That was a fairly ignorant comment because the whole idea behind a speech codec is to compress and reproduce speech in a manner that allows one to at least recognize who's speaking.
But I wonder if there isn't a gem in there somewhere. The essential expressive characteristics of a person's voice change much more slowly than the frame rate of any codec, and predictive coding alone doesn't cover all of the possibilities. There are also not that many unique human voices in the world. If you had several thousand people read a given passage and asked me to listen to them and tell me which voice belongs to a close friend or relative, I doubt I could do it.
So, if a codec could adequately pigeonhole the speaker's inflection, accent, timbre, pacing, and other characteristics and send that information only once per transmission, or whenever it actually changes, then a text-to-speech solution with appropriate metadata describing how to 'render' the voice might work really well.
Put another way, I doubt the nerves between your brain and your larynx carry anything close to 1.6 kbps of bandwidth. Near-optimal compression might be achieved by modeling the larynx accurately alongside the actual nerve signals that drive it, rather than by trying to represent both with a traditional frame-based predictive model.
This idea is the basis of an interesting plot point in Vernor Vinge's sci-fi novel A Fire Upon the Deep.
The book is set in a universe where long-distance, faster-than-light communication is possible, but extremely bandwidth-constrained. However, localized computational power is many orders of magnitude beyond what we have today. As a result, much of the communication on the "Known Net" is text (heavily inspired by Usenet) but you can also send an "evocation" of audio and/or video, which is extremely heavily compressed and relies on an intelligent system at the receiving end to reconstruct and extrapolate all the information that was stripped out.
The downside, of course, is that it can become difficult to tell which nuances were originally present, and which are confabulated.
I love that book. I loved that the ships could send text messages across 10,000 light-years, provided the receiving end had an antenna swarm with the mass of a planet.
He played with the idea in at least one other story that I'm aware of, a short story where a human scientist on a planet just inside the Slow Zone (where no FTL is supposed to be possible) acquires a device from the Transcend (where FTL and AI are apparently trivial). The device is a terminal connected to a transcendent intelligence, and it can send and receive a few bits per second via FTL, even in the Slow Zone. Using those few bits it's able to transmit natural-language questions, and reproduce the answers to those questions, but the scientist can't decide whether he can trust it. Can the AI really model him well enough to understand his questions using that tiny amount of information? Can the answers coming back be anything but random? And then the solar weather starts acting a bit unusual...
The ceptrum that takes up most of the bits (or the LSPs in other codecs) is actually a model of the larynx -- another reason why it doesn't do well on music. Because of the accuracy needed to exactly represent the filter that the larynx makes, plus the fact that it can more relatively quickly, there's indeed a significant number of bits involved here.
The bitrate could definitely be reduced (possibly by 50%+) by using packets of 1 seconds along with entropy coding, but the resulting codec would not be very useful for voice communication. You want packets short enough to get decent latency and if you use RF, then VBR makes things a lot more complicated (and less robust).
In theory, it wouldn't be too hard to implement with an neural network. In theory. In practice, the problem is figuring out how to do the training because I don't have 2 hours of your voice saying the same thing as the target voice and with perfect alignment. I suspect it's still possible, but it's not a simple thing either.
Perfect alignment, or any alignment for that matter, is not necessary. Check out adversarial networks. CycleGAN can be trained to do similar feats in image domain without aligned inputs. Shouldn't be hard to adopt it to audio.
I didn't say "impossible", merely "not simple". The minute you bring in a GAN, things are already not simple. Also, I'm not aware of any work on a GAN that works with a network that does conditional sampling (like LPCNet/WaveNet), so it would mean starting from scratch.
But I wonder if there isn't a gem in there somewhere. The essential expressive characteristics of a person's voice change much more slowly than the frame rate of any codec, and predictive coding alone doesn't cover all of the possibilities. There are also not that many unique human voices in the world. If you had several thousand people read a given passage and asked me to listen to them and tell me which voice belongs to a close friend or relative, I doubt I could do it.
So, if a codec could adequately pigeonhole the speaker's inflection, accent, timbre, pacing, and other characteristics and send that information only once per transmission, or whenever it actually changes, then a text-to-speech solution with appropriate metadata describing how to 'render' the voice might work really well.
Put another way, I doubt the nerves between your brain and your larynx carry anything close to 1.6 kbps of bandwidth. Near-optimal compression might be achieved by modeling the larynx accurately alongside the actual nerve signals that drive it, rather than by trying to represent both with a traditional frame-based predictive model.