The ceptrum that takes up most of the bits (or the LSPs in other codecs) is actually a model of the larynx -- another reason why it doesn't do well on music. Because of the accuracy needed to exactly represent the filter that the larynx makes, plus the fact that it can more relatively quickly, there's indeed a significant number of bits involved here.
The bitrate could definitely be reduced (possibly by 50%+) by using packets of 1 seconds along with entropy coding, but the resulting codec would not be very useful for voice communication. You want packets short enough to get decent latency and if you use RF, then VBR makes things a lot more complicated (and less robust).
In theory, it wouldn't be too hard to implement with an neural network. In theory. In practice, the problem is figuring out how to do the training because I don't have 2 hours of your voice saying the same thing as the target voice and with perfect alignment. I suspect it's still possible, but it's not a simple thing either.
Perfect alignment, or any alignment for that matter, is not necessary. Check out adversarial networks. CycleGAN can be trained to do similar feats in image domain without aligned inputs. Shouldn't be hard to adopt it to audio.
I didn't say "impossible", merely "not simple". The minute you bring in a GAN, things are already not simple. Also, I'm not aware of any work on a GAN that works with a network that does conditional sampling (like LPCNet/WaveNet), so it would mean starting from scratch.
The bitrate could definitely be reduced (possibly by 50%+) by using packets of 1 seconds along with entropy coding, but the resulting codec would not be very useful for voice communication. You want packets short enough to get decent latency and if you use RF, then VBR makes things a lot more complicated (and less robust).