It's a bit of mathematical bikeshedding, hardcoding reversability would cause fa...

thfuran · on Nov 15, 2023

>it's an autoregressive feature based on the entire stream of incoming tokens. Humans have a bias to focus on the 'last token used', this is not what language models look at.

But human language is created for and by humans. Is not then operating on language in a categorically different manner an incorrect usage/understanding of language?

tysam_and · on Nov 15, 2023

No.

For more information, please see https://people.math.harvard.edu/~ctm/home/text/others/shanno...

thfuran · on Nov 16, 2023

I'm not sure why you think that's a rebuttal.

tysam_and · on Nov 16, 2023

It is because it is the foundation for the mathematics underpinning your question about communications systems, encodings, and their various structures. It also explains why the choice of the log likelihood is a better fit for estimating a token-based model than some other hand-crafted heuristic.

The version with the Weaver introduction is quite good as well, there are other versions of similar papers covering the topic from different angles, I find it to be well-worth the read.