Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's a bit of mathematical bikeshedding, hardcoding reversability would cause far more problems than it would help.

Best to simply scale log-likelihood-based training, next-token-based training trivially contains a requirement for learning all of the subproblems that predict said, next token, and hardcoding something to get warm human fuzzies would be creating a biased estimator (and move us back towards the 90s a bit).

Models already very constantly do context-dependent token utilization, it's an autoregressive feature based on the entire stream of incoming tokens. Humans have a bias to focus on the 'last token used', this is not what language models look at.



>it's an autoregressive feature based on the entire stream of incoming tokens. Humans have a bias to focus on the 'last token used', this is not what language models look at.

But human language is created for and by humans. Is not then operating on language in a categorically different manner an incorrect usage/understanding of language?



I'm not sure why you think that's a rebuttal.


It is because it is the foundation for the mathematics underpinning your question about communications systems, encodings, and their various structures. It also explains why the choice of the log likelihood is a better fit for estimating a token-based model than some other hand-crafted heuristic.

The version with the Weaver introduction is quite good as well, there are other versions of similar papers covering the topic from different angles, I find it to be well-worth the read.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: