I tried with some other undeciphered texts and I just got random translations th...

earthboundkid · on May 16, 2023

Yes, I’ve also seen it struggle with ROT13. I’ve read that this is because the tokenizer breaks up words in a way that makes mapping to ROT13 hard. I don’t expect it to have much luck with a language with a small corpus.

themoonisachees · on May 16, 2023

The tokenizer as far as i know is just byte-pair encoding. You take your whole corpus, you find the most common 2 byte pair (probably .[space] for the first iteration) and you assign it to a token. Then, you do it again with the previously found token as possible parts of the byte pairs. Do it enough times and enventually you get full words as tokens if they're common enough, and for more uncommon words just the root of the word (and then later you can assemble root+ing for example, ing being just a normal token among others).

It struggles with rot13 because people don't generally make large corpuses of text rot13 available, next to their translation, so the problem compounds. On one hand there are probably not many rot-13'd words recognized by the tokenizer, and on the other hand even if there were the model wouldn't be trained to predict the correct translation after these tokens because there are very little rot13 roseta stones just laying around.

lettergram · on May 15, 2023

Wait until you ask it for quotes and stuff it literally makes up stuff and cites things that exist but not that page, etc

They’re generative models — it’s what you get