EsoLang-Bench: Evaluating Genuine Reasoning in LLMs via Esoteric Languages

(esolang-bench.vercel.app)

24 points by matt_d1 hour ago

6 comments

simianwords47 minutes ago
I bet I can do better by allowing this: the llm can pull documentation of the language from the web to understand how it works.If the llm has “skills” for that language, it will definitely increase accuracy.
bwestergard56 minutes ago
I'm shocked to see how poorly these models, which I find useful day to day, do in solving virtually any of the problems in Unlambda.Before looking at the results my guess was that scores would be higher for Unlambda than any of the others, because humans that learn Scheme don't find it all that hard to learn about the lambda calculus and combinatory logic.But the model that did the best, Qwen-235B, got virtually every problem wrong.
- __alexs53 minutes ago
 They are also weirdly bad at Brainfuck which is basically just a subset of C.
__alexs56 minutes ago
I had hope we might finally be ushering in a bold new era of programming in Malbolge but apparently that was too optimistic.
deklesen1 hour ago
Mhh... my hunch is that part of this is that all python keywords are 1 token, I assume. And for those very weird languages, tokenizing might make it harder to reason over those tokens.Would love to see how the benchmarks results change if the esoteric languages are changed a bit to make them have 1-token keywords only.
- chychiu54 minutes ago
 Considering that brainfuck only has 8 characters and models are scoring at 6.2% I don't think tokenization is the issue
 - altruios29 minutes ago
 The only issue. *Reasoning is hard, reasoning about colors while wearing glasses that obfuscate the real colors... even harder... but not the core issue if your brain not wired correctly to reason.I suspect the way out of this is to separate knowledge from reason: to train reasoning with zero knowledge and zero language... and then to train language on top of a pre-trained-for-reasoning model.
shablulman1 hour ago
[dead]
Heer_J48 minutes ago
[dead]