6 comments

  • simianwords47 minutes ago
    I bet I can do better by allowing this: the llm can pull documentation of the language from the web to understand how it works.<p>If the llm has “skills” for that language, it will definitely increase accuracy.
  • bwestergard56 minutes ago
    I&#x27;m shocked to see how poorly these models, which I find useful day to day, do in solving virtually <i>any</i> of the problems in Unlambda.<p>Before looking at the results my guess was that scores would be higher for Unlambda than any of the others, because humans that learn Scheme don&#x27;t find it all that hard to learn about the lambda calculus and combinatory logic.<p>But the model that did the best, Qwen-235B, got virtually every problem wrong.
    • __alexs53 minutes ago
      They are also weirdly bad at Brainfuck which is basically just a subset of C.
  • __alexs56 minutes ago
    I had hope we might finally be ushering in a bold new era of programming in Malbolge but apparently that was too optimistic.
  • deklesen1 hour ago
    Mhh... my hunch is that part of this is that all python keywords are 1 token, I assume. And for those very weird languages, tokenizing might make it harder to reason over those tokens.<p>Would love to see how the benchmarks results change if the esoteric languages are changed a bit to make them have 1-token keywords only.
    • chychiu54 minutes ago
      Considering that brainfuck only has 8 characters and models are scoring at 6.2% I don&#x27;t think tokenization is the issue
      • altruios29 minutes ago
        The only issue. *<p>Reasoning is hard, reasoning about colors while wearing glasses that obfuscate the real colors... even harder... but not the core issue if your brain not wired correctly to reason.<p>I suspect the way out of this is to separate knowledge from reason: to train reasoning with zero knowledge and zero language... and then to train language on top of a pre-trained-for-reasoning model.
  • shablulman1 hour ago
    [dead]
  • Heer_J48 minutes ago
    [dead]