4 comments

  • modeless8 minutes ago
    On the public set of 25 problems. These are intended for development and testing, not evaluation. There are 110 private problems for actual evaluation purposes, and the ARC-AGI-3 paper says "the public set is materially easier than the private set".
    • SchemaLoad7 minutes ago
      Benchmarks on public tests are too easy to game. The model owners can just incorporate the answers in to the dataset. Only the private problems actually matter.
      • sanxiyn3 minutes ago
        In this case the code is public and you can see they are not cheating in that sense.
  • lairv2 hours ago
    Note that this uses a harness so it doesn&#x27;t qualify for the official ARC-AGI-3 leaderboard<p>According to the authors the harness isn&#x27;t ARC-AGI specific though <a href="https:&#x2F;&#x2F;x.com&#x2F;agenticasdk&#x2F;status&#x2F;2037335806264971461" rel="nofollow">https:&#x2F;&#x2F;x.com&#x2F;agenticasdk&#x2F;status&#x2F;2037335806264971461</a>
    • krackers26 minutes ago
      &gt; this uses a harness<p>This seems like an arbitrary restriction. Tool-use requires a harness, and their whitepaper never defines exactly what counts as valid.
    • osti8 minutes ago
      Doesn&#x27;t the chat version of chatgpt or gemini also have interleaved tool calls, so do those also count as with harnesses?
    • falcor841 hour ago
      I for one think that harness development is perhaps the most interesting part at the moment and would love to have an alternative leaderboard with harnesses.
      • steve_adams_865 minutes ago
        I&#x27;m so into harness development right now. Once it clicked that harnesses can bring more safety and determinism to LLMs, I started to wonder where I&#x27;d need that and why (vs MCP or just throwing Claude Code at everything), and my brain gears have been turning endlessly since then. I&#x27;d love to see more of what people do with them. My use cases are admittedly lame and boring, but it&#x27;s such a fun paradigm to think and develop around.
      • sanxiyn1 hour ago
        There is. Official leaderboard is without harness, and community leaderboard is with harness. Read ARC-AGI-3 Technical Paper for details.
        • falcor841 hour ago
          I went through the technical paper again, and while they explain why they decided against the harness, I disagree with them - my take is that if harnesses are overfitting, then they should be penalized on the hidden test set.<p>Anyway, searching both in ARC-AGI&#x27;s paper and website and directly on kaggle, I failed to find a with-harness leaderboard; can you please give the link?
          • sanxiyn1 hour ago
            Here it is: <a href="https:&#x2F;&#x2F;arcprize.org&#x2F;leaderboard&#x2F;community">https:&#x2F;&#x2F;arcprize.org&#x2F;leaderboard&#x2F;community</a>
  • esafak1 hour ago
    Anybody used this Agentica of theirs?
  • AbanoubRodolf22 minutes ago
    [dead]