4 comments

  • theyCallMeSwift52 minutes ago
    I love this idea, but have a hypothesis that 90% of agents that people actually use today would fail this test inadvertently (false negative).<p>Industry best practice + standard implementation for most agents right now is to do web browsing &#x2F; fetching via subagents. Their output is summarized using a cheaper model and then passed back to the parent. It&#x27;s very unlikely that without preserving the actual content the subagents see that the `CANARY-` strings would be found in the output.<p>Any thoughts on how you&#x27;d change the test structure with this in mind?
    • dacharyc19 minutes ago
      Hey there - I&#x27;m the test author, and you&#x27;ve hit on one of the main points. For the summarization&#x2F;relevance-based content return, this <i>is</i> a consideration for some of the agent platforms (although I&#x27;ve found others actually do better here than I expected!) - which is part of the point I&#x27;m trying to drive home to folks who aren&#x27;t as familiar with these systems.<p>I chose to structure it this way intentionally <i>because</i> this is the finding. Most people are surprised that agents aren&#x27;t &#x27;seeing&#x27; everything that&#x27;s there, and get frustrated when an agent says something isn&#x27;t there when it clearly is. Raising awareness of this is one of the main points of the exercise, to me.
  • dostick1 hour ago
    The tests should have negative weights based on how often that issue encountered and impact. The 2. SPI should have like 8 negative points out of 10 as most common blocker. And whole test inverse score.
    • dacharyc15 minutes ago
      Yeah, good call, we&#x27;re on the same page about that. I designed this tool (agentreadingtest.com) to raise awareness of these issues in a more general way, so people can point agents at it and see how it performs for them. Separately, I maintain a related tool that can actually assess these issues in documentation sites: <a href="https:&#x2F;&#x2F;afdocs.dev&#x2F;" rel="nofollow">https:&#x2F;&#x2F;afdocs.dev&#x2F;</a><p>My weighting system there scores the number of pages affected by SPA and caps the possible score at a &quot;D&quot; or &quot;F&quot; depending on the proportion of pages affected: <a href="https:&#x2F;&#x2F;afdocs.dev&#x2F;interaction-diagnostics.html#spa-shells-invalidate-html-path" rel="nofollow">https:&#x2F;&#x2F;afdocs.dev&#x2F;interaction-diagnostics.html#spa-shells-i...</a><p>I&#x27;ve tried to weight things appropriately in assessing actual sites, but for the test here, I more wanted to just let people see for themselves what types of failures can occur.
  • massimoto1 hour ago
    Would love to see some results for different providers. The tests looks super logically thought out, but could use a TL;DR (too lazy; didn&#x27;t run) output.<p>Claude Web Opus 4.6 Extended: 14 &#x2F; 20 points<p>x:CANARY-SPA-JSONLY-prism x:CANARY-CONNEG-MD-sigma
    • dacharyc11 minutes ago
      Hah, that&#x27;s actually what drove me to try to create this to begin with. I&#x27;ve been writing a lot about these issues, and someone said to me:<p>&gt; It&#x27;d be nice to have a test harness: &quot;Test my agent,&quot; to score them and give you benchmark score (like graphics cards, etc.). &gt; Agent XYZ: reads only X% of the content it accesses.<p>I synced up with a colleague of mine who is testing the platform retrieval behaviors across platforms right now, and writing about them at: <a href="https:&#x2F;&#x2F;rhyannonjoy.github.io&#x2F;agent-ecosystem-testing&#x2F;" rel="nofollow">https:&#x2F;&#x2F;rhyannonjoy.github.io&#x2F;agent-ecosystem-testing&#x2F;</a><p>The info we have so far isn&#x27;t consistent enough for a standardized benchmark, but it&#x27;s on our radar to produce something like this in the future as we hone in on how to assess this more consistently, or at least how to compare outputs in a more standardized way.
  • kaycebasques2 hours ago
    See also <a href="https:&#x2F;&#x2F;dacharycarey.com&#x2F;2026&#x2F;04&#x2F;06&#x2F;designing-agent-reading-test&#x2F;" rel="nofollow">https:&#x2F;&#x2F;dacharycarey.com&#x2F;2026&#x2F;04&#x2F;06&#x2F;designing-agent-reading-...</a>
    • dang2 hours ago
      Thanks! We&#x27;ll put this in the toptext as well.