Hah, that's actually what drove me to try to create this to begin with. I've been writing a lot about these issues, and someone said to me:<p>> It'd be nice to have a test harness: "Test my agent," to score them and give you benchmark score (like graphics cards, etc.).
> Agent XYZ: reads only X% of the content it accesses.<p>I synced up with a colleague of mine who is testing the platform retrieval behaviors across platforms right now, and writing about them at: <a href="https://rhyannonjoy.github.io/agent-ecosystem-testing/" rel="nofollow">https://rhyannonjoy.github.io/agent-ecosystem-testing/</a><p>The info we have so far isn't consistent enough for a standardized benchmark, but it's on our radar to produce something like this in the future as we hone in on how to assess this more consistently, or at least how to compare outputs in a more standardized way.