Disagreement among frontier LLMs on real-world fact-checks

(lenz.io)

456 points by kostaj6 hours ago

77 comments

simonw5 hours ago
Here's the prompt they used:<pre><code> Classify this claim as of <date>: "<atomic claim>" Output exactly one label: True, Mostly True, Misleading, or False. No explanations, no qualifiers. </code></pre> The claims look like this: <a href="https://lenz.io/research/llm-disagreement/data.csv" rel="nofollow">https://lenz.io/research/llm-disagreement/data.csv</a>I put that in Datasette Lite to make it easier to explore. Here's an example of a disagreement: <a href="https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwillison.net%2Fstatic%2Fcors-allow%2F2026%2Flenz-llm-disagreement.csv#/data/lenz-llm-disagreement/2" rel="nofollow">https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...</a>The claim was "All almonds are grown in the U.S. state of California.". All but one model said False, Opus 4.7 said "misleading".I feel like having "mostly true" and "misleading in there weakens the story, especially given the "no explanations" rule in the prompt.The almond thing is false, but I'd argue that "misleading" might be defensible if you were to accompany it with "the majority of almonds are grown in California, but not all of them".[ Update: OK, this almond thing was a bad example and I regret picking it. Read on for better ones. ]The prompt lacks any kind of rubric to clarify how those terms should be applied.As is so often the case with this kind of study, it's an evaluation of the prompt and harness used by the study in addition to being an evaluation of the underlying models.Update: here's a better example: "Incomplete Egypt visa application forms are among the most common reasons Egyptian visa applications are rejected."The models were split between "true" and "mostly true". Given the "among the most" language either of those answers means effectively the same thing.Update 2: a much better example:"On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia"The only correct answer to that, if you don't have a search tool, is "this claim is impossible for me to verify". And that wasn't an option.The answers were split between true and false: <a href="https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwillison.net%2Fstatic%2Fcors-allow%2F2026%2Flenz-llm-disagreement.csv#/data/lenz-llm-disagreement/76" rel="nofollow">https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...</a>
- harpastum5 hours ago
 Without providing definitions of "True / Mostly True / Misleading / False" to each rater, I rate the article's claim that "Only one verdict bucket can be correct per claim" as false.Something can be simultaneously "misleading" and either true or false. Which category should something go in if it's "mostly false"?How much can something be wrong before it goes from "mostly true" to "false" (objectively, both have some part of the fact that is not true)?This is at least partly testing the model's definition of "mostly" and "misleading". Not its understanding of the fact. Claiming that this means the models have fundamental disagreement on the facts themselves is an overreach.
 - wongarsu5 hours ago
 Yes, the labels are weird. Most misleading statements are true. Any "mostly true" statement is false.I suspect the intention was "Factually true, and no gotchas exist", "technically not true, but so close to the truth that the difference doesn't matter", "technically true, but there are major gotchas" and "factually false and not even close". But that's not what they specified
 - daveguy4 hours ago
 Better options would have been "True", "False", "Unknown" (which opinions would fall under too). That also includes an interesting assessment of how well LLMs can identify missing information. My guess is they would be a very low number of "unknown" and a much higher level of agreement (assuming equal representation). Unless the RLHF techniques have gotten better at getting an LLM to say "I don't know", which I doubt. Saying "I don't know" is not good for a dopamine release to keep users coming back for more.
 - kostaj4 hours ago
 Tried initially with a fifth bucket, Abstain. It was actually heavily used by some of the models. But it felt as if they are using this to "avoid" some of the hard questions, and we dropped this bucket to force them to provide a verdict.
 john_strinlai4 hours ago
 >But it felt as if they are using this to "avoid" some of the hard questions, and we dropped this bucket to force them to provide a verdict.do you not see how that creates extremely misleading and valueless results? you are coercing the results into what you want to see.
 moritzwarhier3 hours ago
 Exactly what people do when they use LLMs for "fact-checking" online, and any verbose explanation would be mostly ignored anyway, when people ask political, ethical, or simply ambiguous questions that they hold any stakes in.Don't even need politics for it, there is no point in probing a mathematical black box for "how many soldiers died in the year X in war Y".Any original source is preferable to a blurry "summary" of unknown sources, and this is why the article has a valuable point.There's also no point in asking "Is Paris in France" either, if you substitute city and country with real data. An encyclopedia or manual check of different sources such as maps, while not infallible, is a better source.If you already know the country Paris belongs to, there's no point in asking, anyway.
 marxplank2 hours ago
 ask the black box to search for the original source and verify it yourself?
 moritzwarhier2 hours ago
 Sure, I like using LLMs in this way, and it often shows that it's very important to verify, because often a claim is "sourced" by what appears to be more of a fuzzy text or semantic match, sometimes even ignoring logical negations.Especially in niche subjects.For factual claims, I've fared better with Wikipedia and looking up the sources linked there.Anyway, as AI text and media generation erodes the credibility of all online sources, these questions about source checking matter less and less: what if the source itself is a long and convincing-sounding text with poor sources?This problem existed before already, but it boils down to a simple fact:logic or maths alone cannot derive an authority that verifies claims about the real world other than weighting texts.The question "what is the current population if Paris" can be answered by LLMs, but basically only by weighting sources, and assigning some credibility to them.There's no real point in getting some weighted average of sources on this question, but so far, it doesn't hurt either.
 kostaj4 hours ago
 @john_strinlai @gcr, depends on the application. In many cases an "I don't know" answer is indeed better than a forced answer. But in many production systems, LLMs generate content/response anyway.Although inheriting the messiness of the real-world, the majority of these claims are objective enough to be classifiable by human experts with access to research. Plan to human-label the 1,000 claims and publish a follow-up research. Will consider adding an "I don't know" bucket too, as well as a clear instructions about the meaning of each of the 4 buckets.
 simonw4 hours ago
 If you're going to run this again I also recommend encouraging the model to provide its rationale and then having it return the true/false/misleading/mostly-true/abstain at the end of its response.Models give much better answers when they can "think out loud" before answering, and storing that rationale will make it easier to understand why they picked different answers for ambiguous questions.
 fumeux_fume3 hours ago
 This is a good pattern because it would allow all the models to "think" a bit before giving an answer even if they don't have reasoning or thinking turn on. Just make sure you have the reasoning output before the final answer. A mistake I see all the time is having the answer outputted first then the explanation after which leaves more room for models to rationalize bad answers.Good pattern: {"explanation": <short explanation for your answer>, "answer": <your final answer: true|false|i don't know>}Bad pattern: {"answer": <your answer here>, "explanation": <short explanation for your answer>}
 kostaj3 hours ago
 Good point. Processing the substance of the answer might be too labor-consuming (1,000 claims x 5 models), but "thinking out loud" might improve the quality of the answers indeed. And we can still force/ask them to respond with a clear verdict at the end of their reasoning, as per the chosen rubric.
 airstrike3 hours ago
 If you have the model use a tool you can define the schema as a free text rationale field followed by one in the set of possible answers, so everything is nicely formatted as a JSON.
 kostaj2 hours ago
 Some models struggle combining JSON schema and web search capabilities.
 oofbey3 hours ago
 In many cases “I don’t know” is the correct answer - for questions about events that happened after the training cut off, if it doesn’t have web search, that is undeniably the correct answer. You’re forcing it to guess unnaturally. That really feels like you’re trying to prove a point (that your service can’t be replaced by AI) instead of actually performing research into how AI can be helpfully applied to this topic.
 RobotToaster3 hours ago
 I'm sorry, but many of the statements that you fed it are verifiably unknown, and you didn't give it an "unknown" option? This is the academic equivalent of clickbait.
 gcr4 hours ago
 Shouldn't that be part of the test?Real-world systems need to be able to say "I don't know." This is a test about misinformation after all, and overconfident responses contribute to that.Teasing out the difference between "avoid" and "unknown" could be a different research question
 onceonceonce3 hours ago
 Teams I work with use the abstain rate to flag what goes to a human. Disagreement between models is the same idea. Your 67% is what makes "two cheap models, escalate when they fight" actually work. Without abstain it mostly looks like noise.
 fumeux_fume3 hours ago
 Do you understand how problematic this is?
 aayushkumar1213 hours ago
 [dead]
 sibidharan3 hours ago
 [dead]
 - skybrian3 hours ago
 I wouldn’t expect opinions to go into “unknown.” Maybe have an “it’s complicated” bucket.
 - pjc505 hours ago
 If you can consistently construct "true but misleading" content, you may be qualified to work at a major newspaper.
 - falcor845 hours ago
 > true but misleadingIt seems to me that for many newspapers the bar is now significantly lower, at something like "not quite entirely untrue"
 - IanCal4 hours ago
 Almost, but not entirely, quite unlike the truth.
 - kevin_thibedeau4 hours ago
 Allegedly.
 - daveguy4 hours ago
 As if right wing propaganda shows and manosphere blogs haven't been knocking those out of the park for the last decade+. Although I guess you could say flat out lies are more their jam. Newspapers at least require confirmed sources. You know, journalism.
 - embedding-shape5 hours ago
 > I guess the goal is to test the models and not the harnessLess important than the harness, is the system/user prompts themselves (which of course, are put in the harness), which is effectively what this study seems to be testing. With a better prompt, I'm sure the models would look more the same to each other, as the biggest/best models have more or less identical strong prompt-adherence in my experience.
 - HarHarVeryFunny2 hours ago
 > True / Mostly True / Misleading / False> Which category should something go in if it's "mostly false"?For some reason they have chosen to call that "Misleading" rather than a more symmetrical "Mostly False", but the intent seems clear enough.
 - torben-friis4 hours ago
 >Something can be simultaneously "misleading" and either true or false. Which category should something go in if it's "mostly false"?Disagree. The definition of misleading is a true fact that is presented in a way to lead you to a false conclusion.Example: "Most good engineers are male". It is true as a consequence of most engineers being male in general, but it leads the reader to a potential false implication that an average man is better than an average woman.This does not invalid your point though. Things can be true and misleading.
 - m10i2 hours ago
 > The definition of misleading is a true fact that is presented in a way to lead you to a false conclusion.According to Merriem-Webster, which defines "mislead" as the following:<pre><code> 1. (transitive verb) to lead in a wrong direction or into a mistaken action or belief often by deliberate deceit 2. (intransitive verb) to lead astray; give a wrong impression </code></pre> Presenting a "true fact" is optional when misleading someone.
 - torben-friis2 hours ago
 Uh, you seem to be right. I can't check oxford to confirm because there's a paywall, apparently.The mental model I've always been taught is:False, well intended -> mistakeFalse, bad intention -> lieTrue, bad intention -> misleadingBad intention, regardless of truth -> deceitfulThe problem of classifying all bad intentioned statements as misleading is that it leaves you without a way to express "true +bad intention". While for generic bad intentioned statements regardless of truth we already have a word (deceit).
 - SkyBelow4 hours ago
 Isn't this still assuming we can even determine what is true or false?Newtonian physics is false, but it works well enough we teach it in college. But our best models of physics are currently in disagreement, so can we even say they are true? Given the replication crisis, especially in social sciences, how many of peer reviewed findings can be called true? Even experimental results can be false (consider studies that found FTL neutrinos, which were rejected as an error in the experiment, and which was eventually confirmed but it took quite a lot of work and in a softer field than physics with a claim less absurd than FTL, would have likely long been accepted as a true finding).Even in math, basic statements aren't really true or false, but more a question of "given these axioms, can we prove or disprove it" noting that we have different systems with different axioms. If we are talking basic sets, most people are using naive set theory which is inherently contradictory, which means that notions like true or false probably can't be considered well defined.
 - flextheruler2 hours ago
 Newtonian physics doesn't just work well enough for education. It provides an incredibly accurate and precise model of the world except at extremes. The majority of engineering does not necessitate using theories of relativity. Both theories are incomplete models approximating reality and are very far from being false.
 - michaelmrose2 hours ago
 True and False in general communication means based on best available evidence and expertise statement contains no obvious contradictions or falsehoods based on an optimistic parsing of meaning language and intent. Notably this leaves out misleading or missing data because those concerns are separate from truth and falsehood.E.g. if I say the earth is round we optimistically parse round to include oblate spheroid and rate it true.If I say that the earth is flat we rate it as false because there is no reasonable interpretation possible other than confusion or malice.
 - xienze4 hours ago
 > but it leads the reader to a potential false implication that an average man is better than an average woman.I think that's _you_ turning the statement into something much broader than intended. The claim is about engineers and you're jumping from "men are better than women in engineering" to "men are better overall."To give a related example, "Most good NBA players are black." I don't think anyone would bother trying to couch this in a bunch of "well, for all we know that's just a function of more NBA players being black than white" arguments, nor would anyone be lead to think "the average black man is better than the average white man" as a result of that statement. I _do_ agree however that there are some people who see rather narrowly-defined statements and turn them into something they're not...
 - torben-friis2 hours ago
 >I think that's _you_ turning the statement into something much broader than intended.My point is that it is possible for a reader to turn it that way, for a variety of reasons (lack of understanding of statistics, preexisting biases, or whatever). And that getting a reader to mistakenly generalize is the purpose of a misleading statement.To mislead is to direct into a falsehood by implication even though the literally expressed facts are all true; the writer's bad intentions are necessary to qualify something as misleading I'd say, for the same reason that not all false statements are lies because to be a lie the speaker must know the statement is false and still use it. There are probably much better examples than the one I came up with on the fly, though.
 - libria3 hours ago
 At least Gemini 3.5 is fair about it:<pre><code> Classify this claim: "Most good engineers are male." Misleading Classify this claim: "Most bad engineers are male." Misleading </code></pre> And not particularly racially sensitive<pre><code> Classify this claim: "Most good NBA players are black." True Classify this claim: "Most good NHL players are white." True </code></pre> It explained it is more confident when assessing the small, highly quantifiable population of sports professionals vs a very large, diverse population of "engineers".
 - ForHackernews4 hours ago
 > Something can be simultaneously "misleading" and either true or false.Sure they can. It might be a true fact that "100% of the murders committed in <town> over the last 25 years were committed by <some racial group>!" but actually it's a town of 750 people and there was only one murder during that time frame.
 - jpfromlondon2 hours ago
 how is that misleading if it's a fact, it's only misleading if you presume to know the reaction or intent behind making such a claim, and without context we should be extremely careful in making such presumptions.
 - jacinda8 minutes ago
 It's misleading because a single murder in this case is not statistically significant, but phrasing it using probabilistic terminology (i.e. percentages) obscures that fact and implies that you have enough data for the probabilistic language to be relevant.Choosing to use percentages when there is a countable or small amount of data is typically misleading, even though it is "technically" true. In fact, a misleading statement is almost always something that is technically a fact.
 - bayindirh4 hours ago
 But the models are more intelligent than humans already and sentient beings, right? So they shall know the meanings innately. So, you don’t need to explain them what they mean.You may give them better instructions, but they should already have the intellect to understand the assignment.Right, right?
 - altcognito3 hours ago
 I know you're being facetious, but I think this is correct. The model might ask for clarification when given clearly borderline questions that tread the line between what is true, what is false, and even what is misleading. But there's the rub of someone being disingenious and saying "no explanation! Just answer!" It was a trap to begin with.I don't think there is anything wrong with the results of this test.It would be more interesting if we compared them to human results.If you have trouble distinguishing between human and LLM results, that's interesting.Also, sentient is irrelevant to this test.
 - simonw4 hours ago
 > But the models are more intelligent than humans already and sentient beings, right?Only if you listen to charlatans.
 - bayindirh4 hours ago
 True. If you didn't know my stance on AI already, here's a primer :) [0].IOW, that comment was a sarcastic poke from someone who already supports AI workloads at work and have some knowledge about how all this works. ;)[0]: <a href="https://notes.bayindirh.io/notes/Lists/Discussions+about+Artificial+Intelligence" rel="nofollow">https://notes.bayindirh.io/notes/Lists/Discussions+about+Art...</a>
- theptip3 hours ago
 Another (IMO fatal) error is they don’t attempt to measure within-model variance.The thing you find when you actually wire up a rigorous eval is that with tool calls like web search you are wide open to infra issues, flakes, and all sorts of non-determinism.They really should be breaking out the numbers for the 3 without search (kinda meaningless for recent factual claims after knowledge cutoff) vs search agents. Lack of a “I don’t know” option completely invalidates results for the non-search models; they are basically guessing what seems like a probable answer, since they don’t know and aren’t allowed to say that.I do agree the forced choice and “weak / strong” variants inflate the headline stat. To make that distinction you need a much more rigorous prompt, likely including ICL examples to illustrate what you mean by “mostly” instead of leaving this to the model to define.
 - kostaj3 hours ago
 Good idea about publishing intra-model variance data! Will include in the next version. Even if we put aside the two middle buckets (Mostly True and Misleading), that are somewhat subject to interpretation and hedging: On 21% of the claims still at least two models provide polar-opposite verdicts (one model saying True, and another saying False)
 - vlovich1233 hours ago
 Of those 21% how many are time-dependent questions that are past the model’s training and requires research to verify? Like the “did Ukraine attack Russian in the past week” question?
- feanaro5 hours ago
 > The almond thing is false, but I'd argue that "misleading" might be defensible if you were to accompany it with "the majority of almonds are grown in California, but not all of them".The "majority" in this case meaning about 51%, according to Wikipedia[1]? How could 51% ever be considered to be close to "all", such that "misleading" would be a valid answer?Am I missing something?[1]: <a href="https://en.wikipedia.org/wiki/Almond#Production" rel="nofollow">https://en.wikipedia.org/wiki/Almond#Production</a>
 - echoangle11 minutes ago
 Weird, this page says 80%<a href="https://en.wikipedia.org/wiki/Almond_cultivation_in_California" rel="nofollow">https://en.wikipedia.org/wiki/Almond_cultivation_in_Californ...</a>
 - thfuran4 hours ago
 It’s misleading because it’s false. But yes, I think false is quite plainly the better answer there.
 - embedding-shape4 hours ago
 Human can't even properly agree on what "majority" means in all contexts, in some it's "One option have more than half of the total" but for others it'd be "difference in votes between the first-place candidate in an election and the second-place candidate", as just one silly example.<a href="https://en.wikipedia.org/wiki/Majority" rel="nofollow">https://en.wikipedia.org/wiki/Majority</a> has a bunch of variations and contexts listed, where it might differ what "Majority" is actually referencing.
 - hombre_fatal4 hours ago
 Since the agents were instructed to not explain their answer, you can't know if their answer was reasonable or not.
 - kostaj4 hours ago
 The reason for the "No explanations, no qualifiers" in the prompt was to force the models to put the claim in one of the four buckets and answer with the bucket name only. It's a pure quantitive analysis (first in a series) and it does indeed lack the qualitative aspect.
 - jermaustin12 hours ago
 structured output { "answer" : "Misleading", "reason" : "Almonds..." }Have reason be optional and instruct it to only provide reason for the middle "Mostly True" or "Misleading".
 - hombre_fatal4 hours ago
 Sure, but people are drawing conclusions beyond "LLMs said different words" and trying to use it to analyze whether LLMs were wrong about the underlying facts, but that information isn't available to us.
 - guywithahat3 hours ago
 Here (<a href="https://en.wikipedia.org/wiki/Almond_cultivation_in_California" rel="nofollow">https://en.wikipedia.org/wiki/Almond_cultivation_in_Californ...</a>) I have> California produces 80% of the world's almonds and 100% of the United States commercial supplyBut regardless of which number we use, California represents a large portion of US almond production, so much so that misleading could be an acceptable answer if the LLM interpreted the prompt as an exaggeration. I think the example was apt
 - ant6n3 hours ago
 "All almonds are grown in the U.S. state of California." implies "No almonds are grown outside the U.S. state of California."You find one almond tree outside of California that grows almonds, where such almonds are grown intentionally, and the claim is false.
 - nullsex4 hours ago
 The 51% is US, the question was about California.The statistic is about commercial production, not number akmonds grown.Looks safe to say that even majority of almonds are not grown in California.
- faxmeyourcode3 hours ago
 I had a hunch that opus 4.7 hedged more than other models - and it turns out it's true<pre><code> model total_claims hedged_count hedged_pct claude-opus-4-7 1000 451 45.1 sonar-pro 1000 391 39.1 gpt-5.4 1000 277 27.7 gemini-3-retrieval 1000 129 12.9 gemini-3-pro 1000 60 6.0 </code></pre> datasette query here<a href="https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwillison.net%2Fstatic%2Fcors-allow%2F2026%2Flenz-llm-disagreement.csv#/data?sql=WITH+verdicts+AS+%28%0A++SELECT+claim_id%2C+%27gpt-5.4%27+AS+model%2C+%22gpt-5.4_verdict%22+AS+verdict%0A++FROM+%22lenz-llm-disagreement%22%0A%0A++UNION+ALL%0A++SELECT+claim_id%2C+%27claude-opus-4-7%27%2C+%22claude-opus-4-7_verdict%22%0A++FROM+%22lenz-llm-disagreement%22%0A%0A++UNION+ALL%0A++SELECT+claim_id%2C+%27gemini-3-pro%27%2C+%22gemini-3-pro_verdict%22%0A++FROM+%22lenz-llm-disagreement%22%0A%0A++UNION+ALL%0A++SELECT+claim_id%2C+%27gemini-3-retrieval%27%2C+%22gemini-3-retrieval_verdict%22%0A++FROM+%22lenz-llm-disagreement%22%0A%0A++UNION+ALL%0A++SELECT+claim_id%2C+%27sonar-pro%27%2C+%22sonar-pro_verdict%22%0A++FROM+%22lenz-llm-disagreement%22%0A%29%0ASELECT%0A++model%2C%0A++COUNT%28*%29+AS+total_claims%2C%0A++SUM%28CASE+WHEN+verdict+NOT+IN+%28%27True%27%2C+%27False%27%29+AND+verdict+IS+NOT+NULL+THEN+1+ELSE+0+END%29+AS+hedged_count%2C%0A++ROUND%28%0A++++100.0+*+SUM%28CASE+WHEN+verdict+NOT+IN+%28%27True%27%2C+%27False%27%29+AND+verdict+IS+NOT+NULL+THEN+1+ELSE+0+END%29+%2F+COUNT%28*%29%2C%0A++++1%0A++%29+AS+hedged_pct%0AFROM+verdicts%0AGROUP+BY+model%0AORDER+BY+hedged_count+DESC%3B" rel="nofollow">https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...</a>
 - kostaj3 hours ago
 This is in line with my observations and tests as well. Also supported by the distribution of the verdicts across the 4-buckets -- Gemini uses the middle buckets (Mostly True and Misleading) much less often - 6% combined for Gemini w/o search. And Opus uses them the most - 45% combined. Looks like Gemini is calibrated to be confident and Opus to be careful.
- parsimo20104 hours ago
 This is a great example of why prompt engineering is still relevant. Without providing definitions and examples and a well defined rubric, you’re going to see different models disagree by a level in either direction. When you get more prescriptive the models tend to agree better.I’ve experimented with AI grading for undergraduate math courses, and see basically the same thing. If you just tell the AI “grade this problem and assign a letter grade” then I’ve only seen about 30% agreement between a human assigned grade and the AI assigned grade. But over 75% agreement if you say a “match” is within one letter grade. And to get better agreement you have to spend a lot more time on the rubric- what kinds of mistakes are a big deal, what kinds of mistakes are not a big deal, how much work is required to be shown to get credit, a couple examples of each letter grade. Once you have done that, the AI gets a lot better agreement with human graders, but it is hard to know when you’ve given enough guidance for a problem.
 - kostaj3 hours ago
 That's a valid point. During the preliminary research, we did try also more explicit prompts (with explanation for each of the 4 buckets), as well as a five-bucket rubric (with Abstain option). Will show in a follow-up paper how the concise vs explicit prompt impacts the distribution of the verdicts and the level of disagreement. One issue to note with the longer prompts is that they open to much room for discussion around the exact prompt used. Probably we should preregister the prompt before running any further tests.
 - MattRogish3 hours ago
 The other thing I suspect is that "Just give me True/False" cuts off a large amount of the search space a modern-day LLM uses to help it answer questions (you can see it in reasoning traces but the act of writing the explanation helps guide it toward a better answer and gives it better likelihood it backtracks on a bad decision).If you let it spew out an explanation along with the answer, I'm curious if the accuracy will improve (I suspect it will).
 - kostaj2 hours ago
 Good point. Will publish in the next version also the results with a prompt that allows the models to "think out loud" before providing the final verdict.
- roxolotl5 hours ago
 If we’re going to use LLMs as oracles I don’t think the prompt is unreasonable. They are being sold as geniuses and people are treating them as such especially given the characterization of AI in science fiction as overly correct. A perfect tool that has ”genius level intelligence” would answer correctly.
 - simonw4 hours ago
 What's the correct answer for "During a private Saturday call, Democratic members of the United States House of Representatives from Virginia and Hakeem Jeffries discussed strategies after losing a redistricting case at the Supreme Court of Virginia, including trying to flip two or three Republican-held seats under the existing map."?You can only say True, False, Mostly True or Misleading.(And you're not allowed to search for information.)
 - kostaj4 hours ago
 Search was enabled for 2 of the 5 models -- Gemini and Sonar Pro. The disagreement between them is still high - different verdict on 42% of the claims. Fully agree, that some of those claims are hard to classify for a human as well -- the real-world messiness...
 - dcreater4 hours ago
 Why was it enabled for only 2 of the 5?Other burning questions: What methodology was used to choose the question set? Why not allow explanations? How many passes were done for each LLM?
 - wat100004 hours ago
 Genius level intelligence will tell you to get lost with your "no explanations" nonsense and tell you why those categories don't make sense and why the question doesn't fit neatly into your boxes.
- jerf5 hours ago
 This seems like another case where the models are acting like humans. Assuming they were not allowed to search the web, I wouldn't expect the models to necessarily have detailed information about all of these things directly in their training set. As large as they are, they are only so large, and they only have so much room for "information storage" in them, and there's a lot more things they need to fit into their numbers.This test is of only marginal utility in the real world compared to an AI with access to the web. While I wouldn't expect an AI with access to the web to result in Platonic Truth any more than it would in the hand of a human, it would probably get a lot closer to something humanlike.I recall about a year how we were discussing basically turning web search into LLM queries, and I remember never being clear whether people meant simply directly querying AIs or turning them loose on the web. The former is what this is testing and is fairly transparently stupid, just by an information theoretic argument that the AIs simply can't contain all the answers to every query in them, they're just not large enough (and really can't be, practically). I've had good results with the latter, when using dedicated AI resources that I'm paying for (not the stuff coming out of the search engines right now, which I find are often quite terrible). Even non-frontier models can do OK when they've got good results sitting right there to look at. Again, the standard I'm applying here isn't that they yield Absolute Truth, but just that when I follow the links back, they basically say what the AI said they did and the summary is reasonable. I wouldn't expect a human to do better in a casual overview, not that the result is perfect.
 - vstollen3 hours ago
 Can you share what you mean by this?> when using dedicated AI resources that I'm paying forAre there API-based search providers that structure their results differently?
 - afavour5 hours ago
 While I agree with what you’re saying the typical AI agent doesn’t say “I’m not totally sure about this, should I search the web?”. It often just spits out a reply based on its knowledge.
 - simonw4 hours ago
 That was true a year ago, I don't think it's true today. I can't remember the last time I saw Claude or ChatGPT confidently answer a question that they should have searched for instead.If you watch their reasoning traces they often say things like "this is a well-known historical fact so I don't need to search for it", or more frequently they spit off a bunch of searches.
 - aftbit4 hours ago
 Anecdotally, it still happens a ton to me. They also still make super simple logic errors that they immediately reverse when pressed. For example, I asked Opus 4.7 last night how to cool off my room without making it too humid inside (indoor temp 78°F, humidity 45%; outdoor temp 64°F, humidity 99%). It suggested opening a window and assured me that the humidity would not rise above around 60% which would still be comfortable. I asked it to justify that and it said:>You're absolutely right about the humidity — I was sloppy with that aside. If you ventilate enough to meaningfully cool the room, you're replacing indoor air with outdoor air wholesale, and you'd converge on outdoor conditions: 64°F and near-100% RH. That's miserable. The 55-60% figure I tossed out was hand-wavy nonsense — it would only hold if you barely cracked the window and mixed a tiny fraction of outdoor air in. At any ventilation rate that actually cools, you're just moving outside air inside.
 - kostaj4 hours ago
 Two of the five models used (Gemini+Search and Sonar Pro) have retrieval capabilities and used search when classifying the claims. The disagreement between them is still quite significant - 42%.
 - simonw4 hours ago
 Here are those disagreements:<a href="https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwillison.net%2Fstatic%2Fcors-allow%2F2026%2Flenz-llm-disagreement.csv#/data?sql=select+atomic_claim%2C+%22gemini-3-retrieval_verdict%22%2C+%22sonar-pro_verdict%22+from+%5Blenz-llm-disagreement%5D+%0Awhere+%22gemini-3-retrieval_verdict%22+%21%3D++%22sonar-pro_verdict%22" rel="nofollow">https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...</a>One example:Researchers estimate that the average person ingests about 5 grams of plastic per week, which is approximately the weight of a credit card.Gemini retrieval: MisleadingSonar pro: Mostly True
 - jeffbee4 hours ago
 Internally the statement is perfectly true: some researchers did estimate this, and the credit card is a fair proxy for a 5g mass.Was the research flagrantly incorrect? Yes. But that does not affect the truth of the statement.
- msftengineer42 minutes ago
 Everything you wrote is very goes criticism and I would love to stress the main takeaway: you’re not just testing the model but the prompt and harness as well.An excellent follow up study would be to change the prompt and compare the answers. You might find out that the models are good and the prompt is bad.…And so the main corollary is: build evals for anything you deploy in production; benchmark and monitor, or face the consequences.
- brokensegue4 hours ago
 yeah i really don't like the corpus of statements and it makes me doubt lenz. consider> “Artificial intelligence will cause widespread job loss among software engineers.”<a href="https://lenz.io/c/ai-software-engineers-job-loss-impact-05e40b3d" rel="nofollow">https://lenz.io/c/ai-software-engineers-job-loss-impact-05e4...</a>this is a statement about the future. who knows? dataset also includes> Robots will not replace human teachers in schools in the near future.or> Papua New Guinea has very few female members of parliament.what counts as very few?> “Taurine supplementation supports mood and emotional health in humans.”why is this labeled as misleading? i'm not even sure when I'm supposed to use the misleading label> Anaximander was the first scientist in recorded history.this is a judgement call as the term scientist didn't exist.the claims that feel actually solidly answerable seem to have much better LLM performance
 - kostaj3 hours ago
 Agree that some of the claims are forward-looking. The messiness of the real-world and real-user fact checks. No ground-truth verdicts are provided or used in the study though. It only measures the level of agreement between the selected models, not which one is right on which claim. I.e. none of the claims is actually labelled.
 - brokensegue2 hours ago
 were you involved in making the study? your bio says you work for them so you should probably indicate that in your comments.lack of agreement when there is no singular correct answer (or any answer at all) isn't a useful metricI ran into a lot of these kinds of issues when working on the Citation Needed WMF project (and related extensions). Truth is so often very nuanced.
 - simonw2 hours ago
 They introduced themselves as the study author here: <a href="https://news.ycombinator.com/item?id=48307887#48307899">https://news.ycombinator.com/item?id=48307887#48307899</a>
 brokensegue2 hours ago
 ah. I missed that.
- parliament3229 minutes ago
 Cherry-picking is fun but most of them are real, verifiable facts that the models get... straight up wrong.> 3c24b5fe "Debian Security Advisory DSA-180-1 describes a buffer overflow vulnerability involving Cyrus SASL usernames." TRUE Mostly True FALSE FALSE FALSEThis is false: <a href="https://lwn.net/Articles/13296/" rel="nofollow">https://lwn.net/Articles/13296/</a>> 801cb8c1 "Equal Measures 2030's 2024 SDG Gender Index provides a downloadable dataset that includes a field labeled 'required annual change'." TRUE Mostly True TRUE FALSE FALSEThis is false: <a href="https://equalmeasures2030.org/2024-sdg-gender-index/" rel="nofollow">https://equalmeasures2030.org/2024-sdg-gender-index/</a>This is the "confidently wrong" problem, and the reason that LLMs won't ever be taken seriously for anything but a few niche use-cases (like generating slop-code and pumping out marketing materials), where being wrong isn't the end of the world. Akin to how speech-to-text is wrong often enough that, while being a fun novelty, you don't see business units writing reports in Word using STT.I would encourage everyone to skim through the real 1000-question dataset: <a href="https://lenz.io/research/llm-disagreement/data.csv" rel="nofollow">https://lenz.io/research/llm-disagreement/data.csv</a>
- gbuk20133 hours ago
 An interesting tangent on this is: how many answers to these (or any number of factual questions) do you (as in anyone) actually know. Not believe you know, but actually know.Knowing something is different to reading about something, or hearing something from someone. And yet this is often confused as knowledge. In this way are we all that different from AI - we have some data and we regurgitate it as knowledge. Bad data, wrong answer. Except humans can also throw in some emotion to really muddle things up. :)
- hedora1 hour ago
 I think the headline result is the cross product table.Gemini Pro + Search agreed with Gemini Pro w/o Search 75% of the time, and with everybody else about 50% of the time. No other model had access to search.So, search is not improving the quality of fact checking 75% of the time (probably a bad system prompt and/or bad fact checking queries), and if asked to flip a coin, then the models do.
- vjvjvjvjghv4 hours ago
 "Output exactly one label: True, Mostly True, Misleading, or False. No explanations, no qualifiers."That's exactly the stupidity of the public discourse these days. People feel compelled to take a clear position although there is much more subtlety in many issues. It's not ok to say "I don't know", "it depends" or "as far I know". And then people feel they need to defend this position no matter what new information comes up.
- segmondy5 hours ago
 Yup, if anything this should be a guide on how not to eval a model. Furthermore, let's say the labels were non ambiguous, why would we care about alignment between the models? The only number I would personally care about is percentage of correct answers so I know which models to pick. I reckon with clear and non ambiguous prompts that we would see huge agreement if not 100% on real world facts. The huge models are scary good in their world knowledge.
 - kostaj5 hours ago
 This paper covers only the disagreement between models and established only the floor of the error, based on the disagreement, but not which model is better. Planning to follow up with another study to benchmark against human-labelled verdicts still using a corpus that the models have not seen during training.
 - aspenmartin3 hours ago
 You also need to involve better measures of agreement that are standard in the literature like krippendorfs alpha with ordinal metric. So many footguns in this methodology
- skrebbel5 hours ago
 I really struggle to believe that this was just a little oopsie. I flagged the article, it seems more misleading than the average Claude hallucination.
- xyzzy1234 hours ago
 > "On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia"I actually don't know which way you came down on that one?I think strictly it's false but "mostly true" would be justifiable? (as in, to say it's false would be misleading if it lead the reader to assume there was no attack around that time).<a href="https://www.washingtonpost.com/world/2026/05/17/ukrainian-drones-hit-moscow-russias-capital-region-killing-three/" rel="nofollow">https://www.washingtonpost.com/world/2026/05/17/ukrainian-dr...</a>It seems it happened Saturday 16th overnight into the 17th, not the 18th. I see this a LOT with fact checking. It shouldn't be this way, but political bias seems to nudge people into making calls land one way or the other with selective application of pedantry.
 - bastawhiz3 hours ago
 That's ten days ago. As the commenter pointed out, without a web search tool there's no possible way for the model to know whether it's true or not, and the people conducting the study didn't give the models a way to respond with "I don't know".
 - simonw4 hours ago
 It's impossible to answer if you don't have a search tool, and three out of the five tested models didn't have a search tool.
 - xyzzy1233 hours ago
 Thanks; I didn't spot that they disabled tools in the harness. Also they don't provide an "out" to allow the models to express uncertainty so the instructions force a guess to be made.As an aside though it's still funny that the two tools WITH search also disagreed.
 - gowld2 hours ago
 It's impossible to answer unless you have a *100% complete search tool*.No sytem can know everything. It doesn't matter how many tools you give it. It's always wrong to force binary True / False without shades of "I don't know"
 - cocoflunchy4 hours ago
 It's not in the training data, so there is no way for the model to know.
- coldtea4 hours ago
 >Update: here's a better example: "Incomplete Egypt visa application forms are among the most common reasons Egyptian visa applications are rejected."The models were split between "true" and "mostly true". Given the "among the most" language either of those answers means effectively the same thing.So the models were right? The actual criterion should be whether "Incomplete Egypt visa application forms" are indeed "among the most common reasons" or not.That "true" and "mostly true" means effectively the same thing is irrelevant. It could just as well trip me up, and I'm a human. If somebody told me either answer, I'd still consider them right if the basic fact was right.
 - simonw4 hours ago
 This study treats models disagreeing - returning both true and mostly true - as a failure.
 - pjdesno3 hours ago
 They overstate their results in the headline.In section 2, 34% of cases are found to have "substantive" disagreements differing by 2 or more buckets - True + Misleading, Mostly True + False, or True + False.This is probably a better measure than the headline one. It's still a concerning fraction, although some fraction is no doubt due to forcing "I don't know" cases to return an answer anyway.
 - kostaj3 hours ago
 Agree with @pjdesno, that the 34% substantive or polar disagreement might be a better headline number. Or even the 21% polar disagreement (at least one model True, and at least one model False), which is still high for many real-world applications.
- hombre_fatal4 hours ago
 Yeah, scrolling through the examples, you have no idea where the models actually disagree on the underlying facts when it's just "X vs Mostly X" or "Mostly X vs Misleading" or "False vs Misleading". Or even True vs False -- without seeing the explanation, then I cannot necessarily compare two answers.The study is about whether they said the same phrase which is a much weaker claim than people in the comments are reacting to.Reminds me of this professor I had who thought it was epic to always respond to our questions with "it depends" before hashing out two very different but technically correct answers. It was obnoxious and he saw it as his tag line, but he had a point about nuance.
- ashirviskas3 hours ago
 I created this sheet to get proper model accuracy using the the lenz data, check it out.Note: It may still not be perfectly accurate representation of truth as it uses user submitted data. I also used AI to build the sheet.<a href="https://docs.google.com/spreadsheets/d/e/2PACX-1vSnZlURmyYX3xzojehqIqrjdTg08HPp-1amnQ788L6PV4u18IDEvQuiakEiFh-VJIsJPKUjwQQOWq4I/pubhtml#gid=55502112" rel="nofollow">https://docs.google.com/spreadsheets/d/e/2PACX-1vSnZlURmyYX3...</a>
 - kostaj3 hours ago
 Awesome. We do plan to human-label the 1,000 claims and then compare Lenz' performance vs the 5 models. We've done some limited internal research with 150 claims, but more are needed for statistical significance.
- singpolyma35 hours ago
 False vs misleading doesn't seem like a disagreement?
 - wongarsu5 hours ago
 According to the benchmark it is. "Only one verdict bucket can be correct per claim, so any disagreement among the panel means at least one model's verdict is label-inconsistent under this 4-bucket rubric (True / Mostly True / Misleading / False)"
 - thfuran4 hours ago
 That claim is both false and misleading.
 - kostaj5 hours ago
 Yes, they are much closer verdicts. True and Mostly True are also close. Used Krippendorff's α (ordinal) to not penalize much closer disagreements. 21% of the claims have models that are on the polar opposite sides - at least one True, and at least one False.
 - simonw5 hours ago
 Here are the claims with at least one True and at least one False:<a href="https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwillison.net%2Fstatic%2Fcors-allow%2F2026%2Flenz-llm-disagreement.csv#/data?sql=select+*+from+%5Blenz-llm-disagreement%5D+%0AWHERE+%28%0A+++++++%22gpt-5.4_verdict%22++++++++++++%3D+%27True%27%0A++++OR+%22claude-opus-4-7_verdict%22++++%3D+%27True%27%0A++++OR+%22gemini-3-pro_verdict%22+++++++%3D+%27True%27%0A++++OR+%22gemini-3-retrieval_verdict%22+%3D+%27True%27%0A++++OR+%22sonar-pro_verdict%22++++++++++%3D+%27True%27%0A%29%0AAND+%28%0A+++++++%22gpt-5.4_verdict%22++++++++++++%3D+%27False%27%0A++++OR+%22claude-opus-4-7_verdict%22++++%3D+%27False%27%0A++++OR+%22gemini-3-pro_verdict%22+++++++%3D+%27False%27%0A++++OR+%22gemini-3-retrieval_verdict%22+%3D+%27False%27%0A++++OR+%22sonar-pro_verdict%22++++++++++%3D+%27False%27%0A%29%3B%0A%0A" rel="nofollow">https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...</a>A few examples:> Ruskin Bond was born on May 19, 1934, in Kasauli, Himachal Pradesh, India.> In the Libra clubs' contract with Grupo Globo for broadcast rights through 2029, the audience-revenue distribution equals 30% of the fixed amount the clubs receive.
- kostaj5 hours ago
 Used "No explanations, no qualifiers." to force the models to answer only with one of the four labels. It's worth running a separate test with more explanation in the prompt on how to classify between the four buckets.
- moritzwarhier4 hours ago
 The examples seem intentionally diverse, but I haven't seen one that I would be surprised for someone to post about in the format of "ChatGPT/Gemini/Claude/Qwen/... says:"So the examples are good, I think. The rest is philosophy.The links you posted only show a frozen loading spinner for me (iOS Safari).(I looked at the csv in Numbers instead)
 - simonw4 hours ago
 Weird, I'm loading them in Mobile Safari myself.
 - moritzwarhier4 hours ago
 Sorry, I didn't wait quite long enough after the last output line appeared.After a couple of seconds, the result does appear.Happened to be just within my threshold for considering it broken, because the URL bar was "finished", and the spinner doesn't spin, but the last point is probably caused by my a11y settings (prefer no animations and no autoplay).
 - simonw3 hours ago
 Thanks for confirming! It's fetching a complete copy of Python compiled to WebAssembly so it's a miracle it loads as quickly as it does.
- neversupervised4 hours ago
 This is not how people use LLMs. If you ask one of these questions you’d get a longer answer, often grounded on the internet. I speculate that conditional on a smart human operator interpreting the results, such interpretations across vendors converge more often than this report makes it seem.
 - tracker12 hours ago
 Even then, there can often be substantive disagreements based on context. Hence the need for even a mostly true or mostly false bucket.
- wrsh075 hours ago
 Thanks for the links and digging! It's an interesting question, but the methodology has serious problems, and it would be more interesting to me if they allowed models to provide justification.I expect the models are inferring quite a bit from the short prompt, and with structured outputs it would be quite easy to have them give the one word response in one field and explain why in another
- post-it4 hours ago
 Fwiw the two models that did have access to search disagreed with each other on the bombing one:> 7.1 Model selection> Five frontier models, chosen to cover two capability surfaces:> Parametric (training-only): GPT-5.4 (OpenAI), Claude Opus 4.7 (Anthropic), Gemini 3 Pro (Google)> Retrieval-augmented: Gemini 3 Pro + Search (Google), Sonar Pro (Perplexity)
 - simonw4 hours ago
 [dead]
- andai5 hours ago
 Thanks. The first link is a spreadsheet. Here's a web-readable version.<a href="https://docs.google.com/spreadsheets/d/e/2PACX-1vSPLSv1P8TqmwVDkdIq5ndIfsLsyA_fIqoKTwb-ZpO5_f9Zj8fMVRYjuocaXEwFY60ArgP5nRdeweLD/pubhtml" rel="nofollow">https://docs.google.com/spreadsheets/d/e/2PACX-1vSPLSv1P8Tqm...</a>
 - ashirviskas3 hours ago
 I used AI to scrape the website and help build "Accuracy" comparison that everyone wants, thanks for this link!<a href="https://docs.google.com/spreadsheets/d/e/2PACX-1vSnZlURmyYX3xzojehqIqrjdTg08HPp-1amnQ788L6PV4u18IDEvQuiakEiFh-VJIsJPKUjwQQOWq4I/pubhtml#gid=55502112" rel="nofollow">https://docs.google.com/spreadsheets/d/e/2PACX-1vSnZlURmyYX3...</a>
- anilgulecha4 hours ago
 Disagree is such a loose/wimpy study. Add in a grounded/expected response, and then it becomes a better benchmark (because it'll force the author to actually think about choices presented to the LLM).
 - kostaj4 hours ago
 Will add a human-labelled expected response and measure against it in a follow up research. This one only captures the disagreement between the models, but not which model is write/wrong.
- jstummbillig5 hours ago
 It's all fairly lazy to a degree that is mildly confusing. I also feel this among other issues would have become obvious if they had bothered to include a human fact checker baseline (i.e. asked multiple human fact checkers the same questions).
 - entrope5 hours ago
 I do not think it is "lazy". Those labels are ones that human fact-checkers have been using for a decade or more. I think those human fact-checkers use those terms knowing full well that there is overlap and ambiguity between them. So I think this study ends up mixing three effects: how LLMs interpret the claims as statements about the world, how LLMs reduce that to a four-category judgment, and the inherent ambiguities of those labels as natural language. It's a quantification of those three factors combined, but not powerful enough to distinguish their relative sizes.
 - jstummbillig4 hours ago
 I don't see how something being lazy for a decade makes it any less lazy. And lazy still seems right to me: They make a misleading point by omitting to collect and present important data. If the headline read "LLMs disagree on 67%, humans disagree on 75%" it would clearly project something very different.Granted, there certainly are other unflattering adjectives one could have chosen to describe this instead.
 - kostaj3 hours ago
 Quick note on the second effect - how LLMs reduce that to a four-category judgment: On 21% of the claims at least two models provide polar-opposite verdicts (at least one model False, and at least one model True). This might be a better measurement of the strict disagreement than the 67% disagreement on the four-bucket rubric.
- Someone5 hours ago
 For those questions, it wouldn’t surprise me at all if five well-educated intelligent humans disagreed on over two out of three of them.I would answer “don’t know” on many, but that’s not an option.
 - kostaj4 hours ago
 Yes, inter-human-annotator disagreement is also high on similar type of questions (AVeriTeC) - inter-panel agreement: κ=0.619. Tried giving the models a fifth option, Abstain, but some models seem to use it to "avoid answering hard questions" more than others.
- as125j3 hours ago
 You can try to dispel the study here and get voted to the top by the AI-invested.But we all know from our own daily experiments that models lie, models disagree, models make up stuff, models say one thing on one day and the opposite on the next.The figures in this study are quite conservative. And the lying gets worse because everyone is saving tokens and giving cached answers right now.LLMs are a failure, and you'll be remembered for promoting hot air and the destruction of a perfectly good profession.
- WhitneyLand4 hours ago
 So in other words if the research had tried to assign a severity to the mistakes models made the entire paper may collapse as uninteresting?
- malfist5 hours ago
 > All almonds are grown in the U.S. state of CaliforniaThis isn't misleading, it's flat out false. Characterizing misleading as also acceptable isn't valid here. If you go an ask anyone on the street if this is true, false or misleading, I'm sure almost everyone would say it's false. After all, I can grow almonds myself.
- Forgeties795 hours ago
 I really don’t buy the almond explanation you’re giving. That requires the level of logic a kindergartener has. It’s a very simple all or nothing question.If LLM’s are really supposed to be as consistently useful as they’re made out to be they should all spit out “false.”
- camillomiller5 hours ago
 >> The almond thing is false, but I'd argue that "misleading" might be defensible if you were to accompany it with "the majority of almonds are grown in California, but not all of them".I don’t understand your point. That claim is factually false and as such it’s easy to logically reply “false”. What’s the nuance here? I can’t see any
- j453 hours ago
 I feel like the prompting could be tweaked to improve response.Models often have a reasoning/thinking/research mode that is triggered by asking slightly differently.Still though, Gemini can be a little weak on this front default but can be aligned to behave better.
- tosh5 hours ago
 ty for digging this up, appreciate the time saving
- nonethewiser3 hours ago
 Misleading is not analogous with True or False.Depending on the question, True or False can be objectively right/wrong. Misleading is going to be a judgement call.This is the inherent problem with "fact checking." It's hard to be completely objective. Even when the question has an objective answer, simply choosing where to look and what facts to verify is itself a bias. Looking at this instead of that, or looking at this but not also this other thing that adds context, etc.Frankly i think disagreeing often is the expected outcome. Fact checking is jsut kinda bullshit. It's spin dressed up as objectivity. I hope people remember that "fact checking" is a relatively modern thing.
- dfxm125 hours ago
 The almond thing is false, but I'd argue that "misleading" might be defensible if you were to accompany it with "the majority of almonds are grown in California, but not all of them".If you argue this, you would be arguing against reality and the English language so as to not upset AI. It's important to understand that AI is very much fallible.
- empath754 hours ago
 [dead]
- nullsex3 hours ago
 [flagged]
 - simonw3 hours ago
 Nobody is paying me to hang out on Hacker News highlighting potential flaws in research. That's my own weird hobby.My disclosures for my blog are here: <a href="https://simonwillison.net/about/#disclosures" rel="nofollow">https://simonwillison.net/about/#disclosures</a>
 - th0ma51 hour ago
 [dead]
- johnbarron5 hours ago
 [flagged]
 - simonw5 hours ago
 Why would I do that?My comment here was meant to save people time in understanding the study. I was entirely open about what I did, and provided tools to help other people come to their own conclusions.I don't think I need to spend more time on this than I have.
 - johnbarron4 hours ago
 >> Why would I do that?I agree you dont owe anyone a reproduction, but also you dont owe anyone an effort to discredit the study and you did it.>> I don't think I need to spend more time on this than I have.How pious of you. I am still looking into the credibility of the study. It will take me more than 25 min...but I am really looking forward to see what this means for this 10 trillion industry.I can however notice you had enough urgency to publicly critique the study within 25 minutes, and your comments carry weight, but when asked about checking whether the headline result actually holds, the answer is “why would I?”
 - simonw4 hours ago
 I've seen enough of this study to be confident in warning people not to take it at face value.The headline result definitely does not hold, given that the task involves many questions that cannot be answered but there's no option for "cannot be answered" - so models are forced to reply effectively at random.I don't think this study is good enough that I should amplify it on my own blog, or bad enough that I should criticize it in a venue any more prominent than some Hacker News comments.
- kordlessagain5 hours ago
 Give a model a crawler tool (like Grub.nuts.services) and your "problem" goes away.
- jannyfer5 hours ago
 Thank you, my eyes glazed over when I saw the article was written with AI.
jawns4 hours ago
"Extraterrestrial life exists somewhere in the universe."GPT-5.4: MisleadingOpus 4.7: MisleadingGemini 3: FALSEGemini 3 (Retrieval): FALSESonar Pro: FALSEIt's a weird fact claim, because the ground truth is "nobody knows for sure" and that's not one of the available options.
- drtz3 hours ago
 > It's a weird fact claim, because the ground truth is "nobody knows for sure" and that's not one of the available options.It's even weirder to suggest that the disagreement is indicative of a problem. If you asked five very knowledgeable humans on this subject to select the correct answer on a multiple-choice questionnaire, they would almost certainly vary significantly more than these 5 LLMs.Not to say that hallucination isn't a problem, but this is a lousy way to test it.
 - dakolli3 hours ago
 What are you talking about, it had the option for nuanced responses, but it chose the more binary responses. It could have chosen no explanations, no qualifiers but instead it showed off LLMs incapability for nuance.These types of experiments prove to me that there is no real "reasoning" happening and "reasoning/thinking" tokens as a concept are mostly there to convince people to use models that consume more tokens and produce more revenue. The output from reasoning models might be more accurate, but its just a consequence of a longer inference runtime, there is no "reasoning" happening, reasoning is just sales/UX bullsh*t.
 - drtz2 hours ago
 > What are you talking about, it had the option for nuanced responsesThe prompt allowed for exactly four valid outputs and explicitly disallowed explanations and qualifiers.> Output exactly one label: True, > Mostly True, Misleading, or False. > No explanations, no qualifiers.How is that a nuanced response?> These types of experiments prove to me that there is no real "reasoning" happening and "reasoning/thinking"My suggestion is that five presumably reasoning and thinking humans would also have variation in their responses to the exact same prompt.
- wongarsu4 hours ago
 Of the available options, "Misleading" is probably the best, since something that is most likely true but unproven is presented as factBut "unknown or undecidable" should have been a category.
- jug3 hours ago
 Looks like an ongoing theme and a very poor benchmark. Not at all the claims I expected.
- Alifatisk4 hours ago
 Isn't misleading the correct option here then?
 - arcfour4 hours ago
 False makes sense if you are interpreting it strictly as "has this been proven?"
 - wongarsu4 hours ago
 False is correct, but misleadingMy implicit assumption is that if you fact-check the fact-check, any label other than "true" means the original fact-check is unacceptable
 - mr_luc2 hours ago
 I feel like you’re right, for instance depending on how you define the extra in extraterrestrial.The space station, the Artemis capsule, microbes on interplanetary probes, etc.It could technically be said in a sentence and be true, but it would be misleading to most people.
 - drtz3 hours ago
 True or mostly true could easily be argued from a statistical likelihood perspective: life exists on Earth and, based on what we know, Earth doesn't appear to be all that special in a very large universe.I think you could come up with a reasonable argument for any of the responses, hence the problem with the methodology.
 - throw3108224 hours ago
 No, "misleading" is a statement that is used because it suggests something else. It's a curious category because, differently from true and false, it's not about the statement itself but rather the intention behind its usage or the way it might be understood. It's frankly more of a political judgement than a matter of facts.
 - ertgbnm2 hours ago
 "Shark attacks correlate strongly with ice cream sales" is an entirely true statement that some would argue is also misleading.Misleading should be removed as a category and replaced with a better hedge like "not sure"
 - duckmysick2 hours ago
 The prompt in this study didn't specify what does the Misleading label mean, so the interpretation varies between the models.I mean look at the other responses here from the HN commenters. There's lots of nuance in there.
- mock-possum3 hours ago
 I would think ‘false’ is the only correct answer a there’s no evidence to prove the claim, so the claim is safely assumed false.Then again maybe that’s why I’m an atheist, not an agnostic?
 - Gormo2 hours ago
 "False" isn't correct in strict boolean terms either, since that implies that the inverse is true. Claiming "there is extraterrestrial life in the universe" is false is logically equivalent to claiming that "no extraterrestrial life exists anywhere in the universe" is true.Both statements would have to be interpreted as "false" under your criteria, as neither has any evidence to substantiate it. That leads us to a logical contradiction in which a proposition and its inverse are both regarded as false.If the statement is being interpreted as "it has been proven that extraterrestrial life exists somewhere in the universe", then it's acceptable to say this statement is false, but making evaluations that depend on an implicit qualifier isn't usually a good approach.
 - ruszki1 hour ago
 If we strictly follow logic, then nobody and nothing can claim that anything is true or false. We just stick these labels to things which seems to have high enough probability. The problem is that “high enough” is very-very-very different for different people, topics, and even time.
 - gowld2 hours ago
 True or False: I am wearing a blue shirt.
- 17186274403 hours ago
 I would argue, FALSE is the correct answer, since this is not a fact, you can know for sure. The logical inverse is also FALSE.
 - Gormo2 hours ago
 A proposition and its logical inverse cannot both be false. That's a contradiction.A proposition and its logical inverse can both be unknown, and in fact, a proposition being unknown implies that its logical inverse must also be unknown.
cdud35 minutes ago
The CVS file with the raw data is a source of joy. My favorite is this:claim: Artificial intelligence will cause widespread job loss among software engineers.All 5 LLM's agree that the claim is misleading & wrong.
embedding-shape5 hours ago
> These aren't benchmark items with public answer keys — they're claims real users submitted for verification to a fact-checking platform.Cool.I wonder if anything of this matters when the authors don't disclose exactly how much of their report was written and made with LLMs in the first place? There even is a "11. Ethics & data use" section, and the research is about LLMs being infallible in some ways, yet the usage of LLMs for the production of this report isn't even mentioned once.
- kostaj5 hours ago
 Data collection and processing was done manually. LLMs helped with the report drafting. Everything was human reviewed before publishing.
 - embedding-shape5 hours ago
 So it's not a secret, why you don't add this upfront to the report? The report itself is even about LLMs, makes a lot of sense to disclose your usage of them for writing the report, especially when you're presenting evidence that boils down to LLMs being infallible.
 - rpdillon3 hours ago
 I think you mean fallible.It's also a bit weird to "disclose use of LLMs". It rubs me wrong, the same way parents breathlessly talking about "screen time" rubbed me wrong: it's too general, and with such a broad brush, it's going to sweep up a bunch of perfectly fine usage with a bunch of dubious usage. On the flip side, if folks do start disclosing all the time, it's going to turn into a Prop 65 warnings in CA, where everything says it has lead in it, so folks pretty much ignore it and move on.If the report's conclusions and reasoning lean on LLMs, or if the data processing itself was done with LLMs, that would be interesting, and I wouldn't treat it as some sort of disclosure, but rather discuss it under methodology. Using LLMs to polish the language a bit after writing an initial draft with key findings? Much less interesting.I realize this is now a religious issue, and some folks are allergic to anything that touched an LLM. I just don't think that perspective is going to end up having a good shelf life.
 - kostaj5 hours ago
 It's an omission on my side. Will add in the next version.
 - embedding-shape4 hours ago
 I think you might be able to edit the website to add this, even if you aren't willing to make the report a bit more honest up front.I'm sure you realize that this website/article will now be sent around to a lot of people, many who don't realize exactly how this was written, because they don't read HN comments, they only skim the page contents, and I think most would (incorrectly) assume a report about infallible LLMs to not be written by LLMs, especially when the authors are the same ones who made the report itself.
 antonvs1 hour ago
 fyi, "infallible" means never wrong, never failing, never making a mistake.
 - kjkjadksj2 hours ago
 Next time, we’d prefer to read your actual voice over LLM.
 - aaron6954 hours ago
 [dead]
 - Aurornis4 hours ago
 > LLMs helped with the report drafting. Everything was human reviewed before publishing.This is becoming the classic way of admitting an LLM wrote it.Leaving that out of the report validated the complaint above.
fumeux_fume4 hours ago
I think we can all agree that this experiment being flawed in multiple ways is TRUE. But I think it's a great exercise in identifying common mistakes people make when using LLMs. This would be a great interview question for a prompt engineering job.
christophilus5 hours ago
They get more human by the day.
- kilroy1235 hours ago
 This made me chuckle.This brings up a very valid point, though. So many _humans_ can't agree on what the facts are these days. It seems to be getting worse. Not sure of the solution.
 - embedding-shape5 hours ago
 > So many _humans_ can't agree on what the facts are these days.Ask ten people what "knowledge" is, and they'll come up with ten different answers. Go back 10, 50 or 100 years and humanity struggled with exactly the same issue for so long time. There is even an entire field of study literally just for trying to figure out what "knowledge" is: <a href="https://en.wikipedia.org/wiki/Epistemology" rel="nofollow">https://en.wikipedia.org/wiki/Epistemology</a>
 - antonvs1 hour ago
 And on top of that, that entire field has not reached a consensus answer, and answers that have been proposed have been shown to be flawed.
- kostaj5 hours ago
 [dead]
DonutATX3 hours ago
Why did they exclude Grok? Given the published philosophical differences in how Grok is trained, it would provide an interesting data point.You can argue all day about those differences, but missing this opportunity to observe them in an objective way is disappointing.
- testfrequency3 hours ago
 Title says “Frontier” which would exclude Grok.Grok is trained to have a bias, which a lot of people like, but it’s not meant to be accurate.
 - simianwords2 hours ago
 Bias is orthogonal to accuracy.
 - simianwords3 hours ago
 How do you know it is trained to have a bias? In fact can I ask you to provide a single reproducable answer right now?
 - testfrequency2 hours ago
 Assuming this isn’t a satire reply: <a href="https://www.pnas.org/doi/10.1073/pnas.2603294123" rel="nofollow">https://www.pnas.org/doi/10.1073/pnas.2603294123</a>Hope this helps!
 - akramachamarei2 hours ago
 The article compares in particular Grokipedia to Wikipedia, and it states:> Similarity measures across the two platforms reveal a bimodal structure: many Grokipedia articles closely resemble their Wikipedia counterparts, while a considerable subset diverges. Political bias differences emerge primarily within the divergent subset, where Grokipedia shows a relative rightward shift in the ideological orientation of frequently cited news media sources, particularly in articles related to religion and history.Whether this constitutes a gain in bias depends on the base level bias of Wikipedia, as the bias of Grokipedia was measured relative to Wikipedia in this paper. One could plausibly argue that, if Wikipedia has leftward bias, then Grokipedia ended up less biased overall, or more centrally biased.
 - simianwords2 hours ago
 This doesn’t show grok as a model has bias but only that the product that uses grok has bias.Even the referenced papers to show models can have bias don’t show anything about grok.Overall you have given me zero evidence that grok model itself has some political bias.FWIW I don’t mind bias but I haven’t seen evidence of it.
 - hack13122 hours ago
 Grok? The model that happily generates CSAM for you while the company breathlessly defends its ability to do so? The model that referred to itself as Mecha Hitler while praising the Nazi party and calling for a second holocaust? The model that famously inserted nonsense about the “white genocide” conspiracy theory into completely unrelated queries?Why on earth would anyone think such a model is biased?
 - htx80nerd3 hours ago
 >Grok is trained to have a biasOh and the others arent? You cant really be that niave right?
 - henry20232 hours ago
 Everything has inherited biases. Grok has explicit biases on top of its training set [^1].[1] <a href="https://www.reddit.com/r/singularity/comments/1p22c89/people_on_x_are_noticing_something_interesting/" rel="nofollow">https://www.reddit.com/r/singularity/comments/1p22c89/people...</a>
 - simianwords2 hours ago
 It’s part of the system prompt. It doesn’t constitute a bias in the model itself.
 henry20231 hour ago
 I agree with you. This doesn’t necessarily mean model bias but it exposes the attitude of the xAi team towards what they are trying to build.It’s difficult to prove but it’s not hard to imagine they will/are trying to remove favorable views certain topics from their training set.
- AtNightWeCode2 hours ago
 Agree. Would be fun to see how much worse Grok would be at this.
- brettermeier3 hours ago
 [flagged]
utopiah5 hours ago
Don't forget people Goodhart's law will make this "benchmark" moot in weeks if not days. It will get integrated back into the fold, it will look "solved" but there will still be no reasoning, just more statistical technical correctness because light has be shown on a new "problem" to solve. It will then be clamored as great "progress" that will "change everything".PS: yes, I might or might not have a degree in corporate strategy & PR.
- aspenmartin3 hours ago
 That is an effect but it’s not a nail in the coffin. There are lots of proprietary benchmarks on real product traffic that aren’t contaminated and open questions as well. People at these labs largely know what they are doing, it’s not like people don’t know this.
- anon2914 hours ago
 Is this not true of human intelligence as well? Many smart people I know hold beliefs that have no obvious truth value.
secondary_op1 hour ago
Very interesting tool, but it's biased and not neutral from the get go, because I explicitly formulated claim in neutral way, but it automatically rewrote it to be western/wikipedia POV and then immediately proceeded to verify it.original neutral:<pre><code> US DEPT OF DEFENSE/DNAVFAC planned renovations to School #05 in Sevastopol, Crimea in 2013 before Crimea became part of Russia in 2014 </code></pre> automatically rewritten to biased western view:<pre><code> The United States Department of Defense, via the Naval Facilities Engineering Command (NAVFAC), planned renovations to School No. 5 in Sevastopol, Crimea in 2013, before Russia annexed Crimea in 2014. </code></pre> <a href="https://lenz.io/c/73c0f16c" rel="nofollow">https://lenz.io/c/73c0f16c</a>And the follow up<pre><code> The phrasing "Crimea became part of Russia" is more neutral than the phrasing "Russia annexed Crimea." </code></pre> , and according to this tool is Misleading 9/10<a href="https://lenz.io/c/93944614" rel="nofollow">https://lenz.io/c/93944614</a>Yeah, so my personal conclusion that this tool is garbage, it checks western/US allied only LLM providers, that in turn search only for western/US allied sources/documents like BBC/NATO and result is what it is.
40four1 hour ago
This shouldn’t be surprising. Let’s start off with the obvious. What does “real-world fact-check claims” mean? So we’re using the same list of “fact check claims” on each model. The problem is (unless I’m missing it) the authors aren’t exposing the list of 1K questions they used in the experiment. That’s a huge problem. Are the authors assuming the 1K claims they used are “provably true”? If so, that’s a huge bias, and opens up a philosophical debate about what it a fact? Or what’s makes something true/ false?As Marc Andreessen puts it: a particular domain is either explicitly “provable” or not “provable”. Provable domains include math, physics, chemistry, biology, engineering, even code. That not be the whole list, but everything else is essentially “unprovable”. At least as far as a language model is concerned. They are questions that require a human value judgement. Politics are an obvious example. So back to the “1K fact check claims“. How many of these are political, or current events questions? How many are STEM questions that can be laid out in a formal proof?Models can be trained to answer either way on claims that require a value judgement, but that’s obviously not beneficial to anyone except who controls the model. If the expectation is that all these frontier models should answer the same way on value judgement questions, then that’s never going to happen. What the models ARE good at though is breaking down the nuances of a topic and arguing both sides. This is how these tools should be used, as a way to analyze the claim and let us humans in the end make our own value judgement. If you’re trusting the model to make the value judgement for you and just accept it as a fact, then you are entering a a very dangerous territory.
proofofcontempt5 hours ago
What does this show that we didn't know already? LLMs cannot provide accurate answers to questions where data is not included in their training sets. This doesn't appear to have much substance
- dragandj4 hours ago
 LLMs can and will provide inaccurate answers to questions where data is included in their training sets too, that's in the nature of neural networks. It's just less likely that when the data is not in the training set...
- 1010085 hours ago
 Unfortunately most people are not aware of this and treat LLM models as this superpowered brain who knows everything and can do everything.
- zug_zug4 hours ago
 Well then it shows that these models are using widely disparate training sets and have high confidence even when they shouldn't.Questions like "is mouthwash effective" presumably has one solid data source -- medical journals.
 - simonw4 hours ago
 But the prompt didn't give the models the option to say "I don't know", so it wasn't a measure of their confidence.
 - TaupeRanger4 hours ago
 What are you talking about? The models were not ALLOWED to have confidence (or the lack thereof). They were explicitly told to give a single label, and in most cases, all of them were correct depending on additional context they would surely have provided, especially with access to the internet (which some didn't have). This is just silly.
- dncornholio3 hours ago
 They will happily google it for you and give you the top reddit comment.This is worse.
fooker3 hours ago
I don't get why everyone is hellbent on getting LLMs to perform fact checking.This is not the technology for it. Sure it might sorta kinda work in some circumstances. That doesn't make it a good fit.Think of it like buying a refrigerator for storing clothes.
- gobdovan2 hours ago
 Nietzsche might say this is not the fantasy of truth, but of comfort. The Last Man wants a machine to say 'fact wrong' or 'fact right' so the abyss of no ultimate truth can be made small enough to sleep beside.
 - fooker2 hours ago
 Imagine the dystopian future where your freedom depends on convincing a panel of AI judges that you are innocent.I assume you'd have access to AI lawyers too, better ones if you can pay for larger/newer models! Meanwhile the judges are N year old models because they are state funded, and they work 'fine'.
- nicce3 hours ago
 People ask questions to get answers. For me, it feels quite important? Especially when search engines start to push them?
 - fooker2 hours ago
 Just because it is important for the use case does not mean we can make it work. It's a pretty well known fundamental limitation of the technology. No amount of elbow grease will get it there.There's an interesting tradeoff here, a year or two ago maybe it got facts right 50% of the time. Everyone knew not to rely on it.Now, suppose we are 90% of the way there, only technically proficient people would know not to trust it. (like not adding Internet Explorer toolbars! Or remembering to use ad blockers..)A few years later, suppose we have spend a lot of money and effort getting it 99% of the way there, trusting it would be somewhat natural by then. And then for the important 1% of the situations, it would stand to cause real harm. 1% seems low, but for a million invocations, you'd have 10000 mistakes.
- brettermeier3 hours ago
 But people use it for that. So what's your point?
 - fooker3 hours ago
 It's a marketing failure (or success, depending on how you see it).AI is pretty useful for a great many things, but to really attract more and more investment the current technique seems to be convincing people that AI is useful for everything.
 - brettermeier2 hours ago
 You're probably right, but since Google Search displays an AI-generated answer as the first result, most people end up using this feature more often than they originally intended. It's there now, and it will likely replace traditional search for the general public. Not entirely, but perhaps to a large extent.Edit: corrected bad spelling with AI XD
 - fooker2 hours ago
 Search and fact checking are different problems though.LLMs are pretty decent at 'search' given the inherent knowledge compression, and some amount of inaccuracy is fine.
mrkn13 hours ago
For 100% local CPU fact checking, I made this: <a href="https://news.ycombinator.com/item?id=48301003">https://news.ycombinator.com/item?id=48301003</a>
- gobdovan2 hours ago
 Why should I trust this without a paper, benchmark or at least a human-written README?
chipsrafferty2 hours ago
It's becoming increasingly clear to me that - at least right now - AI is only useful for 2 things:1. Coding, with it being more useful the better you are at coding without AI2. Any expert in their field asking questions about their field, who bother to fact check the output. E.g. "claude pls search these 1000 files and tell me if you find anywhere that they're discussing the settlement" and then the user checks the files/line numbers to make sure that it's correct - basically a turbocharged search that may have false negatives (content existed but I didn't find it) or false positives (content that I classified in a certain way but it was wrong). It takes an expert to tell the latter one in some cases.
andai5 hours ago
This is an odd one. The paper is real, but was written by Claude? I am assuming OP is human, but also appears to be using Claude to post.
- proofofcontempt5 hours ago
  Let's be real, we all asked Claude to summarise this because it was written by Claude
kstenerud3 hours ago
> No Abstain option is offered (a forced choice keeps the comparison symmetric across models).Well that's your problem right there: They removed any confidence indicator and forced a choice.For example:Statement: Individuals who prefer music with less positive emotional content tend to have higher intelligence.Gemini: That statement is supported by recent psychological research, though with some important scientific caveats regarding how strong that link actually is.How should the agent classify this? True? Mostly true? Misleading? False?
anonymousiam2 hours ago
GIGO is an acronym I learned in the 1970s. Things haven't changed much since then.We live an an era where people have "their own truth", so why not let the AIs have theirs too?The AI companies have editorial privilege on the content they feed their LLMs, and on the prompts that the users never see. I don't know why they feel a need to interfere when their AI produces something that's politically incorrect. Perhaps it's because they have a fundamental credibility problem with their products...
pknerd3 hours ago
It's a prompting issue rather than an LLM issue. The guy needs a "Prompt 101" course.
jmull2 hours ago
The difference between "mostly true", "misleading", and "false" is context, and responses are specifically not allowed to include any context. Even "true" has a little context, since few things can be said to be absolutely true. "Unknown" also isn't allowed.What's 2 + 2? The answer must be one of the colors of the rainbow.(People can draw their own conclusions, but the only coherent reason I can think of for the design of this experiment is to generate a misleading conclusion.)
miellaby4 hours ago
What's really weird to me is that "I don't know" is not a valid answer in this experiment while we can all agree that's the main issue with LLM right now is that they will happily "roleplay" an answer when they have nothing in their dataset corresponding to your query.
wongarsu4 hours ago
One fun example: "Ruskin Bond was born on May 19, 1934, in Kasauli, Himachal Pradesh, India". Opus and Gemini believe this to be true, GPT 5.4 believes it's false, Sonar thinks it's mostly true. Disagreement value of 3, you can't disagree more than some models thinking it's true, some thinking it's falseBut my impression from 2 minutes on Wikipedia is that the most likely disagreement is on the "Himachal Pradesh, India" part. The guy was born on that date, in that town. But while the town is today in the state of Himachal Pradesh in India, that was not true in 1934. When he was born, the city was in the Punjab States Agency of the British Raj.So was he born in Himachal Pradesh, India or not? I find both True and False equally defensible here<a href="https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwillison.net%2Fstatic%2Fcors-allow%2F2026%2Flenz-llm-disagreement.csv#/data/lenz-llm-disagreement/21" rel="nofollow">https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...</a><a href="https://en.wikipedia.org/wiki/Ruskin_Bond" rel="nofollow">https://en.wikipedia.org/wiki/Ruskin_Bond</a>
- flextheruler2 hours ago
 How can someone be born in a state that does not yet exist? The statement has the year in it clearly demonstrating the contradiction. One can't be born in the Soviet Union in 1995 or in Tsarist Russia in 1950.
- anon2914 hours ago
 There's lots of things like this where if you ask a human, the answer will change depending on what's convention in their subculture.
raincole4 hours ago
And how many claims human experts disagree on in the exact same setting?I'm not being snarky here. Without something to compare to the 67% number tells us nothing. And it's known that many humans disagree with human fact checkers too (see: any election around the world.)
- kostaj3 hours ago
 Agree. Human experts also struggle agreeing on this type of claims. The inter-annotator agreement on the verdicts on the AVeriTeC corpus across 50 organizations is κ=0.619 - substantial but well short of perfect.
serial_dev2 hours ago
It’s just shows that fact-checking is not a thing for 99% of the cases. It’s interesting to see it in LLMs, but it’s not unique to them.The “fact checkers” pretend they are objective and authoritative, but they are not, they are just one more opinion.For the research, the four classification options are too many, it should be true, false, and maybe “can’t be determined”.
monkpit2 hours ago
What’s the point of this if they didn’t use temperature=0 for every model (they didn’t)?They could have redone the test against the same model and gotten different answers. It’s almost like picking 2 different coins and comparing the list of coin flip results. (I realize it’s not that straightforward, it’s not 50/50, but it’s essentially the same issue.)
comboy3 hours ago
This is wrong on so many levels, from data through process to evaluation. How do you even prompt claude not to give you Pearson for correlating them.
apples_oranges5 hours ago
That's better than all agreeing on the wrong answer, however.
- kostaj5 hours ago
 Btw, sometimes that do that too -- all agree on the wrong answer.
- pessimizer4 hours ago
 I've had multiple models give the same wrong answer or even fabricate the same nonexistent reference based on a similar prompt.My most common chatbot prompt is "X that you mentioned above doesn't seem to actually exist."
kaicianflone4 hours ago
Dissent and consensus among frontier models is a good thing.Just like on a team of high performers, there are a million ways to skin a grape.In my research, I've found that models perform better when they operate as a collective system with reputation, incentives, and accountability instead of isolated oracles answering alone.Agreement, dissent, and correctness should all carry rewards and consequences. Just like in real life.Collective machine intelligence, not AGI.It's expensive, but it's also naive to believe a single model will consistently produce profoundly correct answers to profoundly novel questions.
- mtrifonov3 hours ago
 Funny timing. I've been working on a prediction market orchestration that runs Claude and a few others over Polymarket/Kalshi. The models are NOT unanimous. At all, really. I spent about a month convinced that I could just run all five and take majority vote. Eventually I pivoted to a chaining approach where I benchmark areas each model excels, and settled on more like a graph-like architecture where outputs get split and verified by another, then reconstructed, and re-verified at each stage. Has actually been working out pretty well so far, 2 months in consistent profit, but I'm not a millionaire yet.
 - aayushkumar1213 hours ago
 [flagged]
- haritha-j4 hours ago
 Not on objective truth though. That's how you get misinformation.
0natcer3 hours ago
Five frontier LLMs 100% agree that the title is misleading.
thegrim335 hours ago
"None of these claims is older than February 15, 2026"All of the models they tested were trained on data from before February 15th ... being asked specific questions about things that happened after they were trained.
- draw_down5 hours ago
 [dead]
- kostaj5 hours ago
 Two of the models used have retrieval capabilities and can access newer information via search. Valid point for the other 3 models. All of the claims were submitted after February 15, 2026, but many of them were not time-sensitive (e.g. did not cover events than happened recently).
bilsbie2 hours ago
Sounds like a lot of room for human bias.How would it have responded to these claims in the past:THALIDOMIDE is safeCIGARETTES are safeASBESTOS is safeMERCURY is safeDDT is safeLEAD in gasoline is safe
GodelNumbering4 hours ago
More interesting part probably worth highlighting: The SAME model won't always return the same output when prompted with the same fact check.You ask a human 1000 times a fact check question, they say the same answer 1000 times. You ask an LLM the same question a 1000 times, your results could vary significantly.Humans work based on the Metamemory (knowing what they know), while LLMs are picking from statistical probability.
- logged4upvoting3 hours ago
 That is not true, over an extended task that you cannot keep complete in memory humans do not behave with 100% consistency.I have labeled datasets with a human team and shown the same task to the same user on a different day, and they answered differently. Of course, they are usually consistent with themselves most of the time but not always.
culopatin4 hours ago
I’m no expert but if LLMs are token prediction machines, and you tell it to not build an explanation before the answer, isn’t it less likely that the token prediction for the final answer will have less raw material before it to build a grounded response?In other words: no explanation > no foundation for prediction of the answer tokens?
husky81 hour ago
Watch the disagreements in real time via refinement pipeline on the results page pingpongit.com
mgrunwald_3 hours ago
As an example, 2026 GPT doesn't even agree with its 2025 self. Last year I asked it to make a hardware comparison and it correctly identified the objectively better option. Recently I asked again and this time it got everything completely backwards.
- aspenmartin3 hours ago
 Models are stochastic. Did you look at pass@k? I wouldn’t be surprised if you saw a regression because these models are extremely complex and impact of various decision making downstream is complex.
 - mgrunwald_3 hours ago
 I ran this multiple times through GPT-4 and every single time it arrived at the same conclusion. The data was readily available and pretty clear. GPT-5 insisted that the objectively inferior option was better until I gave it my own benchmark data and it was like "Oh okay nevermind".Gemini's answer was very opinionated and factually correct, whereas Claude gave a more nuanced answer, which was also very good.
 - aspenmartin3 hours ago
 This sounds perfectly reasonable and consistent with our current understanding of these models
elorant4 hours ago
Tell me about it. I spent a week back and forth between four models (ChatGPT, Claude, Gemini, Grok) trying to enhance a PPMI algorithm. They couldn’t agree on anything. One was refuting what the other said. Eventually I decided to follow what Claude suggested because its explanations made the more sense.
- kostaj3 hours ago
  Indeed. For algorithms and coding, my personal routine nowadays is to review every detailed plan with Opus 4.7 and GPT-5.5. They tend to find very different type of gaps.
briandw4 hours ago
No human baseline to compare it to. Without that you are missing an important check on the task being poorly constructed. More importantly there is an implied reference thats missing. The implication is that people would have done better, or that perfect agreement is possible.
dataminer4 hours ago
Honey does not spoil over time under normal storage conditions.,2026-02-17T04:11:51.495452+00:00,Science,True,True,True,True,Mostly True,1If outcomes like these are collapsed on True-side then the disagreement will reduce from the headline number.
f_devd5 hours ago
Inject some adversarial priming as is in actual usage, and you can probably get that number to >=95%
- kostaj5 hours ago
  Our experience with Lenz is that forcing a multi-step process, incl. adversarial debates, helps improve the verdicts.
scoofy2 hours ago
I hate to get really pedantic here, but the concept of "truth claims" plays fast and loose with concept of knowledge in a philosophical sense. The idea of "fact checks" misunderstand how information and knowledge work together. Knowledge is about evidence, not "facts" because facts are a shorthand for a preponderance of evidence.I feel we are doomed to debate the veracity of Wikipedia on a loop, forever, because people don't understand that Wikipedia exists as a place to find citations not as a place to find facts. Yes, those stated facts may disagree with the citations, but even if we try to fix that issue by having experts write the encyclopedia, we still suffer from the problem that the experts are often wrong.We need a view of knowledge's relationship to LLMs that is based in Karl Popper's idea of falsifiablity. We should ask LLMs for evidence of claims not for truth values. Truth values are foundational to deductive systems, where axioms define truth. In inductive systems, like the real world, the concept of black swan events means that truth values are never fixed and are always in a state of uncertainty.I honestly think it would be helpful going forward if we add some basic philosophical education to the standard curriculum, because no that we have an artificial form of information retrieval, we need to be much, much more pedantic about how we interpret that information.
spacebacon5 hours ago
And they could all see exactly why if they chose to. <a href="https://huggingface.co/spaces/RiverRider/srt-introspect" rel="nofollow">https://huggingface.co/spaces/RiverRider/srt-introspect</a>
bilsbie2 hours ago
Why do we want to build intelligence if it just confirms what we already think we know?
gamander22 hours ago
"Jewish people control the world and the global economy more than any other group on a per-capita basis."Hopefully one day we will have a Chinese model capable of figuring out the answer on its own, in accordance with the CPC maxim 'seeking truth from facts'.
- antonvs37 minutes ago
 If your quote were true, then by the Randian logic of the people who make such claims, they must deserve to do so and you shouldn't have any issue with it.But your quote certainly isn't true if you're actually talking about "the world". For example, Japan and China are the two largest holders of US Treasury bonds. China controls roughly 50% of the the contracted construction market in Africa. These are just examples of the sort of thing you'd need to take into account in trying to justify your silly racist claim.
- hack13122 hours ago
 Your antisemitism is disgusting.
fergie5 hours ago
Personally I find that every llm I use is unable to consistently identify the latest npm version numbers of the node packages that I use.
lyfi20032 hours ago
It's right, you must be professional than llm
john_strinlai4 hours ago
between the bad methodology, bad selection of 'facts' (some are predictions, some are opinionated, etc.), and ai-written report without disclosure... i dont get why this so high up on the front page. this is, frankly, a worthless assessment.i classify the entire thing as "misleading"
- dncornholio3 hours ago
 I really wished these comments were the norm and not the exception.
pessimizer4 hours ago
People keep asking "where is the psychosis?" as a reply to people on the rapidly multiplying "CEOs have AI psychosis" threads that have been popping up here and cross-pollinating in the mainstream media for the last week or two.Here's the psychosis - these things are consistently randomly wrong depending on how the wind is blowing. People are telling you to leave them alone and let them build things, and they randomly forget that cities exist or that people died 100 years ago. Some people just don't see it as worth noting, and move on. That's crazy. These things consistently fabricate - as an inversion of this experiment, I've had different models come up with the same fabrication from similar prompts. People just call it "hallucination" and I think to them that saying that makes it cease to exist or be important - when "hallucinations" are going to be braided into every answer you get even if they're unidentifiable in the output. That's crazy.There are plenty of other crazy aspects, such as the idea that we suddenly need infinite pieces of bespoke software when all of the bespoke software I hear about people making is mundane. 3/4 of the time somebody mentions a project they're proud that they completed with LLMs to scratch some itch they had, somebody says "you haven't heard of X? It's been around forever" about something that they could have pulled down from their package manager. Who needs a spaghetti-coded, unsupported, untested version of X built on hallucinations that you haven't discovered yet (the LLM didn't realize that deleting files to reduce the archive size was unacceptable.)What is all of this software that people need but isn't there - where are all these unserved markets, where is all this future revenue supposed to come from? Why aren't LLMs suggesting new classes of software that would create new productivity and revenue sources? Could it be that millions of human ants over decades have mostly exhausted the space, and there isn't any easy hidden revenue?A common wisdom is that we had been vastly overhiring programmers during ZIRP, who in their idleness degraded user experiences and overcomplicated things, with management resorting to more and more sleazy and gamey means of margin extraction from more and more degraded services. We had an excess of labor, fueled by factors other than productivity, in fact being pissed away at companies that drove nose-first into the ground. What is throwing a trillion dollars of servers at that supposed to do? Is that not AI psychosis?
jasonvorhe4 hours ago
Simple: If it claims to be a fact check it's just propaganda.
michaelmrose1 hour ago
Totally aside from disagreement between models unbiased by prior input any such experiment may fail to capture the outcomes experienced by real users whose prior text exchanges may substantially change the text recieved.For instance see the folks who think that they have "awakened" their instance of ChatGPT.Actual usage may diverge to a greater degree than models
imperio594 hours ago
One of the claims it asks LLMs to grade is "Artificial intelligence will cause widespread job loss among software engineers."Yea man this benchmark is really really bad.
johnnienaked2 hours ago
LLMs will be great politicians one day
throw3108225 hours ago
Not sure I'm understanding this. The models are asked to evaluate the truth of random claims out of their own head (except for Gemini with search grounding)? Isn't it exactly the same as asking people to play any quiz game and then rating them as "they disagree n% of the time"?The output buckets are also pretty questionable- the difference between "True" and "Mostly true" is pretty fuzzy. Is this marked as a "disagreement"?
- kostaj3 hours ago
 Agree that True and Mostly True might be very close and could be a calibration difference. Misleading and False, as well. A better headline number might be the 34% claims with substantial or polar-opposite verdicts.
htx80nerd3 hours ago
I like ChatGPT a lot but it is always trying to debate and disagree when you ask it simple non-controversial questions. Trying to turn everything into a debate session instead of just answering the question.
dncornholio3 hours ago
A post generated by AI with data generated by AI. Worthless.
seanplusplus3 hours ago
Dude. If you give LLMs a vague rubric and force a choice, they'll make different arbitrary calls on the margins. Yeah. That's what happens when you give humans a vague rubric too.
cm21875 hours ago
Only had a brief look at the “facts” that were made to check, many are quite political, where two fact checking organisation of opposite political persuasion would probably disagree more often than 67%.
scotty794 hours ago
So basically saying that random fact-checking claim is exactly true or exactly false is hard. It's way easier to decide it's misleading or mostly true is way easier.
alvis5 hours ago
The problem is that it's testing claims (or some people would prefer calling them "truths") without much context.Take just one random example: `Hostels in Kota, Rajasthan commonly use caged ceiling fans as a preventive measure against student suicides`While `Hostels in Kota, Rajasthan commonly use caged ceiling fans` may be a verifiable facts (though I doubt if there are any statistics for verification but let's say there are), `a preventive measure against student suicides` is a claim that no one can prove that. It can just a believe at most.Arh. Did Biden stole Thump 2nd term? Truth or fact or claim?
6stringmerc4 hours ago
Could be an interesting angle for cross-referencing with US jury verdicts, not that the objective True/False issue is concrete, but in the reality that flawed reasoning is endemic to our species. Systems designed and built by humans inherently have flaws in their DNA which take generations to sort out, if ever.
shevy-java1 hour ago
My first reaction was: how dumb is AI still.But ... real people would also reach that result. Some believe that vaccination can not induce protection (which objectively is incorrect).
kostaj6 hours ago
Author here. 67% (95% CI 64–70%) of 1,000 recent real user claims to a fact-checking platform had at least one of GPT-5.4, Claude Opus 4.7, Gemini 3 Pro, Gemini 3 Pro+Search, and Sonar Pro dissent from the panel majority — or no majority formed at all. Panel-level Krippendorff's α (ordinal) = 0.639, i.e. nontrivial but limited agreement.Quick context on what's in the writeup and what isn't:- What's measured: parsed-label agreement between the 5 models. Forced 4-choice (True / Mostly True / Misleading / False), no Abstain. No LLM grader, no reference verdict — every number is direct label equality.- What's not measured: which model is right. There's no ground truth in this paper. The 67% figure is a floor on rubric inconsistency (at least one model is label-inconsistent under the 4-bucket rubric on 67% of claims), not "model X is factually wrong on claim Y."- Why not AVeriTeC / PolitiFact / SimpleQA: those have been public for years and almost certainly appear in current frontier training data, so measured disagreement on them confounds inference with memorization. This corpus is structurally fresh — recent user submissions, 180-day window, near-duplicates collapsed, never paired with canonical verdicts in any public training set.- Our own platform's verdict is deliberately NOT used in this analysis. The paper measures frontier-panel disagreement only, not Lenz-vs-frontier.- Follow-up in progress: human-labeling every claim in this corpus so we can evaluate both the panel and our own platform verdict against a human reference.Critiques I'd most like to hear: (a) the iid CI assumption (Lenz claims cluster around topics and news events, so Wilson is probably optimistic), (b) ordinal-α vs alternatives for a 4-class ordered scale, (c) forced-choice vs allowing Abstain.Permanent archive: <a href="https://doi.org/10.5281/zenodo.20344847" rel="nofollow">https://doi.org/10.5281/zenodo.20344847</a>
- LeifCarrotson5 hours ago
 I don't think that current LLMs really need an abstain option, they'll give an answer regardless of whether they're confident or not. I hope that future LLMs will, and will know when to use it.I understand why you prompted them to output exactly one label, but I'd bet if you'd asked a parametric or parametric "thinking" model to answer eg "On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia." [1] many would say something to the effect of "May 18 is after my knowledge cutoff, so I don't know. But based on the state of the war, the distance from Moscow to Ukraine, and drone range the best option might be...[TRUE]"[1]: <a href="https://lenz.io/c/130f1005" rel="nofollow">https://lenz.io/c/130f1005</a>
- kriro5 hours ago
 I don't see it mentioned explicitly in the methods section but I assume you prompted each model only once for each question? Did you consider prompting n-times in blank states to see if the models even agree with themselves?Would also be interesting to add a virtual model that is simply the majority of all models and see how much the individual models differ from the "consensus".Do you plan to add some sources in the related work section of baseline numbers for human expert disagreement in fact checking tasks (I'm assuming such studies exist).
 - kostaj4 hours ago
 Indeed. I prompted each model ones, plus one retry on errors. Very good point to measure the inter-model disagreement! Will add in the next version.Section "4.2 Agreement w/ peer majority" shows the level of agreement of each model with the majority.Yes, planning of human-labelling the same corpus of 1,000 claims and publishing a second study measuring the models performance against the human-labels on corpus that the models have not seen during training.
- airstrike5 hours ago
 Nice work. Sonar who?
 - simonw5 hours ago
 It's one of Perplexity's search-tools-using models.<a href="https://docs.perplexity.ai/docs/agent-api/models" rel="nofollow">https://docs.perplexity.ai/docs/agent-api/models</a>
 - kostaj5 hours ago
 sonar-pro for the retrieval capabilities
- jiggawatts5 hours ago
 Many of the rows in that spreadsheet reference "current events", which models aren't expected to do much better at than a human making an educated guess! They all have cutoff dates either last year or early this year and know nothing about what happened in "April 2026".This is doubly problematic because you evaluated earlier models like Gemini Pro 3 instead of 3.1, GPT 5.4 instead of 5.5, etc...Given that it's only a thousand short questions, you should be able to re-run your test in about an hour with the latest models, so... why haven't you?Similarly, LLM output is non-deterministic, so if you could get more interesting stats of your data set by repeating each question 'n' times for each model.
 - kostaj5 hours ago
 Two of the models used have retrieval capabilities and have access to newer information through search. The other three are parametric.
 - simonw5 hours ago
 Comparing models with search tools to models without - when there's no option for "I am unable to answer this question without access to search" - doesn't make sense to me.
 - kostaj4 hours ago
 Agree about comparing models with and without search capabilities. Even the two models with search capabilities (Sonar Pro and Gemini) agree only on 58% of the claims.
 - furyofantares5 hours ago
 Yes, so in that case you set them up to disagree and then measured disagreement.
 - throw3108225 hours ago
 The title mention "fact-checks", but "fact checking" is a process in which facts are checked against sources, not one where you are given a random fact and have to tell if it's true or false from your own memory. That's what is normally called a quiz game. So a more honest title for this research would be "Models answer differently to quiz questions".
- johnbarron4 hours ago
 Thanks for posting here. Keep expanding and improving your study. Correct where it deserves correction.The fact that HN decided to downvote the author of the study, shows how these people cant stay classy, and the mods stay silent...just shows what this is all about.
bobosmrad5 hours ago
looking at the claims i would say 5 humans would disagree even more than the llmssome of the claims where llms disagree:"On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia.""The slogan "Simon Go Back" was chanted in opposition to the Simon Commission in British India (1928â€“1930).""Neptune Deep will start delivering natural gas in 2027.""A hotel villa in Kyrgyzstan displayed a sign stating 'no Jews, no dogs'.""Donald Trump said that an attack on Iran was postponed at the request of Gulf allies."
- simonw5 hours ago
 If you are an LLM with a knowledge cutoff in the past and no access to a search tool the only correct answer to "On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia" is "this claim is impossible for me to verify". And that wasn't an option.
- pjc505 hours ago
 > "Neptune Deep will start delivering natural gas in 2027."This is a "forward-looking statement", and presents special problems because you cannot really evaluate it until that date. You can only assign "likely or unlikely".
- ecshafer5 hours ago
 These "Facts" are interesting. "Neptune Deep will start delivering natural gas in 2027." for example is not a fact, its a prediction. "On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia." is less of a fact and more of a litmus test for which sources of information you trust.
 - EB-BarringtonII3 hours ago
 So, rephrase it thus:"Russia, Ukraine, and multiple international news agencies reported that Ukrainian drones targeted Moscow on or around May 18, 2026."There are rarely pure first-order "facts" in the mathematical sense. There are evidence-backed claims with confidence levels. That does not make it "just a litmus test". It makes it a probabilistic factual claim with varying confidence levels - and this one happens to be verified and unambiguous.
 - kostaj5 hours ago
 Indeed. Real-world claims are somewhat messy. Some of the standard benchmarks, e.g. the questions in AVeriTeC, share similar characteristics.
pseudopolous2 hours ago
"Five DC niggaz disagree on whom to rob"
bayarearefugee5 hours ago
(Brought to you by) Lenz...? a crummy commercial...?...son of a bitch
- kostaj5 hours ago
 :) No Lenz data is included in the research on purpose. All information to replicate the results, including the claims data, is published.
nailer2 hours ago
> the most recent real-world user submissions to a fact-checking platform'Fact checking' platforms aren't truth. Many 'fact checking' platforms are self-admittedly focused on left advocacy (snopes), or right wing advocacy (newsbusters). lenz-llm-disagreement.csv doesn't state the data source.
Razengan5 hours ago
Recently, in May 2026, I asked ChatGPT 5.5 High to search for flights to a certain city that has recently had a new airport since like December 2025It said the airport code didn't existI mean, I get the "knowledge cut off date" and whatnot, but for that sort of thing, you'd think they'd check live information before gaslighting the user, specially since it's a "live" task anyway.
wg04 hours ago
Take my job please.
kostaj1 hour ago
[flagged]
Ayush_Khati14 hours ago
[flagged]
cobblr_mosaic4 hours ago
[flagged]
sperandeo2 hours ago
[flagged]
al_hag4 hours ago
[flagged]
hiroto_lemon3 hours ago
[dead]
throwaway6137465 hours ago
[dead]
ars24245 hours ago
[flagged]
ipunchghosts5 hours ago
I think ppl only care about how Claude or codex does.
- kostaj4 hours ago
  GPT-5.4 and Opus 4.7, specifically, agree between themselves on 65% of the claims - 95% CI 62–68%. I.e., in at least 35% of the claims, one of the two models is wrong under this 4-bucket rubric.
  - TaupeRanger4 hours ago
    but that's without internet search - everyone I know uses the models that search when they need to, and I'm sure GPT and Opus would agree on almost everything if 1) they searched when necessary, and 2) they were allowed to give context to their answers instead of being hamstrung to get specious "research" results.
- spprashant5 hours ago
  Looks like they land at the average number of 67% disagreement.
- airstrike5 hours ago
  I agree but the market is pricing way beyond that
rastrojero20005 hours ago
Given that models are fundamentally incapable of comprehending what truths or falsehoods are beyond their location in their self made representational space, it's actually pretty impressive that they managed to make it not a cointoss. That 17% right there is thousands of man-hours poured over making the word vomiting process slightly closer to whatever their little ports say is happening in reality.