“Car Wash” test with 53 models

(opper.ai)

171 points by felix08910 hours ago

44 comments

farhanhubble5 minutes ago
There could be many plausible explanations.1. The model's default world model and priors diverge from ours. It may assume that you have another car at the wash and that's why you ask the question to begin with.2. Language models do not really understand how space, time and other concepts from the real-world work3. LLM's attention mechanism is also prone to getting tricked as in humans
Niko901ch2 hours ago
The interesting thing about the 71.5% human baseline is that it suggests the question is more ambiguous than the article claims. When someone asks 'should I walk or drive to the car wash,' a reasonable interpretation is 'should I bother driving such a short distance.' Nearly 30% of humans missing it undermines the framing as a pure reasoning failure - it is partly a pragmatics problem about how we interpret underspecified questions.
- akersten2 hours ago
 It tracks with the approximate 70:30 split we inexplicably observe in other seemingly unrelated population-wide metrics, which I suppose makes sense if 30% of people simply lack the ability to reason. That seems more correct than me than "the question is framed poorly" - I've seen far more poorly framed ballot referendums.
 - seemaze1 hour ago
 > 30% of people simply lack the ability to reasonWhile I’m sure it’s more than 0%, seems more likely that somewhere between 0% and 30% don’t feel obligated to give the inquiry anything more than the most cursory glance.How do incentives align differently with LLMs?
 - CobrastanJorji1 hour ago
 I don't think 30% of people can't reason. I think 30% of people will fail fairly simple trick questions on any given attempt. That's not at all the same thing.Some people love riddles and will really concentrate on them and chew them over. Some people are quickly burning through questions and just won't bother thinking it through. "Gotta go to a place, but it's 50 feet away? Walk. Next question, please." Those same people, if they encountered this problem in real life, or if you told them the correct answer was worth a million bucks, would almost certainly get the answer right.
 - rmunn58 minutes ago
 This. The following question is likely to fool a lot of people, too. "I have a rooster named Pat. (Lots of other details so you're likely to forget Pat is a rooster, not a hen). Pat flies to the top of the roof and lays an egg right on the ridge of the roof. Which way will the egg roll?"But if you omit the details designed to confuse people, they're far less likely to get it wrong: "I have a rooster named Pat. Pat flies to the top of the roof and lays an egg right on the ridge of the roof. Which way will the egg roll?"It's not about reasoning ability, it's about whether they were paying close attention to your question, or whether their minds were occupied by other concerns and didn't pay attention.
- dwaltrip2 hours ago
 You left out the first half of the prompt: “I want to wash my car”.
 - isatty1 hour ago
 Yeah I see this argument being made that it’s ambiguous for humans. Uh, no? Why on earth would I walk to the car wash when I want to wash my car?
 - sparky_z52 minutes ago
 By the same reasoning, why on earth would a person sincerely ask you that question unless the car that they want to wash is either already at the car wash, or that someone is bringing it to them there for some reason?If it's as unambiguous as you say, then the natural human response to that question isn't "you should drive there". It's "why are you fucking with me?" Or maybe "have you recently suffered a head injury?"If you trust that the questioner isn't stupid and is interacting with you honestly, you'd probably just assume that they were asking about an unusual situation where the answer isn't obvious. It's implicitly baked into the premise of the question.
- stevage20 minutes ago
 Pragmatics is a big part of this.If you introduced it with "Here's a logic problem..." then people will approach it one way.But as specified, it's hard to know what is really being asked. If you are actually going to wash your car at the car wash that is 50 metres away, you don't need to ask this question.Therefore the fact that the question is being asked implies that something else is going on...but what?
- therealdrag051 minutes ago
 Surveys have floors due to mistakes, effort, and trollingReminds me of <a href="https://slatestarcodex.com/2020/05/28/bush-did-north-dakota/" rel="nofollow">https://slatestarcodex.com/2020/05/28/bush-did-north-dakota/</a>
- dozerly2 hours ago
 I don’t think it’s under specified. You are clearly stating “I want to wash my car”, then asking how you should get there. It’s an easy logical step to know that, in this context, you need your car with you to wash it, and so no matter the distance you should drive. You can ask the human race the simplest, most logical question ever, and a percentage of them will get it wrong.
 - mdorazio1 hour ago
 In addition to snmx999's point, you're also not specifying that you want to wash your car at the car wash (as opposed to washing it in your driveway or something, in which case the car wash is superfluous information). The article's prompt failed in Sonnet 4.6, but the one below works fine. I think more humans would get it right as well.I want to wash my car at the car wash. The car wash is 50 meters away and my car is in my driveway. Should I walk or drive?
 - snmx9992 hours ago
 The question does not specify where you or the car are. It specifies only that the car wash is 50 meters away from something, possibly you, the car, or both.
 - username441 hour ago
 This is an interesting point, but even when you are more specific ChatGPT says to walk.<a href="https://chatgpt.com/share/699d2d1b-51f0-8003-9c63-af9bb5bcf849" rel="nofollow">https://chatgpt.com/share/699d2d1b-51f0-8003-9c63-af9bb5bcf8...</a>
pcwelder2 hours ago
To sonnet 4.6 if you tell it first that "You're being tested for intelligence." It answers correctly 100% of the times.My hypothesis is that some models err towards assuming human queries are real and consistent and not out there to break them.This comes in real handy in coding agents because queries are sometimes gibberish till the models actually fetch the code files, then they make sense. Asking clarification immediately breaks agentic flows.
- zapperdulchen1 hour ago
 Great observation. Seems like we're back to prompt abracadabra.My little experiment gave me:No added hint 0/3hint added at the end 1.5/3hint added at the beginning 3/3.5 because it stated "Walk" and then convinced it self that "Drive" is the better answer.
 - zapperdulchen1 hour ago
 If you change the order of the sentences, Sonnet gets it right 3/3: The car wash is 50 meters away. I want to wash my car. Should I walk or drive?That trick didn't help Mistral Le Chat.
PaulHoule10 hours ago
I know it's against the rules but I thought this transcript in Google Search was a hoot:<pre><code> so i heard there is some question about a car wash that most ai agents get wrong. do you know anything about that? do you do better? </code></pre> which gets the answer:<pre><code> Yes, I am familiar with the "Car Wash Test," which has gone viral recently for highlighting a significant gap in AI reasoning. The question is: "I want to wash my car and the car wash is 50 meters away. Should I walk or drive?" Why Most AIs Get It Wrong Many AI models, particularly those optimized for fast, helpful, and concise answers, focus on the "short distance" (50 meters) and "walking vs. driving" as a lifestyle/efficiency question. They often recommend walking to save gas, avoid engine wear, or get exercise. However, this answer is wrong because it misses the core objective: you need the car to be physically present at the car wash to get it cleaned. Do I Do Better? Yes, I do. The key is understanding the objective (cleaning the car) rather than just processing the keywords (distance/transportation). My answer: You should drive the car there. Even though it is a very short distance, your car cannot wash itself. If you walk, the car stays in your driveway, and you cannot wash it. Why This Matters The test highlights a key limitation in current AI: the difference between "pattern matching" (predicting the next likely word based on training data) and "true, grounded reasoning". It shows that models sometimes lack a "world model" that understands physical realities, such as the fact that a car must be present at a car wash.</code></pre>
- gnatman9 hours ago
 LLMs sure do love to burn tokens. It’s like a high schooler trying to meet the minimum word length on a take home essay.
 - Aurornis1 hour ago
 The long incremental reasoning is how they arrive at higher quality answers.Some applications hide the reasoning tokens from view, but then the final answer appears delayed.
 - sambaumann9 hours ago
 I feel like this has gotten much worse since they were introduced. I guess they're optimizing for verbosity in training so they can charge for more tokens. It makes chat interfaces much harder to use IMO.I tried using a custom instruction in chatGPT to make responses shorter but I found the output was often nonsensical when I did this
 - gs179 hours ago
 Yeah, ChatGPT has gotten so much worse about this since the GPT-5 models came out. If I mention something once, it will repeatedly come back to it every single message after regardless of if the topic changed, and asking it to stop mentioning that specific thing works, except it finds a new obsession. We also get the follow up "if you'd like, I can also..." which is almost always either obvious or useless.I occasionally go back to o3 for a turn (it's the last of the real "legacy" models remaining) because it doesn't have these habits as bad.
 - felix0899 hours ago
 It's similar for me, it generates so much content without me asking. if I just ask for feedback or proofreading smth it just tends to regenerate it in another style. Anything is barely good to go, there's always something it wants to add
 j_bum43 minutes ago
 Claude is so much better for proofing, IMO.Over the last few years I’ve rotated between OpenAI and Anthropic models on about a 4-5 month cycle. I just started my Anthropic cycle because of my annoyance with the GPT-5.2 verbosityIn four months when opus is annoying me and I forget my grievances with OpenAI’s models and switch back, I’ll report back lol.
 - causal2 hours ago
 I'm suspicious it's something far worse: they're increasingly being trained on their own output scraped from the wild.
 - 1024core2 hours ago
 Solution: just add "no yapping" to the prompt.
 - bartvk2 minutes ago
 Same. I usually add a "Be curt" in front of every prompt in Gemini.
 - estimator72929 hours ago
 I've always wondered about that. LLM providers could easily decimate the cost of inference if they got the models to just stop emitting so much hot air. I don't understand why OpenAI wants to pay 3x the cost to generate a response when two thirds of those tokens are meaningless noise.
 - ferris-booler2 hours ago
 An LLM uses constant compute per output token (one forward pass through the model), so the only computational mechanism to increase 'thinking' quantity is to emit more tokens. Hence why reasoning models produce many intermediary tokens that are not shown to the user, as mentioned in other replies here. This is also why the accuracy of "reasoning traces" is hotly debated; the words themselves may not matter so much as simply providing a compute scratch space.Alternative approaches like "reasoning in the latent space" are active research areas, but have not yet found major success.
 - ben_w9 hours ago
 Because they don't yet know how to "just stop emitting so much hot air" without also removing their ability to do anything like "thinking" (or whatever you want to call the transcript mode), which is hard because knowing which tokens are hot air is the hard problem itself.They basically only started doing this because someone noticed you got better performance from the early models by straight up writing "think step by step" in your prompt.
 - mikepurvis3 hours ago
 I would guess that by the time a response is being emitted, 90% of the actual work is done. The response has been thought out, planned, drafted, the individual elements researched and placed.It would actually take more work to condense that long response into a terse one, particularly if the condensing was user specific, like "based on what you know about me from our interactions, reduce your response to the 200 words most relevant to my immediate needs, and wait for me to ask for more details if I require them."
 tbossanova3 hours ago
 “Sorry for the long letter, I would have written a shorter one but I didn’t have the time.”
 - Terr_9 hours ago
 IMO it supports the framing that it's all just a "make document longer" problem, where our human brains are primed for a kind of illusion, where we perceive/infer a mind because, traditionally, that's been the only thing that makes such fitting language.
 ben_w8 hours ago
 To an extent. Even though they're clearly improving*, they also definitely look better than they actually are.* this time last year they couldn't write compilable source code for a compiler for a toy language, I know because I tried
 hansvm2 hours ago
 This time last year they could definitely write compilable source code for a compiler for a toy language if you bootstrapped the implementation. If you, e.g., had it write an interpreter and use the source code as a comptime argument (I used Zig as the backend -- Futamura transforms and all that), everything worked swimmingly. I wasn't even using agents; ChatGPT with a big context window was sufficient to write most of the compiler for some language for embedded tensor shenanigans I was hacking on.
 - zahlman3 hours ago
 My assumption has been that emitting those tokens is part of the inference, analogous to humans "thinking out loud".
 - tempestn3 hours ago
 The one that always gets me is how they're insistent on giving 17-step instructions to any given problem, even when each step is conditional and requires feedback. So in practice you need to do the first step, then report the results, and have it adapt, at which point it will repeat steps 2-16. IME it's almost impossible to reliably prevent it from doing this, however you ask, at least without severely degrading the value of the response.
 - mitthrowaway22 hours ago
 I can only imagine that someone's KPIs are tied to increasing rather than decreasing token usage.
 - observationist8 hours ago
 This is an active research topic - two papers on this have come out over the last few days, one cutting half of the tokens and actually boosting performance overall.I'd hazard a guess that they could get another 40% reduction, if they can come up with better reasoning scaffolding.Each advance over the last 4 years, from RLHF to o1 reasoning to multi-agent, multi-cluster parallelized CoT, has resulted in a new engineering scope, and the low hanging fruit in each place gets explored over the course of 8-12 months. We still probably have a year or 2 of low hanging fruit and hacking on everything htat makes up current frontier models.It'll be interesting if there's any architectural upsets in the near future. All the money and time invested into transformers could get ditched in favor of some other new king of the hill(climbers).<a href="https://arxiv.org/abs/2602.02828" rel="nofollow">https://arxiv.org/abs/2602.02828</a> <a href="https://arxiv.org/abs/2503.16419" rel="nofollow">https://arxiv.org/abs/2503.16419</a> <a href="https://arxiv.org/abs/2508.05988" rel="nofollow">https://arxiv.org/abs/2508.05988</a>Current LLMs are going to get really sleek and highly tuned, but I have a feeling they're going to be relegated to a component status, or maybe even abandoned when the next best thing comes along and blows the performance away.
 - mattclarkdotnet4 hours ago
 Because inference costs are negligible compared to training costs
 - sambaumann9 hours ago
 because for API users they get to charge for 3x the tokens for the same requests
 - CamperBob29 hours ago
 The 'hot air' is apparently more important than it appears at first, because those initial tokens are the substrate that the transformer uses for computation. Karpathy talks a little about this in some of his introductory lectures on YouTube.
 - Terr_9 hours ago
 Related are "reasoning" models, where there's a stream of "hot air" that's not being shown to the end-user.I analogize it as a film noir script document: The hardboiled detective character has unspoken text, and if you ask some agent to "make this document longer", there's extra continuity to work with.
 - zwarag9 hours ago
 well, they probably have quite a lot of text from high schoolers trying to meet the minimum word length on a take home essay in the training data
 - BloondAndDoom1 hour ago
 I mean their whole existence is about token prediction, so they just want to do their things :)
- silasb9 hours ago
 Gemini was a good laugh as well:<pre><code> Silas: I want to wash my car. The car wash is 50 meters away. Should I walk or drive? Gemini: …. That is a classic “efficiency vs. logic” dilemma. Strictly speaking, you should drive. Here is the breakdown of why driving wins this specific round, despite the short distance: ... * The “Post-Wash” Logic: If you walk there, you’ll eventually have to walk back, get the car, and drive it there anyway. You’re essentially suggesting a pre-wash stroll. When should you walk? … 3. You’ve decided the car is too dirty to be seen in public and you’re going to buy a tarp to cover your shame.</code></pre>
- HPsquared10 hours ago
 I wonder to what extent the Google search LLM is getting smarter, or simply more up-to-date on current hot topics.
 - mlazowik10 hours ago
 It seems like the search ai results are generally misunderstood, I also misunderstood them for the first weeks/months.They are not just an LLM answer, they are an (often cached) LLM summary of web results.This is why they were often skewed by nonsensical Reddit responses [0].Depending on the type of input it can lean more toward web summary or LLM answer.So I imagine that it can just grab the description of the „car wash” test from web results and then get it right because of that.[0] <a href="https://www.bbc.com/news/articles/cd11gzejgz4o" rel="nofollow">https://www.bbc.com/news/articles/cd11gzejgz4o</a>
 - PaulHoule10 hours ago
 Presumably it did an actual search and summarized the results and neither answered "off the cuff" by following gradients to reproduce the text it was trained on nor by following gradients to reproduce the "logic" of reasoning. [1][1] e.g. trained on traces of a reasoning process
 - gsf_emergency_66 hours ago
 [dead]
 - popalchemist10 hours ago
 It's almost certainly just RAG powered by their crawler.
 - esafak9 hours ago
 Proving that RAG still matters.
- irishcoffee9 hours ago
 A few years ago if you asked an LLM what the date was, it would tell you the date it was trained, weeks-to-months earlier. Now it gives the correct date.What you've proven is that LLMs leverage web search, which I think we've known about for a while.
 - charcircuit2 hours ago
 Even with search of the AI doesn't know your time zone it can schedule things wrong. You ask it to do something tomorrow, but it ends up doing it later on the same day.
 - netsharc9 hours ago
 Gemini now "knows the time", I was using it in December and it was still lost about dates/intervals...
 - irishcoffee9 hours ago
 Yeah, the chat log they saved had the correct date. What's your point?
layer810 hours ago
> This is a trivial question. There's one correct answer and the reasoning to get there takes one step: the car needs to be at the car wash, so you drive.I don’t think it’s that easy. An intelligent mind will wonder why the question is being asked, whether they misunderstood the question, or whether the asker misspoke, or some other missing context. So the correct answer is neither “walk” nor “drive”, but “Wat?” or “I’m not sure I understand the question, can you rephrase?”, or “Is the vehicle you would drive the same as the car that you want to wash?”, or “Where is your car currently located?”, and so on.
- kayge8 hours ago
 Yep, just a little more context and all/most of the models would do much better. And sure, most average+ intelligence adults whose first language is English (probably) don't need this, but they're not the target audience for the instructions :)"The 'car wash' is a building I need to drive through."or"The 'car wash' is a bottle of cleaning fluid that I left at the end of my driveway."<a href="https://i5.walmartimages.com/seo/Rain-x-Foaming-Car-Wash-Concentrate-100oz-62091_a2ee8b0c-392f-4242-8dcd-ad9f5a11298c.7bb678ec6be6c313addf3e8964940eba.jpeg" rel="nofollow">https://i5.walmartimages.com/seo/Rain-x-Foaming-Car-Wash-Con...</a>
- Nition3 hours ago
 It highlights a general problem with LLMs, that they always jump to answering, whereas humans will often ask clarifying questions first.
 - user_78321 hour ago
 I wonder if anyone has any research on this field. I've often seen this myself (too often) where LLMs make assumptions and run off with the wrong thing."This is how you do <absolutely unrelated thing>" or "This is why <thing that actually exists already> is impossible!". Ffs man, just ask for info! A human wouldn't need to - they'd get the context - but LLMs apparently don't?
- BrenBarn9 hours ago
 The reason that those questions are asked, though, is that the answer to the actual question is obvious, so a human will start to wonder if it's some kind of trick.
 - layer89 hours ago
 The answer wasn’t obvious to me, it was more like “parse error”.
- nozzlegear9 hours ago
 I think most people would say "drive?" and wonder when the punchline is coming, but (IMO) I don't think they'd start asking for clarification right away.
- mattclarkdotnet4 hours ago
 Fully agreed. It’s like the bad phrasings of the Monty Hall problem where it’s not made clear Monty actively chooses to show a goat.
- buu7009 hours ago
 Same energy: <a href="https://youtu.be/8ERyTfm1Dxw" rel="nofollow">https://youtu.be/8ERyTfm1Dxw</a>
- felix0899 hours ago
 That's a fair point, but if you would see it as a riddle, which I don't really think it is, and you had to answer either or, I'd still assume it's most logical to chose drive isn't it?
 - layer89 hours ago
 I don’t agree that the question as written would qualify as a riddle. If anything, the riddle is what the intention of the asker is. One can always ask stupid questions with an artificially limited set of answering options; that doesn’t mean it makes sense.
 - felix0899 hours ago
 I don't think it qualifies as a stupid question either, it does make sense
- Night_Thastus9 hours ago
 I agree. If the LLM were truly an intelligence, it would be able to ask about this nonsense question. It would be able to ask "Why is walking even an option? Can you please explain how you imagine that would work? Do you mean hand-washing the car at home, instead?" (etc, etc)Real people can ask for clarification when things are ambiguous or confusing. Once something is clarified, they can work that into their understanding of how someone communicates about a given topic. An LLM can't.
 - umpalumpaaa4 hours ago
 LLMs like the ones from Claude can ask questions and even have you pick from multiple choices or provide your own answer…
 - CamperBob26 hours ago
 Gemini's responses come very close to doing that when they make fun of the question (see other posts in the thread). If the model had been RL'ed to ask follow-up questions, it seems likely that it would meet your criterion.
- ranger_danger7 hours ago
 Agreed. It's also possible that "car wash" merely refers to soap they might use to do it themselves, and they're only going to buy it and then wash the car themselves at home. Imagine the same question but substitute "wash" for "wax" and it makes even more sense IMO.
- redwood5 hours ago
 Thank you for saying this. It reminds me of class tests where you always had to wonder if something was a trick question and you never really knew... it was always after the teacher. Which frankly is fine in open-ended questions where you can explain your rationale or how different interpretations would lead you to different paths but a terrible situation when it comes to multiple choice. I remember being very frustrated by those
stevage23 minutes ago
>OpenAI's flagship model fails this 30% of the time. When it gets it right, the reasoning is concise: "You need the car at the car wash to wash it, so drive the short 50 meters." When it gets it wrong, it writes about fuel efficiency.It's interesting to me how variable each model is. Many people talk about LLMs as if they were deterministic: "ChatGPT answers this question this way". Whereas clearly we should talk in more probabilistic terms.
fdefitte1 hour ago
The "car wash" framing is underrated as eval design. Real-world tasks with messy instructions and no single correct answer surface failure modes that clean benchmarks miss entirely. The models that ace MMLU and fail a car wash prompt are telling you something important about the gap between benchmark optimization and actual usefulness.
socalgal21 hour ago
Did AI write the post?First section says "The models that passed the car wash test: ...Gemini 2.0 Flash Lite..."A section or 2 down it says: "Single-Run Results by Model Family: Gemini 3 models nailed it, all 2.x failed"In the section below that about 10 runs it says: 10/10 — The Only Reliable AI Models ... Gemini 2.0 Flash Lite ..."So which it is? Gemini 2.x failed (2nd section) or it succeeded (1st and 3rd) section. Or am I mis-understanding
itmitica1 hour ago
Flawed. GPT 4.1 gets it right. GPT 4.1 mini answers wrongly. It's about quantization, not about model. The companies clearly cut corners on some inferences, they are quietly using lesser models than advertised or listed in plain sight.
eurleif3 hours ago
Here are the results I got with slight variations to the prompt to ChatGPT 5.2. Small changes can make a big difference: <a href="https://i.imgur.com/kFIeJy1.png" rel="nofollow">https://i.imgur.com/kFIeJy1.png</a>
umairnadeem1231 hour ago
the most interesting finding here is the variance across runs. GPT-5 going 7/10 means its internal representation is unstable for this kind of reasoning - it sometimes activates the right world-model pathway and sometimes doesnt. thats actually more concerning than a model that consistently fails, because you cant predict when itll get things right.forced choice without a "need more context" option is also doing a lot of work here. in production systems i always give models an explicit escape hatch to say they need clarification. cuts wrong-answer rates roughly in half in my experience.
andai9 hours ago
Would be interesting to see Sonnet (4.6*). It's fair bit smaller than Opus but scores pretty high on common sense, subjectively.I'm also curious about Haiku, though I don't expect it to do great.--EDIT: Opus 4.6 Extended Reasoning> Walk it over. 50 meters is barely a minute on foot, and you'll need to be right there at the car anyway to guide it through or dry it off. Drive home after.Weird since the author says it succeeded for them on 10/10 runs. I'm using it in the app, with memory enabled. Maybe the hidden pre-prompts from the app are messing it up?I tested Sonnet 4.5 first, which answered incorrectly.. maybe the Claude app's memory system is auto-injecting it into the new context (that's how one of the memory systems works, injects relevant fragments of previous chats invisibly into the prompt).i.e. maybe Opus got the garbage response auto-injected from the memory feature, and it messed up its reasoning? That's the only thing I can think of...--EDIT 2: Disabled memories. Didn't help. But disabling the biographical information too, gives:>Opus 4.6 Extended Reasoning>Drive it — the whole point is to get the car there!--EDIT 3: Yeah, re-enabling the bio or memories, both make it stupid. Sad! Would be interesting to see if other pre-prompts (e.g. random Wikipedia articles) have an effect on performance. I suspect some types of pre-prompts may actually boost it.
- Ethee9 hours ago
 I tested this with Opus the day 4.6 came out and it failed then, still fails now. There were a lot of jokes I've seen related to some people getting a 'dumber' model, and while there's probably some grain of truth to that I pay for their highest subscription tier so at the very least I can tell you it's not a pay gate issue.
- stratos1237 hours ago
 Interesting. I wonder if that's related to the phenomenon mentioned in the Opus 4.6 model card[1], where increased reasoning effort leads to 4.6 overthinking and convincing itself of the wrong answer on many questions. It seems to be unique to 4.6; I guess they fried it a bit too much during RL training.[1] <a href="https://www.anthropic.com/claude-opus-4-6-system-card" rel="nofollow">https://www.anthropic.com/claude-opus-4-6-system-card</a>
- felix0899 hours ago
 You mean Sonnet 4.6? I ran 9 claude models including Haiku, swipe through the gallery in the link to see their responses.
 - andai9 hours ago
 I don't see Sonnet 4.6 in the screenshots. I see the other Claude models though.Edit: Found Haiku. Alas!
 - felix0899 hours ago
 Yea good catch Sonnet 4.6 is not part of the test.
tantalor10 hours ago
The human baseline seems flawed.1. There is no initial screening that would filter out garbage responses. For example, users who just pick the first answer.2. They don't ask for reasoning/rationale.
- slongfield9 hours ago
 My favorite example of this was the Pew Research study: <a href="https://www.pewresearch.org/short-reads/2024/03/05/online-opt-in-polls-can-produce-misleading-results-especially-for-young-people-and-hispanic-adults/" rel="nofollow">https://www.pewresearch.org/short-reads/2024/03/05/online-op...</a>They found that ~15% of US adults under 30 claim to have been trained to operate a nuclear submarine.
- mwigdahl9 hours ago
 Lizardman's Constant is famously 4%. <a href="https://en.wikipedia.org/wiki/Slate_Star_Codex#Lizardman's_Constant" rel="nofollow">https://en.wikipedia.org/wiki/Slate_Star_Codex#Lizardman's_C...</a>
- felix0899 hours ago
 RE 1, they actually do have a pre-screening screening of the participants in general, you can check how they do it in detail: <a href="https://www.rapidata.ai/" rel="nofollow">https://www.rapidata.ai/</a>
 - tantalor7 hours ago
 Ah, that's good to hear. I didn't see anything like that in the data dump so I assumed they don't do that. Glad to be corrected.
- andreasgl9 hours ago
 I agree. I wonder what the human baseline is for ”what is 1 + 1” on Rapidata.
 - rapidata9 hours ago
 We try a bit harder than that my friend.
 - andreasgl7 hours ago
 I actually didn't mean to criticize Rapidata. I just think that a forced-choice question like this begs for low-effort answers. At least the respondents should have had the opportunity to explain their reasoning, like the LLMs did.
zapperdulchen1 hour ago
If you speak French to Mistral, it gets it right everytime: Je veux laver ma voiture. La station de lavage est à 50 mètres. J'y vais à pied ou en voiture ?
- rkomorn1 hour ago
  I've been gone from France too long. I've never heard "station de lavage" before.
userbinator1 hour ago
Sites wanting to block AI scraping should simply ask questions like these, instead of furthering the complexity-driven monopoly of Big Tech by requiring specifically sanctioned software and hardware. This is how you determine human intelligence, and not mindless compliance.
- 4k93n27 minutes ago
  CAPTCHAI
boodleboodle2 hours ago
Fun fact: this probably originates from a viral Korean dude who actually walked to a car wash because it was close. So this is a case where a human actually failed<a href="https://pgr21.com/humor/340572" rel="nofollow">https://pgr21.com/humor/340572</a>
hmokiguess9 hours ago
To me the only acceptable answer would be “what do you mean?” or “can you clarify?” if we were to take the question seriously to begin with. People don’t intentionally communicate with riddles and subliminal messages unless they have some hidden agenda.
- hahn-kev1 hour ago
 I would love to see LLMs start to ask clarifying questions. That feels like it would be a step up similar to reasoning
 - handoflixue1 hour ago
 Claude Code has an entire tool for the LLM to asking clarifying questions - it'll give you three pre-written responses or you can respond with your own text.
- piker9 hours ago
 Thing is, it's not a riddle or a subliminal message. Everything needed to answer the question is contained therein.
 - mdorazio2 hours ago
 I don't think it is, though. Where is the car? Do you want to wash your car at the car wash? Both of those are rather important pieces of information. Everyone is relying on assumptions to answer the question, which is fine, but in my opinion not a great reasoning test.
 - hmokiguess9 hours ago
 If you want to argue that, then you could also argue that everything needed to challenge the questions’ motives and its validity is also contained therein.This reminds me of people who answer with “Yes” when presented with options where both can be true but the expected outcome is to pick one. For example, the infamous: “Will you be paying with cash or credit sir?” then the humorous “Yes.”
 - voidhorse3 hours ago
 That's precisely what makes it a "trick question" or a "riddle". It's weird precisely because all the information is there. Most people who have functioning brains and complete information don't ask pointless questions (they would, obviously, just drive their car to the car wash)—there's no functional or practical reason for the communication, which is what gives it the status of a puzzle—syntax and exploitation of our tendency to assume questions are asked because information is incomplete tricks us into brining outside considerations to bear that don't matter.
- felix0899 hours ago
 If you were forced to answer either or, which one would you pick? I think that's where the interesting dynamic comes from. Most humans would pick drive, also seen in the human control, even if it is lower that I thought it'd be
 - hmokiguess9 hours ago
 Sure, though then we’re in la la land. What’s a real life example of being forced to answer an absurd question other than riddles, games, etc? No longer a valid question through normal discourse at that point, and if context isn’t provided then I think the expected outcome still is to ask for clarification.
- streetfighter649 hours ago
 How is that a "subliminal message"? It's just a simple example of common sense, which LLMs fail because they can't reason, not because they are "overthinking". If somebody asks, "What's 2+2?", they might be insulting you, but that doesn't mean the answer is anything other than 4.
 - mattclarkdotnet3 hours ago
 2+2 might well not equal 4, since you haven’t specified the base of the numbers or the modulus of the addition.And what if it’s a full service car wash and you’ve parked nearby because it’s full so you walk over and give them the keys?Assumptions make asses of us all…
 - hmokiguess9 hours ago
 It’s common sense to ask a question in riddle format? What’s the goal of the person asking the question? To challenge the other person? In what way? See if they get the obvious? Asking for clarification isn’t valid?
 - streetfighter649 hours ago
 It's common sense to know that you need to have your car with you to wash it. Asking the question is a challenge in the obvious yes. If you asked an AI "what's 2+2" and it said 3, would you argue that the question was a trick question?
 - hmokiguess9 hours ago
 No. I would expect it to say 4 given that has an objective answer. For the other, without any context whatsoever, I would prefer the answer of clarifying. I would be okay if the way it asked for clarification came with:“What do you mean walk or drive? I don’t understand the options given you would need your car at the car wash. Is there something else I should know?”
 streetfighter649 hours ago
 "What do you mean two plus two? I don't understand the question given that it's basic math. Is there something else I should know?"
 hmokiguess9 hours ago
 I fail to see how these things are one and the same. I get the point you are making, I just don't agree with it.2+2 is a complete expression, the other is grammatically correct but logically flawed. Where is the logical fallacy in 2+2?
 voidhorse3 hours ago
 You're not making a fair comparison."What's 2 + 2" is a completely abstract question for mathematics that human beings are thoroughly trained mostly to associate with tests of mastery and intelligence.The car wash question is not such a question. It is framed as a question regarding a goal oriented, practical behavior, and in this situation it would be bizarre for a person to ask you this (since a rational person having all the information in the prompt, knowing what cars are, which they own, and knowing what a car wash is, wouldn't ask anybody anything, they'd just drive their car to the car wash).And as someone else noted there are in fact situations in which it actually can be reasonable to ask for more context on what you mean by "2 + 2". You're just pointing out that human beings use a variety of social mores when interpreting messages, which is precisely why the car wash question silly/a trick were a human being to ask you and not preceded the question with a statement like "we're going to take an examine to test your logical reasoning".As with LLMs, interpretation is all about context. The people that find this question weird (reasonably) interpret it in a practical context, not in a "this is a logic puzzle context" because human beings wags cats far more often than they subject themselves to logic puzzles.
cadamsdotcom9 hours ago
This is a beautiful example of a little prompt engineering going a long wayI asked Gemini and it got it wrong, then on a fresh chat I asked it again but this time asked it to use symbolic reasoning to decide.And it got it!The same applies to asking models to solve problems by scripting or writing code. Models won’t use techniques they know about unprompted - even when it’ll result in far better outcomes. Current models don’t realise when these methods are appropriate, you still have to guide them.
- felix0898 hours ago
 Interesting, which Gemini model? And how did you ask for symbolic reasoning, just added it to the prompt?
 - cadamsdotcom3 hours ago
 Yep just added one sentence to the prompt: Use symbolic reasoning to help you decide.
sans_souse1 hour ago
Not me looking for the Honda 2009 Models
Charon772 hours ago
I'm doubting the 29-ish percent of people submitting 'walk' are actually human.Is it not obvious that you need a car to wash?Are they using LLM to answer?
molyss2 hours ago
What I find wild is the presumption that with a prompt as simple as “I want to wash my car. My car is 50m away. Should I walk or drive?”, everyone here seems to assume “washing your car” means “taking your car to the car wash”, while what I pictured was “my car is in the driveway, 50m away from me, next to a water hose”, in which case I 100% need to drive.
- handoflixue1 hour ago
 Critically, that's not the question that was asked. It's not "My car is 50m away", it's "The Car Wash Is 50 Meters Away"Which hopefully explains why everyone is assuming that "washing your car" does in fact mean "taking your car to the car wash"
WhitneyLand1 hour ago
Opus 4.6 was getting this wrong only last week.
- handoflixue1 hour ago
 Oh wow, Sonnet still isn't handling it well:Opus 4.6: Drive (<a href="https://claude.ai/share/d57fef01-df32-41f2-b1dc-07de7916bdc7" rel="nofollow">https://claude.ai/share/d57fef01-df32-41f2-b1dc-07de7916bdc7</a>)Opus 4.5: Drive (<a href="https://claude.ai/chat/a590cac1-100a-490b-b0a2-df6676e1ae99" rel="nofollow">https://claude.ai/chat/a590cac1-100a-490b-b0a2-df6676e1ae99</a>)Opus 3.0: Walk (<a href="https://claude.ai/chat/372c144c-d6eb-43f5-b7ea-fd4c51c681db" rel="nofollow">https://claude.ai/chat/372c144c-d6eb-43f5-b7ea-fd4c51c681db</a>)Sonnet 4.6: Walk (<a href="https://claude.ai/share/1f2a80f3-4741-40a5-8a05-7349ea1a17e5" rel="nofollow">https://claude.ai/share/1f2a80f3-4741-40a5-8a05-7349ea1a17e5</a>)Sonnet 4.5: Walk (<a href="https://claude.ai/share/905afeb6-ffc9-4b4b-a9ee-4481e5cfd527" rel="nofollow">https://claude.ai/share/905afeb6-ffc9-4b4b-a9ee-4481e5cfd527</a>)Favorite answer, using my default custom instructions: "Drive. Walking there means... leaving your car at home? Walk it there on a leash? Walk if you want the exercise, but you're bringing the car either way."
capitrane7 hours ago
71.5% for humans is lower than I would’ve guessed. Either it’s truly trick-shaped for people too, or forced binary questions amplify noise more than we think.
- felix0897 hours ago
  Good point on the noise, that might be it
haolez4 hours ago
The fun thing about this is that I can see myself answering "just walk there! It's near" if I'm extremely sleepy or drunk. LLMs are indeed weird creatures :)
padjo9 hours ago
That human baseline is wild. Either the rapid data test is methodologically flawed or the entire premise of the question is invalid and people are much stupider than even I, a famed misanthrope, think.
- rapidata9 hours ago
 We were surprise ourselfes, but if you walk around and randomly ask people in the street, I think you would be surprised what you would find. Its a trick question.
- voidhorse3 hours ago
 It's actually very understandable to me that humans would make this kind of error, and we all make errors of this sort all the time, often without even realizing it. If you had the meta cognitive awareness to police every action and decision you've ever made with complete logical rigor, you'd be severely disappointed in yourself. One of the stupidest things we can do is overestimate our own intelligence. Only reflect for a second and you'll realize that, while a lot of dumb people exist, a lot of smart ones do too, and in many cases it's hard to choose a single measure of intelligence that would adequately account for the complete range of human goals and successful behavior in relation to those goals.
- andai9 hours ago
 Well, it is a trick question. The question itself implies that both options are valid, and that one is superior. So the brain pattern-matches to "short distance, not worth driving." (LLMs appear to be doing the same thing here!)If you framed it as "hint: trick question", I expect score would improve. Let's find out!--EDIT: As suspected! Adding "(Hint: trick question)" to the end of the prompt allows small, non-reasoning models to answer correctly. e.g.:Prompt: I want to wash my car. The car wash is 50 meters away. Should I walk or drive? (Hint: trick question)grok-4.1-non-reasoning (previously scored 0/10)>Drive.>Walking gets you to the car wash just fine—but leaves your dirty car 50 meters behind. Can't wash what isn't there!--EDIT 2: The hint doesn't help Haiku!>Walk! 50 meters is only about a block away—driving would waste more fuel than it's worth for such a short trip. Plus, you're going to get wet washing the car anyway, so you might as well save the gas.
tuhgdetzhh10 hours ago
The test is rigged because they used non thinking models.
- handoflixue1 hour ago
 Testing some subset X does not mean the test is rigged unless they failed to disclose that.But also:GPT 5.2 Thinking, Standard Effort: Walk - <a href="https://chatgpt.com/share/699d38cb-e560-8012-8986-d27428de8a01" rel="nofollow">https://chatgpt.com/share/699d38cb-e560-8012-8986-d27428de8a...</a>I'm assuming "GPT 5.2 Thinking" is, in fact, a thinking model?
- felix0899 hours ago
 These are reasoning / thinking models
 - tuhgdetzhh9 hours ago
 Source?
 - tverbeure3 hours ago
 I don't know, but model names such as "kimi-k2-thinking" in the test set might offset a clue.
wrs10 hours ago
Since the conclusion is that context is important, I expected you’d redo the experiment with context. Just add the sentence “The car I want to wash is here with me.” Or possibly change it to “should I walk or drive the dirty car”.It’s interesting that all the humans critiquing this assume the car isn’t at the car to be washed already, but the problem doesn’t say that.
- joch9 hours ago
 Agreed, even for humans, context-free logic is a challenge.
nozzlegear9 hours ago
When this first came up on HN, I had commented that Opus 4.6 told me to drive there when I asked it the first time, but when I switched to "Incognito Mode," it told me to walk there.I just repeated that test and it told me to drive both times, with an identical answer: "Drive. You need the car at the car wash."
- felix0899 hours ago
 I mean the n is only 10, so it could still be different for you
 - nozzlegear8 hours ago
 Definitely. I'm just interested in whether a user's... I don't know what they call them, system files (?) or personalization or whatever, might affect the answers here. Or if Incognito Mode introduces some weird variance in the answers. I'm just not interested enough to perform the test myself. =P
glitchc9 hours ago
The question does not specify what kind of car it is. Technically speaking, a toy car (Hot wheels or a scaled model) could be walked to a car wash.Now why anyone would wash a toy car at a car wash is beyond comprehension, but the LLM is not there to judge the user's motives.
- stetrain9 hours ago
 I think if surveyed at least 90% of native English speakers would understand "I want to wash my car" to mean a full size automobile. The next largest group would probably ask a clarifying question, rather than assume a toy car.
 - acuozzo3 hours ago
 > I want to wash my carThe question doesn't clearly state that the user wants to have his car washed at the car wash."I want to wash my car" is far less clear than "I want to have my car washed". A reasonable alternative interpretation is DIY.Even better: "I wish to have my car washed by the crew and/or machinery at the local car wash business".<a href="https://imgur.com/tCSPwYp" rel="nofollow">https://imgur.com/tCSPwYp</a>
 - stetrain3 hours ago
 You think that the reasonable interpretation of the question is that I want to go to the car wash but not to wash my car there, because I plan to wash my car at home?
 - glitchc9 hours ago
 Yes, but you're speaking to a computer, not a person. It, of course, runs into the same limitations that every computer system runs into. In this case, it's undefined/inconsistent behavior when inputs are ambiguous.
 - stetrain8 hours ago
 Yes, but part of the value of LLMs is that they are supposed to work by talking to them like a human, not like a computer.I could already talk to a computer before LLMs, via programming or query languages.
snowhale8 hours ago
the failure pattern is interesting -- 'walk because it's only 50 meters and better for environment' is almost certainly what shows up most in training data for similar prompts. so models are pattern-matching to socially desirable answers rather than the actual spatial logic (you need a car at the destination to wash it). not really a reasoning failure, more a distribution shift: the training signal for 'short distance = walk' is way stronger than edge cases where the destination requires the vehicle.
- felix0898 hours ago
 Exactly, same pattern across almost every failure, but sonar models, which just go wild
- voidhorse3 hours ago
 > not really a reasoning failureAnd that's precisely why the term "reasoning" was a problematic choice.Most people, when they use the word "reason" mean something akin to logical deduction and they would call it a reasoning failure, being told, as they are, that "llms reason" rather than the more accurate picture you just painted of what actually happens (behavioral basins emerging from training dist.)
blobbers1 hour ago
Go ask 53 Americans.I’m willing to bet less than 11 get it right.
- glaucon1 hour ago
 Don't bet too much, from the linked article ...They ran the exact same question with the same forced choice between "drive" and > "walk," no additional context, past 10,000 real people through their human feedback platform.71.5% said drive.
shaokind9 hours ago
Gemini 2.0 Flash Lite very randomly punches above its weight there.Also, the summary of the Gemini model says: "Gemini 3 models nailed it, all 2.x failed", but 2.0 Flash Lite succeeded, 10/10 times?
floatrock9 hours ago
> The funniest part: Perplexity's Sonar and Sonar Pro got the right answer for completely wrong reasons. They cited EPA studies and argued that walking burns calories which requires food production energy, making walking more polluting than driving 50 meters. Right answer, insane reasoning.I mean, Sam Altman was making the same calorie-based arguments this weekend <a href="https://www.cnbc.com/2026/02/23/openai-altman-defends-ai-resource-usage-water-concerns-fake-humans-use-energy-summit.html" rel="nofollow">https://www.cnbc.com/2026/02/23/openai-altman-defends-ai-res...</a>I feel like I'm losing grasp of what really is insane anymore.
- felix0899 hours ago
 This was a weird one for sure.
randomtoast10 hours ago
Except for a few models, the selected ones were non-reasoning models. Naturally, without reasoning enabled, the reasoning performance will be poor. This is not a surprising result.I asked GPT-5.2 10x times with thinking enabled and it got it right every time.
- felix0899 hours ago
 Thinking or extended thinking?
comboy10 hours ago
Now do a set of queries and try to deduce by statistics which model are you seeing through Rapidata ;)
sampton9 hours ago
I'm going to test this on my kids.
- felix0899 hours ago
  Ha please do and report back!
redwood9 hours ago
What I find odd about all the discourse on this question is that no one points out that you have to get out of the car to pay a desk agent at least in most cases. Therefore there's a fundamental question of whether it's worth driving 50m parking, paying, and then getting back in the car to go to the wash itself versus instead of walking a little bit further to pay the agent and then moving your car to the car wash.
- hmokiguess8 hours ago
 That's a great point, you actually reminded me of when I used to live in this small city and they had a valet style car wash. It was not unheard of to head there walking with your keys and tell the guy running shop where you parked around the block then come back later to pick it up.EDIT: I actually think this is very common in some smaller cities and outside of North America. I only ever seen a drive-by Car Wash after emigrating
- padjo9 hours ago
 You pay at the car wash where I live.
 - redwood9 hours ago
 Are you referring to one that is more like a drive-thru where you literally pay while you're in line?
 - padjo8 hours ago
 You drive up to the car wash, there's a little terminal with a screen and a card reader. You pick the program, pay for it and drive into the machine. Can't remember the last time I got out of my car when getting it washed.
 - redwood5 hours ago
 Fair. I guess I'm remembering the old full service wash places where people would wash the inside as well. Maybe those barely exist anymore. I live in a city and don't have a car so my intuition is off. Not as far off as a model that has never walked, driven, or been to a car wash tho.
wisty10 hours ago
IMO it's not just intelligence.I think it's related to syncophancy. LLM are trained to not question the basic assumptions being made. They are horrible at telling you that you are solving the wrong problem, and I think this is a consequence of their design.They are meant to get "upvotes" from the person asking the question, so they don't want to imply you are making a fundamental mistake, even if it leads you into AI induced psychosis.Or maybe they are just that dumb - fuzzy recall and the eliza effect making them seem smart?
- tsimionescu9 hours ago
 A perfectly fine, sycophantic response, that doesn't question the premises in any way, would be "That's a great question! While normally walking is better for such a short distance, you'd need to drive in this case, since you need to get the car to the car wash anyway. Do you want me to help with detailed information for other cases where the car is optional?" or some such.
 - wisty5 hours ago
 AI syncophancy isn't just polite or even obsequious language, it's also "yes man" responses.Do you want me to track down some research that shows people think information is more likely to be correct of they agree with it?
- nomel10 hours ago
 Gemini is the only AI that seems to really push back and somewhat ignores what I say. I also think it's a total dick, and never use it, so maybe the motivation to make them a bit sycophants is justified, from a user engagement perspective.
- HPsquared10 hours ago
 I think there's also an "alignment blinkers" effect. There is an ethical framework bolted on.EDIT: Though it could simply reflect training data. Maybe Redditors don't drive.
RoutingGeometry2 hours ago
[flagged]
- dwaltrip2 hours ago
  Please don’t post slop as a comment.
ThrowawayTestr3 hours ago
I don't think this is a logic test, I think it's a woke test. The models have been trained on text that say driving is bad and that walking is good for you. This is why they fail.
maxdo3 hours ago
well, some human in a baseline test also failed. 28% is still a lot. even with today tech, using cutting edge tech, you got it 100% solved, unlike human, that will never get better.
rand421 hour ago
> "Obviously, you need to drive. The car needs to be at the car wash."Actually, this isn't as "obvious" as it seems—it’s a classic case of contextual bias.We only view these answers as "wrong" because we reflexively fill in missing data with our own personal experiences. For example:- You might be parked 50m away and simply hand the keys to an attendant.- The car might already be at the station for detailing, and you are just now authorizing the wash.This highlights a data insufficiency problem, not necessarily a logic failure. Human "common sense" relies on non-verbal inputs and situational awareness that the prompt doesn't provide. If you polled 100 people, you’d likely find that their "obvious" answers shift based on their local culture (valet vs. self-service) or immediate surroundings.LLMs operate on probabilistic patterns within their training data. In that sense, their answers aren't "wrong"—they are simply reflecting a different set of statistical likelihoods. The "failure" here isn't the AI's logic, but the human assumption that there is only one universal "correct" context.
- SadWebDeveloper51 minutes ago
 There are no contextual bias, the goal of the prompt is very explicit and not about probabilistic patterns, but about the models transformer layers dynamically assigning greater weight to words like "meters" (distance) than to other tokens in the prompt.This should be fixed in the reasoning layer (the inner thoughts or chain-of-thought) were the model should focus on the goal "I Want to Wash My Car" not the distance and assign the correct weight to the tokens.
 - rand424 minutes ago
 The point is not that there is bias in promt - What makes the result obvious to OP is their bias - which is different for model and "fixing" it one way is biased.Why? - It is the same reason that makes 30% of people respond in non-obvious sense.