10 comments

  • cjbarber1 hour ago
    It could be interesting to do the metric of intelligence per second.<p>ie intelligence per token, and then tokens per second<p>My current feel is that if Sonnet 4.6 was 5x faster than Opus 4.6, I&#x27;d be primarily using Sonnet 4.6. But that wasn&#x27;t true for me with prior model generations, in those generations the Sonnet class models didn&#x27;t feel good enough compared to the Opus class models. And it might shift again when I&#x27;m doing things that feel more intelligence bottlenecked.<p>But fast responses have an advantage of their own, they give you faster iteration. Kind of like how I used to like OpenAI Deep Research, but then switched to o3-thinking with web search enabled after that came out because it was 80% of the thoroughness with 20% of the time, which tended to be better overall.
    • josephg10 minutes ago
      Yeah I agree with this. We might be able to benchmark it soon (if we can’t already) but asking different agentic code models to produce some relatively simple pieces of software. Fast models can iterate faster. Big models will write better code on the first attempt, and need less loop debugging. Who will win?<p>At the moment I’m loving opus 4.6 but I have no idea if its extra intelligence makes it worth using over sonnet. Some data would be great!
    • nubg48 minutes ago
      Interesting perspective. Perhaps also the user would adopt his queries knowing he can only to small (but very fast) steps. I wonder who would win!
  • nylonstrung44 minutes ago
    I&#x27;m not sold on diffusion models.<p>Other labs like Google have them but they have simply trailed the Pareto frontier for the vast majority of use cases<p>Here&#x27;s more detail on how price&#x2F;performance stacks up<p><a href="https:&#x2F;&#x2F;artificialanalysis.ai&#x2F;models&#x2F;mercury-2" rel="nofollow">https:&#x2F;&#x2F;artificialanalysis.ai&#x2F;models&#x2F;mercury-2</a>
    • volodia7 minutes ago
      I’d push back a bit on the Pareto point.<p>On speed&#x2F;quality, diffusion has actually moved the frontier. At comparable quality levels, Mercury is &gt;5× faster than similar AR models (including the ones referenced on the AA page). So for a fixed quality target, you can get meaningfully higher throughput.<p>That said, I agree diffusion models today don’t yet match the very largest AR systems (Opus, Gemini Pro, etc.) on absolute intelligence. That’s not surprising: we’re starting from smaller models and gradually scaling up. The roadmap is to scale intelligence while preserving the large inference-time advantage.
  • volodia13 minutes ago
    Co-founder &#x2F; Chief Scientist at Inception here. If helpful, I’m happy to answer technical questions about Mercury 2 or diffusion LMs more broadly.
    • kristianp2 minutes ago
      How big is Mercury 2? How many tokens is it trained on?<p>Is it&#x27;s agentic accuracy good enough to operate, say, coding agents without needing a larger model to do more difficult tasks?
    • CamperBob26 minutes ago
      Seems to work pretty well, and it&#x27;s especially interesting to see answers pop up so quickly! It is easily fooled by the usual trick questions about car washes and such, but seems on par with the better open models when I ask it math&#x2F;engineering questions, and is obviously much faster.
  • dvt2 hours ago
    What excites me most about these new 4figure&#x2F;second token models is that you can essentially do multi-shot prompting (+ nudging) and the user doesn&#x27;t even feel it, potentially fixing some of the weird hallucinatory&#x2F;non-deterministic behavior we sometimes end up with.
  • ilaksh1 hour ago
    It seems like the chat demo is really suffering from the effect of everything going into a queue. You can&#x27;t actually tell that it is fast at all. The latency is not good.<p>Assuming that&#x27;s what is causing this. They might show some kind of feedback when it actually makes it out of the queue.
    • volodia1 minute ago
      Thank you for your patience. We are working to handle the surge in demand.
  • mhitza53 minutes ago
    Comment retracted. My bad, missed some details.
    • selcuka43 minutes ago
      I think your comment is a bit unfair.<p>&gt; no reasoning comparison<p>Benchmarks against reasoning models:<p><a href="https:&#x2F;&#x2F;www.inceptionlabs.ai&#x2F;blog&#x2F;introducing-mercury-2" rel="nofollow">https:&#x2F;&#x2F;www.inceptionlabs.ai&#x2F;blog&#x2F;introducing-mercury-2</a><p>&gt; no demo<p><a href="https:&#x2F;&#x2F;chat.inceptionlabs.ai&#x2F;" rel="nofollow">https:&#x2F;&#x2F;chat.inceptionlabs.ai&#x2F;</a><p>&gt; no info on numbers of parameters for the model<p>This is a closed model. Do other providers publish the number of parameters for their models?<p>&gt; testimonials that don&#x27;t actually read like something used in production<p>Fair point.
      • volodia2 minutes ago
        Just to clarify one point: Mercury (the original v1, non-reasoning model) is already used in production in mainstream IDEs like Zed: <a href="https:&#x2F;&#x2F;zed.dev&#x2F;blog&#x2F;edit-prediction-providers" rel="nofollow">https:&#x2F;&#x2F;zed.dev&#x2F;blog&#x2F;edit-prediction-providers</a><p>Mercury v1 focused on autocomplete and next-edit prediction. Mercury 2 extends that into reasoning and agent-style workflows, and we have editor integrations available (docs linked from the blog). I’d encourage folks to try the models!
      • mhitza22 minutes ago
        You are right edited my post (twice actually). Missed the chat first time around (though its hard to see it as a reasoning model when chain of thought is hidden, or not obvious. I guess this is thd new nirmal), and also missed the reasoning table because text is pretty small on mobile and I thought its another speed benchmark.
    • pants233 minutes ago
      Reading such obvious LLM-isms in the announcement just makes me cringe a bit too, ex.<p>&gt; We optimize for speed users actually feel: responsiveness in the moments users experience — p95 latency under high concurrency, consistent turn-to-turn behavior, and stable throughput when systems get busy.
  • tl2do1 hour ago
    Genuine question: what kinds of workloads benefit most from this speed? In my coding use, I still hit limitations even with stronger models, so I&#x27;m interested in where a much faster model changes the outcome rather than just reducing latency.
    • layoric1 hour ago
      I think it would assist in exploiting exploring multiple solution spaces in parallel, and can see with the right user in the loop + tools like compilers, static analysis, tests, etc wrapped harness, be able to iterate very quickly on multiple solutions. An example might be, &quot;I need to optimize this SQL query&quot; pointed to a locally running postgres. Multiple changes could be tested, combined, and explain plan to validate performance vs a test for correct results. Then only valid solutions could be presented to developer for review. I don&#x27;t personally care about the models &#x27;opinion&#x27; or recommendations, using them for architectural choices IMO is a flawed use as a coding tool.<p>It doesn&#x27;t change the fact that the most important thing is verification&#x2F;validation of their output either from tools, developer reviewing&#x2F;making decisions. But even if don&#x27;t want that approach, diffusion models are just a lot more efficient it seems. I&#x27;m interested to see if they are just a better match common developer tasks to assist with validation&#x2F;verification systems, not just writing (likely wrong) code faster.
    • cjbarber1 hour ago
      I&#x27;ve tried a few computer use and browser use tools and they feel relatively tok&#x2F;s bottlenecked.<p>And in some sense, all of my claude code usage feels tok&#x2F;s bottlenecked. There&#x27;s never really a time where I&#x27;m glad to wait for the tokens, I&#x27;d always prefer faster.
    • quotemstr25 minutes ago
      Once you make a model fast and small enough, it starts to become practical to use LLMs for things as mundane as spell checking, touchscreen-keyboard tap disambiguation, and database query planning. If the fast, small model is multimodal, use it in a microwave to make a better DWIM auto-cook.<p>Hell, want to do syntax highlighting? Just throw buffer text into an ultra-fast LLM.<p>It&#x27;s easy to overlook how many small day-to-day heuristic schemes can be replaced with AI. It&#x27;s almost embarrassing to think about all the totally mundane uses to which we can put fast, modest intelligence.
    • irthomasthomas1 hour ago
      multi-model arbitration, synthesis, parallel reasoning etc. Judging large models with small models is quite effective.
  • MarcLore10 minutes ago
    [dead]
  • dhruv300623 minutes ago
    I am little underwhelmed by anything diffusion at the moment - they didn&#x27;t really deliver.
    • quotemstr22 minutes ago
      What isn&#x27;t these days? I&#x27;ve found it pointless to get upset about it.
      • dhruv300616 minutes ago
        We need a new architecture - i wonder what ilya is cooking.
  • arjie36 minutes ago
    Please pre-render your website on the server. Client-side JS means that my agent cannot read the press-release and that reduces the chance I am going to read it myself. Also, day one OpenRouter increases the chance that someone will try it.