Mistral Medium 3.5

(mistral.ai)

247 points by meetpateltech3 hours ago

21 comments

simjnd2 hours ago
I'm not sure what people are on in the comments. It doesn't beat the other models, but it sure competes despite its size.GLM 5.1 is an excellent model, but even at Q4 you're looking at ~400GB. Kimi K2.5 is really good too, and at Q4 quantization you're looking at almost ~600GB.This model? You can run it at Q4 with 70GB of VRAM. This is approaching consumer level territory (you can get a Mac Studio with 128GB of RAM for ~3500 USD).For the Claude-pilled people, I don't know if you only run Opus but when I was on the Pro plan Sonnet was already extremely capable. This beats the latest Sonnet while running locally, without anyone charging you extra for having HERMES.md in your repo, or locking you out of your account on a whim.Mistral has never been competitive at the frontier, but maybe that is not what we need from them. Having Pareto models that get you 80% of the frontier at 20% of the cost/size sounds really good to me.
- gregsadetsky1 hour ago
 I didn't know about HERMES.md ... (??) - found information here for others who are curious <a href="https://github.com/anthropics/claude-code/issues/53262" rel="nofollow">https://github.com/anthropics/claude-code/issues/53262</a>
 - gnulinux18 minutes ago
 This github thread is incredible, thanks for sharing. This link should be its own HN topic.
 - giancarlostoro55 minutes ago
 That is insane, if you billed me an extra $200 for a bug in your system I'd flat out cancel my subscription. If you're not going to credit that back to me, you don't deserve anymore of my money. I'm a Claude first guy, but if you're going to bill me incorrectly, that's on you, own it, fix it.
 - xcrjm48 minutes ago
 They did credit it back to him. There's a comment in the linked issue.
 - MarsIronPI32 minutes ago
 Where? Just searched the entire thread for both the word "refund" and the word "credit" and I'm seeing nothing about credit being issued.Also what's with @sasha-id talking to himself? Looks weird as all get out.
 argee22 minutes ago
 Looks like he copy pasted responses he got from their support agents.
 - simjnd28 minutes ago
 Where? All I see is Boris saying "we are unable to issue compensation for degraded service or technical errors that result in incorrect billing routing".
- ksubedi18 minutes ago
 Let's not forget Qwen 35B A3B MoE. It gets better performance than this in all the metrics for a fraction of the memory / compute footprint.Sad to see all the non Chinese open source models being at least one generation behind.
 - simjnd6 minutes ago
 Qwen3.6 27B is even more impressive IMO. Dense so it doesn't run as fast but it's so good.
- Aurornis1 hour ago
 > This model? You can run it at Q4 with 70GB of VRAM. This is approaching consumer level territory (you can get a Mac Studio with 128GB of RAM for ~3500 USD).The one thing I would want everyone curious about local LLMs to know is that being able to run a model and being able to run a model fast are two very different thresholds. You can get these models to run on a 128GB Mac, but we need to first tell if Q4 retains enough quality (models have different sensitivities to quantization) and how fast it runs.For running async work and background tasks the prompt processing and token generation speeds matter less, but a lot of Mac Studio buyers have discovered the hard way that it's not going to be as responsive as working with a model hosted in the cloud on proper hardware.For most people without hard requirements for on-site processing, the best use case for this model would be going through one of the OpenRouter hosted providers for it and paying by token.> This beats the latest Sonnet while running locallyAlmost every open weight model launch this year has come with claims that it matches or exceeds Sonnet. I've been trying a lot of them and I have yet to see it in practice, even when the benchmarks show a clear lead.
 - simjnd47 minutes ago
 > The one thing I would want everyone curious about local LLMs to know is that being able to run a model and being able to run a model fast are two very different thresholds. You can get these models to run on a 128GB Mac, but we need to first tell if Q4 retains enough quality (models have different sensitivities to quantization) and how fast it runs.Very valid. This is an active area of research, and there are a lot of options to try out already today.- People have successfully used TurboQuant to quantize model weights (TQ3_4S), not just the context KV, to achieve smaller sizes than Q4 (~3.5 bpw) with much better PPL and faster decoding.- Importance-weighted quantization (e.g. IQ4) also provides way better PPL, KDL, etc. at the same size as a Q4.- DFlash (block diffusion for speculative decoding) needs a good drafting model compatible with the big model, but can provide an uplift up to 5x in decoding (although usually in the 2-2.5x range)- Forcing a model's thinking to obey a simple grammar has been shown to improve results with drastically lower thinking output (faster effective result generation) although that has been more impactful on smaller models.We should be skeptical, but it's definitely trending in the right direction and I wouldn't be surprised if we are indeed able to run it at acceptable speeds.> Almost every open weight model launch this year has come with claims that it matches or exceeds Sonnet. I've been trying a lot of them and I have yet to see it in practice, even when the benchmarks show a clear lead.This hasn't been my experience. After Anthropic's started their shenanigans I've switched to exclusively using open-weights models via OpenRouter and OpenCode and I can't really tell a difference (for better or for worse).
 - zozbot2341 hour ago
 Cloud hardware is not inherently more "proper" than what's being proposed here, there's nothing wrong per se about targeting slower inference speeds in an on prem single-user context.
 - Aurornis1 hour ago
 > Cloud hardware is not inherently more "proper" than what's being proposed hereCloud hardware can run the original model. Quantization will reduce quality. The quality drop to Q4 is not trivial.Cloud hardware is also massively faster in time to first token and token generation speed.> there's nothing wrong per se about targeting slower inference speeds in a local single-user context.If that's what the user wants and expects then it's fineMost people working interactively with an LLM would suffer from slower turns.
 - zozbot23418 minutes ago
 > Cloud hardware can run the original model. Quantization will reduce quality.New models are often being released in quantized format to begin with. This is true of both Kimi and the new DeepSeek V4 series. There is no "original model", the model is generated using Quantization Aware Training (QAT).
 - cbg01 hour ago
 The quantization for some models can be very detrimental and their quality can drop considerably from the posted benchmarks which are probably at bf16, this is why having considerable RAM can be important.
- giancarlostoro59 minutes ago
 > For the Claude-pilled people, I don't know if you only run Opus but when I was on the Pro plan Sonnet was already extremely capable.Before February I was able to use Opus on High exclusively on my Max plan no problem. Now I've shifted to just using Sonnet on high and yeah, its pretty capable. I love that, Claude Pilled. ;)
 - simjnd41 minutes ago
 Yeah I love Claude, amazing models. Anthropic has very quickly burned most of the goodwill I had for it so I still ended up cancelling my subscription.
- UncleOxidant53 minutes ago
 Yeah, you can run it locally if you have enough VRAM, but the reports trickling in are saying about 3 tok/sec. This was on a Strix Halo box which definitely has the needed VRAM, but isn't going to have as high mem bandwidth as a GPU card, it's going to be similar on a Mac - that's the dilemma... the unified memory machines have the VRAM, but the bandwidth isn't great for running dense models. This size of a dense model is only going to be runnable (usefully) by very few people who have multiple GPU cards with enough memory to add up to about 70GB.
 - simjnd31 minutes ago
 I don't think this is quite correct, a Strix Halo box usually has 256 GB/s memory bandwidth. An M5 Max has 614 GB/s. An M3 Ultra (no M4 or M5 Ultra) has 820 GB/s. It's still not GDDR or HBM territory, but still significantly faster.That's the edge of Apple Silicon for AI. When they scale up the chip they add more memory controllers which adds more channels and more bandwidth.But yeah in the end it's still going to be only a handful of people that can run it.What I meant is that I think researching and developing smaller more powerful model is more interesting than chasing the next 3T parameter model while burning through VC money and squeezing your customer base more and more aggressively.
- liuliu1 hour ago
 The competition is on DeepSeek v4 Flash for similar size / deployment target.
 - simjnd42 minutes ago
 DeepSeek v4 Flash is still over 100GB at Q4 IIRC, and Q4 has generally been the sweet spot. Although it's an MoE so it might run a lot faster that this dense Mistral model if you have the RAM.
- revolvingthrow6 minutes ago
 Eh. Those results would be noteworthy if it was a a MoE. A 120B dense? Firmly in meh territory.
- YetAnotherNick1 hour ago
 It has similar SWE bench score to qwen 3.6 27b[1]. No one is comparing it to frontier.[1]: There is no other common benchmark in the blog.
 - simjnd21 minutes ago
 That's more a testament of how good Qwen3.6 27B is (it really is great) more than how bad this one is IMO. Gemma 4 31B was already good, but Qwen3.6 27B is incredible for its size.
- zackangelo42 minutes ago
 Isn't Kimi K2.6 natively INT4?
 - simjnd10 minutes ago
 I don't think any models are natively INT4? I wouldn't see the point to nerf the model out-of-the-box.
- DeathArrow1 hour ago
 >This model? You can run it at Q4 with 70GB of VRAM. >This beats the latest Sonnet while running locallyNot sure it will beat Sonet at Q4.>This is approaching consumer level territory (you can get a Mac Studio with 128GB of RAM for ~3500 USD).For $3500 I can get 7-8 years of GLM using coding plans, have a faster model and much better code quality.
 - simjnd13 minutes ago
 > Not sure it will beat Sonet at Q4.Very valid. Importance-weighted quantization and TurboQuant on model weights can reduce loss a lot compared to "traditional" Q4 so one can be hopeful.> For $3500 I can get 7-8 years of GLM using coding plans, have a faster model and much better code qualityBut you will own no computer, and that's also assuming prices stay what they are. Anyway my point was not whether or not it makes financial sense for everyone. A lot of people are very happy not owning their movies, software, games, cars or house. I'm just happy there is a future where the people can own and locally run the tech that was trained on their stolen data.
 - kobalsky59 minutes ago
 > For $3500 I can get 7-8 years of GLMmind sharing where's the go to place to pay for open models?
 - simjnd19 minutes ago
 I recommend using OpenRouter (openrouter.ai). Basically a broker between inference providers and you which allows you to pick, try, and switch models from a massive catalog, extremely transparent about usage and pricing.
 - DeathArrow50 minutes ago
 You can get GLM coding plans from Z.ai and Ollama Cloud and OpenCode Go.
- deepsquirrelnet25 minutes ago
 I would love to be able to run frontier locally, but I think the larger importance of open weight models is price accountability.In the US with our broken system of capitalism, it’s the only way we can tether these companies to reality. Left to their own devices, I’m not convinced they would actually compete with each other on price.Buy nobody like to talk about how “moat” building is fundamentally anti-competitive, even in name.Funny that self proclaimed capitalists hate the system in practice. Commodity pricing is what truly terrifies them.
 - simjnd7 minutes ago
 I'm not necessarily interested in having frontier locally. You don't need to be frontier to be a very good and useful coding agent. I agree with your point on price accountability though. Hopefully no tariff comes down on the Chinese and European open-weight models.
- 2ndorderthought1 hour ago
 The point is it's open weight and is tiny compared to a lot of it's competitors. 4gpus for world class performance - sweet!
- redrove1 hour ago
 It’s 128b dense model. Good luck getting more than 3t/s out of a mac. It doesn’t matter if it fits or not.
 - zozbot2341 hour ago
 You could run it on a single Mac Studio with M3 Ultra, or two Mac Studios with M4 Max at higher perf than that. And lightly quantizing this could give us modern dense models in the ~80GB size range, which is a very compelling target.
 - freakynit1 hour ago
 Wouldn't matter much still. M3 ultra has 819GB/s unified memory bandwidth. That means theoretical max tokem rate is 819/128 =~ 6.39 t/s. At 80 GB (5 bit quantization), its still near about 10 t/s ... far from a good coding experience. Also, these are theoretical max.. real world token generation rates would be at least 15-20% less.
- sayYayToLife18 minutes ago
 [dead]
- freakynit1 hour ago
 I was hoping a lot from it... but this one, is not up to that mark. For example, here is it's comparion with 4.7x smaller model, qwen3.7-27b.<a href="https://chatgpt.com/share/69f239e8-7414-83a8-8fdd-6308906e5f73" rel="nofollow">https://chatgpt.com/share/69f239e8-7414-83a8-8fdd-6308906e5f...</a>Tldr: qwen3.6-27b, a 4.7x smaller model, have similar performance.
 - r0b051 hour ago
 That's a chatgpt summary. Actual usage would a better test.
 - freakynit1 hour ago
 yep.. until then, this is good enough since the tests are standard, and the results are numeric and can be compared without any doubt.
 - lostmsu1 hour ago
 To be fair MoE from Qwen itself had the same "problem". 3.5 122B MoE was same or worse than 3.5 27B. Yet to see 122B 3.6.UPD. NVM, Mistral Medium 3.5 is dense. So yes, it is worse in every way.
vessenes2 hours ago
As always, rooting for these guys — model and national diversity is great. This looks like a solid foundation to build on; hopefully the 3.6/3.7 will dial in more gains. It looks like maybe from the computer use benchmarks that their vision pipeline could use improvement, but that’s just speculation.The different results on some benchmarks vibes as if this is truly an independently trained model, not just exfiltrated frontier logs, which I think is also really important - having different weight architectures inside a particular model seems like a benefit on its own when viewed from a global systems architecture perspective.
- sayYayToLife17 minutes ago
 [dead]
Mashimo1 hour ago
Compared to all other hosted LLMs that I have tested, Mistral seems to be the only one with rather strict CSP headers. When you ask them to create a website with some javascript library it will not preview, even though le chat offers canvas mode.Sometimes when a new release comes around from any provider I just want to test it a bit on the web. without paying and using an agent harness.Why are they like this ;_;Edit: Christ on a bike it's bad at drawing SVGs <a href="https://chat.mistral.ai/chat/23214adb-5530-4af9-bb47-90f52192f274" rel="nofollow">https://chat.mistral.ai/chat/23214adb-5530-4af9-bb47-90f5219...</a>
- SyneRyder44 minutes ago
 > Edit: Christ on a bike it's bad at drawing SVGs On the bike would be an improvement. Geez.I know SVGs may not be the best benchmark, but that matches my experience of trying to run a (previous) Mistral model in Mistral Vibe, asking it to help me configure an MCP server in Vibe. It confidently explained that MCP is the MineCraft Protocol and then began a search of my computer looking for Minecraft binaries.
- 2ndorderthought1 hour ago
 I have never wanted, needed or hoped to draw svgs with an LLM. All of the models suck at it, some are just more fun or something.
 - Mashimo1 hour ago
 I can't speak for what you consider sucking, but there is a significant difference between Mistral and Kimi or Gemini. I find the others to be usable for my needs.
 - 2ndorderthought53 minutes ago
 I agree there is a difference but does that translate to anything? It's not the same operations used to write code, and it's kind of useless. I wouldn't waste my power bill ensuring a model I was releasing was good at it.
mtct882 hours ago
It's okay, nothing exceptional, but any news from non US and non Chinese models is still good news.
- pb72 hours ago
 This is the bar for Europe, huh?
 - deaux2 hours ago
 Where are the competitive models from Singapore, Japan, Taiwan, Korea, Russia, Canada, India, the UK? From anywhere that isn't China or the US?There are none. Mistral Small 4 is pareto-competitive in its pricing bracket at $0.15/$0.60, at worst it's second to Gemma 4 26B A4B. The above countries have never had a model that is even close to being so.This particular Mistral Medium looks to be uncompetitive at that pricing. I'm surprised it's so expensive given its size. Wonder if we'll see other providers offer it for cheaper.but that doesn't mean Mistral has never produced anything useful.
 - argsnd2 hours ago
 DeepMind, which is headquartered in London, probably had a significant role in the development of the Gemini and Gemma models.Yes, it might be a problem that the UK allows companies like this to be bought up by foreign countries.
 - wasfgwp1 hour ago
 Without Google’s funding its not obvious i DeepMind would have went anywhere.Unless the moved to US for funding while keeping a back office in the UK.It’s strange to expect anything significant to come out from Europe when VCs there are either very risk averse and/or don’t have enough cash to begin with. It’s not like government or EU funding can replace that since its almost always wasted or missdirected
 - johndough1 hour ago
 > KoreaEXAONE from LG AI Research <a href="https://huggingface.co/LGAI-EXAONE" rel="nofollow">https://huggingface.co/LGAI-EXAONE</a>They had one of the best small models a few months ago and they released a new model just last week.There's also HyperCLOVA X (haven't tested it, but maybe it is also good) <a href="https://huggingface.co/naver-hyperclovax" rel="nofollow">https://huggingface.co/naver-hyperclovax</a>> IndiaIndia has the Sarvam model series, which admittedly are not SotA, but they have pretty good voice capabilities <a href="https://huggingface.co/sarvamai" rel="nofollow">https://huggingface.co/sarvamai</a>The UAE (not part of the list above) also has a few noteworthy models: <a href="https://huggingface.co/tiiuae" rel="nofollow">https://huggingface.co/tiiuae</a>
 - deaux1 hour ago
 I'm familiar with those models. They're nowhere near competitive. Miles away from Mistral or (obviously) Chinese models.> (haven't tested it, but maybe it is also good)I have. It is not.
 johndough1 hour ago
 You mentioned "pareto-competitive", and EXAONE certainly was that. The statement that the "above countries have never had a model that is even close to being so" is simply too broad.
 deaux35 minutes ago
 You're talking about EXAONE 4.5 33B? Gemma 4 31B was released 1 week earlier and blows it out of the water. Which point in time/model size are you possibly talking about? The original K-EXAONE in January?More than anything the availability speaks for itself. If it was indeed pareto competitive, all dozens of model providers would be doing their best to offer it for serverless inference. They don't. There's maybe one that does. Do you think a lot of companies wouldn't prefer a Korean model over a Chinese one? In this case, the market speaks. Go talk to people who run business based on putting billions or trillions of tokens through open weights models. And how much time they put into optimization of model selection to save money and latency. And ask why none of them are using EXAONE models. It's not because we're not aware of their existence. There's also reason to believe they've been benchmaxxing more than Chinese models, btw. Have you done the vibecheck?I wish they were strong, I hope that in the future, they are. More diversity is better. So far they have not yet been a serious option at any point.
 - cyanydeez1 hour ago
 they should ask unsloth to follow them. For my usecases locally w/128GB, Qwen3.5-Coder-Next is SOTA.
 - class4behavior2 hours ago
 Although the Manus decision might change things for AI, Singapore-washing is quite rampant among Chinese companies, so I wouldn't call this place of origin an alternative market.
 - sayYayToLife16 minutes ago
 [dead]
 - amunozo2 hours ago
 This is the bar for anybody that's not the frontier labs.
 - locknitpicker2 hours ago
 > This is the bar for Europe, huh?A few months ago China was being criticized left and right on how somehow it was not able to compete, and once DeepSeek showed up then all the hatred shifted onto how China was actually competing but exploring unfair competitive advantages.Funny how that works.Also, aren't the likes of OpenAI burning through over $2 of investment for each $1 of revenue?
 - pb72 hours ago
 [flagged]
 - 2ndorderthought1 hour ago
 I find it funny how people don't realize the technical achievements and papers coming out of deepseek or Alibaba. They are making this whole AI thing sustainable and cheap and available to do at home. That's the future. I should be able to run my own harness and model and never bother with openai or anthropic at all.
 - nickthegreek2 hours ago
 2 businesses working to get money from the same customers in the same field is competition. Kellogs is competing with store brand cereal. People are choosing to use these Chinese AI apis because they are good enough for some workflows and cheaper. If they didn't exist, the money would go to the frontier labs. There is no world where this would not be defined as competition.
 - locknitpicker2 hours ago
 > China is not competing, it is distilling US models.I think you should check your notes. The likes of Kimi K2 thinking shows up as high as the second best general purpose model currently in existence. It seems they compete just fine.If you believe "distilling" is all it takes to put together a model at the top of any synthetic benchmark then I wonder what you would have to say about all US models that greatly underperform in comparison and still manage to be used extensively in professional settings.But your argument is an emotional one and not rarional, isn't it?
 wasfgwp1 hour ago
 > high as the second best general purpose modelAccording to benchmarks which are gamed to the extreme these days. Trusting them blindly isn’t exactly rational either. They don’t necessarily translate that well to real world tasksIt’s obviously not “distilling” as such but there are reasons why Chinnese models are consistently several months behind OpenAI/Antropic
 2ndorderthought1 hour ago
 [dead]
 - tirpen2 hours ago
 > China is not competing, it is distilling US modelsChina are cheating by using data obtained without permission to train their models in an evil commie way!They should have done what the US did instead and trained models on data obtained without permission in a fair and freedum way!> Where are the Chinese models that are blowing US ones out of the water?Kimi2 blows every US model out of the water in any comparison that includes both costs and performance.
 2ndorderthought1 hour ago
 Qwen3.6 runs on a single GPU and beats claudes sonnet. In benchmarks and real world tests from humans. Kimi is awesome but most people won't be able to host it themselves.A lot of people are slowly realizing the moat of 1T closed source models is gone as of the last few weeks. It's going to change the industry. April was a huge month for open models, it'll be curious to see if that continues.This Mistral submission is another nail in the coffin.
 wasfgwp1 hour ago
 > beats claudes sonnetBased on benchmarks which don’t mean that much these days.> models is gone as of the last few weeks.Yes, that’s exactly what people were saying after every major release for the past year or so. It’s always a couple of weeks away
 prodigycorp1 hour ago
 i run qwen 3.6. you need to drink some settle down juice.
 2ndorderthought1 hour ago
 No way it's awesome.
 - Jackpillar1 hour ago
 Theft is quite a slippery slope argument not in your favor in the context of US based LLMs and how/what they were trained on..
 - sagacity2 hours ago
 Ah yes, like those EUV machines America and China have worked on.
 - saulapremium1 hour ago
 You wouldn't happen to be a Trump supporter by any chance, would you?
 - wg02 hours ago
 I don't mind Chinese but US under Trump is a fascist state based on ethnic and theological grounds pretty much or soon would be if electorate doesn't decide otherwise.China and rest of the world has sane leadership that aren't mentally retarted.
 - gadders1 hour ago
 Yes, China is a much freer and more democratic country than the USA. It's not like you can get a Uygur killed to order for a new kidney or anything.
 - 2ndorderthought56 minutes ago
 I would rather support Chinese tech companies then American ones who write manifestos, bomb children, praise wwii Germany, can't stay online, are publicly making weapons for wars I don't support, etc.Chinese AI companies are just trying to make money. They are also publicly contributing to forward the field. We all get to decide, but claiming deepseek is involved in genocide is beyond a stretch. Claiming anthropic and chatgpt are... Actually not so much given the president was threatening it and enabling it with an ally...
minimaxir2 hours ago
It's funny that 128B is now considered Medium. I remember back in the day when 355M parameters was considered medium with GPT-2.
- speedgoose2 hours ago
 And GPT-2 1.5B was considered too dangerous to release.They were perhaps right.
andhuman31 minutes ago
The Vibe CLI is really bad on Windows, sure they don’t officially support it, so can’t blame them, but a FYI for anyone wanting to try it. It can’t get find and replace right.
simonw2 hours ago
I can't figure out if this is available in the official Mistral API or not.Their model listing API returns this:<pre><code> { "id": "mistral-medium-2508", "object": "model", "created": 1777479384, "owned_by": "mistralai", "capabilities": { "completion_chat": true, "function_calling": true, "reasoning": false, "completion_fim": false, "fine_tuning": true, "vision": true, "ocr": false, "classification": false, "moderation": false, "audio": false, "audio_transcription": false, "audio_transcription_realtime": false, "audio_speech": false }, "name": "mistral-medium-2508", "description": "Update on Mistral Medium 3 with improved capabilities.", "max_context_length": 131072, "aliases": [ "mistral-medium-latest", "mistral-medium", "mistral-vibe-cli-with-tools" ], "deprecation": null, "deprecation_replacement_model": null, "default_model_temperature": 0.3, "type": "base" }, </code></pre> So that has the alias "mistral-medium-latest", but the official ID is "mistral-medium-2508" which suggests it's the model they released in August 2025.But... that 1777479384 timestamp decodes to Wednesday, April 29, 2026 at 04:16:24 PM UTCSo is that the new Mistral Medium?
- simonw2 hours ago
 Some poking around in the source code for <a href="https://github.com/mistralai/mistral-vibe" rel="nofollow">https://github.com/mistralai/mistral-vibe</a> got me to this:<pre><code> curl https://api.mistral.ai/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $(llm keys get mistral)" \ -d '{ "model": "mistral-medium-3.5", "messages": [ {"role": "user", "content": "Generate an SVG of a pelican riding a bicycle"} ] }' </code></pre> Which did work: <a href="https://gist.github.com/simonw/f3158919b18d2c47863b0a5dc257a355#response" rel="nofollow">https://gist.github.com/simonw/f3158919b18d2c47863b0a5dc257a...</a> - it's pretty disappointing.Weird that it doesn't show up in the model list:<pre><code> curl https://api.mistral.ai/v1/models \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $(llm keys get mistral)" | jq</code></pre>
 - Mashimo1 hour ago
 I also did some SVG tests, it's really bad.<a href="https://chat.mistral.ai/chat/897fbe7d-b1ae-4109-9b29-f3ccc4f8896d" rel="nofollow">https://chat.mistral.ai/chat/897fbe7d-b1ae-4109-9b29-f3ccc4f...</a>
 - spijdar1 hour ago
 Wow. I get that "how well can it make SVGs" isn't the (or a) gold standard for how useful a model is or isn't, but the fact the Gemma 4 26B A4B I'm running locally can blow it out of the water doesn't give me high confidence for the model. Maybe an unfair comparison, but...
 - 2ndorderthought1 hour ago
 It sounds like they focussed performance on not drawing svgs. Which honestly, makes a lot of sense to me.
 spijdar1 hour ago
 Drawing SVGs isn't something I really care about either, and I think it's still to "qualitatively compare" e.g. "Opus's pelican vs GPT's pelican vs GLM's pelican" or whatever the kids are doing.But what stands out to me is that it's barely able to draw a "recognizable" pelican at all. The Devstral 2 model even looks slightly better, though maybe I'm splitting hairs: <a href="https://simonwillison.net/2025/Dec/9/" rel="nofollow">https://simonwillison.net/2025/Dec/9/</a>
 - Mashimo1 hour ago
 It's so bad I don't want to spend the 18 EUR just to test it for a month. It can't even create an SVG of the facebook logo. There should be plenty of examples of that around.Gemini fast could do that in under 5 seconds.
 - cyanydeez1 hour ago
 I'm curios: are you doing a real apples to apples comparison, or are you running a harness that already curates prompts? There's a far and wide margin how any of these models respond based on already loaded context. Most models are pretty much hot garbage until their context is curated appropiately.
 spijdar1 hour ago
 I just copied and pasted each prompt as specified by Mashimo and simonw into a chat interface, using a 4-bit Unsloth quantization of Gemma 4 26B, with the default sampler settings recommended by Google, and a system prompt of "You are a helpful assistant". The results are miles ahead of what the Mistral model output.I've gotten a lot of use out of Mistral models, and I imagine this model is pretty good at other things, but it really feels like a 128B parameter dense model should be at least a little better than this.
syntaxing1 hour ago
This is a very interesting strategy that might pay off. This model is a very good option for enterprise self host. I would argue a lot of companies are VRAM constrained rather than compute constrained. You could fit 4-5 running instances on one H100 cluster where you can only fit 1-2 Kimi K2 or GLM5.
- 2001zhaozhao50 minutes ago
  This is 128B dense though. the K/V cache on long context is going to be massive
  - Havoc7 minutes ago
    Don’t think kv size correlates to dense/moe
- sayYayToLife15 minutes ago
  [dead]
postalcoder2 hours ago
This release Mistral really reminds you of the gap between the frontier labs and everyone else.Pre-agent, there wasn't always an obvious difference between models. Various models had their charms. Nowadays, I don't want to entertain anything less than the frontier models. The difference in capability is enormous and choosing anything less has a real cost in terms of productivity.I've been a big fan of the smaller labs like Mistral and especially Cohere but it's been a while since I've been excited by a release by either company.That said, I'm using mistral voxtral realtime daily – it's great.
- deaux2 hours ago
 Can't agree at all. Productivity gap just 1 year ago was much larger for frontier model vs non-frontier. Let alone 2 years ago.
 - 2ndorderthought38 minutes ago
 Same the gap is almost paper thin for anyone who hasn't gone full uninformed vibe code.
 - postalcoder2 hours ago
 When I was thinking pre-agentic, I was actually thinking more pre-"coding seen as the main use case for these models".
 - deaux1 hour ago
 Coding has always been the main real-world business usecase since day one. There has been no point since the very first public availability of GPT 3.5 in November 2022, that it wasn't.A lot of us have been agentic coding since almost 2 years ago, mid-2024. I have. The productivity gap of "best vs 2nd vs 3rd best model" was biggest back then and has slowly been shrinking ever since.
- onlyrealcuzzo2 hours ago
 > Pre-agent, there wasn't always an obvious difference between models. Various models had their charms. Nowadays, I don't want to entertain anything less than the frontier models. The difference in capability is enormous and choosing anything less has a real cost in terms of productivity.It's just apples to oranges.There is not a clear, across the board, winner on non-agentic tasks between Gemini, ChatGPT, and Claude - the simple chatbot interface.But Claude Code is substantially better than Codex which itself is notably better than Gemini-cli.In this vein, it should not be surprising that Claude Code is way better than non-frontier models for agentic coding... It's substantially better than other frontier models at specialized agentic tasks.
 - philipbjorge2 hours ago
 I’ve been comparing Claude Code and Codex extensively side by side over the past couple of weeks with my favorite prompting framework superpowers…From my perspective, Claude Code is decidedly not better than Codex. They’re slightly different and work better together. I would have no issues dropping CC entirely and using codex 100%.If you’re working off of “defaults”, in other words no custom prompting, Claude Code does perform a lot better out of the box. I think this matters, but if you’re a professional software developer, I’d make the case that you should be owning your tools and moving beyond the baked in prompts.
 - postalcoder2 hours ago
 I think there's a fair amount of evidence that the heavy harnesses actually drag down performance compared to bare harnesses.
 - nothinkjustai2 hours ago
 CC is not better than Codex, nor is it better than OpenCode, Crush, Pi etc…
- locknitpicker2 hours ago
 > Pre-agent, there wasn't always an obvious difference between models. Various models had their charms. Nowadays, I don't want to entertain anything less than the frontier models.This is a very naive and misguided opinion. In most tasks, including complex coding tasks, you can hardly tell the difference between a frontier model and something like GPT4.1. You need to really focus on areas such as context window, tool calling and specific aspects of reasoning steps to start noticing differences. To make matters worse, frontier models are taking a brute force approach to results which ends up making them far more expensive to run, both in terms of what shows up on your invoice and how much more you have to wait to get any resemblance of output.And I won't even go into the topic or local models.
 - postalcoder2 hours ago
 > You need to really focus on areas such as context window, tool calling and specific aspects of reasoning steps to start noticing differences.This is like saying "the current models and the old models are the same if you ignore every important advance they've made"
- sayYayToLife13 minutes ago
 [dead]
maelito1 hour ago
Given what Vibe already did in the previous versions with codestral-v2, that's great news. Keep up the good work ! I don't want to depend on the world's two hungry superpowers.
seb_lz1 hour ago
I'm using mistral-medium-2508 for some text transformation operations. It's giving me better results than mistral-large for my use cases. Looking forward to testing this new model, although I'm not sure if it's really meant at replacing the previous medium model since it's a lot more expensive and presented more as a coding / agentic model (mistral-medium-2508 was priced $0.4/$2 per 1M tokens, mistral-medium-3.5 is $1,5/$7.5).
schipperai1 hour ago
With most OSS releases being MoEs, and modern GPUs optimized for MoEs, can somebody with knowledge of the topic explain or speculate why Mistral might have opted for a dense model?
- ac2936 minutes ago
 Modern GPUs aren't optimized for MoEs though?The advantage to a dense model like this Mistral one is that it is as smart as a much larger MoE model so it can fit on less GPUs. The tradeoff is that it is much slower since it has to read 100% of its weights for every token, MoE models typically only read about a tenth (though sparsity levels vary).
mark_l_watson2 hours ago
I like the idea of Mistral, but the last time I evaluated Mistral Vibe it was really nice for $15/month but not as effective as Gemini Plus with AntiGravity and gemini-cli. I am currently running Gemini Ultra on a 3 month 'special deal' and AntiGravity with Opus 4.7 tokens is pretty much fantastic.That said, when I stop spending money on Gemini Ultra, I will give Mistral Vibe another 1-month test.I like the entire business model and vibe of Mistral so much more than OpenAI/Anthropic/Google but I also have stuff to get done. I am curious if Mistral Vibe for $15/month is a stable business model (i.e., can they make a profit).
- danelski57 minutes ago
 How do you feel about the responsiveness of gemini-cli? I tried it on a paid plan and the 10-minute hang-ups (per step, not the whole plan execution) really break the illusion of performance gains, unless you run it in the background and do something else in the meantime. It's more noticeable when Americans are awake.
- amunozo2 hours ago
 I'm testing it right now and it seems very buggy and unstable, just like before.
Tepix2 hours ago
I use Mistral Le Chat quite a bit.One thing in particular I was disappointed in was its bad explanations when asking about French grammar. It made multiple mistakes and the other models got it right, even Qwen 3.6 27b!Anyway, I'm hoping they catch up some more.
- kubb2 hours ago
 There's a good chance that they'll catch up. The "AI race" is a race to the bottom, with the leaders blowing huge wads of cash on capabilities that get replicated months later by the competition at a fraction of the cost.The only benefit of leading is mindshare. OpenAI is doubling down on that, by investing in communication companies. That's their pathetic attempt at a "moat".
 - pb72 hours ago
 They catch up by distilling frontier models. They will eventually figure out how to prevent that from happening. No one has any interest in investing tens of billions if the product can be copied and sold for less.
 - amarcheschi2 hours ago
 >No one has any interest in investing tens of billions if the product can be copied and sold for less.That is what has happened until now though
Alifatisk1 hour ago
A 1000B model, can we call it 1KB model?
wyre2 hours ago
I'm rooting for Mistral. It seems they are making a big bet that smaller models will win over larger ones and I can see it happening. I was running some simple chat and tool-calling benchmarks for small models and Mistral Small 4 performed well for it's price ($.15/$.60). Seeing this today got me excited, benchmarks seems solid compared to models much larger, but it's priced higher than Haiku, 5.4 mini, and all the the Chinese models it's comparing itself too. It's not even winning those benches either, just being competitive with them, which is great, those models are 5x+ the size, but they are also 1/2 the price. Hard to be excited about that.
spwa43 hours ago
TLDR: Mistral Medium 3.5, text-only, 128B dense model, 256k context window, modified MIT license. Model is ~140G ...<a href="https://huggingface.co/mistralai/Mistral-Medium-3.5-128B" rel="nofollow">https://huggingface.co/mistralai/Mistral-Medium-3.5-128B</a>They more or less claim this exceeds Claude Sonnet 3.5 on most things, but is worse than Sonnet 3.6, and exceeds all other open models.Oh and they have a cloud service that will code your apps "in the cloud". But, yeah, at this point, so does my cat.And, yes, unsloth is on it: <a href="https://huggingface.co/unsloth/Mistral-Medium-3.5-128B-GGUF" rel="nofollow">https://huggingface.co/unsloth/Mistral-Medium-3.5-128B-GGUF</a> (but 4bit quant is 75G)
- wolttam2 hours ago
 Sonnet 4.5 and 4.6*There is no way it exceeds “all other” open models - but it does exceed all of Mistral’s past models.You can see it getting blown past by GLM 5.1 and Kimi in this.Still excited to give it a try
 - 2ndorderthought1 hour ago
 It looks like qwen 3.6 is winning and smaller for the April small model roll out
- pama2 hours ago
 Unfortunately they only compare to old “all other open models”. There are probably over 10 other open models better than it by now.
- Marciplan2 hours ago
 You mean Sonnet 4.5 and 4.6 riight
 - spwa41 hour ago
 right
Giorgi2 hours ago
Oh they are still a thing?! Completely forgot about Mistral. I am assuming they are still burning trough investor money.
- Havoc5 minutes ago
 Think they’re positioned pretty well. They’ve got an edge in the European corporate space and don’t have ungodly large numbers to hit
- danelski51 minutes ago
 I believe they'll get profitable sooner than their frontier competition. Their operating costs seem to peanuts compared to the providers they're compared to most often while having the local advantage of not being Chinese nor American.
- kergonath25 minutes ago
 > they are still burning trough investor moneyDifficult to say, this information is not really public. That said, those investors include EU agencies and European multinational companies and governments. It’s not as flashy as the ridiculous sums OpenAI is getting but it should be enough to keep them going for a while.They also have a different business model. They are selling their expertise to fine tune and adapt their models to on-premises computers (which they can help you build) to handle confidential data and information. I would not be surprised that the revenue they get from normal people is negligible in comparison.
- sev_verso2 hours ago
 What's better than Voxtral for locally processed voice input? More competition is always better.
amunozo3 hours ago
I want to believe it's gonna be good, but after trying GPT-5.5 even the most advanced Chinese models seem depressing.
- r0b053 hours ago
 This is a French model sir
 - spwa42 hours ago
 ÉvidemmentFunny detail: Google AI (the one they use in search) can't spell évidemment correctly.
 - baq2 hours ago
 What's French for 'goblin'...?
- ako3 hours ago
 Then you’ll be happy to learn it’s not Chinese
 - dotancohen3 hours ago
 GP is stating that the second best in the field, the Chinese, is so far behind the best in the field, GPT 5.5, that it is not even worth testing anything else.
 - amunozo2 hours ago
 Thanks for the translation, I did not express it very clearly. Anything that I try is so much worse.
 - Ritewut2 hours ago
 Is GPT 5.5 the best in the field? I think Opus is still better despite Anthropic's recent stumbling.
- manishsharan2 hours ago
 I am not following this obsession with SOTA and benchmark rankingsI have been using DeepSeek and GLMnmodels with OpenCode and Codex and Claudr side by side.I have not found the Chinese models lacking. I enjoy for coding and like to maintain full control of my codebade and deeply care about the GOF patterns. So I am very stringent in terms of what I want the LLM to code and how to code.So from my perspective, they are all about the same.
 - amunozo2 hours ago
 That I agree with, but for more complex autonomous changes the differences are considerable. However, it seems that most models will reach the saturation time in which they will be useful for almost everything and the difference will be in more and more niche and specialized tasks.
- lava_pidgeon2 hours ago
 Honestly I depends on the context which this performance matters. Mistral is quiet cheap
InputName3 hours ago
Looks at first graph. It's SWE-Bench Verified. A benchmark Open-AI stopped using two months ago due to contamination.Doesn't look to promising. Is there any reason to consider Mistral other than it's not US?
- 2ndorderthought1 hour ago
 They did not stop using it due to contamination. They said it's flawed and indirectly said anthropics results were impossible. It's very possible they are sore losers
- tpurves2 hours ago
 If it's not US and it's within a few percent of SOTA that might be good enough for a lot of people (eg Europeans)
 - NitpickLawyer2 hours ago
 Gemma has been better for us at EU languages than mistral (for comparable sized models) :/ so ... dunno. What mistral does well and others are lagging behind is deploying on prem with their engineers and know-how, offering tuned models for your tasks and finetuning on your own data. (I expect google to start offering this next)
 - deaux1 hour ago
 It's sad that despite their strength in this for onprem, they're so behind on this in the cloud. No publicly available cloud SFT at all. Meanwhile Google has been offering that for years - though remains to be seen if they will for Gemini 3 when GA.And on top of it a range of providers like Fireworks and so on that offer it for Chinese models. This seems such an obvious thing for Mistral to offer.
- amunozo2 hours ago
 Price and speed.