GLM-5.2 is a step change for open agents

(interconnects.ai)

190 points by vantareed2 days ago

19 comments

jerojero1 day ago
Open weight models from Chinese labs tend to be significantly cheaper.I think theyre absolutely needed. I can't afford 200 USD a month for personal use of coding AI, and I don't think such prices are reasonable for most of the world economy anyway. Not to mention US firms might be giving their employees a lot more than that.It's increasingly feeling, to me, that theres a gap building up between haves and have nots. But then, we get news of these open weight models that are reasonably priced in inference with reasonable capabilities. Yes, they take maybe 6-9 months to get there, tbh, that's not a bad trade off at all.
- fbrncci5 hours ago
 You made me realize something. I routinely spend upwards of 500$ per month on LLMs for coding (expensed towards clients). However I live in a place where 500$ is around the avg. salary. I’m lucky that I know my way around western clients. Clients who pay these expenses and are happy to work with me because I am still about 50% cheaper than local talent in EU/US, while my salary at home converts to an upper class income at the highest tax bracket.Which of course causes some unfairness on both ends. Nobody here can compete with me. I often use left over tokens on local client projects; which despite lower pay, still pays off because they now take hours not days or weeks to complete. And nobody in the local clients talent pool can compete with me; unless they charge about half the market rate.Take away my 500$ monthly grant; and I’d be more or less screwed. Better open models will more or less start to reduce this advantage. It’s not like I positioned myself here on purpose. But it’s definitely a „right place, right time“ situation.
 - listic3 hours ago
 Thanks for sharing your insight.Mind if I ask you for a few vibe coding tips? I failed to solve you gh puzzle in the profile though.
 - swader9995 hours ago
 If you are running multiple agents your cost to them should be multiples less what their roi is.
 - fbrncci5 hours ago
 My costs are 0$ as any token or subscription spend on agents is invoiced as an expense to my clients.
 - kreelman3 hours ago
 Thanks so much for being bold enough to be fairly open about the costs, how you arrange billing and the advantages that's given you.I've been fooling around with DeepSeek 4 agentically. It's probably not as good as Anthropic offerings, but even those seem to be roiled in politics and strife and DeepSeek 4 is very good IMHO. I'll later try out GLM.I'm in Australia. The government has set up a "return and earn" scheme to keep aluminium cans, plastic bottles and paper drink cartons out of the waste stream. A laudable project. The money you make from return drink containers is pretty low, $AU 0.1 per container. I've participated to get the rubbish out of natural water streams and to make a nano amount of money on the side.When I looked at the costs of an app I was getting DeepSeek to help me with, I realised that the several hours I'd spent learning and building had cost something like 8 recycled containers. In my head after doing some DeepSeek stuff, I calculate a "cans per app" metric for myself for fun. I may even setup a simple graph to view my costs that way.I kind of hope the Anthropics of the world get enough price competition from sources like DeepSeek and GLM to drop their prices significantly. Time will tell.I'm using the Chinese DeepSeek provider, so everything done there could potentially be taken and used by the CCP... But this is hobbyist learning.There is probably a market for Deepseek/GLM served from non CCP available servers. I might even look into how hard that would be to setup here.I also hope that inference focused hardware will come to the fore, reducing energy use and cost. Realistically this will take time though, on the order of years.Here in Oz, we have community batteries that community members can charge and later draw from. Their electricity prices are competitive. I wonder if someone could setup something like a community battery to run data centres... That way reasonable environmental consideration could be given to inference power generation... This might not work in a market like the US or Europe, but small market size might be an advantage... Who knows.
 esperent25 minutes ago
 > I'm using the Chinese DeepSeek provider, so everything done there could potentially be taken and used by the CCPAs opposed to Anthropic or OpenAI where everything done could potentially be taken and used by the US government.Also, replace "could potentially" with "will definitely" in both cases, there's no conspiracy here.We're stuck between two bad positions, so just use the one that's best for you, and wait for a better solution to arrive.
- arikrahman6 hours ago
 Someone else on this forum put it well, U.S. is trying to achieve AGI at all costs, while Chinese models are seeking widespread adoption.
 - azinman26 hours ago
 I don't think anthropic/openai/google aren't also seeing widespread adoption. In fact they already have they already have the marketshare.
 - Turskarama1 hour ago
 The difference is that the US companies are using it as a means to an end, they need to make just enough profit that the investors don't all get cold feet before they get to AGI. The Chinese companies on the other hand are trying to be profitable immediately, which means that they're going slower to save development costs.
- tacomagick23 hours ago
 DeepSeek through their own API has saved me tons of tokens honestly. Even though it is not as smart as Kimi or Claude, their level of entry is very low with a top up of 2$ and Pay as you go compared to the subscription of Claude or 20$ top up of Kimi
 - praveer138 hours ago
 For personal use I’m considering using the frontier models from openai or anthropic to create a plan with research and brainstorming etc with enough details for cheap models to be able to follow (glm, deepseek etc) - with openrouter - will monitor how cheap and effective that turns out to be.
 - tacomagick32 minutes ago
 For my case Openrouter breaks Deepseek caching and charges me multiple times over what I pay for Deepseek's API, with 2$ I was able to get around 120M tokens from deepseek easily when Openrouter could only barely do 250k
 - ImaCake6 hours ago
 You should try out the cheaper models first. I find Deepseek v4 models pretty comparable to sonnet 4.6 but at a fraction of the cost. You might find you just don't need to use the American models at all.
- brian-armstrong55 minutes ago
 I read these stories and I can never figure out how people are managing to use these $200 plans. If I really go full bore, I can sometimes max out the $20 plan. Even then, it already produces more code than I can reasonably review and merge.
 - ipaddr32 minutes ago
 I've maxed out my chatgpt plus the first week and that include an smf forum rewrite. Trying my best I haven't been able to max out again. Things are setup that you need to max out your 5 hour window multiple times which becomes a job in itself.At work I'm struggling to keep my claude bill around $500.
- Fr0styMatt886 hours ago
 If we can agree that the AI model is at least as capable as a junior engineer or new contractor, how’s that different to saying “software engineering isn’t worth $200 a month”?Has a very race-to-the-bottom feel to it.Though in the grand scheme of it, $200/mo probably isn’t the real price either. Also looking at it not just in a vacuum - paying for a product that can change what you get from under you doesn’t seem great anyway.At least with a locally-hosted model you know what you’re getting.
 - matheusmoreira6 hours ago
 Yeah. There's no way to verify what these providers are doing. The real future is running these models at home. Opus level inference on our own hardware would be a dream come true.
 - IncreasePosts5 hours ago
 How will anyone running home instances be able to compete against people paying some money running much more powerful models on much more powerful hardware?
 - Fr0styMatt884 hours ago
 It’ll be interesting.I’m using Qwen3.6:27B at home and mostly Sonnet/Opus (depending on the complexity of the task) at work.You have to break things down into smaller chunks for the local models. For the bigger cloud ones they can do a lot of the broader thinking.
 - jimbokun2 hours ago
 At some point it will be hard for us to tell the difference.
- ImaCake6 hours ago
 Significantly cheaper than comparable models if you are using openrouter [0]. Just yesterday I spent roughly 13 cents centering some divs using Deepseek in a personal project. It would have been north of $1 to do that with a US frontier model.0. <a href="https://openrouter.ai/compare/z-ai/glm-5.2/anthropic/claude-opus-4.8" rel="nofollow">https://openrouter.ai/compare/z-ai/glm-5.2/anthropic/claude-...</a>
 - ipaddr26 minutes ago
 For centering divs the free models opencode offers can easily handle that work. DeepSeek V4 Flash is pretty decent.
- matheusmoreira6 hours ago
 > It's increasingly feeling, to me, that theres a gap building up between haves and have nots.People speak of a permanent underclass.<a href="https://www.nytimes.com/2026/04/30/opinion/ai-labor-work-force-silicon-valley.html" rel="nofollow">https://www.nytimes.com/2026/04/30/opinion/ai-labor-work-for...</a>
- throwaway-blaze6 hours ago
 Just don't ask it to tell you the events of June 4, 1989.
- ttoinou1 day ago
 200 is much less than the value you’re supposed to get out of it. If it’s not then yeah go ahead and use cheaper models with worst quality
 - martinjc8 hours ago
 Are you aware of how much purchasing power 200 dollars is in china, brazil, thailand or india is? This is an extremely arrogant take.
 - dash246 minutes ago
 Parent’s point was that many many people will get much more than $200 value from the “expensive” model. Sure, a Bihar farmer won’t, but even an Indian software developer may easily do if he or she has Western clients.
 - nwienert6 hours ago
 I’ve hired many asian developers anywhere from 1-4k a month.I get a lot more out of a 200/mo subscription now in a week than I did from them in a month.Now obviously in today’s world they’d be using a 200/mo subscription themselves. But it’s not like money is nothing, software development doesn’t scale down below 1k/mo for anyone competent even in the poorest areas.
 - xydone2 hours ago
 The point the post you replied to is making is that while you get value out of it, and in your case it's not that expensive, it's just simply not the case worldwide
 nwienert1 hour ago
 I don’t think you’re really reading between the lines.
 - matheusmoreira6 hours ago
 For the record, 200 USD is around 60% of the brazilian minimum wage.
 - Dayshine1 day ago
 I'm not sure how I'm supposed to get $200 of value out of personal use!
 - LPisGood9 hours ago
 Note that 200 dollars of value is different than 200 dollars of profit.
 - devmor8 hours ago
 I personally don’t find it that useful for most tasks, but if say, you get paid $50/hr for your work and it saves you more than 4 hours of work in a month, there you go.
 - selcuka5 hours ago
 Obviously this assumes that you can find 4+ extra hours of $50/hr work every month, or you can work 4 hours less. Neither of these assumptions is correct for people who work for a fixed salary.
 devmor1 hour ago
 That doesn’t change value. It’s value whether or not you can maintain a profit over it.
 - holoduke8 hours ago
 Here most of my colleagues have +200 dollar rates. It's really a no brainer. But sure, in south America or some Asian countries maybe it is. But still most devs need it anyway. Also in the poor regions.
 - HDBaseT7 hours ago
 $200/h is on the extreme end and I would argue most people here aren't anywhere close to that.The median hourly wage in the US is $28/h, this equates to nearly 7.5 hours. A full day of work a month for the average person to use Claude with reasonable limits.Yes, the people on $28/h may not be the software development types, so their income might not be as high, but these are the people who would probably be vibe coding the most since they aren't day to day programmers!
 ray_kay7776 hours ago
 I suspect the reply above is referring to charge out rates rather than wages.
 HDBaseT3 hours ago
 My fault, thanks for the correction (:
 - folkrav6 hours ago
 Most of the world's developers, even in not-poor regions, make significantly less than what your colleagues charge.
 - uberex8 hours ago
 Unless that value is $200 cash in hand it will be hard to afford it for people who just don't have $200.
 - margalabargala7 hours ago
 Last time you bought a computer, did you buy the absolute fastest best CPU available?
 - girvo7 hours ago
 Yes, but that was because I could see the writing on the wall with respect to hardware prices being cooked by AI demand, so I built the best computer possible at the time knowing it'd probably need to last me the next 5+ yearsSo not really comparable. I use Step 3.7 Flash locally, models are good enough for so many coding tasks even at the lower end! (Though I note that calling a 200B model "lower end" is kind of amusing)
 - smrtinsert7 hours ago
 I've actually come to believe the overwhelming majority of use cases require nowhere frontier quality so there's that. Much faster execution is just a bonus on top of the much reduced cost
guybedo6 hours ago
GLM-5.2 has been a step change in how fast i can burn through tokens.I subscribed to their max plan to try it out. It counted me 700M tokens and drained my weekly quota in under 2 days.Quota just reset less than 24h ago and i'm already >60% weekly quota usage.For reference the kind of work i did would have used somewhere between 3% and 5% of Codex max or Claude max.The model is good, the plan is a scam
- try-working3 hours ago
 Kimi and GLM models have coined a new term: Thinkslop. They run a chain of thought that is up to 10x longer than other models and it seems that through a lookback mechanism they are able to use the CoT to reason about solutions to tasks they couldn't otherwise solve.The downside is of course that they consume many more tokens off your plan, and also that they are significantly slower. Kimi K2.7 takes about 7x longer to finish the same benchmark tasks as DeepSeek V4 Pro on my router benchmarks (<a href="https://role-model.dev/" rel="nofollow">https://role-model.dev/</a>).So for now I'm happy with just two models: GPT and DeepSeek.
 - spwa419 minutes ago
 Turning up the thinking (max time spent thinkin) lever really changes model performance, even for tiny models. But it's really irritating because it adds a lot of time.
 - guybedo3 hours ago
 yeah Kimi K2.7 was doing ok but was painfully slow. The coding plan limits were good though.I haven't tried deepseek yet, i should check this one out.
 - try-working29 minutes ago
 After the release of K2.7, the Kimi plan quotas have been reduced by about 80%.
- jubilanti6 hours ago
 > The model is good, the plan is a scamIf it is needing to generate that many tokens to do the same tasks, then it probably has higher inference costs. So (for you) the model is bad, the plan is the same plan.
- anatoliikmt4 hours ago
 What kind of tasks have you been using it for?
aunty_helen7 hours ago
I signed up to a z.ai max account, $144. Hardly been able to use it as it 429s on most requests. They’re also refusing to refund me.
- osti6 hours ago
 Even as a GLM z.ai fan, I wouldn't pay for their plans. They are just way worse values than gpt or anthropic plans, in terms of both usage and capabilities.
- reissbaker3 hours ago
 Self-promo but you should try our service synthetic.new. We generally have up-to-date open-source LLMs on the sub, and we have GLM-5.2 :) Perf+stability should be wayyy better than zai.
- guybedo6 hours ago
 same here. Barely usable due to API connections issues.And when i can use it, it just drains the quota 5 times faster than codex or claude.Their plan is a scam
- sergiotapia6 hours ago
 My experience as well unfortunately :(
- fartcoin674 hours ago
 [dead]
christophilus6 hours ago
I've been working with Deepseek V4 Flash (with opencode as the harness). It's been almost indistinguishable from Codex / Claude Code for me. I'm sure I'll run into problems when I get to a stickier ticket to tackle. But so far, it's been quite good, and I find it writes straightforward code.I do think the Chinese models are good enough for an 80/20 rule use case.
- scottchiefbaker3 hours ago
 I tried Deepseek V4 Flash with very low expectations and was pleasantly surprised. It's a surprisingly capable model for the price.
 - timcobb3 hours ago
 What provider(s) do you use?
mlmonkey5 hours ago
Here are the numbers from their bar chart:<pre><code> 1. SWE-bench Pro Model Score (%) GLM-5.2 62.1 GLM-5.1 58.4 Claude Opus 4.8 69.2 GPT-5.5 58.6 Gemini 3.1 Pro 54.2 2. Terminal-Bench 2.1 Model Score (%) GLM-5.2 81.0 GLM-5.1 63.5 Claude Opus 4.8 85.0 GPT-5.5 84.0 Gemini 3.1 Pro 74.0 3. NL2Repo Model Score (%) GLM-5.2 48.9 GLM-5.1 42.7 Claude Opus 4.8 69.7 GPT-5.5 50.7 Gemini 3.1 Pro 33.4 4. DeepSWE Model Score (%) GLM-5.2 46.2 GLM-5.1 18.0 Claude Opus 4.8 58.0 GPT-5.5 70.0 Gemini 3.1 Pro 10.0 5. ProgramBench Model Score (%) GLM-5.2 63.7 GLM-5.1 50.9 Claude Opus 4.8 71.9 GPT-5.5 70.8 Gemini 3.1 Pro 39.5 6. MCP-Atlas Model Score (%) GLM-5.2 77.0 GLM-5.1 71.8 Claude Opus 4.8 77.8 GPT-5.5 75.3 Gemini 3.1 Pro 69.2 7. Tool-Decathlon Model Score (%) GLM-5.2 48.2 GLM-5.1 40.7 Claude Opus 4.8 59.9 GPT-5.5 55.6 Gemini 3.1 Pro 48.8 8. Humanity's Last Exam Model Base Score (%) Score w/ Tools (%) GLM-5.2 40.5 54.7 GLM-5.1 31.0 52.3 Claude Opus 4.8 49.8 57.9 GPT-5.5 41.4 52.2 Gemini 3.1 Pro 45.0 51.4 </code></pre> Seems to be handily beating Gemini 3.1 Pro. What _is_ Google DeepMind doing (other than bleeding talent to A\ ) ?
- vineyardmike3 hours ago
 > What _is_ Google DeepMind doingI feel like it has been pretty visible about what’s happening, between their press and products and financial statements. It’s just not what people are accustomed to expect.First, Google has become a major compute provider for competitors, thanks to TPUs. They’ve talked about allocating TPUs to GCP instead of their first party products. I can only assume it’s because they’re collecting a higher margin, and it covers the cost of data center buildout - which they’ve been aggressively doing. I wouldn’t be surprised if they made the financial decisions to delay or slow training for Gemini 3.5 when they provided last minute compute to Anthropic this spring.Second, Gemini has very directly not been focused on agentic coding, maybe 3.5 Flash being the change. They’ve built models they can deploy to watch YouTube videos, Nest cameras, scale to AI in search, understand fitness info in Fitbit, etc. They’re very clearly not focused around agentic/coding. They’ve put in a ton of efforts into multimodal data in and out, and they’re the only major lab working on video generation still. There was leak/rumor that their cofounder (brin) was getting involved in the model training to renew focus on agents so maybe this will change, and again 3.5 already feels different.
- linzhangrun28 minutes ago
 Just waiting for the 3.5 Pro they said would come out this month. Gemini is pretty much useless for any serious work right now.
- verdverm4 hours ago
 copying the graphs and tables to HN is noisy and harder to read
 - JSR_FDED3 hours ago
 Still more helpful than this comment
timcobb7 hours ago
Can people share their GLM and open model setups in general please? What provider do you use. Why do you trust it with serving full quality? What harness do you use? Why do you trust it not to have malware (most harnessed are TS apps). I am just trying GLM 5.1 from Nvidia build in open code would love to hear how you all do it, thanks.
- michimagdesign7 hours ago
 Next to my Claude Pro plan, I have subbed to OpenCode Go. I find the OpenCode UX much better than in Claude Code CLI. As for models, I started a few months ago with GLM 5.1 and it was solid and could archive near sonnet-level tasks. It weirdly sputtered out Chinese characters sometimes. Then I switched to Kimi K2.6, which is the Chinese model I used the most until now. It used way too many reasoning tokens (improved in k2.7). But executed Claude created plans reliably. Now I’m back with GLM 5.2 and it’s really solid (among other things it’s good at design) and I get good usage with the $10 plan. Still the Claude models have less hiccups but the Chinese models are getting really close.
- gandreani7 hours ago
 I use both the openai subscription and the opencode go subscription. I use the go subscription for my personal work and the openai subscription for my consulting work.The differences between the models are minimal, but I usually stick with gpt-5.4-mini, gpt-5.4, mimo-pro-2.5, deepseek-v4-pro. These latter ones have way more usage than even using 5.4-mini so I tend to use them in personal projects for that reason.My harness is <a href="https://github.com/can1357/oh-my-pi" rel="nofollow">https://github.com/can1357/oh-my-pi</a>. I trust it...enough. It updates very frequently so as a safe guard I run it sandboxed with <a href="https://github.com/containers/bubblewrap" rel="nofollow">https://github.com/containers/bubblewrap</a> so it can only access the project folder and some whitelisted config files
 - timcobb6 hours ago
 Thanks. I was looking at open code go yesterday and I couldn't figure out if the base pricing is including usage or if that's just base pricing and then you have to pay for usage too. How does it work? It is very cheap.
 - arcanemachiner5 hours ago
 OpenCode Go is a smoking deal IMO. You basically get 6x multiplier on the $10 price since you get $60 worth of usage for $10. And the first month is only $5 so it's even better.It goes pretty quick, but it's still a great deal. Highly recommended.
- ukuina3 hours ago
 Synthetic.new and Claude Code using GLM-5.2. Great model, but the harness will error out if using subagents. The base plan only allows one concurrent request at a time. Also, GLM will burn through your weekly quota in a day if you're not precise with your scope.
- smoe7 hours ago
 For work, I mostly use Codex and some Claude. For personal use, I’ve started using Chinese models directly through their respective providers, mostly for automation tasks and experiments so far, either via the API directly or through the Pi harness.I do not trust any of them. Everything runs inside virtual machines, not just the sandboxes provided by the harnesses. I also do not run Claude or Codex directly on the host machine. Not just because of supply chain fears, but also because of how incredibly user hostile the VC funded companies are when it comes to installing random stuff on your machine.
- Fr0styMatt882 hours ago
 Local using Qwen3.6-27B; 2xRTX 5070Ti graphics cards; VS Code with Cline at the moment and Ollama back-end (will get to trying the others soon).
- rainmaking7 hours ago
 GLM 5.2 coding plan- I'll post the agent as soon as I can! But opencode works and their own zcode is really good as well.
fraywing6 hours ago
It feels like the gap is closing from an intelligence perspective. Or at least doing some kind of log flattening.Been playing with GLM 5.2 in different contexts. It's less good if you don't max out thinking, but as xhigh it's been able to solve most problems I was throwing at Opus in the about the same amount of time (via OpenRouter).Wild time to be alive.
- JSR_FDED2 hours ago
 Anecdote, not “research”:Yesterday I compared Deepseek, Kimi 2.6, MiMo 2.5 and GLM 5.2 for the same task (replace a custom token-based auth scheme with a cookies-based scheme across a front- and back-end codebase).I used Opencode with the zen subscription to try different models.All did this perfectly, basically indistinguishable from each other. However, when I pointed out that the new cookies-based auth didn’t allow multiple independent logins across browser tabs (which the previous scheme did allow) I noticed this:Deepseek, Kimi, MiMo started giving me multiple options but advocating strongly that I should either accept this deficiency, or don’t use the cookies version (keep the old auth scheme). They were so similar it was as if they were all the same model.Only GLM 5.2 said “here’s how to use cookies and also have tab-level separation”. The difference vs the other models was very stark.
neosat6 hours ago
I've been using GLM 5.2 recently (company hosted, for non-coding tasks) and it's been strong and reliable. There are areas where GPT 5.5 and Opus 4.x still feel marginally better but only marginally. For most tasks if GLM 5.2 is the only model I have to use I'm productive and happy. This was not true before GLM 5.2. No doubt in my mind that the gap is closing quickly and for most tasks that are not very specialized open models will be usably on par on flagship closed models and have an edge factoring in cost.For coding I still use 5.5 w/ Codex and prefer that to other models + harness combinations.
themgt8 hours ago
I just tested GLM 5.2 out via Z.ai in pi for a little one-off project that was already scoped. It actually did a relatively decent job starting out, and figured important things out from context.But the reasoning traces became increasingly hilarious, with it getting confused and going in loops, doubting itself. I began to feel almost sad, it was like listening to the internal monologue of someone with anxiety disorder.It made pretty good progress but wound up going in a lot of goofy loops and doing things a bit "off" from standards I'd hoped it would infer, and finally started going a bit nuts, "This is very confusing.", "OH WAIT", seemingly hallucinating a whole side-quest that didn't make sense and looking at making internal system changes to try to achieve its (now very confused) goal when I pulled the plug.Without seeing the reasoning traces from Claude/GPT it's hard to really know, but it definitely didn't feel like the same quality of reasoning, even if dogged persistence does wind up actually working eventually.
- dools6 hours ago
 The reasoning traces always look terrible and they’re frustrating to watch. It’s the same with Kimi. What’s interesting is that the end result is then good. I think it’s just some sort of devils advocate trick to get better output.
 - rufo3 hours ago
 The reasoning tokens are really just there to extend the amount the LLM can "compute" the problem; put another way, the only way a given model can "think" more about a problem is to fill more of its context with predicted tokens, which has the effect of increasing the accuracy of each token. The reinforcement learning these models go through generally doesn't care what the chain of thought tokens look like (outside of preventing loops/gibberish/reward hacking), only how good the final answer is - so while it does look something like "reasoning" to us and has a rough correlation with the final answer, treating it as actually representative of what the final answer will be or an actual thought process is giving those tokens too much credit :)
 - fc417fc8022 hours ago
 For me what really drove this point home (that reasoning traces aren't "real" by any reasonable definition of the term) was noticing instances of things being out of order and exhibiting various inconsistencies with the final answer. My favorite was an example posted to HN that went something along the lines of the model first output the conclusion, then performed the supposed derivation after the fact, then stated it needed to verify the earlier conclusion to verify the derivation was correct so it hallucinated a tool call, then it remarked positively about the verification matching, and finally it output a slightly different answer. At no point was the answer actually correct although it was vaguely in the ballpark.
 - teravor2 hours ago
 as compared to what though? you can't see the actual think traces for opus or gpt.
 - dools1 hour ago
 Compared to what comes out at the end. Like if you sit there watching Kimi k2.6 "think", you're like "what? no you fucking idiot!" and you get this urge to "steer" it and so on, but very rarely is that steering actually necessary, it just winds up popping out the correct answer and all of those 'Wait! That's it! I found it! Actually ... Let me just' is just whatever internal processing it needed to use to get to the correct response. Mostly likely it's just being self-adversarial and exploring a bunch of dumb avenues to isolate the best outcome with the highest probability
 - try-working3 hours ago
 thinkslop recursion.
- jauntywundrkind8 hours ago
 I think the self-doubt might actually be a very crucial part of it's capability. I often feel compelled to interrupt when I'm watching it think (which thank the stars it let's us do, unlike the big American models!!), but usually it makes the right pick!Being willing and able to reconsider seems very good. Going around and around, pulling in more thinking, integrating it: maybe that's why it is as good as it's good.I want to emphasize again how excellent it is that we can see the thinking. I think this makes GLM so much better an experience for me. It gives me such insight into what is being considered, helps me see where things go wrong. It grounds me, gives me the notion of where the results come from. It was so jarring to switch to GPT and Opus and find that they won't discuss with me, won't reveal their thinking: that feels fundamentally unsafe, for me, for society, to have such a severe black box. I don't think it should be allowed, honestly.Many thanks to this recent submission, which is the first time I've seen anyone blog about this core difference: The text in Claude Code’s “Extended Thinking” output is not authentic. <a href="https://patrickmccanna.net/the-text-in-claude-codes-extended-thinking-output-is-not-authentic/" rel="nofollow">https://patrickmccanna.net/the-text-in-claude-codes-extended...</a> <a href="https://news.ycombinator.com/item?id=48630535">https://news.ycombinator.com/item?id=48630535</a>
 - wuhhh7 hours ago
 Your post made me laugh because I experienced the same as you but the other way around. I switched from Claude to a multi model harness a couple of days ago and the first model I tried was GLM5.2.I gave it some simple code porting exercises and watched dumbfounded at the reasoning, which was more like the ravings of a lunatic - but lo and behold, after much confusion and a dizzying number of eureka moments the task was completed very successfully.I tried Kimi on a similar task, much faster, a little more reassuring somehow in its ramblings, also surprisingly good results.To be clear, I’m not surprised the results were good because they’re not GPT or Claude, but because the line of reasoning was so bonkers. Coming from Claude, I was just not used to seeing this, but I’ll bet it’s just as nuts with the frontier models and we’re just not allowed to see it (I’m about to read the links you shared).Agree wholeheartedly that transparency is of grave importance.
 - nl6 hours ago
 If you look at the "thinking" traces as ways of expressions of uncertainty rather than literal thinking they make more sense.Consider debugging - you start off in one place, think you have worked out what is happening, and then there is a "oh but what about xxx" thing that happens and you explore another branch. Then you "have it for sure" until you find another edge case.The LLM is doing something analogous. It's writing circuits to try to emulate your program. Each time it gets one that seems right it is very sure that circuit is correct, but then it finds another thing.At any point you can stop and go "write code now" and it will, and the code will seems fine provided it hasn't hit one of these edge cases.Turning up thinking time is literally forcing more exploration.The words that come out are amusingly dramatic, but... TBH when I debug I often are like "WTF" and throwing my hands up in the air at some gotcha I didn't expect.
 - rainmaking7 hours ago
 Yeah isn't that thinking weird?Now I see the issue clearly! But wait... now I have the full picture! But wait... Found it!I gave up a few times because of it at first until I realized I just had to let GLM get on with it and what came out was great!But once it was outright endearing- challenging bug, it said: I have been very thorough. Then it escalated where to look and aced it. Built in confucian values
 - wuhhh6 hours ago
 If there’s one thing I’ve learned these past couple of days, it’s to resist the temptation to jab the escape button and start waving my arms! I wonder how much of this cyclical self doubt / self congratulating I go through in my own thoughts without even realising it. If you could verbalise or articulate all the half thoughts, snatches of ideas, feelings and ruminations the human mind goes through on some tasks it might be even more bizarre (or could just be me)
- sosrobahu5 hours ago
 [dead]
newaccountman25 hours ago
5.1 and Qwen 3.6 are great too IMO
seany6 hours ago
What's the current best for ablation? Specifically chemistry and red-team/netsec?
- forsalebypwner3 hours ago
  ime DeepSeek v4 Pro is great for cybersec/netsec, I have not tried GLM though
yogthos3 hours ago
It's by far the most competent open model I've tried yet. It's a bit slower than Claude, but in terms of coding capability it seems to get comparable results at least for the work I'm doing.
citizenpaul8 hours ago
Ive been using glm5 since its release and still prefer it to glm5.1 and so far to glm5.2Perhaps it is just my harness and workflow, but the older model still seems to work better. Also the token cost is significantly lower. I rarely spend more than $20 a week with $50 cap. Not even half claudes ambiguous minimum $200 a month plan.
- rainmaking6 hours ago
 Now that's a tremendous pointer, I'm going to have to try that.Do you full on let GLM5 get stuff done on its own or is it more like a guided workflow? The former's what the point releases doubled down on and is also something that uses a lot of juice.
Balinares1 day ago
I can't help wondering what kind of models we'll see coming out of China once it gets its own chip fabs up and running. Right now it sounds like the US's export ban is not slowing them down a whole lot.
- pianopatrick6 hours ago
 There does not seem to be a big penalty for going slow anyways. People seem to just switch on cost as soon as a model can do a task well enough. There do not seem to be strong network effects or vendor lock in.Seems to me that going slow is the better long term tactic. China can just let the USA pay the high R&D costs to figure out what works, then just copy what works.
- ceejayoz8 hours ago
 > Right now it sounds like the US's export ban is not slowing them down a whole lot.It may wind up being a massive boost to them in the long run, even.Necessity is the mother of invention.
 - pkroll8 hours ago
 If this pans out, you're not at all kidding: <a href="https://www.youtube.com/watch?v=8ekndZwyOzo" rel="nofollow">https://www.youtube.com/watch?v=8ekndZwyOzo</a>
 - verdverm4 hours ago
 Trump allowed more advanced chips (H200s) to be sold after his visit, because some people in the admin still believe the US can "addict" China to the hardware. It seems China is only letting a token few in, the ban is more on their side now, as Xi really wants indiginous capability.
- briga6 hours ago
 With subsidization from the Chinese government they will probably be equal to or better than the models here. I mean, have you looked at the author list of any given AI paper published within, say, the past 5 years? I wouldn't be surprised if half or more AI researches are from China.
 - buzzin__5 hours ago
 Can you compare the amount to the USA subsidization? Which one is bigger? Per Capita? Per unit of economic growth achieved?
 - usef-4 hours ago
 You mean from the private investors? It seems the labs on both sides of the ocean are quite negative in their profitability right now due to the competitiveness. Though Anthropic claims they will have a profitable quarter this year (despite the huge build-out), so their margins on API costs are likely quite decent.
modgate3 hours ago
[flagged]
ideaxiaoshi18 minutes ago
[dead]
dools6 hours ago
Is z.aiIs 2 better than x.ai