> We'll assume a 32B dense model, as they've have gotten quite good for production use and a B200 can comfortably serve them. This could be a Gemma, Qwen, DeepSeek, whatever.<p>That seems like a very consequential point to include halfway through the post. They aren't wrong that Qwen 3.6 26B or Gemma 4 31B are quite good, depending on the use case, but if we're doing napkin math, I'd want some more headroom in the assumptions.<p>They really ought to have Qwen parameterize their post's calculations and add sliders so a reader could play around with the values.<p>Edit: And since they especially mentioned DeekSeek (or whatever), as far as I know, none of their current generation of models is a dense model, and even the smallest of the mixture of experts (MoE) models is 284B parameters (13B activated). That will completely incinerate their napkin.
Yes 32B dense is a weird one to choose.<p>But in reality, 32B dense is very similar* to 32B activated on MoE in terms of inference costs. And I highly suspect eg Opus is around that level of active params.<p>A 284ba13b model at scale, is almost certainly cheaper to serve than a 32b dense model.<p>*as you can shard the model across multiple GPUs at scale. but in reality you have some loss of efficiency from GPU coordination and expert routing
>This largely depends on whether you own or rent your hardware. At $40,000 per B200, your lifetime cost per user is 40_000/num_users.
In the 100% duty cycle case (worst for cost), that's 6k$ per user. Realistically, serving 300 users per GPU you'll spend a lifetime cost of about $133 per user, plus the <i>datacenter/upkeep bill</i>.
If you rent the GPU, the cost is more straightforward. At an hourly rate of $43, your hourly cost per user is 4/num_users. For num_users=300 you get an hourly rate of about $0.013 per user, or $9.36 per month.<p>This leads me to believe you can buy a GPU but leave it at a data center?<p>Do people do this? I don't understand. Or are you equating upkeep bill to electricity on premises?
> Realistically, serving 300 users per GPU you'll spend a lifetime cost of about $133 per user, plus the datacenter/upkeep bill.<p>What is the operational cost and when does it become more expensive than the upfront capex?<p>The B200 tops out at 1000W and idles around 140W. It averages around 600W. <a href="https://www.lightly.ai/blog/nvidia-b200-vs-h100">https://www.lightly.ai/blog/nvidia-b200-vs-h100</a>
U.S. average electricity cost is $.14 per kWh in March. <a href="https://www.eia.gov/electricity/monthly/epm_table_grapher.php?t=epmt_5_6_a" rel="nofollow">https://www.eia.gov/electricity/monthly/epm_table_grapher.ph...</a><p>600/1000 *.14 =$0.084 per hour.
$2.01 per day.
$60.30 per month.
With 300 users, $.20 per user per month. Seems fairly cheap for the electricity.<p>Does anyone know how to estimate colo/data center rent costs?
Where did I screw up my estimates?
I'd like to see a bit of the running costs inside the napkin math. Power, cooling, maintenance, rent, etc. are probably significant factors as well.
> 2B = 562 => B = 331<p>what kind of math is this? why isn't it B = 562 / 2 = 281?