Launch HN: IonRouter (YC W26) – High-throughput, low-cost inference

(ionrouter.io)

35 points by vshah10164 hours ago

10 comments

GodelNumbering3 hours ago
As an inference hungry human, I am obviously hooked. Quick feedback:1. The models/pricing page should be linked from the top perhaps as that is the most interesting part to most users. You have mentioned some impressive numbers (e.g. GLM5~220 tok/s $1.20 in · $3.50 out) but those are way down in the page and many would miss it2. When looking for inference, I always look at 3 things: which models are supported, at which quantization and what is the cached input pricing (this is way more important than headline pricing for agentic loops). You have the info about the first on the site but not 2 and 3. Would definitely like to know!
- 2uryaa2 hours ago
 Thank you for the feedback! I think we will definitely redo the info on the frontpage to reorg and show quantizations better. For reference, Kimi and Minimax are NVFP4. The rest are FP8. But I will make this more obvious on the site itself.
Frannky46 minutes ago
I have no idea how much the demand is for fine-tuned models. Is it big? Are people actively looking for endpoints for fine-tuned models? Why? Mostly out of curiosity, I personally never had the need.What I want from an LLM is smart, super cheap, fast, and private. I wonder if we will ever get there. Like having a cheap Cerebras machine at home with oss 400B models on it.
Oras3 hours ago
The problem is well articulated and nice story for both cofounders.One thing I don’t get is why would anyone use a direct service that does the same thing as others when there are services such as openrouter where you can use the same model from different providers? I would understand if your landing page mentioned fine-tuning only and custom models, but just listing same open source models, tps and pricing wouldn’t tell me how you’re different from other providers.I remember using banana.dev a few years ago and it was very clear proposition that time (serverless GPU with fast cold start)I suppose positioning will take multiple iterations before you land on the right one. Good luck!
- 2uryaa2 hours ago
 Hey Oras, thank you for the feedback! I think we definitely could list on OpenRouter but as you point out, our end goal is to host finetuned models for individuals. The IonRouter product is mostly to showcase our engine. In the backend, we are multiplexing finetuned and open source models on a homogenous fleet of GPUs. So if you feel better or even similar performance difference on our cloud, we're already proving what we set out to show.I do think we will lean harder into the hosting of fine-tuned models though, this is a good insight.
reactordev3 hours ago
“Pricing is per token, no idle costs: GPT-OSS-120B is $0.02 in / $0.095 out, Qwen3.5-122B is $0.20 in / $1.60 out. Full model list and pricing at <a href="https://ionrouter.io" rel="nofollow">https://ionrouter.io</a>.”Man you had me panicking there for a second. Per token?!? Turns out, it’s per million according to their site.Cool concept. I used to run a Fortune 500’s cloud and GPU instances hot and ready were the biggest ask. We weren’t ready for that, cost wise, so we would only spin them up when absolutely necessary.
- 2uryaa1 hour ago
 Haha sorry for the typo! Your F500 use case is exactly who we want to target, especially as they start serving finetunes on their own data. Thanks for the feedback!
nylonstrung3 hours ago
Unless I misunderstood it seems like this is trailing the pareto frontier in cost and speed.Compare to providers like Fireworks and even with the openrouter 5% charge it's not competitive
- 2uryaa2 hours ago
 our SLA is actually higher and we are lower priced. We are also using this as a step into serving finetuned models for much cheaper than Fireworks/Together and not having the horrible cold starts of Modal. We're essentially trying to prove that our engine can hang with the best providers while multiplexing models.
cmrdporcupine2 hours ago
Very cool, I see that "Deploy your finetunes, custom LoRAs, or any open-source model on our fleet." is "Book a call" -- any sense of what pricing will actually look like here, since this seems like it's kind of where your approach wins out, the ability to swap in custom model easier/cheaper?Just curious how close we are to a world where I can fine tune for my (low volume calls) domain and then get it hosted. Right now this is not practical anywhere I've seen, at the volumes I would be doing it at (which are really hobby level).
- 2uryaa1 hour ago
 We usually charge by GPU hour for those finetunes, around 8-10 dollars depending on GPU type and volume! This is similar to Modal, but since the engine is fully ours, you don't wait ~1 min for cold starts. Ideally, we will make onboarding super frictionless and self serve, but onboarding people manually for now.
erichocean3 hours ago
> what would make this actually useful for you?A privacy policy that's at least as good as Vertex.ai at Google.Otherwise it's a non-starter at any price.
- 2uryaa1 hour ago
 Also curious about this. We have a 30 day content retention policy and have to have access to your fine-tuned model/LoRa if deploying that. If there's anything we can change, happy to hear it out.
- Oras2 hours ago
 What's unique about Vertex's privacy policy?
nimchimpsky2 hours ago
[dead]