60% Fable cost cut by converting code to images and having the model OCR it

(github.com)

83 points by dimitropoulos3 hours ago

12 comments

aabhay1 hour ago
In Gemini at least, if you look at how they process PDFs, they do an OCR and then feed the text + image to the model, without charging you for the text tokens (I believe).So my guess is that Claude’s backend is doing the same — so this hack is probably more of a loophole in token accounting that might get closed if Claude is doing what Gemini does
- hn_throwaway_994 minutes ago
 This is really fascinating to me. I was reading this article and originally agreed with you, "I mean, under the covers it's got to be converting to text tokens at some point, so there is no way it's actually cheaper for Claude itself to execute."But then there is a comment below talking about how DeepSeek was able to get a huge improvement in compression by using visual tokens, <a href="https://news.ycombinator.com/item?id=48777848">https://news.ycombinator.com/item?id=48777848</a>. I don't fully understand all of the underlying technical details so I am still fundamentally baffled about how going the OCR route could actually result in overall electricity/computational savings.
lpellis1 hour ago
I tried the same thing last year (with openai models), back then it worked to reduce prompt tokens, but you needed way more completion tokens, ultimately more expensive (and slower) <a href="https://pagewatch.ai/blog/post/llm-text-as-image-tokens/" rel="nofollow">https://pagewatch.ai/blog/post/llm-text-as-image-tokens/</a>
aabhay2 hours ago
Ahhh my eyes the vibe coded readme
- mpalmer1 hour ago
  What, you don't like your caveats to be honest?
genxy2 hours ago
This seems like a pricing hack that burns resources, that when the loophole gets closed the price of OCR will have to rise?
- ricardobeat2 hours ago
 It’s not a loophole, it just happens that encoding information as optical tokens is much more efficient than text.
 - geor9e57 minutes ago
 Step back and think about it another way - ask which scenario is more likely:Some random person discovered a 60% across the board gain in all LLMs, using an extremely simple trick that none of the labs noticed in all these years. That trick being to rasterize 8bit characters into 8x8 pixels in a big image. 60% in a market worth trillions of dollars.orAnthropic's marketing team arbitrarily prices tokens to drive growth, according to vibes and feelings, and didn't think they needed to price images on par with text in their rush to burn cash & drive growth. Some folks take advantage of the trick during the first few days of the model's availability before Anthopic corrects their pricing, to align more proportionally with actual compute costs.
 - calebkaiser30 minutes ago
 Nah, optical compression is a thing. You see it in a lot of different areas in ML. In this case, the "trick" has been known for a while, and belongs to a whole world of compression research. But I think where you're maybe getting mixed up is in where that 60% gain is coming from.It's not a 60% percent reduction in cost for 100% of the same output. If you have a model and input text A, and you fix the seed etc. and run Text A through the model as text tokens and as compressed image tokens, you will not get identical outputs. You're specifically reducing the number of tensors needed to represent your input, which saves you on raw compute, but also by definition gives you less room to represent the information in your input. It's lossy, in other words.Put another way, if you're using a model like Fable because you need the absolute frontier of capability and cheaper models cannot solve your tasks, then there is a very real chance that a compression strategy like this drops Fable's accuracy such that it's no longer suitable for your task. Which defeats the point of you paying for the most expensive model in the first place.So, it's cool research. Might be useful for some people. Probably isn't something that has incredible utility in real use cases.
 - jug21 minutes ago
 Alternative 1 isn’t all that unlikely given Opus 4.8 couldn’t do this. So it’s a recently possible hack. Not something LLM corps have been blindsided by for years. I also strongly recommend RTFA in this case, namely ”The honest part, read before relying on it”
 - stevenhuang5 minutes ago
 This has been known since VLMs were a thing, that more information can be encoded visually and token efficiency is increased. But it came with performance issues (more hallucinations, etc).Also I don't think you realize how much dumb stuff is still left on the table. That the market is worth trillions is quite irrelevant here given the dynamism of the field.
 - vineyardmike48 minutes ago
 > Some random person discovered a 60% across the board gain in all LLMs, using an extremely simple trick that none of the labs noticed in all these years of multi-trillion dollar growthDeepSeek published a pretty well circulated paper on exactly this many months ago. It just hasn’t been attempted and shared publicly, asa retrofit, AFAIK.Also, it’s no free lunch, the readme indicates that this “use images” hack is lossy and reduces success rates alongside the reduced cost. Most labs would focus on success increases regardless of price.
 - geor9e33 minutes ago
 If the trick were genuinely useful, and was well circulated months ago, the resource-starved inference providers would have squeezed this trick dry already, instead of wasting 60% of their tokens, waiting for users to implement it themselves in 5 minutes of effort.
 - Aurornis28 minutes ago
 I think you missed the part where this is a lossy technique that reduces performance.The image trick reduces context because it’s lossy. The README says you can’t use it for anything needing exact recall. It produces a gist of the input.You could achieve something similar by using a small, cheap model to pre-summarize information for the expensive LLM. This is what many people do already and it’s a much better way to do it for most situations.
 - guardiangod1 hour ago
 Truly a picture is worth a thousand words.
 - TZubiri1 hour ago
 Of course it isn'tA text encoding uses 8bits per character on average, tokenization further compresses thatAn image font would be 25 bits if 5x5, and most fonts are 12 pixels highOf course it isn't efficient, this is a pricing inefficiency and a hack to exploit it (even the author describes it as an exploit)
 - legel1 hour ago
 You are wrong.Text tokens are high-dimensional vectors, not 8 bits per character. Every token has a deep embedding, e.g. 1024 float values per text token.DeepSeek-OCR proved 10x+ compression from visual embedding of text, which was a groundbreaking result. [1]Very cool to see OP's project hacking on this principle. It's still not lossless, as noted in the github, but is a promising research direction.[1] <a href="https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf" rel="nofollow">https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSe...</a>
 - deburo50 minutes ago
 A token is probably not a single char, and an image is probably decomposed into tokens as well (and god knows how many tokens an image is decomposed into) which probably map to similar float-hungry vectors. Your counterargument could use a bit more flesh.And we're talking about images of texts, not images that represent complex imagery such as a very detailed scene or what have you.
 - netsharc46 minutes ago
 huh, what if the image encoding is 8 bits per R, G, B values of the pixel, then one can encode the same amount of text in less pixel dimensions (3 letters would need 1 pixel instead of three 12x12 pixels)The top line can be the OCR-able instruction on how to decode the rest of the image, and the rest of the image would be random-looking colourful palette. It might not even need to use 8 bits per character, since ANSI is 7 bits/character.
 - vineyardmike1 hour ago
 [dead]
- samrus1 hour ago
 Not really. They arent actually using more resources this way either. This might be a fundamental inefficiency thats being removedIt kinda makes sense too. Because while people do read code word by word, we often "glance over" it and do roughly pattern recognition on it to know what it does. Only homing in on something when we need to answer a specific question. I think humans kinda naturally do this exploit anyway
himata411356 minutes ago
Related: <a href="https://blog.can.ac/2026/06/10/snapcompact/" rel="nofollow">https://blog.can.ac/2026/06/10/snapcompact/</a>
yogthos8 minutes ago
Isn't this basically what DeepSeek came up with <a href="https://github.com/deepseek-ai/DeepSeek-OCR" rel="nofollow">https://github.com/deepseek-ai/DeepSeek-OCR</a>
dimitropoulos2 hours ago
there's also a DeepSeek whitepaper on this technique <a href="https://www.seangoedecke.com/text-tokens-as-image-tokens" rel="nofollow">https://www.seangoedecke.com/text-tokens-as-image-tokens</a>
__hugues58 minutes ago
seems really dumb and like it would need to violate basic information theory to work?input tokens are cheaper than output tokens. seems like it would maybe reduce input tokens at the expense of many more output tokens if you're actually triggering OCR via thinking?
dippogriff1 hour ago
I want to see more text-free foundation models
puppycodes1 hour ago
That is hilarious and an amazing find.
AIorNot13 minutes ago
I cant get past that LLM intense slop text in the Github repo