> Opus 4.7 tokenizer used 1.46x the number of tokens as Opus 4.6<p>Interesting. Unfortunately Anthropic doesn't actually share their tokenizer, but my educated guess is that they might have made the tokenizer more semantically aware to make the model perform better. What do I mean by that? Let me give you an example. (This isn't necessarily what they did exactly; just illustrating the idea.)<p>Let's take the gpt-oss-120b tokenizer as an example. Here's how a few pieces of text tokenize (I use "|" here to separate tokens):<p><pre><code> Kill -> [70074]
Killed -> [192794]
kill -> [25752]
k|illed -> [74, 7905]
<space>kill -> [15874]
<space>killed -> [17372]
</code></pre>
You have 3 different tokens which encode the same word (Kill, kill, <space>kill) depending on its capitalization and whether there's a space before it or not, you have separate tokens if it's the past tense, etc.<p>This is <i>not</i> necessarily an ideal way of encoding text, because the model must learn by brute force that these tokens are, indeed, related. Now, imagine if you'd encode these like this:<p><pre><code> <capitalize>|kill
<capitalize>|kill|ed
kill|
kill|ed
<space>|kill
<space>|kill|ed
</code></pre>
Notice that this makes much more sense now - the model now only has to learn what "<capitalize>" is, what "kill" is, what "<space>" is, and what "ed" (the past tense suffix) is, and it can compose those together. The downside is that it increases the token usage.<p>So I wouldn't be surprised if this is what they did. Or, my guess number #2, they removed the tokenizer altogether and replaced them with a small trained model (something like the Byte Latent Transformer) and simply "emulate" the token counts.