From the blog:<p><a href="https://arxiv.org/abs/2501.00663" rel="nofollow">https://arxiv.org/abs/2501.00663</a><p><a href="https://arxiv.org/pdf/2504.13173" rel="nofollow">https://arxiv.org/pdf/2504.13173</a><p>Is there any other company that's openly publishing their research on AI at this level? Google should get a lot of credit for this.
DeepSeek and other Chinese companies. Not only do they publish research, they also put their resources where their mouth (research) is. They actually use it and prove it through their open models.<p>Most research coming out of big US labs is counter indicative of practical performance. If it worked (too) well in practice, it wouldn't have been published.<p>Some examples from DeepSeek:<p><a href="https://arxiv.org/abs/2405.04434" rel="nofollow">https://arxiv.org/abs/2405.04434</a><p><a href="https://arxiv.org/abs/2502.11089" rel="nofollow">https://arxiv.org/abs/2502.11089</a>
Well it's cool that they released a paper, but at this point it's been 11 months and you can't download a Titans-architecture model code or weights anywhere. That would put a lot of companies up ahead of them (Meta's Llama, Qwen, DeepSeek).
Closest you can get is an unofficial implementation of the paper <a href="https://github.com/lucidrains/titans-pytorch" rel="nofollow">https://github.com/lucidrains/titans-pytorch</a>
The hardest part about making a new architecture is that even if it is just better than transformers in every way, it’s very difficult to both prove a significant improvement at scale and gain traction. Until google puts in a lot of resources into training a scaled up version of this architecture, I believe there’s plenty of low hanging fruit with improving existing architectures such that it’ll always take the back seat.
<i>Until google puts in a lot of resources into training a scaled up version of this architecture</i><p>If Google is not willing to scale it up, then why would anyone else?
Google is large enough, well-funded enough, and the opportunity is great enough to run experiments.<p>You don't necessarily have to prove it out on large foundation models first. Can it beat out a 32b parameter model, for example?
Do you think there might be an approval process to navigate when experiments costs might run seven or eight digits and months of reserved resources?<p>While they do have lots of money and many people, they don't have infinite money and specifically only have so much hot infrastructure to spread around. You'd expect they have to gradually build up the case that a large scale experiment is likely enough to yield a big enough advantage over what's already claiming those resources.
Prove it beats models of different architectures trained under identical limited resources?
But, it's companies like Google that made tools like Jax and TPU's saying we can throw together models with cheap, easy scaling. Their paper's math is probably harder to put together than an alpha-level prototype which they need anyway.<p>So, I think they could default on doing it for small demonstrators.
Yes. The path dependence for current attention based LLMs is enormous.
At the same time, there is now a ton of data for training models to act as useful assistants, and benchmarks to compare different assistant models. The wide availability and ease of obtaining new RLHF training data will make it more feasible to build models on new architectures I think.
I don't think the comparison is valid. Releasing code and weights for an architecture that is widely known is a lot different than releasing research about an architecture that could mitigate fundamental problems that are common to all LLM products.
The newer one is from late May: <a href="https://arxiv.org/abs/2505.23735" rel="nofollow">https://arxiv.org/abs/2505.23735</a>
I don't think model code is a big deal compared to the idea. If public can recognize the value of idea 11 months ago, they could implement the code quickly because there are so much smart engineers in AI field.
If that is true, does it follow this idea does not actually have a lot of value?
Student: Look, there’s hundred dollar bill on the ground!
Economist: No there isn’t. If there were, someone would have picked it up already.<p>To wit, it's dangerous to assume the value of this idea based on the lack of public implementations.
If the hundred dollar bill was in an accessible place and the fact of its existence had been transmitted to interested parties worldwide, then yeah, the economist would probably be right.
That day the student was the 100th person to pick it up, realize it's fake, and drop it
Well we have the idea and the next best thing to official code, but if this was a big revelation where are all of the Titan models? If this were public, I think we'd have a few attempts at variants (all of the Mamba SSMs, etc.) and get a better sense if this is valuable or not.
> it's been 11 months<p>Is that supposed to be a long time? Seems fair that companies don't rush to open up their models.
Just keep in mind it is performance review time for all the tech companies. Their promotion of these seems to be directly correlated with that event.
Gemini 3 _is_ that architecture.
I've read many very positive reviews about Gemini 3. I tried using it including Pro and to me it looks very inferior to ChatGPT. What was very interesting though was when I caught it bullshitting me I called its BS and Gemini expressed very human like behavior. It did try to weasel its way out, degenerated down to "true Scotsman" level but finally admitted that it was full of it. this is kind of impressive / scary.
Working with 1M context windows daily - the real limitation isn't storage but retrieval. You can feed massive context but knowing WHICH part to reference at the right moment is hard. Effective long-term memory needs both capacity and intelligent indexing.
Bytedance is publishing pretty aggressively.<p>Recently, my favorite from them was lumine: <a href="https://arxiv.org/abs/2511.08892" rel="nofollow">https://arxiv.org/abs/2511.08892</a><p>Here's their official page: <a href="https://seed.bytedance.com/en/research" rel="nofollow">https://seed.bytedance.com/en/research</a>
Meta is also being pretty open with their stuff. And recently most of the Chinese competition.
> Is there any other company that's openly publishing their research on AI at this level? Google should get a lot of credit for this.<p>80% of the ecosystem is built on top of companies, groups and individuals publishing their research openly, not sure why Google would get more credit for this than others...
It was not always like this. Google was very secretive in the early days. We did not start to see things until the GFS, BigTable and Borg (or Chubby) papers in 2006 timeframe.
By 2006, Google was 8 years old. OpenAI is now 10.
Google publishes detailed papers of its architecture once it’s built the next version.<p>AI is a bit different.
Page Rank
Every Google publication goes through multiple review. If anyone thinks the publication is a competitor risk it gets squashed.<p>It's very likely no one is using this architecture at Google for any production work loads. There are a lot of student researchers doing fun proof of concept papers, they're allowed to publish because it's good PR and it's good for their careers.
The amazing thing about this is the first author has published multiple high-impact papers with Google Research VPs! And he is just a 2nd-year PhD student. Very few L7/L8 RS/SWEs can even do this.
I mean, they did publish the word2vec and transformers papers, which are both of major significance to the development of LLMs.
Underrated comment, IMHO. There is such a gulf between what Google does on its own part, and the papers and source code they publish, that I always think about their motivations before I read or adopt it. Think Borg vs. Kubernetes, Stubby vs. gRPC.
The author is listed as a "student researcher", which might include a clause that students can publish their results.<p>Here is a bit more information about this program: <a href="https://www.google.com/about/careers/applications/jobs/results/93865849051325126-student-researcher-2026" rel="nofollow">https://www.google.com/about/careers/applications/jobs/resul...</a>
Arxiv is flooded with ML papers. Github has a lot of prototypes for them. I'd say it's pretty normal with some companies not sharing for perceived, competitive advantage. Perceived because it may or may not be real vs published prototypes.<p>We post a lot of research on mlscaling sub if you want to look back through them.<p><a href="https://www.reddit.com/r/t5_3bzqh1/s/yml1o2ER33" rel="nofollow">https://www.reddit.com/r/t5_3bzqh1/s/yml1o2ER33</a>
lol you don't get it. If it's published it means it's not very useful
Maybe it's just misdirection - a failed approach ?<p>Given the competitive nature of the AI race, it's hard to believe any of these companies are really trying to help the competition.
Amazon has a foundation model named Titan - mostly recommended for creating embeddings. Possible confusion in this space.
"At long last, we have created the Torment Nexus from the classic novel Don't Create the Torment Nexus"<p>(In Eclipse Phase, TITAN - the Total Information Tactical Awareness Network - mulched humanity when it went rogue.)
>The model uses this internal error signal (the gradient) as a mathematical equivalent of saying, "This is unexpected and important!" This allows the Titans architecture to selectively update its long-term memory only with the most novel and context-breaking information<p>So one can break a model by consistently feeding it with random, highly improbable junk? Everything would be registered as a surprise and get stored, impacting future interactions
This is an oversimplification of what Titans does. The model performs nested learned, where the model learns during inference, and during training the model weights learn _how and what_ to learn during inference. If the input contains junk of irrelevant information, the model most likely learned during training to assign low surprise query and key embeddings to those tokens, because learning those junk tokens would have hurt the overall ability of the model to predict subsequent next tokens (and thus, it would have had increased the training loss).
I’m guessing that this is the first thing they thought of and the problem only exists in the superficial gloss you’re responding to?
In what world can you not always break the response of an AI by feeding it a bunch of random junk?
Indeed. In what world can you not break any tool when deliberately misusing it?
I mean, currently LLMs are stateless and you can get rid of all the poisoned data by just starting a new conversation (context). And OP introduces "long-term memory" where junk will accumulate with time
I believe you're misunderstanding what the OP means about "long-term" memory. From what I can tell, it's not actively modifying the weights of the underlying model, it just "remembers" things from a high number of tokens into the past of its context. The point is that this allows it to remember something it read ~200 pages ago in a very long context window, not that it can remember something from one session into another clean session.
In something like Cursor if it messes something up your can click 'undo'. I'd imagine a small snapshot would only persisted to the memory if you keep it's output and even then it's mostly just a summary.<p>There's probably lots of small signals of "the user is happy with the output" plus the longer the history the more it will converge on the middle of being what you want. Including when the user says "don't do [x]" which override past stuff.
I mean ideally AI would be resilient to junk, don't you think?
Humans are pretty vulnerable to junk so I’m not sure.
Ideally, you'd run your own instance of this, I think.<p>I can see a product where you purchase a model that has basic training, and then, using the features outlined in the paper, it learns on the fly from your usage.<p>I can also see there being a secondary market for specially trained models, long-term memory filled with some specific skill, done in some specific way. To make a silly example, imagine buying a licence to Torvald's OS coding assistant, ready to insult your prs before you even commit them!(And possibly help you write code in Torvald's style too)<p>This would of course require Linus to use the model enough for it to learn,I won't comment on the likelihood of that happening: it's just a silly example after all
The is the start of what I always thought an AI should have - a limbic system. Humans don't store memory based on novelty, they store it based on emotional content. This is where I was afraid of the tiger, this is where I smelled delicious food, this was what it felt like when I was victorious in the hunt.<p>AI needs an internal emotional state because that's what drives attention and memory. AI needs to <i>want</i> something.
This is no different from what happens to humans if they're locked into cult programming situations, they'll start believing and regurgitating all kinds of nonsense if their information stream is tightly curated,<p>Practically, for use with a codebase development effort, if the model remembers the original design decisions, the discussions about costs and benefits, then can remember all that much later in the process, it's going to start getting really good at thinking about what the next step is, or even to make decisions about when a major refactor is neede, etc.
When i first read the papers for titans for me it was a "this will be a big step forward".<p>While i have no "AI" title or work in the respective AI industry, ive spend many years thinking about AI concepts, even long before the whole NN/LLM hype started.<p>Maybe because of that i was always really annoyed that LLM are called AI because in my years of thinking about how an actual "human like" thinking AI might work, the things an LLM does was far below what my minimum definition was.<p>But when i stumbled accross the Titans paper, while it still is not an "AI" as i would call it, from my POV its a massive step towarsd the right direction.<p>Sometimes i consider to write all my ideas/thoughts about AI down in my blog, but than i think nobody would care anyway since im not a known figure <i>shrug</i> - so if not to say "look i wrote it years ago!" theres no actual point in doing so i guess.<p>However - im looking forward to see titans in action, and i guess it will impress us all.
Sharing it in your blog over a period of months or years is how you become a known figure eventually.
A lot of LLM/AI writing these days can feel lost in the weeds – the specifics of very detailed techniques are interesting undoubtedly, but writing that steps back and looks at the big picture, informed by those details, could be very useful for people who want to think about where this all may be going.
Thanks, and i gonne think about going for a writeup. As i mentioned in another comment, reading my previous comment back from yesterday i dont even know why i mentioned it - probably because i think so much about the topic but than i think "well your just a guy in a shed" type of thing and decide that prolly noone would care about what i would write. At all - if its just something i can look back onto im some years, prolly worth it.
Are you curious to see whether a blog post shared here might gain any traction and perhaps some valuable feedback?
Tbh, if i read back my comment from yesterday i don't even know exactly why i did mention that part. Sounds even to me like a "look at my blog" thingy which it definitely should not. Maybe some day ill give it a try and write something about my 'ideas' and drop it here. Tho not today (w0rk w0rk) ^
I wrote about that a while ago: <a href="https://paxamans.github.io/blog/titans/" rel="nofollow">https://paxamans.github.io/blog/titans/</a>
I’m curious if this makes them more or less susceptible to prompt injection?<p>On the one hand can learning on the job allow better training of what not to be influenced by, but on the other hand can an injected prompt have an even deeper effect on them long term.
> The Transformer architecture revolutionized sequence modeling with its introduction of attention, a mechanism by which models look back at earlier inputs to prioritize relevant input data<p>I've always wanted to read how something like Cursor manages memory. It seems to have developed a long history of all of prompts and understands both the codebase and what I'm building slightly more over time, causing less errors.
Very interesting. Is it correct for me to imagine it as some kind of "LoRA" thats continuously adapted as the model goes through its day?<p>If so, could there perhaps be a step where the LoRA is merged back into the main model?<p>That would be like sleeping :-)
I don't think that's a great analogy.<p>LoRAs tend to be adapters bolted onto to systems by people other than the system designers, and they are low rank factorizations.<p>There is nothing low rank or adapter here.
Kind-of. You could theoretically use LoRA for this, in fact, but it probably wouldn't have enough capacity to make it a proper substitute of the attention mechanism. Instead a full MLP is trained as input chunks get processed.
Titans: Learning to Memorize at Test Time
<a href="https://arxiv.org/abs/2501.00663" rel="nofollow">https://arxiv.org/abs/2501.00663</a>
This just feels like a tremendous missing piece to LLMs. Looking forward to seeing it in action.
Very very interesting, definitely a missing piece in current AI space.<p>Small typo where the text “Virtually all successful existing sequence models rely on mean squared error…” is repeated twice within the same paragraph. Happens to the best of us.
Would this also allow to align it furthermore with user's prompt ? notably due to the surprise factor and how it may understand it ?
It's interesting that they publish a blog post about the Titans and MIRAS papers only now, while the blog post about the new follow-up paper (Nested Learning), all by the same main author(!), came out a month ago: <a href="https://research.google/blog/introducing-nested-learning-a-new-ml-paradigm-for-continual-learning/" rel="nofollow">https://research.google/blog/introducing-nested-learning-a-n...</a>
See also Hope:<p><i>In the previous sections, we first discussed Continuum Memory System (CMS) that allows for more persistent storage of memories and defines memory as a spectrum of blocks with different frequencies of update. Due to the larger capacity and constraints for scaling the parameters, often CMS requires simple learning rule but higher capacity to store more persistent knowledge. On the other hand, in the previous section, we discussed the design of a self-modifying Titans, where it can generate its own keys and so learning update to better adapt to the context. Contrary to CMS, the self-modifying Titans has a small capacity but is using a complex and expressive learning rule. Accordingly, these two systems seem to be complementary and their combination can enhance the model expressiveness from different aspects.</i><p><i>To this end, we present Hope architecture: A neural learning module that incorporates self-modifying Titans followed by Continuum Memory System.</i><p><a href="https://research.google/blog/introducing-nested-learning-a-new-ml-paradigm-for-continual-learning/" rel="nofollow">https://research.google/blog/introducing-nested-learning-a-n...</a>
For most papers, the main idea can be described in 1-2 sentences, sort of "we did X using Y".<p>That doesn't work for HOPE - a short summary can't explain what it actually does besides "self-modifying" and "continuum memory".<p>So it seems to be an innovation of Transformers calibre, really big (if true). It's definitely not "transformer but with such-and-such modification".<p>Gemini came up with a following visual metaphor for the difference:<p>> Transformer is a series of frozen glass panes (the weights) and a scratchpad (the attention) where it writes notes about the current text.<p>> The HOPE architecture involves no scratchpad. Instead, the glass panes themselves are made of smart liquid. As the data flows through, the first pane reshapes itself instantly. The second pane reshapes itself slowly. And the mechanism deciding how to reshape them is itself a tiny, intelligent machine, not just a basic math rule.
+1 Insightful.<p>This comment was illuminating -- and IMHO an excellent example of why it's important to avoid rigid rules against posting any AI-generated content in HN comments. You gained insights by asking Gemini, and shared them, noting the source. Thank you!
Long-term memory on top of the base model, but is this idea for local users or for the data-center hosted model used by many different people?<p>P.S. This quote from the paper sounds just like LLM output:<p>> "This memory module provides significantly higher expressive power, allowing the model to summarize large volumes of information without losing important context. The model isn't simply taking notes; it's understanding and synthesizing the entire story. Crucially, Titans doesn’t just passively store data. It actively learns how to recognize and retain important relationships and conceptual themes that connect tokens across the entire input."
I submitted this exact url yesterday. What’s the criteria for when hn creates a new post vs going to the existing?
Post starts with wrong statement right away:<p>"The Transformer architecture revolutionized sequence modeling with its introduction of attention"<p>Attention was developed before transformers.
> Attention was developed before transformers.<p>I just looked this up and it’s true, this changes the timeline I had in my mind completely! I thought the paper on Transformers is what also introduced the attention mechanism, but it existed before too and was applied on RNN encoder-decoder. Wow
Skynet kind of sucks ...
"Titans", huh?<p>... anyone here familiar with the RPG Eclipse Phase?
I'm not, but I'm familiar with the mythology of the eastern Mediterranean they're likely getting the word from.<p>There the titans did incest, birthed the olympians, then the youngest of the titans castrated his dad and took all power for himself, and then Zeus and the olympians waged a decade long war against him which they won.
So what happens if I write a book and on the last page write "Everything in this book was a lie and should not be cared about"? Will this be surprising enough for Titan? A regular LLM may ignore it completely if it's a massive book (massive book + 1 line contradiction).
Here is my amateur understanding of the architecture: Fine-tune on the fly by using degrees of surprise to update a separate/new memory network that matches the base model, and just call that network for each token iteration.<p>So if we are viewing this through the needle in hey stack lens: The needle was very surprising for the base model, so going forward, when it see anything of the same nature, the memory module will not just give you hay, but the needle, because it made a special note of it when it went through the haystack 1 million tokens ago, because the needle was surprising.<p>The Transformer's normal attention mechanism is already secretly trying to be a long-term memory system. Every time it writes a new KV pair into the cache, it’s desperately trying to “remember” that token forever.<p>But it’s doing it in the dumbest possible way: by hoarding an ever-growing pile of raw vectors, then frantically dot-product searching through the pile every single step.
It’s like a hoarder who never throws anything away and has to rummage through mountains of junk to find the one receipt they need.
Of course it chokes at long contexts.<p>Titans/MIRAS looks at that mess and says:
“Why store memory in a growing garbage pile of vectors?
Store it in the weights of a deep neural network instead — and let that network keep training itself in real time, but only on the stuff that actually surprises it.”
That’s literally it.<p>Using the Tim Cook Martian example:
The model is cruising through boring financial numbers → attention is doing its normal thing, KV cache is growing, but nothing is really sticking.<p>Suddenly: “Tim Cook is a Martian.”<p>Normal attention would just add one more KV pair to the pile and pray it doesn’t get drowned out later.<p>Titans instead goes: “Holy shit, reconstruction error off the charts → this does NOT fit my current memory at all → massive gradient → actually rewrite huge chunks of the memory MLP’s weights right now so this fact is burned in forever.”<p>From that moment on, the memory MLP has physically changed its internal wiring. Any future query that even vaguely smells like “Tim Cook” or “Martian” will make the activations explode through the newly rewired paths and spit out a vector screaming “MARTIAN” at the frozen attention layers.<p>The frozen attention (which is still doing its normal job on the short window) suddenly sees this one extra “virtual token” in its context that is confidently yelling the surprising fact → it attends hard to it → the model answers as if the Martian revelation happened one token ago, even if it was 2 million tokens back.<p>It looks exactly like a super-attention mechanism that only “primes” or “locks in” the surprising needles and deliberately forgets or ignores the hay. And it is also a way to fine tune one the fly permanently for the current context.<p>I think…
[dead]
This is the one thing missing from my interactions with AI. If successful, this will change everything. If you thought people were getting AI boyfriends and girlfriends before, wait until you see this.