Is One Layer Enough? A Single Transformer Layer Matches Full-Parameter RL Train

(arxiv.org)

120 points by tcp_handshaker7 hours ago

10 comments

HarHarVeryFunny5 hours ago
It's interesting that it's the middle layers of the Transformer that are affected most by RL post-training, but it perhaps makes some intuitive sense given that RL is being used to shape high level planning-type direction of the output.It seems that the input layers to a Transformer are necessarily going to be doing the most low level work of syntax -> semantic augmentation starting with things like tagging parts of speech etc. Similarly the output layers are by necessity going to be concerned with mapping high level representations back into surface level word sequence form. This leaves the middle layers to do the work of first recognizing deep enough patterns to support good quality prediction, then do the high level predication itself which is what RL is typically going to be trying to shape.
mike_hearn5 hours ago
This result feels very intuitive. The early layers of a transformer can be thought of as understanding surface level things like syntax, how tokens group, which groups are entities and how to disambiguate them, etc. The last layers are in a sense decoding ideas into a selection of words, ensuring the grammar makes sense, that the text flows and is structured correctly, etc. The middle layers are where the abstract thought and manipulation of concepts is happening.But for the tasks this paper uses for RL training, it's all about improving the way the net is manipulating concepts. So the middle layers are where the focus should be.Note: RL is also used for tasks that aren't about conceptual manipulation, like instruct training. I bet that their result doesn't hold for that because the delta vs the foundation model is all about the selection of words and flow of the text, not the core understanding.
- throw3108223 hours ago
 I keep thinking of the RYS (Repeat Yourself) experiment of simply looping some of the inner layers of LLMs for better results and wonder if any progress was made on it.<a href="https://dnhkng.github.io/posts/rys/" rel="nofollow">https://dnhkng.github.io/posts/rys/</a>Feels it should be straightforward to integrate in LLMs a network to control the looping. Or just duplicate entire blocks of layers after the initial training.
 - naasking1 hour ago
 Yes, computing in latent space is a big thing now.<a href="https://ouro-llm.github.io/" rel="nofollow">https://ouro-llm.github.io/</a>
janalsncm46 minutes ago
This is interesting theoretically, but in practical terms it’s hard to apply.RL is already hard. There are many things which can go wrong. You have all of the problems with regular LLM SFT, plus now you have a reward model which can be hacked or too hard. Or KL collapse because the outputs are repetitive. Or maybe your groups in GRPO aren’t producing advantages. Or the rollouts are OOD for your reward model. Or maybe you’re running the rollout at a different precision as the trained weights. Or maybe your importance sampling should be clipping when it’s not, or should be clipping at the token level rather than sequence level.Maybe after reading the above you think that the above are not problems because smart people wouldn’t make those mistakes. Fair enough. But I would prefer RL people like myself who are not geniuses.Now, this is adding another variable into the mix: choosing a single layer to train. If it doesn’t work is it because there’s a problem with your RL setup? Or did you just choose the wrong layer? Or maybe there’s no problem with your setup but you chose a suboptimal layer to train.Also note that we already have LoRA, which is a more established method for low memory parameter updates.
olliepro5 hours ago
The authors have some inconsistencies with training token length…Most errors are probably responses that didn’t finish before their 3K token limit. They’ve measured how well RL is able to shorten the response to their limit.
usernametaken296 hours ago
If you think about it for some time then you’ll come to realise transformers are autoencoders on steroids. A small input space is expanded onto a big manifold and contracted again. Now, suppose you want to impose a function to regulate the output of an autoencoder. It’s actually pretty obvious that you need exactly one layer to do so… f(manifold).
- sailingparrot3 hours ago
 Everything can be represented as f(), a full scale SotA transformer model is also just f(context). That does not mean one layer is sufficient. It all depends on the level of expressivity required by this f to be a good model.
- getnormality6 hours ago
 What you're suggesting seems to go implausibly far beyond what the paper says.RL post-training alters the parameters of the transformer, while your f(manifold) idea seems to suggest that a new layer on top would suffice, no need to alter the transformer itself at all.It would be extremely handy if that were so, but I'm guessing it isn't, or it would be the prevailing approach.
 - wrs3 hours ago
 The manifold is in the middle (“small input space is expanded onto a big manifold and contracted again”) so f(manifold) would need to be in the middle too.
- earthnail6 hours ago
 Took me a short time to understand what you mean with "autoencoders on steroids", but I believe you mean they are autoencoders with an inverse bottleneck - an intermediate representation that isn't smaller, but that's much larger than the input space. Is my understanding of your comment correct?
 - usernametaken296 hours ago
 Kind of. Autoencoders don’t need to have an embedding that’s smaller than the input. Their only requirement is that they compress information and thus create reconstruction loss. Typically however they are not trained this way because they don’t converge.. transformers do the same thing, but they can squeeze much more bits of information through one pass because the way they are designed. This holds true even for decoder only networks because they’re still doing the same thing
 - earthnail1 hour ago
 If the embedding isn’t smaller than the input, how is it compressing information? It might lose information in its mapping to the embedding space, but in my understanding, the definition of compression means it has to use less bits than the original to hold the same information. As such, the embedding space must be smaller.
- soraki_soladead6 hours ago
 I might be misunderstanding your point but this conflates the distinguishing features of each. you mention expansion but autoencoders canonically compress their inputs. autoencoders have an explicit encoder and decoder. most transformers we interact with these days (LLMs) are decoder only. the manifold isn't typically something the model is applied to directly. we apply the function/model to the latent representations. those are what live on the manifold.
 - usernametaken296 hours ago
 Now that’s interesting.. what exactly distinguishes latent representations and the manifold? IMHO, those are the same, and you’re constructing a piecewise function of the manifold itself. Decoders also produce manifolds much in the same way, with the distinction being that the encoder isn’t learned but static after initialisation. So fundamentally it is still DOING the same operation.
 - soraki_soladead5 hours ago
 The latent representations of the data are like points on a surface. That surface is the manifold. We don't typically have the full manifold and can only sample points from it by embedding data into it.Worth noting a different manifold "exists" after each transformation (e.g. layer). You only sample from the same manifold when you apply the same transformation(s).
 - CuriouslyC5 hours ago
 Also worth noting that in reality manifolds will be "spiky" in very high dimension, so the idea of a "surface" is best understood through patterns of distance between samples in embedding space and way they collapse in low D.
baq5 hours ago
I'm reminded of this dude who was sitting at or near the top of some kaggle leaderboard simply[0] splicing together some duplicated middle layers and applying a bit of fine tuning[0] not simply
- deflator3 hours ago
 Was it this guy? <a href="https://news.ycombinator.com/item?id=47322887">https://news.ycombinator.com/item?id=47322887</a>
 - baq2 hours ago
 May have been, yes!
 - deflator1 hour ago
 His articles blow me away. Like the one where he plumbed a pair of Grace-Hopper superchips with liquid cooling for a home rig.
- kmeisthax1 hour ago
 I'm wondering if the big problem is just the lack of recurrent connections in the standard Transformer design, and selective layer duplication is just a weird way to fix the same problem. I have to wonder if it would be possible to deliberately architecture a model to discover and exploit layers worth duplicating at training time.The current model architectures we use have a fixed routing of residuals per layer, from the first to the last. I'm imagining replacing this with a matrix of routing weights[0] that determines how "strong" the connection is between each Transformer layer. We still evaluate each layer "in order", but now instead of just giving the layer the last layer's residuals, it gets the sum of all prior layers times their weight in the routing matrix. Recurrent connections (i.e. output of layer 9 to input of layer 3) could be handled by doing a second pass and using the first pass's recurrent residuals as inputs. You could then "loop" the model as many times as desired per token, or even have it do parallel decoding with each token communicating with the others while also recurring on itself.You'd probably need some kind of normalization akin to what Deepseek did with Manifold Hyper Connections (mHC). Hell, mHC might also be useful in combination with this kind of layer routing, so the model could grow different recurrent loops for various bits of it's thought-space.EDIT: if anyone uses it please call it "neuralese recurrence" just to scare the AI safety bros[0] I'm not sure how you'd initialize these weights. Maybe each row/column is a narrow gaussian centered around the prior layer, with some random or constant weighting everywhere else?
soleveloper5 hours ago
Makes sense - This is very similar to fine tuning a down stream task in encoder-decoder architecture (~Bert style)
tribal8085 hours ago
If most of the performance gains are hidden in a few middle layers, you can save a massive amount of compute by freezing the rest
khalic3 hours ago
Really good work here, bravo
vatsachak5 hours ago
I still can't believe that LLM encoders aren't unsupervised learned.So much left on the table
- d3m0t3p2 hours ago
 They are using Qwen, so this is decoder only.
 - vatsachak1 hour ago
 Yes, my comment was kind of a non sequitur