TinyLoRA – Learning to Reason in 13 Parameters

(arxiv.org)

203 points by sorenjan5 days ago

16 comments

cestith16 minutes ago
This is interesting and all, but “LoRA” is painfully close to “LoRa” (which is related to radio networking, not AI) when just scanning a list of topics. We’re never going to beat the Shannon limit on acronyms and initialisms.I’m glad the rest of the anchor text gave some context.
- jorlow4 minutes ago
 I see this comment on every single LoRA post despite the vast majority of posts being about LoRA not LoRa. Can we please stop beating this dead horse?
kashifr31 minutes ago
You can try out TinyLoRA in PEFT main now: <a href="https://huggingface.co/docs/peft/main/en/package_reference/tinylora" rel="nofollow">https://huggingface.co/docs/peft/main/en/package_reference/t...</a>
dollo_78 hours ago
Not sure if I buy it. First, SVD decomposition to obtain U, Σ, V is computationally expensive, so it would work only if we are not finetuning very big models.But my real concern comes at the results. The "13 parameters" looks like bait, because it is one result of finetuning a model on a very simple math benchmark, grade-school-math (GSM8K), an already very saturated benchmark on every model. Besides, it seems to happen only for the qwen family model... It looks like GSM8K was part of the training set of the qwen model, and this tinylora finetuning did the last adjustments to perfectly reflect that overtraining.
- sachaa8 hours ago
 Fair points, especially on GSM8K saturation and Qwen possibly already sitting close to the solution. That said, even if this is mostly "last-mile alignment", the fact that it can be done with such a tiny signal is still interesting, it suggests the gap between capability and behavior might be much smaller (and cheaper to bridge) than we assume.
- sorenjan1 hour ago
 They're using the truncated SVD, not the full variant, that's computationally cheaper.
- robrenaud8 hours ago
 Yeah, my big problem with the paper is it just might be an artifact of qwen's training process.
 - taneq3 hours ago
 In all fairness most of the unique stuff I can do is probably an artifact of my training process, so it seems unfair to deny an LLM the same accomodation.
 - nativeit40 minutes ago
 How much did your training cost society?
volume_tech49 minutes ago
the '13 parameters' framing is misleading in both directions. the base model's billions of parameters do the heavy lifting; the 13 just steer existing circuitry. but that's actually the interesting finding: the behavior gap between a capable model and a reasoning model is geometrically tiny. you can find it with gradient descent in 13 dimensions of a multi-billion-dimensional space. that's a strong claim about the structure of task-relevant representations in transformers.
MASNeo7 hours ago
Is it an Aprils Fools publication?
- darkxanthos3 hours ago
  This hit me too hard.
kgeist8 hours ago
>One theory is that the knowledge required to solve the task is already stored in the parameters of the model, and only the style has to change for task success>In particular, learning to generate longer outputs may be possible in few parametersReminded me of: <a href="https://arxiv.org/abs/2501.19393" rel="nofollow">https://arxiv.org/abs/2501.19393</a>>we develop budget forcing to control test-time compute by forcefully terminating the model’s thinking process or lengthening it by appending “Wait” multiple times to the model’s generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning stepsMaybe, indeed, the model simply learns to insert the EOS token (or similar) later, and the capability is already in the base model
Xx_crazy420_xX7 hours ago
If i understand it correctly, the analogy could be:Let's say we have a low level programmer expert and we try to teach him algebra either we:<pre><code> - (SFT): give him algebra book with new nomenclature, definitions, syntax - (RL): let him learn algebra using C syntax</code></pre>
- nathan_compton2 hours ago
 I don't think so.Fine tuning works on an input/output basis. You are rewarded for producing a plausible output _now_.RL rewards you later for producing the right output now. So you have to learn to generate a lot of activity but you are only rewarded if you end up at the right place.In SFT you are rewarded for generating tokens plausible to the proof, one token at a time. In RL you are expected to generate an entire proof and then you are rewarded or punished only when the proof is done.
vasco2 hours ago
Most data in the training set of most reasoning models is crap I guess.
a-t-c-g12 hours ago
The quality of custom models trained with proper reasoning datasets[0] even with small parameters (3-7B is sweet spot) is incredible now[0]: cartesien.io or Salesforce's WebscaleRL
- objektif11 hours ago
 What are you basing how good they are on? Personal experience or some benchmarks?
 - a-t-c-g10 hours ago
 Benchmarks, we have internal ones testing reasoning fine-tuned v/s frontier + promptsFor some use cases it can be parity performance at 1/20th the cost up to exceeds at 1/10th the cost. Trade-off is ofc narrow applicability
 - objektif1 hour ago
 How can I learn more about these models? Are they open source?
measurablefunc12 hours ago
With four parameters I can fit an elephant, and with five I can make him wiggle his trunk so there is still room for improvement.
- esafak12 hours ago
 Except learning to reason is a far cry from curve fitting. Our brains have more than five parameters.
 - voxelghost11 hours ago
 After a quick content browse, my understanding is this is more like with a very compressed diff vector, applied to a multi billion parameter model, the models could be 'retrained' to reason (score) better on a specific topic , e.g. math was used in the paper
 - sdenton49 hours ago
 It's the statistics equivalent of 'no one needs more than 640kb of RAM'
 - nativeit36 minutes ago
 My very first PC was a Packard Bell with 640KB of RAM. If I’d known, I’d have saved all my RAM for retirement…
 - ekuck11 hours ago
 speak for yourself!
 - est12 hours ago
 reasoning capability might just be some specific combinations of mirror neurons.even some advanced math usually evolves applying patterns found elsewhere into new topics
 - measurablefunc11 hours ago
 I agree, I don't think gradient descent is going to work in the long run for the kind of luxurious & automated communist utopia the technocrats are promising everyone.
Sim-In-Silico10 hours ago
[dead]
evermore6114 hours ago
[dead]
ValveFan69699 hours ago
[dead]
ValveFan696912 hours ago
[dead]
matt12345678910 hours ago
Such low dimensionality of the LoRA vector must surely result in a close-to-linear modification to the KV calculation. This seems to me to imply that what we call "reasoning" is latent within the model. Pretty clear I didn't read the paper, I'm sure the authors address this.
- a-t-c-g10 hours ago
 Yes - some degree of reasoning appears to be latent in the structure of language itself. But models trained explicitly on reasoning-focused data still perform better than models trained only on general corpora.**At least up to 300B parameters, based on the models we’ve tested.
sachaa8 hours ago
If 13 parameters can unlock better reasoning, then we will not be "training" models, we'll be steering them. Most of the capability is already there.The real unlock isn’t TinyLoRA, it’s what this implies: ultra-cheap, continuous adaptation. The bottleneck shifts from compute to having a good reward signal.