Dispersion loss counteracts embedding condensation in small language models

22 points by E-Reverance3 hours ago

2 comments

aetherspawn2 hours ago
It makes sense to me that distributing across more parameters results in models that can be quant more heavily (information theory - more bits available)<p>I wonder if anyone has figured out how the information is compressed and calculated the amount of information an LLM can hold depending on its size
- woadwarrior011 hour ago
  > I wonder if anyone has figured out how the information is compressed and calculated the amount of information an LLM can hold depending on its size<p>You might want to look at Physics of Language Models[1]. IIRC, the authors estimate it to be ~2 bits of factual knowledge per parameter.<p>[1]: <a href="https://physics.allen-zhu.com/" rel="nofollow">https://physics.allen-zhu.com/</a>
lwansbrough2 hours ago
Anyone with a billion dollars want to try this and report back?
- nullc2 hours ago
  From the paper it appears that it's probably more useful on small-ish models.
  - lwansbrough1 hour ago
    What does it cost to train a model like 1-bit Bonsai? Anyone know?