2 comments
It makes sense to me that distributing across more parameters results in models that can be quant more heavily (information theory - more bits available)<p>I wonder if anyone has figured out how the information is compressed and calculated the amount of information an LLM can hold depending on its size
> I wonder if anyone has figured out how the information is compressed and calculated the amount of information an LLM can hold depending on its size<p>You might want to look at Physics of Language Models[1]. IIRC, the authors estimate it to be ~2 bits of factual knowledge per parameter.<p>[1]: <a href="https://physics.allen-zhu.com/" rel="nofollow">https://physics.allen-zhu.com/</a>
Anyone with a billion dollars want to try this and report back?