2 comments

  • aetherspawn2 hours ago
    It makes sense to me that distributing across more parameters results in models that can be quant more heavily (information theory - more bits available)<p>I wonder if anyone has figured out how the information is compressed and calculated the amount of information an LLM can hold depending on its size
    • woadwarrior011 hour ago
      &gt; I wonder if anyone has figured out how the information is compressed and calculated the amount of information an LLM can hold depending on its size<p>You might want to look at Physics of Language Models[1]. IIRC, the authors estimate it to be ~2 bits of factual knowledge per parameter.<p>[1]: <a href="https:&#x2F;&#x2F;physics.allen-zhu.com&#x2F;" rel="nofollow">https:&#x2F;&#x2F;physics.allen-zhu.com&#x2F;</a>
  • lwansbrough2 hours ago
    Anyone with a billion dollars want to try this and report back?
    • nullc2 hours ago
      From the paper it appears that it&#x27;s probably more useful on small-ish models.
      • lwansbrough1 hour ago
        What does it cost to train a model like 1-bit Bonsai? Anyone know?