Would it make sense to embed such single-purpose network with fixed weights within a LLM before pre-training?
Not sure how much this fits into the rules but I saw on twitter someone claimed 28 params : <a href="https://gist.github.com/SeuperHakkerJa/da3050739bea97aabd86ee0d7d5ef689" rel="nofollow">https://gist.github.com/SeuperHakkerJa/da3050739bea97aabd86e...</a>
> In short: if you can swap in a different set of weights and use the exact same inference code for a different task, your setup is legitimate. If the inference code is inseparable from the algorithm, it's not.<p>I wonder why they don't just write the code themselves, so by design the focus can be on the model.
>=99% accuracy wtf?!?<p>I was initially excited until i saw that, because it would reveal some sort of required local min capacity, and then further revelation that this was all vibe coded and no arXiv, makes me feel I should save my attn for another article.
So, hand-coded weights can do it with 36 params and 311 for trained weights - did anyone try the former architecture, but starting with random weights and learning?
You can do that in a single matmul of course.
Now wrap it all in an Electron app!
[dead]