NanoGPT Slowrun: Language Modeling with Limited Data, Infinite Compute

(qlabs.sh)

32 points by sdpmas1 hour ago

2 comments

archermarks4 minutes ago
Very cool idea. Interested to see how this progresses. One question: how worried are you about over-training on this particular dataset? i.e. instead of generalizing you lean more toward memorization? Obviously you leave out a validation set but since you're meta-optimizing the model itself by its performance on the validation dataset you're still at risk of over-fitting.
suddenlybananas44 minutes ago
Reminds me a fair bit of the BabyLM challenge. It would be good to give them a shout-out and see how this challenge differs.
- sdpmas27 minutes ago
  hey, it's Samip (behind the Slowrun repo). yeah that's a fair point, we will mention them in the blog. but there are a couple of major differences: 1. our emphasis is on using more compute to get better data efficiency. this is important because there are lots of hacky chances that will get lower loss, but when compared to general methods that leverage a lot of compute, they don't do so well. and you can already see how this emphasis on compute leads to different methods to BabyLM! 2. our reasoning behind the repo is not anything to do with how much data a child sees. and our dataset is not tailored towards that either. it's simple pretraining on random subset of the internet. we know there are better training algorithms that get lower loss on that data, and we are finding those.
  - soraki_soladead23 minutes ago
    also, BabyLM is more of a conference track / workshop than an open-repo competition which creates a different vibe