Olmo releases their full datasets.<p>Nemotron only releases portions of some of their datasets, like the source code dataset that they pretrain on.<p>For example, from <a href="https://docs.nvidia.com/nemotron/latest/nemotron/super3/pretrain.html" rel="nofollow">https://docs.nvidia.com/nemotron/latest/nemotron/super3/pret...</a> :<p><pre><code> Open-source data coverage: The released datasets cover an estimated 8–10T tokens
(~40–50% of the internal 25T blend). Missing categories include code (~14% of blend),
nemotron-cc-code (~2%), crawl++ (~2%), and academic text (~2%). Users should
supplement with their own data for these categories and adjust train_iters
accordingly.
</code></pre>
K2 Think V2 is another fully open model like Olmo, with full datasets released.<p>Note that the Nemotron models are generally stronger than Olmo and K2 Think V2 (according to Artificial Analysis benchmarks), and there is a lot of overlap in their datasets (lots of datasets are based on the same sources with different filtering, Olmo and K2 Think V2 both have used some Nemotron datasets).<p>But yeah, Nemotron is a modern and fairly capable LLM, even the 122b is more capable than Deepseek R1 (a 671b model) on most benchmarks, and there's also the recently released 550b Ultra now.<p>It does have a fully open training recipe, just some data missing from its datasets, but if you want a fully open pipeline it's going to be a good place to start, you just need to find some more data to fill in the datasets to get up to the token count with reasonably high quality data.