7 comments

  • Tiberium7 hours ago
    Last update over a year ago, so I hope (2025) gets added to the title:<p>&gt; [2025&#x2F;05&#x2F;26] (Step 1 completed!) We release Mixture-of-Thoughts--a curated reasoning dataset of 350k verified traces distilled from R1. The dataset spans tasks in mathematics, coding, and science, and is designed to teach language models to reason step-by-step. We also provide a recipe to train OpenR1-Distill-7B, which replicates the reasoning capabilities of deepseek-ai&#x2F;DeepSeek-R1-Distill-Qwen-7B and marks the completion of step 1 in the Open R1 project.<p>Doesn&#x27;t look like they managed to actually reproduce R1, and only stopped on Step 1 out of their 3-step plan.
    • spmurrayzzz7 hours ago
      One of my favorite code comments of all time is still in the src:<p>&quot;# TODO: implement a proper validator to compare against ground truth. For now we just check for exact string match on each line of stdout.&quot; [1]<p>This was one of my chief complaints about the entire R1 news cycle, it felt like no one actually read the technical report. They were being heralded for their openness, but they left out the most meaningful details that you&#x27;d need to reproduce their work.<p>[1] <a href="https:&#x2F;&#x2F;github.com&#x2F;huggingface&#x2F;open-r1&#x2F;blob&#x2F;1416fa0cf21595d2083b399a2a0bbddd7f6e9563&#x2F;src&#x2F;open_r1&#x2F;rewards.py#L552" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;huggingface&#x2F;open-r1&#x2F;blob&#x2F;1416fa0cf21595d2...</a>
      • neutronicus6 hours ago
        Reminds me of my days in a computational physics PhD program.
        • devmor2 hours ago
          I had some contract work years ago helping some PhD program astrophysicists write scripts to make their research algorithms compute their data before their great grandchildren retire and one of the action items we never implemented was because someone wrote an “…etc” in the middle of the math.<p>They knew that everything was verifiably correct on the input, and verifiably correct on the output, and they swore they had figured it out at one point and just never wrote it down. I was asked if I could “just extrapolate it” and had to explain that computer programs work the same as math - I can give you a literally infinite number of ways to reach an output from an input.
      • khazhoux4 hours ago
        &gt; This was one of my chief complaints about the entire R1 news cycle<p>For me it was the headline that a group of students replicated GPT-3 for $5000
  • aesthesia6 hours ago
    If you really want to see fully open training pipelines for modern LLMs, Olmo and to a lesser extent Nemotron are what you should look at.<p><a href="https:&#x2F;&#x2F;github.com&#x2F;allenai&#x2F;OLMo" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;allenai&#x2F;OLMo</a><p><a href="https:&#x2F;&#x2F;github.com&#x2F;NVIDIA-NeMo&#x2F;Nemotron" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;NVIDIA-NeMo&#x2F;Nemotron</a>
    • achrono45 minutes ago
      After my own very exhaustive survey, I can just say &#x27;+1&#x27; and also good to note that OLMo has actually had one independent reproduction (albeit not open) done: <a href="https:&#x2F;&#x2F;www.amd.com&#x2F;en&#x2F;developer&#x2F;resources&#x2F;technical-articles&#x2F;introducing-the-first-amd-1b-language-model.html" rel="nofollow">https:&#x2F;&#x2F;www.amd.com&#x2F;en&#x2F;developer&#x2F;resources&#x2F;technical-article...</a><p>I often wonder why OLMo and Nemotron aren&#x27;t more popular -- they are gold-standard &#x2F; &quot;frontier&quot; of a year ago. If we had more support behind these, seeing a <i>true</i> open-source AI system that legitimately challenges OpenAI &amp; Anthropic might not be far away!
    • spijdar5 hours ago
      I&#x27;m not really familiar with either, but I&#x27;m more familiar with Olmo. My impression is Nemotron is newer -- why is it less applicable? Is it not totally open like Olmo?
      • lambda5 hours ago
        Olmo releases their full datasets.<p>Nemotron only releases portions of some of their datasets, like the source code dataset that they pretrain on.<p>For example, from <a href="https:&#x2F;&#x2F;docs.nvidia.com&#x2F;nemotron&#x2F;latest&#x2F;nemotron&#x2F;super3&#x2F;pretrain.html" rel="nofollow">https:&#x2F;&#x2F;docs.nvidia.com&#x2F;nemotron&#x2F;latest&#x2F;nemotron&#x2F;super3&#x2F;pret...</a> :<p><pre><code> Open-source data coverage: The released datasets cover an estimated 8–10T tokens (~40–50% of the internal 25T blend). Missing categories include code (~14% of blend), nemotron-cc-code (~2%), crawl++ (~2%), and academic text (~2%). Users should supplement with their own data for these categories and adjust train_iters accordingly. </code></pre> K2 Think V2 is another fully open model like Olmo, with full datasets released.<p>Note that the Nemotron models are generally stronger than Olmo and K2 Think V2 (according to Artificial Analysis benchmarks), and there is a lot of overlap in their datasets (lots of datasets are based on the same sources with different filtering, Olmo and K2 Think V2 both have used some Nemotron datasets).<p>But yeah, Nemotron is a modern and fairly capable LLM, even the 122b is more capable than Deepseek R1 (a 671b model) on most benchmarks, and there&#x27;s also the recently released 550b Ultra now.<p>It does have a fully open training recipe, just some data missing from its datasets, but if you want a fully open pipeline it&#x27;s going to be a good place to start, you just need to find some more data to fill in the datasets to get up to the token count with reasonably high quality data.
  • madiator7 hours ago
    Check out OpenThoughts. It has a widely used dataset, a model that beats the deepseek&#x27;s smaller reasoning models, and a paper that talks in detail about the data curation methodology.<p><a href="https:&#x2F;&#x2F;www.open-thoughts.ai&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.open-thoughts.ai&#x2F;</a>
    • lambda4 hours ago
      Oh, neat, I hadn&#x27;t heard of that.<p>From the blog, it looks like there hasn&#x27;t been much progress for a few months, but if you check their HF it looks like they have a series of 32B models trained on top of Qwen3 32B with different numbers of training examples that they&#x27;ve uploaded a few days ago: <a href="https:&#x2F;&#x2F;huggingface.co&#x2F;collections&#x2F;open-thoughts&#x2F;openthinker-agent-complete" rel="nofollow">https:&#x2F;&#x2F;huggingface.co&#x2F;collections&#x2F;open-thoughts&#x2F;openthinker...</a><p>So looks a little bit more research oriented than intended for production use, but still neat to see this effort.
    • yogthos6 hours ago
      neat
  • yieldcrv6 hours ago
    Too old now
  • christkv6 hours ago
    What is the estimated cost these days to train something like this to conclusion?
    • lambda3 hours ago
      DeepSeek themselves claimed that R1 cost $294k to train. Folks are skeptical of how low that is, however.<p><a href="https:&#x2F;&#x2F;www.techspot.com&#x2F;news&#x2F;109542-rare-disclosure-deepseek-claims-r1-model-training-cost.html" rel="nofollow">https:&#x2F;&#x2F;www.techspot.com&#x2F;news&#x2F;109542-rare-disclosure-deepsee...</a><p>Olmo 3 claims that if they paid market rates for their training, it would have cost $2.75m. It was trained by a non-profit and probably had some of the compute donated, hence why they have to estimate.<p><a href="https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;2512.13961" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;2512.13961</a><p>So, probably somewhere between hundreds of thousands and tens of millions.
  • poppafuze5 hours ago
    &quot;This will likely involve curating new, large-scale datasets for math, reasoning, and code.&quot;. ... everybody likes to hand-wave on this .
  • RedMagicBox6 hours ago
    [dead]