How to scale RL to 10^26 FLOPs - by Jack Morris
Token for Token
SubscribeSign in
Discover more from Token for Token
language models and the future of AI
Over 1,000 subscribers
Subscribe
By subscribing, I agree to Substackās Terms of Use, and acknowledge its Information Collection Notice and Privacy Policy.
Already have an account? Sign in
How to scale RL to 10^26 FLOPs
A roadmap for RL-ing LLMs on the entire Internet
Jul 10, 2025
42
4
Share
TLDR: Reinforcement learning (RL) is the next training technique for building frontier-level AI models. To make it better, we need to train on more data. The current approach of scaling many environments simultaneously is messy and complicated. Instead, I propose we find a way to do next-token prediction on the Web using RL. This way, we learn to reason from general web data, instead of just math and code.
Iāve spent a good part of the past year in denial.
Thanks for reading! Subscribe for free to receive new posts and support my work.
Subscribe
I was in denial because when OpenAI released o1, and explained their paradigm of test-time compute, I thought it was a good idea but mostly a way to get better performance out of models of fixed size. After all, letting models āthink for longerā by generating more tokens lets them do more internal computation.
The o1 release from OpenAI was the first demonstration of a new type of language model, one that could think for longer to generate better answers.
So I wasnāt that surprised that these new models, termed reasoning models, gave better answers. And I especially wasnāt surprised when I found out these answers mostly came on problems that inherently require lots of computation, like difficult math and engineering test questions.
Donāt get me wrong: I always thought reasoning models were interesting. Itās cool to me that they generate āthinking tracesā before giving answers (although the thinking traces might not be very reliable). And itās amazing that the models were trained with reinforcement learning, a foundational technique in machine learning that was generally understood to be difficult to use effectively for real problems.
But I still thought of myself as a scale maximalist: all that really mattered, I thought, was training bigger models on more data. Anything else (read: reasoning models) appeared to be a coping mechanism, just a way to get by while we wait for the hardware needed to train bigger models.
Iāve spent the past few months working on RL research at Meta. It took a bit of time but Iāve come full-circle: something far more nuanced is happening with reasoning models. RL isnāt just a way to give models more compute. RL training really is teaching models something different , a way to use compute to generate better answers given finite model capacity. Through RL, models are clearly learning something that theyāre not getting from pretraining.
Two waves of AI scaling
The AI research-into-production cycle moves through a few distinct phases. First, we as a community identify a new learning paradigm. Second, we find the correct datasets for training and design evaluations to know when our models are getting better, and by how much. And third, we scale it to all hell.
This cycle has happened already, exactly once. Pretraining. It started with the innocuous observation that models can learn quite a lot when trained on internet text data using next-token prediction. We realized that this gives intelligence improvements in just about every domain. And then we scaled.
We spent 2022ā2024 scaling language model pretraining: first making models bigger, and now working to pack as much knowledge we could into models of various scales. Weāll spend the next several years scaling post-training using RL.
And to be clear, pretraining research is ongoing. Weāre still figuring out how to scale our models via bigger datacenters and better hardware and more efficient algorithms. And weāre gathering more and better data every year. But the upper bound of pretraining performance is really clear. To build better models, we need to give them more parameters and train them on bigger datasets. This is what the AI labs have been working on for three years or so now.
But as the dust settles on the pretraining frenzy, reasoning models are showing us a new way to scale. Weāve found a way to make models better thatās independent of the number of training tokens or model size.
The murky path to RL scaling starts with data
Weāve identified a new paradigm: learning to reason. But reasoning models are in their GPT-3 era: theyāre trained on small datasets to do a narrow selection of tasks. We have a brittle proof-of-concept in the reasoning models of 2025. These models have achieved state-of-the-art scores on a small number of tasks, mostly expert-level math and coding questions.
In the case of pretraining, the path to progress was very clear. Models can learn via next-token prediction on just about any data, so we could simply scrape the entire Web and feed it to the models. And once weād done that it became clear that our models were too small and we needed to make them much, much bigger.
But RL training is different. Letās briefly remind ourselves how RL works:
Models like o1 are trained with verifiable rewards, meaning that after thinking and generating answers, we teach models by encouraging them to think more of the thoughts that led to correct answers, and less of the thoughts that led to incorrect answers. This is how RL algorithms like PPO (what o1 probably uses) and GRPO (the algorithm behind DeepSeek R1) work. They donāt teach, they incentivize.
So clearly we can only train RL models on tasks where we can score answers based on correctness. This is the idea behind verifiability, an RL buzzword used to describe tasks with a well-defined automatic scoring function. (The o1-style RL training paradigm is usually called RLVR, reinforcement learning with verifiable rewards, as to be distinguished from RLHF, or reinforcement learning from human feedback.)
Unfortunately, most things arenāt automatically verifiable. There arenāt perfect computer programs that can tell you whether an essay is good, for example, or an explanation.
In fact, things that we know how to automatically verify tend to be in the scientific domain. For example, OpenThoughts, a recently-released dataset of training data for reasoning models, contains four categories, Code, Math, Science, and āPuzzleā:
The recent OpenThoughts dataset contains verifiable tasks in Math, Science, Coding, as well as a small dataset of puzzles. (What is the Puzzle task, I wonder?)
Ok, so we can see that there are at least four domains that contain verifiable problems that we can train on. But there are many open problems here. Are those all the verifiable things that exist? Are they equally valuable? During training should we randomly alternate between them or train separate models and then average?
In fact, in typical RL setups, we donāt even understand the marginal value of a single training example. One recent paper, Reinforcement Learning for Reasoning in Large Language Models with One Training Example, demonstrated that training on just a single example with thousands of different attempts of reasoning can actually produce a very good model:
A recent paper showed close to on-par reasoning performance from just learning a reasoning process from a single training example.
Itās also interesting to note the x-axis in the above graph: training only runs for 2000 steps. And thatās typical. Right now, these models are typically only trained for a few hundred or at most a few thousand steps. In the pretraining era, we often trained models on trillions of tokens, which meant millions of training steps.
This is mostly a compute issue: each step of RL requires sampling reasoning tokens, which is expensive and just difficult from a software perspective. The infrastructure to do this sort of thing is challenging and requires a lot of new engineering, since we arenāt used to doing generation at all during training, let alone at each step.
Mark my words: before we know it weāll be running millions of steps of RLVR too.
RL compute scales unevenly
There are many practical engineering problems that need to be solved to scale RL.
In the pretraining days, training was a very homogenous workload, a very real-time continuous process. Batch of text passes through the model, we compute losses and backpropagate once. Queue the next back of text. This was simple and straightforward to optimize.
When we do RL, training infrastructure has to get more complicated. Gradient steps happen much less frequently and (depending on our chosen hyperparameters) we spend a lot more time generating thinking tokens.
Luckily, weāve spent the last year or two making LLM inference super fast, and we can take advantage of these improvements here. In particular there are two really good libraries for doing inference ( SGLang and vLLM) that make this part ~10x faster than naive python inference code. That helps a lot.
Another systems problem arises when we actually compute verifiable rewards. In the case of math problems this is usually pretty easy. Most datasets have answers computed ahead of time, so we can simply check if the final answer is correct and score accordingly (In practice, formatting makes this process slightly more complicated.)
But in domains besides math, verification quickly becomes expensive. This is especially noticeable in the code domain, which is where a lot of AI labs are focusing their efforts right now.
Remember that each domain needs a domain-specific āverifierā, a system that provides rewards that guide LLMs to generate better outputs. In the case of code, this usually involves running some code and scoring based on the codeās output. Given an LLM-generated answer to a coding problem, we may need to run a bunch of unit tests and count the number of ones that pass to provide a reward.
There are a lot of people working on this right now. Doing better and faster verification, running it in parallel, scaling it properly. In some cases, the training bottleneck isnāt anything related to the model ā itās not inference or backpropagation that slows things down, but the time it takes to compile and execute model code.
The new DGX B200 GPU servers from NVIDIA cost $500K a pop and provide around 10^17 FLOPS for training or inference. Unfortunately, our systems for doing RL on LLMs are pretty primitive and canāt get anywhere near this level of performance yet.
Since a single B200 costs over $500K, any time training spends bottlenecked by CPU execution is a big waste of money.
One path to scaling RL is optimizing this kind of system: making verifiers faster and more reliable, optimizing the new generate-before-backprop training pipeline, and designing clever systems that let us scale RL across datacenters.
In addition to all these systems-level improvements, weāll need to build lots of new environments to try and learn diverse skills via RL training. Oh, and no one really knows what the right skills are to learn via RL, or how to combine them. So weāll have to try lots of different combinations and train many models. And we can try averaging them in different ways via some kind of model souping (also known as model merging). Weāll just run lots of evaluations to find which combination of environments and souping produces the best model. This sounds difficult, doesnāt it? And quite messy.
What if I told you there was another way?
What does it mean to be verifiable?
If we want to scale RL in verifiable settings, we should probably start by figuring out which things are verifiable in the first place. Itās my feeling that people have been throwing this word around a lot, but thereās no clear definition.
It all comes down to what we can train into the models. If we can check a model output and provide a score, thatās good enough.
Waitā but isnāt this how language modeling works already?
Pretraining for reasoning with next-token prediction
Before making my proposal, let me start with listing a few core tenets that I believe about the current state of AI:
-
The only data weāve found that really āworksā (i.e. helps us build more intelligent models) is web-scale pretraining data. Ilya Sutskever famously compared all the human text on the internet to a reserve of fossil fuel: itās exceptionally useful, but finite.
-
Reasoning, at its core, is a way to get better performance out of smaller models. Itās not doing anything more magical. Crucially, weāre getting limited new signal from the verifier itself; RL with verification is just a way to elicit capabilities that already exist within models. ( This is a common belief about RL.)
-
There is nothing special about math and code. These modalities happen to lie in a space thatās difficult to model with a reward model (so prior approaches didnāt work super well) but easy to verify. And we happen to care about them ( automating coding seems especially valuable). But we should be able to learn to reason from any type of data.
-
And finally, we havenāt fully saturated models with Internet data. Todayās models donāt seem to have enough capacity to memorize the entire Internet. Additional pretraining on Web data should still give us a performance boost ā and might be enough to learn to reason.
-
Next-token prediction is verifiable. This is perhaps the central argument Iām making. The current strategy of checking if a math problem has been answered correctly is spiritually no different than confirming whether a model has outputted the proper next tokens.
Putting all this together, Iām betting that the āright wayā to scale RL is by unifying it with next-token prediction. We should teach models to reason by practicing reasoning at scale on the vast diversity of data available on the Web.
Learning to reason via next-token prediction.
The proposed framework of learning to reason via next-token prediction.
This shows a comparison of the new paradigm demonstrated on a math problem. Normal next-token prediction is guessing which tokens come next. Typical RLVR allows the model to āthinkā for a few tokens and then rewards it for outputting the right thing. Our idea of reasoning with next-token prediction (RNTP) would allow the model to think and then reward it based on the next-token prediction loss of the outputs in the
Grok 2 was trained on more FLOPs than GPT-4. Apparently Grok 3, which came out later, was trained on around 10^26 FLOPs, which would put it above the top of this graph. But that was all supervised learning. How do we scale RL to use this much compute?
What do we even need RL for?
Now that weāve stripped things down to their base components, it might not be obvious what benefit we get from doing reinforcement learning, if any.
The answer lies in the
In other words, weāre trying to get the model to learn to reason without knowing what reasoning should look like. We just sample lots of things from the model and encourage it to do the things that get rewards. If there was ground-truth reasoning, we could use the typical supervised training techniques to train the model to output the proper reasoning chains.
But in the real world, thereās no ground-truth for reasoning, so we canāt do supervised learning. And in fact we want it this way ā this is the magic of reinforcement learning. Weāre hoping that the model will discover reasoning chains that are more useful than anything we could ever write ourselves.
Scaling reasoning via next-token prediction
If youāve read this far, and you agree this idea makes sense, you might be thinking about how it could be tricky to implement.
And in fact, youāre right. This is where the research comes in. Almost all research that matters comes from figuring out how to implement and scale ideas that make sense from first principles.
For example: what exactly is the reward? Do we give the model a point for guessing a token correctly? Should we reward it more for multiple tokens in a row? Perhaps we use a string-similarity reward like BLEU score, as was common in machine translation in 2018. We could do some kind of self-evaluation, where a decent model can look at its own outputs and decide which ones should get rewards. Perhaps we filter tokens by entropy and use that to determine which to reason about. Or maybe we want to account for confidence in the reward, and give the model more points for being confidently correct.
Another question: how many times should you āreasonā within a single text chunk? One approach is to insert reasoning tokens at a random position per-chunk. Or perhaps we allow models to reason multiple times throughout each chunk. But then weād have to figure out whether how many times can learn from a given text chunk with different reasoning patterns before memorization starts to occur.
There are additional difficulties that arise when switching from math and code to general reasoning. One reason we like math and code is because theyāre difficult for base models to do āfrom scratchā but often easy to learn via reasoning. This wonāt be the case with general text: some tokens are already extremely low-entropy, and therefore easy to predict; other tokens are nearly impossible, and will never be guessed correctly with any amount of reasoning.
Even the best pretraining datasets are still a long way from perfect.
Andrej Karpathy recently noted that if you actually look at samples from a typical pretraining dataset, theyāre quite ugly, and there is a lot of obvious noise. One benefit of scale is that it irons out many of these low-level idiosyncrasies: after a lot of training, much of this noise gets averaged away. Itās possible that this would happen with my proposed RL training scheme. If we train for long enough, on enough tokens, we might not even care what the exact reward or reasoning schema looks like.
But wait, didnāt somebody try this already?
Those among us who diligently troll ArXiv for the latest nuggets of progress might recognize that someone recently proposed something like this in a recently released preprint ( *Reinforcement Pre-Training).\* This research was praised on twitter (the title sounds important, and the figure is funny!) but disappointed a lot of researchers:
Headlining figure from the recent āReinforcement Pre-Trainingā paper, which also proposes the idea of pretraining for RL via next-token prediction.
Headliner figure from the recent āReinforcement Pre-Trainingā paper, which also proposes the idea of pretraining for RL via next-token prediction.
To be more specific, this paper proposed something similar to what Iām advocating for: using large unlabeled text datasets and next-token prediction to scale RL! And it has pretraining in the name, just like I was describing.
Alas, it turns out to be a classic case of academic titlegrabbing. What the paper actually does is very specific: they finetune a single model with chain-of-thought to improve single-token outputs for some multiple-choice questions. Theyāre not actually doing pretrainingājust finetuning!ā and train on a small subset of questions from a single math dataset. There arenāt a lot of comparisons to any of the other RLVR papers, so itās hard to tell whether this thing even works, and if so, when and how well.
Normally Iād file this type of work away as a sort of negative result ā if a simpler and more general setting worked, they surely would have tried it in this paper, right? But thatās exactly what I donāt think we should do. My overall point in this piece is that if something makes sense from first principles, we should keep working on it until we work out all the kinks.
Making good ideas work often turns out to require significantly more labor than academic researchers expect from a single project. But this is is the price of progress.
Whatās next?
Itās very exciting to me that (a) RL works and (b) no one knows the right way to do it. There is so much opportunity here. One way or another, we will have much better reasoning models in a year or two; itās just that the path is unclear. Before we can see with clarity, we have a lot to learn. If reasoning next-token prediction turns out to really be the right way to scale RL, weāre going to need to answer all these questions and many more.
Thanks to my friends Wenting Zhao, Will Brown, and Nishanth Kumar for reading this blog post early and providing helpful feedback.
Thanks for reading! Subscribe for free to receive new posts and support my work.
Subscribe
42 Likesā
42
4
Share
Discussion about this post
CommentsRestacks
Liked by Jack Morris
Itās funny, I had the exact same reaction to that cherry cake paper. What I thought it was about before I read it closely is almost exactly the same as your post.
Expand full comment
Like (1)
Reply
Share
Really interesting! This makes a lot of sense to me. It feels closer to how humans learn than pretraining (even if its still a bit different). When Iām learning a difficult thing in a textbook, I spend a lot of time thinking in between reading sentences, and I feel like the models need that too. The current RL paradigm feels more similar to skimming and then repeatedly trying the problems at the end of the chapter.
Expand full comment
Like (1)
Reply
Share
TopLatestDiscussions
There Are No New Ideas in AI⦠Only New Datasets
LLMs were invented in four major developments⦠all of which were datasets
Apr 9⢠Jack Morris
245
All AI Models Might Be The Same
Jul 17⢠Jack Morris
70
Why I think Yann Lecun was right about LLMs (but perhaps only by accident)
Feb 21⢠Jack Morris
53
See all
Ready for more?
Subscribe