Global Situational Awareness of A.I. and where its headed

Leopold Aschenbrenner
S I T U AT I O N A L AWA R E N E S S
The Decade Ahead
JUNE 2024

2
Dedicated to Ilya Sutskever.
While I used to work at OpenAI, all of this is based on publicly-
available information, my own ideas, general field-knowledge, or
SF-gossip.
Thank you to Collin Burns, Avital Balwit, Carl Shulman, Jan Leike,
Ilya Sutskever, Holden Karnofsky, Sholto Douglas, James Bradbury,
Dwarkesh Patel, and many others for formative discussions. Thank
you to many friends for feedback on earlier drafts. Thank you to
Joe Ronan for help with graphics, and Nick Whitaker for publishing
help.
situational-awareness.ai
leopold@situational-awareness.ai
Updated June 6, 2024
San Francisco, California

You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion
compute clusters to $100 billion clusters to trillion-dollar clusters. Ev-
ery six months another zero is added to the boardroom plans. Behind
the scenes, there’s a fierce scramble to secure every power contract
still available for the rest of the decade, every voltage transformer
that can possibly be procured. American big business is gearing up
to pour trillions of dollars into a long-unseen mobilization of Amer-
ican industrial might. By the end of the decade, American electricity
production will have grown tens of percent; from the shale fields of
Pennsylvania to the solar farms of Nevada, hundreds of millions of
GPUs will hum.
The AGI race has begun. We are building machines that can think
and reason. By 2025/26, these machines will outpace college grad-
uates. By the end of the decade, they will be smarter than you or I;
we will have superintelligence, in the true sense of the word. Along
the way, national security forces not seen in half a century will be un-
leashed, and before long, The Project will be on. If we’re lucky, we’ll
be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer
of what is about to hit them. Nvidia analysts still think 2024 might
be close to the peak. Mainstream pundits are stuck on the willful
blindness of “it’s just predicting the next word”. They see only hype
and business-as-usual; at most they entertain another internet-scale
technological change.
Before long, the world will wake up. But right now, there are perhaps
a few hundred people, most of them in San Francisco and the AI
labs, that have situational awareness. Through whatever peculiar forces
of fate, I have found myself amongst them. A few years ago, these
people were derided as crazy—but they trusted the trendlines, which
allowed them to correctly predict the AI advances of the past few
years. Whether these people are also right about the next few years
remains to be seen. But these are very smart people—the smartest
people I have ever met—and they are the ones building this technol-
ogy. Perhaps they will be an odd footnote in history, or perhaps they
will go down in history like Szilard and Oppenheimer and Teller. If
they are seeing the future even close to correctly, we are in for a wild
ride.
Let me tell you what we see.

4
Contents
Introduction 3
History is live in San Francisco.
I. From GPT-4 to AGI: Counting the OOMs 7
AGI by 2027 is strikingly plausible. GPT-2 to GPT-4 took us from ~pre-
schooler to ~smart high-schooler abilities in 4 years. Tracing trend-
lines in compute (~0.5 orders of magnitude or OOMs/year), algorith-
mic efficiencies (~0.5 OOMs/year), and “unhobbling” gains (from chat-
bot to agent), we should expect another preschooler-to-high-schooler-
sized qualitative jump by 2027.
II. From AGI to Superintelligence: the Intelligence Explosion 46
AI progress won’t stop at human-level. Hundreds of millions of AGIs
could automate AI research, compressing a decade of algorithmic progress
(5+ OOMs) into 1 year. We would rapidly go from human-level to vastly
superhuman AI systems. The power—and the peril—of superintelli-
gence would be dramatic.
III. The Challenges 74
IIIa. Racing to the Trillion-Dollar Cluster 75
The most extraordinary techno-capital acceleration has been set in mo-
tion. As AI revenue grows rapidly, many trillions of dollars will go
into GPU, datacenter, and power buildout before the end of the decade.
The industrial mobilization, including growing US electricity produc-
tion by 10s of percent, will be intense.
IIIb. Lock Down the Labs: Security for AGI 89
The nation’s leading AI labs treat security as an afterthought. Cur-
rently, they’re basically handing the key secrets for AGI to the CCP
on a silver platter. Securing the AGI secrets and weights against the
state-actor threat will be an immense effort, and we’re not on track.

5
IIIc. Superalignment 105
Reliably controlling AI systems much smarter than we are is an un-
solved technical problem. And while it is a solvable problem, things
could very easily go off the rails during a rapid intelligence explosion.
Managing this will be extremely tense; failure could easily be catas-
trophic.
IIId. The Free World Must Prevail 126
Superintelligence will give a decisive economic and military advan-
tage. China isn’t at all out of the game yet. In the race to AGI, the free
world’s very survival will be at stake. Can we maintain our preem-
inence over the authoritarian powers? And will we manage to avoid
self-destruction along the way?
IV. The Project 141
As the race to AGI intensifies, the national security state will get in-
volved. The USG will wake from its slumber, and by 27/28 we’ll get
some form of government AGI project. No startup can handle super-
intelligence. Somewhere in a SCIF, the endgame will be on.
V. Parting Thoughts 156
What if we’re right?
Appendix 162

I. From GPT-4 to AGI: Counting the OOMs
AGI by 2027 is strikingly plausible. GPT-2 to GPT-4 took
us from ~preschooler to ~smart high-schooler abilities in
4 years. Tracing trendlines in compute (~0.5 orders of magni-
tude or OOMs/year), algorithmic efficiencies (~0.5 OOMs/year),
and “unhobbling” gains (from chatbot to agent), we should
expect another preschooler-to-high-schooler-sized qualitative
jump by 2027.
Look. The models, they just want to learn. You have to
understand this. The models, they just want to learn.
ilya sutskever
(circa 2015, via Dario Amodei)
GPT-4’s capabilities came as a shock to many: an AI system
that could write code and essays, could reason through difficult
math problems, and ace college exams. A few years ago, most
thought these were impenetrable walls.
But GPT-4 was merely the continuation of a decade of break-
neck progress in deep learning. A decade earlier, models could
barely identify simple images of cats and dogs; four years ear-
lier, GPT-2 could barely string together semi-plausible sen-
tences. Now we are rapidly saturating all the benchmarks we
can come up with. And yet this dramatic progress has merely
been the result of consistent trends in scaling up deep learning.
There have been people who have seen this for far longer. They
were scoffed at, but all they did was trust the trendlines. The

situational awareness 8
trendlines are intense, and they were right. The models, they
just want to learn; you scale them up, and they learn more.
I make the following claim: it is strikingly plausible that
by 2027, models will be able to do the work of an AI re-
searcher/engineer. That doesn’t require believing in sci-fi; it
just requires believing in straight lines on a graph.
Figure 1: Rough estimates of past and
future scaleup of effective compute
(both physical compute and algorith-
mic efficiencies), based on the public
estimates discussed in this piece. As
we scale models, they consistently get
smarter, and by “counting the OOMs”
we get a rough sense of what model
intelligence we should expect in the
(near) future. (This graph shows only
the scaleup in base models; “unhob-
blings” are not pictured.)
In this piece, I will simply “count the OOMs” (OOM = order
of magnitude, 10x = 1 order of magnitude): look at the trends
in 1) compute, 2) algorithmic efficiencies (algorithmic progress
that we can think of as growing “effective compute”), and 3)
”unhobbling” gains (fixing obvious ways in which models are
hobbled by default, unlocking latent capabilities and giving
them tools, leading to step-changes in usefulness). We trace

the growth in each over four years before GPT-4, and what we
should expect in the four years after, through the end of 2027.
Given deep learning’s consistent improvements for every OOM
of effective compute, we can use this to project future progress.
Publicly, things have been quiet for a year since the GPT-4 re-
lease, as the next generation of models has been in the oven—
leading some to proclaim stagnation and that deep learning is
hitting a wall.1 But by counting the OOMs, we get a peek at 1
Predictions they’ve made every year
for the last decade, and which they’ve
been consistently wrong about. . .
what we should actually expect.
The upshot is pretty simple. GPT-2 to GPT-4—from models
that were impressive for sometimes managing to string to-
gether a few coherent sentences, to models that ace high-school
exams—was not a one-time gain. We are racing through the
OOMs extremely rapidly, and the numbers indicate we should
expect another ~100,000x effective compute scaleup—resulting
in another GPT-2-to-GPT-4-sized qualitative jump—over four
years. Moreover, and critically, that doesn’t just mean a better
chatbot; picking the many obvious low-hanging fruit on “un-
hobbling” gains should take us from chatbots to agents, from a
tool to something that looks more like drop-in remote worker
replacements.
While the inference is simple, the implication is striking. An-
other jump like that very well could take us to AGI, to models
as smart as PhDs or experts that can work beside us as cowork-
ers. Perhaps most importantly, if these AI systems could auto-
mate AI research itself, that would set in motion intense feed-
back loops—the topic of the the next piece in the series.
Even now, barely anyone is pricing all this in. But situational
awareness on AI isn’t actually that hard, once you step back
and look at the trends. If you keep being surprised by AI capa-
bilities, just start counting the OOMs.

The last four years
We have machines now that we can basically talk to like
humans. It’s a remarkable testament to the human capacity to
adjust that this seems normal, that we’ve become inured to the
pace of progress. But it’s worth stepping back and looking at
the progress of just the last few years.
GPT-2 to GPT-4
Let me remind you of how far we came in just the ~4 (!) years
leading up to GPT-4.
GPT-2 (2019) ~ preschooler: “Wow, it can string together a few
plausible sentences.” A very-cherry-picked example of a semi-
coherent story about unicorns in the Andes it generated was
incredibly impressive at the time. And yet GPT-2 could barely
count to 5 without getting tripped up;2 when summarizing
2
From SSC: “Janelle Shane asks GPT-2
its ten favorite animals:
Prompt: My 10 favorite animals are: 1.
My ten favorite animals are:
1. Zebras with a white scar on the back
2. Insiduous spiders and octopus
3. Frog with large leaves, hopefully
black
4. Cockatiel with scales
5. Razorbill with wings hanging about
4 inches from one’s face and a heart
tattoo on a frog
3. Cockatric interlocking tetrabods that
can be blind, cut, and eaten raw:
4. Black and white desert crocodiles
living in sunlight
5. Zebra and many other pea bugs”
an article, it just barely outperformed selecting 3 random sen-
tences from the article.3 3
From the GPT-2 paper, Section 3.6.
Figure 2: Some examples of what
people found impressive about GPT-
2 at the time. Left: GPT-2 does an
ok job on extremely basic reading
comprehension questions. Right: In
a cherry-picked sample (best of 10
tries), GPT-2 can write a semi-coherent
paragraph that says some semi-relevant
things about the Civil War.
Comparing AI capabilities with human intelligence is difficult
and flawed, but I think it’s informative to consider the analogy

here, even if it’s highly imperfect. GPT-2 was shocking for its
command of language, and its ability to occasionally generate a
semi-cohesive paragraph, or occasionally answer simple factual
questions correctly. It’s what would have been impressive for a
preschooler.
GPT-3 (2020)4 ~ elementary schooler: “Wow, with just some few- 4
I mean clunky old GPT-3 here, not the
dramatically-improved GPT-3.5 you
might know from ChatGPT.
shot examples it can do some simple useful tasks.” It started
being cohesive over even multiple paragraphs much more con-
sistently, and could correct grammar and do some very basic
arithmetic. For the first time, it was also commercially useful in
a few narrow ways: for example, GPT-3 could generate simple
copy for SEO and marketing.
Figure 3: Some examples of what
people found impressive about GPT-
3 at the time. Top: After a simple
instruction, GPT-3 can use a made-up
word in a new sentence. Bottom-left:
GPT-3 can engage in rich storytelling
back-and-forth. Bottom-right: GPT-3
can generate some very simple code.
Again, the comparison is imperfect, but what impressed peo-
ple about GPT-3 is perhaps what would have been impressive
for an elementary schooler: it wrote some basic poetry, could
tell richer and coherent stories, could start to do rudimentary

coding, could fairly reliably learn from simple instructions and
demonstrations, and so on.
GPT-4 (2023) ~ smart high schooler: “Wow, it can write pretty so-
phisticated code and iteratively debug, it can write intelligently
and sophisticatedly about complicated subjects, it can reason
through difficult high-school competition math, it’s beating the
vast majority of high schoolers on whatever tests we can give
it, etc.” From code to math to Fermi estimates, it can think and
reason. GPT-4 is now useful in my daily tasks, from helping
write code to revising drafts.

Figure 4: Some of what people found
impressive about GPT-4 when it was re-
leased, from the “Sparks of AGI” paper.
Top: It’s writing very complicated code
(producing the plots shown in the mid-
dle) and can reason through nontrivial
math problems. Bottom-left: Solving an
AP math problem. Bottom-right: Solv-
ing a fairly complex coding problem.
More interesting excerpts from that
exploration of GPT-4’s capabilities here.

On everything from AP exams to the SAT, GPT-4 scores better
than the vast majority of high schoolers.
Of course, even GPT-4 is still somewhat uneven; for some tasks
it’s much better than smart high-schoolers, while there are
other tasks it can’t yet do. That said, I tend to think most of
these limitations come down to obvious ways models are still
hobbled, as I’ll discuss in-depth later. The raw intelligence
is (mostly) there, even if the models are still artificially con-
strained; it’ll take extra work to unlock models being able to
fully apply that raw intelligence across applications.
Figure 5: Progress over just four years.
Where are you on this line?
The trends in deep learning
The pace of deep learning progress in the last decade has sim-
ply been extraordinary. A mere decade ago it was revolution-
ary for a deep learning system to identify simple images. To-
day, we keep trying to come up with novel, ever harder tests,
and yet each new benchmark is quickly cracked. It used to take
decades to crack widely-used benchmarks; now it feels like
mere months.
We’re literally running out of benchmarks. As an anecdote, my
friends Dan and Collin made a benchmark called MMLU a few
years ago, in 2020. They hoped to finally make a benchmark
that would stand the test of time, equivalent to all the hardest
exams we give high school and college students. Just three
years later, it’s basically solved: models like GPT-4 and Gemini

Figure 6: Deep learning systems are
rapidly reaching or exceeding human-
level in many domains. Graphic: Our
World in Data
get ~90%.
More broadly, GPT-4 mostly cracks all the standard high school
and college aptitude tests (Figure 7).5 5
And no, these tests aren’t in the train-
ing set. AI labs put real effort into en-
suring these evals are uncontaminated,
because they need good measurements
in order to do good science. A recent
analysis on this by ScaleAI confirmed
that the leading labs aren’t overfitting to
the benchmarks (though some smaller
LLM developers might be juicing their
numbers).
Or consider the MATH benchmark, a set of difficult mathemat-
ics problems from high-school math competitions.6 When the
6
In the original paper, it was noted:
“We also evaluated humans on MATH,
and found that a computer science PhD
student who does not especially like
mathematics attained approximately
40% on MATH, while a three-time IMO
gold medalist attained 90%, indicating
that MATH can be challenging for
humans as well.”
benchmark was released in 2021, GPT-3 only got ~5% of prob-
lems right. And the original paper noted: “Moreover, we find
that simply increasing budgets and model parameter counts
will be impractical for achieving strong mathematical reasoning
if scaling trends continue [...]. To have more traction on math-
ematical problem solving we will likely need new algorithmic
advancements from the broader research community”—we
would need fundamental new breakthroughs to solve MATH,
or so they thought. A survey of ML researchers predicted min-
imal progress over the coming years (Figure 8);7 and yet within 7
A coauthor notes: “When our group
first released the MATH dataset, at least
one [ML researcher colleague] told us
that it was a pointless dataset because
it was too far outside the range of what
ML models could accomplish (indeed,
I was somewhat worried about this
myself).”
just a year (by mid-2022), the best models went from ~5% to
50% accuracy; now, MATH is basically solved, with recent per-
formance over 90%.

Figure 7: GPT-4 scores on standardized
tests. Note also the large jump from
GPT-3.5 to GPT-4 in human percentile
on these tests, often from well below
the median human to the very top of
the human range. (And this is GPT-3.5,
a fairly recent model released less than
a year before GPT-4, not the clunky old
elementary-school-level GPT-3 we were
talking about earlier!)
Figure 8: Gray: Professional forecasts,
made in August 2021, for June 2022
performance on the MATH benchmark
(difficult mathematics problems from
high-school math competitions). Red
star: actual state-of-the-art performance
by June 2022, far exceeding even the
upper range forecasters gave. The
median ML researcher was even more
pessimistic.

Over and over again, year after year, skeptics have claimed
“deep learning won’t be able to do X” and have been quickly
proven wrong.8 If there’s one lesson we’ve learned from the past 8
Here’s Yann LeCun predicting in 2022
that even GPT-5000 won’t be able to
reason about physical interactions with
the real world; GPT-4 obviously does it
with ease a year later.
Here’s Gary Marcus’s walls predicted
after GPT-2 being solved by GPT-3, and
the walls he predicted after GPT-3 being
solved by GPT-4.
Here’s Prof. Bryan Caplan losing his
first-ever public bet (after previously
famously having a perfect public
betting track record). In January 2023,
after GPT-3.5 got a D on his economics
midterm, Prof. Caplan bet Matthew
Barnett that no AI would get an A on
his economics midterms by 2029. Just
two months later, when GPT-4 came
out, it promptly scored an A on his
midterm (and it would have been one
of the highest scores in his class).
decade of AI, it’s that you should never bet against deep learning.
Now the hardest unsolved benchmarks are tests like GPQA,
a set of PhD-level biology, chemistry, and physics questions.
Many of the questions read like gibberish to me, and even
PhDs in other scientific fields spending 30+ minutes with
Google barely score above random chance. Claude 3 Opus
currently gets ~60%,9 compared to in-domain PhDs who get
9
On the diamond set, majority voting
of the model trying 32 times with
chain-of-thought.
~80%—and I expect this benchmark to fall as well, in the next
generation or two.

Figure 9: Example GPQA questions.
Models are already better at this than
I am, and we’ll probably crack expert-
PhD-level soon. . .

Counting the OOMs
How did this happen? The magic of deep learning is that it just
works—and the trendlines have been astonishingly consistent,
despite naysayers at every turn.
Figure 10: The effects of scaling com-
pute, in the example of OpenAI Sora.
With each OOM of effective compute, models predictably, reliably get
better.10 If we can count the OOMs, we can (roughly, qualita- 10
And it’s worth noting just how con-
sistent these trendlines are. Combining
the original scaling laws paper with
some of the estimates on compute
and compute efficiency scaling since
then implies a consistent scaling trend
for over 15 orders of magnitude (over
1,000,000,000,000,000x in effective
compute)!
tively) extrapolate capability improvements.11 That’s how a few
11
A common misconception is that scal-
ing only holds for perplexity loss, but
we see very clear and consistent scaling
behavior on downstream performance
on benchmarks as well. It’s usually
just a matter of finding the right log-
log graph. For example, in the GPT-4
blog post, they show consistent scaling
behavior for performance on coding
problems over 6 OOMs (1,000,000x) of
compute, using MLPR (mean log pass
rate).
The “Are Emergent Abilities a Mi-
rage?” paper makes a similar point;
with the right choice of metric, there
is almost always a consistent trend for
performance on downstream tasks.
More generally, the “scaling hypoth-
esis” qualitative observation—very
clear trends on model capability with
scale—predates loss-scaling-curves; the
“scaling laws” work was just a formal
measurement of this.
prescient individuals saw GPT-4 coming.
We can decompose the progress in the four years from GPT-2
to GPT-4 into three categories of scaleups:
1. compute: We’re using much bigger computers to train
these models.
2. algorithmic efficiencies: There’s a continuous trend of
algorithmic progress. Many of these act as “compute multi-
pliers,” and we can put them on a unified scale of growing
effective compute.
3. ”unhobbling” gains: By default, models learn a lot of
amazing raw capabilities, but they are hobbled in all sorts
of dumb ways, limiting their practical value. With simple
algorithmic improvements like reinforcement learning from
human feedback (RLHF), chain-of-thought (CoT), tools, and
scaffolding, we can unlock significant latent capabilities.

We can “count the OOMs” of improvement along these axes:
that is, trace the scaleup for each in units of effective com-
pute. 3x is 0.5 OOMs; 10x is 1 OOM; 30x is 1.5 OOMs; 100x
is 2 OOMs; and so on. We can also look at what we should
expect on top of GPT-4, from 2023 to 2027.
I’ll go through each one-by-one, but the upshot is clear: we are
rapidly racing through the OOMs. There are potential head-
winds in the data wall, which I’ll address—but overall, it seems
likely that we should expect another GPT-2-to-GPT-4-sized
jump, on top of GPT-4, by 2027.
Compute
I’ll start with the most commonly-discussed driver of recent
progress: throwing (a lot) more compute at models.
Many people assume that this is simply due to Moore’s Law.
But even in the old days when Moore’s Law was in its heyday,
it was comparatively glacial—perhaps 1-1.5 OOMs per decade.
We are seeing much more rapid scaleups in compute—close
to 5x the speed of Moore’s law—instead because of mammoth
investment. (Spending even a million dollars on a single model
used to be an outrageous thought nobody would entertain, and
now that’s pocket change!)
Model Estimated Compute Growth
GPT-2 (2019) ~4e21 FLOP
GPT-3 (2020) ~3e23 FLOP + ~2 OOMs
GPT-4 (2023) 8e24 to 4e25 FLOP + ~1.5–2 OOMs
Table 1: Estimates of compute for GPT-2
to GPT-4 by Epoch AI.
We can use public estimates from Epoch AI (a source widely
respected for its excellent analysis of AI trends) to trace the
compute scaleup from 2019 to 2023. GPT-2 to GPT-3 was a
quick scaleup; there was a large overhang of compute, scaling
from a smaller experiment to using an entire datacenter to train
a large language model. With the scaleup from GPT-3 to GPT-
4, we transitioned to the modern regime: having to build an
entirely new (much bigger) cluster for the next model. And yet

the dramatic growth continued. Overall, Epoch AI estimates
suggest that GPT-4 training used ~3,000x-10,000x more raw
compute than GPT-2.
In broad strokes, this is just the continuation of a longer-running
trend. For the last decade and a half, primarily because of
broad scaleups in investment (and specializing chips for AI
workloads in the form of GPUs and TPUs), the training com-
pute used for frontier AI systems has grown at roughly ~0.5
OOMs/year.
Figure 11: Training compute of no-
table deep learning models over time.
Source: Epoch AI.
The compute scaleup from GPT-2 to GPT-3 in a year was an
unusual overhang, but all the indications are that the longer-
run trend will continue. The SF-rumor-mill is abreast with dra-
matic tales of huge GPU orders. The investments involved will
be extraordinary—but they are in motion. I go into this more
later in the series, in IIIa. Racing to the Trillion-Dollar Cluster;
based on that analysis, an additional 2 OOMs of compute (a
cluster in the $10s of billions) seems very likely to happen by
the end of 2027; even a cluster closer to +3 OOMs of compute
($100 billion+) seems plausible (and is rumored to be in the
works at Microsoft/OpenAI).

Algorithmic efficiencies
While massive investments in compute get all the attention,
algorithmic progress is probably a similarly important driver of
progress (and has been dramatically underrated).
To see just how big of a deal algorithmic progress can be, con-
sider the following illustration (Figure 12) of the drop in price
to attain ~50% accuracy on the MATH benchmark (high school
competition math) over just two years. (For comparison, a com-
puter science PhD student who didn’t particularly like math
scored 40%, so this is already quite good.) Inference efficiency
improved by nearly 3 OOMs—1,000x—in less than two years.
Figure 12: Rough estimate on relative
inference cost of attaining ~50% MATH
performance.
Though these are numbers just for inference efficiency (which
may or may not correspond to training efficiency improve-
ments, where numbers are harder to infer from public data),
they make clear there is an enormous amount of algorithmic
progress possible and happening.
Calculations below.
Gemini 1.5 Flash scores 54.9% on
MATH, and costs $0.35/$1.05 (in-
put/output) per million tokens. GPT-4
scored 42.5% on MATH prelease and
52.9% on MATH in early 2023, and cost
$30/$60 (input/output) per million
tokens; that’s 85x/57x (input/output)
more expensive per token than Gemini
1.5 Flash. To be conservative, I use an
estimate of 30x cost decrease above (ac-
counting for Gemini 1.5 Flash possibly
using more tokens to reason through
problems).
Minerva540B scores 50.3% on MATH,
using majority voting among 64 sam-
ples. A knowledgeable friend estimates
the base model here is probably 2-
3x more expensive to inference than
GPT-4. However, Minerva seems to use
somewhat fewer tokens per answer
on a quick spot check. More impor-
tantly, Minerva needed 64 samples to
achieve that performance, naively im-
plying a 64x multiple on cost if you e.g.
naively ran this via an inference API. In
practice, prompt tokens can be cached
when running an eval; given a few-shot
prompt, prompt tokens are likely a
majority of the cost, even accounting
for output tokens. Supposing output
tokens are a third of the cost for getting
a single sample, that would imply only
a ~20x increase in cost from the maj@64
with caching. To be conservative, I
use the rough number of a 20x cost
decrease in the above (even if the naive
decrease in inference cost from running
this via an API would be larger).

In this piece, I’ll separate out two kinds of algorithmic progress.
Here, I’ll start by covering “within-paradigm” algorithmic
improvements—those that simply result in better base mod-
els, and that straightforwardly act as compute efficiencies or com-
pute multipliers. For example, a better algorithm might allow
us to achieve the same performance but with 10x less training
compute. In turn, that would act as a 10x (1 OOM) increase
in effective compute. (Later, I’ll cover “unhobbling,” which you
can think of as “paradigm-expanding/application-expanding”
algorithmic progress that unlocks capabilities of base models.)
If we step back and look at the long-term trends, we seem to
find new algorithmic improvements at a fairly consistent rate.
Individual discoveries seem random, and at every turn, there
seem insurmountable obstacles—but the long-run trendline is
predictable, a straight line on a graph. Trust the trendline.
We have the best data for ImageNet (where algorithmic re-
search has been mostly public and we have data stretching
back a decade), for which we have consistently improved com-
pute efficiency by roughly ~0.5 OOMs/year across the 9-year
period between 2012 and 2021.
Figure 13: We can measure algorith-
mic progress: how much less compute
is needed in 2021 compared to 2012
to train a model with the same per-
formance? We see a trend of ~0.5
OOMs/year of algorithmic efficiency.
Source: Erdil and Besiroglu 2022.
That’s a huge deal: that means 4 years later, we can achieve the
same level of performance for ~100x less compute (and con-
comitantly, much higher performance for the same compute!).
Unfortunately, since labs don’t publish internal data on this,
it’s harder to measure algorithmic progress for frontier LLMs

over the last four years. EpochAI has new work replicating
their results on ImageNet for language modeling, and estimate
a similar ~0.5 OOMs/year of algorithmic efficiency trend in
LLMs from 2012 to 2023. (This has wider error bars though,
and doesn’t capture some more recent gains, since the leading
labs have stopped publishing their algorithmic efficiencies.)
Figure 14: Estimates by Epoch AI of
algorithmic efficiencies in language
modeling. Their estimates suggest
we’ve made ~4 OOMs of efficiency
gains in 8 years.
More directly looking at the last 4 years, GPT-2 to GPT-3 was
basically a simple scaleup (according to the paper), but there
have been many publicly-known and publicly-inferable gains
since GPT-3:
• We can infer gains from API costs:12
12
Though these are inference efficien-
cies (rather than necessarily training
efficiencies), and to some extent will
reflect inference-specific optimizations,
a) they suggest enormous amounts of
algorithmic progress is possible and
happening in general, and b) it’s often
the case that an algorithmic improve-
ments is both a training efficiency gain
and an inference efficiency, for example
by reducing the number of parameters
necessary.
– GPT-4, on release, cost ~the same as GPT-3 when it was
released, despite the absolutely enormous performance in-
crease.13 (If we do a naive and oversimplified back-of-the-
13
GPT-3: $60/1M tokens, GPT-4:
$30/1M input tokens and $60/1M
output tokens.
envelope estimate based on scaling laws, this suggests that
perhaps roughly half the effective compute increase from
GPT-3 to GPT-4 came from algorithmic improvements.14)
14
Chinchilla scaling laws say that one
should scale parameter count and
data equally. That is, parameter count
grows “half the OOMs” of the OOMs
that effective training compute grows.
At the same time, parameter count
is intuitively roughly proportional to
inference costs. All else equal, constant
inference costs thus implies that half of
the OOMs of effective compute growth
were “canceled out” by algorithmic
win.
That said, to be clear, this is a very
naive calculation (just meant for a
rough illustration) that is wrong in
various ways. There may be inference-
specific optimizations (that don’t
translate into training efficiency); there
may be training efficiencies that don’t
reduce parameter count (and thus don’t
translate into inference efficiency); and
so on.
– Since the GPT-4 release a year ago, OpenAI prices for
GPT-4-level models have fallen another 6x/4x (input/output)
with the release of GPT-4o.

– Gemini 1.5 Flash, recently released, offers between “GPT-
3.75-level” and GPT-4-level performance,15 while costing 15
Gemini 1.5 Flash ranks similarly to
GPT-4 (higher than original GPT-4,
lower than updated versions of GPT-4)
on LMSys, a chatbot leaderboard, and
has similar performance on MATH and
GPQA (evals that measure reasoning)
as the original GPT-4, while landing
roughly in the middle between GPT-
3.5 and GPT-4 on MMLU (an eval
that more heavily weights towards
measuring knowledge).
85x/57x (input/output) less than the original GPT-4 (ex-
traordinary gains!).
• Chinchilla scaling laws give a 3x+ (0.5 OOMs+) efficiency
gain.16
16
At ~GPT-3 scale, more than 3x at
larger scales.
• Gemini 1.5 Pro claimed major compute efficiency gains (out-
performing Gemini 1.0 Ultra, while using “significantly less”
compute), with Mixture of Experts (MoE) as a highlighted
architecture change. Other papers also claim a substantial
multiple on compute from MoE.
• There have been many tweaks and gains on architecture,
data, training stack, etc. all the time.17
17
For example, this paper contains a
comparison of a GPT-3-style vanilla
Transformer to various simple changes
to architecture and training recipe
published over the years (RMSnorms
instead of layernorm, different posi-
tional embeddings, SwiGlu activation,
AdamW optimizer instead of Adam,
etc.), what they call “Transformer++”,
implying a 6x gain at least at small
scale.
Put together, public information suggests that the GPT-2 to
GPT-4 jump included 1-2 OOMs of algorithmic efficiency
gains.18
18
If we take the trend of 0.5
OOMs/year, and 4 years between
GPT-2 and GPT-4 release, that would
be 2 OOMs. However, GPT-2 to GPT-3
was a simple scaleup (after big gains
from e.g. Transformers), and OpenAI
claims GPT-4 pretraining finished in
2022, which could mean we’re looking
at closer to 2 years worth of algorithmic
progress that we should be counting
here. 1 OOM of algorithmic efficiency
seems like a conservative lower bound.
Figure 15: Decomposing progress:
compute and algorithmic efficiencies.
(Rough illustration.)

Over the 4 years following GPT-4, we should expect the trend
to continue:19 on average ~0.5 OOMs/year of compute effi- 19
At the very least, given over a decade
of consistent algorithmic improvements,
the burden of proof would be on those
who would suggest it will all suddenly
come to a halt!
ciency, i.e. ~2 OOMs of gains compared to GPT-4 by 2027.
While compute efficiencies will become harder to find as we
pick the low-hanging fruit, AI lab investments in money and
talent to find new algorithmic improvements are growing
rapidly. 20 (The publicly-inferable inference cost efficiencies, 20
The economic returns to a 3x compute
efficiency will be measured in the $10s
of billions or more, given cluster costs.
at least, don’t seem to have slowed down at all.) On the high
end, we could even see more fundamental, Transformer-like21 21
Very roughly something like a ~10x
gain.
breakthroughs with even bigger gains.
Put together, this suggests we should expect something like 1-3
OOMs of algorithmic efficiency gains (compared to GPT-4) by
the end of 2027, maybe with a best guess of ~2 OOMs.
The data wall
There is a potentially important source of variance for all of
this: we’re running out of internet data. That could mean
that, very soon, the naive approach to pretraining larger
language models on more scraped data could start hitting
serious bottlenecks.
Frontier models are already trained on much of the inter-
net. Llama 3, for example, was trained on over 15T tokens.
Common Crawl, a dump of much of the internet used for
LLM training, is >100T tokens raw, though much of that is
spam and duplication (e.g., a relatively simple deduplica-
tion leads to 30T tokens, implying Llama 3 would already
be using basically all the data). Moreover, for more specific
domains like code, there are many fewer tokens still, e.g.
public github repos are estimated to be in low trillions of
tokens.
You can go somewhat further by repeating data, but aca-
demic work on this suggests that repetition only gets you
so far, finding that after 16 epochs (a 16-fold repetition),
returns diminish extremely fast to nil. At some point, even
with more (effective) compute, making your models bet-
ter can become much tougher because of the data con-
straint. This isn’t to be understated: we’ve been riding the

scaling curves, riding the wave of the language-modeling-
pretraining-paradigm, and without something new here,
this paradigm will (at least naively) run out. Despite the
massive investments, we’d plateau.
All of the labs are rumored to be making massive research
bets on new algorithmic improvements or approaches to
get around this. Researchers are purportedly trying many
strategies, from synthetic data to self-play and RL ap-
proaches. Industry insiders seem to be very bullish: Dario
Amodei (CEO of Anthropic) recently said on a podcast: “if
you look at it very naively we’re not that far from running
out of data [. . . ] My guess is that this will not be a blocker
[. . . ] There’s just many different ways to do it.” Of course,
any research results on this are proprietary and not being
published these days.
In addition to insider bullishness, I think there’s a strong
intuitive case for why it should be possible to find ways to
train models with much better sample efficiency (algorith-
mic improvements that let them learn more from limited
data). Consider how you or I would learn from a really
dense math textbook:
• What a modern LLM does during training is, essentially,
very very quickly skim the textbook, the words just fly-
ing by, not spending much brain power on it.
• Rather, when you or I read that math textbook, we read
a couple pages slowly; then have an internal monologue
about the material in our heads and talk about it with
a few study-buddies; read another page or two; then
try some practice problems, fail, try them again in a
different way, get some feedback on those problems,
try again until we get a problem right; and so on, until
eventually the material “clicks.”
• You or I also wouldn’t learn much at all from a pass
through a dense math textbook if all we could do was
breeze through it like LLMs.22 22
And just rereading the same textbook
over and over again might result in
memorization, not understanding. I
take it that’s how many wordcels pass
math classes!

• But perhaps, then, there are ways to incorporate aspects
of how humans would digest a dense math textbook
to let the models learn much more from limited data.
In a simplified sense, this sort of thing—having an in-
ternal monologue about material, having a discussion
with a study-buddy, trying and failing at problems un-
til it clicks—is what many synthetic data/self-play/RL
approaches are trying to do.23
23
One other way of thinking about it
I find interesting: there is a “missing-
middle” between pretraining and
in-context learning.
In-context learning is incredible (and
competitive with human sample effi-
ciency). For example, the Gemini 1.5
Pro paper discusses giving the model
instructional materials (a textbook, a
dictionary) on Kalamang, a language
spoken by fewer than 200 people and
basically not present on the internet,
in context—and the model learns to
translate from English to Kalamang at
human-level! In context, the model is
able to learn from the textbook as well
as a human could (and much better
than it would learn from just chucking
that one textbook into pretraining).
When a human learns from a text-
book, they’re able to distill their short-
term memory/learnings into long-term
memory/long-term skills with practice;
however, we don’t have an equivalent
way to distill in-context learning “back
to the weights.” Synthetic data/self-
play/RL/etc are trying to fix that: let
the model learn by itself, then think
about it and practice what it learned,
distilling that learning back into the
weights.
The old state of the art of training models was simple and
naive, but it worked, so nobody really tried hard to crack
these approaches to sample efficiency. Now that it may
become more of a constraint, we should expect all the labs
to invest billions of dollars and their smartest minds into
cracking it. A common pattern in deep learning is that it
takes a lot of effort (and many failed projects) to get the
details right, but eventually some version of the obvious
and simple thing just works. Given how deep learning has
managed to crash through every supposed wall over the
last decade, my base case is that it will be similar here.
Moreover, it actually seems possible that cracking one of
these algorithmic bets like synthetic data could dramati-
cally improve models. Here’s an intuition pump. Current
frontier models like Llama 3 are trained on the internet—
and the internet is mostly crap, like e-commerce or SEO
or whatever. Many LLMs spend the vast majority of their
training compute on this crap, rather than on really high-
quality data (e.g. reasoning chains of people working
through difficult science problems). Imagine if you could
spend GPT-4-level compute on entirely extremely high-
quality data—it could be a much, much more capable
model.
A look back at AlphaGo—the first AI system that beat the
world champions at Go, decades before it was thought
possible—is useful here as well.24 24
See also Andrej Karpathy’s talk
discussing this here.
• In step 1, AlphaGo was trained by imitation learning on
expert human Go games. This gave it a foundation.

• In step 2, AlphaGo played millions of games against
itself. This let it become superhuman at Go: remember
the famous move 37 in the game against Lee Sedol, an
extremely unusual but brilliant move a human would
never have played.
Developing the equivalent of step 2 for LLMs is a key re-
search problem for overcoming the data wall (and, more-
over, will ultimately be the key to surpassing human-level
intelligence).
All of this is to say that data constraints seem to inject large
error bars either way into forecasting the coming years
of AI progress. There’s a very real chance things stall out
(LLMs might still be as big of a deal as the internet, but we
wouldn’t get to truly crazy AGI). But I think it’s reasonable
to guess that the labs will crack it, and that doing so will
not just keep the scaling curves going, but possibly enable
huge gains in model capability.
As an aside, this also means that we should expect more
variance between the different labs in coming years com-
pared to today. Up until recently, the state of the art tech-
niques were published, so everyone was basically doing
the same thing. (And new upstarts or open source projects
could easily compete with the frontier, since the recipe
was published.) Now, key algorithmic ideas are becom-
ing increasingly proprietary. I’d expect labs’ approaches
to diverge much more, and some to make faster progress
than others—even a lab that seems on the frontier now
could get stuck on the data wall while others make a
breakthrough that lets them race ahead. And open source
will have a much harder time competing. It will certainly
make things interesting. (And if and when a lab figures
it out, their breakthrough will be the key to AGI, key to
superintelligence—one of the United States’ most prized
secrets.)

Unhobbling
Finally, the hardest to quantify—but no less important—category
of improvements: what I’ll call “unhobbling.”
Imagine if when asked to solve a hard math problem, you had
to instantly answer with the very first thing that came to mind.
It seems obvious that you would have a hard time, except for
the simplest problems. But until recently, that’s how we had
LLMs solve math problems. Instead, most of us work through
the problem step-by-step on a scratchpad, and are able to solve
much more difficult problems that way. “Chain-of-thought”
prompting unlocked that for LLMs. Despite excellent raw ca-
pabilities, they were much worse at math than they could be
because they were hobbled in an obvious way, and it took a
small algorithmic tweak to unlock much greater capabilities.
We’ve made huge strides in “unhobbling” models over the past
few years. These are algorithmic improvements beyond just
training better base models—and often only use a fraction of
pretraining compute—that unleash model capabilities:
• Reinforcement learning from human feedback (RLHF). Base mod-
els have incredible latent capabilities,25 but they’re raw and 25
That’s the magic of unsupervised
learning, in some sense: to better pre-
dict the next token, to make perplexity
go down, models learn incredibly rich
internal representations, everything
from (famously) sentiment to complex
world models. But, out of the box,
they’re hobbled: they’re using their in-
credible internal representations merely
to predict the next token in random
internet text, and rather than applying
them in the best way to actually try to
solve your problem.
incredibly hard to work with. While the popular conception
of RLHF is that it merely censors swear words, RLHF has
been key to making models actually useful and commer-
cially valuable (rather than making models predict random
internet text, get them to actually apply their capabilities
to try to answer your question!). This was the magic of
ChatGPT—well-done RLHF made models usable and use-
ful to real people for the first time. The original InstructGPT
paper has a great quantification of this: an RLHF’d small
model was equivalent to a non-RLHF’d >100x larger model
in terms of human rater preference.
• Chain of Thought (CoT). As discussed. CoT started being
widely used just 2 years ago and can provide the equiva-
lent of a >10x effective compute increase on math/reasoning
problems.
• Scaffolding. Think of CoT++: rather than just asking a model

to solve a problem, have one model make a plan of attack,
have another propose a bunch of possible solutions, have
another critique it, and so on. For example, on HumanEval
(coding problems), simple scaffolding enables GPT-3.5 to
outperform un-scaffolded GPT-4. On SWE-Bench (a bench-
mark of solving real-world software engineering tasks), GPT-
4 can only solve ~2% correctly, while with Devin’s agent
scaffolding it jumps to 14-23%. (Unlocking agency is only in
its infancy though, as I’ll discuss more later.)
• Tools: Imagine if humans weren’t allowed to use calculators
or computers. We’re only at the beginning here, but Chat-
GPT can now use a web browser, run some code, and so on.
• Context length. Models have gone from 2k token context
(GPT-3) to 32k context (GPT-4 release) to 1M+ context (Gem-
ini 1.5 Pro). This is a huge deal. A much smaller base model
with, say, 100k tokens of relevant context can outperform a
model that is much larger but only has, say, 4k relevant to-
kens of context—more context is effectively a large compute
efficiency gain.26 More generally, context is key to unlock- 26
See Figure 7 from the updated Gem-
ini 1.5 whitepaper, comparing perplex-
ity vs. context for Gemini 1.5 Pro and
Gemini 1.5 Flash (a much cheaper and
presumably smaller model).
ing many applications of these models: for example, many
coding applications require understanding large parts of
a codebase in order to usefully contribute new code; or, if
you’re using a model to help you write a document at work,
it really needs the context from lots of related internal docs
and conversations. Gemini 1.5 Pro, with its 1M+ token con-
text, was even able to learn a new language (a low-resource
language not on the internet) from scratch, just by putting a
dictionary and grammar reference materials in context!
• Posttraining improvements. The current GPT-4 has substan-
tially improved compared to the original GPT-4 when re-
leased, according to John Schulman due to posttraining
improvements that unlocked latent model capability: on
reasoning evals it’s made substantial gains (e.g., ~50% ->
72% on MATH, ~40% to ~50% on GPQA) and on the LMSys
leaderboard, it’s made nearly 100-point elo jump (compara-
ble to the difference in elo between Claude 3 Haiku and the
much larger Claude 3 Opus, models that have a ~50x price
difference).

A survey by Epoch AI of some of these techniques, like scaf-
folding, tool use, and so on, finds that techniques like this can
typically result in effective compute gains of 5-30x on many
benchmarks. METR (an organization that evaluates models)
similarly found very large performance improvements on their
set of agentic tasks, via unhobbling from the same GPT-4 base
model: from 5% with just the base model, to 20% with the
GPT-4 as posttrained on release, to nearly 40% today from bet-
ter posttraining, tools, and agent scaffolding (Figure 16).
Figure 16: Performance on METR’s
agentic tasks, over time via better
unhobbling. Source: Model Evaluation
and Threat Research
While it’s hard to put these on a unified effective compute
scale with compute and algorithmic efficiencies, it’s clear these
are huge gains, at least on a roughly similar magnitude as the
compute scaleup and algorithmic efficiencies. (It also highlights
the central role of algorithmic progress: the ~0.5 OOMs/year
of compute efficiencies, already significant, are only part of the
story, and put together with unhobbling algorithmic progress
overall is maybe even a majority of the gains on the current
trend.)
“Unhobbling” is a huge part of what actually enabled these
models to become useful—and I’d argue that much of what is
holding back many commercial applications today is the need
for further “unhobbling” of this sort. Indeed, models today are
still incredibly hobbled! For example:
• They don’t have long-term memory.

Figure 17: Decomposing progress:
compute, algorithmic efficiencies, and
unhobbling. (Rough illustration.)
• They can’t use a computer (they still only have very limited
tools).
• They still mostly don’t think before they speak. When you
ask ChatGPT to write an essay, that’s like expecting a human
to write an essay via their initial stream-of-consciousness.27 27
People are working on this though.
• They can (mostly) only engage in short back-and-forth dia-
logues, rather than going away for a day or a week, thinking
about a problem, researching different approaches, consult-
ing other humans, and then writing you a longer report or
pull request.
• They’re mostly not personalized to you or your application
(just a generic chatbot with a short prompt, rather than hav-
ing all the relevant background on your company and your
work).
The possibilities here are enormous, and we’re rapidly picking
low-hanging fruit here. This is critical: it’s completely wrong
to just imagine “GPT-6 ChatGPT.” With continued unhobbling

progress, the improvements will be step-changes compared to
GPT-6 + RLHF. By 2027, rather than a chatbot, you’re going to
have something that looks more like an agent, like a coworker.
From chatbot to agent-coworker
What could ambitious unhobbling over the coming years
look like? The way I think about it, there are three key
ingredients:
1. solving the “onboarding problem”
GPT-4 has the raw smarts to do a decent chunk of many
people’s jobs, but it’s sort of like a smart new hire that just
showed up 5 minutes ago: it doesn’t have any relevant con-
text, hasn’t read the company docs or Slack history or had
conversations with members of the team, or spent any time
understanding the company-internal codebase. A smart
new hire isn’t that useful 5 minutes after arriving—but
they are quite useful a month in! It seems like it should be
possible, for example via very-long-context, to “onboard”
models like we would a new human coworker. This alone
would be a huge unlock.
2. the test-time compute overhang (reasoning/error
correction/system ii for longer-horizon prob-
lems)
Right now, models can basically only do short tasks: you
ask them a question, and they give you an answer. But
that’s extremely limiting. Most useful cognitive work hu-
mans do is longer horizon—it doesn’t just take 5 minutes,
but hours, days, weeks, or months.
A scientist that could only think about a difficult problem
for 5 minutes couldn’t make any scientific breakthroughs.
A software engineer that could only write skeleton code
for a single function when asked wouldn’t be very useful—
software engineers are given a larger task, and they then go
make a plan, understand relevant parts of the codebase or

technical tools, write different modules and test them in-
crementally, debug errors, search over the space of possible
solutions, and eventually submit a large pull request that’s
the culmination of weeks of work. And so on.
In essence, there is a large test-time compute overhang. Think
of each GPT-4 token as a word of internal monologue when
you think about a problem. Each GPT-4 token is quite
smart, but it can currently only really effectively use on
the order of ~hundreds of tokens for chains of thought co-
herently (effectively as though you could only spend a few
minutes of internal monologue/thinking on a problem or
project).
What if it could use millions of tokens to think about and
work on really hard problems or bigger projects?
Number of tokens Equivalent to me work-
ing on something for. . .
100s A few minutes ChatGPT (we are here)
1,000s Half an hour +1 OOMs test-time compute
10,000s Half a workday +2 OOMs
100,000s A workweek +3 OOMs
Millions Multiple months +4 OOMs
Table 2: Assuming a human thinking
at ~100 tokens/minute and working 40
hours/week, translating “how long a
model thinks” in tokens to human-time
on a given problem/project.
Even if the “per-token” intelligence were the same, it’d
be the difference between a smart person spending a few
minutes vs. a few months on a problem. I don’t know about
you, but there’s much, much, much more I am capable
of in a few months vs. a few minutes. If we could unlock
“being able to think and work on something for months-
equivalent, rather than a few-minutes-equivalent” for mod-
els, it would unlock an insane jump in capability. There’s a
huge overhang here, many OOMs worth.
Right now, models can’t do this yet. Even with recent ad-
vances in long-context, this longer context mostly only
works for the consumption of tokens, not the production of
tokens—after a while, the model goes off the rails or gets
stuck. It’s not yet able to go away for a while to work on a

problem or project on its own.28 28
Which makes sense—why would
it have learned the skills for longer-
horizon reasoning and error correction?
There’s very little data on the internet
in the form of “my complete internal
monologue, reasoning, all the relevant
steps over the course of a month as
I work on a project.” Unlocking this
capability will require a new kind of
training, for it to learn these extra skills.
Or as Gwern put it (private correspon-
dence): “ ‘Brain the size of a galaxy, and
what do they ask me to do? Predict the
misspelled answers on benchmarks!’
Marvin the depressed neural network
moaned.”
But unlocking test-time compute might merely be a matter
of relatively small “unhobbling” algorithmic wins. Perhaps
a small amount of RL helps a model learn to error correct
(“hm, that doesn’t look right, let me double check that”),
make plans, search over possible solutions, and so on. In a
sense, the model already has most of the raw capabilities,
it just needs to learn a few extra skills on top to put it all
together.
In essence, we just need to teach the model a sort of System
II outer loop29 that lets it reason through difficult, long- 29
System I vs. System II is a useful
way of thinking about current ca-
pabilities of LLMs—including their
limitations and dumb mistakes—and
what might be possible with RL and
unhobbling. Think of this way: when
you are driving, most of the time you
are on autopilot (system I, what mod-
els mostly do right now). But when
you encounter a complex construction
zone or novel intersection, you might
ask your passenger-seat-companion to
pause your conversation for a moment
while you figure out—actually think
about—what’s going on and what to
do. If you were forced to go about life
with only system I (closer to models
today), you’d have a lot of trouble. Cre-
ating the ability for system II reasoning
loops is a central unlock.
horizon projects.
If we succeed at teaching this outer loop, instead of a short
chatbot answer of a couple paragraphs, imagine a stream
of millions of words (coming in more quickly than you can
read them) as the model thinks through problems, uses
tools, tries different approaches, does research, revises its
work, coordinates with others, and completes big projects
on its own.
Trading off test-time and train-time compute in other ML do-
mains. In other domains, like AI systems for board games,
it’s been demonstrated that you can use more test-time
compute (also called inference-time compute) to substitute
for training compute.
Figure 18: Jones (2021): A smaller
model can do as well as a much larger
model at the game of Hex if you give
it more test-time compute (“more time
to think”). In this domain, they find
that one can spend ~1.2 OOMs more
compute at test-time to get performance
equivalent to a model with ~1 OOMs
more training compute.

If a similar relationship held in our case, if we could unlock
+4 OOMs of test-time compute, that might be equivalent to
+3 OOMs of pretraining compute, i.e. very roughly some-
thing like the jump between GPT-3 and GPT-4. (Solving
this “unhobbling” would be equivalent to a huge OOM
scaleup.)
3. using a computer
This is perhaps the most straightforward of the three. Chat-
GPT right now is basically like a human that sits in an
isolated box that you can text. While early unhobbling im-
provements teach models to use individual isolated tools,
I expect that with multimodal models we will soon be able
to do this in one fell swoop: we will simply enable models
to use a computer like a human would.
That means joining your Zoom calls, researching things on-
line, messaging and emailing people, reading shared docs,
using your apps and dev tooling, and so on. (Of course,
for models to make the most use of this in longer-horizon
loops, this will go hand-in-hand with unlocking test-time
compute.)
By the end of this, I expect us to get something that
looks a lot like a drop-in remote worker. An agent that joins
your company, is onboarded like a new human hire, mes-
sages you and colleagues on Slack and uses your softwares,
makes pull requests, and that, given big projects, can do
the model-equivalent of a human going away for weeks to
independently complete the project. You’ll probably need
somewhat better base models than GPT-4 to unlock this,
but possibly not even that much better—a lot of juice is in
fixing the clear and basic ways models are still hobbled.
(A very early peek at what this might look like is Devin, an
early prototype of unlocking the “agency-overhang”/”test-
time compute overhang” on models on the path to creating
a fully automated software engineer. I don’t know how
well Devin works in practice, and this demo is still very

limited compared to what proper chatbot → agent un-
hobbling would yield, but it’s a useful teaser of the sort of
thing coming soon.)
By the way, the centrality of unhobbling might lead to a some-
what interesting “sonic boom” effect in terms of commercial
applications. Intermediate models between now and the drop-
in remote worker will require tons of schlep to change work-
flows and build infrastructure to integrate and derive economic
value from. The drop-in remote worker will be dramatically
easier to integrate—just, well, drop them in to automate all
the jobs that could be done remotely. It seems plausible that
the schlep will take longer than the unhobbling, that is, by the
time the drop-in remote worker is able to automate a large
number of jobs, intermediate models won’t yet have been fully
harnessed and integrated—so the jump in economic value gen-
erated could be somewhat discontinuous.
The next four years
Putting the numbers together, we should (roughly) ex-
pect another GPT-2-to-GPT-4-sized jump in the 4 years follow-
ing GPT-4, by the end of 2027.
• GPT-2 to GPT-4 was roughly a 4.5–6 OOM base effective
compute scaleup (physical compute and algorithmic efficien-
cies), plus major “unhobbling” gains (from base model to
chatbot).
• In the subsequent 4 years, we should expect 3–6 OOMs
of base effective compute scaleup (physical compute al-
gorithmic efficiencies)—with perhaps a best guess of ~5
OOMs—plus step-changes in utility and applications un-
locked by “unhobbling” (from chatbot to agent/drop-in
remote worker).
To put this in perspective, suppose GPT-4 training took 3
months. In 2027, a leading AI lab will be able to train a GPT-4-
level model in a minute.30 The OOM effective compute scaleup
30
On the best guess assumptions on
physical compute and algorithmic
efficiency scaleups described above, and
simplifying parallelism considerations
(in reality, it might look more like “1440
(60*24) GPT-4-level models in a day” or
similar).

Figure 19: Summary of the estimates
on drivers of progress in the four years
preceding GPT-4, and what we should
expect in the four years following
GPT-4.

will be dramatic.
Where will that take us?
Figure 20: Summary of counting the
OOMs.
GPT-2 to GPT-4 took us from ~preschooler to ~smart high-
schooler; from barely being able to output a few cohesive sen-
tences to acing high-school exams and being a useful coding
assistant. That was an insane jump. If this is the intelligence
gap we’ll cover once more, where will that take us?31 We 31
Of course, any benchmark we have
today will be saturated. But that’s not
saying much; it’s mostly a reflection on
the difficulty of making hard-enough
benchmarks.
should not be surprised if that takes us very, very far. Likely,
it will take us to models that can outperform PhDs and the best
experts in a field.
(One neat way to think about this is that the current trend of AI
progress is proceeding at roughly 3x the pace of child develop-
ment. Your 3x-speed-child just graduated high school; it’ll be
taking your job before you know it!)
Again, critically, don’t just imagine an incredibly smart Chat-
GPT: unhobbling gains should mean that this looks more like
a drop-in remote worker, an incredibly smart agent that can
reason and plan and error-correct and knows everything about
you and your company and can work on a problem indepen-

dently for weeks.
We are on course for AGI by 2027. These AI systems will basi-
cally be able to automate basically all cognitive jobs (think: all
jobs that could be done remotely).
To be clear—the error bars are large. Progress could stall as
we run out of data, if the algorithmic breakthroughs necessary
to crash through the data wall prove harder than expected.
Maybe unhobbling doesn’t go as far, and we are stuck with
merely expert chatbots, rather than expert coworkers. Perhaps
the decade-long trendlines break, or scaling deep learning hits
a wall for real this time. (Or an algorithmic breakthrough, even
simple unhobbling that unleashes the test-time compute over-
hang, could be a paradigm-shift, accelerating things further
and leading to AGI even earlier.)
In any case, we are racing through the OOMs, and it requires
no esoteric beliefs, merely trend extrapolation of straight lines,
to take the possibility of AGI—true AGI—by 2027 extremely
seriously.
It seems like many are in the game of downward-defining AGI
these days, as just as really good chatbot or whatever. What
I mean is an AI system that could fully automate my or my
friends’ job, that could fully do the work of an AI researcher
or engineer. Perhaps some areas, like robotics, might take
longer to figure out by default. And the societal rollout, e.g.
in medical or legal professions, could easily be slowed by so-
cietal choices or regulation. But once models can automate AI
research itself, that’s enough—enough to kick off intense feed-
back loops—and we could very quickly make further progress,
the automated AI engineers themselves solving all the remain-
ing bottlenecks to fully automating everything. In particular,
millions of automated researchers could very plausibly com-
press a decade of further algorithmic progress into a year or
less. AGI will merely be a small taste of the superintelligence
soon to follow. (More on that in the next chapter.)
In any case, do not expect the vertiginous pace of progress to
abate. The trendlines look innocent, but their implications are

intense. As with every generation before them, every new gen-
eration of models will dumbfound most onlookers; they’ll be
incredulous when, very soon, models solve incredibly difficult
science problems that would take PhDs days, when they’re
whizzing around your computer doing your job, when they’re
writing codebases with millions of lines of code from scratch,
when every year or two the economic value generated by these
models 10xs. Forget scifi, count the OOMs: it’s what we should
expect. AGI is no longer a distant fantasy. Scaling up simple
deep learning techniques has just worked, the models just want
to learn, and we’re about to do another 100,000x+ by the end of
2027. It won’t be long before they’re smarter than us.
Figure 21: GPT-4 is just the beginning—
where will we be four years later? Do
not make the mistake of underestimat-
ing the rapid pace of deep learning
progress (as illustrated by progress in
GANs).

Addendum. Racing through the OOMs: It’s this decade or bust
I used to be more skeptical of short timelines to AGI. One
reason is that it seemed unreasonable to privilege this
decade, concentrating so much AGI-probability-mass on
it (it seemed like a classic fallacy to think “oh we’re so
special”). I thought we should be uncertain about what
it takes to get AGI, which should lead to a much more
“smeared-out” probability distribution over when we
might get AGI.
However, I’ve changed my mind: critically, our uncertainty
over what it takes to get AGI should be over OOMs (of
effective compute), rather than over years.
We’re racing through the OOMs this decade. Even at its
bygone heyday, Moore’s law was only 1–1.5 OOMs/decade.
I estimate that we will do ~5 OOMs in 4 years, and over
~10 this decade overall.
Figure 22: Rough projections on ef-
fective compute scaleup. We’ve been
racing through the OOMs this decade;
after the early 2030s, we will face a slow
slog.
In essence, we’re in the middle of a huge scaleup reap-
ing one-time gains this decade, and progress through the
OOMs will be multiples slower thereafter. If this scaleup
doesn’t get us to AGI in the next 5-10 years, it might be a

long way out.
• Spending scaleup: Spending a million dollars on a model
used to be outrageous; by the end of the decade, we will
likely have $100B or $1T clusters. Going much higher
than that will be hard; that’s already basically the feasi-
ble limit (both in terms of what big business can afford,
and even just as a fraction of GDP). Thereafter all we
have is glacial 2%/year trend real GDP growth to in-
crease this.
• Hardware gains: AI hardware has been improving much
more quickly than Moore’s law. That’s because we’ve
been specializing chips for AI workloads. For exam-
ple, we’ve gone from CPUs to GPUs; adapted chips for
Transformers; and we’ve gone down to much lower pre-
cision number formats, from fp64/fp32 for traditional
supercomputing to fp8 on H100s. These are large gains,
but by the end of the decade we’ll likely have totally-
specialized AI-specific chips, without much further
beyond-Moore’s law gains possible.
• Algorithmic progress: In the coming decade, AI labs will
invest tens of billions in algorithmic R&D, and all the
smartest people in the world will be working on this;
from tiny efficiencies to new paradigms, we’ll be picking
lots of the low-hanging fruit. We probably won’t reach
any sort of hard limit (though “unhobblings” are likely
finite), but at the very least the pace of improvements
should slow down, as the rapid growth (in $ and human
capital investments) necessarily slows down (e.g., most
of the smart STEM talent will already be working on AI).
(That said, this is the most uncertain to predict, and the
source of most of the uncertainty on the OOMs in the
2030s on the plot above.)
Put together, this means we are racing through many
more OOMs in the next decade than we might in multi-
ple decades thereafter. Maybe it’s enough—and we get
AGI soon—or we might be in for a long, slow slog. You
and I can reasonably disagree on the median time to AGI,

depending on how hard we think achieving AGI will be—
but given how we’re racing through the OOMs right now,
certainly your modal AGI year should sometime later this
decade or so.
Figure 23: Matthew Barnett has a nice
related visualization of this, considering
just compute and biological bounds.

II. From AGI to Superintelligence: the Intelligence Ex-
plosion
AI progress won’t stop at human-level. Hundreds of millions
of AGIs could automate AI research, compressing a decade
of algorithmic progress (5+ OOMs) into 1 year. We would
rapidly go from human-level to vastly superhuman AI sys-
tems. The power—and the peril—of superintelligence would
be dramatic.
Let an ultraintelligent machine be defined as a machine that
can far surpass all the intellectual activities of any man
however clever. Since the design of machines is one of these
intellectual activities, an ultraintelligent machine could design
even better machines; there would then unquestionably be an
‘intelligence explosion,’ and the intelligence of man would be
left far behind. Thus the first ultraintelligent machine is the
last invention that man need ever make.
i. j. good (1965)
The Bomb and The Super
In the common imagination, the Cold War’s terrors principally trace
back to Los Alamos, with the invention of the atomic bomb. But The
Bomb, alone, is perhaps overrated. Going from The Bomb to The
Super—hydrogen bombs—was arguably just as important.
In the Tokyo air raids, hundreds of bombers dropped thousands of
tons of conventional bombs on the city. Later that year, Little Boy,
dropped on Hiroshima, unleashed similar destructive power in a single
device. But just 7 years later, Teller’s hydrogen bomb multiplied yields
a thousand-fold once again—a single bomb with more explosive power
than all of the bombs dropped in the entirety of WWII combined.

The Bomb was a more efficient bombing campaign. The Super was a
country-annihilating device.32 32
And much of the Cold War’s per-
versities (cf Daniel Ellsberg’s book)
stemmed from merely replacing A-
bombs with H-bombs, without adjust-
ing nuclear policy and war plans to the
massive capability increase.
So it will be with AGI and Superintelligence.
AI progress won’t stop at human-level. After initially
learning from the best human games, AlphaGo started playing
against itself—and it quickly became superhuman, playing
extremely creative and complex moves that a human would
never have come up with.
We discussed the path to AGI in the previous piece. Once we
get AGI, we’ll turn the crank one more time—or two or three
more times—and AI systems will become superhuman—vastly
superhuman. They will become qualitatively smarter than you
or I, much smarter, perhaps similar to how you or I are qualita-
tively smarter than an elementary schooler.
The jump to superintelligence would be wild enough at the
current rapid but continuous rate of AI progress (if we could
make the jump to AGI in 4 years from GPT-4, what might an-
other 4 or 8 years after that bring?). But it could be much faster
than that, if AGI automates AI research itself.
Once we get AGI, we won’t just have one AGI. I’ll walk through
the numbers later, but: given inference GPU fleets by then,
we’ll likely be able to run many millions of them (perhaps 100
million human-equivalents, and soon after at 10x+ human speed).
Even if they can’t yet walk around the office or make coffee,
they will be able to do ML research on a computer. Rather
than a few hundred researchers and engineers at a leading AI
lab, we’d have more than 100,000x that—furiously working
on algorithmic breakthroughs, day and night. Yes, recursive
self-improvement, but no sci-fi required; they would need only to
accelerate the existing trendlines of algorithmic progress (currently at
~0.5 OOMs/year).
Automated AI research could probably compress a human-
decade of algorithmic progress into less than a year (and that
seems conservative). That’d be 5+ OOMs, another GPT-2-to-

Figure 24: Automated AI research could
accelerate algorithmic progress, leading
to 5+ OOMs of effective compute gains
in a year. The AI systems we’d have
by the end of an intelligence explosion
would be vastly smarter than humans.
GPT-4-sized jump, on top of AGI—a qualitative jump like that
from a preschooler to a smart high schooler, on top of AI sys-
tems already as smart as expert AI researchers/engineers.
There are several plausible bottlenecks—including limited com-
pute for experiments, complementarities with humans, and
algorithmic progress becoming harder—which I’ll address, but
none seem sufficient to definitively slow things down.
Before we know it, we would have superintelligence on our
hands—AI systems vastly smarter than humans, capable of
novel, creative, complicated behavior we couldn’t even begin
to understand—perhaps even a small civilization of billions
of them. Their power would be vast, too. Applying superin-

telligence to R&D in other fields, explosive progress would
broaden from just ML research; soon they’d solve robotics,
make dramatic leaps across other fields of science and technol-
ogy within years, and an industrial explosion would follow.
Superintelligence would likely provide a decisive military ad-
vantage, and unfold untold powers of destruction. We will be
faced with one of the most intense and volatile moments of
human history.
Automating AI research
We don’t need to automate everything—just AI research. A
common objection to transformative impacts of AGI is that
it will be hard for AI to do everything. Look at robotics, for
instance, doubters say; that will be a gnarly problem, even if
AI is cognitively at the levels of PhDs. Or take automating
biology R&D, which might require lots of physical lab-work
and human experiments.
But we don’t need robotics—we don’t need many things—for
AI to automate AI research. The jobs of AI researchers and
engineers at leading labs can be done fully virtually and don’t
run into real-world bottlenecks in the same way (though it will
still be limited by compute, which I’ll address later). And the
job of an AI researcher is fairly straightforward, in the grand
scheme of things: read ML literature and come up with new
questions or ideas, implement experiments to test those ideas,
interpret the results, and repeat. This all seems squarely in the
domain where simple extrapolations of current AI capabilities
could easily take us to or beyond the levels of the best humans
by the end of 2027.33
33
The job of an AI researcher is also
a job that AI researchers at AI labs
just, well, know really well—so it’ll
be particularly intuitive to them to
optimize models to be good at that job.
And there will be huge incentives to do
so to help them accelerate their research
and their labs’ competitive edge.
It’s worth emphasizing just how straightforward and hacky
some of the biggest machine learning breakthroughs of the last
decade have been: “oh, just add some normalization” (Lay-
erNorm/BatchNorm) or “do f(x)+x instead of f(x)” (residual
connections)” or “fix an implementation bug” (Kaplan → Chin-
chilla scaling laws). AI research can be automated. And au-
tomating AI research is all it takes to kick off extraordinary
feedback loops.34
34
This suggests an important point in
terms of the sequencing of risks from
AI, by the way. A common AI threat
model people point to is AI systems
developing novel bioweapons, and
that posing catastrophic risk. But if
AI research is more straightforward to
automate than biology R&D, we might
get an intelligence explosion before we
get extreme AI biothreats. This matters,
for example, with regard to whether we
should expect “bio warning shots” in
time before things get crazy on AI.

We’d be able to run millions of copies (and soon at 10x+ hu-
man speed) of the automated AI researchers. Even by 2027,
we should expect GPU fleets in the 10s of millions. Training
clusters alone should be approaching ~3 OOMs larger, already
putting us at 10 million+ A100-equivalents. Inference fleets
should be much larger still. (More on all this in IIIa. Racing to
the Trillion Dollar Cluster.)
That would let us run many millions of copies of our auto-
mated AI researchers, perhaps 100 million human-researcher-
equivalents, running day and night. There’s some assump-
tions that flow into the exact numbers, including that humans
“think” at 100 tokens/minute (just a rough order of magnitude
estimate, e.g. consider your internal monologue) and extrap-
olating historical trends and Chinchilla scaling laws on per-
token inference costs for frontier models remaining in the same
ballpark.35 We’d also want to reserve some of the GPUs for
35
As noted earlier, the GPT-4 API
costs less today than GPT-3 when it
was released—this suggests that the
trend of inference efficiency wins is fast
enough to keep inference costs roughly
constant even as models get much more
powerful. Similarly, there have been
huge inference cost wins in just the year
since GPT-4 was released; for example,
the current version of Gemini 1.5 Pro
outperforms the original GPT-4, while
being roughly 10x cheaper.
We can also ground this somewhat
more by considering Chinchilla scaling
laws. On Chinchilla scaling laws, model
size—and thus inference costs—grow
with the square root of training cost,
i.e. half the OOMs of the OOM scaleup
of effective compute. However, in
the previous piece, I suggested that
algorithmic efficiency was advancing
at roughly the same pace as compute
scaleup, i.e. it made up roughly half of
the OOMs of effective compute scaleup.
If these algorithmic wins also translate
into inference efficiency, that means
that the algorithmic efficiencies would
compensate for the naive increase in
inference cost.
In practice, training compute efficien-
cies often, but not always, translate into
inference efficiency wins. However,
there are also separately many inference
efficiency wins that are not training effi-
ciency wins. So, at least in terms of the
rough ballpark, assuming the $/token
of frontier models stays roughly similar
doesn’t seem crazy.
(Of course, they’ll use more tokens,
i.e. more test-time compute. But that’s
already part of the calculation here,
by pricing human-equivalents as 100
tokens/minute.)
running experiments and training new models. Full calculation
in a footnote.36
36
GPT4T is about $0.03/1K tokens. We
supposed we would have 10s of mil-
lions of A100 equivalents, which cost
~$1 hour per GPU if A100-equivalents.
If we used the API costs to trans-
late GPUs into tokens generated,
that implies 10s of millions GPUs *
$1/GPU-hour * 33K tokens/$ = ~ one
trillion tokens/ hour. Suppose a human
does 100 tokens/min of thinking, that
means a human-equivalent is 6,000
tokens/hour. One trillion tokens/hour
divided by 6,000 tokens/human-hour
= ~200 million human-equivalents—
i.e. as if running 200 million human
researchers, day and night. (And even
if we reserve half the GPUs for exper-
iment compute, we get 100 million
human-researcher-equivalents.)
Another way of thinking about it is that given inference fleets
in 2027, we should be able to generate an entire internet’s worth of
tokens, every single day.37 In any case, the exact numbers don’t
37
The previous footnote estimated ~1T
tokens/hour, i.e. 24T tokens a day.
In the previous piece, I noted that a
public deduplicated CommonCrawl
had around 30T tokens.
matter that much, beyond a simple plausibility demonstration.

Moreover, our automated AI researchers may soon be able to
run at much faster than human-speed:
• By taking some inference penalties, we can trade off running
fewer copies in exchange for running them at faster serial
speed. (For example, we could go from ~5x human speed to
~100x human speed by “only” running 1 million copies of
the automated researchers.38) 38
Jacob Steinhardt estimates that kˆ3
parallel copies of a model can be
replaced with a single model that is kˆ2
faster, given some math on inference
tradeoffs with a tiling scheme (that
theoretically works even for k of 100
or more). Suppose initial speeds were
already ~5x human speed (based on,
say, GPT-4 speed on release). Then,
by taking this inference penalty (with
k= ~5), we’d be able to run ~1 million
automated AI researchers at ~100x
human speed.
• More importantly, the first algorithmic innovation the au-
tomated AI researchers work on is getting a 10x or 100x
speedup. Gemini 1.5 Flash is ~10x faster than the originally-
released GPT-4,39 merely a year later, while providing similar
39
This source benchmarks throughput
of Flash at ~6x GPT-4 Turbo, and GPT-4
Turbo was faster than original GPT-4.
Latency is probably also roughly 10x
faster.
performance to the originally-released GPT-4 on reasoning
benchmarks. If that’s the algorithmic speedup a few hun-
dred human researchers can find in a year, the automated AI
researchers will be able to find similar wins very quickly.
That is: expect 100 million automated researchers each working at
100x human speed not long after we begin to be able to automate AI
research. They’ll each be able to do a year’s worth of work in a
few days. The increase in research effort—compared to a few
hundred puny human researchers at a leading AI lab today,
working at a puny 1x human speed—will be extraordinary.
This could easily dramatically accelerate existing trends of
algorithmic progress, compressing a decade of advances into
a year. We need not postulate anything totally novel for au-
tomated AI research to intensely speed up AI progress. Walk-
ing through the numbers in the previous piece, we saw that
algorithmic progress has been a central driver of deep learn-
ing progress in the last decade; we noted a trendline of ~0.5
OOMs/year on algorithmic efficiencies alone, with additional
large algorithmic gains from unhobbling on top. (I think the
import of algorithmic progress has been underrated by many,
and properly appreciating it is important for appreciating the
possibility of an intelligence explosion.)
Could our millions of automated AI researchers (soon working
at 10x or 100x human speed) compress the algorithmic progress
human researchers would have found in a decade into a year

instead? That would be 5+ OOMs in a year.
Don’t just imagine 100 million junior software engineer
interns here (we’ll get those earlier, in the next couple years!).
Real automated AI researchers be very smart—and in addition
to their raw quantitative advantage, automated AI researchers
will have other enormous advantages over human researchers:
• They’ll be able to read every single ML paper ever written,
have been able to deeply think about every single previous
experiment ever run at the lab, learn in parallel from each
of their copies, and rapidly accumulate the equivalent of
millennia of experience. They’ll be able to develop far deeper
intuitions about ML than any human.
• They’ll be easily able to write millions of lines of complex
code, keep the entire codebase in context, and spend human-
decades (or more) checking and rechecking every line of
code for bugs and optimizations. They’ll be superbly compe-
tent at all parts of the job.
• You won’t have to individually train up each automated
AI researcher (indeed, training and onboarding 100 million
new human hires would be difficult). Instead, you can just
teach and onboard one of them—and then make replicas.
(And you won’t have to worry about politicking, cultural
acclimation, and so on, and they’ll work with peak energy
and focus day and night.)
• Vast numbers of automated AI researchers will be able to
share context (perhaps even accessing each others’ latent
space and so on), enabling much more efficient collaboration
and coordination compared to human researchers.
• And of course, however smart our initial automated AI
researchers would be, we’d soon be able to make further
OOM-jumps, producing even smarter models, even more
capable at automated AI research.
Imagine an automated Alec Radford—imagine 100 million au-

tomated Alec Radfords.40 I think just about every researcher at 40
Alec Radford is an incredibly gifted
and prolific researcher/engineer at
OpenAI, behind many of the most
important advances, though he flies
under the radar some.
OpenAI would agree that if they had 10 Alec Radfords, let
alone 100 or 1,000 or 1 million running at 10x or 100x human
speed, they could very quickly solve very many of their prob-
lems. Even with various other bottlenecks (more in a moment),
compressing a decade of algorithmic progress into a year as
a result seems very plausible. (A 10x acceleration from a mil-
lion times more research effort, which seems conservative if
anything.)
That would be 5+ OOMs right there. 5 OOMs of algorithmic
wins would be a similar scaleup to what produced the GPT-
2-to-GPT-4 jump, a capability jump from ~a preschooler to ~a
smart high schooler. Imagine such a qualitative jump on top of
AGI, on top of Alec Radford.
It’s strikingly plausible we’d go from AGI to superintelligence
very quickly, perhaps in 1 year.
Possible bottlenecks
While this basic story is surprisingly strong—and is supported
by thorough economic modeling work—there are some real
and plausible bottlenecks that will probably slow down an
automated-AI-research intelligence explosion.
I’ll give a summary here, and then discuss these in more detail
in the optional sections below for those interested:
• Limited compute: AI research doesn’t just take good ideas,
thinking, or math—but running experiments to get empirical
signal on your ideas. A million times more research effort
via automated research labor won’t mean a million times
faster progress, because compute will still be limited—and
limited compute for experiments will be the bottleneck. Still,
even if this won’t be a 1,000,000x speedup, I find it hard to
imagine that the automated AI researchers couldn’t use the
compute at least 10x more effectively: they’ll be able to get
incredible ML intuition (having internalized the whole ML
literature and every previous experiment every run!) and

centuries-equivalent of thinking-time to figure out exactly
the right experiment to run, configure it optimally, and get
maximum value of information; they’ll be able to spend
centuries-equivalent of engineer-time before running even
tiny experiments to avoid bugs and get them right on the
first try; they can make tradeoffs to economize on compute
by focusing on the biggest wins; and they’ll be able to try
tons of smaller-scale experiments (and given effective com-
pute scaleups by then, “smaller-scale” means being able to
train 100,000 GPT-4-level models in a year to try architec-
ture breakthroughs). Some human researchers and engineers
are able to produce 10x the progress as others, even with
the same amount of compute—and this should apply even
moreso to automated AI researchers. I do think this is the
most important bottleneck, and I address it in more depth
below.
• Complementarities/long tail: A classic lesson from eco-
nomics (cf Baumol’s growth disease) is that if you can au-
tomate, say, 70% of something, you get some gains but
quickly the remaining 30% become your bottleneck. For any-
thing that falls short of full automation—say, really good
copilots—human AI researchers would remain a major
bottleneck, making the overall increase in the rate of algo-
rithmic progress relatively small. Moreover, there’s likely
some long tail of capabilities required for automating AI
research—the last 10% of the job of an AI researcher might
be particularly hard to automate. This could soften takeoff
some, though my best guess is that this only delays things
by a couple years. Perhaps 2026/27-models speed are the
proto-automated-researcher, it takes another year or two for
some final unhobbling, a somewhat better model, inference
speedups, and working out kinks to get to full automation,
and finally by 2028 we get the 10x acceleration (and superin-
telligence by the end of the decade).
• Inherent limits to algorithmic progress: Maybe another 5
OOMs of algorithmic efficiency will be fundamentally im-
possible? I doubt it. While there will definitely be upper
limits,41 if we got 5 OOMs in the last decade, we should 41
25 OOMs of algorithmic progress on
top of GPT-4, for example, are clearly
impossible: that would imply it would
be possible to train a GPT-4-level model
with just a handful of FLOP.
probably expect at least another decade’s-worth of progress

to be possible. More directly, current architectures and train-
ing algorithms are still very rudimentary, and it seems that
much more efficient schemes should be possible. Biological
reference classes also support dramatically more efficient
algorithms being plausible.
• Ideas get harder to find, so the automated AI researchers
will merely sustain, rather than accelerate, the current rate
of progress: One objection is that although automated re-
search would increase effective research effort a lot, ideas
also get harder to find. That is, while it takes only a few
hundred top researchers at a lab to sustain 0.5 OOMs/year
today, as we exhaust the low-hanging fruit, it will take more
and more effort to sustain that progress—and so the 100 mil-
lion automated researchers will be merely what’s necessary
to sustain progress. I think this basic model is correct, but
the empirics don’t add up: the magnitude of the increase in
research effort—a million-fold—is way, way larger than the
historical trends of the growth in research effort that’s been
necessary to sustain progress. In econ modeling terms, it’s a
bizarre “knife-edge assumption” to assume that the increase
in research effort from automation will be just enough to keep
progress constant.
• Ideas get harder to find and there are diminishing returns, so
the intelligence explosion will quickly fizzle: Related to the
above objection, even if the automated AI researchers lead
to an initial burst of progress, whether rapid progress can be
sustained depends on the shape of the diminishing returns
curve to algorithmic progress. Again, my best read of the
empirical evidence is that the exponents shake out in favor
of explosive/accelerating progress. In any case, the sheer
size of the one-time boost—from 100s to 100s of millions of
AI researchers—probably overcomes diminishing returns
here for at least a good number of OOMs of algorithmic
progress, even though it of course can’t be indefinitely self-
sustaining.
Overall, these factors may slow things down somewhat: the
most extreme versions of intelligence explosion (say, overnight)
seem implausible. And they may result in a somewhat longer

runup (perhaps we need to wait an extra year or two from
more sluggish, proto-automated researchers to the true auto-
mated Alec Radfords, before things kick off in full force). But
they certainly don’t rule out a very rapid intelligence explosion.
A year—or at most just a few years, but perhaps even just a few
months—in which we go from fully-automated AI researchers
to vastly superhuman AI systems should be our mainline ex-
pectation.
If you’d rather skip the in-depth discussions on the various bottle-
necks below, click here to skip to the next section.
Limited compute for experiments (optional, in more depth)
The production function for algorithmic progress includes
two complementary factors of production: research effort
and experiment compute. The millions of automated AI
researchers won’t have any more compute to run their
experiments on than human AI researchers; perhaps they’ll
just be sitting around waiting for their jobs to finish.
This is probably the most important bottleneck to the
intelligence explosion. Ultimately this is a quantitative
question—just how much of a bottleneck is it? On balance,
I find it hard to believe that the 100 million Alec Radfords
couldn’t increase the marginal product of experiment com-
pute by at least 10x (and thus, would still accelerate the
pace of progress by 10x):
• There’s a lot you can do with smaller amounts of compute.
The way most AI research works is that you test things
out at small scale—and then extrapolate via scaling laws.
(Many key historical breakthroughs required only a very
small amount of compute, e.g. the original Transformer
was trained on just 8 GPUs for a few days.) And note
that with ~5 OOMs of baseline scaleup in the next four
years, “small scale” will mean GPT-4 scale—the auto-
mated AI researchers will be able to run 100,000 GPT-4-
level experiments on their training cluster in a year, and
tens of millions of GPT-3-level experiments. (That’s a lot

of potential-breakthrough new architectures they’ll be
able to test!)
– A lot of the compute goes into larger-scale validation
of the final pretraining run—making sure you are get-
ting a high-enough degree of confidence on marginal
efficiency wins for your annual headline product—but
if you’re racing through the OOMs in the intelligence
explosion, you could economize and just focus on the
really big wins.
– As discussed in the previous piece, there are often
enormous gains to be had from relatively low-compute
“unhobbling” of models. These don’t require big pre-
training runs. It’s highly plausible that that intelli-
gence explosion starts off automated AI research e.g.
discovering a way to do RL on top that gives us a cou-
ple OOMs via unhobbling wins (and then we’re off to
the races).
– As the automated AI researchers find efficiencies, that’ll
let them run more experiments. Recall the near-1000x
cheaper inference in two years for equivalent-MATH
performance, and the 10x general inference gains in
the last year, discussed in the previous piece, from
mere-human algorithmic progress. The first thing
the automated AI researchers will do is quickly find
similar gains, and in turn, that’ll let them run 100x
more experiments on e.g. new RL approaches. Or
they’ll be able to quickly make smaller models with
similar performance in relevant domains (cf previous
discussion of Gemini Flash, near-100x cheaper than
GPT-4), which in turn will let them run many more
experiments with these smaller models (again, imag-
ine using these to try different RL schemes). There are
probably other overhangs too, e.g. the automated AI
researchers might be able to quickly develop much
better distributed training schemes to utilize all the
inference GPUs (probably at least 10x more compute
right there). More generally, every OOM of training
efficiency gains they find will give them an OOM

more of effective compute to run experiments on.
• The automated AI researchers could be way more efficient.
It’s hard to understate how many fewer experiments
you would have to run if you just got it right on the
first try—no gnarly bugs, being more selective about
exactly what you are running, and so on. Imagine 1000
automated AI researchers spending a month-equivalent
checking your code and getting the exact experiment
right before you press go. I’ve asked some AI lab col-
leagues about this and they agreed: you should pretty
easily be able to save 3x-10x of compute on most projects
merely if you could avoid frivolous bugs, get things right
on the first try, and only run high value-of-information
experiments.
• The automated AI researchers could have way better intu-
itions.
– Recently, I was speaking to an intern at a frontier lab;
they said that their dominant experience over the
past few months was suggesting many experiments
they wanted to run, and their supervisor (a senior
researcher) saying they could already predict the re-
sult beforehand so there was no need. The senior
researcher’s years of random experiments messing
around with models had honed their intuitions about
what ideas would work—or not. Similarly, it seems
like our AI systems could easily get superhuman in-
tuitions about ML experiments—they will have read
the entire machine learning literature, be able to learn
from every other experiment result and deeply think
about it, they could easily be trained to predict the
outcome of millions of ML experiments, and so on.
And maybe one of the first things they do is build up
a strong basic science of "predicting if this large scale
experiment will be successful just after seeing the first
1% of training, or just after seeing the smaller scale
version of this experiment", and so on.
– Moreover, beyond really good intuitions about re-

Global Situational Awareness of A.I. and where its headed

Global Situational Awareness of A.I. and where its headed

Recommended

Recommended

More Related Content

Similar to Global Situational Awareness of A.I. and where its headed

Similar to Global Situational Awareness of A.I. and where its headed (20)

More from vikram sood

More from vikram sood (20)

Recently uploaded

Recently uploaded (20)

Global Situational Awareness of A.I. and where its headed