尊敬的 微信汇率:1円 ≈ 0.046606 元 支付宝汇率:1円 ≈ 0.046698元 [退出登录]
SlideShare a Scribd company logo
Leopold Aschenbrenner
S I T U AT I O N A L AWA R E N E S S
The Decade Ahead
JUNE 2024
2
Dedicated to Ilya Sutskever.
While I used to work at OpenAI, all of this is based on publicly-
available information, my own ideas, general field-knowledge, or
SF-gossip.
Thank you to Collin Burns, Avital Balwit, Carl Shulman, Jan Leike,
Ilya Sutskever, Holden Karnofsky, Sholto Douglas, James Bradbury,
Dwarkesh Patel, and many others for formative discussions. Thank
you to many friends for feedback on earlier drafts. Thank you to
Joe Ronan for help with graphics, and Nick Whitaker for publishing
help.
situational-awareness.ai
leopold@situational-awareness.ai
Updated June 6, 2024
San Francisco, California
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion
compute clusters to $100 billion clusters to trillion-dollar clusters. Ev-
ery six months another zero is added to the boardroom plans. Behind
the scenes, there’s a fierce scramble to secure every power contract
still available for the rest of the decade, every voltage transformer
that can possibly be procured. American big business is gearing up
to pour trillions of dollars into a long-unseen mobilization of Amer-
ican industrial might. By the end of the decade, American electricity
production will have grown tens of percent; from the shale fields of
Pennsylvania to the solar farms of Nevada, hundreds of millions of
GPUs will hum.
The AGI race has begun. We are building machines that can think
and reason. By 2025/26, these machines will outpace college grad-
uates. By the end of the decade, they will be smarter than you or I;
we will have superintelligence, in the true sense of the word. Along
the way, national security forces not seen in half a century will be un-
leashed, and before long, The Project will be on. If we’re lucky, we’ll
be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer
of what is about to hit them. Nvidia analysts still think 2024 might
be close to the peak. Mainstream pundits are stuck on the willful
blindness of “it’s just predicting the next word”. They see only hype
and business-as-usual; at most they entertain another internet-scale
technological change.
Before long, the world will wake up. But right now, there are perhaps
a few hundred people, most of them in San Francisco and the AI
labs, that have situational awareness. Through whatever peculiar forces
of fate, I have found myself amongst them. A few years ago, these
people were derided as crazy—but they trusted the trendlines, which
allowed them to correctly predict the AI advances of the past few
years. Whether these people are also right about the next few years
remains to be seen. But these are very smart people—the smartest
people I have ever met—and they are the ones building this technol-
ogy. Perhaps they will be an odd footnote in history, or perhaps they
will go down in history like Szilard and Oppenheimer and Teller. If
they are seeing the future even close to correctly, we are in for a wild
ride.
Let me tell you what we see.
4
Contents
Introduction 3
History is live in San Francisco.
I. From GPT-4 to AGI: Counting the OOMs 7
AGI by 2027 is strikingly plausible. GPT-2 to GPT-4 took us from ~pre-
schooler to ~smart high-schooler abilities in 4 years. Tracing trend-
lines in compute (~0.5 orders of magnitude or OOMs/year), algorith-
mic efficiencies (~0.5 OOMs/year), and “unhobbling” gains (from chat-
bot to agent), we should expect another preschooler-to-high-schooler-
sized qualitative jump by 2027.
II. From AGI to Superintelligence: the Intelligence Explosion 46
AI progress won’t stop at human-level. Hundreds of millions of AGIs
could automate AI research, compressing a decade of algorithmic progress
(5+ OOMs) into 1 year. We would rapidly go from human-level to vastly
superhuman AI systems. The power—and the peril—of superintelli-
gence would be dramatic.
III. The Challenges 74
IIIa. Racing to the Trillion-Dollar Cluster 75
The most extraordinary techno-capital acceleration has been set in mo-
tion. As AI revenue grows rapidly, many trillions of dollars will go
into GPU, datacenter, and power buildout before the end of the decade.
The industrial mobilization, including growing US electricity produc-
tion by 10s of percent, will be intense.
IIIb. Lock Down the Labs: Security for AGI 89
The nation’s leading AI labs treat security as an afterthought. Cur-
rently, they’re basically handing the key secrets for AGI to the CCP
on a silver platter. Securing the AGI secrets and weights against the
state-actor threat will be an immense effort, and we’re not on track.
5
IIIc. Superalignment 105
Reliably controlling AI systems much smarter than we are is an un-
solved technical problem. And while it is a solvable problem, things
could very easily go off the rails during a rapid intelligence explosion.
Managing this will be extremely tense; failure could easily be catas-
trophic.
IIId. The Free World Must Prevail 126
Superintelligence will give a decisive economic and military advan-
tage. China isn’t at all out of the game yet. In the race to AGI, the free
world’s very survival will be at stake. Can we maintain our preem-
inence over the authoritarian powers? And will we manage to avoid
self-destruction along the way?
IV. The Project 141
As the race to AGI intensifies, the national security state will get in-
volved. The USG will wake from its slumber, and by 27/28 we’ll get
some form of government AGI project. No startup can handle super-
intelligence. Somewhere in a SCIF, the endgame will be on.
V. Parting Thoughts 156
What if we’re right?
Appendix 162
I. From GPT-4 to AGI: Counting the OOMs
AGI by 2027 is strikingly plausible. GPT-2 to GPT-4 took
us from ~preschooler to ~smart high-schooler abilities in
4 years. Tracing trendlines in compute (~0.5 orders of magni-
tude or OOMs/year), algorithmic efficiencies (~0.5 OOMs/year),
and “unhobbling” gains (from chatbot to agent), we should
expect another preschooler-to-high-schooler-sized qualitative
jump by 2027.
Look. The models, they just want to learn. You have to
understand this. The models, they just want to learn.
ilya sutskever
(circa 2015, via Dario Amodei)
GPT-4’s capabilities came as a shock to many: an AI system
that could write code and essays, could reason through difficult
math problems, and ace college exams. A few years ago, most
thought these were impenetrable walls.
But GPT-4 was merely the continuation of a decade of break-
neck progress in deep learning. A decade earlier, models could
barely identify simple images of cats and dogs; four years ear-
lier, GPT-2 could barely string together semi-plausible sen-
tences. Now we are rapidly saturating all the benchmarks we
can come up with. And yet this dramatic progress has merely
been the result of consistent trends in scaling up deep learning.
There have been people who have seen this for far longer. They
were scoffed at, but all they did was trust the trendlines. The
situational awareness 8
trendlines are intense, and they were right. The models, they
just want to learn; you scale them up, and they learn more.
I make the following claim: it is strikingly plausible that
by 2027, models will be able to do the work of an AI re-
searcher/engineer. That doesn’t require believing in sci-fi; it
just requires believing in straight lines on a graph.
Figure 1: Rough estimates of past and
future scaleup of effective compute
(both physical compute and algorith-
mic efficiencies), based on the public
estimates discussed in this piece. As
we scale models, they consistently get
smarter, and by “counting the OOMs”
we get a rough sense of what model
intelligence we should expect in the
(near) future. (This graph shows only
the scaleup in base models; “unhob-
blings” are not pictured.)
In this piece, I will simply “count the OOMs” (OOM = order
of magnitude, 10x = 1 order of magnitude): look at the trends
in 1) compute, 2) algorithmic efficiencies (algorithmic progress
that we can think of as growing “effective compute”), and 3)
”unhobbling” gains (fixing obvious ways in which models are
hobbled by default, unlocking latent capabilities and giving
them tools, leading to step-changes in usefulness). We trace
situational awareness 9
the growth in each over four years before GPT-4, and what we
should expect in the four years after, through the end of 2027.
Given deep learning’s consistent improvements for every OOM
of effective compute, we can use this to project future progress.
Publicly, things have been quiet for a year since the GPT-4 re-
lease, as the next generation of models has been in the oven—
leading some to proclaim stagnation and that deep learning is
hitting a wall.1 But by counting the OOMs, we get a peek at 1
Predictions they’ve made every year
for the last decade, and which they’ve
been consistently wrong about. . .
what we should actually expect.
The upshot is pretty simple. GPT-2 to GPT-4—from models
that were impressive for sometimes managing to string to-
gether a few coherent sentences, to models that ace high-school
exams—was not a one-time gain. We are racing through the
OOMs extremely rapidly, and the numbers indicate we should
expect another ~100,000x effective compute scaleup—resulting
in another GPT-2-to-GPT-4-sized qualitative jump—over four
years. Moreover, and critically, that doesn’t just mean a better
chatbot; picking the many obvious low-hanging fruit on “un-
hobbling” gains should take us from chatbots to agents, from a
tool to something that looks more like drop-in remote worker
replacements.
While the inference is simple, the implication is striking. An-
other jump like that very well could take us to AGI, to models
as smart as PhDs or experts that can work beside us as cowork-
ers. Perhaps most importantly, if these AI systems could auto-
mate AI research itself, that would set in motion intense feed-
back loops—the topic of the the next piece in the series.
Even now, barely anyone is pricing all this in. But situational
awareness on AI isn’t actually that hard, once you step back
and look at the trends. If you keep being surprised by AI capa-
bilities, just start counting the OOMs.
situational awareness 10
The last four years
We have machines now that we can basically talk to like
humans. It’s a remarkable testament to the human capacity to
adjust that this seems normal, that we’ve become inured to the
pace of progress. But it’s worth stepping back and looking at
the progress of just the last few years.
GPT-2 to GPT-4
Let me remind you of how far we came in just the ~4 (!) years
leading up to GPT-4.
GPT-2 (2019) ~ preschooler: “Wow, it can string together a few
plausible sentences.” A very-cherry-picked example of a semi-
coherent story about unicorns in the Andes it generated was
incredibly impressive at the time. And yet GPT-2 could barely
count to 5 without getting tripped up;2 when summarizing
2
From SSC: “Janelle Shane asks GPT-2
its ten favorite animals:
Prompt: My 10 favorite animals are: 1.
My ten favorite animals are:
1. Zebras with a white scar on the back
2. Insiduous spiders and octopus
3. Frog with large leaves, hopefully
black
4. Cockatiel with scales
5. Razorbill with wings hanging about
4 inches from one’s face and a heart
tattoo on a frog
3. Cockatric interlocking tetrabods that
can be blind, cut, and eaten raw:
4. Black and white desert crocodiles
living in sunlight
5. Zebra and many other pea bugs”
an article, it just barely outperformed selecting 3 random sen-
tences from the article.3 3
From the GPT-2 paper, Section 3.6.
Figure 2: Some examples of what
people found impressive about GPT-
2 at the time. Left: GPT-2 does an
ok job on extremely basic reading
comprehension questions. Right: In
a cherry-picked sample (best of 10
tries), GPT-2 can write a semi-coherent
paragraph that says some semi-relevant
things about the Civil War.
Comparing AI capabilities with human intelligence is difficult
and flawed, but I think it’s informative to consider the analogy
situational awareness 11
here, even if it’s highly imperfect. GPT-2 was shocking for its
command of language, and its ability to occasionally generate a
semi-cohesive paragraph, or occasionally answer simple factual
questions correctly. It’s what would have been impressive for a
preschooler.
GPT-3 (2020)4 ~ elementary schooler: “Wow, with just some few- 4
I mean clunky old GPT-3 here, not the
dramatically-improved GPT-3.5 you
might know from ChatGPT.
shot examples it can do some simple useful tasks.” It started
being cohesive over even multiple paragraphs much more con-
sistently, and could correct grammar and do some very basic
arithmetic. For the first time, it was also commercially useful in
a few narrow ways: for example, GPT-3 could generate simple
copy for SEO and marketing.
Figure 3: Some examples of what
people found impressive about GPT-
3 at the time. Top: After a simple
instruction, GPT-3 can use a made-up
word in a new sentence. Bottom-left:
GPT-3 can engage in rich storytelling
back-and-forth. Bottom-right: GPT-3
can generate some very simple code.
Again, the comparison is imperfect, but what impressed peo-
ple about GPT-3 is perhaps what would have been impressive
for an elementary schooler: it wrote some basic poetry, could
tell richer and coherent stories, could start to do rudimentary
situational awareness 12
coding, could fairly reliably learn from simple instructions and
demonstrations, and so on.
GPT-4 (2023) ~ smart high schooler: “Wow, it can write pretty so-
phisticated code and iteratively debug, it can write intelligently
and sophisticatedly about complicated subjects, it can reason
through difficult high-school competition math, it’s beating the
vast majority of high schoolers on whatever tests we can give
it, etc.” From code to math to Fermi estimates, it can think and
reason. GPT-4 is now useful in my daily tasks, from helping
write code to revising drafts.
situational awareness 13
Figure 4: Some of what people found
impressive about GPT-4 when it was re-
leased, from the “Sparks of AGI” paper.
Top: It’s writing very complicated code
(producing the plots shown in the mid-
dle) and can reason through nontrivial
math problems. Bottom-left: Solving an
AP math problem. Bottom-right: Solv-
ing a fairly complex coding problem.
More interesting excerpts from that
exploration of GPT-4’s capabilities here.
situational awareness 14
On everything from AP exams to the SAT, GPT-4 scores better
than the vast majority of high schoolers.
Of course, even GPT-4 is still somewhat uneven; for some tasks
it’s much better than smart high-schoolers, while there are
other tasks it can’t yet do. That said, I tend to think most of
these limitations come down to obvious ways models are still
hobbled, as I’ll discuss in-depth later. The raw intelligence
is (mostly) there, even if the models are still artificially con-
strained; it’ll take extra work to unlock models being able to
fully apply that raw intelligence across applications.
Figure 5: Progress over just four years.
Where are you on this line?
The trends in deep learning
The pace of deep learning progress in the last decade has sim-
ply been extraordinary. A mere decade ago it was revolution-
ary for a deep learning system to identify simple images. To-
day, we keep trying to come up with novel, ever harder tests,
and yet each new benchmark is quickly cracked. It used to take
decades to crack widely-used benchmarks; now it feels like
mere months.
We’re literally running out of benchmarks. As an anecdote, my
friends Dan and Collin made a benchmark called MMLU a few
years ago, in 2020. They hoped to finally make a benchmark
that would stand the test of time, equivalent to all the hardest
exams we give high school and college students. Just three
years later, it’s basically solved: models like GPT-4 and Gemini
situational awareness 15
Figure 6: Deep learning systems are
rapidly reaching or exceeding human-
level in many domains. Graphic: Our
World in Data
get ~90%.
More broadly, GPT-4 mostly cracks all the standard high school
and college aptitude tests (Figure 7).5 5
And no, these tests aren’t in the train-
ing set. AI labs put real effort into en-
suring these evals are uncontaminated,
because they need good measurements
in order to do good science. A recent
analysis on this by ScaleAI confirmed
that the leading labs aren’t overfitting to
the benchmarks (though some smaller
LLM developers might be juicing their
numbers).
Or consider the MATH benchmark, a set of difficult mathemat-
ics problems from high-school math competitions.6 When the
6
In the original paper, it was noted:
“We also evaluated humans on MATH,
and found that a computer science PhD
student who does not especially like
mathematics attained approximately
40% on MATH, while a three-time IMO
gold medalist attained 90%, indicating
that MATH can be challenging for
humans as well.”
benchmark was released in 2021, GPT-3 only got ~5% of prob-
lems right. And the original paper noted: “Moreover, we find
that simply increasing budgets and model parameter counts
will be impractical for achieving strong mathematical reasoning
if scaling trends continue [...]. To have more traction on math-
ematical problem solving we will likely need new algorithmic
advancements from the broader research community”—we
would need fundamental new breakthroughs to solve MATH,
or so they thought. A survey of ML researchers predicted min-
imal progress over the coming years (Figure 8);7 and yet within 7
A coauthor notes: “When our group
first released the MATH dataset, at least
one [ML researcher colleague] told us
that it was a pointless dataset because
it was too far outside the range of what
ML models could accomplish (indeed,
I was somewhat worried about this
myself).”
just a year (by mid-2022), the best models went from ~5% to
50% accuracy; now, MATH is basically solved, with recent per-
formance over 90%.
situational awareness 16
Figure 7: GPT-4 scores on standardized
tests. Note also the large jump from
GPT-3.5 to GPT-4 in human percentile
on these tests, often from well below
the median human to the very top of
the human range. (And this is GPT-3.5,
a fairly recent model released less than
a year before GPT-4, not the clunky old
elementary-school-level GPT-3 we were
talking about earlier!)
Figure 8: Gray: Professional forecasts,
made in August 2021, for June 2022
performance on the MATH benchmark
(difficult mathematics problems from
high-school math competitions). Red
star: actual state-of-the-art performance
by June 2022, far exceeding even the
upper range forecasters gave. The
median ML researcher was even more
pessimistic.
situational awareness 17
Over and over again, year after year, skeptics have claimed
“deep learning won’t be able to do X” and have been quickly
proven wrong.8 If there’s one lesson we’ve learned from the past 8
Here’s Yann LeCun predicting in 2022
that even GPT-5000 won’t be able to
reason about physical interactions with
the real world; GPT-4 obviously does it
with ease a year later.
Here’s Gary Marcus’s walls predicted
after GPT-2 being solved by GPT-3, and
the walls he predicted after GPT-3 being
solved by GPT-4.
Here’s Prof. Bryan Caplan losing his
first-ever public bet (after previously
famously having a perfect public
betting track record). In January 2023,
after GPT-3.5 got a D on his economics
midterm, Prof. Caplan bet Matthew
Barnett that no AI would get an A on
his economics midterms by 2029. Just
two months later, when GPT-4 came
out, it promptly scored an A on his
midterm (and it would have been one
of the highest scores in his class).
decade of AI, it’s that you should never bet against deep learning.
Now the hardest unsolved benchmarks are tests like GPQA,
a set of PhD-level biology, chemistry, and physics questions.
Many of the questions read like gibberish to me, and even
PhDs in other scientific fields spending 30+ minutes with
Google barely score above random chance. Claude 3 Opus
currently gets ~60%,9 compared to in-domain PhDs who get
9
On the diamond set, majority voting
of the model trying 32 times with
chain-of-thought.
~80%—and I expect this benchmark to fall as well, in the next
generation or two.
situational awareness 18
Figure 9: Example GPQA questions.
Models are already better at this than
I am, and we’ll probably crack expert-
PhD-level soon. . .
situational awareness 19
Counting the OOMs
How did this happen? The magic of deep learning is that it just
works—and the trendlines have been astonishingly consistent,
despite naysayers at every turn.
Figure 10: The effects of scaling com-
pute, in the example of OpenAI Sora.
With each OOM of effective compute, models predictably, reliably get
better.10 If we can count the OOMs, we can (roughly, qualita- 10
And it’s worth noting just how con-
sistent these trendlines are. Combining
the original scaling laws paper with
some of the estimates on compute
and compute efficiency scaling since
then implies a consistent scaling trend
for over 15 orders of magnitude (over
1,000,000,000,000,000x in effective
compute)!
tively) extrapolate capability improvements.11 That’s how a few
11
A common misconception is that scal-
ing only holds for perplexity loss, but
we see very clear and consistent scaling
behavior on downstream performance
on benchmarks as well. It’s usually
just a matter of finding the right log-
log graph. For example, in the GPT-4
blog post, they show consistent scaling
behavior for performance on coding
problems over 6 OOMs (1,000,000x) of
compute, using MLPR (mean log pass
rate).
The “Are Emergent Abilities a Mi-
rage?” paper makes a similar point;
with the right choice of metric, there
is almost always a consistent trend for
performance on downstream tasks.
More generally, the “scaling hypoth-
esis” qualitative observation—very
clear trends on model capability with
scale—predates loss-scaling-curves; the
“scaling laws” work was just a formal
measurement of this.
prescient individuals saw GPT-4 coming.
We can decompose the progress in the four years from GPT-2
to GPT-4 into three categories of scaleups:
1. compute: We’re using much bigger computers to train
these models.
2. algorithmic efficiencies: There’s a continuous trend of
algorithmic progress. Many of these act as “compute multi-
pliers,” and we can put them on a unified scale of growing
effective compute.
3. ”unhobbling” gains: By default, models learn a lot of
amazing raw capabilities, but they are hobbled in all sorts
of dumb ways, limiting their practical value. With simple
algorithmic improvements like reinforcement learning from
human feedback (RLHF), chain-of-thought (CoT), tools, and
scaffolding, we can unlock significant latent capabilities.
situational awareness 20
We can “count the OOMs” of improvement along these axes:
that is, trace the scaleup for each in units of effective com-
pute. 3x is 0.5 OOMs; 10x is 1 OOM; 30x is 1.5 OOMs; 100x
is 2 OOMs; and so on. We can also look at what we should
expect on top of GPT-4, from 2023 to 2027.
I’ll go through each one-by-one, but the upshot is clear: we are
rapidly racing through the OOMs. There are potential head-
winds in the data wall, which I’ll address—but overall, it seems
likely that we should expect another GPT-2-to-GPT-4-sized
jump, on top of GPT-4, by 2027.
Compute
I’ll start with the most commonly-discussed driver of recent
progress: throwing (a lot) more compute at models.
Many people assume that this is simply due to Moore’s Law.
But even in the old days when Moore’s Law was in its heyday,
it was comparatively glacial—perhaps 1-1.5 OOMs per decade.
We are seeing much more rapid scaleups in compute—close
to 5x the speed of Moore’s law—instead because of mammoth
investment. (Spending even a million dollars on a single model
used to be an outrageous thought nobody would entertain, and
now that’s pocket change!)
Model Estimated Compute Growth
GPT-2 (2019) ~4e21 FLOP
GPT-3 (2020) ~3e23 FLOP + ~2 OOMs
GPT-4 (2023) 8e24 to 4e25 FLOP + ~1.5–2 OOMs
Table 1: Estimates of compute for GPT-2
to GPT-4 by Epoch AI.
We can use public estimates from Epoch AI (a source widely
respected for its excellent analysis of AI trends) to trace the
compute scaleup from 2019 to 2023. GPT-2 to GPT-3 was a
quick scaleup; there was a large overhang of compute, scaling
from a smaller experiment to using an entire datacenter to train
a large language model. With the scaleup from GPT-3 to GPT-
4, we transitioned to the modern regime: having to build an
entirely new (much bigger) cluster for the next model. And yet
situational awareness 21
the dramatic growth continued. Overall, Epoch AI estimates
suggest that GPT-4 training used ~3,000x-10,000x more raw
compute than GPT-2.
In broad strokes, this is just the continuation of a longer-running
trend. For the last decade and a half, primarily because of
broad scaleups in investment (and specializing chips for AI
workloads in the form of GPUs and TPUs), the training com-
pute used for frontier AI systems has grown at roughly ~0.5
OOMs/year.
Figure 11: Training compute of no-
table deep learning models over time.
Source: Epoch AI.
The compute scaleup from GPT-2 to GPT-3 in a year was an
unusual overhang, but all the indications are that the longer-
run trend will continue. The SF-rumor-mill is abreast with dra-
matic tales of huge GPU orders. The investments involved will
be extraordinary—but they are in motion. I go into this more
later in the series, in IIIa. Racing to the Trillion-Dollar Cluster;
based on that analysis, an additional 2 OOMs of compute (a
cluster in the $10s of billions) seems very likely to happen by
the end of 2027; even a cluster closer to +3 OOMs of compute
($100 billion+) seems plausible (and is rumored to be in the
works at Microsoft/OpenAI).
situational awareness 22
Algorithmic efficiencies
While massive investments in compute get all the attention,
algorithmic progress is probably a similarly important driver of
progress (and has been dramatically underrated).
To see just how big of a deal algorithmic progress can be, con-
sider the following illustration (Figure 12) of the drop in price
to attain ~50% accuracy on the MATH benchmark (high school
competition math) over just two years. (For comparison, a com-
puter science PhD student who didn’t particularly like math
scored 40%, so this is already quite good.) Inference efficiency
improved by nearly 3 OOMs—1,000x—in less than two years.
Figure 12: Rough estimate on relative
inference cost of attaining ~50% MATH
performance.
Though these are numbers just for inference efficiency (which
may or may not correspond to training efficiency improve-
ments, where numbers are harder to infer from public data),
they make clear there is an enormous amount of algorithmic
progress possible and happening.
Calculations below.
Gemini 1.5 Flash scores 54.9% on
MATH, and costs $0.35/$1.05 (in-
put/output) per million tokens. GPT-4
scored 42.5% on MATH prelease and
52.9% on MATH in early 2023, and cost
$30/$60 (input/output) per million
tokens; that’s 85x/57x (input/output)
more expensive per token than Gemini
1.5 Flash. To be conservative, I use an
estimate of 30x cost decrease above (ac-
counting for Gemini 1.5 Flash possibly
using more tokens to reason through
problems).
Minerva540B scores 50.3% on MATH,
using majority voting among 64 sam-
ples. A knowledgeable friend estimates
the base model here is probably 2-
3x more expensive to inference than
GPT-4. However, Minerva seems to use
somewhat fewer tokens per answer
on a quick spot check. More impor-
tantly, Minerva needed 64 samples to
achieve that performance, naively im-
plying a 64x multiple on cost if you e.g.
naively ran this via an inference API. In
practice, prompt tokens can be cached
when running an eval; given a few-shot
prompt, prompt tokens are likely a
majority of the cost, even accounting
for output tokens. Supposing output
tokens are a third of the cost for getting
a single sample, that would imply only
a ~20x increase in cost from the maj@64
with caching. To be conservative, I
use the rough number of a 20x cost
decrease in the above (even if the naive
decrease in inference cost from running
this via an API would be larger).
situational awareness 23
In this piece, I’ll separate out two kinds of algorithmic progress.
Here, I’ll start by covering “within-paradigm” algorithmic
improvements—those that simply result in better base mod-
els, and that straightforwardly act as compute efficiencies or com-
pute multipliers. For example, a better algorithm might allow
us to achieve the same performance but with 10x less training
compute. In turn, that would act as a 10x (1 OOM) increase
in effective compute. (Later, I’ll cover “unhobbling,” which you
can think of as “paradigm-expanding/application-expanding”
algorithmic progress that unlocks capabilities of base models.)
If we step back and look at the long-term trends, we seem to
find new algorithmic improvements at a fairly consistent rate.
Individual discoveries seem random, and at every turn, there
seem insurmountable obstacles—but the long-run trendline is
predictable, a straight line on a graph. Trust the trendline.
We have the best data for ImageNet (where algorithmic re-
search has been mostly public and we have data stretching
back a decade), for which we have consistently improved com-
pute efficiency by roughly ~0.5 OOMs/year across the 9-year
period between 2012 and 2021.
Figure 13: We can measure algorith-
mic progress: how much less compute
is needed in 2021 compared to 2012
to train a model with the same per-
formance? We see a trend of ~0.5
OOMs/year of algorithmic efficiency.
Source: Erdil and Besiroglu 2022.
That’s a huge deal: that means 4 years later, we can achieve the
same level of performance for ~100x less compute (and con-
comitantly, much higher performance for the same compute!).
Unfortunately, since labs don’t publish internal data on this,
it’s harder to measure algorithmic progress for frontier LLMs
situational awareness 24
over the last four years. EpochAI has new work replicating
their results on ImageNet for language modeling, and estimate
a similar ~0.5 OOMs/year of algorithmic efficiency trend in
LLMs from 2012 to 2023. (This has wider error bars though,
and doesn’t capture some more recent gains, since the leading
labs have stopped publishing their algorithmic efficiencies.)
Figure 14: Estimates by Epoch AI of
algorithmic efficiencies in language
modeling. Their estimates suggest
we’ve made ~4 OOMs of efficiency
gains in 8 years.
More directly looking at the last 4 years, GPT-2 to GPT-3 was
basically a simple scaleup (according to the paper), but there
have been many publicly-known and publicly-inferable gains
since GPT-3:
• We can infer gains from API costs:12
12
Though these are inference efficien-
cies (rather than necessarily training
efficiencies), and to some extent will
reflect inference-specific optimizations,
a) they suggest enormous amounts of
algorithmic progress is possible and
happening in general, and b) it’s often
the case that an algorithmic improve-
ments is both a training efficiency gain
and an inference efficiency, for example
by reducing the number of parameters
necessary.
– GPT-4, on release, cost ~the same as GPT-3 when it was
released, despite the absolutely enormous performance in-
crease.13 (If we do a naive and oversimplified back-of-the-
13
GPT-3: $60/1M tokens, GPT-4:
$30/1M input tokens and $60/1M
output tokens.
envelope estimate based on scaling laws, this suggests that
perhaps roughly half the effective compute increase from
GPT-3 to GPT-4 came from algorithmic improvements.14)
14
Chinchilla scaling laws say that one
should scale parameter count and
data equally. That is, parameter count
grows “half the OOMs” of the OOMs
that effective training compute grows.
At the same time, parameter count
is intuitively roughly proportional to
inference costs. All else equal, constant
inference costs thus implies that half of
the OOMs of effective compute growth
were “canceled out” by algorithmic
win.
That said, to be clear, this is a very
naive calculation (just meant for a
rough illustration) that is wrong in
various ways. There may be inference-
specific optimizations (that don’t
translate into training efficiency); there
may be training efficiencies that don’t
reduce parameter count (and thus don’t
translate into inference efficiency); and
so on.
– Since the GPT-4 release a year ago, OpenAI prices for
GPT-4-level models have fallen another 6x/4x (input/output)
with the release of GPT-4o.
situational awareness 25
– Gemini 1.5 Flash, recently released, offers between “GPT-
3.75-level” and GPT-4-level performance,15 while costing 15
Gemini 1.5 Flash ranks similarly to
GPT-4 (higher than original GPT-4,
lower than updated versions of GPT-4)
on LMSys, a chatbot leaderboard, and
has similar performance on MATH and
GPQA (evals that measure reasoning)
as the original GPT-4, while landing
roughly in the middle between GPT-
3.5 and GPT-4 on MMLU (an eval
that more heavily weights towards
measuring knowledge).
85x/57x (input/output) less than the original GPT-4 (ex-
traordinary gains!).
• Chinchilla scaling laws give a 3x+ (0.5 OOMs+) efficiency
gain.16
16
At ~GPT-3 scale, more than 3x at
larger scales.
• Gemini 1.5 Pro claimed major compute efficiency gains (out-
performing Gemini 1.0 Ultra, while using “significantly less”
compute), with Mixture of Experts (MoE) as a highlighted
architecture change. Other papers also claim a substantial
multiple on compute from MoE.
• There have been many tweaks and gains on architecture,
data, training stack, etc. all the time.17
17
For example, this paper contains a
comparison of a GPT-3-style vanilla
Transformer to various simple changes
to architecture and training recipe
published over the years (RMSnorms
instead of layernorm, different posi-
tional embeddings, SwiGlu activation,
AdamW optimizer instead of Adam,
etc.), what they call “Transformer++”,
implying a 6x gain at least at small
scale.
Put together, public information suggests that the GPT-2 to
GPT-4 jump included 1-2 OOMs of algorithmic efficiency
gains.18
18
If we take the trend of 0.5
OOMs/year, and 4 years between
GPT-2 and GPT-4 release, that would
be 2 OOMs. However, GPT-2 to GPT-3
was a simple scaleup (after big gains
from e.g. Transformers), and OpenAI
claims GPT-4 pretraining finished in
2022, which could mean we’re looking
at closer to 2 years worth of algorithmic
progress that we should be counting
here. 1 OOM of algorithmic efficiency
seems like a conservative lower bound.
Figure 15: Decomposing progress:
compute and algorithmic efficiencies.
(Rough illustration.)
situational awareness 26
Over the 4 years following GPT-4, we should expect the trend
to continue:19 on average ~0.5 OOMs/year of compute effi- 19
At the very least, given over a decade
of consistent algorithmic improvements,
the burden of proof would be on those
who would suggest it will all suddenly
come to a halt!
ciency, i.e. ~2 OOMs of gains compared to GPT-4 by 2027.
While compute efficiencies will become harder to find as we
pick the low-hanging fruit, AI lab investments in money and
talent to find new algorithmic improvements are growing
rapidly. 20 (The publicly-inferable inference cost efficiencies, 20
The economic returns to a 3x compute
efficiency will be measured in the $10s
of billions or more, given cluster costs.
at least, don’t seem to have slowed down at all.) On the high
end, we could even see more fundamental, Transformer-like21 21
Very roughly something like a ~10x
gain.
breakthroughs with even bigger gains.
Put together, this suggests we should expect something like 1-3
OOMs of algorithmic efficiency gains (compared to GPT-4) by
the end of 2027, maybe with a best guess of ~2 OOMs.
The data wall
There is a potentially important source of variance for all of
this: we’re running out of internet data. That could mean
that, very soon, the naive approach to pretraining larger
language models on more scraped data could start hitting
serious bottlenecks.
Frontier models are already trained on much of the inter-
net. Llama 3, for example, was trained on over 15T tokens.
Common Crawl, a dump of much of the internet used for
LLM training, is >100T tokens raw, though much of that is
spam and duplication (e.g., a relatively simple deduplica-
tion leads to 30T tokens, implying Llama 3 would already
be using basically all the data). Moreover, for more specific
domains like code, there are many fewer tokens still, e.g.
public github repos are estimated to be in low trillions of
tokens.
You can go somewhat further by repeating data, but aca-
demic work on this suggests that repetition only gets you
so far, finding that after 16 epochs (a 16-fold repetition),
returns diminish extremely fast to nil. At some point, even
with more (effective) compute, making your models bet-
ter can become much tougher because of the data con-
straint. This isn’t to be understated: we’ve been riding the
situational awareness 27
scaling curves, riding the wave of the language-modeling-
pretraining-paradigm, and without something new here,
this paradigm will (at least naively) run out. Despite the
massive investments, we’d plateau.
All of the labs are rumored to be making massive research
bets on new algorithmic improvements or approaches to
get around this. Researchers are purportedly trying many
strategies, from synthetic data to self-play and RL ap-
proaches. Industry insiders seem to be very bullish: Dario
Amodei (CEO of Anthropic) recently said on a podcast: “if
you look at it very naively we’re not that far from running
out of data [. . . ] My guess is that this will not be a blocker
[. . . ] There’s just many different ways to do it.” Of course,
any research results on this are proprietary and not being
published these days.
In addition to insider bullishness, I think there’s a strong
intuitive case for why it should be possible to find ways to
train models with much better sample efficiency (algorith-
mic improvements that let them learn more from limited
data). Consider how you or I would learn from a really
dense math textbook:
• What a modern LLM does during training is, essentially,
very very quickly skim the textbook, the words just fly-
ing by, not spending much brain power on it.
• Rather, when you or I read that math textbook, we read
a couple pages slowly; then have an internal monologue
about the material in our heads and talk about it with
a few study-buddies; read another page or two; then
try some practice problems, fail, try them again in a
different way, get some feedback on those problems,
try again until we get a problem right; and so on, until
eventually the material “clicks.”
• You or I also wouldn’t learn much at all from a pass
through a dense math textbook if all we could do was
breeze through it like LLMs.22 22
And just rereading the same textbook
over and over again might result in
memorization, not understanding. I
take it that’s how many wordcels pass
math classes!
situational awareness 28
• But perhaps, then, there are ways to incorporate aspects
of how humans would digest a dense math textbook
to let the models learn much more from limited data.
In a simplified sense, this sort of thing—having an in-
ternal monologue about material, having a discussion
with a study-buddy, trying and failing at problems un-
til it clicks—is what many synthetic data/self-play/RL
approaches are trying to do.23
23
One other way of thinking about it
I find interesting: there is a “missing-
middle” between pretraining and
in-context learning.
In-context learning is incredible (and
competitive with human sample effi-
ciency). For example, the Gemini 1.5
Pro paper discusses giving the model
instructional materials (a textbook, a
dictionary) on Kalamang, a language
spoken by fewer than 200 people and
basically not present on the internet,
in context—and the model learns to
translate from English to Kalamang at
human-level! In context, the model is
able to learn from the textbook as well
as a human could (and much better
than it would learn from just chucking
that one textbook into pretraining).
When a human learns from a text-
book, they’re able to distill their short-
term memory/learnings into long-term
memory/long-term skills with practice;
however, we don’t have an equivalent
way to distill in-context learning “back
to the weights.” Synthetic data/self-
play/RL/etc are trying to fix that: let
the model learn by itself, then think
about it and practice what it learned,
distilling that learning back into the
weights.
The old state of the art of training models was simple and
naive, but it worked, so nobody really tried hard to crack
these approaches to sample efficiency. Now that it may
become more of a constraint, we should expect all the labs
to invest billions of dollars and their smartest minds into
cracking it. A common pattern in deep learning is that it
takes a lot of effort (and many failed projects) to get the
details right, but eventually some version of the obvious
and simple thing just works. Given how deep learning has
managed to crash through every supposed wall over the
last decade, my base case is that it will be similar here.
Moreover, it actually seems possible that cracking one of
these algorithmic bets like synthetic data could dramati-
cally improve models. Here’s an intuition pump. Current
frontier models like Llama 3 are trained on the internet—
and the internet is mostly crap, like e-commerce or SEO
or whatever. Many LLMs spend the vast majority of their
training compute on this crap, rather than on really high-
quality data (e.g. reasoning chains of people working
through difficult science problems). Imagine if you could
spend GPT-4-level compute on entirely extremely high-
quality data—it could be a much, much more capable
model.
A look back at AlphaGo—the first AI system that beat the
world champions at Go, decades before it was thought
possible—is useful here as well.24 24
See also Andrej Karpathy’s talk
discussing this here.
• In step 1, AlphaGo was trained by imitation learning on
expert human Go games. This gave it a foundation.
situational awareness 29
• In step 2, AlphaGo played millions of games against
itself. This let it become superhuman at Go: remember
the famous move 37 in the game against Lee Sedol, an
extremely unusual but brilliant move a human would
never have played.
Developing the equivalent of step 2 for LLMs is a key re-
search problem for overcoming the data wall (and, more-
over, will ultimately be the key to surpassing human-level
intelligence).
All of this is to say that data constraints seem to inject large
error bars either way into forecasting the coming years
of AI progress. There’s a very real chance things stall out
(LLMs might still be as big of a deal as the internet, but we
wouldn’t get to truly crazy AGI). But I think it’s reasonable
to guess that the labs will crack it, and that doing so will
not just keep the scaling curves going, but possibly enable
huge gains in model capability.
As an aside, this also means that we should expect more
variance between the different labs in coming years com-
pared to today. Up until recently, the state of the art tech-
niques were published, so everyone was basically doing
the same thing. (And new upstarts or open source projects
could easily compete with the frontier, since the recipe
was published.) Now, key algorithmic ideas are becom-
ing increasingly proprietary. I’d expect labs’ approaches
to diverge much more, and some to make faster progress
than others—even a lab that seems on the frontier now
could get stuck on the data wall while others make a
breakthrough that lets them race ahead. And open source
will have a much harder time competing. It will certainly
make things interesting. (And if and when a lab figures
it out, their breakthrough will be the key to AGI, key to
superintelligence—one of the United States’ most prized
secrets.)
situational awareness 30
Unhobbling
Finally, the hardest to quantify—but no less important—category
of improvements: what I’ll call “unhobbling.”
Imagine if when asked to solve a hard math problem, you had
to instantly answer with the very first thing that came to mind.
It seems obvious that you would have a hard time, except for
the simplest problems. But until recently, that’s how we had
LLMs solve math problems. Instead, most of us work through
the problem step-by-step on a scratchpad, and are able to solve
much more difficult problems that way. “Chain-of-thought”
prompting unlocked that for LLMs. Despite excellent raw ca-
pabilities, they were much worse at math than they could be
because they were hobbled in an obvious way, and it took a
small algorithmic tweak to unlock much greater capabilities.
We’ve made huge strides in “unhobbling” models over the past
few years. These are algorithmic improvements beyond just
training better base models—and often only use a fraction of
pretraining compute—that unleash model capabilities:
• Reinforcement learning from human feedback (RLHF). Base mod-
els have incredible latent capabilities,25 but they’re raw and 25
That’s the magic of unsupervised
learning, in some sense: to better pre-
dict the next token, to make perplexity
go down, models learn incredibly rich
internal representations, everything
from (famously) sentiment to complex
world models. But, out of the box,
they’re hobbled: they’re using their in-
credible internal representations merely
to predict the next token in random
internet text, and rather than applying
them in the best way to actually try to
solve your problem.
incredibly hard to work with. While the popular conception
of RLHF is that it merely censors swear words, RLHF has
been key to making models actually useful and commer-
cially valuable (rather than making models predict random
internet text, get them to actually apply their capabilities
to try to answer your question!). This was the magic of
ChatGPT—well-done RLHF made models usable and use-
ful to real people for the first time. The original InstructGPT
paper has a great quantification of this: an RLHF’d small
model was equivalent to a non-RLHF’d >100x larger model
in terms of human rater preference.
• Chain of Thought (CoT). As discussed. CoT started being
widely used just 2 years ago and can provide the equiva-
lent of a >10x effective compute increase on math/reasoning
problems.
• Scaffolding. Think of CoT++: rather than just asking a model
situational awareness 31
to solve a problem, have one model make a plan of attack,
have another propose a bunch of possible solutions, have
another critique it, and so on. For example, on HumanEval
(coding problems), simple scaffolding enables GPT-3.5 to
outperform un-scaffolded GPT-4. On SWE-Bench (a bench-
mark of solving real-world software engineering tasks), GPT-
4 can only solve ~2% correctly, while with Devin’s agent
scaffolding it jumps to 14-23%. (Unlocking agency is only in
its infancy though, as I’ll discuss more later.)
• Tools: Imagine if humans weren’t allowed to use calculators
or computers. We’re only at the beginning here, but Chat-
GPT can now use a web browser, run some code, and so on.
• Context length. Models have gone from 2k token context
(GPT-3) to 32k context (GPT-4 release) to 1M+ context (Gem-
ini 1.5 Pro). This is a huge deal. A much smaller base model
with, say, 100k tokens of relevant context can outperform a
model that is much larger but only has, say, 4k relevant to-
kens of context—more context is effectively a large compute
efficiency gain.26 More generally, context is key to unlock- 26
See Figure 7 from the updated Gem-
ini 1.5 whitepaper, comparing perplex-
ity vs. context for Gemini 1.5 Pro and
Gemini 1.5 Flash (a much cheaper and
presumably smaller model).
ing many applications of these models: for example, many
coding applications require understanding large parts of
a codebase in order to usefully contribute new code; or, if
you’re using a model to help you write a document at work,
it really needs the context from lots of related internal docs
and conversations. Gemini 1.5 Pro, with its 1M+ token con-
text, was even able to learn a new language (a low-resource
language not on the internet) from scratch, just by putting a
dictionary and grammar reference materials in context!
• Posttraining improvements. The current GPT-4 has substan-
tially improved compared to the original GPT-4 when re-
leased, according to John Schulman due to posttraining
improvements that unlocked latent model capability: on
reasoning evals it’s made substantial gains (e.g., ~50% ->
72% on MATH, ~40% to ~50% on GPQA) and on the LMSys
leaderboard, it’s made nearly 100-point elo jump (compara-
ble to the difference in elo between Claude 3 Haiku and the
much larger Claude 3 Opus, models that have a ~50x price
difference).
situational awareness 32
A survey by Epoch AI of some of these techniques, like scaf-
folding, tool use, and so on, finds that techniques like this can
typically result in effective compute gains of 5-30x on many
benchmarks. METR (an organization that evaluates models)
similarly found very large performance improvements on their
set of agentic tasks, via unhobbling from the same GPT-4 base
model: from 5% with just the base model, to 20% with the
GPT-4 as posttrained on release, to nearly 40% today from bet-
ter posttraining, tools, and agent scaffolding (Figure 16).
Figure 16: Performance on METR’s
agentic tasks, over time via better
unhobbling. Source: Model Evaluation
and Threat Research
While it’s hard to put these on a unified effective compute
scale with compute and algorithmic efficiencies, it’s clear these
are huge gains, at least on a roughly similar magnitude as the
compute scaleup and algorithmic efficiencies. (It also highlights
the central role of algorithmic progress: the ~0.5 OOMs/year
of compute efficiencies, already significant, are only part of the
story, and put together with unhobbling algorithmic progress
overall is maybe even a majority of the gains on the current
trend.)
“Unhobbling” is a huge part of what actually enabled these
models to become useful—and I’d argue that much of what is
holding back many commercial applications today is the need
for further “unhobbling” of this sort. Indeed, models today are
still incredibly hobbled! For example:
• They don’t have long-term memory.
situational awareness 33
Figure 17: Decomposing progress:
compute, algorithmic efficiencies, and
unhobbling. (Rough illustration.)
• They can’t use a computer (they still only have very limited
tools).
• They still mostly don’t think before they speak. When you
ask ChatGPT to write an essay, that’s like expecting a human
to write an essay via their initial stream-of-consciousness.27 27
People are working on this though.
• They can (mostly) only engage in short back-and-forth dia-
logues, rather than going away for a day or a week, thinking
about a problem, researching different approaches, consult-
ing other humans, and then writing you a longer report or
pull request.
• They’re mostly not personalized to you or your application
(just a generic chatbot with a short prompt, rather than hav-
ing all the relevant background on your company and your
work).
The possibilities here are enormous, and we’re rapidly picking
low-hanging fruit here. This is critical: it’s completely wrong
to just imagine “GPT-6 ChatGPT.” With continued unhobbling
situational awareness 34
progress, the improvements will be step-changes compared to
GPT-6 + RLHF. By 2027, rather than a chatbot, you’re going to
have something that looks more like an agent, like a coworker.
From chatbot to agent-coworker
What could ambitious unhobbling over the coming years
look like? The way I think about it, there are three key
ingredients:
1. solving the “onboarding problem”
GPT-4 has the raw smarts to do a decent chunk of many
people’s jobs, but it’s sort of like a smart new hire that just
showed up 5 minutes ago: it doesn’t have any relevant con-
text, hasn’t read the company docs or Slack history or had
conversations with members of the team, or spent any time
understanding the company-internal codebase. A smart
new hire isn’t that useful 5 minutes after arriving—but
they are quite useful a month in! It seems like it should be
possible, for example via very-long-context, to “onboard”
models like we would a new human coworker. This alone
would be a huge unlock.
2. the test-time compute overhang (reasoning/error
correction/system ii for longer-horizon prob-
lems)
Right now, models can basically only do short tasks: you
ask them a question, and they give you an answer. But
that’s extremely limiting. Most useful cognitive work hu-
mans do is longer horizon—it doesn’t just take 5 minutes,
but hours, days, weeks, or months.
A scientist that could only think about a difficult problem
for 5 minutes couldn’t make any scientific breakthroughs.
A software engineer that could only write skeleton code
for a single function when asked wouldn’t be very useful—
software engineers are given a larger task, and they then go
make a plan, understand relevant parts of the codebase or
situational awareness 35
technical tools, write different modules and test them in-
crementally, debug errors, search over the space of possible
solutions, and eventually submit a large pull request that’s
the culmination of weeks of work. And so on.
In essence, there is a large test-time compute overhang. Think
of each GPT-4 token as a word of internal monologue when
you think about a problem. Each GPT-4 token is quite
smart, but it can currently only really effectively use on
the order of ~hundreds of tokens for chains of thought co-
herently (effectively as though you could only spend a few
minutes of internal monologue/thinking on a problem or
project).
What if it could use millions of tokens to think about and
work on really hard problems or bigger projects?
Number of tokens Equivalent to me work-
ing on something for. . .
100s A few minutes ChatGPT (we are here)
1,000s Half an hour +1 OOMs test-time compute
10,000s Half a workday +2 OOMs
100,000s A workweek +3 OOMs
Millions Multiple months +4 OOMs
Table 2: Assuming a human thinking
at ~100 tokens/minute and working 40
hours/week, translating “how long a
model thinks” in tokens to human-time
on a given problem/project.
Even if the “per-token” intelligence were the same, it’d
be the difference between a smart person spending a few
minutes vs. a few months on a problem. I don’t know about
you, but there’s much, much, much more I am capable
of in a few months vs. a few minutes. If we could unlock
“being able to think and work on something for months-
equivalent, rather than a few-minutes-equivalent” for mod-
els, it would unlock an insane jump in capability. There’s a
huge overhang here, many OOMs worth.
Right now, models can’t do this yet. Even with recent ad-
vances in long-context, this longer context mostly only
works for the consumption of tokens, not the production of
tokens—after a while, the model goes off the rails or gets
stuck. It’s not yet able to go away for a while to work on a
situational awareness 36
problem or project on its own.28 28
Which makes sense—why would
it have learned the skills for longer-
horizon reasoning and error correction?
There’s very little data on the internet
in the form of “my complete internal
monologue, reasoning, all the relevant
steps over the course of a month as
I work on a project.” Unlocking this
capability will require a new kind of
training, for it to learn these extra skills.
Or as Gwern put it (private correspon-
dence): “ ‘Brain the size of a galaxy, and
what do they ask me to do? Predict the
misspelled answers on benchmarks!’
Marvin the depressed neural network
moaned.”
But unlocking test-time compute might merely be a matter
of relatively small “unhobbling” algorithmic wins. Perhaps
a small amount of RL helps a model learn to error correct
(“hm, that doesn’t look right, let me double check that”),
make plans, search over possible solutions, and so on. In a
sense, the model already has most of the raw capabilities,
it just needs to learn a few extra skills on top to put it all
together.
In essence, we just need to teach the model a sort of System
II outer loop29 that lets it reason through difficult, long- 29
System I vs. System II is a useful
way of thinking about current ca-
pabilities of LLMs—including their
limitations and dumb mistakes—and
what might be possible with RL and
unhobbling. Think of this way: when
you are driving, most of the time you
are on autopilot (system I, what mod-
els mostly do right now). But when
you encounter a complex construction
zone or novel intersection, you might
ask your passenger-seat-companion to
pause your conversation for a moment
while you figure out—actually think
about—what’s going on and what to
do. If you were forced to go about life
with only system I (closer to models
today), you’d have a lot of trouble. Cre-
ating the ability for system II reasoning
loops is a central unlock.
horizon projects.
If we succeed at teaching this outer loop, instead of a short
chatbot answer of a couple paragraphs, imagine a stream
of millions of words (coming in more quickly than you can
read them) as the model thinks through problems, uses
tools, tries different approaches, does research, revises its
work, coordinates with others, and completes big projects
on its own.
Trading off test-time and train-time compute in other ML do-
mains. In other domains, like AI systems for board games,
it’s been demonstrated that you can use more test-time
compute (also called inference-time compute) to substitute
for training compute.
Figure 18: Jones (2021): A smaller
model can do as well as a much larger
model at the game of Hex if you give
it more test-time compute (“more time
to think”). In this domain, they find
that one can spend ~1.2 OOMs more
compute at test-time to get performance
equivalent to a model with ~1 OOMs
more training compute.
situational awareness 37
If a similar relationship held in our case, if we could unlock
+4 OOMs of test-time compute, that might be equivalent to
+3 OOMs of pretraining compute, i.e. very roughly some-
thing like the jump between GPT-3 and GPT-4. (Solving
this “unhobbling” would be equivalent to a huge OOM
scaleup.)
3. using a computer
This is perhaps the most straightforward of the three. Chat-
GPT right now is basically like a human that sits in an
isolated box that you can text. While early unhobbling im-
provements teach models to use individual isolated tools,
I expect that with multimodal models we will soon be able
to do this in one fell swoop: we will simply enable models
to use a computer like a human would.
That means joining your Zoom calls, researching things on-
line, messaging and emailing people, reading shared docs,
using your apps and dev tooling, and so on. (Of course,
for models to make the most use of this in longer-horizon
loops, this will go hand-in-hand with unlocking test-time
compute.)
By the end of this, I expect us to get something that
looks a lot like a drop-in remote worker. An agent that joins
your company, is onboarded like a new human hire, mes-
sages you and colleagues on Slack and uses your softwares,
makes pull requests, and that, given big projects, can do
the model-equivalent of a human going away for weeks to
independently complete the project. You’ll probably need
somewhat better base models than GPT-4 to unlock this,
but possibly not even that much better—a lot of juice is in
fixing the clear and basic ways models are still hobbled.
(A very early peek at what this might look like is Devin, an
early prototype of unlocking the “agency-overhang”/”test-
time compute overhang” on models on the path to creating
a fully automated software engineer. I don’t know how
well Devin works in practice, and this demo is still very
situational awareness 38
limited compared to what proper chatbot → agent un-
hobbling would yield, but it’s a useful teaser of the sort of
thing coming soon.)
By the way, the centrality of unhobbling might lead to a some-
what interesting “sonic boom” effect in terms of commercial
applications. Intermediate models between now and the drop-
in remote worker will require tons of schlep to change work-
flows and build infrastructure to integrate and derive economic
value from. The drop-in remote worker will be dramatically
easier to integrate—just, well, drop them in to automate all
the jobs that could be done remotely. It seems plausible that
the schlep will take longer than the unhobbling, that is, by the
time the drop-in remote worker is able to automate a large
number of jobs, intermediate models won’t yet have been fully
harnessed and integrated—so the jump in economic value gen-
erated could be somewhat discontinuous.
The next four years
Putting the numbers together, we should (roughly) ex-
pect another GPT-2-to-GPT-4-sized jump in the 4 years follow-
ing GPT-4, by the end of 2027.
• GPT-2 to GPT-4 was roughly a 4.5–6 OOM base effective
compute scaleup (physical compute and algorithmic efficien-
cies), plus major “unhobbling” gains (from base model to
chatbot).
• In the subsequent 4 years, we should expect 3–6 OOMs
of base effective compute scaleup (physical compute al-
gorithmic efficiencies)—with perhaps a best guess of ~5
OOMs—plus step-changes in utility and applications un-
locked by “unhobbling” (from chatbot to agent/drop-in
remote worker).
To put this in perspective, suppose GPT-4 training took 3
months. In 2027, a leading AI lab will be able to train a GPT-4-
level model in a minute.30 The OOM effective compute scaleup
30
On the best guess assumptions on
physical compute and algorithmic
efficiency scaleups described above, and
simplifying parallelism considerations
(in reality, it might look more like “1440
(60*24) GPT-4-level models in a day” or
similar).
situational awareness 39
Figure 19: Summary of the estimates
on drivers of progress in the four years
preceding GPT-4, and what we should
expect in the four years following
GPT-4.
situational awareness 40
will be dramatic.
Where will that take us?
Figure 20: Summary of counting the
OOMs.
GPT-2 to GPT-4 took us from ~preschooler to ~smart high-
schooler; from barely being able to output a few cohesive sen-
tences to acing high-school exams and being a useful coding
assistant. That was an insane jump. If this is the intelligence
gap we’ll cover once more, where will that take us?31 We 31
Of course, any benchmark we have
today will be saturated. But that’s not
saying much; it’s mostly a reflection on
the difficulty of making hard-enough
benchmarks.
should not be surprised if that takes us very, very far. Likely,
it will take us to models that can outperform PhDs and the best
experts in a field.
(One neat way to think about this is that the current trend of AI
progress is proceeding at roughly 3x the pace of child develop-
ment. Your 3x-speed-child just graduated high school; it’ll be
taking your job before you know it!)
Again, critically, don’t just imagine an incredibly smart Chat-
GPT: unhobbling gains should mean that this looks more like
a drop-in remote worker, an incredibly smart agent that can
reason and plan and error-correct and knows everything about
you and your company and can work on a problem indepen-
situational awareness 41
dently for weeks.
We are on course for AGI by 2027. These AI systems will basi-
cally be able to automate basically all cognitive jobs (think: all
jobs that could be done remotely).
To be clear—the error bars are large. Progress could stall as
we run out of data, if the algorithmic breakthroughs necessary
to crash through the data wall prove harder than expected.
Maybe unhobbling doesn’t go as far, and we are stuck with
merely expert chatbots, rather than expert coworkers. Perhaps
the decade-long trendlines break, or scaling deep learning hits
a wall for real this time. (Or an algorithmic breakthrough, even
simple unhobbling that unleashes the test-time compute over-
hang, could be a paradigm-shift, accelerating things further
and leading to AGI even earlier.)
In any case, we are racing through the OOMs, and it requires
no esoteric beliefs, merely trend extrapolation of straight lines,
to take the possibility of AGI—true AGI—by 2027 extremely
seriously.
It seems like many are in the game of downward-defining AGI
these days, as just as really good chatbot or whatever. What
I mean is an AI system that could fully automate my or my
friends’ job, that could fully do the work of an AI researcher
or engineer. Perhaps some areas, like robotics, might take
longer to figure out by default. And the societal rollout, e.g.
in medical or legal professions, could easily be slowed by so-
cietal choices or regulation. But once models can automate AI
research itself, that’s enough—enough to kick off intense feed-
back loops—and we could very quickly make further progress,
the automated AI engineers themselves solving all the remain-
ing bottlenecks to fully automating everything. In particular,
millions of automated researchers could very plausibly com-
press a decade of further algorithmic progress into a year or
less. AGI will merely be a small taste of the superintelligence
soon to follow. (More on that in the next chapter.)
In any case, do not expect the vertiginous pace of progress to
abate. The trendlines look innocent, but their implications are
situational awareness 42
intense. As with every generation before them, every new gen-
eration of models will dumbfound most onlookers; they’ll be
incredulous when, very soon, models solve incredibly difficult
science problems that would take PhDs days, when they’re
whizzing around your computer doing your job, when they’re
writing codebases with millions of lines of code from scratch,
when every year or two the economic value generated by these
models 10xs. Forget scifi, count the OOMs: it’s what we should
expect. AGI is no longer a distant fantasy. Scaling up simple
deep learning techniques has just worked, the models just want
to learn, and we’re about to do another 100,000x+ by the end of
2027. It won’t be long before they’re smarter than us.
Figure 21: GPT-4 is just the beginning—
where will we be four years later? Do
not make the mistake of underestimat-
ing the rapid pace of deep learning
progress (as illustrated by progress in
GANs).
situational awareness 43
Addendum. Racing through the OOMs: It’s this decade or bust
I used to be more skeptical of short timelines to AGI. One
reason is that it seemed unreasonable to privilege this
decade, concentrating so much AGI-probability-mass on
it (it seemed like a classic fallacy to think “oh we’re so
special”). I thought we should be uncertain about what
it takes to get AGI, which should lead to a much more
“smeared-out” probability distribution over when we
might get AGI.
However, I’ve changed my mind: critically, our uncertainty
over what it takes to get AGI should be over OOMs (of
effective compute), rather than over years.
We’re racing through the OOMs this decade. Even at its
bygone heyday, Moore’s law was only 1–1.5 OOMs/decade.
I estimate that we will do ~5 OOMs in 4 years, and over
~10 this decade overall.
Figure 22: Rough projections on ef-
fective compute scaleup. We’ve been
racing through the OOMs this decade;
after the early 2030s, we will face a slow
slog.
In essence, we’re in the middle of a huge scaleup reap-
ing one-time gains this decade, and progress through the
OOMs will be multiples slower thereafter. If this scaleup
doesn’t get us to AGI in the next 5-10 years, it might be a
situational awareness 44
long way out.
• Spending scaleup: Spending a million dollars on a model
used to be outrageous; by the end of the decade, we will
likely have $100B or $1T clusters. Going much higher
than that will be hard; that’s already basically the feasi-
ble limit (both in terms of what big business can afford,
and even just as a fraction of GDP). Thereafter all we
have is glacial 2%/year trend real GDP growth to in-
crease this.
• Hardware gains: AI hardware has been improving much
more quickly than Moore’s law. That’s because we’ve
been specializing chips for AI workloads. For exam-
ple, we’ve gone from CPUs to GPUs; adapted chips for
Transformers; and we’ve gone down to much lower pre-
cision number formats, from fp64/fp32 for traditional
supercomputing to fp8 on H100s. These are large gains,
but by the end of the decade we’ll likely have totally-
specialized AI-specific chips, without much further
beyond-Moore’s law gains possible.
• Algorithmic progress: In the coming decade, AI labs will
invest tens of billions in algorithmic R&D, and all the
smartest people in the world will be working on this;
from tiny efficiencies to new paradigms, we’ll be picking
lots of the low-hanging fruit. We probably won’t reach
any sort of hard limit (though “unhobblings” are likely
finite), but at the very least the pace of improvements
should slow down, as the rapid growth (in $ and human
capital investments) necessarily slows down (e.g., most
of the smart STEM talent will already be working on AI).
(That said, this is the most uncertain to predict, and the
source of most of the uncertainty on the OOMs in the
2030s on the plot above.)
Put together, this means we are racing through many
more OOMs in the next decade than we might in multi-
ple decades thereafter. Maybe it’s enough—and we get
AGI soon—or we might be in for a long, slow slog. You
and I can reasonably disagree on the median time to AGI,
situational awareness 45
depending on how hard we think achieving AGI will be—
but given how we’re racing through the OOMs right now,
certainly your modal AGI year should sometime later this
decade or so.
Figure 23: Matthew Barnett has a nice
related visualization of this, considering
just compute and biological bounds.
II. From AGI to Superintelligence: the Intelligence Ex-
plosion
AI progress won’t stop at human-level. Hundreds of millions
of AGIs could automate AI research, compressing a decade
of algorithmic progress (5+ OOMs) into 1 year. We would
rapidly go from human-level to vastly superhuman AI sys-
tems. The power—and the peril—of superintelligence would
be dramatic.
Let an ultraintelligent machine be defined as a machine that
can far surpass all the intellectual activities of any man
however clever. Since the design of machines is one of these
intellectual activities, an ultraintelligent machine could design
even better machines; there would then unquestionably be an
‘intelligence explosion,’ and the intelligence of man would be
left far behind. Thus the first ultraintelligent machine is the
last invention that man need ever make.
i. j. good (1965)
The Bomb and The Super
In the common imagination, the Cold War’s terrors principally trace
back to Los Alamos, with the invention of the atomic bomb. But The
Bomb, alone, is perhaps overrated. Going from The Bomb to The
Super—hydrogen bombs—was arguably just as important.
In the Tokyo air raids, hundreds of bombers dropped thousands of
tons of conventional bombs on the city. Later that year, Little Boy,
dropped on Hiroshima, unleashed similar destructive power in a single
device. But just 7 years later, Teller’s hydrogen bomb multiplied yields
a thousand-fold once again—a single bomb with more explosive power
than all of the bombs dropped in the entirety of WWII combined.
situational awareness 47
The Bomb was a more efficient bombing campaign. The Super was a
country-annihilating device.32 32
And much of the Cold War’s per-
versities (cf Daniel Ellsberg’s book)
stemmed from merely replacing A-
bombs with H-bombs, without adjust-
ing nuclear policy and war plans to the
massive capability increase.
So it will be with AGI and Superintelligence.
AI progress won’t stop at human-level. After initially
learning from the best human games, AlphaGo started playing
against itself—and it quickly became superhuman, playing
extremely creative and complex moves that a human would
never have come up with.
We discussed the path to AGI in the previous piece. Once we
get AGI, we’ll turn the crank one more time—or two or three
more times—and AI systems will become superhuman—vastly
superhuman. They will become qualitatively smarter than you
or I, much smarter, perhaps similar to how you or I are qualita-
tively smarter than an elementary schooler.
The jump to superintelligence would be wild enough at the
current rapid but continuous rate of AI progress (if we could
make the jump to AGI in 4 years from GPT-4, what might an-
other 4 or 8 years after that bring?). But it could be much faster
than that, if AGI automates AI research itself.
Once we get AGI, we won’t just have one AGI. I’ll walk through
the numbers later, but: given inference GPU fleets by then,
we’ll likely be able to run many millions of them (perhaps 100
million human-equivalents, and soon after at 10x+ human speed).
Even if they can’t yet walk around the office or make coffee,
they will be able to do ML research on a computer. Rather
than a few hundred researchers and engineers at a leading AI
lab, we’d have more than 100,000x that—furiously working
on algorithmic breakthroughs, day and night. Yes, recursive
self-improvement, but no sci-fi required; they would need only to
accelerate the existing trendlines of algorithmic progress (currently at
~0.5 OOMs/year).
Automated AI research could probably compress a human-
decade of algorithmic progress into less than a year (and that
seems conservative). That’d be 5+ OOMs, another GPT-2-to-
situational awareness 48
Figure 24: Automated AI research could
accelerate algorithmic progress, leading
to 5+ OOMs of effective compute gains
in a year. The AI systems we’d have
by the end of an intelligence explosion
would be vastly smarter than humans.
GPT-4-sized jump, on top of AGI—a qualitative jump like that
from a preschooler to a smart high schooler, on top of AI sys-
tems already as smart as expert AI researchers/engineers.
There are several plausible bottlenecks—including limited com-
pute for experiments, complementarities with humans, and
algorithmic progress becoming harder—which I’ll address, but
none seem sufficient to definitively slow things down.
Before we know it, we would have superintelligence on our
hands—AI systems vastly smarter than humans, capable of
novel, creative, complicated behavior we couldn’t even begin
to understand—perhaps even a small civilization of billions
of them. Their power would be vast, too. Applying superin-
situational awareness 49
telligence to R&D in other fields, explosive progress would
broaden from just ML research; soon they’d solve robotics,
make dramatic leaps across other fields of science and technol-
ogy within years, and an industrial explosion would follow.
Superintelligence would likely provide a decisive military ad-
vantage, and unfold untold powers of destruction. We will be
faced with one of the most intense and volatile moments of
human history.
Automating AI research
We don’t need to automate everything—just AI research. A
common objection to transformative impacts of AGI is that
it will be hard for AI to do everything. Look at robotics, for
instance, doubters say; that will be a gnarly problem, even if
AI is cognitively at the levels of PhDs. Or take automating
biology R&D, which might require lots of physical lab-work
and human experiments.
But we don’t need robotics—we don’t need many things—for
AI to automate AI research. The jobs of AI researchers and
engineers at leading labs can be done fully virtually and don’t
run into real-world bottlenecks in the same way (though it will
still be limited by compute, which I’ll address later). And the
job of an AI researcher is fairly straightforward, in the grand
scheme of things: read ML literature and come up with new
questions or ideas, implement experiments to test those ideas,
interpret the results, and repeat. This all seems squarely in the
domain where simple extrapolations of current AI capabilities
could easily take us to or beyond the levels of the best humans
by the end of 2027.33
33
The job of an AI researcher is also
a job that AI researchers at AI labs
just, well, know really well—so it’ll
be particularly intuitive to them to
optimize models to be good at that job.
And there will be huge incentives to do
so to help them accelerate their research
and their labs’ competitive edge.
It’s worth emphasizing just how straightforward and hacky
some of the biggest machine learning breakthroughs of the last
decade have been: “oh, just add some normalization” (Lay-
erNorm/BatchNorm) or “do f(x)+x instead of f(x)” (residual
connections)” or “fix an implementation bug” (Kaplan → Chin-
chilla scaling laws). AI research can be automated. And au-
tomating AI research is all it takes to kick off extraordinary
feedback loops.34
34
This suggests an important point in
terms of the sequencing of risks from
AI, by the way. A common AI threat
model people point to is AI systems
developing novel bioweapons, and
that posing catastrophic risk. But if
AI research is more straightforward to
automate than biology R&D, we might
get an intelligence explosion before we
get extreme AI biothreats. This matters,
for example, with regard to whether we
should expect “bio warning shots” in
time before things get crazy on AI.
situational awareness 50
We’d be able to run millions of copies (and soon at 10x+ hu-
man speed) of the automated AI researchers. Even by 2027,
we should expect GPU fleets in the 10s of millions. Training
clusters alone should be approaching ~3 OOMs larger, already
putting us at 10 million+ A100-equivalents. Inference fleets
should be much larger still. (More on all this in IIIa. Racing to
the Trillion Dollar Cluster.)
That would let us run many millions of copies of our auto-
mated AI researchers, perhaps 100 million human-researcher-
equivalents, running day and night. There’s some assump-
tions that flow into the exact numbers, including that humans
“think” at 100 tokens/minute (just a rough order of magnitude
estimate, e.g. consider your internal monologue) and extrap-
olating historical trends and Chinchilla scaling laws on per-
token inference costs for frontier models remaining in the same
ballpark.35 We’d also want to reserve some of the GPUs for
35
As noted earlier, the GPT-4 API
costs less today than GPT-3 when it
was released—this suggests that the
trend of inference efficiency wins is fast
enough to keep inference costs roughly
constant even as models get much more
powerful. Similarly, there have been
huge inference cost wins in just the year
since GPT-4 was released; for example,
the current version of Gemini 1.5 Pro
outperforms the original GPT-4, while
being roughly 10x cheaper.
We can also ground this somewhat
more by considering Chinchilla scaling
laws. On Chinchilla scaling laws, model
size—and thus inference costs—grow
with the square root of training cost,
i.e. half the OOMs of the OOM scaleup
of effective compute. However, in
the previous piece, I suggested that
algorithmic efficiency was advancing
at roughly the same pace as compute
scaleup, i.e. it made up roughly half of
the OOMs of effective compute scaleup.
If these algorithmic wins also translate
into inference efficiency, that means
that the algorithmic efficiencies would
compensate for the naive increase in
inference cost.
In practice, training compute efficien-
cies often, but not always, translate into
inference efficiency wins. However,
there are also separately many inference
efficiency wins that are not training effi-
ciency wins. So, at least in terms of the
rough ballpark, assuming the $/token
of frontier models stays roughly similar
doesn’t seem crazy.
(Of course, they’ll use more tokens,
i.e. more test-time compute. But that’s
already part of the calculation here,
by pricing human-equivalents as 100
tokens/minute.)
running experiments and training new models. Full calculation
in a footnote.36
36
GPT4T is about $0.03/1K tokens. We
supposed we would have 10s of mil-
lions of A100 equivalents, which cost
~$1 hour per GPU if A100-equivalents.
If we used the API costs to trans-
late GPUs into tokens generated,
that implies 10s of millions GPUs *
$1/GPU-hour * 33K tokens/$ = ~ one
trillion tokens/ hour. Suppose a human
does 100 tokens/min of thinking, that
means a human-equivalent is 6,000
tokens/hour. One trillion tokens/hour
divided by 6,000 tokens/human-hour
= ~200 million human-equivalents—
i.e. as if running 200 million human
researchers, day and night. (And even
if we reserve half the GPUs for exper-
iment compute, we get 100 million
human-researcher-equivalents.)
Another way of thinking about it is that given inference fleets
in 2027, we should be able to generate an entire internet’s worth of
tokens, every single day.37 In any case, the exact numbers don’t
37
The previous footnote estimated ~1T
tokens/hour, i.e. 24T tokens a day.
In the previous piece, I noted that a
public deduplicated CommonCrawl
had around 30T tokens.
matter that much, beyond a simple plausibility demonstration.
situational awareness 51
Moreover, our automated AI researchers may soon be able to
run at much faster than human-speed:
• By taking some inference penalties, we can trade off running
fewer copies in exchange for running them at faster serial
speed. (For example, we could go from ~5x human speed to
~100x human speed by “only” running 1 million copies of
the automated researchers.38) 38
Jacob Steinhardt estimates that kˆ3
parallel copies of a model can be
replaced with a single model that is kˆ2
faster, given some math on inference
tradeoffs with a tiling scheme (that
theoretically works even for k of 100
or more). Suppose initial speeds were
already ~5x human speed (based on,
say, GPT-4 speed on release). Then,
by taking this inference penalty (with
k= ~5), we’d be able to run ~1 million
automated AI researchers at ~100x
human speed.
• More importantly, the first algorithmic innovation the au-
tomated AI researchers work on is getting a 10x or 100x
speedup. Gemini 1.5 Flash is ~10x faster than the originally-
released GPT-4,39 merely a year later, while providing similar
39
This source benchmarks throughput
of Flash at ~6x GPT-4 Turbo, and GPT-4
Turbo was faster than original GPT-4.
Latency is probably also roughly 10x
faster.
performance to the originally-released GPT-4 on reasoning
benchmarks. If that’s the algorithmic speedup a few hun-
dred human researchers can find in a year, the automated AI
researchers will be able to find similar wins very quickly.
That is: expect 100 million automated researchers each working at
100x human speed not long after we begin to be able to automate AI
research. They’ll each be able to do a year’s worth of work in a
few days. The increase in research effort—compared to a few
hundred puny human researchers at a leading AI lab today,
working at a puny 1x human speed—will be extraordinary.
This could easily dramatically accelerate existing trends of
algorithmic progress, compressing a decade of advances into
a year. We need not postulate anything totally novel for au-
tomated AI research to intensely speed up AI progress. Walk-
ing through the numbers in the previous piece, we saw that
algorithmic progress has been a central driver of deep learn-
ing progress in the last decade; we noted a trendline of ~0.5
OOMs/year on algorithmic efficiencies alone, with additional
large algorithmic gains from unhobbling on top. (I think the
import of algorithmic progress has been underrated by many,
and properly appreciating it is important for appreciating the
possibility of an intelligence explosion.)
Could our millions of automated AI researchers (soon working
at 10x or 100x human speed) compress the algorithmic progress
human researchers would have found in a decade into a year
situational awareness 52
instead? That would be 5+ OOMs in a year.
Don’t just imagine 100 million junior software engineer
interns here (we’ll get those earlier, in the next couple years!).
Real automated AI researchers be very smart—and in addition
to their raw quantitative advantage, automated AI researchers
will have other enormous advantages over human researchers:
• They’ll be able to read every single ML paper ever written,
have been able to deeply think about every single previous
experiment ever run at the lab, learn in parallel from each
of their copies, and rapidly accumulate the equivalent of
millennia of experience. They’ll be able to develop far deeper
intuitions about ML than any human.
• They’ll be easily able to write millions of lines of complex
code, keep the entire codebase in context, and spend human-
decades (or more) checking and rechecking every line of
code for bugs and optimizations. They’ll be superbly compe-
tent at all parts of the job.
• You won’t have to individually train up each automated
AI researcher (indeed, training and onboarding 100 million
new human hires would be difficult). Instead, you can just
teach and onboard one of them—and then make replicas.
(And you won’t have to worry about politicking, cultural
acclimation, and so on, and they’ll work with peak energy
and focus day and night.)
• Vast numbers of automated AI researchers will be able to
share context (perhaps even accessing each others’ latent
space and so on), enabling much more efficient collaboration
and coordination compared to human researchers.
• And of course, however smart our initial automated AI
researchers would be, we’d soon be able to make further
OOM-jumps, producing even smarter models, even more
capable at automated AI research.
Imagine an automated Alec Radford—imagine 100 million au-
situational awareness 53
tomated Alec Radfords.40 I think just about every researcher at 40
Alec Radford is an incredibly gifted
and prolific researcher/engineer at
OpenAI, behind many of the most
important advances, though he flies
under the radar some.
OpenAI would agree that if they had 10 Alec Radfords, let
alone 100 or 1,000 or 1 million running at 10x or 100x human
speed, they could very quickly solve very many of their prob-
lems. Even with various other bottlenecks (more in a moment),
compressing a decade of algorithmic progress into a year as
a result seems very plausible. (A 10x acceleration from a mil-
lion times more research effort, which seems conservative if
anything.)
That would be 5+ OOMs right there. 5 OOMs of algorithmic
wins would be a similar scaleup to what produced the GPT-
2-to-GPT-4 jump, a capability jump from ~a preschooler to ~a
smart high schooler. Imagine such a qualitative jump on top of
AGI, on top of Alec Radford.
It’s strikingly plausible we’d go from AGI to superintelligence
very quickly, perhaps in 1 year.
Possible bottlenecks
While this basic story is surprisingly strong—and is supported
by thorough economic modeling work—there are some real
and plausible bottlenecks that will probably slow down an
automated-AI-research intelligence explosion.
I’ll give a summary here, and then discuss these in more detail
in the optional sections below for those interested:
• Limited compute: AI research doesn’t just take good ideas,
thinking, or math—but running experiments to get empirical
signal on your ideas. A million times more research effort
via automated research labor won’t mean a million times
faster progress, because compute will still be limited—and
limited compute for experiments will be the bottleneck. Still,
even if this won’t be a 1,000,000x speedup, I find it hard to
imagine that the automated AI researchers couldn’t use the
compute at least 10x more effectively: they’ll be able to get
incredible ML intuition (having internalized the whole ML
literature and every previous experiment every run!) and
situational awareness 54
centuries-equivalent of thinking-time to figure out exactly
the right experiment to run, configure it optimally, and get
maximum value of information; they’ll be able to spend
centuries-equivalent of engineer-time before running even
tiny experiments to avoid bugs and get them right on the
first try; they can make tradeoffs to economize on compute
by focusing on the biggest wins; and they’ll be able to try
tons of smaller-scale experiments (and given effective com-
pute scaleups by then, “smaller-scale” means being able to
train 100,000 GPT-4-level models in a year to try architec-
ture breakthroughs). Some human researchers and engineers
are able to produce 10x the progress as others, even with
the same amount of compute—and this should apply even
moreso to automated AI researchers. I do think this is the
most important bottleneck, and I address it in more depth
below.
• Complementarities/long tail: A classic lesson from eco-
nomics (cf Baumol’s growth disease) is that if you can au-
tomate, say, 70% of something, you get some gains but
quickly the remaining 30% become your bottleneck. For any-
thing that falls short of full automation—say, really good
copilots—human AI researchers would remain a major
bottleneck, making the overall increase in the rate of algo-
rithmic progress relatively small. Moreover, there’s likely
some long tail of capabilities required for automating AI
research—the last 10% of the job of an AI researcher might
be particularly hard to automate. This could soften takeoff
some, though my best guess is that this only delays things
by a couple years. Perhaps 2026/27-models speed are the
proto-automated-researcher, it takes another year or two for
some final unhobbling, a somewhat better model, inference
speedups, and working out kinks to get to full automation,
and finally by 2028 we get the 10x acceleration (and superin-
telligence by the end of the decade).
• Inherent limits to algorithmic progress: Maybe another 5
OOMs of algorithmic efficiency will be fundamentally im-
possible? I doubt it. While there will definitely be upper
limits,41 if we got 5 OOMs in the last decade, we should 41
25 OOMs of algorithmic progress on
top of GPT-4, for example, are clearly
impossible: that would imply it would
be possible to train a GPT-4-level model
with just a handful of FLOP.
probably expect at least another decade’s-worth of progress
situational awareness 55
to be possible. More directly, current architectures and train-
ing algorithms are still very rudimentary, and it seems that
much more efficient schemes should be possible. Biological
reference classes also support dramatically more efficient
algorithms being plausible.
• Ideas get harder to find, so the automated AI researchers
will merely sustain, rather than accelerate, the current rate
of progress: One objection is that although automated re-
search would increase effective research effort a lot, ideas
also get harder to find. That is, while it takes only a few
hundred top researchers at a lab to sustain 0.5 OOMs/year
today, as we exhaust the low-hanging fruit, it will take more
and more effort to sustain that progress—and so the 100 mil-
lion automated researchers will be merely what’s necessary
to sustain progress. I think this basic model is correct, but
the empirics don’t add up: the magnitude of the increase in
research effort—a million-fold—is way, way larger than the
historical trends of the growth in research effort that’s been
necessary to sustain progress. In econ modeling terms, it’s a
bizarre “knife-edge assumption” to assume that the increase
in research effort from automation will be just enough to keep
progress constant.
• Ideas get harder to find and there are diminishing returns, so
the intelligence explosion will quickly fizzle: Related to the
above objection, even if the automated AI researchers lead
to an initial burst of progress, whether rapid progress can be
sustained depends on the shape of the diminishing returns
curve to algorithmic progress. Again, my best read of the
empirical evidence is that the exponents shake out in favor
of explosive/accelerating progress. In any case, the sheer
size of the one-time boost—from 100s to 100s of millions of
AI researchers—probably overcomes diminishing returns
here for at least a good number of OOMs of algorithmic
progress, even though it of course can’t be indefinitely self-
sustaining.
Overall, these factors may slow things down somewhat: the
most extreme versions of intelligence explosion (say, overnight)
seem implausible. And they may result in a somewhat longer
situational awareness 56
runup (perhaps we need to wait an extra year or two from
more sluggish, proto-automated researchers to the true auto-
mated Alec Radfords, before things kick off in full force). But
they certainly don’t rule out a very rapid intelligence explosion.
A year—or at most just a few years, but perhaps even just a few
months—in which we go from fully-automated AI researchers
to vastly superhuman AI systems should be our mainline ex-
pectation.
If you’d rather skip the in-depth discussions on the various bottle-
necks below, click here to skip to the next section.
Limited compute for experiments (optional, in more depth)
The production function for algorithmic progress includes
two complementary factors of production: research effort
and experiment compute. The millions of automated AI
researchers won’t have any more compute to run their
experiments on than human AI researchers; perhaps they’ll
just be sitting around waiting for their jobs to finish.
This is probably the most important bottleneck to the
intelligence explosion. Ultimately this is a quantitative
question—just how much of a bottleneck is it? On balance,
I find it hard to believe that the 100 million Alec Radfords
couldn’t increase the marginal product of experiment com-
pute by at least 10x (and thus, would still accelerate the
pace of progress by 10x):
• There’s a lot you can do with smaller amounts of compute.
The way most AI research works is that you test things
out at small scale—and then extrapolate via scaling laws.
(Many key historical breakthroughs required only a very
small amount of compute, e.g. the original Transformer
was trained on just 8 GPUs for a few days.) And note
that with ~5 OOMs of baseline scaleup in the next four
years, “small scale” will mean GPT-4 scale—the auto-
mated AI researchers will be able to run 100,000 GPT-4-
level experiments on their training cluster in a year, and
tens of millions of GPT-3-level experiments. (That’s a lot
situational awareness 57
of potential-breakthrough new architectures they’ll be
able to test!)
– A lot of the compute goes into larger-scale validation
of the final pretraining run—making sure you are get-
ting a high-enough degree of confidence on marginal
efficiency wins for your annual headline product—but
if you’re racing through the OOMs in the intelligence
explosion, you could economize and just focus on the
really big wins.
– As discussed in the previous piece, there are often
enormous gains to be had from relatively low-compute
“unhobbling” of models. These don’t require big pre-
training runs. It’s highly plausible that that intelli-
gence explosion starts off automated AI research e.g.
discovering a way to do RL on top that gives us a cou-
ple OOMs via unhobbling wins (and then we’re off to
the races).
– As the automated AI researchers find efficiencies, that’ll
let them run more experiments. Recall the near-1000x
cheaper inference in two years for equivalent-MATH
performance, and the 10x general inference gains in
the last year, discussed in the previous piece, from
mere-human algorithmic progress. The first thing
the automated AI researchers will do is quickly find
similar gains, and in turn, that’ll let them run 100x
more experiments on e.g. new RL approaches. Or
they’ll be able to quickly make smaller models with
similar performance in relevant domains (cf previous
discussion of Gemini Flash, near-100x cheaper than
GPT-4), which in turn will let them run many more
experiments with these smaller models (again, imag-
ine using these to try different RL schemes). There are
probably other overhangs too, e.g. the automated AI
researchers might be able to quickly develop much
better distributed training schemes to utilize all the
inference GPUs (probably at least 10x more compute
right there). More generally, every OOM of training
efficiency gains they find will give them an OOM
situational awareness 58
more of effective compute to run experiments on.
• The automated AI researchers could be way more efficient.
It’s hard to understate how many fewer experiments
you would have to run if you just got it right on the
first try—no gnarly bugs, being more selective about
exactly what you are running, and so on. Imagine 1000
automated AI researchers spending a month-equivalent
checking your code and getting the exact experiment
right before you press go. I’ve asked some AI lab col-
leagues about this and they agreed: you should pretty
easily be able to save 3x-10x of compute on most projects
merely if you could avoid frivolous bugs, get things right
on the first try, and only run high value-of-information
experiments.
• The automated AI researchers could have way better intu-
itions.
– Recently, I was speaking to an intern at a frontier lab;
they said that their dominant experience over the
past few months was suggesting many experiments
they wanted to run, and their supervisor (a senior
researcher) saying they could already predict the re-
sult beforehand so there was no need. The senior
researcher’s years of random experiments messing
around with models had honed their intuitions about
what ideas would work—or not. Similarly, it seems
like our AI systems could easily get superhuman in-
tuitions about ML experiments—they will have read
the entire machine learning literature, be able to learn
from every other experiment result and deeply think
about it, they could easily be trained to predict the
outcome of millions of ML experiments, and so on.
And maybe one of the first things they do is build up
a strong basic science of "predicting if this large scale
experiment will be successful just after seeing the first
1% of training, or just after seeing the smaller scale
version of this experiment", and so on.
– Moreover, beyond really good intuitions about re-
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed

More Related Content

Similar to Global Situational Awareness of A.I. and where its headed

2019 ARTIFICIAL INTELLIGENCE SURVEY
 2019 ARTIFICIAL INTELLIGENCE SURVEY 2019 ARTIFICIAL INTELLIGENCE SURVEY
2019 ARTIFICIAL INTELLIGENCE SURVEY
Peerasak C.
 
The ai revolution_toby_walsh
The ai revolution_toby_walshThe ai revolution_toby_walsh
The ai revolution_toby_walsh
Wilson Marchionatti
 
CERIAS Symposium: John Johnson, Future of Cybersecurity 2050
CERIAS Symposium: John Johnson, Future of Cybersecurity 2050CERIAS Symposium: John Johnson, Future of Cybersecurity 2050
CERIAS Symposium: John Johnson, Future of Cybersecurity 2050
John D. Johnson
 
Spohrer Ntegra 20230324 v12.pptx
Spohrer Ntegra 20230324 v12.pptxSpohrer Ntegra 20230324 v12.pptx
Spohrer Ntegra 20230324 v12.pptx
ISSIP
 
Seizing opportunities with AI in the cognitive economy
Seizing opportunities with AI in the cognitive economySeizing opportunities with AI in the cognitive economy
Seizing opportunities with AI in the cognitive economy
baghdad
 
Artificial intelligence and the revolution of work
Artificial intelligence and the revolution of workArtificial intelligence and the revolution of work
Artificial intelligence and the revolution of work
eraser Juan José Calderón
 
Generative AI Potential
Generative AI PotentialGenerative AI Potential
Generative AI Potential
Kapil Khandelwal (KK)
 
Angshik- Smart Cities Vision
Angshik- Smart Cities VisionAngshik- Smart Cities Vision
Angshik- Smart Cities Vision
Angshik Chaudhuri
 
Future Today Institute | 2020 Tech Trends Report | Section 2 of 2
Future Today Institute | 2020 Tech Trends Report | Section 2 of 2Future Today Institute | 2020 Tech Trends Report | Section 2 of 2
Future Today Institute | 2020 Tech Trends Report | Section 2 of 2
Amy Webb
 
2018 TECHNOLOGY PREDICTIONS. Trends & innovations shaping the global tech se...
2018  TECHNOLOGY PREDICTIONS. Trends & innovations shaping the global tech se...2018  TECHNOLOGY PREDICTIONS. Trends & innovations shaping the global tech se...
2018 TECHNOLOGY PREDICTIONS. Trends & innovations shaping the global tech se...
eraser Juan José Calderón
 
Future Today Institute | 2020 Tech Trends Report
Future Today Institute | 2020 Tech Trends ReportFuture Today Institute | 2020 Tech Trends Report
Future Today Institute | 2020 Tech Trends Report
Amy Webb
 
CEO NunkyWorld
CEO NunkyWorld CEO NunkyWorld
CEO NunkyWorld
Abel Linares Palacios
 
Future trends 2014
Future trends 2014Future trends 2014
Future trends 2014
dynamo&son
 
Our Guide to Digital disruption Update 2019
Our Guide to Digital disruption Update 2019Our Guide to Digital disruption Update 2019
Our Guide to Digital disruption Update 2019
John Ashcroft
 
Silicon Valley Bank Startup Outlook 2012
Silicon Valley Bank Startup Outlook 2012Silicon Valley Bank Startup Outlook 2012
Silicon Valley Bank Startup Outlook 2012
Silicon Valley Bank
 
IS AI IN JEOPARDY? THE NEED TO UNDER PROMISE AND OVER DELIVER – THE CASE FOR ...
IS AI IN JEOPARDY? THE NEED TO UNDER PROMISE AND OVER DELIVER – THE CASE FOR ...IS AI IN JEOPARDY? THE NEED TO UNDER PROMISE AND OVER DELIVER – THE CASE FOR ...
IS AI IN JEOPARDY? THE NEED TO UNDER PROMISE AND OVER DELIVER – THE CASE FOR ...
csandit
 
CS183 startup-stanford-spring-2012
CS183 startup-stanford-spring-2012CS183 startup-stanford-spring-2012
CS183 startup-stanford-spring-2012
Benjamin Crucq
 
Peter+Thiels+-+Stanford+Lecture
Peter+Thiels+-+Stanford+LecturePeter+Thiels+-+Stanford+Lecture
Peter+Thiels+-+Stanford+Lecture
theextraaedge
 
11/4 Top 5 Deep Learning Stories
11/4 Top 5 Deep Learning Stories11/4 Top 5 Deep Learning Stories
11/4 Top 5 Deep Learning Stories
NVIDIA
 
A Glimpse Into the Future of Data Science - What's Next for AI, Big Data & Ma...
A Glimpse Into the Future of Data Science - What's Next for AI, Big Data & Ma...A Glimpse Into the Future of Data Science - What's Next for AI, Big Data & Ma...
A Glimpse Into the Future of Data Science - What's Next for AI, Big Data & Ma...
Pangea.ai
 

Similar to Global Situational Awareness of A.I. and where its headed (20)

2019 ARTIFICIAL INTELLIGENCE SURVEY
 2019 ARTIFICIAL INTELLIGENCE SURVEY 2019 ARTIFICIAL INTELLIGENCE SURVEY
2019 ARTIFICIAL INTELLIGENCE SURVEY
 
The ai revolution_toby_walsh
The ai revolution_toby_walshThe ai revolution_toby_walsh
The ai revolution_toby_walsh
 
CERIAS Symposium: John Johnson, Future of Cybersecurity 2050
CERIAS Symposium: John Johnson, Future of Cybersecurity 2050CERIAS Symposium: John Johnson, Future of Cybersecurity 2050
CERIAS Symposium: John Johnson, Future of Cybersecurity 2050
 
Spohrer Ntegra 20230324 v12.pptx
Spohrer Ntegra 20230324 v12.pptxSpohrer Ntegra 20230324 v12.pptx
Spohrer Ntegra 20230324 v12.pptx
 
Seizing opportunities with AI in the cognitive economy
Seizing opportunities with AI in the cognitive economySeizing opportunities with AI in the cognitive economy
Seizing opportunities with AI in the cognitive economy
 
Artificial intelligence and the revolution of work
Artificial intelligence and the revolution of workArtificial intelligence and the revolution of work
Artificial intelligence and the revolution of work
 
Generative AI Potential
Generative AI PotentialGenerative AI Potential
Generative AI Potential
 
Angshik- Smart Cities Vision
Angshik- Smart Cities VisionAngshik- Smart Cities Vision
Angshik- Smart Cities Vision
 
Future Today Institute | 2020 Tech Trends Report | Section 2 of 2
Future Today Institute | 2020 Tech Trends Report | Section 2 of 2Future Today Institute | 2020 Tech Trends Report | Section 2 of 2
Future Today Institute | 2020 Tech Trends Report | Section 2 of 2
 
2018 TECHNOLOGY PREDICTIONS. Trends & innovations shaping the global tech se...
2018  TECHNOLOGY PREDICTIONS. Trends & innovations shaping the global tech se...2018  TECHNOLOGY PREDICTIONS. Trends & innovations shaping the global tech se...
2018 TECHNOLOGY PREDICTIONS. Trends & innovations shaping the global tech se...
 
Future Today Institute | 2020 Tech Trends Report
Future Today Institute | 2020 Tech Trends ReportFuture Today Institute | 2020 Tech Trends Report
Future Today Institute | 2020 Tech Trends Report
 
CEO NunkyWorld
CEO NunkyWorld CEO NunkyWorld
CEO NunkyWorld
 
Future trends 2014
Future trends 2014Future trends 2014
Future trends 2014
 
Our Guide to Digital disruption Update 2019
Our Guide to Digital disruption Update 2019Our Guide to Digital disruption Update 2019
Our Guide to Digital disruption Update 2019
 
Silicon Valley Bank Startup Outlook 2012
Silicon Valley Bank Startup Outlook 2012Silicon Valley Bank Startup Outlook 2012
Silicon Valley Bank Startup Outlook 2012
 
IS AI IN JEOPARDY? THE NEED TO UNDER PROMISE AND OVER DELIVER – THE CASE FOR ...
IS AI IN JEOPARDY? THE NEED TO UNDER PROMISE AND OVER DELIVER – THE CASE FOR ...IS AI IN JEOPARDY? THE NEED TO UNDER PROMISE AND OVER DELIVER – THE CASE FOR ...
IS AI IN JEOPARDY? THE NEED TO UNDER PROMISE AND OVER DELIVER – THE CASE FOR ...
 
CS183 startup-stanford-spring-2012
CS183 startup-stanford-spring-2012CS183 startup-stanford-spring-2012
CS183 startup-stanford-spring-2012
 
Peter+Thiels+-+Stanford+Lecture
Peter+Thiels+-+Stanford+LecturePeter+Thiels+-+Stanford+Lecture
Peter+Thiels+-+Stanford+Lecture
 
11/4 Top 5 Deep Learning Stories
11/4 Top 5 Deep Learning Stories11/4 Top 5 Deep Learning Stories
11/4 Top 5 Deep Learning Stories
 
A Glimpse Into the Future of Data Science - What's Next for AI, Big Data & Ma...
A Glimpse Into the Future of Data Science - What's Next for AI, Big Data & Ma...A Glimpse Into the Future of Data Science - What's Next for AI, Big Data & Ma...
A Glimpse Into the Future of Data Science - What's Next for AI, Big Data & Ma...
 

More from vikram sood

A Dose of Laughter by R.K. Laxman
A Dose of Laughter by R.K. LaxmanA Dose of Laughter by R.K. Laxman
A Dose of Laughter by R.K. Laxman
vikram sood
 
Sizing people Up
Sizing people UpSizing people Up
Sizing people Up
vikram sood
 
DAOS by ARCA
DAOS by ARCADAOS by ARCA
DAOS by ARCA
vikram sood
 
Sandalwood’s NFT Marketplace
Sandalwood’s NFT MarketplaceSandalwood’s NFT Marketplace
Sandalwood’s NFT Marketplace
vikram sood
 
Pagova FloodDetection_North East India
Pagova FloodDetection_North East IndiaPagova FloodDetection_North East India
Pagova FloodDetection_North East India
vikram sood
 
Jeevan Rath/1 - impact report
Jeevan Rath/1 - impact reportJeevan Rath/1 - impact report
Jeevan Rath/1 - impact report
vikram sood
 
Jeevan Rath_ Hungry Wheels Response to Covid19
Jeevan Rath_ Hungry Wheels Response to Covid19Jeevan Rath_ Hungry Wheels Response to Covid19
Jeevan Rath_ Hungry Wheels Response to Covid19
vikram sood
 
Brand Elasticity and Architecture by vikram sood
Brand Elasticity and Architecture by vikram soodBrand Elasticity and Architecture by vikram sood
Brand Elasticity and Architecture by vikram sood
vikram sood
 
Amul - brand innovation presentation
Amul - brand innovation presentationAmul - brand innovation presentation
Amul - brand innovation presentation
vikram sood
 
VUAR a technology for inclusive tourism
VUAR a technology for inclusive tourismVUAR a technology for inclusive tourism
VUAR a technology for inclusive tourism
vikram sood
 
Dots
DotsDots
Whatodo!!!
Whatodo!!!Whatodo!!!
Whatodo!!!
vikram sood
 
Falafels QSR positioning
Falafels QSR positioningFalafels QSR positioning
Falafels QSR positioning
vikram sood
 
Brand design strategy as a service
Brand design strategy as a serviceBrand design strategy as a service
Brand design strategy as a service
vikram sood
 
My case-studies
My case-studiesMy case-studies
My case-studies
vikram sood
 
Live digital painting
Live digital paintingLive digital painting
Live digital painting
vikram sood
 
Uhc environment graphics
Uhc environment graphicsUhc environment graphics
Uhc environment graphics
vikram sood
 
Ipod leaflet
Ipod leafletIpod leaflet
Ipod leaflet
vikram sood
 
Asian paints kids world
Asian paints kids worldAsian paints kids world
Asian paints kids world
vikram sood
 
Apple India posters
Apple India postersApple India posters
Apple India posters
vikram sood
 

More from vikram sood (20)

A Dose of Laughter by R.K. Laxman
A Dose of Laughter by R.K. LaxmanA Dose of Laughter by R.K. Laxman
A Dose of Laughter by R.K. Laxman
 
Sizing people Up
Sizing people UpSizing people Up
Sizing people Up
 
DAOS by ARCA
DAOS by ARCADAOS by ARCA
DAOS by ARCA
 
Sandalwood’s NFT Marketplace
Sandalwood’s NFT MarketplaceSandalwood’s NFT Marketplace
Sandalwood’s NFT Marketplace
 
Pagova FloodDetection_North East India
Pagova FloodDetection_North East IndiaPagova FloodDetection_North East India
Pagova FloodDetection_North East India
 
Jeevan Rath/1 - impact report
Jeevan Rath/1 - impact reportJeevan Rath/1 - impact report
Jeevan Rath/1 - impact report
 
Jeevan Rath_ Hungry Wheels Response to Covid19
Jeevan Rath_ Hungry Wheels Response to Covid19Jeevan Rath_ Hungry Wheels Response to Covid19
Jeevan Rath_ Hungry Wheels Response to Covid19
 
Brand Elasticity and Architecture by vikram sood
Brand Elasticity and Architecture by vikram soodBrand Elasticity and Architecture by vikram sood
Brand Elasticity and Architecture by vikram sood
 
Amul - brand innovation presentation
Amul - brand innovation presentationAmul - brand innovation presentation
Amul - brand innovation presentation
 
VUAR a technology for inclusive tourism
VUAR a technology for inclusive tourismVUAR a technology for inclusive tourism
VUAR a technology for inclusive tourism
 
Dots
DotsDots
Dots
 
Whatodo!!!
Whatodo!!!Whatodo!!!
Whatodo!!!
 
Falafels QSR positioning
Falafels QSR positioningFalafels QSR positioning
Falafels QSR positioning
 
Brand design strategy as a service
Brand design strategy as a serviceBrand design strategy as a service
Brand design strategy as a service
 
My case-studies
My case-studiesMy case-studies
My case-studies
 
Live digital painting
Live digital paintingLive digital painting
Live digital painting
 
Uhc environment graphics
Uhc environment graphicsUhc environment graphics
Uhc environment graphics
 
Ipod leaflet
Ipod leafletIpod leaflet
Ipod leaflet
 
Asian paints kids world
Asian paints kids worldAsian paints kids world
Asian paints kids world
 
Apple India posters
Apple India postersApple India posters
Apple India posters
 

Recently uploaded

DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
Vietnam Cotton & Spinning Association
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
agdhot
 
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
1tyxnjpia
 
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
eudsoh
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
actyx
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
mkkikqvo
 
Drownings spike from May to August in children
Drownings spike from May to August in childrenDrownings spike from May to August in children
Drownings spike from May to August in children
Bisnar Chase Personal Injury Attorneys
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
TeukuEriSyahputra
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
ywqeos
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
uevausa
 
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptxREUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
KiriakiENikolaidou
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
aguty
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
Vietnam Cotton & Spinning Association
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
dataschool1
 

Recently uploaded (20)

DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
 
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
 
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
 
Drownings spike from May to August in children
Drownings spike from May to August in childrenDrownings spike from May to August in children
Drownings spike from May to August in children
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
 
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptxREUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
 

Global Situational Awareness of A.I. and where its headed

  • 1. Leopold Aschenbrenner S I T U AT I O N A L AWA R E N E S S The Decade Ahead JUNE 2024
  • 2. 2 Dedicated to Ilya Sutskever. While I used to work at OpenAI, all of this is based on publicly- available information, my own ideas, general field-knowledge, or SF-gossip. Thank you to Collin Burns, Avital Balwit, Carl Shulman, Jan Leike, Ilya Sutskever, Holden Karnofsky, Sholto Douglas, James Bradbury, Dwarkesh Patel, and many others for formative discussions. Thank you to many friends for feedback on earlier drafts. Thank you to Joe Ronan for help with graphics, and Nick Whitaker for publishing help. situational-awareness.ai leopold@situational-awareness.ai Updated June 6, 2024 San Francisco, California
  • 3. You can see the future first in San Francisco. Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Ev- ery six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of Amer- ican industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum. The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college grad- uates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un- leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war. Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the willful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change. Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technol- ogy. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride. Let me tell you what we see.
  • 4. 4 Contents Introduction 3 History is live in San Francisco. I. From GPT-4 to AGI: Counting the OOMs 7 AGI by 2027 is strikingly plausible. GPT-2 to GPT-4 took us from ~pre- schooler to ~smart high-schooler abilities in 4 years. Tracing trend- lines in compute (~0.5 orders of magnitude or OOMs/year), algorith- mic efficiencies (~0.5 OOMs/year), and “unhobbling” gains (from chat- bot to agent), we should expect another preschooler-to-high-schooler- sized qualitative jump by 2027. II. From AGI to Superintelligence: the Intelligence Explosion 46 AI progress won’t stop at human-level. Hundreds of millions of AGIs could automate AI research, compressing a decade of algorithmic progress (5+ OOMs) into 1 year. We would rapidly go from human-level to vastly superhuman AI systems. The power—and the peril—of superintelli- gence would be dramatic. III. The Challenges 74 IIIa. Racing to the Trillion-Dollar Cluster 75 The most extraordinary techno-capital acceleration has been set in mo- tion. As AI revenue grows rapidly, many trillions of dollars will go into GPU, datacenter, and power buildout before the end of the decade. The industrial mobilization, including growing US electricity produc- tion by 10s of percent, will be intense. IIIb. Lock Down the Labs: Security for AGI 89 The nation’s leading AI labs treat security as an afterthought. Cur- rently, they’re basically handing the key secrets for AGI to the CCP on a silver platter. Securing the AGI secrets and weights against the state-actor threat will be an immense effort, and we’re not on track.
  • 5. 5 IIIc. Superalignment 105 Reliably controlling AI systems much smarter than we are is an un- solved technical problem. And while it is a solvable problem, things could very easily go off the rails during a rapid intelligence explosion. Managing this will be extremely tense; failure could easily be catas- trophic. IIId. The Free World Must Prevail 126 Superintelligence will give a decisive economic and military advan- tage. China isn’t at all out of the game yet. In the race to AGI, the free world’s very survival will be at stake. Can we maintain our preem- inence over the authoritarian powers? And will we manage to avoid self-destruction along the way? IV. The Project 141 As the race to AGI intensifies, the national security state will get in- volved. The USG will wake from its slumber, and by 27/28 we’ll get some form of government AGI project. No startup can handle super- intelligence. Somewhere in a SCIF, the endgame will be on. V. Parting Thoughts 156 What if we’re right? Appendix 162
  • 6.
  • 7. I. From GPT-4 to AGI: Counting the OOMs AGI by 2027 is strikingly plausible. GPT-2 to GPT-4 took us from ~preschooler to ~smart high-schooler abilities in 4 years. Tracing trendlines in compute (~0.5 orders of magni- tude or OOMs/year), algorithmic efficiencies (~0.5 OOMs/year), and “unhobbling” gains (from chatbot to agent), we should expect another preschooler-to-high-schooler-sized qualitative jump by 2027. Look. The models, they just want to learn. You have to understand this. The models, they just want to learn. ilya sutskever (circa 2015, via Dario Amodei) GPT-4’s capabilities came as a shock to many: an AI system that could write code and essays, could reason through difficult math problems, and ace college exams. A few years ago, most thought these were impenetrable walls. But GPT-4 was merely the continuation of a decade of break- neck progress in deep learning. A decade earlier, models could barely identify simple images of cats and dogs; four years ear- lier, GPT-2 could barely string together semi-plausible sen- tences. Now we are rapidly saturating all the benchmarks we can come up with. And yet this dramatic progress has merely been the result of consistent trends in scaling up deep learning. There have been people who have seen this for far longer. They were scoffed at, but all they did was trust the trendlines. The
  • 8. situational awareness 8 trendlines are intense, and they were right. The models, they just want to learn; you scale them up, and they learn more. I make the following claim: it is strikingly plausible that by 2027, models will be able to do the work of an AI re- searcher/engineer. That doesn’t require believing in sci-fi; it just requires believing in straight lines on a graph. Figure 1: Rough estimates of past and future scaleup of effective compute (both physical compute and algorith- mic efficiencies), based on the public estimates discussed in this piece. As we scale models, they consistently get smarter, and by “counting the OOMs” we get a rough sense of what model intelligence we should expect in the (near) future. (This graph shows only the scaleup in base models; “unhob- blings” are not pictured.) In this piece, I will simply “count the OOMs” (OOM = order of magnitude, 10x = 1 order of magnitude): look at the trends in 1) compute, 2) algorithmic efficiencies (algorithmic progress that we can think of as growing “effective compute”), and 3) ”unhobbling” gains (fixing obvious ways in which models are hobbled by default, unlocking latent capabilities and giving them tools, leading to step-changes in usefulness). We trace
  • 9. situational awareness 9 the growth in each over four years before GPT-4, and what we should expect in the four years after, through the end of 2027. Given deep learning’s consistent improvements for every OOM of effective compute, we can use this to project future progress. Publicly, things have been quiet for a year since the GPT-4 re- lease, as the next generation of models has been in the oven— leading some to proclaim stagnation and that deep learning is hitting a wall.1 But by counting the OOMs, we get a peek at 1 Predictions they’ve made every year for the last decade, and which they’ve been consistently wrong about. . . what we should actually expect. The upshot is pretty simple. GPT-2 to GPT-4—from models that were impressive for sometimes managing to string to- gether a few coherent sentences, to models that ace high-school exams—was not a one-time gain. We are racing through the OOMs extremely rapidly, and the numbers indicate we should expect another ~100,000x effective compute scaleup—resulting in another GPT-2-to-GPT-4-sized qualitative jump—over four years. Moreover, and critically, that doesn’t just mean a better chatbot; picking the many obvious low-hanging fruit on “un- hobbling” gains should take us from chatbots to agents, from a tool to something that looks more like drop-in remote worker replacements. While the inference is simple, the implication is striking. An- other jump like that very well could take us to AGI, to models as smart as PhDs or experts that can work beside us as cowork- ers. Perhaps most importantly, if these AI systems could auto- mate AI research itself, that would set in motion intense feed- back loops—the topic of the the next piece in the series. Even now, barely anyone is pricing all this in. But situational awareness on AI isn’t actually that hard, once you step back and look at the trends. If you keep being surprised by AI capa- bilities, just start counting the OOMs.
  • 10. situational awareness 10 The last four years We have machines now that we can basically talk to like humans. It’s a remarkable testament to the human capacity to adjust that this seems normal, that we’ve become inured to the pace of progress. But it’s worth stepping back and looking at the progress of just the last few years. GPT-2 to GPT-4 Let me remind you of how far we came in just the ~4 (!) years leading up to GPT-4. GPT-2 (2019) ~ preschooler: “Wow, it can string together a few plausible sentences.” A very-cherry-picked example of a semi- coherent story about unicorns in the Andes it generated was incredibly impressive at the time. And yet GPT-2 could barely count to 5 without getting tripped up;2 when summarizing 2 From SSC: “Janelle Shane asks GPT-2 its ten favorite animals: Prompt: My 10 favorite animals are: 1. My ten favorite animals are: 1. Zebras with a white scar on the back 2. Insiduous spiders and octopus 3. Frog with large leaves, hopefully black 4. Cockatiel with scales 5. Razorbill with wings hanging about 4 inches from one’s face and a heart tattoo on a frog 3. Cockatric interlocking tetrabods that can be blind, cut, and eaten raw: 4. Black and white desert crocodiles living in sunlight 5. Zebra and many other pea bugs” an article, it just barely outperformed selecting 3 random sen- tences from the article.3 3 From the GPT-2 paper, Section 3.6. Figure 2: Some examples of what people found impressive about GPT- 2 at the time. Left: GPT-2 does an ok job on extremely basic reading comprehension questions. Right: In a cherry-picked sample (best of 10 tries), GPT-2 can write a semi-coherent paragraph that says some semi-relevant things about the Civil War. Comparing AI capabilities with human intelligence is difficult and flawed, but I think it’s informative to consider the analogy
  • 11. situational awareness 11 here, even if it’s highly imperfect. GPT-2 was shocking for its command of language, and its ability to occasionally generate a semi-cohesive paragraph, or occasionally answer simple factual questions correctly. It’s what would have been impressive for a preschooler. GPT-3 (2020)4 ~ elementary schooler: “Wow, with just some few- 4 I mean clunky old GPT-3 here, not the dramatically-improved GPT-3.5 you might know from ChatGPT. shot examples it can do some simple useful tasks.” It started being cohesive over even multiple paragraphs much more con- sistently, and could correct grammar and do some very basic arithmetic. For the first time, it was also commercially useful in a few narrow ways: for example, GPT-3 could generate simple copy for SEO and marketing. Figure 3: Some examples of what people found impressive about GPT- 3 at the time. Top: After a simple instruction, GPT-3 can use a made-up word in a new sentence. Bottom-left: GPT-3 can engage in rich storytelling back-and-forth. Bottom-right: GPT-3 can generate some very simple code. Again, the comparison is imperfect, but what impressed peo- ple about GPT-3 is perhaps what would have been impressive for an elementary schooler: it wrote some basic poetry, could tell richer and coherent stories, could start to do rudimentary
  • 12. situational awareness 12 coding, could fairly reliably learn from simple instructions and demonstrations, and so on. GPT-4 (2023) ~ smart high schooler: “Wow, it can write pretty so- phisticated code and iteratively debug, it can write intelligently and sophisticatedly about complicated subjects, it can reason through difficult high-school competition math, it’s beating the vast majority of high schoolers on whatever tests we can give it, etc.” From code to math to Fermi estimates, it can think and reason. GPT-4 is now useful in my daily tasks, from helping write code to revising drafts.
  • 13. situational awareness 13 Figure 4: Some of what people found impressive about GPT-4 when it was re- leased, from the “Sparks of AGI” paper. Top: It’s writing very complicated code (producing the plots shown in the mid- dle) and can reason through nontrivial math problems. Bottom-left: Solving an AP math problem. Bottom-right: Solv- ing a fairly complex coding problem. More interesting excerpts from that exploration of GPT-4’s capabilities here.
  • 14. situational awareness 14 On everything from AP exams to the SAT, GPT-4 scores better than the vast majority of high schoolers. Of course, even GPT-4 is still somewhat uneven; for some tasks it’s much better than smart high-schoolers, while there are other tasks it can’t yet do. That said, I tend to think most of these limitations come down to obvious ways models are still hobbled, as I’ll discuss in-depth later. The raw intelligence is (mostly) there, even if the models are still artificially con- strained; it’ll take extra work to unlock models being able to fully apply that raw intelligence across applications. Figure 5: Progress over just four years. Where are you on this line? The trends in deep learning The pace of deep learning progress in the last decade has sim- ply been extraordinary. A mere decade ago it was revolution- ary for a deep learning system to identify simple images. To- day, we keep trying to come up with novel, ever harder tests, and yet each new benchmark is quickly cracked. It used to take decades to crack widely-used benchmarks; now it feels like mere months. We’re literally running out of benchmarks. As an anecdote, my friends Dan and Collin made a benchmark called MMLU a few years ago, in 2020. They hoped to finally make a benchmark that would stand the test of time, equivalent to all the hardest exams we give high school and college students. Just three years later, it’s basically solved: models like GPT-4 and Gemini
  • 15. situational awareness 15 Figure 6: Deep learning systems are rapidly reaching or exceeding human- level in many domains. Graphic: Our World in Data get ~90%. More broadly, GPT-4 mostly cracks all the standard high school and college aptitude tests (Figure 7).5 5 And no, these tests aren’t in the train- ing set. AI labs put real effort into en- suring these evals are uncontaminated, because they need good measurements in order to do good science. A recent analysis on this by ScaleAI confirmed that the leading labs aren’t overfitting to the benchmarks (though some smaller LLM developers might be juicing their numbers). Or consider the MATH benchmark, a set of difficult mathemat- ics problems from high-school math competitions.6 When the 6 In the original paper, it was noted: “We also evaluated humans on MATH, and found that a computer science PhD student who does not especially like mathematics attained approximately 40% on MATH, while a three-time IMO gold medalist attained 90%, indicating that MATH can be challenging for humans as well.” benchmark was released in 2021, GPT-3 only got ~5% of prob- lems right. And the original paper noted: “Moreover, we find that simply increasing budgets and model parameter counts will be impractical for achieving strong mathematical reasoning if scaling trends continue [...]. To have more traction on math- ematical problem solving we will likely need new algorithmic advancements from the broader research community”—we would need fundamental new breakthroughs to solve MATH, or so they thought. A survey of ML researchers predicted min- imal progress over the coming years (Figure 8);7 and yet within 7 A coauthor notes: “When our group first released the MATH dataset, at least one [ML researcher colleague] told us that it was a pointless dataset because it was too far outside the range of what ML models could accomplish (indeed, I was somewhat worried about this myself).” just a year (by mid-2022), the best models went from ~5% to 50% accuracy; now, MATH is basically solved, with recent per- formance over 90%.
  • 16. situational awareness 16 Figure 7: GPT-4 scores on standardized tests. Note also the large jump from GPT-3.5 to GPT-4 in human percentile on these tests, often from well below the median human to the very top of the human range. (And this is GPT-3.5, a fairly recent model released less than a year before GPT-4, not the clunky old elementary-school-level GPT-3 we were talking about earlier!) Figure 8: Gray: Professional forecasts, made in August 2021, for June 2022 performance on the MATH benchmark (difficult mathematics problems from high-school math competitions). Red star: actual state-of-the-art performance by June 2022, far exceeding even the upper range forecasters gave. The median ML researcher was even more pessimistic.
  • 17. situational awareness 17 Over and over again, year after year, skeptics have claimed “deep learning won’t be able to do X” and have been quickly proven wrong.8 If there’s one lesson we’ve learned from the past 8 Here’s Yann LeCun predicting in 2022 that even GPT-5000 won’t be able to reason about physical interactions with the real world; GPT-4 obviously does it with ease a year later. Here’s Gary Marcus’s walls predicted after GPT-2 being solved by GPT-3, and the walls he predicted after GPT-3 being solved by GPT-4. Here’s Prof. Bryan Caplan losing his first-ever public bet (after previously famously having a perfect public betting track record). In January 2023, after GPT-3.5 got a D on his economics midterm, Prof. Caplan bet Matthew Barnett that no AI would get an A on his economics midterms by 2029. Just two months later, when GPT-4 came out, it promptly scored an A on his midterm (and it would have been one of the highest scores in his class). decade of AI, it’s that you should never bet against deep learning. Now the hardest unsolved benchmarks are tests like GPQA, a set of PhD-level biology, chemistry, and physics questions. Many of the questions read like gibberish to me, and even PhDs in other scientific fields spending 30+ minutes with Google barely score above random chance. Claude 3 Opus currently gets ~60%,9 compared to in-domain PhDs who get 9 On the diamond set, majority voting of the model trying 32 times with chain-of-thought. ~80%—and I expect this benchmark to fall as well, in the next generation or two.
  • 18. situational awareness 18 Figure 9: Example GPQA questions. Models are already better at this than I am, and we’ll probably crack expert- PhD-level soon. . .
  • 19. situational awareness 19 Counting the OOMs How did this happen? The magic of deep learning is that it just works—and the trendlines have been astonishingly consistent, despite naysayers at every turn. Figure 10: The effects of scaling com- pute, in the example of OpenAI Sora. With each OOM of effective compute, models predictably, reliably get better.10 If we can count the OOMs, we can (roughly, qualita- 10 And it’s worth noting just how con- sistent these trendlines are. Combining the original scaling laws paper with some of the estimates on compute and compute efficiency scaling since then implies a consistent scaling trend for over 15 orders of magnitude (over 1,000,000,000,000,000x in effective compute)! tively) extrapolate capability improvements.11 That’s how a few 11 A common misconception is that scal- ing only holds for perplexity loss, but we see very clear and consistent scaling behavior on downstream performance on benchmarks as well. It’s usually just a matter of finding the right log- log graph. For example, in the GPT-4 blog post, they show consistent scaling behavior for performance on coding problems over 6 OOMs (1,000,000x) of compute, using MLPR (mean log pass rate). The “Are Emergent Abilities a Mi- rage?” paper makes a similar point; with the right choice of metric, there is almost always a consistent trend for performance on downstream tasks. More generally, the “scaling hypoth- esis” qualitative observation—very clear trends on model capability with scale—predates loss-scaling-curves; the “scaling laws” work was just a formal measurement of this. prescient individuals saw GPT-4 coming. We can decompose the progress in the four years from GPT-2 to GPT-4 into three categories of scaleups: 1. compute: We’re using much bigger computers to train these models. 2. algorithmic efficiencies: There’s a continuous trend of algorithmic progress. Many of these act as “compute multi- pliers,” and we can put them on a unified scale of growing effective compute. 3. ”unhobbling” gains: By default, models learn a lot of amazing raw capabilities, but they are hobbled in all sorts of dumb ways, limiting their practical value. With simple algorithmic improvements like reinforcement learning from human feedback (RLHF), chain-of-thought (CoT), tools, and scaffolding, we can unlock significant latent capabilities.
  • 20. situational awareness 20 We can “count the OOMs” of improvement along these axes: that is, trace the scaleup for each in units of effective com- pute. 3x is 0.5 OOMs; 10x is 1 OOM; 30x is 1.5 OOMs; 100x is 2 OOMs; and so on. We can also look at what we should expect on top of GPT-4, from 2023 to 2027. I’ll go through each one-by-one, but the upshot is clear: we are rapidly racing through the OOMs. There are potential head- winds in the data wall, which I’ll address—but overall, it seems likely that we should expect another GPT-2-to-GPT-4-sized jump, on top of GPT-4, by 2027. Compute I’ll start with the most commonly-discussed driver of recent progress: throwing (a lot) more compute at models. Many people assume that this is simply due to Moore’s Law. But even in the old days when Moore’s Law was in its heyday, it was comparatively glacial—perhaps 1-1.5 OOMs per decade. We are seeing much more rapid scaleups in compute—close to 5x the speed of Moore’s law—instead because of mammoth investment. (Spending even a million dollars on a single model used to be an outrageous thought nobody would entertain, and now that’s pocket change!) Model Estimated Compute Growth GPT-2 (2019) ~4e21 FLOP GPT-3 (2020) ~3e23 FLOP + ~2 OOMs GPT-4 (2023) 8e24 to 4e25 FLOP + ~1.5–2 OOMs Table 1: Estimates of compute for GPT-2 to GPT-4 by Epoch AI. We can use public estimates from Epoch AI (a source widely respected for its excellent analysis of AI trends) to trace the compute scaleup from 2019 to 2023. GPT-2 to GPT-3 was a quick scaleup; there was a large overhang of compute, scaling from a smaller experiment to using an entire datacenter to train a large language model. With the scaleup from GPT-3 to GPT- 4, we transitioned to the modern regime: having to build an entirely new (much bigger) cluster for the next model. And yet
  • 21. situational awareness 21 the dramatic growth continued. Overall, Epoch AI estimates suggest that GPT-4 training used ~3,000x-10,000x more raw compute than GPT-2. In broad strokes, this is just the continuation of a longer-running trend. For the last decade and a half, primarily because of broad scaleups in investment (and specializing chips for AI workloads in the form of GPUs and TPUs), the training com- pute used for frontier AI systems has grown at roughly ~0.5 OOMs/year. Figure 11: Training compute of no- table deep learning models over time. Source: Epoch AI. The compute scaleup from GPT-2 to GPT-3 in a year was an unusual overhang, but all the indications are that the longer- run trend will continue. The SF-rumor-mill is abreast with dra- matic tales of huge GPU orders. The investments involved will be extraordinary—but they are in motion. I go into this more later in the series, in IIIa. Racing to the Trillion-Dollar Cluster; based on that analysis, an additional 2 OOMs of compute (a cluster in the $10s of billions) seems very likely to happen by the end of 2027; even a cluster closer to +3 OOMs of compute ($100 billion+) seems plausible (and is rumored to be in the works at Microsoft/OpenAI).
  • 22. situational awareness 22 Algorithmic efficiencies While massive investments in compute get all the attention, algorithmic progress is probably a similarly important driver of progress (and has been dramatically underrated). To see just how big of a deal algorithmic progress can be, con- sider the following illustration (Figure 12) of the drop in price to attain ~50% accuracy on the MATH benchmark (high school competition math) over just two years. (For comparison, a com- puter science PhD student who didn’t particularly like math scored 40%, so this is already quite good.) Inference efficiency improved by nearly 3 OOMs—1,000x—in less than two years. Figure 12: Rough estimate on relative inference cost of attaining ~50% MATH performance. Though these are numbers just for inference efficiency (which may or may not correspond to training efficiency improve- ments, where numbers are harder to infer from public data), they make clear there is an enormous amount of algorithmic progress possible and happening. Calculations below. Gemini 1.5 Flash scores 54.9% on MATH, and costs $0.35/$1.05 (in- put/output) per million tokens. GPT-4 scored 42.5% on MATH prelease and 52.9% on MATH in early 2023, and cost $30/$60 (input/output) per million tokens; that’s 85x/57x (input/output) more expensive per token than Gemini 1.5 Flash. To be conservative, I use an estimate of 30x cost decrease above (ac- counting for Gemini 1.5 Flash possibly using more tokens to reason through problems). Minerva540B scores 50.3% on MATH, using majority voting among 64 sam- ples. A knowledgeable friend estimates the base model here is probably 2- 3x more expensive to inference than GPT-4. However, Minerva seems to use somewhat fewer tokens per answer on a quick spot check. More impor- tantly, Minerva needed 64 samples to achieve that performance, naively im- plying a 64x multiple on cost if you e.g. naively ran this via an inference API. In practice, prompt tokens can be cached when running an eval; given a few-shot prompt, prompt tokens are likely a majority of the cost, even accounting for output tokens. Supposing output tokens are a third of the cost for getting a single sample, that would imply only a ~20x increase in cost from the maj@64 with caching. To be conservative, I use the rough number of a 20x cost decrease in the above (even if the naive decrease in inference cost from running this via an API would be larger).
  • 23. situational awareness 23 In this piece, I’ll separate out two kinds of algorithmic progress. Here, I’ll start by covering “within-paradigm” algorithmic improvements—those that simply result in better base mod- els, and that straightforwardly act as compute efficiencies or com- pute multipliers. For example, a better algorithm might allow us to achieve the same performance but with 10x less training compute. In turn, that would act as a 10x (1 OOM) increase in effective compute. (Later, I’ll cover “unhobbling,” which you can think of as “paradigm-expanding/application-expanding” algorithmic progress that unlocks capabilities of base models.) If we step back and look at the long-term trends, we seem to find new algorithmic improvements at a fairly consistent rate. Individual discoveries seem random, and at every turn, there seem insurmountable obstacles—but the long-run trendline is predictable, a straight line on a graph. Trust the trendline. We have the best data for ImageNet (where algorithmic re- search has been mostly public and we have data stretching back a decade), for which we have consistently improved com- pute efficiency by roughly ~0.5 OOMs/year across the 9-year period between 2012 and 2021. Figure 13: We can measure algorith- mic progress: how much less compute is needed in 2021 compared to 2012 to train a model with the same per- formance? We see a trend of ~0.5 OOMs/year of algorithmic efficiency. Source: Erdil and Besiroglu 2022. That’s a huge deal: that means 4 years later, we can achieve the same level of performance for ~100x less compute (and con- comitantly, much higher performance for the same compute!). Unfortunately, since labs don’t publish internal data on this, it’s harder to measure algorithmic progress for frontier LLMs
  • 24. situational awareness 24 over the last four years. EpochAI has new work replicating their results on ImageNet for language modeling, and estimate a similar ~0.5 OOMs/year of algorithmic efficiency trend in LLMs from 2012 to 2023. (This has wider error bars though, and doesn’t capture some more recent gains, since the leading labs have stopped publishing their algorithmic efficiencies.) Figure 14: Estimates by Epoch AI of algorithmic efficiencies in language modeling. Their estimates suggest we’ve made ~4 OOMs of efficiency gains in 8 years. More directly looking at the last 4 years, GPT-2 to GPT-3 was basically a simple scaleup (according to the paper), but there have been many publicly-known and publicly-inferable gains since GPT-3: • We can infer gains from API costs:12 12 Though these are inference efficien- cies (rather than necessarily training efficiencies), and to some extent will reflect inference-specific optimizations, a) they suggest enormous amounts of algorithmic progress is possible and happening in general, and b) it’s often the case that an algorithmic improve- ments is both a training efficiency gain and an inference efficiency, for example by reducing the number of parameters necessary. – GPT-4, on release, cost ~the same as GPT-3 when it was released, despite the absolutely enormous performance in- crease.13 (If we do a naive and oversimplified back-of-the- 13 GPT-3: $60/1M tokens, GPT-4: $30/1M input tokens and $60/1M output tokens. envelope estimate based on scaling laws, this suggests that perhaps roughly half the effective compute increase from GPT-3 to GPT-4 came from algorithmic improvements.14) 14 Chinchilla scaling laws say that one should scale parameter count and data equally. That is, parameter count grows “half the OOMs” of the OOMs that effective training compute grows. At the same time, parameter count is intuitively roughly proportional to inference costs. All else equal, constant inference costs thus implies that half of the OOMs of effective compute growth were “canceled out” by algorithmic win. That said, to be clear, this is a very naive calculation (just meant for a rough illustration) that is wrong in various ways. There may be inference- specific optimizations (that don’t translate into training efficiency); there may be training efficiencies that don’t reduce parameter count (and thus don’t translate into inference efficiency); and so on. – Since the GPT-4 release a year ago, OpenAI prices for GPT-4-level models have fallen another 6x/4x (input/output) with the release of GPT-4o.
  • 25. situational awareness 25 – Gemini 1.5 Flash, recently released, offers between “GPT- 3.75-level” and GPT-4-level performance,15 while costing 15 Gemini 1.5 Flash ranks similarly to GPT-4 (higher than original GPT-4, lower than updated versions of GPT-4) on LMSys, a chatbot leaderboard, and has similar performance on MATH and GPQA (evals that measure reasoning) as the original GPT-4, while landing roughly in the middle between GPT- 3.5 and GPT-4 on MMLU (an eval that more heavily weights towards measuring knowledge). 85x/57x (input/output) less than the original GPT-4 (ex- traordinary gains!). • Chinchilla scaling laws give a 3x+ (0.5 OOMs+) efficiency gain.16 16 At ~GPT-3 scale, more than 3x at larger scales. • Gemini 1.5 Pro claimed major compute efficiency gains (out- performing Gemini 1.0 Ultra, while using “significantly less” compute), with Mixture of Experts (MoE) as a highlighted architecture change. Other papers also claim a substantial multiple on compute from MoE. • There have been many tweaks and gains on architecture, data, training stack, etc. all the time.17 17 For example, this paper contains a comparison of a GPT-3-style vanilla Transformer to various simple changes to architecture and training recipe published over the years (RMSnorms instead of layernorm, different posi- tional embeddings, SwiGlu activation, AdamW optimizer instead of Adam, etc.), what they call “Transformer++”, implying a 6x gain at least at small scale. Put together, public information suggests that the GPT-2 to GPT-4 jump included 1-2 OOMs of algorithmic efficiency gains.18 18 If we take the trend of 0.5 OOMs/year, and 4 years between GPT-2 and GPT-4 release, that would be 2 OOMs. However, GPT-2 to GPT-3 was a simple scaleup (after big gains from e.g. Transformers), and OpenAI claims GPT-4 pretraining finished in 2022, which could mean we’re looking at closer to 2 years worth of algorithmic progress that we should be counting here. 1 OOM of algorithmic efficiency seems like a conservative lower bound. Figure 15: Decomposing progress: compute and algorithmic efficiencies. (Rough illustration.)
  • 26. situational awareness 26 Over the 4 years following GPT-4, we should expect the trend to continue:19 on average ~0.5 OOMs/year of compute effi- 19 At the very least, given over a decade of consistent algorithmic improvements, the burden of proof would be on those who would suggest it will all suddenly come to a halt! ciency, i.e. ~2 OOMs of gains compared to GPT-4 by 2027. While compute efficiencies will become harder to find as we pick the low-hanging fruit, AI lab investments in money and talent to find new algorithmic improvements are growing rapidly. 20 (The publicly-inferable inference cost efficiencies, 20 The economic returns to a 3x compute efficiency will be measured in the $10s of billions or more, given cluster costs. at least, don’t seem to have slowed down at all.) On the high end, we could even see more fundamental, Transformer-like21 21 Very roughly something like a ~10x gain. breakthroughs with even bigger gains. Put together, this suggests we should expect something like 1-3 OOMs of algorithmic efficiency gains (compared to GPT-4) by the end of 2027, maybe with a best guess of ~2 OOMs. The data wall There is a potentially important source of variance for all of this: we’re running out of internet data. That could mean that, very soon, the naive approach to pretraining larger language models on more scraped data could start hitting serious bottlenecks. Frontier models are already trained on much of the inter- net. Llama 3, for example, was trained on over 15T tokens. Common Crawl, a dump of much of the internet used for LLM training, is >100T tokens raw, though much of that is spam and duplication (e.g., a relatively simple deduplica- tion leads to 30T tokens, implying Llama 3 would already be using basically all the data). Moreover, for more specific domains like code, there are many fewer tokens still, e.g. public github repos are estimated to be in low trillions of tokens. You can go somewhat further by repeating data, but aca- demic work on this suggests that repetition only gets you so far, finding that after 16 epochs (a 16-fold repetition), returns diminish extremely fast to nil. At some point, even with more (effective) compute, making your models bet- ter can become much tougher because of the data con- straint. This isn’t to be understated: we’ve been riding the
  • 27. situational awareness 27 scaling curves, riding the wave of the language-modeling- pretraining-paradigm, and without something new here, this paradigm will (at least naively) run out. Despite the massive investments, we’d plateau. All of the labs are rumored to be making massive research bets on new algorithmic improvements or approaches to get around this. Researchers are purportedly trying many strategies, from synthetic data to self-play and RL ap- proaches. Industry insiders seem to be very bullish: Dario Amodei (CEO of Anthropic) recently said on a podcast: “if you look at it very naively we’re not that far from running out of data [. . . ] My guess is that this will not be a blocker [. . . ] There’s just many different ways to do it.” Of course, any research results on this are proprietary and not being published these days. In addition to insider bullishness, I think there’s a strong intuitive case for why it should be possible to find ways to train models with much better sample efficiency (algorith- mic improvements that let them learn more from limited data). Consider how you or I would learn from a really dense math textbook: • What a modern LLM does during training is, essentially, very very quickly skim the textbook, the words just fly- ing by, not spending much brain power on it. • Rather, when you or I read that math textbook, we read a couple pages slowly; then have an internal monologue about the material in our heads and talk about it with a few study-buddies; read another page or two; then try some practice problems, fail, try them again in a different way, get some feedback on those problems, try again until we get a problem right; and so on, until eventually the material “clicks.” • You or I also wouldn’t learn much at all from a pass through a dense math textbook if all we could do was breeze through it like LLMs.22 22 And just rereading the same textbook over and over again might result in memorization, not understanding. I take it that’s how many wordcels pass math classes!
  • 28. situational awareness 28 • But perhaps, then, there are ways to incorporate aspects of how humans would digest a dense math textbook to let the models learn much more from limited data. In a simplified sense, this sort of thing—having an in- ternal monologue about material, having a discussion with a study-buddy, trying and failing at problems un- til it clicks—is what many synthetic data/self-play/RL approaches are trying to do.23 23 One other way of thinking about it I find interesting: there is a “missing- middle” between pretraining and in-context learning. In-context learning is incredible (and competitive with human sample effi- ciency). For example, the Gemini 1.5 Pro paper discusses giving the model instructional materials (a textbook, a dictionary) on Kalamang, a language spoken by fewer than 200 people and basically not present on the internet, in context—and the model learns to translate from English to Kalamang at human-level! In context, the model is able to learn from the textbook as well as a human could (and much better than it would learn from just chucking that one textbook into pretraining). When a human learns from a text- book, they’re able to distill their short- term memory/learnings into long-term memory/long-term skills with practice; however, we don’t have an equivalent way to distill in-context learning “back to the weights.” Synthetic data/self- play/RL/etc are trying to fix that: let the model learn by itself, then think about it and practice what it learned, distilling that learning back into the weights. The old state of the art of training models was simple and naive, but it worked, so nobody really tried hard to crack these approaches to sample efficiency. Now that it may become more of a constraint, we should expect all the labs to invest billions of dollars and their smartest minds into cracking it. A common pattern in deep learning is that it takes a lot of effort (and many failed projects) to get the details right, but eventually some version of the obvious and simple thing just works. Given how deep learning has managed to crash through every supposed wall over the last decade, my base case is that it will be similar here. Moreover, it actually seems possible that cracking one of these algorithmic bets like synthetic data could dramati- cally improve models. Here’s an intuition pump. Current frontier models like Llama 3 are trained on the internet— and the internet is mostly crap, like e-commerce or SEO or whatever. Many LLMs spend the vast majority of their training compute on this crap, rather than on really high- quality data (e.g. reasoning chains of people working through difficult science problems). Imagine if you could spend GPT-4-level compute on entirely extremely high- quality data—it could be a much, much more capable model. A look back at AlphaGo—the first AI system that beat the world champions at Go, decades before it was thought possible—is useful here as well.24 24 See also Andrej Karpathy’s talk discussing this here. • In step 1, AlphaGo was trained by imitation learning on expert human Go games. This gave it a foundation.
  • 29. situational awareness 29 • In step 2, AlphaGo played millions of games against itself. This let it become superhuman at Go: remember the famous move 37 in the game against Lee Sedol, an extremely unusual but brilliant move a human would never have played. Developing the equivalent of step 2 for LLMs is a key re- search problem for overcoming the data wall (and, more- over, will ultimately be the key to surpassing human-level intelligence). All of this is to say that data constraints seem to inject large error bars either way into forecasting the coming years of AI progress. There’s a very real chance things stall out (LLMs might still be as big of a deal as the internet, but we wouldn’t get to truly crazy AGI). But I think it’s reasonable to guess that the labs will crack it, and that doing so will not just keep the scaling curves going, but possibly enable huge gains in model capability. As an aside, this also means that we should expect more variance between the different labs in coming years com- pared to today. Up until recently, the state of the art tech- niques were published, so everyone was basically doing the same thing. (And new upstarts or open source projects could easily compete with the frontier, since the recipe was published.) Now, key algorithmic ideas are becom- ing increasingly proprietary. I’d expect labs’ approaches to diverge much more, and some to make faster progress than others—even a lab that seems on the frontier now could get stuck on the data wall while others make a breakthrough that lets them race ahead. And open source will have a much harder time competing. It will certainly make things interesting. (And if and when a lab figures it out, their breakthrough will be the key to AGI, key to superintelligence—one of the United States’ most prized secrets.)
  • 30. situational awareness 30 Unhobbling Finally, the hardest to quantify—but no less important—category of improvements: what I’ll call “unhobbling.” Imagine if when asked to solve a hard math problem, you had to instantly answer with the very first thing that came to mind. It seems obvious that you would have a hard time, except for the simplest problems. But until recently, that’s how we had LLMs solve math problems. Instead, most of us work through the problem step-by-step on a scratchpad, and are able to solve much more difficult problems that way. “Chain-of-thought” prompting unlocked that for LLMs. Despite excellent raw ca- pabilities, they were much worse at math than they could be because they were hobbled in an obvious way, and it took a small algorithmic tweak to unlock much greater capabilities. We’ve made huge strides in “unhobbling” models over the past few years. These are algorithmic improvements beyond just training better base models—and often only use a fraction of pretraining compute—that unleash model capabilities: • Reinforcement learning from human feedback (RLHF). Base mod- els have incredible latent capabilities,25 but they’re raw and 25 That’s the magic of unsupervised learning, in some sense: to better pre- dict the next token, to make perplexity go down, models learn incredibly rich internal representations, everything from (famously) sentiment to complex world models. But, out of the box, they’re hobbled: they’re using their in- credible internal representations merely to predict the next token in random internet text, and rather than applying them in the best way to actually try to solve your problem. incredibly hard to work with. While the popular conception of RLHF is that it merely censors swear words, RLHF has been key to making models actually useful and commer- cially valuable (rather than making models predict random internet text, get them to actually apply their capabilities to try to answer your question!). This was the magic of ChatGPT—well-done RLHF made models usable and use- ful to real people for the first time. The original InstructGPT paper has a great quantification of this: an RLHF’d small model was equivalent to a non-RLHF’d >100x larger model in terms of human rater preference. • Chain of Thought (CoT). As discussed. CoT started being widely used just 2 years ago and can provide the equiva- lent of a >10x effective compute increase on math/reasoning problems. • Scaffolding. Think of CoT++: rather than just asking a model
  • 31. situational awareness 31 to solve a problem, have one model make a plan of attack, have another propose a bunch of possible solutions, have another critique it, and so on. For example, on HumanEval (coding problems), simple scaffolding enables GPT-3.5 to outperform un-scaffolded GPT-4. On SWE-Bench (a bench- mark of solving real-world software engineering tasks), GPT- 4 can only solve ~2% correctly, while with Devin’s agent scaffolding it jumps to 14-23%. (Unlocking agency is only in its infancy though, as I’ll discuss more later.) • Tools: Imagine if humans weren’t allowed to use calculators or computers. We’re only at the beginning here, but Chat- GPT can now use a web browser, run some code, and so on. • Context length. Models have gone from 2k token context (GPT-3) to 32k context (GPT-4 release) to 1M+ context (Gem- ini 1.5 Pro). This is a huge deal. A much smaller base model with, say, 100k tokens of relevant context can outperform a model that is much larger but only has, say, 4k relevant to- kens of context—more context is effectively a large compute efficiency gain.26 More generally, context is key to unlock- 26 See Figure 7 from the updated Gem- ini 1.5 whitepaper, comparing perplex- ity vs. context for Gemini 1.5 Pro and Gemini 1.5 Flash (a much cheaper and presumably smaller model). ing many applications of these models: for example, many coding applications require understanding large parts of a codebase in order to usefully contribute new code; or, if you’re using a model to help you write a document at work, it really needs the context from lots of related internal docs and conversations. Gemini 1.5 Pro, with its 1M+ token con- text, was even able to learn a new language (a low-resource language not on the internet) from scratch, just by putting a dictionary and grammar reference materials in context! • Posttraining improvements. The current GPT-4 has substan- tially improved compared to the original GPT-4 when re- leased, according to John Schulman due to posttraining improvements that unlocked latent model capability: on reasoning evals it’s made substantial gains (e.g., ~50% -> 72% on MATH, ~40% to ~50% on GPQA) and on the LMSys leaderboard, it’s made nearly 100-point elo jump (compara- ble to the difference in elo between Claude 3 Haiku and the much larger Claude 3 Opus, models that have a ~50x price difference).
  • 32. situational awareness 32 A survey by Epoch AI of some of these techniques, like scaf- folding, tool use, and so on, finds that techniques like this can typically result in effective compute gains of 5-30x on many benchmarks. METR (an organization that evaluates models) similarly found very large performance improvements on their set of agentic tasks, via unhobbling from the same GPT-4 base model: from 5% with just the base model, to 20% with the GPT-4 as posttrained on release, to nearly 40% today from bet- ter posttraining, tools, and agent scaffolding (Figure 16). Figure 16: Performance on METR’s agentic tasks, over time via better unhobbling. Source: Model Evaluation and Threat Research While it’s hard to put these on a unified effective compute scale with compute and algorithmic efficiencies, it’s clear these are huge gains, at least on a roughly similar magnitude as the compute scaleup and algorithmic efficiencies. (It also highlights the central role of algorithmic progress: the ~0.5 OOMs/year of compute efficiencies, already significant, are only part of the story, and put together with unhobbling algorithmic progress overall is maybe even a majority of the gains on the current trend.) “Unhobbling” is a huge part of what actually enabled these models to become useful—and I’d argue that much of what is holding back many commercial applications today is the need for further “unhobbling” of this sort. Indeed, models today are still incredibly hobbled! For example: • They don’t have long-term memory.
  • 33. situational awareness 33 Figure 17: Decomposing progress: compute, algorithmic efficiencies, and unhobbling. (Rough illustration.) • They can’t use a computer (they still only have very limited tools). • They still mostly don’t think before they speak. When you ask ChatGPT to write an essay, that’s like expecting a human to write an essay via their initial stream-of-consciousness.27 27 People are working on this though. • They can (mostly) only engage in short back-and-forth dia- logues, rather than going away for a day or a week, thinking about a problem, researching different approaches, consult- ing other humans, and then writing you a longer report or pull request. • They’re mostly not personalized to you or your application (just a generic chatbot with a short prompt, rather than hav- ing all the relevant background on your company and your work). The possibilities here are enormous, and we’re rapidly picking low-hanging fruit here. This is critical: it’s completely wrong to just imagine “GPT-6 ChatGPT.” With continued unhobbling
  • 34. situational awareness 34 progress, the improvements will be step-changes compared to GPT-6 + RLHF. By 2027, rather than a chatbot, you’re going to have something that looks more like an agent, like a coworker. From chatbot to agent-coworker What could ambitious unhobbling over the coming years look like? The way I think about it, there are three key ingredients: 1. solving the “onboarding problem” GPT-4 has the raw smarts to do a decent chunk of many people’s jobs, but it’s sort of like a smart new hire that just showed up 5 minutes ago: it doesn’t have any relevant con- text, hasn’t read the company docs or Slack history or had conversations with members of the team, or spent any time understanding the company-internal codebase. A smart new hire isn’t that useful 5 minutes after arriving—but they are quite useful a month in! It seems like it should be possible, for example via very-long-context, to “onboard” models like we would a new human coworker. This alone would be a huge unlock. 2. the test-time compute overhang (reasoning/error correction/system ii for longer-horizon prob- lems) Right now, models can basically only do short tasks: you ask them a question, and they give you an answer. But that’s extremely limiting. Most useful cognitive work hu- mans do is longer horizon—it doesn’t just take 5 minutes, but hours, days, weeks, or months. A scientist that could only think about a difficult problem for 5 minutes couldn’t make any scientific breakthroughs. A software engineer that could only write skeleton code for a single function when asked wouldn’t be very useful— software engineers are given a larger task, and they then go make a plan, understand relevant parts of the codebase or
  • 35. situational awareness 35 technical tools, write different modules and test them in- crementally, debug errors, search over the space of possible solutions, and eventually submit a large pull request that’s the culmination of weeks of work. And so on. In essence, there is a large test-time compute overhang. Think of each GPT-4 token as a word of internal monologue when you think about a problem. Each GPT-4 token is quite smart, but it can currently only really effectively use on the order of ~hundreds of tokens for chains of thought co- herently (effectively as though you could only spend a few minutes of internal monologue/thinking on a problem or project). What if it could use millions of tokens to think about and work on really hard problems or bigger projects? Number of tokens Equivalent to me work- ing on something for. . . 100s A few minutes ChatGPT (we are here) 1,000s Half an hour +1 OOMs test-time compute 10,000s Half a workday +2 OOMs 100,000s A workweek +3 OOMs Millions Multiple months +4 OOMs Table 2: Assuming a human thinking at ~100 tokens/minute and working 40 hours/week, translating “how long a model thinks” in tokens to human-time on a given problem/project. Even if the “per-token” intelligence were the same, it’d be the difference between a smart person spending a few minutes vs. a few months on a problem. I don’t know about you, but there’s much, much, much more I am capable of in a few months vs. a few minutes. If we could unlock “being able to think and work on something for months- equivalent, rather than a few-minutes-equivalent” for mod- els, it would unlock an insane jump in capability. There’s a huge overhang here, many OOMs worth. Right now, models can’t do this yet. Even with recent ad- vances in long-context, this longer context mostly only works for the consumption of tokens, not the production of tokens—after a while, the model goes off the rails or gets stuck. It’s not yet able to go away for a while to work on a
  • 36. situational awareness 36 problem or project on its own.28 28 Which makes sense—why would it have learned the skills for longer- horizon reasoning and error correction? There’s very little data on the internet in the form of “my complete internal monologue, reasoning, all the relevant steps over the course of a month as I work on a project.” Unlocking this capability will require a new kind of training, for it to learn these extra skills. Or as Gwern put it (private correspon- dence): “ ‘Brain the size of a galaxy, and what do they ask me to do? Predict the misspelled answers on benchmarks!’ Marvin the depressed neural network moaned.” But unlocking test-time compute might merely be a matter of relatively small “unhobbling” algorithmic wins. Perhaps a small amount of RL helps a model learn to error correct (“hm, that doesn’t look right, let me double check that”), make plans, search over possible solutions, and so on. In a sense, the model already has most of the raw capabilities, it just needs to learn a few extra skills on top to put it all together. In essence, we just need to teach the model a sort of System II outer loop29 that lets it reason through difficult, long- 29 System I vs. System II is a useful way of thinking about current ca- pabilities of LLMs—including their limitations and dumb mistakes—and what might be possible with RL and unhobbling. Think of this way: when you are driving, most of the time you are on autopilot (system I, what mod- els mostly do right now). But when you encounter a complex construction zone or novel intersection, you might ask your passenger-seat-companion to pause your conversation for a moment while you figure out—actually think about—what’s going on and what to do. If you were forced to go about life with only system I (closer to models today), you’d have a lot of trouble. Cre- ating the ability for system II reasoning loops is a central unlock. horizon projects. If we succeed at teaching this outer loop, instead of a short chatbot answer of a couple paragraphs, imagine a stream of millions of words (coming in more quickly than you can read them) as the model thinks through problems, uses tools, tries different approaches, does research, revises its work, coordinates with others, and completes big projects on its own. Trading off test-time and train-time compute in other ML do- mains. In other domains, like AI systems for board games, it’s been demonstrated that you can use more test-time compute (also called inference-time compute) to substitute for training compute. Figure 18: Jones (2021): A smaller model can do as well as a much larger model at the game of Hex if you give it more test-time compute (“more time to think”). In this domain, they find that one can spend ~1.2 OOMs more compute at test-time to get performance equivalent to a model with ~1 OOMs more training compute.
  • 37. situational awareness 37 If a similar relationship held in our case, if we could unlock +4 OOMs of test-time compute, that might be equivalent to +3 OOMs of pretraining compute, i.e. very roughly some- thing like the jump between GPT-3 and GPT-4. (Solving this “unhobbling” would be equivalent to a huge OOM scaleup.) 3. using a computer This is perhaps the most straightforward of the three. Chat- GPT right now is basically like a human that sits in an isolated box that you can text. While early unhobbling im- provements teach models to use individual isolated tools, I expect that with multimodal models we will soon be able to do this in one fell swoop: we will simply enable models to use a computer like a human would. That means joining your Zoom calls, researching things on- line, messaging and emailing people, reading shared docs, using your apps and dev tooling, and so on. (Of course, for models to make the most use of this in longer-horizon loops, this will go hand-in-hand with unlocking test-time compute.) By the end of this, I expect us to get something that looks a lot like a drop-in remote worker. An agent that joins your company, is onboarded like a new human hire, mes- sages you and colleagues on Slack and uses your softwares, makes pull requests, and that, given big projects, can do the model-equivalent of a human going away for weeks to independently complete the project. You’ll probably need somewhat better base models than GPT-4 to unlock this, but possibly not even that much better—a lot of juice is in fixing the clear and basic ways models are still hobbled. (A very early peek at what this might look like is Devin, an early prototype of unlocking the “agency-overhang”/”test- time compute overhang” on models on the path to creating a fully automated software engineer. I don’t know how well Devin works in practice, and this demo is still very
  • 38. situational awareness 38 limited compared to what proper chatbot → agent un- hobbling would yield, but it’s a useful teaser of the sort of thing coming soon.) By the way, the centrality of unhobbling might lead to a some- what interesting “sonic boom” effect in terms of commercial applications. Intermediate models between now and the drop- in remote worker will require tons of schlep to change work- flows and build infrastructure to integrate and derive economic value from. The drop-in remote worker will be dramatically easier to integrate—just, well, drop them in to automate all the jobs that could be done remotely. It seems plausible that the schlep will take longer than the unhobbling, that is, by the time the drop-in remote worker is able to automate a large number of jobs, intermediate models won’t yet have been fully harnessed and integrated—so the jump in economic value gen- erated could be somewhat discontinuous. The next four years Putting the numbers together, we should (roughly) ex- pect another GPT-2-to-GPT-4-sized jump in the 4 years follow- ing GPT-4, by the end of 2027. • GPT-2 to GPT-4 was roughly a 4.5–6 OOM base effective compute scaleup (physical compute and algorithmic efficien- cies), plus major “unhobbling” gains (from base model to chatbot). • In the subsequent 4 years, we should expect 3–6 OOMs of base effective compute scaleup (physical compute al- gorithmic efficiencies)—with perhaps a best guess of ~5 OOMs—plus step-changes in utility and applications un- locked by “unhobbling” (from chatbot to agent/drop-in remote worker). To put this in perspective, suppose GPT-4 training took 3 months. In 2027, a leading AI lab will be able to train a GPT-4- level model in a minute.30 The OOM effective compute scaleup 30 On the best guess assumptions on physical compute and algorithmic efficiency scaleups described above, and simplifying parallelism considerations (in reality, it might look more like “1440 (60*24) GPT-4-level models in a day” or similar).
  • 39. situational awareness 39 Figure 19: Summary of the estimates on drivers of progress in the four years preceding GPT-4, and what we should expect in the four years following GPT-4.
  • 40. situational awareness 40 will be dramatic. Where will that take us? Figure 20: Summary of counting the OOMs. GPT-2 to GPT-4 took us from ~preschooler to ~smart high- schooler; from barely being able to output a few cohesive sen- tences to acing high-school exams and being a useful coding assistant. That was an insane jump. If this is the intelligence gap we’ll cover once more, where will that take us?31 We 31 Of course, any benchmark we have today will be saturated. But that’s not saying much; it’s mostly a reflection on the difficulty of making hard-enough benchmarks. should not be surprised if that takes us very, very far. Likely, it will take us to models that can outperform PhDs and the best experts in a field. (One neat way to think about this is that the current trend of AI progress is proceeding at roughly 3x the pace of child develop- ment. Your 3x-speed-child just graduated high school; it’ll be taking your job before you know it!) Again, critically, don’t just imagine an incredibly smart Chat- GPT: unhobbling gains should mean that this looks more like a drop-in remote worker, an incredibly smart agent that can reason and plan and error-correct and knows everything about you and your company and can work on a problem indepen-
  • 41. situational awareness 41 dently for weeks. We are on course for AGI by 2027. These AI systems will basi- cally be able to automate basically all cognitive jobs (think: all jobs that could be done remotely). To be clear—the error bars are large. Progress could stall as we run out of data, if the algorithmic breakthroughs necessary to crash through the data wall prove harder than expected. Maybe unhobbling doesn’t go as far, and we are stuck with merely expert chatbots, rather than expert coworkers. Perhaps the decade-long trendlines break, or scaling deep learning hits a wall for real this time. (Or an algorithmic breakthrough, even simple unhobbling that unleashes the test-time compute over- hang, could be a paradigm-shift, accelerating things further and leading to AGI even earlier.) In any case, we are racing through the OOMs, and it requires no esoteric beliefs, merely trend extrapolation of straight lines, to take the possibility of AGI—true AGI—by 2027 extremely seriously. It seems like many are in the game of downward-defining AGI these days, as just as really good chatbot or whatever. What I mean is an AI system that could fully automate my or my friends’ job, that could fully do the work of an AI researcher or engineer. Perhaps some areas, like robotics, might take longer to figure out by default. And the societal rollout, e.g. in medical or legal professions, could easily be slowed by so- cietal choices or regulation. But once models can automate AI research itself, that’s enough—enough to kick off intense feed- back loops—and we could very quickly make further progress, the automated AI engineers themselves solving all the remain- ing bottlenecks to fully automating everything. In particular, millions of automated researchers could very plausibly com- press a decade of further algorithmic progress into a year or less. AGI will merely be a small taste of the superintelligence soon to follow. (More on that in the next chapter.) In any case, do not expect the vertiginous pace of progress to abate. The trendlines look innocent, but their implications are
  • 42. situational awareness 42 intense. As with every generation before them, every new gen- eration of models will dumbfound most onlookers; they’ll be incredulous when, very soon, models solve incredibly difficult science problems that would take PhDs days, when they’re whizzing around your computer doing your job, when they’re writing codebases with millions of lines of code from scratch, when every year or two the economic value generated by these models 10xs. Forget scifi, count the OOMs: it’s what we should expect. AGI is no longer a distant fantasy. Scaling up simple deep learning techniques has just worked, the models just want to learn, and we’re about to do another 100,000x+ by the end of 2027. It won’t be long before they’re smarter than us. Figure 21: GPT-4 is just the beginning— where will we be four years later? Do not make the mistake of underestimat- ing the rapid pace of deep learning progress (as illustrated by progress in GANs).
  • 43. situational awareness 43 Addendum. Racing through the OOMs: It’s this decade or bust I used to be more skeptical of short timelines to AGI. One reason is that it seemed unreasonable to privilege this decade, concentrating so much AGI-probability-mass on it (it seemed like a classic fallacy to think “oh we’re so special”). I thought we should be uncertain about what it takes to get AGI, which should lead to a much more “smeared-out” probability distribution over when we might get AGI. However, I’ve changed my mind: critically, our uncertainty over what it takes to get AGI should be over OOMs (of effective compute), rather than over years. We’re racing through the OOMs this decade. Even at its bygone heyday, Moore’s law was only 1–1.5 OOMs/decade. I estimate that we will do ~5 OOMs in 4 years, and over ~10 this decade overall. Figure 22: Rough projections on ef- fective compute scaleup. We’ve been racing through the OOMs this decade; after the early 2030s, we will face a slow slog. In essence, we’re in the middle of a huge scaleup reap- ing one-time gains this decade, and progress through the OOMs will be multiples slower thereafter. If this scaleup doesn’t get us to AGI in the next 5-10 years, it might be a
  • 44. situational awareness 44 long way out. • Spending scaleup: Spending a million dollars on a model used to be outrageous; by the end of the decade, we will likely have $100B or $1T clusters. Going much higher than that will be hard; that’s already basically the feasi- ble limit (both in terms of what big business can afford, and even just as a fraction of GDP). Thereafter all we have is glacial 2%/year trend real GDP growth to in- crease this. • Hardware gains: AI hardware has been improving much more quickly than Moore’s law. That’s because we’ve been specializing chips for AI workloads. For exam- ple, we’ve gone from CPUs to GPUs; adapted chips for Transformers; and we’ve gone down to much lower pre- cision number formats, from fp64/fp32 for traditional supercomputing to fp8 on H100s. These are large gains, but by the end of the decade we’ll likely have totally- specialized AI-specific chips, without much further beyond-Moore’s law gains possible. • Algorithmic progress: In the coming decade, AI labs will invest tens of billions in algorithmic R&D, and all the smartest people in the world will be working on this; from tiny efficiencies to new paradigms, we’ll be picking lots of the low-hanging fruit. We probably won’t reach any sort of hard limit (though “unhobblings” are likely finite), but at the very least the pace of improvements should slow down, as the rapid growth (in $ and human capital investments) necessarily slows down (e.g., most of the smart STEM talent will already be working on AI). (That said, this is the most uncertain to predict, and the source of most of the uncertainty on the OOMs in the 2030s on the plot above.) Put together, this means we are racing through many more OOMs in the next decade than we might in multi- ple decades thereafter. Maybe it’s enough—and we get AGI soon—or we might be in for a long, slow slog. You and I can reasonably disagree on the median time to AGI,
  • 45. situational awareness 45 depending on how hard we think achieving AGI will be— but given how we’re racing through the OOMs right now, certainly your modal AGI year should sometime later this decade or so. Figure 23: Matthew Barnett has a nice related visualization of this, considering just compute and biological bounds.
  • 46. II. From AGI to Superintelligence: the Intelligence Ex- plosion AI progress won’t stop at human-level. Hundreds of millions of AGIs could automate AI research, compressing a decade of algorithmic progress (5+ OOMs) into 1 year. We would rapidly go from human-level to vastly superhuman AI sys- tems. The power—and the peril—of superintelligence would be dramatic. Let an ultraintelligent machine be defined as a machine that can far surpass all the intellectual activities of any man however clever. Since the design of machines is one of these intellectual activities, an ultraintelligent machine could design even better machines; there would then unquestionably be an ‘intelligence explosion,’ and the intelligence of man would be left far behind. Thus the first ultraintelligent machine is the last invention that man need ever make. i. j. good (1965) The Bomb and The Super In the common imagination, the Cold War’s terrors principally trace back to Los Alamos, with the invention of the atomic bomb. But The Bomb, alone, is perhaps overrated. Going from The Bomb to The Super—hydrogen bombs—was arguably just as important. In the Tokyo air raids, hundreds of bombers dropped thousands of tons of conventional bombs on the city. Later that year, Little Boy, dropped on Hiroshima, unleashed similar destructive power in a single device. But just 7 years later, Teller’s hydrogen bomb multiplied yields a thousand-fold once again—a single bomb with more explosive power than all of the bombs dropped in the entirety of WWII combined.
  • 47. situational awareness 47 The Bomb was a more efficient bombing campaign. The Super was a country-annihilating device.32 32 And much of the Cold War’s per- versities (cf Daniel Ellsberg’s book) stemmed from merely replacing A- bombs with H-bombs, without adjust- ing nuclear policy and war plans to the massive capability increase. So it will be with AGI and Superintelligence. AI progress won’t stop at human-level. After initially learning from the best human games, AlphaGo started playing against itself—and it quickly became superhuman, playing extremely creative and complex moves that a human would never have come up with. We discussed the path to AGI in the previous piece. Once we get AGI, we’ll turn the crank one more time—or two or three more times—and AI systems will become superhuman—vastly superhuman. They will become qualitatively smarter than you or I, much smarter, perhaps similar to how you or I are qualita- tively smarter than an elementary schooler. The jump to superintelligence would be wild enough at the current rapid but continuous rate of AI progress (if we could make the jump to AGI in 4 years from GPT-4, what might an- other 4 or 8 years after that bring?). But it could be much faster than that, if AGI automates AI research itself. Once we get AGI, we won’t just have one AGI. I’ll walk through the numbers later, but: given inference GPU fleets by then, we’ll likely be able to run many millions of them (perhaps 100 million human-equivalents, and soon after at 10x+ human speed). Even if they can’t yet walk around the office or make coffee, they will be able to do ML research on a computer. Rather than a few hundred researchers and engineers at a leading AI lab, we’d have more than 100,000x that—furiously working on algorithmic breakthroughs, day and night. Yes, recursive self-improvement, but no sci-fi required; they would need only to accelerate the existing trendlines of algorithmic progress (currently at ~0.5 OOMs/year). Automated AI research could probably compress a human- decade of algorithmic progress into less than a year (and that seems conservative). That’d be 5+ OOMs, another GPT-2-to-
  • 48. situational awareness 48 Figure 24: Automated AI research could accelerate algorithmic progress, leading to 5+ OOMs of effective compute gains in a year. The AI systems we’d have by the end of an intelligence explosion would be vastly smarter than humans. GPT-4-sized jump, on top of AGI—a qualitative jump like that from a preschooler to a smart high schooler, on top of AI sys- tems already as smart as expert AI researchers/engineers. There are several plausible bottlenecks—including limited com- pute for experiments, complementarities with humans, and algorithmic progress becoming harder—which I’ll address, but none seem sufficient to definitively slow things down. Before we know it, we would have superintelligence on our hands—AI systems vastly smarter than humans, capable of novel, creative, complicated behavior we couldn’t even begin to understand—perhaps even a small civilization of billions of them. Their power would be vast, too. Applying superin-
  • 49. situational awareness 49 telligence to R&D in other fields, explosive progress would broaden from just ML research; soon they’d solve robotics, make dramatic leaps across other fields of science and technol- ogy within years, and an industrial explosion would follow. Superintelligence would likely provide a decisive military ad- vantage, and unfold untold powers of destruction. We will be faced with one of the most intense and volatile moments of human history. Automating AI research We don’t need to automate everything—just AI research. A common objection to transformative impacts of AGI is that it will be hard for AI to do everything. Look at robotics, for instance, doubters say; that will be a gnarly problem, even if AI is cognitively at the levels of PhDs. Or take automating biology R&D, which might require lots of physical lab-work and human experiments. But we don’t need robotics—we don’t need many things—for AI to automate AI research. The jobs of AI researchers and engineers at leading labs can be done fully virtually and don’t run into real-world bottlenecks in the same way (though it will still be limited by compute, which I’ll address later). And the job of an AI researcher is fairly straightforward, in the grand scheme of things: read ML literature and come up with new questions or ideas, implement experiments to test those ideas, interpret the results, and repeat. This all seems squarely in the domain where simple extrapolations of current AI capabilities could easily take us to or beyond the levels of the best humans by the end of 2027.33 33 The job of an AI researcher is also a job that AI researchers at AI labs just, well, know really well—so it’ll be particularly intuitive to them to optimize models to be good at that job. And there will be huge incentives to do so to help them accelerate their research and their labs’ competitive edge. It’s worth emphasizing just how straightforward and hacky some of the biggest machine learning breakthroughs of the last decade have been: “oh, just add some normalization” (Lay- erNorm/BatchNorm) or “do f(x)+x instead of f(x)” (residual connections)” or “fix an implementation bug” (Kaplan → Chin- chilla scaling laws). AI research can be automated. And au- tomating AI research is all it takes to kick off extraordinary feedback loops.34 34 This suggests an important point in terms of the sequencing of risks from AI, by the way. A common AI threat model people point to is AI systems developing novel bioweapons, and that posing catastrophic risk. But if AI research is more straightforward to automate than biology R&D, we might get an intelligence explosion before we get extreme AI biothreats. This matters, for example, with regard to whether we should expect “bio warning shots” in time before things get crazy on AI.
  • 50. situational awareness 50 We’d be able to run millions of copies (and soon at 10x+ hu- man speed) of the automated AI researchers. Even by 2027, we should expect GPU fleets in the 10s of millions. Training clusters alone should be approaching ~3 OOMs larger, already putting us at 10 million+ A100-equivalents. Inference fleets should be much larger still. (More on all this in IIIa. Racing to the Trillion Dollar Cluster.) That would let us run many millions of copies of our auto- mated AI researchers, perhaps 100 million human-researcher- equivalents, running day and night. There’s some assump- tions that flow into the exact numbers, including that humans “think” at 100 tokens/minute (just a rough order of magnitude estimate, e.g. consider your internal monologue) and extrap- olating historical trends and Chinchilla scaling laws on per- token inference costs for frontier models remaining in the same ballpark.35 We’d also want to reserve some of the GPUs for 35 As noted earlier, the GPT-4 API costs less today than GPT-3 when it was released—this suggests that the trend of inference efficiency wins is fast enough to keep inference costs roughly constant even as models get much more powerful. Similarly, there have been huge inference cost wins in just the year since GPT-4 was released; for example, the current version of Gemini 1.5 Pro outperforms the original GPT-4, while being roughly 10x cheaper. We can also ground this somewhat more by considering Chinchilla scaling laws. On Chinchilla scaling laws, model size—and thus inference costs—grow with the square root of training cost, i.e. half the OOMs of the OOM scaleup of effective compute. However, in the previous piece, I suggested that algorithmic efficiency was advancing at roughly the same pace as compute scaleup, i.e. it made up roughly half of the OOMs of effective compute scaleup. If these algorithmic wins also translate into inference efficiency, that means that the algorithmic efficiencies would compensate for the naive increase in inference cost. In practice, training compute efficien- cies often, but not always, translate into inference efficiency wins. However, there are also separately many inference efficiency wins that are not training effi- ciency wins. So, at least in terms of the rough ballpark, assuming the $/token of frontier models stays roughly similar doesn’t seem crazy. (Of course, they’ll use more tokens, i.e. more test-time compute. But that’s already part of the calculation here, by pricing human-equivalents as 100 tokens/minute.) running experiments and training new models. Full calculation in a footnote.36 36 GPT4T is about $0.03/1K tokens. We supposed we would have 10s of mil- lions of A100 equivalents, which cost ~$1 hour per GPU if A100-equivalents. If we used the API costs to trans- late GPUs into tokens generated, that implies 10s of millions GPUs * $1/GPU-hour * 33K tokens/$ = ~ one trillion tokens/ hour. Suppose a human does 100 tokens/min of thinking, that means a human-equivalent is 6,000 tokens/hour. One trillion tokens/hour divided by 6,000 tokens/human-hour = ~200 million human-equivalents— i.e. as if running 200 million human researchers, day and night. (And even if we reserve half the GPUs for exper- iment compute, we get 100 million human-researcher-equivalents.) Another way of thinking about it is that given inference fleets in 2027, we should be able to generate an entire internet’s worth of tokens, every single day.37 In any case, the exact numbers don’t 37 The previous footnote estimated ~1T tokens/hour, i.e. 24T tokens a day. In the previous piece, I noted that a public deduplicated CommonCrawl had around 30T tokens. matter that much, beyond a simple plausibility demonstration.
  • 51. situational awareness 51 Moreover, our automated AI researchers may soon be able to run at much faster than human-speed: • By taking some inference penalties, we can trade off running fewer copies in exchange for running them at faster serial speed. (For example, we could go from ~5x human speed to ~100x human speed by “only” running 1 million copies of the automated researchers.38) 38 Jacob Steinhardt estimates that kˆ3 parallel copies of a model can be replaced with a single model that is kˆ2 faster, given some math on inference tradeoffs with a tiling scheme (that theoretically works even for k of 100 or more). Suppose initial speeds were already ~5x human speed (based on, say, GPT-4 speed on release). Then, by taking this inference penalty (with k= ~5), we’d be able to run ~1 million automated AI researchers at ~100x human speed. • More importantly, the first algorithmic innovation the au- tomated AI researchers work on is getting a 10x or 100x speedup. Gemini 1.5 Flash is ~10x faster than the originally- released GPT-4,39 merely a year later, while providing similar 39 This source benchmarks throughput of Flash at ~6x GPT-4 Turbo, and GPT-4 Turbo was faster than original GPT-4. Latency is probably also roughly 10x faster. performance to the originally-released GPT-4 on reasoning benchmarks. If that’s the algorithmic speedup a few hun- dred human researchers can find in a year, the automated AI researchers will be able to find similar wins very quickly. That is: expect 100 million automated researchers each working at 100x human speed not long after we begin to be able to automate AI research. They’ll each be able to do a year’s worth of work in a few days. The increase in research effort—compared to a few hundred puny human researchers at a leading AI lab today, working at a puny 1x human speed—will be extraordinary. This could easily dramatically accelerate existing trends of algorithmic progress, compressing a decade of advances into a year. We need not postulate anything totally novel for au- tomated AI research to intensely speed up AI progress. Walk- ing through the numbers in the previous piece, we saw that algorithmic progress has been a central driver of deep learn- ing progress in the last decade; we noted a trendline of ~0.5 OOMs/year on algorithmic efficiencies alone, with additional large algorithmic gains from unhobbling on top. (I think the import of algorithmic progress has been underrated by many, and properly appreciating it is important for appreciating the possibility of an intelligence explosion.) Could our millions of automated AI researchers (soon working at 10x or 100x human speed) compress the algorithmic progress human researchers would have found in a decade into a year
  • 52. situational awareness 52 instead? That would be 5+ OOMs in a year. Don’t just imagine 100 million junior software engineer interns here (we’ll get those earlier, in the next couple years!). Real automated AI researchers be very smart—and in addition to their raw quantitative advantage, automated AI researchers will have other enormous advantages over human researchers: • They’ll be able to read every single ML paper ever written, have been able to deeply think about every single previous experiment ever run at the lab, learn in parallel from each of their copies, and rapidly accumulate the equivalent of millennia of experience. They’ll be able to develop far deeper intuitions about ML than any human. • They’ll be easily able to write millions of lines of complex code, keep the entire codebase in context, and spend human- decades (or more) checking and rechecking every line of code for bugs and optimizations. They’ll be superbly compe- tent at all parts of the job. • You won’t have to individually train up each automated AI researcher (indeed, training and onboarding 100 million new human hires would be difficult). Instead, you can just teach and onboard one of them—and then make replicas. (And you won’t have to worry about politicking, cultural acclimation, and so on, and they’ll work with peak energy and focus day and night.) • Vast numbers of automated AI researchers will be able to share context (perhaps even accessing each others’ latent space and so on), enabling much more efficient collaboration and coordination compared to human researchers. • And of course, however smart our initial automated AI researchers would be, we’d soon be able to make further OOM-jumps, producing even smarter models, even more capable at automated AI research. Imagine an automated Alec Radford—imagine 100 million au-
  • 53. situational awareness 53 tomated Alec Radfords.40 I think just about every researcher at 40 Alec Radford is an incredibly gifted and prolific researcher/engineer at OpenAI, behind many of the most important advances, though he flies under the radar some. OpenAI would agree that if they had 10 Alec Radfords, let alone 100 or 1,000 or 1 million running at 10x or 100x human speed, they could very quickly solve very many of their prob- lems. Even with various other bottlenecks (more in a moment), compressing a decade of algorithmic progress into a year as a result seems very plausible. (A 10x acceleration from a mil- lion times more research effort, which seems conservative if anything.) That would be 5+ OOMs right there. 5 OOMs of algorithmic wins would be a similar scaleup to what produced the GPT- 2-to-GPT-4 jump, a capability jump from ~a preschooler to ~a smart high schooler. Imagine such a qualitative jump on top of AGI, on top of Alec Radford. It’s strikingly plausible we’d go from AGI to superintelligence very quickly, perhaps in 1 year. Possible bottlenecks While this basic story is surprisingly strong—and is supported by thorough economic modeling work—there are some real and plausible bottlenecks that will probably slow down an automated-AI-research intelligence explosion. I’ll give a summary here, and then discuss these in more detail in the optional sections below for those interested: • Limited compute: AI research doesn’t just take good ideas, thinking, or math—but running experiments to get empirical signal on your ideas. A million times more research effort via automated research labor won’t mean a million times faster progress, because compute will still be limited—and limited compute for experiments will be the bottleneck. Still, even if this won’t be a 1,000,000x speedup, I find it hard to imagine that the automated AI researchers couldn’t use the compute at least 10x more effectively: they’ll be able to get incredible ML intuition (having internalized the whole ML literature and every previous experiment every run!) and
  • 54. situational awareness 54 centuries-equivalent of thinking-time to figure out exactly the right experiment to run, configure it optimally, and get maximum value of information; they’ll be able to spend centuries-equivalent of engineer-time before running even tiny experiments to avoid bugs and get them right on the first try; they can make tradeoffs to economize on compute by focusing on the biggest wins; and they’ll be able to try tons of smaller-scale experiments (and given effective com- pute scaleups by then, “smaller-scale” means being able to train 100,000 GPT-4-level models in a year to try architec- ture breakthroughs). Some human researchers and engineers are able to produce 10x the progress as others, even with the same amount of compute—and this should apply even moreso to automated AI researchers. I do think this is the most important bottleneck, and I address it in more depth below. • Complementarities/long tail: A classic lesson from eco- nomics (cf Baumol’s growth disease) is that if you can au- tomate, say, 70% of something, you get some gains but quickly the remaining 30% become your bottleneck. For any- thing that falls short of full automation—say, really good copilots—human AI researchers would remain a major bottleneck, making the overall increase in the rate of algo- rithmic progress relatively small. Moreover, there’s likely some long tail of capabilities required for automating AI research—the last 10% of the job of an AI researcher might be particularly hard to automate. This could soften takeoff some, though my best guess is that this only delays things by a couple years. Perhaps 2026/27-models speed are the proto-automated-researcher, it takes another year or two for some final unhobbling, a somewhat better model, inference speedups, and working out kinks to get to full automation, and finally by 2028 we get the 10x acceleration (and superin- telligence by the end of the decade). • Inherent limits to algorithmic progress: Maybe another 5 OOMs of algorithmic efficiency will be fundamentally im- possible? I doubt it. While there will definitely be upper limits,41 if we got 5 OOMs in the last decade, we should 41 25 OOMs of algorithmic progress on top of GPT-4, for example, are clearly impossible: that would imply it would be possible to train a GPT-4-level model with just a handful of FLOP. probably expect at least another decade’s-worth of progress
  • 55. situational awareness 55 to be possible. More directly, current architectures and train- ing algorithms are still very rudimentary, and it seems that much more efficient schemes should be possible. Biological reference classes also support dramatically more efficient algorithms being plausible. • Ideas get harder to find, so the automated AI researchers will merely sustain, rather than accelerate, the current rate of progress: One objection is that although automated re- search would increase effective research effort a lot, ideas also get harder to find. That is, while it takes only a few hundred top researchers at a lab to sustain 0.5 OOMs/year today, as we exhaust the low-hanging fruit, it will take more and more effort to sustain that progress—and so the 100 mil- lion automated researchers will be merely what’s necessary to sustain progress. I think this basic model is correct, but the empirics don’t add up: the magnitude of the increase in research effort—a million-fold—is way, way larger than the historical trends of the growth in research effort that’s been necessary to sustain progress. In econ modeling terms, it’s a bizarre “knife-edge assumption” to assume that the increase in research effort from automation will be just enough to keep progress constant. • Ideas get harder to find and there are diminishing returns, so the intelligence explosion will quickly fizzle: Related to the above objection, even if the automated AI researchers lead to an initial burst of progress, whether rapid progress can be sustained depends on the shape of the diminishing returns curve to algorithmic progress. Again, my best read of the empirical evidence is that the exponents shake out in favor of explosive/accelerating progress. In any case, the sheer size of the one-time boost—from 100s to 100s of millions of AI researchers—probably overcomes diminishing returns here for at least a good number of OOMs of algorithmic progress, even though it of course can’t be indefinitely self- sustaining. Overall, these factors may slow things down somewhat: the most extreme versions of intelligence explosion (say, overnight) seem implausible. And they may result in a somewhat longer
  • 56. situational awareness 56 runup (perhaps we need to wait an extra year or two from more sluggish, proto-automated researchers to the true auto- mated Alec Radfords, before things kick off in full force). But they certainly don’t rule out a very rapid intelligence explosion. A year—or at most just a few years, but perhaps even just a few months—in which we go from fully-automated AI researchers to vastly superhuman AI systems should be our mainline ex- pectation. If you’d rather skip the in-depth discussions on the various bottle- necks below, click here to skip to the next section. Limited compute for experiments (optional, in more depth) The production function for algorithmic progress includes two complementary factors of production: research effort and experiment compute. The millions of automated AI researchers won’t have any more compute to run their experiments on than human AI researchers; perhaps they’ll just be sitting around waiting for their jobs to finish. This is probably the most important bottleneck to the intelligence explosion. Ultimately this is a quantitative question—just how much of a bottleneck is it? On balance, I find it hard to believe that the 100 million Alec Radfords couldn’t increase the marginal product of experiment com- pute by at least 10x (and thus, would still accelerate the pace of progress by 10x): • There’s a lot you can do with smaller amounts of compute. The way most AI research works is that you test things out at small scale—and then extrapolate via scaling laws. (Many key historical breakthroughs required only a very small amount of compute, e.g. the original Transformer was trained on just 8 GPUs for a few days.) And note that with ~5 OOMs of baseline scaleup in the next four years, “small scale” will mean GPT-4 scale—the auto- mated AI researchers will be able to run 100,000 GPT-4- level experiments on their training cluster in a year, and tens of millions of GPT-3-level experiments. (That’s a lot
  • 57. situational awareness 57 of potential-breakthrough new architectures they’ll be able to test!) – A lot of the compute goes into larger-scale validation of the final pretraining run—making sure you are get- ting a high-enough degree of confidence on marginal efficiency wins for your annual headline product—but if you’re racing through the OOMs in the intelligence explosion, you could economize and just focus on the really big wins. – As discussed in the previous piece, there are often enormous gains to be had from relatively low-compute “unhobbling” of models. These don’t require big pre- training runs. It’s highly plausible that that intelli- gence explosion starts off automated AI research e.g. discovering a way to do RL on top that gives us a cou- ple OOMs via unhobbling wins (and then we’re off to the races). – As the automated AI researchers find efficiencies, that’ll let them run more experiments. Recall the near-1000x cheaper inference in two years for equivalent-MATH performance, and the 10x general inference gains in the last year, discussed in the previous piece, from mere-human algorithmic progress. The first thing the automated AI researchers will do is quickly find similar gains, and in turn, that’ll let them run 100x more experiments on e.g. new RL approaches. Or they’ll be able to quickly make smaller models with similar performance in relevant domains (cf previous discussion of Gemini Flash, near-100x cheaper than GPT-4), which in turn will let them run many more experiments with these smaller models (again, imag- ine using these to try different RL schemes). There are probably other overhangs too, e.g. the automated AI researchers might be able to quickly develop much better distributed training schemes to utilize all the inference GPUs (probably at least 10x more compute right there). More generally, every OOM of training efficiency gains they find will give them an OOM
  • 58. situational awareness 58 more of effective compute to run experiments on. • The automated AI researchers could be way more efficient. It’s hard to understate how many fewer experiments you would have to run if you just got it right on the first try—no gnarly bugs, being more selective about exactly what you are running, and so on. Imagine 1000 automated AI researchers spending a month-equivalent checking your code and getting the exact experiment right before you press go. I’ve asked some AI lab col- leagues about this and they agreed: you should pretty easily be able to save 3x-10x of compute on most projects merely if you could avoid frivolous bugs, get things right on the first try, and only run high value-of-information experiments. • The automated AI researchers could have way better intu- itions. – Recently, I was speaking to an intern at a frontier lab; they said that their dominant experience over the past few months was suggesting many experiments they wanted to run, and their supervisor (a senior researcher) saying they could already predict the re- sult beforehand so there was no need. The senior researcher’s years of random experiments messing around with models had honed their intuitions about what ideas would work—or not. Similarly, it seems like our AI systems could easily get superhuman in- tuitions about ML experiments—they will have read the entire machine learning literature, be able to learn from every other experiment result and deeply think about it, they could easily be trained to predict the outcome of millions of ML experiments, and so on. And maybe one of the first things they do is build up a strong basic science of "predicting if this large scale experiment will be successful just after seeing the first 1% of training, or just after seeing the smaller scale version of this experiment", and so on. – Moreover, beyond really good intuitions about re-
  翻译: