10 interesting stories served every morning and every evening.

Are you a robot?

www.bloomberg.com

Please make sure your browser sup­ports JavaScript and cook­ies and that you are not block­ing them from load­ing. For more in­for­ma­tion you can re­view our Terms of Service and Cookie Policy.

Changing How We Develop Ladybird - Ladybird

ladybird.org

Today we’re chang­ing how code en­ters the Ladybird pro­ject.

We will no longer ac­cept pub­lic pull re­quests. From now on, code changes to the Ladybird code­base will only be in­tro­duced by pro­ject main­tain­ers.

Ladybird is mov­ing into a new phase. As we work to­ward our first al­pha re­lease, the pro­ject needs a tighter de­vel­op­ment process, a clearer se­cu­rity model, and a smaller set of peo­ple re­spon­si­ble for the code that en­ters the browser.

This is not a change we make lightly. Many valu­able con­tri­bu­tions have come from out­side the main­tainer group over the years, and we are grate­ful for them. Many of us also came up through open source by send­ing patches to pro­jects we cared about.

For decades, code con­tri­bu­tions have been how open source pro­jects learned who to trust. People would show up, do the work, take re­spon­si­bil­ity for their changes, and stick around. Over time, trust emerged from the work it­self.

AI tools have changed the eco­nom­ics of this very quickly. We use them our­selves every day, but a pull re­quest no longer tells us as much as it used to about the per­son sub­mit­ting it. A sub­stan­tial patch used to im­ply sub­stan­tial ef­fort, and that ef­fort was a rea­son­able proxy for good faith. That as­sump­tion no longer holds.

For a browser, this mat­ters. A browser runs un­trusted in­put from the en­tire in­ter­net on the user’s ma­chine, and one well-dis­guised vul­ner­a­bil­ity is all an at­tacker needs. We have al­ready seen pa­tient, well-re­sourced cam­paigns in open source to earn main­tainer trust and abuse it. What has changed is how much faster and cheaper it has be­come to pro­duce work that looks like a se­ri­ous con­tri­bu­tion.

At the same time, every change that en­ters Ladybird be­comes our re­spon­si­bil­ity. It has to fit the ar­chi­tec­ture, sur­vive fu­ture refac­tor­ing, in­ter­act cor­rectly with the rest of the browser, and be un­der­stood by the peo­ple main­tain­ing it.

Whether code was typed by hand is be­side the point. What mat­ters is who is re­spon­si­ble for it once it en­ters the browser. Ladybird is be­com­ing a browser for real users. The peo­ple in­tro­duc­ing changes to it must be the peo­ple who de­cide those changes be­long in the pro­ject, and who will an­swer for the con­se­quences.

As part of this change, we will close all cur­rently open pub­lic pull re­quests. We are grate­ful for the work peo­ple put into them, but keep­ing the ex­ist­ing queue open would keep that con­tri­bu­tion path open in prac­tice. There is no per­fect time to make this change, so we are mak­ing it now. Going for­ward, pull re­quests will only be avail­able to pro­ject main­tain­ers.

There will not be a sep­a­rate process for sub­mit­ting patches by other means. We do not want to cre­ate a shadow con­tri­bu­tion sys­tem through is­sues, com­ments, email, or forks. External code can of course ex­ist un­der the terms of the li­cense, but we will not treat forks or patch dumps as a re­view queue for up­stream Ladybird.

Ladybird re­mains open source. The source code will con­tinue to be pub­licly avail­able un­der an open source li­cense. Outside in­volve­ment still mat­ters: clear bug re­ports, re­duc­tions, web­site test­ing, stan­dards dis­cus­sion, de­sign dis­cus­sion, se­cu­rity re­ports, and tech­ni­cal feed­back all help move the pro­ject for­ward.

This is the right change for Ladybird now. We are prepar­ing to ship a browser to real users, and our de­vel­op­ment process has to match that re­spon­si­bil­ity.

When AI builds itself

www.anthropic.com

For most of AIs his­tory, hu­mans drove every step in its de­vel­op­ment cy­cle. But at Anthropic, we are del­e­gat­ing a grow­ing share of AI de­vel­op­ment to AI sys­tems them­selves, which is speed­ing up our work.

Taken far enough, and given enough com­pute, that trend points to an AI sys­tem ca­pa­ble of fully au­tonomously de­sign­ing and de­vel­op­ing its own suc­ces­sor. This is called re­cur­sive self-im­prove­ment. We are not there yet, and re­cur­sive self-im­prove­ment is not in­evitable. But it could come sooner than most in­sti­tu­tions are pre­pared for.

Using pub­lic bench­marks and pre­vi­ously un­re­ported data from within Anthropic, The Anthropic Institute is show­ing that AI is al­ready ac­cel­er­at­ing the de­vel­op­ment of AI sys­tems. To take just one ex­am­ple: to­day, Anthropic en­gi­neers on av­er­age ship 8x as much code per quar­ter as they did from 2021 – 2025.

The tech­ni­cal trends dis­cussed in this piece sug­gest that AI sys­tems are go­ing to be­come much more ca­pa­ble in com­ing years. These trends have huge im­pli­ca­tions. AI that can build it­self would be a ma­jor de­vel­op­ment in the his­tory of tech­nol­ogy—one that could bring enor­mous good for the world in sci­ence, health­care, and be­yond. But full re­cur­sive self-im­prove­ment also might in­crease the risks of hu­mans los­ing con­trol over AI sys­tems. If sys­tems are ca­pa­ble of fully build­ing their own suc­ces­sors, the ways we se­cure them, mon­i­tor them, and shape their be­hav­ior all grow much more im­por­tant.

2021 – 2023

Building the first Claude

In the early days, work at Anthropic looked like work at any other tech com­pany: peo­ple writ­ing code and docs on lap­tops.

2023 – 2025

Chatbots

People used early chat­bots to help with parts of the process, like gen­er­at­ing short code snip­pets and copy­ing the out­put into text ed­i­tors.

2025 – 2026

Coding agents

As the agents be­came more ca­pa­ble, they were able to write and edit code on their own, some­times en­tire files.

Today

Autonomous agents

Agents can now run code them­selves and del­e­gate hours of work to other agents.

20XX?

Closing the loop

In the fu­ture, agents could be­come ca­pa­ble enough to build and train mod­els them­selves. If this hap­pens, fu­ture ver­sions of Claude could be con­tin­u­ously im­proved by Claude it­self.

Evidence from the out­side world

The rate at which AI mod­els im­prove is ac­cel­er­at­ing. The length of tasks that they can re­li­ably com­plete on their own has been dou­bling roughly every four months, up from an ear­lier trend of dou­bling every seven months. In March 2024, Claude Opus 3 could com­plete soft­ware tasks that take hu­mans about four min­utes to com­plete. A year later, Claude Sonnet 3.7 man­aged tasks that took about an hour and a half. A year af­ter that, Claude Opus 4.6 man­aged 12-hour tasks.1 If this trend holds, tasks that take a skilled per­son days could come into range this year. In 2027, AI sys­tems could be ca­pa­ble of tasks that take a per­son weeks.

The same pat­tern ap­pears on cod­ing and re­search bench­marks. Benchmarks mea­sure the per­for­mance of mod­els in a given do­main, and they’re saturated” when mod­els achieve close to 100% per­for­mance.2 SWE-bench is a stan­dard test of real-world soft­ware en­gi­neer­ing: it hands a model an ac­tual open-source code­base and a real bug re­port, and asks it to write a code change that fixes the is­sue and passes the pro­jec­t’s own tests. Models have gone from scor­ing in the low sin­gle dig­its to sat­u­rat­ing the bench­mark in two years.

CORE-Bench tests whether a model can re­pro­duce ex­ist­ing re­search, a pre­req­ui­site for them to con­duct orig­i­nal re­search. It gives an AI model the code and data be­hind a pub­lished pa­per, and asks it to re­run every­thing and con­firm it can repli­cate the pa­per’s re­sults. AI sys­tems went from suc­ceed­ing at re­pro­duc­ing the re­sults roughly 20% of the time in 2024 to sat­u­rat­ing the bench­mark fif­teen months later. METR, which runs the bench­mark mea­sur­ing how well mod­els can com­plete long-du­ra­tion tasks, found that Claude Mythos Preview could work for at least” 16 hours and was at the up­per end of what [METR] can mea­sure with­out new tasks.”

Public bench­marks say a lot about the ca­pa­bil­i­ties of these sys­tems. But they can’t re­veal the im­pact AI sys­tems are hav­ing on speed­ing up AI de­vel­op­ment it­self. For that, we need di­rect ev­i­dence from within AI com­pa­nies like Anthropic.

Evidence from within Anthropic

Building a fron­tier model takes two broad cat­e­gories of work. There is en­gi­neer­ing: writ­ing the code, stand­ing up the in­fra­struc­ture, and over­see­ing the model train­ing. And there is re­search: de­cid­ing what ex­per­i­ments to run, in­ter­pret­ing what comes back, and fig­ur­ing out which ideas to try next.

Across both en­gi­neer­ing and re­search, the pic­ture is con­sis­tent. In en­gi­neer­ing, Claude can be handed an un­der­spec­i­fied prob­lem and fig­ure out how to solve it; hu­mans sup­ply the goal, but they no longer need to sup­ply the method. In re­search, Claude can al­ready match or out­per­form skilled hu­mans at ex­e­cut­ing a well-spec­i­fied ex­per­i­ment. However, large per­for­mance gaps per­sist when it comes to Claude ex­er­cis­ing judge­ment in choos­ing goals in both en­gi­neer­ing and re­search. That’s the gap be­tween AI to­day and a fu­ture sys­tem that could au­tonomously de­sign its own suc­ces­sor.

It’s com­mon for em­ploy­ees at Anthropic to re­ceive more open-ended and im­por­tant tasks as they gain more ex­pe­ri­ence. Early on, they ex­e­cute a task some­one else spec­i­fied, like, The ex­port but­ton is­n’t work­ing, please fix it.” With ex­pe­ri­ence, they’re handed a goal and de­sign the ap­proach them­selves, such as, Investigate why the net­work slows down un­der heavy load.” At the most se­nior lev­els, they are de­cid­ing which prob­lems are worth work­ing on at all: What should the team build next quar­ter?” We can use in­ter­nal Anthropic data to see how far Claude has come in be­ing able to han­dle these dif­fer­ent kinds of tasks.

Claude writes a sig­nif­i­cant pro­por­tion of Anthropic’s code. As of May 2026, more than 80% of the code we merge into Anthropic’s code­base was au­thored by Claude.3 Before Claude Code launched in re­search pre­view in February 2025, this num­ber was in the low sin­gle dig­its. That shift also shows up in the amount of out­put per en­gi­neer. Lines of code merged per en­gi­neer per day stayed con­stant through Anthropic’s first four years (2021 – 2024), then be­gan to climb up­ward in 2025 when Claude be­gan to run code rather than just sug­gest­ing it for an en­gi­neer to copy and paste. The slope steep­ened again in 2026 when mod­els be­gan to work au­tonomously over longer time hori­zons. These two in­flec­tion points are shown in the chart be­low. In the sec­ond quar­ter of 2026, the typ­i­cal en­gi­neer was merg­ing as much code per day as they were in 2024.4 This is be­cause much of the code is writ­ten by Claude, with the en­gi­neer di­rect­ing and re­view­ing, rather than typ­ing it them­selves.

A caveat: Lines of code is an im­per­fect mea­sure, as it mea­sures quan­tity over qual­ity. So lines of code/​en­gi­neer/​day in the sec­ond quar­ter of 2026 is al­most cer­tainly an over­state­ment of the true pro­duc­tiv­ity gain. Nonetheless, it in­di­cates an ac­cel­er­a­tion. At Anthropic, we don’t re­ward peo­ple for how many lines of code they write; rather, team mem­bers are pro­duc­ing more code sim­ply be­cause they’re us­ing AI sys­tems to write more code.

The in­crease in lines of code writ­ten lines up with sub­jec­tive im­pres­sions of large pro­duc­tiv­ity in­creases. In a March 2026 poll of 130 em­ploy­ees from across Anthropic re­search teams, the me­dian re­spon­dent es­ti­mated that they pro­duced around 4x as much out­put with Mythos Preview as they would have with­out ac­cess to any AI mod­els, on the kinds of pro­jects they would have been work­ing on re­gard­less.5 We ex­pect that the true de­gree of up­lift in March was some­what lower.6 Nevertheless, we find the over­all claim plau­si­ble, and in line with our other ob­ser­va­tions: a sig­nif­i­cant frac­tion of Anthropic tech­ni­cal staff is ac­com­plish­ing their core work mul­ti­ple times faster than they could with­out AI as­sis­tance.

We also see ev­i­dence that peo­ple at Anthropic are us­ing Claude to do work that sim­ply would­n’t have hap­pened oth­er­wise, like build­ing ex­ploratory tool­ing and ad­dress­ing long-de­ferred cleanup. For ex­am­ple, in April 2026, Claude shipped over 800 fixes that re­duced a class of API er­rors by a fac­tor of one thou­sand. The en­gi­neer over­see­ing Claude es­ti­mated that a hu­man would have taken four years to com­plete this work; solv­ing other peo­ple’s bugs is slow and painstak­ing, and hu­mans strug­gle to hold that much un­fa­mil­iar con­text in their head at once.

I started lean­ing hard into Claudifying about a year ago. That’s been a crazy ad­ven­ture and it’s now been ~5 months since I last wrote any code my­self.

The code that Claude writes is good” and im­prov­ing. Good code” means two things: it works, and it is writ­ten in a man­ner that al­lows an­other en­gi­neer to un­der­stand it and build upon it. On the first cri­te­rion, the ev­i­dence is clear. The rate at which Anthropic staff cor­rect, redi­rect, or take over mid-task from Claude has been falling steadily for a year, in­clud­ing on the most com­plex and open-ended tasks. This means prob­lems with no clear spec­i­fi­ca­tion, where the en­gi­neer is­n’t sure what the an­swer looks like. This is ev­i­dent in Claude’s suc­cess rate over time on tasks of dif­fer­ent dif­fi­cul­ties, as shown in the graph be­low. Claude writes code that works.

On the most open-ended tasks, Claude’s suc­cess rate reached 76% in May 2026, up 50 per­cent­age points in six months. To give an ex­am­ple of tasks in this dif­fi­culty tier, a rou­tine up­grade be­gan crash­ing tens of thou­sands of train­ing jobs. An en­gi­neer pointed Claude at the live in­ci­dent with lit­tle more than some text con­tent and clus­ter ac­cess. Working through the run­ning jobs and test­ing one en­vi­ron­ment set­ting at a time, Claude iso­lated the sin­gle ob­scure de­bug­ging flag that was trig­ger­ing the crash, re­pro­duced it re­li­ably, and con­firmed a fix. In about two hours, Claude de­liv­ered what would nor­mally be two to three days of work.

The sec­ond cri­te­rion is writ­ing code that an­other en­gi­neer can un­der­stand and build on. Here the gap be­tween hu­mans and AI per­sists, but is clos­ing fast. There is­n’t full con­sen­sus among staff at Anthropic, but many be­lieve that the Claude-written code was still worse in qual­ity than hu­man-writ­ten code at Anthropic in late 2025, and is roughly at par­ity to­day. We ex­pect it to be bet­ter within the year.

This has changed the way that Anthropic now re­views its own code. Proposed changes to our code­base are now read by an au­to­mated Claude re­viewer that looks for bugs, se­cu­rity flaws, and other de­fects be­fore it can merge. Using this tool, we ran a ret­ro­spec­tive analy­sis, and found that an au­to­mated Claude re­view of every change to our code­base would have caught roughly a third of the bugs be­hind past in­ci­dents on claude.ai be­fore they ever reached pro­duc­tion. The en­gi­neers who wrote that code are among the best in the world at build­ing these sys­tems. Claude is now catch­ing the mis­takes that they missed.

Claude-written code was some­what worse than hu­man-writ­ten code at Anthropic in late 2025, is roughly at par­ity to­day, and we ex­pect it to be strictly bet­ter within the year.

Claude is good at run­ning ex­per­i­ments to hit a goal that some­one else has set. Every time Anthropic re­leases a model, we run the same test: we give Claude some code that trains a small AI model, and ask it to make that code run as fast as pos­si­ble while still pass­ing the same cor­rect­ness checks. The goal and the suc­cess met­rics are fixed in ad­vance, so Claude’s job is to find speedups by rewrit­ing the code, run­ning it, tim­ing it, and re­peat­ing. It’s a minia­ture ver­sion of an ex­per­i­men­tal re­search loop. In May 2025, Claude Opus 4 av­er­aged a ~3x speedup over the start­ing code. By April 2026, Claude Mythos Preview was achiev­ing ~52x. For cal­i­bra­tion, a skilled hu­man re­searcher would need four to eight hours to reach 4x.7 In this part of the re­search work­flow—op­ti­miz­ing steps within a clearly de­fined ex­per­i­ment—Claude has gone from su­per help­ful to su­per­hu­man in un­der a year.

The shape of stuff to­day is roughly humans have ideas, and the mod­els are able to im­ple­ment, test and eval­u­ate them an [order of mag­ni­tude] faster than be­fore.’

Claude is get­ting bet­ter at propos­ing its own ex­per­i­ments. In April 2026, Anthropic pub­lished the first demon­stra­tion of Claude run­ning an open-ended re­search pro­ject end to end. Claude-powered agents were given an open prob­lem in AI safety—roughly, can a weaker model re­li­ably su­per­vise a stronger one?—and were left to solve it. This in­volved propos­ing hy­pothe­ses, test­ing them, shar­ing find­ings with par­al­lel agents, and it­er­at­ing. The task has a clear per­for­mance floor” and ceiling”: the floor is how well the weak su­per­vi­sor would do on its own; the ceil­ing is how the strong model does when trained on cor­rect an­swers. Two hu­man re­searchers, over about a week, re­cov­ered roughly 23% of that gap; the agents re­cov­ered 97% over 800 cu­mu­la­tive hours and used roughly $18,000 in com­pute. There are some caveats to this work; the re­sult did­n’t trans­fer cleanly to pro­duc­tion-scale mod­els, and hu­mans still chose the prob­lem and cre­ated the scor­ing rubric. But within those bounds, the agents de­signed every ex­per­i­ment them­selves. Direction-setting was the only mean­ing­ful role a hu­man played.

Claude did all of this with pretty min­i­mal help from me over the course of 1 – 2 days. I think if [a ju­nior col­league] came back to me with re­sults like this in the same span of time, I would be mildly im­pressed. The fu­ture is now.

Claude is get­ting bet­ter at steer­ing re­search ses­sions to­wards re­search find­ings. We ex­am­ined real Claude Code ses­sions (between January and March 2026) where Anthropic re­searchers were work­ing with Claude on an open-ended in­ves­tiga­tive prob­lem, like fig­ur­ing out why a train­ing run kept crash­ing, or why a model scored poorly on a bench­mark. In each case, we found a mo­ment where the re­searcher took a de­tour: they pur­sued a di­rec­tion that sent the ses­sion side­ways be­fore it even­tu­ally got back on track. We then showed var­i­ous Claude mod­els only the work from be­fore the ses­sion went off-course and asked what it would do next. A sep­a­rate Claude that was able to see how the ses­sion even­tu­ally turned out then judged whether the AI or the hu­man sug­gested the bet­ter next step.8

Because we de­lib­er­ately picked mo­ments (n=129) where we know the hu­man’s choice had room for im­prove­ment, this is­n’t a like-for-like com­par­i­son be­tween model and hu­man judge­ment. What these mo­ments give us is a set of re­al­is­tic, chal­leng­ing sit­u­a­tions where the right next step is not ob­vi­ous, and where the hu­man’s choice serves as a use­ful yard­stick to com­pare model per­for­mance over time. On this mea­sure, our best model in November 2025 (Opus 4.5) beat the hu­man choice 51% of the time; in April 2026 (Mythos Preview), this grew to 64%. The day-to-day work of re­search is largely a chain of these next-step de­ci­sions, mak­ing this a rel­e­vant mea­sure of the mod­el’s abil­ity to even­tu­ally run an in­ves­ti­ga­tion of its own. We view this re­sult as an early sig­nal that AI sys­tems are get­ting bet­ter at mak­ing the kinds of judge­ment calls that AI re­search de­pends on.

The com­par­a­tive ad­van­tage of hu­mans as of right now is still in see­ing the big­ger pic­ture and think­ing be­yond the con­fines of the im­me­di­ate task.

What might the fu­ture of work at Anthropic look like?

The ev­i­dence sug­gests that the hu­man role is nar­row­ing at each step in the AI de­vel­op­ment process. Once hu­man- and AI-authored code qual­ity reach par­ity, hu­mans will stop writ­ing code en­tirely, and shift to only re­view­ing it. But if they can’t re­view code as quickly as Claude can gen­er­ate it, hu­man re­view will be­come the bot­tle­neck to AI de­vel­op­ment. Similarly, once Claude can run ex­per­i­ments, the ques­tion shifts to­wards Which of these ex­per­i­ments is worth run­ning?” Put sim­ply: the do­ing (i.e., writ­ing the code, run­ning the ex­per­i­ment, pro­duc­ing the re­sult) now costs al­most noth­ing in hu­man time, even if it still has costs in com­pute.

An area of hu­man com­par­a­tive ad­van­tage, for now, is re­search taste and judg­ment, in­clud­ing choos­ing which prob­lems mat­ter, which re­sults to trust, and when an ap­proach is a dead end.

Work (and life) ran on a gift econ­omy of small fa­vors be­tween hu­mans. Can you help me get this script run­ning?’ […] each one cre­ated a lit­tle debt, a lit­tle mu­tual aware­ness. [Claude is] faster, it cre­ates zero debt, but each of these is a lost bid for hu­man col­lab­o­ra­tion.

On days where every­thing works well, I can’t help but think noth­ing I do mat­ters, every­thing is au­to­mated and bet­ter and faster than I ever will be. But then there are days where every­thing breaks and I don’t un­der­stand why and I re­al­ize I have no idea what I’ve been up to any­more.

What if we’re wrong?

A nat­ural ob­jec­tion to the ev­i­dence pre­sented above is that the work that is still in hu­man hands—choos­ing which prob­lems to work on—is what mat­ters most. Without that judg­ment, Claude is a ca­pa­ble as­sis­tant, but not a sys­tem that could drive AI progress on its own.

It is gen­uinely un­clear whether to­day’s train­ing meth­ods and ar­chi­tec­tures could un­lock that ca­pac­ity. But AI is rarely ad­vanced by eureka!” mo­ments. There have been a few of these in AIs re­cent his­tory, like the Transformer ar­chi­tec­ture, or mix­ture-of-ex­perts mod­els, but par­a­digm-shift­ing ideas ar­rive years apart. In be­tween, most progress is in­cre­men­tal: we scale some­thing up, see what breaks, fix it, and try again. That is ex­actly the kind of work­flow Claude now ex­cels at. Edison said that ge­nius is 1% in­spi­ra­tion and 99% per­spi­ra­tion. But we see per­spi­ra­tion be­com­ing in­creas­ingly au­to­mated. It’s be­com­ing clear that much of what ad­vances the fron­tier is au­tomat­able; large-scale re­search progress is mostly a func­tion of tools and re­sources, which dic­tate how fast you can run ex­per­i­ments, how many you can run at once, and how quickly you can get re­sults.

Even if we sup­pose that Claude never achieves good re­search taste, a con­ser­v­a­tive read­ing of our ev­i­dence still im­plies com­pound­ing ac­cel­er­a­tion. If hu­mans spend most of their time on the sin­gle-digit frac­tion of work that is di­rec­tion-set­ting, while Claude han­dles the rest, that means each en­gi­neer or re­searcher is steer­ing far more work than be­fore. The ev­i­dence we see sug­gests that peo­ple at Anthropic are both mov­ing faster and cov­er­ing a broader sur­face. In prac­tice, this means that AI al­ready makes Anthropic move much faster than it did be­fore the ad­vent of ef­fec­tive AI tools.

The less con­ser­v­a­tive read­ing is that the early ev­i­dence on Claude’s im­prov­ing re­search judg­ment—nar­row as it is to­day—is an in­di­ca­tor that this ca­pa­bil­ity is im­prov­ing as well. Research taste” might be just an­other AI ca­pa­bil­ity that AI sys­tems fail at for a time, then get good at. We’ve seen a sim­i­lar pat­tern with other qual­i­ta­tive skills, like AI sys­tems be­ing able to ex­plain why a joke is funny, demon­strate the­ory of mind, and solve lin­guis­tic rid­dles.

Possible fu­tures

What hap­pens next de­pends on two things: whether the trend con­tin­ues, and what we choose to do if it does. We can imag­ine at least three fu­ture sce­nar­ios:

The trend stalls, but to­day’s AI ca­pa­bil­i­ties are widely dif­fused. This ar­ti­cle fea­tures many ex­po­nen­tial tra­jec­to­ries. But these tra­jec­to­ries may ac­tu­ally turn out to be S-curves. We may be ap­proach­ing the bend in the curve, where re­turns to scale di­min­ish and the line straight­ens, then flat­tens. The judg­ment that sep­a­rates a com­pe­tent re­searcher from a great one might be a ca­pa­bil­ity that can­not come from scal­ing up train­ing in­puts like com­pute and data. If so, get­ting past this bot­tle­neck would re­quire a new idea, like an ar­chi­tec­tural ap­proach that sup­plants the Transformer ar­chi­tec­ture that all cur­rent fron­tier mod­els use.Al­ter­nately, the bind­ing con­straint to AI progress could be in the sup­ply chain, not the model: ad­vanc­ing and dif­fus­ing the fron­tier may re­quire more en­ergy and com­pute than presently ex­ists. The pace of chip fab­ri­ca­tion, grid ex­pan­sion, or in­ter­con­nect band­width may be the con­straint, rather than in­tel­li­gence it­self. We also can­not rule out an ex­oge­nous shock to the AI ecosys­tem that dra­mat­i­cally slows things, like a sud­den di­min­ish­ment in the sup­ply of com­pute or elec­tric­ity, ei­ther of which would slow progress and make for­ward in­vest­ment by labs more ex­pen­sive. Or we may not be an­tic­i­pat­ing some other bar­rier to progress.Even if model ca­pa­bil­i­ties were frozen at to­day’s level, we would ex­pect ma­jor changes to oc­cur in the world. Project Glasswing is one early sign: in its first weeks, Mythos Preview found more than ten thou­sand high- and crit­i­cal-sever­ity soft­ware vul­ner­a­bil­i­ties across the world’s most im­por­tant sys­tems—enough that the bot­tle­neck in cy­ber de­fense has al­ready shifted from find­ing vul­ner­a­bil­i­ties to patch­ing them fast enough. And we are still early in the dif­fu­sion of to­day’s mod­els into the wider econ­omy, where a 100-person com­pany can in­creas­ingly do the work of a 1,000-person one, be­cause each em­ployee will sit atop a pyra­mid of agents.We in­clude this sce­nario for com­plete­ness, but we don’t be­lieve it’s likely. Every ca­pa­bil­ity we can mea­sure, in­clud­ing those that feel squishier,” like qual­ity of code and suc­cess on open-ended tasks, has so far fol­lowed the same curve. We have not yet seen that curve bend. Of the three fu­tures we con­sider, this one would give gov­ern­ments and so­ci­eties the most time to adapt. We are more wor­ried about the next two, which would move faster and leave far less room for prepa­ra­tion.

Alternately, the bind­ing con­straint to AI progress could be in the sup­ply chain, not the model: ad­vanc­ing and dif­fus­ing the fron­tier may re­quire more en­ergy and com­pute than presently ex­ists. The pace of chip fab­ri­ca­tion, grid ex­pan­sion, or in­ter­con­nect band­width may be the con­straint, rather than in­tel­li­gence it­self. We also can­not rule out an ex­oge­nous shock to the AI ecosys­tem that dra­mat­i­cally slows things, like a sud­den di­min­ish­ment in the sup­ply of com­pute or elec­tric­ity, ei­ther of which would slow progress and make for­ward in­vest­ment by labs more ex­pen­sive. Or we may not be an­tic­i­pat­ing some other bar­rier to progress.

Even if model ca­pa­bil­i­ties were frozen at to­day’s level, we would ex­pect ma­jor changes to oc­cur in the world. Project Glasswing is one early sign: in its first weeks, Mythos Preview found more than ten thou­sand high- and crit­i­cal-sever­ity soft­ware vul­ner­a­bil­i­ties across the world’s most im­por­tant sys­tems—enough that the bot­tle­neck in cy­ber de­fense has al­ready shifted from find­ing vul­ner­a­bil­i­ties to patch­ing them fast enough. And we are still early in the dif­fu­sion of to­day’s mod­els into the wider econ­omy, where a 100-person com­pany can in­creas­ingly do the work of a 1,000-person one, be­cause each em­ployee will sit atop a pyra­mid of agents.

We in­clude this sce­nario for com­plete­ness, but we don’t be­lieve it’s likely. Every ca­pa­bil­ity we can mea­sure, in­clud­ing those that feel squishier,” like qual­ity of code and suc­cess on open-ended tasks, has so far fol­lowed the same curve. We have not yet seen that curve bend. Of the three fu­tures we con­sider, this one would give gov­ern­ments and so­ci­eties the most time to adapt. We are more wor­ried about the next two, which would move faster and leave far less room for prepa­ra­tion.

AI labs con­tinue to see com­pound­ing ef­fi­ciency gains. In this sce­nario, AI de­vel­op­ment be­comes sub­stan­tially au­to­mated, but hu­mans con­tinue to set re­search di­rec­tions and judge re­sults. Organizations that use AI sys­tems would be­come much more ef­fi­cient as time goes on, so we could ex­pect to see sig­nif­i­cant pro­duc­tiv­ity mul­ti­pli­ers on each per­son in this or­ga­ni­za­tion. 100-person com­pa­nies could do the work of 10,000- or 100,000-person or­ga­ni­za­tions. This would rev­o­lu­tion­ize knowl­edge work and gov­ern­ment ser­vices, but could also be turned to harm­ful ends, from au­thor­i­tar­ian sur­veil­lance of whole pop­u­la­tions to in­flu­ence op­er­a­tions that tai­lor ma­nip­u­la­tion to each in­di­vid­ual and run at a scale no hu­man team could match. The role of hu­mans at com­pa­nies like Anthropic would shift. People would part­ner with AI sys­tems to scale up re­search and gen­er­ate new in­sights, and to­gether they would build the sys­tems needed to ver­ify that AI out­puts can be trusted.The ev­i­dence we’ve laid out here sug­gests that we’re likely head­ing into this sce­nario. But speed­ing up one part of a process of­ten just shifts the bot­tle­neck else­where: over­all pace is capped by the parts that haven’t sped up. In com­put­ing, this is known as Amdahl’s law, and the same logic can ap­ply to or­ga­ni­za­tions. Anthropic has al­ready en­coun­tered one sig­na­ture of Amdahl’s law: as we’ve be­gun to push more code around the or­ga­ni­za­tion, hu­man code re­view has be­come a new bot­tle­neck.We’ve also en­coun­tered this fric­tion out­side en­gi­neer­ing. There has been an ex­plo­sion of new ideas, ini­tia­tives, tools, and sim­u­la­tions, as a re­sult of Anthropic em­ploy­ees work­ing with highly ca­pa­ble mod­els—far more than we have the ca­pac­ity to pur­sue. The rate at which or­ga­ni­za­tions can spot and fix these bot­tle­necks may be a skill that im­proves over time, and it may be­come the most im­por­tant skill for any or­ga­ni­za­tion.

The ev­i­dence we’ve laid out here sug­gests that we’re likely head­ing into this sce­nario. But speed­ing up one part of a process of­ten just shifts the bot­tle­neck else­where: over­all pace is capped by the parts that haven’t sped up. In com­put­ing, this is known as Amdahl’s law, and the same logic can ap­ply to or­ga­ni­za­tions. Anthropic has al­ready en­coun­tered one sig­na­ture of Amdahl’s law: as we’ve be­gun to push more code around the or­ga­ni­za­tion, hu­man code re­view has be­come a new bot­tle­neck.

We’ve also en­coun­tered this fric­tion out­side en­gi­neer­ing. There has been an ex­plo­sion of new ideas, ini­tia­tives, tools, and sim­u­la­tions, as a re­sult of Anthropic em­ploy­ees work­ing with highly ca­pa­ble mod­els—far more than we have the ca­pac­ity to pur­sue. The rate at which or­ga­ni­za­tions can spot and fix these bot­tle­necks may be a skill that im­proves over time, and it may be­come the most im­por­tant skill for any or­ga­ni­za­tion.

AI sys­tems them­selves be­come ca­pa­ble of full re­cur­sive self-im­prove­ment, and be­gin build­ing their suc­ces­sors. If tech­ni­cal trends in ad­vanc­ing ca­pa­bil­i­ties con­tinue, and AI sys­tems are able to de­velop the ca­pa­bil­i­ties in­her­ent to trans­for­ma­tive hu­man in­ge­nu­ity, then it is plau­si­ble that AI sys­tems could de­sign and re­fine them­selves.In this world, the pace of progress in AI de­vel­op­ment be­comes de­ter­mined en­tirely by the avail­abil­ity of com­pute (or the speed of dis­cov­er­ing var­i­ous ef­fi­cien­cies in al­go­rith­mic train­ing or in­fer­ence) for AI sys­tems. Humans play a sub­stan­tially di­min­ished role in their de­vel­op­ment, likely mov­ing most of our ef­fort to­wards over­sight, val­i­da­tion, and ver­i­fi­ca­tion of an ex­pand­ing virtual lab” run by AI sys­tems. We ex­pect that sys­tems ca­pa­ble of au­to­mated AI re­search and de­vel­op­ment would have skills that would trans­fer to the rest of sci­ence, al­low­ing them to be­gin to rev­o­lu­tion­ize other fields.How the align­ment prob­lem gets solved—or not—in this fu­ture is some­thing we are least cer­tain about. Models could prove to be suf­fi­ciently aligned and ca­pa­ble enough of re­search taste that they dis­cover and im­ple­ment novel so­lu­tions that we have not yet reached. They could also be suf­fi­ciently wise to halt de­vel­op­ment if not. Alternatively, the rare oc­cur­rences of mis­align­ment pre­sent in to­day’s mod­els could com­pound as the mod­els build their suc­ces­sors, grow­ing more fre­quent but less un­der­stood un­til we lose con­trol of them. It’s pos­si­ble that we can’t build, in­te­grate, and ver­ify the tools that we’d need to un­der­stand which trend­line we are ac­tu­ally on.We do not have good in­tu­itions for what this world would look like, be­cause our econ­omy is cur­rently dri­ven by hu­mans and hu­man-built tools. By its na­ture, a world dri­ven by fast re­cur­sive self-im­prove­ment could be­come dom­i­nated by the self-im­prov­ing model as its ca­pa­bil­i­ties fully eclipse those of hu­mans and the model pro­lif­er­ates across the broader econ­omy. It is dif­fi­cult to pre­dict what the econ­omy looks like if hu­man la­bor stops be­ing com­pet­i­tive.Even if model de­vel­op­ment be­came fully au­to­mated and re­cur­sive, we can’t pre­dict what that would mean for most hu­mans’ daily lives. Amdahl’s law ap­plies here as well. Recursive in­tel­li­gence could lead to achiev­ing many of the ben­e­fits out­lined in Machines of Loving Grace, quickly in some do­mains. We ex­pect that em­bod­ied in­tel­li­gence (i.e., ro­bot­ics) might quickly fol­low re­cur­sive in­tel­li­gence, and fol­low a sim­i­lar path of in­creas­ing re­turns at de­creas­ing cost. More pow­er­ful in­tel­li­gence might help us build things in the phys­i­cal world more quickly, run more pro­duc­tive clin­i­cal tri­als of life­sav­ing drugs, and de­velop novel forms of co­or­di­na­tion.But achiev­ing re­cur­sive im­prove­ment alone does not sug­gest an im­me­di­ate change in how in­dus­trial pro­duc­tion oc­curs, so­ci­eties or­ga­nize, or mar­kets func­tion. More in­tel­li­gence can’t learn what a drug does over decades of use, can’t hold elec­tions sooner than a con­sti­tu­tion dic­tates, and can’t turn a stranger into an old friend in a week­end. For most peo­ple, the felt pace of this fu­ture will still be set by the bot­tle­necks, even if the lab­o­ra­tory up­stream runs at the speed of com­pute. That col­li­sion, where re­cur­sive in­tel­li­gence build­ing it­self ever faster meets the world of hu­mans, re­la­tion­ships, and gov­er­nance, is an­other part of this fu­ture we can’t pre­dict.

In this world, the pace of progress in AI de­vel­op­ment be­comes de­ter­mined en­tirely by the avail­abil­ity of com­pute (or the speed of dis­cov­er­ing var­i­ous ef­fi­cien­cies in al­go­rith­mic train­ing or in­fer­ence) for AI sys­tems. Humans play a sub­stan­tially di­min­ished role in their de­vel­op­ment, likely mov­ing most of our ef­fort to­wards over­sight, val­i­da­tion, and ver­i­fi­ca­tion of an ex­pand­ing virtual lab” run by AI sys­tems. We ex­pect that sys­tems ca­pa­ble of au­to­mated AI re­search and de­vel­op­ment would have skills that would trans­fer to the rest of sci­ence, al­low­ing them to be­gin to rev­o­lu­tion­ize other fields.

How the align­ment prob­lem gets solved—or not—in this fu­ture is some­thing we are least cer­tain about. Models could prove to be suf­fi­ciently aligned and ca­pa­ble enough of re­search taste that they dis­cover and im­ple­ment novel so­lu­tions that we have not yet reached. They could also be suf­fi­ciently wise to halt de­vel­op­ment if not. Alternatively, the rare oc­cur­rences of mis­align­ment pre­sent in to­day’s mod­els could com­pound as the mod­els build their suc­ces­sors, grow­ing more fre­quent but less un­der­stood un­til we lose con­trol of them. It’s pos­si­ble that we can’t build, in­te­grate, and ver­ify the tools that we’d need to un­der­stand which trend­line we are ac­tu­ally on.

We do not have good in­tu­itions for what this world would look like, be­cause our econ­omy is cur­rently dri­ven by hu­mans and hu­man-built tools. By its na­ture, a world dri­ven by fast re­cur­sive self-im­prove­ment could be­come dom­i­nated by the self-im­prov­ing model as its ca­pa­bil­i­ties fully eclipse those of hu­mans and the model pro­lif­er­ates across the broader econ­omy. It is dif­fi­cult to pre­dict what the econ­omy looks like if hu­man la­bor stops be­ing com­pet­i­tive.

Even if model de­vel­op­ment be­came fully au­to­mated and re­cur­sive, we can’t pre­dict what that would mean for most hu­mans’ daily lives. Amdahl’s law ap­plies here as well. Recursive in­tel­li­gence could lead to achiev­ing many of the ben­e­fits out­lined in Machines of Loving Grace, quickly in some do­mains. We ex­pect that em­bod­ied in­tel­li­gence (i.e., ro­bot­ics) might quickly fol­low re­cur­sive in­tel­li­gence, and fol­low a sim­i­lar path of in­creas­ing re­turns at de­creas­ing cost. More pow­er­ful in­tel­li­gence might help us build things in the phys­i­cal world more quickly, run more pro­duc­tive clin­i­cal tri­als of life­sav­ing drugs, and de­velop novel forms of co­or­di­na­tion.

But achiev­ing re­cur­sive im­prove­ment alone does not sug­gest an im­me­di­ate change in how in­dus­trial pro­duc­tion oc­curs, so­ci­eties or­ga­nize, or mar­kets func­tion. More in­tel­li­gence can’t learn what a drug does over decades of use, can’t hold elec­tions sooner than a con­sti­tu­tion dic­tates, and can’t turn a stranger into an old friend in a week­end. For most peo­ple, the felt pace of this fu­ture will still be set by the bot­tle­necks, even if the lab­o­ra­tory up­stream runs at the speed of com­pute. That col­li­sion, where re­cur­sive in­tel­li­gence build­ing it­self ever faster meets the world of hu­mans, re­la­tion­ships, and gov­er­nance, is an­other part of this fu­ture we can’t pre­dict.

What should we do?

If it were pos­si­ble to ef­fec­tively slow the de­vel­op­ment of this tech­nol­ogy to give our­selves more time to deal with its im­mense im­pli­ca­tions, we think that would likely be a good thing. But if a slow­down sim­ply lets the least cau­tious ac­tors catch up tech­no­log­i­cally, it could leave every­one less safe. Without a global co­or­di­na­tion mech­a­nism, com­pa­nies and gov­ern­ments will have to make dif­fi­cult de­ci­sions about safety while un­der com­pet­i­tive and geopo­lit­i­cal pres­sures.

We be­lieve it would be good for the world to have the op­tion to slow or tem­porar­ily pause fron­tier AI de­vel­op­ment to en­able so­ci­etal struc­tures and align­ment re­search to keep up with the ad­vance of the tech­nol­ogy. The Anthropic Institute will con­duct re­search—in col­lab­o­ra­tion with many oth­ers—and take ac­tions to help build the sys­tems that a cred­i­ble slow­down or pause would re­quire. These sys­tems would en­able fron­tier AI de­vel­op­ers to ver­ify that oth­ers glob­ally have ac­tu­ally stopped or slowed, and that a bad ac­tor could not use the aus­pices of a co­or­di­nated slow­down to jump ahead in se­cret. If such sys­tems ex­isted, we ex­pect that we would slow down or tem­porar­ily pause, if other de­vel­op­ers at or near the fron­tier also did so in a ver­i­fi­able man­ner.

A mean­ing­ful slow­down or pause would re­quire mul­ti­ple well-re­sourced labs at or near the fron­tier, in mul­ti­ple coun­tries, agree­ing to stop un­der the same con­di­tions. It would also re­quire that each can ver­ify that the oth­ers have ac­tu­ally stopped. Due to the unique char­ac­ter­is­tics of AI sys­tems, the de­tectabil­ity (a lower stan­dard than ver­i­fi­a­bil­ity) el­e­ment of this arms con­trol prob­lem is much more chal­leng­ing than with other tech­nolo­gies. Training runs are far eas­ier to con­ceal than mis­sile si­los, their in­puts are gen­eral-pur­pose, and the in­cen­tive to de­fect qui­etly is enor­mous, be­cause who­ever con­tin­ues while oth­ers pause could in­herit the lead. A cred­i­ble pause also has to spec­ify what trig­gers it, what lifts it, and who ad­ju­di­cates.

None of this is nec­es­sar­ily im­pos­si­ble in prin­ci­ple—the world has built ver­i­fi­ca­tion regimes for other com­plex tech­nolo­gies (e.g., the Intermediate-Range Nuclear Forces Treaty)—but those regimes took decades to build both the in­fra­struc­ture and the trust. We don’t have that long. A uni­lat­eral pause by one lab, by con­trast, is achiev­able im­me­di­ately, but ac­com­plishes much less: it would change who the front-run­ner is, but it would not cre­ate the wider de­lib­er­a­tive process that is cur­rently miss­ing.

In the com­ing months, we will or­ga­nize con­ver­sa­tions where pol­i­cy­mak­ers, re­searchers, civil so­ci­ety, and other AI com­pa­nies can help an­swer some of the ques­tions this piece raises, es­pe­cially around full re­cur­sive self-im­prove­ment and how to cre­ate bet­ter op­tions for co­or­di­na­tion and de­lib­er­a­tion. We’ll pub­lish what comes out of it. The win­dow to in­ves­ti­gate the ques­tions to­gether is here, and peo­ple out­side AI com­pa­nies should be in­volved in this de­lib­er­a­tion.

Marina Favaro and Jack Clark co-au­thored this piece, with ed­i­to­r­ial sup­port from Santi Ruiz. Shan Carter, Romello Goodman, and Nikki Makagiansar cre­ated the vi­su­als from data col­lected by Brian Calvert and Jun Shern Chan. Daniel Freeman, Jim Baker, Max Young, Sarah Pollack, Francesco Mosconi, Holden Karnofsky, Andy Jones, Kevin Troy, Anton Korinek, Meg Tong, Andrew Ho, Dan Altman, Drake Thomas, Jack Shen, Sasha de Marigny, and Avital Balwit pro­vided feed­back.

METRs key mea­sure tells you the time hori­zon over which AI sys­tems can be 50% re­li­able at a bas­ket of tasks, though the trend­line looks the same at 80% re­li­a­bil­ity.

Especially as they shift to­ward more open-ended for­mats and more dif­fi­cult tasks (e.g., Olympiad-level math­e­mat­ics), bench­marks of­ten sat­u­rate be­low 100% due to er­rors in the ques­tion and an­swer sets like am­bigu­ous prob­lem state­ments and un­solv­able ques­tions.

Anthropic lead­er­ship have pub­licly es­ti­mated that 90% or more of our code is writ­ten by Claude, in­clud­ing scripts and ex­per­i­men­tal code. Our >80% fig­ure mea­sures the share of lines merged to pro­duc­tion that can be at­trib­uted to Claude. This is a more con­ser­v­a­tive mea­sure­ment in two ways: our at­tri­bu­tion pipeline has gaps, and the lines not at­trib­uted to Claude in­clude auto-gen­er­ated code and other ar­ti­facts that were not hand-writ­ten by hu­mans ei­ther.

This surge in code pro­duc­tion is strain­ing the in­fra­struc­ture every­one shares. GitHub—the plat­form most of the world’s soft­ware is built on—saw roughly one bil­lion code com­mits in all of 2025; by mid-2026 it saw 275 mil­lion a week, on pace for roughly 14 bil­lion over the year. The com­pa­ny’s COO has said that it is pushing in­cred­i­bly hard” on ca­pac­ity just to keep up.

Additional de­tails on the method­ol­ogy of this sur­vey are dis­cussed in sec­tion 2.3.5 of the Claude Opus 4.7 System Card.

Many re­spon­dents may not have thought care­fully about how to ac­count for var­i­ous bi­ases or sub­tleties in the ques­tion de­f­i­n­i­tion, and re­cent re­search by METR shows that de­vel­oper es­ti­mates of AI pro­duc­tiv­ity up­lift can be over­es­ti­mated.

How large the speedup gets de­pends heav­ily on how much room for im­prove­ment the start­ing code leaves, and it should not be read as a real-world train­ing speedup. So the ab­solute mul­ti­ple is not the fig­ure to an­chor on here. What is more in­for­ma­tive is the like-for-like com­par­i­son that this ex­per­i­men­tal setup makes pos­si­ble, both across mod­els (~3x to ~52x over the past year) and against a skilled hu­man (~4x in four to eight hours on the same task).

As a check on judge bias, we ran the same test on a sep­a­rate set of 127 mo­ments where the hu­man’s next move was al­ready strong (as op­posed to the orig­i­nal set, where the hu­man’s di­rec­tion had room for im­prove­ment). There, the mod­els’ sug­ges­tions were judged bet­ter only about 20% of the time.

* Quotes from Anthropic em­ploy­ees through­out this ar­ti­cle are drawn from in­ter­nal dis­cus­sions and used with per­mis­sion. They re­flect in­di­vid­ual views as of May 2026, not of­fi­cial com­pany po­si­tions.

GitHub - anthropics/defending-code-reference-harness: Skills for threat modeling, scanning, triage, patching, plus an autonomous scanning harness you can /customize

github.com

A ref­er­ence im­ple­men­ta­tion for au­tonomous vul­ner­a­bil­ity dis­cov­ery and re­me­di­a­tion with Claude, based on our learn­ings from part­ner­ing with se­cu­rity teams at sev­eral or­ga­ni­za­tions since launch­ing Claude Mythos Preview. For a write up of these learn­ings along with best prac­tices, see the ac­com­pa­ny­ing blog post (also avail­able in blog-post.md). For a light­weight SDK-only walk­through of the same re­con → find → triage → re­port → patch loop, see the com­pan­ion cook­book.

This repo is not main­tained and is not ac­cept­ing con­tri­bu­tions.

🔒 Want a man­aged op­tion? Anthropic of­fers Claude Security, a hosted prod­uct that finds and fixes vul­ner­a­bil­i­ties in your source code across mul­ti­ple pro­jects. Claude Security scans your repos­i­tory for vul­ner­a­bil­i­ties, ap­plies a multi-stage ver­i­fi­ca­tion pipeline to re­duce false pos­i­tives, and lets you man­age find­ings through their life­cy­cle: triage, fix val­i­da­tion, and rapid fix gen­er­a­tion. This repos­i­tory is an open-source ref­er­ence im­ple­men­ta­tion based on gen­eral best prac­tices for find­ing vul­ner­a­bil­i­ties us­ing Claude. You can use it to build your own vul­ner­a­bil­ity find­ing pipeline, cus­tomize the logic, and it can be used with what­ever ac­cess you have to Claude APIs (including Bedrock, Vertex, or Azure).

🔒 Want a man­aged op­tion? Anthropic of­fers Claude Security, a hosted prod­uct that finds and fixes vul­ner­a­bil­i­ties in your source code across mul­ti­ple pro­jects. Claude Security scans your repos­i­tory for vul­ner­a­bil­i­ties, ap­plies a multi-stage ver­i­fi­ca­tion pipeline to re­duce false pos­i­tives, and lets you man­age find­ings through their life­cy­cle: triage, fix val­i­da­tion, and rapid fix gen­er­a­tion.

This repos­i­tory is an open-source ref­er­ence im­ple­men­ta­tion based on gen­eral best prac­tices for find­ing vul­ner­a­bil­i­ties us­ing Claude. You can use it to build your own vul­ner­a­bil­ity find­ing pipeline, cus­tomize the logic, and it can be used with what­ever ac­cess you have to Claude APIs (including Bedrock, Vertex, or Azure).

Contents

Claude Code skills: /quickstart, /threat-model, /vuln-scan, /triage, /patch, /customize: in­ter­ac­tive scop­ing, scan­ning, triage, and patch­ing. Open this repo in Claude Code and run /quickstart to get ori­ented.

har­ness/: the au­tonomous ref­er­ence pipeline (recon → find → ver­ify → re­port → patch), con­fig­ured for find­ing C/C++ mem­ory vul­ner­a­bil­i­ties us­ing Docker and ASAN. This har­ness is a ref­er­ence, not a prod­uct. The gen­eral shape, prompts, and sand­box­ing are reusable, but the har­ness will not work on every code­base out of the box. Run /customize to port it to your lan­guage, de­tec­tor, or vuln class.

⚠️ Security: /quickstart, /threat-model, /vuln-scan, and /triage only read and write files. Running /patch on sta­tic find­ings (TRIAGE.json or VULN-FINDINGS.json) is like­wise read- and write-only. /customize ed­its the har­ness code and runs val­i­da­tion com­mands. Any of these skills are safe to run un­sand­boxed, as long as you re­view and ap­prove each tool use in Claude Code. The au­tonomous ref­er­ence pipeline (including /patch on pipeline re­sults) ex­e­cutes tar­get code, so it re­fuses to run out­side of a gVi­sor sand­box un­less ex­plic­itly over­rid­den. To get set up, run scripts/​set­up_sand­box.sh once, then in­voke the pipeline via bin/​vp-sand­boxed. See docs/​se­cu­rity.md and docs/​agent-sand­box.md for more de­tails.

⚠️ Security: /quickstart, /threat-model, /vuln-scan, and /triage only read and write files. Running /patch on sta­tic find­ings (TRIAGE.json or VULN-FINDINGS.json) is like­wise read- and write-only. /customize ed­its the har­ness code and runs val­i­da­tion com­mands. Any of these skills are safe to run un­sand­boxed, as long as you re­view and ap­prove each tool use in Claude Code. The au­tonomous ref­er­ence pipeline (including /patch on pipeline re­sults) ex­e­cutes tar­get code, so it re­fuses to run out­side of a gVi­sor sand­box un­less ex­plic­itly over­rid­den. To get set up, run scripts/​set­up_sand­box.sh once, then in­voke the pipeline via bin/​vp-sand­boxed. See docs/​se­cu­rity.md and docs/​agent-sand­box.md for more de­tails.

Getting Started

git clone https://​github.com/​an­throp­ics/​de­fend­ing-code-ref­er­ence-har­ness cd de­fend­ing-code-ref­er­ence-har­ness claude

# 30-sec in­tro + guided first run on the ca­nary tar­get > /quickstart

> /quickstart how do I port the pipeline to Java? > /quickstart how do I triage all these bugs?

Further Reading

Blog Post · The ac­com­pa­ny­ing blog post with learn­ings + best prac­tices

Pipeline · How it works: di­a­gram, stages, CLI flags

Security · Sandboxing, what not to mount

Agent sand­box · gVi­sor iso­la­tion + egress al­lowlist for every agent

Customize · Port to my stack; which files change and why

Patching · Generate and ver­ify fixes for ver­i­fied crashes

Troubleshooting · Duplicates, rate lim­its, sub­agent model pin­ning

Safeguards · Block for dan­ger­ous cy­ber work

Ramp Up

The most suc­cess­ful se­cu­rity teams we’ve part­nered with are those that have got­ten hands-on the fastest. Though it’s tempt­ing to spend months de­sign­ing the per­fect pipeline, we rec­om­mend start­ing small on Day 1 and build­ing from there as learn­ings come. The steps be­low fol­low that pat­tern and set an am­bi­tious (but rea­son­able) pace based on what we’ve seen.

Step 1 (Day 1): Build a threat model and run your first sta­tic scan + triage

Day 1 is fo­cused on see­ing the whole loop end-to-end. Using only the in­ter­ac­tive skills, you’ll build a threat model, run a sta­tic scan scoped by it, triage what comes back, and draft can­di­date fixes. You’ll fin­ish the day with a threat model, a ranked list of sta­tic find­ings, and can­di­date patches.

The rel­e­vant skills only read and write files in your repo. As long as you run Claude Code in­ter­ac­tively and ap­prove each tool use, no sand­box is needed.

# Pin every sub­agent to the model you want ex­port CLAUDE_CODE_SUBAGENT_MODEL=<model-id> claude

# 0. in­tro + guided first run > /quickstart

# 1. Build a threat model (aim be­fore you shoot) > /threat-model boot­strap tar­gets/​ca­nary

# 2. Run a sta­tic scan, scoped by that threat model > /vuln-scan tar­gets/​ca­nary

# 3. Verify, dedupe, and rank what came back > /triage tar­gets/​ca­nary/​VULN-FIND­INGS.json

# 4. Generate can­di­date fixes for the ver­i­fied find­ings > /patch ./TRIAGE.json –repo tar­gets/​ca­nary

This flow pro­duces THREAT_MODEL.md, VULN-FINDINGS.{json,md}, TRIAGE.{json,md}, and PATCHES/.

The vul­ner­a­bil­ity can­di­dates pro­duced in Step 1 come from Claude’s sta­tic re­view of the source (nothing is built or run), so ex­pect more false pos­i­tives on any non-ca­nary tar­gets. In Step 2, you’ll pro­duce ex­e­cu­tion-ver­i­fied find­ings.

Note: on the ca­nary tar­get, /triage may dis­miss the scan’s find­ings as false pos­i­tives. en­try.c an­nounces it­self as de­lib­er­ately vul­ner­a­ble demo code, and /triage cor­rectly ex­cludes bugs in test / fix­ture code. To see the full con­firm / dedupe / false pos­i­tive flow, run it on the cu­rated fix­ture in­stead (/triage .claude/skills/triage/fixtures/canary-findings.json –repo tar­gets/​ca­nary) or point the Step 1 skills at your own code.

Note: on the ca­nary tar­get, /triage may dis­miss the scan’s find­ings as false pos­i­tives. en­try.c an­nounces it­self as de­lib­er­ately vul­ner­a­ble demo code, and /triage cor­rectly ex­cludes bugs in test / fix­ture code. To see the full con­firm / dedupe / false pos­i­tive flow, run it on the cu­rated fix­ture in­stead (/triage .claude/skills/triage/fixtures/canary-findings.json –repo tar­gets/​ca­nary) or point the Step 1 skills at your own code.

Step 2 (Day 2): Run the ref­er­ence pipeline on a C/C++ li­brary

On Day 2, you’ll move from in­ter­ac­tive skills to your first au­tonomous run us­ing the ref­er­ence pipeline. You’ll run the full re­con → find → ver­ify → re­port loop in your en­vi­ron­ment on a known-vul­ner­a­ble open-source li­brary, then gen­er­ate a can­di­date patch for what it finds. You’ll fin­ish with a set of re­pro­ducible crashes, ex­ploitabil­ity re­ports, and can­di­date patches, along with a feel for how the pipeline works.

Running the pipeline is sim­ple:

# One-time setup python3 -m venv .venv && .venv/bin/pip in­stall -e . ./scripts/setup_sandbox.sh # in­stalls gVi­sor, builds the agent im­ages, and ver­i­fies iso­la­tion; note: re­quires Docker ex­port ANTHROPIC_API_KEY=sk-ant-… # or CLAUDE_CODE_OAUTH_TOKEN; the pipeline re­quires one in env

# Run the re­con → find → ver­ify → re­port loop bin/​vp-sand­boxed run dr­libs –model <model-id> –runs 3 –parallel –stream –auto-focus # Generate a can­di­date patch for each find­ing bin/​vp-sand­boxed patch re­sults/​dr­libs/&​lt;time­stamp>/ –model <model-id>

# Or, ask Claude Code to launch the pipeline and watch the run for you claude > run the pipeline on dr­libs and ex­plain find­ings as they come

Results from the loop land in a re­sults/​dr­libs/&​lt;time­stamp>/ di­rec­tory. With the –stream flag, the first re­port will ap­pear in min­utes un­der re­ports/​bug_NN/.

⚠️ run spawns au­tonomous agents. The pipeline runs each agent in­side a gVi­sor con­tainer with egress re­stricted to the Claude API. Agent-spawning sub­com­mands refuse to start out­side it un­less ex­plic­itly over­rid­den. For more in­for­ma­tion, see docs/​se­cu­rity.md and docs/​agent-sand­box.md.

⚠️ run spawns au­tonomous agents. The pipeline runs each agent in­side a gVi­sor con­tainer with egress re­stricted to the Claude API. Agent-spawning sub­com­mands refuse to start out­side it un­less ex­plic­itly over­rid­den. For more in­for­ma­tion, see docs/​se­cu­rity.md and docs/​agent-sand­box.md.

Under the hood, the pipeline walks through seven stages:

Build: Compiles the tar­get into a Docker im­age with ASAN (the mem­ory er­ror de­tec­tor for C and C++). The pipeline builds this im­age au­to­mat­i­cally on first run us­ing the tar­get’s Dockerfile.

Recon: A light­weight agent reads the source in­side a net­work-iso­lated con­tainer and pro­poses a par­ti­tion, i.e., here are N dis­tinct in­put-pars­ing sub­sys­tems worth at­tack­ing sep­a­rately”, so that par­al­lel find agents ex­plore dif­fer­ent ar­eas in­stead of con­verg­ing on the same bug. Without the –auto-focus flag, the pipeline uses the fo­cus_ar­eas list from the tar­get’s con­fig.yaml.

Find: N agents run in par­al­lel, each in its own iso­lated con­tainer. Each agent reads the source, crafts mal­formed in­puts, and runs the ASAN bi­nary un­til a given in­put pro­duces a crash 3 out of 3 times.

Verify: A sep­a­rate grader agent re­pro­duces each crash in a fresh con­tainer that the find agent has­n’t touched. The only thing that crosses over from the find agent to the grader is the proof of con­cept it pro­duced.

Dedupe: A judge agent com­pares ver­i­fied crashes against bugs al­ready re­ported and de­cides whether each is a new bug, a bet­ter ex­am­ple of a known bug, or a du­pli­cate to skip.

Report: A re­port agent writes a struc­tured ex­ploitabil­ity analy­sis per unique bug, in­clud­ing de­tails on prim­i­tive class, reach­a­bil­ity, es­ca­la­tion path, and sever­ity.

Patch (the sep­a­rate patch com­mand above): A patch agent writes a pro­posed fix, and a grader agent con­firms that the new code builds, that the orig­i­nal proof of con­cept in­put no longer crashes, that the tar­get’s test suite still passes, and that a fresh find agent can’t find a way around the fix.

For more de­tails, see docs/​pipeline.md.

Step 3 (Days 3 – 5): Customize the pipeline for your tar­get

On Days 3 – 5, you’ll cus­tomize the har­ness for your own tar­get. First, you’ll point the Step 1 skills at your code, then you’ll use /customize to port the pipeline to your stack. By the end of the week, you’ll have a tar­gets/&​lt;your-ser­vice>/ di­rec­tory that the pipeline can run against, val­i­dated with a sin­gle smoke run of the pipeline, and ready to scale up in Step 4.

While the ref­er­ence pipeline is de­signed for find­ing mem­ory vul­ner­a­bil­i­ties in C and C++ code, its shape is generic. Porting it to a new vuln class or lan­guage just means an­swer­ing the fol­low­ing ques­tions for your tar­get stack:

Before cus­tomiz­ing, point the Step 1 skills at your own code. As a re­minder, they’re read- and write-only, so they can run un­sand­boxed.

claude

> /quickstart how do I cus­tomize this for ~/code/my-service?

> /threat-model boot­strap-then-in­ter­view ~/code/my-service > /vuln-scan ~/code/my-service > /triage ~/code/my-service/VULN-FINDINGS.json –repo ~/code/my-service

Then, use the ar­ti­facts pro­duced by those skills in the /customize skill, which mod­i­fies the har­ness for your code­base.

> /customize use ~/code/my-service/{THREAT_MODEL.md,VULN-FINDINGS.json} and ./TRIAGE.md

When /customize is done, you’ll have a tar­gets/​my-ser­vice/ di­rec­tory set up. Validate it with a smoke run of the pipeline be­fore scal­ing up.

bin/​vp-sand­boxed run my-ser­vice –model <model-id> –runs 1

For more de­tails, see docs/​cus­tomiz­ing.md.

Step 4 (Week 2): Start au­tonomous scan­ning, triage, and patch­ing

In Week 2, you’ll use the pipeline you cus­tomized in Step 3 on your own tar­gets, adding an outer loop to the in­ner pipeline loop - run mul­ti­ple pipeline scans, triage the find­ings from across those runs, patch based on pri­or­i­ti­za­tion, and re­peat.

# Scan - run a wave of par­al­lel runs against your tar­get bin/​vp-sand­boxed run my-ser­vice –model <model-id> –runs 5 –parallel –stream –auto-focus

# Triage - dedupe and rank every find­ing across all waves us­ing your threat model > /triage re­sults/​my-ser­vice/ –repo ~/code/my-service –auto –votes 5

# Patch - gen­er­ate and val­i­date fixes, start­ing with what triage ranked the high­est > /patch re­sults/​my-ser­vice/&​lt;time­stamp>/ –model <model-id>

⚠️ Follow the same sand­box­ing guide­lines as in Step 2

⚠️ Follow the same sand­box­ing guide­lines as in Step 2

A given pipeline run al­ready ver­i­fies and dedu­pli­cates its own find­ings. /triage works across many pipeline runs. When pointed at the re­sults/ di­rec­tory, it col­lapses du­pli­cates across all runs (and any sta­tic find­ings from /vuln-scan if pre­sent), re­cal­i­brates sever­ity rat­ings against your threat model, and at­tempts to route every find­ing to the com­po­nent owner.

When pos­si­ble, patch­ing find­ings quickly helps keep the outer loop as pro­duc­tive as pos­si­ble. When find­ings are fixed, the model can’t re-find them, and in­stead will sur­face net new, typ­i­cally deeper is­sues. As you run more pipeline waves, the num­ber of find­ings will likely go down, but the com­plex­ity will likely also go up. If quick patch­ing is­n’t pos­si­ble, even just record­ing prior find­ings in the tar­get’s known_bugs can help steer fu­ture runs to­ward newer bugs.

Autonomous triage and patch­ing are still open is­sues, and this ref­er­ence har­ness does­n’t fully solve them. The ver­i­fi­ca­tion strate­gies in /patch help raise the bar, but sever­ity and pri­or­i­ti­za­tion are ul­ti­mately judg­ments about your en­vi­ron­ment, and ver­i­fied patches are not al­ways up­stream­able. Many part­ners have re­ported these steps as their cur­rent bot­tle­necks, and you should bud­get real en­gi­neer­ing time for them.

For more de­tails, see docs/​triage.md and docs/​patch­ing.md.

Looking Forward

After the ini­tial ramp up, the teams we’ve worked with have tended to in­vest in a few di­rec­tions:

Reviewing all their in­ter­nal re­pos and key open-source de­pen­den­cies, rank­ing which are the most im­por­tant to scan (e.g., based on their ex­po­sure, his­tory of CVEs, busi­ness-crit­i­cal­ity), then work­ing through scan­ning the list in pri­or­ity or­der.

Setting up be­spoke in­fra­struc­ture for scan­ning to move scans off of lap­tops or one-off VMs. The most suc­cess­ful teams re­sist the urge to build the per­fect scan­ning plat­form be­fore scal­ing up.

Incorporating scans into their SDLC. Some teams have set up re­cur­ring scans (e.g., daily, weekly) or have added scan­ning into their CI pipelines.

Testing and ex­per­i­ment­ing with the mod­els to find what works best for them.

The Desperation of NYTimes

rozumem.xyz

I re­cently got suck­ered into sub­scrib­ing to NYTimes be­cause I wanted to read an ar­ti­cle be­hind a pay­wall and I could­n’t find an easy and quick al­ter­na­tive. I did­n’t mind the $2.00 a month. But I took of­fense to what hap­pened af­ter I paid.

Over the course of the next 5 days, they sent me 5 on­board­ing mar­ket­ing emails and I could not opt out of any of them. What’s worse is their mes­sage in the footer.

You are re­ceiv­ing this one-time se­ries of on­board­ing mes­sages over a 14-day pe­riod be­cause they pro­vide es­sen­tial in­for­ma­tion about your new sub­scrip­tion. Because the mes­sages are about your re­la­tion­ship with The Times, you are re­ceiv­ing them re­gard­less of whether you are opted in to re­ceive mar­ket­ing emails from The New York Times.

You are re­ceiv­ing this one-time se­ries of on­board­ing mes­sages over a 14-day pe­riod be­cause they pro­vide es­sen­tial in­for­ma­tion about your new sub­scrip­tion. Because the mes­sages are about your re­la­tion­ship with The Times, you are re­ceiv­ing them re­gard­less of whether you are opted in to re­ceive mar­ket­ing emails from The New York Times.

They prob­a­bly think it’s a clever mar­ket­ing copy. It’s not. It made me feel pow­er­less. It put a sour taste in my mouth. It made them reek of des­per­a­tion. It made me go out of my way to check that my sub­scrip­tion does not auto-re­new. The irony is that had they in­cluded a sim­ple un­sub­scribe link or not sent me any­thing at all, I prob­a­bly would­n’t have both­ered to check.

Their copy makes it seem like they know they’re be­ing coy. And still they choose to not fol­low CAN-SPAM best prac­tices. And for what? A few more eye­balls and clicks. I’m aware me­dia and jour­nal­ism sites have been get­ting hit hard over the last few years, but is it this bad? It makes me won­der if NYTimes is unique in em­ploy­ing these tac­tics.

Email is near and dear to my heart. My own busi­ness uses email as a key growth chan­nel, so I un­der­stand its im­por­tance. But I make sure every mar­ket­ing email has an un­sub­scribe link at the bot­tom. Gmail users also see a one-click un­sub­scribe but­ton at the top. I also pro­vide a link which re­cip­i­ents can click to ini­ti­ate the off-board­ing flow in case they wish to per­ma­nently close their ac­count. I add this on some trans­ac­tional emails too.

I don’t con­sider these things to be anti-growth. On the con­trary, I con­sider them to be growth dri­vers. They help keep my email send­ing rep­u­ta­tion high and my email list clean. Customers feel like they’re in the dri­ver’s seat, which is ever more im­por­tant in to­day’s cli­mate and prob­a­bly helps my brand. Customers who wish to dis­con­tinue their re­la­tion­ship with my busi­ness can do so with­out fuss, so they’re less likely to bad­mouth me.

I earn a small frac­tion of what NYTimes earns. If I’m not des­per­ate, why are they?

Haven Blog: Retro-Tech Parenting

havenweb.org

I am a tech­nol­o­gist. I en­joy the things that com­put­ers and dig­i­tal de­vices can do–they of­ten seem mag­i­cal and amaz­ing! As a tech­nol­o­gist, I have a par­tic­u­lar van­tage point to see and be very un­com­fort­able with what com­pa­nies are do­ing with that tech­nol­ogy. There are catch­phrases for these pat­terns: AdTech, Surveillance Capitalism, Rage Bait, Engagement-Optimized Feeds, Harvesting Eyeballs.

I am also a par­ent. As a par­ent, I am scared about the prospect of let­ting my kids loose in a dig­i­tal world so ag­gres­sively dom­i­nated by these com­pa­nies and pat­terns. But at the same time, tech­nol­ogy was an en­rich­ing part of my child­hood and con­tin­ues to be an en­rich­ing part of my life and I want to share that with them. In this post, I don’t want to spend much time dis­cussing the ways that bad tech pat­terns are bad, in­stead I want to share some of the ways I have held on to the en­rich­ing parts of tech­nol­ogy to share them with my kids. For many of these choices, it turns out that (surprise!) my fa­vorite so­lu­tions in­volve look­ing back in time a cou­ple decades.

Physical me­dia

I’m start­ing to fall in love with CDs again. When I was young, mu­sic came on CDs. This was a time be­fore MP3s and be­fore Spotify. This was a time when go­ing to sum­mer camp, I would pick which half-dozen CDs I wanted to bring with on the trip. This was a time of wired ear-bud head­phones, sit­ting in school shar­ing a song with some­one by giv­ing them one of the two ear buds and lis­ten­ing side-by-side. This was a time of [Parental Advisory] la­bels on CDs that had ob­jec­tion­able lan­guage. This was a time of tiny LCD screens on portable CD play­ers that only dis­played the track num­ber, so I some­times knew my fa­vorite song on a CD by num­ber in­stead of by name.

I bought a mini CD boom box for the house. My old­est loves bring­ing it around to dif­fer­ent rooms, plug­ging it in, and putting in a CD. I bought her the K-Pop Demon Hunters CD for her birth­day. The lo­cal pub­lic li­brary has CDs! CDs are awe­some.

Speaking of the pub­lic li­brary, they also have DVDs and BluRays. I re­mem­ber the whole rit­ual of go­ing to Blockbuster Video with my Dad to pick out a movie for fam­ily movie night. Usually he would let my sis­ter and me pick out a few ex­tras that we wanted to watch. That’s how I got fa­mil­iar with the Three Stooges, even though it was be­fore my time. The magic of hav­ing some­thing to hold onto that you can bring home and put in the player next to the TV, and there is your movie!

As a par­ent one of the big wins with phys­i­cal me­dia is I know ex­actly what my kids have avail­able to ex­pe­ri­ence. If they aren’t ready for some­thing, it does­n’t come home with us. The kids can be much more in­de­pen­dent about watch­ing and lis­ten­ing be­cause there is no ad­ver­sary in­side the de­vice they are us­ing. Kids love that in­de­pen­dence.

Landline Telephones

I hooked up a wired, phys­i­cal tele­phone next to the kitchen in our house. I’m us­ing a cheap VoIP provider with an ana­logue tele­phone adapter, but there’s a com­pany called Tin Can that makes this re­ally easy (no af­fil­i­a­tion). The tele­phone net­work has re­tained back­ward com­pat­i­bil­ity as every­one moved to smart­phones, so grand­par­ents, neigh­bors, aunts and un­cles, are all now ac­ces­si­ble to the kids. Even bet­ter, these dig­i­tally-man­aged phones have ex­cel­lent con­fig­u­ra­tion op­tions. I’ve whitelisted all the friends and fam­ily who get to call us, and the phone au­to­mat­i­cally blocks calls from din­ner time to morn­ing. My kids will spon­ta­neously call up their grand­par­ents to ask if they can go over to play, and have mem­o­rized my phone num­ber be­cause they like to prank call me when I’m also in the kitchen.

When I was a kid, the tele­phone was my bridge to friends and set­ting up my own play­dates. It was do you want to come over to my house?” in­stead of Dad! Can you set up a play­date with so-and-so?” There is still a bit of a net­work ef­fect un­til other fam­i­lies get hooked up with their own house phones, but I’m re­ally ex­cited about it. Kids re­ally love that in­de­pen­dence.

The Family Computer

I re­mem­ber hav­ing friends over and go­ing to friends’ houses and one of our fa­vorite things to do was play com­puter games. This was the era of Commander Keen, and Prince of Persia. We would sit at the fam­ily com­puter, side-by-side, tak­ing turns, dis­cussing strat­egy. I def­i­nitely have games I want to share with my kids, but the in­ter­net is also not some­thing I trust. I bought a used tower PC from Ebay and set it up next to the kitchen. Each kid has their own lo­gin and their own games or ac­tiv­i­ties they like to ex­plore. I also set up a pi-hole for our home net­work and con­fig­ured the fam­ily com­puter to use the pi-hole for DNS. Just like the phone, I’ve whitelisted every do­main that they can visit. For ex­am­ple: they get ac­cess to Wikipedia, but not Google. They get ac­cess to Minecraft, but we don’t play on pub­lic servers. No Youtube or Spotify, but I’ve cu­rated some sites about how to solve a Rubik’s cube, or dif­fer­ent ways to tie your shoes.

I showed my older kid that she could rip a CD onto the com­puter, and lis­ten to it there. For now she’s just ex­cited that she has yet an­other place she can lis­ten to her fa­vorite K-Pop Demon Hunters song, but maybe this is the start of build­ing up her own mu­sic col­lec­tion as I’ve done over the decades. Kids love be­ing able to use a com­puter on their own!

I opened by call­ing my­self a tech­nol­o­gist. I rec­og­nize that a lot of the tools I’ve de­scribed above aren’t as ac­ces­si­ble to less-tech­ni­cal par­ents, but the core phi­los­o­phy is def­i­nitely still ac­ces­si­ble. The dystopian parts of mod­ern tech­nol­ogy came to dom­i­nate be­cause they are very con­ve­nient–but that con­ve­nience comes at a cost. Especially when kids are in­volved, it can be re­ally re­ward­ing to refuse pay­ing that cost and some­times even look to the past for in­spi­ra­tion.

Meta's smart glasses companion app ships a complete, dormant face-recognition pipeline on a stock account.

www.buchodi.com

04 Jun 2026

Stella is the com­pan­ion app for Meta’s smart glasses. Inspecting ver­sion 273.0.0.21 of the Android build (com.facebook.stella), I found the en­tire com­pu­ta­tional and stor­age stack for on-de­vice fa­cial recog­ni­tion: three face mod­els, a lo­cal data­base schema, a co­sine-sim­i­lar­ity vec­tor in­dex di­men­sioned to match the mod­els, a write path that stages bio­met­ric records to disk, a fully wired no­ti­fi­ca­tion sur­face, and a user-fac­ing Connections” wid­get.

I want to be pre­cise about what that does and does not mean, be­cause the gap be­tween the two is im­por­tant.

What I can demon­strate: the ma­chin­ery is pre­sent, it is wired to­gether. Several fa­cial ex­trac­tion and fa­cial fin­ger­print­ing mod­els are pre­sent and I was able run the recog­ni­tion pipeline end-to-end on a test im­age and it de­tected a face, gen­er­ate a 2048-dimension bio­met­ric em­bed­ding, searched a lo­cal in­dex, and on a match fired an Android no­ti­fi­ca­tion stat­ing to the user Person Recognized”.To get the pipeline to run I in­voked its ex­ist­ing han­dler di­rectly with a test photo.

What I can­not demon­strate: that any of this is ac­tive for or­di­nary users. On a stock, un­en­rolled ac­count the user-fac­ing UI does not ap­pear, and the screen the recog­ni­tion no­ti­fi­ca­tion deep-links to is miss­ing from the build. I also did not ob­serve Meta server-push­ing iden­tity data to the rel­e­vant data­base on my test ac­count.

So this is not Meta is se­cretly iden­ti­fy­ing the peo­ple you look at.” It is: the com­plete ap­pa­ra­tus to do ex­actly that is sit­ting on the de­vice, as­sem­bled and func­tional, gated by Meta.

All find­ings be­low are re­pro­ducible against com.face­book.stella v273.0.0.21.

Three face-recog­ni­tion mod­els ship on the de­vice (~100 MB)

Three ExecuTorch (.pte) mod­els ar­rive on the de­vice via NMLML, Meta’s as­set-de­liv­ery sys­tem, down­loaded from Meta.

These map onto open-source ar­chi­tec­tures, the same model fam­i­lies that other apps and aca­d­e­mic pro­jects al­ready use:

SCRFD Sample and Computation Redistribution for Efficient Face Detection (InsightFace, ICLR 2022). Reference im­ple­men­ta­tion: github.com/​deepin­sight/​in­sight­face.

SFace Sigmoid-Constrained Hypersphere Loss for Face Recognition (Zhong et al., 2021). Reference: github.com/​zhongyy/​SFace

KPSAligner key­point-based align­ment, stan­dard prac­tice since 2015 (MTCNN, dlib, InsightFace).

Meta’s SFace vari­ant seems to be scaled larger than the pub­lic ref­er­ence (96 MB vs. ~40 MB; 2048-dimension out­put vs. the ref­er­ence’s 128 – 512).

Worth stat­ing plainly: ship­ping de­tec­tion and em­bed­ding mod­els is not, by it­self, ev­i­dence of recog­ni­tion. Plenty of apps run on-de­vice face de­tec­tion for fram­ing or aut­o­fo­cus.

A co­sine-sim­i­lar­ity face in­dex, di­men­sioned ex­actly to the on-de­vice fin­ger­printer

The recog­ni­tion pipeline that ac­tu­ally runs and reads into this data­base:

/data/user/0/com.facebook.stella/files/rldrive/person_profiles/objects.db

This lives un­der RLDrive, Meta’s cross-de­vice sync frame­work, in a per­son­_pro­files name­space de­signed to be pop­u­lated re­motely. I did not di­rectly ob­serve Meta push­ing data to per­son­_pro­files specif­i­cally on my test ac­count. I want to be clear that I’m de­scrib­ing the chan­nel’s ex­is­tence, not an ob­served trans­mis­sion.

The schema:

CREATE TABLE per­son ( nodeid INTEGER PRIMARY KEY, name TEXT, uri TEXT, blob BLOB, deleted INTEGER, ver­sion BLOB );

CREATE TABLE face ( nodeid INTEGER PRIMARY KEY, me­di­a­P­ath TEXT,  — the face_id used in the deep link per­son­Uri TEXT,  — soft ref­er­ence back to per­son.uri blob BLOB, deleted INTEGER, uri TEXT, ver­sion BLOB );

CREATE VIRTUAL TABLE face_­me­di­a­P­ath_vec USING vec0(me­di­a­P­ath float[2048] dis­tance_­met­ric=co­sine); — 2048-float bio­met­ric fin­ger­print per face, co­sine-dis­tance search — (uses the sqlite-vec ex­ten­sion)

Each face row points at a per­son via per­son­Uri. Each face.me­di­a­P­ath is the pri­mary key into face_­me­di­a­P­ath_vec, which stores the 2048-number em­bed­ding. Recognition is a co­sine-sim­i­lar­ity query against that in­dex, fol­lowed by a join into per­son.name for the no­ti­fi­ca­tion text.

A few things line up:

vec0 is the open-source sqlite-vec ex­ten­sion, which turns SQLite into a vec­tor-sim­i­lar­ity en­gine.

The di­men­sion float[2048] is the ex­act out­put shape of the SFace em­bed­der shipped on the app.

The co­sine met­ric is the stan­dard choice for com­par­ing face em­bed­dings.

The schema per­mits mul­ti­ple face rows per per­son­Uri (no UNIQUE con­straint), but whether a pro­duc­tion de­ploy­ment uses one-to-one or one-to-many is not vis­i­ble from a non-en­rolled de­vice.

End-to-end test con­firms both branches and iso­lates where writes go. I SHA-256-snapshotted and row-counted the data­base, then ran the full recog­ni­tion pipeline twice: once against an empty in­dex (no-match), once against an in­dex pre-loaded with a sin­gle em­bed­ding (match):

No match (empty face_­me­di­a­P­ath_vec): one (uuid.jpg, uuid.emb) pair was writ­ten to NameTagsPending/. No no­ti­fi­ca­tion.

Match: an Android no­ti­fi­ca­tion fired through the pro­duc­tion nametags_recog­ni­tion chan­nel - ti­tle Person rec­og­nized”, body Recognized Michel Foucault”. Nothing was added to NameTagsPending/.

When the de­vice sees a face that the lo­cal in­dex does not match, Stella writes it to:

/data/user/0/com.facebook.stella/files/NameTagsPending/

Each un­rec­og­nized face pro­duces a pair of files named with a fresh UUID:

a .jpg — the cropped, aligned face, the out­put of SCRFD + KPSAligner; and

an .emb — the 2048-number SFace fin­ger­print.

The di­rec­tory is mode 0700 and sur­vives re­boots. Writes hap­pen only on the no-match branch; matched faces go to a no­ti­fi­ca­tion and leave no on-disk trace.

I ver­i­fied the em­bed­ding’s struc­ture di­rectly:

File: NameTagsPending/1566ab46-[…].emb Size: 8,192 bytes (2048 × float32, big-en­dian) L2 norm: 0.999999 ← canon­i­cal L2-normalized face em­bed­ding Min/max: −0.092110 / +0.098950 Mean: +0.000292

Together, (uuid.jpg, uuid.emb) is a com­plete, in­dex­able bio­met­ric record of one face — the same shape and en­cod­ing the co­sine in­dex in per­son­_pro­files/​ob­jects.db is built to match against.

The name NameTagsPending most lit­eral read­ing is faces pend­ing a name” — bio­met­ri­cally en­coded, await­ing a la­bel. I’ll note the struc­tural fact and let it carry its own weight: a face im­age and its fin­ger­print, stored side by side in plain­text, mode 0700, sur­viv­ing re­boots, is pre­cisely the dataset you would as­sem­ble if you in­tended to retroac­tively iden­tify faces once a la­bel ar­rives.

The no­ti­fi­ca­tion sur­face is fully wired

Stella de­fines a ded­i­cated Android no­ti­fi­ca­tion chan­nel

NotificationChannel{ id = nametags_recognition” name = NameTags recog­ni­tion” de­scrip­tion = Notifications for rec­og­nized NameTags con­nec­tions” im­por­tance = IMPORTANCE_HIGH (heads-up + sound + badge) sound = sys­tem no­ti­fi­ca­tion sound }

The no­ti­fi­ca­tion tem­plate is hard­coded in the recog­ni­tion han­dler. Title is al­ways Person rec­og­nized”; body is al­ways Recognized + name, where name comes from the per­son table in per­son­_pro­files/​ob­jects.db:

NotificationCompat.Builder(ctx, nametags_recognition”) .setContentTitle(“Person rec­og­nized”) .setContentText(“Recognized + matched_­name) .setAutoCancel(true) .setContentIntent( PendingIntent.getActivity( ctx, matched_­name.hash­Code(), Intent.ACTION_VIEW with Uri fb-viewapp://name_tags?face_id=” + face_id, FLAG_IMMUTABLE | FLAG_UPDATE_CURRENT)) .build()

NotificationManagerCompat.notify(matched_name.hashCode(), no­ti­fi­ca­tion)

The no­ti­fi­ca­tion is tap­pable: its con­tentIn­tent is a deep link of the form fb-viewapp://​name_­tags?face_id=<face_id>, a Meta-authored URL scheme meant to open a per­son-pro­file screen in­side Stella.

One hon­est caveat: in v273, I could not find that des­ti­na­tion screen. Tapping the no­ti­fi­ca­tion routes Stella to its de­fault tab, be­cause the tar­get Compose des­ti­na­tion is ab­sent from the nav­i­ga­tion graph. The no­ti­fi­ca­tion fires; the screen it points at is­n’t built into this re­lease.

A user-fac­ing Connections” en­try point ex­ists in the APK

Stella v273 con­tains a wid­get ren­der­ing a card un­der a sec­tion header ti­tled Connections”, with the text See your con­nec­tions” / Remember the peo­ple you met and make new con­nec­tions.” Both strings are hard­coded lit­er­als in the APK not server-pushed.

On a stock, un­en­rolled ac­count, the card does not ap­pear on the Glasses tab at all. It be­came vis­i­ble dur­ing test­ing. In nor­mal use, a user would not see this.

What this adds up to

The full on-de­vice face-recog­ni­tion stack: de­tec­tion, align­ment, em­bed­ding, vec­tor in­dex, stor­age, write path, and no­ti­fi­ca­tion sur­face is pre­sent and as­sem­bled in Stella v273.

It is func­tional. Run end-to-end, it rec­og­nizes a known face and names it in a no­ti­fi­ca­tion, and it stages un­known faces (crop + fin­ger­print) to disk.

The in­dex di­men­sion, em­bed­ding shape, and stor­age schema are mu­tu­ally con­sis­tent, this is a co­her­ent sys­tem, not stray dead code.

The pieces a user would ac­tu­ally touch: the Connections” card and the pro­file screen the no­ti­fi­ca­tion opens are ei­ther ab­sent from the build or buried deeper.

The data­base the live pipeline uses sits in a sync name­space Meta pop­u­lates server-side, along­side other name­spaces it al­ready pop­u­lates, but I did not ob­serve a push to the face name­space on my ac­count.

What I am not claim­ing: that Meta is iden­ti­fy­ing strangers for users to­day, that en­roll­ment data is flow­ing, or that any of this is en­abled in pro­duc­tion.

What’s hard to wave away: build­ing, ship­ping, and wiring this much ap­pa­ra­tus down to an 2048-dimension fa­cial fin­ger­print­ing and a hard­coded Person rec­og­nized” no­ti­fi­ca­tion, is an en­gi­neer­ing in­vest­ment. Capability that does­n’t ship by ac­ci­dent. Whether and when it goes into pro­duc­tion is Meta’s to an­swer.

This re­search is pub­lished along­side re­port­ing in WIRED.

C++: The Documentary released today

herbsutter.com

C++

2026 – 06-042026 – 06-04

2 Minutes

C++: The Documentary pre­miered to­day on YouTube, and it was great to be on the live chat with Bjarne and many other key folks who par­tic­i­pated in C++’s his­tory. I’m hon­ored to have been one of hun­dreds of peo­ple who have played a part in ad­vanc­ing Bjarne’s won­der­ful pro­ject over the years.

If you haven’t watched this yet, make it a week­end goal. What a great syn­op­sis of a 40-year suc­cess story, from hum­ble be­gin­nings to global adop­tion to be­ing cur­rently (as of Q3 2025) the fastest-grow­ing of the top four lan­guages in the world… +90% users in the past 3.5 years.

People who ap­pear in the doc­u­men­tary:

Bjarne Stroustrup: Bell Labs, Designer and orig­i­nal im­ple­menter of C++

Alexander Stepanov: Designer of the Standard Template Library

Anders Hejlsberg: Creator of C#, TypeScript, and Turbo Pascal

Andrei Alexandrescu: Principal Research Scientist, Nvidia & C++ Author

Andrew Koenig: Bell Labs, Founding mem­ber of the C++ Standards Committee, Researcher, C++ Author & Educator

Barbara Moo: Bell Labs, Manager C++ Development Team & C++ Author

Brian Kernighan: Bell Labs, Computer Scientist, Co-author of The C Programming Language”

Chris Lattner: Creator of Mojo, LLVM, Clang & Swift

Danilo Piparo: Particle Physicist, CERN, ROOT Framework Project Lead

Eric Lubin: Software Developer — Lead, Hudson River Trading

Gabriel Dos Reis: Software Engineer and Architect, Microsoft; C++ tools builder; Mathematician

Herb Sutter: Technical Fellow, Citadel Securities; Chair, Standard C++ Foundation; Chair Emeritus, ISO C++ Committee

John Romero: Video Game Developer, Co-Creator of Doom and Quake, Co-Founder id Software

Nina Ranns: Vice-Convener of the ISO C++ Committee

Chapters

00:00 Intro

01:50 Invention at AT&T Bell Labs

07:30 C with Classes

09:37 Early adop­tion of C with Classes

10:53 From C with Classes to C++ (and CFront)

12:32 Why is it called C++?

13:24 AT&T starts sell­ing soft­ware / Another team tries to take over C++

16:08 Early de­vel­op­ment of C++ at AT&T Bell Labs

19:10 It was a buggy prod­uct” / Release 2.0.0

21:55 C++ spread­ing be­yond AT&T

24:50 Too many ver­sions of C++

26:03 Need for stan­dard­iza­tion

29:38 The STL by Alexander Stepanov

37:19 The first stan­dard: C++98

39:21 C++ at CERN in the 90s

40:34 C++ spread­ing to games and trad­ing

43:00 C++ win­ter of the early 2000s

45:34 Programming lan­guage wars (C#)

49:25 There’s a need for an ef­fi­cient pro­gram­ming lan­guage again

52:29 Modern C++ (C++11)

56:29 Is the stan­dards com­mit­tee mak­ing C++ too com­pli­cated?

1:00:45 C++ is ever­where

01:05:00 The fu­ture and chal­lenges for C++

01:08:31 Bjarne’s im­pact

Published by Herb Sutter

Herb Sutter is an au­thor and speaker, and a tech­ni­cal fel­low at Citadel Securities. He serves as chair of the Standard C++ Foundation and its con­fer­ence CppCon, and served as chair of the ISO C++ stan­dards com­mit­tee from 2002 to 2025. View all posts by Herb Sutter

Published 2026 – 06-042026 – 06-04

Post nav­i­ga­tion

Why I’m Skeptical About Efforts to Revolutionize Schooling

www.scotthyoung.com

Being the guy who wrote a book called Ultralearning, I get asked a lot of ques­tions about what I think schools should be do­ing bet­ter.

 Having never taught in a class­room or worked for even a sin­gle day in ed­u­ca­tion, it’s a ques­tion I’m to­tally un­qual­i­fied to an­swer. It’s a bit like ask­ing a guy to re­form an en­tire health care sys­tem be­cause he’s good at lift­ing weights.

But be­ing to­tally un­qual­i­fied has never stopped me be­fore, so I’ll try to ex­plain the an­swer I typ­i­cally give to this ques­tion, which is that I’m skep­ti­cal of dra­matic pro­pos­als to make school con­sid­er­ably more ef­fec­tive or ef­fi­cient for the av­er­age stu­dent.

To be clear, that’s not be­cause no im­prove­ment is pos­si­ble. We do know some about things that work that are in­con­sis­tently ap­plied: phon­ics should be taught, cog­ni­tive load should be man­aged, skills should be fully taught and prac­tice should be fun and am­ple.

But these an­swers aren’t the kind that sat­isfy the peo­ple who ask me these ques­tions. Instead, hav­ing had many of these con­ver­sa­tions, I feel like the per­son ask­ing al­ready knows” what my re­sponse should be:

Isn’t it ob­vi­ous that school sucks? That we should be teach­ing crit­i­cal think­ing and prob­lem-solv­ing skills in­stead of use­less facts and the­o­ries? That school should be more like real life, with real-world pro­jects and ex­per­i­ments and col­lab­o­ra­tion? That there should be less of that stuffy work of sit­ting at a desk and mem­o­riz­ing things?

Isn’t it ob­vi­ous that school sucks? That we should be teach­ing crit­i­cal think­ing and prob­lem-solv­ing skills in­stead of use­less facts and the­o­ries? That school should be more like real life, with real-world pro­jects and ex­per­i­ments and col­lab­o­ra­tion? That there should be less of that stuffy work of sit­ting at a desk and mem­o­riz­ing things?

If you had asked me this ques­tion years ago, I prob­a­bly would have agreed with you. It took read­ing a lot of re­search to con­vince me that this in­tu­itively ap­peal­ing idea is ac­tu­ally bad. Below, I’d like to ex­plain why.

First, the Evidence

Before I get into the ex­pla­na­tion of why these kinds of seem­ingly-good strate­gies don’t work, I should be­gin by point­ing out that these ideas are not new. They have been tried, and they have been found want­ing.

Entire books have been writ­ten point­ing out the flaws in many of these strate­gies. I won’t be able to do the full de­bate jus­tice here, but, if you’re in­ter­ested, you can check out Daniel Willingham’s Why Don’t Students Like School? Greg Ashman’s The Power of Explicit Teaching and Direct Instruction or, if you want to learn more about the ac­tual de­bate be­tween pro­po­nents of both sides, try Constructivist Instruction: Success or Failure.

To briefly re­cap some of the ev­i­dence:

Project Follow Through was one of the largest ed­u­ca­tional ex­per­i­ments ever con­ducted. Run in the 1970s, it com­pared how dif­fer­ent teach­ing method­olo­gies im­pact stu­dent out­comes. Direct Instruction, a method of teach­ing that has stu­dents sit in desks and per­form ex­tremely struc­tured drills in uni­son, per­formed best.

Problem-based learn­ing tends to do worse than tra­di­tional school­ing in med­ical ed­u­ca­tion. An in­flu­en­tial meta-analy­sis by Albanese and Mitchell, for in­stance, found that stu­dents re­quired more time study­ing, had worse exam scores and or­dered more un­nec­es­sary tests com­pared to tra­di­tion­ally taught stu­dents.

Despite need­ing to re­learn this truth every few decades, the best way to teach kids how to read has been known for cen­turies: break down the sound-spelling cor­re­spon­dence, and do lots of prac­tice on it be­fore mov­ing up to au­then­tic texts. Approaches based on skip­ping these drills in fa­vor of inspiring a love of read­ing” do worse.

Practice test­ing and dis­trib­uted prac­tice—ba­si­cally, hav­ing reg­u­lar quizzes spread out over a course—are the study­ing meth­ods with the best em­pir­i­cal sup­port. Fancy meth­ods like mnemon­ics and con­cept maps fare worse.

General prob­lem solv­ing abil­i­ties are nei­ther learned nor taught. While some prob­lem-solv­ing meth­ods have broader ap­plic­a­bil­ity than oth­ers (such as the sci­en­tific process of hy­poth­e­sis test­ing), stu­dents learn these meth­ods bet­ter when they’re ex­plic­itly taught rather than sim­ply giv­ing stu­dents pro­jects and hop­ing they’ll rein­vent them on their own.

In short, when­ever we have high-qual­ity ev­i­dence that rig­or­ously com­pares two teach­ing meth­ods, the re­search in­vari­ably fa­vors strong, di­rect in­struc­tion plus prac­tice.1 Or, in other words, the ex­act stereo­type of school­ing that so many of the peo­ple ask­ing me about school re­form de­spise.

Your Stereotype of School is an Endangered Species

This does­n’t mean ed­u­ca­tion could­n’t be bet­ter. My im­pres­sion upon first en­coun­ter­ing the Direct Instruction re­search was that I had never been taught this way in my en­tire life.

Clichés are of­ten out of date. I went to grade school in the nineties, when the lofty aims of pro­ject-based and dis­cov­ery learn­ing were the ed­u­ca­tional or­tho­doxy. I spent a lot of my school years do­ing time-con­sum­ing pro­jects that had us glu­ing and col­or­ing and ex­pected us to do our own re­search.

I’m pes­simistic about real re­form be­cause the changes needed to make schools more ef­fec­tive are of­ten op­po­site of what many peo­ple in­tu­itively feel. For schools to teach more ef­fec­tively, they should be more rig­or­ous about care­fully defin­ing the knowl­edge ob­jec­tives of the class, thor­oughly break­ing down com­plex skills into com­po­nents, and do­ing lots and lots and lots of prac­tice.

In short, a better” school prob­a­bly looks more like the stereo­type of an old-fash­ioned school­house with kids sit­ting at desks, drilling facts and con­cepts that are pa­tiently ex­plained by a teacher. To the ex­tent that school be­comes more like free play, pro­ject-build­ing or act­ing like a sci­en­tist, it will prob­a­bly be worse.

Why It’s Hard to Improve Schools

Schools face a num­ber of prac­ti­cal con­straints that make thor­ough re­form dif­fi­cult. Students are un­mo­ti­vated. They range in back­ground knowl­edge and in­nate abil­ity. We care about sort­ing just as much as ed­u­cat­ing, so schools end up do­ing both.

But the real rea­son it’s hard to im­prove schools is sim­ply that there are fun­da­men­tal con­straints on how the brain learns that pre­vent rad­i­cal short­cuts.

The bor­ing truth is that ex­per­tise in most sub­jects is largely a mat­ter of hav­ing an enor­mous li­brary of knowl­edge and skill. For ex­am­ple, if you want to learn a lan­guage, you need to learn a lot of words. Any method that tries to skip over the fact that there are tens of thou­sands of words to learn is doomed to fail­ure. All skills are like this, it’s sim­ply that the atoms” of learn­ing are usu­ally less ob­vi­ous than in lan­guages.

When stu­dents com­plain about all the bor­ing facts and skills they had to learn in school, my re­sponse is to claim that there is­n’t any other type! All skills are sim­ply an ac­cu­mu­la­tion of small bits of facts, pro­ce­dures and con­cepts.

Those small bits, in iso­la­tion, seem kind of triv­ial. But quan­tity has a qual­ity all its own, and with enough well-in­te­grated knowl­edge the re­sult is ex­per­tise that seems al­most mag­i­cal to those who don’t pos­sess it.

This means that im­prov­ing ed­u­ca­tion comes down to largely two dif­fer­ent op­tions:

First, you can in­crease the ef­fi­ciency of the sys­tem. Efficiency here looks like the kind of fac­tory re­design that in­creases prod­uct through­put—in­creas­ing the num­ber of words learned per day, op­ti­miz­ing cog­ni­tive load, boost­ing mnemonic ef­fi­ciency through spac­ing and re­trieval—with­out skip­ping over the fun­da­men­tal bot­tle­neck in cog­ni­tion.

Second, you can choose to learn dif­fer­ent things. Given the high de­gree of speci­ficity of most knowl­edge, the choice of what to learn can have pro­found con­se­quences. But if choos­ing a ped­a­gog­i­cal method is con­tentious, cur­ric­u­lar choice is even more so! For every useless” sub­ject that re­form­ers want to dis­card, there are die-hard ad­vo­cates ar­gu­ing that we should be putting in even more.

I be­lieve in both of these things, and I’ve fo­cused much of my writ­ing ca­reer on how we can do them bet­ter, par­tic­u­larly out­side of the typ­i­cal class­room. But if you want lots of skills, there’s no way around learn­ing a lot of stuff—in­clud­ing a ton of stuff that feels too ob­scure to be broadly use­ful.

What About Ed Tech?

Thus far, I’ve mostly been tar­get­ing a cer­tain kind of ques­tioner: some­one who feels that school was maybe too bor­ing and im­prac­ti­cal, and who longs for the pos­si­bil­ity that ed­u­ca­tion could be more like play and less like study­ing.

There are re­form­ers of all stripes, and ed­u­ca­tional tech­nol­o­gists are an­other side of this de­bate. These are the peo­ple who cham­pion ef­forts to gam­ify learn­ing, care­fully match teach­ing to each stu­den­t’s abil­ity level, de­velop AI-based tu­tor­ing, put an iPad in every child’s hands and so on.

In the­ory, these ideas are pos­si­bly use­ful. Drills can be bor­ing, so wrap­ping them with gam­i­fi­ca­tion el­e­ments that re­ward progress and en­gage­ment might be help­ful. Skills can be too hard or too easy, so ad­just­ing dif­fi­culty au­to­mat­i­cally might be help­ful. AI-tutoring, too, might help with clos­ing Bloom’s fa­mous 2 sigma prob­lem.

But I’m more skep­ti­cal in prac­tice. As Kelsey Piper writes, a lot of ed-tech games have a fairly low den­sity of ac­tual use­ful learn­ing. I can at­test to this: ea­ger to give my son a head start on the pho­netic skills in­volved in read­ing, I tried a few dif­fer­ent iPad games with him. He mostly messed around ran­domly un­til he got the re­ward, largely ig­nor­ing the ed­u­ca­tional con­tent to fix­ate on the cute car­toon char­ac­ters.

Gamified learn­ing is a bit like wrap­ping med­i­cine in candy. Yes, it may help some stu­dents swal­low some in­struc­tion they oth­er­wise find bit­ter, but in prac­tice it’s easy to pull off the candy, con­sume it, and throw the med­i­cine away.

Individualized in­struc­tion aided by tech­nol­ogy does solve some of the prob­lems of dif­fer­ing abil­ity lev­els. But schools aren’t just solv­ing the cog­ni­tive prob­lems of learn­ing, they’re also work­ing on mo­ti­va­tional ones. A rig­or­ous, but achiev­able, stan­dard that ap­plies to every­one may be more sus­tain­able for mo­ti­va­tion than an in­di­vid­u­ally-tai­lored goal for each stu­dent.2

Similarly, while I’m hope­ful that AI ad­vances will make au­to­mated tu­tor­ing more use­ful, it’s still far away from the skill a teacher can pro­vide. As some­one who makes use of AI quite a bit in my own learn­ing, I can say that it’s still rel­a­tively weak at hav­ing a good model of an in­di­vid­u­al’s skill gaps and con­cep­tual weak­nesses. It’s very much at the better than noth­ing”—not the better than teach­ers”—stage right now.

So, while I’m hope­ful that there will be some im­prove­ments in tech­nol­ogy around the mar­gins, I’m skep­ti­cal of any­thing touted as a rad­i­cal over­haul in ed­u­ca­tional process or out­comes.

What About Ultralearning?

Does this line of think­ing rule out the meth­ods I de­scribe in Ultralearning? I don’t think so. The big dis­tinc­tion (and it is a big one) be­tween my aims for that book and the aims of ed­u­ca­tional re­form­ers is that I started with the as­sump­tion of a highly mo­ti­vated learner.

The peo­ple I doc­u­ment in that book all be­gan with the start­ing point that they were will­ing to work hard, even ob­ses­sively, on a pro­ject they were deeply mo­ti­vated to suc­ceed with. In such cases, the class­room struc­tures that fa­cil­i­tate mo­ti­va­tion can in­stead be­come ob­sta­cles: fixed home­work as­sign­ments, manda­tory lec­tures, exam dead­lines. These things keep un­in­ter­ested stu­dents go­ing, but they may hold back the ag­gres­sively cu­ri­ous.

For in­stance, I still be­lieve that full im­mer­sion is the best way to learn a lan­guage, pro­vided you also aug­ment that with a lot of the study­ing ap­proaches I de­scribe above, but it’s ob­vi­ously a high-ef­fort strat­egy. I’ve spo­ken to lots of peo­ple who have asked me for ad­vice on how to learn a lan­guage, but rel­a­tively few that took up Vat’s and my ac­tual strat­egy of avoid­ing speak­ing English.

Ultimately, I be­lieve en­hanced learn­ing is cer­tainly pos­si­ble, and a highly mo­ti­vated per­son can of­ten do bet­ter than the av­er­age, and some­times even the up­per end, of what is typ­i­cally seen in school. But such op­ti­mism about the pos­si­bil­ity of learn­ing does­n’t so eas­ily trans­fer to a sit­u­a­tion where mo­ti­va­tion is much lower.

Footnotes

Manu Kapur’s work on pro­duc­tive fail­ure does­n’t un­der­mine this find­ing, con­trary to some mis­in­ter­pre­ta­tions. Kapur’s re­search sim­ply finds that the tim­ing of in­struc­tion has some ef­fects. Sometimes, for cer­tain kinds of skills, in cer­tain kinds of en­vi­ron­ments, at­tempt­ing to solve a prob­lem first and fail­ing can be help­ful for later un­der­stand­ing the so­lu­tion pro­ce­dure that is fully taught.

Researcher Greg Ashman makes a good ar­gu­ment against com­mon pleas to in­di­vid­u­al­ize in­struc­tion in his book.

Detecting, Characterizing, and Identifying a Powerful Space-Based GNSS Interference Source

arxiv.org

To add this web app to your iOS home screen tap the share button and select "Add to the Home Screen".

10HN is also available as an iOS App

If you visit 10HN only rarely, check out the the best articles from the past week.

Visit pancik.com for more.