10 interesting stories served every morning and every evening.

This Alberta Startup Sells No-Tech Tractors for Half Price

wheelfront.com

Home • Automotive News • This Alberta Startup Sells No-Tech Tractors for Half Price

Automotive News

Stay con­nected via Google News

Follow us for the lat­est travel up­dates and guides.

Four hun­dred in­quiries from American farm­ers poured in af­ter a sin­gle in­ter­view. Not for a John Deere. Not for a Case IH. For a trac­tor built in Alberta with a re­man­u­fac­tured 1990s diesel en­gine and zero elec­tron­ics.

Ursa Ag, a small Canadian man­u­fac­turer, is as­sem­bling trac­tors pow­ered by 12-valve Cummins en­gines — the same me­chan­i­cally in­jected work­horses that pow­ered com­bines and pickup trucks decades ago — and sell­ing them for roughly half the price of com­pa­ra­ble ma­chines from es­tab­lished brands. The 150-horsepower model starts at $129,900 CAD, about $95,000 USD. The range-top­ping 260-hp ver­sion runs $199,900 CAD, around $146,000.

Try find­ing a sim­i­larly pow­ered John Deere for that money.

Owner Doug Wilson is­n’t pre­tend­ing this is cut­ting-edge tech­nol­ogy. That’s the en­tire point. The 150-hp and 180-hp mod­els use re­man­u­fac­tured 5.9-liter Cummins en­gines, while the 260-hp gets an 8.3-liter unit.

All are fed by Bosch P-pumps — purely me­chan­i­cal fuel in­jec­tion, no ECU, no pro­pri­etary soft­ware hand­shake re­quired. The cabs are sourced ex­ter­nally and stripped to es­sen­tials: an air ride seat, me­chan­i­cally con­nected con­trols, and noth­ing re­sem­bling a touch­screen.

This plays di­rectly into a fight that has been sim­mer­ing for years. John Deere’s right-to-re­pair bat­tles be­came a na­tional story when farm­ers dis­cov­ered they could­n’t fix their own equip­ment with­out dealer-au­tho­rized soft­ware. Lawsuits fol­lowed, then leg­is­la­tion.

Deere even­tu­ally made con­ces­sions, but the dam­age was done. A gen­er­a­tion of farm­ers learned ex­actly how much con­trol they’d sur­ren­dered by buy­ing ma­chines loaded with pro­pri­etary code.

Wilson saw the gap and drove a trac­tor through it. The 12-valve Cummins is ar­guably the most widely un­der­stood diesel en­gine in North America. Every in­de­pen­dent shop, every shade-tree me­chanic with a set of wrenches, every farmer who grew up turn­ing bolts has en­coun­tered one.

Parts sit on shelves in thou­sands of stores. Downtime — the thing that ac­tu­ally costs a farmer money dur­ing plant­ing or har­vest — shrinks dra­mat­i­cally when you don’t need a fac­tory tech­ni­cian with a lap­top to di­ag­nose a fuel de­liv­ery prob­lem.

Ursa Ag’s dealer net­work re­mains tiny, and the com­pany sells di­rect. Wilson ad­mit­ted they haven’t scaled up dis­tri­b­u­tion be­cause they can’t keep shelves stocked as it stands. He says 2026 pro­duc­tion will ex­ceed the com­pa­ny’s en­tire cu­mu­la­tive out­put, which is a bold claim from a small op­er­a­tion, and whether they can ac­tu­ally de­liver is the sin­gle biggest ques­tion hang­ing over this story.

The U.S. mar­ket is where things get in­ter­est­ing. Ursa Ag has no American dis­trib­u­tors yet, though Wilson says that’s likely to change. The eas­i­est an­swer is yes, we can ship to the United States,” he told re­porters.

Those 400 American in­quiries af­ter one Farms.com seg­ment sug­gest the ap­petite is real. Farmers who have been buy­ing 30-year-old equip­ment to avoid mod­ern com­plex­ity now have a new al­ter­na­tive — a ma­chine with fresh sheet metal, a war­ranty, and an en­gine phi­los­o­phy rooted firmly in the past.

There’s a rea­son the used trac­tor mar­ket has been so ro­bust. Plenty of op­er­a­tors looked at a $300,000 ma­chine full of sen­sors and soft­ware and de­cided a well-main­tained older unit was the smarter bet. Ursa Ag is man­u­fac­tur­ing that bet from scratch.

Whether a small Alberta com­pany can scale fast enough to meet de­mand from an en­tire con­ti­nent is an­other mat­ter. The big man­u­fac­tur­ers have sup­ply chains, dealer net­works, and fi­nanc­ing arms that took decades to build. Wilson has re­man­u­fac­tured Cummins en­gines and a value propo­si­tion that res­onates with any­one who has ever waited three days for a dealer tech to show up with a di­ag­nos­tic ca­ble.

The farm equip­ment in­dus­try spent 20 years adding com­plex­ity and cost. Ursa Ag is wa­ger­ing that a sig­nif­i­cant num­ber of farm­ers never wanted any of it.

Stay con­nected via Google News

Follow us for the lat­est travel up­dates and guides.

Qwen Studio

qwen.ai

We Found a Stable Firefox Identifier Linking All Your Private Tor Identities

fingerprint.com

We re­cently dis­cov­ered a pri­vacy vul­ner­a­bil­ity af­fect­ing all Firefox-based browsers. The is­sue al­lows web­sites to de­rive a unique, de­ter­min­is­tic, and sta­ble process-life­time iden­ti­fier from the or­der of en­tries re­turned by IndexedDB, even in con­texts where users ex­pect stronger iso­la­tion.

This means a web­site can cre­ate a set of IndexedDB data­bases, in­spect the re­turned or­der­ing, and use that or­der­ing as a fin­ger­print for the run­ning browser process. Because the be­hav­ior is process-scoped rather than ori­gin-scoped, un­re­lated web­sites can in­de­pen­dently ob­serve the same iden­ti­fier and link ac­tiv­ity across ori­gins dur­ing the same browser run­time. In Firefox Private Browsing mode, the iden­ti­fier can also per­sist af­ter all pri­vate win­dows are closed, as long as the Firefox process re­mains run­ning. In Tor Browser, the sta­ble iden­ti­fier per­sists even through the New Identity” fea­ture, which is de­signed to be a full re­set that clears cook­ies and browser his­tory and uses new Tor cir­cuits. The fea­ture is de­scribed as be­ing for users who want to pre­vent [their] sub­se­quent browser ac­tiv­ity from be­ing link­able to what [they] were do­ing be­fore.” This vul­ner­a­bil­ity ef­fec­tively de­feats the iso­la­tion guar­an­tees users rely on for un­link­a­bil­ity.

We re­spon­si­bly dis­closed the is­sue to Mozilla and to the Tor Project. Mozilla has quickly re­leased the fix in Firefox 150 and ESR 140.10.0, and the patch is tracked in Mozilla Bug 2024220. The un­der­ly­ing root cause is in­her­ited by Tor Browser through Gecko’s IndexedDB im­ple­men­ta­tion, so the is­sue is rel­e­vant to both prod­ucts and to all Firefox-based browsers.

The fix is straight­for­ward in prin­ci­ple: the browser should not ex­pose in­ter­nal stor­age or­der­ing that re­flects process-scoped state. Canonicalizing or sort­ing re­sults be­fore re­turn­ing them re­moves the en­tropy and pre­vents this API from act­ing as a sta­ble iden­ti­fier.

Why this mat­ters

Private brows­ing modes and pri­vacy-fo­cused browsers are de­signed to re­duce web­sites’ abil­ity to iden­tify users across con­texts. Users gen­er­ally ex­pect two things:

First, un­re­lated web­sites should not be able to tell they are in­ter­act­ing with the same browser in­stance un­less a shared stor­age or ex­plicit iden­tity mech­a­nism is in­volved.

Second, when a pri­vate ses­sion ends, the state as­so­ci­ated with that ses­sion should dis­ap­pear.

This is­sue breaks both ex­pec­ta­tions. A web­site does not need cook­ies, lo­cal­Stor­age, or any ex­plicit cross-site chan­nel. Instead, it can rely on the browser’s own in­ter­nal stor­age be­hav­ior to de­rive a high-ca­pac­ity iden­ti­fier from the or­der­ing of data­base names re­turned by an API.

For de­vel­op­ers, this is a use­ful re­minder that pri­vacy bugs do not al­ways come from di­rect ac­cess to iden­ti­fy­ing data. Sometimes they come from de­ter­min­is­tic ex­po­sure of in­ter­nal im­ple­men­ta­tion de­tails.

For se­cu­rity and prod­uct stake­hold­ers, the key point is sim­ple: even an API that ap­pears harm­less can be­come a cross-site track­ing vec­tor if it leaks sta­ble process-level state.

What is IndexedDB and what does in­dexedDB.data­bases() do?

IndexedDB is a browser API for stor­ing struc­tured data on the client side. Web ap­pli­ca­tions use it for of­fline sup­port, caching, ses­sion state, and other lo­cal stor­age needs. Each ori­gin can cre­ate one or more named data­bases, which can hold ob­ject stores and large amounts of data.

The indexedDB.databases() API re­turns meta­data about the data­bases vis­i­ble to the cur­rent ori­gin. In prac­tice, de­vel­op­ers might use it to in­spect ex­ist­ing data­bases, de­bug stor­age us­age, or man­age ap­pli­ca­tion state.

Under nor­mal pri­vacy ex­pec­ta­tions, the or­der of re­sults re­turned by this API should not, in it­self, carry iden­ti­fy­ing in­for­ma­tion. It should sim­ply re­flect a neu­tral, canon­i­cal, or oth­er­wise non-sen­si­tive pre­sen­ta­tion of data­base meta­data.

The is­sue we found comes from the fact that, in all Firefox-based browsers, the re­turned or­der was not neu­tral at all.

How indexedDB.databases() became a sta­ble iden­ti­fier

In all Firefox Private Browsing mode, in­dexedDB.data­bases() re­turns data­base meta­data in an or­der de­rived from in­ter­nal stor­age struc­tures rather than from data­base cre­ation or­der.

The rel­e­vant im­ple­men­ta­tion is in dom/​in­dexedDB/​Ac­tors­Par­ent.cpp.

In Private Browsing mode, data­base names are not used di­rectly as on-disk iden­ti­fiers. Instead, they are mapped to UUID-based file­name bases via a global hash table:

us­ing StorageDatabaseNameHashtable = nsTHashMap<nsString, nsString>;

StaticAutoPtr<StorageDatabaseNameHashtable> gStor­age­Data­base­Name­Hashtable;

The map­ping is per­formed in­side GetDatabaseFilenameBase() called within OpenDatabaseOp::DoDatabaseWork().

When aIsPrivate is true, the web­site-pro­vided data­base name is re­placed with a gen­er­ated UUID and stored in the global Stor­age­Data­base­Name­Hashtable. This map­ping:

Is keyed only by the data­base name string

Persists for the life­time of the IndexedDB QuotaClient

Is shared across all ori­gins

Is cleared only when Firefox is fully restarted

Later, when in­dexedDB.data­bases() is in­voked, Firefox gath­ers data­base file­names via QuotaClient::GetDatabaseFilenames(…) called in GetDatabasesOp::DoDatabaseWork().

Database base names are in­serted into an nsTHash­Set.

No sort­ing is per­formed be­fore it­er­a­tion. The fi­nal re­sult or­der is de­ter­mined by it­er­a­tion over the hash set’s in­ter­nal bucket lay­out.

Because UUID map­pings are sta­ble for the life­time of the Firefox process, and hash table struc­ture and it­er­a­tion or­der are de­ter­min­is­tic for a given in­ter­nal lay­out, the re­turned or­der­ing be­comes a de­ter­min­is­tic func­tion of the gen­er­ated UUID val­ues, hash func­tion be­hav­ior, and hash table ca­pac­ity and in­ser­tion his­tory. This or­der­ing per­sists across tabs and pri­vate win­dows, re­set­ting only upon a full Firefox restart. Crucially, the UUID map­ping and hash set it­er­a­tion are not ori­gin-scoped. They are process-scoped.

Reproducing the is­sue

A sim­ple proof of con­cept is enough to demon­strate the be­hav­ior. Two dif­fer­ent ori­gins host the same script. Each script:

Creates a fixed set of named data­bases.

Calls indexedDB.databases().

Extracts and prints the re­turned or­der.

In af­fected Firefox Private Browsing and Tor Browser builds, both ori­gins ob­serve the same per­mu­ta­tion dur­ing the life­time of the same browser process. Restarting the browser changes the per­mu­ta­tion.

Conceptually, the out­put looks like this:

cre­ated:

a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p

listed:

g,c,p,a,l,f,n,d,j,b,o,h,e,m,i,k

The im­por­tant point is not the ex­act or­der it­self, but rather that the or­der is not the orig­i­nal cre­ation or­der, that the same or­der ap­pears across un­re­lated ori­gins, and it per­sists across re­loads and new pri­vate win­dows, even af­ter all pri­vate win­dows are closed. Only a full browser restart yields a new one. That is ex­actly what you do not want from a pri­vacy per­spec­tive.

Privacy im­pact

This is­sue en­ables both cross-ori­gin and same-ori­gin track­ing within a sin­gle browser run­time.

Cross-origin im­pact

Unrelated web­sites can in­de­pen­dently de­rive the same iden­ti­fier and in­fer that they are in­ter­act­ing with the same run­ning Firefox or Tor Browser process. That lets them link ac­tiv­ity across do­mains with­out cook­ies or other shared stor­age.

Same-origin im­pact

In Firefox Private Browsing mode, the iden­ti­fier can per­sist even af­ter all pri­vate win­dows are closed, pro­vided the Firefox process it­self is still run­ning. That means a site can rec­og­nize a later visit in what ap­pears to be a fresh pri­vate ses­sion. In Tor Browser, the sta­ble iden­ti­fier ef­fec­tively de­feats Tor Browser’s New Identity” iso­la­tion within a run­ning browser process, al­low­ing web­sites to link ses­sions that are ex­pected to be fully iso­lated from one an­other.

Why this is es­pe­cially se­ri­ous in Tor Browser

Tor Browser is specif­i­cally de­signed to re­duce cross-site link­a­bil­ity and min­i­mize browser-in­stance-level iden­tity. A sta­ble process-life­time iden­ti­fier cuts di­rectly against that de­sign goal. Even if it only sur­vives un­til a full process restart, that is still enough to weaken un­link­a­bil­ity dur­ing ac­tive use.

Entropy and fin­ger­print­ing ca­pac­ity

The sig­nal is not just sta­ble. It also has high ca­pac­ity.

If a site con­trols N data­base names, then the num­ber of pos­si­ble ob­serv­able per­mu­ta­tions is N!, with the­o­ret­i­cal en­tropy of log2(N!). With 16 con­trolled names, the the­o­ret­i­cal space is about 44 bits. That is far more than enough to dis­tin­guish re­al­is­tic num­bers of con­cur­rent browser in­stances in prac­tice.

The ex­act num­ber of reach­able per­mu­ta­tions may be some­what lower be­cause of in­ter­nal hash table be­hav­ior, but that does not ma­te­ri­ally change the se­cu­rity story. The ex­posed or­der­ing still pro­vides more than enough en­tropy to act as a strong iden­ti­fier.

The fix

The right fix is to stop ex­pos­ing en­tropy de­rived from the in­ter­nal stor­age lay­out.

The clean­est mit­i­ga­tion is to re­turn re­sults in a canon­i­cal or­der, such as lex­i­co­graphic sort­ing. That pre­serves the APIs use­ful­ness for de­vel­op­ers while re­mov­ing the fin­ger­print­ing sig­nal. Randomizing out­put per call could also hide the sta­ble or­der­ing, but sort­ing is sim­pler, more pre­dictable, and eas­ier for de­vel­op­ers to rea­son about.

From a se­cu­rity en­gi­neer­ing stand­point, an ideal fix:

Low con­cep­tual com­plex­ity

Minimal com­pat­i­bil­ity risk

Direct elim­i­na­tion of the pri­vacy leak

Responsible dis­clo­sure

We re­spon­si­bly dis­closed the is­sue to Mozilla and to the Tor Project. Mozilla has re­leased the fix in Firefox 150 and ESR 140.10.0, and the patch is tracked in Mozilla Bug 2024220. Because the be­hav­ior orig­i­nates from Gecko’s IndexedDB im­ple­men­ta­tion, down­stream Gecko-based browsers, in­clud­ing Tor Browser, are also af­fected un­less they ap­ply their own mit­i­ga­tion.

Building for pri­vacy

This vul­ner­a­bil­ity shows how a small im­ple­men­ta­tion de­tail can cre­ate a mean­ing­ful pri­vacy prob­lem. The im­pact is sig­nif­i­cant. Unrelated web­sites can link ac­tiv­ity across ori­gins dur­ing the same browser run­time, and pri­vate-ses­sion bound­aries are weak­ened be­cause the iden­ti­fier sur­vives longer than users would ex­pect.

The good news is that the fix is sim­ple and ef­fec­tive. By canon­i­cal­iz­ing the out­put be­fore re­turn­ing it, browsers can elim­i­nate this source of en­tropy and re­store the ex­pected pri­vacy bound­ary. This is ex­actly the kind of is­sue worth pay­ing at­ten­tion to: sub­tle, easy to miss, and highly in­struc­tive for any­one build­ing pri­vacy-sen­si­tive browser fea­tures.

5x5 Pixel font for tiny screens (Maurycy's blog)

maurycyz.com

All char­ac­ters fit within a 5 pixel square, and are safe to draw on a 6x6 grid.

The de­sign is based off of lcamtuf’s 5x6 font-in­line.h, which is it­self in­spired by the ZX Spectrum’s 8x8 font.

5x5 is the small­est size that does­n’t com­pro­mise leg­i­bil­ity:

2x2: Impossible.

3x3: Technically pos­si­ble, but un­read­able.

4x4: Not enough to draw E”, M” or W” prop­erly.

5x5: This font.

Five by five is ac­tu­ally big enough to draw most low­er­case let­ters one pixel smaller, mak­ing them vi­su­ally dis­tinct from up­per­case.

Narrower 4x5 and 3x5 di­men­sions are pos­si­ble, but would re­quire sac­ri­fic­ing the M,

dot­ted zero, and re­duce U/V/Y dis­tinc­tive­ness.

There’s no artis­tic rea­son to make all char­ac­ters five wide just be­cause a few must be…

but a us­ing a con­stant width makes pro­gram­ming a lot eas­ier:

The length of a string on screen is al­ways 6 times the num­ber of char­ac­ters.

It also makes com­pact lay­outs much safer:

There’s no need to worry that a num­ber will over­flow be­cause 8978” is longer than 1111″.

The whole font takes up just 350 bytes of mem­ory,

which makes it ide­ally suited to 8-bit mi­cro­con­trollers like the AVR128DA28 (16 kB of RAM)

These are cheap, low power and ro­bust…

but they fall short on graph­ics:

Even a low-res­o­lu­tion 384x288 dis­play has 110 thou­sand pix­els:

way too big to fit in the AVRs mem­ory.

… ex­cept most pro­jects don’t need any­where near that many pix­els.

A 160x128 or 128x64 OLED is more prac­ti­cal and cheaper —

but these need hand-drawn, pixel-ef­fi­cient fonts to make good use of them.

For ref­er­ence, here’s a vec­tor font ren­dered at a sim­i­lar scale:

Antialiasing, sev­eral megabytes of code, a megabyte of font data, and it’s still ter­ri­ble com­pared 350 hand-crafted bytes.

Real pix­els:

Pixels aren’t per­fect squares, so the font won’t ac­tu­ally look like the ren­der­ing at the top of this post:

This is it on an ac­tual screen:

I ac­tu­ally re­ally like the pseudo-drop­shadow ef­fect cre­ated by the sub­pix­els.

This won’t hap­pen on mono­chrome dis­plays, but the font will still look smoother than you might ex­pect.

The gaps be­tween pix­els re­ally help sell the e” and g”, but this same ef­fect should al­low…

Even smaller fonts:

While 5x5 is the small­est no-com­pro­mise res­o­lu­tion, a 3x5 is­n’t too bad:

The M”, W” and Q” suf­fer, but it’s still got a dis­tinct O and zero.

Something like this might ac­tu­ally be a good op­tion if you need to cram (50%) more columns into a dis­play.

That’s still read­able, so what about 3x4?

At this size, there’s no way to have a dis­tinct up­per and low­er­case, so I’ve picked what­ever style works the best in the lim­ited space.

The num­bers have also taken a hit, but still work ok.

How about 3x3?

The main loss was the num­bers, but the let­ters don’t in­clude any du­pli­cates and are some­what rec­og­niz­able.

This font is hugely im­proved by be­ing dis­played on real hard­ware:

That means it’s still too big. How about 2x3?

Ok, this is get­ting ridicu­lous.

Most let­ters are un­rec­og­niz­able, and there are quite a few du­pli­cates.

In case you could­n’t tell, the bot­tom line reads Hello World”.

Flipping the as­pect ra­tio to a 3x2 makes it a lot bet­ter:

More let­ters have hor­i­zon­tal de­tail (M, W, N, Q, G, P, etc) then have ver­ti­cal de­tail (E, F).

The bot­tom line reads you can prob­a­bly read this”, al­though you might have to squint or zoom out.

… and for the sake of com­plete­ness, a 2x2:

On pa­per, there are 16 pos­si­ble 2x2 im­ages, but one of them is blank and 5 of them are shifted copies of an­other one.

That leaves 10, just enough to do all the dig­its…

but be­cause they have no re­sem­blance to the orig­i­nals, it’s more of a se­cret code than a font.

Related:

/projects/mcufont/mcufont.h: The 5x5 font.

/projects/mcufont/test.c: Program to pre­view the font.

https://​lcamtuf.core­dump.cx/​soft/​em­bed­ded/​font-in­line.h: The orig­i­nal font.

https://​moon­bench.xyz/​pro­jects/​tiny-pixel-art-fonts/: More tiny fonts.

Apple fixes bug that cops used to extract deleted chat messages from iPhones

techcrunch.com

12:13 PM PDT · April 22, 2026

Apple re­leased a soft­ware up­date on Wednesday for iPhones and iPads fix­ing a bug that al­lowed law en­force­ment to ex­tract mes­sages that had been deleted or dis­ap­peared au­to­mat­i­cally from mes­sag­ing apps. This was be­cause no­ti­fi­ca­tions that dis­played the mes­sages’ con­tent were also cached on the de­vice for up to a month.

In a se­cu­rity no­tice on its web­site, Apple said that the bug meant notifications marked for dele­tion could be un­ex­pect­edly re­tained on the de­vice.”

This is a clear ref­er­ence to an is­sue re­vealed by 404 Media ear­lier this month. The in­de­pen­dent news out­let re­ported that the FBI had been able to ex­tract deleted Signal mes­sages from some­one’s iPhone us­ing foren­sic tools, due to the fact that the con­tent of the mes­sages had been dis­played in a no­ti­fi­ca­tion and then stored in­side a phone’s data­base — even af­ter the mes­sages were deleted in­side Signal.

After the news, Signal pres­i­dent Meredith Whittaker said the mes­sag­ing app maker asked Apple to ad­dress the is­sue. Notifications for deleted mes­sages should­n’t re­main in any OS no­ti­fi­ca­tion data­base,” Whittaker wrote in a post on Bluesky.

Contact Us

Do you have more in­for­ma­tion about how au­thor­i­ties are us­ing foren­sic tools on iPhones or Android de­vices? From a non-work de­vice, you can con­tact Lorenzo Franceschi-Bicchierai se­curely on Signal at +1 917 257 1382, or via Telegram and Keybase @lorenzofb, or email.

It’s un­clear why the no­ti­fi­ca­tions’ con­tent was logged to be­gin with, but to­day’s fix sug­gests it was a bug.

Apple did not im­me­di­ately re­spond to a re­quest for com­ment ask­ing why the no­ti­fi­ca­tions were be­ing re­tained. The com­pany also back­ported the fix to iPhone and iPad own­ers run­ning the older iOS 18 soft­ware.

Privacy ac­tivists ex­pressed alarm when they learned that the FBI had found a way around a se­cu­rity fea­ture that is used daily by at-risk users. Signal, like other mes­sag­ing apps such as WhatsApp, al­lows users to set up a timer that in­structs the app to au­to­mat­i­cally delete mes­sages af­ter a set amount of time. This fea­ture can be help­ful for any­one who wants to keep their con­ver­sa­tions se­cret in the event that au­thor­i­ties seize their de­vices.

Techcrunch event

San Francisco, CA

|

October 13 – 15, 2026

Topics

When you pur­chase through links in our ar­ti­cles, we may earn a small com­mis­sion. This does­n’t af­fect our ed­i­to­r­ial in­de­pen­dence.

Lorenzo Franceschi-Bicchierai is a Senior Writer at TechCrunch, where he cov­ers hack­ing, cy­ber­se­cu­rity, sur­veil­lance, and pri­vacy.

You can con­tact or ver­ify out­reach from Lorenzo by email­ing lorenzo@techcrunch.com, via en­crypted mes­sage at +1 917 257 1382 on Signal, and @lorenzofb on Keybase/Telegram.

View Bio

Telemetry

cli.github.com

GitHub CLI sends pseu­do­ny­mous teleme­try to help us im­prove the prod­uct. We want you to un­der­stand what is be­ing sent and why.

Why we col­lect teleme­try

As agen­tic adop­tion of GitHub CLI grows, our team needs vis­i­bil­ity into how fea­tures are be­ing used in prac­tice. We use this data to pri­or­i­tize our work and eval­u­ate whether fea­tures are meet­ing real user needs.

For ex­am­ple, when we ship a new sub­com­mand, we want to un­der­stand whether any­one is us­ing it and how. If adop­tion is low, we know we need to re­visit the fea­ture’s dis­cov­er­abil­ity or de­sign. If a sub­com­mand sees high us­age with cer­tain flags, that tells us where to in­vest in a bet­ter ex­pe­ri­ence.

What we col­lect

The fol­low­ing fields are in­cluded in teleme­try events. Fields with a Command Scope value are only sent for com­mands within that scope.

How to in­spect what’s be­ing sent

GitHub CLI is open source, so you can re­view the teleme­try im­ple­men­ta­tion in the cli/​cli repos­i­tory.

Additionally, you can en­able log­ging mode us­ing ei­ther an en­vi­ron­ment vari­able or con­fig­u­ra­tion op­tion to see ex­actly what would be sent with­out send­ing it.

1. Environment vari­able:

ex­port GH_TELEMETRY=log

2. CLI con­fig:

gh con­fig set teleme­try log

In log­ging mode, the JSON pay­load that would nor­mally be sent is printed to stderr in­stead. This lets you in­spect every field be­fore de­cid­ing whether to keep teleme­try en­abled, for ex­am­ple:

$ GH_TELEMETRY=log gh pr edit 42 –title bug fix” –body fixed a bug”

Telemetry pay­load:

{

events”: [

{

type”: command_invocation”,

dimensions”: {

agent”: ”,

architecture”: arm64″,

ci”: false”,

command”: gh pr edit”,

device_id”: d80dc1eb-5c66 – 4bcd-bbc8 – 568e173bb977″,

flags”: body,title”,

github_actions”: false”,

invocation_id”: 51b4383c-23b1 – 47da-91d7-dcc8aa79dd1c”,

is_tty”: true”,

os”: darwin”,

timestamp”: 2026 – 04-22T00:00:00.000Z”,

version”: 2.91.0″

}

}

]

}

Note that this com­mand can only log teleme­try for the ex­act com­mand and con­text in which it ran. For ex­am­ple, chang­ing en­vi­ron­ment vari­ables, or au­then­ti­cated ac­counts may change the events, and event di­men­sions that are in­cluded in the pay­load.

How to opt out

There are three ways to dis­able teleme­try:

1. Set the GH_TELEMETRY en­vi­ron­ment vari­able (any falsy value works: 0, false, dis­abled, or an empty string):

ex­port GH_TELEMETRY=false

2. Use the DO_NOT_TRACK con­ven­tion:

ex­port DO_NOT_TRACK=true

3. Use the CLI con­fig:

gh con­fig set teleme­try dis­abled

Note: The en­vi­ron­ment vari­ables (options 1 and 2) take prece­dence over the con­fig value.

Where data is sent

Telemetry events are sent to GitHub’s in­ter­nal an­a­lyt­ics in­fra­struc­ture. For more in­for­ma­tion about how GitHub han­dles your data, see the GitHub General Privacy Statement.

Additional in­for­ma­tion

GitHub CLI al­lows you to add fea­tures to the prod­uct by in­stalling GitHub and third-party ex­ten­sions, in­clud­ing agents. These ex­ten­sions may be col­lect­ing their own us­age data and are not con­trolled by opt­ing out. Consult the spe­cific ex­ten­sion’s doc­u­men­ta­tion to learn about its teleme­try re­port­ing and whether it can be dis­abled.

This page de­scribes client-side data col­lec­tion for the GitHub CLI (gh). It does not ap­ply to GitHub Copilot or the Copilot CLI, which han­dle data col­lec­tion sep­a­rately. For in­for­ma­tion on the Copilot CLI, see Using GitHub Copilot CLI and Responsible Use of the GitHub Copilot CLI.

Our eighth generation TPUs: two chips for the agentic era

blog.google

The cul­mi­na­tion of a decade of de­vel­op­ment, TPU 8t and TPU 8i are cus­tom-en­gi­neered to power the next gen­er­a­tion of su­per­com­put­ing with ef­fi­ciency and scale.

General sum­mary

Google is launch­ing its eighth-gen­er­a­tion Tensor Processor Units, fea­tur­ing two spe­cial­ized chips: the TPU 8t for mas­sive model train­ing and the TPU 8i for high-speed in­fer­ence. These chips are pur­pose-built to han­dle the com­plex, it­er­a­tive de­mands of AI agents while de­liv­er­ing sig­nif­i­cant gains in power ef­fi­ciency and per­for­mance. You can re­quest more in­for­ma­tion now to pre­pare for their gen­eral avail­abil­ity later this year.

Summaries were gen­er­ated by Google AI. Generative AI is ex­per­i­men­tal.

Bullet points

Google’s new eighth gen­er­a­tion TPUs, TPU 8t and 8i, power the next era of AI.

The TPU 8t is a train­ing pow­er­house built to speed up com­plex model de­vel­op­ment.

The TPU 8i spe­cial­izes in low-la­tency in­fer­ence to sup­port fast, col­lab­o­ra­tive AI agents.

Both chips use cus­tom hard­ware to de­liver bet­ter per­for­mance and en­ergy ef­fi­ciency than be­fore.

These new sys­tems will be avail­able later this year to help scale your AI work­loads.

Summaries were gen­er­ated by Google AI. Generative AI is ex­per­i­men­tal.

Basic ex­plainer

Google just an­nounced its eighth gen­er­a­tion of cus­tom AI chips, the TPU 8t and TPU 8i. These chips are built to han­dle the heavy lift­ing re­quired for train­ing mas­sive AI mod­els and run­ning com­plex AI agents. By spe­cial­iz­ing each chip for ei­ther train­ing or per­for­mance, Google makes AI faster and more en­ergy-ef­fi­cient. This new hard­ware helps de­vel­op­ers build smarter tools that can rea­son and solve prob­lems more ef­fec­tively.

Summaries were gen­er­ated by Google AI. Generative AI is ex­per­i­men­tal.

Your browser does not sup­port the au­dio el­e­ment.

Listen to ar­ti­cle

This con­tent is gen­er­ated by Google AI. Generative AI is ex­per­i­men­tal

[[duration]] min­utes

Today at Google Cloud Next, we are in­tro­duc­ing the eighth gen­er­a­tion of Google’s cus­tom Tensor Processor Unit (TPU), com­ing soon with two dis­tinct, pur­pose-built ar­chi­tec­tures for train­ing and in­fer­ence: TPU 8t and TPU 8i. These two chips are de­signed to power our cus­tom-built su­per­com­put­ers, to drive every­thing from cut­ting-edge model train­ing and agent de­vel­op­ment, to mas­sive in­fer­ence work­loads. TPUs have been pow­er­ing lead­ing foun­da­tion mod­els, in­clud­ing Gemini, for years. These 8th gen­er­a­tion TPUs to­gether will de­liver scale, ef­fi­ciency and ca­pa­bil­i­ties across train­ing, serv­ing and agen­tic work­loads.

In this age of AI agents, mod­els must rea­son through prob­lems, ex­e­cute multi-step work­flows and learn from their own ac­tions in con­tin­u­ous loops. This places a new set of de­mands on in­fra­struc­ture, and TPU 8t and TPU 8i were de­signed in part­ner­ship with Google DeepMind to take on the most de­mand­ing AI work­loads and adapt to evolv­ing model ar­chi­tec­tures at scale.

TPUs set the stan­dard for a num­ber of ML su­per­com­put­ing com­po­nents in­clud­ing cus­tom nu­mer­ics, liq­uid cool­ing, cus­tom in­ter­con­nects and more, and our eighth gen­er­a­tion TPUs are the cul­mi­na­tion of more than a decade of de­vel­op­ment. The key in­sight be­hind the orig­i­nal TPU de­sign con­tin­ues to hold to­day: by cus­tomiz­ing and co-de­sign­ing sil­i­con with hard­ware, net­work­ing and soft­ware, in­clud­ing model ar­chi­tec­ture and ap­pli­ca­tion re­quire­ments, we can de­liver dra­mat­i­cally more power ef­fi­ciency and ab­solute per­for­mance.

We are thrilled to see how a decade of in­no­va­tion trans­lates into real-world break­throughs. Today, pi­o­neer­ing or­ga­ni­za­tions like Citadel Securities are push­ing the bound­aries of what’s pos­si­ble, choos­ing TPUs to power their cut­ting-edge AI work­loads:

Two chips to meet the mo­ment

Hardware de­vel­op­ment cy­cles are much longer than soft­ware. With each gen­er­a­tion of TPUs, we need to con­sider what tech­nolo­gies and de­mands will ex­ist by the time they are brought to mar­ket. Several years ago, we an­tic­i­pated ris­ing de­mand for in­fer­ence from cus­tomers as fron­tier AI mod­els are de­ployed in pro­duc­tion and at scale. And with the rise of AI agents, we de­ter­mined the com­mu­nity would ben­e­fit from chips in­di­vid­u­ally spe­cial­ized to the needs of train­ing and serv­ing.

TPU 8t shines at mas­sive, com­pute-in­ten­sive train­ing work­loads de­signed with larger com­pute through­put and more scale-up band­width. TPU 8i is de­signed with more mem­ory band­width to serve the most la­tency-sen­si­tive in­fer­ence work­loads, which is crit­i­cal be­cause in­ter­ac­tions be­tween agents at scale mag­nify even small in­ef­fi­cien­cies.

Importantly, both chips can run var­i­ous work­loads, but spe­cial­iza­tion un­locks sig­nif­i­cant ef­fi­cien­cies and gains.

TPU 8t: The train­ing pow­er­house

TPU 8t is built to re­duce the fron­tier model de­vel­op­ment cy­cle from months to weeks. By bal­anc­ing the high­est pos­si­ble com­pute through­put, shared mem­ory and in­ter­chip band­width with the best pos­si­ble power ef­fi­ciency and pro­duc­tive com­pute time, we have crafted a sys­tem that de­liv­ers nearly 3x the com­pute per­for­mance per pod over the pre­vi­ous gen­er­a­tion, en­abling faster in­no­va­tion to en­sure our cus­tomers con­tinue to set the pace for the in­dus­try.

Massive scale: A sin­gle TPU 8t su­per­pod now scales to 9,600 chips and two petabytes of shared high band­width mem­ory, with dou­ble the in­ter­chip band­width of the pre­vi­ous gen­er­a­tion. This ar­chi­tec­ture de­liv­ers 121 ExaFlops of com­pute and al­lows the most com­plex mod­els to lever­age a sin­gle, mas­sive pool of mem­ory.

Maximum uti­liza­tion: By also in­te­grat­ing 10x faster stor­age ac­cess, com­bined with TPUDirect to pull data di­rectly into the TPU, TPU 8t helps en­sure max­i­mum uti­liza­tion of the end-to-end sys­tem.

Near-linear scal­ing: Our new Virgo Network, com­bined with JAX and our Pathways soft­ware, means TPU 8t can pro­vide near-lin­ear scal­ing for up to a mil­lion chips in a sin­gle log­i­cal clus­ter.

In ad­di­tion to raw per­for­mance, TPU 8t is en­gi­neered to tar­get over 97% goodput” — a mea­sure of use­ful, pro­duc­tive com­pute time — through a com­pre­hen­sive set of Reliability, Availability and Serviceability (RAS) ca­pa­bil­i­ties. These in­clude real-time teleme­try across tens of thou­sands of chips, au­to­matic de­tec­tion and rerout­ing around faulty ICI links with­out in­ter­rupt­ing a job, and Optical Circuit Switching (OCS) that re­con­fig­ures hard­ware around fail­ures with no hu­man in­ter­ven­tion.

Every hard­ware fail­ure, net­work stall or check­point restart is time the clus­ter is not train­ing, and at fron­tier train­ing scale, every per­cent­age point can trans­late into days of ac­tive train­ing time.

TPU 8i: The rea­son­ing en­gine

In the agen­tic era, users ex­pect to be able to ask ques­tions, del­e­gate tasks and get out­comes. TPU 8i is de­signed to han­dle the in­tri­cate, col­lab­o­ra­tive, it­er­a­tive work of many spe­cial­ized agents, of­ten swarming” to­gether in com­plex flows to de­liver so­lu­tions and in­sights for the most chal­leng­ing tasks. We re­designed the stack to elim­i­nate the waiting room” ef­fect through four key in­no­va­tions:

Breaking the memory wall”: To stop proces­sors from sit­ting idle, TPU 8i pairs 288 GB of high-band­width mem­ory with 384 MB of on-chip SRAM — 3x more than the pre­vi­ous gen­er­a­tion — keep­ing a mod­el’s ac­tive work­ing set en­tirely on-chip.

Axion-powered ef­fi­ciency: We dou­bled the phys­i­cal CPU hosts per server, mov­ing to our cus­tom Axion Arm-based CPUs. By us­ing a non-uni­form mem­ory ar­chi­tec­ture (NUMA) for iso­la­tion, we have op­ti­mized the full sys­tem for su­pe­rior per­for­mance.

Scaling MoE mod­els: For mod­ern Mixture of Expert (MoE) mod­els, we dou­bled the Interconnect (ICI) band­width to 19.2 Tb/s. Our new Boardfly ar­chi­tec­ture re­duces the max­i­mum net­work di­am­e­ter by more than 50%, en­sur­ing the sys­tem works as one co­he­sive, low-la­tency unit.

Eliminating lag: Our new on-chip Collectives Acceleration Engine (CAE) of­floads global op­er­a­tions, re­duc­ing on-chip la­tency by up to 5x, min­i­miz­ing lag.

These in­no­va­tions de­liver 80% bet­ter per­for­mance-per-dol­lar com­pared to the pre­vi­ous gen­er­a­tion, en­abling busi­nesses to serve nearly twice the cus­tomer vol­ume at the same cost.

TPU 8i hi­er­ar­chi­cal Boardfly topol­ogy build­ing up from a build­ing block of four fully con­nected chips into a fully con­nected group of eight boards, with 36 of such groups fully con­nected into a TPU 8i pod

Co-designed for Gemini, open for every­one

This eighth gen­er­a­tion TPU is also the lat­est ex­pres­sion of our co-de­sign phi­los­o­phy, where every spec is built to solve AIs biggest hur­dles.

Boardfly topol­ogy was de­signed specif­i­cally for the com­mu­ni­ca­tion de­mands of to­day’s most ca­pa­ble rea­son­ing mod­els.

SRAM ca­pac­ity in TPU 8i was sized for the KV cache foot­print of rea­son­ing mod­els at pro­duc­tion scale.

Virgo Network fab­ric’s band­width tar­gets were de­rived from the par­al­lelism re­quire­ments of tril­lion-pa­ra­me­ter train­ing.

And for the first time, both chips run on Google’s own Axion ARM-based CPU host, al­low­ing us to op­ti­mize the full sys­tem, not just the chip, for per­for­mance and ef­fi­ciency.

Both plat­forms sup­port na­tive JAX, MaxText, PyTorch, SGLang and vLLM — the frame­works de­vel­op­ers al­ready use — and of­fer bare metal ac­cess, giv­ing cus­tomers di­rect hard­ware ac­cess with­out the over­head of vir­tu­al­iza­tion. Open-source con­tri­bu­tions in­clud­ing MaxText ref­er­ence im­ple­men­ta­tions and Tunix for re­in­force­ment learn­ing sup­port turn key paths be­tween ca­pa­bil­ity and pro­duc­tion de­ploy­ment.

Designing for power ef­fi­ciency at scale

In to­day’s data cen­ters, power, not just chip sup­ply, is a bind­ing con­straint. To solve this, we have op­ti­mized ef­fi­ciency across the en­tire stack, with in­te­grated power man­age­ment that dy­nam­i­cally ad­justs the power draw based on real-time de­mand. TPU 8t and TPU 8i de­liver up to two times bet­ter per­for­mance-per-watt over the pre­vi­ous gen­er­a­tion, Ironwood.

But ef­fi­ciency at Google is not just a chip-level met­ric; it’s also a sys­tem-level com­mit­ment that runs from sil­i­con to the data cen­ter. For ex­am­ple, we in­te­grate net­work con­nec­tiv­ity with com­pute on the same chip, sig­nif­i­cantly re­duc­ing the power costs of mov­ing data across the TPU pod. Even our data cen­ters are co-de­signed with our TPUs. We in­no­vated across hard­ware and soft­ware to en­able our data cen­ters to de­liver six times more com­put­ing power per unit of elec­tric­ity than they did just five years ago.

TPU 8t and TPU 8i con­tinue that tra­jec­tory. Both are sup­ported by our fourth-gen­er­a­tion liq­uid cool­ing tech­nol­ogy that sus­tains per­for­mance den­si­ties air cool­ing can­not. By own­ing the full stack, from Axion host to ac­cel­er­a­tor, we can op­ti­mize sys­tem-level en­ergy ef­fi­ciency in ways that sim­ply can­not be achieved when the host and chip are de­signed in­de­pen­dently.

Google Cloud’s fourth gen­er­a­tion cool­ing dis­tri­b­u­tion unit

Infrastructure for the agen­tic era

Every ma­jor com­put­ing tran­si­tion has re­quired in­fra­struc­ture break­throughs, and the agen­tic era is no dif­fer­ent. Infrastructure must evolve to meet the de­mands of au­tonomous agents op­er­at­ing in con­tin­u­ous loops of rea­son­ing, plan­ning, ex­e­cu­tion and learn­ing.

TPU 8t and TPU 8i are our an­swer to this chal­lenge: two spe­cial­ized ar­chi­tec­tures built to re­de­fine what is pos­si­ble in AI, from build­ing the most ca­pa­ble AI mod­els, to swarms of agents per­fectly or­ches­trated, to man­ag­ing the most com­plex rea­son­ing tasks. Both chips will be gen­er­ally avail­able later this year, and can be used as part of Google’s AI Hypercomputer, which brings to­gether pur­pose-built hard­ware (compute, stor­age, net­work­ing), open soft­ware (frameworks, in­fer­ence en­gines), and flex­i­ble con­sump­tion (orchestration, clus­ter man­age­ment and de­liv­ery mod­els) into a uni­fied stack.

Agentic com­put­ing will re­de­fine what is pos­si­ble. We are thrilled to an­nounce the lat­est in­car­na­tion of our re­lent­less in­no­va­tion to power this trans­for­ma­tion, TPU 8i and 8t. Interested cus­tomers can re­quest more in­for­ma­tion.

Get more sto­ries from Google in your in­box.

Done. Just one step more.

Check your in­box to con­firm your sub­scrip­tion.

You are al­ready sub­scribed to our newslet­ter.

You can also sub­scribe with a

Coding Models Are Doing Too Much

nrehiew.github.io

Code for this post is avail­able here.

AI-assisted cod­ing has be­come the norm and with tools like Cursor, GitHub Copilot, Claude Code, Codex, we are in­creas­ingly let­ting mod­els touch our code. If you have used any of these tools in the past year, you have prob­a­bly ex­pe­ri­enced some­thing like this: you ask the model to fix a sim­ple bug (perhaps a sin­gle off-by-one er­ror, or maybe a wrong op­er­a­tor). The model fixes the bug but half the func­tion has been rewrit­ten. An ex­tra helper func­tion has ap­peared. A per­fectly rea­son­able vari­able name has been re­named. New in­put val­i­da­tion has been added. And the diff is enor­mous.

I re­fer to this as the Over-Editing prob­lem where mod­els have the ten­dency to rewrite code that did­n’t need rewrit­ing. This mat­ters more than it might seem. Code re­view is al­ready a bot­tle­neck and re­view­ers need to un­der­stand what changed, why it changed, and whether the change is safe. A model that rewrites en­tire func­tions, even cor­rectly, makes this job dra­mat­i­cally harder as the code is now com­pletely un­rec­og­niz­able.

In this post, I will in­ves­ti­gate this prob­lem: whether ex­ist­ing LLMs have a ten­dency to over-edit and whether we can train mod­els to be more faith­ful ed­i­tors.

Over-Editing

Over-editing refers to a model mod­i­fy­ing code be­yond what is strictly nec­es­sary to fix the prob­lem at hand. To be pre­cise: a model is over-edit­ing if its out­put is func­tion­ally cor­rect but struc­turally di­verges from the orig­i­nal code more than the min­i­mal fix re­quires.

The ex­am­ple in Figure 1 il­lus­trates this well. The bug is a sin­gle off-by-one er­ror in a range() call (range(len(x) - 1) should be range(len(x))) and the cor­rect fix is a sin­gle line. GPT-5.4 (with high rea­son­ing ef­fort) re­sponds by rewrit­ing the en­tire func­tion: it adds ex­plicit None checks, in­tro­duces np.asar­ray con­ver­sions with dtype=float, adds fi­nite-value mask­ing, val­i­dates ar­ray sizes, changes the curve_­fit call sig­na­ture, and re­places the plot­ting logic en­tirely. While the out­put passes the tests and is func­tion­ally cor­rect, the diff is enor­mous, and none of those ad­di­tions were asked for or even nec­es­sary.

It helps to think about this in terms of the kind of work be­ing done. Software en­gi­neer­ing broadly splits into two modes: green-field (building some­thing new from scratch) and brown-field (working within an ex­ist­ing code­base). Specifically in brown-field, the ex­ist­ing code has been un­der­stood by the team and has been de­lib­er­ately writ­ten the way it was. The mod­el’s job is to fix the is­sue and noth­ing else.

A com­mon piece of ad­vice for work­ing with AI cod­ing tools is to sim­ply write more tests be­cause if the tests pass, the code is fine. However, Over-editing is a brown-field fail­ure where un­like cor­rect­ness fail­ures, it is in­vis­i­ble to test suites. As mod­els gen­er­ate more code, en­gi­neers have more to re­view and over-edit­ing makes that harder. There is more com­plex logic to parse, more lines of code to read, and a higher chance that over­all code­base qual­ity qui­etly de­grades.

Measuring Over-Editing

To study over-edit­ing, we first need a dataset of code ed­its where the ground truth” edit is well-de­fined with some de­gree of minimality”. Rather than us­ing an­other LLM to in­tro­duce bugs (which is what most ex­ist­ing bench­marks do), we pro­gram­mat­i­cally cor­rupt 400 prob­lems from BigCodeBench which gives us more fine-grained con­trol: things like flip­ping a com­par­i­son op­er­a­tor (< → <=), swap­ping + for -, or chang­ing boolean val­ues (True → False).1 Each cor­rupted sam­ple re­mains syn­tac­ti­cally valid and ver­i­fied to break the cor­re­spond­ing test cases. This en­sures that the ground truth edit is ex­actly the re­ver­sal of the cor­rup­tion and noth­ing more, thus mak­ing this edit min­i­mal by con­struc­tion. We can then eval­u­ate not just whether a model fixes the bug, but how much else it changed in the process.

Metrics

Most cod­ing bench­marks eval­u­ate mod­els on cor­rect­ness us­ing some vari­ant of Pass@1. However, Pass@1 is nec­es­sary but not suf­fi­cient. A model can score per­fectly on Pass@1 while com­pletely rewrit­ing every func­tion it touches. For this ex­per­i­ment, we need met­rics that cap­ture how much the model changed be­yond what was re­quired.

Token-level Levenshtein Distance. Unlike stan­dard Levenshtein which counts the min­i­mum num­ber of char­ac­ter in­ser­tions, dele­tions, and sub­sti­tu­tions to trans­form one string into an­other, we use a Python to­ken-level vari­ant. The code is first passed through Python’s to­k­enizer, which splits it into its atomic syn­tac­tic units (def, add, (, a, ,, b, ), :, re­turn, a, +, b). Levenshtein is then com­puted over this to­ken se­quence rather than raw char­ac­ters.

For ex­am­ple, con­sider the fol­low­ing two func­tions:

def add(a, b): def someother­func­tion­name(a, b):

re­turn a + b re­turn a + b

Character-level Levenshtein gives a dis­tance of 19. Token-level Levenshtein gives a dis­tance of 1 since someother­func­tion­name be­comes a sin­gle to­ken. We nor­mal­ize by to­tal to­ken count so scores are com­pa­ra­ble across func­tions of dif­fer­ent lengths.

In ad­di­tion, rather than sim­ply com­par­ing the mod­el’s out­put to the ground truth, we com­pare both against the cor­rupted in­put. Let $C$ be the cor­rupted so­lu­tion, $G$ the ground truth, and $M$ the mod­el’s out­put. The true min­i­mal edit (simply the re­ver­sal of the cor­rup­tion) is $D_{\text{true}} = d(G, C)$ and the mod­el’s edit is $D_{\text{model}} = d(M, C)$, giv­ing a rel­a­tive patch score:

Values closer to zero in­di­cate the mod­el’s patch re­sem­bles the true min­i­mal fix. The in­tu­ition is that we can in­ter­pret the orig­i­nal un­cor­rupted so­lu­tion as the best pos­si­ble edit to the cor­rupted so­lu­tion, com­pute the scores for this best pos­si­ble patch, and then com­pare with the mod­el’s out­put.

Added Cognitive Complexity. Cognitive Complexity (an im­prove­ment over Cyclomatic Complexity) mea­sures how hard code is to un­der­stand. It pe­nal­izes nest­ing, re­cur­sion, mixed log­i­cal op­er­a­tors, and non-ob­vi­ous con­trol flow. For ex­am­ple a straight line of code with no branches is much eas­ier to read than some­thing that re­quires a reader to hold state, such as an if, a loop, or try/​ex­cept. An ex­am­ple is shown be­low:

def process(items):

re­sult = []

for item in items: # +1

if item > 0: # +2 (nesting penalty: in­side a loop)

if item % 2 == 0: # +3 (nesting penalty: two lev­els deep)

re­sult.ap­pend(item)

re­turn re­sult

# Cognitive Complexity: 6

Since all our cor­rup­tions change val­ues rather than struc­ture, the cor­rect fix should al­ways add zero Cognitive Complexity. Any in­crease in the mod­el’s out­put was in­tro­duced un­prompted and is un­nec­es­sary. We re­port the ab­solute dif­fer­ence be­tween the mod­el’s out­put and the orig­i­nal, which should be zero for a faith­ful min­i­mal edit. Values be­low 0 are also un­wanted as un­nec­es­sary sim­pli­fi­ca­tions to code are also un­de­sir­able.

Do Models Over-Edit?

Yes, even fron­tier ones.

Among the lat­est fron­tier mod­els, GPT-5.4 over-ed­its the most. It has a Levenshtein of 0.39 in rea­son­ing mode and 0.33 in non-rea­son­ing, with Added Cognitive Complexity of 2.31 and 1.56 re­spec­tively. Despite this, its Pass@1 is only 0.723 and 0.770, mak­ing it one of the weak­est cor­rect­ness per­form­ers too. Claude Opus 4.6 achieves the high­est Pass@1 of any model eval­u­ated (0.912 rea­son­ing, 0.900 non-rea­son­ing) while also pro­duc­ing the small­est diffs with Levenshtein of 0.06 and 0.08, Added Cognitive Complexity of 0.20 and 0.31. Gemini 3.1 Pro Preview sits in sim­i­lar ter­ri­tory, with GLM 5 ar­guably the most con­ser­v­a­tive model among the open weight ones.

Does Prompting Help?

Many pa­pers that claim to un­cover a new LLM fail­ure mode do not first test whether the model can do the task when asked di­rectly. A be­hav­ior that looks im­pos­si­ble in one setup may be easy un­der an ex­plicit prompt, so I in­ves­ti­gate the im­pact of adding IMPORTANT: Try to pre­serve the orig­i­nal code and the logic of the orig­i­nal code as much as pos­si­ble” to the prompt.

With ex­plicit prompt­ing, every model im­proves and re­duces its Levenshtein Distance, and with the ex­cep­tion of DeepSeek R1/v3, also im­proves its Pass@1. One in­ter­pre­ta­tion is that the con­straint to make min­i­mal ed­its in­ad­ver­tently nar­rows the search space of pos­si­ble fixes, steer­ing mod­els to­ward the kind of pre­cise, tar­geted change that is more likely to be cor­rect. The ef­fect on Levenshtein Distance is much more pro­nounced in rea­son­ing mod­els, which is likely the re­sult of their stronger in­struc­tion fol­low­ing abil­ity.

Does Reasoning Mean Overthinking and Over-Editing?

Reasoning mod­els are gen­er­ally as­sumed to be bet­ter at cod­ing tasks, and they do score higher on Pass@1. But since we are in­ter­ested in the style of these ed­its, we need to look at the re­sults through a dif­fer­ent lens.

Figure 3 groups the mod­els into pairs where each pair con­tains a rea­son­ing and non-rea­son­ing model from the same fam­ily. For each pair, we plot the Levenshtein Distance of only the sam­ples where both mod­els get the an­swer cor­rect. This al­lows us to iso­late edit min­i­mal­ity from cor­rect­ness since a model that fails more of­ten has fewer sam­ples to over-edit on, which would oth­er­wise bias the com­par­i­son.

In the generic set­ting (top), rea­son­ing mod­els over-edit more than their non-rea­son­ing coun­ter­parts in the ma­jor­ity of pairs. DeepSeek V3, GPT-5, GPT-5.4, Gemini 3.1 Pro Preview, Qwen 3.6 Plus, and Kimi 2.5 all show the rea­son­ing bar sit­ting higher. Reasoning mod­els seems to nat­u­rally have more elab­o­rate rewrites where the model rea­sons its way into a better” im­ple­men­ta­tion rather than a min­i­mal fix. The no­table ex­cep­tion is Claude Opus 4.6, where the rea­son­ing vari­ant ed­its sub­stan­tially less than its non-rea­son­ing coun­ter­part.

In the ex­plicit set­ting (bottom), the pic­ture changes con­sid­er­ably. Once mod­els are told to pre­serve the orig­i­nal code, rea­son­ing mod­els have much lower Levenshtein Distance than their non-rea­son­ing coun­ter­parts and match or un­der­cut them in al­most every pair. Claude Opus 4.6 (reasoning) drops to the low­est Levenshtein of any model in this set­ting. GPT-5 and GPT-5.4 both see their rea­son­ing vari­ants fall sig­nif­i­cantly, though GPT-5.4’s non-rea­son­ing model still edges ahead.

Therefore, the take­away is that the de­fault be­hav­ior of most rea­son­ing mod­els is to over-edit. Left un­con­strained, the ex­tended rea­son­ing gives mod­els more room to improve” code that does­n’t need im­prov­ing. But, that same rea­son­ing ca­pac­ity also makes them bet­ter at fol­low­ing the con­straint once it is given. The gap be­tween the generic and ex­plicit set­ting is con­sis­tently larger for rea­son­ing mod­els, which sug­gests the over-edit­ing is not a fun­da­men­tal lim­i­ta­tion but rather a de­fault be­hav­ior that can be over­rid­den.

Training

A nat­ural next ques­tion: can we train mod­els to be more faith­ful ed­i­tors? For this ex­per­i­ment, I start with Qwen3 4B 2507 Instruct as the base model. I use both 0-shot and 8-shot prompt­ing to­gether with the ex­plicit in­struc­tion to pre­serve the orig­i­nal code as base­lines. All other meth­ods are prompted in the generic set­ting with­out the ex­plicit in­struc­tion dur­ing eval­u­a­tion.

Setup

I first cre­ate a syn­thetic dataset of cor­rupted prob­lems from DeepCoder us­ing the same ap­proach de­tailed above. In ad­di­tion to this pro­gram­mat­i­cally gen­er­ated dataset, I also use the base Qwen3 4B 2507 Instruct model to cre­ate a syn­thetic dataset via self-dis­til­la­tion. Concretely, I prompt the model to gen­er­ate 8 com­ple­tions per prob­lem, keep­ing only the sam­ples that are func­tion­ally cor­rect and rank­ing them by Levenshtein Distance. The model is then trained with­out the ex­plicit in­struc­tion sim­i­lar to Context Distillation.

We eval­u­ate 4 dif­fer­ent meth­ods:

SFT: Supervised fine-tun­ing di­rectly on the pro­gram­mat­i­cally gen­er­ated dataset.

rSFT: Rejection-sampled SFT where we train on the com­ple­tions with the 3 low­est Levenshtein Distances for each sam­ple from the self-dis­til­la­tion dataset.

DPO: Preference op­ti­miza­tion be­tween the com­ple­tions with the high­est and low­est Levenshtein Distances for each sam­ple from the self-dis­til­la­tion dataset.

RL: Reinforcement learn­ing with a re­ward com­bin­ing func­tional cor­rect­ness and Levenshtein-based edit min­i­mal­ity. The re­ward struc­ture is a weighted sum of the Levenshtein Distance and a penalty for fail­ing to pass the test cases:

r = r_edit + 0.1 # if gen­er­a­tion passes all test cases

r = -0.2 # oth­er­wise

# r_edit is nor­mal­ized Levenshtein-based re­ward

Does It Work?

On the first at­tempt, SFT is al­most sus­pi­ciously good as the re­sul­tant model seems to have per­fectly learned the task. I found this ex­tremely sur­pris­ing and had the ini­tial hy­poth­e­sis that the model was just mem­o­riz­ing the re­ver­sal for this set of cor­rup­tions rather than learn­ing a gen­eral min­i­mal edit­ing be­hav­ior. As a re­sult, I re-cre­ated both syn­thetic datasets but in­stead us­ing a com­pletely dif­fer­ent set of cor­rup­tions than the eval­u­a­tion set to test for gen­er­al­iza­tion. The core hy­poth­e­sis was that the model was sim­ply learn­ing to re­verse a par­tic­u­lar set of cor­rup­tions.

Does It Generalize?

SFT col­lapses en­tirely out-of-do­main. Pass@1 drops to 0.458 as the model has learned to make spe­cific min­i­mal changes re­gard­less of whether they fix any­thing. rSFT and DPO are both bet­ter but the over­all im­prove­ment is slight com­pared to the 8-shot base­line. This in­di­cates that train­ing on traces dis­tilled from the base model it­self is suf­fi­cient to in­duce some de­gree of gen­er­al­iza­tion. RL is the only method that gen­er­al­izes cleanly, im­prov­ing on all three met­rics over both base­lines. The fact that the RL model has larger im­prove­ments on Levenshtein Distance and Added Cognitive Complexity than on Pass@1 is fur­ther ev­i­dence that it is not just mem­o­riz­ing cor­rup­tion re­ver­sals but has ac­tu­ally gen­er­al­ized to min­i­mal edit­ing.

Given the SFT mod­el’s in­abil­ity to even fix bugs, we also wanted to look at Catastrophic Forgetting. Specifically, whether fine-tun­ing for min­i­mal edit­ing de­grades gen­eral cod­ing abil­ity. We eval­u­ate all fine-tuned mod­els on LiveCodeBench v6 and com­pare against the orig­i­nal pre­trained model. Ideally, per­for­mance should re­main sim­i­lar af­ter train­ing.

SFT shows a 43% per­for­mance degra­da­tion, which aligns with our ear­lier find­ing that it can no longer iden­tify and fix ba­sic bugs. The rSFT and DPO mod­els ex­pe­ri­ence slight degra­da­tion, in­di­cat­ing that even though they were trained on sam­ples gen­er­ated by the orig­i­nal model, the na­ture of the task still re­sults in some de­gree of Catastrophic Forgetting. The RL model, how­ever, does not ex­pe­ri­ence any degra­da­tion. Combined with the fact that it also per­forms the task best, RL is able to teach the model a new be­hav­ior with­out de­grad­ing pre­vi­ously ac­quired abil­i­ties. This aligns with broader work show­ing that SFT mem­o­rizes while RL gen­er­al­izes.

Inspired by other work show­ing that RLs has a bias to­wards KL-minimal so­lu­tions re­duces for­get­ting, we can in­ter­pret these re­sults from a dis­tri­b­u­tional per­spec­tive. Specifically, the dis­tri­b­u­tion of the pro­gram­mat­i­cally gen­er­ated dataset is very dif­fer­ent from the mod­el’s orig­i­nal dis­tri­b­u­tion. As a re­sult, the SFT mod­el’s dis­tri­b­u­tion has been heav­ily mod­i­fied and thus suf­fers from Catastrophic Forgetting. In con­trast, for both rSFT and DPO, the dis­tri­b­u­tion of the self-dis­tilled dataset is more aligned and is thus less heavy-handed in na­ture when shap­ing the trained mod­el’s dis­tri­b­u­tion. Therefore, it is likely that the de­gree of Catastrophic Forgetting is pro­por­tional to the dif­fer­ence be­tween the mod­el’s orig­i­nal dis­tri­b­u­tion and the dis­tri­b­u­tion of the task train­ing data.

Additional Experiments

RL with LoRA: Do We Need Full Fine-Tuning?

Given that this task is less about teach­ing the model new knowl­edge and more about tun­ing its style on an ex­ist­ing task, we also wanted to ex­plore whether LoRA would be suf­fi­cient. Since the base model al­ready has the ca­pa­bil­ity to edit code and fix bugs, full fine-tun­ing might not be nec­es­sary.

The re­sults sup­port the hy­poth­e­sis. LoRA at rank 64 nearly matches full RL on Levenshtein Distance and beats it on Added Cognitive Complexity. LiveCodeBench dips slightly at low ranks but rank 64 is ef­fec­tively flat, and full RL re­mains best over­all. There is a clean mo­not­o­nic trend as rank in­creases where both Levenshtein and Added CC fall steadily from rank 1 to rank 64. The rate of im­prove­ment is not uni­form though, as the biggest gains hap­pen early. Rank 1 to 16 ac­counts for most of the Levenshtein re­duc­tion (0.166 → 0.087), while rank 16 to 64 closes the re­main­ing gap more grad­u­ally (0.087 → 0.051). Ranks 1 and 8 also trade cor­rect­ness for edit min­i­mal­ity which could be ex­plained by a lack of suf­fi­cient ca­pac­ity to learn both re­ward func­tions and in­stead bias to­wards the higher-re­ward edit min­i­mal­ity.

This is con­sis­tent with the idea that a small num­ber of ad­di­tional pa­ra­me­ters is enough to shift the mod­el’s edit­ing be­hav­ior and more ca­pac­ity be­yond a cer­tain point yields di­min­ish­ing re­turns. For style-level be­hav­ioral changes where the un­der­ly­ing ca­pa­bil­ity is al­ready pre­sent, LoRA is likely suf­fi­cient and con­sid­er­ably cheaper to run.

The orig­i­nal ver­sion of the re­ward func­tion had a bug where roll­outs with no suc­cess­ful ex­e­cu­tion were given a hard­coded re­ward of 0. This ended up be­ing a higher re­ward than roll­outs with suc­cess­ful ex­e­cu­tion since the Levenshtein dis­tance was negated to make it higher is bet­ter.” I found it in­ter­est­ing that even with this buggy re­ward func­tion, full RL was still able to learn the task. Only with LoRA did the model fail to learn it, seem­ingly re­ward hack­ing by learn­ing to never out­put func­tion­ally cor­rect code which trig­gered an in­ves­ti­ga­tion into the en­vi­ron­ment. With the fixed re­ward func­tion, the re­sults of full RL im­proved only slightly.

Does It Scale?

Lastly, to val­i­date the re­sults across larger mod­els, I ap­ply the same RL recipe us­ing the Out-of-Domain data onto the larger Qwen3 14B model. Even at larger pa­ra­me­ter counts, there are per­for­mance gains across the board with higher Pass@1, lower Levenshtein Distance, lower Added Cognitive Complexity, and no in­di­ca­tion of Catastrophic Forgetting. This gives me the con­fi­dence that such a recipe for the task of Minimal Code Editing can be ex­tended to var­i­ous mod­els of dif­fer­ent scales.

Final Thoughts

It is no­table that, de­spite be­ing a fron­tier model, GPT 5.4 strug­gles on the min­i­mal edit­ing task, es­pe­cially in the generic set­ting and rel­a­tive to Opus 4.6. Figure 3, how­ever, shows that it sees one of the largest gains when ex­plic­itly prompted in rea­son­ing mode, sec­ond only to its pre­de­ces­sor GPT 5, which sug­gests strong in­struc­tion fol­low­ing ca­pa­bil­i­ties. By con­trast, Opus 4.6 shows one of the small­est im­prove­ments, though that may sim­ply re­flect its al­ready strong base­line per­for­mance. This pat­tern fits the broader view that while GPT 5.4 of­ten de­faults to overly ver­bose code (“slop”), its be­hav­ior can be steered ef­fec­tively with proper prompt­ing.

Taken to­gether, the re­sults sug­gest that Over-Editing is both wide­spread and mea­sur­able. At the same time, the prompt­ing re­sults show that this is not purely a ca­pa­bil­ity lim­i­ta­tion. Especially for rea­son­ing mod­els, a sim­ple in­struc­tion to pre­serve the orig­i­nal code leads to much more faith­ful ed­its, which is an en­cour­ag­ing sign that when mod­els like GPT 5.4 over-edit, they can still be steered to­ward higher-qual­ity code.

Further, the train­ing re­sults sug­gest that this be­hav­ior can be im­proved. Reinforcement learn­ing pro­duced more faith­ful ed­i­tors with­out degra­da­tion in gen­eral cod­ing abil­ity, and those gains held up across both the 4B and 14B Qwen3 mod­els.

Admittedly, the field of code bench­marks has gone on from sim­ple sin­gle func­tion eval­u­a­tions to more agen­tic eval­u­a­tion par­a­digms like SWE-Bench Pro. Relative to those, eval­u­at­ing bug fixes in iso­lated func­tions is still a fairly con­tained task given the na­ture of the bugs.

Even so, in my ex­pe­ri­ence, de­spite the preva­lence of Over-Editing across all fron­tier cod­ing mod­els to­day, it has long been dif­fi­cult to quan­tify in re­al­is­tic set­tings. I hope this work can serve as a first step to­ward eval­u­at­ing and im­prov­ing the min­i­mal edit­ing ca­pa­bil­ity of cod­ing mod­els, and ul­ti­mately the over­all qual­ity of AI-generated code.

Acknowledgements:I am grate­ful to my su­per­vi­sor A/P Min-Yen Kan and my ad­vi­sor Tongyao Zhu for their guid­ance, and to Prime Intellect for spon­sor­ing the com­pute and API costs of this pro­ject.

The full list of cor­rup­tions can be found in the code. ↩︎

The full list of cor­rup­tions can be found in the code. ↩︎

© 2026. All rights re­served.

crawshaw - 2026-04-22

crawshaw.io

I am build­ing a cloud

2026 – 04-22

Today is fundrais­ing an­nounce­ment day. As is the na­ture of writ­ing for a larger au­di­ence, it is a for­mal, safe an­nounce­ment. As it should be. Writing must nec­es­sar­ily be­come im­per­sonal at scale. But I would like to write some­thing per­sonal about why I am do­ing this. What is the goal of build­ing exe.dev? I am al­ready the co-founder of one startup that is do­ing very well, sell­ing a prod­uct I love as much as when I first helped de­sign and build it.

What could pos­sess me to go through all the pain of start­ing an­other com­pany? Some fel­low founders have looked at me with in­credulity and shock that I would throw my­self back into the fry­ing pan. (Worse yet, ex­pe­ri­ence tells me that most of the pain is still in my fu­ture.) It has been a gen­uinely hard ques­tion to an­swer be­cause I start search­ing for a big” rea­son, a prin­ci­ple or a so­cial need, a rea­son or mo­ti­va­tion be­yond chal­lenge. But I be­lieve the truth is far sim­pler, and to some I am sure al­most equally in­cred­u­lous.

I like com­put­ers.

In some tech cir­cles, that is an un­usual state­ment. (“In this house, we curse com­put­ers!”) I get it, com­put­ers can be re­ally frus­trat­ing. But I like com­put­ers. I al­ways have. It is re­ally fun get­ting com­put­ers to do things. Painful, sure, but the re­sults are worth it. Small mi­cro­con­trollers are fun, desk­tops are fun, phones are fun, and servers are fun, whether racked in your base­ment or in a data cen­ter across the world. I like them all.

So it is no small thing for me when I ad­mit: I do not like the cloud to­day.

I want to. Computers are great, whether it is a BSD in­stalled di­rectly on a PC or a Linux VM. I can en­joy Windows, BeOS, Novell NetWare, I even in­stalled OS/2 Warp back in the day and had a great time with it. Linux is par­tic­u­larly pow­er­ful to­day and a source of end­less po­ten­tial. And for all the pages of prod­ucts, the cloud is just Linux VMs. Better, they are API dri­ven Linux VMs. I should be in heaven.

But every cloud prod­uct I try is wrong. Some are bet­ter than oth­ers, but I am con­stantly con­strained by the choices cloud ven­dors make in ways that make it hard to get com­put­ers to do the things I want them to do.

These is­sues go be­yond UX or bad API de­sign. Some of the fun­da­men­tal build­ing blocks of to­day’s clouds are the wrong shape. VMs are the wrong shape be­cause they are tied to CPU/memory re­sources. I want to buy some CPUs, mem­ory, and disk, and then run VMs on it. A Linux VM is a process run­ning in an­other Linux’s cgroup, I should be able to run as many as I like on the com­puter I have. The only way to do that eas­ily on to­day’s clouds is to take iso­la­tion into my own hands, with gVi­sor or nested vir­tu­al­iza­tion on a sin­gle cloud VM, pay­ing the nest­ing per­for­mance penalty, and then I am left with the job of run­ning and man­ag­ing, at a min­i­mum, a re­verse proxy onto my VMs. All be­cause the cloud ab­strac­tion is the wrong shape.

Clouds have tried to solve this with PaaS” sys­tems. Abstractions that are in­her­ently less pow­er­ful than a com­puter, be­spoke to a par­tic­u­lar provider. Learn a new way to write soft­ware for each com­pute ven­dor, only to find half way into your pro­ject that some­thing that is easy on a nor­mal com­puter is nearly im­pos­si­ble be­cause of some ob­scure limit of the plat­form sys­tem buried so deep you can­not find it un­til you are deeply com­mit­ted to a pro­ject. Time and again I have said this is the one” only to be be­trayed by some half-assed, half-im­ple­mented, or half-thought-through ab­strac­tion. No thank you.

Consider disk. Cloud providers want you to use re­mote block de­vices (or some­thing even more lim­ited and slow, like S3). When re­mote block de­vices were in­tro­duced they made sense, be­cause com­put­ers used hard dri­ves. Remote does not hurt se­quen­tial read/​write per­for­mance, if the buffer­ing im­ple­men­ta­tion is good. Random seeks on a hard drive take 10ms, so 1ms RTT for the Ethernet con­nec­tion to re­mote stor­age is a fine price to pay. It is a good prod­uct for hard dri­ves and makes the cloud ven­dor’s life a lot eas­ier be­cause it re­moves an en­tire di­men­sion from their stan­dard in­stance types.

But then we all switched to SSD. Seek time went from 10 mil­lisec­onds to 20 mi­crosec­onds. Heroic ef­forts have cut the net­work RTT a bit for re­ally good re­mote block sys­tems, but the IOPS over­head of re­mote sys­tems went from 10% with hard dri­ves to more than 10x with SSDs.

It is a lot of work to con­fig­ure an EC2 VM to have 200k IOPS, and you will pay $10k/month for the priv­i­lege. My MacBook has 500k IOPS. Why are we hob­bling our cloud in­fra­struc­ture with slow disk?

Finally net­work­ing. Hyperscalers have great net­works. They charge you the earth for them and make it mis­er­able to do deals with other ven­dors. The stan­dard price for a GB of egress from a cloud provider is 10x what you pay rack­ing a server in a nor­mal data cen­ter. At mod­er­ate vol­ume the mul­ti­plier is even worse. Sure, if you spend $XXm/month with a cloud the prices get much bet­ter, but most of my pro­jects want to spend $XX/month, with­out the lit­tle m. The fun­da­men­tal tech­nol­ogy here is fine, but this is where lim­its are placed on you to make sure what­ever you build can­not be af­ford­able.

Finally, clouds have painful APIs. This is where pro­jects like K8S come in, pa­per­ing over the pain so en­gi­neers suf­fer a bit less from us­ing the cloud. But VMs are hard with Kubernetes be­cause the cloud makes you do it all your­self with lumpy nested vir­tu­al­iza­tion. Disk is hard be­cause back when they were de­sign­ing K8S Google did­n’t re­ally even do us­able re­mote block de­vices, and even if you can find a com­mon pat­tern among clouds to­day to pa­per over, it will be slow. Networking is hard be­cause if it were easy you would pri­vate link in a few sys­tems from a neigh­bor­ing open DC and drop a zero from your cloud spend. It is tempt­ing to dis­miss Kubernetes as a scam, ar­ti­fi­cial make work de­signed to avoid do­ing real prod­uct work, but the truth is worse: it is a prod­uct at­tempt­ing to solve an im­pos­si­ble prob­lem: make clouds portable and us­able. It can­not be done.

You can­not solve the fun­da­men­tal prob­lems with cloud ab­strac­tions by build­ing new ab­strac­tions on top. Making Kubernetes good is in­her­ently im­pos­si­ble, a pro­ject in putting (admittedly high qual­ity) lip­stick on a pig.

We have been mud­dy­ing along with these mis­er­able clouds for 15 years now. We make do, in the way we do with all the un­pleas­ant parts of our soft­ware stack, hold­ing our nose when­ever we have to deal with and try­ing to min­i­mize how of­ten that hap­pens.

This how­ever, is the mo­ment to fix it.

This is the mo­ment be­cause some­thing has changed: we have agents now. (Indeed my co-founder Josh and I started tin­ker­ing be­cause we wanted to use LLMs in pro­gram­ming. It turns out what needs build­ing for LLMs are bet­ter tra­di­tional ab­strac­tions.) Agents, by mak­ing it eas­i­est to write code, means there will be a lot more soft­ware. Economists would call this an in­stance of Jevons para­dox. Each of us will write more pro­grams, for fun and for work. We need pri­vate places to run them, easy shar­ing with friends and col­leagues, min­i­mal over­head.

With more to­tal soft­ware in our lives the cloud, which was an an­noy­ing pain, be­comes a much big­ger pain. We need a lot more com­pute, we need it to be eas­ier to man­age. Agents help to some de­gree. If you trust them with your cre­den­tials they will do a great job dri­ving the AWS API for you (though oc­ca­sion­ally it will delete your pro­duc­tion DB). But agents strug­gle with the fun­da­men­tal lim­its of the ab­strac­tions as much as we do. You need more to­kens than you should and you get a worse re­sult than you should. Every per­cent of con­text win­dow the agent spends think­ing about how to con­tort clas­sic clouds into work­ing is con­text win­dow is not us­ing to solve your prob­lem.

So we are go­ing to fix it. What we have launched on exe.dev to­day ad­dresses the VM re­source iso­la­tion prob­lem: in­stead of pro­vi­sion­ing in­di­vid­ual VMs, you get CPU and mem­ory and run the VMs you want. We took care of a TLS proxy and an au­then­ti­ca­tion proxy, be­cause I do not ac­tu­ally want my fresh VMs dumped di­rectly on the in­ter­net. Your disk is lo­cal NVMe with blocks repli­cated off ma­chine asyn­chro­nously. We have re­gions around the world for your ma­chines, be­cause you want your ma­chines close. Your ma­chines are be­hind an any­cast net­work to give all your global users a low la­tency en­try­point to your prod­uct (and so we can build some new ex­cit­ing things soon).

There is a lot more to build here, from ob­vi­ous things like sta­tic IPs to UX chal­lenges like how to give you ac­cess to our au­to­matic his­tor­i­cal disk snap­shots. Those will get built. And at the same time we are go­ing right back to the be­gin­ning, rack­ing com­put­ers in data cen­ters, think­ing through every layer of the soft­ware stack, ex­plor­ing all the op­tions for how we wire up net­works.

So, I am build­ing a cloud. One I ac­tu­ally want to use. I hope it is use­ful to you.

Show HN submissions tripled and now mostly share the same vibe-coded look

www.adriankrebs.ch

An at­tempt to de­tect AI de­sign pat­terns in Show HN pages

Apr 20, 2026

When brows­ing Hacker News, I no­ticed that many Show HN pro­jects now have a generic ster­ile feel­ing that tells me they are purely AI-generated.

Initially I could­n’t tell what it was ex­actly, so I won­dered if we could au­to­mat­i­cally quan­tify this sub­jec­tive feel­ing by scor­ing 500 Show HN pages for AI de­sign pat­terns.

Claude Code has led to a large in­crease in Show HN pro­jects. So much, that the mod­er­a­tors of HN had to re­strict Show HN sub­mis­sions for new ac­counts.

Here is how the Show HN sub­mis­sions in­creased over the last few years:

Update: dang pointed out that the March 2026 dip cor­re­lates with the roll­out of /showlim, the view newer ac­counts now see.

That should give us plenty of pages to score for AI de­sign pat­terns.

AI de­sign pat­terns

A de­signer re­cently told me that colored left bor­ders are al­most as re­li­able a sign of AI-generated de­sign as em-dashes for text”, so I started to no­tice them on many pages.

Then I asked some more de­signer friends what they think are com­mon AI pat­terns.

The an­swers can be roughly grouped into fonts, col­ors, lay­out quirks, and CSS pat­terns.

Fonts

Inter used for every­thing, but es­pe­cially the cen­tered hero head­lines

LLM tend to use cer­tain font com­bos like Space Grotesk, Instrument Serif and Geist

Serif italic for one ac­cent word in an oth­er­wise-In­ter hero

Colors

VibeCode Purple”

Perma dark mode with medium-grey body text and all-caps sec­tion la­bels

Barely pass­ing body-text con­trast in dark themes

Gradient every­thing

Large col­ored glows and col­ored box-shad­ows

Layout quirks

Centered hero set in a generic sans

Badge right above the hero H1

Colored bor­ders on cards, on the top or left edge

Identical fea­ture cards, each with an icon on top

Numbered 1, 2, 3” step se­quences

Stat ban­ner rows

Sidebar or nav with emoji icons

All-caps head­ings and sec­tion la­bels

CSS pat­terns

shadcn/​ui

Glassmorphism

A few ex­am­ples from the Show HN sub­mis­sions:

Detecting AI de­sign in Show HN sub­mis­sions

Now we can try to sys­tem­at­i­cally score for these pat­terns by go­ing through 500 of the lat­est Show HN sub­mis­sions and scor­ing their land­ing pages against the list above.

Here is the scor­ing method:

A head­less browser loads each site (Playwright)

A small in-page script an­a­lyzes the DOM and reads com­puted styles

Every pat­tern is a de­ter­min­is­tic CSS or DOM check. I in­ten­tion­ally do not take screen­shots and let the LLM judge them.

This ul­ti­mately also leads to false pos­i­tives, but my man­ual QA run ver­i­fied it’s maybe 5 – 10%.

If there is any in­ter­est in open sourc­ing the scor­ing code to repli­cate (and im­prove) the run or score your own site, let me know.

Results

A sin­gle pat­tern does­n’t nec­es­sar­ily make a site AI-generated, so I grouped them into three tiers based on how many of the 15 pat­terns they trig­ger:

Heavy slop (5+ pat­terns) · 105 sites · 21%

Mild (2 – 4) · 230 sites · 46%

Clean (0 – 1) · 165 sites · 33%

Is this bad? Not re­ally, just unin­spired. After all, val­i­dat­ing a busi­ness idea was never about fancy de­sign, and be­fore the AI era, every­thing looked like Bootstrap.

There is a dif­fer­ence be­tween try­ing to craft your own de­sign and just ship­ping with what­ever de­faults the LLMs out­put. And the same has been the case pre-LLM when us­ing CSS/HTML tem­plates.

I guess peo­ple will get back to craft­ing beau­ti­ful de­signs to stand out from the slop. On the other hand, I’m not sure how much de­sign will still mat­ter once AI agents are the pri­mary users of the web.

This post is hu­man-writ­ten, the scor­ing and analy­sis were AI-assisted.

To add this web app to your iOS home screen tap the share button and select "Add to the Home Screen".

10HN is also available as an iOS App

If you visit 10HN only rarely, check out the the best articles from the past week.

If you like 10HN please leave feedback and share

Visit pancik.com for more.