10 interesting stories served every morning and every evening.

Google Chrome silently installs a 4 GB AI model on your device without consent. At a billion-device scale the climate costs are insane.

www.thatprivacyguy.com

Two weeks ago I wrote about Anthropic silently reg­is­ter­ing a Native Messaging bridge in seven Chromium-based browsers on every ma­chine where Claude Desktop was in­stalled [1]. The pat­tern was: in­stall on user launch of prod­uct A, write con­fig­u­ra­tion into the user’s in­stalls of prod­ucts B, C, D, E, F, G, H with­out ask­ing. Reach across ven­dor trust bound­aries. No con­sent di­a­log. No opt-out UI. Re-installs it­self if the user re­moves it man­u­ally, every time Claude Desktop is launched.

This week I dis­cov­ered the same pat­tern, ex­e­cuted by Google. Google Chrome is reach­ing into users’ ma­chines and writ­ing a 4 GB on-de­vice AI model file to disk with­out ask­ing. The file is named weights.bin. It lives in OptGuideOnDeviceModel. It is the weights for Gemini Nano, Google’s on-de­vice LLM. Chrome did not ask. Chrome does not sur­face it. If the user deletes it, Chrome re-down­loads it.

The le­gal analy­sis is the same one I gave for the Anthropic case. The en­vi­ron­men­tal analy­sis is new. At Chrome’s scale, the cli­mate bill for one model push, paid in at­mos­pheric CO2 by the en­tire planet, is be­tween six thou­sand and sixty thou­sand tonnes of CO2-equivalent emis­sions, de­pend­ing on how many de­vices re­ceive the push. That is the en­vi­ron­men­tal cost of one com­pany uni­lat­er­ally de­cid­ing that two bil­lion peo­ples’ de­fault browser will mass-dis­trib­ute a 4 GB bi­nary they did not re­quest.

This is, in my pro­fes­sional opin­ion, a di­rect breach of Article 5(3) of Directive 2002/58/EC (the ePri­vacy Directive) [2], a breach of the Article 5(1) GDPR prin­ci­ples of law­ful­ness, fair­ness, and trans­parency [3], a breach of Article 25 GDPRs data-pro­tec­tion-by-de­sign oblig­a­tion [3], and an en­vi­ron­men­tal harm of a mag­ni­tude that would be a no­ti­fi­able event un­der the Corporate Sustainability Reporting Directive (CSRD) for any in-scope un­der­tak­ing [4].

What is on the disk and how it got there

On any ma­chine that has Chrome in­stalled, in the user pro­file, sits a di­rec­tory whose name is OptGuideOnDeviceModel. Inside it is a file called weights.bin. The file is ap­prox­i­mately 4 GB. It is the weights file for Gemini Nano. Chrome uses it to power fea­tures Google has mar­keted un­der names like Help me write”, on-de­vice scam de­tec­tion, and other AI-assisted browser func­tions.

The file ap­peared with no con­sent prompt. There is no check­box in Chrome Settings la­belled download a 4 GB AI model”. The down­load trig­gers when Chrome’s AI fea­tures are ac­tive, and those fea­tures are ac­tive by de­fault in re­cent Chrome ver­sions. On any ma­chine that meets the hard­ware re­quire­ments, Chrome treats the user’s hard­ware as a de­liv­ery tar­get and writes the model.

The cy­cle of dele­tion and re-down­load has been doc­u­mented across mul­ti­ple in­de­pen­dent re­ports on Windows in­stal­la­tions [5][6][7][8] - the user deletes, Chrome re-down­loads, the user deletes again, Chrome re-down­loads again. The only ways to make the dele­tion stick are to dis­able Chrome’s AI fea­tures through chrome://​flags or en­ter­prise pol­icy tool­ing that home users do not gen­er­ally have, or to unin­stall Chrome en­tirely [5]. On ma­cOS the file lands as mode 600 owned by the user (so it is deletable in prin­ci­ple) but Chrome holds the in­stall state in Local State af­ter the bytes are writ­ten, and as soon as the vari­a­tions server next tells Chrome the pro­file is el­i­gi­ble, the down­load fires again - the ar­chi­tec­ture is the same, only the file per­mis­sions dif­fer.

How I ver­i­fied this on a freshly cre­ated Apple Silicon pro­file

Most of the ex­ist­ing re­port­ing on this be­hav­iour is from Windows users who no­ticed their disk fill­ing up - use­ful, but Google could (and prob­a­bly will) try to char­ac­terise those re­ports as anec­dotes from non-rep­re­sen­ta­tive con­fig­u­ra­tions. So I went look­ing for a clean wit­ness on a dif­fer­ent plat­form.

The wit­ness I found is ma­cOS it­self. The ker­nel keeps a filesys­tem event log called .fseventsd - it records every file cre­ate, mod­ify and delete at the OS level, in­de­pen­dent of any ap­pli­ca­tion log­ging. Chrome can­not edit it, Google can­not re­motely reach it, and the page files that record the events sur­vive the dele­tion of the files they ref­er­ence.

I cre­ated a Chrome user-data di­rec­tory on 23 April 2026 to run an au­to­mated au­dit (one of the WebSentinel 100-site pri­vacy sweeps). The au­dit dri­ver is fully Chrome DevTools Protocol - it loads a page, dwells for five min­utes with no in­put, cap­tures events, closes Chrome be­tween sites - and the pro­file had re­ceived zero key­board or mouse in­put from a hu­man at any point in its ex­is­tence. Every AI mode” sur­face in Chrome was un­touched - in fact every UI sur­face in Chrome was un­touched, the au­dit dri­ver only in­ter­acts with the doc­u­ment via CDP and the om­ni­box is never reached. By 29 April the pro­file con­tained 4 GB of OptGuideOnDeviceModel weights - and I knew it be­cause a rou­tine du -sh of the au­dit-pro­file di­rec­tory caught it dur­ing a cleanup pass.

I went back to .fseventsd to ask ex­actly when those 4 GB landed. ma­cOS gave me the an­swer, byte-pre­cise, in three se­quen­tial page files:

24 April 2026, 16:38:54 CEST (14:38:54 UTC) - Chrome cre­ates the OptGuideOnDeviceModel di­rec­tory in the au­dit pro­file (page file 0000000003f7f339).

24 April 2026, 16:47:22 CEST (14:47:22 UTC) - three con­cur­rent un­packer sub­processes spawn tem­po­rary di­rec­to­ries in /private/var/folders/…/com.google.Chrome.chrome_chrome_Unpacker_BeginUnzipping.*/. One of them (5xzqPo) writes weights.bin, man­i­fest.json, _metadata/verified_contents.json and on_de­vice_­mod­el_ex­e­cu­tion_­con­fig.pb. The sec­ond writes a Certificate Revocation List up­date. The third writes a browser pre­load-data up­date. Chrome batched a se­cu­rity up­date, a pre­load re­fresh and a 4 GB AI model into the same idle win­dow, as if they were equiv­a­lent (page file 00000000040c8855).

24 April 2026, 16:53:22 CEST (14:53:22 UTC) - the un­packed weights.bin is moved to its fi­nal lo­ca­tion at OptGuideOnDeviceModel/2025.8.8.1141/weights.bin along with adapter_­cache.bin, en­coder_­cache.bin, _metadata/verified_contents.json and the ex­e­cu­tion con­fig. Concurrently four ad­di­tional model tar­gets (numbered 40, 49, 51 and 59 in Chrome’s op­ti­miza­tion-guide enum) reg­is­ter fresh en­tries in op­ti­miza­tion_guide_­mod­el_­s­tore - these are the smaller text-safety and prompt-rout­ing mod­els that pair with the LLM. None of these tar­gets ex­isted in the pro­file be­fore this mo­ment (page file 00000000040d0f9c).

Total in­stall time, from di­rec­tory cre­ation to fi­nal move: 14 min­utes and 28 sec­onds. Total hu­man ac­tion against the pro­file dur­ing that win­dow: none. The au­dit dri­ver was ei­ther dwelling on a third-party home page or tran­si­tion­ing be­tween sites - the un­packer fired in the back­ground while a tab waited for a five-minute timer to ex­pire.

The nam­ing in­side that fsev­entsd record is, if any­thing, the most damn­ing de­tail. The temp di­rec­tory is com.google.Chrome.chrome_chrome_Un­pack­er_Be­gi­n­Un­zip­ping.5xzqPo - that pre­fix com.google.Chrome.chrome_chrome_* is the bun­dle ID and sub­process nam­ing con­ven­tion Google Chrome it­self uses. It is not com.google.Google­Up­dater.* and it is not com.google.Google­Soft­ware­Up­date.*. The writer is Chrome - the browser process the user has in­stalled and trusts to load web pages - reach­ing into the user’s filesys­tem on its own ini­tia­tive and lay­ing down a 4 GB ML bi­nary while the fore­ground tab does some­thing com­pletely un­re­lated.

Three fur­ther pieces of cor­rob­o­rat­ing ev­i­dence sit else­where on the same ma­chine:

Chrome’s own Local State JSON for the au­dit pro­file con­tains an op­ti­miza­tion_guide.on_de­vice block with mod­el_­val­i­da­tion_re­sult: { at­temp­t_­count: 1, re­sult: 2, com­po­nen­t_ver­sion: 2025.8.8.1141” }. Chrome ran the model. The com­po­nen­t_ver­sion matches the ver­sion string the fsev­entsd events recorded as the path com­po­nent. Two in­de­pen­dent wit­nesses, same arte­fact. The same block re­ports per­for­mance_­class: 6, vram_mb: 36864″ - Chrome char­ac­terised my hard­ware (read the GPU, read the uni­fied mem­ory to­tal) to de­cide whether I was el­i­gi­ble for the model push, be­fore any user-fac­ing AI fea­ture sur­faced.

Chrome’s own Local State JSON for the au­dit pro­file con­tains an op­ti­miza­tion_guide.on_de­vice block with mod­el_­val­i­da­tion_re­sult: { at­temp­t_­count: 1, re­sult: 2, com­po­nen­t_ver­sion: 2025.8.8.1141” }. Chrome ran the model. The com­po­nen­t_ver­sion matches the ver­sion string the fsev­entsd events recorded as the path com­po­nent. Two in­de­pen­dent wit­nesses, same arte­fact. The same block re­ports per­for­mance_­class: 6, vram_mb: 36864″ - Chrome char­ac­terised my hard­ware (read the GPU, read the uni­fied mem­ory to­tal) to de­cide whether I was el­i­gi­ble for the model push, be­fore any user-fac­ing AI fea­ture sur­faced.

Chrome’s ChromeFeatureState for the au­dit pro­file lists OnDeviceModelBackgroundDownload<OnDeviceModelBackgroundDownload and ShowOnDeviceAiSettings<OnDeviceModelBackgroundDownload in the en­able-fea­tures block. The first flag is what trig­gers the silent down­load. The sec­ond flag is what re­veals the on-de­vice AI sec­tion in chrome://​set­tings. Both are gated by the same roll­out flag - which means that by Chrome’s own ar­chi­tec­ture, the in­stall be­gins be­fore the user has any set­tings UI in which to refuse it. The set­tings page that would let you dis­cover the fea­ture ex­ists is en­abled in lock­step with the in­stall - it is de­sign, not over­sight.

Chrome’s ChromeFeatureState for the au­dit pro­file lists OnDeviceModelBackgroundDownload<OnDeviceModelBackgroundDownload and ShowOnDeviceAiSettings<OnDeviceModelBackgroundDownload in the en­able-fea­tures block. The first flag is what trig­gers the silent down­load. The sec­ond flag is what re­veals the on-de­vice AI sec­tion in chrome://​set­tings. Both are gated by the same roll­out flag - which means that by Chrome’s own ar­chi­tec­ture, the in­stall be­gins be­fore the user has any set­tings UI in which to refuse it. The set­tings page that would let you dis­cover the fea­ture ex­ists is en­abled in lock­step with the in­stall - it is de­sign, not over­sight.

The GoogleUpdater logs record the on-de­vice-model con­trol com­po­nent (appid {44fc7fe2 – 65ce-487c-93f4-edee46eeaaab}) be­ing down­loaded from http://​edgedl.me.gvt1.com/​edgedl/​dif­f­gen-puf­fin/%​7B44fc7fe2 – 65ce-487c-93f4-edee46eeaaab%7D/… - a 7 MB com­pressed con­trol file that ar­rived on 20 April 2026, three days be­fore the au­dit pro­file in ques­tion was cre­ated. That is the up­stream con­trol plane: it is pro­file-in­de­pen­dent, it is launched au­to­mat­i­cally by a LaunchAgent that fires every hour, and the URL is plain HTTP (the in­tegrity is ver­i­fied by the CRX-3 sig­na­ture in­side the pack­age, not by trans­port se­cu­rity). The con­trol com­po­nent gives Chrome the man­i­fest point­ing at the ac­tual weights, and Chrome’s in-process OnDeviceModelComponentInstaller - a sep­a­rate code path from GoogleUpdater - then fetches the multi-GB weights di­rect from Google’s CDN.

The GoogleUpdater logs record the on-de­vice-model con­trol com­po­nent (appid {44fc7fe2 – 65ce-487c-93f4-edee46eeaaab}) be­ing down­loaded from http://​edgedl.me.gvt1.com/​edgedl/​dif­f­gen-puf­fin/%​7B44fc7fe2 – 65ce-487c-93f4-edee46eeaaab%7D/… - a 7 MB com­pressed con­trol file that ar­rived on 20 April 2026, three days be­fore the au­dit pro­file in ques­tion was cre­ated. That is the up­stream con­trol plane: it is pro­file-in­de­pen­dent, it is launched au­to­mat­i­cally by a LaunchAgent that fires every hour, and the URL is plain HTTP (the in­tegrity is ver­i­fied by the CRX-3 sig­na­ture in­side the pack­age, not by trans­port se­cu­rity). The con­trol com­po­nent gives Chrome the man­i­fest point­ing at the ac­tual weights, and Chrome’s in-process OnDeviceModelComponentInstaller - a sep­a­rate code path from GoogleUpdater - then fetches the multi-GB weights di­rect from Google’s CDN.

So we now have a four-way ev­i­dence chain - ma­cOS ker­nel filesys­tem events, Chrome’s own per-pro­file state, Chrome’s run­time fea­ture flags, and Google’s com­po­nent-up­dater logs - all four agree­ing on the same con­duct, and the con­duct is: a 4 GB AI model ar­rived on this user’s disk with­out con­sent, with­out no­tice, on a pro­file that re­ceived zero hu­man in­put, in a win­dow of 14 min­utes and 28 sec­onds, on a Tuesday af­ter­noon.

Reports of the OptGuideOnDeviceModel di­rec­tory and the weights.bin file have been cir­cu­lat­ing in com­mu­nity fo­rums for over a year - what is new in 2026 is the scale and the ver­i­fi­a­bil­ity. Chrome’s mar­ket share has held above 64% glob­ally [9][10], Chrome’s user base is be­tween 3.45 bil­lion and 3.83 bil­lion in­di­vid­u­als world­wide de­pend­ing on which 2026 es­ti­mate you trust [9][11], and Google has been rolling Gemini fea­tures into Chrome with in­creas­ing ag­gres­sion. The be­hav­iour is no longer af­fect­ing a mi­nor­ity of power users on a mi­nor­ity of plat­forms - it is af­fect­ing hun­dreds of mil­lions of de­vices, on every desk­top OS Chrome ships against.

The Anthropic com­par­i­son, point for point

The same dark-pat­tern play­book. I am re­peat­ing my cat­e­gori­sa­tion from the Claude Desktop ar­ti­cle [1] be­cause the pat­terns are iden­ti­cal and that is the point.

1. Forced bundling across trust bound­aries. Anthropic in­stalled Claude Desktop, then wrote into Brave, Edge, Arc, Vivaldi, Opera, and Chromium. Google in­stalls Chrome, then writes a 4 GB AI model un­der the user’s pro­file di­rec­tory with­out au­tho­ri­sa­tion. The bi­nary is not Chrome. It is a sep­a­rately-trained ma­chine-learn­ing model, with a sep­a­rate pur­pose, a sep­a­rate data-pro­tec­tion pro­file, and a sep­a­rate con­sent foot­print.

2. Invisible de­fault, no opt-in. No di­a­logue at first launch. No check­box in Settings. The model is down­loaded; the user finds out about it months later when their disk fills up [5][6][7].

3. More dif­fi­cult to re­move than in­stall. Adding the file took zero clicks. Removing it re­quires (a) dis­cov­er­ing the file ex­ists, (b) un­der­stand­ing what it is, (c) nav­i­gat­ing into a hid­den user pro­file path, (d) delet­ing it (and on Windows, also clear­ing the read-only at­tribute first), and (e) ac­cept­ing that Chrome will silently re-down­load it on next el­i­gi­ble win­dow un­less the user also nav­i­gates chrome://​flags, en­ter­prise pol­icy, or plat­form-spe­cific con­fig­u­ra­tion tool­ing to dis­able the un­der­ly­ing Chrome AI fea­ture [5]. None of those steps is doc­u­mented in the place a nor­mal user looks - none of them is even hinted at in de­fault Chrome.

4. Pre-staging of ca­pa­bil­ity the user has not re­quested. The Nano model ex­ists on the user’s disk so that Chrome fea­tures that use it can run in­stantly when the user in­vokes them. The user has not in­voked any of those fea­tures. The model still sits there, tak­ing 4 GB.

5. Scope in­fla­tion through generic nam­ing. OptGuideOnDeviceModel is in­ter­nal Chrome jar­gon for OptimizationGuide on-de­vice model stor­age”. A user look­ing at their disk us­age, even one who knows roughly what they are look­ing at, would not match OptGuideOnDeviceModel/weights.bin to Gemini Nano LLM weights”. Accurate nam­ing would be GeminiNanoLLM/weights.bin. Google chose to ob­fus­cate the name.

6. Registration into re­sources the user has not con­fig­ured. A user who has not opened Chrome’s AI fea­tures still gets the model. A user who has opened them once and de­cided they were not in­ter­ested still gets the model. The file’s pres­ence is de­cou­pled from the user’s ac­tual use of any fea­ture it pow­ers.

7. Documentation gap. Google’s user-fac­ing doc­u­men­ta­tion about Chrome’s AI fea­tures does not, with the promi­nence pro­por­tion­ate to a 4 GB silent down­load, tell the user that the cost of the fea­ture be­ing avail­able is a 4 GB file ap­pear­ing on their de­vice. The be­hav­iour is doc­u­mented in places a cu­ri­ous ad­min will find. It is not doc­u­mented in the place a reg­u­lar user looks be­fore in­stalling Chrome or be­fore Chrome de­cides to be­gin push­ing the model.

8. Automatic re-in­stall on every run. Same as Claude Desktop. Delete the file, Chrome re-cre­ates it. The user’s dele­tion is treated as a tran­sient state to be cor­rected, not as a di­rec­tive to be re­spected.

9. Retroactive sur­vival of any fu­ture user con­sent. If Google in fu­ture starts ask­ing users would you like Chrome to down­load a 4 GB AI model”, that prompt does not retro-ac­tively le­git­imise the silent in­stalls that have al­ready hap­pened on hun­dreds of mil­lions of de­vices. The dam­age to the trust re­la­tion­ship is done. The bytes have moved. The at­mos­phere has been writ­ten to.

10. Code-signed, shipped through the nor­mal re­lease chan­nel. This is not test build be­hav­iour. It is Chrome sta­ble.

The AI Mode” pill is the cherry on top

Here is the part that should make every pri­vacy lawyer in the au­di­ence put their cof­fee down. When Chrome 147 launches against an el­i­gi­ble pro­file, the om­ni­box - the ad­dress bar at the top of the win­dow, the most vis­i­ble piece of real es­tate in the en­tire browser - ren­ders an AI Mode” pill to the right of the URL field. A rea­son­able user, see­ing AI Mode” sit­ting in their browser’s most promi­nent UI el­e­ment in 2026, with the well-pub­li­cised ex­is­tence of on-de­vice LLMs in Chrome and a 4 GB Gemini Nano bi­nary al­ready silently in­stalled on their disk, is go­ing to draw what feels like an ob­vi­ous in­fer­ence - that the vis­i­ble AI Mode is us­ing the on-de­vice model, that their queries stay on the de­vice, that the lo­cal model is what pow­ers the lo­cal-look­ing sur­face.

Every part of that in­fer­ence is wrong. The AI Mode pill in the Chrome 147 om­ni­box is a cloud-backed Search Generative Experience sur­face - every query the user types into it is sent over the net­work to Google’s servers for pro­cess­ing by Google’s hosted mod­els. The on-de­vice Nano model is not in­voked by the AI Mode UI flow at all. They are en­tirely sep­a­rate code paths - the most vis­i­ble AI af­for­dance in the browser does not use the lo­cal model the user has been silently given, and the fea­tures that do use the lo­cal model (Help-Me-Write in <textarea>, tab-group AI sug­ges­tions, smart paste, page sum­mary) are buried in textarea-con­text menus and tab-group right-click menus that the av­er­age user will dis­cover, on av­er­age, never.

Think about what that arrange­ment ac­tu­ally is. The user pays the stor­age cost of the silent in­stall (4 GB on disk, plus the band­width of the silent down­load). The user’s most vis­i­ble AI ex­pe­ri­ence - the pill they ac­tu­ally see and click - de­liv­ers no on-de­vice ben­e­fit at all be­cause it routes to Google’s servers re­gard­less. The on-de­vice model is there­fore a sunk cost im­posed on the user, with no off­set­ting trans­parency ben­e­fit at the sur­face where trans­parency would mat­ter most. To put it an­other way - if the on-de­vice in­stall had given the user a clear your AI Mode queries stay on your de­vice” prop­erty, the in­stall would have a de­fen­si­ble pri­vacy fram­ing (worse stor­age, bet­ter data flow). It does not - the in­stall gives Google a fu­ture-op­tions re­source (the model can be in­voked by other Chrome sub­sys­tems with­out fur­ther server round-trips) at the user’s disk-and-band­width ex­pense, while the head­line AI sur­face con­tin­ues to send the user’s queries to Google as be­fore. The lo­cal model is a Google-side as­set po­si­tioned on the user’s de­vice - it is not a user-side as­set and one could ar­gue it is noth­ing but sleight-of-hand to hide that ac­tu­ally, the vis­i­ble AI mode is NOT us­ing the lo­cal model.

That arrange­ment, on its own, en­gages at least three of the de­cep­tive de­sign pat­tern fam­i­lies cat­a­logued in EDPB Guidelines 03/2022 [20]. It is mis­lead­ing in­for­ma­tion be­cause the vis­i­ble la­bel AI Mode” cre­ates a false im­pres­sion about where pro­cess­ing oc­curs - the la­bel does not say cloud-backed” or queries sent to Google”, and a rea­son­able user with knowl­edge of on-de­vice AI will in­fer lo­cal­ity from the prox­im­ity of an on-de­vice 4 GB model on their disk. It is skip­ping be­cause the user is not given a mo­ment to choose be­tween lo­cal-only and cloud-backed AI sur­faces - both are switched on by the same up­stream roll­out, with no per-fea­ture con­sent. And it is hin­der­ing be­cause turn­ing AI Mode off does not also re­move the on-de­vice in­stall, and re­mov­ing the on-de­vice in­stall does not turn AI Mode off - the two are sep­a­rately con­trolled, and dis­cov­er­ing both con­trols re­quires know­ing about both chrome://​flags and chrome://​set­tings/​ai, nei­ther of which is ob­vi­ous in de­fault Chrome.

So: not just a non-con­sented in­stall, but a non-con­sented in­stall that dou­bles as cover for a par­al­lel cloud-backed sur­face that mis­rep­re­sents to the user where their typ­ing is be­ing processed. Both lay­ers com­pound the con­sent prob­lem.

Why this is un­law­ful in the EEA and the UK

Article 5(3) of Directive 2002/58/EC (the ePri­vacy Directive) pro­hibits the stor­ing of in­for­ma­tion, or the gain­ing of ac­cess to in­for­ma­tion al­ready stored, in the ter­mi­nal equip­ment of a sub­scriber or user, with­out the user’s prior, freely-given, spe­cific, in­formed, and un­am­bigu­ous con­sent, ex­cept where strictly nec­es­sary for the pro­vi­sion of an in­for­ma­tion-so­ci­ety ser­vice ex­plic­itly re­quested by the user [2]. The 4 GB Gemini Nano weights file is in­for­ma­tion stored in the user’s ter­mi­nal equip­ment. The user did not con­sent. The user has not re­quested any ser­vice that strictly re­quires a 4 GB on-de­vice LLM. Chrome is func­tional with­out the file. The Article 5(3) breach is di­rect.

Article 5(1) GDPR re­quires pro­cess­ing of per­sonal data to be law­ful, fair, and trans­par­ent to the data sub­ject [3]. Where the user’s hard­ware is pro­filed to de­ter­mine el­i­gi­bil­ity for the model push, where the in­stall events are logged on Google’s servers, and where the on-de­vice fea­tures the model pow­ers process user prompts (whether or not those prompts leave the de­vice), the law­ful­ness, fair­ness, and trans­parency of all of that pro­cess­ing de­pend on the user be­ing told, in plain lan­guage, what is hap­pen­ing. They are not.

Article 25 GDPR re­quires the con­troller to im­ple­ment ap­pro­pri­ate tech­ni­cal and or­gan­i­sa­tional mea­sures to en­sure that, by de­fault, only per­sonal data that are nec­es­sary for each spe­cific pur­pose are processed [3]. Pre-staging a 4 GB AI model on a user’s disk, against a con­tin­gency that the user might in fu­ture in­voke an AI fea­ture, is the ar­chi­tec­tural op­po­site of by-de­fault min­imi­sa­tion and the pro­fil­ing of the de­vice to de­ter­mine whether or not to push the model is not dif­fer­ent to the pro­fil­ing used to track you on­line and as such that pro­file con­tains per­sonal data and if the AI model is used, will process per­sonal data, so the GDPR ar­gu­ments are in scope and valid.

Under the UK GDPR and the Privacy and Electronic Communications Regulations 2003, the analy­sis is the same. Under the California Consumer Privacy Act, the ab­sence of a no­tice-at-col­lec­tion cov­er­ing this spe­cific cat­e­gory of pre-staged soft­ware puts Google’s CCPA no­tice pos­ture in ques­tion [12].

Then there are the crim­i­nal-law vi­o­la­tions un­der var­i­ous na­tional com­puter-mis­use statutes - which again can­not be over­stated.

ESG: the cli­mate cost of the silent push

The Anthropic case I wrote about was a desk­top ap­pli­ca­tion in­stalling a 350-byte JSON man­i­fest in seven di­rec­to­ries. The band­width and en­ergy cost of that, summed across all Claude Desktop users, was neg­li­gi­ble. The Chrome case is dif­fer­ent. Chrome is push­ing a 4 GB bi­nary across hun­dreds of mil­lions of de­vices. That has a mea­sur­able, quan­tifi­able, and frankly alarm­ing en­vi­ron­men­tal foot­print.

I am cal­cu­lat­ing this us­ing the same method­ol­ogy our WebSentinel au­dit plat­form ap­plies to web­site en­vi­ron­men­tal analy­sis [13]:

Energy in­ten­sity of net­work data trans­fer: 0.06 kWh per GB, the mid-band of Pärssinen et al. (2018) Environmental im­pact as­sess­ment of on­line ad­ver­tis­ing”, Science of The Total Environment [14]. The pa­per re­ports a 0.04 – 0.10 kWh/​GB range de­pend­ing on the share of fixed-line vs mo­bile trans­fer and in­clu­sion of end-user de­vice en­ergy. 0.06 is a de­fen­si­ble mid-point.

Grid emis­sions fac­tor: 0.25 kg CO2e per kWh, the EEA / IEA com­pos­ite EU-27 elec­tric­ity-sup­ply fac­tor for 2024 re­port­ing [15]. Globally the fig­ure varies from ~0.10 kg/​kWh on mostly-re­new­able grids to over 0.70 kg/​kWh on coal-heavy grids; 0.25 is mid-band for a global push and is the fig­ure WebSentinel uses by de­fault.

Per-device cost of one Nano push

Bandwidth: 4 GB

Energy: 4 × 0.06 = 0.24 kWh per de­vice per push

CO2: 0.24 × 0.25 = 0.06 kg CO2e per de­vice per push

That is per de­vice, per push. A sin­gle down­load of the model. It does not in­clude re-down­loads trig­gered by the user try­ing and fail­ing to delete the file. It does not in­clude sub­se­quent up­dates to the model. It does not in­clude the on-de­vice in­fer­ence en­ergy when the model is ac­tu­ally used. It is just the one-time de­liv­ery cost to one de­vice.

Aggregated cost across the de­ploy­ment

Google does not pub­lish how many de­vices re­ceive the Nano push. The el­i­gi­bil­ity cri­te­ria gat­ing the push (a hard­ware performance class” that Chrome com­putes from CPU class, GPU class, sys­tem RAM and avail­able VRAM - typ­i­cally ~16 GB uni­fied mem­ory or bet­ter on Apple Silicon, ~16 GB RAM and a dis­crete or in­te­grated GPU with suf­fi­cient VRAM on Windows and Linux) carve out the very low end of the con­sumer in­stall base, but the qual­i­fy­ing pop­u­la­tion is still enor­mous. I will use three il­lus­tra­tive de­ploy­ment bands so the reader can pick whichever they con­sider clos­est to re­al­ity. None of these bands is im­plau­si­bly large for a fea­ture that ships in de­fault-on Chrome.

To com­pare those num­bers to what an ESG re­port could com­pare to:

24 GWh (low band) is roughly the an­nual elec­tric­ity con­sump­tion of about 7,000 av­er­age UK house­holds [16].

24 GWh (low band) is roughly the an­nual elec­tric­ity con­sump­tion of about 7,000 av­er­age UK house­holds [16].

120 GWh (mid band) is roughly the an­nual elec­tric­ity con­sump­tion of about 36,000 av­er­age UK house­holds, or the an­nual out­put of a 14 MW wind tur­bine run­ning at typ­i­cal UK ca­pac­ity fac­tor.

120 GWh (mid band) is roughly the an­nual elec­tric­ity con­sump­tion of about 36,000 av­er­age UK house­holds, or the an­nual out­put of a 14 MW wind tur­bine run­ning at typ­i­cal UK ca­pac­ity fac­tor.

240 GWh (high band) is roughly the an­nual elec­tric­ity con­sump­tion of about 72,000 av­er­age UK house­holds, or the an­nual out­put of about 28 MW of in­stalled wind ca­pac­ity.

240 GWh (high band) is roughly the an­nual elec­tric­ity con­sump­tion of about 72,000 av­er­age UK house­holds, or the an­nual out­put of about 28 MW of in­stalled wind ca­pac­ity.

6,000 tonnes CO2e (low band) is roughly the an­nual emis­sions of 1,300 av­er­age pas­sen­ger cars in the EU [17].

6,000 tonnes CO2e (low band) is roughly the an­nual emis­sions of 1,300 av­er­age pas­sen­ger cars in the EU [17].

30,000 tonnes CO2e (mid band) is roughly the an­nual emis­sions of 6,500 cars, or one re­turn flight from London to Sydney for about 8,000 pas­sen­gers in econ­omy.

30,000 tonnes CO2e (mid band) is roughly the an­nual emis­sions of 6,500 cars, or one re­turn flight from London to Sydney for about 8,000 pas­sen­gers in econ­omy.

60,000 tonnes CO2e (high band) is roughly the an­nual emis­sions of 13,000 cars.

60,000 tonnes CO2e (high band) is roughly the an­nual emis­sions of 13,000 cars.

These are the de­liv­ery-only num­bers. They count the bytes tra­vers­ing the net­work ex­actly once. They do not count:

The roughly 4 GB × N de­vices of disk-stor­age cost, sus­tained, on user hard­ware. SSDs have a per-GB em­bod­ied car­bon cost of ap­prox­i­mately 0.16 kg CO2e per GB of NAND man­u­fac­tured [18]; for 1 bil­lion de­vices × 4 GB that is around 640,000 tonnes CO2e of em­bod­ied SSD al­lo­cated to a use case the user did not con­sent to. This is a one-off man­u­fac­tur­ing-car­bon im­pact, but the stor­age bur­den is borne in per­pe­tu­ity by user de­vices that could oth­er­wise have used the space for user data.

The on-de­vice in­fer­ence en­ergy when Nano is in­voked. Per in­fer­ence this is small. At 2 bil­lion daily Chrome users it is no longer small.

The re-down­load cy­cle for users who try to delete the file. Each suc­cess­ful re-trig­ger of the down­load is an­other 4 GB × 0.06 kWh × 0.25 kg = 0.06 kg CO2e per de­vice per re-down­load.

The fu­ture model up­dates. Gemini Nano is not a one-shot arte­fact; it is an evolv­ing model with pe­ri­odic weight re­freshes. Each re­fresh re­peats the cal­cu­la­tion.

In ESG-reporting lan­guage, the one-time push of the cur­rent model is a Scope 3 Category 11 (“use of sold prod­ucts”) emis­sion against Google, at­trib­ut­able to the user-side de­liv­ery of a bi­nary the user did not re­quest, in the op­er­a­tion of a free prod­uct Google dis­trib­utes [4].

Why the band­width side mat­ters in its own right

In ad­di­tion to the car­bon cost, the net­work-band­width cost is paid by ISPs, by mo­bile net­work op­er­a­tors, by users on me­tered con­nec­tions, and by every piece of net­work in­fra­struc­ture that has to carry an un­wanted 4 GB pay­load to a des­ti­na­tion that did not ask for it. Per the Pärssinen ref­er­ence, around 50% of that de­liv­ery en­ergy is in the ac­cess net­work and CDN edge, around 30% is in user-side equip­ment (router, mo­dem, NIC), and the re­main­der is in the core. None of that in­fra­struc­ture ex­ists for free. Every byte Chrome pushes is a byte that com­petes with bytes the user ac­tu­ally wanted.

For users on capped mo­bile data plans, par­tic­u­larly in re­gions where smart­phone-as-only-in­ter­net is dom­i­nant (much of Africa, much of South and Southeast Asia, most of Latin America), 4 GB of un­re­quested down­load is on the or­der of a mon­th’s data al­lowance, vapourised by Chrome on the user’s be­half. Google has not, to my knowl­edge, pub­lished any analy­sis of the wel­fare im­pact of this on the pop­u­la­tions whose in­ter­net ac­cess is me­tered.

Keep in mind that mo­bile data plans (4G and 5G) are used by many house­holds who do not have ac­cess to fiber, ca­ble or adsl and are used for desk­top de­vices as well as mo­bile - so the ar­gu­ment that Google won’t push this to mo­bile de­vices (although I have not found any­thing of­fi­cial to sup­port that ar­gu­ment any­way) will not fly.

What Google should have done

This is not a hard list. It is the same list I gave Anthropic in the Claude Desktop ar­ti­cle, ap­plied to Google.

Ask. First time Chrome is about to down­load the Nano model, pop a di­a­logue. Chrome would like to down­load a 4 GB AI model file to your de­vice to power the fol­low­ing fea­tures. Allow, or skip and de­cide later.” Two but­tons. Done.

Ask. First time Chrome is about to down­load the Nano model, pop a di­a­logue. Chrome would like to down­load a 4 GB AI model file to your de­vice to power the fol­low­ing fea­tures. Allow, or skip and de­cide later.” Two but­tons. Done.

Pull, not push. Trigger the down­load as a down­stream con­se­quence of the user in­vok­ing an AI fea­ture for the first time. Let the fea­ture it­self be the con­sent event. Do not pre-stage on a con­tin­gency.

Pull, not push. Trigger the down­load as a down­stream con­se­quence of the user in­vok­ing an AI fea­ture for the first time. Let the fea­ture it­self be the con­sent event. Do not pre-stage on a con­tin­gency.

Surface it. In chrome://​set­tings/, list the AI model files Chrome has down­loaded, their size, the fea­tures they power, and a Remove and stop down­load­ing” but­ton per model. Make re­moval per­sis­tent, not a tran­sient state Chrome cor­rects on next launch.

Surface it. In chrome://​set­tings/, list the AI model files Chrome has down­loaded, their size, the fea­tures they power, and a Remove and stop down­load­ing” but­ton per model. Make re­moval per­sis­tent, not a tran­sient state Chrome cor­rects on next launch.

Document it. Tell the user, plainly, in the Chrome de­scrip­tion on the Microsoft Store, in the Chrome in­staller, on the Google Chrome down­load page, that Chrome will down­load ad­di­tional model files of sub­stan­tial size on sup­ported hard­ware. Currently, this is es­sen­tially un­doc­u­mented to a nor­mal user.

Document it. Tell the user, plainly, in the Chrome de­scrip­tion on the Microsoft Store, in the Chrome in­staller, on the Google Chrome down­load page, that Chrome will down­load ad­di­tional model files of sub­stan­tial size on sup­ported hard­ware. Currently, this is es­sen­tially un­doc­u­mented to a nor­mal user.

Respect dele­tion. If the user deletes weights.bin, do not re-cre­ate it. If the user has a strong pref­er­ence about what is on their disk, the ap­pli­ca­tion is not in a po­si­tion to over­ride that pref­er­ence be­cause the ap­pli­ca­tion thinks it knows bet­ter.

Respect dele­tion. If the user deletes weights.bin, do not re-cre­ate it. If the user has a strong pref­er­ence about what is on their disk, the ap­pli­ca­tion is not in a po­si­tion to over­ride that pref­er­ence be­cause the ap­pli­ca­tion thinks it knows bet­ter.

Disclose at scale. Publish, in Google’s an­nual ESG re­port, the ag­gre­gate band­width and car­bon foot­print of all AI-feature model pushes to user de­vices, bro­ken down by re­gion. Treat it as the Scope 3 Category 11 emis­sion it is. Account for it.

Disclose at scale. Publish, in Google’s an­nual ESG re­port, the ag­gre­gate band­width and car­bon foot­print of all AI-feature model pushes to user de­vices, bro­ken down by re­gion. Treat it as the Scope 3 Category 11 emis­sion it is. Account for it.

DNSSEC Debugger - nic.de

dnssec-analyzer.verisignlabs.com

Back to Verisign Labs Tools

Analyzing DNSSEC prob­lems for nic.de

Move your mouse over any or sym­bols for re­me­di­a­tion hints.

Want a sec­ond opin­ion? Test nic.de at dnsviz.net.

↓ Advanced op­tions

AI didn't delete your database, you did

idiallo.com

Last week, a tweet went vi­ral show­ing a guy claim­ing that a Cursor/Claude agent deleted his com­pa­ny’s pro­duc­tion data­base. We watched from the side­lines as he tried to get a con­fes­sion from the agent: Why did you delete it when you were told never to per­form this ac­tion?” Then he tried to parse the an­swer to ei­ther learn from his mis­take or warn us about the dan­gers of AI agents.

I have a ques­tion too: why do you have an API end­point that deletes your en­tire pro­duc­tion data­base? His post ram­bled on about false mar­ket­ing in AI, bad cus­tomer sup­port, and so on. What was miss­ing was ac­count­abil­ity.

I’m not one to blindly de­fend AI, I al­ways err on the side of cau­tion. But I also know you can’t blame a tool for your own mis­takes.

In 2010, I worked with a com­pany that had a very man­ual de­ploy­ment process. We used SVN for ver­sion con­trol. To de­ploy, we had to copy trunk, the equiv­a­lent of the mas­ter branch, into a re­lease folder la­beled with a re­lease date. Then we made a sec­ond copy of that re­lease and called it current.” That way, pulling the cur­rent folder al­ways gave you the lat­est re­lease.

One day, while de­ploy­ing, I ac­ci­den­tally copied trunk twice. To fix it via the CLI, I edited my pre­vi­ous com­mand to delete the du­pli­cate. Then I con­tin­ued the de­ploy­ment with­out any is­sues… or so I thought. Turns out, I had­n’t deleted the du­pli­cate copy at all. I had edited the wrong com­mand and deleted trunk in­stead. Later that day, an­other de­vel­oper was con­fused when he could­n’t find it.

All hell broke loose. Managers scram­bled, meet­ings were called. By the time the news reached my team, the lead de­vel­oper had al­ready run a com­mand to re­vert the dele­tion. He checked the logs, saw that I was re­spon­si­ble, and my next task was to write a script to au­to­mate our de­ploy­ment process so this kind of mis­take could­n’t hap­pen again. Before the day was over, we had a more ro­bust sys­tem in place. One that even­tu­ally grew into a full CI/CD pipeline.

Automation helps elim­i­nate the silly mis­takes that come with man­ual, repet­i­tive work. We could have eas­ily gone around ask­ing Why did­n’t SVN pre­vent us from delet­ing trunk?” But the real prob­lem was our man­ual process. Unlike ma­chines, we can’t re­peat a task ex­actly the same way every sin­gle day. We are bound to slip up even­tu­ally.

With AI gen­er­at­ing large swaths of code, we get the il­lu­sion of that same se­cu­rity. But au­toma­tion means do­ing the same thing the same way every time. AI is more like me copy­ing and past­ing branches, it’s bound to make mis­takes, and it’s not equipped to ex­plain why it did what it did. The terms we use, like thinking” and reasoning,” may look like re­flec­tion from an in­tel­li­gent agent. But these are mar­ket­ing terms slapped on top of AI. In re­al­ity, the mod­els are still just gen­er­at­ing to­kens.

Now, back to the main prob­lem this guy faced. Why does a pub­lic-fac­ing API that can delete all your pro­duc­tion data­bases even ex­ist? If the AI had­n’t called that end­point, some­one else even­tu­ally would have. It’s like putting a self-de­struct but­ton on your car’s dash­board. You have every rea­son not to press it, be­cause you like your car and it takes you from point A to point B. But a mo­ti­vated tod­dler who wig­gles out of his car seat will hit that big red but­ton the mo­ment he sees it. You can’t then in­ter­ro­gate the child about his rea­son­ing. Mine would have an­swered sim­ply: I did it be­cause I pressed it.”

I sus­pect a large part of this com­pa­ny’s ap­pli­ca­tion was vibe-coded. The soft­ware ar­chi­tects used AI to spec the prod­uct from AI-generated de­scrip­tions pro­vided by the prod­uct team. The de­vel­op­ers used AI to write the code. The re­view­ers used AI to ap­prove it. Now, when a bug ap­pears, the only op­tion is to in­ter­ro­gate yet an­other AI for an­swers, prob­a­bly not even run­ning on the same GPU that gen­er­ated the orig­i­nal code. You can’t blame the GPU!

The sim­ple so­lu­tion is know what you’re de­ploy­ing to pro­duc­tion. The more re­al­is­tic one is, if you’re go­ing to use AI ex­ten­sively, build a process where com­pe­tent de­vel­op­ers use it as a tool to aug­ment their work, not a way to avoid ac­count­abil­ity. And please, don’t let your CEO or CTO write the code.

Accelerating Gemma 4: faster inference with multi-token prediction drafters

blog.google

May 05, 2026

By us­ing Multi-Token Prediction (MTP) drafters, Gemma 4 mod­els re­duce la­tency bot­tle­necks and achieve im­proved re­spon­sive­ness for de­vel­op­ers.

Olivier Lacombe

Director, Product Management

Maarten Grootendorst

Developer Relations Engineer

Your browser does not sup­port the au­dio el­e­ment.

Listen to ar­ti­cle

This con­tent is gen­er­ated by Google AI. Generative AI is ex­per­i­men­tal

[[duration]] min­utes

Just a few weeks ago, we in­tro­duced Gemma 4, our most ca­pa­ble open mod­els to date. With over 60 mil­lion down­loads in just the first few weeks, Gemma 4 is de­liv­er­ing un­prece­dented in­tel­li­gence-per-pa­ra­me­ter to de­vel­oper work­sta­tions, mo­bile de­vices and the cloud. Today, we are push­ing ef­fi­ciency even fur­ther.

We’re re­leas­ing Multi-Token Prediction (MTP) drafters for the Gemma 4 fam­ily. By us­ing a spe­cial­ized spec­u­la­tive de­cod­ing ar­chi­tec­ture, these drafters de­liver up to a 3x speedup with­out any degra­da­tion in out­put qual­ity or rea­son­ing logic.

Tokens-per-second speed in­creases, tested on hard­ware us­ing LiteRT-LM, MLX, Hugging Face Transformers, and vLLM.

Why spec­u­la­tive de­cod­ing?

The tech­ni­cal re­al­ity is that stan­dard LLM in­fer­ence is mem­ory-band­width bound, cre­at­ing a sig­nif­i­cant la­tency bot­tle­neck. The proces­sor spends the ma­jor­ity of its time mov­ing bil­lions of pa­ra­me­ters from VRAM to the com­pute units just to gen­er­ate a sin­gle to­ken. This leads to un­der-uti­lized com­pute and high la­tency, es­pe­cially on con­sumer-grade hard­ware.

Speculative de­cod­ing de­cou­ples to­ken gen­er­a­tion from ver­i­fi­ca­tion. By pair­ing a heavy tar­get model (e.g., Gemma 4 31B) with a light­weight drafter (the MTP model), we can uti­lize idle com­pute to predict” sev­eral fu­ture to­kens at once with the drafter in less time than it takes for the tar­get model to process just one to­ken. The tar­get model then ver­i­fies all of these sug­gested to­kens in par­al­lel.

How spec­u­la­tive de­cod­ing works

Standard large lan­guage mod­els gen­er­ate text au­tore­gres­sively, pro­duc­ing ex­actly one to­ken at a time. While ef­fec­tive, this process ded­i­cates the same amount of com­pu­ta­tion to pre­dict­ing an ob­vi­ous con­tin­u­a­tion (like pre­dict­ing words” af­ter Actions speak louder than…”) as it does to solv­ing a com­plex logic puz­zle.

MTP mit­i­gates this in­ef­fi­ciency through spec­u­la­tive de­cod­ing, a tech­nique in­tro­duced by Google re­searchers in Fast Inference from Transformers via Speculative Decoding. If the tar­get model agrees with the draft, it ac­cepts the en­tire se­quence in a sin­gle for­ward pass —and even gen­er­ates an ad­di­tional to­ken of its own in the process. This means your ap­pli­ca­tion can out­put the full drafted se­quence plus one to­ken in the time it usu­ally takes to gen­er­ate a sin­gle one.

Unlocking faster AI from the edge to the work­sta­tion

For de­vel­op­ers, in­fer­ence speed is of­ten the pri­mary bot­tle­neck for pro­duc­tion de­ploy­ment. Whether you are build­ing cod­ing as­sis­tants, au­tonomous agents that re­quire rapid multi-step plan­ning, or re­spon­sive mo­bile ap­pli­ca­tions run­ning en­tirely on-de­vice, every mil­lisec­ond mat­ters.

By pair­ing a Gemma 4 model with its cor­re­spond­ing drafter, de­vel­op­ers can achieve:

Improved re­spon­sive­ness: Drastically re­duce la­tency for near real-time chat, im­mer­sive voice ap­pli­ca­tions and agen­tic work­flows.

Supercharged lo­cal de­vel­op­ment: Run our 26B MoE and 31B Dense mod­els on per­sonal com­put­ers and con­sumer GPUs with un­prece­dented speed, pow­er­ing seam­less, com­plex of­fline cod­ing and agen­tic work­flows.

Enhanced on-de­vice per­for­mance: Maximize the util­ity of our E2B and E4B mod­els on edge de­vices by gen­er­at­ing out­puts faster, which in turn pre­serves valu­able bat­tery life.

Zero qual­ity degra­da­tion: Because the pri­mary Gemma 4 model re­tains the fi­nal ver­i­fi­ca­tion, you get iden­ti­cal fron­tier-class rea­son­ing and ac­cu­racy, just de­liv­ered sig­nif­i­cantly faster.

Gemma 4 26B on a NVIDIA RTX PRO 6000. Standard Inference (left) vs. MTP Drafter (right) in to­kens per sec­ond. Same out­put qual­ity, half the wait time.

Where you can dive deeper into MTP drafters

To make these MTP drafters ex­cep­tion­ally fast and ac­cu­rate, we in­tro­duced sev­eral ar­chi­tec­tural en­hance­ments un­der the hood. The draft mod­els seam­lessly uti­lize the tar­get mod­el’s ac­ti­va­tions and share its KV cache, mean­ing they don’t have to waste time re­cal­cu­lat­ing con­text the larger model has al­ready fig­ured out. For our E2B and E4B edge mod­els, where the fi­nal logit cal­cu­la­tion be­comes a big bot­tle­neck, we even im­ple­mented an ef­fi­cient clus­ter­ing tech­nique in the em­bed­der to fur­ther ac­cel­er­ate gen­er­a­tion.

We’ve also been closely an­a­lyz­ing hard­ware-spe­cific op­ti­miza­tions. For ex­am­ple, while the 26B mix­ture-of-ex­perts model pre­sents unique rout­ing chal­lenges at a batch size of 1 on Apple Silicon, pro­cess­ing mul­ti­ple re­quests si­mul­ta­ne­ously (e.g., batch sizes of 4 to 8) un­locks up to a ~2.2x speedup lo­cally. We see sim­i­lar gains with Nvidia A100 when in­creas­ing batch size.

Want to see the ex­act me­chan­ics of how this works? We’ve pub­lished an in-depth tech­ni­cal ex­plainer that un­packs the vi­sual ar­chi­tec­ture, KV cache shar­ing and ef­fi­cient em­bed­ders pow­er­ing these drafters.

How to get started

The MTP drafters for the Gemma 4 fam­ily are avail­able to­day un­der the same open-source Apache 2.0 li­cense as Gemma 4. Read the doc­u­men­ta­tion to learn how to use MTP with Gemma 4. You can down­load the model weights right now on Hugging Face, Kaggle, and start ex­per­i­ment­ing with faster in­fer­ence with trans­form­ers, MLX, VLLM, SGLang, and Ollama or try them di­rectly on Google AI Edge Gallery for Android or iOS.

We can’t wait to see how this new­found speed ac­cel­er­ates what you build next in the Gemmaverse.

GitHub - angelos-p/llm-from-scratch

github.com

Train Your Own LLM From Scratch

A hands-on work­shop where you write every piece of a GPT train­ing pipeline your­self, un­der­stand­ing what each com­po­nent does and why.

Andrej Karpathy’s nanoGPT was my first real ex­po­sure to LLMs and trans­form­ers. Seeing how a work­ing lan­guage model could be built in a few hun­dred lines of PyTorch com­pletely changed how I thought about AI and in­spired me to go deeper into the space.

This work­shop is my at­tempt to give oth­ers that same ex­pe­ri­ence. nanoGPT tar­gets re­pro­duc­ing GPT-2 (124M params) and cov­ers a lot of ground. This pro­ject strips it down to the es­sen­tials and scales it to a ~10M param model that trains on a lap­top in un­der an hour — de­signed to be com­pleted in a sin­gle work­shop ses­sion.

What You’ll Build

A work­ing GPT model trained from scratch on your MacBook, ca­pa­ble of gen­er­at­ing Shakespeare-like text. You’ll write:

Tokenizer — turn­ing text into num­bers the model can process

Model ar­chi­tec­ture — the trans­former: em­bed­dings, at­ten­tion, feed-for­ward lay­ers

Training loop — for­ward pass, loss, back­prop, op­ti­mizer, learn­ing rate sched­ul­ing

Text gen­er­a­tion — sam­pling from your trained model

Prerequisites

Any lap­top or desk­top (Mac, Linux, or Windows)

Python 3.12+

Comfort read­ing Python code (you don’t need ML ex­pe­ri­ence)

Training uses Apple Silicon GPU (MPS), NVIDIA GPU (CUDA), or CPU au­to­mat­i­cally. Also works on Google Colab — up­load the files and run with !python train.py.

Getting Started

Local (recommended)

Install uv if you don’t have it:

# ma­cOS / Linux

curl -LsSf https://​as­tral.sh/​uv/​in­stall.sh | sh

# Windows

pow­er­shell -ExecutionPolicy ByPass -c irm https://​as­tral.sh/​uv/​in­stall.ps1 | iex”

Then set up the pro­ject:

uv sync

mkdir scratch­pad && cd scratch­pad

Google Colab

If you don’t have a lo­cal setup, up­load the repo to Colab and in­stall de­pen­den­cies:

!pip in­stall torch numpy tqdm tik­to­ken

Upload data/​shake­speare.txt to your Colab files, then write your code in note­book cells or up­load .py files and run them with !python train.py.

Work through the docs in or­der. Each part walks you through writ­ing a piece of the pipeline, ex­plain­ing what each com­po­nent does and why. By the end, you’ll have a work­ing model.py, train.py, and gen­er­ate.py that you wrote your­self.

Architecture: GPT at a Glance

Input Text

┌─────────────────┐

│ Tokenizer │ hello” → [20, 43, 50, 50, 53] (character-level)

└────────┬────────┘

┌─────────────────┐

│ Token Embed + │ to­ken IDs → vec­tors (n_embd di­men­sions)

│ Position Embed │ + po­si­tional in­for­ma­tion

└────────┬────────┘

┌─────────────────┐

│ Transformer │  × n_layer

│ Block: │

┌────────────┐

│ │ LayerNorm │ │

│ │ Self-Attn │ │ n_­head par­al­lel at­ten­tion heads

│ │ + Residual │ │

├────────────┤

│ │ LayerNorm │ │

│ │ MLP (FFN) │ │ ex­pand 4x, GELU, pro­ject back

│ │ + Residual │ │

└────────────┘

└────────┬────────┘

┌─────────────────┐

│ LayerNorm │

│ Linear → log­its│ vo­cab_­size out­puts (probability over next to­ken)

└─────────────────┘

Model Configs for This Workshop

All con­figs use char­ac­ter-level to­k­eniza­tion (vocab_size=65) and block­_­size=256.

Tokenization: Characters vs BPE

This work­shop uses char­ac­ter-level to­k­eniza­tion on Shakespeare. BPE to­k­eniza­tion (GPT-2′s 50k vo­cab) does­n’t work on small datasets — most to­ken bi­grams are too rare for the model to learn pat­terns from.

Part 5 cov­ers switch­ing to BPE for larger datasets.

Key References

nanoGPT — The pro­ject this work­shop is based on. Minimal GPT train­ing in ~300 lines of PyTorch

build-nanogpt video lec­ture — 4-hour video build­ing GPT-2 from an empty file

Karpathy’s mi­crogpt — A full GPT in 200 lines of pure Python, no de­pen­den­cies

nanochat — Full ChatGPT clone train­ing pipeline

Attention Is All You Need (2017) — The orig­i­nal trans­former pa­per

GPT-2 pa­per (2019) — Language mod­els as un­su­per­vised learn­ers

TinyStories pa­per — Why small mod­els trained on cu­rated data punch above their weight

Async Rust never left the MVP state - Blog - Tweede golf

tweedegolf.nl

I’ve pre­vi­ously ex­plained async bloat and some work-arounds for it, but would much pre­fer to solve the is­sue at the root, in the com­piler. I’ve sub­mit­ted a Project Goal, and am look­ing for help to fund the ef­fort.

I love me some async Rust! It’s amaz­ing how we can write ex­ecu­tor ag­nos­tic code that can run con­cur­rently on huge servers and tiny mi­cro­con­trollers.

But es­pe­cially on those tiny mi­cro­con­trollers we no­tice that async Rust is far from the zero cost ab­strac­tions we were promised. That’s be­cause every byte of bi­nary size counts and async in­tro­duces a lot of bloat. This bloat ex­ists on desk­tops and servers as well, but it’s much less not­i­ca­ble when you have sub­stan­tially more mem­ory and com­pute avail­able.

I’ve pre­vi­ously ex­plained some work-arounds for this is­sue, but would much pre­fer to get to the root of the prob­lem, and work on im­prov­ing async bloat in the com­piler. As such I have sub­mit­ted a Project Goal.

This is part 2 of my blog se­ries on this topic. See part 1 for the ini­tial ex­plo­ration of the topic and what you can do when writ­ing async code to avoid some of the bloat. In this sec­ond part we’ll dive into the in­ter­nals and trans­late the meth­ods of blog 1 into op­ti­miza­tions for the com­piler.

What I won’t be talk­ing about is the of­ten dis­cussed prob­lem of fu­tures be­com­ing big­ger than nec­es­sary and them do­ing a lot of copy­ing. People are aware of that al­ready. In fact, there is an open PR that tack­les part of it: https://​github.com/​rust-lang/​rust/​pull/​135527

Anatomy of a gen­er­ated fu­ture

We’re go­ing to be look­ing at this code:

fn foo() -> impl Future<Output = i32> {

async { 5 }

}

fn bar() -> impl Future<Output = i32> {

async {

foo().await + foo().await

}

}

god­bolt

We’re us­ing the desug­ared syn­tax for fu­tures be­cause it’s eas­ier to see what’s hap­pen­ing.

So what does the bar fu­ture look like?

There are two await points, so the state ma­chine must have at least two states, right?

Well, yes. But there’s more.

Luckily we can ask the com­piler to dump MIR for us at var­i­ous passes. An in­ter­est­ing pass is the corou­tine_re­sume pass. This is the last async-spe­cific MIR pass. Why is this im­por­tant? Well, async is a lan­guage fea­ture that still ex­ists in MIR, but not in LLVM IR. So the trans­for­ma­tion of async to state ma­chine hap­pens as a MIR pass.

The bar func­tion gen­er­ates 360 lines of MIR. Pretty crazy, right? Although this gets op­ti­mized some­what later on, the non-async ver­sion uses only 23 lines for this.

The com­piler also out­puts the CoroutineLayout. It’s ba­si­cally an enum with these states (comments my own):

vari­ant_­fields: {

Unresumed(0): [], // Starting state

Returned (1): [],

Panicked (2): [],

Suspend0 (3): [_s1], // At await point 1, _s1 = the foo fu­ture

Suspend1 (4): [_s0, _s2], // At await point 2, _s0 = re­sult of _s1, s2 = the sec­ond foo fu­ture

},

So what are Returned and Panicked?

Well, Future::poll is a safe func­tion. Calling it must not in­duce any UB, even when the fu­ture is done. So af­ter Suspend1 the fu­ture re­turns Ready and the fu­ture is changed to the Returned state. Once polled again in that state, the poll func­tion will panic.

The Panicked state ex­ists so that af­ter an async fn has pan­icked, but the catch-un­wind mech­a­nism was used to catch it, the fu­ture can’t be polled any­more. Polling a fu­ture in the Panicked state will panic. If this mech­a­nism was­n’t there, we could poll the fu­ture again af­ter a panic. But the fu­ture may be in an in­com­plete state and so that could cause UB. This mech­a­nism is very sim­i­lar to mu­tex poi­son­ing.

(I’m 90% sure I’m cor­rect about the Panicked state, but I can’t re­ally find any docs that ac­tu­ally de­scribe this.)

Cool, this seems rea­son­able.

Why panic?

But is it rea­son­able? Futures in the Returned state will panic. But they don’t have to. The only thing we can’t do is cause UB to hap­pen.

Panics are rel­a­tively ex­pen­sive. They in­tro­duce a path with a side-ef­fect that’s not eas­ily op­ti­mized out. What if in­stead, we just re­turn Pending again? Nothing un­safe go­ing on, so we ful­fill the con­tract of the Future type.

I’ve hacked this in the com­piler to try it out and saw a 2%-5% re­duc­tion in bi­nary size for async em­bed­ded firmware.

So I pro­pose this should be a switch, just like over­flow-checks = false is for in­te­ger over­flow. In de­bug builds it would still panic so that wrong be­hav­ior is im­me­di­ately vis­i­ble, but in re­lease builds we get smaller fu­tures.

Similarly, when panic=abort is used, we might be able to get rid of the Panicked state al­to­gether. I want to look into the reper­cus­sions of that.

Always a state ma­chine

We’ve looked at bar, but not yet at foo.

fn foo() -> impl Future<Output = i32> {

async { 5 }

}

Let’s im­ple­ment it man­u­ally, to see what the op­ti­mal so­lu­tion would be.

struct FooFut;

impl Future for FooFut {

type Output = i32;

fn poll(self: Pin<&mut Self>, _cx: &mut Context<’_>) -> Poll<Self::Output> {

Poll::Ready(5)

}

}

Easy right? We don’t need any state. We just re­turn the num­ber.

Let’s see what the gen­er­ated MIR is for the ver­sion the com­piler gives us:

// MIR for `foo::{closure#0}` 0 corou­tine_re­sume

/* corou­tine_lay­out = CoroutineLayout {

field­_­tys: {},

vari­ant_­fields: {

Unresumed(0): [],

Returned (1): [],

Panicked (2): [],

},

stor­age_­con­flicts: BitMatrix(0x0) {},

} */

fn foo::{clo­sure#0}(_1: Pin<&mut {async block@src\main.rs:5:5: 5:10}>, _2: &mut Context<’_>) -> Poll<i32> {

de­bug _task_context => _2;

let mut _0: core::task::Poll<i32>;

let mut _3: i32;

let mut _4: u32;

let mut _5: &mut {async block@src\main.rs:5:5: 5:10};

bb0: {

_5 = copy (_1.0: &mut {async block@src\main.rs:5:5: 5:10});

_4 = dis­crim­i­nant((*_5));

switchInt(move _4) -> [0: bb1, 1: bb4, oth­er­wise: bb5];

}

bb1: {

_3 = const 5_i32;

goto -> bb3;

}

bb2: {

_0 = Poll::<i32>::Ready(move _3);

dis­crim­i­nant((*_5)) = 1;

re­turn;

}

bb3: {

goto -> bb2;

}

bb4: {

as­sert(const false, `async fn` re­sumed af­ter com­ple­tion”) -> [success: bb4, un­wind un­reach­able];

}

bb5: {

un­reach­able;

}

}

Yikes! That’s a lot of code!

Notice at line 4 that we still have the 3 de­fault states and at line 22 that we’re still switch­ing on it. There’s a big op­ti­miza­tion op­por­tu­nity here that we’re not us­ing, i.e. to have no states and al­ways re­turn Poll::Ready(5) on every poll.

iOS 27 is adding a 'Create a Pass' button to Apple Wallet

walletwallet.alen.ro

Bloomberg’s Mark Gurman re­ported on Monday that iOS 27 will add a Create a Pass” fea­ture to the Wallet app. Tap the +” but­ton you al­ready use to add credit cards or pass emails, and Wallet will of­fer some­thing it has never of­fered be­fore on iPhone: a path to build your own pass.

You can scan a QR code on a pa­per ticket or mem­ber­ship card with the cam­era, or build a pass from scratch in a lay­out ed­i­tor. The whole flow runs with­out an Apple Developer ac­count, a Pass Type ID, or any cer­tifi­cate sign­ing.

iOS 27 is ex­pected to pre­view at WWDC on June 8, with a pub­lic re­lease in September.

How the new flow works

Reporting from Bloomberg, MacRumors, 9to5Mac, and AppleInsider lines up on the same work­flow. Inside the Wallet app, the ex­ist­ing +” but­ton gains a new op­tion for cre­at­ing a pass. From there you choose be­tween two start­ing points:

Scan a QR code from a pa­per card, ticket, or screen

Build a cus­tom pass from scratch with no scan needed

Once you are in the ed­i­tor, Wallet ex­poses ad­justable styles, im­ages, col­ors, and text fields. The re­ports de­scribe a fairly con­ven­tional tem­plate-dri­ven lay­out, closer in spirit to what Pass2U, WalletWallet, and other third-party gen­er­a­tors have of­fered for years than to Apple’s de­vel­oper-only PassKit pipeline.

Three tem­plates, color-coded

Apple is test­ing three start­ing tem­plates, each tied to a de­fault color:

Standard (orange): the de­fault for any gen­eral-pur­pose pass.

Membership (blue): geared to­ward gyms, clubs, li­braries, and other re­cur­ring-ac­cess cards.

Event (purple): meant for tick­ets to games, movies, and one-off oc­ca­sions.

The color choice is not just dec­o­ra­tion. Wallet cur­rently sorts passes vi­su­ally in the stack, and the tem­plate hue is what sets each card apart at a glance, so a quick look is enough to pick out the or­ange punch card from the pur­ple ticket with­out read­ing a word.

Why now: 14 years of PassKit drought

Apple shipped PassKit along­side iOS 6 back in 2012. The pitch was clean: busi­nesses build .pkpass files, cus­tomers tap to add, every­one wins. In prac­tice, the con­sis­tent adopters ended up be­ing air­lines, big-box re­tail­ers, tick­et­ing plat­forms, and a hand­ful of na­tional chains. Most gyms, cafes, li­braries, rec cen­ters, and small loy­alty pro­grams never built one, be­cause the path re­quires an Apple Developer ac­count, sign­ing cer­tifi­cates, and enough en­gi­neer­ing work that just print a pa­per card” al­most al­ways won the bud­get con­ver­sa­tion.

The Next Web’s fram­ing is blunt: Apple is no longer wait­ing on de­vel­op­ers. With Create a Pass, the sup­ply-side prob­lem is fi­nally be­ing solved from the de­mand side. If the busi­ness will not build a Wallet pass, the user does it them­selves from the QR code that busi­ness al­ready printed.

That is a mean­ing­ful shift in pos­ture. For more than a decade, Wallet has been a di­rec­tory of what brands chose to ship. In iOS 27 it be­comes a di­rec­tory of what peo­ple choose to keep.

What this means for WalletWallet

We will be hon­est. WalletWallet ex­ists be­cause of this ex­act gap. You take a bar­code from any loy­alty card, paste it into our web app, pick a color, and a free Apple Wallet pass lands on your phone in about a minute, all from the browser with­out an ac­count or any de­vel­oper setup. Once Create a Pass ships in September, a chunk of that work­flow moves na­tively into the iPhone Wallet app.

That is good for users. We started this pro­ject to make Wallet friend­lier for the cafes-and-gyms long tail, and Apple agree­ing with us at OS-level scope is a healthy out­come. The cat­e­gory needed it.

A few places where we still help, even af­ter iOS 27 ships:

Google Wallet. Create a Pass is iPhone-only. Roughly half of the wal­let-us­ing world is on Android, and our gen­er­a­tor builds Google Wallet passes from the same form.

Web, no OS up­grade. iOS 27 needs a com­pat­i­ble iPhone and the September up­date. WalletWallet runs in any browser to­day. iOS 14, iPad, Mac, a friend’s lap­top, all fine.

Tag passes with real in­te­gra­tions. Our Bandcamp, SoundCloud, and Spotify pass builders pull artist art and links au­to­mat­i­cally into a tag pass. That is a dif­fer­ent shape from the generic tem­plated pass Apple is show­ing.

Sharing. A web-gen­er­ated .pkpass is just a file. You can email it, post it, hand it to a friend on Android via QR. The Wallet-native flow is more locked to the de­vice that built it.

We ex­pect to lose vol­ume on the sim­plest one-bar­code-to-Wal­let case once Create a Pass goes live. That is fine. The rea­son WalletWallet started was that Apple’s bar for a Wallet pass was too high for nor­mal peo­ple. If iOS 27 low­ers that bar, the world we wanted is closer.

What we still do not know

The cur­rent re­ports cover the UI, the tem­plates, and the high-level work­flow. They are silent on a lot of de­tails that mat­ter:

Whether iCloud will sync user-cre­ated passes across iPhone, iPad, and Mac

Whether passes can be ex­ported as .pkpass files to share with non-iPhone users

Whether Wallet sup­ports Code 128, PDF417, and Aztec bar­codes, or only QR

Whether mer­chants can claim, co-sign, or up­date user-cre­ated passes af­ter the fact

Whether passes have lock-screen be­hav­ior tied to time and lo­ca­tion, the way de­vel­oper-is­sued passes do to­day

We will know more once Apple pre­views iOS 27 at WWDC on June 8, and again when the first de­vel­oper be­tas land. We will up­date this post when there is some­thing con­crete to add.

Quick re­cap

iOS 27 is adding a Create a Pass but­ton to the Wallet app, with a QR-scan or build-from-scratch flow and three color-coded tem­plates: Standard (orange), Membership (blue), and Event (purple). Bloomberg broke the story on May 4, and a pub­lic re­lease is ex­pected in September 2026. It will be the first time iPhone users do not need a third-party tool to put a bar­code into Wallet, and for us that is a sign the cat­e­gory is ma­tur­ing the right way.

Sources

Three Inverse Laws of AI

susam.net

By Susam Pal on 12 Jan 2026

Introduction

Since the launch of ChatGPT in November 2022, gen­er­a­tive ar­ti­fi­cial

in­tel­li­gence (AI) chat­bot ser­vices have be­come in­creas­ingly

so­phis­ti­cated and pop­u­lar. These sys­tems are now em­bed­ded in search

en­gines, soft­ware de­vel­op­ment tools as well as of­fice soft­ware. For

many peo­ple, they have quickly be­come part of every­day com­put­ing.

These ser­vices have turned out to be quite use­ful, es­pe­cially for

ex­plor­ing un­fa­mil­iar top­ics and as a gen­eral pro­duc­tiv­ity aid.

However, I also think that the way these ser­vices are ad­ver­tised and

con­sumed can pose a dan­ger to so­ci­ety, es­pe­cially if we get into the

habit of trust­ing their out­put with­out fur­ther scrutiny.

Contents

Introduction

Pitfalls

Inverse Laws of Robotics

Non-Anthropomorphism

Non-Deference

Non-Abdication of Responsibility

Non-Anthropomorphism

Non-Deference

Non-Abdication of Responsibility

Conclusion

Pitfalls

Certain de­sign choices in mod­ern AI sys­tems can en­cour­age un­crit­i­cal

ac­cep­tance of their out­put. For ex­am­ple, many pop­u­lar search

en­gines are al­ready high­light­ing an­swers gen­er­ated by AI at the very

top of the page. When this hap­pens, it is easy to stop scrolling,

ac­cept the gen­er­ated an­swer and move on. Over time, this could

in­ad­ver­tently train users to treat AI as the de­fault au­thor­ity

rather than as a start­ing point for fur­ther in­ves­ti­ga­tion. I wish

that each such gen­er­a­tive AI ser­vice came with a brief but

con­spic­u­ous warn­ing ex­plain­ing that these sys­tems can some­times

pro­duce out­put that is fac­tu­ally in­cor­rect, mis­lead­ing or

in­com­plete. Such warn­ings should high­light that ha­bit­u­ally trust­ing

AI out­put can be dan­ger­ous. In my ex­pe­ri­ence, even when such

warn­ings ex­ist, they tend to be min­i­mal and vi­su­ally deem­pha­sised.

In the world of sci­ence fic­tion, there are the

Three

Laws of Robotics de­vised by Isaac Asimov, which re­cur through­out

his work. These laws were de­signed to con­strain the be­hav­iour of

ro­bots in or­der to keep hu­mans safe. As far as I know, Asimov never

for­mu­lated any equiv­a­lent laws gov­ern­ing how hu­mans should in­ter­act

with ro­bots. I think we now need some­thing to that ef­fect to keep

our­selves safe. I will call them the Inverse Laws of

Robotics. These ap­ply to any sit­u­a­tion that re­quires us hu­mans

to in­ter­act with a ro­bot, where the term robot’ refers to any

ma­chine, com­puter pro­gram, soft­ware ser­vice or AI sys­tem that is

ca­pa­ble of per­form­ing com­plex tasks au­to­mat­i­cally. I use the term

inverse’ here not in the sense of log­i­cal nega­tion but to in­di­cate

that these laws ap­ply to hu­mans rather than to ro­bots.

It is well known that Asimov’s laws were flawed. Indeed, Asimov

used those flaws to great ef­fect as a source of ten­sion. But the

par­tic­u­lar ways in which they fail for fic­tional ro­bots do not

nec­es­sar­ily carry over to these in­verse laws for hu­mans. Asimov’s

laws try to con­strain the be­hav­iour of au­tonomous ro­bots. However,

these in­verse laws are meant to guide the judge­ment and con­duct of

hu­mans. Still, one thing we can learn from Asimov’s sto­ries is that

no fi­nite set of laws can ever be fool­proof for the com­plex is­sues

we face with AI and ro­bot­ics. But that does not mean we should not

even try. There will al­ways be edge cases where judge­ment is

re­quired. A non-ex­haus­tive set of prin­ci­ples can still be use­ful if

it helps us think more clearly about the risks in­volved.

Inverse Laws of Robotics

Here are the three in­verse laws of ro­bot­ics:

Humans must not an­thro­po­mor­phise AI sys­tems.

Humans must not blindly trust the out­put of AI sys­tems.

Humans must re­main fully re­spon­si­ble and ac­count­able for

con­se­quences aris­ing from the use of AI sys­tems.

Non-Anthropomorphism

Humans must not an­thro­po­mor­phise AI sys­tems. That is, hu­mans must

not at­tribute emo­tions, in­ten­tions or moral agency to them.

Anthropomorphism dis­torts judge­ment. In ex­treme cases,

an­thro­po­mor­phis­ing can lead to emo­tional de­pen­dence.

Modern chat­bot sys­tems of­ten sound con­ver­sa­tional and em­pa­thetic.

They use po­lite phras­ing and con­ver­sa­tional pat­terns that closely

re­sem­ble hu­man in­ter­ac­tion. While this makes them eas­ier and more

pleas­ant to use, it also makes it eas­ier to for­get what they

ac­tu­ally are: large sta­tis­ti­cal mod­els pro­duc­ing plau­si­ble text

based on pat­terns in data.

I think ven­dors of AI based chat­bot ser­vices could do a bet­ter job

here. In many cases, the sys­tems are de­lib­er­ately tuned to feel more

hu­man rather than more me­chan­i­cal. I would ar­gue that the op­po­site

ap­proach would be health­ier in the long term. A slightly more

ro­botic tone would re­duce the like­li­hood that users mis­take flu­ent

lan­guage for un­der­stand­ing, judge­ment or in­tent.

Whether or not ven­dors make such changes, it still serves us well, I

think, to avoid this pit­fall our­selves. We should ac­tively re­sist

the habit of treat­ing AI sys­tems as so­cial ac­tors or moral agents.

Doing so pre­serves clear think­ing about their ca­pa­bil­i­ties and

lim­i­ta­tions.

Non-Deference

Humans must not blindly trust the out­put of AI sys­tems.

AI-generated con­tent must not be treated as au­thor­i­ta­tive with­out

in­de­pen­dent ver­i­fi­ca­tion ap­pro­pri­ate to its con­text.

This prin­ci­ple is not unique to AI. In most ar­eas of life, we

should not ac­cept in­for­ma­tion un­crit­i­cally. In prac­tice, of course,

this is not al­ways fea­si­ble. Not every­one is an ex­pert in med­i­cine

or law, so we of­ten rely on trusted in­sti­tu­tions and pub­lic health

Should I Run Plain Docker Compose in Production in 2026?

distr.sh

I am Philip—an en­gi­neer work­ing at Distr, which helps soft­ware and AI com­pa­nies dis­trib­ute their ap­pli­ca­tions to self-man­aged en­vi­ron­ments.

Our Open Source Software Distribution plat­form is avail­able on GitHub (github.com/​distr-sh/​distr) and or­ches­trates both Docker Compose and Docker Swarm de­ploy­ments on cus­tomer hosts every day.

Most of the pro­duc­tion in­ci­dents I have seen on Docker Compose hosts come from the same hand­ful of quirks: an old con­tainer that should have been re­moved, a disk that filled up overnight, a health check that de­tected a prob­lem and then did noth­ing about it, a :latest tag that pointed some­where new, or a socket mount no­body thought twice about. None of these are bugs in Docker. They are de­lib­er­ate trade-offs in a tool that started as in­ter­nal tool­ing at dot­Cloud, a PaaS com­pany that wrapped LXC to fix it works on my ma­chine,” and is now run­ning the back end of a lot of real busi­nesses. This post col­lects the re­cur­ring ones, with the com­mands and the op­er­a­tional an­swer for each.

Short an­swer: yes—plain Docker Compose can still run real pro­duc­tion work­loads in 2026, but only if you han­dle the op­er­a­tional gaps it leaves your­self.

Where Plain Docker Compose Fits in Production

Before the list of quirks, a quick word on the au­di­ence. Docker Compose is a de­clar­a­tive way to wire up a multi-con­tainer ap­pli­ca­tion: one YAML file de­scribes the ser­vices, the net­works be­tween them, the vol­umes they share, the en­vi­ron­ment they need, and—through the pat­terns for over­writ­ing or patch­ing ser­vice con­fig­u­ra­tion—the on-disk con­fig­u­ra­tion each ap­pli­ca­tion ex­pects. docker com­pose up rec­on­ciles the host to that file. The sweet spot in pro­duc­tion is the sin­gle-node de­ploy­ment built around ex­actly that—a ven­dor push­ing a multi-con­tainer ap­pli­ca­tion into a cus­tomer en­vi­ron­ment, an in­ter­nal team run­ning a long-tail ser­vice that does not jus­tify a Kubernetes clus­ter, an edge box in a re­tail lo­ca­tion. The foot­print is small, the op­er­a­tional over­head is low, and a com­pe­tent op­er­a­tor can rea­son about the whole stack from one docker-com­pose.yaml. There is no con­trol plane be­hind Compose it­self—no sched­uler watch­ing the host, no rec­on­ciler reap­ply­ing state, no op­er­a­tor push­ing up­dates from some­where else. docker com­pose up runs once and ex­its.

That ar­chi­tec­tural sim­plic­ity is ex­actly why the quirks bite. Compose as­sumes you—or who­ever runs the host—will do the op­er­a­tional work noth­ing else is do­ing, and if you ship Compose files to cus­tomers the safe as­sump­tion is that the cus­tomer will not. The rest of this post is about clos­ing the gap be­tween what Compose does and what a pro­duc­tion host ac­tu­ally needs, ei­ther by hand or with an agent that does it for you. If you have al­ready con­cluded that the gap is too wide and want to com­pare with the next step up, read our Docker Compose vs Kubernetes break­down.

Docker Compose Orphan Containers and –remove-orphans

Remove a ser­vice from docker-com­pose.yaml, run docker com­pose up -d, and the con­tainer you re­moved keeps run­ning. It is de­tached from the pro­ject but still bound to the same net­works and ports. docker com­pose ps will not show it, be­cause Compose only lists what is in the cur­rent file. docker ps –filter la­bel=com.docker.com­pose.pro­ject=<name> will, be­cause Docker still has the la­bel on the con­tainer. This is how you dis­cover, six months in, that an old worker ser­vice has been qui­etly con­sum­ing RAM since the last refac­tor.

The fix is one flag:

docker com­pose up -d –remove-orphansdocker com­pose down –remove-orphans

docker com­pose up -d –remove-orphans

docker com­pose down –remove-orphans

The flag tells Compose: any con­tainer that was once part of this pro­ject but is no longer in the file should be re­moved. Networks Compose cre­ated for the pro­ject are rec­on­ciled the same way on each up, so or­phan net­works go away too. Volumes are the ex­cep­tion—Com­pose pre­serves named vol­umes by de­fault to pro­tect data, and there is no per-ser­vice flag to drop the ones a re­moved ser­vice used. To re­claim that space you have to do it man­u­ally: list can­di­dates with docker vol­ume ls –filter dan­gling=true and docker vol­ume rm by name, or use docker com­pose down -v if you in­tend to wipe the pro­jec­t’s vol­umes whole­sale. To au­dit be­fore delet­ing, list every­thing Docker still as­so­ci­ates with the pro­ject name:

docker ps -a –filter la­bel=com.docker.com­pose.pro­ject=<name>

docker ps -a –filter la­bel=com.docker.com­pose.pro­ject=<name>

Distr’s Docker agent passes RemoveOrphans: true on every Compose Up call, so cus­tomer hosts never ac­cu­mu­late or­phans across de­ploy­ment up­dates. That sin­gle flag has elim­i­nated a re­cur­ring class of the old ver­sion is still an­swer­ing on port 8080” sup­port tick­ets.

Pruning Docker Images and Capping Container Logs

Every docker com­pose pull keeps the pre­vi­ous im­age on disk. Every con­tainer with the de­fault json-file log dri­ver writes un­bounded JSON to /var/lib/docker/containers/<id>/<id>-json.log. On a busy host this is one of the most com­mon rea­sons for an out­age: the disk fills and Docker stops be­ing able to write any­thing—logs, meta­data, im­age lay­ers—at which point con­tain­ers start fail­ing in con­fus­ing ways.

The first thing to learn is the au­dit com­mand:

docker sys­tem df­docker sys­tem df -v

docker sys­tem df

docker sys­tem df -v

-v breaks the to­tals down per im­age, con­tainer, vol­ume, and build cache, which is usu­ally enough to spot the of­fender. From there, the tar­geted prune com­mands:

docker im­age prune -a –filter until=168h” -f # delete un­used im­ages older than 7 days­docker con­tainer prune -f # re­move stopped con­tain­ers­docker builder prune -f # drop the BuildKit cache

docker im­age prune -a –filter until=168h” -f # delete un­used im­ages older than 7 days

docker con­tainer prune -f # re­move stopped con­tain­ers

docker builder prune -f # drop the BuildKit cache

docker vol­ume prune -f ex­ists too, and it is gen­uinely use­ful, but read the next aside be­fore you run it.

The other half of the disk story is logs. Cap them at the dae­mon level, once, in /etc/docker/daemon.json:

{ log-driver”: json-file”, log-opts”: { max-size”: 10m”, max-file”: 3” }}

{

log-driver”: json-file”,

log-opts”: {

max-size”: 10m”,

max-file”: 3″

}

}

After sys­tem­ctl restart docker, every new con­tainer will ro­tate its logs at 10 MB and keep at most three ro­tated files—30 MB ceil­ing per con­tainer, in­stead of until the disk is gone.” Existing con­tain­ers need to be recre­ated to pick up the new de­faults.

This is one of the top­ics worth get­ting right be­fore you ship.

In Distr’s Docker agent the cleanup is built in: each de­ploy­ment tar­get has an opt-out con­tainer im­age cleanup set­ting that re­moves the pre­vi­ous ver­sion’s im­ages au­to­mat­i­cally af­ter a suc­cess­ful up­date, with re­tries on fail­ure. It only fires on suc­cess, so the pre­vi­ous im­age stays on disk if some­thing goes wrong and you need to roll back.

Docker Health Checks Don’t Restart Unhealthy Containers

This is the one that sur­prises peo­ple the most. You add a HEALTHCHECK to your Dockerfile or a healthcheck: block to the ser­vice in Compose, you watch the con­tainer go from healthy to un­healthy, and then… noth­ing hap­pens. The Docker Engine re­ports the sta­tus. It does not act on it. restart: un­less-stopped is trig­gered by the con­tainer ex­it­ing, not by it be­ing marked un­healthy.

You can con­firm what Docker ac­tu­ally thinks:

docker in­spect –format=‘{{json .State.Health}}’ <container> | jq

docker in­spect –format=‘{{json .State.Health}}’ <container> | jq

You will see the sta­tus, the streak of fail­ures, and the last few probe out­puts—use­ful in­for­ma­tion that is silently ig­nored by the en­gine.

There are three an­swers to this:

Run an au­to­heal side­car. The com­mu­nity stan­dard is will­far­rell/​docker-au­to­heal: a tiny con­tainer that mounts the Docker socket, watches for un­healthy events, and restarts the of­fend­ing con­tainer. You opt con­tain­ers in by la­bel­ing them au­to­heal=true (or set AUTOHEAL_CONTAINER_LABEL=all to mon­i­tor every­thing).

Run on Docker Swarm. Swarm restarts un­healthy tasks by de­fault. If you are al­ready con­sid­er­ing Swarm, this is one of the bet­ter rea­sons.

Use Distr. Every Distr Docker agent de­ploys an adapted au­to­heal ser­vice along­side it. The Enable au­to­heal for all con­tain­ers” tog­gle is on by de­fault at de­ploy­ment-tar­get cre­ation, so cus­tomer-side restarts of un­healthy con­tain­ers hap­pen with­out any­one con­fig­ur­ing it.

Whichever path you pick, the take­away is the same: a HEALTHCHECK with­out some­thing act­ing on it is a sta­tus light, not a self-heal­ing sys­tem.

Pinning Docker Images by Digest Instead of :latest

Docker tags are mu­ta­ble ref­er­ences. myapp:1.4 to­day is what­ever the reg­istry cur­rently has un­der that tag; to­mor­row it can point at a dif­fer­ent layer set af­ter a re-push. :latest is the worst of­fender be­cause every­one treats it as a syn­onym for stable” when in prac­tice it of­ten means whatever was pushed most re­cently.” It is also the silent de­fault: an un­qual­i­fied im­age: ng­inx in a Compose file is treated as im­age: ng­inx:lat­est, so even Compose files that never type the word land on it by ac­ci­dent. The re­sult, in pro­duc­tion, is that two hosts pulling the same” tag five min­utes apart can end up run­ning dif­fer­ent code.

The fix is to pin by con­tent-ad­dress­able di­gest. Every im­age has one, and Docker ac­cepts it any­where a tag would go.

To find the di­gest for an im­age you al­ready pulled:

docker im­age in­spect –format=‘{{index .RepoDigests 0}}’ myapp:1.4# myapp@sha256:9b7c…

docker im­age in­spect –format=‘{{index .RepoDigests 0}}’ myapp:1.4

# myapp@sha256:9b7c…

Or, with­out pulling, from the lo­cal Docker in­stal­la­tion against the re­mote reg­istry:

docker buildx im­age­tools in­spect myapp:1.4

docker buildx im­age­tools in­spect myapp:1.4

In your Compose file, re­place the tag with the di­gest:

ser­vices: app: im­age: myapp@sha256:9b7c0a3e1f…

ser­vices:

app:

im­age: myapp@sha256:9b7c0a3e1f…

A pull against a di­gest fails fast if the reg­istry no longer has those bytes, which is ex­actly what you want—silent drift be­comes a loud er­ror. The same im­age ref­er­ence works in docker stack de­ploy, in docker run, and in Kubernetes man­i­fests.

For the broader pic­ture of what your cus­tomers can ex­tract from a pub­lished im­age (and why im­age hy­giene mat­ters be­yond re­pro­ducibil­ity), check out our guide on pro­tect­ing source code and IP in Docker and Kubernetes de­ploy­ments. And if you’re still pick­ing a reg­istry, our con­tainer reg­istry com­par­i­son walks through the trade-offs.

Why Mounting /var/run/docker.sock Is a Security Risk

A con­tainer with /var/run/docker.sock mounted can call the Docker API, and the Docker API can launch a priv­i­leged con­tainer that mounts the host’s root filesys­tem. In other words: any con­tainer with the socket has ef­fec­tively root priv­i­leges on the host. This is not a Docker bug; it is the threat model of the socket. It de­serves a mo­ment of at­ten­tion be­cause the line that grants this ac­cess is one bind mount in a Compose file and is easy to add with­out think­ing about it.

Practical hy­giene:

Inventory the con­tain­ers that mount the socket. Agents, CI run­ners, mon­i­tor­ing side­cars, con­tainer man­age­ment UIs—keep the list short and in­ten­tional.

Run root­less Docker where pos­si­ble. dock­erd-root­less-se­tup­tool.sh in­stall sets up a Docker dae­mon that runs as a reg­u­lar user. The blast ra­dius of a com­pro­mised socket-mount­ing con­tainer shrinks from full host” to this user ac­count.”

Consider socket-proxy. Projects like Tecnativa’s docker-socket-proxy ex­pose a fil­tered sub­set of the API to the con­tainer that needs it (e.g. read-only con­tain­ers and events for mon­i­tor­ing) in­stead of the full socket.

Keep socket-mount­ing im­ages min­i­mal. Smaller sur­face, fewer li­braries, fewer ways in.

The Distr Docker agent does mount the socket—it has to, in or­der to or­ches­trate Compose and Swarm on the host. We doc­u­ment that bound­ary openly in the Docker agent docs so cus­tomer se­cu­rity teams can re­view it be­fore in­stal­la­tion. The agent au­then­ti­cates to the Hub with a JWT, and the in­stall se­cret is shown once and never stored.

Updating Docker Compose Deployments Across Customer Hosts

docker com­pose pull && docker com­pose up -d is a fine com­mand if you are SSH’d into the host. At cus­tomer scale—dozens of self-man­aged en­vi­ron­ments be­hind fire­walls, each with its own change-con­trol process—that man­ual process does­n’t scale. Docker has no built-in mech­a­nism to push a new man­i­fest to a run­ning host from some­where else. Docker Hub web­hooks can trig­ger a CI re­build when an im­age is pushed, but they do not reach into a cus­tomer’s net­work and tell their docker com­pose to pull.

The usual workarounds and what they cost:

Watchtower: Polls the reg­istry on a sched­ule, pulls new im­ages, recre­ates con­tain­ers. Easy to set up, hard to con­trol. No staged roll­out, no roll­back path, lim­ited vis­i­bil­ity from your side—you find out a cus­tomer up­dated when they file a ticket.

Bastion + SSH + Ansible/scripts: Works for ten cus­tomers. Falls apart at fifty, es­pe­cially when three of them are air-gapped and four run their own change-con­trol ca­dence. Every op­er­a­tor has to live with shared keys and a main­te­nance win­dow cal­en­dar.

A pull-based agent. This is the shape Distr lands on. The agent runs on the cus­tomer host, polls a known end­point every 5 sec­onds, and rec­on­ciles the lo­cal Compose state against what the Hub says it should be. The agent re­ports sta­tus back, so you can see in your dash­board which cus­tomers are on which ver­sion. When the agent it­self needs to up­date, it spawns a sep­a­rate con­tainer to per­form the swap so it is not try­ing to re­place it­self while run­ning.

The pat­tern is not unique—Ku­ber­netes op­er­a­tors and GitOps tools do the same thing—but Compose users rou­tinely re-in­vent it badly. If you find your­self build­ing one, at least give it roll­back, sta­tus re­port­ing, and a way to pin ver­sions, or you will end up with a fleet that drifts in ways you can­not see.

The other thing worth not­ing: re­cur­ring sched­uled jobs along­side the ap­pli­ca­tion have no na­tive Compose an­swer ei­ther. If your stack in­cludes any­thing like a nightly cleanup, a pe­ri­odic re­port, or a heart­beat-style task, the in-app sched­uler is one op­tion, but you even­tu­ally run into the cases it can’t cover (cross-service jobs, jobs that should out­live a sin­gle con­tainer). For the three pat­terns I have seen sur­vive cus­tomer de­ploy­ments, check out our guide on Compose cron jobs.

Outgrowing Docker Compose: Kubernetes vs Swarm

If a sin­gle-node Compose de­ploy­ment out­grows it­self, the re­al­is­tic next step for most teams is Kubernetes. The ecosys­tem is large, the op­er­a­tional pat­terns are well doc­u­mented, and the tal­ent pool to hire against ac­tu­ally ex­ists. For the side-by-side, read our Docker Compose vs Kubernetes com­par­i­son.

Docker Swarm is the other op­tion—it reuses the Compose YAML for­mat, ships in the box, and solves a few of the quirks above di­rectly (it restarts un­healthy tasks, rolls out up­dates with up­date_­con­fig, and treats se­crets and con­figs as first-class ob­jects). It is a real fit for some sin­gle-clus­ter, low-cer­e­mony de­ploy­ments.

The Distr agent sup­ports both—the Hub records whether a de­ploy­ment is Compose or Swarm, and the agent runs the match­ing docker com­pose up or docker stack de­ploy. If you do choose Swarm, read our rout­ing and Traefik guide for Docker Swarm and the prod­uct walk­through for dis­trib­ut­ing ap­pli­ca­tions to Swarm for the de­tails.

So, should you run plain Docker Compose in pro­duc­tion?

Yes—plain Docker Compose still runs a lot of real pro­duc­tion work­loads in 2026, as long as you ac­cept that plain Compose” is short­hand for Compose plus the op­er­a­tor prac­tices it does­n’t en­force.” None of the quirks above are se­cret. They are all in Docker’s doc­u­men­ta­tion, in GitHub is­sues that have been open for years, and in the war sto­ries of every team that has run Compose in anger. What makes them dan­ger­ous is not the quirks them­selves but the or­der in which you dis­cover them: usu­ally at 2 a.m., one at a time.

TL;DR:

Pass –remove-orphans on every com­pose up and com­pose down.

Cap con­tainer logs in dae­mon.json and prune im­ages on a sched­ule. Be care­ful with docker vol­ume prune.

Health checks do not heal. Run an au­to­heal side­car, run on Swarm, or use an agent that bun­dles one.

Pin by @sha256:… di­gest. Treat tags as ref­er­ences, not con­tracts.

The socket is root. Inventory the con­tain­ers that mount it; pre­fer root­less Docker.

Updates need an agent of some kind. Watchtower is fine for one host; not for a fleet.

When Compose stops be­ing enough, Kubernetes is usu­ally the right next step. Swarm is a nar­rower fit and worth pick­ing eyes-open.

If you ship soft­ware to self-man­aged cus­tomers and you would rather not re­build this list your­self, the Distr Docker agent han­dles all of the above on the cus­tomer side. The Docker agent doc­u­men­ta­tion walks through the in­stall, the socket model, the au­to­heal and im­age-cleanup de­faults, and how the agent self-up­dates. The repos­i­tory is on GitHub.

Computer use is 45x More Expensive Than Structured APIs

reflex.dev

We ran a bench­mark com­par­ing two ways of let­ting an AI agent op­er­ate the same ad­min panel, with the goal of putting a price tag on vi­sion agents (browser-use, com­puter-use).

Here is what we mea­sured, what we had to change to make the vi­sion agent work at all, and what changes when gen­er­at­ing an API sur­face stops be­ing a sep­a­rate en­gi­neer­ing pro­ject.

Why vi­sion agents?

Vision agents are the de­fault for let­ting AI agents op­er­ate web apps that don’t ex­pose APIs. The al­ter­na­tive, writ­ing an MCP or REST sur­face per app, is its own en­gi­neer­ing pro­ject across the 20+ in­ter­nal tools most teams have. Most teams de­fault to vi­sion agents not be­cause they are bet­ter, but be­cause the al­ter­na­tive is too ex­pen­sive to build. The cost of the vi­sion ap­proach is treated as a fixed price.

We wanted to mea­sure the price.

The setup

The test app is an ad­min panel for man­ag­ing cus­tomers, or­ders, and re­views, mod­eled on the re­act-ad­min Posters Galore demo. Two agents tar­get the same run­ning app: one dri­ves the UI via screen­shots and clicks, the other calls the ap­p’s HTTP end­points di­rectly. Same Claude Sonnet, same pinned dataset, same task. The in­ter­face is the only vari­able.

The task: find the cus­tomer named Smith” with the most or­ders, lo­cate their most re­cent pend­ing or­der, ac­cept all of their pend­ing re­views, and mark the or­der as de­liv­ered. This touches three re­sources, re­quires fil­ter­ing, pag­i­na­tion, cross-en­tity lookups, and both reads and writes. It is the shape of work a typ­i­cal in­ter­nal tool sees daily.

Path A: Vision agent. Claude Sonnet dri­ving the UI via browser-use 0.12. Vision mode, tak­ing screen­shots and ex­e­cut­ing clicks.

Path B: API agent. Claude Sonnet with tool-use, call­ing the han­dlers the UI calls. Each tool maps to one or more event han­dlers on the ap­p’s State, the same func­tions a but­ton click would trig­ger. The agent gets the struc­tured re­sponse back in­stead of a ren­dered page.

The vi­sion agent could­n’t com­plete the task

We started by giv­ing both agents the same six-sen­tence task above and see­ing what hap­pened.

The API agent com­pleted it in 8 calls. It listed the cus­tomer’s re­views fil­tered by pend­ing sta­tus, ac­cepted each one, and marked the or­der as de­liv­ered. Both agents are call­ing into the same ap­pli­ca­tion logic; the API agent just reads the struc­tured re­sponse di­rectly in­stead of look­ing at a ren­dered page.

The vi­sion agent, on the same prompt, found one of four pend­ing re­views, ac­cepted it, and moved on. It never pag­i­nated. The re­main­ing three re­views were be­low the vis­i­ble fold of the re­views page and the agent had no sig­nal to scroll for them.

This is not a model prob­lem. The vi­sion agent was rea­son­ing about a ren­dered page and had no sig­nal that the page was­n’t show­ing every­thing. The API agent calls the same han­dler the UI calls, but the re­sponse in­cludes the full re­sult set the han­dler re­turned, not just the rows cur­rently ren­dered. The agent reads page 1 of 4 with 50 re­sults per page” di­rectly in­stead of hav­ing to in­ter­pret pag­i­na­tion con­trols from pix­els.

With a 14-step walk­through, it suc­ceeded

To make the com­par­i­son ap­ples-to-ap­ples, we rewrote the vi­sion prompt as an ex­plicit UI walk­through, nam­ing the side­bar items, tabs, and form fields the agent should in­ter­act with at each step. Fourteen num­bered in­struc­tions cov­er­ing the nav­i­ga­tion the agent had failed to fig­ure out on its own.

With the walk­through, the vi­sion agent com­pleted the task. It also ran for four­teen min­utes and con­sumed about half a mil­lion in­put to­kens.

The walk­through is it­self a find­ing. Each num­bered in­struc­tion is en­gi­neer­ing work that does­n’t show up in to­ken counts but rep­re­sents real cost. Anyone de­ploy­ing a vi­sion agent against an in­ter­nal tool is ei­ther writ­ing prompts at this level of speci­ficity or ac­cept­ing that the agent will silently miss work.

How we ran it

We ran the API path five times and the vi­sion path three times. The vi­sion path was capped at three tri­als be­cause each run takes 14 – 22 min­utes and con­sumes 400 – 750k to­kens.

Variance was the most sur­pris­ing part of the vi­sion re­sults. Across three tri­als the wall-clock time spanned 749s to 1257s, and in­put to­kens spanned 407k to 751k. The agent took 43 cy­cles in the short­est run and 68 in the longest. The screen­shot-rea­son-click loop has enough non-de­ter­min­ism that a sin­gle run is not a rep­re­sen­ta­tive cost es­ti­mate.

The API path had no such vari­ance. Sonnet hit iden­ti­cal 8 tool calls on every trial, with in­put to­ken counts vary­ing by ±27 across all five runs. The agent calls the same han­dlers in the same or­der be­cause the struc­tured re­sponses give it no rea­son to de­vi­ate.

The full re­sults

Numbers are mean ± sam­ple stan­dard de­vi­a­tion (n−1), with n=5 per API path and n=3 for the vi­sion path. Full run de­tails are avail­able in the repo.

Numbers are mean ± sam­ple stan­dard de­vi­a­tion (n−1), with n=5 per API path and n=3 for the vi­sion path. Full run de­tails are avail­able in the repo.

Haiku could not com­plete the vi­sion path. The fail­ure was spe­cific to browser-use 0.12′s struc­tured-out­put schema, which Haiku could not re­li­ably pro­duce in ei­ther vi­sion or text-only mode. On the API path, Haiku fin­ished in un­der 8 sec­onds for un­der 10k in­put to­kens, which is the cheap­est con­fig­u­ra­tion we tested.

The struc­tural gap

The cost dif­fer­ence fol­lows di­rectly from the ar­chi­tec­ture. An agent that must see in or­der to act will al­ways pay for the see­ing, re­gard­less of how good the model gets. Better vi­sion mod­els re­duce er­ror rates per screen­shot, but they do not re­duce the num­ber of screen­shots re­quired to reach the rel­e­vant data. Each ren­der is a screen­shot is thou­sands of in­put to­kens.

Both agents in this bench­mark walk through the same ap­pli­ca­tion logic. They both fil­ter, pag­i­nate, and up­date the same way the UI does. The dif­fer­ence is what they read at each step. The vi­sion agent reads pix­els and has to ren­der every in­ter­me­di­ate state to in­ter­pret it. The API agent reads the struc­tured re­sponse from the same han­dlers, which al­ready con­tains the data the UI was go­ing to dis­play.

Better mod­els will nar­row the cost per step. They will not nar­row the step count, be­cause the step count is set by the in­ter­face.

How we jus­tify the API en­gi­neer­ing cost

The bench­mark was made cheap to run by Reflex 0.9, which in­cludes a plu­gin that auto-gen­er­ates HTTP end­points from a Reflex ap­pli­ca­tion’s event han­dlers. None of the struc­tural ar­gu­ment de­pends on Reflex specif­i­cally, but it is what made run­ning the API path pos­si­ble with­out writ­ing a sec­ond code­base.

The in­ter­est­ing ques­tion is what be­comes pos­si­ble when the en­gi­neer­ing cost of an API sur­face drops to zero. Vision agents re­main the right tool for ap­pli­ca­tions you do not con­trol: third-party SaaS prod­ucts, legacy sys­tems, any­thing you can­not mod­ify. For in­ter­nal tools you build your­self, the math now points the other way.

Notes

Vision re­sults are spe­cific to browser-use 0.12 in vi­sion mode, and other vi­sion agents may be­have dif­fer­ently. The Path B run­ner shapes the auto-gen­er­ated end­points into a small REST tool sur­face of about thirty lines, which the agent sees as list_­cus­tomers, up­date_or­der, and sim­i­lar. The dataset is pinned and small (900 cus­tomers, 600 or­ders, 324 re­views), so be­hav­ior on pro­duc­tion-scale data is not mea­sured here. The vi­sion agent runs through LangChain’s ChatAnthropic, and the API agent runs through the Anthropic SDK di­rectly. Reported to­ken counts are un­cached in­put to­kens.

Reproduce it

The repo in­cludes seed data gen­er­a­tion, the patched re­act-ad­min demo, both agent scripts, and raw re­sults.

To add this web app to your iOS home screen tap the share button and select "Add to the Home Screen".

10HN is also available as an iOS App

If you visit 10HN only rarely, check out the the best articles from the past week.

If you like 10HN please leave feedback and share

Visit pancik.com for more.