10 interesting stories served every morning and every evening.
10 interesting stories served every morning and every evening.
Two weeks ago I wrote about Anthropic silently registering a Native Messaging bridge in seven Chromium-based browsers on every machine where Claude Desktop was installed [1]. The pattern was: install on user launch of product A, write configuration into the user’s installs of products B, C, D, E, F, G, H without asking. Reach across vendor trust boundaries. No consent dialog. No opt-out UI. Re-installs itself if the user removes it manually, every time Claude Desktop is launched.
This week I discovered the same pattern, executed by Google. Google Chrome is reaching into users’ machines and writing a 4 GB on-device AI model file to disk without asking. The file is named weights.bin. It lives in OptGuideOnDeviceModel. It is the weights for Gemini Nano, Google’s on-device LLM. Chrome did not ask. Chrome does not surface it. If the user deletes it, Chrome re-downloads it.
The legal analysis is the same one I gave for the Anthropic case. The environmental analysis is new. At Chrome’s scale, the climate bill for one model push, paid in atmospheric CO2 by the entire planet, is between six thousand and sixty thousand tonnes of CO2-equivalent emissions, depending on how many devices receive the push. That is the environmental cost of one company unilaterally deciding that two billion peoples’ default browser will mass-distribute a 4 GB binary they did not request.
This is, in my professional opinion, a direct breach of Article 5(3) of Directive 2002/58/EC (the ePrivacy Directive) [2], a breach of the Article 5(1) GDPR principles of lawfulness, fairness, and transparency [3], a breach of Article 25 GDPR’s data-protection-by-design obligation [3], and an environmental harm of a magnitude that would be a notifiable event under the Corporate Sustainability Reporting Directive (CSRD) for any in-scope undertaking [4].
What is on the disk and how it got there
On any machine that has Chrome installed, in the user profile, sits a directory whose name is OptGuideOnDeviceModel. Inside it is a file called weights.bin. The file is approximately 4 GB. It is the weights file for Gemini Nano. Chrome uses it to power features Google has marketed under names like “Help me write”, on-device scam detection, and other AI-assisted browser functions.
The file appeared with no consent prompt. There is no checkbox in Chrome Settings labelled “download a 4 GB AI model”. The download triggers when Chrome’s AI features are active, and those features are active by default in recent Chrome versions. On any machine that meets the hardware requirements, Chrome treats the user’s hardware as a delivery target and writes the model.
The cycle of deletion and re-download has been documented across multiple independent reports on Windows installations [5][6][7][8] - the user deletes, Chrome re-downloads, the user deletes again, Chrome re-downloads again. The only ways to make the deletion stick are to disable Chrome’s AI features through chrome://flags or enterprise policy tooling that home users do not generally have, or to uninstall Chrome entirely [5]. On macOS the file lands as mode 600 owned by the user (so it is deletable in principle) but Chrome holds the install state in Local State after the bytes are written, and as soon as the variations server next tells Chrome the profile is eligible, the download fires again - the architecture is the same, only the file permissions differ.
How I verified this on a freshly created Apple Silicon profile
Most of the existing reporting on this behaviour is from Windows users who noticed their disk filling up - useful, but Google could (and probably will) try to characterise those reports as anecdotes from non-representative configurations. So I went looking for a clean witness on a different platform.
The witness I found is macOS itself. The kernel keeps a filesystem event log called .fseventsd - it records every file create, modify and delete at the OS level, independent of any application logging. Chrome cannot edit it, Google cannot remotely reach it, and the page files that record the events survive the deletion of the files they reference.
I created a Chrome user-data directory on 23 April 2026 to run an automated audit (one of the WebSentinel 100-site privacy sweeps). The audit driver is fully Chrome DevTools Protocol - it loads a page, dwells for five minutes with no input, captures events, closes Chrome between sites - and the profile had received zero keyboard or mouse input from a human at any point in its existence. Every “AI mode” surface in Chrome was untouched - in fact every UI surface in Chrome was untouched, the audit driver only interacts with the document via CDP and the omnibox is never reached. By 29 April the profile contained 4 GB of OptGuideOnDeviceModel weights - and I knew it because a routine du -sh of the audit-profile directory caught it during a cleanup pass.
I went back to .fseventsd to ask exactly when those 4 GB landed. macOS gave me the answer, byte-precise, in three sequential page files:
24 April 2026, 16:38:54 CEST (14:38:54 UTC) - Chrome creates the OptGuideOnDeviceModel directory in the audit profile (page file 0000000003f7f339).
24 April 2026, 16:47:22 CEST (14:47:22 UTC) - three concurrent unpacker subprocesses spawn temporary directories in /private/var/folders/…/com.google.Chrome.chrome_chrome_Unpacker_BeginUnzipping.*/. One of them (5xzqPo) writes weights.bin, manifest.json, _metadata/verified_contents.json and on_device_model_execution_config.pb. The second writes a Certificate Revocation List update. The third writes a browser preload-data update. Chrome batched a security update, a preload refresh and a 4 GB AI model into the same idle window, as if they were equivalent (page file 00000000040c8855).
24 April 2026, 16:53:22 CEST (14:53:22 UTC) - the unpacked weights.bin is moved to its final location at OptGuideOnDeviceModel/2025.8.8.1141/weights.bin along with adapter_cache.bin, encoder_cache.bin, _metadata/verified_contents.json and the execution config. Concurrently four additional model targets (numbered 40, 49, 51 and 59 in Chrome’s optimization-guide enum) register fresh entries in optimization_guide_model_store - these are the smaller text-safety and prompt-routing models that pair with the LLM. None of these targets existed in the profile before this moment (page file 00000000040d0f9c).
Total install time, from directory creation to final move: 14 minutes and 28 seconds. Total human action against the profile during that window: none. The audit driver was either dwelling on a third-party home page or transitioning between sites - the unpacker fired in the background while a tab waited for a five-minute timer to expire.
The naming inside that fseventsd record is, if anything, the most damning detail. The temp directory is com.google.Chrome.chrome_chrome_Unpacker_BeginUnzipping.5xzqPo - that prefix com.google.Chrome.chrome_chrome_* is the bundle ID and subprocess naming convention Google Chrome itself uses. It is not com.google.GoogleUpdater.* and it is not com.google.GoogleSoftwareUpdate.*. The writer is Chrome - the browser process the user has installed and trusts to load web pages - reaching into the user’s filesystem on its own initiative and laying down a 4 GB ML binary while the foreground tab does something completely unrelated.
Three further pieces of corroborating evidence sit elsewhere on the same machine:
Chrome’s own Local State JSON for the audit profile contains an optimization_guide.on_device block with model_validation_result: { attempt_count: 1, result: 2, component_version: “2025.8.8.1141” }. Chrome ran the model. The component_version matches the version string the fseventsd events recorded as the path component. Two independent witnesses, same artefact. The same block reports performance_class: 6, vram_mb: “36864″ - Chrome characterised my hardware (read the GPU, read the unified memory total) to decide whether I was eligible for the model push, before any user-facing AI feature surfaced.
Chrome’s own Local State JSON for the audit profile contains an optimization_guide.on_device block with model_validation_result: { attempt_count: 1, result: 2, component_version: “2025.8.8.1141” }. Chrome ran the model. The component_version matches the version string the fseventsd events recorded as the path component. Two independent witnesses, same artefact. The same block reports performance_class: 6, vram_mb: “36864″ - Chrome characterised my hardware (read the GPU, read the unified memory total) to decide whether I was eligible for the model push, before any user-facing AI feature surfaced.
Chrome’s ChromeFeatureState for the audit profile lists OnDeviceModelBackgroundDownload<OnDeviceModelBackgroundDownload and ShowOnDeviceAiSettings<OnDeviceModelBackgroundDownload in the enable-features block. The first flag is what triggers the silent download. The second flag is what reveals the on-device AI section in chrome://settings. Both are gated by the same rollout flag - which means that by Chrome’s own architecture, the install begins before the user has any settings UI in which to refuse it. The settings page that would let you discover the feature exists is enabled in lockstep with the install - it is design, not oversight.
Chrome’s ChromeFeatureState for the audit profile lists OnDeviceModelBackgroundDownload<OnDeviceModelBackgroundDownload and ShowOnDeviceAiSettings<OnDeviceModelBackgroundDownload in the enable-features block. The first flag is what triggers the silent download. The second flag is what reveals the on-device AI section in chrome://settings. Both are gated by the same rollout flag - which means that by Chrome’s own architecture, the install begins before the user has any settings UI in which to refuse it. The settings page that would let you discover the feature exists is enabled in lockstep with the install - it is design, not oversight.
The GoogleUpdater logs record the on-device-model control component (appid {44fc7fe2 – 65ce-487c-93f4-edee46eeaaab}) being downloaded from http://edgedl.me.gvt1.com/edgedl/diffgen-puffin/%7B44fc7fe2 – 65ce-487c-93f4-edee46eeaaab%7D/… - a 7 MB compressed control file that arrived on 20 April 2026, three days before the audit profile in question was created. That is the upstream control plane: it is profile-independent, it is launched automatically by a LaunchAgent that fires every hour, and the URL is plain HTTP (the integrity is verified by the CRX-3 signature inside the package, not by transport security). The control component gives Chrome the manifest pointing at the actual weights, and Chrome’s in-process OnDeviceModelComponentInstaller - a separate code path from GoogleUpdater - then fetches the multi-GB weights direct from Google’s CDN.
The GoogleUpdater logs record the on-device-model control component (appid {44fc7fe2 – 65ce-487c-93f4-edee46eeaaab}) being downloaded from http://edgedl.me.gvt1.com/edgedl/diffgen-puffin/%7B44fc7fe2 – 65ce-487c-93f4-edee46eeaaab%7D/… - a 7 MB compressed control file that arrived on 20 April 2026, three days before the audit profile in question was created. That is the upstream control plane: it is profile-independent, it is launched automatically by a LaunchAgent that fires every hour, and the URL is plain HTTP (the integrity is verified by the CRX-3 signature inside the package, not by transport security). The control component gives Chrome the manifest pointing at the actual weights, and Chrome’s in-process OnDeviceModelComponentInstaller - a separate code path from GoogleUpdater - then fetches the multi-GB weights direct from Google’s CDN.
So we now have a four-way evidence chain - macOS kernel filesystem events, Chrome’s own per-profile state, Chrome’s runtime feature flags, and Google’s component-updater logs - all four agreeing on the same conduct, and the conduct is: a 4 GB AI model arrived on this user’s disk without consent, without notice, on a profile that received zero human input, in a window of 14 minutes and 28 seconds, on a Tuesday afternoon.
Reports of the OptGuideOnDeviceModel directory and the weights.bin file have been circulating in community forums for over a year - what is new in 2026 is the scale and the verifiability. Chrome’s market share has held above 64% globally [9][10], Chrome’s user base is between 3.45 billion and 3.83 billion individuals worldwide depending on which 2026 estimate you trust [9][11], and Google has been rolling Gemini features into Chrome with increasing aggression. The behaviour is no longer affecting a minority of power users on a minority of platforms - it is affecting hundreds of millions of devices, on every desktop OS Chrome ships against.
The Anthropic comparison, point for point
The same dark-pattern playbook. I am repeating my categorisation from the Claude Desktop article [1] because the patterns are identical and that is the point.
1. Forced bundling across trust boundaries. Anthropic installed Claude Desktop, then wrote into Brave, Edge, Arc, Vivaldi, Opera, and Chromium. Google installs Chrome, then writes a 4 GB AI model under the user’s profile directory without authorisation. The binary is not Chrome. It is a separately-trained machine-learning model, with a separate purpose, a separate data-protection profile, and a separate consent footprint.
2. Invisible default, no opt-in. No dialogue at first launch. No checkbox in Settings. The model is downloaded; the user finds out about it months later when their disk fills up [5][6][7].
3. More difficult to remove than install. Adding the file took zero clicks. Removing it requires (a) discovering the file exists, (b) understanding what it is, (c) navigating into a hidden user profile path, (d) deleting it (and on Windows, also clearing the read-only attribute first), and (e) accepting that Chrome will silently re-download it on next eligible window unless the user also navigates chrome://flags, enterprise policy, or platform-specific configuration tooling to disable the underlying Chrome AI feature [5]. None of those steps is documented in the place a normal user looks - none of them is even hinted at in default Chrome.
4. Pre-staging of capability the user has not requested. The Nano model exists on the user’s disk so that Chrome features that use it can run instantly when the user invokes them. The user has not invoked any of those features. The model still sits there, taking 4 GB.
5. Scope inflation through generic naming. OptGuideOnDeviceModel is internal Chrome jargon for “OptimizationGuide on-device model storage”. A user looking at their disk usage, even one who knows roughly what they are looking at, would not match OptGuideOnDeviceModel/weights.bin to “Gemini Nano LLM weights”. Accurate naming would be GeminiNanoLLM/weights.bin. Google chose to obfuscate the name.
6. Registration into resources the user has not configured. A user who has not opened Chrome’s AI features still gets the model. A user who has opened them once and decided they were not interested still gets the model. The file’s presence is decoupled from the user’s actual use of any feature it powers.
7. Documentation gap. Google’s user-facing documentation about Chrome’s AI features does not, with the prominence proportionate to a 4 GB silent download, tell the user that the cost of the feature being available is a 4 GB file appearing on their device. The behaviour is documented in places a curious admin will find. It is not documented in the place a regular user looks before installing Chrome or before Chrome decides to begin pushing the model.
8. Automatic re-install on every run. Same as Claude Desktop. Delete the file, Chrome re-creates it. The user’s deletion is treated as a transient state to be corrected, not as a directive to be respected.
9. Retroactive survival of any future user consent. If Google in future starts asking users “would you like Chrome to download a 4 GB AI model”, that prompt does not retro-actively legitimise the silent installs that have already happened on hundreds of millions of devices. The damage to the trust relationship is done. The bytes have moved. The atmosphere has been written to.
10. Code-signed, shipped through the normal release channel. This is not test build behaviour. It is Chrome stable.
The “AI Mode” pill is the cherry on top
Here is the part that should make every privacy lawyer in the audience put their coffee down. When Chrome 147 launches against an eligible profile, the omnibox - the address bar at the top of the window, the most visible piece of real estate in the entire browser - renders an “AI Mode” pill to the right of the URL field. A reasonable user, seeing “AI Mode” sitting in their browser’s most prominent UI element in 2026, with the well-publicised existence of on-device LLMs in Chrome and a 4 GB Gemini Nano binary already silently installed on their disk, is going to draw what feels like an obvious inference - that the visible AI Mode is using the on-device model, that their queries stay on the device, that the local model is what powers the local-looking surface.
Every part of that inference is wrong. The AI Mode pill in the Chrome 147 omnibox is a cloud-backed Search Generative Experience surface - every query the user types into it is sent over the network to Google’s servers for processing by Google’s hosted models. The on-device Nano model is not invoked by the AI Mode UI flow at all. They are entirely separate code paths - the most visible AI affordance in the browser does not use the local model the user has been silently given, and the features that do use the local model (Help-Me-Write in <textarea>, tab-group AI suggestions, smart paste, page summary) are buried in textarea-context menus and tab-group right-click menus that the average user will discover, on average, never.
Think about what that arrangement actually is. The user pays the storage cost of the silent install (4 GB on disk, plus the bandwidth of the silent download). The user’s most visible AI experience - the pill they actually see and click - delivers no on-device benefit at all because it routes to Google’s servers regardless. The on-device model is therefore a sunk cost imposed on the user, with no offsetting transparency benefit at the surface where transparency would matter most. To put it another way - if the on-device install had given the user a clear “your AI Mode queries stay on your device” property, the install would have a defensible privacy framing (worse storage, better data flow). It does not - the install gives Google a future-options resource (the model can be invoked by other Chrome subsystems without further server round-trips) at the user’s disk-and-bandwidth expense, while the headline AI surface continues to send the user’s queries to Google as before. The local model is a Google-side asset positioned on the user’s device - it is not a user-side asset and one could argue it is nothing but sleight-of-hand to hide that actually, the visible AI mode is NOT using the local model.
That arrangement, on its own, engages at least three of the deceptive design pattern families catalogued in EDPB Guidelines 03/2022 [20]. It is misleading information because the visible label “AI Mode” creates a false impression about where processing occurs - the label does not say “cloud-backed” or “queries sent to Google”, and a reasonable user with knowledge of on-device AI will infer locality from the proximity of an on-device 4 GB model on their disk. It is skipping because the user is not given a moment to choose between local-only and cloud-backed AI surfaces - both are switched on by the same upstream rollout, with no per-feature consent. And it is hindering because turning AI Mode off does not also remove the on-device install, and removing the on-device install does not turn AI Mode off - the two are separately controlled, and discovering both controls requires knowing about both chrome://flags and chrome://settings/ai, neither of which is obvious in default Chrome.
So: not just a non-consented install, but a non-consented install that doubles as cover for a parallel cloud-backed surface that misrepresents to the user where their typing is being processed. Both layers compound the consent problem.
Why this is unlawful in the EEA and the UK
Article 5(3) of Directive 2002/58/EC (the ePrivacy Directive) prohibits the storing of information, or the gaining of access to information already stored, in the terminal equipment of a subscriber or user, without the user’s prior, freely-given, specific, informed, and unambiguous consent, except where strictly necessary for the provision of an information-society service explicitly requested by the user [2]. The 4 GB Gemini Nano weights file is information stored in the user’s terminal equipment. The user did not consent. The user has not requested any service that strictly requires a 4 GB on-device LLM. Chrome is functional without the file. The Article 5(3) breach is direct.
Article 5(1) GDPR requires processing of personal data to be lawful, fair, and transparent to the data subject [3]. Where the user’s hardware is profiled to determine eligibility for the model push, where the install events are logged on Google’s servers, and where the on-device features the model powers process user prompts (whether or not those prompts leave the device), the lawfulness, fairness, and transparency of all of that processing depend on the user being told, in plain language, what is happening. They are not.
Article 25 GDPR requires the controller to implement appropriate technical and organisational measures to ensure that, by default, only personal data that are necessary for each specific purpose are processed [3]. Pre-staging a 4 GB AI model on a user’s disk, against a contingency that the user might in future invoke an AI feature, is the architectural opposite of by-default minimisation and the profiling of the device to determine whether or not to push the model is not different to the profiling used to track you online and as such that profile contains personal data and if the AI model is used, will process personal data, so the GDPR arguments are in scope and valid.
Under the UK GDPR and the Privacy and Electronic Communications Regulations 2003, the analysis is the same. Under the California Consumer Privacy Act, the absence of a notice-at-collection covering this specific category of pre-staged software puts Google’s CCPA notice posture in question [12].
Then there are the criminal-law violations under various national computer-misuse statutes - which again cannot be overstated.
ESG: the climate cost of the silent push
The Anthropic case I wrote about was a desktop application installing a 350-byte JSON manifest in seven directories. The bandwidth and energy cost of that, summed across all Claude Desktop users, was negligible. The Chrome case is different. Chrome is pushing a 4 GB binary across hundreds of millions of devices. That has a measurable, quantifiable, and frankly alarming environmental footprint.
I am calculating this using the same methodology our WebSentinel audit platform applies to website environmental analysis [13]:
Energy intensity of network data transfer: 0.06 kWh per GB, the mid-band of Pärssinen et al. (2018) “Environmental impact assessment of online advertising”, Science of The Total Environment [14]. The paper reports a 0.04 – 0.10 kWh/GB range depending on the share of fixed-line vs mobile transfer and inclusion of end-user device energy. 0.06 is a defensible mid-point.
Grid emissions factor: 0.25 kg CO2e per kWh, the EEA / IEA composite EU-27 electricity-supply factor for 2024 reporting [15]. Globally the figure varies from ~0.10 kg/kWh on mostly-renewable grids to over 0.70 kg/kWh on coal-heavy grids; 0.25 is mid-band for a global push and is the figure WebSentinel uses by default.
Per-device cost of one Nano push
Bandwidth: 4 GB
Energy: 4 × 0.06 = 0.24 kWh per device per push
CO2: 0.24 × 0.25 = 0.06 kg CO2e per device per push
That is per device, per push. A single download of the model. It does not include re-downloads triggered by the user trying and failing to delete the file. It does not include subsequent updates to the model. It does not include the on-device inference energy when the model is actually used. It is just the one-time delivery cost to one device.
Aggregated cost across the deployment
Google does not publish how many devices receive the Nano push. The eligibility criteria gating the push (a hardware “performance class” that Chrome computes from CPU class, GPU class, system RAM and available VRAM - typically ~16 GB unified memory or better on Apple Silicon, ~16 GB RAM and a discrete or integrated GPU with sufficient VRAM on Windows and Linux) carve out the very low end of the consumer install base, but the qualifying population is still enormous. I will use three illustrative deployment bands so the reader can pick whichever they consider closest to reality. None of these bands is implausibly large for a feature that ships in default-on Chrome.
To compare those numbers to what an ESG report could compare to:
24 GWh (low band) is roughly the annual electricity consumption of about 7,000 average UK households [16].
24 GWh (low band) is roughly the annual electricity consumption of about 7,000 average UK households [16].
120 GWh (mid band) is roughly the annual electricity consumption of about 36,000 average UK households, or the annual output of a 14 MW wind turbine running at typical UK capacity factor.
120 GWh (mid band) is roughly the annual electricity consumption of about 36,000 average UK households, or the annual output of a 14 MW wind turbine running at typical UK capacity factor.
240 GWh (high band) is roughly the annual electricity consumption of about 72,000 average UK households, or the annual output of about 28 MW of installed wind capacity.
240 GWh (high band) is roughly the annual electricity consumption of about 72,000 average UK households, or the annual output of about 28 MW of installed wind capacity.
6,000 tonnes CO2e (low band) is roughly the annual emissions of 1,300 average passenger cars in the EU [17].
6,000 tonnes CO2e (low band) is roughly the annual emissions of 1,300 average passenger cars in the EU [17].
30,000 tonnes CO2e (mid band) is roughly the annual emissions of 6,500 cars, or one return flight from London to Sydney for about 8,000 passengers in economy.
30,000 tonnes CO2e (mid band) is roughly the annual emissions of 6,500 cars, or one return flight from London to Sydney for about 8,000 passengers in economy.
60,000 tonnes CO2e (high band) is roughly the annual emissions of 13,000 cars.
60,000 tonnes CO2e (high band) is roughly the annual emissions of 13,000 cars.
These are the delivery-only numbers. They count the bytes traversing the network exactly once. They do not count:
The roughly 4 GB × N devices of disk-storage cost, sustained, on user hardware. SSDs have a per-GB embodied carbon cost of approximately 0.16 kg CO2e per GB of NAND manufactured [18]; for 1 billion devices × 4 GB that is around 640,000 tonnes CO2e of embodied SSD allocated to a use case the user did not consent to. This is a one-off manufacturing-carbon impact, but the storage burden is borne in perpetuity by user devices that could otherwise have used the space for user data.
The on-device inference energy when Nano is invoked. Per inference this is small. At 2 billion daily Chrome users it is no longer small.
The re-download cycle for users who try to delete the file. Each successful re-trigger of the download is another 4 GB × 0.06 kWh × 0.25 kg = 0.06 kg CO2e per device per re-download.
The future model updates. Gemini Nano is not a one-shot artefact; it is an evolving model with periodic weight refreshes. Each refresh repeats the calculation.
In ESG-reporting language, the one-time push of the current model is a Scope 3 Category 11 (“use of sold products”) emission against Google, attributable to the user-side delivery of a binary the user did not request, in the operation of a free product Google distributes [4].
Why the bandwidth side matters in its own right
In addition to the carbon cost, the network-bandwidth cost is paid by ISPs, by mobile network operators, by users on metered connections, and by every piece of network infrastructure that has to carry an unwanted 4 GB payload to a destination that did not ask for it. Per the Pärssinen reference, around 50% of that delivery energy is in the access network and CDN edge, around 30% is in user-side equipment (router, modem, NIC), and the remainder is in the core. None of that infrastructure exists for free. Every byte Chrome pushes is a byte that competes with bytes the user actually wanted.
For users on capped mobile data plans, particularly in regions where smartphone-as-only-internet is dominant (much of Africa, much of South and Southeast Asia, most of Latin America), 4 GB of unrequested download is on the order of a month’s data allowance, vapourised by Chrome on the user’s behalf. Google has not, to my knowledge, published any analysis of the welfare impact of this on the populations whose internet access is metered.
Keep in mind that mobile data plans (4G and 5G) are used by many households who do not have access to fiber, cable or adsl and are used for desktop devices as well as mobile - so the argument that Google won’t push this to mobile devices (although I have not found anything official to support that argument anyway) will not fly.
What Google should have done
This is not a hard list. It is the same list I gave Anthropic in the Claude Desktop article, applied to Google.
Ask. First time Chrome is about to download the Nano model, pop a dialogue. “Chrome would like to download a 4 GB AI model file to your device to power the following features. Allow, or skip and decide later.” Two buttons. Done.
Ask. First time Chrome is about to download the Nano model, pop a dialogue. “Chrome would like to download a 4 GB AI model file to your device to power the following features. Allow, or skip and decide later.” Two buttons. Done.
Pull, not push. Trigger the download as a downstream consequence of the user invoking an AI feature for the first time. Let the feature itself be the consent event. Do not pre-stage on a contingency.
Pull, not push. Trigger the download as a downstream consequence of the user invoking an AI feature for the first time. Let the feature itself be the consent event. Do not pre-stage on a contingency.
Surface it. In chrome://settings/, list the AI model files Chrome has downloaded, their size, the features they power, and a “Remove and stop downloading” button per model. Make removal persistent, not a transient state Chrome corrects on next launch.
Surface it. In chrome://settings/, list the AI model files Chrome has downloaded, their size, the features they power, and a “Remove and stop downloading” button per model. Make removal persistent, not a transient state Chrome corrects on next launch.
Document it. Tell the user, plainly, in the Chrome description on the Microsoft Store, in the Chrome installer, on the Google Chrome download page, that Chrome will download additional model files of substantial size on supported hardware. Currently, this is essentially undocumented to a normal user.
Document it. Tell the user, plainly, in the Chrome description on the Microsoft Store, in the Chrome installer, on the Google Chrome download page, that Chrome will download additional model files of substantial size on supported hardware. Currently, this is essentially undocumented to a normal user.
Respect deletion. If the user deletes weights.bin, do not re-create it. If the user has a strong preference about what is on their disk, the application is not in a position to override that preference because the application thinks it knows better.
Respect deletion. If the user deletes weights.bin, do not re-create it. If the user has a strong preference about what is on their disk, the application is not in a position to override that preference because the application thinks it knows better.
Disclose at scale. Publish, in Google’s annual ESG report, the aggregate bandwidth and carbon footprint of all AI-feature model pushes to user devices, broken down by region. Treat it as the Scope 3 Category 11 emission it is. Account for it.
Disclose at scale. Publish, in Google’s annual ESG report, the aggregate bandwidth and carbon footprint of all AI-feature model pushes to user devices, broken down by region. Treat it as the Scope 3 Category 11 emission it is. Account for it.
Skip to content
Secure your code as you build
We read every piece of feedback, and take your input very seriously.
Include my email address so I can be contacted
Use saved searches to filter your results more quickly
To see all available qualifiers, see our documentation.
Sign up
You signed in with another tab or window. Reload to refresh your session.
You signed out in another tab or window. Reload to refresh your session.
You switched accounts on another tab or window. Reload to refresh your session.
Notifications
You must be signed in to change notification settings
You can’t perform that action at this time.
Last week, a tweet went viral showing a guy claiming that a Cursor/Claude agent deleted his company’s production database. We watched from the sidelines as he tried to get a confession from the agent: “Why did you delete it when you were told never to perform this action?” Then he tried to parse the answer to either learn from his mistake or warn us about the dangers of AI agents.
I have a question too: why do you have an API endpoint that deletes your entire production database? His post rambled on about false marketing in AI, bad customer support, and so on. What was missing was accountability.
I’m not one to blindly defend AI, I always err on the side of caution. But I also know you can’t blame a tool for your own mistakes.
In 2010, I worked with a company that had a very manual deployment process. We used SVN for version control. To deploy, we had to copy trunk, the equivalent of the master branch, into a release folder labeled with a release date. Then we made a second copy of that release and called it “current.” That way, pulling the current folder always gave you the latest release.
One day, while deploying, I accidentally copied trunk twice. To fix it via the CLI, I edited my previous command to delete the duplicate. Then I continued the deployment without any issues… or so I thought. Turns out, I hadn’t deleted the duplicate copy at all. I had edited the wrong command and deleted trunk instead. Later that day, another developer was confused when he couldn’t find it.
All hell broke loose. Managers scrambled, meetings were called. By the time the news reached my team, the lead developer had already run a command to revert the deletion. He checked the logs, saw that I was responsible, and my next task was to write a script to automate our deployment process so this kind of mistake couldn’t happen again. Before the day was over, we had a more robust system in place. One that eventually grew into a full CI/CD pipeline.
Automation helps eliminate the silly mistakes that come with manual, repetitive work. We could have easily gone around asking “Why didn’t SVN prevent us from deleting trunk?” But the real problem was our manual process. Unlike machines, we can’t repeat a task exactly the same way every single day. We are bound to slip up eventually.
With AI generating large swaths of code, we get the illusion of that same security. But automation means doing the same thing the same way every time. AI is more like me copying and pasting branches, it’s bound to make mistakes, and it’s not equipped to explain why it did what it did. The terms we use, like “thinking” and “reasoning,” may look like reflection from an intelligent agent. But these are marketing terms slapped on top of AI. In reality, the models are still just generating tokens.
Now, back to the main problem this guy faced. Why does a public-facing API that can delete all your production databases even exist? If the AI hadn’t called that endpoint, someone else eventually would have. It’s like putting a self-destruct button on your car’s dashboard. You have every reason not to press it, because you like your car and it takes you from point A to point B. But a motivated toddler who wiggles out of his car seat will hit that big red button the moment he sees it. You can’t then interrogate the child about his reasoning. Mine would have answered simply: “I did it because I pressed it.”
I suspect a large part of this company’s application was vibe-coded. The software architects used AI to spec the product from AI-generated descriptions provided by the product team. The developers used AI to write the code. The reviewers used AI to approve it. Now, when a bug appears, the only option is to interrogate yet another AI for answers, probably not even running on the same GPU that generated the original code. You can’t blame the GPU!
The simple solution is know what you’re deploying to production. The more realistic one is, if you’re going to use AI extensively, build a process where competent developers use it as a tool to augment their work, not a way to avoid accountability. And please, don’t let your CEO or CTO write the code.
Train Your Own LLM From Scratch
A hands-on workshop where you write every piece of a GPT training pipeline yourself, understanding what each component does and why.
Andrej Karpathy’s nanoGPT was my first real exposure to LLMs and transformers. Seeing how a working language model could be built in a few hundred lines of PyTorch completely changed how I thought about AI and inspired me to go deeper into the space.
This workshop is my attempt to give others that same experience. nanoGPT targets reproducing GPT-2 (124M params) and covers a lot of ground. This project strips it down to the essentials and scales it to a ~10M param model that trains on a laptop in under an hour — designed to be completed in a single workshop session.
What You’ll Build
A working GPT model trained from scratch on your MacBook, capable of generating Shakespeare-like text. You’ll write:
Tokenizer — turning text into numbers the model can process
Model architecture — the transformer: embeddings, attention, feed-forward layers
Training loop — forward pass, loss, backprop, optimizer, learning rate scheduling
Text generation — sampling from your trained model
Prerequisites
Any laptop or desktop (Mac, Linux, or Windows)
Python 3.12+
Comfort reading Python code (you don’t need ML experience)
Training uses Apple Silicon GPU (MPS), NVIDIA GPU (CUDA), or CPU automatically. Also works on Google Colab — upload the files and run with !python train.py.
Getting Started
Local (recommended)
Install uv if you don’t have it:
# macOS / Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# Windows
powershell -ExecutionPolicy ByPass -c “irm https://astral.sh/uv/install.ps1 | iex”
Then set up the project:
uv sync
mkdir scratchpad && cd scratchpad
Google Colab
If you don’t have a local setup, upload the repo to Colab and install dependencies:
!pip install torch numpy tqdm tiktoken
Upload data/shakespeare.txt to your Colab files, then write your code in notebook cells or upload .py files and run them with !python train.py.
Work through the docs in order. Each part walks you through writing a piece of the pipeline, explaining what each component does and why. By the end, you’ll have a working model.py, train.py, and generate.py that you wrote yourself.
Architecture: GPT at a Glance
Input Text
│
▼
┌─────────────────┐
│ Tokenizer │ “hello” → [20, 43, 50, 50, 53] (character-level)
└────────┬────────┘
▼
┌─────────────────┐
│ Token Embed + │ token IDs → vectors (n_embd dimensions)
│ Position Embed │ + positional information
└────────┬────────┘
▼
┌─────────────────┐
│ Transformer │ × n_layer
│ Block: │
│ ┌────────────┐ │
│ │ LayerNorm │ │
│ │ Self-Attn │ │ n_head parallel attention heads
│ │ + Residual │ │
│ ├────────────┤ │
│ │ LayerNorm │ │
│ │ MLP (FFN) │ │ expand 4x, GELU, project back
│ │ + Residual │ │
│ └────────────┘ │
└────────┬────────┘
▼
┌─────────────────┐
│ LayerNorm │
│ Linear → logits│ vocab_size outputs (probability over next token)
└─────────────────┘
Model Configs for This Workshop
All configs use character-level tokenization (vocab_size=65) and block_size=256.
Tokenization: Characters vs BPE
This workshop uses character-level tokenization on Shakespeare. BPE tokenization (GPT-2′s 50k vocab) doesn’t work on small datasets — most token bigrams are too rare for the model to learn patterns from.
Part 5 covers switching to BPE for larger datasets.
Key References
nanoGPT — The project this workshop is based on. Minimal GPT training in ~300 lines of PyTorch
build-nanogpt video lecture — 4-hour video building GPT-2 from an empty file
Karpathy’s microgpt — A full GPT in 200 lines of pure Python, no dependencies
nanochat — Full ChatGPT clone training pipeline
Attention Is All You Need (2017) — The original transformer paper
GPT-2 paper (2019) — Language models as unsupervised learners
TinyStories paper — Why small models trained on curated data punch above their weight
I’ve previously explained async bloat and some work-arounds for it, but would much prefer to solve the issue at the root, in the compiler. I’ve submitted a Project Goal, and am looking for help to fund the effort.
I love me some async Rust! It’s amazing how we can write executor agnostic code that can run concurrently on huge servers and tiny microcontrollers.
But especially on those tiny microcontrollers we notice that async Rust is far from the zero cost abstractions we were promised. That’s because every byte of binary size counts and async introduces a lot of bloat. This bloat exists on desktops and servers as well, but it’s much less noticable when you have substantially more memory and compute available.
I’ve previously explained some work-arounds for this issue, but would much prefer to get to the root of the problem, and work on improving async bloat in the compiler. As such I have submitted a Project Goal.
This is part 2 of my blog series on this topic. See part 1 for the initial exploration of the topic and what you can do when writing async code to avoid some of the bloat. In this second part we’ll dive into the internals and translate the methods of blog 1 into optimizations for the compiler.
What I won’t be talking about is the often discussed problem of futures becoming bigger than necessary and them doing a lot of copying. People are aware of that already. In fact, there is an open PR that tackles part of it: https://github.com/rust-lang/rust/pull/135527
Anatomy of a generated future
We’re going to be looking at this code:
fn foo() -> impl Future<Output = i32> {
async { 5 }
}
fn bar() -> impl Future<Output = i32> {
async {
foo().await + foo().await
}
}
godbolt
We’re using the desugared syntax for futures because it’s easier to see what’s happening.
So what does the bar future look like?
There are two await points, so the state machine must have at least two states, right?
Well, yes. But there’s more.
Luckily we can ask the compiler to dump MIR for us at various passes. An interesting pass is the coroutine_resume pass. This is the last async-specific MIR pass. Why is this important? Well, async is a language feature that still exists in MIR, but not in LLVM IR. So the transformation of async to state machine happens as a MIR pass.
The bar function generates 360 lines of MIR. Pretty crazy, right? Although this gets optimized somewhat later on, the non-async version uses only 23 lines for this.
The compiler also outputs the CoroutineLayout. It’s basically an enum with these states (comments my own):
variant_fields: {
Unresumed(0): [], // Starting state
Returned (1): [],
Panicked (2): [],
Suspend0 (3): [_s1], // At await point 1, _s1 = the foo future
Suspend1 (4): [_s0, _s2], // At await point 2, _s0 = result of _s1, s2 = the second foo future
},
So what are Returned and Panicked?
Well, Future::poll is a safe function. Calling it must not induce any UB, even when the future is done. So after Suspend1 the future returns Ready and the future is changed to the Returned state. Once polled again in that state, the poll function will panic.
The Panicked state exists so that after an async fn has panicked, but the catch-unwind mechanism was used to catch it, the future can’t be polled anymore. Polling a future in the Panicked state will panic. If this mechanism wasn’t there, we could poll the future again after a panic. But the future may be in an incomplete state and so that could cause UB. This mechanism is very similar to mutex poisoning.
(I’m 90% sure I’m correct about the Panicked state, but I can’t really find any docs that actually describe this.)
Cool, this seems reasonable.
Why panic?
But is it reasonable? Futures in the Returned state will panic. But they don’t have to. The only thing we can’t do is cause UB to happen.
Panics are relatively expensive. They introduce a path with a side-effect that’s not easily optimized out. What if instead, we just return Pending again? Nothing unsafe going on, so we fulfill the contract of the Future type.
I’ve hacked this in the compiler to try it out and saw a 2%-5% reduction in binary size for async embedded firmware.
So I propose this should be a switch, just like overflow-checks = false is for integer overflow. In debug builds it would still panic so that wrong behavior is immediately visible, but in release builds we get smaller futures.
Similarly, when panic=abort is used, we might be able to get rid of the Panicked state altogether. I want to look into the repercussions of that.
Always a state machine
We’ve looked at bar, but not yet at foo.
fn foo() -> impl Future<Output = i32> {
async { 5 }
}
Let’s implement it manually, to see what the optimal solution would be.
struct FooFut;
impl Future for FooFut {
type Output = i32;
fn poll(self: Pin<&mut Self>, _cx: &mut Context<’_>) -> Poll<Self::Output> {
Poll::Ready(5)
}
}
Easy right? We don’t need any state. We just return the number.
Let’s see what the generated MIR is for the version the compiler gives us:
// MIR for `foo::{closure#0}` 0 coroutine_resume
/* coroutine_layout = CoroutineLayout {
field_tys: {},
variant_fields: {
Unresumed(0): [],
Returned (1): [],
Panicked (2): [],
},
storage_conflicts: BitMatrix(0x0) {},
} */
fn foo::{closure#0}(_1: Pin<&mut {async block@src\main.rs:5:5: 5:10}>, _2: &mut Context<’_>) -> Poll<i32> {
debug _task_context => _2;
let mut _0: core::task::Poll<i32>;
let mut _3: i32;
let mut _4: u32;
let mut _5: &mut {async block@src\main.rs:5:5: 5:10};
bb0: {
_5 = copy (_1.0: &mut {async block@src\main.rs:5:5: 5:10});
_4 = discriminant((*_5));
switchInt(move _4) -> [0: bb1, 1: bb4, otherwise: bb5];
}
bb1: {
_3 = const 5_i32;
goto -> bb3;
}
bb2: {
_0 = Poll::<i32>::Ready(move _3);
discriminant((*_5)) = 1;
return;
}
bb3: {
goto -> bb2;
}
bb4: {
assert(const false, “`async fn` resumed after completion”) -> [success: bb4, unwind unreachable];
}
bb5: {
unreachable;
}
}
Yikes! That’s a lot of code!
Notice at line 4 that we still have the 3 default states and at line 22 that we’re still switching on it. There’s a big optimization opportunity here that we’re not using, i.e. to have no states and always return Poll::Ready(5) on every poll.
Back to Verisign Labs Tools
Analyzing DNSSEC problems for nic.de
Move your mouse over any or symbols for remediation hints.
Want a second opinion? Test nic.de at dnsviz.net.
↓ Advanced options
May 05, 2026
By using Multi-Token Prediction (MTP) drafters, Gemma 4 models reduce latency bottlenecks and achieve improved responsiveness for developers.
Olivier Lacombe
Director, Product Management
Maarten Grootendorst
Developer Relations Engineer
Your browser does not support the audio element.
Listen to article
This content is generated by Google AI. Generative AI is experimental
[[duration]] minutes
Just a few weeks ago, we introduced Gemma 4, our most capable open models to date. With over 60 million downloads in just the first few weeks, Gemma 4 is delivering unprecedented intelligence-per-parameter to developer workstations, mobile devices and the cloud. Today, we are pushing efficiency even further.
We’re releasing Multi-Token Prediction (MTP) drafters for the Gemma 4 family. By using a specialized speculative decoding architecture, these drafters deliver up to a 3x speedup without any degradation in output quality or reasoning logic.
Tokens-per-second speed increases, tested on hardware using LiteRT-LM, MLX, Hugging Face Transformers, and vLLM.
Why speculative decoding?
The technical reality is that standard LLM inference is memory-bandwidth bound, creating a significant latency bottleneck. The processor spends the majority of its time moving billions of parameters from VRAM to the compute units just to generate a single token. This leads to under-utilized compute and high latency, especially on consumer-grade hardware.
Speculative decoding decouples token generation from verification. By pairing a heavy target model (e.g., Gemma 4 31B) with a lightweight drafter (the MTP model), we can utilize idle compute to “predict” several future tokens at once with the drafter in less time than it takes for the target model to process just one token. The target model then verifies all of these suggested tokens in parallel.
How speculative decoding works
Standard large language models generate text autoregressively, producing exactly one token at a time. While effective, this process dedicates the same amount of computation to predicting an obvious continuation (like predicting “words” after “Actions speak louder than…”) as it does to solving a complex logic puzzle.
MTP mitigates this inefficiency through speculative decoding, a technique introduced by Google researchers in Fast Inference from Transformers via Speculative Decoding. If the target model agrees with the draft, it accepts the entire sequence in a single forward pass —and even generates an additional token of its own in the process. This means your application can output the full drafted sequence plus one token in the time it usually takes to generate a single one.
Unlocking faster AI from the edge to the workstation
For developers, inference speed is often the primary bottleneck for production deployment. Whether you are building coding assistants, autonomous agents that require rapid multi-step planning, or responsive mobile applications running entirely on-device, every millisecond matters.
By pairing a Gemma 4 model with its corresponding drafter, developers can achieve:
Improved responsiveness: Drastically reduce latency for near real-time chat, immersive voice applications and agentic workflows.
Supercharged local development: Run our 26B MoE and 31B Dense models on personal computers and consumer GPUs with unprecedented speed, powering seamless, complex offline coding and agentic workflows.
Enhanced on-device performance: Maximize the utility of our E2B and E4B models on edge devices by generating outputs faster, which in turn preserves valuable battery life.
Zero quality degradation: Because the primary Gemma 4 model retains the final verification, you get identical frontier-class reasoning and accuracy, just delivered significantly faster.
Gemma 4 26B on a NVIDIA RTX PRO 6000. Standard Inference (left) vs. MTP Drafter (right) in tokens per second. Same output quality, half the wait time.
Where you can dive deeper into MTP drafters
To make these MTP drafters exceptionally fast and accurate, we introduced several architectural enhancements under the hood. The draft models seamlessly utilize the target model’s activations and share its KV cache, meaning they don’t have to waste time recalculating context the larger model has already figured out. For our E2B and E4B edge models, where the final logit calculation becomes a big bottleneck, we even implemented an efficient clustering technique in the embedder to further accelerate generation.
We’ve also been closely analyzing hardware-specific optimizations. For example, while the 26B mixture-of-experts model presents unique routing challenges at a batch size of 1 on Apple Silicon, processing multiple requests simultaneously (e.g., batch sizes of 4 to 8) unlocks up to a ~2.2x speedup locally. We see similar gains with Nvidia A100 when increasing batch size.
Want to see the exact mechanics of how this works? We’ve published an in-depth technical explainer that unpacks the visual architecture, KV cache sharing and efficient embedders powering these drafters.
How to get started
The MTP drafters for the Gemma 4 family are available today under the same open-source Apache 2.0 license as Gemma 4. Read the documentation to learn how to use MTP with Gemma 4. You can download the model weights right now on Hugging Face, Kaggle, and start experimenting with faster inference with transformers, MLX, VLLM, SGLang, and Ollama or try them directly on Google AI Edge Gallery for Android or iOS.
We can’t wait to see how this newfound speed accelerates what you build next in the Gemmaverse.
Speaking of companies with valuable minority stakes in AI companies, there’s one thing that stuck in my craw about the blockbuster Ronan Farrow / Andrew Marantz investigative piece on Sam Altman and OpenAI last month for The New Yorker. It didn’t come up during Nilay Patel’s excellent interview with Farrow on Decoder, either.
Sam Altman was the president of Y Combinator for several years, and left to become the full-time CEO of OpenAI. The New Yorker quotes Y Combinator co-founder Paul Graham multiple times, in the context of Altman’s trustworthiness. (Some of those quotes are firsthand, others secondhand.) Graham’s role in the story — particularly his public remarks after publication — comprised an entire section in my own take on the New Yorker piece, wherein I concluded:
I would characterize Graham’s tweets re: Altman this week as
emphasizing only that Altman was not fired or otherwise forced
from YC, and could have stayed as CEO at YC if he’d found another
CEO for OpenAI. But for all of Graham’s elucidating engagement on
Twitter/X this week regarding this story, he’s dancing around the
core question of the Farrow/Marantz investigation, the one right
there in The New Yorker’s headline: Can Sam Altman be trusted?
“We didn’t ‘remove’ Sam Altman” and “We didn’t want him to
leave” are not the same things as saying, say, “I think Sam
Altman is honest and trustworthy” or “Sam Altman is a man of
integrity”. If Paul Graham were to say such things, clearly and
unambiguously, those remarks would carry tremendous weight. But — rather conspicuously to my eyes — he’s not saying such things.
I would characterize Graham’s tweets re: Altman this week as
emphasizing only that Altman was not fired or otherwise forced
from YC, and could have stayed as CEO at YC if he’d found another
CEO for OpenAI. But for all of Graham’s elucidating engagement on
Twitter/X this week regarding this story, he’s dancing around the
core question of the Farrow/Marantz investigation, the one right
there in The New Yorker’s headline: Can Sam Altman be trusted?
“We didn’t ‘remove’ Sam Altman” and “We didn’t want him to
leave” are not the same things as saying, say, “I think Sam
Altman is honest and trustworthy” or “Sam Altman is a man of
integrity”. If Paul Graham were to say such things, clearly and
unambiguously, those remarks would carry tremendous weight. But — rather conspicuously to my eyes — he’s not saying such things.
The thing that stuck in my craw is this: Does Y Combinator own a stake in OpenAI? And if they do, given OpenAI’s sky-high valuation, isn’t that stake worth billions of dollars?
OpenAI was seeded by an offshoot of Y Combinator called YC Research in 2016 — when Altman was running YC. In December 2023, the well-known AI expert (and AI-hype skeptic) Gary Marcus wrote the following, in a piece on Altman’s trustworthiness in the wake of the OpenAI board saga that saw Altman fired, re-hired, and the board purged in the course of a tumultuous week:
After poking around, I found out that “I have no equity in OpenAI”
was only half the truth; while Altman to my knowledge holds no
direct equity in OpenAI, he does have an indirect stake in
OpenAI, and that fact should have been disclosed.
In particular, he owns a stake of Y Combinator, and Y Combinator
owns a stake in OpenAI. It may well be worth tens of millions of
dollars; even for Altman, that’s not trivial. Since he was
President of Y Combinator, and CEO of OpenAI; he surely was
aware of this.
After poking around, I found out that “I have no equity in OpenAI”
was only half the truth; while Altman to my knowledge holds no
direct equity in OpenAI, he does have an indirect stake in
OpenAI, and that fact should have been disclosed.
In particular, he owns a stake of Y Combinator, and Y Combinator
owns a stake in OpenAI. It may well be worth tens of millions of
dollars; even for Altman, that’s not trivial. Since he was
President of Y Combinator, and CEO of OpenAI; he surely was
aware of this.
So it’s well known that Y Combinator owns some stake in OpenAI. But how big is that stake? This seems like devilishly difficult information to obtain. I asked around and a little birdie who knows several OpenAI investors came back with an answer: Y Combinator owns about 0.6 percent of OpenAI. At OpenAI’s current $852 billion valuation, that’s worth over $5 billion.
Graham and his wife Jessica Livingston are two of Y Combinator’s four founding partners. The fact that Paul Graham personally has billions of dollars at stake with OpenAI doesn’t mean that his public opinion on Sam Altman’s trustworthiness and leadership is invalid. But it certainly seems like the sort of thing that ought to be disclosed when quoting Graham as an Altman character reference. A billion dollars here, a billion there — that adds up to the sort of money that might skew a fellow’s opinion.
A senior engineer’s job is mostly the parts that don’t show up in the diff. Specs. Tests. Reviews. Scope discipline. Refusing to ship what can’t be verified. AI coding agents skip those parts by default. Agent Skills is my attempt to make them not optional.
The default behaviour of any AI coding agent is to take the shortest path to “done.” Ask for a feature and it writes the feature. It does not ask whether you have a spec, write a test before the implementation, consider whether the change crosses a trust boundary, or check what the PR will look like to a reviewer. It produces code, declares victory, and moves on.
This is the same failure mode every senior engineer has spent their career learning to avoid. The senior version of any task includes work that doesn’t show up in the diff: surfacing assumptions, writing the spec, breaking the work into reviewable chunks, choosing the boring design, leaving evidence that the result is correct, sizing the change so a human can actually review it. Those steps are most of what separates engineers who ship reliable software at scale from people who push code that breaks.
Agents skip those steps for the same reason any junior would. They’re invisible. The reward signal points at “task complete” not “task complete and the design doc exists.” So we have to bolt the senior-engineer scaffolding back on.
Agent Skills is my attempt at that scaffolding. It just crossed 27K stars, so apparently I’m not alone in wanting it. This post is the part the README doesn’t quite cover: why each design choice exists, how it maps onto standard SDLC and Google’s published engineering practices, and what you should steal from the project even if you never install a single skill.
What a “skill” actually is
The word “skill” is doing a lot of work in the Claude Code / Anthropic vocabulary, and it helps to be precise. A skill is a markdown file with frontmatter that gets injected into the agent’s context when the situation calls for it. Somewhere between a system-prompt fragment and a runbook.
A skill is not reference documentation. It is not “everything you should know about testing.” It is a workflow: a sequence of steps the agent follows, with checkpoints that produce evidence, ending in a defined exit criterion.
That distinction is the whole game. If you put a 2,000-word essay on testing best practices into the agent’s context, the agent reads it, generates plausible-looking text, and skips the actual testing. If you put a workflow there (write the failing test first, run it, watch it fail, write the minimum code to pass, watch it pass, refactor), the agent has something to do, and you have something to verify.
Process over prose. Workflows over reference. Steps with exit criteria over essays without them. That single distinction separates a useful skill from a pretty markdown file. It also explains why so many “AI rules” repos end up doing nothing in practice. The rules are essays.
The SDLC the skills encode
The twenty skills in the repo organise around six lifecycle phases, with seven slash commands sitting on top. Define (/spec) is where you decide what you’re actually building. Plan (/plan) breaks the work down. Build (/build) implements it in vertical slices. Verify (/test) proves it works. Review (/review) catches what slipped through. Ship (/ship) gets it to users safely. /code-simplify sits across the bottom of the whole thing.
This isn’t a coincidence. It’s the same SDLC every functioning engineering organisation runs, just in different vocabulary. Google calls it design doc → review → implementation → readability review → launch checklist. Amazon calls it the working-backwards memo and the bar raiser. Every healthy team has some version of this loop.
What’s new with AI coding agents is that most agents skip most of these phases by default. You ask for a feature, you get an implementation, and the spec, plan, tests, review, and launch checklist all just don’t happen. Skills push the agent through the same phases a senior engineer forces themselves through, because shipping the code without them is how you produce incidents.
A complex feature might activate eleven skills in sequence. A small bug fix might use three. The router (using-agent-skills) decides which apply. The point is that the workflow scales to the actual scope, not to the assumed scope.
Five principles that are doing the work
Five design decisions in the project are the load-bearing ones. The rest of the system follows from them.
1. Process over prose
Already covered. Workflows are agent-actionable; essays are not. The same is true for human teams. If your team handbook is 200 pages, no one reads it under time pressure. If it’s a small set of workflows with checkpoints, people actually run them.
2. Anti-rationalization tables
This is the most distinctive design decision in the project, and the one I most want other teams to steal.
Each skill includes a table of common excuses an agent (or a tired engineer) might use to skip the workflow, paired with a written rebuttal. A few examples close to the originals:
“This task is too simple to need a spec.” → Acceptance criteria still apply. Five lines is fine. Zero lines is not.
“I’ll write tests later.” → Later is the load-bearing word. There is no later. Write the failing test first.
“Tests pass, ship it.” → Passing tests are evidence, not proof. Did you check the runtime? Did you verify user-visible behaviour? Did a human read the diff?
The reason this works is that LLMs are excellent at rationalisation. They will produce a plausible-sounding paragraph explaining why this particular task doesn’t need a spec, or why this particular change is fine to merge without review. Anti-rationalization tables are pre-written rebuttals to lies the agent hasn’t yet told.
The pattern is just as good for human teams. Most engineering decay isn’t anyone choosing to do bad work. It’s people accepting plausible-sounding justifications for skipping the parts they don’t feel like doing. A team that writes down its anti-rationalizations is a team that has fewer of them.
3. Verification is non-negotiable
Every skill terminates in concrete evidence. Tests pass. Build output is clean. The runtime trace shows the expected behaviour. A reviewer signs off. “Seems right” is never sufficient.
This is the same principle that makes Anthropic’s harness recover from failures, that makes Cursor’s planner/worker/judge split actually catch bugs, that makes any long-running agent recoverable. The agent is a generator. You need a separate signal that the work is done. Skills bake that signal into every workflow.
4. Progressive disclosure
Do not load all twenty skills into context at session start. Activate them based on the phase. A small meta-skill (using-agent-skills) acts as a router that decides which skill applies to the current task.
This is the harness engineering lesson applied at skill granularity. Every token loaded into context degrades performance somewhere, so you load what’s relevant and leave the rest on disk. Progressive disclosure is how you get a twenty-skill library into a 5K-token slot without poisoning the well.
5. Scope discipline
The meta-skill encodes a non-negotiable I’d staple to every agent if I could: “touch only what you’re asked to touch.” Don’t refactor adjacent systems. Don’t remove code you don’t fully understand. Don’t brush against a TODO and decide to rewrite the file.
This sounds obvious until you watch an agent decide that fixing one bug requires modernizing three unrelated files. Scope discipline is the single biggest determinant of whether an agent’s PR is mergeable or has to be unwound. It’s also the principle that maps most cleanly onto Google’s code review norms, where reviewers will block a PR for doing more than one thing.
The Google DNA
The skills are saturated with practices from Software Engineering at Google and Google’s public engineering culture. This is intentional. Most of what makes Google-scale software work is documented and public, and it is exactly the part agents are most likely to skip.
A partial map of which skill encodes which practice:
Hyrum’s Law in api-and-interface-design. Every observable behaviour of your API will eventually be depended on by someone, so design with that in mind.
The test pyramid (~80/15/5) and the Beyoncé Rule in test-driven-development. “If you liked it, you should have put a test on it.” Infrastructure changes don’t catch bugs; tests do.
DAMP over DRY in tests. Google’s testing philosophy is explicit that test code should read like a specification even at the cost of some duplication. Over-abstracted tests are a known anti-pattern.
~100-line PR sizing, with Critical / Nit / Optional / FYI severity labels in code-review-and-quality. Straight from Google’s code review norms. Big PRs don’t get reviewed; they get rubber-stamped.
Chesterton’s Fence in code-simplification. Don’t remove a thing until you understand why it was put there.
Trunk-based development and atomic commits in git-workflow-and-versioning.
Shift Left and feature flags in ci-cd-and-automation. Catch problems as early as possible, decouple deploy from release.
Code-as-liability in deprecation-and-migration. Every line you keep is one you have to maintain forever, so prefer the smaller surface.
None of these are new ideas. The point is that none of them are in the agent by default. A frontier model has read the phrase “Hyrum’s Law” in its training data, but it does not apply Hyrum’s Law when it’s designing your API at 3am. Skills are how you make sure it does.
How to actually use it
Three modes, in roughly increasing commitment.
Mode 1: install via marketplace. If you’re using Claude Code:
/plugin marketplace add addyosmani/agent-skills
/plugin install agent-skills@addy-agent-skills
You get the slash commands (/spec, /plan, /build, /test, /review, /ship, /code-simplify) and the agent activates the relevant skills automatically based on context. This is the path I’d recommend most people start on.
Mode 2: drop the markdown into your tool of choice. The skills are plain markdown with frontmatter. Cursor users put them in .cursor/rules/. Gemini CLI has its own install path. Codex, Aider, Windsurf, OpenCode, anything that accepts a system prompt can read them. The tooling matters less than the workflow underneath.
Mode 3: read them as a spec. Even if you never install anything, the skills are a documented description of what good engineering with AI agents looks like. Read code-review-and-quality.md and apply the five-axis framework to your team’s review process. Read test-driven-development.md and use it to settle the next “do we need to write the test first” argument with a junior. Read the meta-skill and steal the five non-negotiables for your own AGENTS.md.
This third mode is where I’d actually start. Pick the four or five skills closest to your current pain. Decide which workflows you want enforced. Then install the runtime, or roll your own, to do the enforcing.
What to steal even if you never install
A few patterns from the project I’d steal regardless of whether you use AI coding agents at all.
Anti-rationalization as a team practice. Write down the lies your team tells itself. “We’ll fix the tests after launch.” “This change is too small for a design doc.” “It’s fine, we have monitoring.” Pair each with the rebuttal. Put it in your AGENTS.md or your engineering wiki. It will save you arguments and it will catch the next tired Friday-afternoon shortcut.
Process over prose for anything you write internally. If you find yourself writing a 2,000-word doc titled “how we approach X” you’ve written reference material. Convert it to a workflow with checkpoints. The doc shrinks to 400 words and people actually run it. This applies as much to onboarding guides and runbooks as it does to agent skills.
Verification as a hard exit criterion. Make “produce evidence” the exit step of every task. For agents, for engineers, for yourself. Evidence is whatever proves the work is done: a green test run, a screenshot, a log, a review approval. Without it, the task is not done. “Seems right” never closes the loop.
Progressive disclosure for any rulebook. Do not write a 50-page handbook. Write a small router that points to the right small chapter for the situation. This is true for AGENTS.md, for runbooks, for incident playbooks, for anything anyone will read under time pressure.
Five non-negotiables, lifted from the meta-skill, that I’d put in any AGENTS.md tomorrow:
Surface assumptions before building. Wrong assumptions held silently are the most common failure mode.
Stop and ask when requirements conflict. Don’t guess.
Push back when warranted. The agent (or engineer) is not a yes-machine.
Prefer the boring, obvious solution. Cleverness is expensive.
Touch only what you’re asked to touch.
That’s a worthwhile engineering culture in five lines, and you don’t need to install anything to adopt it.
Where this fits in the harness
In the broader picture, skills are one layer of agent harness engineering. The harness is the model plus everything you build around it; skills are the reusable workflow chunks that get progressively disclosed into the system prompt. They sit alongside AGENTS.md (the rolling rulebook), hooks (the deterministic enforcement layer), tools (the actions the agent can take), and the session log (the durable memory). Each layer has a specific job. Skills do the senior-engineer-process job.
Skills matter more for long-running agents than they do for chat-style ones, because long runs amplify every shortcut. An agent that skips the test in a 10-minute session produces one bug. An agent that skips the test in a 30-hour session produces a debugging archaeology project at the end of the run, when no one remembers what the original intent was. The longer the run, the more the senior-engineer scaffolding has to be enforced rather than suggested.
The portability of the skills format matters too. The same SKILL.md file works in Claude Code, Cursor (with rules), Gemini CLI, Codex, and any other harness that accepts system-prompt content. Write the workflow once, the runtime enforces it. That’s the thing the markdown-with-frontmatter format buys you that bespoke prompt engineering does not.
Closing
The thing I most want people to take from this project, more than the skills themselves, is the framing.
AI coding agents are extremely capable junior engineers with no instinct for the parts of the job that don’t show up in the diff. The senior-engineering work (surfacing assumptions, sizing changes, writing the spec, leaving evidence, refusing to merge what can’t be reviewed) is exactly what an agent will skip unless you make it impossible to skip. The job, increasingly, is to encode that discipline as something the agent cannot talk itself out of.
Skills are one shape of that. Anti-rationalization tables. Progressive disclosure. Process over prose. Verification as the load-bearing exit criterion. The Google practices that already work, made portable.
You can install my version. You can roll your own. The lesson stands either way: the senior-engineer parts of the job are no longer optional, even when the engineer is a model.
The repo is at github.com/addyosmani/agent-skills (MIT). For the broader scaffolding picture, see Agent Harness Engineering and Long-running Agents.
Bloomberg’s Mark Gurman reported on Monday that iOS 27 will add a “Create a Pass” feature to the Wallet app. Tap the “+” button you already use to add credit cards or pass emails, and Wallet will offer something it has never offered before on iPhone: a path to build your own pass.
You can scan a QR code on a paper ticket or membership card with the camera, or build a pass from scratch in a layout editor. The whole flow runs without an Apple Developer account, a Pass Type ID, or any certificate signing.
iOS 27 is expected to preview at WWDC on June 8, with a public release in September.
How the new flow works
Reporting from Bloomberg, MacRumors, 9to5Mac, and AppleInsider lines up on the same workflow. Inside the Wallet app, the existing “+” button gains a new option for creating a pass. From there you choose between two starting points:
Scan a QR code from a paper card, ticket, or screen
Build a custom pass from scratch with no scan needed
Once you are in the editor, Wallet exposes adjustable styles, images, colors, and text fields. The reports describe a fairly conventional template-driven layout, closer in spirit to what Pass2U, WalletWallet, and other third-party generators have offered for years than to Apple’s developer-only PassKit pipeline.
Three templates, color-coded
Apple is testing three starting templates, each tied to a default color:
Standard (orange): the default for any general-purpose pass.
Membership (blue): geared toward gyms, clubs, libraries, and other recurring-access cards.
Event (purple): meant for tickets to games, movies, and one-off occasions.
The color choice is not just decoration. Wallet currently sorts passes visually in the stack, and the template hue is what sets each card apart at a glance, so a quick look is enough to pick out the orange punch card from the purple ticket without reading a word.
Why now: 14 years of PassKit drought
Apple shipped PassKit alongside iOS 6 back in 2012. The pitch was clean: businesses build .pkpass files, customers tap to add, everyone wins. In practice, the consistent adopters ended up being airlines, big-box retailers, ticketing platforms, and a handful of national chains. Most gyms, cafes, libraries, rec centers, and small loyalty programs never built one, because the path requires an Apple Developer account, signing certificates, and enough engineering work that “just print a paper card” almost always won the budget conversation.
The Next Web’s framing is blunt: Apple is no longer waiting on developers. With Create a Pass, the supply-side problem is finally being solved from the demand side. If the business will not build a Wallet pass, the user does it themselves from the QR code that business already printed.
That is a meaningful shift in posture. For more than a decade, Wallet has been a directory of what brands chose to ship. In iOS 27 it becomes a directory of what people choose to keep.
What this means for WalletWallet
We will be honest. WalletWallet exists because of this exact gap. You take a barcode from any loyalty card, paste it into our web app, pick a color, and a free Apple Wallet pass lands on your phone in about a minute, all from the browser without an account or any developer setup. Once Create a Pass ships in September, a chunk of that workflow moves natively into the iPhone Wallet app.
That is good for users. We started this project to make Wallet friendlier for the cafes-and-gyms long tail, and Apple agreeing with us at OS-level scope is a healthy outcome. The category needed it.
A few places where we still help, even after iOS 27 ships:
Google Wallet. Create a Pass is iPhone-only. Roughly half of the wallet-using world is on Android, and our generator builds Google Wallet passes from the same form.
Web, no OS upgrade. iOS 27 needs a compatible iPhone and the September update. WalletWallet runs in any browser today. iOS 14, iPad, Mac, a friend’s laptop, all fine.
Tag passes with real integrations. Our Bandcamp, SoundCloud, and Spotify pass builders pull artist art and links automatically into a tag pass. That is a different shape from the generic templated pass Apple is showing.
Sharing. A web-generated .pkpass is just a file. You can email it, post it, hand it to a friend on Android via QR. The Wallet-native flow is more locked to the device that built it.
We expect to lose volume on the simplest one-barcode-to-Wallet case once Create a Pass goes live. That is fine. The reason WalletWallet started was that Apple’s bar for a Wallet pass was too high for normal people. If iOS 27 lowers that bar, the world we wanted is closer.
What we still do not know
The current reports cover the UI, the templates, and the high-level workflow. They are silent on a lot of details that matter:
Whether iCloud will sync user-created passes across iPhone, iPad, and Mac
Whether passes can be exported as .pkpass files to share with non-iPhone users
Whether Wallet supports Code 128, PDF417, and Aztec barcodes, or only QR
Whether merchants can claim, co-sign, or update user-created passes after the fact
Whether passes have lock-screen behavior tied to time and location, the way developer-issued passes do today
We will know more once Apple previews iOS 27 at WWDC on June 8, and again when the first developer betas land. We will update this post when there is something concrete to add.
Quick recap
iOS 27 is adding a Create a Pass button to the Wallet app, with a QR-scan or build-from-scratch flow and three color-coded templates: Standard (orange), Membership (blue), and Event (purple). Bloomberg broke the story on May 4, and a public release is expected in September 2026. It will be the first time iPhone users do not need a third-party tool to put a barcode into Wallet, and for us that is a sign the category is maturing the right way.
Sources
To add this web app to your iOS home screen tap the share button and select "Add to the Home Screen".
10HN is also available as an iOS App
If you visit 10HN only rarely, check out the the best articles from the past week.
If you like 10HN please leave feedback and share
Visit pancik.com for more.