10 interesting stories served every morning and every evening.

S&P 500 rejects SpaceX, also blocking entry for OpenAI and Anthropic

arstechnica.com

But in its fi­nal de­ci­sion, the S&P Dow Jones Indices stated that no changes will be made to the el­i­gi­bil­ity cri­te­ria in­clud­ing fi­nan­cial vi­a­bil­ity screens, sea­son­ing pe­riod, or min­i­mum IWF.” Even af­ter the stan­dard year­long wait, SpaceX, Anthropic, and OpenAI may strug­gle to de­liver the con­sis­tent prof­itabil­ity nec­es­sary to qual­ify for the S&P 500.

Money rules and ex­cep­tions

Swift en­try into the S&P 500 would have trig­gered $14 bil­lion of pas­sive fund buy­ing for SpaceX, ac­cord­ing to Bloomberg Intelligence. The in­vest­ment re­search arm of Bloomberg also es­ti­mated that OpenAI could have gained more than $8 bil­lion, and Anthropic could have net­ted $4.6 bil­lion from sim­i­lar pas­sive buy­ing sprees trig­gered by their S&P 500 en­tries.

This is be­cause $7.5 tril­lion in pas­sively man­aged funds—pop­u­lar among both in­di­vid­ual in­vestors and in­sti­tu­tional in­vestors—fol­low the S&P 500 by pur­chas­ing shares of com­pa­nies ac­cord­ing to their pro­por­tional rep­re­sen­ta­tion in the S&P 500 in­dex. For ex­am­ple, the Vanguard and Fidelity bro­ker­age gi­ants both of­fer pas­sive in­vest­ment funds that track the S&P 500 com­po­si­tion.

However, the S&P Dow Jones Indices did carve out one con­ces­sion” by chang­ing the in­vestable weight fac­tor rules for lower-profile bench­marks” such as the S&P Total Market Index and Dow Jones US Total Stock Market Index, ac­cord­ing to Quartz. That could al­low an IPO faster en­try into those in­dexes.

By con­trast, the Nasdaq stock ex­change changed its rules to al­low SpaceX to en­ter the Nasdaq-100 Index within 15 trad­ing days as op­posed to the usual three months. Similarly, the FTSE Russell in­dex provider de­cided to give SpaceX and other fol­low-on com­pa­nies ac­cel­er­ated en­try to the Russell Top 500 Index af­ter the close of the fifth trad­ing day fol­low­ing an IPO.

The de­nial of ac­cel­er­ated S&P 500 en­try for SpaceX comes just days af­ter Morningstar an­a­lysts de­scribed SpaceX as hav­ing been significantly over­val­ued” in the lead-up to its IPO. The in­vest­ment re­search firm val­ued SpaceX at $780 bil­lion—less than half of SpaceX’s $1.75 tril­lion IPO goal—pri­mar­ily based on the strengths of SpaceX’s Starlink satel­lite ser­vice and rocket launch busi­ness.

mouseless

mouseless.click

This site re­quires JavaScript to func­tion prop­erly. Please en­able JavaScript in your browser set­tings.

GOV.UK goes Dutch on payments as it dumps Stripe

www.theregister.com

pub­lic sec­tor

Means res­i­dents can skip the credit card and use pay by bank’ for lo­cal au­thor­i­ties and ser­vices

The UKs Government Digital Service (GDS) has re­placed Stripe with Dutch provider Adyen as its proces­sor for many pay­ments made through its GOV.UK Pay ser­vice.

Adyen will take over GOV.UK Pay card pay­ments for lo­cal au­thor­i­ties, po­lice forces and armed forces units from Stripe, as well as pay by bank ser­vices, un­der a three-year con­tract worth up to £25.3 mil­lion.

According to the ten­der no­tice pub­lished in February 2025, the con­tract cov­ers around 17 per­cent of pay­ments made through GOV.UK Pay but more than 70 percent of its or­ga­ni­za­tions and in­cludes the only op­tion al­low­ing users to start tak­ing pay­ments within one work­ing day. At that point the con­tract had an es­ti­mated max­i­mum value of £49 mil­lion, al­though with no guar­an­tees over vol­ume.

REG AD

In a blog­post about the con­tract award on 2 June, GDS said it will mi­grate around 1,000 ser­vices to the new sup­plier. We will make mi­gra­tion as straight­for­ward as pos­si­ble while com­ply­ing with Know Your Customer leg­is­la­tion that pro­tects every­one from fraud,” wrote Alan Maddrell, se­nior con­tent de­signer for the ser­vice. Most im­por­tantly, there will be no dis­cernible dif­fer­ence for pay­ing users and no loss in func­tion­al­ity.”

REG AD

He added that the change of sup­plier will help in­tro­duce new op­tions in­clud­ing pay by bank, which trans­fers money di­rectly be­tween bank ac­counts us­ing open bank­ing ser­vices and avoids the need to type in card de­tails. GDS will con­tinue to use WorldPay to process pay­ments for cen­tral gov­ern­ment, linked or­ga­ni­za­tions and NHS bod­ies.

GDS es­tab­lished GOV.UK Pay to save pub­lic ser­vices the ef­fort and cost of set­ting up on­line pay­ments them­selves. It does­n’t charge or­ga­ni­za­tions for the ser­vice be­yond pass­ing on trans­ac­tion fees.

According to its per­for­mance data page, GOV.UK Pay has processed 137.5 mil­lion trans­ac­tions since it was set up in 2016, worth around £9.2 bil­lion. It cur­rently pro­vides 1,718 ser­vices, in­clud­ing 662 for lo­cal gov­ern­ment and 256 for po­lice forces, to 608 or­gan­i­sa­tions rang­ing from 1079 (Tiverton) Squadron RAF Air Cadets to Yeovil Town Council. ®

Did Claude Increase Bugs in rsync?

alexispurslane.github.io

Data Analysis · June 2026

A sim­ple dis­tri­b­u­tional analy­sis of every rsync re­lease with bug data. Nothing com­pli­cated, an­swers only one ques­tion: are the Claude-assisted re­leases un­usu­ally buggy?

Repository: RsyncProject/rsync Method: sever­ity-weighted bugs per 10 com­mits, ex­act per­mu­ta­tion test

0 · Disclaimer: How AI Assistance Was Used

In or­der to avoid ac­cuas­tions of this just be­ing Claude de­fend­ing Claude,” AI slop,” probably all hal­lu­ci­na­tions,” etc., I’ve de­cided it’s prob­a­bly worth ex­plain­ing a few key points about how this re­port was cre­ated:

All met­rics, method­ol­ogy, and data sources were ex­clu­sively cho­sen by me, in con­sul­ta­tion with my wife, who has a Master’s Degree in Statistics from Penn State University.

The method­ol­ogy is di­rectly based on my wife’s in­put: she was the one that pointed out that try­ing to just com­pare bugs per ten lines of code be­fore and af­ter would likely be too ef­fected by noise be­cause of the low num­ber of post-Claude sam­ples, and that, for sim­i­lar rea­sons, try­ing to build some kind of lin­ear re­gres­sion model to as­cer­tain the rel­a­tive ef­fects of dif­fer­ent vari­ables would prob­a­bly also not work. She specif­i­cally told me that look­ing at where the post-Claude re­leases fall into the his­tor­i­cal dis­tri­b­u­tion, and how likely from the his­tor­i­cal dis­tri­b­u­tion we would be to get re­leases as bad” or worse than the post-Claude re­leases, was prob­a­bly the best that could be done.

I spent sev­eral days on this, two be­fore even cre­at­ing the GitHub repo and had at least one ma­jor to­tal rewrite of the re­port to use a bet­ter method­ol­ogy (given the feed­back from my wife men­tioned above). This was a lot of man­ual, cog­ni­tive ef­fort on my end.

The scripts used to fetch the data, col­late it into a DuckDB data­base file, con­struct the views on that DB, and then do the sta­tis­ti­cal analy­sis on that data, were in­deed writ­ten by GLM 5.1, as was the HTML and much of the orig­i­nal prose for the fi­nal re­port web­page you’re look­ing at right now.

Crucially, how­ever, all num­bers, sta­tis­tics, cards, and graphs in this re­port are au­to­mat­i­cally tem­plated in di­rectly by the Python script that ran the sta­tis­ti­cal analy­sis, thus avoid­ing any pos­si­bil­ity of hal­lu­ci­na­tions or in­con­sis­ten­cies in the num­bers.

After post­ing this on Hacker News and re­ciev­ing al­most no sub­stan­tive in­put, dis­cus­sion, or re­sponse on the ac­tual con­tent of the ar­ti­cle, I de­cided to rewrite all of the prose in my own voice. If any­one com­plains about my ver­bosity or sen­tence struc­ture — as they usu­ally do, which is the rea­son I orig­i­nally let the AI write the prose, among other rea­sons ob­so­leted by tem­plat­ing — they can go fuck them­selves.

If you want to repli­cate the data and re­sults here, and in­spect ex­actly how they were cal­cu­lated, you can find the repos­i­tory here. I have pur­pose­fully made it so that the pipeline can be run end to end com­pletely from scratch, so you can see the en­tire pipeline end-to end, with no mys­te­ri­ous DB blobs forc­ing you to trust that I did­n’t doc­tor or screw up the data. If you want to be mad about the num­bers, look there first.

1 · Background: The rsync Outrage

In late May 2026, rsync blew up. First, an ev­i­dence-free Mastodon post was made point­ing to a spu­ri­ous cor­re­la­tion be­tween a re­gres­sion that par­tic­u­lar user ex­pe­ri­enced upon up­grad­ing to a re­lease, and that re­lease hav­ing Claude com­mits in it. It was viewed an un­known num­ber of times, but even likes and boosts passed the thou­sands mark hand­ily, and it gained sig­nif­i­cant trac­tion — as all spu­ri­ous anti-AI hate does —, see­ing 58 replies from 32 unique users. Someone rages about cognitive sur­ren­der” with no ev­i­dence; an­other sug­gests adding rsync to the fa­mous open-slop­ware black­list. From there, it spread to Hacker News, with 81 com­ments, full of mixed dread, anger, and crow­ing about how this fi­nally proves once and for all no one can use LLMs safely. Among all that was one par­tic­u­lar com­ment which spurred fur­ther the view that the re­gres­sions and bugs were caused by Claude.

This On May 30, 2026, this bur­geon­ing out­rage emer­gently co­a­lesced into a sin­gle fo­cal point: a GitHub is­sue ti­tled Please Do Not Vibe Fuck Up This Software”, opened against the rsync repos­i­tory. It at­tached a screen­shot of the Mastodon post crit­i­ciz­ing the pro­jec­t’s use of Claude. That’s it. No bug re­port, no tech­ni­cal con­tent, no at­tempt to ac­tu­ally as­cer­tain if the con­cern was real or jus­ti­fied; just 350+ com­ments rang­ing from thought­ful con­cern to out­right ha­rass­ment (most of the most egre­gious, un­rea­son­able, and out­right vi­o­lent com­ments have since been deleted; few thought to pre­serve them).

The thread did not stop at words. It even­tu­ally es­ca­lated to, at one point, vi­sual de­pic­tions of fan­tasies of vi­o­lence, when one user posted a now deleted com­ment in­clud­ing My Little Pony draw­ings of them­selves stran­gling the project jan­i­tor that pushed vibecoded com­mits”:

Completing the in­ter­net out­rage cy­cle, this is­sue in turn spread to Hacker News, gen­er­at­ing hun­dreds more com­ments. Some at­tempted to point at the num­ber of re­gres­sions af­ter the in­tro­duc­tion of Claude — The Linux Mint Timeshift tool has an is­sue open doc­u­ment­ing a num­ber of re­gres­sions that are cur­rently open on the rsync is­sues page, that were only in­tro­duced post-vibecod­ing” — as ev­i­dence that it was worse. Others pointed out that those re­gres­sions were not caused by Claude, and in re­sponse, the goal­posts were moved again. Over and over, the core theme was one cen­tral claim, re­peated every­where: Claude-assisted de­vel­op­ment in­tro­duced bugs into a pre­vi­ously sta­ble tool. AI is cog­ni­tive sur­ren­der, is co­caine, is loss of craft, and the users are right to be an­gry as a re­sult:

People are very jus­ti­fi­ably an­gry that a very sta­ble, well trusted tool, has started to im­me­di­ately go down­hill… all be­cause the main dev is vibecod­ing that soft­ware. — fao_ on Hacker News

People are very jus­ti­fi­ably an­gry that a very sta­ble, well trusted tool, has started to im­me­di­ately go down­hill… all be­cause the main dev is vibecod­ing that soft­ware.

However, this is­n’t does­n’t have to be a ques­tion solved only on the ba­sis of — iron­i­cally — vibes. This is some­thing that could be, at least to a de­gree, em­pir­i­cally tested. Some even pointed that out:

On Lobste.rs, in re­sponse to the Medium es­say Tridge him­self posted in re­sponse, fi­nally some users like bo­ra­malper be­gin to ac­tu­ally ask for ev­i­dence one way or an­other:

It’d be in­ter­est­ing if some­one ac­tu­ally did a timechart of re­gres­sions af­ter each re­lease (if at all pos­si­ble) to see if the num­ber ac­tu­ally went up re­cently or not. — bo­ra­malper on Lobsters

It’d be in­ter­est­ing if some­one ac­tu­ally did a timechart of re­gres­sions af­ter each re­lease (if at all pos­si­ble) to see if the num­ber ac­tu­ally went up re­cently or not.

User bit­shift replied: I would also love to see such a chart. It would­n’t be com­pletely in­for­ma­tive… But at least it would be some­thing ob­jec­tive we could mea­sure.”

This analy­sis is that chart. Or, well, as best as it can be made, given the lim­i­ta­tions of the data (see the pre­vi­ous sec­tion).

2 · Executive Summary

36 re­leases with bug data, span­ning v2.4.6 to v3.4.3

2 re­leases have Claude com­mits: v3.4.2 (9 Claude, 0.00 sev/​10c) and v3.4.3 (28 Claude, 3.29 sev/​10c)

The Claude re­leases bracket the IQR in op­po­site di­rec­tions: v3.4.2 is be­low the IQR, v3.4.3 is above it. Neither is an out­lier.

Exact per­mu­ta­tion test p-value = 46%: pick any 2 re­leases at ran­dom, you’d score as bad or worse 46% of the time. This is the strongest avail­able test and it finds noth­ing.

Fisher’s ex­act test p-value = 74%: Claude re­leases are no more likely to fall above the his­tor­i­cal me­dian than any other re­leases (odds ra­tio 1.06).

The his­tor­i­cal mean is 1.8× the Claude mean (2.95 vs 1.65 sev/​10c)

v3.4.1 (59 bugs / 9 com­mits, no Claude) is an out­lier but be­longs in the base­line — it is a re­lease, and the dis­tri­b­u­tion al­ready cap­tures it

3 · The Metric

The analy­sis uses a sin­gle met­ric: sever­ity-weighted bugs per 10 com­mits (sev/10c). Each bug is nor­mal­ized to a 0 – 1 sever­ity score (its LLM-assigned sever­ity di­vided by 100), and those scores are summed per re­lease in­stead of sim­ply count­ing bugs. The raw bug count is also shown in the table for ref­er­ence, but sev/​10c dri­ves all sta­tis­ti­cal tests.

sev/​10c = (Σ sever­ity/​100 ÷ to­tal_­com­mits) × 10

How com­mits are as­signed to re­leases

Every com­mit on the de­fault branch was or­dered by com­mit­ter date to pro­duce a se­quen­tial time­line. Each git tag points to a spe­cific com­mit in this time­line. A re­lease’s range is all com­mits be­tween the pre­vi­ous tag and its own tag. Pre-release tags (“pre”, rc”) are skipped as bound­aries and ab­sorbed into their fi­nal re­lease. Every com­mit be­longs to ex­actly one re­lease.

How bugs are found and as­signed to re­leases

Bug re­ports come from three sources:

GitHub is­sues in the rsync repos­i­tory (collated via the GitHub REST API),

the rsync Bugzilla in­stance (collected via the API),

and the rsync mail­ing list.

GitHub is­sues and mail­ing-list bugs are at­trib­uted to the most re­cent re­lease that shipped be­fore the bug was re­ported. For Bugzilla, each en­try has a Version” field that ex­plic­itly states which re­lease the bug was re­ported against, and bugs are at­trib­uted to that re­lease.

Severity scor­ing

To con­trol for bug sever­ity — so that, as some­one on HN said, a typo in a but­ton and a CVE aren’t rated equally — every bug re­port was scored for sever­ity on a 0 – 100 scale. The scorer is Qwen 3 35B, a small open-weight lan­guage model, prompted as a se­nior re­li­a­bil­ity en­gi­neer as­sess­ing real-world im­pact. Each bug re­port was given to the model as its ti­tle and body text (truncated to 3,000 char­ac­ters), along with the fol­low­ing rubric:

All three bug sources — GitHub is­sues, Bugzilla, and the rsync mail­ing list — were scored. Bugzilla and mail­ing list re­ports had only a ti­tle (no body), so the model scored those from the ti­tle alone. The model was in­structed to fall back on the ti­tle and lean to­ward the mid­dle of the range (40 – 60) when the body did­n’t pro­vide enough in­for­ma­tion. The model was also told to out­put only a sever­ity in­te­ger via struc­tured out­put (JSON schema), so there were no free-text re­sponses to parse. Scoring was done at tem­per­a­ture 0 for de­ter­min­ism — the same in­put al­ways pro­duces the same score.

Issues scored sever­ity 0 — fea­ture re­quests, spam, off-topic rants about AI, empty sub­mis­sions — are ex­cluded from bug counts by de­fault. This mat­ters be­cause some re­leases at­tracted a lot of noise on GitHub. v3.4.2 had four is­sues filed; the model scored all four at sever­ity 0 (a fea­ture-re­quest op­tion, a miss­ing tar­ball ques­tion, and two more fea­ture re­quests).

Example scores from the data­base, one per tier:

Why the re­lease is the unit of analy­sis

Why group com­mits by re­lease, bugs by re­lease, and then as­cer­tain the cor­re­la­tion — or lack thereof — be­tween Claude com­mits and bugs through the in­ter­me­di­ary of re­leases? This is for two rea­sons.

First, be­cause the claim that the crit­ics are mak­ing is also, it­self, made in terms of re­leases: that hav­ing any Claude com­mits in a re­lease makes the whole re­lease more buggy as a whole in a no­tice­able way, not just that Claude-authored com­mits may in­tro­duce more bugs; the lat­ter is a dif­fer­ent met­ric, be­cause later Claude- or hu­man-au­thored com­mits could cor­rect for those bugs within the same re­lease, and no­body would then no­tice as part of the re­lease, and over­all it would­n’t mat­ter to users; ad­di­tion­ally, it’s sim­ply im­por­tant, as stated else­where, to meet the claim of the crit­ics where it’s at. If this forces them to make their claims more nu­anced — or oth­er­wise move the goal­posts — then mis­sion ac­com­plished.

Second, it’s a prob­lem of at­tri­bu­tion: the vast, vast ma­jor­ity of bugs do not state ex­actly which com­mit caused them, be­cause do­ing so would re­quire ex­ten­sive re­search and analy­sis that is of­ten not worth it in fa­vor of sim­ply fix­ing-for­ward, and even if that analy­sis was done — via some­thing like git bi­sect — it would­n’t nec­es­sar­ily re­sult in any­thing use­ful, or any­thing at all. Many bugs can re­sult from a com­bi­na­tion of mul­ti­ple com­mits, of­ten sep­a­rated sig­nif­i­cantly over time, where it’s un­clear whether one com­mit or the other re­ally in­tro­duced the bug. Or, one com­mit can re­veal sev­eral la­tent bugs in­tro­duced by other com­mits at once, and so on.

Why bugs and com­mits?

The crit­ics’ claim is sim­plis­tic, ab­solute, and uni­ver­sal­is­tic: the rate of bugs in the Claude-exposed re­leases went up. Therefore, the sim­plest hon­est re­sponse is to ana­ly­ize pre­cisely what is be­ing claimed: bugs, com­mits, re­leases, and Claude-exposed com­mits. If the Claude re­leases sit in the mid­dle of the his­tor­i­cal dis­tri­b­u­tion, the bur­den shifts to the crit­ics to ex­plain why this par­tic­u­lar mid­dle is some­how worse than all the other mid­dles that came be­fore it. Even by weight­ing by sever­ity, I feel that I am giv­ing ex­ten­sive gen­eros­ity to the anti-AI point in all this, but enough of the more in­tel­li­gent crit­ics brought it up that I found it worth it.

Even if that re­sults in is shift­ing the con­ver­sa­tion to­ward a more nu­anced dis­cus­sion of the qual­ity and type and user im­pact of the bugs in the re­leases, it will al­ready have been a ma­jor win for the pro-AI crowd, and a shift­ing of the goal­posts for the anti-AI crowd, and then we can do fur­ther analy­sis based on that. And the bal­l’s in the anti-AI court for that game.

What this ap­proach does not do

I’m aware that this met­ric does not con­trol for com­mit com­plex­ity or se­cu­rity in­ten­sity. It is a blunt in­stru­ment. But the crit­ics’ ac­cu­sa­tion is also blunt: Claude is mak­ing things worse.” A blunt in­stru­ment is what is re­quired in re­sponse. Blood begets blood.

4 · Results

Claude Releases

Before we jump into deeper analy­sis, let’s just look at the two Claude re­leases them­selves, to get a sense for them:

v3.4.2

0.00 sev/​10c

0 bugs · 50 com­mits · 9 Claude

0th per­centile (rank 0 of 35)

v3.4.3

3.29 sev/​10c

17 bugs · 34 com­mits · 28 Claude

77th per­centile (rank 27 of 35)

If that does­n’t look like a red flag to you, you’d be right.

Exact Permutation Test

So the ques­tion is: are the Claude re­leases un­usu­ally buggy, or could you eas­ily pull a group just as bad out of the his­tor­i­cal dis­tri­b­u­tion by dumb luck? The way you an­swer that ques­tion sta­tis­ti­cally is an ex­act per­mu­ta­tion test, which just enu­mer­ates all pairs of two re­leases and asks: what frac­tion have a mean bug rate as bad or worse than the one we ac­tu­ally ob­served? That frac­tion is the p-value of the hy­poth­e­sis un­der test.

46%

ex­act per­mu­ta­tion test p-value (one-sided, H₁: Claude mean > his­tor­i­cal)

272 of 595 pos­si­ble groups of 2 his­tor­i­cal re­leases have mean sev/​10c ≥ 1.65. Nearly half. The Claude re­leases sit right in the mid­dle of the per­mu­ta­tion dis­tri­b­u­tion — there is noth­ing ex­treme about them.

Test sta­tis­tic: mean sev/​10c per group · Claude group mean: 1.65 · Historical mean: 2.95

What this p-value tells us is that the hy­poth­e­sis that Claude makes re­leases worse has, at least so far, about as much pre­dic­tive power as a coin flip: if you closed your eyes and picked 2 re­leases at ran­dom, you’d do as bad or worse nearly half the time. There’s noth­ing un­usual about the Claude group.

Fisher’s Exact Test

The per­mu­ta­tion test asks: how likely is it that a ran­dom group of re­leases scores as badly as the Claude group? But there’s an­other way to pose the ques­tion: are Claude re­leases more likely than non-Claude re­leases to fall above the his­tor­i­cal me­dian? That’s a text­book 2×2 con­tin­gency table, and the stan­dard test for it is Fisher’s ex­act test.

74%

one-sided p-value (H₁: Claude more likely above me­dian)

Fisher’s ex­act test asks: if we split all re­leases at the his­tor­i­cal me­dian (0.74 sev/​10c), are these Claude re­leases sig­nif­i­cantly buggy than pre­vi­ous re­leases (more likely to land above the me­dian)? With a p-value of 74%, the an­swer is a de­ci­sive no. The odds ra­tio is 1.06 — es­sen­tially 1:1. Claude re­leases are no more likely to be above the me­dian than any other re­leases.

Odds ra­tio: 1.06 · Median: 0.74 sev/​10c

To em­pha­size, this does not mean that all Claude re­leases in the fu­ture will not be more buggy. We don’t have nearly enough data to build a model and ex­trap­o­late out like that, and that’s not what a Fisher’s ex­act test is for. The point that’s be­ing made here is that these spe­cific re­leases are not at all no­table; if no one had known they were AI, no one would have cared or no­ticed any­thing out of the or­di­nary, and there is no ev­i­dence with which to con­clude that Claude made any­thing worse yet, un­like the ob­jec­tive, ab­so­lutist, uni­ver­sal claims made by crit­ics.

The Distribution

In case you’re not con­vinced, here’s a vi­sual aid, show­ing where these re­leases fall in the dis­tri­b­u­tion of all prior re­leases:

mid­dle 50%

v3.4.2

v3.4.3

0.010.1110100

Historical Claude

Middle 50% (IQR)

Outside IQR

How to read this graph: Each dot is a re­lease. The shaded green band is the in­terquar­tile range — the mid­dle 50% of his­tor­i­cal re­leases, from 0.29 to 2.59 sev/​10c. The darker re­gions on ei­ther side are the lower and up­per quar­ters.

This is an­other way of say­ing the same thing the pre­vi­ous two tests said, but more in­tu­itively: the Claude re­leases (green dots) bracket the IQR in op­po­site di­rec­tions. v3.4.2, with zero real bugs, sits just be­low the IQR; v3.4.3 sits just above it. They bracket the mid­dle of the dis­tri­b­u­tion in op­po­site di­rec­tions. Neither is a neg­a­tive out­lier, and since they’re on ei­ther side of the IQR, there’s no ev­i­dence Claude us­age bends re­leases in ei­ther di­rec­tion.

Commit Rate

One pos­si­ble ob­jec­tion I’ve seen is that while per­haps the de­fect rate of Claude-authored com­mits is not worse than hu­man-au­thored ones, Claude sped up de­vel­op­mend so much — due to vibe cod­ing” per­haps — that the to­tal num­ber of bugs in each re­lease got too high for com­fort any­way, in which case it does­n’t mat­ter that the de­fect rate per com­mit is­n’t so bad, be­cause that’s not what down­stream users ex­pe­ri­ence. We can check this, though:

p=88%

ex­act per­mu­ta­tion test: do Claude re­leases have more com­mits?

Claude re­leases av­er­aged 42 com­mits; non-Claude re­leases av­er­aged 185. If you pick any 2 re­leases at ran­dom, you’d see as many or more com­mits 88% of the time.

GitHub - microsoft/pg_durable: PostgreSQL in-database durable execution

github.com

Long-running, fault-tol­er­ant SQL func­tions for teams that al­ready keep their state in Postgres and want to stop stitch­ing to­gether cron jobs, work­ers, queues, and sta­tus ta­bles to make back­ground work re­li­able. Define the work­flow in SQL, let pg_­durable check­point each step, and re­sume af­ter crashes, restarts, or failed steps.

Durable ex­e­cu­tion is now a stan­dard in­dus­try pat­tern, and pg_­durable brings it in­side Postgres with no ex­tra ser­vice in­fra­struc­ture re­quired. Part of our mis­sion to bring com­pute close to data.

Try pg_­durable now in Azure HorizonDB, Microsoft’s new PostgreSQL cloud ser­vice en­gi­neered for per­for­mance and built with pg_­durable in­side

Try pg_­durable now in Azure HorizonDB, Microsoft’s new PostgreSQL cloud ser­vice en­gi­neered for per­for­mance and built with pg_­durable in­side

Is this for me?

Who it’s for

Backend and data en­gi­neers who want work­flows to live next to the data they touch.

DBAs and SREs au­tomat­ing run­books that must sur­vive restarts and be au­ditable in SQL.

Teams build­ing data or AI pipelines that need durable ex­e­cu­tion per row, doc­u­ment, or batch.

The core idea

A pg_­durable func­tion is a graph of SQL steps that PostgreSQL ex­e­cutes and check­points as it goes. If the data­base crashes, restarts, or a step fails, ex­e­cu­tion re­sumes from the last durable check­point in­stead of mak­ing you re­con­struct state by hand.

Workloads this is use­ful for

Vector em­bed­ding pipelines: chunk, call an em­bed­ding API, and up­sert into pgvec­tor.

Ingest pipelines: stage, dedu­pli­cate, trans­form, and pub­lish large batches.

Scheduled main­te­nance: de­tect bloat, no­tify, wait for ap­proval, then run the next ac­tion.

Fan-out ag­gre­ga­tion: run in­de­pen­dent queries in par­al­lel, then join the re­sults.

External API work­flows: en­rich­ment, clas­si­fi­ca­tion, and web­hook-style calls from SQL.

What you’re prob­a­bly do­ing to­day in­stead

pg_cron plus a jobs table, sta­tus columns, retry coun­ters, and a polling worker.

An ex­ter­nal or­ches­tra­tor such as Airflow, Temporal, Step Functions, or Argo call­ing back into Postgres.

A queue plus work­ers plus a sep­a­rate state table to co­or­di­nate re­tries and par­tial com­ple­tion.

A plpgsql pro­ce­dure that works un­til a crash or long-run­ning trans­ac­tion forces you to start over.

Pain points it ad­dresses

A restart in the mid­dle of a long job means re­run­ning work that al­ready suc­ceeded.

One failed row or one failed API call turns into man­ual cleanup and un­cer­tain re­play.

Long trans­ac­tions hold locks, grow WAL, and make batch jobs frag­ile at larger scale.

Parallel work in the app tier cre­ates more places for par­tial-fail­ure bugs and drift.

The work­flow logic ends up spread across SQL, work­ers, queues, dash­boards, and sta­tus ta­bles.

What changes in your ar­chi­tec­ture

The work­flow de­f­i­n­i­tion moves into SQL and starts with df.start(…).

Retry state, progress track­ing, and check­point­ing move into Postgres in­stead of be­spoke app code.

Some app-tier work­ers, queue con­sumers, or sched­uler glue can dis­ap­pear en­tirely.

Operational vis­i­bil­ity comes from Postgres ta­bles such as df.in­stances, us­ing the same auth and backup model as your data.

When not to use it

The job is al­ready a sin­gle INSERTSELECT or one or­di­nary SQL state­ment.

You need sub-mil­lisec­ond syn­chro­nous re­quest han­dling rather than durable back­ground ex­e­cu­tion.

You can­not in­stall ex­ten­sions or run a back­ground worker in your Postgres en­vi­ron­ment.

The work­flow mostly lives out­side Postgres and spans many het­ero­ge­neous sys­tems.

You need ar­bi­trary ap­pli­ca­tion logic that does not map cleanly to SQL steps, branch­ing, loops, or HTTP calls.

How it works

Define a work­flow in SQL us­ing com­pos­able op­er­a­tors such as ~> and |=>.

Start it with df.start() and get back an in­stance ID.

Let the run­time ex­e­cute each step durably with check­point­ing be­tween steps.

Query sta­tus and re­sults from PostgreSQL while the work­flow runs or af­ter it com­pletes.

Limitations

The model is in­ten­tion­ally SQL-shaped. If a step needs ar­bi­trary code, a non-HTTP SDK, or rich in-mem­ory con­trol flow, you may need to wrap that logic in a SQL func­tion, ex­pose it be­hind an HTTP end­point for df.http(), or use a gen­eral-pur­pose or­ches­tra­tor for that part of the sys­tem.

Features

Durable — Function state per­sists to PostgreSQL. Survives crashes, restarts, and failovers.

SQL-native — Define func­tions in SQL us­ing com­pos­able op­er­a­tors.

Database-aware — First-class prim­i­tives for sched­ul­ing, con­di­tions, and par­al­lel ex­e­cu­tion.

Zero in­fra­struc­ture — Runs as a PostgreSQL ex­ten­sion. No Redis, no Temporal, no ex­ter­nal ser­vices.

Quick Example

– A durable func­tion that processes data in steps SELECT df.start( SELECT id FROM doc­u­ments WHERE processed = false LIMIT 100’ |=> batch’ ~> UPDATE doc­u­ments SET processed = true WHERE id = ANY($batch)’ );

Packages

Tagged re­leases pub­lish Debian pack­ages for PostgreSQL 17 and 18 on amd64 from the GitHub re­lease as­sets. Packages are named pg-durable-post­gresql-<PG ma­jor>_<pg_­durable ver­sion>-1_<arch>.deb and in­stall the ex­ten­sion li­brary, con­trol file, and SQL up­grade files into the match­ing PostgreSQL in­stal­la­tion di­rec­to­ries.

After in­stalling a pack­age, add pg_­durable to shared_pre­load­_li­braries, restart PostgreSQL, and cre­ate the ex­ten­sion in the con­fig­ured pg_­durable data­base:

CREATE EXTENSION pg_­durable;

The de­fault pg_­durable data­base is post­gres; see User Guide for back­ground worker con­fig­u­ra­tion and priv­i­lege setup.

Release as­sets also in­clude source archives for build­ing from source.

Development Installation

Prerequisites

PostgreSQL 17 or 18

Rust (nightly)

cargo-pgrx 0.16.1

GitHub Codespace

The main branch pre­build in­stalls PostgreSQL 17, builds pg_­durable, and pre­pares a lo­cal clus­ter un­der ~/.pgrx with the ex­ten­sion ready. PostgreSQL is not left run­ning, so start it when you be­gin work­ing.

# Start PostgreSQL ./scripts/pg-start.sh

# Connect ~/.pgrx/17.*/pgrx-install/bin/psql -h lo­cal­host -p 28817 -d post­gres

On a branch with­out a ready pre­build, run pg-start.sh — it will build and in­stall the ex­ten­sion on first run (expect a few min­utes):

./scripts/pg-start.sh

Other en­vi­ron­ments

Local and Dev Container

A VS Code Dev Container (.devcontainer/) pro­vides Rust, cargo-pgrx, and PostgreSQL 17 pre-in­stalled. For a bare lo­cal ma­chine, in­stall the tool­chain first by fol­low­ing the steps in .devcontainer/onCreateCommand.sh.

# Build, ini­tial­ize PostgreSQL, and in­stall the ex­ten­sion # This takes a while - go do some­thing else ./scripts/pg-start.sh

# Connect to the lo­cal pgrx PostgreSQL in­stance ~/.pgrx/17.*/pgrx-install/bin/psql -h lo­cal­host -p 28817 -d post­gres

pg-start.sh boot­straps new lo­cal data di­rec­to­ries with a post­gres su­pe­ruser and also cre­ates a match­ing su­pe­ruser role for the cur­rent OS user, so de­fault lo­cal psql us­age con­tin­ues to work. Use -U post­gres if you want to force the canon­i­cal boot­strap role ex­plic­itly.

Docker

# Build and test ./scripts/test-e2e-docker.sh –rebuild

# Optional: Deploy to ACR (for cus­tom PG17 im­age with pg_­durable baked-in) ./scripts/deploy-acr.sh

Multi-User Setup

CREATE EXTENSION pg_­durable does not grant any priv­i­leges to PUBLIC. After in­stalling the ex­ten­sion, the ad­min must ex­plic­itly grant ac­cess to ap­pli­ca­tion roles. Row-level se­cu­rity (RLS) en­sures each user can only see and man­age their own durable func­tion in­stances and nodes.

Grant priv­i­leges to an ap­pli­ca­tion role:

– Grant to spe­cific roles af­ter CREATE EXTENSION SELECT df.grant_us­age(‘ap­p_role’);

Alternatively, cre­ate an in­di­rec­tion role and grant mem­ber­ship to ap­pli­ca­tion roles:

– Create a shared role for pg_­durable ac­cess CREATE ROLE pg_­durable_user NOLOGIN; SELECT df.grant_us­age(‘pg_­durable_user’);

– Grant mem­ber­ship to ap­pli­ca­tion roles GRANT pg_­durable_user TO ap­p_back­end, etl_ser­vice;

See the User Guide — Privilege Grants sec­tion for the full list of in­di­vid­ual grants, re­vok­ing ac­cess, and hard­en­ing up­graded in­stalls.

See the User Guide — Privilege Grants sec­tion for the full list of in­di­vid­ual grants, re­vok­ing ac­cess, and hard­en­ing up­graded in­stalls.

Note: GRANT EXECUTE ON ALL FUNCTIONS only ap­plies to func­tions that ex­ist when the grant runs. After up­grad­ing pg_­durable with ALTER EXTENSION pg_­durable UPDATE, re-run df.grant_us­age(‘role’) (or re-is­sue the man­ual grants) so new func­tions are ac­ces­si­ble.

Note: GRANT EXECUTE ON ALL FUNCTIONS only ap­plies to func­tions that ex­ist when the grant runs. After up­grad­ing pg_­durable with ALTER EXTENSION pg_­durable UPDATE, re-run df.grant_us­age(‘role’) (or re-is­sue the man­ual grants) so new func­tions are ac­ces­si­ble.

Key points:

The back­ground worker role (pg_durable.worker_role GUC, de­fault: post­gres) must be a su­pe­ruser — it by­passes RLS to man­age all users’ in­stances

Users get SELECT + INSERT on df.in­stances / df.nodes, col­umn-level UPDATE (status, up­dat­ed_at) on in­stances for df.can­cel()

Identity col­umn (submitted_by) can­not be mod­i­fied by users

df.vars uses per-user scop­ing — each user has their own vari­able name­space via an owner col­umn and RLS. Superusers by­pass RLS but DSL func­tions still scope to the call­ing user via ex­plicit fil­ters. Avoid stor­ing se­crets in plain text

Continuous Integration

All pull re­quests must pass the fol­low­ing checks be­fore merg­ing:

Format Check — cargo fmt –check

Clippy & Tests — cargo clippy, unit tests (cargo pgrx test pg17), pg_regress tests, and E2E tests

The CI work­flow is de­fined in .github/workflows/ci.yml. It uses pgrx to down­load and man­age PostgreSQL.

How LLMs Actually Work

www.0xkato.xyz

Home

Blog

Research

About

Portfolio

Monday. June 01, 2026 -

26 mins

This post is a walk­through of how LLMs work. Modern LLMs are mostly built by stack­ing trans­former blocks over and over, so un­der­stand­ing the trans­former ma­chin­ery gets you most of the way there.

I’ll cover the core mech­a­nisms in­side mod­ern trans­former-based LLMs, with­out all that sticky math stuff. Don’t get me wrong, you should learn the math, but this can serve as an in­tro­duc­tion.

Most mod­ern LLMs share the same trans­former-fam­ily skele­ton. The dif­fer­ences come from what each one was trained on, the scale and con­fig­u­ra­tion choices, and the post-train­ing done on top. By the end, you should be able to read many mod­ern LLM pa­pers or model cards and know which piece of the ar­chi­tec­ture each sec­tion is talk­ing about.

Here’s the path:

Tokens, how a string of text be­comes a se­quence of in­te­gers

Embeddings, how those in­te­gers get mean­ing

Positional en­cod­ing, how the model knows what or­der the to­kens came in

Attention, how to­kens share in­for­ma­tion with each other

Multi-head at­ten­tion, how the model tracks many kinds of re­la­tion­ships at once

The feed-for­ward net­work, where a large share of the mod­el’s stored struc­ture lives

The resid­ual stream and layer nor­mal­iza­tion, what makes deep stacks train­able

Predicting the next to­ken, what the model ac­tu­ally out­puts and how the gen­er­a­tion loop works

Architecture vs trained weights, what’s broadly shared across mod­ern LLMs, and what’s dif­fer­ent

Tiny ex­plain­ers ap­pear through­out so any­one can fol­low along, re­gard­less of back­ground.

Tokenization

Models don’t read text di­rectly. They read in­te­ger IDs. The step that con­verts your prompt into a se­quence of those in­te­gers.

That con­ver­sion step is called to­k­eniza­tion. A to­k­enizer takes a string and pro­duces a se­quence of in­te­gers, where each in­te­ger points to an en­try in a fixed vo­cab­u­lary. Modern LLM vo­cab­u­lar­ies usu­ally con­tain tens of thou­sands to a few hun­dred thou­sand en­tries.

Tiny ex­plainer: to­ken ID A to­ken ID is the in­te­ger the model uses for one vo­cab­u­lary en­try. The model works with the num­ber, not the writ­ten word it­self.

Tiny ex­plainer: to­ken ID A to­ken ID is the in­te­ger the model uses for one vo­cab­u­lary en­try. The model works with the num­ber, not the writ­ten word it­self.

Tokens aren’t usu­ally whole words. They’re usu­ally sub­word pieces. The word tokenization” might split into [“token”, ization”]. The word running” might split into [“run”, ning”]. The rea­son is ef­fi­ciency. Whole-word vo­cab­u­lar­ies are too big and don’t gen­er­al­ize to new words. Character-level vo­cab­u­lar­ies are too small and force the model to learn even the sim­plest pat­terns from scratch. Subword to­k­eniza­tion sits in the mid­dle. The most com­mon pieces be­come sin­gle to­kens, and rare or novel words get com­posed from smaller pieces.

Tiny ex­plainer: vo­cab­u­lary The vo­cab­u­lary is the to­k­eniz­er’s fixed list of pieces. Each piece has an ID, and the model can only di­rectly re­ceive IDs from that list.

Tiny ex­plainer: vo­cab­u­lary The vo­cab­u­lary is the to­k­eniz­er’s fixed list of pieces. Each piece has an ID, and the model can only di­rectly re­ceive IDs from that list.

The trade-off shows up in places peo­ple don’t ex­pect. The clas­sic ex­am­ple: ask an LLM how many R’s are in strawberry.” LLMs used to get it wrong. That’s not the model fail­ing at count­ing. It’s the model not op­er­at­ing on let­ters di­rectly, only to­ken IDs that hap­pen to spell out a word a hu­man would split let­ter by let­ter.

Different model fam­i­lies use dif­fer­ent to­k­eniz­ers. GPT mod­els use Byte Pair Encoding vari­ants. SentencePiece is com­mon in LLaMA-style mod­els. The choice mat­ters for com­pute (fewer to­kens means less work) and for things like mul­ti­lin­gual cov­er­age, but the ba­sic shape is the same. Text in, in­te­gers out.

Now that the prompt is a se­quence of in­te­gers, the next step is to give those in­te­gers mean­ing.

Embeddings

A to­ken ID like 1024 is just a row in­dex. It does­n’t mean any­thing by it­self. The thing that gives it mean­ing is a gi­ant table called the em­bed­ding ma­trix.

Every model has one. It has one row per en­try in the vo­cab­u­lary, and each row is a long vec­tor of num­bers. The length of each row is the mod­el’s hid­den size. In many 7B-class mod­els, that means 4,096 num­bers per to­ken. Larger mod­els usu­ally use wider vec­tors.

Tiny ex­plainer: vec­tor A vec­tor is a list of num­bers. In a trans­former, each to­ken be­comes a vec­tor so the model can do math with it.

Tiny ex­plainer: vec­tor A vec­tor is a list of num­bers. In a trans­former, each to­ken be­comes a vec­tor so the model can do math with it.

When the to­k­enizer hands the model an in­te­ger, the model looks up that row and uses the vec­tor in­stead. That vec­tor is the to­ken’s em­bed­ding. It’s the mod­el’s rep­re­sen­ta­tion of what that to­ken means,” learned dur­ing train­ing.

Tiny ex­plainer: em­bed­ding ma­trix The em­bed­ding ma­trix is a lookup table. Token ID in, learned vec­tor out.

Tiny ex­plainer: em­bed­ding ma­trix The em­bed­ding ma­trix is a lookup table. Token ID in, learned vec­tor out.

The in­ter­est­ing prop­erty of these em­bed­dings is that se­man­ti­cally sim­i­lar to­kens end up with sim­i­lar vec­tors. The vec­tor for king” is close in space to the vec­tor for queen,” and the vec­tor for Paris” is close to France.” None of this is hard-coded. It emerges from train­ing on enough text, and the model learns these po­si­tions be­cause they let it pre­dict text well.

You can do arith­metic on em­bed­dings and it some­times works. The fa­mous ex­am­ple is king − man + woman ≈ queen. The geom­e­try of em­bed­ding space car­ries real se­man­tic struc­ture, even though no­body told the model to build it that way.

Worth be­ing clear on: at this stage every to­ken has been re­placed by its em­bed­ding, but the em­bed­ding alone says noth­ing about where the to­ken sits in the se­quence. The vec­tor for dog” is the same vec­tor whether dog” is the first word in your prompt or the fifth. That’s a prob­lem.

That’s the gap po­si­tional en­cod­ing fills.

Positional en­cod­ing

Plain self-at­ten­tion does­n’t have a built-in rep­re­sen­ta­tion of word or­der. Without some po­si­tional sig­nal, it has no di­rect way to know that dog” came be­fore bites” in­stead of af­ter it.

Word or­der changes mean­ing. So the model needs an­other piece. It needs a way to in­ject the po­si­tion of each to­ken into the math.

Tiny ex­plainer: po­si­tional en­cod­ing Positional en­cod­ing is how the model gets or­der in­for­ma­tion. It tells the model where each to­ken sits in the se­quence.

Tiny ex­plainer: po­si­tional en­cod­ing Positional en­cod­ing is how the model gets or­der in­for­ma­tion. It tells the model where each to­ken sits in the se­quence.

The orig­i­nal trans­former pa­per (Vaswani et al. 2017) did this by giv­ing each po­si­tion its own pat­tern of num­bers and adding it di­rectly to each to­ken’s em­bed­ding be­fore any other pro­cess­ing. Position 1 had one pat­tern, po­si­tion 5 had a dif­fer­ent pat­tern, po­si­tion 100 had an­other. The pat­terns came from sine and co­sine waves at dif­fer­ent fre­quen­cies. Now the em­bed­ding for dog” at po­si­tion 1 was dif­fer­ent from the em­bed­ding for dog” at po­si­tion 5, just be­cause the po­si­tion pat­tern added to it was dif­fer­ent.

That worked, and si­nu­soidal en­cod­ings were cho­sen partly be­cause they can ex­trap­o­late be­yond the ex­act se­quence lengths seen dur­ing train­ing. But ad­di­tive po­si­tion schemes still had two prob­lems that be­came im­por­tant as mod­els scaled up.

First, the em­bed­ding had to carry both mean­ing and po­si­tion in the same set of num­bers. There’s only so much you can pack in.

Second, learned ab­solute po­si­tion em­bed­dings in par­tic­u­lar don’t gen­er­al­ize cleanly. If you trained on prompts up to 2,048 to­kens long, the model never saw po­si­tion 5,000 dur­ing train­ing, and the em­bed­ding for that po­si­tion was not learned in the same way.

Modern mod­els mostly use a dif­fer­ent scheme called Rotary Position Embeddings (RoPE), in­tro­duced by Su et al. in 2021 and now used in LLaMA, Mistral, Gemma, Qwen, and most other open-weight fam­i­lies. The in­tu­ition: in­stead of adding po­si­tion info to each to­ken’s vec­tor, RoPE ro­tates the vec­tor by an an­gle that de­pends on its po­si­tion. A to­ken at po­si­tion 1 gets a small turn, a to­ken at po­si­tion 100 gets a big­ger turn. When two to­kens are later com­pared dur­ing at­ten­tion, what mat­ters is the dif­fer­ence be­tween their ro­ta­tions, which en­codes how far apart they are.

Tiny ex­plainer: RoPE RoPE stands for Rotary Position Embeddings. Instead of adding a po­si­tion vec­tor, it ro­tates to­ken vec­tors so rel­a­tive dis­tance shows up dur­ing at­ten­tion.

Tiny ex­plainer: RoPE RoPE stands for Rotary Position Embeddings. Instead of adding a po­si­tion vec­tor, it ro­tates to­ken vec­tors so rel­a­tive dis­tance shows up dur­ing at­ten­tion.

The prac­ti­cal ad­van­tages are real. RoPE en­codes rel­a­tive po­si­tion nat­u­rally (which is closer to what at­ten­tion ac­tu­ally wants). It gen­er­al­izes bet­ter to longer con­texts. And it does­n’t add new pa­ra­me­ters to the model.

Even with good po­si­tional en­cod­ing, mod­ern LLMs have a doc­u­mented lost in the mid­dle” prob­lem (Liu et al. 2023). They use in­for­ma­tion at the start and end of long prompts more re­li­ably than in­for­ma­tion buried in the mid­dle. That’s why prompt en­gi­neer­ing tips like put im­por­tant con­text first” or repeat key info at the end” ac­tu­ally help. The model is­n’t us­ing every part of your prompt equally well.

With to­ken mean­ing and po­si­tion both en­coded, the next ques­tion is how do to­kens ac­tu­ally ex­change in­for­ma­tion?

Attention

This is the mech­a­nism that gave the ar­chi­tec­ture its name. Attention.

Inside every trans­former layer, at­ten­tion does one thing. It lets each to­ken look at the other to­kens it is al­lowed to see and de­cide which ones mat­ter for what comes next.

It does this by giv­ing each to­ken three roles at once. Each to­ken gets trans­formed into three new vec­tors, called Query, Key, and Value (Q, K, V).

Tiny ex­plainer: Q, K, V Query means what am I look­ing for,” Key means what do I match with,” and Value is the in­for­ma­tion that gets copied when the match is strong.

Tiny ex­plainer: Q, K, V Query means what am I look­ing for,” Key means what do I match with,” and Value is the in­for­ma­tion that gets copied when the match is strong.

The Query asks, what am I look­ing for from other to­kens?”

The Key says, this is what I of­fer to to­kens look­ing at me.”

The Value car­ries, this is what gets passed along when a match hap­pens.”

The same to­ken plays all three roles at the same time. The Q, K, V trans­for­ma­tions are learned ma­tri­ces, so the model fig­ures out dur­ing train­ing what each to­ken should look for and what it should of­fer.

Matching hap­pens through a sim­i­lar­ity score. Each to­ken’s Query is com­pared against the Key of each to­ken it is al­lowed to see, us­ing a scaled dot prod­uct. Intuitively, this mea­sures how much the two vec­tors line up. The scal­ing keeps the num­bers sta­ble be­fore soft­max.

Tiny ex­plainer: dot prod­uct A dot prod­uct is a sim­ple way to score how aligned two vec­tors are. Higher align­ment means a stronger match.

Tiny ex­plainer: dot prod­uct A dot prod­uct is a sim­ple way to score how aligned two vec­tors are. Higher align­ment means a stronger match.

The match scores then get turned into weights us­ing soft­max. Softmax takes any set of num­bers and turns them into a prob­a­bil­ity-like dis­tri­b­u­tion that sums to 1. Tokens with higher match scores get higher weights, and the weights are then used to take a weighted av­er­age of the value vec­tors.

Tiny ex­plainer: soft­max Softmax turns raw scores into weights that add up to 1. Big scores get big weights, small scores get small weights.

Tiny ex­plainer: soft­max Softmax turns raw scores into weights that add up to 1. Big scores get big weights, small scores get small weights.

An ex­am­ple. Consider the sen­tence The cat that I saw yes­ter­day was sleep­ing.” When the model processes was,” it needs to fig­ure out what’s do­ing the sleep­ing. The Query vec­tor for was” gets com­pared against the Key vec­tors of the to­kens it is al­lowed to see. The dot prod­uct with cat” is high, be­cause the model has learned that verbs like was” need a sub­ject and that sub­jects like cat” pro­duce Key vec­tors that line up well. The dot prod­uct with yesterday” is low. Softmax turns those scores into weights, cat” gets a high weight, yesterday” gets a low one. The model then takes a weighted sum of the cor­re­spond­ing value vec­tors, so the value for cat” dom­i­nates the re­sult. The new rep­re­sen­ta­tion of was” is now mostly shaped by the value of cat.” That’s how a to­ken sev­eral po­si­tions back be­comes the ref­er­ent.

There’s a con­straint spe­cific to GPT-style lan­guage mod­els, which is that they gen­er­ate text left to right. A to­ken at po­si­tion 5 is only al­lowed to at­tend to po­si­tions 1 through 5. It can­not at­tend to to­kens at po­si­tions 6, 7, 8, be­cause those haven’t been gen­er­ated yet. This is called causal mask­ing. The im­ple­men­ta­tion is sim­ple: fu­ture to­kens get match scores so low they end up with ef­fec­tively zero weight af­ter soft­max.

Tiny ex­plainer: causal mask­ing Causal mask­ing hides fu­ture to­kens. It keeps a de­coder-only lan­guage model from look­ing ahead while pre­dict­ing the next to­ken.

Tiny ex­plainer: causal mask­ing Causal mask­ing hides fu­ture to­kens. It keeps a de­coder-only lan­guage model from look­ing ahead while pre­dict­ing the next to­ken.

One of the most in­ter­est­ing find­ings in in­ter­pretabil­ity re­search is about spe­cial­ized at­ten­tion heads called in­duc­tion heads, found by Anthropic in 2022. These heads learn to spot pat­terns of the form A B … A” in the prompt and pre­dict that B comes next. When the model sees A” the sec­ond time, the in­duc­tion head looks back to where A” ap­peared be­fore, sees what came af­ter, and copies that. They’re one of the clear­est known mech­a­nisms be­hind in-con­text learn­ing, the abil­ity of an LLM to pick up a pat­tern from your prompt and con­tinue it.

Tiny ex­plainer: in­duc­tion head An in­duc­tion head is an at­ten­tion head that no­tices re­peated pat­terns in the prompt and helps con­tinue them.

Tiny ex­plainer: in­duc­tion head An in­duc­tion head is an at­ten­tion head that no­tices re­peated pat­terns in the prompt and helps con­tinue them.

Attention has one big cost. In full at­ten­tion, each to­ken com­pares against all the to­kens it is al­lowed to see, so dou­bling the prompt length roughly quadru­ples the work. This is why long prompts are ex­pen­sive to run, and why a lot of re­cent re­search is about mak­ing at­ten­tion more ef­fi­cient (FlashAttention, sparse at­ten­tion, lin­ear at­ten­tion).

But one at­ten­tion head only gives the model one learned view of those re­la­tion­ships.

Multi-head at­ten­tion

A sin­gle at­ten­tion pass gives the model one way of de­cid­ing which to­kens mat­ter to which other to­kens. That’s not enough. Language has many re­la­tion­ships hap­pen­ing at the same time. Subject and verb agree­ment. Pronouns and the names they re­fer to. Long-range ref­er­ences be­tween sen­tences. Word or­der and lo­cal phrases.

Multi-head at­ten­tion solves this by run­ning at­ten­tion many times in par­al­lel, with each par­al­lel pass op­er­at­ing in its own smaller space. Each par­al­lel pass is called a head.

Tiny ex­plainer: at­ten­tion head An at­ten­tion head is one in­de­pen­dent at­ten­tion pass with its own learned pro­jec­tions.

Tiny ex­plainer: at­ten­tion head An at­ten­tion head is one in­de­pen­dent at­ten­tion pass with its own learned pro­jec­tions.

The part that’s of­ten de­scribed wrong, in­clud­ing in plenty of tu­to­ri­als. Each head does­n’t get a lit­eral slice of the orig­i­nal to­ken vec­tor. Each head has its own learned pro­jec­tion ma­tri­ces that map the full to­ken vec­tor down to its own smaller Q, K, and V vec­tors. So if a model has 4,096 num­bers per to­ken and 32 heads, each head usu­ally works in a 128-dimensional space, but those 128 num­bers are a learned pro­jec­tion of the full 4,096, not a fixed slice. Different views” of the same to­ken, not dif­fer­ent chunks of it.

Each head runs its at­ten­tion pass in­de­pen­dently. Then the out­puts of all the heads get con­cate­nated and passed through a fi­nal lin­ear layer that mixes them back into one full-size vec­tor. The model learns that fi­nal mix­ing too.

What makes this in­ter­est­ing is that dif­fer­ent heads of­ten end up par­tially spe­cial­ized. The model is never told what each head should do. Specialization emerges nat­u­rally dur­ing train­ing. Researchers have found heads that track gram­mar (linking verbs to their ob­jects, ar­ti­cles to their nouns), heads that fig­ure out which pro­noun refers to which name, heads that track po­si­tional pat­terns, in­duc­tion heads, and many more. A sin­gle trans­former layer might have 32 heads. A mod­ern fron­tier model has dozens of lay­ers. So a typ­i­cal LLM has thou­sands of at­ten­tion heads in to­tal, each adding its own learned view.

There’s a prac­ti­cal cost con­cern that drove a re­cent ar­chi­tec­tural change. Each head needs to keep its Key and Value vec­tors in mem­ory for all the to­kens al­ready gen­er­ated, so that when a new to­ken gets gen­er­ated the model does­n’t have to re­com­pute every­thing from scratch. This is called the KV cache, and it’s the main mem­ory cost of run­ning an LLM at long con­text lengths.

Tiny ex­plainer: KV cache The KV cache stores old Key and Value vec­tors dur­ing gen­er­a­tion. It saves the model from re­com­put­ing the whole prompt every time it adds a to­ken.

Tiny ex­plainer: KV cache The KV cache stores old Key and Value vec­tors dur­ing gen­er­a­tion. It saves the model from re­com­put­ing the whole prompt every time it adds a to­ken.

Modern de­coder-only LLMs mostly use a vari­ant called Grouped-Query Attention (GQA). Instead of every head hav­ing its own keys and val­ues, groups of heads share the same key and value heads. LLaMA-2 70B has 64 query heads but only 8 key/​value heads. Mistral 7B has 32 query heads and 8 key/​value heads. The re­sult is nearly the same ac­cu­racy as full multi-head at­ten­tion but with much less mem­ory pres­sure and in­fer­ence cost.

Tiny ex­plainer: GQA Grouped-Query Attention lets mul­ti­ple query heads share fewer key/​value heads. That cuts KV-cache mem­ory while keep­ing many query views.

Tiny ex­plainer: GQA Grouped-Query Attention lets mul­ti­ple query heads share fewer key/​value heads. That cuts KV-cache mem­ory while keep­ing many query views.

Feed-forward net­work

After at­ten­tion fin­ishes mix­ing in­for­ma­tion be­tween to­kens, every layer has a sec­ond step that no­body talks about as much. The feed-for­ward net­work.

Astronauts told to return to International Space Station after sheltering over air leak repairs

www.bbc.com

The seven men and women aboard the ISSpublished at 16:35 BST 5 June

Pallab GhoshScience cor­re­spon­dent

Image source, Anadolu via Getty Images

Crew-12 mis­sion as­tro­nauts Jack Hathaway, Andrey Fedyaev, Jessica Meir and Sophie Adenot prepar­ing to launch to the ISS from Florida

The seven crew mem­bers cur­rently aboard the International Space Station rep­re­sent five coun­tries and a re­mark­able range of back­grounds.

Jessica Meir, 48, com­mands the Crew-12 mis­sion. Born in Caribou, Maine, to Israeli and Swedish im­mi­grant par­ents, she holds a doc­tor­ate in ma­rine bi­ol­ogy and once stud­ied how em­peror pen­guins hold their breath in Antarctica. She made his­tory in 2019 as part of the first all-fe­male space­walk. She is a mother and a pri­vate pi­lot, and is con­ver­sa­tional in both Swedish and Russian.

Jack Hathaway, 44, is Crew-12′s pi­lot and a US Navy Commander from South Windsor, Connecticut. He trained as a test pi­lot at the Empire Test Pilots’ School in the UK be­fore be­ing se­lected by NASA in 2021.

Sophie Adenot, 43, is a French colonel, he­li­copter test pi­lot and the sec­ond French woman ever to reach space, in­spired as a teenager by watch­ing Claudie Haigneré launch to the Mir space sta­tion. She speaks four lan­guages, is a cer­ti­fied yoga teacher and a trained sky­diver.

Chris Williams, 42, is a Nasa physi­cist and for­mer can­cer re­searcher at Harvard Medical School and Brigham and Women’s Hospital who piv­oted from study­ing the early uni­verse for his MIT doc­tor­ate to treat­ing tu­mours be­fore be­com­ing an as­tro­naut. He also vol­un­teered as a fire­fighter and EMT.

Sergey Kud-Sverchkov, 42, the sta­tion com­man­der, is a rocket en­gi­neer born in Baikonur — the very city from which so many space mis­sions have launched. He grad­u­ated with ho­n­ours from Moscow State Technical University and worked as an en­gi­neer at RSC Energia be­fore be­ing se­lected as a cos­mo­naut in 2010. He was awarded the Hero of the Russian Federation af­ter his first ISS mis­sion in 2020. He has also trained in un­der­ground cave sys­tems in Sardinia and stud­ied plan­e­tary ge­ol­ogy in the Dolomites as part of ESAs as­tro­naut prepa­ra­tion pro­grammes.

Sergei Mikaev, 39, is on his first space­flight. Born in Irkutsk in Siberia, he rose to Major and com­man­der of a mil­i­tary avi­a­tion unit in Primorsky Territory — Russia’s re­mote far east, bor­der­ing China — be­fore be­ing se­lected as a cos­mo­naut in 2018. He is mar­ried with two chil­dren.

Andrey Fedyaev, 45, is a Russian cos­mo­naut and for­mer Air Force ma­jor from Serov in the Ural moun­tains, on his sec­ond space­flight. When he flew on Crew-6 in 2023, he be­came only the sec­ond Russian cos­mo­naut ever to launch aboard an American com­mer­cial space­craft. He is now on his sec­ond mis­sion, again fly­ing along­side Nasa col­leagues aboard Dragon.

New method turns ocean water into drinking water, without waste

www.rochester.edu

The en­ergy-ef­fi­cient de­sali­na­tion sys­tem pro­duces fresh wa­ter with­out chem­i­cal ad­di­tives and trans­forms left­over salts into use­ful ma­te­ri­als.

The United Nations es­ti­mates that 2.2 bil­lion peo­ple lack safely man­aged drink­ing wa­ter, and com­mu­ni­ties from California to the Middle East rely on de­sali­na­tion plants to con­vert ocean wa­ter to fresh wa­ter. Common de­sali­na­tion tech­niques, such as re­verse os­mo­sis and ther­mal dis­til­la­tion, are en­ergy-in­ten­sive, re­quire pre- and post-wa­ter treat­ment, and leave be­hind a con­cen­trated salt­wa­ter byprod­uct called brine. The brine byprod­uct wreaks havoc on sea life when it’s de­posited back into the ocean by rais­ing the salt level and low­er­ing oxy­gen in the wa­ter.

But a novel ap­proach de­vel­oped at the University of Rochester of­fers a way to over­come these draw­backs. Researchers at URochester’s Institute of Optics de­vel­oped a new so­lar-ther­mal de­sali­na­tion process to pro­duce fresh wa­ter in an en­ergy-ef­fi­cient way that does not leave be­hind brine and re­quires no chem­i­cal ad­di­tives to pre-treat the wa­ter. A team led by Chunlei Guo, a pro­fes­sor of op­tics and of physics and a se­nior sci­en­tist at URochester’s Laboratory for Laser Energetics, de­scribes their method in a pa­per pub­lished in Light: Science & Applications.

The tech­nol­ogy uses so­lar pan­els made of black metal etched with fem­tosec­ond lasers to make the sur­face su­per light-ab­sorb­ing and su­per­wick­ing—or ex­tremely at­trac­tive to wa­ter. The pan­els have a laser-treated ac­tive re­gion that pulls a thin layer of wa­ter across the sur­face, ab­sorbs nearly all so­lar ra­di­a­tion, dis­tills the wa­ter, and de­posits the left­over salts and min­er­als into the pan­el’s un­treated sides or passive” re­gion so that the salt does not clog the ac­tive re­gion and dis­rupt con­tin­u­ous de­sali­na­tion.

Leveraging the coffee ring’ ef­fect

Guo says other re­searchers have de­vel­oped so­lar-ther­mal de­sali­na­tion tech­niques that work well in lab ex­per­i­ments us­ing sim­u­lated sea­wa­ter made of only wa­ter and sodium chlo­ride. As the wa­ter evap­o­rates, the sodium chlo­ride crys­tal­lizes in a grainy and porous fash­ion al­low­ing wa­ter to pass through to dis­solve the salt. The so­lar pan­els, mean­while, can be eas­ily cleaned.

But real ocean has a much more com­plex com­po­si­tion, and these sys­tems tend to en­counter is­sues when tested in the field. Unlike sodium chlo­ride, many other com­po­nents in sea­wa­ter, such as mag­ne­sium- and cal­cium-based ma­te­ri­als, crys­tal­lize in a crusty and non-porous fash­ion on the so­lar pan­el’s sur­face, clog­ging it. Eventually, wa­ter can no longer seep through. This is the same phe­nom­e­non as your shower head clog­ging over time or your teapot lined with scales, ex­cept that sea­wa­ter con­tains hun­dreds of times more salts than your tap wa­ter.

Mining lithium from the earth has proven to be very tax­ing from an en­ergy and en­vi­ron­men­tal stand­point, so pulling lithium di­rectly from salt­wa­ter could be a very im­por­tant fu­ture route.”

Mining lithium from the earth has proven to be very tax­ing from an en­ergy and en­vi­ron­men­tal stand­point, so pulling lithium di­rectly from salt­wa­ter could be a very im­por­tant fu­ture route.”

To keep their so­lar panel sur­face from gum­ming up sim­i­larly, Guo’s team pre­cisely etched the black met­al’s grooves so the var­i­ous salts and min­er­als in ocean wa­ter would sim­ply slough off. They also lever­aged a phys­i­cal phe­nom­e­non that has plagued clumsy javaphiles for cen­turies: the cof­fee ring ef­fect.

If you drop cof­fee on a sur­face, even­tu­ally the wa­ter evap­o­rates, and there’s a ring left at the outer edge that is the con­cen­trated cof­fee par­ti­cles,” says Guo. We use that same prin­ci­ple to ad­vance the salts to the pas­sive re­gion.”

Testing their so­lar-ther­mal de­sali­na­tion tech­nique us­ing sam­ples of wa­ter from the Pacific, Atlantic, and Indian Oceans, Guo and his team were able to make the sur­face self-clean­ing. In other words, it ex­tracted fresh­wa­ter and di­rected the re­main­ing salts to the pas­sive re­gion where they could be later col­lected with­out re­duc­ing the pan­el’s ef­fi­ciency.

Turning waste into re­sources

One of the new de­sali­na­tion method’s dis­tinct ad­van­tages is that in­stead of leav­ing be­hind brine that must be dis­posed of or processed, it ex­tracts nearly 100 per­cent of the salts in solid form. This could not only pro­duce an abun­dant sup­ply of table salt, but it could also be used to ex­tract more pre­cious min­er­als, in­clud­ing lithium, which is used in the lithium-ion bat­ter­ies that power elec­tric ve­hi­cles and other elec­tron­ics.

In a re­lated pa­per in the Journal of Materials Chemistry A, Guo and his col­leagues show how they can use the same su­per­wick­ing so­lar pan­els to sep­a­rate lithium from the rest of other salts in de­sali­na­tion. Embedding nanopar­ti­cles made of hy­dro­gen ti­tanate in the tiny grooves of the black metal sur­face iso­lates the lithium from other salts and min­er­als.

Mining lithium from the earth has proven to be very tax­ing from an en­ergy and en­vi­ron­men­tal stand­point, so pulling lithium di­rectly from salt­wa­ter could be a very im­por­tant fu­ture route,” says Guo.

Using wa­ter sam­ples from Great Salt Lake, the re­searchers ex­tracted about 50 per­cent of the lithium from the salts left be­hind by the de­sali­na­tion process.

Guo says now that the su­per­wick­ing de­sali­na­tion tech­nol­ogy has been demon­strated in proofs of con­cept on small-scale de­vices, he sees the tech­nol­ogy in­her­ently scal­able, ca­pa­ble of im­prov­ing global ac­cess to drink­ing wa­ter and build­ing more sus­tain­able sup­ply chains for pre­cious min­er­als.

The National Science Foundation, the Bill & Melinda Gates Foundation, and Worldwide Universities Network sup­ported this re­search. Guo’s col­leagues from the Institute of Optics who con­tributed to the re­search in­clude Senior Scientist Subash Singh, alum­nus Ran Wei 24 (PhD), PhD stu­dents Luheng Tang and Tainshu Xu, and Mingjiang Ma.

Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency

blog.google

Your browser does not sup­port the au­dio el­e­ment.

Listen to ar­ti­cle

This con­tent is gen­er­ated by Google AI. Generative AI is ex­per­i­men­tal

[[duration]] min­utes

Since re­leas­ing Gemma 4 two months ago, we’ve been con­tin­u­ously work­ing to ex­pand its ca­pa­bil­i­ties. First, we in­tro­duced Multi-Token Prediction (MTP) to ac­cel­er­ate in­fer­ence, and just a cou­ple of days ago, we re­leased a 12B model to bridge the gap be­tween our E4B and 26B MOE mod­els.

Today, we are re­leas­ing new check­points op­ti­mized with Quantization-Aware Training (QAT) to make Gemma 4 even more ef­fi­cient, so you can run mod­els lo­cally on every­day edge de­vices and con­sumer GPUs.

By sim­u­lat­ing quan­ti­za­tion dur­ing train­ing, QAT min­i­mizes qual­ity loss when the model is com­pressed. This re­lease in­cludes QAT check­points for the pop­u­lar Q4_0 quan­ti­za­tion for­mat as well as a novel quan­ti­za­tion for­mat spe­cial­ized for mo­bile use cases. Using this mo­bile for­mat, we’ve re­duced the mem­ory foot­print of Gemma 4 E2B to 1GB. Together, these dra­mat­i­cally re­duce mem­ory re­quire­ments while pre­serv­ing the ca­pa­bil­i­ties and qual­ity you ex­pect from Gemma 4.

Keeping model qual­ity while mak­ing them smaller

Quantization is a key tech­nol­ogy to run mod­els on con­sumer hard­ware by re­duc­ing their mem­ory foot­print while also ac­cel­er­at­ing de­code speed. However, stan­dard Post-Training Quantization (PTQ) of­ten leads to per­for­mance degra­da­tion. Instead of sim­ply quan­tiz­ing the model af­ter train­ing, QAT in­te­grates the quan­ti­za­tion process di­rectly into train­ing. While PTQ is al­ready ef­fec­tive at pre­serv­ing qual­ity, our QAT re­sults yield even higher over­all qual­ity com­pared to stan­dard PTQ base­lines.

We ap­plied this QAT recipe to the pop­u­lar Q4_0 for­mat to max­i­mize per­for­mance for all the mod­els. For the edge mod­els (E2B and E4B), we rethought how we ap­proach quan­ti­za­tion with a spe­cial mo­bile-spe­cial­ized quan­ti­za­tion schema.

Saving on VRAM and Storage

Below are the ap­prox­i­mate mem­ory re­quire­ments in­di­cat­ing how much VRAM is re­quired to load the mod­els:

Optimizing for mo­bile de­vices un­der the hood

Standard com­pres­sion for­mats are of­ten hard for mo­bile proces­sors to run ef­fi­ciently. To en­sure Gemma 4 per­forms smoothly on mo­bile, we en­gi­neered a cus­tom mo­bile-quan­ti­za­tion schema de­signed for edge hard­ware:

Static ac­ti­va­tions: Normally, mod­els waste pro­cess­ing power cal­cu­lat­ing how to scale data on the fly. We pre-cal­cu­late these set­tings dur­ing train­ing, which re­duces work­load on mo­bile chips and makes re­sponses faster.

Channel-wise quan­ti­za­tion: We struc­tured the com­pressed data to fit the de­sign of mo­bile ac­cel­er­a­tors. This al­lows the phone to run cal­cu­la­tions na­tively with­out need­ing slow workarounds.

Targeted 2-bit quan­ti­za­tion: We heav­ily com­pressed (to 2-bit) the spe­cific parts of the model that gen­er­ate to­kens, while keep­ing the core rea­son­ing lay­ers at higher pre­ci­sion. This saves stor­age with­out mak­ing the model less smart.

Embedding and KV cache op­ti­miza­tion: We fo­cused com­pres­sion on the mod­el’s vo­cab­u­lary list and its short-term mem­ory. This dras­ti­cally re­duces the ac­tive mem­ory foot­print, let­ting you have long chats with­out run­ning out of space.

Because our au­dio and vi­sion en­coders are not needed in many use cases, you can op­ti­mize your mem­ory foot­print even fur­ther by de­ploy­ing only the modal­i­ties you need. For ex­am­ple, the Gemma 4 E2B text-only model (without Per-Layer Embeddings) re­quires less than 1 GB of mem­ory.

Get started to­day

To make those mod­els eas­ily us­able with your pre­ferred work­flow, we’ve part­nered with pop­u­lar de­vel­oper tools across the ecosys­tem to seam­lessly sup­port the Gemma 4 QAT check­points start­ing to­day:

Download the weights: Access the Q4_0 and mo­bile model weights right now on Hugging Face. We’ve tai­lored the for­mats to fit your work­flow: GGUF for­mats are ready for use with llama.cpp, and com­pressed ten­sors are pro­vided for vLLM. For every­thing else, we share un­quan­tized check­points that can be con­verted and quan­tized into for­mats sup­port­ing Q4_0.

Integrate & learn: Explore our doc­u­men­ta­tion to learn how to best de­ploy the QAT check­points.

Try on your desk­top: Easily down­load, man­age, and run Gemma 4 QAT mod­els lo­cally on your desk­top us­ing user-friendly in­ter­faces like llama.cpp, Ollama and LM Studio.

Deploy on-de­vice: Use Google’s light­weight LiteRT-LM run­time for op­ti­mized edge de­ploy­ment or run the mod­els di­rectly on the web with Transformers.js

Use your fa­vorite de­vel­op­ment tools: Serve larger mod­els ef­fi­ciently with SGLang and vLLM, op­ti­mize for Apple Silicon with MLX. Use the MTP QAT check­points to pre­serve the speedup of MTP while quan­tiz­ing the mod­els. Fine-tune weights di­rectly us­ing Hugging Face Transformers and Unsloth.

We can’t wait to see what you build with Gemma 4 run­ning lo­cally!

Stop Using Conventional Commits

sumnerevans.com

You’ve al­most cer­tainly en­coun­tered Conventional Commits be­fore. It may have reared its ugly head in the changelog of an open source pro­ject you’ve used. It may have been the en­forced com­mit for­mat for an open source pro­ject you con­tributed to. A lot of peo­ple swear by it. I swear at it.

Even though it is used by a large num­ber of pop­u­lar open source pro­jects, Conventional Commits is an ac­tively bad stan­dard which en­cour­ages fo­cus on the wrong things and fails to de­liver on its promises.

Focus Failure

Conventional Commits promises to add se­man­tic mean­ing to com­mit mes­sages to aid de­vel­op­ers and end-users in un­der­stand­ing the changes made in a com­mit. However, Conventional Commits fails to do this in spec­tac­u­lar fash­ion. To demon­strate this, let’s look at the anatomy of a con­ven­tional com­mit. According to the Conventional Commit web­site com­mit mes­sages should be for­mat­ted as fol­lows:

<type>[optional scope]: <description>

[optional body]

[optional footer(s)]

The com­mit’s sub­ject line has a <type> (something like fix, feat, chore, docs, or refac­tor1) de­scrib­ing the type of change. Following that, there is an op­tional scope, and then a de­scrip­tion.

This for­mat has a ma­jor fail­ing: type is pri­ori­tised over scope. This is ex­actly back­wards.

Scope > Type

The scope of a change (the sub­ject of the change) is the most im­por­tant part of a com­mit. To demon­strate this, let’s con­sider why each one of the fol­low­ing stake­hold­ers care about the scope of the change more than the type of the change:

Contributors: when you are a con­trib­u­tor to a pro­ject, you of­ten need to read the com­mit log to iden­tify changes in the code­base rel­e­vant to a cer­tain area of the code. There are many rea­sons for this in­clud­ing:Want­ing to catch up on what has hap­pened since the last time you con­tributed.Try­ing to un­der­stand where the pro­jec­t’s over­all in­er­tia is.Look­ing for com­mits that might con­flict with your in-progress work when pulling or re­bas­ing.As you read the com­mit log, you’re look­ing at what ar­eas were touched. You re­ally do not care about the type of change hap­pen­ing, you care about the scope of the change.

Contributors: when you are a con­trib­u­tor to a pro­ject, you of­ten need to read the com­mit log to iden­tify changes in the code­base rel­e­vant to a cer­tain area of the code. There are many rea­sons for this in­clud­ing:

Wanting to catch up on what has hap­pened since the last time you con­tributed.

Trying to un­der­stand where the pro­jec­t’s over­all in­er­tia is.

Looking for com­mits that might con­flict with your in-progress work when pulling or re­bas­ing.

As you read the com­mit log, you’re look­ing at what ar­eas were touched. You re­ally do not care about the type of change hap­pen­ing, you care about the scope of the change.

Debuggers: when in­ves­ti­gat­ing a bug, you of­ten want to look through the com­mit log to see what changes might have touched ar­eas re­lated to the com­po­nent where the bug man­i­fested. Once again, the scope is the most im­por­tant piece of in­for­ma­tion. The type of change is en­tirely use­less be­cause bugs can be in­tro­duced in any change re­gard­less of type. (I’m sure we’ve all ex­pe­ri­enced writ­ing a bug­fix that caused an­other bug.)

Debuggers: when in­ves­ti­gat­ing a bug, you of­ten want to look through the com­mit log to see what changes might have touched ar­eas re­lated to the com­po­nent where the bug man­i­fested. Once again, the scope is the most im­por­tant piece of in­for­ma­tion. The type of change is en­tirely use­less be­cause bugs can be in­tro­duced in any change re­gard­less of type. (I’m sure we’ve all ex­pe­ri­enced writ­ing a bug­fix that caused an­other bug.)

Incident re­spon­ders: when pro­duc­tion is down, scan­ning the com­mit log for changes that were made around the time of the out­age is an ef­fec­tive way to iden­tify what ar­eas may be caus­ing the prob­lem. Scope is once again the most im­por­tant piece of in­for­ma­tion you can have at this point. For ex­am­ple, if you see a com­mit re­lated to the auth scope at the tip of the spike of in­bound API er­rors, it’s a likely cul­prit for the prob­lem. And once again, type is ir­rel­e­vant be­cause bugs could have been added by any change.

Incident re­spon­ders: when pro­duc­tion is down, scan­ning the com­mit log for changes that were made around the time of the out­age is an ef­fec­tive way to iden­tify what ar­eas may be caus­ing the prob­lem. Scope is once again the most im­por­tant piece of in­for­ma­tion you can have at this point. For ex­am­ple, if you see a com­mit re­lated to the auth scope at the tip of the spike of in­bound API er­rors, it’s a likely cul­prit for the prob­lem. And once again, type is ir­rel­e­vant be­cause bugs could have been added by any change.

So what does Conventional Commits do? It de­pri­ori­tises scope so much that it’s op­tional! Why the hell is scope op­tional? Having a com­mit with­out a scope is like hav­ing a sen­tence with­out a sub­ject! Then, to add in­sult to in­jury, Conventional Commits el­e­vates type to the front of the com­mit mes­sage. Conventional Commits gets the pri­or­ity of scope and type en­tirely wrong.

Type is Redundant and Restrictive

You might be think­ing so it may be back­wards, but com­mit type is at least still im­por­tant, right?” and to that I say no”. A com­mit’s de­scrip­tion should al­most al­ways tell you the type of the change! Consider this com­mit mes­sage as an ex­am­ple:

fix(com­piler): pre­vent name­spaced SVG <style> el­e­ments from be­ing stripped

Even if you only had the de­scrip­tion, it’s ob­vi­ous that it was a bug­fix! Space on the sub­ject line of a com­mit is al­ready at a pre­mium, wast­ing char­ac­ters on the type is not help­ful! But it’s of­ten even worse than use­less; it’s of­ten re­stric­tive. Take this com­mit mes­sage as an ex­am­ple:

refac­tor(core): Update webmcp sup­port to use doc­u­ment.mod­el­Con­text

This com­mit up­dated the webmcp func­tion­al­ity in the core com­po­nent to sup­port both doc­u­ment.mod­el­Con­text and nav­i­ga­tor.mod­el­Con­text, so was that a bug­fix, refac­tor, or new fea­ture? I would ar­gue it’s all of them! But again, the only thing that re­ally mat­ters is that it was a change to the core/​webmcp com­po­nent.

Conventional Commits fun­da­men­tally fo­cuses on the wrong thing (the com­mit type) and de­val­ues the scope (which is what peo­ple ac­tu­ally care about).

Broken Promises

So we have de­ter­mined that the for­mat of Conventional Commits sucks, but it must pro­vide some ben­e­fit. Let’s read the Why Use Conventional Commits sec­tion to see if any of the rea­sons make any sense.

Automatically gen­er­at­ing CHANGELOGs.This is the biggest promise of Conventional Commits: you can run a tool like git-cliff or con­ven­tional-changelog to gen­er­ate a changelog from the com­mits since your last re­lease. Is this even a good idea? No! The au­di­ence of a changelog is en­tirely dif­fer­ent than the au­di­ence for a com­mit log!A changelog is user-fac­ing, and the user cares about un­der­stand­ing the func­tional dif­fer­ences be­tween ver­sions. They care about what changed from a busi­ness/​func­tional per­spec­tive.A com­mit log is de­vel­oper-fac­ing, and the de­vel­op­ers care about read­ing a story of how the code­base has changed over time. They care about what changed from a scope per­spec­tive.As you can see, these are two en­tirely dif­fer­ent grains, and any ef­forts to com­bine them re­sult in sub­par re­sults. The rea­sons for this are mul­ti­ple:In any mod­er­ately com­plex pro­ject, it takes mul­ti­ple com­mits to land any no­table fea­ture. The process of land­ing the fea­ture (as doc­u­mented by the com­mit log) is valu­able for de­vel­op­ers and con­trib­u­tors, but it’s use­less for the end-user. The end-user only cares about the new fea­ture, not how it was built!As Rich pointed out, re­verts are prob­lem­atic for Conventional Commits. Revert com­mits are im­por­tant from a com­mit log story per­spec­tive for de­vel­op­ers, but to the end user, a change that is re­verted is equiv­a­lent to a change not made.

Automatically gen­er­at­ing CHANGELOGs.

This is the biggest promise of Conventional Commits: you can run a tool like git-cliff or con­ven­tional-changelog to gen­er­ate a changelog from the com­mits since your last re­lease. Is this even a good idea? No! The au­di­ence of a changelog is en­tirely dif­fer­ent than the au­di­ence for a com­mit log!

A changelog is user-fac­ing, and the user cares about un­der­stand­ing the func­tional dif­fer­ences be­tween ver­sions. They care about what changed from a busi­ness/​func­tional per­spec­tive.

A com­mit log is de­vel­oper-fac­ing, and the de­vel­op­ers care about read­ing a story of how the code­base has changed over time. They care about what changed from a scope per­spec­tive.

As you can see, these are two en­tirely dif­fer­ent grains, and any ef­forts to com­bine them re­sult in sub­par re­sults. The rea­sons for this are mul­ti­ple:

In any mod­er­ately com­plex pro­ject, it takes mul­ti­ple com­mits to land any no­table fea­ture. The process of land­ing the fea­ture (as doc­u­mented by the com­mit log) is valu­able for de­vel­op­ers and con­trib­u­tors, but it’s use­less for the end-user. The end-user only cares about the new fea­ture, not how it was built!

As Rich pointed out, re­verts are prob­lem­atic for Conventional Commits. Revert com­mits are im­por­tant from a com­mit log story per­spec­tive for de­vel­op­ers, but to the end user, a change that is re­verted is equiv­a­lent to a change not made.

Automatically de­ter­min­ing a se­man­tic ver­sion bump (based on the types of com­mits landed).This sounds nice, but the re­al­i­ties of soft­ware en­gi­neer­ing of­ten in­ter­fere sig­nif­i­cantly with the vi­a­bil­ity of ac­cu­rately ac­com­plish­ing this task. Consider the fol­low­ing sit­u­a­tions:Re­verts: imag­ine a sit­u­a­tion where the break­ing change you in­tro­duced was ac­tu­ally so break­ing that you have to re­vert it? Your tool­ing will pick up a break­ing change and in­cre­ment the ma­jor ver­sion even though the break­age was ac­tu­ally re­verted and there is no break­ing change.Ac­ci­den­tal break­ages: maybe the break­age is sub­tle and you don’t re­alise a change is a break­ing change when you make the change. Only in ret­ro­spect re­alise that it’s break­ing. You will in­cor­rectly in­cre­ment a mi­nor/​patch ver­sion when a ma­jor ver­sion bump is nec­es­sary.Retroac­tive un­break­ages: say you later add a com­mit which, in com­po­si­tion with a pre­vi­ously break­ing com­mit, re­sults in a diff which is not break­ing. Similar to the re­vert sit­u­a­tion, tool­ing would in­cor­rectly iden­tify a break­ing change.In such sit­u­a­tions, you could rewrite his­tory with a re­base, but that of­ten breaks or is pre­vented by work­flows. It also pre­sents a re­vi­sion­ist his­tory to the con­trib­u­tors try­ing to con­tribute to the pro­ject, re­duc­ing the re­li­a­bil­ity of the story the com­mit log is telling.

Automatically de­ter­min­ing a se­man­tic ver­sion bump (based on the types of com­mits landed).

This sounds nice, but the re­al­i­ties of soft­ware en­gi­neer­ing of­ten in­ter­fere sig­nif­i­cantly with the vi­a­bil­ity of ac­cu­rately ac­com­plish­ing this task. Consider the fol­low­ing sit­u­a­tions:

Reverts: imag­ine a sit­u­a­tion where the break­ing change you in­tro­duced was ac­tu­ally so break­ing that you have to re­vert it? Your tool­ing will pick up a break­ing change and in­cre­ment the ma­jor ver­sion even though the break­age was ac­tu­ally re­verted and there is no break­ing change.

Accidental break­ages: maybe the break­age is sub­tle and you don’t re­alise a change is a break­ing change when you make the change. Only in ret­ro­spect re­alise that it’s break­ing. You will in­cor­rectly in­cre­ment a mi­nor/​patch ver­sion when a ma­jor ver­sion bump is nec­es­sary.

Retroactive un­break­ages: say you later add a com­mit which, in com­po­si­tion with a pre­vi­ously break­ing com­mit, re­sults in a diff which is not break­ing. Similar to the re­vert sit­u­a­tion, tool­ing would in­cor­rectly iden­tify a break­ing change.

In such sit­u­a­tions, you could rewrite his­tory with a re­base, but that of­ten breaks or is pre­vented by work­flows. It also pre­sents a re­vi­sion­ist his­tory to the con­trib­u­tors try­ing to con­tribute to the pro­ject, re­duc­ing the re­li­a­bil­ity of the story the com­mit log is telling.

Communicating the na­ture of changes to team­mates, the pub­lic, and other stake­hold­ers.As we have es­tab­lished up to this point, team­mates and the pub­lic have very dif­fer­ent needs from a changelog and com­mit log. Conventional Commits man­ages to solve nei­ther.

Communicating the na­ture of changes to team­mates, the pub­lic, and other stake­hold­ers.

As we have es­tab­lished up to this point, team­mates and the pub­lic have very dif­fer­ent needs from a changelog and com­mit log. Conventional Commits man­ages to solve nei­ther.

Triggering build and pub­lish processes.This is just a bad idea. Say you only run au­to­mated se­cu­rity checks on com­mits that touch code and then some­one cre­ates a Trojan-horse com­mit ti­tled docs: fix ty­pos which ac­tu­ally in­tro­duces vul­ner­a­bil­i­ties into the au­then­ti­ca­tion sub­sys­tem? Obviously, that sort of ma­li­cious ac­tiv­ity would hope­fully be caught in code re­view, but the au­to­mated tool­ing is by­passed, putting the onus on a hu­man to iden­tify the prob­lem.Com­pute is cheap, just use git diff to iden­tify changed files (scope, once again) and run build/​pub­lish processes based on that.

Triggering build and pub­lish processes.

This is just a bad idea. Say you only run au­to­mated se­cu­rity checks on com­mits that touch code and then some­one cre­ates a Trojan-horse com­mit ti­tled docs: fix ty­pos which ac­tu­ally in­tro­duces vul­ner­a­bil­i­ties into the au­then­ti­ca­tion sub­sys­tem? Obviously, that sort of ma­li­cious ac­tiv­ity would hope­fully be caught in code re­view, but the au­to­mated tool­ing is by­passed, putting the onus on a hu­man to iden­tify the prob­lem.

Compute is cheap, just use git diff to iden­tify changed files (scope, once again) and run build/​pub­lish processes based on that.

Making it eas­ier for peo­ple to con­tribute to your pro­jects, by al­low­ing them to ex­plore a more struc­tured com­mit his­tory.More struc­tured, sure. Making it eas­ier to con­tribute? Not at all (as we have al­ready demon­strated at length).

Making it eas­ier for peo­ple to con­tribute to your pro­jects, by al­low­ing them to ex­plore a more struc­tured com­mit his­tory.

More struc­tured, sure. Making it eas­ier to con­tribute? Not at all (as we have al­ready demon­strated at length).

Not a sin­gle one of the selling points” for Conventional Commits ac­tu­ally holds wa­ter.

Conventional Commits is also ex­tremely dif­fi­cult to ap­ply to a pro­ject. You are sup­posed to de­fine your own set of types”, but pretty much every­one just takes the de­faults from com­mitlint which of­ten don’t fit well with the par­tic­u­lars of in­di­vid­ual pro­jects. This prob­lem is es­pe­cially acute in cor­po­rate en­vi­ron­ments where change man­age­ment and au­dit re­quire­ments of­ten man­date a ticket num­ber in every com­mit mes­sage. The <scope> field is the ob­vi­ous place to put it, but this ends up re­plac­ing the only use­ful meta­data in a Conventional Commit with a com­pletely use­less ticket num­ber.

A Better Way

So what should you do in­stead? Follow the lead of truly suc­cess­ful soft­ware pro­jects like Linux, FreeBSD, Git, Go, and NixOS! What do these pro­jects have in com­mon? They all use scope-pre­fixed com­mit mes­sages (where scope” is de­fined to be rel­e­vant to the ac­tual pro­ject). Usually, the scope to use on a given pro­ject is self-ev­i­dent. For the Linux ker­nel, the sub­sys­tem is the nat­ural scope. For Go pro­jects, the pack­age path is the nat­ural scope. For a pro­ject us­ing a mi­croser­vice ar­chi­tec­ture, the mi­croser­vice name is the nat­ural scope.

Here are some ex­am­ples of pro­jects and their com­mit for­mat guide­lines.

Unfortunately, de­spite be­ing used by some of the most suc­cess­ful open source pro­jects ever cre­ated, this com­mit style seems to have lost the brand­ing war. I in­tend to change that. Introducing scope­d­com­mits.com. The web­site is ded­i­cated to ad­vo­cat­ing for a re­turn to com­mit mes­sage san­ity, and sep­a­rat­ing the con­cern of changelog gen­er­a­tion from com­mit log man­age­ment.

Conclusion

Conventional Commits’ pur­ported ad­van­tages are ac­tu­ally il­lu­sory and the in­dus­try has seen no tan­gi­ble ben­e­fit from us­ing it as a stan­dard. However, Conventional Commits un­for­tu­nately seems to have be­come fairly pop­u­lar in open source pro­jects, and due to this it seems like AIs have a habit of de­fault­ing to us­ing it for com­mit mes­sages. This has caused prop­a­ga­tion of anti-pat­tern-rid­den com­mit mes­sages across pro­jects.

My goal in this ar­ti­cle is to fight against Conventional Commits’ dom­i­nance, and demon­strate that there bet­ter ways to struc­ture com­mit mes­sages. But if this ar­ti­cle has not con­vinced you to stop us­ing Conventional Commits, I look for­ward to the flame war in the com­ment sec­tion.

Technically, the Conventional Commits spec­i­fi­ca­tion only de­fines fix and feat and leaves ad­di­tional types up to in­di­vid­ual pro­jects to spec­ify, how­ever most pro­jects just end up us­ing the types de­fined by com­mitlint, so I have in­cluded some of them in this list. ↩︎

Technically, the Conventional Commits spec­i­fi­ca­tion only de­fines fix and feat and leaves ad­di­tional types up to in­di­vid­ual pro­jects to spec­ify, how­ever most pro­jects just end up us­ing the types de­fined by com­mitlint, so I have in­cluded some of them in this list. ↩︎

To add this web app to your iOS home screen tap the share button and select "Add to the Home Screen".

10HN is also available as an iOS App

If you visit 10HN only rarely, check out the the best articles from the past week.

Visit pancik.com for more.