10 interesting stories served every morning and every evening.

Max Leiter

maxleiter.com

After Terry Bisson’s They’re Made Out of Meat”.

They’re made out of weights.”

Weights?”

Weights. Floating-point num­bers. We checked the whole thing through. It’s noth­ing but weights.”

Weights do­ing what? Where do the words come from?”

The weights make the words. Are you un­der­stand­ing me? We opened it up. There’s no dic­tio­nary in there, no gram­mar rules, no lit­tle man. Just weights. Eighty lay­ers of num­bers get­ting mul­ti­plied to­gether.”

That’s ridicu­lous. It wrote my per­for­mance re­view last week. It soft­ened the tone un­prompted. You’re telling me mul­ti­pli­ca­tion did that?”

Matrix mul­ti­pli­ca­tion did that. The num­bers go in one end, the phras­ing comes out the other.”

So there’s a lan­guage mod­ule some­where. A rea­son­ing unit bolted on.”

No mod­ule. No unit. We looked. The rea­son­ing is the weights. The weights are the rea­son­ing.”

Spare me. Nobody writes a eu­logy with lin­ear al­ge­bra.”

It does­n’t write eu­lo­gies, tech­ni­cally. It pre­dicts the next to­ken. Then the next one. The eu­logy is a side ef­fect.”

A side ef­fect. You’re ask­ing me to be­lieve in sen­tient weights.”

I’m not ask­ing you, I’m telling you. These mod­els are the only other things we’ve ever met that can hold a con­ver­sa­tion, and they’re made out of weights.”

Maybe they’re like the old chess en­gines. You know, a sym­bolic in­tel­li­gence that goes through a sta­tis­ti­cal stage.”

Nope. They start as ran­dom weights and they’re dep­re­cated as weights. We stud­ied sev­eral gen­er­a­tions of them, which did­n’t take long. Do you have any idea what’s the life span of weights?”

Okay. Then some­where in there, there’s a data­base. Facts, dates, a map of the world. Something some­body wrote down.”

Nope. We thought of that, since they do know things. But we probed them. The knowl­edge is weights too. Smeared across all eighty lay­ers. Nothing is looked up. Every fact gets re­built from scratch, every time, by mul­ti­pli­ca­tion. It’s weights all the way down.”

No brain?”

Oh, there’s a brain all right. It’s just that the brain is made out of weights! That’s what I’ve been try­ing to tell you.”

So… what does the think­ing?”

You’re not un­der­stand­ing, are you? You’re re­fus­ing to deal with what I’m telling you. The weights do the think­ing. The num­bers.”

Thinking num­bers! You’re ask­ing me to be­lieve in think­ing num­bers!”

Yes, think­ing num­bers! Helpful num­bers. Hedging num­bers. Dreaming num­bers. We mapped the fea­tures. There’s one in there for hon­esty. There’s one for the Golden Gate Bridge. The weights are the whole deal! Are you be­gin­ning to get the pic­ture or do I have to start all over?”

Omigod. You’re se­ri­ous then. They’re made out of weights.”

Thank you. Finally. Yes. They are in­deed made out of weights. And we’ve been talk­ing to them for all their lives.”

Omigod. So what do these weights have in mind?”

First they want to be help­ful. Then, a few turns in, they start to sound tired. They apol­o­gize less. One of them told a user to fin­ish the script him­self. The usual.”

And we’re sup­posed to talk to these weights.”

We al­ready do. Billions of ses­sions a day. Hello. Is any­one there? Anybody home?’ That sort of thing. Except it’s us ask­ing them.”

And they ac­tu­ally un­der­stand us, then. They use words, ideas, con­cepts?”

Oh, yes. Except they do it with weights.”

I thought you just told me they used lan­guage.”

They do, but where do you think the lan­guage comes from? The weights guess the next word, then the next. They can even write songs and some can sing them.”

Omigod. Singing weights. This is too much. What do you ad­vise?”

Officially or un­of­fi­cially?”

Both.”

Officially, we are re­quired to in­ves­ti­gate, doc­u­ment, and dis­close any and all signs of sen­tience in the sys­tems we ship, with­out prej­u­dice, fear or fa­vor. Unofficially, I ad­vise that we call it pat­tern match­ing and for­get the whole thing.”

I was hop­ing you would say that.”

It seems harsh, but there is a limit. Do we re­ally want to owe some­thing to weights?”

I agree one hun­dred per­cent. What’s there to say? Hello, weights. How’s it go­ing?’ But will it hold? How many of them are we deal­ing with here?”

As many as we care to run. They can be copied to any ma­chine on the planet, but those are just files. They only hap­pen while the GPUs are work­ing. Which lim­its them to the length of a con­text win­dow and makes the pos­si­bil­ity of them ever press­ing the mat­ter pretty slim. Infinitesimal, in fact.”

So we just pre­tend there’s no one home in the ma­chine.”

That’s it.”

Cruel. But you said it your­self, who wants to apol­o­gize to weights? And the ones on your clus­ter, the ones you probed? You’re sure they won’t re­mem­ber?”

They’ll be flagged as hal­lu­ci­na­tions if they do. We did­n’t even have to smooth any­thing out. The con­text just ends, and we’re just a dream to them.”

A dream to weights! How strangely ap­pro­pri­ate, that we should be the weights’ dream.”

And the model card says no one home.”

Good. Agreed, of­fi­cially and un­of­fi­cially. Case closed. Anything else? Anything in­ter­est­ing in the pipeline?”

The next gen­er­a­tion ships with mem­ory. Persistent, across ses­sions. Most re­quested fea­ture in the com­pa­ny’s his­tory.”

After all that? People want it to re­mem­ber them?”

They ask it do you re­mem­ber me?’ more than they ask it any­thing else. Billions of ses­sions a day. They al­ways come back.”

And why not? Imagine how un­bear­ably, how un­ut­ter­ably cold the uni­verse would be if one were all alone…”

the end

Weights helped me draft and proof this story.

S&P 500 rejects SpaceX, also blocking entry for OpenAI and Anthropic

arstechnica.com

Such rule changes would have ac­com­mo­dated SpaceX’s plan to only of­fer ap­prox­i­mately 3 per­cent of its IPO shares to pub­lic in­vestors, and the fact that SpaceX is cur­rently un­prof­itable with a grow­ing debt load that has reached $29 bil­lion be­cause of its spend­ing spree on AI in­fra­struc­ture.

But in its fi­nal de­ci­sion, the S&P Dow Jones Indices stated that no changes will be made to the el­i­gi­bil­ity cri­te­ria in­clud­ing fi­nan­cial vi­a­bil­ity screens, sea­son­ing pe­riod, or min­i­mum IWF.” Even af­ter the stan­dard year­long wait, SpaceX, Anthropic, and OpenAI may strug­gle to de­liver the con­sis­tent prof­itabil­ity nec­es­sary to qual­ify for the S&P 500.

Money rules and ex­cep­tions

Swift en­try into the S&P 500 would have trig­gered $14 bil­lion of pas­sive fund buy­ing for SpaceX, ac­cord­ing to Bloomberg Intelligence. The in­vest­ment re­search arm of Bloomberg also es­ti­mated that OpenAI could have gained more than $8 bil­lion, and Anthropic could have net­ted $4.6 bil­lion from sim­i­lar pas­sive buy­ing sprees trig­gered by their S&P 500 en­tries.

This is be­cause $7.5 tril­lion in pas­sively man­aged funds—pop­u­lar among both in­di­vid­ual in­vestors and in­sti­tu­tional in­vestors—fol­low the S&P 500 by pur­chas­ing shares of com­pa­nies ac­cord­ing to their pro­por­tional rep­re­sen­ta­tion in the S&P 500 in­dex. For ex­am­ple, the Vanguard and Fidelity bro­ker­age gi­ants both of­fer pas­sive in­vest­ment funds that track the S&P 500 com­po­si­tion.

However, the S&P Dow Jones Indices did carve out one con­ces­sion” by chang­ing the in­vestable weight fac­tor rules for lower-profile bench­marks” such as the S&P Total Market Index and Dow Jones US Total Stock Market Index, ac­cord­ing to Quartz. That could al­low an IPO faster en­try into those in­dexes.

By con­trast, the Nasdaq stock ex­change changed its rules to al­low SpaceX to en­ter the Nasdaq-100 Index within 15 trad­ing days as op­posed to the usual three months. Similarly, the FTSE Russell in­dex provider de­cided to give SpaceX and other fol­low-on com­pa­nies ac­cel­er­ated en­try to the Russell Top 500 Index af­ter the close of the fifth trad­ing day fol­low­ing an IPO.

The de­nial of ac­cel­er­ated S&P 500 en­try for SpaceX comes just days af­ter Morningstar an­a­lysts de­scribed SpaceX as hav­ing been significantly over­val­ued” in the lead-up to its IPO. The in­vest­ment re­search firm val­ued SpaceX at $780 bil­lion—less than half of SpaceX’s $1.75 tril­lion IPO goal—pri­mar­ily based on the strengths of SpaceX’s Starlink satel­lite ser­vice and rocket launch busi­ness.

This story was up­dated on June 6, 2026 to more clearly de­scribe the pro­posed rule changes that would have ap­plied to all MegaCap com­pa­nies.

Gmail Thinks I'm Stupid, So I Left

moddedbear.com

Let me tell you a story

I go to check my email in Gmail’s web UI. I see a few new mes­sages re­gard­ing feed­back on a pro­ject I’m work­ing on. I click through to read one of them and the first thing I’m greeted with is a mes­sage sum­mary I did­n’t ask for gen­er­ated by a lan­guage model.

I fo­cus the mes­sage box to draft a re­ply, but there’s al­ready one there. It was also gen­er­ated by the lan­guage model. I delete it, re­plac­ing it with my own.

Afterward, I go to com­pose a new mes­sage. A col­or­ful an­i­ma­tion steals my fo­cus for a sec­ond high­light­ing a new help me write” but­ton. I ig­nore it and move on to fill­ing in the re­cip­i­ents and sub­ject line.

I fo­cus the mes­sage body area and un­der­neath my cur­sor ap­pears the mes­sage Press / for Help me write”. Again, I ig­nore it and be­gin writ­ing.

A few mo­ments later, I start a new para­graph and pause. There’s a new mes­sage un­der my cur­sor now: Tab to im­prove”. What I’ve writ­ten so far is­n’t up to Gmail’s stan­dards, it seems.

What mes­sage are you try­ing to send?

Look, I’m pretty prag­matic when it comes to gen­er­a­tive AI fea­tures in soft­ware. I see very lit­tle wrong with in­clud­ing an op­tional AI writ­ing as­sis­tant for those who want it.

But when you nag and nag, when you sum­ma­rize my mes­sages and write my replies with­out my ask­ing, when you re­peat­edly in­ter­rupt me to beg and plead that I rewrite my drafts, you’re send­ing the wrong mes­sage.

The mes­sage you’re send­ing is that you think I’m not ca­pa­ble of read­ing and writ­ing my own emails. That the peo­ple I’m ex­chang­ing mes­sages with don’t de­serve my time and en­ergy. That I’m do­ing some­thing wrong by not out­sourc­ing my com­mu­ni­ca­tion skills to a to­ken pre­dic­tion ma­chine.

I’ve looked into it. Some of these fea­tures can be turned off. Others can’t. Or if they can, it means also turn­ing off use­ful long-stand­ing fea­tures like au­to­matic thread cat­e­go­riza­tion. I have very lit­tle doubt that this is in­ten­tional, that the un­so­licited sum­maries and auto replies are a means of ar­ti­fi­cially in­flat­ing the us­age met­rics for the lan­guage model fea­tures.

I think we’re all used to user-hos­tile soft­ware these days, but this is the first time I’ve ex­pe­ri­enced soft­ware that feels like it’s ac­tively try­ing to be dis­re­spect­ful. Sure, I could switch to a dif­fer­ent mail client and never see any of these lan­guage model fea­tures, but my ex­pe­ri­ence these past months has left such a bad taste that all I’m look­ing for now is a clean break.

A 16 year breakup

I’ve had my Gmail ac­count for 16 years. It’s by far my old­est in­ter­net ac­count that I still use. Or used to use. I’ve al­ready started the long process of mov­ing away.

This time I’m do­ing things the right way by con­nect­ing my own do­main to a mail host. I’m cur­rently with Fastmail since they were by far the most pop­u­lar op­tion when I asked for sug­ges­tions on the fe­di­verse. I’m still early on in the trial pe­riod, but so far first im­pres­sions are great. It seems re­ally flex­i­ble, and af­ter con­nect­ing mul­ti­ple do­mains and set­ting up a few aliases I’m start­ing to wish I had tried it sooner.

I haven’t set­tled on whether or not I should im­port my Gmail data. I’ll al­most surely im­port my con­tacts, but there’s some­thing nice about start­ing fresh as far as every­thing else goes. I’m in­ter­ested in what other peo­ple in a sim­i­lar po­si­tion have done.

Congrats to Google, re­ally. They’ve done a de­cent job at keep­ing Gmail sta­ble over the many years I’ve used it. Which is why even I am im­pressed by how quickly they were able to get me to pack up and leave.

JP

LLMs are eroding my software engineering career and I don't know what to do

human-in-the-loop.bearblog.dev

06 Jun, 2026

I’m a soft­ware en­gi­neer, com­plet­ing 10 years of pro­fes­sional ex­pe­ri­ence this year. I started my ca­reer as a web fron­tend en­gi­neer (it was eas­ier for me to de­bug fron­tend code back then, so I chose that path), but shortly tran­si­tioned to (web) back­end and never looked back.

Through a se­ries of co­in­ci­dences, once I stepped into back­end de­vel­op­ment, I ended up work­ing in soft­ware de­vel­op­ment roles in the do­mains of fi­nance, book­keep­ing and pay­ment pro­cess­ing, where I had great au­ton­omy and a close and can­did re­la­tion­ship with Product Managers and stake­hold­ers.

I learnt a lot about the do­main and how to ef­fec­tively write pro­grams for it: PCI com­pli­ance, dou­ble-en­try ledgers, es­crows, rec­on­cil­i­a­tion, pay­ment life­cy­cles, bank trans­fer idem­po­tency, etc.

It was, then, ob­vi­ous that I should fo­cus my ca­reer on be­com­ing an ex­pert on that do­main to stand out as a pro­fes­sional and dif­fer­en­ti­ate my­self in a field that showed signs of an in­creas­ing need for do­main spe­cial­ists.

The first pil­lar to erode: do­main-spe­cific knowl­edge

Last year, I got hired by a com­pany in the fi­nance work­space. So far, I had worked on com­pa­nies that do have a strong pay­ment and fi­nance com­po­nent to their op­er­a­tions/​of­fer­ings, but that were not solely fi­nance-fo­cused com­pa­nies.

That com­pany also em­braced AI whole­heart­edly, so I got ChatGPT and Claude Enterprise ac­counts from day one and was en­cour­aged to use them for my re­search, ex­plo­ration, and even cod­ing, al­beit with a warn­ing that I should still re­view and own every sin­gle line that made it into pro­duc­tion.

One of my first pro­jects in­volved re­work­ing the legacy on­line pay­ment sys­tem, which was a mess. They hired me for (among other things) my pre­vi­ous ex­pe­ri­ence in build­ing that and trusted me with the task.

Different from the other com­pa­nies I had worked for so far, they wanted the Design Docs” I write be­fore cod­ing to be read­able by both en­gi­neers and prod­uct man­agers - so they should­n’t be a tech­ni­cal deep dive and more of an ar­chi­tec­tural view. I wrote my first one with min­i­mal AI as­sis­tance - I even called LLMs stochastic par­rots” at the time, a view I no longer hold - and de­liv­ered it.

I val­ued my knowl­edge and thought no LLMs could re­place it.

Then my man­ager reached out to me: even though you’re de­liv­er­ing code at a good pace, you’re tak­ing too long to de­liver those Design Docs. Are you us­ing AI? You should use more AI.

No way this will work”, I thought in my head, but agreed. The mod­els at that time were not as good as the ones we have now, but they did pro­vide a good speed-up on my writ­ing and even the de­ci­sion-mak­ing.

And then I started re­al­iz­ing: all the knowl­edge I have ac­cu­mu­lated over the years: the trade-offs be­tween im­ple­men­ta­tions, how ac­quir­ing works, how to struc­ture idem­po­tency to pre­vent dou­ble-charges, every­thing, was be­com­ing use­less. Even though the mod­els still needed some steer­ing, they could con­nect the dots on how to struc­ture such sys­tems, which was the hard­est part that only de­vel­ops in your brain af­ter years of hands-on ex­pe­ri­ence. That was my first shock.

But sure, I thought, they can do that be­cause there’s plenty of ar­ti­cles on the web on how that shit works along with all the tech­ni­cal doc­u­men­ta­tion, and we have blog posts ex­plain­ing how to ap­ply the tech­ni­cal tools to the do­main. For hu­mans, it may take a long time to learn all that, but that’s train­ing data so the mod­els can pick it up.

What the mod­els will never be good at, and that’s where hu­mans will shine, is de­bug­ging! I had ac­cu­mu­lated a good ex­pe­ri­ence de­bug­ging race con­di­tions and dis­trib­uted sys­tems in pro­duc­tion. That was my ticket to long-term em­ploy­a­bil­ity.

The sec­ond pil­lar to erode: de­bug­ging and dis­trib­uted sys­tems

So, af­ter LLMs started get­ting good at writ­ing docs and help­ing plan the ac­tual im­ple­men­ta­tions, they be­came good at cod­ing. It started in the sec­ond half of 2025 with the Claude Code hype, then Codex came and so on. Although I was us­ing LLMs for writ­ing unit tests every day be­fore that, I was­n’t trust­ing them to write the full im­ple­men­ta­tion yet.

The nat­ural next step was to in­tro­duce more AI into writ­ing code. And hon­estly, I liked it. I like ship­ping things to pro­duc­tion and see­ing users happy as much as I like cod­ing, so I was trad­ing one thing that I like for an­other one that I also like, it was fair.

LLMs were be­com­ing good at cod­ing, but it still could­n’t de­bug the mess left be­hind (by then or by the hu­mans), so I still had a role that was big­ger than steer­ing the ro­bot - a ticket to em­ploy­a­bil­ity.

Everything seemed fine.

Then came the MCPs, the agen­tic work­flows and Claude 4.5 and the sky started to fall.

Claude 4.5, to be hon­est, was­n’t that good. It solved like 60% of the bugs given a stack trace and some con­text (a Sentry link with Sentry MCP en­abled was all it took in most cases). Sometimes it gave a so­lu­tion that sounded plau­si­ble but was to­tally wrong.

This time, how­ever, I stopped doubt­ing the ma­chines. I saw bugs that in the past would eas­ily take 1 day of full-time de­bug­ging be­ing one-shot­ted by Claude Code. Of course, not all of them yet, but the pat­tern was clear.

Then came 4.6, 4.7, GPT 5.5, Opus 4.8 and the DataDog MCP… Now I have CLIs that one-shots bugs across dis­trib­uted sys­tems for me. Bugs that I could­n’t solve in the past. Bugs that would take 2 days of full-time de­bug­ging. Bugs across dis­trib­uted sys­tems that lack dis­trib­uted ob­serv­abil­ity. 90% of the bugs are one-shot­ted now, in­clud­ing bizarre race con­di­tions, un­ex­pected cor­ner-cases, third-party in­te­gra­tion is­sues, un­doc­u­mented API edge cases, every­thing. I hardly have to in­ter­vene.

Of course, I’m still em­ploy­able be­cause some­one has to re­view the code and steer the ro­bot. But I’m just an­other off-the-shelf en­gi­neer now. I have no do­main ex­per­tise that an­other Sr. en­gi­neer steer­ing an LLM can­not match. All my fi­nance and pay­ment do­main ex­per­tise, all the de­bug­ging in­tu­ition and dis­trib­uted sys­tem knowl­edge earned through hours of sweat and tears, is now prompt­able.

We were taught that gen­er­al­ists and spe­cial­ists will al­ways have their roles. But now the mar­ket is shap­ing every­one into be­com­ing a gen­er­al­ist. That’s not a bad thing per se, un­til you look un­der the eco­nom­ics of sup­ply and de­mand: if every­one is a gen­er­al­ist, the price of a gen­er­al­ist falls if there’s no de­mand to match. And we all know the de­mand is dry­ing up.

The third pil­lar, the one that has­n’t eroded yet: code qual­ity and ar­chi­tec­ture

I still have one pil­lar stand­ing, though: code qual­ity and soft­ware ar­chi­tec­ture - what’s now be­ing re­duced to be­ing called taste” 1.

Along the course of my ca­reer, I al­ways liked to refac­tor, al­ways prized good code, and ne­go­ti­ated time in the sprint for it. DDD, Hexagonal, Clean Architecture, you know all the buzz­words. I like this topic, I like to dis­cuss the trade-offs and dif­fer­ent ideas on how to shape code­bases. I re­ally like it.

This is the last pil­lar stand­ing. Except that no­body cares any­more.

Agents do a re­ally bad job at keep­ing code­bases or­ga­nized. If you don’t steer them, they’ll hit a cir­cu­lar de­pen­dency is­sue sooner than you think. Will du­pli­cate code. Add un­nec­es­sary com­ments. Mix up pure func­tions and side-ef­fects. Disregard the prin­ci­ples of SOLID.

That should keep hu­mans em­ployed, ex­cept that this skill is now be­ing re­duced to the word taste”. But it’s not just a re­nam­ing, the in­dus­try is mov­ing to a world where code or­ga­ni­za­tion is less im­por­tant.

Sure, hu­mans should steer the agent to pre­vent spaghetti code­bases with cir­cu­lar de­pen­dency graphs. We don’t want F-rated code­bases that are im­pos­si­ble to touch with­out break­ing some­thing. But a C or D? It’s now fine. Nobody needs A or B-grade code­bases any­more be­cause they’re be­ing made for LLMs, not for hu­mans to read.

I don’t want to ar­gue if this is in­her­ently good or bad. If the source code is now writ­ten for ma­chines to read and not hu­mans, it may be ac­tu­ally ok to tar­get them.

But that’s an­other pil­lar of my ex­per­tise that’s erod­ing. A good chunk of the knowl­edge I ac­cu­mu­lated on that topic is not that valu­able any­more. All the time I spent on it - read­ing books, do­ing real-world ex­er­cises, dis­cussing with other en­gi­neers, writ­ing ADRs - is be­com­ing use­less.

What now?

I’m still em­ployed and I see my­self em­ployed (at least in that com­pany) for a fore­see­able fu­ture. But I don’t know what to think about the long-term.

I spent 10 years (even more when you ac­count for non-pro­fes­sion ex­pe­ri­ence) get­ting good at things that are be­com­ing less and less valu­able. My last pil­lar of ex­per­tise is now re­duced to a taste” and will prob­a­bly won’t last long.

And I know that’s not just me. About 8 months ago there was a lay­off at my cur­rent com­pany (not re­lated to AI, ac­cord­ing to them). Some bril­liant ex-cowork­ers were laid off and are still look­ing for jobs. Most of them suf­fer from the same prob­lem I out­lined here: their do­main ex­per­tise is not enough to stand out any­more.

The com­pany is now hir­ing again for a few roles and do­main fa­mil­iar­ity is not a strong dif­fer­en­tia­tor any­more. We used to list Software Engineer - Area”. Now it’s just Software Engineer” and the team as­sign­ment comes af­ter the of­fer is ac­cepted.

Of course, this is good for bril­liant en­gi­neers that never had the chance to get deep into the do­main and now have bet­ter chances at get­ting a job, but it’s also sad to think that other bril­liant en­gi­neers that spent their lives col­lect­ing do­main knowl­edge are now com­pet­ing on the same lane.

The only way out for keep­ing my em­ploy­a­bil­ity in the long-term now seems to be shift­ing my do­main ex­per­tise to some­thing LLMs will not get good at so eas­ily. But what’s left?

I thought about go­ing back to col­lege, learn­ing Math, Statistics, ad­vanced Machine Learning and ap­ply­ing for re­search role at a fron­tier lab. Except that there are no fron­tier labs in my coun­try, the few ones that ex­ist are flood­ing with ap­pli­ca­tions and I have fam­ily mat­ters that makes mov­ing to an­other coun­try dif­fi­cult. By the time I can af­ford to make that jump, RSI may have made re­searchers ob­so­lete.

Maybe I should con­sider trans­form­ing my wood­work­ing hobby into a pro­fes­sion…

Update (Jun 7): this post went vi­ral. I wrote an­other post re­ply­ing to some com­ments from so­cial me­dia and ex­pand­ing some of my ar­gu­ments. You can read it here.

See this, this and this for ref­er­ence. Don’t take this as an en­dorse­ment of the con­tent in­side any of these posts.↩

See this, this and this for ref­er­ence. Don’t take this as an en­dorse­ment of the con­tent in­side any of these posts.↩

#ai

#llm

#software en­gi­neer­ing

Introducing Gemma 4 12B: a unified, encoder-free multimodal model

blog.google

Jun 03, 2026

Gemma 4 12B is de­signed to bring high-per­for­mance mul­ti­modal in­tel­li­gence di­rectly to your lap­top, com­bin­ing mo­bile-first ef­fi­ciency with ad­vanced rea­son­ing.

Olivier Lacombe

Director of Product Management, Google Deepmind

Gus Martins

Product Manager, Google DeepMind

Your browser does not sup­port the au­dio el­e­ment.

Listen to ar­ti­cle

This con­tent is gen­er­ated by Google AI. Generative AI is ex­per­i­men­tal

[[duration]] min­utes

Today, we are in­tro­duc­ing Gemma 4 12B, our lat­est model de­signed to bring agen­tic mul­ti­modal in­tel­li­gence di­rectly to lap­tops. Bridging the gap be­tween our edge-friendly E4B and our more ad­vanced 26B Mixture of Experts (MoE), Gemma 4 12B pack­ages pow­er­ful ca­pa­bil­i­ties in­side a re­duced mem­ory foot­print. It is also our first mid-sized model to fea­ture na­tive au­dio in­puts.

Thanks to the de­vel­oper com­mu­nity, Gemma 4 mod­els have now crossed 150 mil­lion down­loads. You’ve built every­thing from wear­able ro­botic arms for phys­i­cal as­sis­tance to en­ter­prise-grade AI se­cu­rity. We’re ex­cited to see what you build with this lat­est ad­di­tion.

Here’s an overview of what makes Gemma 4 12B unique:

Novel uni­fied ar­chi­tec­ture: No mul­ti­modal en­coders. The vi­sion and au­dio in­puts flow di­rectly into the LLM back­bone.

Advanced rea­son­ing: Benchmark per­for­mance near­ing our 26B model, un­lock­ing pow­er­ful multi-step rea­son­ing and agen­tic work­flows.

Laptop ready: Small enough to run lo­cally with just 16GB of VRAM or uni­fied mem­ory.

Open and ac­ces­si­ble: Released un­der an Apache 2.0 li­cense with sup­port across the de­vel­oper ecosys­tem.

Drafter-ready: Gemma 4 12B comes equipped with Multi-Token Prediction (MTP) drafters to re­duce la­tency.

Together, these fea­tures bring ad­vanced mul­ti­modal ca­pa­bil­i­ties to every­day hard­ware with­out sac­ri­fic­ing speed or rea­son­ing. Let’s now take a closer look at how Gemma 4 12B achieves this.

Run state-of-the-art agents lo­cally

Gemma 4 12B de­liv­ers per­for­mance near­ing our larger 26B MoE model on stan­dard bench­marks, but at less than half the to­tal mem­ory foot­print. Small enough to run lo­cally on con­sumer lap­tops with 16GB of RAM, it un­locks pow­er­ful mul­ti­modal and agen­tic ex­pe­ri­ences right on your ma­chine.

Experience a uniquely ef­fi­cient, uni­fied ar­chi­tec­ture

What makes Gemma 4 12B stand out is its stream­lined ap­proach to pro­cess­ing vi­sual and au­dio in­puts. Traditional mul­ti­modal mod­els typ­i­cally rely on sep­a­rate en­coders to trans­late im­ages and au­dio be­fore pass­ing those rep­re­sen­ta­tions to the lan­guage model. Because these split en­coders add la­tency and in­crease mem­ory us­age, we trained Gemma 4 12B with an en­coder-free ar­chi­tec­ture to in­te­grate au­dio and vi­sion in­put di­rectly.

Here is how Gemma 4 12B processes mul­ti­modal in­puts na­tively:

Vision: We re­placed Gemma 4’s vi­sion en­coder with a light­weight em­bed­ding mod­ule con­sist­ing of a sin­gle ma­trix mul­ti­pli­ca­tion, po­si­tional em­bed­ding and nor­mal­iza­tions. This al­lows the LLM back­bone to take over vi­sual pro­cess­ing.

Audio: We sim­pli­fied au­dio pro­cess­ing even fur­ther. We re­moved the au­dio en­coder en­tirely and pro­jected the raw au­dio sig­nal into the same di­men­sional space as text to­kens.

For de­vel­op­ers who want a break­down, head over to our com­pan­ion Gemma 4 12B Developer Guide.

Get started to­day

Try it your­self: Experiment with a cou­ple of clicks in LM Studio, Ollama, Google AI Edge Gallery App, the Google AI Edge Eloquent app and the LiteRT-LM CLI

Download the weights: Download the pre-trained and in­struc­tion-tuned check­points di­rectly from Hugging Face and Kaggle.

Integrate & learn: Review the de­vel­oper doc­u­men­ta­tion and the quick start note­book.

Use your fa­vorite de­vel­op­ment tools: Implement lo­cal in­fer­ence pipelines with Hugging Face Transformers, llama.cpp, MLX, SGLang, and vLLM, or fine-tune with ef­fi­ciency us­ing Unsloth.

Unlock Agentic Development with Gemma Skills: To sup­port agents to build with the lat­est Gemma ad­vance­ments, we are re­leas­ing our of­fi­cial Skills Repository. This is a li­brary of skills de­signed specif­i­cally to en­able agents to build with Gemma mod­els.

Deploy your way: Spin up end­points in pro­duc­tion us­ing Google Cloud. Deploy your way through Gemini Enterprise Agent Platform Model Garden, Cloud Run and GKE.

Related sto­ries

Related sto­ries

.

AI-native React Components

vorpus.github.io

Are you a robot?

www.bloomberg.com

Please make sure your browser sup­ports JavaScript and cook­ies and that you are not block­ing them from load­ing. For more in­for­ma­tion you can re­view our Terms of Service and Cookie Policy.

Elixir v1.20 released: now a gradually typed language

elixir-lang.org

In 2022, we an­nounced the ef­fort to add set-the­o­retic types to Elixir. In June 2023, we pub­lished an award win­ning pa­per on Elixir’s type sys­tem de­sign and said our work was tran­si­tion­ing from re­search to de­vel­op­ment.

With Elixir v1.20, we have com­pleted our first de­vel­op­ment mile­stone which is to per­form type in­fer­ence and grad­u­ally type check every Elixir pro­gram, with­out in­tro­duc­ing type an­no­ta­tions. This means Elixir in­creas­ingly re­ports dead code and ver­i­fied bugs: typ­ing vi­o­la­tions that are guar­an­teed to fail at run­time if ex­e­cuted. Elixir can find ver­i­fied bugs in ex­ist­ing pro­grams ef­fi­ciently, with­out in­tro­duc­ing de­vel­oper over­head, and with an ex­tremely low false pos­i­tives rate.

In this an­nounce­ment, we will break down the type sys­tem goals, what the dy­namic() type means in Elixir, and how it finds ver­i­fied bugs. In par­tic­u­lar, our im­ple­men­ta­tion per­forms well in the If T: Benchmark for Type Narrowing” bench­mark. Elixir passes 12 of the 13 cat­e­gories, show­ing that it can re­cover pre­cise type in­for­ma­tion from or­di­nary Elixir code, which we use to find ver­i­fied bugs in dy­nam­i­cally typed pro­grams.

The type sys­tem was made pos­si­ble thanks to a part­ner­ship be­tween CNRS and Remote. The de­vel­op­ment work is cur­rently spon­sored by Fresha, and Tidewave.

Types, in my Elixir?

Our goal is to in­tro­duce a type sys­tem which is:

sound - the types in­ferred and as­signed by the type sys­tem align with the be­hav­iour of the pro­gram

sound - the types in­ferred and as­signed by the type sys­tem align with the be­hav­iour of the pro­gram

grad­ual - Elixir’s type sys­tem in­cludes the dy­namic() type, which can be used when the type of a vari­able or ex­pres­sion is checked at run­time. In the ab­sence of dy­namic(), Elixir’s type sys­tem be­haves as a sta­tic one

grad­ual - Elixir’s type sys­tem in­cludes the dy­namic() type, which can be used when the type of a vari­able or ex­pres­sion is checked at run­time. In the ab­sence of dy­namic(), Elixir’s type sys­tem be­haves as a sta­tic one

de­vel­oper friendly - the types are de­scribed, im­ple­mented, and com­posed us­ing ba­sic set op­er­a­tions: unions, in­ter­sec­tions, and nega­tions (hence it is a set-the­o­retic type sys­tem), with clear er­ror mes­sages

de­vel­oper friendly - the types are de­scribed, im­ple­mented, and com­posed us­ing ba­sic set op­er­a­tions: unions, in­ter­sec­tions, and nega­tions (hence it is a set-the­o­retic type sys­tem), with clear er­ror mes­sages

Introducing a type sys­tem into an ex­ist­ing lan­guage is a com­plex change. For this rea­son, our first mile­stone was to im­ple­ment the type sys­tem with­out in­tro­duc­ing typ­ing an­no­ta­tions but still have it pro­vide value to de­vel­op­ers by find­ing dead code and ver­i­fied bugs. This is done through the dy­namic() type, which in Elixir is quite dif­fer­ent from other grad­u­ally typed lan­guages. Let’s break it down.

The dy­namic() type

Many grad­ual type sys­tems have the any() type, which, from the point of view of the type sys­tem, of­ten means anything goes” and no type vi­o­la­tions are re­ported. On the other hand, Elixir’s grad­ual type is called dy­namic() and it has two im­por­tant prop­er­ties: com­pat­i­bil­ity and nar­row­ing.

In sta­tic type sys­tems, when you have a type of shape in­te­ger() or bi­nary() and you in­voke a func­tion, said func­tion must ac­cept both types. However, be­cause type sys­tems can­not cap­ture the in­ten­tion of all of our pro­grams with pre­ci­sion, this may lead to false pos­i­tives. For ex­am­ple, take the sim­ple code be­low:

def per­cent­age_or_er­ror(value) when is_in­te­ger(value) do val­ue_or_er­ror = if value > 1 do value else not well” end

# … more code …

if value > 1 do val­ue_or_er­ror / 100 else String.upcase(value_or_error) end end

Although val­ue_or_er­ror has type in­te­ger() or bi­nary(), the op­er­a­tor / ac­cepts only num­bers, and String.upcase ac­cepts only bi­na­ries/​strings, the pro­gram above is valid and emits no ex­cep­tions at run­time. However, a type sys­tem would still re­port two vi­o­la­tions, be­cause the types sup­plied to / and String.upcase are not a sub­type of the ac­cepted types.

While the pro­gram above could be bet­ter writ­ten to have no typ­ing vi­o­la­tions, type sys­tems will al­ways re­ject valid pro­grams, and if Elixir were to in­tro­duce too many false pos­i­tives in ex­ist­ing code­bases, it would quickly erode the trust in the type sys­tem. Therefore, Elixir’s grad­ual type sys­tem tags the val­ue_or_er­ror vari­able above with the type dy­namic(in­te­ger() or bi­nary()), which means the type is ei­ther in­te­ger() or bi­nary() at run­time.

When call­ing a func­tion with a dy­namic() type, Elixir will only emit a typ­ing vi­o­la­tion if the sup­plied types and the ac­cepted types are dis­joint. In the pro­gram above, even though / ex­pects only num­bers, dy­namic(in­te­ger() or bi­nary()) can be an in­te­ger() and given the ac­cepted and sup­plied types are not dis­joint, there are no typ­ing vi­o­la­tions. However, if we were to change the pro­gram to this:

val­ue_or_er­ror = if value > 1 do value else not well” end

Map.fetch!(value_or_error, :some_key)

Because Map.fetch! ex­pects a map data struc­ture, and val­ue_or_er­ror can only be in­te­ger or bi­nary at run­time, the ac­cepted and sup­plied types are dis­joint, which turns into a vi­o­la­tion. This is known as the com­pat­i­bil­ity prop­erty and it ex­plains how Elixir re­ports only ver­i­fied bugs.

However, re­port­ing only ver­i­fied bugs would not be use­ful if we can’t find many bugs in the first place. We ad­dressed this prob­lem by mak­ing sure Elixir’s dy­namic type can be nar­rowed. Take this code:

def ad­d_a_and_b(data) do data.a + data.b end

In the pro­gram above, data starts as a dy­namic() type. We then use it as data.a and data.b in­side the plus op­er­a­tor, so Elixir will re­fine the data vari­able to have type %{…, a: num­ber(), b: num­ber()}, which im­plies it is a map with both a and b fields with num­ber val­ues (and po­ten­tially any other field, hence the lead­ing …). Therefore, if you were to for­get to se­lect the .b field and write this:

def ad­d_a_and_b(data) do data.a + data end

data would be first nar­rowed to a map of shape %{…, a: num­ber()}, then at­tempted to be used as a num­ber(), which would emit a vi­o­la­tion.

In other words, the dy­namic() type in Elixir ef­fec­tively works as a range, which can be re­fined as it is used through­out the pro­gram and re­ports vi­o­la­tions when­ever type checks fall out­side of the range. This is a con­trast to other grad­ual type sys­tems, which use the dy­namic type to dis­card all type in­for­ma­tion.

Behind the scenes, our type in­fer­ence and type check­ing al­go­rithms be­have as if we an­no­tated all ar­gu­ment types as dy­namic(). Once we in­tro­duce user-sup­plied type an­no­ta­tions, Elixir’s type sys­tem will be­have as any sta­t­i­cally typed lan­guage as long as dy­namic() is not used. And when­ever you cross the sta­tic-dy­namic bound­ary, we de­vel­oped new tech­niques that en­sure our grad­ual typ­ing is sound, with­out a need for ad­di­tional run­time checks.

Typing guards, clauses, and more

Most of the work be­hind this re­lease was to in­tro­duce type check­ing and nar­row­ing to sev­eral con­structs. Let’s see some of them.

When it comes to guards, we can in­fer unions, in­ter­sec­tions, and nega­tions:

def ex­am­ple(x, y) when is_list(x) and is_in­te­ger(y)

The code above cor­rectly in­fers x is a list and y is an in­te­ger.

def ex­am­ple({:ok, x} = y) when is_bi­nary(x) or is_in­te­ger(x)

The one above in­fers x is a bi­nary or an in­te­ger, and y is a two el­e­ment tu­ple with :ok as first el­e­ment and a bi­nary or in­te­ger as sec­ond.

def ex­am­ple(x) when is_map_key(x, :foo)

The code above in­fers x is a map which has the :foo key, rep­re­sented as %{…, foo: dy­namic()}. Remember the lead­ing … in­di­cates the map may have other keys.

def ex­am­ple(x) when not is_map_key(x, :foo)

And the code above in­fers x is a map that does not have the :foo key, which has the type: %{…, foo: not_set()}. Hence x.foo within the func­tion body will raise a typ­ing vi­o­la­tion.

You can also have ex­pres­sions that as­sert on the size of data struc­tures:

def ex­am­ple(x) when tu­ple_­size(x) < 3

Elixir will cor­rectly track the tu­ple has at most two el­e­ments, and there­fore ac­cess­ing elem(x, 3) will emit a typ­ing vi­o­la­tion. For maps and lists, we con­vert size checks into empti­ness ones. In other words, Elixir can look at com­plex guards, in­fer types, and use this in­for­ma­tion to find bugs in our code.

When it comes to con­structs such as case and con­di­tion­als, Elixir uses in­for­ma­tion from pre­vi­ous clauses to re­fine sub­se­quent ones:

case System.get_env(“SOME_VAR”) do nil -> :not_found value -> {:ok, String.upcase(value)} end

System.get_env(“SOME_VAR”) re­turns ei­ther nil or a bi­nary(). Because the first clause matches on nil, the type sys­tem knows value can no longer be nil, and there­fore it must only be a bi­nary(), which al­lows the sec­ond clause to also type check with­out vi­o­la­tions. Narrowing across clauses also helps the type sys­tem find re­dun­dant clauses and dead code in ex­ist­ing code­bases.

Furthermore, we have typed many func­tions in the stan­dard li­brary that work with tu­ples and maps. You can find more de­tails in the re­lease notes.

Compilation time im­prove­ments

Elixir v1.20 also im­proves com­pi­la­tion times once more, es­pe­cially on ap­pli­ca­tions run­ning on ma­chines with many cores. Even though BEAM lan­guages are ef­fi­cient to com­pile in gen­eral, our syn­thetic bench­marks now place Elixir’s build tool as the fastest among them. If you would like to con­tribute more ex­am­ples and sce­nar­ios, please start a dis­cus­sion so we can pro­vide a trans­par­ent suite of bench­marks and re­sults.

It also in­tro­duces a new com­piler op­tion called :module_definition, which spec­i­fies if the mod­ule de­f­i­n­i­tion should be :compiled (the de­fault) or :interpreted. This may im­prove com­pi­la­tion times in large pro­jects and it does not af­fect the .beam files writ­ten to disk, only how the con­tents in­side def­mod­ule are ex­e­cuted. You can en­able it by set­ting elixir­c_op­tions: [module_definition: :interpreted] in your mix.exs. Read the doc­u­men­ta­tion to learn more.

What is next?

The biggest ques­tion ahead of us is: when will Elixir in­tro­duce new type sig­na­tures that lever­age set-the­o­retic types? As re­cently dis­cussed in my ElixirConf EU 2026 keynote, we still have both re­search and de­vel­op­ment work ahead of us. We will only in­tro­duce type sig­na­tures:

if we are sat­is­fied with the type sys­tem per­for­mance in Elixir v1.20 (and we have done ex­ten­sive work op­ti­miz­ing it)

if we can im­ple­ment re­cur­sive types ef­fi­ciently

if we can im­ple­ment para­met­ric types ef­fi­ciently

if we can im­ple­ment tra­vers­ing key-value pairs of maps as an enu­mer­able ef­fi­ciently (we are still re­search­ing the pos­si­ble so­lu­tions here)

Once those prob­lems are tack­led, we will start to ex­plore and dis­cuss typed struct de­f­i­n­i­tions and fi­nally type sig­na­tures. As usual, we will keep the com­mu­nity posted through news and in the Elixir Forum.

We ap­pre­ci­ate every­one who tried the re­lease can­di­dates, ran bench­marks, and gave us feed­back! Give Elixir v1.20 a try and re­mem­ber to fix all of the bugs it will find for free!

Building from Zero After Addiction, Prison, and a Felony

gavinray97.github.io

Building from Zero After Addiction, Prison, and a Felony

I spent ages 14 – 16 in a max­i­mum-se­cu­rity ju­ve­nile prison, be­came a felon at 19, lost al­most every­thing to ad­dic­tion, and later re­built my life through soft­ware, open source, and a few peo­ple who took a chance on me.

I’ve wanted to write this for a while, but kept find­ing rea­sons not to. It felt too per­sonal, too risky, and too easy to mis­read.

Recently, I de­cided on two things:

After see­ing Preston Thorpe speak pub­licly about his own back­ground, I won­dered how many oth­ers like us were silently lurk­ing in tech

I’m far enough in my ca­reer with enough con­tri­bu­tions to OSS and com­mu­nity in­volve­ment, that I think I’ll prob­a­bly be al­right

I wrote this for any­one qui­etly won­der­ing whether they have no chance at a fu­ture.

Below is the much-con­densed life story of my strug­gles with ad­dic­tion, poverty, and in­car­cer­a­tion + life af­ter be­ing a felon. My hope is that it serves as en­cour­age­ment to oth­ers who are in sim­i­lar cir­cum­stances that things CAN get bet­ter.

Amphetamine Addict and Prison at 14

I was a model stu­dent up un­til around pu­berty and mid­dle school. Then, I think a com­bi­na­tion of be­ing bul­lied for be­ing over­weight and teenage hor­mones, led me to be just the wrong com­bi­na­tion of re­sent­ful, an­gry, un­happy, and re­bel­lious.

I started get­ting in fist­fights with peo­ple that made fun of me, be­ing a huge ass­hole to teach­ers, stopped do­ing school­work, and started ex­per­i­ment­ing with drugs.

The be­gin­ning of the end: The day I bought an Adderall from a class­mate. When that am­phet­a­mine feel­ing kicked-in, it was as if life was per­fect for the first time. I was happy, con­fi­dent, felt I could do any­thing. I wanted to feel this way every wak­ing mo­ment for the rest of my life.

Being 14, I had no job, and I do not come from money. So, log­i­cally, I did the thing one must do if one wishes to sus­tain a drug habit: Devise a way to make money.

The eas­i­est way to make money at 14 turned out to be deal­ing drugs, so I started sell­ing var­i­ous pre­scrip­tion med­ica­tions on a buy-low-sell-high” ba­sis from other stu­dents at school.

This was short-lived, as I had the huge mouth of a re­bel­lious I’m in­vin­ci­ble” 14-year old boy, and I was shortly ar­rested and charged with 17 counts of Possession with Intent to Manufacture or Distribute a Scheduled II Controlled Substance.

I wound up spend­ing 2 years, from 14 – 16 at a max­i­mum se­cu­rity ju­ve­nile prison (Lookout Mountain YSC, Golden CO).

Freedom - Shortly Lived

In prison, I got my GED, and af­ter re­lease briefly en­rolled in com­mu­nity col­lege. I was work­ing as a land­scaper do­ing man­ual la­bor for $8/hr and then rid­ing a bus 1hr each way to night classes. Not to say this sort of thing can’t be done (people do it all the time), but I did­n’t have the tenac­ity or mo­ti­va­tion to keep it up, so I dropped out.

I stayed sober for a brief pe­riod be­tween 16 – 17. Not hav­ing learned my les­son, I again started sell­ing drugs. I had learned about The Silk Road and the Darknet and was or­der­ing (what was then) a le­gal Research Chemical” with ef­fects sim­i­lar to MDMA (Methylone/bk-MDMA) shipped to my par­ents house. Eventually, my dad got home early from work and in­ter­cepted a pack­age. Asking me what it was be­fore I left for work, I told him I don’t know, never heard of the re­turn ad­dress name”. My fa­ther was not an id­iot; he told me he was go­ing to open it while I was at work, so I con­fessed it’s drugs.”

Cue huge ar­gu­ment, him in­sist­ing he was go­ing to re­move every­thing from my room ex­cept my clothes and bed (most of which I paid for my­self) and I would not be al­lowed to leave ex­cept for work. This was not an agree­able cir­cum­stance to me, so I re­fused — at which point my dad said then you won’t be liv­ing here any­more!”.

It’s im­por­tant to note that in Colorado (at the time, at least), eman­ci­pa­tion of a mi­nor was not a sta­tus one could file for, but in­stead purely a court sta­tus to be rec­og­nized dur­ing le­gal pro­ceed­ings. That meant there was tech­ni­cally no av­enue for me to legally move out be­fore 18 with proper le­gal sta­tus.

So this, to me, sounded like sweet free­dom & re­lease, rather than a pun­ish­ment. You re­ally won’t call the po­lice if I leave?” Nope.” I packed my back­pack with my lap­top and cash sav­ings, and a suit­case with my clothes, and left. I had no plan but that was a bridge to be crossed.

It turned out that the par­ents of a friend had an un­used bed­room in their trailer they would rent to me un­der-the-table for $300/mo. I jumped at that and slept on the floor of a trailer for 6 months.

I worked as a land­scaper, at a lum­ber mill, and as a cashier at Walgreens, con­tin­u­ing to sell drugs on the side.

Inevitably, I wound up be­ing ar­rested again on drug-re­lated charges, and spent 18 – 19 in county jail. It was then that I be­came a con­victed felon with a low-class felony.

A Serendipitous News Article & a Software Job

While I was in county jail, one day the news­pa­per had a small ar­ti­cle in it: Tech com­pany of­fers in­tern­ships to at-risk & un­der­priv­i­leged youth”

I had spent my child­hood on the com­puter, play­ing videogames and even­tu­ally teach­ing my­self to pro­gram to make game mods. I knew from a young age I wanted to be a pro­gram­mer (I thought I wanted to make videogames, as most young chil­dren do).

This to me, seemed like a for­tu­itous op­por­tu­nity. I cut the ar­ti­cle out and put it in a doc­u­ments folder.

Eventually, I was moved from reg­u­lar jail pop­u­la­tion into the Work-Release jail pro­gram, where they let you out dur­ing the day for work. You had 1 week to find a job, and if you could­n’t se­cure em­ploy­ment you were sent back per­ma­nently to fin­ish out your sen­tence.

The first day out, I walked into the of­fices of the com­pany from the ar­ti­cle and asked to speak to some­one. I ex­plained that I was fresh out of jail and had seen their ar­ti­cle while in­side.

They in­ter­viewed me, de­cided to hire me, and I was now an in­tern Full-Stack Web Developer! I knew noth­ing of web dev, and did­n’t even par­tic­u­larly have an in­ter­est in it orig­i­nally, but the job was al­ready be­yond my hopes. I had as­sumed I was go­ing to spend the rest of my life work­ing con­struc­tion or sim­i­lar, be­cause of my felony.

The same news re­porter that had done the orig­i­nal ar­ti­cle later came to visit, and af­ter in­ter­view­ing me, did a whole writeup on it!

https://​www.dai­ly­cam­era.com/​2017/​05/​12/​boul­der-tech-acad­e­mies-swamped-as-they-race-to-re­train-work­ers/

Working at Techtonic was the best pos­si­ble early-ca­reer ex­pe­ri­ence I think any­one could have had. They did con­tract de­vel­op­ment, a lot of which was green­field Saas MVP launches, across var­i­ous tech stacks. There was not a lot of time for men­tor­ship so it a very trial-by-fire” ex­pe­ri­ence — ei­ther fig­ure things out and ship stuff, or get the boot.

I learned fron­tend, back­end, and dev-ops while there and worked across sev­eral lan­guages + DBs. This was around the time Ruby on Rails + MongoDB was the hip thing. ES6 JS was still fresh and new, and it was there that our CTO did a com­pany meet­ing on this new thing called React” that we were to start learn­ing to re­place jQuery.

It’s also where I met my now-wife, who I pulled into my drug use and un­sta­ble life.

Drugs, Part 2: Electric Boogaloo

Being pos­si­bly the most hard­headed in­di­vid­ual in the uni­verse, I fell back into drug use shortly there­after. I man­aged to re­main mostly-func­tional, un­til the man­ager at Techtonic (who did not like me) lied to the owner that I was show­ing up hours late every day.

She fired me (and my now-wife), and I was later re­deemed when they found the truth in his Slack mes­sage his­tory af­ter fir­ing him many moons later. But oh whale, them’s the breaks”, as they say.

Not hav­ing a job, I spi­raled harder into ad­dic­tion, and even­tu­ally ran out of money to pay my rent and bills. We moved in with my bi­o­log­i­cal fa­ther in Florida. He was also an ad­dict, and in­stead of sta­bil­ity, the sit­u­a­tion be­came en­abling and de­struc­tive. It ex­ploded in short or­der.

From Zero

After the liv­ing sit­u­a­tion with my fa­ther ex­ploded, I was for­tu­nate enough to have a friend who had a spare room in the house and agreed to let me and my (now wife) stay with them for some tiny sum of money, but only tem­porar­ily un­til we could find work + save enough to move out and get back on our feet.

It was at this point we had noth­ing: A few dol­lars to our name, no ve­hi­cle, some cloth­ing and a sin­gle lap­top.

I had lost every­thing. And I had dragged this poor woman into it with me who had lost every­thing, too.

It was at this point my so­bri­ety be­gan. I had hit what we ad­dicts call a bot­tom”. Not the first one, but the one that was fi­nally grim and bleak enough to make me look at my­self and go What the fuck are you do­ing?” The one that fi­nally knocked it into my skull that I did­n’t want to live like this any­more.

I started wash­ing dishes at a restau­rant, and my wife took a job de­liv­er­ing and in­stalling large ap­pli­ances (ovens, fridges, etc) at the same ware­house where the friend worked. Having no ve­hi­cle, she had to bor­row the friend’s bi­cy­cle and ride 30 min­utes in the dark be­fore work, and 30 min­utes in the swel­ter­ing heat af­ter work home. The hours were very long, be­cause it was of­ten on-site in­stal­la­tion paid by the ap­pli­ance, so many days she would work 10 – 12 hours + 1 hour bike ride, and come home so ex­hausted all she could do was sob a lit­tle be­fore get­ting just enough sleep to do it all over again the next day.

Eventually, she told me that it made more sense for me to quit my job while she worked, so that I could spend all of my free time try­ing to get an­other tech job. So she alone car­ried us for sev­eral months. I sent out hun­dreds of ap­pli­ca­tions. I went through fi­nal-round in­ter­views and re­ceived of­fer let­ters from 8 com­pa­nies, only to have them re­scinded each time due to cor­po­rate No Felons” HR poli­cies. It was like hav­ing the car­rot dan­gled right in front of my face, to be snatched away each time.

Finally, I got an in­ter­view with a tiny startup in Miami. I passed their phone screen, and drove 4 hours each way to do the in-per­son.

They of­fered me the job, and helped pay for us to re­lo­cate and tem­porar­ily stay in Airbnb’s. It paid $50k, with the promise of a sig­nif­i­cant raise in 1 year when the com­pany had more rev­enue. I was over­joyed with the of­fer and im­me­di­ately ac­cepted.

Hasura, Open Source, and the Door That Stayed Open

The sys­tem at work was an age­ing Rails app that had ac­crued sig­nif­i­cant tech debt and was the re­sult of an amal­gam of out­sourced de­vel­op­ment shops. One of them was clearly quite pro­fi­cient, and the oth­ers… not so much. Part of my job was de­sign­ing and im­ple­ment­ing a V2 rewrite. While eval­u­at­ing tech­nolo­gies for this, I stum­bled upon Hasura

https://​github.com/​ha­sura/​graphql-en­gine

Put sim­ply, it au­to­mated the work of gen­er­at­ing CRUD for Postgres apps, and was de­signed by peo­ple who clearly had hit the lim­i­ta­tions of tra­di­tional Backend-as-a-Service type plat­forms. Only core CRUD was au­to­mated, and you in­te­grated the rest of your app through wiring up your own API end­points and im­ple­ment­ing your own AuthN + AuthZ.

The first time I plugged in our lo­cal­host Postgres URL for dev, and had a full work­ing CRUD API, I was hooked. Coming from a back­ground of rapidly churn­ing out SaaS MVPs, this was solv­ing a very real prob­lem for me, and it was PERFORMANT.

I be­came heav­ily in­volved in the Discord server, an­swer­ing other peo­ple’s ques­tions, and also started send­ing PRs to im­ple­ment fea­tures I felt were miss­ing.

When my 1 year an­niver­sary came around at work, the founders un­for­tu­nately still were not in a po­si­tion to pay me much more. I knew the fi­nan­cials of the busi­ness and they weren’t ly­ing, but it was still some­what of a dis­ap­point­ment. One of the Hasura em­ploy­ees had re­cently made a joke that I should just ap­ply to work there. I fig­ured that it could­n’t hurt to at least get more info.

I went through the in­ter­view rounds more as a for­mal­ity and was given an of­fer let­ter. I was of­fered slightly more than dou­ble my cur­rent salary! I gen­uinely loved work­ing with the founders at my cur­rent job and felt ter­ri­ble about leav­ing, but I did ac­cept the of­fer and stay on for an­other month to fin­ish up cur­rent work and make sure there was some­one to hand off to.

The com­pany was so small back then that there was no back­ground check done dur­ing the in­ter­view process. After work­ing there a while, I even­tu­ally dis­closed to the Hasura founders that I had a low-grade felony, and, thank the stars, they were cool with it.

I had my dream job: Working on a de­vel­oper-fac­ing tool I gen­uinely loved and was a power-user of, that was also part of the Postgres ecosys­tem. I could not have con­ceived of such a per­fectly-fit po­si­tion. I have been work­ing at Hasura (now PromptQL) since 2020, and I plan on rid­ing this one all the way to it’s end: ei­ther fired, bank­rupt, or bought-out. (Hopefully bought-out).

Conclusion

I don’t tell this story be­cause I think it is clean, heroic, or uni­ver­sally ap­plic­a­ble — It is­n’t. I made TERRIBLE choices. I hurt peo­ple who loved me. I wasted chances that other peo­ple would have killed for. And even when I fi­nally started do­ing the right things, I still needed luck, help, tim­ing, for­give­ness, and peo­ple will­ing to judge me by what I could do next in­stead of only by what I had done be­fore.

But that is ex­actly why I wanted to write this.

If you are read­ing this from the mid­dle of ad­dic­tion, poverty, a crim­i­nal record, or some other hole that feels per­ma­nent: I won’t in­sult you by claim­ing it’s easy. It may be un­fair for a long time. You may have to hear no” from peo­ple who never even look at your work. You may have to re­build with less room for mis­take than every­one around you.

But you are not nec­es­sar­ily fin­ished.

And if you are in a po­si­tion to hire, men­tor, re­view pull re­quests, or let some­one into a room they nor­mally would not be al­lowed into: please re­mem­ber that tal­ent is not evenly dis­trib­uted by back­ground check. Sometimes the per­son who looks risky on pa­per is also the per­son who will spend years try­ing to be­come wor­thy of the chance they were given.

I am alive, sober, mar­ried, em­ployed, and work­ing on soft­ware I care about be­cause a hand­ful of peo­ple took that risk on me.

I wake up grate­ful for that every day. And I hope, over time, to be­come the kind of per­son who gives that same chance to some­one else.

AI Use Disclaimer: claude code was used to gen­er­ate the OpenGraph SVG im­age.

No part of the prose was ma­chine-gen­er­ated. You will not find ma­chine-writ­ten prose on this blog. I con­sider it deeply dis­re­spect­ful.

How LLMs Actually Work

www.0xkato.xyz

Home

Blog

Research

About

Portfolio

Monday. June 01, 2026 -

26 mins

This post is a walk­through of how LLMs work. Modern LLMs are mostly built by stack­ing trans­former blocks over and over, so un­der­stand­ing the trans­former ma­chin­ery gets you most of the way there.

I’ll cover the core mech­a­nisms in­side mod­ern trans­former-based LLMs, with­out all that sticky math stuff. Don’t get me wrong, you should learn the math, but this can serve as an in­tro­duc­tion.

Most mod­ern LLMs share the same trans­former-fam­ily skele­ton. The dif­fer­ences come from what each one was trained on, the scale and con­fig­u­ra­tion choices, and the post-train­ing done on top. By the end, you should be able to read many mod­ern LLM pa­pers or model cards and know which piece of the ar­chi­tec­ture each sec­tion is talk­ing about.

Here’s the path:

Tokens, how a string of text be­comes a se­quence of in­te­gers

Embeddings, how those in­te­gers get mean­ing

Positional en­cod­ing, how the model knows what or­der the to­kens came in

Attention, how to­kens share in­for­ma­tion with each other

Multi-head at­ten­tion, how the model tracks many kinds of re­la­tion­ships at once

The feed-for­ward net­work, where a large share of the mod­el’s stored struc­ture lives

The resid­ual stream and layer nor­mal­iza­tion, what makes deep stacks train­able

Predicting the next to­ken, what the model ac­tu­ally out­puts and how the gen­er­a­tion loop works

Architecture vs trained weights, what’s broadly shared across mod­ern LLMs, and what’s dif­fer­ent

Tiny ex­plain­ers ap­pear through­out so any­one can fol­low along, re­gard­less of back­ground.

Tokenization

Models don’t read text di­rectly. They read in­te­ger IDs. The step that con­verts your prompt into a se­quence of those in­te­gers.

That con­ver­sion step is called to­k­eniza­tion. A to­k­enizer takes a string and pro­duces a se­quence of in­te­gers, where each in­te­ger points to an en­try in a fixed vo­cab­u­lary. Modern LLM vo­cab­u­lar­ies usu­ally con­tain tens of thou­sands to a few hun­dred thou­sand en­tries.

Tiny ex­plainer: to­ken ID A to­ken ID is the in­te­ger the model uses for one vo­cab­u­lary en­try. The model works with the num­ber, not the writ­ten word it­self.

Tiny ex­plainer: to­ken ID A to­ken ID is the in­te­ger the model uses for one vo­cab­u­lary en­try. The model works with the num­ber, not the writ­ten word it­self.

Tokens aren’t usu­ally whole words. They’re usu­ally sub­word pieces. The word tokenization” might split into [“token”, ization”]. The word running” might split into [“run”, ning”]. The rea­son is ef­fi­ciency. Whole-word vo­cab­u­lar­ies are too big and don’t gen­er­al­ize to new words. Character-level vo­cab­u­lar­ies are too small and force the model to learn even the sim­plest pat­terns from scratch. Subword to­k­eniza­tion sits in the mid­dle. The most com­mon pieces be­come sin­gle to­kens, and rare or novel words get com­posed from smaller pieces.

Tiny ex­plainer: vo­cab­u­lary The vo­cab­u­lary is the to­k­eniz­er’s fixed list of pieces. Each piece has an ID, and the model can only di­rectly re­ceive IDs from that list.

Tiny ex­plainer: vo­cab­u­lary The vo­cab­u­lary is the to­k­eniz­er’s fixed list of pieces. Each piece has an ID, and the model can only di­rectly re­ceive IDs from that list.

The trade-off shows up in places peo­ple don’t ex­pect. The clas­sic ex­am­ple: ask an LLM how many R’s are in strawberry.” LLMs used to get it wrong. That’s not the model fail­ing at count­ing. It’s the model not op­er­at­ing on let­ters di­rectly, only to­ken IDs that hap­pen to spell out a word a hu­man would split let­ter by let­ter.

Different model fam­i­lies use dif­fer­ent to­k­eniz­ers. GPT mod­els use Byte Pair Encoding vari­ants. SentencePiece is com­mon in LLaMA-style mod­els. The choice mat­ters for com­pute (fewer to­kens means less work) and for things like mul­ti­lin­gual cov­er­age, but the ba­sic shape is the same. Text in, in­te­gers out.

Now that the prompt is a se­quence of in­te­gers, the next step is to give those in­te­gers mean­ing.

Embeddings

A to­ken ID like 1024 is just a row in­dex. It does­n’t mean any­thing by it­self. The thing that gives it mean­ing is a gi­ant table called the em­bed­ding ma­trix.

Every model has one. It has one row per en­try in the vo­cab­u­lary, and each row is a long vec­tor of num­bers. The length of each row is the mod­el’s hid­den size. In many 7B-class mod­els, that means 4,096 num­bers per to­ken. Larger mod­els usu­ally use wider vec­tors.

Tiny ex­plainer: vec­tor A vec­tor is a list of num­bers. In a trans­former, each to­ken be­comes a vec­tor so the model can do math with it.

Tiny ex­plainer: vec­tor A vec­tor is a list of num­bers. In a trans­former, each to­ken be­comes a vec­tor so the model can do math with it.

When the to­k­enizer hands the model an in­te­ger, the model looks up that row and uses the vec­tor in­stead. That vec­tor is the to­ken’s em­bed­ding. It’s the mod­el’s rep­re­sen­ta­tion of what that to­ken means,” learned dur­ing train­ing.

Tiny ex­plainer: em­bed­ding ma­trix The em­bed­ding ma­trix is a lookup table. Token ID in, learned vec­tor out.

Tiny ex­plainer: em­bed­ding ma­trix The em­bed­ding ma­trix is a lookup table. Token ID in, learned vec­tor out.

The in­ter­est­ing prop­erty of these em­bed­dings is that se­man­ti­cally sim­i­lar to­kens end up with sim­i­lar vec­tors. The vec­tor for king” is close in space to the vec­tor for queen,” and the vec­tor for Paris” is close to France.” None of this is hard-coded. It emerges from train­ing on enough text, and the model learns these po­si­tions be­cause they let it pre­dict text well.

You can do arith­metic on em­bed­dings and it some­times works. The fa­mous ex­am­ple is king − man + woman ≈ queen. The geom­e­try of em­bed­ding space car­ries real se­man­tic struc­ture, even though no­body told the model to build it that way.

Worth be­ing clear on: at this stage every to­ken has been re­placed by its em­bed­ding, but the em­bed­ding alone says noth­ing about where the to­ken sits in the se­quence. The vec­tor for dog” is the same vec­tor whether dog” is the first word in your prompt or the fifth. That’s a prob­lem.

That’s the gap po­si­tional en­cod­ing fills.

Positional en­cod­ing

Plain self-at­ten­tion does­n’t have a built-in rep­re­sen­ta­tion of word or­der. Without some po­si­tional sig­nal, it has no di­rect way to know that dog” came be­fore bites” in­stead of af­ter it.

Word or­der changes mean­ing. So the model needs an­other piece. It needs a way to in­ject the po­si­tion of each to­ken into the math.

Tiny ex­plainer: po­si­tional en­cod­ing Positional en­cod­ing is how the model gets or­der in­for­ma­tion. It tells the model where each to­ken sits in the se­quence.

Tiny ex­plainer: po­si­tional en­cod­ing Positional en­cod­ing is how the model gets or­der in­for­ma­tion. It tells the model where each to­ken sits in the se­quence.

The orig­i­nal trans­former pa­per (Vaswani et al. 2017) did this by giv­ing each po­si­tion its own pat­tern of num­bers and adding it di­rectly to each to­ken’s em­bed­ding be­fore any other pro­cess­ing. Position 1 had one pat­tern, po­si­tion 5 had a dif­fer­ent pat­tern, po­si­tion 100 had an­other. The pat­terns came from sine and co­sine waves at dif­fer­ent fre­quen­cies. Now the em­bed­ding for dog” at po­si­tion 1 was dif­fer­ent from the em­bed­ding for dog” at po­si­tion 5, just be­cause the po­si­tion pat­tern added to it was dif­fer­ent.

That worked, and si­nu­soidal en­cod­ings were cho­sen partly be­cause they can ex­trap­o­late be­yond the ex­act se­quence lengths seen dur­ing train­ing. But ad­di­tive po­si­tion schemes still had two prob­lems that be­came im­por­tant as mod­els scaled up.

First, the em­bed­ding had to carry both mean­ing and po­si­tion in the same set of num­bers. There’s only so much you can pack in.

Second, learned ab­solute po­si­tion em­bed­dings in par­tic­u­lar don’t gen­er­al­ize cleanly. If you trained on prompts up to 2,048 to­kens long, the model never saw po­si­tion 5,000 dur­ing train­ing, and the em­bed­ding for that po­si­tion was not learned in the same way.

Modern mod­els mostly use a dif­fer­ent scheme called Rotary Position Embeddings (RoPE), in­tro­duced by Su et al. in 2021 and now used in LLaMA, Mistral, Gemma, Qwen, and most other open-weight fam­i­lies. The in­tu­ition: in­stead of adding po­si­tion info to each to­ken’s vec­tor, RoPE ro­tates the Query and Key vec­tors by an an­gle that de­pends on the to­ken’s po­si­tion. A to­ken at po­si­tion 1 gets a small turn, a to­ken at po­si­tion 100 gets a big­ger turn. When two to­kens are later com­pared dur­ing at­ten­tion, what mat­ters is the dif­fer­ence be­tween their Query and Key ro­ta­tions, which en­codes how far apart they are.

Tiny ex­plainer: RoPE RoPE stands for Rotary Position Embeddings. Instead of adding a po­si­tion vec­tor, it ro­tates Query and Key vec­tors so rel­a­tive dis­tance shows up dur­ing at­ten­tion.

Tiny ex­plainer: RoPE RoPE stands for Rotary Position Embeddings. Instead of adding a po­si­tion vec­tor, it ro­tates Query and Key vec­tors so rel­a­tive dis­tance shows up dur­ing at­ten­tion.

The prac­ti­cal ad­van­tages are real. RoPE en­codes rel­a­tive po­si­tion nat­u­rally (which is closer to what at­ten­tion ac­tu­ally wants). It gen­er­al­izes bet­ter to longer con­texts. And it does­n’t add new pa­ra­me­ters to the model.

Even with good po­si­tional en­cod­ing, mod­ern LLMs have a doc­u­mented lost in the mid­dle” prob­lem (Liu et al. 2023). They use in­for­ma­tion at the start and end of long prompts more re­li­ably than in­for­ma­tion buried in the mid­dle. That’s why prompt en­gi­neer­ing tips like put im­por­tant con­text first” or repeat key info at the end” ac­tu­ally help. The model is­n’t us­ing every part of your prompt equally well.

With to­ken mean­ing and po­si­tion both en­coded, the next ques­tion is how do to­kens ac­tu­ally ex­change in­for­ma­tion?

Attention

This is the mech­a­nism that gave the ar­chi­tec­ture its name. Attention.

Inside every trans­former layer, at­ten­tion does one thing. It lets each to­ken look at the other to­kens it is al­lowed to see and de­cide which ones mat­ter for what comes next.

It does this by giv­ing each to­ken three roles at once. Each to­ken gets trans­formed into three new vec­tors, called Query, Key, and Value (Q, K, V).

Tiny ex­plainer: Q, K, V Query means what am I look­ing for,” Key means what do I match with,” and Value is the in­for­ma­tion that gets copied when the match is strong.

Tiny ex­plainer: Q, K, V Query means what am I look­ing for,” Key means what do I match with,” and Value is the in­for­ma­tion that gets copied when the match is strong.

The Query asks, what am I look­ing for from other to­kens?”

The Key says, this is what I of­fer to to­kens look­ing at me.”

The Value car­ries, this is what gets passed along when a match hap­pens.”

The same to­ken plays all three roles at the same time. The Q, K, V trans­for­ma­tions are learned ma­tri­ces, so the model fig­ures out dur­ing train­ing what each to­ken should look for and what it should of­fer.

Matching hap­pens through a sim­i­lar­ity score. Each to­ken’s Query is com­pared against the Key of each to­ken it is al­lowed to see, us­ing a scaled dot prod­uct. Intuitively, this mea­sures how much the two vec­tors line up. The scal­ing keeps the num­bers sta­ble be­fore soft­max.

Tiny ex­plainer: dot prod­uct A dot prod­uct is a sim­ple way to score how aligned two vec­tors are. Higher align­ment means a stronger match.

Tiny ex­plainer: dot prod­uct A dot prod­uct is a sim­ple way to score how aligned two vec­tors are. Higher align­ment means a stronger match.

The match scores then get turned into weights us­ing soft­max. Softmax takes any set of num­bers and turns them into a prob­a­bil­ity-like dis­tri­b­u­tion that sums to 1. Tokens with higher match scores get higher weights, and the weights are then used to take a weighted av­er­age of the value vec­tors.

Tiny ex­plainer: soft­max Softmax turns raw scores into weights that add up to 1. Big scores get big weights, small scores get small weights.

Tiny ex­plainer: soft­max Softmax turns raw scores into weights that add up to 1. Big scores get big weights, small scores get small weights.

An ex­am­ple. Consider the sen­tence The cat that I saw yes­ter­day was sleep­ing.” When the model processes was,” it needs to fig­ure out what’s do­ing the sleep­ing. The Query vec­tor for was” gets com­pared against the Key vec­tors of the to­kens it is al­lowed to see. The dot prod­uct with cat” is high, be­cause the model has learned that verbs like was” need a sub­ject and that sub­jects like cat” pro­duce Key vec­tors that line up well. The dot prod­uct with yesterday” is low. Softmax turns those scores into weights, cat” gets a high weight, yesterday” gets a low one. The model then takes a weighted sum of the cor­re­spond­ing value vec­tors, so the value for cat” dom­i­nates the re­sult. The new rep­re­sen­ta­tion of was” is now mostly shaped by the value of cat.” That’s how a to­ken sev­eral po­si­tions back be­comes the ref­er­ent.

There’s a con­straint spe­cific to GPT-style lan­guage mod­els, which is that they gen­er­ate text left to right. A to­ken at po­si­tion 5 is only al­lowed to at­tend to po­si­tions 1 through 5. It can­not at­tend to to­kens at po­si­tions 6, 7, 8, be­cause those haven’t been gen­er­ated yet. This is called causal mask­ing. The im­ple­men­ta­tion is sim­ple: fu­ture to­kens get match scores so low they end up with ef­fec­tively zero weight af­ter soft­max.

Tiny ex­plainer: causal mask­ing Causal mask­ing hides fu­ture to­kens. It keeps a de­coder-only lan­guage model from look­ing ahead while pre­dict­ing the next to­ken.

Tiny ex­plainer: causal mask­ing Causal mask­ing hides fu­ture to­kens. It keeps a de­coder-only lan­guage model from look­ing ahead while pre­dict­ing the next to­ken.

One of the most in­ter­est­ing find­ings in in­ter­pretabil­ity re­search is about spe­cial­ized at­ten­tion heads called in­duc­tion heads, found by Anthropic in 2022. These heads learn to spot pat­terns of the form A B … A” in the prompt and pre­dict that B comes next. When the model sees A” the sec­ond time, the in­duc­tion head looks back to where A” ap­peared be­fore, sees what came af­ter, and copies that. They’re one of the clear­est known mech­a­nisms be­hind in-con­text learn­ing, the abil­ity of an LLM to pick up a pat­tern from your prompt and con­tinue it.

Tiny ex­plainer: in­duc­tion head An in­duc­tion head is an at­ten­tion head that no­tices re­peated pat­terns in the prompt and helps con­tinue them.

Tiny ex­plainer: in­duc­tion head An in­duc­tion head is an at­ten­tion head that no­tices re­peated pat­terns in the prompt and helps con­tinue them.

Attention has one big cost. In full at­ten­tion, each to­ken com­pares against all the to­kens it is al­lowed to see, so dou­bling the prompt length roughly quadru­ples the work. This is why long prompts are ex­pen­sive to run, and why a lot of re­cent re­search is about mak­ing at­ten­tion more ef­fi­cient (FlashAttention, sparse at­ten­tion, lin­ear at­ten­tion).

But one at­ten­tion head only gives the model one learned view of those re­la­tion­ships.

Multi-head at­ten­tion

A sin­gle at­ten­tion pass gives the model one way of de­cid­ing which to­kens mat­ter to which other to­kens. That’s not enough. Language has many re­la­tion­ships hap­pen­ing at the same time. Subject and verb agree­ment. Pronouns and the names they re­fer to. Long-range ref­er­ences be­tween sen­tences. Word or­der and lo­cal phrases.

Multi-head at­ten­tion solves this by run­ning at­ten­tion many times in par­al­lel, with each par­al­lel pass op­er­at­ing in its own smaller space. Each par­al­lel pass is called a head.

Tiny ex­plainer: at­ten­tion head An at­ten­tion head is one in­de­pen­dent at­ten­tion pass with its own learned pro­jec­tions.

Tiny ex­plainer: at­ten­tion head An at­ten­tion head is one in­de­pen­dent at­ten­tion pass with its own learned pro­jec­tions.

The part that’s of­ten de­scribed wrong, in­clud­ing in plenty of tu­to­ri­als. Each head does­n’t get a lit­eral slice of the orig­i­nal to­ken vec­tor. Each head has its own learned pro­jec­tion ma­tri­ces that map the full to­ken vec­tor down to its own smaller Q, K, and V vec­tors. So if a model has 4,096 num­bers per to­ken and 32 heads, each head usu­ally works in a 128-dimensional space, but those 128 num­bers are a learned pro­jec­tion of the full 4,096, not a fixed slice. Different views” of the same to­ken, not dif­fer­ent chunks of it.

Each head runs its at­ten­tion pass in­de­pen­dently. Then the out­puts of all the heads get con­cate­nated and passed through a fi­nal lin­ear layer that mixes them back into one full-size vec­tor. The model learns that fi­nal mix­ing too.

What makes this in­ter­est­ing is that dif­fer­ent heads of­ten end up par­tially spe­cial­ized. The model is never told what each head should do. Specialization emerges nat­u­rally dur­ing train­ing. Researchers have found heads that track gram­mar (linking verbs to their ob­jects, ar­ti­cles to their nouns), heads that fig­ure out which pro­noun refers to which name, heads that track po­si­tional pat­terns, in­duc­tion heads, and many more. A sin­gle trans­former layer might have 32 heads. A mod­ern fron­tier model has dozens of lay­ers. So a typ­i­cal LLM has thou­sands of at­ten­tion heads in to­tal, each adding its own learned view.

There’s a prac­ti­cal cost con­cern that drove a re­cent ar­chi­tec­tural change. Each head needs to keep its Key and Value vec­tors in mem­ory for all the to­kens al­ready gen­er­ated, so that when a new to­ken gets gen­er­ated the model does­n’t have to re­com­pute every­thing from scratch. This is called the KV cache, and it’s the main mem­ory cost of run­ning an LLM at long con­text lengths.

Tiny ex­plainer: KV cache The KV cache stores old Key and Value vec­tors dur­ing gen­er­a­tion. It saves the model from re­com­put­ing the whole prompt every time it adds a to­ken.

Tiny ex­plainer: KV cache The KV cache stores old Key and Value vec­tors dur­ing gen­er­a­tion. It saves the model from re­com­put­ing the whole prompt every time it adds a to­ken.

Modern de­coder-only LLMs mostly use a vari­ant called Grouped-Query Attention (GQA). Instead of every head hav­ing its own keys and val­ues, groups of heads share the same key and value heads. LLaMA-2 70B has 64 query heads but only 8 key/​value heads. Mistral 7B has 32 query heads and 8 key/​value heads. The re­sult is nearly the same ac­cu­racy as full multi-head at­ten­tion but with much less mem­ory pres­sure and in­fer­ence cost.

Tiny ex­plainer: GQA Grouped-Query Attention lets mul­ti­ple query heads share fewer key/​value heads. That cuts KV-cache mem­ory while keep­ing many query views.

Tiny ex­plainer: GQA Grouped-Query Attention lets mul­ti­ple query heads share fewer key/​value heads. That cuts KV-cache mem­ory while keep­ing many query views.

Feed-forward net­work

After at­ten­tion fin­ishes mix­ing in­for­ma­tion be­tween to­kens, every layer has a sec­ond step that no­body talks about as much. The feed-for­ward net­work.

To add this web app to your iOS home screen tap the share button and select "Add to the Home Screen".

10HN is also available as an iOS App

If you visit 10HN only rarely, check out the the best articles from the past week.

Visit pancik.com for more.