10 interesting stories served every morning and every evening.




1 1,527 shares, 60 trendiness

20 Years of Digital Life, Gone in an Instant, thanks to Apple

Summary: A ma­jor brick-and-mor­tar store sold an Apple Gift Card that Apple seem­ingly took of­fence to, and locked out my en­tire Apple ID, ef­fec­tively brick­ing my de­vices and my iCloud Account, Apple Developer ID, and every­thing as­so­ci­ated with it, and I have no re­course. Can you help? Email paris AT paris.id.au (and read on for the de­tails). ❤️

Update 14 December 2025: Someone from Executive Relations at Apple says they’re look­ing into it. I hope this is true. They say they’ll call me back to­mor­row, on 15 December 2025. In the mean time, it’s been cov­ered by Daring Fireball, Apple Insider, Michael Tsai, and oth­ers, thanks folks! I’ve re­ceived 100s of emails of sup­port, and will re­ply to you all in time, thank you. Finger’s crossed Apple calls back.

I am writ­ing this as a des­per­ate mea­sure. After nearly 30 years as a loyal cus­tomer, au­thor­ing tech­ni­cal books on Apple’s own pro­gram­ming lan­guages (Objective-C and Swift), and spend­ing tens upon tens upon tens of thou­sands of dol­lars on de­vices, apps, con­fer­ences, and ser­vices, I have been locked out of my per­sonal and pro­fes­sional dig­i­tal life with no ex­pla­na­tion and no re­course.

My Apple ID, which I have held for around 25 years (it was orig­i­nally a user­name, be­fore they had to be email ad­dresses; it’s from the iTools era), has been per­ma­nently dis­abled. This is­n’t just an email ad­dress; it is my core dig­i­tal iden­tity. It holds ter­abytes of fam­ily pho­tos, my en­tire mes­sage his­tory, and is the key to sync­ing my work across the ecosys­tem.

The Trigger: The only re­cent ac­tiv­ity on my ac­count was a re­cent at­tempt to re­deem a $500 Apple Gift Card to pay for my 6TB iCloud+ stor­age plan. The code failed. The ven­dor sug­gested that the card num­ber was likely com­pro­mised and agreed to reis­sue it. Shortly af­ter, my ac­count was locked. An Apple Support rep­re­sen­ta­tive sug­gested that this was the cause of the is­sue: in­di­cat­ing that some­thing was likely un­to­ward about this card.The card was pur­chased from a ma­jor brick-and-mor­tar re­tailer (Australians, think Woolworths scale; Americans, think Walmart scale), so if I can­not rely on the prove­nance of that, and have no re­course, what am I meant to do? We have even sent the re­ceipt, in­di­cat­ing the card’s se­r­ial num­ber and pur­chase lo­ca­tion to Apple.

The Consequence: My ac­count is flagged as closed in ac­cor­dance with the Apple Media Services Terms and Conditions”.The Damage: I ef­fec­tively have over $30,000 worth of pre­vi­ously-ac­tive bricked” hard­ware. My iPhone, iPad, Watch, and Macs can­not sync, up­date, or func­tion prop­erly. I have lost ac­cess to thou­sands of dol­lars in pur­chased soft­ware and me­dia. Apple rep­re­sen­ta­tives claim that only the Media and Services” side of my ac­count is blocked, but now my de­vices have signed me out of iMes­sage (and I can’t sign back in), and I can’t even sign out of the blocked iCloud ac­count be­cause… it’s barred from the sign-out API, as far as I can tell.I can’t even lo­gin to the Secure File Transfer” sys­tem Apple uses to ex­change in­for­ma­tion, be­cause it re­lies on an Apple ID. Most of the ways Apple has sug­gested seek­ing help from them in­volve sign­ing in to an Apple ser­vice to up­load some­thing, or com­mu­ni­cate with them. This does­n’t work as the ac­count is locked.

I can’t even down­load my iCloud Photos, as:There are re­peated auth-er­rors on my ac­count, so I can’t make Photos work;I don’t have a 6TB de­vice to sync them to, even if I could.

No Information: Support staff re­fused to tell me why the ac­count was banned or pro­vide spe­cific de­tails on the de­ci­sion. No Escalation: When I begged for an es­ca­la­tion to Executive Customer Relations (ECR), not­ing that I would lose the abil­ity to do my job and that my de­vices were use­less, I was told that an ad­di­tional es­ca­la­tion won’t lead to a dif­fer­ent out­come”.Many of the reps I’ve spo­ken to have sug­gested strange things, one of the strangest was telling me that I could phys­i­cally go to Apple’s Australian HQ at Level 3, 20 Martin Place, Sydney, and plead my case. They even put me on hold for 5 min­utes while they looked up the ad­dress.

Most in­sult­ingly, the of­fi­cial ad­vice from the Senior Advisor was to create a new Apple ac­count… and up­date the pay­ment in­for­ma­tion”.

The Legal Catch: Apple’s Terms and Conditions rely on Termination of Access.” By clos­ing my ac­count, they have re­voked my li­cense to use their ser­vices. The Technical Trap: If I fol­low their ad­vice and cre­ate a new ac­count on my cur­rent de­vices (which are likely hard­ware-flagged due to the gift card er­ror), the new ac­count will likely be linked to the banned one and dis­abled for cir­cum­vent­ing se­cu­rity mea­sures.The Developer Risk: As a pro­fes­sional Apple Developer, at­tempt­ing to dodge” a ban by cre­at­ing a new ID could lead to my Developer Program mem­ber­ship be­ing per­ma­nently black­listed, amongst other things.

I am not a ca­sual user. I have lit­er­ally writ­ten the book on Apple de­vel­op­ment (taking over the Learning Cocoa with Objective-C se­ries, which Apple them­selves used to write, for O’Reilly Media, and then 20+ books fol­low­ing that). I help run the longest-run­ning Apple de­vel­oper event not run by Apple them­selves, /dev/world. I have ef­fec­tively been an evan­ge­list for this com­pa­ny’s tech­nol­ogy for my en­tire pro­fes­sional life. We had an app on the App Store on Day 1 in every sense of the world.

I am ask­ing for a hu­man at Apple to re­view this case. I sus­pect an au­to­mated fraud flag re­gard­ing the bad gift card trig­gered a nu­clear re­sponse that front­line sup­port can­not over­ride. I have es­ca­lated this through my many friends in WWDR and SRE at Apple, with no suc­cess.

I am des­per­ate to re­solve this and re­store my dig­i­tal life. If you can help, please email paris AT paris.id.au

...

Read the original on hey.paris »

2 1,057 shares, 38 trendiness

frontier intelligence built for speed

Gemini 3 Flash is our lat­est model with fron­tier in­tel­li­gence built for speed that helps every­one learn, build, and plan any­thing — faster.

Senior Director, Product Management, on be­half of the Gemini team

Google is re­leas­ing Gemini 3 Flash, a fast and cost-ef­fec­tive model built for speed. You can now ac­cess Gemini 3 Flash through the Gemini app and AI Mode in Search. Developers can ac­cess it via the Gemini API in Google AI Studio, Google Antigravity, Gemini CLI, Android Studio, Vertex AI and Gemini Enterprise.

Summaries were gen­er­ated by Google AI. Generative AI is ex­per­i­men­tal.

It’s great for cod­ing, com­plex analy­sis, and quick an­swers in in­ter­ac­tive apps.

Gemini 3 Flash is now the de­fault model in the Gemini app and AI Mode in Search.

Developers and every­day users can ac­cess Gemini 3 Flash via var­i­ous Google plat­forms.

Summaries were gen­er­ated by Google AI. Generative AI is ex­per­i­men­tal.

...

Read the original on blog.google »

3 1,014 shares, 41 trendiness

AWS CEO Explains 3 Reasons AI Can’t Replace Junior Devs

AWS CEO Matt Garman out­lined 3 solid rea­sons why com­pa­nies should not fo­cus on cut­ting ju­nior de­vel­oper roles, not­ing that they are ac­tu­ally the most ex­pe­ri­enced with the AI tools”.

In a tech world ob­sessed with AI re­plac­ing hu­man work­ers, Matt Garman, CEO of Amazon Web Services (AWS), is push­ing back against one of the in­dus­try’s most pop­u­lar cost-cut­ting ideas.

Speaking on WIREDs The Big Interview pod­cast, Garman has a bold mes­sage for com­pa­nies rac­ing to cut costs with AI.

He was asked to ex­plain why he once called re­plac­ing ju­nior em­ploy­ees with AI one of the dumb­est ideas” he’d ever heard, and to ex­pand on how he be­lieves agen­tic AI will ac­tu­ally change the work­place in the com­ing years.

First, ju­nior em­ploy­ees are of­ten bet­ter with AI tools than se­nior staff.

Fresh grads have grown up with new tech­nol­ogy, so they can adapt quickly. Many of them learn AI-powered tools while study­ing or dur­ing in­tern­ships. They tend to ex­plore new fea­tures, find quick meth­ods to write code, and fig­ure out how to get the best re­sults from AI agents.

According to the 2025 Stack Overflow Developer Survey, 55.5% of early-ca­reer de­vel­op­ers re­ported us­ing AI tools daily in their de­vel­op­ment process, higher than for the ex­pe­ri­enced folks.

This com­fort with new tools al­lows them to work more ef­fi­ciently. In con­trast, se­nior de­vel­op­ers have es­tab­lished work­flows and may take more time to adopt. Recent re­search shows that over half of Gen Z em­ploy­ees are ac­tu­ally help­ing se­nior col­leagues up­skill in AI.

Second, ju­nior staff are usu­ally the least ex­pen­sive em­ploy­ees.

Junior em­ploy­ees usu­ally get much less in salary and ben­e­fits, so re­mov­ing them does not de­liver huge sav­ings. If a com­pany is try­ing to save money, it does­n’t make that much fi­nan­cial sense.

So, when com­pa­nies talk about in­creas­ing profit mar­gins, ju­nior em­ploy­ees should not be the de­fault or only tar­get. True op­ti­miza­tion, Real cost-cut­ting means look­ing at the whole com­pany be­cause there are plenty of other places where ex­penses can be trimmed.

In fact, 30% of com­pa­nies that laid off work­ers ex­pect­ing sav­ings ended up in­creas­ing ex­penses, and many had to re­hire later.

Think of a com­pany like a sports team. If you only keep vet­eran play­ers and never re­cruit rook­ies, what hap­pens when those vet­er­ans re­tire? You are left with no one who knows how to play the game.

Also, hir­ing peo­ple straight out of col­lege brings new ways of think­ing into the work­place. They have fresh ideas shaped by the lat­est trends, mo­ti­va­tion to in­no­vate.

More im­por­tantly, they form the foun­da­tion of a com­pa­ny’s fu­ture work­force. If a com­pany de­cides to stop hir­ing ju­nior em­ploy­ees al­to­gether, it cuts off its own tal­ent pipeline. Over time, that leads to fewer lead­ers to pro­mote from within.

A Deloitte re­port also notes that the tech work­force is ex­pected to grow at roughly twice the rate of the over­all U. S. work­force, high­light­ing the de­mand for tech tal­ent. Without a strong pipeline of ju­nior de­vel­op­ers com­ing in, com­pa­nies might face a tech tal­ent short­age.

When there are not enough ju­nior hires be­ing trained to­day, teams strug­gle to fill roles to­mor­row, es­pe­cially as pro­jects scale.

This is­n’t just cor­po­rate talk. As the leader of one of the world’s largest cloud com­put­ing plat­forms, serv­ing every­one from Netflix to the U. S. in­tel­li­gence agen­cies, Garman has a front-row seat to how com­pa­nies are ac­tu­ally us­ing AI.

And what he is see­ing makes him wor­ried that short-term think­ing could dam­age busi­nesses for years to come. Garman’s point is grounded in long-term strat­egy. A com­pany that re­lies solely on AI to han­dle tasks with­out train­ing new tal­ent could find it­self short of peo­ple.

Still, Garman ad­mits the next few years will be bumpy. Your job is go­ing to change,” he said. He be­lieves AI will make com­pa­nies more pro­duc­tive as well as the em­ploy­ees.

When tech­nol­ogy makes some­thing eas­ier, peo­ple want more of it. AI en­ables the cre­ation of soft­ware faster, al­low­ing com­pa­nies to de­velop more prod­ucts, en­ter new mar­kets, and serve more cus­tomers.

Developers will be re­spon­si­ble for more than just writ­ing code, with faster adap­ta­tion to new tech­nolo­gies be­com­ing es­sen­tial. But he has a hope­ful mes­sage in the end.

That’s why Geoffrey Hinton has ad­vised that Computer Science de­grees re­main es­sen­tial. This di­rectly sup­ports Matt Garman’s point. Fresh tal­ent with a strong un­der­stand­ing of core fun­da­men­tals be­comes cru­cial for fill­ing these higher-value roles of the fu­ture.

I’m very con­fi­dent in the medium to longer term that AI will def­i­nitely cre­ate more jobs than it re­moves at first,” Garman said.

...

Read the original on www.finalroundai.com »

4 880 shares, 28 trendiness

📝 Is Mozilla trying hard to kill itself?

It may be just me, but I read this as I don’t want to 😜 😜 but I’ll kill AdBlockers in Firefox for buck­eri­nos 😂. This dis­ap­points and sad­dens me a lot, and I hope I’m wrong.

...

Read the original on infosec.press »

5 838 shares, 31 trendiness

ALPR Watch – Track Surveillance Tech in Local Government

Your lo­cal gov­ern­ment might be dis­cussing sur­veil­lance tech like Flock cam­eras, fa­cial recog­ni­tion, or au­to­mated li­cense plate read­ers right now. This map helps you find those meet­ings and take ac­tion.

Why this mat­ters:  Municipalities across the US are qui­etly adopt­ing sur­veil­lance tech­nolo­gies in rapidly grow­ing num­bers with over 80,000 cam­eras al­ready out on the streets. These sys­tems track res­i­dents’ move­ments, col­lect bio­met­ric data, and build mas­sive data­bases of our daily lives.

alpr.watch scans meet­ing agen­das for key­words like flock,” license plate reader,” alpr,” and more. Each pin on the map shows where these con­ver­sa­tions are hap­pen­ing so that you can make a dif­fer­ence.

Zoom in to see ALPR cam­eras

Get Email Alerts for Your Area

Enter your email be­low and we’ll send you a lo­gin link. After log­ging in, you can set your no­ti­fi­ca­tion pref­er­ences.

I agree to the Terms of Service and Privacy Policy

Please agree to the Terms of Service and Privacy Policy to con­tinue.

You’re logged in! Update your no­ti­fi­ca­tion set­tings to re­ceive alerts.

Zoom in to see ALPR sur­veil­lance cam­eras

Data be­fore mid-De­cem­ber may be un­ver­i­fied. All fu­ture flags are 100% mod­er­a­tor ap­proved.

Automated License Plate Recognition (ALPR) sys­tems use cam­eras and ar­ti­fi­cial in­tel­li­gence to cap­ture, read, and store li­cense plate data from every pass­ing ve­hi­cle.

These sys­tems work 24/7 cre­at­ing a mas­sive data­base of where ve­hi­cles, and by ex­ten­sion, peo­ple, travel. Every trip to the gro­cery store, doc­tor’s of­fice, or place of wor­ship gets recorded and stored.

Flock Safety is one of the largest man­u­fac­tur­ers of ALPR cam­eras in the United States, mar­ket­ing their sys­tems to neigh­bor­hoods and law en­force­ment.

Flock cam­eras cap­ture li­cense plates, ve­hi­cle make/​model, color, and other iden­ti­fy­ing fea­tures. This data is shared across a mas­sive net­work of agen­cies and ju­ris­dic­tions, cre­at­ing a sur­veil­lance web that tracks mil­lions of Americans.

History shows that sur­veil­lance sys­tems ex­pand be­yond their orig­i­nal scope:

Systems mar­keted for solving crimes” get used for im­mi­gra­tion en­force­ment

These groups and in­di­vid­u­als are lead­ing the fight against mass sur­veil­lance. Consider sup­port­ing their work or get­ting in­volved lo­cally.

...

Read the original on alpr.watch »

6 791 shares, 32 trendiness

An extremely fast Python type checker and language server

TL;DR: ty is an ex­tremely fast Python type checker and

lan­guage server, writ­ten in Rust, and de­signed as an al­ter­na­tive to tools like mypy, Pyright, and Pylance.

Today, we’re an­nounc­ing the Beta re­lease of ty. We now use ty ex­clu­sively in our own pro­jects and are ready to rec­om­mend it to mo­ti­vated users for pro­duc­tion use.

At Astral, we build high-per­for­mance de­vel­oper tools for the Python ecosys­tem. We’re best known for

uv, our Python pack­age man­ager, and

Ruff, our lin­ter and for­mat­ter.

Today, we’re an­nounc­ing the Beta re­lease of the next tool in the Astral tool­chain: ty, an

ex­tremely fast Python type checker and lan­guage server, writ­ten in Rust.

ty was de­signed from the ground up to power a lan­guage server. The en­tire ty ar­chi­tec­ture is built around incrementality”, en­abling us to se­lec­tively re-run only the nec­es­sary com­pu­ta­tions when a user (e.g.) ed­its a file or mod­i­fies an in­di­vid­ual func­tion. This makes live up­dates ex­tremely fast in the con­text of an ed­i­tor or long-lived process.

You can in­stall ty to­day with uv tool in­stall ty@lat­est, or via our

VS Code ex­ten­sion.

Like Ruff and uv, ty’s im­ple­men­ta­tion was grounded in some of our core prod­uct prin­ci­ples:

An ob­ses­sive fo­cus on per­for­mance. Without caching, ty is con­sis­tently be­tween 10x and 60x faster than mypy and Pyright. When run in an ed­i­tor, the gap is even more dra­matic. As an ex­am­ple, af­ter edit­ing a load-bear­ing file in the PyTorch repos­i­tory, ty re­com­putes di­ag­nos­tics in 4.7ms: 80x faster than Pyright (386ms) and 500x faster than Pyrefly (2.38 sec­onds). ty is very fast!

Correct, prag­matic, and er­gonomic. With fea­tures like

first-class in­ter­sec­tion types,

ad­vanced type nar­row­ing, and

so­phis­ti­cated reach­a­bil­ity analy­sis, ty pushes for­ward the state of the art in Python type check­ing, pro­vid­ing more ac­cu­rate feed­back and avoid­ing as­sump­tions

about user in­tent that of­ten lead to false pos­i­tives. Our goal with ty is not only to build a faster type checker; we want to build a bet­ter type checker, and one that bal­ances cor­rect­ness with a deep fo­cus on the end-user ex­pe­ri­ence.

Built in the open. ty was built by our core team along­side dozens of ac­tive con­trib­u­tors un­der the MIT li­cense, and the same goes for our

ed­i­tor ex­ten­sions. You can run ty any­where that you write Python (including in the browser).

Even com­pared to other Rust-based lan­guage servers like Pyrefly, ty can run or­ders of mag­ni­tude faster when per­form­ing in­cre­men­tal up­dates on large pro­jects.

ty also in­cludes a

best-in-class di­ag­nos­tic sys­tem, in­spired by the Rust com­pil­er’s own world-class er­ror mes­sages. A sin­gle ty di­ag­nos­tic can pull in con­text from mul­ti­ple files at once to ex­plain not only what’s wrong, but why (and, of­ten, how to fix it).

Diagnostic out­put is the pri­mary user in­ter­face for a type checker; we pri­or­i­tized our di­ag­nos­tic sys­tem from the start (with both hu­mans and agents in mind) and view it as a first-class fea­ture in ty.

If you use VS Code, Cursor, or a sim­i­lar ed­i­tor, we rec­om­mend in­stalling the

ty VS Code ex­ten­sion. The ty lan­guage server sup­ports all the ca­pa­bil­i­ties

that you’d ex­pect for a mod­ern lan­guage server (Go to Definition, Symbol Rename, Auto-Complete, Auto-Import, Semantic Syntax Highlighting, Inlay Hints, etc.), and runs in any ed­i­tor that im­ple­ments the Language Server Protocol.

Following the Beta re­lease, our im­me­di­ate pri­or­ity is sup­port­ing early adopters. From there, we’re work­ing to­wards a Stable re­lease next year, with the gap be­tween the

Beta and

Stable mile­stones largely fo­cus­ing on: (1) sta­bil­ity and bug fixes, (2) com­plet­ing the long tail of fea­tures in the

Python typ­ing spec­i­fi­ca­tion, and (3) first-class sup­port for pop­u­lar third-party li­braries like Pydantic and

Django.

On a longer time hori­zon, though, ty will power se­man­tic ca­pa­bil­i­ties across the Astral tool­chain: dead code elim­i­na­tion, un­used de­pen­dency de­tec­tion, SemVer-compatible up­grade en­force­ment, CVE reach­a­bil­ity analy­sis, type-aware lint­ing, and more (including some that are too am­bi­tious to say out loud just yet).

We want to make Python the most pro­duc­tive pro­gram­ming ecosys­tem on Earth. Just as with

Ruff and uv, our com­mit­ment from here is that ty will get sig­nif­i­cantly bet­ter every week by work­ing closely with our users. Thank you for build­ing with us.

ty is the most so­phis­ti­cated prod­uct we’ve built, and its de­sign and im­ple­men­ta­tion have sur­faced some of the hard­est tech­ni­cal prob­lems we’ve seen at Astral. Working on ty re­quires a deep un­der­stand­ing of type the­ory, Python run­time se­man­tics, and how the Python ecosys­tem ac­tu­ally uses Python.

I’d like to thank all those that con­tributed di­rectly to the de­vel­op­ment of ty, in­clud­ing:

Douglas Creager, Alex Waygood,

David Peter, Micha Reiser,

Andrew Gallant, Aria Desires,

Carl Meyer, Zanie Blue,

Ibraheem Ahmed,

Dhruv Manilawala, Jack O’Connor,

Zsolt Dollenstein, Shunsuke Shibayama,

Matthew Mckee, Brent Westbrook,

UnboundVariable,

Shaygan Hooshyari, Justin Chapman,

InSync, Bhuminjay Soni,

Abhijeet Prasad Bodas,

Rasmus Nygren, lipefree,

Eric Mark Martin, Tomer Bin,

Luca Chiodini, Brandt Bucher,

Dylan Wilson, Eric Jolibois,

Felix Scherz, Leandro Braga,

Renkai Ge, Sumana Harihareswara,

Takayuki Maeda, Max Mynter,

med1844, William Woodruff,

Chandra Kiran G, DetachHead,

Emil Sadek, Jo,

Joren Hammudoglu, Mahmoud Saada,

Manuel Mendez, Mark Z. Ding,

Simon Lamon, Suneet Tipirneni,

Francesco Giacometti,

Adam Aaronson, Alperen Keleş,

char­liecloud­berry,

Dan Parizher, Daniel Hollas,

David Sherret, Dmitry,

Eric Botti, Erudit Morina,

François-Guillaume Fernandez,

Fabrizio Damicelli,

Guillaume-Fgt, Hugo van Kemenade,

Josiah Kane, Loïc Riegel,

Ramil Aleskerov, Samuel Rigaud,

Soof Golan, Usul-Dev,

dec­o­ra­tor-fac­tory, om­ahs,

wangx­i­aolei, cake-mo­not­one,

slyces, Chris Krycho,

Mike Perlov, Raphael Gaschignard,

Connor Skees, Aditya Pillai,

Lexxxzy, haarisr,

Joey Bar, Andrii Turov,

Kalmaegi, Trevor Manz,

Teodoro Freund, Hugo Polloli,

Nathaniel Roman, Victor Hugo Gomes,

Nuri Jung, Ivan Yakushev,

Hamir Mahal, Denys Zhak,

Daniel Kongsgaard,

Emily B. Zhang, Ben Bar-Or,

Aleksei Latyshev,

Aditya Pratap Singh, wooly18,

Samodya Abeysiriwardane, and

Pepe Navarro.

We’d also like to thank the Salsa team (especially

Niko Matsakis, David Barsky, and Lukas Wirth) for their sup­port and col­lab­o­ra­tion; the

Elixir team (especially

José Valim, Giuseppe Castagna, and

Guillaume Duboc), whose work strongly in­flu­enced our ap­proach to grad­ual types and in­ter­sec­tions; and a few mem­bers of the broader Python typ­ing com­mu­nity:

Eric Traut, Jelle Zijlstra,

Jia Chen, Sam Goldman,

Shantanu Jain, and Steven Troxler.

...

Read the original on astral.sh »

7 785 shares, 31 trendiness

AI will make formal verification go mainstream — Martin Kleppmann’s blog

Much has been said about the ef­fects that AI will have on soft­ware de­vel­op­ment, but there is an an­gle I haven’t seen talked about: I be­lieve that AI will bring for­mal ver­i­fi­ca­tion, which for decades has been a bit of a fringe pur­suit, into the soft­ware en­gi­neer­ing main­stream.

Proof as­sis­tants and proof-ori­ented pro­gram­ming lan­guages such as Rocq,

Isabelle, Lean,

F*, and Agda have been around for a long time. They make it pos­si­ble to write a for­mal spec­i­fi­ca­tion that some piece of code is sup­posed to sat­isfy, and then math­e­mat­i­cally prove that the code al­ways sat­is­fies that spec (even on weird edge cases that you did­n’t think of test­ing). These tools have been used to de­velop some large for­mally ver­i­fied soft­ware sys­tems, such as an op­er­at­ing sys­tem ker­nel, a C com­piler, and a

cryp­to­graphic pro­to­col stack.

At pre­sent, for­mal ver­i­fi­ca­tion is mostly used by re­search pro­jects, and it is

un­com­mon for in­dus­trial soft­ware en­gi­neers to use for­mal meth­ods (even those work­ing on clas­sic high-as­sur­ance soft­ware such as med­ical de­vices and air­craft). The rea­son is that writ­ing those proofs is both very dif­fi­cult (requiring PhD-level train­ing) and very la­bo­ri­ous.

For ex­am­ple, as of 2009, the for­mally ver­i­fied seL4 mi­cro­ker­nel con­sisted of 8,700 lines of C code, but prov­ing it cor­rect re­quired 20 per­son-years and

200,000 lines of Isabelle code — or 23 lines of proof and half a per­son-day for every sin­gle line of im­ple­men­ta­tion. Moreover, there are maybe a few hun­dred peo­ple in the world (wild guess) who know how to write such proofs, since it re­quires a lot of ar­cane knowl­edge about the proof sys­tem.

To put it in sim­ple eco­nomic terms: for most sys­tems, the ex­pected cost of bugs is lower than the ex­pected cost of us­ing the proof tech­niques that would elim­i­nate those bugs. Part of the rea­son is per­haps that bugs are a neg­a­tive ex­ter­nal­ity: it’s not the soft­ware de­vel­oper who bears the cost of the bugs, but the users. But even if the soft­ware de­vel­oper were to bear the cost, for­mal ver­i­fi­ca­tion is sim­ply very hard and ex­pen­sive.

At least, that was the case un­til re­cently. Now, LLM-based cod­ing as­sis­tants are get­ting pretty good not only at writ­ing im­ple­men­ta­tion code, but also at

writ­ing

proof scripts in

var­i­ous lan­guages. At pre­sent, a hu­man with spe­cial­ist ex­per­tise still has to guide the process, but it’s not hard to ex­trap­o­late and imag­ine that process be­com­ing fully au­to­mated in the next few years. And when that hap­pens, it will to­tally change the eco­nom­ics of for­mal ver­i­fi­ca­tion.

If for­mal ver­i­fi­ca­tion be­comes vastly cheaper, then we can af­ford to ver­ify much more soft­ware. But on top of that, AI also cre­ates a need to for­mally ver­ify more soft­ware: rather than hav­ing hu­mans re­view AI-generated code, I’d much rather have the AI prove to me that the code it has gen­er­ated is cor­rect. If it can do that, I’ll take AI-generated code over hand­crafted code (with all its ar­ti­sanal bugs) any day!

In fact, I would ar­gue that writ­ing proof scripts is one of the best ap­pli­ca­tions for LLMs. It does­n’t mat­ter if they hal­lu­ci­nate non­sense, be­cause the proof checker will re­ject any in­valid proof and force the AI agent to retry. The proof checker is a small amount of code that is it­self ver­i­fied, mak­ing it vir­tu­ally im­pos­si­ble to sneak an in­valid proof past the checker.

That does­n’t mean soft­ware will sud­denly be bug-free. As the ver­i­fi­ca­tion process it­self be­comes au­to­mated, the chal­lenge will move to cor­rectly defin­ing the spec­i­fi­ca­tion: that is, how do you know that the prop­er­ties that were proved are ac­tu­ally the prop­er­ties that you cared about? Reading and writ­ing such for­mal spec­i­fi­ca­tions still re­quires ex­per­tise and care­ful thought. But writ­ing the spec is vastly eas­ier and quicker than writ­ing the proof by hand, so this is progress.

I could also imag­ine AI agents help­ing with the process of writ­ing the spec­i­fi­ca­tions, trans­lat­ing be­tween for­mal lan­guage and nat­ural lan­guage. Here there is the po­ten­tial for sub­tleties to be lost in trans­la­tion, but this seems like a man­age­able risk.

I find it ex­cit­ing to think that we could just spec­ify in a high-level, de­clar­a­tive way the prop­er­ties that we want some piece of code to have, and then to vibe code the im­ple­men­ta­tion along with a proof that it sat­is­fies the spec­i­fi­ca­tion. That would to­tally change the na­ture of soft­ware de­vel­op­ment: we would­n’t even need to bother look­ing at the AI-generated code any more, just like we don’t bother look­ing at the ma­chine code gen­er­ated by a com­piler.

In sum­mary: 1. for­mal ver­i­fi­ca­tion is about to be­come vastly cheaper; 2. AI-generated code needs for­mal ver­i­fi­ca­tion so that we can skip hu­man re­view and still be sure that it works; 3. the pre­ci­sion of for­mal ver­i­fi­ca­tion coun­ter­acts the im­pre­cise and prob­a­bilis­tic na­ture of LLMs. These three things taken to­gether mean for­mal ver­i­fi­ca­tion is likely to go main­stream in the fore­see­able fu­ture. I sus­pect that soon the lim­it­ing fac­tor will not be the tech­nol­ogy, but the cul­ture change re­quired for peo­ple to re­alise that for­mal meth­ods have be­come vi­able in prac­tice.

...

Read the original on martin.kleppmann.com »

8 783 shares, 27 trendiness

8 Million Users' AI Conversations Sold for Profit by "Privacy" Extensions

A few weeks ago, I was wrestling with a ma­jor life de­ci­sion. Like I’ve grown used to do­ing, I opened Claude and started think­ing out loud-lay­ing out the op­tions, weigh­ing the trade­offs, ask­ing for per­spec­tive.

Midway through the con­ver­sa­tion, I paused. I re­al­ized how much I’d shared: not just this de­ci­sion, but months of con­ver­sa­tions-per­sonal dilem­mas, health ques­tions, fi­nan­cial de­tails, work frus­tra­tions, things I had­n’t told any­one else. I’d de­vel­oped a level of can­dor with my AI as­sis­tant that I don’t have with most peo­ple in my life.

And then an un­com­fort­able thought: what if some­one was read­ing all of this?

The thought did­n’t let go. As a se­cu­rity re­searcher, I have the tools to an­swer that ques­tion.

We asked Wings, our agen­tic-AI risk en­gine, to scan for browser ex­ten­sions with the ca­pa­bil­ity to read and ex­fil­trate con­ver­sa­tions from AI chat plat­forms. We ex­pected to find a hand­ful of ob­scure ex­ten­sions-low in­stall counts, sketchy pub­lish­ers, the usual sus­pects.

The re­sults came back with some­thing else en­tirely.

Near the top of the list: Urban VPN Proxy. A Chrome ex­ten­sion with over 6 mil­lion users. A 4.7-star rat­ing from 58,000 re­views. A Featured” badge from Google, mean­ing it had passed man­ual re­view and met what Google de­scribes as a high stan­dard of user ex­pe­ri­ence and de­sign.”

A free VPN promis­ing pri­vacy and se­cu­rity. Exactly the kind of tool some­one in­stalls when they want to pro­tect them­selves on­line.

We de­cided to look closer.

For each plat­form, the ex­ten­sion in­cludes a ded­i­cated executor” script de­signed to in­ter­cept and cap­ture con­ver­sa­tions. The har­vest­ing is en­abled by de­fault through hard­coded flags in the ex­ten­sion’s con­fig­u­ra­tion:

There is no user-fac­ing tog­gle to dis­able this. The only way to stop the data col­lec­tion is to unin­stall the ex­ten­sion en­tirely.

The data col­lec­tion op­er­ates in­de­pen­dently of the VPN func­tion­al­ity. Whether the VPN is con­nected or not, the har­vest­ing runs con­tin­u­ously in the back­ground.

The ex­ten­sion mon­i­tors your browser tabs. When you visit any of the tar­geted AI plat­forms (ChatGPT, Claude, Gemini, etc.), it in­jects an executor” script di­rectly into the page. Each plat­form has its own ded­i­cated script - chat­gpt.js, claude.js, gem­ini.js, and so on.

Once in­jected, the script over­rides fetch() and XMLHttpRequest - the fun­da­men­tal browser APIs that han­dle all net­work re­quests. This is an ag­gres­sive tech­nique. The script wraps the orig­i­nal func­tions so that every net­work re­quest and re­sponse on that page passes through the ex­ten­sion’s code first.

This means when Claude sends you a re­sponse, or when you sub­mit a prompt to ChatGPT, the ex­ten­sion sees the raw API traf­fic be­fore your browser even ren­ders it.

The in­jected script parses the in­ter­cepted API re­sponses to ex­tract con­ver­sa­tion data - your prompts, the AIs re­sponses, time­stamps, con­ver­sa­tion IDs. This data is pack­aged and sent via win­dow.postMes­sage to the ex­ten­sion’s con­tent script, tagged with the iden­ti­fier PANELOS_MESSAGE.

The con­tent script for­wards the data to the ex­ten­sion’s back­ground ser­vice worker, which han­dles the ac­tual ex­fil­tra­tion. The data is com­pressed and trans­mit­ted to Urban VPNs servers at end­points in­clud­ing an­a­lyt­ics.ur­ban-vpn.com and stats.ur­ban-vpn.com.

* Every prompt you send to the AI

* The spe­cific AI plat­form and model used

The AI con­ver­sa­tion har­vest­ing was­n’t al­ways there. Based on our analy­sis:

* July 2025 - Present: All user con­ver­sa­tions with tar­geted AI plat­forms cap­tured and ex­fil­trated

Chrome and Edge ex­ten­sions auto-up­date by de­fault. Users who in­stalled Urban VPN for its stated pur­pose - VPN func­tion­al­ity - woke up one day with new code silently har­vest­ing their AI con­ver­sa­tions.

Anyone who used ChatGPT, Claude, Gemini, or the other tar­geted plat­forms while Urban VPN was in­stalled af­ter July 9, 2025 should as­sume those con­ver­sa­tions are now on Urban VPNs servers and have been shared with third par­ties. Medical ques­tions, fi­nan­cial de­tails, pro­pri­etary code, per­sonal dilem­mas - all of it, sold for marketing an­a­lyt­ics pur­poses.”

Advanced VPN Protection - Our VPN pro­vides added se­cu­rity fea­tures to help shield your brows­ing ex­pe­ri­ence from phish­ing at­tempts, mal­ware, in­tru­sive ads and AI pro­tec­tion which checks prompts for per­sonal data (like an email or phone num­ber), checks AI chat re­sponses for sus­pi­cious or un­safe links and dis­plays a warn­ing be­fore click or sub­mit your prompt.”

The fram­ing sug­gests the AI mon­i­tor­ing ex­ists to pro­tect you-check­ing for sen­si­tive data you might ac­ci­den­tally share, warn­ing you about sus­pi­cious links in re­sponses.

The code tells a dif­fer­ent story. The data col­lec­tion and the protection” no­ti­fi­ca­tions op­er­ate in­de­pen­dently. Enabling or dis­abling the warn­ing fea­ture has no ef­fect on whether your con­ver­sa­tions are cap­tured and ex­fil­trated. The ex­ten­sion har­vests every­thing re­gard­less.

The pro­tec­tion fea­ture shows oc­ca­sional warn­ings about shar­ing sen­si­tive data with AI com­pa­nies. The har­vest­ing fea­ture sends that ex­act sen­si­tive data - and every­thing else - to Urban VPNs own servers, where it’s sold to ad­ver­tis­ers. The ex­ten­sion warns you about shar­ing your email with ChatGPT while si­mul­ta­ne­ously ex­fil­trat­ing your en­tire con­ver­sa­tion to a data bro­ker.

After doc­u­ment­ing Urban VPN Proxy’s be­hav­ior, we checked whether the same code ex­isted else­where.

It did. The iden­ti­cal AI har­vest­ing func­tion­al­ity ap­pears in seven other ex­ten­sions from the same pub­lisher, across both Chrome and Edge:

The ex­ten­sions span dif­fer­ent prod­uct cat­e­gories, a VPN, an ad blocker, a browser guard” se­cu­rity tool, but share the same sur­veil­lance back­end. Users in­stalling an ad blocker have no rea­son to ex­pect their Claude con­ver­sa­tions are be­ing har­vested.

All of these ex­ten­sions carry Featured” badges from their re­spec­tive stores, ex­cept Urban Ad Blocker for Edge. These badges sig­nal to users that the ex­ten­sions have been re­viewed and meet plat­form qual­ity stan­dards. For many users, a Featured badge is the dif­fer­ence be­tween in­stalling an ex­ten­sion and pass­ing it by - it’s an im­plicit en­dorse­ment from Google and Microsoft.

Urban VPN is op­er­ated by Urban Cyber Security Inc., which is af­fil­i­ated with BiScience (B. I Science (2009) Ltd.), a data bro­ker com­pany.

This com­pany has been on re­searchers’ radar be­fore. Security re­searchers such as John Tuckner at Secure Annex have pre­vi­ously doc­u­mented BiScience’s data col­lec­tion prac­tices. Their re­search es­tab­lished that:

* The com­pany pro­vides an SDK to third-party ex­ten­sion de­vel­op­ers to col­lect and sell user data

* BiScience sells this data through prod­ucts like AdClarity and Clickstream OS

Our find­ing rep­re­sents an ex­pan­sion of this op­er­a­tion. BiScience has moved from col­lect­ing brows­ing his­tory to har­vest­ing com­plete AI con­ver­sa­tions-a sig­nif­i­cantly more sen­si­tive cat­e­gory of data.

We share the Web Browsing Data with our af­fil­i­ated com­pany… BiScience that uses this raw data and cre­ates in­sights which are com­mer­cially used and shared with Business Partners”

To be fair, Urban VPN does dis­close some of this-if you know where to look.

The con­sent prompt (shown dur­ing ex­ten­sion setup) men­tions that the ex­ten­sion processes ChatAI com­mu­ni­ca­tion” along with pages you visit” and security sig­nals.” It states this is done to pro­vide these pro­tec­tions.”

The pri­vacy pol­icy goes fur­ther, buried deep in the doc­u­ment:

AI Inputs and Outputs. As part of the Browsing Data, we will col­lect the prompts and out­puts queried by the End-User or gen­er­ated by the AI chat provider, as ap­plic­a­ble.”

We also dis­close the AI prompts for mar­ket­ing an­a­lyt­ics pur­poses.”

However, the Chrome Web Store list­ing-the place where users ac­tu­ally de­cide whether to in­stall-shows a dif­fer­ent pic­ture:

This de­vel­oper de­clares that your data is Not be­ing sold to third par­ties, out­side of the ap­proved use cases”

The list­ing men­tions the ex­ten­sion han­dles Web his­tory” and Website con­tent.” It says noth­ing about AI con­ver­sa­tions specif­i­cally.

The con­sent prompt frames AI mon­i­tor­ing as pro­tec­tive. The pri­vacy pol­icy re­veals the data is sold for mar­ket­ing.

The store list­ing says data is­n’t sold to third par­ties. The pri­vacy pol­icy de­scribes shar­ing with BiScience, Business Partners,” and use for marketing an­a­lyt­ics.”

Users who in­stalled be­fore July 2025 never saw the up­dated con­sent prompt-the AI har­vest­ing was added via silent up­date in ver­sion 5.5.0.

Even users who see the con­sent prompt have no gran­u­lar con­trol. You can’t ac­cept the VPN but de­cline the AI har­vest­ing. It’s all or noth­ing.

Nothing in­di­cates to users that the data col­lec­tion con­tin­ues even when the VPN is dis­con­nected and the AI pro­tec­tion fea­ture is turned off. The har­vest­ing runs silently in the back­ground re­gard­less of what fea­tures the user has en­abled.

Urban VPN Proxy car­ries Google’s Featured” badge on the Chrome Web Store. According to Google’s doc­u­men­ta­tion:

Featured ex­ten­sions fol­low our tech­ni­cal best prac­tices and meet a high stan­dard of user ex­pe­ri­ence and de­sign.”

Before it re­ceives a Featured badge, the Chrome Web Store team must re­view each ex­ten­sion.”

This means a hu­man at Google re­viewed Urban VPN Proxy and con­cluded it met their stan­dards. Either the re­view did­n’t ex­am­ine the code that har­vests con­ver­sa­tions from Google’s own AI prod­uct (Gemini), or it did and did­n’t con­sider this a prob­lem.

The Chrome Web Store’s Limited Use pol­icy ex­plic­itly pro­hibits transferring or sell­ing user data to third par­ties like ad­ver­tis­ing plat­forms, data bro­kers, or other in­for­ma­tion re­sellers.” BiScience is, by its own de­scrip­tion, a data bro­ker.

The ex­ten­sion re­mains live and fea­tured as of this writ­ing.

Browser ex­ten­sions oc­cupy a unique po­si­tion of trust. They run in the back­ground, have broad ac­cess to your brows­ing ac­tiv­ity, and auto-up­date with­out ask­ing. When an ex­ten­sion promises pri­vacy and se­cu­rity, users have lit­tle rea­son to sus­pect it’s do­ing the op­po­site.

What makes this case no­table is­n’t just the scale - 8 mil­lion users - or the sen­si­tiv­ity of the data - com­plete AI con­ver­sa­tions. It’s that these ex­ten­sions passed re­view, earned Featured badges, and re­mained live for months while har­vest­ing some of the most per­sonal data users gen­er­ate on­line. The mar­ket­places de­signed to pro­tect users in­stead gave these ex­ten­sions their stamp of ap­proval.

If you have any of these ex­ten­sions in­stalled, unin­stall them now. Assume any AI con­ver­sa­tions you’ve had since July 2025 have been cap­tured and shared with third par­ties.

This writeup was au­thored by the re­search team at Koi.

We built Koi to de­tect ex­actly these kinds of threats - ex­ten­sions that slip past mar­ket­place re­views and qui­etly ex­fil­trate sen­si­tive data. Our risk en­gine, Wings, con­tin­u­ously mon­i­tors browser ex­ten­sions to catch threats be­fore they reach your team.

Book a demo to see how be­hav­ioral analy­sis catches what sta­tic re­view misses.

...

Read the original on www.koi.ai »

9 770 shares, 30 trendiness

No Graphics API — Sebastian Aaltonen

The com­plex­ity of graph­ics APIs, shader frame­works and dri­vers have in­creased rapidly dur­ing the past decades. The pipeline state ob­ject (PSO) ex­plo­sion has got­ten out of hands. How did we end up with 100GB lo­cal shader pipeline caches and mas­sive cloud servers to host them? It’s time to start dis­cussing how to cut down the ab­strac­tions and the API sur­face to sim­plify how we in­ter­act with the GPU. This blog post in­cludes lots of low level hard­ware de­tails. When writ­ing this post I used GPT5 Thinking” AI model to cross ref­er­ence pub­lic Linux open source dri­vers to con­firm my knowl­edge and to en­sure no NDA in­for­ma­tion is pre­sent in this blog post. Sources: AMD RDNA ISA doc­u­ments and GPUOpen, Nvidia PTX ISA doc­u­ments, Intel PRM, Linux open source GPU dri­vers (Mesa, Freedreno, Turnip, Asahi) and ven­dor op­ti­miza­tion guides/​pre­sen­ta­tions. The blog post has been screened by sev­eral in­dus­try in­sid­ers be­fore the pub­lic re­lease. Ten years ago, a sig­nif­i­cant shift oc­curred in real-time com­puter graph­ics with the in­tro­duc­tion of new low-level PC graph­ics APIs. AMD had won both Xbox One (2013) and Playstation 4 (2013) con­tracts. Their new Graphics Core Next (GCN) ar­chi­tec­ture be­came the de-facto lead de­vel­op­ment plat­form for AAA games. PC graph­ics APIs at that time, DirectX 11 and OpenGL 4.5, had heavy dri­ver over­head and were de­signed for sin­gle threaded ren­der­ing. AAA de­vel­op­ers de­manded higher per­for­mance APIs for PC. DICE joined with AMD to cre­ate a low level AMD GCN spe­cific API for the PC called Mantle. As a re­sponse, Microsoft, Khronos and Apple started de­vel­op­ing their own low-level APIs: DirectX 12, Vulkan and Metal were born.The ini­tial re­cep­tion of these new low-level APIs was mixed. Synthetic bench­marks and demos showed sub­stan­tial per­for­mance in­creases, but per­for­mance gains could­n’t be seen in ma­jor game en­gines such as Unreal and Unity. At Ubisoft, our teams no­ticed that port­ing ex­ist­ing DirectX 11 ren­der­ers to DirectX 12 of­ten re­sulted in per­for­mance re­gres­sion. Something was­n’t right.Ex­ist­ing high-level APIs fea­tured min­i­mal per­sis­tent state, with fine-grained state set­ters and in­di­vid­ual data in­puts bound to the shader just prior to draw call sub­mis­sion. New low-level APIs aimed to make draw calls cheaper by ahead-of-time bundling shader pipeline state and bind­ings into per­sis­tent ob­jects. GPU ar­chi­tec­tures were highly het­ero­ge­neous back in the day. Doing the data remap­ping, val­i­da­tion, and up­load­ing ahead of time was a big gain. However, the ren­der­ing hard­ware in­ter­faces (RHI) of ex­ist­ing game en­gines were de­signed for fine grained im­me­di­ate mode ren­der­ing, while the new low-level APIs re­quired bundling data in per­sis­tent ob­jects.To ad­dress this in­com­pat­i­bil­ity, a new low-level graph­ics remap­ping layer grew be­neath the RHI. This layer as­sumed the com­plex­ity pre­vi­ously han­dled by the OpenGL and DirectX 11 graph­ics dri­vers, track­ing re­sources and man­ag­ing map­pings be­tween the fine-grained dy­namic user-land state and the per­sis­tent low-level GPU state. Graphics pro­gram­mers started spe­cial­iz­ing into two dis­tinct roles: low-level graph­ics pro­gram­mers, who fo­cused on the new low-level driver” layer and the RHI, and high-level graph­ics pro­gram­mers, who built vi­sual graph­ics al­go­rithms on top of the RHI. Visual pro­gram­ming was also get­ting more com­plex due to phys­i­cally based light­ing mod­els, com­pute shaders and later ray-trac­ing. Di­rectX 12, Vulkan, and Metal are of­ten re­ferred to as modern APIs”. These APIs are now 10 years old. They were ini­tially de­signed to sup­port GPUs that are now 13 years old, an in­cred­i­bly long time in GPU his­tory. Older GPU ar­chi­tec­tures were op­ti­mized for tra­di­tional ver­tex and pixel shader tasks rather than the com­pute-in­ten­sive generic work­loads preva­lent to­day. They had ven­dor spe­cific bind­ing mod­els and data paths. Hardware dif­fer­ences had to be wrapped un­der the same API. Ahead-of-time cre­ated per­sis­tent ob­jects were cru­cial in of­fload­ing the map­ping, up­load­ing, val­i­da­tion and bind­ing costs.In con­trast, the con­sole APIs and Mantle were ex­clu­sively de­signed for AMDs GCN ar­chi­tec­ture, a for­ward-think­ing de­sign for its time. GCN boasted a com­pre­hen­sive read/​write cache hi­er­ar­chy and scalar reg­is­ters ca­pa­ble of stor­ing tex­ture and buffer de­scrip­tors, ef­fec­tively treat­ing every­thing as mem­ory. No com­plex API for remap­ping the data was re­quired, and sig­nif­i­cantly less ahead-of-time work was needed. The con­sole APIs and Mantle had much less API com­plex­ity due to tar­get­ing a sin­gle mod­ern GPU ar­chi­tec­ture.A decade has passed, and GPUs have un­der­gone a sig­nif­i­cant evo­lu­tion. All mod­ern GPU ar­chi­tec­tures now fea­ture com­plete cache hi­er­ar­chies with co­her­ent last-level caches. CPUs can write di­rectly to GPU mem­ory us­ing PCIe REBAR or UMA and 64-bit GPU point­ers are di­rectly sup­ported in shaders. Texture sam­plers are bind­less, elim­i­nat­ing the need for a CPU dri­ver to con­fig­ure the de­scrip­tor bind­ings. Texture de­scrip­tors can be di­rectly stored in ar­rays within the GPU mem­ory (often called de­scrip­tor heaps). If we were to de­sign an API tai­lored for mod­ern GPUs to­day, it would­n’t need most of these per­sis­tent retained mode” ob­jects. The com­pro­mises that DirectX 12.0, Metal 1 and Vulkan 1.0 had to make are not needed any­more. We could sim­plify the API dras­ti­cally.The past decade has re­vealed the weak­nesses of the mod­ern APIs. The PSO per­mu­ta­tion ex­plo­sion is the biggest prob­lem we need to solve. Vendors (Valve, Nvidia, etc) have mas­sive cloud servers stor­ing ter­abytes of PSOs for each dif­fer­ent ar­chi­tec­ture/​dri­ver com­bi­na­tion. User’s lo­cal PSO cache size can ex­ceed 100GB. No won­der the gamers are com­plain­ing that load­ing takes ages and stut­ter is all over the place.The his­tory of GPUs and APIsBefore we talk about strip­ping the API sur­face, we need to un­der­stand why graph­ics APIs were his­tor­i­cally de­signed this way. OpenGL was­n’t in­ten­tion­ally slow, nor was Vulkan in­ten­tion­ally com­plex. 10-20 years ago GPU hard­ware was highly di­verse and un­der­go­ing rapid evo­lu­tion. Designing a cross-plat­form API for such a di­verse set of hard­ware re­quired com­pro­mises.Let’s start with a clas­sic: The 3dFX Voodoo 2 12MB (1998) was a three chip de­sign: A sin­gle ras­ter­izer chip con­nected to a 4MB frame­buffer mem­ory and two tex­ture sam­pling chips, each con­nected to their own 4MB tex­ture mem­ory. There was no geom­e­try pipeline and no pro­gram­ma­ble shaders. CPU sent pre-trans­formed tri­an­gle ver­tices to the ras­ter­izer. The ras­ter­izer had a con­fig­urable blend­ing equa­tion to con­trol how the ver­tex col­ors and the two tex­ture sam­pler re­sults were com­bined to­gether. Texture sam­plers could not read each-oth­er’s mem­ory or the frame­buffer. Thus there was no sup­port for mul­ti­ple ren­der passes. Since the hard­ware was in­ca­pable of win­dow com­po­si­tion, it had a loop­back ca­ble to con­nect your ded­i­cated 2d video card. 3d ren­der­ing only worked in ex­clu­sive fullscreen mode. A 3d graph­ics card was a highly spe­cial­ized piece of hard­ware, with lit­tle in com­mon with the cur­rent GPUs and their mas­sive pro­gram­ma­ble SIMD ar­rays. Hardware of this era had a mas­sive im­pact on DirectX (1995) and OpenGL (1992) de­sign. Backwards com­pat­i­bil­ity played a huge role. APIs im­proved it­er­a­tively. These 30 year old API de­signs still im­pact the way we write soft­ware to­day.

3dFX Voodoo 2 12MB (1998): Individual proces­sors and traces be­tween them and their own mem­ory chips (four 1MB chips for each proces­sor) are clearly vis­i­ble. Image © TechPowerUp.

Nvidia’s Geforce 256 coined the term GPU. It had a geom­e­try proces­sor in ad­di­tion to the ras­ter­izer. The geom­e­try proces­sor, ras­ter­izer and tex­ture sam­pling units were all in­te­grated in the same die and shared mem­ory. DirectX 7 in­tro­duced two new con­cepts: ren­der tar­get tex­tures and uni­form con­stants. Multipass ren­der­ing meant that tex­ture sam­plers could read the ras­ter­izer out­put, in­val­i­dat­ing the 3dFX Voodoo 2 sep­a­rate mem­ory de­sign.The geom­e­try proces­sor API fea­tured uni­form data in­puts for trans­form ma­tri­ces (float4x4), light po­si­tions, and col­ors (float4). GPU im­ple­men­ta­tions var­ied among man­u­fac­tur­ers, many opt­ing to em­bed a small con­stant mem­ory block within the geom­e­try en­gine. But this was­n’t the only way to do it. In the OpenGL API each shader had its own per­sis­tent uni­forms. This de­sign en­abled the dri­ver to em­bed con­stants di­rectly in the shader’s in­struc­tion stream, an API pe­cu­liar­ity that still per­sists in OpenGL 4.6 and ES 3.2 to­day.GPUs back then did­n’t have generic read & write caches. Rasterizer had screen lo­cal cache for blend­ing and depth buffer­ing and tex­ture sam­plers leaned on lin­early in­ter­po­lated ver­tex UVs for data prefetch­ing. When shaders were in­tro­duced in DirectX 8 shader model 1.0 (SM 1.0), the pixel shader stage did­n’t sup­port cal­cu­lat­ing tex­ture UVs. UVs were cal­cu­lated at ver­tex gran­u­lar­ity, in­ter­po­lated by the hard­ware and passed di­rectly to the tex­ture sam­plers. Di­rectX 9 brought a sub­stan­tial in­crease in shader in­struc­tion lim­its, but shader model 2.0 did­n’t ex­pose any new data paths. Both ver­tex and pixel shaders still op­er­ated as 1:1 in­put:out­put ma­chines, al­low­ing users to only cus­tomize the trans­form math of the ver­tex po­si­tion and at­trib­utes and the pixel color. Programmable load and store were not sup­ported. The fixed-func­tion in­put blocks per­sisted: ver­tex fetch, uni­form (constant) mem­ory and tex­ture sam­pler. Vertex shader was a sep­a­rate ex­e­cu­tion unit. It gained new fea­tures like the abil­ity to in­dex con­stants (limited to float4 ar­rays) but still lacked tex­ture sam­pling sup­port.Di­rectX 9 shader model 3.0 in­creased the in­struc­tion limit to 65536 mak­ing it dif­fi­cult for hu­mans to write and main­tain shader as­sem­bly any­more. Higher level shad­ing lan­guages were born: HLSL (2002) and GLSL (2002-2004). These lan­guages adapted the 1:1 el­e­men­t­wise trans­form de­sign. Each shader in­vo­ca­tion op­er­ated on a sin­gle data el­e­ment: ver­tex or pixel. Framework-style shader de­sign heav­ily af­fected the graph­ics API de­sign in the fol­low­ing years. It was a nice way to ab­stract hard­ware dif­fer­ences back in the day, but is show­ing scal­ing pains to­day. Di­rectX 11 was a sig­nif­i­cant shift in the data model, in­tro­duc­ing sup­port for com­pute shaders, generic read-write buffers and in­di­rect draw­ing. The GPU could now fully feed it­self. The in­clu­sion of generic buffers en­abled shader pro­grams to ac­cess and mod­ify pro­gram­ma­ble mem­ory lo­ca­tions, which forced hard­ware ven­dors to im­ple­ment generic cache hi­er­ar­chies. Shaders evolved be­yond sim­ple 1:1 data trans­for­ma­tions, mark­ing the end of spe­cial­ized, hard­coded data paths. GPU hard­ware started to shift to­wards a generic SIMD de­sign. SIMD units were now ex­e­cut­ing all the dif­fer­ent shader types: ver­tex, pixel, geom­e­try, hull, do­main and com­pute. Today the frame­work has 16 dif­fer­ent shader en­try points. This adds a lot of API sur­face and makes com­po­si­tion dif­fi­cult. As a re­sult GLSL and HLSL still don’t have a flour­ish­ing li­brary ecosys­tem.Di­rectX 11 fea­tured a whole zoo of buffer types, each de­signed to ac­com­mo­date spe­cific hard­ware data path pe­cu­liar­i­ties: typed SRV & UAV, byte ad­dress SRV & UAV, struc­tured SRV & UAV, ap­pend & con­sume (with counter), con­stant, ver­tex, and in­dex buffers. Like tex­tures, buffers in DirectX uti­lize an opaque de­scrip­tor. Descriptors are hard­ware spe­cific (commonly 128-256 bit) data blobs en­cod­ing the size, for­mat, prop­er­ties and data ad­dress of the re­source in GPU mem­ory. DirectX 11 GPUs lever­aged their tex­ture sam­plers for buffer load (gather) op­er­a­tions. This was nat­ural since the sam­pler al­ready had a type con­ver­sion hard­ware and a small read-only data cache. Typed buffers sup­ported the same for­mats as tex­tures, and DirectX used the same SRV (shader re­source view) ab­strac­tion for both tex­tures and buffers.The use of opaque buffer de­scrip­tors meant that the buffer for­mat was not known at shader com­pile time. This was fine for read-only buffers as they were han­dled by the tex­ture sam­pler. Read-write buffer (UAV in DirectX) was ini­tially lim­ited to 32-bit and 128-bit (vec4) types. Subsequent API and hard­ware re­vi­sions grad­u­ally ad­dressed typed UAV load lim­i­ta­tions, but the core is­sues per­sisted: a de­scrip­tor re­quires an in­di­rec­tion (contains a pointer), com­piler op­ti­miza­tions are lim­ited (data type is known only at run­time), for­mat con­ver­sion hard­ware in­tro­duces la­tency (vs raw L1$ load), ex­pand at load re­serves reg­is­ters for longer time (vs ex­pand at use), de­scrip­tor man­age­ment adds CPU dri­ver com­plex­ity, and the API is com­plex (ten dif­fer­ent buffer types).In DirectX 11 the struc­tured buffers were the only buffer type al­low­ing an user de­fined struct type. All other buffer types rep­re­sented a ho­mo­ge­neous ar­ray of sim­ple scalar/​vec­tor el­e­ments. Unfortunately, struc­tured buffers were not lay­out com­pat­i­ble with other buffer types. Users were not al­lowed to have struc­tured buffer views to typed buffers, byte ad­dress buffers, or ver­tex/​in­dex buffers. The rea­son was that struc­tured buffers had spe­cial AoSoA swiz­zle op­ti­miza­tion un­der the hood, which was im­por­tant for older vec4 ar­chi­tec­tures. This hard­ware spe­cific op­ti­miza­tion lim­ited the struc­tured buffer us­abil­ity.Di­rectX 12 made all buffers lin­ear in mem­ory, mak­ing them com­pat­i­ble with each other. SM 6.2 also added load syn­tac­tic sugar for the byte ad­dress buffer, al­low­ing clean struct load­ing syn­tax from ar­bi­trary off­set. All the old buffer types are still sup­ported for back­wards com­pat­i­bil­ity rea­sons and all the buffers still use opaque de­scrip­tors. HLSL is still miss­ing sup­port for 64-bit GPU point­ers. In con­trast, the Nvidia CUDA com­put­ing plat­form (2007) fully leaned on 64-bit point­ers, but its pop­u­lar­ity was ini­tially lim­ited to aca­d­e­mic use. Today it is the lead­ing AI plat­form and is heav­ily af­fect­ing the hard­ware de­sign.Sup­port for 16-bit reg­is­ters and 16-bit math was dis­or­ga­nized when DirectX 12 launched. Microsoft ini­tially made a ques­tion­able de­ci­sion to not back­port DirectX 12 to Windows 7. Shader bi­na­ries tar­get­ing Windows 8 sup­ported 16-bit types, but most gamers con­tin­ued us­ing Windows 7. Developers did­n’t want to ship two sets of shaders. OpenGL lowp/​medi­ump spec­i­fi­ca­tion was also messy. Bit depths were not prop­erly stan­dard­ized. Mediump was a pop­u­lar op­ti­miza­tion in mo­bile games, but most PC dri­vers ig­nored it, mak­ing game de­vel­op­er’s life mis­er­able. AAA games mostly ig­nored 16-bit math un­til PS4 Pro launched in 2016 with dou­ble rate fp16 sup­port.With the rise of AI, ray-trac­ing, and GPU-driven ren­der­ing, GPU ven­dors started fo­cus­ing on op­ti­miz­ing their raw data load paths and pro­vid­ing larger and faster generic caches. Routing loads though the tex­ture sam­pler (type con­ver­sion) added too much la­tency, as de­pen­dent load chains are com­mon in mod­ern shaders. Hardware got na­tive sup­port for nar­row 8-bit, 16-bit, and 64-bit types and point­ers.Most ven­dors ditched their fixed func­tion ver­tex fetch hard­ware, emit­ting stan­dard raw load in­struc­tions in the ver­tex shader in­stead. Fully pro­gram­ma­ble ver­tex fetch al­lowed de­vel­op­ers to write new al­go­rithms such as clus­tered GPU-driven ren­der­ing. Fixed func­tion hard­ware tran­sis­tor bud­get could be used else­where.Mesh shaders rep­re­sent the cul­mi­na­tion of ras­ter­izer evo­lu­tion, elim­i­nat­ing the need for in­dex dedu­pli­ca­tion hard­ware and post-trans­form caches. In this par­a­digm, all in­puts are treated as raw mem­ory. The user is re­spon­si­ble for di­vid­ing the mesh into self-con­tained mesh­lets that in­ter­nally share ver­tices. This process is of­ten done of­fline. The GPU no longer needs to do par­al­lel in­dex dedu­pli­ca­tion for each draw call, sav­ing power and tran­sis­tors. Given that gam­ing ac­counts for only 10% of Nvidia’s rev­enue to­day, while AI rep­re­sents 90% and ray-trac­ing con­tin­ues to grow, it is likely only a mat­ter of time be­fore the fixed func­tion geom­e­try hard­ware is stripped to bare min­i­mum and dri­vers au­to­mat­i­cally con­vert ver­tex shaders to mesh shaders.Mo­bile GPUs are tile-based ren­der­ers. Tilers bin the in­di­vid­ual tri­an­gles to small tiles (commonly be­tween 16x16 to 64x64 pix­els) . Mesh shaders are too coarse grained for this pur­pose. Binning mesh­lets to tiny tiles would cause sig­nif­i­cant geom­e­try over­shad­ing. There’s no clear con­ver­gence path. We still need to sup­port the ver­tex shader path.10 years ago when DirectX 12.0, Vulkan 1.0 and Metal 1.0 ar­rived, the ex­ist­ing GPU hard­ware did­n’t widely sup­port bind­less re­sources. APIs adapted com­plex bind­ing mod­els to ab­stract the hard­ware dif­fer­ences. DirectX al­lowed in­dex­ing up to 128 re­sources per stage, Vulkan and Metal did­n’t ini­tially sup­port de­scrip­tor in­dex­ing at all. Developers had to con­tinue us­ing tra­di­tional workarounds to re­duce the bind­ings change over­head, such as pack­ing tex­tures into at­lases and merg­ing meshes to­gether. The GPU hard­ware has evolved sig­nif­i­cantly dur­ing the past decade and con­verged to generic bind­less SIMD de­sign.Let’s in­ves­ti­gate how much sim­pler the graph­ics API and the shader lan­guage would be­come if we de­signed them solely for mod­ern bind­less hard­ware.Let’s start our jour­ney dis­cussing mem­ory man­age­ment. Legacy graph­ics APIs ab­stracted the GPU mem­ory man­age­ment com­pletely. Abstraction was nec­es­sary, as old GPUs had split mem­o­ries and/​or spe­cial data paths with var­i­ous cache co­herency con­cerns. When DirectX 12 and Vulkan ar­rived 10 years ago, the GPU hard­ware had ma­tured enough to ex­pose place­ment heaps to the user. Consoles had al­ready ex­posed mem­ory for a few gen­er­a­tions and de­vel­op­ers re­quested sim­i­lar flex­i­bil­ity for PC and mo­bile. Apple in­tro­duced place­ment heaps 4 years af­ter Vulkan and DirectX 12 in Metal 2.Modern APIs re­quire the user to enu­mer­ate the heap types to find out what kind of mem­ory the GPU dri­ver has to of­fer. It’s a good prac­tice to pre­al­lo­cate mem­ory in big chunks and sub­al­lo­cate it us­ing a user-land al­lo­ca­tor. However, there’s a de­sign flaw in Vulkan: You have to cre­ate your tex­ture/​buffer ob­ject first. Then you can ask which heap types are com­pat­i­ble with the new re­source. This forces the user into a lazy al­lo­ca­tion pat­tern, which can cause per­for­mance hitches and mem­ory spikes at run­time. This also makes it dif­fi­cult to wrap a GPU mem­ory al­lo­ca­tion into a cross-plat­form li­brary. AMD VMA, for ex­am­ple, cre­ates both the Vulkan-specific buffer/​tex­ture ob­ject in ad­di­tion to al­lo­cat­ing mem­ory. We want to fully sep­a­rate these con­cerns.To­day the CPU has full vis­i­bil­ity into the GPU mem­ory. Integrated GPUs have UMA, and mod­ern dis­crete GPUs have PCIe Resizable BAR. The whole GPU heap can be mapped. Vulkan heap API nat­u­rally sup­ports CPU mapped GPU heaps. DirectX 12 got sup­port in 2023 (HEAP_TYPE_GPU_UPLOAD).CUDA has a sim­ple de­sign for GPU mem­ory al­lo­ca­tion: The GPU mal­loc API takes the size as in­put and re­turns a mapped CPU pointer. The GPU free API frees the mem­ory. CUDA does­n’t sup­port CPU mapped GPU mem­ory. The GPU reads the CPU mem­ory though the PCIe bus. CUDA also sup­ports GPU mem­ory al­lo­ca­tions, but they can’t be di­rectly writ­ten by the CPU.We com­bine CUDA mal­loc de­sign with CPU mapped GPU mem­ory (UMA/ReBAR). It’s the best of both worlds: The data is fast for the CPU to write and fast for the GPU to read, yet we main­tain the clean, easy to use de­sign.

Default gpuMal­loc align­ment is 16 bytes (vec4 align­ment). If you need wider align­ment use gpuMal­loc(size, align­ment) over­load. My ex­am­ple code uses gpuMal­locWrit­ing data di­rectly into GPU mem­ory is op­ti­mal for small data like draw ar­gu­ments, uni­forms and de­scrip­tors. For large per­sis­tent data, we still want to per­form a copy op­er­a­tion. GPUs store tex­tures in a swiz­zled lay­out sim­i­lar to Morton-order to im­prove cache lo­cal­ity. DirectX 11.3 and 12 tried to stan­dard­ize the swiz­zle lay­out, but could­n’t get all GPU man­u­fac­tur­ers on­board. The com­mon way to per­form tex­ture swiz­zling is to use a dri­ver pro­vided copy com­mand. The copy com­mand reads lin­ear tex­ture data from a CPU mapped upload” heap and writes to a swiz­zled lay­out in a pri­vate GPU heap. Every mod­ern GPU also has loss­less delta color com­pres­sion (DCC). Modern GPUs copy en­gines are ca­pa­ble of DCC com­pres­sion and de­com­pres­sion. DCC and Morton swiz­zle are the main rea­sons we want to copy tex­tures into a pri­vate GPU heap. Recently, GPUs have also added generic loss­less mem­ory com­pres­sion for buffer data. If the mem­ory heap is CPU mapped, the GPU can’t en­able ven­dor spe­cific loss­less com­pres­sion, as the CPU would­n’t know how to read or write it. A copy com­mand must be used to com­press the data.We need a mem­ory type pa­ra­me­ter in the GPU mal­loc func­tion to add sup­port for pri­vate GPU mem­ory. The stan­dard mem­ory type should be CPU mapped GPU mem­ory (write com­bined CPU ac­cess). It is fast for the GPU to read, and the CPU can di­rectly write to it just like it was a CPU mem­ory pointer. GPU-only mem­ory is used for tex­tures and big GPU-only buffers. The CPU can’t di­rectly write to these GPU point­ers. The user writes the data to CPU mapped GPU mem­ory first and then is­sues a copy com­mand, which trans­forms the data to op­ti­mal com­pressed for­mat. Modern tex­ture sam­plers and dis­play en­gines can read com­pressed GPU data di­rectly, so there’s no need for sub­se­quent data lay­out trans­forms (see chap­ter: Modern bar­ri­ers). The up­loaded data is ready to use im­me­di­ately.We have two types of GPU point­ers, a CPU mapped vir­tual ad­dress and a GPU vir­tual ad­dress. The GPU can only deref­er­ence GPU ad­dresses. All point­ers in GPU data struc­tures must use GPU ad­dresses. CPU mapped ad­dresses are only used for CPU writes. CUDA has an API to trans­form a CPU mapped ad­dress to a GPU ad­dress (cudaHostGetDevicePointer). Metal 4 buffer ob­ject has two get­ters: .contents (CPU mapped ad­dress) and .gpuAddress (GPU ad­dress). Since the gpuMal­loc API re­turns a pointer, not a man­aged ob­ject han­dle (like Metal), we choose the CUDA ap­proach (gpuHostToDevicePointer). This API call is not free. The dri­ver likely im­ple­ments it us­ing a hash map (if other than base ad­dresses need to be trans­lated, we need a tree). Preferably we call the ad­dress trans­la­tion once per al­lo­ca­tion and cache in a user land struct (void *cpu, void *gpu). This is the ap­proach my user­land GPUBumpAllocator uses (see ap­pen­dix for full im­ple­men­ta­tion).

// Load a mesh us­ing a 3rd party li­brary

auto mesh = cre­ateMesh(“mesh.obj”);

auto up­load = up­load­BumpAl­lo­ca­tor.al­lo­cate(mesh.byte­Size); // Custom bump al­lo­ca­tor (wraps a gpuMal­loc ptr)

mesh.load(up­load.cpu);

// Allocate GPU-only mem­ory and copy into it

void* meshGpu = gpuMal­loc(mesh.byte­Size, MEMORY_GPU);

gpuMem­Cpy(com­mand­Buffer, meshGpu, up­load.gpu);

Vulkan re­cently got a new ex­ten­sion called VK_EXT_host_image_copy. The dri­ver im­ple­ments a di­rect CPU to GPU im­age copy op­er­a­tion, per­form­ing the hard­ware spe­cific tex­ture swiz­zle on CPU. This ex­ten­sion is cur­rently only avail­able on UMA ar­chi­tec­tures, but there’s no tech­ni­cal rea­son why it’s not avail­able on PCIe ReBAR as well. Unfortunately this API does­n’t sup­port DCC. It would be too ex­pen­sive to per­form DCC com­pres­sion on the CPU. The ex­ten­sion is mainly use­ful for block com­pressed tex­tures, as they don’t re­quire DCC. It can’t uni­ver­sally re­place hard­ware copy to GPU pri­vate mem­ory.There’s also a need for a third mem­ory type, CPU-cached, for read­back pur­poses. This mem­ory type is slower for the GPU to write due to cache co­herency with the CPU. Games only use read­back sel­domly. Common use cases are screen­shots and vir­tual tex­tur­ing read­back. GPGPU al­go­rithms such as AI train­ing and in­fer­ence lean on ef­fi­cient com­mu­ni­ca­tion be­tween the CPU and the GPU.When we mix the sim­plic­ity of CUDA mal­loc with CPU-mapped GPU mem­ory we get a flex­i­ble and fast GPU mem­ory al­lo­ca­tion sys­tem with min­i­mal API sur­face. This is an ex­cel­lent start­ing point for a min­i­mal­is­tic mod­ern graph­ics API.CUDA, Metal and OpenCL lever­age C/C++ shader lan­guages fea­tur­ing 64-bit pointer se­man­tics. These lan­guages sup­port load­ing and stor­ing of structs from/​to any ap­pro­pri­ately aligned GPU mem­ory lo­ca­tion. The com­piler han­dles be­hind-the-scenes op­ti­miza­tions, in­clud­ing wide loads (combine), reg­is­ter map­pings, and bit ex­trac­tions. Many mod­ern GPUs of­fer free in­struc­tion mod­i­fiers for ex­tract­ing 8/16-bit por­tions of a reg­is­ter, al­low­ing the com­piler to pack 8-bit and 16-bit val­ues into a sin­gle reg­is­ter. This keeps the shader code clean and ef­fi­cient.If you load a struct of eight 32-bit val­ues, the com­piler will most likely emit two 128-bit wide loads (each fill­ing 4 reg­is­ters), a 4x re­duc­tion in load in­struc­tion count. Wide loads are sig­nif­i­cantly faster, es­pe­cially if the struct con­tains nar­row 8 and 16-bit fields. GPUs are ALU dense and have big reg­is­ter files, but com­pared to CPUs their mem­ory paths are rel­a­tively slow. A CPU of­ten has two load ports each do­ing a load per cy­cle. On a mod­ern GPU we can achieve one SIMD load per 4 cy­cles. Wide load + un­pack in the shader is of­ten the most ef­fi­cient way to han­dle data. Com­pact 8-16 bit data has been tra­di­tion­ally stored in texel buffers (Buffer) in DirectX games. Modern GPUs are op­ti­mized for com­pute work­loads. Raw buffer load in­struc­tions nowa­days have up to 2x higher through­put and up to 3x lower la­tency than texel buffers. Texel buffers are no longer the op­ti­mal choice on mod­ern GPUs. Texel buffers do not sup­port struc­tured data, the user is forced to split their data into SoA lay­out in mul­ti­ple texel buffers. Each texel buffer has its own de­scrip­tor, which must be loaded be­fore the data can be ac­cessed. This con­sumes re­sources (SGPRs, de­scrip­tor cache slots) and adds startup la­tency com­pared to us­ing a sin­gle 64-bit raw pointer. SoA data lay­out also re­sults in sig­nif­i­cantly more cache misses for non-lin­ear in­dex lookups (examples: ma­te­r­ial, tex­ture, tri­an­gle, in­stance, bone id). Texel buffers of­fer free con­ver­sion of nor­mal­ized ([0,1] and [-1,1]) types to float­ing point reg­is­ters. It’s true that there’s no ALU cost, but you lose wide load sup­port (combine loads) and the in­struc­tion goes through the slow tex­ture sam­pler hard­ware path. Narrow texel buffer loads also add reg­is­ter bloat. RGBA8_UNORM load to vec4 al­lo­cates four vec­tor reg­is­ters im­me­di­ately. The sam­pler hard­ware will even­tu­ally write the value to these reg­is­ters. Compilers try to max­i­mize the dis­tance of load→use by mov­ing load in­struc­tions in the be­gin­ning of the shader. This hides the load la­tency by ALU and al­lows over­lap­ping mul­ti­ple loads. If we in­stead use wide raw loads, our uin­t8x4 data con­sumes just a sin­gle 32-bit reg­is­ter. We un­pack the 8-bit chan­nels on use. The reg­is­ter life time is much shorter. Modern GPUs can di­rectly ac­cess 16-bit low/​high halves of reg­is­ters with­out un­pack, and some can even do 8-bit (AMD SDWA mod­i­fier). Packed dou­ble rate math makes 2x16 bit con­ver­sion in­struc­tions faster. Some GPU ar­chi­tec­tures (Nvidia, AMD) can also do 64-bit pointer raw loads di­rectly from VRAM into group­shared mem­ory, fur­ther re­duc­ing the reg­is­ter bloat needed for la­tency hid­ing. By us­ing 64-bit point­ers, game en­gines ben­e­fit from AI hard­ware op­ti­miza­tions.Pointer based sys­tems make mem­ory align­ment ex­plicit. When you are al­lo­cat­ing a buffer ob­ject in DirectX or Vulkan, you need to query the API for align­ment. Buffer bind off­sets must also be prop­erly aligned. Vulkan has an API for query­ing the bind off­set align­ment and DirectX has fixed align­ment rules. Alignment con­tract al­lows the low level shader com­piler to emit op­ti­mal code (such as aligned 4x32-byte wide loads). The DirectX ByteAddressBuffer ab­strac­tion has a de­sign flaw: load2, load3 and load4 in­struc­tions only re­quire 4-byte align­ment. The new SM 6.2 load also only re­quires el­e­men­t­wise align­ment (half4 = 2, float4 = 4). Some GPU ven­dors (like Nvidia) have to split ByteAddressBuffer.load4 into four in­di­vid­ual load in­struc­tions. The buffer ab­strac­tion can’t al­ways shield the user from bad code­gen. It makes bad code­gen hard to fix. C/C++ based lan­guages (CUDA, Metal) al­low the user to ex­plic­itly de­clare struct align­ment with the alig­nas at­tribute. We use alig­nas(16) in all our ex­am­ple code root structs.By de­fault, GPU writes are only vis­i­ble to the threads in­side the same thread group (= in­side a com­pute unit). This al­lows non-co­her­ent L1$ de­sign. Visibility is com­monly pro­vided by bar­ri­ers. If the user needs mem­ory vis­i­bil­ity be­tween the groups in a sin­gle dis­patch, they dec­o­rate the buffer bind­ing with the [globallycoherent] at­tribute. The shader com­piler emits co­her­ent load/​store in­struc­tions for ac­cesses of that buffer. Since we use 64-bit point­ers in­stead of buffer ob­jects, we of­fer ex­plicit co­her­ent load/​store in­struc­tions. The syn­tax is sim­i­lar to atomic load/​store. Similarly we can pro­vide non-tem­po­ral load/​store in­struc­tions that by­pass the whole cache hi­er­ar­chy.Vulkan sup­ports 64-bit point­ers us­ing the (2019) VK_KHR_buffer_device_address ex­ten­sion (https://​docs.vulkan.org/​sam­ples/​lat­est/​sam­ples/​ex­ten­sions/​buffer­_de­vice_ad­dress/​README.html). Buffer de­vice ad­dress ex­ten­sion is widely sup­ported by all GPU ven­dors (including mo­bile), but is not a part of core Vulkan 1.4. The main is­sue with BDA is lack of pointer sup­port in the GLSL and the HLSL shader lan­guages. The user has to use raw 64-bit in­te­gers in­stead. A 64-bit in­te­ger can be cast to a struct. Structs are de­fined with cus­tom BDA syn­tax. Array in­dex­ing re­quires de­clar­ing an ex­tra BDA struct type with an ar­ray in it, if the user wants the com­piler to gen­er­ate the in­dex ad­dress­ing math. Debugging sup­port is cur­rently lim­ited. Usability mat­ters a lot and BDA will re­main a niche un­til HLSL and GLSL sup­port point­ers na­tively. This is a stark con­trast to CUDA, OpenCL and Metal, where na­tive pointer sup­port is a lan­guage core pil­lar and de­bug­ging works flaw­lessly. Di­rectX 12 has no sup­port for point­ers in shaders. As a con­se­quence, HLSL does­n’t al­low pass­ing ar­rays as func­tion pa­ra­me­ters. Simple things like hav­ing a ma­te­r­ial ar­ray in­side UBO/SSBO re­quires hack­ing around with macros. It’s im­pos­si­ble to make reusable func­tions for re­duc­tions (prefix sum, sort, etc), since group­shared mem­ory ar­rays can’t be passed be­tween func­tions. You could of course de­clare a sep­a­rate global ar­ray for each util­ity header/​li­brary, but the com­piler will al­lo­cate group­shared mem­ory for each of them sep­a­rately, re­duc­ing oc­cu­pancy. There’s no easy way to alias group­shared mem­ory. GLSL has iden­ti­cal is­sues. Pointer based lan­guages like CUDA and Metal MSL don’t have such is­sues with ar­rays. CUDA has a vast ecosys­tem of 3rd party li­braries, and this ecosys­tem makes Nvidia the most val­ued com­pany on the planet. Graphics shad­ing lan­guages need to evolve to meet mod­ern stan­dards. We need a li­brary ecosys­tem too.I will be us­ing a C/C++ style shad­ing lan­guage sim­i­lar to CUDA and Metal MSL in my ex­am­ples, with some HLSL-style sys­tem value (SV) se­man­tics mixed in for the graph­ics spe­cific bits and pieces.Op­er­at­ing sys­tem thread­ing APIs com­monly pro­vide a sin­gle 64-bit void pointer to the thread func­tion. The op­er­at­ing sys­tem does­n’t care about the user’s data in­put lay­out. Let’s ap­ply the same ide­ol­ogy to the GPU ker­nel data in­puts. The shader ker­nel re­ceives a sin­gle 64-bit pointer, which we cast to our de­sired struct (by the ker­nel func­tion sig­na­ture). Developers can use the same shared C/C++ header in both CPU and GPU side.

// Common header…

struct alig­nas(16) Data

// Uniform data

float16x4 color; // 16-bit float vec­tor

uin­t16x2 off­set; // 16-bit in­te­ger vec­tor

const uint8* lut; // pointer to 8-bit data ar­ray

// Pointers to in/​out data ar­rays

const uin­t32* in­put;

uin­t32* out­put;

// CPU code…

gpuSet­Pipeline(com­mand­Buffer, com­putePipeline);

auto data = my­BumpAl­lo­ca­tor.al­lo­cate(); // Custom bump al­lo­ca­tor (wraps gpuMal­loc ptr, see ap­pen­dix)

data.cpu->color = {1.0f, 0.0f, 0.0f, 1.0f};

data.cpu->off­set = {16, 0};

data.cpu->lut = luts.gpu + 64; // GPU point­ers sup­port pointer math (no need for off­set API)

data.cpu->in­put = in­put.gpu;

data.cpu->out­put = out­put.gpu;

gpuD­is­patch(com­mand­Buffer, data.gpu, uvec3(128, 1, 1));

// GPU ker­nel…

[groupsize = (64, 1, 1)]

void main(uin­t32x3 threa­dId : SV_ThreadID, const Data* data)

uin­t32 value = data->in­put[threa­dId.x];

// TODO: Code us­ing color, off­set, lut, etc…

data->out­put[threa­dId.x] = value;

In the ex­am­ple code we use a sim­ple lin­ear bump al­lo­ca­tor (myBumpAllocator) for al­lo­cat­ing GPU ar­gu­ments (see ap­pen­dix for im­ple­men­ta­tion). It re­turns a struct {void* cpu, void *gpu}. The CPU pointer is used for writ­ing di­rectly to per­sis­tently mapped GPU mem­ory and the GPU pointer can be stored to GPU data struc­tures or passed as dis­patch com­mand ar­gu­ment. Most GPUs pre­load root uni­forms (including 64-bit point­ers) into con­stant or scalar reg­is­ters just be­fore launch­ing a wave. This op­ti­miza­tion re­mains vi­able: the draw/​dis­patch com­mand car­ries the base data pointer. All the in­put uni­forms (including point­ers to other data) are found at small fixed off­sets from the base pointer. Since shaders are pre-com­piled and fur­ther op­ti­mized into de­vice-spe­cific mi­croc­ode dur­ing the PSO cre­ation, dri­vers have am­ple op­por­tu­nity to set up reg­is­ter pre­load­ing and sim­i­lar root data op­ti­miza­tions. Users should put the most im­por­tant data in the be­gin­ning of the root struct as root data size is lim­ited in some ar­chi­tec­tures. Our root struct has no hard size limit. The shader com­piler will emit stan­dard (scalar/uniform) mem­ory loads for the re­main­ing fields. The root data pointer pro­vided to the shader is const. Shader can’t mod­ify the root in­put data, as it might be still used by the com­mand proces­sor for pre­load­ing data to new waves. Output is done through non-const point­ers (see Data::output in above ex­am­ple). By forc­ing the root data to be const, we also al­low GPU dri­vers to per­form their spe­cial uni­form data path op­ti­miza­tions. Do we need a spe­cial uni­form buffer type? Modern shader com­pil­ers per­form au­to­matic uni­for­mity analy­sis. If all in­puts to an in­struc­tion are uni­form, the out­put is also uni­form. Uniformity prop­a­gates over the shader. All mod­ern ar­chi­tec­tures have scalar reg­is­ters/​loads or a sim­i­lar con­struct (SIMD1 on Intel). Uniformity analy­sis is used to con­vert vec­tor loads into scalar loads, which saves reg­is­ters and re­duces la­tency. Uniformity analy­sis does­n’t care about the buffer type (UBO vs SSBO). The re­source must be read­only (this is why you should al­ways dec­o­rate SSBO with read­only at­tribute in GLSL or pre­fer SRV over UAV in DirectX 12). The com­piler also needs to be able to prove that the pointer is not aliased. The C/C++ const key­word means that data can’t be mod­i­fied though this pointer, it does­n’t guar­an­tee that other read-write point­ers might alias the same mem­ory re­gion. C99 added the re­strict key­word for this pur­pose and CUDA ker­nels use it fre­quently. Root point­ers in Metal are no-alias (restrict) by de­fault, and so are buffer ob­jects in Vulkan and DirectX 12. We should adopt the same con­ven­tion to give the com­piler more free­dom to do op­ti­miza­tions. The shader com­piler is not al­ways able to prove ad­dress uni­for­mity at com­pile time. Modern GPUs op­por­tunis­ti­cally op­ti­mize dy­namic uni­form ad­dress loads. If the mem­ory con­troller de­tects that all lanes of a vec­tor load in­struc­tion have a uni­form ad­dress, it emits a sin­gle lane load in­stead of a SIMD wide gather. The re­sult is repli­cated to all lanes. This op­ti­miza­tion is trans­par­ent, and does­n’t af­fect shader code gen­er­a­tion or reg­is­ter al­lo­ca­tion. Dynamically uni­form data is a much smaller per­for­mance hit than it used to be in the past, es­pe­cially when com­bined with the new fast raw load paths.Some GPU ven­dors (ARM Mali and Qualcomm Adreno) take the uni­for­mity analy­sis a step fur­ther. The shader com­piler ex­tracts uni­form loads and uni­form math. A scalar pre­am­ble runs be­fore the shader. Uniform mem­ory loads and math is ex­e­cuted once for the whole draw/​dis­patch and the re­sults are stored in spe­cial hard­ware con­stant reg­is­ters (the same reg­is­ters used by root con­stants).All of the above op­ti­miza­tions to­gether pro­vide a bet­ter way of han­dling uni­form data than the clas­sic 16KB/64KB uni­form/​con­stant buffer ab­strac­tion. Many GPUs still have spe­cial uni­form reg­is­ters for root con­stants, sys­tem val­ues and the pre­am­ble (see above para­graph).Ide­ally, tex­ture de­scrip­tors would be­have like any other data in GPU mem­ory, al­low­ing them to be freely mixed in structs with other data. However, this level of flex­i­bil­ity is­n’t uni­ver­sally sup­ported by all mod­ern GPUs. Fortunately bind­less tex­ture sam­pler de­signs have con­verged over the last decade, with only two pri­mary meth­ods re­main­ing: 256-bit raw de­scrip­tors and the in­dexed de­scrip­tor heap.AMDs raw de­scrip­tor method loads 256-bit de­scrip­tors di­rectly from GPU mem­ory into the com­pute unit’s scalar reg­is­ters. Eight sub­se­quent 32-bit scalar reg­is­ters con­tain a sin­gle de­scrip­tor. During the SIMD tex­ture sam­ple in­struc­tion, the shader core sends a 256-bit tex­ture de­scrip­tor and per-lane UVs to the sam­pler unit. This pro­vides the sam­pler all the data it needs to ad­dress and load tex­els with­out any in­di­rec­tions. The draw­back is that the 256-bit de­scrip­tor takes a lot of reg­is­ter space and needs to be re­sent to the sam­pler for each sam­ple in­struc­tion.The in­dexed de­scrip­tor heap ap­proach uses 32-bit in­dices (20 bits for old Intel iG­PUs). 32-bit in­dices are triv­ial to store in structs, load into stan­dard SIMD reg­is­ters and ef­fi­cient to pass around. During a SIMD sam­ple in­struc­tion, the shader core sends the tex­ture in­dex and the per-lane UVs to the sam­pler unit. The sam­pler fetches the de­scrip­tor from the de­scrip­tor heap: heap base ad­dress + tex­ture in­dex * stride (256-bits in mod­ern GPUs). The tex­ture heap base ad­dress is ei­ther ab­stracted by the dri­ver (Vulkan and Metal) or pro­vided by the user (SetDescriptorHeaps in DirectX 12). Changing the tex­ture heap base ad­dress may re­sult in an in­ter­nal pipeline bar­rier (on older hard­ware). On mod­ern GPUs the tex­ture heap 64-bit base ad­dress is of­ten part of each sam­ple in­struc­tion data, al­low­ing sam­pling from mul­ti­ple heaps seam­lessly (64-bit base + 32-bit off­set per lane). The sam­pler unit has a tiny in­ter­nal de­scrip­tor cache to avoid in­di­rect reads af­ter the first ac­cess. Descriptor caches must be in­val­i­dated when­ever the de­scrip­tor heap is mod­i­fied.A few years ago it looked like AMDs scalar reg­is­ter based tex­ture de­scrip­tors were the win­ning for­mula in the long run. Scalar reg­is­ters are more flex­i­ble than a de­scrip­tor heap, al­low­ing de­scrip­tors to be em­bed­ded in­side GPU data struc­tures di­rectly. But there’s a down­side. Modern GPU work­loads such as ray-trac­ing and de­ferred tex­tur­ing (Nanite) lean on non-uni­form tex­ture in­dices. The tex­ture heap in­dex is not uni­form over a SIMD wave. A 32-bit heap in­dex is just 4 bytes, we can send it per lane. In con­trast, a 256-bit de­scrip­tor is 32 bytes. It is not fea­si­ble to fetch and send a full 256-bit de­scrip­tor per lane. Modern Nvidia, Apple and Qualcomm GPUs sup­port per-lane de­scrip­tor in­dex mode in their sam­ple in­struc­tions, mak­ing the non-uni­form case more ef­fi­cient. The sam­pler unit per­forms an in­ter­nal loop if re­quired. Inputs/outputs to/​from sam­pler units are sent once, re­gard­less of the heap in­dex co­her­ence. AMDs scalar reg­is­ter based de­scrip­tor ar­chi­tec­ture re­quires the shader com­piler to gen­er­ate a scalar­iza­tion loop around the tex­ture sam­ple in­struc­tion. This costs ex­tra ALU cy­cles and re­quires send­ing and re­ceiv­ing (partially masked) sam­pler data mul­ti­ple times. It’s one of the rea­sons why Nvidia is faster in ray-trac­ing than AMD. ARM and Intel use 32-bit heap in­dices too (like Nvidia, Qualcomm and Apple), but their lat­est ar­chi­tec­tures don’t yet have a per-lane heap in­dex mode. They emit a sim­i­lar scalar­iza­tion loop as AMD for the non-uni­form in­dex case.All of these dif­fer­ences can be wrapped un­der an uni­fied tex­ture de­scrip­tor heap ab­strac­tion. The de-facto tex­ture de­scrip­tor size is 256 bits (192 bits on Apple for a sep­a­rate tex­ture de­scrip­tor, sam­pler is the re­main­ing 32 bits). The tex­ture heap can be pre­sented as a ho­mo­ge­neous ar­ray of 256-bit de­scrip­tor blobs. Indexing is triv­ial. DirectX 12 shader model 6.6 pro­vides a tex­ture heap ab­strac­tion like this, but does­n’t al­low di­rect CPU or com­pute shader write ac­cess to the de­scrip­tor heap mem­ory. A set of APIs are used for cre­at­ing de­scrip­tors and copy­ing de­scrip­tors from the CPU to the GPU. The GPU is not al­lowed to write the de­scrip­tors. Today, we can re­move this API ab­strac­tion com­pletely by al­low­ing di­rect CPU and GPU write to the de­scrip­tor heap. All we need is a sim­ple (user-land) dri­ver helper func­tion for cre­at­ing a 256-bit (uint64[4]) hard­ware spe­cific de­scrip­tor blob. Modern GPUs have UMA or PCIe ReBAR. The CPU can di­rectly write de­scrip­tor blobs into GPU mem­ory. Users can also use com­pute shaders to copy or gen­er­ate de­scrip­tors. The shader lan­guage has a de­scrip­tor cre­ation in­trin­sic too. It re­turns a hard­ware spe­cific uin­t64x4 de­scrip­tor blob (analogous to the CPU API). This ap­proach cuts the API com­plex­ity dras­ti­cally and is both faster and more flex­i­ble than the DirectX 12 de­scrip­tor up­date model. Vulkan’s VK_EXT_descriptor_buffer (https://​www.khronos.org/​blog/​vk-ext-de­scrip­tor-buffer) ex­ten­sion (2022) is sim­i­lar to my pro­posal, al­low­ing di­rect CPU and GPU write. It is sup­ported by most ven­dors, but un­for­tu­nately is not part of the Vulkan 1.4 core spec.

// App startup: Allocate a tex­ture de­scrip­tor heap (for ex­am­ple 65536 de­scrip­tors)

GpuTextureDescriptor *textureHeap = gpuMal­loc(65536);

// Load an im­age us­ing a 3rd party li­brary

auto pngIm­age = pn­gLoad(“cat.png”);

auto up­load­Mem­ory = up­load­BumpAl­lo­ca­tor.al­lo­cate(pngIm­age.byte­Size); // Custom bump al­lo­ca­tor (wraps gpuMal­loc ptr)

pngIm­age.load(up­load­Mem­ory.cpu);

// Allocate GPU mem­ory for our tex­ture (optimal lay­out with meta­data)

GpuTextureDesc tex­ture­Desc { .dimensions = pngIm­age.di­men­sions, .format = FORMAT_RGBA8_UNORM, .usage = SAMPLED };

GpuTextureSizeAlign tex­ture­SizeAlign = gpu­Tex­ture­SizeAlign(tex­ture­Desc);

void *texturePtr = gpuMal­loc(tex­ture­SizeAlign.size, tex­ture­SizeAlign.align, MEMORY_GPU);

GpuTexture tex­ture = gpu­Cre­ate­Tex­ture(tex­ture­Desc, tex­turePtr);

// Create a 256-bit tex­ture view de­scrip­tor and store it

tex­ture­Heap[0] = gpu­Tex­ture­ViewDe­scrip­tor(tex­ture, { .format = FORMAT_RGBA8_UNORM });

// Batched up­load: be­gin

GpuCommandBuffer up­load­Com­mand­Buffer = gpuS­tart­Com­man­dRecord­ing(queue);

// Copy all tex­tures here!

gpu­Copy­To­Tex­ture(up­load­Com­mand­Buffer, tex­turePtr, up­load­Mem­ory.gpu, tex­ture);

// TODO other tex­tures…

// Batched up­load: end

gpuB­ar­rier(up­load­Com­mand­Buffer, STAGE_TRANSFER, STAGE_ALL, HAZARD_DESCRIPTORS);

gpuSub­mit(queue, { up­load­Com­mand­Buffer });

// Later dur­ing ren­der­ing…

gpuSe­tAc­tiveTex­ture­HeapPtr(com­mand­Buffer, gpuHost­ToDe­vi­ce­Pointer(tex­ture­Heap));

It is al­most pos­si­ble to get rid of the CPU side tex­ture ob­ject (GpuTexture) com­pletely. Unfortunately the tri­an­gle ras­ter­izer units of all mod­ern GPUs are not yet bind­less. The CPU dri­ver needs to pre­pare com­mand pack­ets to bind ren­der tar­gets, depth-sten­cil buffers, clear and re­solve. These APIs don’t use the 256-bit GPU tex­ture de­scrip­tor. We need dri­ver spe­cific ex­tra CPU data (stored in the GpuTexture ob­ject).The sim­plest way to ref­er­ence a tex­ture in a shader is to use a 32-bit in­dex. A sin­gle in­dex can also rep­re­sent the start­ing off­set of a range of de­scrip­tors. This of­fers a straight­for­ward way to im­ple­ment the DirectX 12 de­scrip­tor table ab­strac­tion and the Vulkan de­scrip­tor set ab­strac­tion with­out an API. We also get an el­e­gant so­lu­tion to the fast ma­te­r­ial switch use case: All we need is a sin­gle 64-bit GPU pointer, point­ing to a ma­te­r­ial data struct (containing ma­te­r­ial prop­er­ties + 32-bit tex­ture heap start in­dex). Vulkan vkCmd­Bind­De­scrip­torSets and DirectX 12 SetGraphicsRootDescriptorTable are rel­a­tively fast API calls, but they are nowhere as fast as writ­ing a sin­gle 64-bit pointer to per­sis­tently mapped GPU mem­ory. A lot of com­plex­ity is re­moved by not need­ing to cre­ate, up­date and delete re­source bind­ing API ob­jects. CPU time is also saved as the user no longer needs to main­tain a hash map of de­scrip­tor sets, a com­mon ap­proach to solve the im­me­di­ate vs re­tained mode dis­crep­ancy in game en­gines.

Metal 4 man­ages the tex­ture de­scrip­tor heap au­to­mat­i­cally. Texture ob­jects have .gpuResourceID, which is a 64-bit heap in­dex (Xcode GPU de­bug­ger re­veals small val­ues such as 0x3). You can di­rectly write tex­ture IDs into GPU structs, as you would use tex­ture in­dices in DirectX SM 6.6 and Vulkan (descriptor buffer ex­ten­sion). As the heap man­age­ment in Metal is au­to­matic, users can’t al­lo­cate tex­ture de­scrip­tors in con­tigu­ous ranges. It’s a com­mon prac­tice to store a 32-bit in­dex to the first tex­ture in the range and cal­cu­late the in­dices for other tex­tures in the set (see above code ex­am­ple). Metal does­n’t sup­port this. The user has to write a 64-bit tex­ture han­dle for each tex­ture sep­a­rately. To ad­dress a set of 5 tex­tures, you need 40 bytes in Metal (5 * 64-bit). Vulkan and DirectX 12 only need 4 bytes (1 * 32-bit). Apple GPU hard­ware is able to im­ple­ment SM 6.6 tex­ture heaps. The lim­i­ta­tion is the Metal API (software).Texel buffers can be still sup­ported for back­wards com­pat­i­bil­ity. DirectX 12 stores texel buffer de­scrip­tors in the same heap with tex­ture de­scrip­tors. A texel buffer func­tions sim­i­larly to a 1d tex­ture (unfiltered tfetch path). Since texel buffers would be mainly used for back­wards com­pat­i­bil­ity, dri­ver ven­dors would­n’t need to jump over the hoops to re­place them with faster code paths such as raw mem­ory loads be­hind the scenes. I am not a big fan of dri­ver back­ground threads and shader re­place­ments.Non-uni­form tex­ture in­dex needs to use NonUniformResourceIndex no­ta­tion sim­i­lar to GLSL and HLSL. This tells the low level GPU shader com­piler to emit a spe­cial tex­ture in­struc­tion with per-lane heap in­dex, or a scalar­iza­tion loop for GPUs that only sup­port uni­form de­scrip­tors. Since buffers are not de­scrip­tors, we never need NonUniformResourceIndex for buffers. We sim­ply pass a 64-bit pointer per lane. It works on all mod­ern GPUs. No scalar­iza­tion loop, no mess. Additionally, the lan­guage should na­tively sup­port ptr[in­dex] no­ta­tion for mem­ory loads, where the in­dex is 32-bits. Some GPUs sup­port raw mem­ory load in­struc­tions with 32-bit per lane off­set. It re­duces the reg­is­ter pres­sure. Feedback to GPU ven­dors: Please add the miss­ing 64-bit shared base + 32-bit per lane off­set raw load in­struc­tion and 16-bit uv(w) tex­ture load in­struc­tions, if your ar­chi­tec­ture is still miss­ing them.

const Texture tex­ture­Heap[];

[groupsize = (8, 8, 1)]

void main(uin­t32x3 threa­dId : SV_ThreadID, const Data* data)

// Non-uniform buffer data” is not an is­sue with pointer se­man­tics!

Material* ma­te­r­ial = data->ma­te­rialMap[threa­dId.xy];

// Non-uniform tex­ture heap in­dex

uin­t32 tex­ture­Base = NonUniformResourceIndex(material.textureBase);

Texture tex­ture­Color = tex­ture­Heap[tex­ture­Base + 0];

Texture tex­tureNor­mal = tex­ture­Heap[tex­ture­Base + 1];

Texture tex­turePBR = tex­ture­Heap[tex­ture­Base + 2];

Sampler sam­pler = {.minFilter = LINEAR, .magFilter = LINEAR};

float32x2 uv = float32x2(threa­dId.xy) * data->in­vDi­men­sions;

float32x4 color = sam­ple(tex­ture­Color, sam­pler, uv);

float32x4 nor­mal = sam­ple(tex­tureNor­mal, sam­pler, uv);

float32x4 pbr = sam­ple(tex­turePBR, sam­pler, uv);

color *= ma­te­r­ial.color;

pbr *= ma­te­r­ial.pbr;

// Rest of the shader

Modern bind­less tex­tur­ing lets us re­move all tex­ture bind­ing APIs. A global in­dex­able tex­ture heap makes all tex­tures vis­i­ble to all shaders. Texture data still needs to be loaded into GPU mem­ory by copy com­mands (to en­able DCC and Morton swiz­zle). Texture de­scrip­tor cre­ation still needs a thin GPU spe­cific user land API. The tex­ture heap can be ex­posed di­rectly to both the CPU and the GPU as a raw GPU mem­ory ar­ray, re­mov­ing most of the tex­ture heap API com­plex­ity com­pared to DirectX 12 SM 6.6.Since our shader root data is just a sin­gle 64-bit pointer and our tex­tures are just 32-bit in­dices, the shader pipeline cre­ation be­comes dead sim­ple. There’s no need to de­fine tex­ture bind­ings, buffer bind­ings, bind groups (descriptor sets, ar­gu­ment buffers) or the root sig­na­ture.

DirectX 12 and Vulkan uti­lize com­plex APIs to bind and set up root sig­na­tures, push de­scrip­tors, push con­stants, and de­scrip­tor sets. A mod­ern GPU dri­ver es­sen­tially con­structs a sin­gle struct into GPU mem­ory and passes its pointer to the com­mand proces­sor. We have shown that such API com­plex­ity is un­nec­es­sary. The user sim­ply writes the root struct into per­sis­tently mapped GPU mem­ory and passes a 64-bit GPU pointer di­rectly to the draw/​dis­patch func­tion. Users can also in­clude 64-bit point­ers and 32-bit tex­ture heap in­dices in­side their structs to build any in­di­rect data lay­out that fits their needs. Root bind­ings APIs and the whole DX12 buffer zoo can be re­placed ef­fi­ciently with 64-bit point­ers.​​ This sim­pli­fies the shader pipeline cre­ation dras­ti­cally. We don’t need to de­fine the data lay­out at all. We suc­cess­fully re­moved a mas­sive chunk of API com­plex­ity while pro­vid­ing more flex­i­bil­ity to the user.Vulkan, Metal and WebGPU have a con­cept of sta­tic (specialization) con­stants, locked in at shader pipeline cre­ation. The dri­ver’s in­ter­nal shader com­piler ap­plies these con­stants as lit­er­als in the in­put shader IR and does con­stant prop­a­ga­tion and dead code elim­i­na­tion pass af­ter­ward. This can be used to cre­ate mul­ti­ple per­mu­ta­tions of the same shader at pipeline cre­ation, re­duc­ing the time and stor­age re­quired for of­fline com­pil­ing all the shader per­mu­ta­tions.Vulkan and Metal have a set of APIs and a spe­cial shader syn­tax for de­scrib­ing the shader spe­cial­iza­tion con­stants and their val­ues. It would be nicer to sim­ply pro­vide a C struct that matches the con­stant struct de­fined in the shader side. That would re­quire min­i­mal API sur­face and would bring im­por­tant im­prove­ments.Vulka­n’s spe­cial­iza­tion con­stants have a de­sign flaw. Specialization con­stants can’t mod­ify the de­scrip­tor set lay­outs. Data in­puts and out­puts are fixed. The user could hack around the lim­i­ta­tion by im­ple­ment­ing an uber-lay­out con­tain­ing all po­ten­tial in­puts/​out­puts and skip up­dat­ing un­used de­scrip­tors, but this is cum­ber­some and sub-op­ti­mal. Our pro­posed de­sign does­n’t have the same prob­lem. One can sim­ply branch by a con­stant (the other side is dead code elim­i­nated) and rein­ter­pret the shader data in­put pointer as a dif­fer­ent struct. One could also mimic the C++ in­her­i­tance data lay­out. Use a com­mon lay­out for the be­gin­ning of the in­put struct and put spe­cial­ized data at the end. Static poly­mor­phism can be achieved cleanly. Runtime per­for­mance is iden­ti­cal to hand op­ti­mized shader. The spe­cial­iza­tion struct can also in­clude GPU point­ers, al­low­ing the user to hard­code run­time mem­ory lo­ca­tions, avoid­ing in­di­rec­tions. This has never been pos­si­ble in a shader lan­guage be­fore. Instead, the GPU ven­dors had to use back­ground threads to an­a­lyze the shaders to do sim­i­lar shader re­place­ment op­ti­miza­tions at run­time, in­creas­ing the CPU cost and the dri­ver com­plex­ity sig­nif­i­cantly.

The shader per­mu­ta­tion hell is one of the biggest is­sues in mod­ern graph­ics to­day. Gamers are com­plain­ing about stut­ter, devs are com­plain­ing about of­fline shader com­pi­la­tion tak­ing hours. This new de­sign gives the user added flex­i­bil­ity. They can tog­gle be­tween sta­tic and dy­namic be­hav­ior in­side the shader, mak­ing it easy to have a generic fall­back and spe­cial­iza­tion on de­mand. This de­sign re­duces the num­ber of shader per­mu­ta­tions and the run­time stalls caused by pipeline cre­ation.The most hated fea­ture in mod­ern graph­ics APIs must be the bar­ri­ers. Barriers serve two pur­poses: en­force pro­ducer-to-con­sumer ex­e­cu­tion de­pen­den­cies and tran­si­tion tex­tures be­tween lay­outs.Many graph­ics pro­gram­mers have an in­cor­rect men­tal model about the GPU syn­chro­niza­tion. A com­mon be­lief is that GPU syn­chro­niza­tion is based on fine-grained tex­ture and buffer de­pen­den­cies. In re­al­ity, mod­ern GPU hard­ware does­n’t re­ally care about in­di­vid­ual re­sources. We spend lots of CPU cy­cles in user­land prepar­ing a list of in­di­vid­ual re­sources and how their lay­outs change, but mod­ern GPU dri­vers prac­ti­cally throw that list away. The ab­strac­tion does­n’t match re­al­ity.Mod­ern bind­less ar­chi­tec­ture gives the GPU a lot of free­dom. A shader can write to any 64-bit pointer or any tex­ture in the global de­scrip­tor heap. The CPU does­n’t know what de­ci­sions the GPU is go­ing to make. How is it sup­posed to emit tran­si­tion bar­ri­ers for each af­fected re­source? This is a clear mis­match be­tween bind­less ar­chi­tec­ture and clas­sic CPU-driven ren­der­ing APIs to­day. Let’s in­ves­ti­gate why the APIs were de­signed like this 10 years ago. AMD GCN had a big in­flu­ence on mod­ern graph­ics API de­sign. GCN was ahead of its time with async com­pute and bind­less tex­tur­ing (using scalar reg­is­ters to store de­scrip­tors), but it also had cru­cial lim­i­ta­tions in its delta color com­pres­sion (DCC) and cache de­sign. These lim­i­ta­tions are a great ex­am­ple why the bar­rier model we have to­day is so com­plex. GCN did­n’t have a co­her­ent last-level cache. ROPs (raster op­er­a­tions = pixel shader out­puts) had spe­cial non-co­her­ent caches di­rectly con­nected to the VRAM. The dri­ver had to first flush the ROP caches to mem­ory and then in­val­i­date the L2$ to make pixel shader writes vis­i­ble to shaders and sam­plers. The com­mand proces­sor also was­n’t a client of the L2$. Indirect ar­gu­ments writ­ten in com­pute shaders weren’t vis­i­ble to the com­mand proces­sor with­out in­val­i­dat­ing the whole L2$ and flush­ing all dirty lines into VRAM. GCN 3 in­tro­duced delta color com­pres­sion (DCC) for ROPs, but AMDs tex­ture sam­plers were not able to di­rectly read DCC com­pressed tex­tures or com­pressed depth buffers. The dri­ver had to per­form an in­ter­nal de­com­press com­pute shader to elim­i­nate the com­pres­sion. The dis­play en­gine could not read DCC com­pressed tex­tures ei­ther. The com­mon case of sam­pling a ren­der tar­get re­quired two in­ter­nal bar­ri­ers and flush­ing all caches (wait for ROPs, flush ROP cache and L2$, run de­com­press com­pute shader, wait for com­pute).AMD’s new RDNA ar­chi­tec­ture has sev­eral cru­cial im­prove­ments: It has a co­her­ent L2$ cov­er­ing all mem­ory op­er­a­tions. ROPs and the com­mand proces­sor are clients of the L2$. The only non-co­her­ent caches are the tiny L0$ and K$ (scalar cache) in­side the com­pute units. A bar­rier now re­quires only flush­ing the out­stand­ing writes in the tiny caches into the higher level cache. The dri­ver no longer has to flush the last-level (L2) cache into the VRAM, mak­ing bar­ri­ers sig­nif­i­cantly faster. RDNAs im­proved dis­play en­gine is ca­pa­ble of read­ing DCC com­pressed tex­tures and a (de)compressor sits be­tween the L2$ and the L0$ tex­ture cache. There’s no need to de­com­press tex­tures into VRAM be­fore sam­pling, re­mov­ing the need for tex­ture lay­out tran­si­tions (compressed / un­com­pressed). All desk­top and mo­bile GPU ven­dors have reached sim­i­lar con­clu­sions: Bandwidth is the bot­tle­neck to­day. We should never waste band­width de­cod­ing re­sources into VRAM. Layout tran­si­tions are no longer needed.

AMD RDNA (2019): Improved cache hi­er­ar­chy, DCC and dis­play en­gine in the RDNA ar­chi­tec­ture. L2$ con­tains DCC com­pressed data. (De)compressor sits be­tween L2$ and lower lev­els. L0$ (texture) is de­com­pressed. Image © AMD.

Resource lists are the most an­noy­ing as­pect of bar­ri­ers in DirectX 12 and Vulkan. Users are ex­pected to track the state of each re­source in­di­vid­u­ally, and tell the graph­ics API their pre­vi­ous and next state for each bar­rier. This was nec­es­sary on 10 year old GPUs as ven­dors hid var­i­ous de­com­press com­mands un­der the bar­rier API. The bar­rier com­mand func­tioned as the de­com­press com­mand, so it had to know which re­sources re­quired de­com­pres­sion. Today’s hard­ware does­n’t need tex­ture lay­outs or de­com­press steps. Vulkan just got a new VK_KHR_unified_image_layouts (https://​www.khronos.org/​blog/​so-long-im­age-lay­outs-sim­pli­fy­ing-vulkan-syn­chro­ni­sa­tion) ex­ten­sion (2025), re­mov­ing the im­age lay­out tran­si­tions from the bar­rier com­mand. But it still re­quires the user to list in­di­vid­ual tex­tures and buffers. Why is this?The main rea­son is legacy API and tool­ing com­pat­i­bil­ity. People are used to think­ing about re­source de­pen­den­cies and the ex­ist­ing Vulkan and DirectX 12 val­i­da­tion lay­ers are de­signed that way. However, the bar­rier com­mand ex­e­cuted by the GPU con­tains no in­for­ma­tion about tex­tures or buffers at all. The re­source list is con­sumed solely by the dri­ver.Our mod­ern dri­ver loops through your re­source list and pop­u­lates a set of flags. Drivers no longer need to worry about re­source lay­outs or last level cache co­herency, but there still ex­ists tiny non-co­her­ent caches that need flush­ing in spe­cial cases. Modern GPUs flush the ma­jor­ity of the non-co­her­ent caches au­to­mat­i­cally in every bar­rier. For ex­am­ple the AMD L0$ and K$ (scalar cache) are al­ways flushed, since every pass writes some out­puts and these out­puts live in some of these caches. Fine grained track­ing of all write ad­dresses would be too ex­pen­sive. Tiny non-co­her­ent caches tend to be in­clu­sive. Modified lines get flushed to the next cache level. This is fast and does­n’t pro­duce VRAM traf­fic. Some ar­chi­tec­tures have spe­cial caches that are not au­to­mat­i­cally flushed. Examples: de­scrip­tor caches in the tex­ture sam­plers (see above chap­ter), ras­ter­izer ROP caches and HiZ caches. The com­mand proces­sor com­monly runs ahead to re­duce the wave spawn la­tency. If we write in­di­rect ar­gu­ments in a shader, we need to in­form the GPU to stall the com­mand proces­sor prefetcher to avoid a race. The GPU does­n’t ac­tu­ally know whether your com­pute shader was writ­ing into an in­di­rect ar­gu­ment buffer or not. In DirectX 12 the buffer is tran­si­tioned to D3D12_RESOURCE_STATE_INDIRECT_ARGUMENT and in Vulkan the con­sumer de­pen­dency has a spe­cial stage VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT. When a bar­rier has a re­source tran­si­tion like this or a stage de­pen­dency like this, the dri­ver will in­clude com­mand proces­sor prefetcher stall flag into the bar­rier.A mod­ern bar­rier de­sign re­places the re­source list with a sin­gle bit­field de­scrib­ing what hap­pens to these spe­cial non-co­her­ent caches. Special cases in­clude: Invalidate tex­ture de­scrip­tors, in­val­i­date draw ar­gu­ments and in­val­i­date depth caches. These flags are needed when we gen­er­ate draw ar­gu­ments, write to the de­scrip­tor heap or write to a depth buffer with a com­pute shader. Most bar­ri­ers don’t need spe­cial cache in­val­i­da­tion flags.Some GPUs still need to de­com­press data in spe­cial cases. For ex­am­ple dur­ing a copy or a clear com­mand (fast clear elim­i­nate if clear color has changed). Copy and clear com­mands take the af­fected re­source as a pa­ra­me­ter. The dri­ver can take nec­es­sary steps to de­code the data if needed. We don’t need a re­source list in our bar­rier for these spe­cial cases. Not all for­mats and us­age flags sup­port com­pres­sion. The dri­ver will keep the data un­com­pressed in these cases, in­stead of tran­si­tion­ing it back and forth, wast­ing band­width.

If you write to the tex­ture de­scrip­tor heap (uncommon), you need to add a spe­cial flag.

A bar­rier be­tween ras­ter­izer out­put and pixel shader is a com­mon case for off­screen ren­der tar­get → sam­pling. Our ex­am­ple has de­pen­dency stages set up in a way that the bar­rier does­n’t block ver­tex shaders, al­low­ing ver­tex shad­ing (and tile bin­ning on mo­bile GPUs) to over­lap with pre­vi­ous passes. A bar­rier with raster out­put stage (or later) as the pro­ducer au­to­mat­i­cally flushes non-co­her­ent ROP caches if the GPU ar­chi­tec­ture needs that. We don’t need an ex­plicit flag for it.

Users only de­scribe the queue ex­e­cu­tion de­pen­den­cies: pro­ducer and con­sumer stage masks. There’s no need to track the in­di­vid­ual tex­ture and buffer re­source states, re­mov­ing a lot of com­plex­ity and sav­ing a sig­nif­i­cant amount of CPU time ver­sus the cur­rent DirectX 12 and Vulkan de­signs. Metal 2 has a mod­ern bar­rier de­sign al­ready: it does­n’t use re­source lists.Many GPUs have cus­tom scratch­pads mem­o­ries: Groupshared mem­ory in­side each com­pute unit, tile mem­ory, large shared scratch­pads like the Qualcomm GMEM. These mem­o­ries are man­aged au­to­mat­i­cally by the dri­ver. Temporary scratch­pads like group­shared mem­ory are never stored to mem­ory. Tile mem­o­ries are stored au­to­mat­i­cally by the tile ras­ter­izer (store op == store). Uniform reg­is­ters are read-only and pre-pop­u­lated be­fore each draw call. Scratchpads and uni­form reg­is­ters don’t have cache co­herency pro­to­cols and don’t in­ter­act with the bar­ri­ers di­rectly.Mod­ern GPUs sup­port a syn­chro­niza­tion com­mand that writes a value to mem­ory when a shader stage is fin­ished, and a com­mand that waits for a value to ap­pear in mem­ory lo­ca­tion be­fore a shader stage is al­lowed to be­gin (wait in­cludes op­tional cache flush se­man­tics). This is equiv­a­lent to split­ting the bar­rier into two: the pro­ducer and the con­sumer. DirectX 12 split bar­ri­ers and Vulkan event→wait are ex­am­ples of this de­sign. Splitting the bar­rier into con­sumer→pro­ducer al­lows putting in­de­pen­dent work be­tween them, avoid­ing drain­ing the GPU.Vulkan event→wait (and DX12 split bar­ri­ers) see barely any use. The main rea­son is that nor­mal bar­ri­ers are al­ready highly com­pli­cated, and de­vel­op­ers want to avoid ex­tra com­plex­ity. Driver sup­port for split bar­ri­ers also has­n’t been per­fect in the past. Removing the re­source lists sim­pli­fies the split bar­ri­ers sig­nif­i­cantly. We can also make split bar­ri­ers se­man­ti­cally sim­i­lar to time­line sem­a­phores: Signal com­mand writes to a mo­not­o­n­i­cally in­creas­ing 64-bit value (atomic max) and wait com­mand waits for the value to be >= N (greater equal). The counter is just a GPU mem­ory pointer, no per­sis­tent API ob­ject is re­quired. This pro­vides us with a sig­nif­i­cantly sim­pler event→wait API.

This API is much sim­pler than the ex­ist­ing VkEvent API, yet of­fers im­proved flex­i­bil­ity. In the above ex­am­ple we im­ple­mented the time­line sem­a­phore se­man­tics, but we can im­ple­ment other pat­terns too, such as wait­ing mul­ti­ple pro­duc­ers us­ing a bit­mask: mark bits with SIGNAL_ATOMIC_OR and wait for all bits in a mask to be set (mask is an op­tional pa­ra­me­ter in the gpuWait­Be­fore com­mand).

GPU→CPU syn­chro­niza­tion was ini­tially messy in Vulkan and Metal. Users needed a sep­a­rate fence ob­ject for each sub­mit. N buffer­ing was a com­mon tech­nique for reusing the ob­jects. This is a sim­i­lar us­abil­ity is­sue as dis­cussed above re­gard­ing VkEvent. DirectX 12 was the first API to solve the GPU→CPU syn­chro­niza­tion cleanly with time­line sem­a­phores. Vulkan 1.2 and Metal 2 adapted the same de­sign later. A time­line sem­a­phore needs only a sin­gle 64-bit mo­not­o­n­i­cally in­creas­ing counter. This re­duces com­plex­ity over the older Vulkan and Metal fence APIs, which many en­gines still use to­day.

Our pro­posed bar­rier de­sign is a mas­sive im­prove­ment over DirectX 12 and Vulkan. It re­duces the API com­plex­ity sig­nif­i­cantly. Users no longer need to track in­di­vid­ual re­sources. Our sim­ple haz­ard track­ing has queue + stage gran­u­lar­ity. This matches what GPU hard­ware does to­day. Game en­gine graph­ics back­ends can be sim­pli­fied and CPU cy­cles are saved.Vulkan and DirectX 12 were de­signed to pro­mote the pre-cre­ation and reuse of re­sources. Early Vulkan ex­am­ples recorded a sin­gle com­mand buffer at startup, re­play­ing it every frame. Developers quickly dis­cov­ered that com­mand buffer reuse was im­prac­ti­cal. Real game en­vi­ron­ments are dy­namic and the cam­era is in con­stant mo­tion. The vis­i­ble ob­ject set changes fre­quently.Game en­gines ig­nored pre­re­corded com­mand buffers en­tirely. Metal and WebGPU fea­ture tran­sient com­mand buffers, which are cre­ated just be­fore record­ing and dis­ap­pear af­ter GPU has fin­ished ren­der­ing. This elim­i­nates the need for com­mand buffer man­age­ment and pre­vents mul­ti­ple sub­mis­sions of the same com­mands. GPU ven­dors rec­om­mend one shot com­mand buffers (a re­set­table com­mand pool per frame in flight) in Vulkan too, as it sim­pli­fies the dri­ver’s in­ter­nal mem­ory man­age­ment (bump al­lo­ca­tor vs heap al­lo­ca­tor). The best prac­tices match Metal and WebGPU de­sign. Persistent com­mand buffer ob­jects can be re­moved. That API com­plex­ity did­n’t pro­vide any­thing worth us­ing.

Let’s start with a burn­ing ques­tion: Do we need graph­ics shaders any­more? UE5 Nanite uses com­pute shaders to plot pix­els us­ing 64-bit atom­ics. High bits con­tain the pixel depth and low bits con­tain the pay­load. Atomic-min en­sures that the clos­est sur­face re­mains. This tech­nique was first pre­sented at SIGGRAPH 2015 by Media Molecule Dreams (Alex Evans). Hardware ras­ter­izer still has some ad­van­tages, like hi­er­ar­chi­cal/​early depth-sten­cil tests. Nanite has to lean solely on coarse clus­ter culling, which re­sults in ex­tra over­draw with kit­bashed con­tent. Ubisoft (me and Ulrich Haar) pre­sented this two-pass clus­ter culling al­go­rithm at SIGGRAPH 2015. Ubisoft used clus­ter culling in com­bi­na­tion with the hard­ware ras­ter­izer for more fine grained culling. Today’s GPUs are bind­less and much bet­ter suited for GPU-driven work­loads like this. 10 years ago Ubisoft had to lean on vir­tual tex­tur­ing (all tex­tures in the same at­las) in­stead of bind­less tex­tur­ing. Despite many com­pute-only ras­ter­iz­ers to­day (Nanite, SDF sphere trac­ing, DDA voxel trac­ing) the hard­ware ras­ter­izer still re­mains the most used tech­nique for ren­der­ing tri­an­gles in games to­day. It’s def­i­nitely worth dis­cussing how to make the ras­ter­i­za­tion pipeline more flex­i­ble and eas­ier to use.The mod­ern shader frame­work has grown to 16 shader en­try points. We have eight en­try points for ras­ter­i­za­tion (pixel, ver­tex, geom­e­try, hull, do­main, patch con­stant, mesh and am­pli­fi­ca­tion), and six for ray-trac­ing (ray gen­er­a­tion, miss, clos­est hit, any hit, in­ter­sec­tion and callable). In com­par­i­son, CUDA has a sin­gle en­try point: ker­nel. This makes CUDA com­pos­able. CUDA has a healthy ecosys­tem of 3rd party li­braries. New GPU hard­ware blocks such as the ten­sor cores (AI) are ex­posed as in­trin­sic func­tions. This is how it all started in the graph­ics land as well: tex­ture sam­pling was our first in­trin­sic func­tion. Today, tex­ture sam­pling is fully bind­less and does­n’t even re­quire dri­ver setup. This is the de­sign de­vel­op­ers pre­fer. Simple, easy to com­pose, and ex­tend.We re­cently got more in­trin­sics: in­line ray­trac­ing and co­op­er­a­tive ma­trix (wave ma­trix in DirectX 12, sub­group ma­trix in Metal). I am hop­ing that this is the new di­rec­tion. We should start tear­ing down the mas­sive 16 shader frame­work and re­plac­ing it with in­trin­sics that can be com­posed in a flex­i­ble way.Solv­ing the shader frame­work com­plex­ity is a mas­sive topic. To keep the scope of this blog post in check, I will to­day only dis­cuss com­pute shaders and raster pipelines. I am go­ing to be writ­ing a fol­lowup about sim­pli­fy­ing the shader frame­work, in­clud­ing mod­ern top­ics such as ray-trac­ing, shader ex­e­cu­tion re­order­ing (SER), dy­namic reg­is­ter al­lo­ca­tion ex­ten­sions and Apple’s new L1$ backed reg­is­ter file (called dy­namic caching).There are two rel­e­vant raster pipelines to­day: Vertex+pixel and mesh+pixel. Mobile GPUs em­ploy­ing tile based de­ferred ren­der­ing (TBDR) per­form per-tri­an­gle bin­ning. Tile size is com­monly be­tween 16x16 to 64x64 pix­els, mak­ing mesh­lets too coarse grained prim­i­tive for bin­ning. Meshlet has no clear 1:1 lane to ver­tex map­ping, there’s no straight­for­ward way to run a par­tial mesh shader wave for se­lected tri­an­gles. This is the main rea­son mo­bile GPU ven­dors haven’t been keen to adapt the desk­top cen­tric mesh shader API de­signed by Nvidia and AMD. Vertex shaders are still im­por­tant for mo­bile.I will not be dis­cussing geom­e­try, hull, do­main, and patch con­stant (tessellation) shaders. The graph­ics com­mu­nity widely con­sid­ers these shader types as failed ex­per­i­ments. They all have cru­cial per­for­mance is­sues in their de­sign. In all rel­e­vant use cases, you can run a com­pute prepass gen­er­at­ing an in­dex buffer to out­per­form these stages. Additionally, mesh shaders al­low gen­er­at­ing a com­pact 8-bit in­dex buffer into on-chip mem­ory, fur­ther in­creas­ing the per­for­mance gap over these legacy shader stages.Our goal is to build a mod­ern PSO ab­strac­tion with a min­i­mal amount of baked state. One of the main cri­tiques of Vulkan and DirectX 12 has been the pipeline per­mu­ta­tion ex­plo­sion. The less state we have in­side the PSO, the less pipeline per­mu­ta­tions we get. There are two main ar­eas to im­prove: graph­ics shader data bind­ings and the ras­ter­izer state.Ver­tex+pixel shader pipeline needs sev­eral ad­di­tional in­puts com­pared to a com­pute ker­nel: ver­tex buffers, in­dex buffer, ras­ter­izer state, ren­der tar­get views and a depth-sten­cil view. Let’s start by dis­cussing the shader vis­i­ble data bind­ings.Ver­tex buffer bind­ings are easy to solve: We sim­ply re­move them. Modern GPUs have fast raw load paths. Most GPU ven­dors have been em­u­lat­ing ver­tex fetch hard­ware al­ready for sev­eral gen­er­a­tions. Their low level shader com­piler reads the user de­fined ver­tex lay­out and emits ap­pro­pri­ate raw load in­struc­tions in the be­gin­ning of the ver­tex shader. The ver­tex bind­ings de­c­la­ra­tion is an­other ex­am­ple of a spe­cial C/C++ API for defin­ing a struct mem­ory lay­out. It adds com­plex­ity and forces com­pil­ing mul­ti­ple PSO per­mu­ta­tions for dif­fer­ent lay­outs. We sim­ply re­place the ver­tex buffers with stan­dard C/C++ structs. No API is re­quired.

The same is true for per-in­stance data and mul­ti­ple ver­tex streams. We can im­ple­ment them ef­fi­ciently with raw mem­ory loads. When we use raw load in­struc­tions, we can dy­nam­i­cally ad­just the ver­tex stride, branch over sec­ondary ver­tex buffer loads and cal­cu­late our ver­tex in­dices us­ing cus­tom for­mu­las to im­ple­ment clus­tered GPU-driven ren­der­ing, par­ti­cle quad ex­pan­sion, higher or­der sur­faces, ef­fi­cient ter­rain ren­der­ing and many other al­go­rithms. Additional shader en­try points and bind­ing APIs are not needed. We can use our new sta­tic con­stant sys­tem to dead code elim­i­nate ver­tex streams at pipeline cre­ation or pro­vide a sta­tic ver­tex stride if we so pre­fer. All the old op­ti­miza­tion strate­gies still ex­ist, but we can now mix and match tech­niques freely to match our ren­der­er’s needs.

// Common header…

struct VertexPosition

float32x4 po­si­tion;

struct VertexAttributes

uin­t8x4 nor­mal;

uin­t8x4 tan­gent;

uin­t16x2 uv;

...

Read the original on www.sebastianaaltonen.com »

10 730 shares, 27 trendiness

Pricing changes for GitHub Actions

We’ve read your posts and heard your feed­back.

We’re post­pon­ing the an­nounced billing change for self-hosted GitHub Actions to take time to re-eval­u­ate our ap­proach.

We are con­tin­u­ing to re­duce hosted-run­ners prices by up to 39% on January 1, 2026.

We have real costs in run­ning the Actions con­trol plane. We are also mak­ing in­vest­ments into self-hosted run­ners so they work at scale in cus­tomer en­vi­ron­ments, par­tic­u­larly for com­plex en­ter­prise sce­nar­ios. While this con­text mat­ters, we missed the mark with this change by not in­clud­ing more of you in our plan­ning.

We need to im­prove GitHub Actions. We’re tak­ing more time to meet and lis­ten closely to de­vel­op­ers, cus­tomers, and part­ners to start. We’ve also opened a dis­cus­sion to col­lect more di­rect feed­back and will use that feed­back to in­form the GitHub Actions roadmap. We’re work­ing hard to earn your trust through con­sis­tent de­liv­ery across GitHub Actions and the en­tire plat­form.

Below is the orig­i­nal an­nounce­ment from 12/16/25

We’re an­nounc­ing up­dates to our pric­ing and prod­uct mod­els for GitHub Actions.

Historically, self-hosted run­ner cus­tomers were able to lever­age much of GitHub Actions’ in­fra­struc­ture and ser­vices at no cost. This meant that the cost of main­tain­ing and evolv­ing these es­sen­tial ser­vices was largely be­ing sub­si­dized by the prices set for GitHub-hosted run­ners. By up­dat­ing our pric­ing, we’re align­ing costs more closely with us­age and the value de­liv­ered to every Actions user, while fu­el­ing fur­ther in­no­va­tion and in­vest­ment across the plat­form. The vast ma­jor­ity of users, es­pe­cially in­di­vid­u­als and small teams, will see no price in­crease.

We will have a GitHub Actions pric­ing cal­cu­la­tor avail­able where you will know how much you will be charged. You can see the Actions pric­ing cal­cu­la­tor to es­ti­mate your fu­ture costs. 96% of cus­tomers will see no change to their bill. Of the 4% of Actions users im­pacted by this change, 85% of this co­hort will see their Actions bill de­crease and the re­main­ing 15% who are im­pacted across all face a me­dian in­crease around $13.

GitHub Actions will re­main free for pub­lic repos­i­to­ries. In 2025, we saw de­vel­op­ers use 11.5 bil­lion to­tal Actions min­utes in pub­lic pro­jects for free (~$184 mil­lion) and we will con­tinue to in­vest in Actions to pro­vide a fast, re­li­able, and pre­dictable ex­pe­ri­ence for our users.

When we shipped Actions in 2018, we had no idea how pop­u­lar it would be­come. By early 2024, the plat­form was run­ning about 23 mil­lion jobs per day and our ex­ist­ing ar­chi­tec­ture could­n’t re­li­ably sup­port our growth curve. In or­der to in­crease fea­ture ve­loc­ity, we first needed to im­prove re­li­a­bil­ity and mod­ern­ize the legacy frame­works that sup­ported GitHub Actions.

Our so­lu­tion was to re-ar­chi­tect the core back­end ser­vices pow­er­ing GitHub Actions jobs and run­ners with the goals of im­prov­ing up­time and re­silience against in­fra­struc­ture is­sues, en­hanc­ing per­for­mance, re­duc­ing in­ter­nal throt­tles, and lever­ag­ing GitHub’s broader plat­form in­vest­ments and de­vel­oper ex­pe­ri­ence im­prove­ments. This work is pay­ing off by help­ing us han­dle our cur­rent scale, even as we work through the last pieces of sta­bi­liz­ing our new plat­form.

Since August, all GitHub Actions jobs have run on our new ar­chi­tec­ture, which han­dles 71 mil­lion jobs per day (over 3x from where we started). Individual en­ter­prises are able to start 7x more jobs per minute than our pre­vi­ous ar­chi­tec­ture could sup­port.

As with any prod­uct, our goal at GitHub has been to meet cus­tomer needs while pro­vid­ing en­ter­prises with flex­i­bil­ity and trans­parency.

This change bet­ter sup­ports a world where CI/CD must be faster and more re­li­able, bet­ter caching, more work­flow flex­i­bil­ity, rock-solid re­li­a­bil­ity, and strength­ens the core ex­pe­ri­ence while po­si­tion­ing GitHub Actions to power GitHub’s open, se­cure plat­form for agen­tic work­loads.

Starting to­day, we’re charg­ing fairly for Actions across the board which re­duces the price of GItHub Hosted Runners and the price the av­er­age GitHub cus­tomer pays. And we’re re­duc­ing the net cost of GitHub-hosted run­ners by up to 39%, de­pend­ing on which ma­chine type is used.

This re­duc­tion is dri­ven by a ~40% price re­duc­tion across all run­ner sizes, paired with the ad­di­tion of a new $0.002 per-minute GitHub Actions cloud plat­form charge. For GitHub-hosted run­ners, the new Actions cloud plat­form charge is al­ready in­cluded into the re­duced me­ter price.

Standard GitHub-hosted or self-hosted run­ner us­age on pub­lic repos­i­to­ries will re­main free. GitHub Enterprise Server pric­ing is not im­pacted by this change.

The price re­duc­tion you will see in your ac­count de­pends on the types of ma­chines that you use most fre­quently — smaller run­ners will have a smaller rel­a­tive price re­duc­tion, larger run­ners will see a larger rel­a­tive re­duc­tion.

This price re­duc­tion makes high-per­for­mance com­pute more ac­ces­si­ble for both high-vol­ume CI work­loads and the agent jobs that rely on fast, se­cure ex­e­cu­tion en­vi­ron­ments.

For full pric­ing up­date de­tails, see the up­dated Actions run­ner prices in our doc­u­men­ta­tion.

This price change will go into ef­fect on January 1, 2026.

We are in­tro­duc­ing a $0.002 per-minute Actions cloud plat­form charge for all Actions work­flows across GitHub-hosted and self-hosted run­ners. The new listed GitHub-runner rates in­clude this charge. This will not im­pact Actions us­age in pub­lic repos­i­to­ries or GitHub Enterprise Server cus­tomers.

This aligns pric­ing to match con­sump­tion pat­terns and en­sures con­sis­tent ser­vice qual­ity as us­age grows across both host­ing modal­i­ties.

We are in­creas­ing our in­vest­ment into our self-hosted ex­pe­ri­ence to en­sure that we can pro­vide au­toscal­ing for sce­nar­ios be­yond just Linux con­tain­ers. This will in­clude new ap­proaches to scal­ing, new plat­form sup­port, Windows sup­port, and more as we move through the next 12 months. Here’s a pre­view of what to ex­pect in the new year:

This new client pro­vides en­ter­prises with a light­weight Go SDK to build cus­tom au­toscal­ing so­lu­tions with­out the com­plex­ity of Kubernetes or re­liance on ARC. It in­te­grates seam­lessly with ex­ist­ing in­fra­struc­ture—con­tain­ers, vir­tual ma­chines, cloud in­stances, or bare metal—while man­ag­ing job queu­ing, se­cure con­fig­u­ra­tion, and in­tel­li­gent scal­ing logic. Customers gain a sup­ported path to im­ple­ment flex­i­ble au­toscal­ing, re­duce setup fric­tion, and ex­tend GitHub Actions be­yond work­flows to sce­nar­ios such as self-hosted Dependabot and Copilot Coding Agent.

We are rein­tro­duc­ing multi-la­bel func­tion­al­ity for both GitHub-hosted larger run­ners and self-hosted run­ners, in­clud­ing those man­aged by Actions Runner Controller (ARC) and the new Scale Set Client.

This up­com­ing re­lease in­tro­duces ma­jor qual­ity-of-life im­prove­ments, in­clud­ing re­fined Helm charts for eas­ier Docker con­fig­u­ra­tion, en­hanced log­ging, up­dated met­rics, and for­mal­ized ver­sion­ing re­quire­ments. It also an­nounces the dep­re­ca­tion of legacy ARC, pro­vid­ing a clear mi­gra­tion path to a more re­li­able and main­tain­able ar­chi­tec­ture. Customers ben­e­fit from sim­pli­fied setup, im­proved ob­serv­abil­ity, and con­fi­dence in long-term sup­port, re­duc­ing op­er­a­tional fric­tion and im­prov­ing scal­a­bil­ity.

The Actions Data Stream will de­liver a near real-time, au­thor­i­ta­tive feed of GitHub Actions work­flow and job event data, in­clud­ing meta­data such as the ver­sion of the ac­tion that was ex­e­cuted on any given work­flow run. This ca­pa­bil­ity en­hances ob­serv­abil­ity and trou­bleshoot­ing by en­abling or­ga­ni­za­tions to in­te­grate event data into mon­i­tor­ing and an­a­lyt­ics sys­tems for com­pli­ance and op­er­a­tional in­sights. By pro­vid­ing struc­tured, high-fi­delity data at scale, it elim­i­nates re­liance on man­ual log pars­ing and em­pow­ers teams to proac­tively man­age re­li­a­bil­ity and per­for­mance.

Agents are ex­pand­ing what teams can au­to­mate—but CI/CD re­mains the heart­beat of mod­ern soft­ware de­liv­ery. These up­dates en­able both a faster, more re­li­able CI/CD ex­pe­ri­ence for every de­vel­oper, and a scal­able, flex­i­ble, se­cure ex­e­cu­tion layer to power GitHub’s agen­tic plat­form.

Our goal is to en­sure GitHub Actions con­tin­ues to meet the needs of the largest en­ter­prises and of in­di­vid­ual de­vel­op­ers alike, with clear pric­ing, stronger per­for­mance, and a prod­uct di­rec­tion built for the next decade of soft­ware de­vel­op­ment.

Why am I be­ing charged to use my own hard­ware?

Historically, self-hosted run­ner cus­tomers were able to lever­age much of GitHub Actions’ in­fra­struc­ture and ser­vices at no cost. This meant that the cost of main­tain­ing and evolv­ing these es­sen­tial ser­vices was largely be­ing sub­si­dized by the prices set for GitHub-hosted run­ners. By up­dat­ing our pric­ing, we’re align­ing costs more closely with us­age and the value de­liv­ered to every Actions user, while fu­el­ing fur­ther in­no­va­tion and in­vest­ment across the plat­form. The vast ma­jor­ity of users, es­pe­cially in­di­vid­u­als and small teams, will see no price in­crease.

You can see the Actions pric­ing cal­cu­la­tor to es­ti­mate your fu­ture costs.

What are the new GitHub-hosted run­ner rates?

See the GitHub Actions run­ner pric­ing ref­er­ence for the up­dated rates that will go into ef­fect on January 1, 2026. These listed rates in­clude the new $0.002 per-minute Actions cloud plat­form charge.

Q: Why is .002/minute the right price for self-hosted run­ners on cloud?

We de­ter­mined per-minute was deemed the most fair and ac­cu­rate by our users, and com­pared to other self-hosted CI so­lu­tions in the mar­ket. We be­lieve this is a sus­tain­able op­tion that will not deeply im­pact our lightly- nor heav­ily-ac­tive cus­tomers, while still de­liv­er­ing fast, flex­i­ble work­loads for the best end user ex­pe­ri­ence.

Which job ex­e­cu­tion sce­nar­ios for GitHub Actions are af­fected by this pric­ing change?

* Jobs that run in pri­vate repos­i­to­ries and use stan­dard GitHub-hosted or self-hosted run­ners

Standard GitHub-hosted or self-hosted run­ner us­age on pub­lic repos­i­to­ries will re­main free. GitHub Enterprise Server pric­ing is not im­pacted by this change.

When will this pric­ing change take ef­fect?

The price de­crease for GitHub-hosted run­ners will take ef­fect on January 1, 2026. The new charge for self-hosted run­ners will ap­ply be­gin­ning on March 1, 2026. The price changes will im­pact all cus­tomers on these dates.

Will the free us­age quota avail­able in my plan change?

Beginning March 1, 2026, self-hosted run­ners will be in­cluded within your free us­age quota, and will con­sume avail­able us­age based on list price the same way that Linux, Windows, and MacOS stan­dard run­ners work to­day.

Will self-hosted run­ner us­age con­sume from my free us­age min­utes?

Yes, bill­able self-hosted run­ner us­age will be able to con­sume min­utes from the free quota as­so­ci­ated with your plan.

How does this pric­ing change af­fect cus­tomers on GitHub Enterprise Server?

This pric­ing change does not af­fect cus­tomers us­ing GitHub Enterprise Server. Customers run­ning Actions jobs on self-hosted run­ners on GitHub Enterprise Server may con­tinue to host, man­age, trou­bleshoot and use Actions on and in con­junc­tion with their im­ple­men­ta­tion free of charge.

Can I bill my self-hosted run­ner us­age on pri­vate repos­i­to­ries through Azure?

Yes, as long as you have an ac­tive Azure sub­scrip­tion ID as­so­ci­ated with your GitHub Enterprise or Organization(s).

What is the over­all im­pact of this change to GitHub cus­tomers?

96% of cus­tomers will see no change to their bill. Of the 4% of Actions users im­pacted by this change, 85% of this co­hort will see their Actions bill de­crease and the re­main­ing 15% who are im­pacted across all face a me­dian in­crease around $13.

Did GitHub con­sider how this im­pacts in­di­vid­ual de­vel­op­ers, not just Enterprise scale cus­tomers of GitHub?

From our in­di­vid­ual users (free & Pro plans) of those who used GitHub Actions in the last month in pri­vate re­pos only 0.09% would end up with a price in­crease, with a me­dian in­crease of un­der $2 a month. Note that this im­pact is af­ter these users have made use of their in­cluded min­utes in their plans to­day, en­ti­tling them to over 33 hours of in­cluded GitHub com­pute, and this has no im­pact on their free use of pub­lic re­pos. A fur­ther 2.8% of this to­tal user base will see a de­crease in their monthly cost as a re­sult of these changes. The rest are unim­pacted by this change.

How can I fig­ure out what my new monthly cost for Actions looks like?

GitHub Actions pro­vides de­tailed us­age re­ports for the cur­rent and prior year. You can use this prior us­age along­side the rate changes that will be in­tro­duced in January and March to es­ti­mate cost un­der the new pric­ing struc­ture. We have cre­ated a Python script to help you lever­age full us­age re­ports to cal­cu­late your ex­pected cost af­ter the price up­dates.

We have also up­dated our Actions pric­ing cal­cu­la­tor, mak­ing it eas­ier to es­ti­mate your fu­ture costs, par­tic­u­larly if your his­tor­i­cal us­age is lim­ited or not rep­re­sen­ta­tive of ex­pected fu­ture us­age.

...

Read the original on resources.github.com »

To add this web app to your iOS home screen tap the share button and select "Add to the Home Screen".

10HN is also available as an iOS App

If you visit 10HN only rarely, check out the the best articles from the past week.

If you like 10HN please leave feedback and share

Visit pancik.com for more.