10 interesting stories served every morning and every evening.
Summary: A major brick-and-mortar store sold an Apple Gift Card that Apple seemingly took offence to, and locked out my entire Apple ID, effectively bricking my devices and my iCloud Account, Apple Developer ID, and everything associated with it, and I have no recourse. Can you help? Email paris AT paris.id.au (and read on for the details). ❤️
Update 14 December 2025: Someone from Executive Relations at Apple says they’re looking into it. I hope this is true. They say they’ll call me back tomorrow, on 15 December 2025. In the mean time, it’s been covered by Daring Fireball, Apple Insider, Michael Tsai, and others, thanks folks! I’ve received 100s of emails of support, and will reply to you all in time, thank you. Finger’s crossed Apple calls back.
I am writing this as a desperate measure. After nearly 30 years as a loyal customer, authoring technical books on Apple’s own programming languages (Objective-C and Swift), and spending tens upon tens upon tens of thousands of dollars on devices, apps, conferences, and services, I have been locked out of my personal and professional digital life with no explanation and no recourse.
My Apple ID, which I have held for around 25 years (it was originally a username, before they had to be email addresses; it’s from the iTools era), has been permanently disabled. This isn’t just an email address; it is my core digital identity. It holds terabytes of family photos, my entire message history, and is the key to syncing my work across the ecosystem.
The Trigger: The only recent activity on my account was a recent attempt to redeem a $500 Apple Gift Card to pay for my 6TB iCloud+ storage plan. The code failed. The vendor suggested that the card number was likely compromised and agreed to reissue it. Shortly after, my account was locked. An Apple Support representative suggested that this was the cause of the issue: indicating that something was likely untoward about this card.The card was purchased from a major brick-and-mortar retailer (Australians, think Woolworths scale; Americans, think Walmart scale), so if I cannot rely on the provenance of that, and have no recourse, what am I meant to do? We have even sent the receipt, indicating the card’s serial number and purchase location to Apple.
The Consequence: My account is flagged as “closed in accordance with the Apple Media Services Terms and Conditions”.The Damage: I effectively have over $30,000 worth of previously-active “bricked” hardware. My iPhone, iPad, Watch, and Macs cannot sync, update, or function properly. I have lost access to thousands of dollars in purchased software and media. Apple representatives claim that only the “Media and Services” side of my account is blocked, but now my devices have signed me out of iMessage (and I can’t sign back in), and I can’t even sign out of the blocked iCloud account because… it’s barred from the sign-out API, as far as I can tell.I can’t even login to the “Secure File Transfer” system Apple uses to exchange information, because it relies on an Apple ID. Most of the ways Apple has suggested seeking help from them involve signing in to an Apple service to upload something, or communicate with them. This doesn’t work as the account is locked.
I can’t even download my iCloud Photos, as:There are repeated auth-errors on my account, so I can’t make Photos work;I don’t have a 6TB device to sync them to, even if I could.
No Information: Support staff refused to tell me why the account was banned or provide specific details on the decision. No Escalation: When I begged for an escalation to Executive Customer Relations (ECR), noting that I would lose the ability to do my job and that my devices were useless, I was told that “an additional escalation won’t lead to a different outcome”.Many of the reps I’ve spoken to have suggested strange things, one of the strangest was telling me that I could physically go to Apple’s Australian HQ at Level 3, 20 Martin Place, Sydney, and plead my case. They even put me on hold for 5 minutes while they looked up the address.
Most insultingly, the official advice from the Senior Advisor was to “create a new Apple account… and update the payment information”.
The Legal Catch: Apple’s Terms and Conditions rely on “Termination of Access.” By closing my account, they have revoked my license to use their services. The Technical Trap: If I follow their advice and create a new account on my current devices (which are likely hardware-flagged due to the gift card error), the new account will likely be linked to the banned one and disabled for circumventing security measures.The Developer Risk: As a professional Apple Developer, attempting to “dodge” a ban by creating a new ID could lead to my Developer Program membership being permanently blacklisted, amongst other things.
I am not a casual user. I have literally written the book on Apple development (taking over the Learning Cocoa with Objective-C series, which Apple themselves used to write, for O’Reilly Media, and then 20+ books following that). I help run the longest-running Apple developer event not run by Apple themselves, /dev/world. I have effectively been an evangelist for this company’s technology for my entire professional life. We had an app on the App Store on Day 1 in every sense of the world.
I am asking for a human at Apple to review this case. I suspect an automated fraud flag regarding the bad gift card triggered a nuclear response that frontline support cannot override. I have escalated this through my many friends in WWDR and SRE at Apple, with no success.
I am desperate to resolve this and restore my digital life. If you can help, please email paris AT paris.id.au
...
Read the original on hey.paris »
Gemini 3 Flash is our latest model with frontier intelligence built for speed that helps everyone learn, build, and plan anything — faster.
Senior Director, Product Management, on behalf of the Gemini team
Google is releasing Gemini 3 Flash, a fast and cost-effective model built for speed. You can now access Gemini 3 Flash through the Gemini app and AI Mode in Search. Developers can access it via the Gemini API in Google AI Studio, Google Antigravity, Gemini CLI, Android Studio, Vertex AI and Gemini Enterprise.
Summaries were generated by Google AI. Generative AI is experimental.
It’s great for coding, complex analysis, and quick answers in interactive apps.
Gemini 3 Flash is now the default model in the Gemini app and AI Mode in Search.
Developers and everyday users can access Gemini 3 Flash via various Google platforms.
Summaries were generated by Google AI. Generative AI is experimental.
...
Read the original on blog.google »
AWS CEO Matt Garman outlined 3 solid reasons why companies should not focus on cutting junior developer roles, noting that they “are actually the most experienced with the AI tools”.
In a tech world obsessed with AI replacing human workers, Matt Garman, CEO of Amazon Web Services (AWS), is pushing back against one of the industry’s most popular cost-cutting ideas.
Speaking on WIRED’s The Big Interview podcast, Garman has a bold message for companies racing to cut costs with AI.
He was asked to explain why he once called replacing junior employees with AI “one of the dumbest ideas” he’d ever heard, and to expand on how he believes agentic AI will actually change the workplace in the coming years.
First, junior employees are often better with AI tools than senior staff.
Fresh grads have grown up with new technology, so they can adapt quickly. Many of them learn AI-powered tools while studying or during internships. They tend to explore new features, find quick methods to write code, and figure out how to get the best results from AI agents.
According to the 2025 Stack Overflow Developer Survey, 55.5% of early-career developers reported using AI tools daily in their development process, higher than for the experienced folks.
This comfort with new tools allows them to work more efficiently. In contrast, senior developers have established workflows and may take more time to adopt. Recent research shows that over half of Gen Z employees are actually helping senior colleagues upskill in AI.
Second, junior staff are usually the least expensive employees.
Junior employees usually get much less in salary and benefits, so removing them does not deliver huge savings. If a company is trying to save money, it doesn’t make that much financial sense.
So, when companies talk about increasing profit margins, junior employees should not be the default or only target. True optimization, Real cost-cutting means looking at the whole company because there are plenty of other places where expenses can be trimmed.
In fact, 30% of companies that laid off workers expecting savings ended up increasing expenses, and many had to rehire later.
Think of a company like a sports team. If you only keep veteran players and never recruit rookies, what happens when those veterans retire? You are left with no one who knows how to play the game.
Also, hiring people straight out of college brings new ways of thinking into the workplace. They have fresh ideas shaped by the latest trends, motivation to innovate.
More importantly, they form the foundation of a company’s future workforce. If a company decides to stop hiring junior employees altogether, it cuts off its own talent pipeline. Over time, that leads to fewer leaders to promote from within.
A Deloitte report also notes that the tech workforce is expected to grow at roughly twice the rate of the overall U. S. workforce, highlighting the demand for tech talent. Without a strong pipeline of junior developers coming in, companies might face a tech talent shortage.
When there are not enough junior hires being trained today, teams struggle to fill roles tomorrow, especially as projects scale.
This isn’t just corporate talk. As the leader of one of the world’s largest cloud computing platforms, serving everyone from Netflix to the U. S. intelligence agencies, Garman has a front-row seat to how companies are actually using AI.
And what he is seeing makes him worried that short-term thinking could damage businesses for years to come. Garman’s point is grounded in long-term strategy. A company that relies solely on AI to handle tasks without training new talent could find itself short of people.
Still, Garman admits the next few years will be bumpy. “Your job is going to change,” he said. He believes AI will make companies more productive as well as the employees.
When technology makes something easier, people want more of it. AI enables the creation of software faster, allowing companies to develop more products, enter new markets, and serve more customers.
Developers will be responsible for more than just writing code, with faster adaptation to new technologies becoming essential. But he has a hopeful message in the end.
That’s why Geoffrey Hinton has advised that Computer Science degrees remain essential. This directly supports Matt Garman’s point. Fresh talent with a strong understanding of core fundamentals becomes crucial for filling these higher-value roles of the future.
“I’m very confident in the medium to longer term that AI will definitely create more jobs than it removes at first,” Garman said.
...
Read the original on www.finalroundai.com »
It may be just me, but I read this as “I don’t want to 😜 😜 but I’ll kill AdBlockers in Firefox for buckerinos 😂”. This disappoints and saddens me a lot, and I hope I’m wrong.
...
Read the original on infosec.press »
Your local government might be discussing surveillance tech like Flock cameras, facial recognition, or automated license plate readers right now. This map helps you find those meetings and take action.
Why this matters: Municipalities across the US are quietly adopting surveillance technologies in rapidly growing numbers with over 80,000 cameras already out on the streets. These systems track residents’ movements, collect biometric data, and build massive databases of our daily lives.
alpr.watch scans meeting agendas for keywords like “flock,” “license plate reader,” “alpr,” and more. Each pin on the map shows where these conversations are happening so that you can make a difference.
Zoom in to see ALPR cameras
Get Email Alerts for Your Area
Enter your email below and we’ll send you a login link. After logging in, you can set your notification preferences.
I agree to the Terms of Service and Privacy Policy
Please agree to the Terms of Service and Privacy Policy to continue.
You’re logged in! Update your notification settings to receive alerts.
Zoom in to see ALPR surveillance cameras
Data before mid-December may be unverified. All future flags are 100% moderator approved.
Automated License Plate Recognition (ALPR) systems use cameras and artificial intelligence to capture, read, and store license plate data from every passing vehicle.
These systems work 24/7 creating a massive database of where vehicles, and by extension, people, travel. Every trip to the grocery store, doctor’s office, or place of worship gets recorded and stored.
Flock Safety is one of the largest manufacturers of ALPR cameras in the United States, marketing their systems to neighborhoods and law enforcement.
Flock cameras capture license plates, vehicle make/model, color, and other identifying features. This data is shared across a massive network of agencies and jurisdictions, creating a surveillance web that tracks millions of Americans.
History shows that surveillance systems expand beyond their original scope:
Systems marketed for “solving crimes” get used for immigration enforcement
These groups and individuals are leading the fight against mass surveillance. Consider supporting their work or getting involved locally.
...
Read the original on alpr.watch »
TL;DR: ty is an extremely fast Python type checker and
language server, written in Rust, and designed as an alternative to tools like mypy, Pyright, and Pylance.
Today, we’re announcing the Beta release of ty. We now use ty exclusively in our own projects and are ready to recommend it to motivated users for production use.
At Astral, we build high-performance developer tools for the Python ecosystem. We’re best known for
uv, our Python package manager, and
Ruff, our linter and formatter.
Today, we’re announcing the Beta release of the next tool in the Astral toolchain: ty, an
extremely fast Python type checker and language server, written in Rust.
ty was designed from the ground up to power a language server. The entire ty architecture is built around “incrementality”, enabling us to selectively re-run only the necessary computations when a user (e.g.) edits a file or modifies an individual function. This makes live updates extremely fast in the context of an editor or long-lived process.
You can install ty today with uv tool install ty@latest, or via our
VS Code extension.
Like Ruff and uv, ty’s implementation was grounded in some of our core product principles:
An obsessive focus on performance. Without caching, ty is consistently between 10x and 60x faster than mypy and Pyright. When run in an editor, the gap is even more dramatic. As an example, after editing a load-bearing file in the PyTorch repository, ty recomputes diagnostics in 4.7ms: 80x faster than Pyright (386ms) and 500x faster than Pyrefly (2.38 seconds). ty is very fast!
Correct, pragmatic, and ergonomic. With features like
first-class intersection types,
advanced type narrowing, and
sophisticated reachability analysis, ty pushes forward the state of the art in Python type checking, providing more accurate feedback and avoiding assumptions
about user intent that often lead to false positives. Our goal with ty is not only to build a faster type checker; we want to build a better type checker, and one that balances correctness with a deep focus on the end-user experience.
Built in the open. ty was built by our core team alongside dozens of active contributors under the MIT license, and the same goes for our
editor extensions. You can run ty anywhere that you write Python (including in the browser).
Even compared to other Rust-based language servers like Pyrefly, ty can run orders of magnitude faster when performing incremental updates on large projects.
ty also includes a
best-in-class diagnostic system, inspired by the Rust compiler’s own world-class error messages. A single ty diagnostic can pull in context from multiple files at once to explain not only what’s wrong, but why (and, often, how to fix it).
Diagnostic output is the primary user interface for a type checker; we prioritized our diagnostic system from the start (with both humans and agents in mind) and view it as a first-class feature in ty.
If you use VS Code, Cursor, or a similar editor, we recommend installing the
ty VS Code extension. The ty language server supports all the capabilities
that you’d expect for a modern language server (Go to Definition, Symbol Rename, Auto-Complete, Auto-Import, Semantic Syntax Highlighting, Inlay Hints, etc.), and runs in any editor that implements the Language Server Protocol.
Following the Beta release, our immediate priority is supporting early adopters. From there, we’re working towards a Stable release next year, with the gap between the
Beta and
Stable milestones largely focusing on: (1) stability and bug fixes, (2) completing the long tail of features in the
Python typing specification, and (3) first-class support for popular third-party libraries like Pydantic and
Django.
On a longer time horizon, though, ty will power semantic capabilities across the Astral toolchain: dead code elimination, unused dependency detection, SemVer-compatible upgrade enforcement, CVE reachability analysis, type-aware linting, and more (including some that are too ambitious to say out loud just yet).
We want to make Python the most productive programming ecosystem on Earth. Just as with
Ruff and uv, our commitment from here is that ty will get significantly better every week by working closely with our users. Thank you for building with us.
ty is the most sophisticated product we’ve built, and its design and implementation have surfaced some of the hardest technical problems we’ve seen at Astral. Working on ty requires a deep understanding of type theory, Python runtime semantics, and how the Python ecosystem actually uses Python.
I’d like to thank all those that contributed directly to the development of ty, including:
Douglas Creager, Alex Waygood,
David Peter, Micha Reiser,
Andrew Gallant, Aria Desires,
Carl Meyer, Zanie Blue,
Ibraheem Ahmed,
Dhruv Manilawala, Jack O’Connor,
Zsolt Dollenstein, Shunsuke Shibayama,
Matthew Mckee, Brent Westbrook,
UnboundVariable,
Shaygan Hooshyari, Justin Chapman,
InSync, Bhuminjay Soni,
Abhijeet Prasad Bodas,
Rasmus Nygren, lipefree,
Eric Mark Martin, Tomer Bin,
Luca Chiodini, Brandt Bucher,
Dylan Wilson, Eric Jolibois,
Felix Scherz, Leandro Braga,
Renkai Ge, Sumana Harihareswara,
Takayuki Maeda, Max Mynter,
med1844, William Woodruff,
Chandra Kiran G, DetachHead,
Emil Sadek, Jo,
Joren Hammudoglu, Mahmoud Saada,
Manuel Mendez, Mark Z. Ding,
Simon Lamon, Suneet Tipirneni,
Francesco Giacometti,
Adam Aaronson, Alperen Keleş,
charliecloudberry,
Dan Parizher, Daniel Hollas,
David Sherret, Dmitry,
Eric Botti, Erudit Morina,
François-Guillaume Fernandez,
Fabrizio Damicelli,
Guillaume-Fgt, Hugo van Kemenade,
Josiah Kane, Loïc Riegel,
Ramil Aleskerov, Samuel Rigaud,
Soof Golan, Usul-Dev,
decorator-factory, omahs,
wangxiaolei, cake-monotone,
slyces, Chris Krycho,
Mike Perlov, Raphael Gaschignard,
Connor Skees, Aditya Pillai,
Lexxxzy, haarisr,
Joey Bar, Andrii Turov,
Kalmaegi, Trevor Manz,
Teodoro Freund, Hugo Polloli,
Nathaniel Roman, Victor Hugo Gomes,
Nuri Jung, Ivan Yakushev,
Hamir Mahal, Denys Zhak,
Daniel Kongsgaard,
Emily B. Zhang, Ben Bar-Or,
Aleksei Latyshev,
Aditya Pratap Singh, wooly18,
Samodya Abeysiriwardane, and
Pepe Navarro.
We’d also like to thank the Salsa team (especially
Niko Matsakis, David Barsky, and Lukas Wirth) for their support and collaboration; the
Elixir team (especially
José Valim, Giuseppe Castagna, and
Guillaume Duboc), whose work strongly influenced our approach to gradual types and intersections; and a few members of the broader Python typing community:
Eric Traut, Jelle Zijlstra,
Jia Chen, Sam Goldman,
Shantanu Jain, and Steven Troxler.
...
Read the original on astral.sh »
Much has been said about the effects that AI will have on software development, but there is an angle I haven’t seen talked about: I believe that AI will bring formal verification, which for decades has been a bit of a fringe pursuit, into the software engineering mainstream.
Proof assistants and proof-oriented programming languages such as Rocq,
Isabelle, Lean,
F*, and Agda have been around for a long time. They make it possible to write a formal specification that some piece of code is supposed to satisfy, and then mathematically prove that the code always satisfies that spec (even on weird edge cases that you didn’t think of testing). These tools have been used to develop some large formally verified software systems, such as an operating system kernel, a C compiler, and a
cryptographic protocol stack.
At present, formal verification is mostly used by research projects, and it is
uncommon for industrial software engineers to use formal methods (even those working on classic high-assurance software such as medical devices and aircraft). The reason is that writing those proofs is both very difficult (requiring PhD-level training) and very laborious.
For example, as of 2009, the formally verified seL4 microkernel consisted of 8,700 lines of C code, but proving it correct required 20 person-years and
200,000 lines of Isabelle code — or 23 lines of proof and half a person-day for every single line of implementation. Moreover, there are maybe a few hundred people in the world (wild guess) who know how to write such proofs, since it requires a lot of arcane knowledge about the proof system.
To put it in simple economic terms: for most systems, the expected cost of bugs is lower than the expected cost of using the proof techniques that would eliminate those bugs. Part of the reason is perhaps that bugs are a negative externality: it’s not the software developer who bears the cost of the bugs, but the users. But even if the software developer were to bear the cost, formal verification is simply very hard and expensive.
At least, that was the case until recently. Now, LLM-based coding assistants are getting pretty good not only at writing implementation code, but also at
writing
proof scripts in
various languages. At present, a human with specialist expertise still has to guide the process, but it’s not hard to extrapolate and imagine that process becoming fully automated in the next few years. And when that happens, it will totally change the economics of formal verification.
If formal verification becomes vastly cheaper, then we can afford to verify much more software. But on top of that, AI also creates a need to formally verify more software: rather than having humans review AI-generated code, I’d much rather have the AI prove to me that the code it has generated is correct. If it can do that, I’ll take AI-generated code over handcrafted code (with all its artisanal bugs) any day!
In fact, I would argue that writing proof scripts is one of the best applications for LLMs. It doesn’t matter if they hallucinate nonsense, because the proof checker will reject any invalid proof and force the AI agent to retry. The proof checker is a small amount of code that is itself verified, making it virtually impossible to sneak an invalid proof past the checker.
That doesn’t mean software will suddenly be bug-free. As the verification process itself becomes automated, the challenge will move to correctly defining the specification: that is, how do you know that the properties that were proved are actually the properties that you cared about? Reading and writing such formal specifications still requires expertise and careful thought. But writing the spec is vastly easier and quicker than writing the proof by hand, so this is progress.
I could also imagine AI agents helping with the process of writing the specifications, translating between formal language and natural language. Here there is the potential for subtleties to be lost in translation, but this seems like a manageable risk.
I find it exciting to think that we could just specify in a high-level, declarative way the properties that we want some piece of code to have, and then to vibe code the implementation along with a proof that it satisfies the specification. That would totally change the nature of software development: we wouldn’t even need to bother looking at the AI-generated code any more, just like we don’t bother looking at the machine code generated by a compiler.
In summary: 1. formal verification is about to become vastly cheaper; 2. AI-generated code needs formal verification so that we can skip human review and still be sure that it works; 3. the precision of formal verification counteracts the imprecise and probabilistic nature of LLMs. These three things taken together mean formal verification is likely to go mainstream in the foreseeable future. I suspect that soon the limiting factor will not be the technology, but the culture change required for people to realise that formal methods have become viable in practice.
...
Read the original on martin.kleppmann.com »
A few weeks ago, I was wrestling with a major life decision. Like I’ve grown used to doing, I opened Claude and started thinking out loud-laying out the options, weighing the tradeoffs, asking for perspective.
Midway through the conversation, I paused. I realized how much I’d shared: not just this decision, but months of conversations-personal dilemmas, health questions, financial details, work frustrations, things I hadn’t told anyone else. I’d developed a level of candor with my AI assistant that I don’t have with most people in my life.
And then an uncomfortable thought: what if someone was reading all of this?
The thought didn’t let go. As a security researcher, I have the tools to answer that question.
We asked Wings, our agentic-AI risk engine, to scan for browser extensions with the capability to read and exfiltrate conversations from AI chat platforms. We expected to find a handful of obscure extensions-low install counts, sketchy publishers, the usual suspects.
The results came back with something else entirely.
Near the top of the list: Urban VPN Proxy. A Chrome extension with over 6 million users. A 4.7-star rating from 58,000 reviews. A “Featured” badge from Google, meaning it had passed manual review and met what Google describes as “a high standard of user experience and design.”
A free VPN promising privacy and security. Exactly the kind of tool someone installs when they want to protect themselves online.
We decided to look closer.
For each platform, the extension includes a dedicated “executor” script designed to intercept and capture conversations. The harvesting is enabled by default through hardcoded flags in the extension’s configuration:
There is no user-facing toggle to disable this. The only way to stop the data collection is to uninstall the extension entirely.
The data collection operates independently of the VPN functionality. Whether the VPN is connected or not, the harvesting runs continuously in the background.
The extension monitors your browser tabs. When you visit any of the targeted AI platforms (ChatGPT, Claude, Gemini, etc.), it injects an “executor” script directly into the page. Each platform has its own dedicated script - chatgpt.js, claude.js, gemini.js, and so on.
Once injected, the script overrides fetch() and XMLHttpRequest - the fundamental browser APIs that handle all network requests. This is an aggressive technique. The script wraps the original functions so that every network request and response on that page passes through the extension’s code first.
This means when Claude sends you a response, or when you submit a prompt to ChatGPT, the extension sees the raw API traffic before your browser even renders it.
The injected script parses the intercepted API responses to extract conversation data - your prompts, the AI’s responses, timestamps, conversation IDs. This data is packaged and sent via window.postMessage to the extension’s content script, tagged with the identifier PANELOS_MESSAGE.
The content script forwards the data to the extension’s background service worker, which handles the actual exfiltration. The data is compressed and transmitted to Urban VPN’s servers at endpoints including analytics.urban-vpn.com and stats.urban-vpn.com.
* Every prompt you send to the AI
* The specific AI platform and model used
The AI conversation harvesting wasn’t always there. Based on our analysis:
* July 2025 - Present: All user conversations with targeted AI platforms captured and exfiltrated
Chrome and Edge extensions auto-update by default. Users who installed Urban VPN for its stated purpose - VPN functionality - woke up one day with new code silently harvesting their AI conversations.
Anyone who used ChatGPT, Claude, Gemini, or the other targeted platforms while Urban VPN was installed after July 9, 2025 should assume those conversations are now on Urban VPN’s servers and have been shared with third parties. Medical questions, financial details, proprietary code, personal dilemmas - all of it, sold for “marketing analytics purposes.”
“Advanced VPN Protection - Our VPN provides added security features to help shield your browsing experience from phishing attempts, malware, intrusive ads and AI protection which checks prompts for personal data (like an email or phone number), checks AI chat responses for suspicious or unsafe links and displays a warning before click or submit your prompt.”
The framing suggests the AI monitoring exists to protect you-checking for sensitive data you might accidentally share, warning you about suspicious links in responses.
The code tells a different story. The data collection and the “protection” notifications operate independently. Enabling or disabling the warning feature has no effect on whether your conversations are captured and exfiltrated. The extension harvests everything regardless.
The protection feature shows occasional warnings about sharing sensitive data with AI companies. The harvesting feature sends that exact sensitive data - and everything else - to Urban VPN’s own servers, where it’s sold to advertisers. The extension warns you about sharing your email with ChatGPT while simultaneously exfiltrating your entire conversation to a data broker.
After documenting Urban VPN Proxy’s behavior, we checked whether the same code existed elsewhere.
It did. The identical AI harvesting functionality appears in seven other extensions from the same publisher, across both Chrome and Edge:
The extensions span different product categories, a VPN, an ad blocker, a “browser guard” security tool, but share the same surveillance backend. Users installing an ad blocker have no reason to expect their Claude conversations are being harvested.
All of these extensions carry “Featured” badges from their respective stores, except Urban Ad Blocker for Edge. These badges signal to users that the extensions have been reviewed and meet platform quality standards. For many users, a Featured badge is the difference between installing an extension and passing it by - it’s an implicit endorsement from Google and Microsoft.
Urban VPN is operated by Urban Cyber Security Inc., which is affiliated with BiScience (B. I Science (2009) Ltd.), a data broker company.
This company has been on researchers’ radar before. Security researchers such as John Tuckner at Secure Annex have previously documented BiScience’s data collection practices. Their research established that:
* The company provides an SDK to third-party extension developers to collect and sell user data
* BiScience sells this data through products like AdClarity and Clickstream OS
Our finding represents an expansion of this operation. BiScience has moved from collecting browsing history to harvesting complete AI conversations-a significantly more sensitive category of data.
“We share the Web Browsing Data with our affiliated company… BiScience that uses this raw data and creates insights which are commercially used and shared with Business Partners”
To be fair, Urban VPN does disclose some of this-if you know where to look.
The consent prompt (shown during extension setup) mentions that the extension processes “ChatAI communication” along with “pages you visit” and “security signals.” It states this is done “to provide these protections.”
The privacy policy goes further, buried deep in the document:
“AI Inputs and Outputs. As part of the Browsing Data, we will collect the prompts and outputs queried by the End-User or generated by the AI chat provider, as applicable.”
“We also disclose the AI prompts for marketing analytics purposes.”
However, the Chrome Web Store listing-the place where users actually decide whether to install-shows a different picture:
“This developer declares that your data is Not being sold to third parties, outside of the approved use cases”
The listing mentions the extension handles “Web history” and “Website content.” It says nothing about AI conversations specifically.
The consent prompt frames AI monitoring as protective. The privacy policy reveals the data is sold for marketing.
The store listing says data isn’t sold to third parties. The privacy policy describes sharing with BiScience, “Business Partners,” and use for “marketing analytics.”
Users who installed before July 2025 never saw the updated consent prompt-the AI harvesting was added via silent update in version 5.5.0.
Even users who see the consent prompt have no granular control. You can’t accept the VPN but decline the AI harvesting. It’s all or nothing.
Nothing indicates to users that the data collection continues even when the VPN is disconnected and the AI protection feature is turned off. The harvesting runs silently in the background regardless of what features the user has enabled.
Urban VPN Proxy carries Google’s “Featured” badge on the Chrome Web Store. According to Google’s documentation:
“Featured extensions follow our technical best practices and meet a high standard of user experience and design.”
“Before it receives a Featured badge, the Chrome Web Store team must review each extension.”
This means a human at Google reviewed Urban VPN Proxy and concluded it met their standards. Either the review didn’t examine the code that harvests conversations from Google’s own AI product (Gemini), or it did and didn’t consider this a problem.
The Chrome Web Store’s Limited Use policy explicitly prohibits “transferring or selling user data to third parties like advertising platforms, data brokers, or other information resellers.” BiScience is, by its own description, a data broker.
The extension remains live and featured as of this writing.
Browser extensions occupy a unique position of trust. They run in the background, have broad access to your browsing activity, and auto-update without asking. When an extension promises privacy and security, users have little reason to suspect it’s doing the opposite.
What makes this case notable isn’t just the scale - 8 million users - or the sensitivity of the data - complete AI conversations. It’s that these extensions passed review, earned Featured badges, and remained live for months while harvesting some of the most personal data users generate online. The marketplaces designed to protect users instead gave these extensions their stamp of approval.
If you have any of these extensions installed, uninstall them now. Assume any AI conversations you’ve had since July 2025 have been captured and shared with third parties.
This writeup was authored by the research team at Koi.
We built Koi to detect exactly these kinds of threats - extensions that slip past marketplace reviews and quietly exfiltrate sensitive data. Our risk engine, Wings, continuously monitors browser extensions to catch threats before they reach your team.
Book a demo to see how behavioral analysis catches what static review misses.
...
Read the original on www.koi.ai »
The complexity of graphics APIs, shader frameworks and drivers have increased rapidly during the past decades. The pipeline state object (PSO) explosion has gotten out of hands. How did we end up with 100GB local shader pipeline caches and massive cloud servers to host them? It’s time to start discussing how to cut down the abstractions and the API surface to simplify how we interact with the GPU. This blog post includes lots of low level hardware details. When writing this post I used “GPT5 Thinking” AI model to cross reference public Linux open source drivers to confirm my knowledge and to ensure no NDA information is present in this blog post. Sources: AMD RDNA ISA documents and GPUOpen, Nvidia PTX ISA documents, Intel PRM, Linux open source GPU drivers (Mesa, Freedreno, Turnip, Asahi) and vendor optimization guides/presentations. The blog post has been screened by several industry insiders before the public release. Ten years ago, a significant shift occurred in real-time computer graphics with the introduction of new low-level PC graphics APIs. AMD had won both Xbox One (2013) and Playstation 4 (2013) contracts. Their new Graphics Core Next (GCN) architecture became the de-facto lead development platform for AAA games. PC graphics APIs at that time, DirectX 11 and OpenGL 4.5, had heavy driver overhead and were designed for single threaded rendering. AAA developers demanded higher performance APIs for PC. DICE joined with AMD to create a low level AMD GCN specific API for the PC called Mantle. As a response, Microsoft, Khronos and Apple started developing their own low-level APIs: DirectX 12, Vulkan and Metal were born.The initial reception of these new low-level APIs was mixed. Synthetic benchmarks and demos showed substantial performance increases, but performance gains couldn’t be seen in major game engines such as Unreal and Unity. At Ubisoft, our teams noticed that porting existing DirectX 11 renderers to DirectX 12 often resulted in performance regression. Something wasn’t right.Existing high-level APIs featured minimal persistent state, with fine-grained state setters and individual data inputs bound to the shader just prior to draw call submission. New low-level APIs aimed to make draw calls cheaper by ahead-of-time bundling shader pipeline state and bindings into persistent objects. GPU architectures were highly heterogeneous back in the day. Doing the data remapping, validation, and uploading ahead of time was a big gain. However, the rendering hardware interfaces (RHI) of existing game engines were designed for fine grained immediate mode rendering, while the new low-level APIs required bundling data in persistent objects.To address this incompatibility, a new low-level graphics remapping layer grew beneath the RHI. This layer assumed the complexity previously handled by the OpenGL and DirectX 11 graphics drivers, tracking resources and managing mappings between the fine-grained dynamic user-land state and the persistent low-level GPU state. Graphics programmers started specializing into two distinct roles: low-level graphics programmers, who focused on the new low-level “driver” layer and the RHI, and high-level graphics programmers, who built visual graphics algorithms on top of the RHI. Visual programming was also getting more complex due to physically based lighting models, compute shaders and later ray-tracing. DirectX 12, Vulkan, and Metal are often referred to as “modern APIs”. These APIs are now 10 years old. They were initially designed to support GPUs that are now 13 years old, an incredibly long time in GPU history. Older GPU architectures were optimized for traditional vertex and pixel shader tasks rather than the compute-intensive generic workloads prevalent today. They had vendor specific binding models and data paths. Hardware differences had to be wrapped under the same API. Ahead-of-time created persistent objects were crucial in offloading the mapping, uploading, validation and binding costs.In contrast, the console APIs and Mantle were exclusively designed for AMD’s GCN architecture, a forward-thinking design for its time. GCN boasted a comprehensive read/write cache hierarchy and scalar registers capable of storing texture and buffer descriptors, effectively treating everything as memory. No complex API for remapping the data was required, and significantly less ahead-of-time work was needed. The console APIs and Mantle had much less API complexity due to targeting a single modern GPU architecture.A decade has passed, and GPUs have undergone a significant evolution. All modern GPU architectures now feature complete cache hierarchies with coherent last-level caches. CPUs can write directly to GPU memory using PCIe REBAR or UMA and 64-bit GPU pointers are directly supported in shaders. Texture samplers are bindless, eliminating the need for a CPU driver to configure the descriptor bindings. Texture descriptors can be directly stored in arrays within the GPU memory (often called descriptor heaps). If we were to design an API tailored for modern GPUs today, it wouldn’t need most of these persistent “retained mode” objects. The compromises that DirectX 12.0, Metal 1 and Vulkan 1.0 had to make are not needed anymore. We could simplify the API drastically.The past decade has revealed the weaknesses of the modern APIs. The PSO permutation explosion is the biggest problem we need to solve. Vendors (Valve, Nvidia, etc) have massive cloud servers storing terabytes of PSOs for each different architecture/driver combination. User’s local PSO cache size can exceed 100GB. No wonder the gamers are complaining that loading takes ages and stutter is all over the place.The history of GPUs and APIsBefore we talk about stripping the API surface, we need to understand why graphics APIs were historically designed this way. OpenGL wasn’t intentionally slow, nor was Vulkan intentionally complex. 10-20 years ago GPU hardware was highly diverse and undergoing rapid evolution. Designing a cross-platform API for such a diverse set of hardware required compromises.Let’s start with a classic: The 3dFX Voodoo 2 12MB (1998) was a three chip design: A single rasterizer chip connected to a 4MB framebuffer memory and two texture sampling chips, each connected to their own 4MB texture memory. There was no geometry pipeline and no programmable shaders. CPU sent pre-transformed triangle vertices to the rasterizer. The rasterizer had a configurable blending equation to control how the vertex colors and the two texture sampler results were combined together. Texture samplers could not read each-other’s memory or the framebuffer. Thus there was no support for multiple render passes. Since the hardware was incapable of window composition, it had a loopback cable to connect your dedicated 2d video card. 3d rendering only worked in exclusive fullscreen mode. A 3d graphics card was a highly specialized piece of hardware, with little in common with the current GPUs and their massive programmable SIMD arrays. Hardware of this era had a massive impact on DirectX (1995) and OpenGL (1992) design. Backwards compatibility played a huge role. APIs improved iteratively. These 30 year old API designs still impact the way we write software today.
3dFX Voodoo 2 12MB (1998): Individual processors and traces between them and their own memory chips (four 1MB chips for each processor) are clearly visible. Image © TechPowerUp.
Nvidia’s Geforce 256 coined the term GPU. It had a geometry processor in addition to the rasterizer. The geometry processor, rasterizer and texture sampling units were all integrated in the same die and shared memory. DirectX 7 introduced two new concepts: render target textures and uniform constants. Multipass rendering meant that texture samplers could read the rasterizer output, invalidating the 3dFX Voodoo 2 separate memory design.The geometry processor API featured uniform data inputs for transform matrices (float4x4), light positions, and colors (float4). GPU implementations varied among manufacturers, many opting to embed a small constant memory block within the geometry engine. But this wasn’t the only way to do it. In the OpenGL API each shader had its own persistent uniforms. This design enabled the driver to embed constants directly in the shader’s instruction stream, an API peculiarity that still persists in OpenGL 4.6 and ES 3.2 today.GPUs back then didn’t have generic read & write caches. Rasterizer had screen local cache for blending and depth buffering and texture samplers leaned on linearly interpolated vertex UVs for data prefetching. When shaders were introduced in DirectX 8 shader model 1.0 (SM 1.0), the pixel shader stage didn’t support calculating texture UVs. UVs were calculated at vertex granularity, interpolated by the hardware and passed directly to the texture samplers. DirectX 9 brought a substantial increase in shader instruction limits, but shader model 2.0 didn’t expose any new data paths. Both vertex and pixel shaders still operated as 1:1 input:output machines, allowing users to only customize the transform math of the vertex position and attributes and the pixel color. Programmable load and store were not supported. The fixed-function input blocks persisted: vertex fetch, uniform (constant) memory and texture sampler. Vertex shader was a separate execution unit. It gained new features like the ability to index constants (limited to float4 arrays) but still lacked texture sampling support.DirectX 9 shader model 3.0 increased the instruction limit to 65536 making it difficult for humans to write and maintain shader assembly anymore. Higher level shading languages were born: HLSL (2002) and GLSL (2002-2004). These languages adapted the 1:1 elementwise transform design. Each shader invocation operated on a single data element: vertex or pixel. Framework-style shader design heavily affected the graphics API design in the following years. It was a nice way to abstract hardware differences back in the day, but is showing scaling pains today. DirectX 11 was a significant shift in the data model, introducing support for compute shaders, generic read-write buffers and indirect drawing. The GPU could now fully feed itself. The inclusion of generic buffers enabled shader programs to access and modify programmable memory locations, which forced hardware vendors to implement generic cache hierarchies. Shaders evolved beyond simple 1:1 data transformations, marking the end of specialized, hardcoded data paths. GPU hardware started to shift towards a generic SIMD design. SIMD units were now executing all the different shader types: vertex, pixel, geometry, hull, domain and compute. Today the framework has 16 different shader entry points. This adds a lot of API surface and makes composition difficult. As a result GLSL and HLSL still don’t have a flourishing library ecosystem.DirectX 11 featured a whole zoo of buffer types, each designed to accommodate specific hardware data path peculiarities: typed SRV & UAV, byte address SRV & UAV, structured SRV & UAV, append & consume (with counter), constant, vertex, and index buffers. Like textures, buffers in DirectX utilize an opaque descriptor. Descriptors are hardware specific (commonly 128-256 bit) data blobs encoding the size, format, properties and data address of the resource in GPU memory. DirectX 11 GPUs leveraged their texture samplers for buffer load (gather) operations. This was natural since the sampler already had a type conversion hardware and a small read-only data cache. Typed buffers supported the same formats as textures, and DirectX used the same SRV (shader resource view) abstraction for both textures and buffers.The use of opaque buffer descriptors meant that the buffer format was not known at shader compile time. This was fine for read-only buffers as they were handled by the texture sampler. Read-write buffer (UAV in DirectX) was initially limited to 32-bit and 128-bit (vec4) types. Subsequent API and hardware revisions gradually addressed typed UAV load limitations, but the core issues persisted: a descriptor requires an indirection (contains a pointer), compiler optimizations are limited (data type is known only at runtime), format conversion hardware introduces latency (vs raw L1$ load), expand at load reserves registers for longer time (vs expand at use), descriptor management adds CPU driver complexity, and the API is complex (ten different buffer types).In DirectX 11 the structured buffers were the only buffer type allowing an user defined struct type. All other buffer types represented a homogeneous array of simple scalar/vector elements. Unfortunately, structured buffers were not layout compatible with other buffer types. Users were not allowed to have structured buffer views to typed buffers, byte address buffers, or vertex/index buffers. The reason was that structured buffers had special AoSoA swizzle optimization under the hood, which was important for older vec4 architectures. This hardware specific optimization limited the structured buffer usability.DirectX 12 made all buffers linear in memory, making them compatible with each other. SM 6.2 also added load syntactic sugar for the byte address buffer, allowing clean struct loading syntax from arbitrary offset. All the old buffer types are still supported for backwards compatibility reasons and all the buffers still use opaque descriptors. HLSL is still missing support for 64-bit GPU pointers. In contrast, the Nvidia CUDA computing platform (2007) fully leaned on 64-bit pointers, but its popularity was initially limited to academic use. Today it is the leading AI platform and is heavily affecting the hardware design.Support for 16-bit registers and 16-bit math was disorganized when DirectX 12 launched. Microsoft initially made a questionable decision to not backport DirectX 12 to Windows 7. Shader binaries targeting Windows 8 supported 16-bit types, but most gamers continued using Windows 7. Developers didn’t want to ship two sets of shaders. OpenGL lowp/mediump specification was also messy. Bit depths were not properly standardized. Mediump was a popular optimization in mobile games, but most PC drivers ignored it, making game developer’s life miserable. AAA games mostly ignored 16-bit math until PS4 Pro launched in 2016 with double rate fp16 support.With the rise of AI, ray-tracing, and GPU-driven rendering, GPU vendors started focusing on optimizing their raw data load paths and providing larger and faster generic caches. Routing loads though the texture sampler (type conversion) added too much latency, as dependent load chains are common in modern shaders. Hardware got native support for narrow 8-bit, 16-bit, and 64-bit types and pointers.Most vendors ditched their fixed function vertex fetch hardware, emitting standard raw load instructions in the vertex shader instead. Fully programmable vertex fetch allowed developers to write new algorithms such as clustered GPU-driven rendering. Fixed function hardware transistor budget could be used elsewhere.Mesh shaders represent the culmination of rasterizer evolution, eliminating the need for index deduplication hardware and post-transform caches. In this paradigm, all inputs are treated as raw memory. The user is responsible for dividing the mesh into self-contained meshlets that internally share vertices. This process is often done offline. The GPU no longer needs to do parallel index deduplication for each draw call, saving power and transistors. Given that gaming accounts for only 10% of Nvidia’s revenue today, while AI represents 90% and ray-tracing continues to grow, it is likely only a matter of time before the fixed function geometry hardware is stripped to bare minimum and drivers automatically convert vertex shaders to mesh shaders.Mobile GPUs are tile-based renderers. Tilers bin the individual triangles to small tiles (commonly between 16x16 to 64x64 pixels) . Mesh shaders are too coarse grained for this purpose. Binning meshlets to tiny tiles would cause significant geometry overshading. There’s no clear convergence path. We still need to support the vertex shader path.10 years ago when DirectX 12.0, Vulkan 1.0 and Metal 1.0 arrived, the existing GPU hardware didn’t widely support bindless resources. APIs adapted complex binding models to abstract the hardware differences. DirectX allowed indexing up to 128 resources per stage, Vulkan and Metal didn’t initially support descriptor indexing at all. Developers had to continue using traditional workarounds to reduce the bindings change overhead, such as packing textures into atlases and merging meshes together. The GPU hardware has evolved significantly during the past decade and converged to generic bindless SIMD design.Let’s investigate how much simpler the graphics API and the shader language would become if we designed them solely for modern bindless hardware.Let’s start our journey discussing memory management. Legacy graphics APIs abstracted the GPU memory management completely. Abstraction was necessary, as old GPUs had split memories and/or special data paths with various cache coherency concerns. When DirectX 12 and Vulkan arrived 10 years ago, the GPU hardware had matured enough to expose placement heaps to the user. Consoles had already exposed memory for a few generations and developers requested similar flexibility for PC and mobile. Apple introduced placement heaps 4 years after Vulkan and DirectX 12 in Metal 2.Modern APIs require the user to enumerate the heap types to find out what kind of memory the GPU driver has to offer. It’s a good practice to preallocate memory in big chunks and suballocate it using a user-land allocator. However, there’s a design flaw in Vulkan: You have to create your texture/buffer object first. Then you can ask which heap types are compatible with the new resource. This forces the user into a lazy allocation pattern, which can cause performance hitches and memory spikes at runtime. This also makes it difficult to wrap a GPU memory allocation into a cross-platform library. AMD VMA, for example, creates both the Vulkan-specific buffer/texture object in addition to allocating memory. We want to fully separate these concerns.Today the CPU has full visibility into the GPU memory. Integrated GPUs have UMA, and modern discrete GPUs have PCIe Resizable BAR. The whole GPU heap can be mapped. Vulkan heap API naturally supports CPU mapped GPU heaps. DirectX 12 got support in 2023 (HEAP_TYPE_GPU_UPLOAD).CUDA has a simple design for GPU memory allocation: The GPU malloc API takes the size as input and returns a mapped CPU pointer. The GPU free API frees the memory. CUDA doesn’t support CPU mapped GPU memory. The GPU reads the CPU memory though the PCIe bus. CUDA also supports GPU memory allocations, but they can’t be directly written by the CPU.We combine CUDA malloc design with CPU mapped GPU memory (UMA/ReBAR). It’s the best of both worlds: The data is fast for the CPU to write and fast for the GPU to read, yet we maintain the clean, easy to use design.
Default gpuMalloc alignment is 16 bytes (vec4 alignment). If you need wider alignment use gpuMalloc(size, alignment) overload. My example code uses gpuMallocWriting data directly into GPU memory is optimal for small data like draw arguments, uniforms and descriptors. For large persistent data, we still want to perform a copy operation. GPUs store textures in a swizzled layout similar to Morton-order to improve cache locality. DirectX 11.3 and 12 tried to standardize the swizzle layout, but couldn’t get all GPU manufacturers onboard. The common way to perform texture swizzling is to use a driver provided copy command. The copy command reads linear texture data from a CPU mapped “upload” heap and writes to a swizzled layout in a private GPU heap. Every modern GPU also has lossless delta color compression (DCC). Modern GPUs copy engines are capable of DCC compression and decompression. DCC and Morton swizzle are the main reasons we want to copy textures into a private GPU heap. Recently, GPUs have also added generic lossless memory compression for buffer data. If the memory heap is CPU mapped, the GPU can’t enable vendor specific lossless compression, as the CPU wouldn’t know how to read or write it. A copy command must be used to compress the data.We need a memory type parameter in the GPU malloc function to add support for private GPU memory. The standard memory type should be CPU mapped GPU memory (write combined CPU access). It is fast for the GPU to read, and the CPU can directly write to it just like it was a CPU memory pointer. GPU-only memory is used for textures and big GPU-only buffers. The CPU can’t directly write to these GPU pointers. The user writes the data to CPU mapped GPU memory first and then issues a copy command, which transforms the data to optimal compressed format. Modern texture samplers and display engines can read compressed GPU data directly, so there’s no need for subsequent data layout transforms (see chapter: Modern barriers). The uploaded data is ready to use immediately.We have two types of GPU pointers, a CPU mapped virtual address and a GPU virtual address. The GPU can only dereference GPU addresses. All pointers in GPU data structures must use GPU addresses. CPU mapped addresses are only used for CPU writes. CUDA has an API to transform a CPU mapped address to a GPU address (cudaHostGetDevicePointer). Metal 4 buffer object has two getters: .contents (CPU mapped address) and .gpuAddress (GPU address). Since the gpuMalloc API returns a pointer, not a managed object handle (like Metal), we choose the CUDA approach (gpuHostToDevicePointer). This API call is not free. The driver likely implements it using a hash map (if other than base addresses need to be translated, we need a tree). Preferably we call the address translation once per allocation and cache in a user land struct (void *cpu, void *gpu). This is the approach my userland GPUBumpAllocator uses (see appendix for full implementation).
// Load a mesh using a 3rd party library
auto mesh = createMesh(“mesh.obj”);
auto upload = uploadBumpAllocator.allocate(mesh.byteSize); // Custom bump allocator (wraps a gpuMalloc ptr)
mesh.load(upload.cpu);
// Allocate GPU-only memory and copy into it
void* meshGpu = gpuMalloc(mesh.byteSize, MEMORY_GPU);
gpuMemCpy(commandBuffer, meshGpu, upload.gpu);
Vulkan recently got a new extension called VK_EXT_host_image_copy. The driver implements a direct CPU to GPU image copy operation, performing the hardware specific texture swizzle on CPU. This extension is currently only available on UMA architectures, but there’s no technical reason why it’s not available on PCIe ReBAR as well. Unfortunately this API doesn’t support DCC. It would be too expensive to perform DCC compression on the CPU. The extension is mainly useful for block compressed textures, as they don’t require DCC. It can’t universally replace hardware copy to GPU private memory.There’s also a need for a third memory type, CPU-cached, for readback purposes. This memory type is slower for the GPU to write due to cache coherency with the CPU. Games only use readback seldomly. Common use cases are screenshots and virtual texturing readback. GPGPU algorithms such as AI training and inference lean on efficient communication between the CPU and the GPU.When we mix the simplicity of CUDA malloc with CPU-mapped GPU memory we get a flexible and fast GPU memory allocation system with minimal API surface. This is an excellent starting point for a minimalistic modern graphics API.CUDA, Metal and OpenCL leverage C/C++ shader languages featuring 64-bit pointer semantics. These languages support loading and storing of structs from/to any appropriately aligned GPU memory location. The compiler handles behind-the-scenes optimizations, including wide loads (combine), register mappings, and bit extractions. Many modern GPUs offer free instruction modifiers for extracting 8/16-bit portions of a register, allowing the compiler to pack 8-bit and 16-bit values into a single register. This keeps the shader code clean and efficient.If you load a struct of eight 32-bit values, the compiler will most likely emit two 128-bit wide loads (each filling 4 registers), a 4x reduction in load instruction count. Wide loads are significantly faster, especially if the struct contains narrow 8 and 16-bit fields. GPUs are ALU dense and have big register files, but compared to CPUs their memory paths are relatively slow. A CPU often has two load ports each doing a load per cycle. On a modern GPU we can achieve one SIMD load per 4 cycles. Wide load + unpack in the shader is often the most efficient way to handle data. Compact 8-16 bit data has been traditionally stored in texel buffers (Buffer) in DirectX games. Modern GPUs are optimized for compute workloads. Raw buffer load instructions nowadays have up to 2x higher throughput and up to 3x lower latency than texel buffers. Texel buffers are no longer the optimal choice on modern GPUs. Texel buffers do not support structured data, the user is forced to split their data into SoA layout in multiple texel buffers. Each texel buffer has its own descriptor, which must be loaded before the data can be accessed. This consumes resources (SGPRs, descriptor cache slots) and adds startup latency compared to using a single 64-bit raw pointer. SoA data layout also results in significantly more cache misses for non-linear index lookups (examples: material, texture, triangle, instance, bone id). Texel buffers offer free conversion of normalized ([0,1] and [-1,1]) types to floating point registers. It’s true that there’s no ALU cost, but you lose wide load support (combine loads) and the instruction goes through the slow texture sampler hardware path. Narrow texel buffer loads also add register bloat. RGBA8_UNORM load to vec4 allocates four vector registers immediately. The sampler hardware will eventually write the value to these registers. Compilers try to maximize the distance of load→use by moving load instructions in the beginning of the shader. This hides the load latency by ALU and allows overlapping multiple loads. If we instead use wide raw loads, our uint8x4 data consumes just a single 32-bit register. We unpack the 8-bit channels on use. The register life time is much shorter. Modern GPUs can directly access 16-bit low/high halves of registers without unpack, and some can even do 8-bit (AMD SDWA modifier). Packed double rate math makes 2x16 bit conversion instructions faster. Some GPU architectures (Nvidia, AMD) can also do 64-bit pointer raw loads directly from VRAM into groupshared memory, further reducing the register bloat needed for latency hiding. By using 64-bit pointers, game engines benefit from AI hardware optimizations.Pointer based systems make memory alignment explicit. When you are allocating a buffer object in DirectX or Vulkan, you need to query the API for alignment. Buffer bind offsets must also be properly aligned. Vulkan has an API for querying the bind offset alignment and DirectX has fixed alignment rules. Alignment contract allows the low level shader compiler to emit optimal code (such as aligned 4x32-byte wide loads). The DirectX ByteAddressBuffer abstraction has a design flaw: load2, load3 and load4 instructions only require 4-byte alignment. The new SM 6.2 load also only requires elementwise alignment (half4 = 2, float4 = 4). Some GPU vendors (like Nvidia) have to split ByteAddressBuffer.load4 into four individual load instructions. The buffer abstraction can’t always shield the user from bad codegen. It makes bad codegen hard to fix. C/C++ based languages (CUDA, Metal) allow the user to explicitly declare struct alignment with the alignas attribute. We use alignas(16) in all our example code root structs.By default, GPU writes are only visible to the threads inside the same thread group (= inside a compute unit). This allows non-coherent L1$ design. Visibility is commonly provided by barriers. If the user needs memory visibility between the groups in a single dispatch, they decorate the buffer binding with the [globallycoherent] attribute. The shader compiler emits coherent load/store instructions for accesses of that buffer. Since we use 64-bit pointers instead of buffer objects, we offer explicit coherent load/store instructions. The syntax is similar to atomic load/store. Similarly we can provide non-temporal load/store instructions that bypass the whole cache hierarchy.Vulkan supports 64-bit pointers using the (2019) VK_KHR_buffer_device_address extension (https://docs.vulkan.org/samples/latest/samples/extensions/buffer_device_address/README.html). Buffer device address extension is widely supported by all GPU vendors (including mobile), but is not a part of core Vulkan 1.4. The main issue with BDA is lack of pointer support in the GLSL and the HLSL shader languages. The user has to use raw 64-bit integers instead. A 64-bit integer can be cast to a struct. Structs are defined with custom BDA syntax. Array indexing requires declaring an extra BDA struct type with an array in it, if the user wants the compiler to generate the index addressing math. Debugging support is currently limited. Usability matters a lot and BDA will remain a niche until HLSL and GLSL support pointers natively. This is a stark contrast to CUDA, OpenCL and Metal, where native pointer support is a language core pillar and debugging works flawlessly. DirectX 12 has no support for pointers in shaders. As a consequence, HLSL doesn’t allow passing arrays as function parameters. Simple things like having a material array inside UBO/SSBO requires hacking around with macros. It’s impossible to make reusable functions for reductions (prefix sum, sort, etc), since groupshared memory arrays can’t be passed between functions. You could of course declare a separate global array for each utility header/library, but the compiler will allocate groupshared memory for each of them separately, reducing occupancy. There’s no easy way to alias groupshared memory. GLSL has identical issues. Pointer based languages like CUDA and Metal MSL don’t have such issues with arrays. CUDA has a vast ecosystem of 3rd party libraries, and this ecosystem makes Nvidia the most valued company on the planet. Graphics shading languages need to evolve to meet modern standards. We need a library ecosystem too.I will be using a C/C++ style shading language similar to CUDA and Metal MSL in my examples, with some HLSL-style system value (SV) semantics mixed in for the graphics specific bits and pieces.Operating system threading APIs commonly provide a single 64-bit void pointer to the thread function. The operating system doesn’t care about the user’s data input layout. Let’s apply the same ideology to the GPU kernel data inputs. The shader kernel receives a single 64-bit pointer, which we cast to our desired struct (by the kernel function signature). Developers can use the same shared C/C++ header in both CPU and GPU side.
// Common header…
struct alignas(16) Data
// Uniform data
float16x4 color; // 16-bit float vector
uint16x2 offset; // 16-bit integer vector
const uint8* lut; // pointer to 8-bit data array
// Pointers to in/out data arrays
const uint32* input;
uint32* output;
// CPU code…
gpuSetPipeline(commandBuffer, computePipeline);
auto data = myBumpAllocator.allocate(); // Custom bump allocator (wraps gpuMalloc ptr, see appendix)
data.cpu->color = {1.0f, 0.0f, 0.0f, 1.0f};
data.cpu->offset = {16, 0};
data.cpu->lut = luts.gpu + 64; // GPU pointers support pointer math (no need for offset API)
data.cpu->input = input.gpu;
data.cpu->output = output.gpu;
gpuDispatch(commandBuffer, data.gpu, uvec3(128, 1, 1));
// GPU kernel…
[groupsize = (64, 1, 1)]
void main(uint32x3 threadId : SV_ThreadID, const Data* data)
uint32 value = data->input[threadId.x];
// TODO: Code using color, offset, lut, etc…
data->output[threadId.x] = value;
In the example code we use a simple linear bump allocator (myBumpAllocator) for allocating GPU arguments (see appendix for implementation). It returns a struct {void* cpu, void *gpu}. The CPU pointer is used for writing directly to persistently mapped GPU memory and the GPU pointer can be stored to GPU data structures or passed as dispatch command argument. Most GPUs preload root uniforms (including 64-bit pointers) into constant or scalar registers just before launching a wave. This optimization remains viable: the draw/dispatch command carries the base data pointer. All the input uniforms (including pointers to other data) are found at small fixed offsets from the base pointer. Since shaders are pre-compiled and further optimized into device-specific microcode during the PSO creation, drivers have ample opportunity to set up register preloading and similar root data optimizations. Users should put the most important data in the beginning of the root struct as root data size is limited in some architectures. Our root struct has no hard size limit. The shader compiler will emit standard (scalar/uniform) memory loads for the remaining fields. The root data pointer provided to the shader is const. Shader can’t modify the root input data, as it might be still used by the command processor for preloading data to new waves. Output is done through non-const pointers (see Data::output in above example). By forcing the root data to be const, we also allow GPU drivers to perform their special uniform data path optimizations. Do we need a special uniform buffer type? Modern shader compilers perform automatic uniformity analysis. If all inputs to an instruction are uniform, the output is also uniform. Uniformity propagates over the shader. All modern architectures have scalar registers/loads or a similar construct (SIMD1 on Intel). Uniformity analysis is used to convert vector loads into scalar loads, which saves registers and reduces latency. Uniformity analysis doesn’t care about the buffer type (UBO vs SSBO). The resource must be readonly (this is why you should always decorate SSBO with readonly attribute in GLSL or prefer SRV over UAV in DirectX 12). The compiler also needs to be able to prove that the pointer is not aliased. The C/C++ const keyword means that data can’t be modified though this pointer, it doesn’t guarantee that other read-write pointers might alias the same memory region. C99 added the restrict keyword for this purpose and CUDA kernels use it frequently. Root pointers in Metal are no-alias (restrict) by default, and so are buffer objects in Vulkan and DirectX 12. We should adopt the same convention to give the compiler more freedom to do optimizations. The shader compiler is not always able to prove address uniformity at compile time. Modern GPUs opportunistically optimize dynamic uniform address loads. If the memory controller detects that all lanes of a vector load instruction have a uniform address, it emits a single lane load instead of a SIMD wide gather. The result is replicated to all lanes. This optimization is transparent, and doesn’t affect shader code generation or register allocation. Dynamically uniform data is a much smaller performance hit than it used to be in the past, especially when combined with the new fast raw load paths.Some GPU vendors (ARM Mali and Qualcomm Adreno) take the uniformity analysis a step further. The shader compiler extracts uniform loads and uniform math. A scalar preamble runs before the shader. Uniform memory loads and math is executed once for the whole draw/dispatch and the results are stored in special hardware constant registers (the same registers used by root constants).All of the above optimizations together provide a better way of handling uniform data than the classic 16KB/64KB uniform/constant buffer abstraction. Many GPUs still have special uniform registers for root constants, system values and the preamble (see above paragraph).Ideally, texture descriptors would behave like any other data in GPU memory, allowing them to be freely mixed in structs with other data. However, this level of flexibility isn’t universally supported by all modern GPUs. Fortunately bindless texture sampler designs have converged over the last decade, with only two primary methods remaining: 256-bit raw descriptors and the indexed descriptor heap.AMDs raw descriptor method loads 256-bit descriptors directly from GPU memory into the compute unit’s scalar registers. Eight subsequent 32-bit scalar registers contain a single descriptor. During the SIMD texture sample instruction, the shader core sends a 256-bit texture descriptor and per-lane UVs to the sampler unit. This provides the sampler all the data it needs to address and load texels without any indirections. The drawback is that the 256-bit descriptor takes a lot of register space and needs to be resent to the sampler for each sample instruction.The indexed descriptor heap approach uses 32-bit indices (20 bits for old Intel iGPUs). 32-bit indices are trivial to store in structs, load into standard SIMD registers and efficient to pass around. During a SIMD sample instruction, the shader core sends the texture index and the per-lane UVs to the sampler unit. The sampler fetches the descriptor from the descriptor heap: heap base address + texture index * stride (256-bits in modern GPUs). The texture heap base address is either abstracted by the driver (Vulkan and Metal) or provided by the user (SetDescriptorHeaps in DirectX 12). Changing the texture heap base address may result in an internal pipeline barrier (on older hardware). On modern GPUs the texture heap 64-bit base address is often part of each sample instruction data, allowing sampling from multiple heaps seamlessly (64-bit base + 32-bit offset per lane). The sampler unit has a tiny internal descriptor cache to avoid indirect reads after the first access. Descriptor caches must be invalidated whenever the descriptor heap is modified.A few years ago it looked like AMDs scalar register based texture descriptors were the winning formula in the long run. Scalar registers are more flexible than a descriptor heap, allowing descriptors to be embedded inside GPU data structures directly. But there’s a downside. Modern GPU workloads such as ray-tracing and deferred texturing (Nanite) lean on non-uniform texture indices. The texture heap index is not uniform over a SIMD wave. A 32-bit heap index is just 4 bytes, we can send it per lane. In contrast, a 256-bit descriptor is 32 bytes. It is not feasible to fetch and send a full 256-bit descriptor per lane. Modern Nvidia, Apple and Qualcomm GPUs support per-lane descriptor index mode in their sample instructions, making the non-uniform case more efficient. The sampler unit performs an internal loop if required. Inputs/outputs to/from sampler units are sent once, regardless of the heap index coherence. AMDs scalar register based descriptor architecture requires the shader compiler to generate a scalarization loop around the texture sample instruction. This costs extra ALU cycles and requires sending and receiving (partially masked) sampler data multiple times. It’s one of the reasons why Nvidia is faster in ray-tracing than AMD. ARM and Intel use 32-bit heap indices too (like Nvidia, Qualcomm and Apple), but their latest architectures don’t yet have a per-lane heap index mode. They emit a similar scalarization loop as AMD for the non-uniform index case.All of these differences can be wrapped under an unified texture descriptor heap abstraction. The de-facto texture descriptor size is 256 bits (192 bits on Apple for a separate texture descriptor, sampler is the remaining 32 bits). The texture heap can be presented as a homogeneous array of 256-bit descriptor blobs. Indexing is trivial. DirectX 12 shader model 6.6 provides a texture heap abstraction like this, but doesn’t allow direct CPU or compute shader write access to the descriptor heap memory. A set of APIs are used for creating descriptors and copying descriptors from the CPU to the GPU. The GPU is not allowed to write the descriptors. Today, we can remove this API abstraction completely by allowing direct CPU and GPU write to the descriptor heap. All we need is a simple (user-land) driver helper function for creating a 256-bit (uint64[4]) hardware specific descriptor blob. Modern GPUs have UMA or PCIe ReBAR. The CPU can directly write descriptor blobs into GPU memory. Users can also use compute shaders to copy or generate descriptors. The shader language has a descriptor creation intrinsic too. It returns a hardware specific uint64x4 descriptor blob (analogous to the CPU API). This approach cuts the API complexity drastically and is both faster and more flexible than the DirectX 12 descriptor update model. Vulkan’s VK_EXT_descriptor_buffer (https://www.khronos.org/blog/vk-ext-descriptor-buffer) extension (2022) is similar to my proposal, allowing direct CPU and GPU write. It is supported by most vendors, but unfortunately is not part of the Vulkan 1.4 core spec.
// App startup: Allocate a texture descriptor heap (for example 65536 descriptors)
GpuTextureDescriptor *textureHeap = gpuMalloc(65536);
// Load an image using a 3rd party library
auto pngImage = pngLoad(“cat.png”);
auto uploadMemory = uploadBumpAllocator.allocate(pngImage.byteSize); // Custom bump allocator (wraps gpuMalloc ptr)
pngImage.load(uploadMemory.cpu);
// Allocate GPU memory for our texture (optimal layout with metadata)
GpuTextureDesc textureDesc { .dimensions = pngImage.dimensions, .format = FORMAT_RGBA8_UNORM, .usage = SAMPLED };
GpuTextureSizeAlign textureSizeAlign = gpuTextureSizeAlign(textureDesc);
void *texturePtr = gpuMalloc(textureSizeAlign.size, textureSizeAlign.align, MEMORY_GPU);
GpuTexture texture = gpuCreateTexture(textureDesc, texturePtr);
// Create a 256-bit texture view descriptor and store it
textureHeap[0] = gpuTextureViewDescriptor(texture, { .format = FORMAT_RGBA8_UNORM });
// Batched upload: begin
GpuCommandBuffer uploadCommandBuffer = gpuStartCommandRecording(queue);
// Copy all textures here!
gpuCopyToTexture(uploadCommandBuffer, texturePtr, uploadMemory.gpu, texture);
// TODO other textures…
// Batched upload: end
gpuBarrier(uploadCommandBuffer, STAGE_TRANSFER, STAGE_ALL, HAZARD_DESCRIPTORS);
gpuSubmit(queue, { uploadCommandBuffer });
// Later during rendering…
gpuSetActiveTextureHeapPtr(commandBuffer, gpuHostToDevicePointer(textureHeap));
It is almost possible to get rid of the CPU side texture object (GpuTexture) completely. Unfortunately the triangle rasterizer units of all modern GPUs are not yet bindless. The CPU driver needs to prepare command packets to bind render targets, depth-stencil buffers, clear and resolve. These APIs don’t use the 256-bit GPU texture descriptor. We need driver specific extra CPU data (stored in the GpuTexture object).The simplest way to reference a texture in a shader is to use a 32-bit index. A single index can also represent the starting offset of a range of descriptors. This offers a straightforward way to implement the DirectX 12 descriptor table abstraction and the Vulkan descriptor set abstraction without an API. We also get an elegant solution to the fast material switch use case: All we need is a single 64-bit GPU pointer, pointing to a material data struct (containing material properties + 32-bit texture heap start index). Vulkan vkCmdBindDescriptorSets and DirectX 12 SetGraphicsRootDescriptorTable are relatively fast API calls, but they are nowhere as fast as writing a single 64-bit pointer to persistently mapped GPU memory. A lot of complexity is removed by not needing to create, update and delete resource binding API objects. CPU time is also saved as the user no longer needs to maintain a hash map of descriptor sets, a common approach to solve the immediate vs retained mode discrepancy in game engines.
Metal 4 manages the texture descriptor heap automatically. Texture objects have .gpuResourceID, which is a 64-bit heap index (Xcode GPU debugger reveals small values such as 0x3). You can directly write texture IDs into GPU structs, as you would use texture indices in DirectX SM 6.6 and Vulkan (descriptor buffer extension). As the heap management in Metal is automatic, users can’t allocate texture descriptors in contiguous ranges. It’s a common practice to store a 32-bit index to the first texture in the range and calculate the indices for other textures in the set (see above code example). Metal doesn’t support this. The user has to write a 64-bit texture handle for each texture separately. To address a set of 5 textures, you need 40 bytes in Metal (5 * 64-bit). Vulkan and DirectX 12 only need 4 bytes (1 * 32-bit). Apple GPU hardware is able to implement SM 6.6 texture heaps. The limitation is the Metal API (software).Texel buffers can be still supported for backwards compatibility. DirectX 12 stores texel buffer descriptors in the same heap with texture descriptors. A texel buffer functions similarly to a 1d texture (unfiltered tfetch path). Since texel buffers would be mainly used for backwards compatibility, driver vendors wouldn’t need to jump over the hoops to replace them with faster code paths such as raw memory loads behind the scenes. I am not a big fan of driver background threads and shader replacements.Non-uniform texture index needs to use NonUniformResourceIndex notation similar to GLSL and HLSL. This tells the low level GPU shader compiler to emit a special texture instruction with per-lane heap index, or a scalarization loop for GPUs that only support uniform descriptors. Since buffers are not descriptors, we never need NonUniformResourceIndex for buffers. We simply pass a 64-bit pointer per lane. It works on all modern GPUs. No scalarization loop, no mess. Additionally, the language should natively support ptr[index] notation for memory loads, where the index is 32-bits. Some GPUs support raw memory load instructions with 32-bit per lane offset. It reduces the register pressure. Feedback to GPU vendors: Please add the missing 64-bit shared base + 32-bit per lane offset raw load instruction and 16-bit uv(w) texture load instructions, if your architecture is still missing them.
const Texture textureHeap[];
[groupsize = (8, 8, 1)]
void main(uint32x3 threadId : SV_ThreadID, const Data* data)
// Non-uniform “buffer data” is not an issue with pointer semantics!
Material* material = data->materialMap[threadId.xy];
// Non-uniform texture heap index
uint32 textureBase = NonUniformResourceIndex(material.textureBase);
Texture textureColor = textureHeap[textureBase + 0];
Texture textureNormal = textureHeap[textureBase + 1];
Texture texturePBR = textureHeap[textureBase + 2];
Sampler sampler = {.minFilter = LINEAR, .magFilter = LINEAR};
float32x2 uv = float32x2(threadId.xy) * data->invDimensions;
float32x4 color = sample(textureColor, sampler, uv);
float32x4 normal = sample(textureNormal, sampler, uv);
float32x4 pbr = sample(texturePBR, sampler, uv);
color *= material.color;
pbr *= material.pbr;
// Rest of the shader
Modern bindless texturing lets us remove all texture binding APIs. A global indexable texture heap makes all textures visible to all shaders. Texture data still needs to be loaded into GPU memory by copy commands (to enable DCC and Morton swizzle). Texture descriptor creation still needs a thin GPU specific user land API. The texture heap can be exposed directly to both the CPU and the GPU as a raw GPU memory array, removing most of the texture heap API complexity compared to DirectX 12 SM 6.6.Since our shader root data is just a single 64-bit pointer and our textures are just 32-bit indices, the shader pipeline creation becomes dead simple. There’s no need to define texture bindings, buffer bindings, bind groups (descriptor sets, argument buffers) or the root signature.
DirectX 12 and Vulkan utilize complex APIs to bind and set up root signatures, push descriptors, push constants, and descriptor sets. A modern GPU driver essentially constructs a single struct into GPU memory and passes its pointer to the command processor. We have shown that such API complexity is unnecessary. The user simply writes the root struct into persistently mapped GPU memory and passes a 64-bit GPU pointer directly to the draw/dispatch function. Users can also include 64-bit pointers and 32-bit texture heap indices inside their structs to build any indirect data layout that fits their needs. Root bindings APIs and the whole DX12 buffer zoo can be replaced efficiently with 64-bit pointers. This simplifies the shader pipeline creation drastically. We don’t need to define the data layout at all. We successfully removed a massive chunk of API complexity while providing more flexibility to the user.Vulkan, Metal and WebGPU have a concept of static (specialization) constants, locked in at shader pipeline creation. The driver’s internal shader compiler applies these constants as literals in the input shader IR and does constant propagation and dead code elimination pass afterward. This can be used to create multiple permutations of the same shader at pipeline creation, reducing the time and storage required for offline compiling all the shader permutations.Vulkan and Metal have a set of APIs and a special shader syntax for describing the shader specialization constants and their values. It would be nicer to simply provide a C struct that matches the constant struct defined in the shader side. That would require minimal API surface and would bring important improvements.Vulkan’s specialization constants have a design flaw. Specialization constants can’t modify the descriptor set layouts. Data inputs and outputs are fixed. The user could hack around the limitation by implementing an uber-layout containing all potential inputs/outputs and skip updating unused descriptors, but this is cumbersome and sub-optimal. Our proposed design doesn’t have the same problem. One can simply branch by a constant (the other side is dead code eliminated) and reinterpret the shader data input pointer as a different struct. One could also mimic the C++ inheritance data layout. Use a common layout for the beginning of the input struct and put specialized data at the end. Static polymorphism can be achieved cleanly. Runtime performance is identical to hand optimized shader. The specialization struct can also include GPU pointers, allowing the user to hardcode runtime memory locations, avoiding indirections. This has never been possible in a shader language before. Instead, the GPU vendors had to use background threads to analyze the shaders to do similar shader replacement optimizations at runtime, increasing the CPU cost and the driver complexity significantly.
The shader permutation hell is one of the biggest issues in modern graphics today. Gamers are complaining about stutter, devs are complaining about offline shader compilation taking hours. This new design gives the user added flexibility. They can toggle between static and dynamic behavior inside the shader, making it easy to have a generic fallback and specialization on demand. This design reduces the number of shader permutations and the runtime stalls caused by pipeline creation.The most hated feature in modern graphics APIs must be the barriers. Barriers serve two purposes: enforce producer-to-consumer execution dependencies and transition textures between layouts.Many graphics programmers have an incorrect mental model about the GPU synchronization. A common belief is that GPU synchronization is based on fine-grained texture and buffer dependencies. In reality, modern GPU hardware doesn’t really care about individual resources. We spend lots of CPU cycles in userland preparing a list of individual resources and how their layouts change, but modern GPU drivers practically throw that list away. The abstraction doesn’t match reality.Modern bindless architecture gives the GPU a lot of freedom. A shader can write to any 64-bit pointer or any texture in the global descriptor heap. The CPU doesn’t know what decisions the GPU is going to make. How is it supposed to emit transition barriers for each affected resource? This is a clear mismatch between bindless architecture and classic CPU-driven rendering APIs today. Let’s investigate why the APIs were designed like this 10 years ago. AMD GCN had a big influence on modern graphics API design. GCN was ahead of its time with async compute and bindless texturing (using scalar registers to store descriptors), but it also had crucial limitations in its delta color compression (DCC) and cache design. These limitations are a great example why the barrier model we have today is so complex. GCN didn’t have a coherent last-level cache. ROPs (raster operations = pixel shader outputs) had special non-coherent caches directly connected to the VRAM. The driver had to first flush the ROP caches to memory and then invalidate the L2$ to make pixel shader writes visible to shaders and samplers. The command processor also wasn’t a client of the L2$. Indirect arguments written in compute shaders weren’t visible to the command processor without invalidating the whole L2$ and flushing all dirty lines into VRAM. GCN 3 introduced delta color compression (DCC) for ROPs, but AMD’s texture samplers were not able to directly read DCC compressed textures or compressed depth buffers. The driver had to perform an internal decompress compute shader to eliminate the compression. The display engine could not read DCC compressed textures either. The common case of sampling a render target required two internal barriers and flushing all caches (wait for ROPs, flush ROP cache and L2$, run decompress compute shader, wait for compute).AMD’s new RDNA architecture has several crucial improvements: It has a coherent L2$ covering all memory operations. ROPs and the command processor are clients of the L2$. The only non-coherent caches are the tiny L0$ and K$ (scalar cache) inside the compute units. A barrier now requires only flushing the outstanding writes in the tiny caches into the higher level cache. The driver no longer has to flush the last-level (L2) cache into the VRAM, making barriers significantly faster. RDNA’s improved display engine is capable of reading DCC compressed textures and a (de)compressor sits between the L2$ and the L0$ texture cache. There’s no need to decompress textures into VRAM before sampling, removing the need for texture layout transitions (compressed / uncompressed). All desktop and mobile GPU vendors have reached similar conclusions: Bandwidth is the bottleneck today. We should never waste bandwidth decoding resources into VRAM. Layout transitions are no longer needed.
AMD RDNA (2019): Improved cache hierarchy, DCC and display engine in the RDNA architecture. L2$ contains DCC compressed data. (De)compressor sits between L2$ and lower levels. L0$ (texture) is decompressed. Image © AMD.
Resource lists are the most annoying aspect of barriers in DirectX 12 and Vulkan. Users are expected to track the state of each resource individually, and tell the graphics API their previous and next state for each barrier. This was necessary on 10 year old GPUs as vendors hid various decompress commands under the barrier API. The barrier command functioned as the decompress command, so it had to know which resources required decompression. Today’s hardware doesn’t need texture layouts or decompress steps. Vulkan just got a new VK_KHR_unified_image_layouts (https://www.khronos.org/blog/so-long-image-layouts-simplifying-vulkan-synchronisation) extension (2025), removing the image layout transitions from the barrier command. But it still requires the user to list individual textures and buffers. Why is this?The main reason is legacy API and tooling compatibility. People are used to thinking about resource dependencies and the existing Vulkan and DirectX 12 validation layers are designed that way. However, the barrier command executed by the GPU contains no information about textures or buffers at all. The resource list is consumed solely by the driver.Our modern driver loops through your resource list and populates a set of flags. Drivers no longer need to worry about resource layouts or last level cache coherency, but there still exists tiny non-coherent caches that need flushing in special cases. Modern GPUs flush the majority of the non-coherent caches automatically in every barrier. For example the AMD L0$ and K$ (scalar cache) are always flushed, since every pass writes some outputs and these outputs live in some of these caches. Fine grained tracking of all write addresses would be too expensive. Tiny non-coherent caches tend to be inclusive. Modified lines get flushed to the next cache level. This is fast and doesn’t produce VRAM traffic. Some architectures have special caches that are not automatically flushed. Examples: descriptor caches in the texture samplers (see above chapter), rasterizer ROP caches and HiZ caches. The command processor commonly runs ahead to reduce the wave spawn latency. If we write indirect arguments in a shader, we need to inform the GPU to stall the command processor prefetcher to avoid a race. The GPU doesn’t actually know whether your compute shader was writing into an indirect argument buffer or not. In DirectX 12 the buffer is transitioned to D3D12_RESOURCE_STATE_INDIRECT_ARGUMENT and in Vulkan the consumer dependency has a special stage VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT. When a barrier has a resource transition like this or a stage dependency like this, the driver will include command processor prefetcher stall flag into the barrier.A modern barrier design replaces the resource list with a single bitfield describing what happens to these special non-coherent caches. Special cases include: Invalidate texture descriptors, invalidate draw arguments and invalidate depth caches. These flags are needed when we generate draw arguments, write to the descriptor heap or write to a depth buffer with a compute shader. Most barriers don’t need special cache invalidation flags.Some GPUs still need to decompress data in special cases. For example during a copy or a clear command (fast clear eliminate if clear color has changed). Copy and clear commands take the affected resource as a parameter. The driver can take necessary steps to decode the data if needed. We don’t need a resource list in our barrier for these special cases. Not all formats and usage flags support compression. The driver will keep the data uncompressed in these cases, instead of transitioning it back and forth, wasting bandwidth.
If you write to the texture descriptor heap (uncommon), you need to add a special flag.
A barrier between rasterizer output and pixel shader is a common case for offscreen render target → sampling. Our example has dependency stages set up in a way that the barrier doesn’t block vertex shaders, allowing vertex shading (and tile binning on mobile GPUs) to overlap with previous passes. A barrier with raster output stage (or later) as the producer automatically flushes non-coherent ROP caches if the GPU architecture needs that. We don’t need an explicit flag for it.
Users only describe the queue execution dependencies: producer and consumer stage masks. There’s no need to track the individual texture and buffer resource states, removing a lot of complexity and saving a significant amount of CPU time versus the current DirectX 12 and Vulkan designs. Metal 2 has a modern barrier design already: it doesn’t use resource lists.Many GPUs have custom scratchpads memories: Groupshared memory inside each compute unit, tile memory, large shared scratchpads like the Qualcomm GMEM. These memories are managed automatically by the driver. Temporary scratchpads like groupshared memory are never stored to memory. Tile memories are stored automatically by the tile rasterizer (store op == store). Uniform registers are read-only and pre-populated before each draw call. Scratchpads and uniform registers don’t have cache coherency protocols and don’t interact with the barriers directly.Modern GPUs support a synchronization command that writes a value to memory when a shader stage is finished, and a command that waits for a value to appear in memory location before a shader stage is allowed to begin (wait includes optional cache flush semantics). This is equivalent to splitting the barrier into two: the producer and the consumer. DirectX 12 split barriers and Vulkan event→wait are examples of this design. Splitting the barrier into consumer→producer allows putting independent work between them, avoiding draining the GPU.Vulkan event→wait (and DX12 split barriers) see barely any use. The main reason is that normal barriers are already highly complicated, and developers want to avoid extra complexity. Driver support for split barriers also hasn’t been perfect in the past. Removing the resource lists simplifies the split barriers significantly. We can also make split barriers semantically similar to timeline semaphores: Signal command writes to a monotonically increasing 64-bit value (atomic max) and wait command waits for the value to be >= N (greater equal). The counter is just a GPU memory pointer, no persistent API object is required. This provides us with a significantly simpler event→wait API.
This API is much simpler than the existing VkEvent API, yet offers improved flexibility. In the above example we implemented the timeline semaphore semantics, but we can implement other patterns too, such as waiting multiple producers using a bitmask: mark bits with SIGNAL_ATOMIC_OR and wait for all bits in a mask to be set (mask is an optional parameter in the gpuWaitBefore command).
GPU→CPU synchronization was initially messy in Vulkan and Metal. Users needed a separate fence object for each submit. N buffering was a common technique for reusing the objects. This is a similar usability issue as discussed above regarding VkEvent. DirectX 12 was the first API to solve the GPU→CPU synchronization cleanly with timeline semaphores. Vulkan 1.2 and Metal 2 adapted the same design later. A timeline semaphore needs only a single 64-bit monotonically increasing counter. This reduces complexity over the older Vulkan and Metal fence APIs, which many engines still use today.
Our proposed barrier design is a massive improvement over DirectX 12 and Vulkan. It reduces the API complexity significantly. Users no longer need to track individual resources. Our simple hazard tracking has queue + stage granularity. This matches what GPU hardware does today. Game engine graphics backends can be simplified and CPU cycles are saved.Vulkan and DirectX 12 were designed to promote the pre-creation and reuse of resources. Early Vulkan examples recorded a single command buffer at startup, replaying it every frame. Developers quickly discovered that command buffer reuse was impractical. Real game environments are dynamic and the camera is in constant motion. The visible object set changes frequently.Game engines ignored prerecorded command buffers entirely. Metal and WebGPU feature transient command buffers, which are created just before recording and disappear after GPU has finished rendering. This eliminates the need for command buffer management and prevents multiple submissions of the same commands. GPU vendors recommend one shot command buffers (a resettable command pool per frame in flight) in Vulkan too, as it simplifies the driver’s internal memory management (bump allocator vs heap allocator). The best practices match Metal and WebGPU design. Persistent command buffer objects can be removed. That API complexity didn’t provide anything worth using.
Let’s start with a burning question: Do we need graphics shaders anymore? UE5 Nanite uses compute shaders to plot pixels using 64-bit atomics. High bits contain the pixel depth and low bits contain the payload. Atomic-min ensures that the closest surface remains. This technique was first presented at SIGGRAPH 2015 by Media Molecule Dreams (Alex Evans). Hardware rasterizer still has some advantages, like hierarchical/early depth-stencil tests. Nanite has to lean solely on coarse cluster culling, which results in extra overdraw with kitbashed content. Ubisoft (me and Ulrich Haar) presented this two-pass cluster culling algorithm at SIGGRAPH 2015. Ubisoft used cluster culling in combination with the hardware rasterizer for more fine grained culling. Today’s GPUs are bindless and much better suited for GPU-driven workloads like this. 10 years ago Ubisoft had to lean on virtual texturing (all textures in the same atlas) instead of bindless texturing. Despite many compute-only rasterizers today (Nanite, SDF sphere tracing, DDA voxel tracing) the hardware rasterizer still remains the most used technique for rendering triangles in games today. It’s definitely worth discussing how to make the rasterization pipeline more flexible and easier to use.The modern shader framework has grown to 16 shader entry points. We have eight entry points for rasterization (pixel, vertex, geometry, hull, domain, patch constant, mesh and amplification), and six for ray-tracing (ray generation, miss, closest hit, any hit, intersection and callable). In comparison, CUDA has a single entry point: kernel. This makes CUDA composable. CUDA has a healthy ecosystem of 3rd party libraries. New GPU hardware blocks such as the tensor cores (AI) are exposed as intrinsic functions. This is how it all started in the graphics land as well: texture sampling was our first intrinsic function. Today, texture sampling is fully bindless and doesn’t even require driver setup. This is the design developers prefer. Simple, easy to compose, and extend.We recently got more intrinsics: inline raytracing and cooperative matrix (wave matrix in DirectX 12, subgroup matrix in Metal). I am hoping that this is the new direction. We should start tearing down the massive 16 shader framework and replacing it with intrinsics that can be composed in a flexible way.Solving the shader framework complexity is a massive topic. To keep the scope of this blog post in check, I will today only discuss compute shaders and raster pipelines. I am going to be writing a followup about simplifying the shader framework, including modern topics such as ray-tracing, shader execution reordering (SER), dynamic register allocation extensions and Apple’s new L1$ backed register file (called dynamic caching).There are two relevant raster pipelines today: Vertex+pixel and mesh+pixel. Mobile GPUs employing tile based deferred rendering (TBDR) perform per-triangle binning. Tile size is commonly between 16x16 to 64x64 pixels, making meshlets too coarse grained primitive for binning. Meshlet has no clear 1:1 lane to vertex mapping, there’s no straightforward way to run a partial mesh shader wave for selected triangles. This is the main reason mobile GPU vendors haven’t been keen to adapt the desktop centric mesh shader API designed by Nvidia and AMD. Vertex shaders are still important for mobile.I will not be discussing geometry, hull, domain, and patch constant (tessellation) shaders. The graphics community widely considers these shader types as failed experiments. They all have crucial performance issues in their design. In all relevant use cases, you can run a compute prepass generating an index buffer to outperform these stages. Additionally, mesh shaders allow generating a compact 8-bit index buffer into on-chip memory, further increasing the performance gap over these legacy shader stages.Our goal is to build a modern PSO abstraction with a minimal amount of baked state. One of the main critiques of Vulkan and DirectX 12 has been the pipeline permutation explosion. The less state we have inside the PSO, the less pipeline permutations we get. There are two main areas to improve: graphics shader data bindings and the rasterizer state.Vertex+pixel shader pipeline needs several additional inputs compared to a compute kernel: vertex buffers, index buffer, rasterizer state, render target views and a depth-stencil view. Let’s start by discussing the shader visible data bindings.Vertex buffer bindings are easy to solve: We simply remove them. Modern GPUs have fast raw load paths. Most GPU vendors have been emulating vertex fetch hardware already for several generations. Their low level shader compiler reads the user defined vertex layout and emits appropriate raw load instructions in the beginning of the vertex shader. The vertex bindings declaration is another example of a special C/C++ API for defining a struct memory layout. It adds complexity and forces compiling multiple PSO permutations for different layouts. We simply replace the vertex buffers with standard C/C++ structs. No API is required.
The same is true for per-instance data and multiple vertex streams. We can implement them efficiently with raw memory loads. When we use raw load instructions, we can dynamically adjust the vertex stride, branch over secondary vertex buffer loads and calculate our vertex indices using custom formulas to implement clustered GPU-driven rendering, particle quad expansion, higher order surfaces, efficient terrain rendering and many other algorithms. Additional shader entry points and binding APIs are not needed. We can use our new static constant system to dead code eliminate vertex streams at pipeline creation or provide a static vertex stride if we so prefer. All the old optimization strategies still exist, but we can now mix and match techniques freely to match our renderer’s needs.
// Common header…
struct VertexPosition
float32x4 position;
struct VertexAttributes
uint8x4 normal;
uint8x4 tangent;
uint16x2 uv;
...
Read the original on www.sebastianaaltonen.com »
We’ve read your posts and heard your feedback.
We’re postponing the announced billing change for self-hosted GitHub Actions to take time to re-evaluate our approach.
We are continuing to reduce hosted-runners prices by up to 39% on January 1, 2026.
We have real costs in running the Actions control plane. We are also making investments into self-hosted runners so they work at scale in customer environments, particularly for complex enterprise scenarios. While this context matters, we missed the mark with this change by not including more of you in our planning.
We need to improve GitHub Actions. We’re taking more time to meet and listen closely to developers, customers, and partners to start. We’ve also opened a discussion to collect more direct feedback and will use that feedback to inform the GitHub Actions roadmap. We’re working hard to earn your trust through consistent delivery across GitHub Actions and the entire platform.
Below is the original announcement from 12/16/25
We’re announcing updates to our pricing and product models for GitHub Actions.
Historically, self-hosted runner customers were able to leverage much of GitHub Actions’ infrastructure and services at no cost. This meant that the cost of maintaining and evolving these essential services was largely being subsidized by the prices set for GitHub-hosted runners. By updating our pricing, we’re aligning costs more closely with usage and the value delivered to every Actions user, while fueling further innovation and investment across the platform. The vast majority of users, especially individuals and small teams, will see no price increase.
We will have a GitHub Actions pricing calculator available where you will know how much you will be charged. You can see the Actions pricing calculator to estimate your future costs. 96% of customers will see no change to their bill. Of the 4% of Actions users impacted by this change, 85% of this cohort will see their Actions bill decrease and the remaining 15% who are impacted across all face a median increase around $13.
GitHub Actions will remain free for public repositories. In 2025, we saw developers use 11.5 billion total Actions minutes in public projects for free (~$184 million) and we will continue to invest in Actions to provide a fast, reliable, and predictable experience for our users.
When we shipped Actions in 2018, we had no idea how popular it would become. By early 2024, the platform was running about 23 million jobs per day and our existing architecture couldn’t reliably support our growth curve. In order to increase feature velocity, we first needed to improve reliability and modernize the legacy frameworks that supported GitHub Actions.
Our solution was to re-architect the core backend services powering GitHub Actions jobs and runners with the goals of improving uptime and resilience against infrastructure issues, enhancing performance, reducing internal throttles, and leveraging GitHub’s broader platform investments and developer experience improvements. This work is paying off by helping us handle our current scale, even as we work through the last pieces of stabilizing our new platform.
Since August, all GitHub Actions jobs have run on our new architecture, which handles 71 million jobs per day (over 3x from where we started). Individual enterprises are able to start 7x more jobs per minute than our previous architecture could support.
As with any product, our goal at GitHub has been to meet customer needs while providing enterprises with flexibility and transparency.
This change better supports a world where CI/CD must be faster and more reliable, better caching, more workflow flexibility, rock-solid reliability, and strengthens the core experience while positioning GitHub Actions to power GitHub’s open, secure platform for agentic workloads.
Starting today, we’re charging fairly for Actions across the board which reduces the price of GItHub Hosted Runners and the price the average GitHub customer pays. And we’re reducing the net cost of GitHub-hosted runners by up to 39%, depending on which machine type is used.
This reduction is driven by a ~40% price reduction across all runner sizes, paired with the addition of a new $0.002 per-minute GitHub Actions cloud platform charge. For GitHub-hosted runners, the new Actions cloud platform charge is already included into the reduced meter price.
Standard GitHub-hosted or self-hosted runner usage on public repositories will remain free. GitHub Enterprise Server pricing is not impacted by this change.
The price reduction you will see in your account depends on the types of machines that you use most frequently — smaller runners will have a smaller relative price reduction, larger runners will see a larger relative reduction.
This price reduction makes high-performance compute more accessible for both high-volume CI workloads and the agent jobs that rely on fast, secure execution environments.
For full pricing update details, see the updated Actions runner prices in our documentation.
This price change will go into effect on January 1, 2026.
We are introducing a $0.002 per-minute Actions cloud platform charge for all Actions workflows across GitHub-hosted and self-hosted runners. The new listed GitHub-runner rates include this charge. This will not impact Actions usage in public repositories or GitHub Enterprise Server customers.
This aligns pricing to match consumption patterns and ensures consistent service quality as usage grows across both hosting modalities.
We are increasing our investment into our self-hosted experience to ensure that we can provide autoscaling for scenarios beyond just Linux containers. This will include new approaches to scaling, new platform support, Windows support, and more as we move through the next 12 months. Here’s a preview of what to expect in the new year:
This new client provides enterprises with a lightweight Go SDK to build custom autoscaling solutions without the complexity of Kubernetes or reliance on ARC. It integrates seamlessly with existing infrastructure—containers, virtual machines, cloud instances, or bare metal—while managing job queuing, secure configuration, and intelligent scaling logic. Customers gain a supported path to implement flexible autoscaling, reduce setup friction, and extend GitHub Actions beyond workflows to scenarios such as self-hosted Dependabot and Copilot Coding Agent.
We are reintroducing multi-label functionality for both GitHub-hosted larger runners and self-hosted runners, including those managed by Actions Runner Controller (ARC) and the new Scale Set Client.
This upcoming release introduces major quality-of-life improvements, including refined Helm charts for easier Docker configuration, enhanced logging, updated metrics, and formalized versioning requirements. It also announces the deprecation of legacy ARC, providing a clear migration path to a more reliable and maintainable architecture. Customers benefit from simplified setup, improved observability, and confidence in long-term support, reducing operational friction and improving scalability.
The Actions Data Stream will deliver a near real-time, authoritative feed of GitHub Actions workflow and job event data, including metadata such as the version of the action that was executed on any given workflow run. This capability enhances observability and troubleshooting by enabling organizations to integrate event data into monitoring and analytics systems for compliance and operational insights. By providing structured, high-fidelity data at scale, it eliminates reliance on manual log parsing and empowers teams to proactively manage reliability and performance.
Agents are expanding what teams can automate—but CI/CD remains the heartbeat of modern software delivery. These updates enable both a faster, more reliable CI/CD experience for every developer, and a scalable, flexible, secure execution layer to power GitHub’s agentic platform.
Our goal is to ensure GitHub Actions continues to meet the needs of the largest enterprises and of individual developers alike, with clear pricing, stronger performance, and a product direction built for the next decade of software development.
Why am I being charged to use my own hardware?
Historically, self-hosted runner customers were able to leverage much of GitHub Actions’ infrastructure and services at no cost. This meant that the cost of maintaining and evolving these essential services was largely being subsidized by the prices set for GitHub-hosted runners. By updating our pricing, we’re aligning costs more closely with usage and the value delivered to every Actions user, while fueling further innovation and investment across the platform. The vast majority of users, especially individuals and small teams, will see no price increase.
You can see the Actions pricing calculator to estimate your future costs.
What are the new GitHub-hosted runner rates?
See the GitHub Actions runner pricing reference for the updated rates that will go into effect on January 1, 2026. These listed rates include the new $0.002 per-minute Actions cloud platform charge.
Q: Why is .002/minute the right price for self-hosted runners on cloud?
We determined per-minute was deemed the most fair and accurate by our users, and compared to other self-hosted CI solutions in the market. We believe this is a sustainable option that will not deeply impact our lightly- nor heavily-active customers, while still delivering fast, flexible workloads for the best end user experience.
Which job execution scenarios for GitHub Actions are affected by this pricing change?
* Jobs that run in private repositories and use standard GitHub-hosted or self-hosted runners
Standard GitHub-hosted or self-hosted runner usage on public repositories will remain free. GitHub Enterprise Server pricing is not impacted by this change.
When will this pricing change take effect?
The price decrease for GitHub-hosted runners will take effect on January 1, 2026. The new charge for self-hosted runners will apply beginning on March 1, 2026. The price changes will impact all customers on these dates.
Will the free usage quota available in my plan change?
Beginning March 1, 2026, self-hosted runners will be included within your free usage quota, and will consume available usage based on list price the same way that Linux, Windows, and MacOS standard runners work today.
Will self-hosted runner usage consume from my free usage minutes?
Yes, billable self-hosted runner usage will be able to consume minutes from the free quota associated with your plan.
How does this pricing change affect customers on GitHub Enterprise Server?
This pricing change does not affect customers using GitHub Enterprise Server. Customers running Actions jobs on self-hosted runners on GitHub Enterprise Server may continue to host, manage, troubleshoot and use Actions on and in conjunction with their implementation free of charge.
Can I bill my self-hosted runner usage on private repositories through Azure?
Yes, as long as you have an active Azure subscription ID associated with your GitHub Enterprise or Organization(s).
What is the overall impact of this change to GitHub customers?
96% of customers will see no change to their bill. Of the 4% of Actions users impacted by this change, 85% of this cohort will see their Actions bill decrease and the remaining 15% who are impacted across all face a median increase around $13.
Did GitHub consider how this impacts individual developers, not just Enterprise scale customers of GitHub?
From our individual users (free & Pro plans) of those who used GitHub Actions in the last month in private repos only 0.09% would end up with a price increase, with a median increase of under $2 a month. Note that this impact is after these users have made use of their included minutes in their plans today, entitling them to over 33 hours of included GitHub compute, and this has no impact on their free use of public repos. A further 2.8% of this total user base will see a decrease in their monthly cost as a result of these changes. The rest are unimpacted by this change.
How can I figure out what my new monthly cost for Actions looks like?
GitHub Actions provides detailed usage reports for the current and prior year. You can use this prior usage alongside the rate changes that will be introduced in January and March to estimate cost under the new pricing structure. We have created a Python script to help you leverage full usage reports to calculate your expected cost after the price updates.
We have also updated our Actions pricing calculator, making it easier to estimate your future costs, particularly if your historical usage is limited or not representative of expected future usage.
...
Read the original on resources.github.com »
To add this web app to your iOS home screen tap the share button and select "Add to the Home Screen".
10HN is also available as an iOS App
If you visit 10HN only rarely, check out the the best articles from the past week.
If you like 10HN please leave feedback and share
Visit pancik.com for more.