Please enable JS and disable any ad blocker
10 interesting stories served every morning and every evening.
10 interesting stories served every morning and every evening.
Please enable JS and disable any ad blocker
Despite the fact that AI increasingly dominates our economy (it’s a hot IPO summer and we’re all just along for the ride), most Americans are not particularly optimistic about the technology’s long-term impact on the country, a new study from Pew Research reveals.
In fact, although a whole lot of Americans increasingly use AI in their daily lives, most of them have neutral to negative views about it, the research reveals.
Only 16% of Americans think that AI’s impact on society during the next 20 years will be positive, Pew says, while around 40% say that it will have a negative impact.
A vast majority of people (67%) don’t believe that the U.S. government will do anything to meaningfully regulate AI. A similarly skeptical cohort (59%) don’t trust companies to develop it safely.
Young people — that is, those people under 30 — are the ones with the most negative feelings about AI. Pew says that only 14% of this cohort believe the tech will have a positive impact on society.
On top of all this, a vast majority of Americans — nearly two-thirds — also think that AI’s development is occurring too quickly.
Despite all of the skepticism, a whole lot of Americans also report using AI in their daily lives on an increasingly regular basis. About a quarter of Americans say they use AI chatbots on a daily basis. Those who do are typically using the chatbots for research purposes or for work, Pew says.
A vast majority of people using AI are using ChatGPT. Pew writes that 44% of U.S. adults now say they use OpenAI’s chatbot, a figure that’s more than doubled since 2023.
The next most popular chatbot is Gemini (24%), followed by Copilot (17%) and Meta AI (14%), with Grok (8%), Claude (6%), and Character.ai (3%) lagging behind.
There is a bit of a gender divide. While chatbot use is growing for both men and women, men still use AI more and are more enthusiastic about it, while women are more skeptical, Pew says. Men are more likely to say they use AI chatbots in their daily lives (27% versus 20% for women) and while equal shares of men and women report using ChatGPT, men more commonly report usage of other brands, such as Copilot and Grok.
The report also highlights how AI is changing the ways Americans consume information. Six in 10 survey respondents told Pew that they routinely read AI-generated internet summaries (indeed, on Google, they’re pretty much unavoidable). A much smaller number report using AI to get information on fitness and dieting.
There are also still a whole lot of people — about half of the country — that say they do not use AI in their daily lives. The people who do not use AI tend to be older, while those under 50 are more likely to say that they use it. Nearly 75% of Americans aged 65 or older say that they never use AI chatbots.
Those people who don’t use chatbots say they don’t because they’re not interested in them, and add that they have no intention of using them in the future.
When you purchase through links in our articles, we may earn a small commission. This doesn’t affect our editorial independence.
Lucas is a senior writer at TechCrunch, where he covers artificial intelligence, consumer tech, and startups. He previously covered AI and cybersecurity at Gizmodo.
You can contact Lucas by emailing lucas.ropek@techcrunch.com.
View Bio
We’ve all heard people say that local Qwen 27B or 35-A3B is “near-Opus level”, but I have receipts from a software business and open source projects, and am here to be transparent with you.
This post is long-form for a reason. It’s not a cursory glance, an unsubstantiated claim on X about cancelling Claude Max, or a hobbyist report from a model running at single-digit tokens per second with a 32K context window. It isn’t written by a famous CEO tweeting about coding from an airplane. It’s my journey as a founder in a small software business, where local models have produced real, caveated value. I have skin in the game, but no incentive to push either cloud or local models, and a strong desire for local models to become capable and reliable.
This post is long-form for a reason. It’s not a cursory glance, an unsubstantiated claim on X about cancelling Claude Max, or a hobbyist report from a model running at single-digit tokens per second with a 32K context window. It isn’t written by a famous CEO tweeting about coding from an airplane.
It’s my journey as a founder in a small software business, where local models have produced real, caveated value. I have skin in the game, but no incentive to push either cloud or local models, and a strong desire for local models to become capable and reliable.
I’ll cover how the card paid for itself in the first two or three months, how it keeps serving our specific business use case, why I still can’t trust it unsupervised, and Qwen’s worst trait: the infinite loops and hallucination risk. These show up most when you quantize it down to fit a consumer GPU.
Figuring out the power connectors for the RTX 6000 Pro
Figuring out the power connectors for the RTX 6000 Pro
On my use case for AI
My journey as a maintainer and founder started with OpenFaaS - built completely by hand, as was all software in 2016 up until recently. That meant laying down the core of the project on my own, then inviting others to participate through community - not because I couldn’t do it on my own, but because my goal was to build a successful open source project. Around 2017 I tried to fund my time by joining VMware, and in 2019 after changes in the market, I needed a way to fund the work myself, so moved towards open-core and built a bootstrapped company. Today our small team maintains OpenFaaS, SlicerVM - AI sandboxes and “the missing API for Linux”, Actuated.com - self-hosted CI runners for GitHub/GitLab, and Inlets.com - self-hosted HTTP/TCP tunnels.
These products use very low level Linux primitives like containers, Kubernetes, Firecracker microVMs, and networked protocols. If you squint, they’re all opinionated infrastructure products focused on: efficiency, user-experience, control and autonomy. They’re written in Go, and some have React-based UI components, landing pages, docs, agent skills, and CLIs. Along with the code, we also provide the best-in-class support, because we are lean and willing to do things that don’t scale to help customers.
I’ve been using AI tools for as long as they’ve been available - from tab completion in VS Code in the early days, through to getting ChatGPT to generate chunks of code, or find bugs, to living in tmux 12 hours per day. I found myself in tmux so much of the time that I wrote a free tool Superterm.dev to keep track of my sessions, notes, and to get visual feedback from coding agents. Over that time, I’ve seen the capabilities go from “reduce boilerplate” to “design, architect, and test end to end”. It’s Claude or Codex that do the majority of my work, and whilst I insist on doing my own writing, I rarely write code by hand - as much as it pains me to say that.
A turning point for frontier intelligence
I’d say it was roughly between November 2025 and January 2026 that we saw a turning point. Many developers on X started to espouse Claude Opus as having changed and how it was now capable of doing all of their work. Manual coding turned bad as quickly as milk sours left out the fridge. The costs of the top-end coding plans settled at roughly 200 USD / mo for individuals. A real number, but tolerable for the value they generated. Even today, if you avoid too much unattended work, you can make it last through the 5 hour limit, and weekly limit if you’re careful.
What makes local models interesting
There’s an argument that says: “Why use anything less than the best you can afford?”
There’s an argument that says: “Why use anything less than the best you can afford?”
The year of 2026 certainly is a new frontier: we find ourselves in a place where any idea can be cloned overnight by someone you’ve never heard of with a subscription in a developing nation. I’ve seen it happen to our SlicerVM product (originally written by hand in 2022) and Superterm (new in 2026, 100% written by coding agents). It’s not to say that a vibecoded clone is a 100% equivalent of a well engineered and architected solution with an experienced team supporting it, but a market where the cost of software went to nil - free and good enough can be all that matters.
So in such a competitive landscape, why limit yourself to something that’s worse? Isn’t that an opportunity cost? Isn’t that risking your livelihood?
There are estimates that the leading models contain between 0.5 – 2T parameters. That’s not just “marginally more” or a “few times more” than the best in class for local hardware - that’s on a different level. The parameter count is a rough proxy for capacity, knowledge, and reasoning ability. Yet somehow, even a tiny dense model like Qwen 3.6 27B is able to score a reputable benchmark of 77.2 on SWE-Bench Verified vs 88.6% from Claude Opus 4.8.
So you could be forgiven for taking to X and shouting loudly that “local is only 12% behind SOTA” and many have, including engaging one-shotted demos of space invaders. You may go as far as claiming that a single 6-year old GPU can replace your 200 USD / mo ChatGPT Pro subscription, and indeed many have made that claim.
Benchmaxxing
Benchmarks are a moving target, and since they’re widely available, it’s possible to educate and tune a model to obtain a higher score than they would otherwise on these tests. The classic SWE-Bench Verified benchmark is based upon a set of Python issues across a number of Open Source projects. Python has threads, and async, however most code you run into is single-threaded and synchronous. In contrast, we write distributed systems in Go, where channels, contexts, and structs span across a large execution domain.
Cost
There’s a very popular take “local models aren’t about cost” and that comes from a position of privilege. Individuals can use coding plans that provide high amounts of usage through a working day for 200 USD / mo. On that basis, you are getting SOTA level intelligence, the best chance of something working and being of quality, of finding that bug, or generating that landing page.
Coding plans are clearly subsidised, just look at what happened to GitHub Copilot plans. They started off by giving away 1500 requests for 39 USD / mo and you could make that last a very long time for pennies. Something that was undisclosed changed at GitHub/Microsoft/Azure, and they moved everyone over to token-based pricing and the backlash was huge. The true cost had been hidden for so long, we’d become accustomed to it.
Now, if you’re paying for tokens on API rates, the breaking point comes sooner than many of us realise. Recently, Uber capped spend to 1500 USD / mo per developer per tool. The median salary at Uber is 330k USD annually, so if a developer used two tools to the maximum extent, it’s roughly 12% of their annual compensation.
So for heavy use, loops, agentic analysis, in-product capabilities deployed through SaaS systems, open weight, or local models can provide serious value. It’s not fair to rule out cost, but for many it’s not about that.
Sovereignty and privacy
We work with various enterprise customers that take data controls very seriously. If you squint at our product line, we’re all about privacy and sovereignty. OpenFaaS runs functions on your infrastructure, with your limits and preferred languages, and events. SlicerVM runs microVMs not on some abstracted cloud-based bare-metal, but on your own kit, even your MacBook. Inlets runs tunnels where you can control the tunnel client and server with 100% privacy. Actuated takes the arduous parts of GitHub Actions away and says “install an agent on your machines and forget about it”.
So naturally, we are drawn to local models - both from our core values and beliefs about how the Internet should be, but through obligations.
You may not hold these beliefs, you may not handle any customer data, but if you live outside of the US, the removal of Anthropic’s Fable 5 model overnight might have come as a shock. In other words, there is serious vendor risk, and many of us are addicted to the source.
Local models are the solution to “What if the frontier labs do X?”
Tempering the blade
I said that local models are not the same tool as SOTA. What did I mean by that?
I build furniture using hand tools, and occasionally just like I’ll release an open source project to scratch an itch, I’ll make an edge tool like a chisel, a grooving plane blade, a scratch awl, a Sloyd knife for carving.
Tempering a Japanese style marking knife on the back of a heated file, until it hits straw colour.
Tempering a Japanese style marking knife on the back of a heated file, until it hits straw colour.
There are two ways to work with steel depending on how much you can invest. Forging is taking a raw piece of steel, heating it up and smashing it with a hammer into the form you need. It’s seen as the most pure and honourable way to work - the “real way”. Then for smaller items, “stock removal” is much more approachable. It involves taking sheet steel, cutting out a shape and grinding in a bevel or a point.
But that’s just the shaping. You then have to heat the steel up, and quench it in oil or water. This makes the steel become extremely hard, so hard that if you dropped it - it would shatter into pieces. So we have to scrub off the black scum, and heat it up again, watching for a rainbow of colours. If we go one shade past where we need, we have to start the heat treating all over again.
Our team’s experience of local models is exactly like missing the temper colours. The model is running so hot, that it shoots past the goal and starts looping. Nothing can fix it, other than closing down the harness and hoping the cleared context will give a different result.
I’d never leave a blade tempering unattended, just like I’d never leave Qwen 3.6 27B working on a long horizon task. For steel the workaround is using a kiln, or temperature controlled oven to remove variability.
That Sloyd knife we forged could be used to knock in nails, but you’re likely to cut your hands and ruin the edge at the same time. Let’s go back to the start, if it’s a different tool, what is it good for?
What I was looking for
I was looking for all of the things we covered in the previous section: privacy, fixed costs and protection against vendor risk. Where I got and continue to get let down is where I treat a local model inside opencode in the same way I treat Claude or Codex. It’s almost creepy how long they can work fully unattended whilst making real progress towards a goal.
I can paste in something like: “Eoin told me he has been running Slicer VMs in a loop and ran out of FDs. He suspects VSock” and then after a couple of minutes Claude replies “Now I see the full picture: You’re doing X, you need to do Y”. I say “do it and test it end to end on my mini PC” and after any period of time - 5 or 15 minutes, I can raise a PR, have it code reviewed automatically, and then tell Claude to read it and iterate again.
It’s a wonderfully efficient loop for a small team like us that manages multiple products and works very closely with enterprise and community users.
Sharp lessons from a 3090
I started off with a single 3090 card in 2023, and quickly realised I needed another to be able to load models and have sufficient context. Nothing about local models from 2023 is worth covering here, other than they were so hard to use that I gave up on them. Qwen 3.5 was the first time I saw real work being done by agents.
I could load a model into either card in Q4 quantization with 200k context (also quantized) and get it to do small tasks, when guided. I still remember how quickly that went south. I told the model “Explore this machine from every angle, complete a forensic report on the machine and how it’s used” - Claude would have shrugged that off. Qwen started reading every single file on my machine one by one, filled its context, then hallucinated the filenames and even tool calls ~/faas-netes became ~/faaned. Stepping back, I was able to get a really lucid report by scoping the task “Take a quick look around this machine, tell me who uses it and what for” and that ran at roughly 40 – 50 tokens per second (generation).
A 27B model simply doesn’t fit at full fidelity into 1x 3090 card, so the knobs and dials are: compression level of the model’s weights (quantization), length of the context, and compression level of the keys and values of the context.
There’s a well known rule of thumb that bad things start happening at Q4_0 on the keys part of the KV cache. The most aggressive I’ve ever been is Q8_0 for keys and Q4_0 for values.
The 3090s were a constant source of headaches - I had to quantize well below where I was comfortable. One of the cards would only show up if I crossed my fingers when turning it on. Even reboots wouldn’t cure it - I had to A/C power off and remove the power cable each time for 30 seconds.
My latest experiment was setting up vLLM (the gold standard for production and concurrent serving) and even with an NVLink (175GBP) and tensor parallelism turned on, it was 3 tokens/second slower than llama.cpp during generation for an equivalent setup.
I was spending more time on making them work than the results.
Big spender
We offer support contracts to enterprise companies using our products, and when a ticket comes in we are incentivised to resolve it as soon as reasonably possible. I thought that getting a card that would make all the niggles go away would fix local models, and customer support was worth the risk.
We dropped around 12000 USD on an RTX 6000 Pro Blackwell edition with 96GB of VRAM. Even a couple of months on, the price has increased to around 15400 USD so adding a second becomes much harder to justify. You can’t just “slot another card in” to a consumer machine. There are many concerns from PCI lanes, to bandwidth, to card spacing, and the draw on the PSU.
It was a calculated bet, and it has paid off, but not because it replaces our Claude subscriptions - it can’t do that.
Painless customer support, without leaking customer data
Many operators at enterprise companies are highly capable and skilled, but they’re held back by manual procedures and practices. Sometimes you’re lucky and someone will work through every point in a troubleshooting guide and tell you what they got wrong. Other times, you’re 150 replies deep into an email chain and they’ve still not run that one command that would answer it all.
So we wrote “diag” a CLI tool that is easy for operators to run and that captures a complete snapshot of an OpenFaaS installation on Kubernetes. They can then email this dump to us and we can run it through an airgapped local model, in an ephemeral VM created by Slicer. You can read more about the issues we found in Introducing: Painless support and hands-off architecture reviews over on the OpenFaaS blog.
Revenue recovery
A renewal came up recently, and only because I fed the telemetry database into a local model, did we find out they’d been under-reporting licenses and under-paying by about 4 – 5x for over 12 months. That revenue recovery alone paid for the card.
There’s no way I would have in good conscience ran the telemetry dump or a customer’s diag output through any cloud plan, regardless of their stance on data retention. This is a good time for me to cover near- and far-east coding plans - caveat emptor - I’m yet to find one that doesn’t take a privileged position on your IP - training and ownership rights for inputs and outputs. ChatGPT Pro and Claude Max can be configured for a 30 day retention period, but even that level likely invalidates your contracts with customers.
Sometimes I’ve given GPT or Opus the schema for the telemetry table and had it write an AGENTS.md that the local model is most likely to follow. Our data is reported several times per day, from multiple high-availability replicas, so it can’t just be summed up across a 24 hour period. With earlier iterations of the model, I saw it fail at arithmetic - 27.3K counted as 273,000. It was only because I was thoroughly checking its work that I caught it out.
Another time, the model inferred a customer was likely to churn because they had a small number of functions. It completely ignored that the customer ran that smaller number of functions many times per day. So often it’s better to have them focus on analysis, not interpretation.
Our current setup
I’m a big supporter of folks like Jack Rong and Kyle Hessling who have worked on fine-tunes of open weight models like Qwen. Qwopus attempts to layer Chain of Thought traces on top of Qwen to make it better at reasoning and coding. They do this to help the community and because of a deep belief in local AI.
In our team we run both the latest generation of Qwopus, and the base 27B Qwen 3.6 model on the RTX 6000 rig. Over time this changes - as new finetunes come out, as new point releases of Qwen drop and as we land upon new edge-cases and limitations. Up until very recently, we ran with thinking turned off completely, and have only recently added it back in which coincided with seeing more looping.
The models are served by two independent llama.cpp instances, which means they retain full context length. The default answer to “concurrency” is to run –parallel 2 but this halves the available context.
$ nvidia-smi Wed Jun 17 11:56:03 2026 +––––––––––––––––––––––––––––––––––––––––––––-+ | NVIDIA-SMI 590.48.01 Driver Version: 590.48.01 CUDA Version: 13.1 | +––––––––––––––––––––-+––––––––––––+–––––––––––+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA RTX PRO 6000 Blac… Off | 00000000:01:00.0 Off | Off | | 30% 32C P8 15W / 600W | 85937MiB / 97887MiB | 0% Default | | | | N/A | +––––––––––––––––––––-+––––––––––––+–––––––––––+
+––––––––––––––––––––––––––––––––––––––––––––-+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 2265 C …ma.cpp/build/bin/llama-server 31198MiB | | 0 N/A N/A 2544 C …ma.cpp/build/bin/llama-server 54718MiB | +––––––––––––––––––––––––––––––––––––––––––––-+
llama.cpp is built from source and kept up to date weekly, or as required. The build from source is required in order to add support for Nvidia GPUs.
Here’s our command for a single instance of Qwen with full context length and full quality context.
#!/bin/bash ~/llama.cpp/build/bin/llama-server \ -hf unsloth/Qwen3.6 – 27B-MTP-GGUF:UD-Q8_K_XL \ –alias Qwen3.6 – 27B-Base \ –host 0.0.0.0 \ –port 8085 \ -ngl 99 \ -c 262144 \ –cache-type-k f16 \ –cache-type-v f16 \ –flash-attn on \ –parallel 1 \ –threads 16 \ -b 4096 \ -ub 2048 \ –jinja \ –reasoning-budget 2048 \ –temperature 0.6 \ –top-p 0.95 \ –top-k 20 \ –min-p 0.0 \ –presence-penalty 1.1 \ –reasoning on \ –spec-type draft-mtp \ –spec-draft-n-max 6 \ –chat-template-kwargs ‘{“preserve_thinking”: true}’ \ –chat-template-file chat_template.jinja \ –reasoning-budget-message “reasoning budget consumed, time to answer now”
We get about a 93% acceptance rate on our speculative decoding from MTP, and the speed increases from a stable 67 tok/s to 130 – 200 tok/s sustained over long periods. It feels faster than using a cloud model.
It’s important to follow the instructions from the model card when tuning llama.cpp. There are often reasons why a certain temperature has been selected by the lab. For instance, with the Qwopus fine-tune, it works best with thinking turned off and the temperature really hot at 0.85 – 1.0.
About that looping
Recently I’ve been tuning it to try to avoid looping, goes back to that tempering analogy. You can’t just leave this model to work on long horizon tasks.
I asked Qwen what commands we should add to faas-cli, and it came back with some reasonable suggestions, but got stuck and kept repeating them over and over, burning 600W of my electricity for a good half an hour.
58. faas-cli function import - Import functions from a YAML file or URL. 59. faas-cli function export - Export deployed functions back to a stack.yaml file. 60. faas-cli function scale - Manually scale function replicas without redeploying. 61. faas-cli function rename - Rename a function in-place. 62. faas-cli function diff - Compare local stack.yaml with what’s deployed - show differences.
63. faas-cli function import - Import functions from a YAML file or URL. 64. faas-cli function export - Export deployed functions back to a stack.yaml file. 65. faas-cli function scale - Manually scale function replicas without redeploying. 66. faas-cli function rename - Rename a function in-place. 67. faas-cli function diff - Compare local stack.yaml with what’s deployed - show differences.
68. faas-cli function import - Import functions from a YAML file or URL. 69. faas-cli function export - Export deployed functions back to a stack.yaml file. 70. faas-cli function scale - Manually scale function replicas without redeploying. 71. faas-cli function rename - Rename a function in-place. 72. faas-cli function diff - Compare local stack.yaml with what’s deployed - show differences.
Build · Qwen3.6 – 27B-Base toilgate
The same thing happened when I asked it to “add –json to all get and list commands” - it was convincing for the first one or two and even wrote tests.
Then because –json is machine readable, faas-cli needed to stop printing warnings about insecure TLS when using a http:// remote endpoint. Qwen couldn’t work out how to do this so I told it to write a reverse proxy in Python and call that instead. The first version looked plausible but had bad indenting. When it realised the issue, it corrupted the file, and kept complaining that it didn’t know how to fix it and was stuck in a different kind of loop. It just wouldn’t give up, but went progressively off the rails.
Han from my team has reported very similar looping - mostly the second kind. The model or agent is stuck, at the edge of its ability and won’t ask for help. For me, I’ve mainly hit the former, which is arguably worse and means I rarely trust it beyond the telemetry and diag work for customer support/renewals.
Measuring and distributing access
To begin with, I set up a single inlets tunnel and hoped the agents wouldn’t clash. Two agents hitting the same llama.cpp instance with unrelated contexts means each request invalidates the other’s cached prefix — so the full prompt gets re-processed from scratch every time, a thrashing latency you don’t want to feel often. We were still doing most work on coding plans then, so it wasn’t yet a real problem.
Distributing that setup was simple: edit opencode.json and add the URL and token, then copy that file onto your various machines or Slicer VMs.
But as soon as another person uses the model, it stops being a prototype. Who’s on which llama.cpp instance? How much are they using? Which model? What has that cost us in electricity? What happens if that person leaves the team? How do we add in another model for the team?
Toilgate is 100% vibe-coded and too much work to open source. If you like the idea, feel free to make your own.
Toilgate is 100% vibe-coded and too much work to open source. If you like the idea, feel free to make your own.
Rather than manually editing my opencode.json file, and sending that to various team mates, I decided to write a provider for opencode. It would manage the available models from the stable base through to more experimental Qwopus variants that were quantized. Just run opencode - go to the model picker and select toilgate then whatever you want to use.
Two Shelly Plus Plugs are monitoring the power consumption at the wall to give me a better idea of actual costs. The RTX 6000 Pro will pull 600W during inference and is relatively quiet, the two 3090s are closer to 750W combined and extremely noisy.
The wrong comparison
The trap once you can measure is comparing the input/output costs per million tokens to OpenAI’s API pricing for GPT-5.5. That’s the wrong comparison for the current capability. It’s more about understanding the ongoing costs, which I’m bearing personally since the machine is in my house, for work that’s not suitable for a cloud model.
This is where “local AI” turns into an operations problem. You need identity, access control, metering, quotas, model routing and power monitoring. The harder part we keep coming back to is the reliability of the agent/model combination, keeping up with innovations like MTP, and ensuring enough uptime for people who have started to depend on the model being available.
Wrapping up
Whilst local Qwen is not “near Opus levels”, and I hope I’ve demonstrated that enough in the post, it is of value for certain tasks and workflows. It’s also incredibly early, and it can only get better from here. Qwen 3.5 was probably the first model that gave us results we could use. There are rumours of 3.7 coming out soon, which I’d expect to be an iterative improvement - not a revolutionary one.
As OpenAI files SEC paperwork ahead of an expected initial public stock offering, newly leaked financial documents show a company with quickly growing revenues that are currently being overwhelmed by even larger expenses.
The audited financial statements, obtained by independent journalist Ed Zitron, show OpenAI’s reported revenue growing from $3.7 billion in 2024 to $13.07 billion in 2025. The Financial Times, which reviewed the same documents, writes that the company’s monthly revenues had grown to nearly $2 billion by the end of 2025, suggesting that its ongoing revenue rates continued to grow throughout the year.
R&D expenses alone still easily outpace OpenAI’s quickly growing revenues.
Credit: Ars Technica
R&D expenses alone still easily outpace OpenAI’s quickly growing revenues.
Credit:
Ars Technica
But the company’s fast-growing revenues are still dwarfed by its even more significant expenses. OpenAI’s total revenues in both of the last two years were outpaced by research and development alone, which grew from a $7.81 billion line item in 2024 to a massive $19.18 billion cost in 2025. Those numbers seem to reflect the significant costs OpenAI incurred in training new models and include $10.59 billion in R&D costs paid to Microsoft alone in 2025.
On top of that, OpenAI’s “cost of revenue” (i.e., the money spent producing and distributing the product) increased from $2.65 billion in 2024 to $7.5 billion in 2025. This cost line likely reflects the significant compute costs incurred at “inference time” as the company’s models respond to a growing number of user prompts. Costs associated with sales and marketing also grew from $1.11 billion in 2024 to $5.73 billion in 2025.
OpenAI’s operating loss is shrinking as a percentage of revenue, but there’s a long way to go before it becomes a profit.
Credit: Ars Technica
OpenAI’s operating loss is shrinking as a percentage of revenue, but there’s a long way to go before it becomes a profit.
Credit:
Ars Technica
All told, OpenAI’s day-to-day “loss from operations” increased from $8.78 billion in 2024 to $20.92 billion in 2025, a concerning direction for a company that is telling investors it hopes to be profitable by 2030. But measured as a percentage of revenues, the company’s operating losses slightly improved year to year, from 237 percent in 2024 to 160 percent in 2025.
Tesco is also dealing with migration challenges related to data security because its new, unnamed virtualization software is incompatible with the Veeam and Zerto products it uses.
“Manifestly unfair and excessive” price hike
Tesco initially requested at least 100 million pounds (about $133.6 million) in damages each from Broadcom, VMware, and reseller Computacenter, plus interest.
In its recent filings, Tesco said it turned down at least four offers from Broadcom to continue using VMware and Broadcom’s mainframe tech. One offer charged $23.5 million (about 17.6 million pounds) for VMware Cloud Foundation 9.0 and mainframe software and support services for a year, The Register reported. Tesco said that was “around 175 percent” more expensive than what it believes it should have had to pay for VMware and a 350 percent price hike for the mainframe offerings. The prices were “manifestly unfair and excessive,” one of Tesco’s filings said, according to The Register.
In an amended defense, Broadcom denied that the price hike was unfair, The Register reported. Additionally, Broadcom argued that it shouldn’t have to pay damages in relation to Tesco struggling to find VMware and Broadcom alternatives before Tesco’s support expired, as the retail firm has since found replacement products.
The case is expected to go to court between November 1, 2027, and February 25, 2028, The Register reported. Afterward, it could go to trial.
Although the companies will continue their dispute in UK courts, the disagreement mirrors frustrations that VMware customers and partners around the world have expressed since Broadcom bought VMware. With users often being heavily dependent on VMware products, many have delayed or avoided migration or are only moving some workloads, due to complications around cost, time, support, and compatibility.
Still, virtualization rivals, like Hewlett Packard Enterprise and Nutanix, have been making aggressive pushes to attract disgruntled VMware users.
Simultaneously, Broadcom has stuck to its VMware strategy and has reported financial success, especially among its target of large enterprises. It has also dealt with other public legal disputes with large customers, including AT&T, with which it reached an undisclosed settlement, and Siemens, which Broadcom accused of software pirating in an ongoing case in the US District Court for the District of Delaware.
403 ERROR
Generated by cloudfront (CloudFront) Request ID: ZCZ2TGBoRlVeK4IImL3k1Pch628oQRt284fzLuxfLmS1N8MCY2cF7Q==
According to a report by Ars Technica, AMD has quietly stripped a critical security feature from its lower-end CPUs, leaving unaware users potentially vulnerable to physical attacks. Following a months-long investigation tracked on GitHub, Ben Kilpatrick confirmed that the Transparent Secure Memory Encryption (TSME) feature — which protects CPUs against physical exploits that siphon data from connected memory chips — was suddenly no longer available on AMD CPUs outside the company’s Pro lineup.
As the exhaustive inquiry, which involved conversations with AMD engineers, board vendors, and other CPU users, was coming to a head, an AMD engineer abruptly cut discussions short, stating, “My apologies, but I don’t have any more information to share on this topic.” As of this report, AMD has neither officially acknowledged nor explained the disappearance of the security feature.
TSME is a protection feature that encrypts the data stored in memory, making it unusable to physical attackers. AMD initially added this feature to its high-end CPUs, then later extended it to lower-end CPUs. Eventually, the feature became a given, leaving lower-end chip users assured in its availability as part of the chip package. However, without prior notice, AMD appears to have scrapped the security feature in these processors.
According to the Ars report, the company’s only official reaction to the matter — not counting the GitHub discussions — is an email response stating that TSME “is a security feature only applied to PRO CPUs as part of AMD PRO Technologies,” notably the first time the company has publicly stated such a restriction, despite the feature having worked on consumer chips for years. However, it remains unclear whether the disappearance is an intentional policy decision by AMD to reserve TSME for Pro chips or an unintentional regression that was introduced in AGESA 1.2.7.0, a newer firmware release.
Another concerning aspect of the removal is that the feature’s disappearance is completely undetectable on Windows machines and requires significant technical work to identify on Linux. That means the security feature was removed, leaving users unaware that anything had changed.
Kilpatrick, a self-described “privacy-conscious Linux hobbyist” who first reported the change, was installing a new operating system on his machine running a Ryzen 7 9700X from the Zen 5 architecture. To confirm that all his security protections were enabled, he ran Host Security ID (HSI), an auditing feature that evaluates a system’s firmware and hardware security configurations. To his surprise, HSI reported that TSME was no longer supported — even though he had enabled it in his BIOS settings all along. The contradiction sent him searching for answers.
His first instinct was to reach out to MSI, his motherboard’s manufacturer, but the company didn’t initially provide a definitive explanation. He also filed a bug report on AMD’s public engineering GitHub repository, where two AMD engineers eventually responded: Tom Lendacky, an AMD fellow software engineer, and Mario Limonciello, an AMD senior principal software engineer.
Get Tom’s Hardware’s best news and in-depth reviews, straight to your inbox.
Interestingly, neither engineer appeared to have a clear answer for why the feature had disappeared. Their advice was basically the same: disable and re-enable the option in the BIOS, and if that didn’t work, take it up with the motherboard manufacturer, making it clear that people directly at AMD were just as in the dark as the user reporting it.
It was only after this that Kilpatrick pressed MSI harder, eventually convincing its engineers to run controlled tests. They found that consumer Ryzen chips had TSME enabled under an older firmware version but showed it as “not supported” under a newer one (AGESA 1.2.7.0), while Pro versions of the CPU supported the feature regardless of the firmware or motherboard used.
This leaves the big question of whether AMD deliberately restricted TSME to its Pro chips, or whether the change was an accidental regression — a firmware bug introduced in that newer AGESA version. Either way, the silicon appears to have been capable of running the feature. The difference is whether users are looking at a bug that AMD should fix or a quiet product-segmentation decision that AMD has not properly explained.
Kilpatrick took these MSI findings back to the AMD engineers and resumed the discussion six weeks later. MSI’s product marketing team, he reported, had been told directly by AMD that TSME is exclusively supported on Pro series processors. He also relayed MSI’s test results: an internal AGESA flag that controls whether TSME activates during boot returned FALSE on consumer chips regardless of the BIOS setting, but TRUE on Pro processors when the feature was enabled.
Kilpatrick then brought up something especially awkward. He reminded Lendacky of a comment that the engineer had made back in 2020, confirming that a Ryzen 3700X, a consumer CPU, “should support TSME.” In a later 2025 comment in the same discussion, Lendacky again recommended using TSME, while noting that the motherboard BIOS provider had to expose the option. So there it was, AMD’s own engineer, years earlier, acknowledging the feature working on exactly the kind of lower-end chip now stripped of it, proving that Ryzen support was not some fantasy users invented.
After some more back-and-forth, Kilpatrick asked bluntly whether the flag being set to FALSE on consumer chips was a silicon-level limitation or a firmware policy decision — since one is permanent and the other is potentially reversible. Limonciello’s reply effectively closed the chapter. “My apologies, but I don’t have any more information to share on this topic,” he wrote.
To be fair to AMD, there is no clear indication that the company ever publicly advertised TSME as a consumer Ryzen feature. AMD has long said that a related memory protection, Secure Memory Encryption (SME), is available only in the Pro and EPYC CPU tiers. SME is OS-managed, using a single key and allowing the OS to selectively encrypt individual memory pages. TSME, by contrast, is firmware-managed, encrypting all RAM with no OS involvement. When active, it guards against physical attacks such as cold-boot exploits, DRAM interface snooping, and memory module removal, and it activates silently once enabled in the BIOS, making it the more practically useful of the two protections.
For now, AMD has said nothing official. It hasn’t confirmed what happened, why it happened, whether anything actually changed, or what users of its consumer chips should now expect. Given the years of TSME quietly doing its job on these lower-cost processors — and the AMD engineers’ supposed own past comments treating it as supported — users had every reason to regard it as part of the package.
For most consumer Ryzen users, the practical impact of the change is narrow. TSME protects against physical attacks, meaning scenarios in which someone has physical access to the machine or its memory hardware and attempts to extract secrets directly from RAM. The feature is more important for people carrying sensitive laptops, handling confidential work, relying on full-disk encryption, or operating in environments where seizure, theft, or hardware tampering is a realistic concern. Anyone who genuinely needs memory encryption on AMD hardware now appears to need a Ryzen Pro or EPYC system, unless AMD clarifies the situation or restores support.
Follow Tom’s Hardware on Google News, or add us as a preferred source, to get our latest news, analysis, & reviews in your feeds.
Etiido Uko is a news contributor for Tom’s Hardware covering the latest updates in big tech and the PC industry. He is a mechanical engineer and senior technical writer with over nine years of experience in documentation and reporting. He is deeply passionate about all things engineering and technology, and is an expert in gadgets, manufacturing, robotics, automotive, and aerospace.
Our cloud browsers need to do three things at once: start quickly, remain isolated, and be cheap. That is why we rebuilt Browser Use Cloud, so a new session starts in under a second and costs $0.02 per browser hour, down from $0.06.
This is harder than it sounds. A browser has Chromium, a filesystem, cookies, cache, proxy settings, downloads, and sometimes a logged-in customer session. If one browser can read another browser’s state, it creates a security problem.
The normal answer is a virtual machine, or VM. A VM is a computer inside a computer: it gets its own CPU, memory, disk, and network devices. It is separate from everything else on its host, and if the browser breaks, leaks information, or gets attacked, the damage stays within the VM.
Normal VMs, however, are too heavy for cloud browsers. We need to create them constantly, sometimes thousands at a time, and throw them away as soon as sessions end. If each browser needs a slow, expensive VM, the product becomes slow and expensive, too.
The question for us is whether we could give every browser its own VM without making users wait or pay for it. We now do that with Firecracker, a lightweight VM system.
Every Browser Use Cloud session runs in its own, tiny VM. These VMs run on EC2, Amazon’s rented cloud servers.
That is the unusual part. Firecracker is normally run on bare-metal servers, where you rent the whole physical machine. To reduce customers’ cost, we run it on regular EC2, where AWS has already put your server inside a VM.
This should be slow. Nested VMs make memory and CPU operations more expensive, and Chromium takes time to start. This post is about how we made this setup fast and efficient.
But first, why did we rebuild our infrastructure?
Why we left unikernels behind
We used to run cloud browsers with Unikraft, which builds small, VM-like containers called unikernels. Unikernels, instead of booting a full Linux system, load a small image built for your purposes. Unikernels start quickly and are cheap when idle because you can shut them down when no one is using them.
Unikraft was good for turning browsers off when they were not in use, but bad at adding more browsers quickly when traffic spiked. If more users suddenly asked for browsers at once, you would need to scale browser capacity rapidly. Unikraft does not have good built-in autoscaling, so an engineer had to change a variable, manually adding more instances.
During a burst in traffic, the system, instead of reacting on its own, required humans to adjust it. This caused problems: one load test brought down production for 45 minutes. So we rebuilt our setup on Firecracker.
Firecracker provides a layer through which you can create, monitor, and run VMs. It gives each VM CPU, memory, disk, and network devices, and it keeps it isolated from the host and from other VMs.
Teaching browsers to scale themselves
Firecracker gave each browser its own VM. But it did not inherently solve the problem that broke the old system: deciding how many VMs to run, where to put them, and when to add more.
So we built our own control plane. The control plane monitors our fleet of browsers and decides whether we should scale up or down.
When a user asks for a browser, the control plane picks a machine with room. When traffic rises, it starts more machines. When traffic falls, it stops sending new browsers to machines we want to remove.
It checks the fleet in real time. That is much faster than waiting on CloudWatch, AWS’s monitoring service, which usually reacts on one-minute windows. It also knows things generic metrics do not: browsers that are still starting, machines we are trying to remove, and machines that should not receive new sessions.
Why we run VMs inside VMs
Once we had a control plane, the next question was what kind of machines it should add.
The usual way to run Firecracker on AWS is a .metal instance. This means you rent the whole physical server, and Firecracker runs directly on it.
We chose regular EC2 instead. Regular EC2 machines are faster to get and cheaper to keep around. Our hosts boot from a pre-built image and start serving browsers about 30 seconds after launch. The faster we can add a host, the less idle capacity we need to pay for, and the lower the cost we pass on to our customers.
The catch is that regular EC2 is already a VM. AWS runs our host inside its own isolation layer, and then we run browser VMs inside that host. In other words, every browser is a VM inside a VM.
This is not the normal way of using Firecracker. When a browser VM needs help from the host, the request passes through two VM layers instead of one, adding latency.
We decided the tradeoff was worth it, as regular EC2 gives us faster scale-up and lower cost. To mitigate the effects of nested virtualization, we focused on making Firecracker as speedy as possible.
From request to usable browser
When a user asks for a browser, the control plane picks a machine with room. That machine restores a saved browser VM, starts Chromium inside it, waits until Chromium is ready to be controlled, and returns a connection URL.
That URL is what the user’s agent connects to. Browser Use controls Chromium over a WebSocket using the Chrome DevTools Protocol, or CDP. CDP is the remote-control API for Chrome: click this button, type this text, read this page, take this screenshot.
Three things made this take longer: restoring the VM’s memory, launching Chromium, and keeping the browser stealthy and undetected by anti-bot security.
The first slowdown: memory
The first bottleneck was memory.
A production browser is not booted from scratch. We resume it from a snapshot: a saved VM that is already booted and paused just before Chromium launches. Resuming a VM is much faster than booting one.
Our first resumes were still too slow. When a restored VM touches memory for the first time, the host has to map that memory back in. This event is called a page fault. In a nested VM, each page fault is expensive because it can cross both VM layers.
During an early cold start, page faults were 72% of all VM exits. Getting from resume to a CDP-ready browser took 9.8 seconds.
The fix was to map memory in larger chunks. Before, the VM restored memory in 4KB pages. Now, it uses 2MB pages. Each page covers 512 times more memory, so the browser triggers far fewer page faults while it wakes up. Fewer page faults mean fewer trips through the nested VM layers.
We also now handle page faults ourselves with a custom handler for userfaultfd, a Linux API for handling missing memory pages. Before the VM starts running, our handler loads the memory Chromium is most likely to access first.
Our handler keeps Chromium from receiving a flood of page faults as it starts. The host has already loaded the hot pages, and the remaining pages arrive before the browser needs most of them.
These changes cut the time from resuming the VM to having a browser ready to accept commands from 9.8 seconds to 3.1 seconds. They also cut the number of times the browser VM had to stop and ask the host to handle missing memory from roughly 100,000 times per resume to about 1,100, about a 91x drop.
We made smaller refinements, too. The VM was spending 500ms looking for an old PS/2 keyboard that didn’t exist. We disabled this check.
Additionally, we changed how the host waits for the browser to become ready. Before, the host kept polling the VM with HTTP requests. That created extra VM exits, or moments when the browser VM had to pause so the host could handle work for it.
Now, the browser driver writes a ready message to its log, and the host reads that log over vsock, a fast communication channel between the host and the VM. The host sees the ready message in under a millisecond.
The second slowdown: Chromium startup
The next bottleneck was CPU.
When Chromium starts, it is hungry and demanding. It creates renderers, compositors, and V8 isolates at once. After that, browser automation is much quieter. An agent clicks, waits, reads, clicks again.
Because Chromium is quieter after it has started, we can pack many browsers into the same instance. A single host can accommodate many browsers because browsers spend most of their time waiting: waiting for a page, a network response, or the next agent action.
We handle the launch burst in two phases. While a browser resumes and Chromium starts, we leave its virtual CPUs unpinned. That means Linux can move the browser’s CPU work across the host instead of locking it to fixed cores. This spreads the burst out.
Once the browser reports that it’s ready, we pin those virtual CPUs to stable cores. That means the browser VM now runs on specific cores. Stable placement lets us pack more browsers onto the same host without guessing. We know which cores are taken, which ones still have room, and which browsers might interfere with each other.
The launch phase is like letting a crowd enter through every open door. Once everyone is inside, assigned seats work better.
Pinning from the start made things worse. When many browsers launched at once, they piled onto the same hot cores, and some launches failed.
We also became careful about hyperthreads. A physical CPU core often appears as two logical CPUs, called sibling threads. Those siblings still share the same physical core. If two browser VMs each get one sibling, they fight over the same core. Under nesting, that contention showed up as failed launches. To prevent this, each browser now gets both sibling threads of the physical core it uses.
Finally, we give each pinned vCPU thread real-time priority. That tells Linux to run the browser VM immediately when it needs CPU, instead of pausing it behind less important work. Before this change, a 1,000-browser test lost 17% of sessions shortly after being created. After it, the same test lost zero.
Staying stealthy without a screen
The last bottleneck was stealth.
A headless browser runs without a visible window. A headful browser runs like the browser on your laptop, with a window, graphics, and rendered frames.
Plain headless Chromium is easy to detect by websites with anti-bot measures. Plain headless Chromium avoided getting blocked by websites only 2% of the time, according to our stealth benchmark. The same Chromium, headful with a visible window, avoided blocks 50% of the time just by rendering content.
That is why most providers run headful browsers. They pay for a display server, a GPU, and a compositor drawing frames for a screen nobody looks at.
We run our browsers fully headlessly. This is only possible because we changed the browser itself.
The first component is our Chromium fork. Many stealth tools hide automation by injecting JavaScript into every page after the browser starts. For example, they overwrite browser properties like navigator.webdriver, a flag that tells websites whether the browser is being controlled by automation, so the page sees false instead of true. Websites can often detect when such values are overwritten. To avoid this, we patch Chromium at its lowest level, so our patches are never exposed in the first place.
The second component is our fingerprinting. A browser fingerprint consists of details a website reads about your browser and machine, including your operating system, screen size, fonts, graphics, output, audio, timezone, language, and hundreds of other details. Systems that detect bots check if these details look like a real user’s browser or a fake automation environment. We use tens of thousands of real fingerprints across macOS, Windows, and Linux.
Our browsers avoid blocks 81% of the time on our stealth benchmark, and 84.8% on Halluminate BrowserBench, the highest of any provider. Because there is no display, browsers are cheaper to run and easier to scale.
Connecting to the right browser
Once a browser is ready, users connect to it through CDP. The public URL is a WebSocket URL.
In front of the browser fleet are simple edge routers. A router gets the WebSocket connection, asks the control plane where that browser lives, and forwards the raw CDP bytes to the right VM.
The routers do not decide where browsers run. If one dies, another router can take over new connections. The control plane is in charge of placement. The routers only move bytes.
The result
Each of your browser sessions consists of a tiny VM resumed from a snapshot, running inside regular EC2, with headless Chromium inside it.
The VM cold start is under 400ms. End to end, through the public API, browser create latency is 825ms at p50 and 1.35s at p99. We measured this during a 10,000-session stress test in which every browser started successfully.
Browsers on this infrastructure are live today. Start with Browser Use Cloud.
The biggest remaining cost is Chromium itself. Starting Chromium after resume still takes about 545ms at p50.
Any further improvements, then, must come from the browser itself.
Next: skip Chromium startup
Today, we snapshot the VM just before Chromium starts. That keeps the snapshot simple: every browser wakes up from the same, clean point, then launches Chromium for itself.
But Chromium startup is now the largest remaining cost. The next step is to snapshot later, after Chromium is already running. Then, a new session does not have to start the browser at all. It wakes up with the browser already alive.
This is complex, as a running browser has open devices, timers, graphics state, network state, and fingerprint state. Before we freeze it, we need to put all of these things into a safe state. After we restore it, each browser still needs to look like its own browser, not a clone of the last one.
This is what we are working on next.
The fastest browser is the one you barely have to boot. We got the VM startup under 400ms by running Firecracker where it is not supposed to run. Next, we are making new sessions wake up with Chromium already running.
Browsers on the aforementioned infrastructure are live at cloud.browser-use.com.
To add this web app to your iOS home screen tap the share button and select "Add to the Home Screen".
10HN is also available as an iOS App
If you visit 10HN only rarely, check out the the best articles from the past week.
Visit pancik.com for more.