10 interesting stories served every morning and every evening.

Just a moment...

www.midjourney.com

Just a moment...

www.midjourney.com

reuters.com

www.reuters.com

Please en­able JS and dis­able any ad blocker

Only 16 percent of Americans think AI will have a positive impact on society, a new study shows

techcrunch.com

Despite the fact that AI in­creas­ingly dom­i­nates our econ­omy (it’s a hot IPO sum­mer and we’re all just along for the ride), most Americans are not par­tic­u­larly op­ti­mistic about the tech­nol­o­gy’s long-term im­pact on the coun­try, a new study from Pew Research re­veals.

In fact, al­though a whole lot of Americans in­creas­ingly use AI in their daily lives, most of them have neu­tral to neg­a­tive views about it, the re­search re­veals.

Only 16% of Americans think that AIs im­pact on so­ci­ety dur­ing the next 20 years will be pos­i­tive, Pew says, while around 40% say that it will have a neg­a­tive im­pact.

A vast ma­jor­ity of peo­ple (67%) don’t be­lieve that the U.S. gov­ern­ment will do any­thing to mean­ing­fully reg­u­late AI. A sim­i­larly skep­ti­cal co­hort (59%) don’t trust com­pa­nies to de­velop it safely.

Young peo­ple — that is, those peo­ple un­der 30 — are the ones with the most neg­a­tive feel­ings about AI. Pew says that only 14% of this co­hort be­lieve the tech will have a pos­i­tive im­pact on so­ci­ety.

On top of all this, a vast ma­jor­ity of Americans — nearly two-thirds — also think that AIs de­vel­op­ment is oc­cur­ring too quickly.

Despite all of the skep­ti­cism, a whole lot of Americans also re­port us­ing AI in their daily lives on an in­creas­ingly reg­u­lar ba­sis. About a quar­ter of Americans say they use AI chat­bots on a daily ba­sis. Those who do are typ­i­cally us­ing the chat­bots for re­search pur­poses or for work, Pew says.

A vast ma­jor­ity of peo­ple us­ing AI are us­ing ChatGPT. Pew writes that 44% of U.S. adults now say they use OpenAI’s chat­bot, a fig­ure that’s more than dou­bled since 2023.

The next most pop­u­lar chat­bot is Gemini (24%), fol­lowed by Copilot (17%) and Meta AI (14%), with Grok (8%), Claude (6%), and Character.ai (3%) lag­ging be­hind.

There is a bit of a gen­der di­vide. While chat­bot use is grow­ing for both men and women, men still use AI more and are more en­thu­si­as­tic about it, while women are more skep­ti­cal, Pew says. Men are more likely to say they use AI chat­bots in their daily lives (27% ver­sus 20% for women) and while equal shares of men and women re­port us­ing ChatGPT, men more com­monly re­port us­age of other brands, such as Copilot and Grok.

The re­port also high­lights how AI is chang­ing the ways Americans con­sume in­for­ma­tion. Six in 10 sur­vey re­spon­dents told Pew that they rou­tinely read AI-generated in­ter­net sum­maries (indeed, on Google, they’re pretty much un­avoid­able). A much smaller num­ber re­port us­ing AI to get in­for­ma­tion on fit­ness and di­et­ing.

There are also still a whole lot of peo­ple — about half of the coun­try — that say they do not use AI in their daily lives. The peo­ple who do not use AI tend to be older, while those un­der 50 are more likely to say that they use it. Nearly 75% of Americans aged 65 or older say that they never use AI chat­bots.

Those peo­ple who don’t use chat­bots say they don’t be­cause they’re not in­ter­ested in them, and add that they have no in­ten­tion of us­ing them in the fu­ture.

When you pur­chase through links in our ar­ti­cles, we may earn a small com­mis­sion. This does­n’t af­fect our ed­i­to­r­ial in­de­pen­dence.

Lucas is a se­nior writer at TechCrunch, where he cov­ers ar­ti­fi­cial in­tel­li­gence, con­sumer tech, and star­tups. He pre­vi­ously cov­ered AI and cy­ber­se­cu­rity at Gizmodo.

You can con­tact Lucas by email­ing lu­cas.ropek@techcrunch.com.

View Bio

Local Qwen isn't a worse Opus, it's a different tool

blog.alexellis.io

We’ve all heard peo­ple say that lo­cal Qwen 27B or 35-A3B is near-Opus level”, but I have re­ceipts from a soft­ware busi­ness and open source pro­jects, and am here to be trans­par­ent with you.

This post is long-form for a rea­son. It’s not a cur­sory glance, an un­sub­stan­ti­ated claim on X about can­celling Claude Max, or a hob­by­ist re­port from a model run­ning at sin­gle-digit to­kens per sec­ond with a 32K con­text win­dow. It is­n’t writ­ten by a fa­mous CEO tweet­ing about cod­ing from an air­plane. It’s my jour­ney as a founder in a small soft­ware busi­ness, where lo­cal mod­els have pro­duced real, caveated value. I have skin in the game, but no in­cen­tive to push ei­ther cloud or lo­cal mod­els, and a strong de­sire for lo­cal mod­els to be­come ca­pa­ble and re­li­able.

This post is long-form for a rea­son. It’s not a cur­sory glance, an un­sub­stan­ti­ated claim on X about can­celling Claude Max, or a hob­by­ist re­port from a model run­ning at sin­gle-digit to­kens per sec­ond with a 32K con­text win­dow. It is­n’t writ­ten by a fa­mous CEO tweet­ing about cod­ing from an air­plane.

It’s my jour­ney as a founder in a small soft­ware busi­ness, where lo­cal mod­els have pro­duced real, caveated value. I have skin in the game, but no in­cen­tive to push ei­ther cloud or lo­cal mod­els, and a strong de­sire for lo­cal mod­els to be­come ca­pa­ble and re­li­able.

I’ll cover how the card paid for it­self in the first two or three months, how it keeps serv­ing our spe­cific busi­ness use case, why I still can’t trust it un­su­per­vised, and Qwen’s worst trait: the in­fi­nite loops and hal­lu­ci­na­tion risk. These show up most when you quan­tize it down to fit a con­sumer GPU.

Figuring out the power con­nec­tors for the RTX 6000 Pro

Figuring out the power con­nec­tors for the RTX 6000 Pro

On my use case for AI

My jour­ney as a main­tainer and founder started with OpenFaaS - built com­pletely by hand, as was all soft­ware in 2016 up un­til re­cently. That meant lay­ing down the core of the pro­ject on my own, then invit­ing oth­ers to par­tic­i­pate through com­mu­nity - not be­cause I could­n’t do it on my own, but be­cause my goal was to build a suc­cess­ful open source pro­ject. Around 2017 I tried to fund my time by join­ing VMware, and in 2019 af­ter changes in the mar­ket, I needed a way to fund the work my­self, so moved to­wards open-core and built a boot­strapped com­pany. Today our small team main­tains OpenFaaS, SlicerVM - AI sand­boxes and the miss­ing API for Linux”, Actuated.com - self-hosted CI run­ners for GitHub/GitLab, and Inlets.com - self-hosted HTTP/TCP tun­nels.

These prod­ucts use very low level Linux prim­i­tives like con­tain­ers, Kubernetes, Firecracker mi­croVMs, and net­worked pro­to­cols. If you squint, they’re all opin­ion­ated in­fra­struc­ture prod­ucts fo­cused on: ef­fi­ciency, user-ex­pe­ri­ence, con­trol and au­ton­omy. They’re writ­ten in Go, and some have React-based UI com­po­nents, land­ing pages, docs, agent skills, and CLIs. Along with the code, we also pro­vide the best-in-class sup­port, be­cause we are lean and will­ing to do things that don’t scale to help cus­tomers.

I’ve been us­ing AI tools for as long as they’ve been avail­able - from tab com­ple­tion in VS Code in the early days, through to get­ting ChatGPT to gen­er­ate chunks of code, or find bugs, to liv­ing in tmux 12 hours per day. I found my­self in tmux so much of the time that I wrote a free tool Superterm.dev to keep track of my ses­sions, notes, and to get vi­sual feed­back from cod­ing agents. Over that time, I’ve seen the ca­pa­bil­i­ties go from reduce boil­er­plate” to design, ar­chi­tect, and test end to end”. It’s Claude or Codex that do the ma­jor­ity of my work, and whilst I in­sist on do­ing my own writ­ing, I rarely write code by hand - as much as it pains me to say that.

A turn­ing point for fron­tier in­tel­li­gence

I’d say it was roughly be­tween November 2025 and January 2026 that we saw a turn­ing point. Many de­vel­op­ers on X started to es­pouse Claude Opus as hav­ing changed and how it was now ca­pa­ble of do­ing all of their work. Manual cod­ing turned bad as quickly as milk sours left out the fridge. The costs of the top-end cod­ing plans set­tled at roughly 200 USD / mo for in­di­vid­u­als. A real num­ber, but tol­er­a­ble for the value they gen­er­ated. Even to­day, if you avoid too much un­at­tended work, you can make it last through the 5 hour limit, and weekly limit if you’re care­ful.

What makes lo­cal mod­els in­ter­est­ing

There’s an ar­gu­ment that says: Why use any­thing less than the best you can af­ford?”

There’s an ar­gu­ment that says: Why use any­thing less than the best you can af­ford?”

The year of 2026 cer­tainly is a new fron­tier: we find our­selves in a place where any idea can be cloned overnight by some­one you’ve never heard of with a sub­scrip­tion in a de­vel­op­ing na­tion. I’ve seen it hap­pen to our SlicerVM prod­uct (originally writ­ten by hand in 2022) and Superterm (new in 2026, 100% writ­ten by cod­ing agents). It’s not to say that a vibecoded clone is a 100% equiv­a­lent of a well en­gi­neered and ar­chi­tected so­lu­tion with an ex­pe­ri­enced team sup­port­ing it, but a mar­ket where the cost of soft­ware went to nil - free and good enough can be all that mat­ters.

So in such a com­pet­i­tive land­scape, why limit your­self to some­thing that’s worse? Isn’t that an op­por­tu­nity cost? Isn’t that risk­ing your liveli­hood?

There are es­ti­mates that the lead­ing mod­els con­tain be­tween 0.5 – 2T pa­ra­me­ters. That’s not just marginally more” or a few times more” than the best in class for lo­cal hard­ware - that’s on a dif­fer­ent level. The pa­ra­me­ter count is a rough proxy for ca­pac­ity, knowl­edge, and rea­son­ing abil­ity. Yet some­how, even a tiny dense model like Qwen 3.6 27B is able to score a rep­utable bench­mark of 77.2 on SWE-Bench Verified vs 88.6% from Claude Opus 4.8.

So you could be for­given for tak­ing to X and shout­ing loudly that local is only 12% be­hind SOTA and many have, in­clud­ing en­gag­ing one-shot­ted demos of space in­vaders. You may go as far as claim­ing that a sin­gle 6-year old GPU can re­place your 200 USD / mo ChatGPT Pro sub­scrip­tion, and in­deed many have made that claim.

Benchmaxxing

Benchmarks are a mov­ing tar­get, and since they’re widely avail­able, it’s pos­si­ble to ed­u­cate and tune a model to ob­tain a higher score than they would oth­er­wise on these tests. The clas­sic SWE-Bench Verified bench­mark is based upon a set of Python is­sues across a num­ber of Open Source pro­jects. Python has threads, and async, how­ever most code you run into is sin­gle-threaded and syn­chro­nous. In con­trast, we write dis­trib­uted sys­tems in Go, where chan­nels, con­texts, and structs span across a large ex­e­cu­tion do­main.

Cost

There’s a very pop­u­lar take local mod­els aren’t about cost” and that comes from a po­si­tion of priv­i­lege. Individuals can use cod­ing plans that pro­vide high amounts of us­age through a work­ing day for 200 USD / mo. On that ba­sis, you are get­ting SOTA level in­tel­li­gence, the best chance of some­thing work­ing and be­ing of qual­ity, of find­ing that bug, or gen­er­at­ing that land­ing page.

Coding plans are clearly sub­sidised, just look at what hap­pened to GitHub Copilot plans. They started off by giv­ing away 1500 re­quests for 39 USD / mo and you could make that last a very long time for pen­nies. Something that was undis­closed changed at GitHub/Microsoft/Azure, and they moved every­one over to to­ken-based pric­ing and the back­lash was huge. The true cost had been hid­den for so long, we’d be­come ac­cus­tomed to it.

Now, if you’re pay­ing for to­kens on API rates, the break­ing point comes sooner than many of us re­alise. Recently, Uber capped spend to 1500 USD / mo per de­vel­oper per tool. The me­dian salary at Uber is 330k USD an­nu­ally, so if a de­vel­oper used two tools to the max­i­mum ex­tent, it’s roughly 12% of their an­nual com­pen­sa­tion.

So for heavy use, loops, agen­tic analy­sis, in-prod­uct ca­pa­bil­i­ties de­ployed through SaaS sys­tems, open weight, or lo­cal mod­els can pro­vide se­ri­ous value. It’s not fair to rule out cost, but for many it’s not about that.

Sovereignty and pri­vacy

We work with var­i­ous en­ter­prise cus­tomers that take data con­trols very se­ri­ously. If you squint at our prod­uct line, we’re all about pri­vacy and sov­er­eignty. OpenFaaS runs func­tions on your in­fra­struc­ture, with your lim­its and pre­ferred lan­guages, and events. SlicerVM runs mi­croVMs not on some ab­stracted cloud-based bare-metal, but on your own kit, even your MacBook. Inlets runs tun­nels where you can con­trol the tun­nel client and server with 100% pri­vacy. Actuated takes the ar­du­ous parts of GitHub Actions away and says install an agent on your ma­chines and for­get about it”.

So nat­u­rally, we are drawn to lo­cal mod­els - both from our core val­ues and be­liefs about how the Internet should be, but through oblig­a­tions.

You may not hold these be­liefs, you may not han­dle any cus­tomer data, but if you live out­side of the US, the re­moval of Anthropic’s Fable 5 model overnight might have come as a shock. In other words, there is se­ri­ous ven­dor risk, and many of us are ad­dicted to the source.

Local mod­els are the so­lu­tion to What if the fron­tier labs do X?”

Tempering the blade

I said that lo­cal mod­els are not the same tool as SOTA. What did I mean by that?

I build fur­ni­ture us­ing hand tools, and oc­ca­sion­ally just like I’ll re­lease an open source pro­ject to scratch an itch, I’ll make an edge tool like a chisel, a groov­ing plane blade, a scratch awl, a Sloyd knife for carv­ing.

Tempering a Japanese style mark­ing knife on the back of a heated file, un­til it hits straw colour.

Tempering a Japanese style mark­ing knife on the back of a heated file, un­til it hits straw colour.

There are two ways to work with steel de­pend­ing on how much you can in­vest. Forging is tak­ing a raw piece of steel, heat­ing it up and smash­ing it with a ham­mer into the form you need. It’s seen as the most pure and ho­n­ourable way to work - the real way”. Then for smaller items, stock re­moval” is much more ap­proach­able. It in­volves tak­ing sheet steel, cut­ting out a shape and grind­ing in a bevel or a point.

But that’s just the shap­ing. You then have to heat the steel up, and quench it in oil or wa­ter. This makes the steel be­come ex­tremely hard, so hard that if you dropped it - it would shat­ter into pieces. So we have to scrub off the black scum, and heat it up again, watch­ing for a rain­bow of colours. If we go one shade past where we need, we have to start the heat treat­ing all over again.

Our team’s ex­pe­ri­ence of lo­cal mod­els is ex­actly like miss­ing the tem­per colours. The model is run­ning so hot, that it shoots past the goal and starts loop­ing. Nothing can fix it, other than clos­ing down the har­ness and hop­ing the cleared con­text will give a dif­fer­ent re­sult.

I’d never leave a blade tem­per­ing un­at­tended, just like I’d never leave Qwen 3.6 27B work­ing on a long hori­zon task. For steel the workaround is us­ing a kiln, or tem­per­a­ture con­trolled oven to re­move vari­abil­ity.

That Sloyd knife we forged could be used to knock in nails, but you’re likely to cut your hands and ruin the edge at the same time. Let’s go back to the start, if it’s a dif­fer­ent tool, what is it good for?

What I was look­ing for

I was look­ing for all of the things we cov­ered in the pre­vi­ous sec­tion: pri­vacy, fixed costs and pro­tec­tion against ven­dor risk. Where I got and con­tinue to get let down is where I treat a lo­cal model in­side open­code in the same way I treat Claude or Codex. It’s al­most creepy how long they can work fully un­at­tended whilst mak­ing real progress to­wards a goal.

I can paste in some­thing like: Eoin told me he has been run­ning Slicer VMs in a loop and ran out of FDs. He sus­pects VSock” and then af­ter a cou­ple of min­utes Claude replies Now I see the full pic­ture: You’re do­ing X, you need to do Y”. I say do it and test it end to end on my mini PC and af­ter any pe­riod of time - 5 or 15 min­utes, I can raise a PR, have it code re­viewed au­to­mat­i­cally, and then tell Claude to read it and it­er­ate again.

It’s a won­der­fully ef­fi­cient loop for a small team like us that man­ages mul­ti­ple prod­ucts and works very closely with en­ter­prise and com­mu­nity users.

Sharp lessons from a 3090

I started off with a sin­gle 3090 card in 2023, and quickly re­alised I needed an­other to be able to load mod­els and have suf­fi­cient con­text. Nothing about lo­cal mod­els from 2023 is worth cov­er­ing here, other than they were so hard to use that I gave up on them. Qwen 3.5 was the first time I saw real work be­ing done by agents.

I could load a model into ei­ther card in Q4 quan­ti­za­tion with 200k con­text (also quan­tized) and get it to do small tasks, when guided. I still re­mem­ber how quickly that went south. I told the model Explore this ma­chine from every an­gle, com­plete a foren­sic re­port on the ma­chine and how it’s used” - Claude would have shrugged that off. Qwen started read­ing every sin­gle file on my ma­chine one by one, filled its con­text, then hal­lu­ci­nated the file­names and even tool calls ~/faas-netes be­came ~/faaned. Stepping back, I was able to get a re­ally lu­cid re­port by scop­ing the task Take a quick look around this ma­chine, tell me who uses it and what for” and that ran at roughly 40 – 50 to­kens per sec­ond (generation).

A 27B model sim­ply does­n’t fit at full fi­delity into 1x 3090 card, so the knobs and di­als are: com­pres­sion level of the mod­el’s weights (quantization), length of the con­text, and com­pres­sion level of the keys and val­ues of the con­text.

There’s a well known rule of thumb that bad things start hap­pen­ing at Q4_0 on the keys part of the KV cache. The most ag­gres­sive I’ve ever been is Q8_0 for keys and Q4_0 for val­ues.

The 3090s were a con­stant source of headaches - I had to quan­tize well be­low where I was com­fort­able. One of the cards would only show up if I crossed my fin­gers when turn­ing it on. Even re­boots would­n’t cure it - I had to A/C power off and re­move the power ca­ble each time for 30 sec­onds.

My lat­est ex­per­i­ment was set­ting up vLLM (the gold stan­dard for pro­duc­tion and con­cur­rent serv­ing) and even with an NVLink (175GBP) and ten­sor par­al­lelism turned on, it was 3 to­kens/​sec­ond slower than llama.cpp dur­ing gen­er­a­tion for an equiv­a­lent setup.

I was spend­ing more time on mak­ing them work than the re­sults.

Big spender

We of­fer sup­port con­tracts to en­ter­prise com­pa­nies us­ing our prod­ucts, and when a ticket comes in we are in­cen­tivised to re­solve it as soon as rea­son­ably pos­si­ble. I thought that get­ting a card that would make all the nig­gles go away would fix lo­cal mod­els, and cus­tomer sup­port was worth the risk.

We dropped around 12000 USD on an RTX 6000 Pro Blackwell edi­tion with 96GB of VRAM. Even a cou­ple of months on, the price has in­creased to around 15400 USD so adding a sec­ond be­comes much harder to jus­tify. You can’t just slot an­other card in” to a con­sumer ma­chine. There are many con­cerns from PCI lanes, to band­width, to card spac­ing, and the draw on the PSU.

It was a cal­cu­lated bet, and it has paid off, but not be­cause it re­places our Claude sub­scrip­tions - it can’t do that.

Painless cus­tomer sup­port, with­out leak­ing cus­tomer data

Many op­er­a­tors at en­ter­prise com­pa­nies are highly ca­pa­ble and skilled, but they’re held back by man­ual pro­ce­dures and prac­tices. Sometimes you’re lucky and some­one will work through every point in a trou­bleshoot­ing guide and tell you what they got wrong. Other times, you’re 150 replies deep into an email chain and they’ve still not run that one com­mand that would an­swer it all.

So we wrote diag” a CLI tool that is easy for op­er­a­tors to run and that cap­tures a com­plete snap­shot of an OpenFaaS in­stal­la­tion on Kubernetes. They can then email this dump to us and we can run it through an air­gapped lo­cal model, in an ephemeral VM cre­ated by Slicer. You can read more about the is­sues we found in Introducing: Painless sup­port and hands-off ar­chi­tec­ture re­views over on the OpenFaaS blog.

Revenue re­cov­ery

A re­newal came up re­cently, and only be­cause I fed the teleme­try data­base into a lo­cal model, did we find out they’d been un­der-re­port­ing li­censes and un­der-pay­ing by about 4 – 5x for over 12 months. That rev­enue re­cov­ery alone paid for the card.

There’s no way I would have in good con­science ran the teleme­try dump or a cus­tomer’s diag out­put through any cloud plan, re­gard­less of their stance on data re­ten­tion. This is a good time for me to cover near- and far-east cod­ing plans - caveat emp­tor - I’m yet to find one that does­n’t take a priv­i­leged po­si­tion on your IP - train­ing and own­er­ship rights for in­puts and out­puts. ChatGPT Pro and Claude Max can be con­fig­ured for a 30 day re­ten­tion pe­riod, but even that level likely in­val­i­dates your con­tracts with cus­tomers.

Sometimes I’ve given GPT or Opus the schema for the teleme­try table and had it write an AGENTS.md that the lo­cal model is most likely to fol­low. Our data is re­ported sev­eral times per day, from mul­ti­ple high-avail­abil­ity repli­cas, so it can’t just be summed up across a 24 hour pe­riod. With ear­lier it­er­a­tions of the model, I saw it fail at arith­metic - 27.3K counted as 273,000. It was only be­cause I was thor­oughly check­ing its work that I caught it out.

Another time, the model in­ferred a cus­tomer was likely to churn be­cause they had a small num­ber of func­tions. It com­pletely ig­nored that the cus­tomer ran that smaller num­ber of func­tions many times per day. So of­ten it’s bet­ter to have them fo­cus on analy­sis, not in­ter­pre­ta­tion.

Our cur­rent setup

I’m a big sup­porter of folks like Jack Rong and Kyle Hessling who have worked on fine-tunes of open weight mod­els like Qwen. Qwopus at­tempts to layer Chain of Thought traces on top of Qwen to make it bet­ter at rea­son­ing and cod­ing. They do this to help the com­mu­nity and be­cause of a deep be­lief in lo­cal AI.

In our team we run both the lat­est gen­er­a­tion of Qwopus, and the base 27B Qwen 3.6 model on the RTX 6000 rig. Over time this changes - as new fine­tunes come out, as new point re­leases of Qwen drop and as we land upon new edge-cases and lim­i­ta­tions. Up un­til very re­cently, we ran with think­ing turned off com­pletely, and have only re­cently added it back in which co­in­cided with see­ing more loop­ing.

The mod­els are served by two in­de­pen­dent llama.cpp in­stances, which means they re­tain full con­text length. The de­fault an­swer to concurrency” is to run –parallel 2 but this halves the avail­able con­text.

$ nvidia-smi Wed Jun 17 11:56:03 2026 +––––––––––––––––––––––––––––––––––––––––––––-+ | NVIDIA-SMI 590.48.01 Driver Version: 590.48.01 CUDA Version: 13.1 | +––––––––––––––––––––-+––––––––––––+–––––––––––+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA RTX PRO 6000 Blac… Off | 00000000:01:00.0 Off | Off | | 30% 32C P8 15W /  600W | 85937MiB /  97887MiB | 0% Default | | | | N/A | +––––––––––––––––––––-+––––––––––––+–––––––––––+

+––––––––––––––––––––––––––––––––––––––––––––-+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 2265 C …ma.cpp/​build/​bin/​llama-server 31198MiB | | 0 N/A N/A 2544 C …ma.cpp/​build/​bin/​llama-server 54718MiB | +––––––––––––––––––––––––––––––––––––––––––––-+

llama.cpp is built from source and kept up to date weekly, or as re­quired. The build from source is re­quired in or­der to add sup­port for Nvidia GPUs.

Here’s our com­mand for a sin­gle in­stance of Qwen with full con­text length and full qual­ity con­text.

#!/bin/bash ~/llama.cpp/​build/​bin/​llama-server \ -hf un­sloth/​Qwen3.6 – 27B-MTP-GGUF:UD-Q8_K_XL \ –alias Qwen3.6 – 27B-Base \ –host 0.0.0.0 \ –port 8085 \ -ngl 99 \ -c 262144 \ –cache-type-k f16 \ –cache-type-v f16 \ –flash-attn on \ –parallel 1 \ –threads 16 \ -b 4096 \ -ub 2048 \ –jinja \ –reasoning-budget 2048 \ –temperature 0.6 \ –top-p 0.95 \ –top-k 20 \ –min-p 0.0 \ –presence-penalty 1.1 \ –reasoning on \ –spec-type draft-mtp \ –spec-draft-n-max 6 \ –chat-template-kwargs {“preserve_thinking”: true}’ \ –chat-template-file chat_tem­plate.jinja \ –reasoning-budget-message reasoning bud­get con­sumed, time to an­swer now”

We get about a 93% ac­cep­tance rate on our spec­u­la­tive de­cod­ing from MTP, and the speed in­creases from a sta­ble 67 tok/​s to 130 – 200 tok/​s sus­tained over long pe­ri­ods. It feels faster than us­ing a cloud model.

It’s im­por­tant to fol­low the in­struc­tions from the model card when tun­ing llama.cpp. There are of­ten rea­sons why a cer­tain tem­per­a­ture has been se­lected by the lab. For in­stance, with the Qwopus fine-tune, it works best with think­ing turned off and the tem­per­a­ture re­ally hot at 0.85 – 1.0.

About that loop­ing

Recently I’ve been tun­ing it to try to avoid loop­ing, goes back to that tem­per­ing anal­ogy. You can’t just leave this model to work on long hori­zon tasks.

I asked Qwen what com­mands we should add to faas-cli, and it came back with some rea­son­able sug­ges­tions, but got stuck and kept re­peat­ing them over and over, burn­ing 600W of my elec­tric­ity for a good half an hour.

58. faas-cli func­tion im­port - Import func­tions from a YAML file or URL. 59. faas-cli func­tion ex­port - Export de­ployed func­tions back to a stack.yaml file. 60. faas-cli func­tion scale - Manually scale func­tion repli­cas with­out re­de­ploy­ing. 61. faas-cli func­tion re­name - Rename a func­tion in-place. 62. faas-cli func­tion diff - Compare lo­cal stack.yaml with what’s de­ployed - show dif­fer­ences.

63. faas-cli func­tion im­port - Import func­tions from a YAML file or URL. 64. faas-cli func­tion ex­port - Export de­ployed func­tions back to a stack.yaml file. 65. faas-cli func­tion scale - Manually scale func­tion repli­cas with­out re­de­ploy­ing. 66. faas-cli func­tion re­name - Rename a func­tion in-place. 67. faas-cli func­tion diff - Compare lo­cal stack.yaml with what’s de­ployed - show dif­fer­ences.

68. faas-cli func­tion im­port - Import func­tions from a YAML file or URL. 69. faas-cli func­tion ex­port - Export de­ployed func­tions back to a stack.yaml file. 70. faas-cli func­tion scale - Manually scale func­tion repli­cas with­out re­de­ploy­ing. 71. faas-cli func­tion re­name - Rename a func­tion in-place. 72. faas-cli func­tion diff - Compare lo­cal stack.yaml with what’s de­ployed - show dif­fer­ences.

Build · Qwen3.6 – 27B-Base toil­gate

The same thing hap­pened when I asked it to add –json to all get and list com­mands” - it was con­vinc­ing for the first one or two and even wrote tests.

Then be­cause –json is ma­chine read­able, faas-cli needed to stop print­ing warn­ings about in­se­cure TLS when us­ing a http:// re­mote end­point. Qwen could­n’t work out how to do this so I told it to write a re­verse proxy in Python and call that in­stead. The first ver­sion looked plau­si­ble but had bad in­dent­ing. When it re­alised the is­sue, it cor­rupted the file, and kept com­plain­ing that it did­n’t know how to fix it and was stuck in a dif­fer­ent kind of loop. It just would­n’t give up, but went pro­gres­sively off the rails.

Han from my team has re­ported very sim­i­lar loop­ing - mostly the sec­ond kind. The model or agent is stuck, at the edge of its abil­ity and won’t ask for help. For me, I’ve mainly hit the for­mer, which is ar­guably worse and means I rarely trust it be­yond the teleme­try and diag work for cus­tomer sup­port/​re­newals.

Measuring and dis­trib­ut­ing ac­cess

To be­gin with, I set up a sin­gle in­lets tun­nel and hoped the agents would­n’t clash. Two agents hit­ting the same llama.cpp in­stance with un­re­lated con­texts means each re­quest in­val­i­dates the oth­er’s cached pre­fix — so the full prompt gets re-processed from scratch every time, a thrash­ing la­tency you don’t want to feel of­ten. We were still do­ing most work on cod­ing plans then, so it was­n’t yet a real prob­lem.

Distributing that setup was sim­ple: edit open­code.json and add the URL and to­ken, then copy that file onto your var­i­ous ma­chines or Slicer VMs.

But as soon as an­other per­son uses the model, it stops be­ing a pro­to­type. Who’s on which llama.cpp in­stance? How much are they us­ing? Which model? What has that cost us in elec­tric­ity? What hap­pens if that per­son leaves the team? How do we add in an­other model for the team?

Toilgate is 100% vibe-coded and too much work to open source. If you like the idea, feel free to make your own.

Toilgate is 100% vibe-coded and too much work to open source. If you like the idea, feel free to make your own.

Rather than man­u­ally edit­ing my open­code.json file, and send­ing that to var­i­ous team mates, I de­cided to write a provider for open­code. It would man­age the avail­able mod­els from the sta­ble base through to more ex­per­i­men­tal Qwopus vari­ants that were quan­tized. Just run open­code - go to the model picker and se­lect toil­gate then what­ever you want to use.

Two Shelly Plus Plugs are mon­i­tor­ing the power con­sump­tion at the wall to give me a bet­ter idea of ac­tual costs. The RTX 6000 Pro will pull 600W dur­ing in­fer­ence and is rel­a­tively quiet, the two 3090s are closer to 750W com­bined and ex­tremely noisy.

The wrong com­par­i­son

The trap once you can mea­sure is com­par­ing the in­put/​out­put costs per mil­lion to­kens to OpenAI’s API pric­ing for GPT-5.5. That’s the wrong com­par­i­son for the cur­rent ca­pa­bil­ity. It’s more about un­der­stand­ing the on­go­ing costs, which I’m bear­ing per­son­ally since the ma­chine is in my house, for work that’s not suit­able for a cloud model.

This is where local AI turns into an op­er­a­tions prob­lem. You need iden­tity, ac­cess con­trol, me­ter­ing, quo­tas, model rout­ing and power mon­i­tor­ing. The harder part we keep com­ing back to is the re­li­a­bil­ity of the agent/​model com­bi­na­tion, keep­ing up with in­no­va­tions like MTP, and en­sur­ing enough up­time for peo­ple who have started to de­pend on the model be­ing avail­able.

Wrapping up

Whilst lo­cal Qwen is not near Opus lev­els”, and I hope I’ve demon­strated that enough in the post, it is of value for cer­tain tasks and work­flows. It’s also in­cred­i­bly early, and it can only get bet­ter from here. Qwen 3.5 was prob­a­bly the first model that gave us re­sults we could use. There are ru­mours of 3.7 com­ing out soon, which I’d ex­pect to be an it­er­a­tive im­prove­ment - not a rev­o­lu­tion­ary one.

Leaked financial docs show OpenAI is losing billions of dollars a year

arstechnica.com

As OpenAI files SEC pa­per­work ahead of an ex­pected ini­tial pub­lic stock of­fer­ing, newly leaked fi­nan­cial doc­u­ments show a com­pany with quickly grow­ing rev­enues that are cur­rently be­ing over­whelmed by even larger ex­penses.

The au­dited fi­nan­cial state­ments, ob­tained by in­de­pen­dent jour­nal­ist Ed Zitron, show OpenAI’s re­ported rev­enue grow­ing from $3.7 bil­lion in 2024 to $13.07 bil­lion in 2025. The Financial Times, which re­viewed the same doc­u­ments, writes that the com­pa­ny’s monthly rev­enues had grown to nearly $2 bil­lion by the end of 2025, sug­gest­ing that its on­go­ing rev­enue rates con­tin­ued to grow through­out the year.

R&D ex­penses alone still eas­ily out­pace OpenAI’s quickly grow­ing rev­enues.

Credit: Ars Technica

R&D ex­penses alone still eas­ily out­pace OpenAI’s quickly grow­ing rev­enues.

Credit:

Ars Technica

But the com­pa­ny’s fast-grow­ing rev­enues are still dwarfed by its even more sig­nif­i­cant ex­penses. OpenAI’s to­tal rev­enues in both of the last two years were out­paced by re­search and de­vel­op­ment alone, which grew from a $7.81 bil­lion line item in 2024 to a mas­sive $19.18 bil­lion cost in 2025. Those num­bers seem to re­flect the sig­nif­i­cant costs OpenAI in­curred in train­ing new mod­els and in­clude $10.59 bil­lion in R&D costs paid to Microsoft alone in 2025.

On top of that, OpenAI’s cost of rev­enue” (i.e., the money spent pro­duc­ing and dis­trib­ut­ing the prod­uct) in­creased from $2.65 bil­lion in 2024 to $7.5 bil­lion in 2025. This cost line likely re­flects the sig­nif­i­cant com­pute costs in­curred at inference time” as the com­pa­ny’s mod­els re­spond to a grow­ing num­ber of user prompts. Costs as­so­ci­ated with sales and mar­ket­ing also grew from $1.11 bil­lion in 2024 to $5.73 bil­lion in 2025.

OpenAI’s op­er­at­ing loss is shrink­ing as a per­cent­age of rev­enue, but there’s a long way to go be­fore it be­comes a profit.

Credit: Ars Technica

OpenAI’s op­er­at­ing loss is shrink­ing as a per­cent­age of rev­enue, but there’s a long way to go be­fore it be­comes a profit.

Credit:

Ars Technica

All told, OpenAI’s day-to-day loss from op­er­a­tions” in­creased from $8.78 bil­lion in 2024 to $20.92 bil­lion in 2025, a con­cern­ing di­rec­tion for a com­pany that is telling in­vestors it hopes to be prof­itable by 2030. But mea­sured as a per­cent­age of rev­enues, the com­pa­ny’s op­er­at­ing losses slightly im­proved year to year, from 237 per­cent in 2024 to 160 per­cent in 2025.

Tesco moving 40,000 server workloads off VMware amid Broadcom's “abusive conduct”

arstechnica.com

Tesco is also deal­ing with mi­gra­tion chal­lenges re­lated to data se­cu­rity be­cause its new, un­named vir­tu­al­iza­tion soft­ware is in­com­pat­i­ble with the Veeam and Zerto prod­ucts it uses.

Manifestly un­fair and ex­ces­sive” price hike

Tesco ini­tially re­quested at least 100 mil­lion pounds (about $133.6 mil­lion) in dam­ages each from Broadcom, VMware, and re­seller Computacenter, plus in­ter­est.

In its re­cent fil­ings, Tesco said it turned down at least four of­fers from Broadcom to con­tinue us­ing VMware and Broadcom’s main­frame tech. One of­fer charged $23.5 mil­lion (about 17.6 mil­lion pounds) for VMware Cloud Foundation 9.0 and main­frame soft­ware and sup­port ser­vices for a year, The Register re­ported. Tesco said that was around 175 per­cent” more ex­pen­sive than what it be­lieves it should have had to pay for VMware and a 350 per­cent price hike for the main­frame of­fer­ings. The prices were manifestly un­fair and ex­ces­sive,” one of Tesco’s fil­ings said, ac­cord­ing to The Register.

In an amended de­fense, Broadcom de­nied that the price hike was un­fair, The Register re­ported. Additionally, Broadcom ar­gued that it should­n’t have to pay dam­ages in re­la­tion to Tesco strug­gling to find VMware and Broadcom al­ter­na­tives be­fore Tesco’s sup­port ex­pired, as the re­tail firm has since found re­place­ment prod­ucts.

The case is ex­pected to go to court be­tween November 1, 2027, and February 25, 2028, The Register re­ported. Afterward, it could go to trial.

Although the com­pa­nies will con­tinue their dis­pute in UK courts, the dis­agree­ment mir­rors frus­tra­tions that VMware cus­tomers and part­ners around the world have ex­pressed since Broadcom bought VMware. With users of­ten be­ing heav­ily de­pen­dent on VMware prod­ucts, many have de­layed or avoided mi­gra­tion or are only mov­ing some work­loads, due to com­pli­ca­tions around cost, time, sup­port, and com­pat­i­bil­ity.

Still, vir­tu­al­iza­tion ri­vals, like Hewlett Packard Enterprise and Nutanix, have been mak­ing ag­gres­sive pushes to at­tract dis­grun­tled VMware users.

Simultaneously, Broadcom has stuck to its VMware strat­egy and has re­ported fi­nan­cial suc­cess, es­pe­cially among its tar­get of large en­ter­prises. It has also dealt with other pub­lic le­gal dis­putes with large cus­tomers, in­clud­ing AT&T, with which it reached an undis­closed set­tle­ment, and Siemens, which Broadcom ac­cused of soft­ware pi­rat­ing in an on­go­ing case in the US District Court for the District of Delaware.

The request could not be satisfied

chat.deepseek.com

403 ERROR

Generated by cloud­front (CloudFront) Request ID: ZCZ2TGBoRlVeK4IImL3k1Pch628oQRt284fzLuxfLmS1N8MCY2cF7Q==

AMD silently removes memory encryption from consumer Ryzen CPUs, leaving users unaware that they may be vulnerable — security feature vanishes after newer AGESA firmware, AMD engineers go radio silent when pressed about the change

www.tomshardware.com

According to a re­port by Ars Technica, AMD has qui­etly stripped a crit­i­cal se­cu­rity fea­ture from its lower-end CPUs, leav­ing un­aware users po­ten­tially vul­ner­a­ble to phys­i­cal at­tacks. Following a months-long in­ves­ti­ga­tion tracked on GitHub, Ben Kilpatrick con­firmed that the Transparent Secure Memory Encryption (TSME) fea­ture — which pro­tects CPUs against phys­i­cal ex­ploits that siphon data from con­nected mem­ory chips — was sud­denly no longer avail­able on AMD CPUs out­side the com­pa­ny’s Pro lineup.

As the ex­haus­tive in­quiry, which in­volved con­ver­sa­tions with AMD en­gi­neers, board ven­dors, and other CPU users, was com­ing to a head, an AMD en­gi­neer abruptly cut dis­cus­sions short, stat­ing, My apolo­gies, but I don’t have any more in­for­ma­tion to share on this topic.” As of this re­port, AMD has nei­ther of­fi­cially ac­knowl­edged nor ex­plained the dis­ap­pear­ance of the se­cu­rity fea­ture.

TSME is a pro­tec­tion fea­ture that en­crypts the data stored in mem­ory, mak­ing it un­us­able to phys­i­cal at­tack­ers. AMD ini­tially added this fea­ture to its high-end CPUs, then later ex­tended it to lower-end CPUs. Eventually, the fea­ture be­came a given, leav­ing lower-end chip users as­sured in its avail­abil­ity as part of the chip pack­age. However, with­out prior no­tice, AMD ap­pears to have scrapped the se­cu­rity fea­ture in these proces­sors.

According to the Ars re­port, the com­pa­ny’s only of­fi­cial re­ac­tion to the mat­ter — not count­ing the GitHub dis­cus­sions — is an email re­sponse stat­ing that TSME is a se­cu­rity fea­ture only ap­plied to PRO CPUs as part of AMD PRO Technologies,” no­tably the first time the com­pany has pub­licly stated such a re­stric­tion, de­spite the fea­ture hav­ing worked on con­sumer chips for years. However, it re­mains un­clear whether the dis­ap­pear­ance is an in­ten­tional pol­icy de­ci­sion by AMD to re­serve TSME for Pro chips or an un­in­ten­tional re­gres­sion that was in­tro­duced in AGESA 1.2.7.0, a newer firmware re­lease.

Another con­cern­ing as­pect of the re­moval is that the fea­ture’s dis­ap­pear­ance is com­pletely un­de­tectable on Windows ma­chines and re­quires sig­nif­i­cant tech­ni­cal work to iden­tify on Linux. That means the se­cu­rity fea­ture was re­moved, leav­ing users un­aware that any­thing had changed.

Kilpatrick, a self-de­scribed privacy-conscious Linux hob­by­ist” who first re­ported the change, was in­stalling a new op­er­at­ing sys­tem on his ma­chine run­ning a Ryzen 7 9700X from the Zen 5 ar­chi­tec­ture. To con­firm that all his se­cu­rity pro­tec­tions were en­abled, he ran Host Security ID (HSI), an au­dit­ing fea­ture that eval­u­ates a sys­tem’s firmware and hard­ware se­cu­rity con­fig­u­ra­tions. To his sur­prise, HSI re­ported that TSME was no longer sup­ported — even though he had en­abled it in his BIOS set­tings all along. The con­tra­dic­tion sent him search­ing for an­swers.

His first in­stinct was to reach out to MSI, his moth­er­board’s man­u­fac­turer, but the com­pany did­n’t ini­tially pro­vide a de­fin­i­tive ex­pla­na­tion. He also filed a bug re­port on AMDs pub­lic en­gi­neer­ing GitHub repos­i­tory, where two AMD en­gi­neers even­tu­ally re­sponded: Tom Lendacky, an AMD fel­low soft­ware en­gi­neer, and Mario Limonciello, an AMD se­nior prin­ci­pal soft­ware en­gi­neer.

Get Tom’s Hardware’s best news and in-depth re­views, straight to your in­box.

Interestingly, nei­ther en­gi­neer ap­peared to have a clear an­swer for why the fea­ture had dis­ap­peared. Their ad­vice was ba­si­cally the same: dis­able and re-en­able the op­tion in the BIOS, and if that did­n’t work, take it up with the moth­er­board man­u­fac­turer, mak­ing it clear that peo­ple di­rectly at AMD were just as in the dark as the user re­port­ing it.

It was only af­ter this that Kilpatrick pressed MSI harder, even­tu­ally con­vinc­ing its en­gi­neers to run con­trolled tests. They found that con­sumer Ryzen chips had TSME en­abled un­der an older firmware ver­sion but showed it as not sup­ported” un­der a newer one (AGESA 1.2.7.0), while Pro ver­sions of the CPU sup­ported the fea­ture re­gard­less of the firmware or moth­er­board used.

This leaves the big ques­tion of whether AMD de­lib­er­ately re­stricted TSME to its Pro chips, or whether the change was an ac­ci­den­tal re­gres­sion — a firmware bug in­tro­duced in that newer AGESA ver­sion. Either way, the sil­i­con ap­pears to have been ca­pa­ble of run­ning the fea­ture. The dif­fer­ence is whether users are look­ing at a bug that AMD should fix or a quiet prod­uct-seg­men­ta­tion de­ci­sion that AMD has not prop­erly ex­plained.

Kilpatrick took these MSI find­ings back to the AMD en­gi­neers and re­sumed the dis­cus­sion six weeks later. MSIs prod­uct mar­ket­ing team, he re­ported, had been told di­rectly by AMD that TSME is ex­clu­sively sup­ported on Pro se­ries proces­sors. He also re­layed MSIs test re­sults: an in­ter­nal AGESA flag that con­trols whether TSME ac­ti­vates dur­ing boot re­turned FALSE on con­sumer chips re­gard­less of the BIOS set­ting, but TRUE on Pro proces­sors when the fea­ture was en­abled.

Kilpatrick then brought up some­thing es­pe­cially awk­ward. He re­minded Lendacky of a com­ment that the en­gi­neer had made back in 2020, con­firm­ing that a Ryzen 3700X, a con­sumer CPU, should sup­port TSME.” In a later 2025 com­ment in the same dis­cus­sion, Lendacky again rec­om­mended us­ing TSME, while not­ing that the moth­er­board BIOS provider had to ex­pose the op­tion. So there it was, AMDs own en­gi­neer, years ear­lier, ac­knowl­edg­ing the fea­ture work­ing on ex­actly the kind of lower-end chip now stripped of it, prov­ing that Ryzen sup­port was not some fan­tasy users in­vented.

After some more back-and-forth, Kilpatrick asked bluntly whether the flag be­ing set to FALSE on con­sumer chips was a sil­i­con-level lim­i­ta­tion or a firmware pol­icy de­ci­sion — since one is per­ma­nent and the other is po­ten­tially re­versible. Limonciello’s re­ply ef­fec­tively closed the chap­ter. My apolo­gies, but I don’t have any more in­for­ma­tion to share on this topic,” he wrote.

To be fair to AMD, there is no clear in­di­ca­tion that the com­pany ever pub­licly ad­ver­tised TSME as a con­sumer Ryzen fea­ture. AMD has long said that a re­lated mem­ory pro­tec­tion, Secure Memory Encryption (SME), is avail­able only in the Pro and EPYC CPU tiers. SME is OS-managed, us­ing a sin­gle key and al­low­ing the OS to se­lec­tively en­crypt in­di­vid­ual mem­ory pages. TSME, by con­trast, is firmware-man­aged, en­crypt­ing all RAM with no OS in­volve­ment. When ac­tive, it guards against phys­i­cal at­tacks such as cold-boot ex­ploits, DRAM in­ter­face snoop­ing, and mem­ory mod­ule re­moval, and it ac­ti­vates silently once en­abled in the BIOS, mak­ing it the more prac­ti­cally use­ful of the two pro­tec­tions.

For now, AMD has said noth­ing of­fi­cial. It has­n’t con­firmed what hap­pened, why it hap­pened, whether any­thing ac­tu­ally changed, or what users of its con­sumer chips should now ex­pect. Given the years of TSME qui­etly do­ing its job on these lower-cost proces­sors — and the AMD en­gi­neers’ sup­posed own past com­ments treat­ing it as sup­ported — users had every rea­son to re­gard it as part of the pack­age.

For most con­sumer Ryzen users, the prac­ti­cal im­pact of the change is nar­row. TSME pro­tects against phys­i­cal at­tacks, mean­ing sce­nar­ios in which some­one has phys­i­cal ac­cess to the ma­chine or its mem­ory hard­ware and at­tempts to ex­tract se­crets di­rectly from RAM. The fea­ture is more im­por­tant for peo­ple car­ry­ing sen­si­tive lap­tops, han­dling con­fi­den­tial work, re­ly­ing on full-disk en­cryp­tion, or op­er­at­ing in en­vi­ron­ments where seizure, theft, or hard­ware tam­per­ing is a re­al­is­tic con­cern. Anyone who gen­uinely needs mem­ory en­cryp­tion on AMD hard­ware now ap­pears to need a Ryzen Pro or EPYC sys­tem, un­less AMD clar­i­fies the sit­u­a­tion or re­stores sup­port.

Follow Tom’s Hardware on Google News, or add us as a pre­ferred source, to get our lat­est news, analy­sis, & re­views in your feeds.

Etiido Uko is a news con­trib­u­tor for Tom’s Hardware cov­er­ing the lat­est up­dates in big tech and the PC in­dus­try. He is a me­chan­i­cal en­gi­neer and se­nior tech­ni­cal writer with over nine years of ex­pe­ri­ence in doc­u­men­ta­tion and re­port­ing. He is deeply pas­sion­ate about all things en­gi­neer­ing and tech­nol­ogy, and is an ex­pert in gad­gets, man­u­fac­tur­ing, ro­bot­ics, au­to­mo­tive, and aero­space.

How We Made Cloud Browsers 3x Cheaper and Faster

browser-use.com

Our cloud browsers need to do three things at once: start quickly, re­main iso­lated, and be cheap. That is why we re­built Browser Use Cloud, so a new ses­sion starts in un­der a sec­ond and costs $0.02 per browser hour, down from $0.06.

This is harder than it sounds. A browser has Chromium, a filesys­tem, cook­ies, cache, proxy set­tings, down­loads, and some­times a logged-in cus­tomer ses­sion. If one browser can read an­other browser’s state, it cre­ates a se­cu­rity prob­lem.

The nor­mal an­swer is a vir­tual ma­chine, or VM. A VM is a com­puter in­side a com­puter: it gets its own CPU, mem­ory, disk, and net­work de­vices. It is sep­a­rate from every­thing else on its host, and if the browser breaks, leaks in­for­ma­tion, or gets at­tacked, the dam­age stays within the VM.

Normal VMs, how­ever, are too heavy for cloud browsers. We need to cre­ate them con­stantly, some­times thou­sands at a time, and throw them away as soon as ses­sions end. If each browser needs a slow, ex­pen­sive VM, the prod­uct be­comes slow and ex­pen­sive, too.

The ques­tion for us is whether we could give every browser its own VM with­out mak­ing users wait or pay for it. We now do that with Firecracker, a light­weight VM sys­tem.

Every Browser Use Cloud ses­sion runs in its own, tiny VM. These VMs run on EC2, Amazon’s rented cloud servers.

That is the un­usual part. Firecracker is nor­mally run on bare-metal servers, where you rent the whole phys­i­cal ma­chine. To re­duce cus­tomers’ cost, we run it on reg­u­lar EC2, where AWS has al­ready put your server in­side a VM.

This should be slow. Nested VMs make mem­ory and CPU op­er­a­tions more ex­pen­sive, and Chromium takes time to start. This post is about how we made this setup fast and ef­fi­cient.

But first, why did we re­build our in­fra­struc­ture?

Why we left uniker­nels be­hind

We used to run cloud browsers with Unikraft, which builds small, VM-like con­tain­ers called uniker­nels. Unikernels, in­stead of boot­ing a full Linux sys­tem, load a small im­age built for your pur­poses. Unikernels start quickly and are cheap when idle be­cause you can shut them down when no one is us­ing them.

Unikraft was good for turn­ing browsers off when they were not in use, but bad at adding more browsers quickly when traf­fic spiked. If more users sud­denly asked for browsers at once, you would need to scale browser ca­pac­ity rapidly. Unikraft does not have good built-in au­toscal­ing, so an en­gi­neer had to change a vari­able, man­u­ally adding more in­stances.

During a burst in traf­fic, the sys­tem, in­stead of re­act­ing on its own, re­quired hu­mans to ad­just it. This caused prob­lems: one load test brought down pro­duc­tion for 45 min­utes. So we re­built our setup on Firecracker.

Firecracker pro­vides a layer through which you can cre­ate, mon­i­tor, and run VMs. It gives each VM CPU, mem­ory, disk, and net­work de­vices, and it keeps it iso­lated from the host and from other VMs.

Teaching browsers to scale them­selves

Firecracker gave each browser its own VM. But it did not in­her­ently solve the prob­lem that broke the old sys­tem: de­cid­ing how many VMs to run, where to put them, and when to add more.

So we built our own con­trol plane. The con­trol plane mon­i­tors our fleet of browsers and de­cides whether we should scale up or down.

When a user asks for a browser, the con­trol plane picks a ma­chine with room. When traf­fic rises, it starts more ma­chines. When traf­fic falls, it stops send­ing new browsers to ma­chines we want to re­move.

It checks the fleet in real time. That is much faster than wait­ing on CloudWatch, AWSs mon­i­tor­ing ser­vice, which usu­ally re­acts on one-minute win­dows. It also knows things generic met­rics do not: browsers that are still start­ing, ma­chines we are try­ing to re­move, and ma­chines that should not re­ceive new ses­sions.

Why we run VMs in­side VMs

Once we had a con­trol plane, the next ques­tion was what kind of ma­chines it should add.

The usual way to run Firecracker on AWS is a .metal in­stance. This means you rent the whole phys­i­cal server, and Firecracker runs di­rectly on it.

We chose reg­u­lar EC2 in­stead. Regular EC2 ma­chines are faster to get and cheaper to keep around. Our hosts boot from a pre-built im­age and start serv­ing browsers about 30 sec­onds af­ter launch. The faster we can add a host, the less idle ca­pac­ity we need to pay for, and the lower the cost we pass on to our cus­tomers.

The catch is that reg­u­lar EC2 is al­ready a VM. AWS runs our host in­side its own iso­la­tion layer, and then we run browser VMs in­side that host. In other words, every browser is a VM in­side a VM.

This is not the nor­mal way of us­ing Firecracker. When a browser VM needs help from the host, the re­quest passes through two VM lay­ers in­stead of one, adding la­tency.

We de­cided the trade­off was worth it, as reg­u­lar EC2 gives us faster scale-up and lower cost. To mit­i­gate the ef­fects of nested vir­tu­al­iza­tion, we fo­cused on mak­ing Firecracker as speedy as pos­si­ble.

From re­quest to us­able browser

When a user asks for a browser, the con­trol plane picks a ma­chine with room. That ma­chine re­stores a saved browser VM, starts Chromium in­side it, waits un­til Chromium is ready to be con­trolled, and re­turns a con­nec­tion URL.

That URL is what the user’s agent con­nects to. Browser Use con­trols Chromium over a WebSocket us­ing the Chrome DevTools Protocol, or CDP. CDP is the re­mote-con­trol API for Chrome: click this but­ton, type this text, read this page, take this screen­shot.

Three things made this take longer: restor­ing the VMs mem­ory, launch­ing Chromium, and keep­ing the browser stealthy and un­de­tected by anti-bot se­cu­rity.

The first slow­down: mem­ory

The first bot­tle­neck was mem­ory.

A pro­duc­tion browser is not booted from scratch. We re­sume it from a snap­shot: a saved VM that is al­ready booted and paused just be­fore Chromium launches. Resuming a VM is much faster than boot­ing one.

Our first re­sumes were still too slow. When a re­stored VM touches mem­ory for the first time, the host has to map that mem­ory back in. This event is called a page fault. In a nested VM, each page fault is ex­pen­sive be­cause it can cross both VM lay­ers.

During an early cold start, page faults were 72% of all VM ex­its. Getting from re­sume to a CDP-ready browser took 9.8 sec­onds.

The fix was to map mem­ory in larger chunks. Before, the VM re­stored mem­ory in 4KB pages. Now, it uses 2MB pages. Each page cov­ers 512 times more mem­ory, so the browser trig­gers far fewer page faults while it wakes up. Fewer page faults mean fewer trips through the nested VM lay­ers.

We also now han­dle page faults our­selves with a cus­tom han­dler for user­faultfd, a Linux API for han­dling miss­ing mem­ory pages. Before the VM starts run­ning, our han­dler loads the mem­ory Chromium is most likely to ac­cess first.

Our han­dler keeps Chromium from re­ceiv­ing a flood of page faults as it starts. The host has al­ready loaded the hot pages, and the re­main­ing pages ar­rive be­fore the browser needs most of them.

These changes cut the time from re­sum­ing the VM to hav­ing a browser ready to ac­cept com­mands from 9.8 sec­onds to 3.1 sec­onds. They also cut the num­ber of times the browser VM had to stop and ask the host to han­dle miss­ing mem­ory from roughly 100,000 times per re­sume to about 1,100, about a 91x drop.

We made smaller re­fine­ments, too. The VM was spend­ing 500ms look­ing for an old PS/2 key­board that did­n’t ex­ist. We dis­abled this check.

Additionally, we changed how the host waits for the browser to be­come ready. Before, the host kept polling the VM with HTTP re­quests. That cre­ated ex­tra VM ex­its, or mo­ments when the browser VM had to pause so the host could han­dle work for it.

Now, the browser dri­ver writes a ready mes­sage to its log, and the host reads that log over vsock, a fast com­mu­ni­ca­tion chan­nel be­tween the host and the VM. The host sees the ready mes­sage in un­der a mil­lisec­ond.

The sec­ond slow­down: Chromium startup

The next bot­tle­neck was CPU.

When Chromium starts, it is hun­gry and de­mand­ing. It cre­ates ren­der­ers, com­pos­i­tors, and V8 iso­lates at once. After that, browser au­toma­tion is much qui­eter. An agent clicks, waits, reads, clicks again.

Because Chromium is qui­eter af­ter it has started, we can pack many browsers into the same in­stance. A sin­gle host can ac­com­mo­date many browsers be­cause browsers spend most of their time wait­ing: wait­ing for a page, a net­work re­sponse, or the next agent ac­tion.

We han­dle the launch burst in two phases. While a browser re­sumes and Chromium starts, we leave its vir­tual CPUs un­pinned. That means Linux can move the browser’s CPU work across the host in­stead of lock­ing it to fixed cores. This spreads the burst out.

Once the browser re­ports that it’s ready, we pin those vir­tual CPUs to sta­ble cores. That means the browser VM now runs on spe­cific cores. Stable place­ment lets us pack more browsers onto the same host with­out guess­ing. We know which cores are taken, which ones still have room, and which browsers might in­ter­fere with each other.

The launch phase is like let­ting a crowd en­ter through every open door. Once every­one is in­side, as­signed seats work bet­ter.

Pinning from the start made things worse. When many browsers launched at once, they piled onto the same hot cores, and some launches failed.

We also be­came care­ful about hy­per­threads. A phys­i­cal CPU core of­ten ap­pears as two log­i­cal CPUs, called sib­ling threads. Those sib­lings still share the same phys­i­cal core. If two browser VMs each get one sib­ling, they fight over the same core. Under nest­ing, that con­tention showed up as failed launches. To pre­vent this, each browser now gets both sib­ling threads of the phys­i­cal core it uses.

Finally, we give each pinned vCPU thread real-time pri­or­ity. That tells Linux to run the browser VM im­me­di­ately when it needs CPU, in­stead of paus­ing it be­hind less im­por­tant work. Before this change, a 1,000-browser test lost 17% of ses­sions shortly af­ter be­ing cre­ated. After it, the same test lost zero.

Staying stealthy with­out a screen

The last bot­tle­neck was stealth.

A head­less browser runs with­out a vis­i­ble win­dow. A head­ful browser runs like the browser on your lap­top, with a win­dow, graph­ics, and ren­dered frames.

Plain head­less Chromium is easy to de­tect by web­sites with anti-bot mea­sures. Plain head­less Chromium avoided get­ting blocked by web­sites only 2% of the time, ac­cord­ing to our stealth bench­mark. The same Chromium, head­ful with a vis­i­ble win­dow, avoided blocks 50% of the time just by ren­der­ing con­tent.

That is why most providers run head­ful browsers. They pay for a dis­play server, a GPU, and a com­pos­i­tor draw­ing frames for a screen no­body looks at.

We run our browsers fully headlessly. This is only pos­si­ble be­cause we changed the browser it­self.

The first com­po­nent is our Chromium fork. Many stealth tools hide au­toma­tion by in­ject­ing JavaScript into every page af­ter the browser starts. For ex­am­ple, they over­write browser prop­er­ties like nav­i­ga­tor.web­driver, a flag that tells web­sites whether the browser is be­ing con­trolled by au­toma­tion, so the page sees false in­stead of true. Websites can of­ten de­tect when such val­ues are over­writ­ten. To avoid this, we patch Chromium at its low­est level, so our patches are never ex­posed in the first place.

The sec­ond com­po­nent is our fin­ger­print­ing. A browser fin­ger­print con­sists of de­tails a web­site reads about your browser and ma­chine, in­clud­ing your op­er­at­ing sys­tem, screen size, fonts, graph­ics, out­put, au­dio, time­zone, lan­guage, and hun­dreds of other de­tails. Systems that de­tect bots check if these de­tails look like a real user’s browser or a fake au­toma­tion en­vi­ron­ment. We use tens of thou­sands of real fin­ger­prints across ma­cOS, Windows, and Linux.

Our browsers avoid blocks 81% of the time on our stealth bench­mark, and 84.8% on Halluminate BrowserBench, the high­est of any provider. Because there is no dis­play, browsers are cheaper to run and eas­ier to scale.

Connecting to the right browser

Once a browser is ready, users con­nect to it through CDP. The pub­lic URL is a WebSocket URL.

In front of the browser fleet are sim­ple edge routers. A router gets the WebSocket con­nec­tion, asks the con­trol plane where that browser lives, and for­wards the raw CDP bytes to the right VM.

The routers do not de­cide where browsers run. If one dies, an­other router can take over new con­nec­tions. The con­trol plane is in charge of place­ment. The routers only move bytes.

The re­sult

Each of your browser ses­sions con­sists of a tiny VM re­sumed from a snap­shot, run­ning in­side reg­u­lar EC2, with head­less Chromium in­side it.

The VM cold start is un­der 400ms. End to end, through the pub­lic API, browser cre­ate la­tency is 825ms at p50 and 1.35s at p99. We mea­sured this dur­ing a 10,000-session stress test in which every browser started suc­cess­fully.

Browsers on this in­fra­struc­ture are live to­day. Start with Browser Use Cloud.

The biggest re­main­ing cost is Chromium it­self. Starting Chromium af­ter re­sume still takes about 545ms at p50.

Any fur­ther im­prove­ments, then, must come from the browser it­self.

Next: skip Chromium startup

Today, we snap­shot the VM just be­fore Chromium starts. That keeps the snap­shot sim­ple: every browser wakes up from the same, clean point, then launches Chromium for it­self.

But Chromium startup is now the largest re­main­ing cost. The next step is to snap­shot later, af­ter Chromium is al­ready run­ning. Then, a new ses­sion does not have to start the browser at all. It wakes up with the browser al­ready alive.

This is com­plex, as a run­ning browser has open de­vices, timers, graph­ics state, net­work state, and fin­ger­print state. Before we freeze it, we need to put all of these things into a safe state. After we re­store it, each browser still needs to look like its own browser, not a clone of the last one.

This is what we are work­ing on next.

The fastest browser is the one you barely have to boot. We got the VM startup un­der 400ms by run­ning Firecracker where it is not sup­posed to run. Next, we are mak­ing new ses­sions wake up with Chromium al­ready run­ning.

Browsers on the afore­men­tioned in­fra­struc­ture are live at cloud.browser-use.com.

To add this web app to your iOS home screen tap the share button and select "Add to the Home Screen".

10HN is also available as an iOS App

If you visit 10HN only rarely, check out the the best articles from the past week.

Visit pancik.com for more.