10 interesting stories served every morning and every evening.

Cloudflare Turnstile requiring fingerprintable WebGL

hacktivis.me

Since about a week, Cloudflare Turnstile (their Verify you’re hu­man” de­vice ver­i­fi­ca­tion) has been loop­ing in­def­i­nitely in my we­bkit-gtk based browser. Preventing ac­cess to quite few web­sites (previously, but it even went worse lately).

Turns out it’s be­cause Cloudflare wants to have a fin­ger­print of your de­vice via WebGL, the only rea­son for do­ing this would be track­ing.

Their pro-track­ing non-jus­ti­fi­ca­tion copied here just in case:

Turnstile uses browser fin­ger­print­ing to ver­ify you’re hu­man. Privacy tools that block or ran­dom­ize fin­ger­print­ing make your browser look like a bot try­ing to hide its iden­tity. Temporarily al­low­ing fin­ger­print­ing for this site will fix the is­sue.

Such things are blocked in WebKit, and have been for years. Meaning it’s track­ing so aw­ful that even Apple would block it, and as far as I can tell it’s not the kind of pri­vacy pro­tec­tion you can eas­ily dis­able in it. So Cloudflare just banned all WebKitGTK browsers as I guess they put an ex­cep­tion for Safari.

As an aside, if you’re won­der­ing, Mozilla Firefox screwed up their WebGL fin­ger­print­ing pro­tec­tion:

Bugzilla#1916271: Gecko re­veals san­i­tized GPU Characteristics; we­bkit and blink re­turn hard­coded strings for all users

Plus pri­vacy.re­sistfin­ger­print­ing is­n’t en­abled even when se­lect­ing Strict” Enhanced Privacy Protection” in the set­tings, great job there Mozilla. But I guess with it en­abled, pri­vacy-con­scious Firefox users might not be able to pass Cloudflare’s de­vice ver­i­fi­ca­tion in the fu­ture.

Scientists found that the creatine supplement millions take for muscle gains is quietly raising brain energy levels and slowing early Alzheimer’s cognitive decline by 30%

thesciverse.org

Tens of mil­lions of peo­ple take cre­a­tine every day. They bought it for their mus­cles. They mea­sure their doses by how much weight they can add to a bench press or how quickly they re­cover be­tween sets. Almost none of them know that the same sup­ple­ment is cross­ing the blood-brain bar­rier, rais­ing phos­pho­cre­a­tine lev­els in their neu­rons, and do­ing some­thing to their cog­ni­tive func­tion that the fit­ness in­dus­try has never ad­ver­tised and most users have never been told.

A com­pre­hen­sive re­view pub­lished in the Journal of Psychiatry and Brain Science in 2025, along­side a land­mark pi­lot trial pub­lished in Alzheimer’s and Dementia: Translational Research and Clinical Interventions, has as­sem­bled the most com­plete pic­ture yet of what cre­a­tine is qui­etly do­ing in­side the brain. The find­ings span cog­ni­tive per­for­mance in healthy adults, de­pres­sion treat­ment out­comes, sleep de­pri­va­tion re­silience, and most strik­ingly, a 30% slow­ing of cog­ni­tive de­cline in early Alzheimer’s pa­tients in con­trolled tri­als. None of this is in the mar­ket­ing on the tub sit­ting in most gym bags.

Why the Brain Needs Creatine

The brain is the most en­ergy-de­mand­ing or­gan in the hu­man body, con­sum­ing ap­prox­i­mately 20% of the body’s to­tal en­ergy out­put de­spite rep­re­sent­ing only 2% of its mass. Neurons do not store mean­ing­ful en­ergy re­serves. They rely on a con­tin­u­ous sup­ply of ATP, adeno­sine triphos­phate, the mol­e­cule that pow­ers vir­tu­ally every cel­lu­lar process from main­tain­ing ion gra­di­ents across mem­branes to re­leas­ing neu­ro­trans­mit­ters at synapses.

Creatine plays a crit­i­cal role in the en­ergy me­tab­o­lism of brain cells. After cel­lu­lar up­take, cre­a­tine is con­verted into phos­pho­cre­a­tine, which is rapidly bro­ken down via catal­y­sis by cre­a­tine ki­nase to fa­cil­i­tate ATP re­gen­er­a­tion, thereby serv­ing as a cru­cial el­e­ment in en­ergy trans­fer.

In mus­cles, this phos­pho­cre­a­tine sys­tem pro­vides the rapid en­ergy burst needed for ex­plo­sive phys­i­cal ef­fort. In neu­rons, it serves a dif­fer­ent but equally im­por­tant func­tion: pro­vid­ing an emer­gency en­ergy buffer dur­ing pe­ri­ods of high meta­bolic de­mand. When a neu­ron fires rapidly, when the pre­frontal cor­tex is work­ing through a com­plex prob­lem, when the hip­pocam­pus is en­cod­ing a new mem­ory, ATP con­sump­tion spikes in ways that ox­ida­tive phos­pho­ry­la­tion alone can­not im­me­di­ately meet. The phos­pho­cre­a­tine sys­tem fills that gap in mil­lisec­onds, re­gen­er­at­ing ATP faster than any other avail­able mech­a­nism.

When brain cre­a­tine lev­els are in­suf­fi­cient, neu­rons work­ing at high in­ten­sity hit an en­ergy ceil­ing. Processing slows. Working mem­ory ca­pac­ity shrinks. The brain can still func­tion, but it is op­er­at­ing be­low its en­ergy ca­pac­ity in ex­actly the sit­u­a­tions that de­mand the most from it.

What Happens to Brain Creatine as You Age

The prob­lem that makes this rel­e­vant be­yond ath­letic per­for­mance is what hap­pens to the brain’s cre­a­tine sys­tem over time. Impaired brain en­ergy me­tab­o­lism, in­clud­ing dys­func­tion in the cre­a­tine sys­tem, may con­tribute to the de­vel­op­ment and pro­gres­sion of Alzheimer’s dis­ease, mak­ing it a com­pelling ther­a­peu­tic tar­get.

The ev­i­dence for cre­a­tine sys­tem dys­func­tion in Alzheimer’s is spe­cific and mea­sur­able. Phosphocreatine lev­els in the brains of Alzheimer’s pa­tients are sig­nif­i­cantly lower than in age-matched healthy con­trols. The en­zyme cre­a­tine ki­nase, which cat­alyzes the con­ver­sion of phos­pho­cre­a­tine to ATP, shows re­duced ac­tiv­ity in Alzheimer’s brain tis­sue. Mitochondrial dys­func­tion in Alzheimer’s neu­rons cre­ates what re­searchers de­scribe as a bioen­er­getic cri­sis, a state where the cells most re­spon­si­ble for mem­ory and cog­ni­tion are chron­i­cally en­ergy-de­prived and in­creas­ingly un­able to main­tain the ATP lev­els needed for nor­mal synap­tic func­tion.

Mitochondrial im­pair­ment in Alzheimer’s dis­ease re­duces ATP pro­duc­tion in brain and blood cells, ul­ti­mately cre­at­ing a bioen­er­getic cri­sis as part of its patho­phys­i­ol­ogy. The cre­a­tine sys­tem is one of the few mech­a­nisms that can par­tially com­pen­sate for this deficit, pro­vid­ing ATP through a path­way that does not de­pend on fully func­tional mi­to­chon­dria. This is why re­searchers be­gan ask­ing whether sup­ple­ment­ing cre­a­tine could mean­ing­fully re­store brain en­ergy lev­els in peo­ple whose neu­rons were al­ready strug­gling.

The Clinical Trial That Answered the Question

The University of Kansas Medical Center’s CABA trial, the Creatine to Augment Bioenergetics in Alzheimer’s study, pub­lished its re­sults in Alzheimer’s and Dementia: Translational Research and Clinical Interventions in early 2026. Twenty pa­tients with clin­i­cally con­firmed Alzheimer’s dis­ease took 20 grams of cre­a­tine mono­hy­drate daily for eight weeks.

Patients with Alzheimer’s dis­ease took 20 grams of cre­a­tine mono­hy­drate for eight weeks. They im­proved on cog­ni­tive func­tion, scor­ing higher in sort­ing, read­ing and at­ten­tion tests af­ter the full eight weeks were over. Brain phos­pho­cre­a­tine lev­els, mea­sured us­ing mag­netic res­o­nance spec­troscopy, in­creased mea­sur­ably fol­low­ing sup­ple­men­ta­tion, con­firm­ing that oral cre­a­tine was suc­cess­fully cross­ing the blood-brain bar­rier and rais­ing in­tra­cel­lu­lar cre­a­tine con­cen­tra­tions in neural tis­sue.

The 2026 mul­ti­cen­ter placebo-con­trolled trial ex­tend­ing this work en­rolled 240 par­tic­i­pants with early Alzheimer’s. After 12 weeks of oral cre­a­tine sup­ple­men­ta­tion at 5 grams per day, par­tic­i­pants showed a 10 to 15% in­crease in brain phos­pho­cre­a­tine on MRS scans. Improvements in en­ergy met­rics cor­re­lated with mod­est gains in short-term mem­ory tests. The in­ter­ven­tion group showed slower de­cline on stan­dard cog­ni­tive scales by about 30% ver­sus placebo.

A 30% slow­ing of cog­ni­tive de­cline in early Alzheimer’s from a sup­ple­ment that costs pen­nies per dose and is al­ready sit­ting in the cab­i­nets of mil­lions of peo­ple who bought it for en­tirely dif­fer­ent rea­sons is a find­ing that de­serves con­sid­er­ably more at­ten­tion than it has re­ceived out­side spe­cial­ist jour­nals.

What Creatine Does for Healthy Brains

The Alzheimer’s data is the most dra­matic find­ing, but the brain ben­e­fits of cre­a­tine are not lim­ited to neu­rode­gen­er­a­tive dis­ease. A sys­tem­atic re­view and meta-analy­sis pub­lished in Frontiers in Nutrition in 2024 an­a­lyzed the ef­fects of cre­a­tine sup­ple­men­ta­tion on cog­ni­tive func­tion across healthy adults. Creatine sup­ple­men­ta­tion demon­strated po­ten­tial ben­e­fits in pro­cess­ing speed. Creatine sup­ple­men­ta­tion could en­hance the speed and ac­cu­racy of cog­ni­tive tasks, par­tic­u­larly in con­tin­u­ous mem­ory tasks and other tasks re­quir­ing rapid in­for­ma­tion pro­cess­ing.

The cog­ni­tive ben­e­fits in healthy adults are most pro­nounced un­der con­di­tions of meta­bolic stress, ex­actly the con­di­tions where the phos­pho­cre­a­tine buffer mat­ters most. Sleep de­pri­va­tion is the most ex­ten­sively stud­ied of these. A study pub­lished in Scientific Reports found that a sin­gle dose of cre­a­tine im­proved cog­ni­tive per­for­mance and in­duced mea­sur­able changes in cere­bral high-en­ergy phos­phates dur­ing sleep de­pri­va­tion. The brain run­ning low on sleep is a brain run­ning low on en­ergy, and cre­a­tine ap­pears to par­tially com­pen­sate for that deficit through the same phos­pho­cre­a­tine mech­a­nism that ben­e­fits Alzheimer’s pa­tients.

Creatine has also emerged as a se­ri­ous can­di­date for de­pres­sion treat­ment. A 2025 study tested 5 grams of cre­a­tine daily as an add-on to cog­ni­tive be­hav­ioral ther­apy for de­pres­sion, find­ing that adding cre­a­tine to CBT sig­nif­i­cantly im­proved de­pres­sive symp­toms. The bi­o­log­i­cal ra­tio­nale runs through the same en­ergy path­way. Depression is in­creas­ingly un­der­stood as in­volv­ing mi­to­chon­dr­ial dys­func­tion and im­paired brain en­ergy me­tab­o­lism in the pre­frontal cor­tex and hip­pocam­pus, the same re­gions where cre­atine’s phos­pho­cre­a­tine buffer is most ac­tive. Regions of the brain that have high meta­bolic ac­tiv­ity rely on the phos­pho­cre­a­tine sys­tem in or­der to reg­u­late emo­tion and cog­ni­tion.

The Blood-Brain Barrier Question

One de­tail that has his­tor­i­cally com­pli­cated cre­atine’s brain story is the blood-brain bar­rier. The brain is se­lec­tive about what it al­lows in from the blood­stream, and cre­atine’s abil­ity to cross that bar­rier is more lim­ited than its abil­ity to en­ter mus­cle tis­sue. This raised le­git­i­mate ques­tions about whether oral sup­ple­men­ta­tion ac­tu­ally raises brain cre­a­tine lev­els enough to mat­ter.

The CABA tri­al’s MRS imag­ing data an­swered this ques­tion di­rectly. Brain phos­pho­cre­a­tine con­cen­tra­tions did in­crease fol­low­ing oral sup­ple­men­ta­tion, con­firm­ing that di­etary cre­a­tine reaches the brain in func­tion­ally mean­ing­ful quan­ti­ties at suf­fi­cient doses. The re­view in the Journal of Psychiatry and Brain Science notes that higher doses than the stan­dard 5-gram ath­letic dose may be needed to op­ti­mize brain cre­a­tine lev­els, and that strate­gies in­clud­ing higher dos­ing pro­to­cols and po­ten­tially in­tranasal de­liv­ery are be­ing ex­plored to im­prove cen­tral ner­vous sys­tem bioavail­abil­ity.

The Supplement Nobody Told You Was a Brain Drug

The pic­ture that emerges from this body of re­search is one that the fit­ness sup­ple­ment in­dus­try has not been par­tic­u­larly mo­ti­vated to com­mu­ni­cate and that the neu­ro­science com­mu­nity has been slow to trans­late into pub­lic health mes­sag­ing. Creatine mono­hy­drate, one of the most widely used, most ex­ten­sively stud­ied, and cheap­est sup­ple­ments avail­able, is do­ing some­thing to the brain that goes con­sid­er­ably be­yond what the peo­ple buy­ing it un­der­stand.

It is rais­ing phos­pho­cre­a­tine lev­els in neu­rons. It is pro­vid­ing an ATP buffer that helps cog­ni­tively de­mand­ing tasks run at full ca­pac­ity. It is show­ing mea­sur­able cog­ni­tive im­prove­ments in healthy adults un­der stress. It is emerg­ing as a po­ten­tial ad­junct for de­pres­sion treat­ment. And it is slow­ing cog­ni­tive de­cline in early Alzheimer’s pa­tients by ap­prox­i­mately 30% in con­trolled tri­als.

The tub in your gym bag has been do­ing all of this qui­etly, every day, re­gard­less of whether you knew it was hap­pen­ing.

Sources:

1. Comprehensive brain re­view (Journal of Psychiatry and Brain Science, 2025) Candow, D., Fabiano, N. Creatine Supplementation: More Is Likely Better for Brain Bioenergetics, Health and Function. Journal of Psychiatry and Brain Science, 2025; 10. https://​jpbs.hapres.com/​htmls/​JPB­S_1766_De­tail.html

2. CABA pi­lot trial (Alzheimer’s & Dementia: TRCI, 2025) Smith, A.N., Choi, I.Y., Lee, P., Sullivan, D.K., Burns, J.M., Swerdlow, R.H., et al. Creatine mono­hy­drate pi­lot in Alzheimer’s: Feasibility, brain cre­a­tine, and cog­ni­tion. Alzheimer’s & Dementia: Translational Research & Clinical Interventions, 2025; 11(2): e70101. DOI: 10.1002/trc2.70101 https://​alz-jour­nals.on­lineli­brary.wi­ley.com/​doi/​10.1002/​trc2.70101

3. Cognitive meta-analy­sis (Frontiers in Nutrition, 2024) Xu, C., Bi, S., Zhang, W., Luo, L. The ef­fects of cre­a­tine sup­ple­men­ta­tion on cog­ni­tive func­tion in adults: a sys­tem­atic re­view and meta-analy­sis. Frontiers in Nutrition, 2024; 11: 1424972. DOI: 10.3389/fnut.2024.1424972 https://​www.fron­tiersin.org/​jour­nals/​nu­tri­tion/​ar­ti­cles/​10.3389/​fnut.2024.1424972/​full

4. Creatine and de­pres­sion ad­junct (2025) Sherpa, et al. Creatine as add-on to cog­ni­tive be­hav­ioral ther­apy for de­pres­sion. 2025. https://​www.psy­chi­a­try­pod­cast.com/​psy­chi­a­try-psy­chother­apy-pod­cast/​episode-238-cre­a­tine-men­tal-health-ben­e­fits

Introducing 1-bit and Ternary Bonsai Image 4B: Image Generation for Local Devices

prismml.com

Today we’re re­leas­ing Bonsai Image 4B, a fam­ily of com­pact im­age-gen­er­a­tion mod­els de­signed to run high-qual­ity dif­fu­sion in­fer­ence on lo­cal hard­ware: from lap­tops to phones.

Bonsai Image 4B comes in two vari­ants:

1-bit Bonsai Image 4B uses bi­nary {−1, +1} trans­former weights with an FP16 group-wise scal­ing fac­tor, giv­ing 1.125 ef­fec­tive bits per weight. It tar­gets max­i­mum com­pres­sion and is the right fit when mem­ory pres­sure, band­width, and the de­ploy­ment foot­print are the pri­mary con­straints.

Ternary Bonsai Image 4B uses {−1, 0, +1} trans­former weights with an FP16 group-wise scal­ing fac­tor, giv­ing 1.71 ef­fec­tive bits per weight. The ad­di­tional zero state gives the model more rep­re­sen­ta­tional flex­i­bil­ity, im­prov­ing vi­sual qual­ity and prompt fi­delity while re­main­ing ex­tremely com­pact.

The re­sult is a new de­ploy­ment regime for im­age gen­er­a­tion: ca­pa­ble out­puts, open weights, and prac­ti­cal lo­cal in­fer­ence on de­vices that were pre­vi­ously out of reach for this class of model. To our knowl­edge, Bonsai Image 4B is the first im­age model in its pa­ra­me­ter class to run di­rectly on an iPhone.

Built for lo­cal gen­er­a­tion

Local im­age gen­er­a­tion starts with a hard con­straint: the model has to fit within the de­vice’s mem­ory bud­get.

For a 4B-class im­age model, the dif­fu­sion trans­former is the largest part of the model and the part that runs re­peat­edly dur­ing gen­er­a­tion. Each de­nois­ing step in­vokes the trans­former again, so trans­former size di­rectly shapes mem­ory pres­sure, band­width de­mand, and lo­cal in­fer­ence speed.

Bonsai Image 4B is built from the FLUX.2 Klein 4B. It keeps the ar­chi­tec­ture in­tact but changes how the trans­former weights are rep­re­sented. By mov­ing those weights into bi­nary and ternary form, Bonsai re­duces the part of the im­age pipeline that mat­ters most for lo­cal de­ploy­ment.

Table I: Diffusion trans­former foot­print for mod­els.

The bi­nary lay­ers pro­vide roughly a 14x re­duc­tion rel­a­tive to full-pre­ci­sion trans­former weights. A small set of pre­ci­sion-sen­si­tive sup­port­ing ten­sors (~5%), called the pro­jec­tion lay­ers, re­mains in FP16 so the fi­nal 1-bit Bonsai Image 4B trans­former is 0.93 GB: an 8.3x re­duc­tion from the 7.75 GB full-pre­ci­sion FLUX.2 Klein 4B.

The ternary vari­ant fol­lows the same struc­ture. Its ternary lay­ers pro­vide roughly a 10x re­duc­tion and the fi­nal Ternary Bonsai Image 4B trans­former is 1.21 GB, a 6.4x re­duc­tion from the full-pre­ci­sion trans­former. It is slightly larger than the 1-bit model, but the ad­di­tional zero state im­proves vi­sual qual­ity and prompt fi­delity.

Including the com­pressed text en­coder and FP16 VAE, the Apple Silicon de­ploy­ment pay­load is 3.42 GB for 1-bit Bonsai Image 4B and 3.88 GB for Ternary Bonsai Image 4B. For com­par­i­son, the full pre­ci­sion FLUX.2 Klein 4B re­quires a de­ploy­ment pay­load of 15.97 GB. Since, at run­time, the text en­coder is of­floaded af­ter prompt en­cod­ing, the mean mem­ory us­age is smaller than the to­tal pay­load. When gen­er­at­ing a 512x512 im­age, the mean-ac­tive mem­ory is 1.5 GB and 1.96 GB, for the bi­nary and ternary mod­els, com­pared to 11.74 GB for the orig­i­nal FLUX.2 Klein 4B (a re­duc­tion of 7.8x and 6.0x, re­spec­tively). For a 1024x1024 im­age, the mean-ac­tive mem­ory is 1.95 GB and 2.38 GB, for the bi­nary and ternary mod­els, com­pared to 14.39 GB for the orig­i­nal FLUX.2 Klein 4B (a re­duc­tion of 7.4x and 6.0x, re­spec­tively).

This re­duc­tion in mem­ory foot­print changes where the model can run. Our de­ploy­ment stack sup­ports Apple Silicon iPhones, iPads and Macs and CUDA GPUs, us­ing MLX low-bit paths on Apple hard­ware and Gemlite low-bit GEMM ker­nels on CUDA. On iPhone 17 Pro Max, the full-pre­ci­sion FLUX.2 Klein 4B pipeline does not fit within the de­vice mem­ory bud­get, while both Bonsai Image vari­ants run on-de­vice.

Video I: Image gen­er­a­tion on Bonsai Studio

In prac­tice, Bonsai Image 4B gen­er­ates a 512x512 im­age in 9.4 sec­onds on an iPhone 17 Pro Max and about 6 sec­onds on Mac M4 Pro. On Mac M4 Pro, Bonsai Image 4B is up to 5.6x faster than the stock full-pre­ci­sion MFLUX pipeline.

Benchmarking per­for­mance

Compression only mat­ters if the model re­mains use­ful. We eval­u­ated Bonsai Image 4B across three com­ple­men­tary bench­marks: GenEval for ob­ject com­po­si­tion and at­tribute bind­ing; HPSv3 hu­man pref­er­ence and aes­thetic qual­ity; DPG-Bench dense prompt fol­low­ing and se­man­tic faith­ful­ness.

Table II: Image qual­ity bench­mark com­par­i­son across Ternary Bonsai Image 4B and other mod­els.

Ternary Bonsai Image 4B is the qual­ity-ori­ented vari­ant. At 1.21 GB, it re­tains 95% of the FLUX.2 Klein 4B ac­cu­racy across GenEval, HPSv3, and DPG-Bench, while re­duc­ing the dif­fu­sion trans­former foot­print by 6.4x.

1-bit Bonsai Image 4B is the foot­print-ori­ented vari­ant. It brings the dif­fu­sion trans­former be­low 1 GB, an 8.3x re­duc­tion, while still de­liv­er­ing strong bench­mark scores across the same three eval­u­a­tions (it re­tains 88% of the ac­cu­racy of FLUX.2 Klein 4B).

Together, the two vari­ants move the qual­ity–foot­print fron­tier. Bonsai Image re­mains com­pet­i­tive with mod­ern 4B-class im­age mod­els while us­ing a frac­tion of their dif­fu­sion-trans­former foot­print. At the same time, it sub­stan­tially out­per­forms smaller mod­els with sim­i­lar mem­ory foot­prints. That is the same Pareto shift we have seen in our prior Bonsai lan­guage mod­els. Bonsai Image brings mod­ern dif­fu­sion-trans­former be­hav­ior into a mem­ory range that pre­vi­ously be­longed to much smaller, lower-ca­pa­bil­ity mod­els.

Why this is im­por­tant

Image gen­er­a­tion is not only a model-qual­ity prob­lem. It is also a de­ploy­ment prob­lem.

Cloud APIs will con­tinue to be the right choice for many prod­ucts. But cloud-only gen­er­a­tion im­poses cer­tain prod­uct con­straints: every prompt is a re­mote re­quest, every it­er­a­tion car­ries mar­ginal serv­ing cost, and every in­ter­ac­tion adds round-trip la­tency.

That mat­ters be­cause im­age gen­er­a­tion is nat­u­rally it­er­a­tive. Users rarely stop at one im­age. They re­vise prompts, com­pare out­puts, gen­er­ate vari­a­tions, dis­card fail­ures, and try again. When each at­tempt is a server-side job, the cre­ative loop be­comes some­thing users have to me­ter and wait for.

Local in­fer­ence changes that. Once the model fits on the de­vice, gen­er­a­tion can sit di­rectly in­side the prod­uct ex­pe­ri­ence. It be­comes cheaper to run, faster to it­er­ate on, and eas­ier to use in en­vi­ron­ments where prompts, and gen­er­ated as­sets should re­main pri­vate.

Bonsai Image 4B is a step to­ward that de­ploy­ment regime: ca­pa­ble im­age gen­er­a­tion run­ning closer to the user, on hard­ware they al­ready own.

Availability

Both 1-bit and Ternary Bonsai Image 4B will be re­leased with open weights and code un­der the Apache 2.0 li­cense.

With this launch, we are also launch­ing Bonsai Studio, its iOS app for try­ing Bonsai Image 4B di­rectly on iPhone.

Join Us

PrismML emerged from a team of Caltech re­searchers and was founded with sup­port from Khosla Ventures, Cerberus and Google. We’ve spent years tack­ling one of the field’s hard­est prob­lems: com­press­ing neural net­works with­out sac­ri­fic­ing their rea­son­ing abil­ity.

If you want to help build the next gen­er­a­tion of state-of-the-art AI, we’d love to hear from you. Check out our ca­reers page.

Resources

Whitepaper

Hugging Face

WebGPU demo

Bonsai Studio for iPhone

GitHub

"Four-Letter Word": United Airlines 767 Returns To Newark After Bluetooth Name Sparks Alert

simpleflying.com

Updated Jun 1, 2026, 4:55 AM EDT

Luke has over a decade of ex­pe­ri­ence as a travel writer and avi­a­tion an­a­lyst. As a pas­sion­ate trav­eler based across the Middle East and Asia, Luke of­fers strong in­sights into the in­dus­try. Based in South East Asia.

This ar­ti­cle was up­dated on Monday, June 1, 2026, to in­clude an of­fi­cial state­ment from United Airlines and ad­di­tional con­text on the in­ci­dent. It was orig­i­nally pub­lished on Sunday, May 31, 2026.

Unlock Personalized Content & Exclusive Features

Join the com­mu­nity to dis­cuss trend­ing top­ics with top au­thors, per­son­al­ize your feed, and get fewer ads.

Log in or Create an Account For Free

*Required: 8 chars, 1 cap­i­tal let­ter, 1 num­ber

or

By cre­at­ing an ac­count, you agree to our Terms of Use and Privacy Policy. You also agree to re­ceive our newslet­ters; you can un­sub­scribe any time.

A United Airlines Boeing 767 – 400ER bound for Palma de Mallorca, Spain, made a mid-At­lantic U-turn af­ter a pas­sen­ger’s threat­en­ing Bluetooth net­work name trig­gered a se­cu­rity alert. Early re­ports in­di­cate that a teenage pas­sen­ger on­board named their de­vice BOMB,’ and the dis­cov­er­able name es­ca­lated quickly into a bomb-threat re­sponse.

The crew is­sued re­peated warn­ings be­fore giv­ing all pas­sen­gers a one-minute ul­ti­ma­tum to turn their Bluetooth off, or else the air­craft would be forced to turn around. At least two de­vices were re­port­edly still ac­tive af­ter this dead­line passed, prompt­ing the flight crew to de­clare an emer­gency and di­vert to Newark.

United Airlines’ Bluetooth Threat Incident

According to flight track­ing data, United Flight 236 from Newark Liberty International Airport (EWR) to Palma De Mallorca Airport (PMI) de­parted Newark at 6:08 PM lo­cal time, and was ap­prox­i­mately 60 min­utes into its transat­lantic jour­ney be­fore the se­cu­rity sit­u­a­tion es­ca­lated. A pas­sen­ger on the flight pro­vided more de­tails on Reddit, stat­ing that a flight at­ten­dant told pas­sen­gers over the PA sys­tem that they must turn off Bluetooth im­me­di­ately,” or else the air­craft would have to turn around.

Date

May 30, 2026

Airline

United Airlines

Flight Code

UA236

Aircraft Type

Boeing 767 – 400ER (N67052)

Departure Airport

Newark Liberty International Airport (EWR)

Destination Airport

Palma de Mallorca Airport (PMI)

Fate

Returned to EWR; pas­sen­gers boarded a re­place­ment flight

This was re­peated mul­ti­ple times, with the crew even­tu­ally is­su­ing a fi­nal one-minute warn­ing. However, not all pas­sen­gers com­plied with the in­struc­tions, as there were still two ac­tive Bluetooth de­vices af­ter the ul­ti­ma­tum was is­sued. The air­craft sub­se­quently squawked 7700 (the code for a gen­eral emer­gency) and turned around, land­ing back in EWR at 8:50 PM af­ter spend­ing al­most three hours in the air.

Bluetooth Device Name Set To BOMB

As per record­ings from LiveATC.net, a mem­ber of United’s ground team said that the Bluetooth name had been set to a four-letter word,” later re­ported by AirLive as BOMB.’ Passengers on the flight were re­port­edly told that up to ten agents” would be wait­ing for the air­craft in Newark to de­ter­mine the ori­gin of the threat.

Those on­board were also in­structed to leave all their be­long­ings on the air­craft be­fore de­plan­ing. Saturday’s in­ci­dent has par­al­lels with an­other se­cu­rity scare that oc­curred on a United flight ear­lier this month. During this in­ci­dent, a Wi-Fi hotspot named Free Palestine, F Zionists” prompted the pi­lot to is­sue a warn­ing to the cabin, telling the pas­sen­ger re­spon­si­ble that they had 30 sec­onds” to re­move the name or the FBI would meet the air­craft.

Additionally, in April, two United flights were evac­u­ated on back-to-back days due to bomb threats, demon­strat­ing how se­ri­ously these in­ci­dents are taken. Though some have ques­tioned why any­one in­tend­ing to blow up a plane would broad­cast the word bomb, many ter­ror­ist acts have re­lied on the threat of a bomb as lever­age dur­ing at­tempted hi­jack­ings or hostage sit­u­a­tions.

Related

Passengers Board Replacement Flight

Passengers on the flight ar­rived back in Newark just be­fore 9:00 PM on Saturday evening, and were met by a sig­nif­i­cant con­tin­gent of lo­cal and fed­eral law en­force­ment. They were asked to take only their pass­ports and phones with them, leav­ing their cabin bags on the air­craft. After spend­ing sev­eral hours on the ground as se­cu­rity teams com­pleted their sweep, trav­el­ers would even­tu­ally de­part Newark on a re­place­ment flight in the early hours.

The re­place­ment flight was op­er­ated by the same air­craft, a Boeing 767 – 400ER (registration N67052), but would not take off un­til around 02:30 AM the next day. At the time of pub­li­ca­tion, the flight is cur­rently over the Atlantic and is ex­pected to land in Palma de Mallorca in the af­ter­noon lo­cal time. Before pas­sen­gers could board this flight, they were re­quired to pass through TSA se­cu­rity for a sec­ond time.

United Airlines Responds

Simple Flying con­tacted United for com­ment on this in­ci­dent. A spokesper­son con­firmed that Flight UA236 re­turned to Newark to ad­dress a po­ten­tial se­cu­rity con­cern.” The air­line added that there were a to­tal of 190 pas­sen­gers on the flight, as well as 12 crew mem­bers. These pas­sen­gers even­tu­ally ar­rived in Spain at 3:41 PM lo­cal time the next day, rep­re­sent­ing a de­lay of over nine hours.

It has now been re­ported by var­i­ous out­lets, in­clud­ing the New York Post, that the de­vice re­spon­si­ble for the threat­en­ing Bluetooth name was a Fitbit. This is a wear­able smart­watch and fit­ness tracker that comes with Bluetooth ca­pa­bil­ity to sync with other de­vices, such as phones or com­put­ers. The 16-year-old owner and the de­vice were not deemed a threat by au­thor­i­ties.

the solution might be cancelling my AI subscription

thoughts.hmmz.org

I am try­ing to think of a list of all the won­der­ful things I’ve built with AI:

a speech recog­ni­tion sys­tem in rust

an email archive ren­der­ing + quote col­laps­ing tool

a jel­lyfin desk­top clone with gstreamer and qt quick

an in­vid­i­ous clone in python + yt-dlp

a faith­ful Windows 95 notepad.exe clone in fltk ported from the Wine sources

a ma­chine vi­sion thing to count traf­fic flows from pub­lic street cam­eras in opencv

a claude ui clone in python or rust i think, i don’t even re­mem­ber

a re­gional news site i never meant to build that is ac­tu­ally get­ting traf­fic, python/​flask

a 3d car game built on the pro­to­col for an ex­ist­ing mul­ti­player game in three.js

an in­vest­ment back­tester in python

a html clone of the light­room ui, mar­velled at the re­sult then never made the back­end

a mark­down viewer in qt or gtk or some­thing else i can’t even re­mem­ber

a re­place­ment world clock wid­get for my lap­top desk­top en­vi­ron­ment in gtk and C

a javascript net­work syn­chro­nised au­dio play­back thing

a rust client for a chi­nese IP cam­era re­versed from its Android app

a size­able SaaS in rust

maybe 50 other pro­jects i’ve al­ready deleted

Except for the SaaS, al­most none of this is use­ful and I don’t want to main­tain any of it. I ac­ci­den­tally run a news out­let which is surely a li­a­bil­ity. Sure, it has helped me learn AI tool­ing” and I use many of these tools, but I did­n’t need them. I can’t af­ford to main­tain any of them, not in terms of time, com­mit­ment, be­lief, at­ten­tion or will­ing­ness to spend on to­kens.

I did­n’t mean to build most of these things. Usually the Claude ses­sion started with some­thing like write a quick script for X, and one hour later the re­sult is not a quick script for X, nor in the usual case is my prob­lem solved, what­ever the orig­i­nal itch hap­pened to be.

at­ten­tion is all you need

On that last point, this tech­nol­ogy is hor­rific for at­ten­tion. It’s a ther­monu­clear ADHD am­pli­fier and I have seen the same ef­fect in every sin­gle one of my adult friends. Folk run­ning 3 screens si­mul­ta­ne­ously work­ing on to­tally un­re­lated projects” they have lit­tle hope of main­tain­ing, and such lit­tle com­mit­ment to the out­come that the time is ob­vi­ously wasted.

In re­cent times, at least once per month some­one sends a screen­shot for an awe­some tool they are work­ing on. I’m like whoa, that’s re­ally some­thing and the sender is ob­vi­ously proud and en­thu­si­as­tic. I try not to ask, but am al­ways think­ing and where will you mar­ket it?, be­cause when the ques­tion is asked of an en­gi­neer, the an­swer is un­changed since be­fore LLMs ex­isted.

I re­cently in­ter­viewed and when the topic of AI us­age came up, the host an­swered some­thing like oh we’re quite light on it, every­one has up to 5 rooms where they man­age their agents and I im­me­di­ately felt a tight­ness in my stom­ach.

I had a vague sense of the ef­fect a few months into us­ing Claude. Later I re­duced my sub­scrip­tion to Pro in the be­lief a quota re­stric­tion would mit­i­gate ex­ces­sive use. Then Claude went through a bad ser­vice pe­riod and I moved to Codex. Codex’s CLI is much nicer than Claude’s and no­tice­ably faster. And us­age started creep­ing back up.

The tech­nol­ogy, when honed, is gen­uinely amaz­ing. Ask it to zero shot a parser for an es­o­teric gram­mar im­ple­mented in an es­o­teric lan­guage with full tests and it’s done. The tool­ing as it ex­ists to­day pro­motes ab­solutely noth­ing like the fo­cus re­quired to ap­ply it ju­di­ciously.

Almost every ven­dor and every tool in­tends to do ex­actly the op­po­site: more us­age, more to­kens, more out­put. Ask a sim­ple yes/​no ques­tion of ChatGPT and you can clearly see that it is hard-wired to in­clude a rel­e­vant fol­low-up ques­tion to pro­mote ex­ces­sive in­ter­ac­tion.

Slopping out a 10,000 LOC untested Python/JS mess in 5 min­utes helps no­body. The thought of this hap­pen­ing in every com­mer­cial en­vi­ron­ment si­mul­ta­ne­ously is hor­ri­fy­ing.

fric­tion = fo­cus, fo­cus = prod­uct

One of my early AI ex­per­i­ments, ex­plor­ing AI as a lens in Marshall McLuhan-like think­ing, was to con­nect speech recog­ni­tion to a pipeline that gen­er­ated blog posts on the other side, in the be­lief it would en­cour­age me to cap­ture my thoughts. All I needed was to press the voice note but­ton in a Telegram chan­nel, and out pops an Opus-formatted post.

The out­put was un­bri­dled garbage. Because the ef­fort was re­moved, so was the com­mit­ment, and with the com­mit­ment the fo­cus, and with the fo­cus any mean­ing­ful prod­uct at all. Quality writ­ing is not con­ver­sa­tional English sim­ply cast through a lens: con­ver­sa­tional English is low-bit rate noise, qual­ity writ­ing at­tempts to cap­ture high bit rate in­for­ma­tion with bet­ter formed con­cepts, and this should have been ob­vi­ous be­fore I be­gan.

I looked at re­pur­pos­ing the pipeline to cap­ture pri­vate notes, but I have no need for pri­vate notes. It sub­verts the nat­ural process of noise be­ing for­got­ten. It is just more ex­cess tool use.

Following from this, for as long as qual­ity mat­ters, I be­lieve hand­writ­ing can never be ob­so­lete.

It feels like we’re head­ing to­wards cri­sis, and I doubt the an­swer is better mod­els” or better tool­ing”. Cal Newport re­lates this to pseudo-pro­duc­tiv­ity:

The speaker ar­gues that dig­i­tal pro­duc­tiv­ity tools, in­clud­ing AI and email, of­ten cre­ate a digital pro­duc­tiv­ity para­dox”: they make in­di­vid­ual tasks faster or eas­ier, but they can leave knowl­edge work­ers busier, more dis­tracted, and less pro­duc­tive over­all. He cites re­search show­ing that AI users spent much more time in email, mes­sag­ing, chat, and busi­ness-man­age­ment tools, while spend­ing less time in fo­cused, un­in­ter­rupted work. His cen­tral claim is that tools de­signed to re­duce fric­tion of­ten in­crease the vol­ume of shal­low tasks and con­text switch­ing, which weak­ens deep work and high-value out­put.

He ex­plains that this hap­pens be­cause knowl­edge work of­ten re­lies on pseudo pro­duc­tiv­ity,” where vis­i­ble busy­ness is treated as a proxy for real value. Digital tools re­in­force this by mak­ing peo­ple look ac­tive: send­ing more mes­sages, pro­duc­ing more drafts, at­tend­ing more meet­ings, and gen­er­at­ing more work ar­ti­facts. To avoid the trap, he rec­om­mends mea­sur­ing real out­comes, iden­ti­fy­ing the true bot­tle­necks in one’s work, and sep­a­rat­ing deep work from shal­low work so that dig­i­tal tools sup­port mean­ing­ful progress in­stead of con­sum­ing at­ten­tion.

🤖

The speaker ar­gues that dig­i­tal pro­duc­tiv­ity tools, in­clud­ing AI and email, of­ten cre­ate a digital pro­duc­tiv­ity para­dox”: they make in­di­vid­ual tasks faster or eas­ier, but they can leave knowl­edge work­ers busier, more dis­tracted, and less pro­duc­tive over­all. He cites re­search show­ing that AI users spent much more time in email, mes­sag­ing, chat, and busi­ness-man­age­ment tools, while spend­ing less time in fo­cused, un­in­ter­rupted work. His cen­tral claim is that tools de­signed to re­duce fric­tion of­ten in­crease the vol­ume of shal­low tasks and con­text switch­ing, which weak­ens deep work and high-value out­put.

He ex­plains that this hap­pens be­cause knowl­edge work of­ten re­lies on pseudo pro­duc­tiv­ity,” where vis­i­ble busy­ness is treated as a proxy for real value. Digital tools re­in­force this by mak­ing peo­ple look ac­tive: send­ing more mes­sages, pro­duc­ing more drafts, at­tend­ing more meet­ings, and gen­er­at­ing more work ar­ti­facts. To avoid the trap, he rec­om­mends mea­sur­ing real out­comes, iden­ti­fy­ing the true bot­tle­necks in one’s work, and sep­a­rat­ing deep work from shal­low work so that dig­i­tal tools sup­port mean­ing­ful progress in­stead of con­sum­ing at­ten­tion.

🤖

These ex­pe­ri­ences have opened a new per­cep­tion of all tool use, be­cause be­neath it all this is not about faster de­vel­op­ment = more apps or faster email = more com­mu­ni­ca­tion be­ing a de­sir­able goal. Generically, it’s about a unit time of life and how it is spent mean­ing­fully.

I have no idea how to man­age AI at pre­sent ex­cept by cur­tail­ing use, be­cause a tool pro­duc­ing a cheap re­ward with min­i­mal in­put and no fric­tion can only be a li­a­bil­ity, and achiev­ing that re­al­i­sa­tion is prob­a­bly the only real con­tri­bu­tion of AI to date.

David, Sun 31 May 14:31:04 2026

Chuwi Minibook X: the netbook we deserve

tylercipriani.com

Netbooks are dead, but the Chuwi Minibook X scratches the same itch.

The Minibook X is a 10.5″ x86_64 sub-ul­tra­book with 16GB RAM, a 512GB NVMe drive, and only one ma­jorly an­ny­oing Linux quirk.

I needed a knock-around lap­top, so I bought my­self a Minibook for my birth­day last year. The more I tote it around, the more fun I’m hav­ing with this ridicu­lous lit­tle com­puter.

Quick specs

Much like the net­books of yore, the Minibook is a bud­get ma­chine. But it’s 2026, so even bud­get ma­chines pack more oomph than I need from a util­ity lap­top.

CPU 4-core/4-thread 3.6GHz Intel N150 Twin Lake

16 GB RAM — LPDDR5 – 6400 — sol­dered 😿

512GB NVMe — upgrad­able

10.51” IPS 2K 16:10 screen

28.88Wh Li-Ion bat­tery

Weight: 911g

Ports: 2×USB-C (1×PD charg­ing)

Cost: $350

One odd­ity is that the Minibook comes bun­dled with a 12V/2A USB-C charger. I chucked the charger; I wor­ried I’d fry some 5V SoC some­day. The Minibook works fine with a PD charger.

I’d as­sume the 12V charger was a cost-sav­ing choice, but it also cre­ates some weird pos­si­bil­i­ties for DC/off-grid se­tups.

Linux and weird­ness: side­ways pan­els and ker­nel pa­ra­me­ters

The fe­di­verse told me that Minibook runs Linux boringly well,” which was al­most true.

I tried Debian, then jumped to NixOS for kicks.

What works:

Camera/Microphone/Speakers

Touchscreen

Sleep/Suspend

Hibernate

Keyboard back­light

USB-C HDMI

Bluetooth (non-free blobs — Intel)

Wi-Fi 6 (non-free blobs — Intel)

But on first boot, the screen ori­en­ta­tion is 270° clock­wise:

The Chuwi’s screen is a panel from a cheap tablet; the screen ro­ta­tion is­sue is a hard­ware prob­lem (the screen is mounted side­ways). To fix the screen’s ro­ta­tion, I had to tweak screen ori­en­ta­tion at every soft­ware layer. Fixing this prob­lem was a jour­ney:

Bootloader — Switched from sys­temd-boot to grub, car­ry­ing some un­merged GRUB ro­ta­tion patches on top.

Initrd — Tell the Intel dis­play dri­ver about the panel ori­en­ta­tion via a ker­nel pa­ra­me­ter, and force the Intel dri­ver to load in the initramfs. On NixOS: boot.ker­nel­Params = [“video=DSI-1:panel_orientation=right_side_up”]; and boot.ini­trd.ker­nelMod­ules = [“i915″]; (see Kernel docs for mod­edb de­fault video mode sup­port)

Desktop en­vi­ron­ment — For X11, good ole xrandr –output DSI-1 –rotate right. Wayland picked this up from the DRM con­nec­tor. This one was easy.

Framebuffer — Ensure all TTYs have the proper ori­en­ta­tion by adding fb­con=ro­tate:1 to ker­nel pa­ra­me­ters boot.ker­nel­Params = [“fbcon=rotate:1”]; (see Kernel docs for frame­buffer con­sole boot op­tions)

Behold, the fi­nal re­sult in all its glory:

Size, weight, and build

This com­puter is mind-bog­glingly small. The build is sturdy and totable; it’ll hold up to a back­pack jostling.

The lap­top’s case is MacBook-esque: alu­minum and good-look­ing. The MacBook Air’s di­men­sions dwarf the Chuwi’s, but the two lap­tops are about the same thick­ness.

A note­book that weighs more than a kilo is sim­ply not a good thing — Linus Torvalds

A note­book that weighs more than a kilo is sim­ply not a good thing

– Linus Torvalds

The Minibook weighs in just shy of a kilo at 912 grams.

Perf, ther­mals, and power

tl;dr: you get what you pay for. But bat­tery life and cool­ing are bet­ter than I’d have guessed.

The Minibook X was never go­ing to com­pile the Linux ker­nel in record time. But the per­for­mance matches the specs, it stays cool, and it has enough bat­tery life to run a movie marathon.

Numbers:

Geekbench6 (a fun side-quest to get run­ning on NixOS), bet­ter than I ex­pected.

Single-core: 1295 Multi-core: 3332

Single-core: 1295

Multi-core: 3332

Wi-Fi 6 speed: 424 Mbps, more than enough to stream a 4K movie.

Power

Idle: 3.8W During bench­mark: ~15W

Idle: 3.8W

During bench­mark: ~15W

Battery: When I left the 1995 clas­sic film Hackers” loop­ing in VLC, the bat­tery lasted about 6 hours.

Heat: Running stress-ng for 10 min­utes, the hottest part of the lap­top chas­sis re­mained be­low 90°F (32°C):

What I dis­like

There’s so much to dis­like about this lap­top:

Screen is ter­ri­ble — 2K? 50Hz re­fresh rate? Why!?

Keyboard is ter­ri­ble — it only reg­is­ters key­strokes when you hit the ex­act cen­ter of each key.

Touchpad is ter­ri­ble — It’s a div­ing board-style, with­out phys­i­cal but­tons.

Sound is meh — I can hear the tinny lap­top speaker fine, but it’s un­der­whelm­ing. I’ve never tried tweak­ing it in Pipewire, though; it’s pos­si­ble it could be bet­ter.

But terrible” is in com­par­i­son to the nicest mod­ern lap­tops in ex­is­tence. Everything I listed here works fine. I’m hon­estly blown away when I tune my ex­pec­ta­tions to the sub-$400 lap­top range.

Verdict

In The Death and Life of Great American Cities, Jane Jacobs wrote, new ideas re­quire old build­ings”: cheap spaces let peo­ple try risky ideas.

The Chuwi Minibook X is an old build­ing.

I can brick the Minibook and have a nor­mal Monday on my se­ri­ous work lap­top. Nothing has to work, which makes it per­fect to try out new Linux desk­top stuff:

NixOS — I’ve been us­ing Debian for 15 years+, fig­ured I’d try join­ing the NixOS cult for a while.

RiverWM — I’m on a quest to find the Wayland ver­sion of XMonad; River is pretty close.

KDE Plasma — I’ve used a tiling win­dow man­ager for over a decade. What’s it like to use a desk­top that Just Works™?

Steam — Never been much into games, but I de­cided to give Steam a try since, well, why not?

Cheap, weird com­put­ers like the Chuwi make it safe to play. And play­ing with com­put­ers is still fun.

I Put a Datacenter GPU in My Gaming PC for £200

blog.tymscar.com

I al­ready had an RTX 4080. 16GB of VRAM. Good enough for gam­ing, not good enough for the mod­els I wanted to run lo­cally. The next step up in GPU land is ei­ther spend a for­tune on a card with more VRAM, or find an­other way.

I found an­other way.

I bought a dat­a­cen­ter GPU that does­n’t even have a nor­mal PCIe con­nec­tor, stuck it in my gam­ing PC with an adapter, and now I have 32GB of VRAM across two GPUs run­ning a 27 bil­lion pa­ra­me­ter model at 32 to­kens per sec­ond. The whole thing cost me £200.

The GPU#

This is a Tesla V100 SXM2 16GB. It was de­signed for NVIDIAs DGX servers and hy­per­scaler racks. The SXM2 form fac­tor means it does not have a PCIe slot. It does not have dis­play out­puts. It does not have a nor­mal power con­nec­tor. It sits on a pro­pri­etary board in­side a server rack and com­mu­ni­cates over NVLink.

You can­not plug this into a moth­er­board. Not with­out help.

But here is the thing: this is a Volta GPU with 16GB of HBM2 mem­ory, 5120 CUDA cores, and I picked it up for about £150 on eBay. The com­pute is still real. The VRAM is still real. And the mem­ory band­width is where it gets gen­uinely sur­pris­ing.

HBM2 is a dif­fer­ent class of mem­ory. The V100 has a 4096-bit mem­ory bus de­liv­er­ing 900 GB/s of band­width. To put that in per­spec­tive, my RTX 4080 with its fancy GDDR6X man­ages 736 GB/s. The V100 from 2017 has 22% more mem­ory band­width than a GPU that launched in 2022.

And it is not just NVIDIAs con­sumer cards that lose. Apple’s M3 Max does 400 GB/s. The M4 Max does 546 GB/s. The brand new M5 Max, which will set you back over £3,000 for a lap­top, man­ages 614 GB/s. A GPU from 2017 beats every Mac on the mar­ket.

The clos­est AMD com­pe­ti­tion to my 4080 is the RX 7900 XTX, which does 960 GB/s on its 24GB of GDDR6. Technically that edges out the V100, but the 7900 XTX costs £700+ and ROCm sup­port for LLM in­fer­ence is still rough com­pared to CUDA. The V100 gives you 94% of that band­width for less than a quar­ter of the price, and it just works with llama.cpp.

The only con­sumer GPU that com­fort­ably beats it is the RTX 5090 at 1,792 GB/s, and that card costs over £2,000. For LLM in­fer­ence, where mem­ory band­width is the bot­tle­neck that de­ter­mines your to­kens per sec­ond, this mat­ters more than al­most any­thing else.

The only prob­lem is the con­nec­tor.

The adapter#

Turns out, some­one makes an SXM2-to-PCIe adapter. It is not made by NVIDIA. It is not of­fi­cially sup­ported by any­one. It is a bare PCB with the SXM2 socket on one side and a PCIe edge con­nec­tor on the other. I paid about £50 for it. Half of that might just be the cop­per.

So for about £200 to­tal, I had a 16GB VRAM GPU that could slot into my moth­er­board along­side my RTX 4080. That is 32GB of to­tal VRAM. A sin­gle RTX 5090 with 32GB costs over £2,000. I am not say­ing this is the same ex­pe­ri­ence. I am say­ing the VRAM is the same.

The fan from hell#

Before I could do any­thing use­ful with the V100, I had to deal with the fan.

The V100 SXM2 was de­signed to live in­side a 2U server with in­dus­trial cool­ing. The fan on the adapter is not sub­tle. It is not quiet. It is not some­thing you want in a room you also sleep in.

I mea­sured it with my Apple Watch:

82 deci­bels. That is some­where be­tween a garbage dis­posal and a lawn­mower, well past loud PC and into should I be wear­ing earplugs in my own house” ter­ri­tory.

And the worst part: you can­not con­trol it. I tried nvidia-smi, I tried scan­ning for it on Linux, I even tried Afterburner on Windows (more on that later, the whole setup barely works on Windows). Nothing. The fan on this adapter is not de­signed to be con­trolled. It is de­signed to run at 100%, for­ever, in­side a server rack where no­body has to hear it.

Here is me try­ing to fig­ure out the fan pinout. I guessed it might be a stan­dard case fan pinout on a weird con­nec­tor, so I jammed two jumper wires into VCC and ground and prod­ded a 9V bat­tery against them. It spun. And it was so much qui­eter than the 12V it nor­mally gets:

That con­firmed the pinout and gave me hope that the fan could ac­tu­ally be tamed.

Making the fan lis­ten to rea­son#

The 9V bat­tery test told me the pinout was stan­dard case fan ter­ri­tory, just with a weird con­nec­tor. The next ques­tion was whether the fan would ac­tu­ally re­spond to PWM con­trol if I wired the tachome­ter and PWM pins to my moth­er­board.

So I shoved some jumper wires into the con­nec­tor and jammed the other ends into a spare fan header (turn your vol­ume up):

It works. The moth­er­board can read the RPM and the fan re­sponds to PWM. I keep it at 10%. It never goes above 50C even at full load, and I can­not re­ally hear it.

Now I just needed a proper ca­ble in­stead of jumper wires held in by hope.

The fan con­nec­tor on the adapter is a small JST PH2.0 plug with four pins. Motherboard fan head­ers use a stan­dard 0.1 inch (2.54mm) pitch. The GPU fan uses a 2.0mm JST PH con­nec­tor. The pins are closer to­gether and the plug is smaller.

The so­lu­tion was a 2.54mm male to PH2.0 fe­male jumper ca­ble. The fe­male PH2.0 end plugs into the fan’s tachome­ter and PWM pins, and the male 2.54mm end goes into a spare fan header on the moth­er­board:

That went from 82dB ear dam­age to some­thing I can ac­tu­ally live with.

Doubling VRAM for cheap#

With the fan sit­u­a­tion han­dled, the V100 slot­ted right in along­side my 4080:

RTX 4080: 16GB VRAM, Ada ar­chi­tec­ture

Tesla V100: 16GB VRAM, Volta ar­chi­tec­ture

Total: 32GB VRAM across two GPUs

llama.cpp can split the model across both GPUs us­ing ten­sor split­ting. It pipelines the lay­ers across the PCIe bus so the 4080 han­dles some lay­ers and the V100 han­dles the rest. It is not as fast as hav­ing a sin­gle GPU with 32GB, but it works, and it cost me roughly 10% of what a 32GB GPU would cost. For what it is worth, the most I have ever seen the V100 pull is around 150W. That is not noth­ing, but it is not out of this world for a GPU run­ning lo­cal LLM in­fer­ence.

But wait, you can go big­ger#

The V100 also comes in a 32GB vari­ant. It costs more than dou­ble what I paid, but we are still talk­ing about a few hun­dred pounds for 32GB of HBM2 mem­ory on a sin­gle card. Two of those would give you 64GB of VRAM for roughly 20% of what an RTX 5090 costs in to­day’s mar­ket.

You can also clus­ter them. The SXM2 for­mat sup­ports NVLink na­tively, which means if you are build­ing a proper multi-GPU setup, these cards can talk to each other at very high band­width. Even through the PCIe adapter, the ten­sor split per­for­mance is solid.

The soft­ware side#

This part was sur­pris­ingly smooth thanks to NixOS. The V100 is a Volta chip. NVIDIA dropped Volta sup­port start­ing with dri­ver branch 560. The last dri­ver that sup­ports both my RTX 4080 (Ada) and the V100 (Volta) is branch 550.x, which maps to nvidi­a­Pack­ages.lega­cy_535 on NixOS.

That dri­ver only sup­ports CUDA up to 12.2. Current nix­p­kgs ships CUDA 12.6 min­i­mum. So I had to pull CUDA 12.2 from nix­p­kgs 24.05.

Also, the dri­ver re­quires ker­nel 6.6. Newer ker­nels are not sup­ported with the legacy dri­ver.

And here is a weird one: even though this is a head­less in­fer­ence server, ser­vices.xserver.en­able = true is re­quired. Without it, the NVIDIA ker­nel mod­ules do not load.

NixOS made most of this straight­for­ward. Here is the key con­fig­u­ra­tion for get­ting the dri­ver and ker­nel right:

boot.ker­nel­Pack­ages = pkgs.lin­ux­Pack­ages_6_6; hard­ware.nvidia.pack­age = con­fig.boot.ker­nel­Pack­ages.nvidi­a­Pack­ages.lega­cy_535; ser­vices.xserver.en­able = true; ser­vices.xserver.video­Drivers = [ nvidia” ];

And for load­ing CUDA 12.2 from an older nix­p­kgs since the cur­rent one only ships 12.6+:

nix­p­kgs.over­lays = [ (final: prev: { cu­d­a­Pack­ages_12_2 = nix­p­kgs-cuda.lega­cy­Pack­ages.${prev.sys­tem}.cu­d­a­Pack­ages_12_2; }) ];

The im­por­tant thing is: it works. Both GPUs show up, CUDA is func­tional, and NixOS han­dled the whole thing el­e­gantly. If you want to repli­cate this, the en­tire ma­chine de­f­i­n­i­tion is in this com­mit on my dot­files repo, in­clud­ing the llama.cpp ser­vice de­f­i­n­i­tion and the cus­tom build pinned to the right ver­sion.

Running the model#

I am run­ning Qwen3.6 – 27B-MTP quan­tized at Q5_K_M, which comes in at about 19GB. With both GPUs, the en­tire model fits in VRAM with room for con­text:

And the per­for­mance:

32 to­kens per sec­ond is fast enough for in­ter­ac­tive use. It is faster than most cloud API end­points when you fac­tor in net­work la­tency. And this is with ten­sor split­ting across two dif­fer­ent GPU ar­chi­tec­tures con­nected by PCIe.

This model is ac­tu­ally good#

I want to be clear about some­thing. This is not good for a lo­cal model.” This is not acceptable if you lower your ex­pec­ta­tions.” Qwen3.6 – 27B ties with Claude Sonnet 4.6 on Artificial Analysis’s Agentic Index. It beats Sonnet 4.6 on MMMU-Pro and Terminal-Bench 2.0. A 27 bil­lion pa­ra­me­ter model run­ning on sec­ond­hand hard­ware is gen­uinely com­pet­i­tive with the lat­est cloud mod­els from Anthropic.

Yes, Sonnet 4.6 edges it out on GPQA and SWE-Bench Verified. It should, it is a mas­sive pro­pri­etary model. And yes, if you want the ab­solute best, Opus 4.8 ex­ists. It also costs more per 20 min­utes of heavy use than I paid for this en­tire GPU and adapter setup com­bined. But the gap is shock­ingly small. We have reached the point where the model you run in your bed­room is in the same con­ver­sa­tion as the ones that charge you per to­ken.

Multi-Token Prediction#

The MTP in the model name stands for Multi-Token Prediction. Normal LLM in­fer­ence pre­dicts one to­ken at a time. Predict one to­ken, ac­cept it, pre­dict the next to­ken, re­peat. MTP changes this by hav­ing the model pre­dict sev­eral fu­ture to­kens at once, then ver­i­fy­ing which ones were cor­rect. Accepted to­kens are es­sen­tially free. Wrong pre­dic­tions fall back to the nor­mal path.

The re­sult is roughly 1.5 – 2x faster gen­er­a­tion with no ac­cu­racy loss. On my setup that means in­fer­ence goes from around 32 tok/​s to po­ten­tially 50 – 60 tok/​s when MTP hits its stride, es­pe­cially on pre­dictable out­put like code.

The catch is that MTP sup­port in llama.cpp is new. The ver­sion in nix­p­kgs does not sup­port the Qwen3.6 MTP ar­chi­tec­ture, so I had to build llama.cpp from source at a spe­cific com­mit that added sup­port. On NixOS this is pain­less. I have a cus­tom de­riva­tion pinned to the right com­mit, and the whole thing is re­pro­ducible. When I want to up­date the model or change the llama.cpp ver­sion, I change one line in my con­fig, run nixos-re­build switch, and I am done. No de­pen­dency hell, no re­in­stalling by hand, no won­der­ing whether I built against the right CUDA ver­sion.

Vision: how the model sees im­ages#

The Qwen3.6 – 27B model sup­ports im­age in­put through a sep­a­rate mul­ti­modal pro­jec­tor file (mmproj). This is about 928MB ex­tra, and it is fas­ci­nat­ing.

The way it works is that a vi­sion en­coder (similar to what ChatGPT and Claude use) takes im­age pix­els and trans­lates them into the LLMs to­ken em­bed­ding space. The model does not see” the im­age the way a hu­man does. Instead, the vi­sion en­coder com­presses the im­age into a se­quence of vec­tors that live in the same math­e­mat­i­cal space as text to­kens. The LLM then processes those vec­tors as if they were just an­other se­quence of to­kens.

What this means in prac­tice: you send the model an im­age URL along­side your text prompt, and it can de­scribe, an­a­lyze, and rea­son about what it sees. The en­tire vi­sion ca­pa­bil­ity adds about 1GB to the model size. That is it. One gi­ga­byte and your lo­cal LLM can read im­ages.

In llama.cpp, the flags are straight­for­ward:

–mmproj /mnt/nas/llamacpp/mmproj-F16.gguf –mmproj-offload

The –mmproj-offload flag loads the vi­sion en­coder onto GPU along­side the model, so you still get fast in­fer­ence even with im­ages.

Running it through OpenCode#

I use this setup with OpenCode, which is an AI cod­ing as­sis­tant that can run against lo­cal mod­els. The LLM server runs on my desk­top, but I do not use it from that ma­chine. I use it from any other ma­chine in my house over the net­work, or from out­side over Tailscale (but that is a blog post for an­other time). Pointing OpenCode at the llama.cpp server is as sim­ple as set­ting the API URL. The model runs lo­cally, the re­sponses are fast, and noth­ing leaves my net­work.

The NAS and the USB drive#

All the mod­els live on my TrueNAS server, mounted via NFS:

fileSys­tems.“/​mnt/​nas” = { de­vice = truenas-nfs.tymscar.com:/mnt/oasis/services”; fsType = nfs”; op­tions = [ nfsvers=4″ _netdev” auto” nofail” ]; };

The llama.cpp ser­vice de­pends on mnt-nas.mount, so it does not start un­til the NAS is avail­able. This means I can store ter­abytes of mod­els with­out wor­ry­ing about lo­cal disk space.

The en­tire OS runs from a Corsair MP600 MINI in a DockCase USB-C NVMe en­clo­sure. No in­ter­nal drive mod­i­fi­ca­tion needed. When I want to game, I un­plug the drive and re­boot into my main Windows in­stall, and game nor­mally on the 4080. When I want to do LLM stuff, I plug the drive back in, re­boot into NixOS, and both GPUs are avail­able.

This is not as el­e­gant as a dual-boot menu, but it is sim­ple and it works. No GRUB, no boot­loader con­flicts, no par­ti­tion man­age­ment. Just a phys­i­cal switch.

The one an­noy­ing thing#

The V100 oc­ca­sion­ally dis­ap­pears from lspci and nvidia-smi af­ter a warm re­boot (where the OS restarts but the moth­er­board stays pow­ered). This seems to be an ACPI enu­mer­a­tion is­sue with the PCIe slot. A cold re­boot (physically power off, wait a few sec­onds, power back on) al­ways re­stores it.

When the V100 is ab­sent, llama.cpp fails to start be­cause it can­not fit the model on a sin­gle 16GB GPU. The ser­vice crash-loops un­til the GPU comes back. This is not a big deal in prac­tice since I am usu­ally around when I re­boot, but it is worth know­ing about. It gives me the same vibes as the in­fa­mous AMD GPU re­set bug, where pass­ing through an AMD GPU to a VM and then shut­ting it down leaves the GPU in a state that only a full host power cy­cle can fix.

What I ended up with#

For £200, I got:

A 16GB dat­a­cen­ter GPU run­ning along­side my gam­ing GPU

32GB to­tal VRAM for lo­cal LLM in­fer­ence

32 to­kens per sec­ond on a 27B pa­ra­me­ter model

128k to­ken con­text win­dow

Vision sup­port for im­age in­put

A model that runs com­pletely lo­cally, no cloud, no per-to­ken costs

The only real cost was the noise, and I solved that with £2 worth of jumper ca­bles and a bit of con­nec­tor spelunk­ing. The V100 is not the fastest GPU for in­fer­ence, and the ten­sor split across two dif­fer­ent ar­chi­tec­tures is not as clean as a sin­gle GPU. But for the price, it is ab­surdly good value.

If you want to run proper mod­els lo­cally, look at the sec­ond­hand server GPU mar­ket. You do not even need an ex­ist­ing GPU. I hap­pen to have a 4080 in my gam­ing PC, but a sin­gle V100 in a cheap server box would give you 16GB of VRAM and a per­fectly us­able lo­cal LLM for very lit­tle money. The V100 SXM2 is not the only op­tion. The P40 gives you 24GB for sim­i­lar money, though it is slower and has no Tensor Cores. The V100 32GB vari­ant costs more but still un­der­cuts any con­sumer GPU with that much VRAM.

Just be ready for the fan.

A 10 year old Xeon is all you need - point.free

point.free

Published on June 01, 2026

17 min­utes read

The pre­vi­ous post cov­ered get­ting Gemma 4’s MTP drafters quan­tized and paired with a ver­i­fier. This one is about run­ning the re­sult on a ma­chine that has no busi­ness run­ning it.

I have a re­cy­cled server. To its credit, it has a whop­ping 128 GB RAM, but it’s DDR3… That RAM is 5 – 6 times slower than the cur­rent best lap­top ram. It also has a sin­gle Intel Xeon E5 – 2620 v4 from 2016, which is about 5 times slower than my lap­tops CPU…

Oh, and as I did men­tion, we have no GPU. And no, the Xeon does not have an in­te­grated GPU.

But, just hear me out…

If we were to just break out ol­lama here, well… as ex­plained in ear­lier blog posts, we can’t. And we’d be lucky if we could in 6 months when they add sup­port for the model we need, if they ever do. Might be they never do. And even still, ol­lama sim­ply does­n’t ex­pose enough knobs for us to ever make this run well, nei­ther does even the stan­dard llama-cpp.

But. Why would that stop us?

I’ve re­cieved feed­back that some of the pre­vi­ous posts were too high level, I’ll try to make things as clear as rea­son­ably pos­si­ble here. If you’re a tech worker, or a Linux en­thu­si­ast that has built a com­puter and used some­thing like ChatGPT, most of this should be ap­proach­able.

I’ve re­cieved feed­back that some of the pre­vi­ous posts were too high level, I’ll try to make things as clear as rea­son­ably pos­si­ble here. If you’re a tech worker, or a Linux en­thu­si­ast that has built a com­puter and used some­thing like ChatGPT, most of this should be ap­proach­able.

So, just to re­ally set the stage fully. The hard­ware, per lscpu:

CPU: Intel Xeon E5 – 2620 v4 @ 2.10 GHz

Cores: 8 phys­i­cal, 16 threads

Instruction sets: AVX2 (no AVX-512, no AVX-VNNI, no BF16)

Cache: 20 MiB L3, 2 MiB L2 to­tal

Memory: 128 GB DDR3

GPU: none

For LLM in­fer­ence, mem­ory band­width is the lim­it­ing re­source. Every to­ken gen­er­ated re­quires haul­ing gi­ga­bytes of weights from RAM into the CPU cache.

When you use a tool like ChatGPT and watch the text stream onto your screen word by word, you are watch­ing the decoder pass”. During this phase, the model gen­er­ates the out­put one piece (or token”) at a time.

In this step, the sys­tem’s raw pro­cess­ing power is rarely the bot­tle­neck. Instead, the lim­i­ta­tion is mem­ory band­width. To cal­cu­late that next word, the proces­sor has to con­stantly pull mas­sive amounts of data. That data is the weights” that con­tain the mod­el’s learned knowl­edge. It moves this from mem­ory into the com­pute cores.

The proces­sor ex­e­cutes the re­quired ma­trix cal­cu­la­tions so quickly that it is left sit­ting idle, wait­ing for the hard­ware to phys­i­cally move the next chunk of weights across the mem­ory bus. In tra­di­tional soft­ware terms, de­cod­ing is heav­ily mem­ory-bound, not com­pute-bound.

This is the so called memory wall”, one of the sin­gle biggest per­for­mance hur­dles now, whether you’re on a Xeon or an H100.

Naively run­ning llama-cli on a DDR3 ma­chine with­out a GPU is hor­ren­dously slow, even if it can run it, be­cause it’s op­ti­mized for a generic GPU use­case, and of­ten leaves a lot of im­prove­ments on the table. Further, it sim­ply does­n’t have most of the ac­tual op­ti­miza­tions that the state of the art cur­rently uses to run these at scale.

The rem­edy is to pull every op­ti­miza­tion lever ik_l­lama.cpp ex­poses. Most of them are slightly ob­scure.

Here is the magic spell that makes this ac­tu­ally run.

llama-cli \ –model gemma-4 – 26B-A4B-it-Q8_0.gguf \ –model-draft gemma-4 – 26B-A4B-it-as­sis­tant-GGUF/\ wiki­text-2-raw_ik-llama-mt­p_­drafter-con­ser­v­a­tive/\ gemma-4 – 26B-A4B-it-as­sis­tant-Q8_0.gguf \ –spec-type mtp –draft-max 3 –draft-p-min 0.0 –spec-autotune \ -cnv –color –jinja –special \ -sm graph -smgs -sas -mea 256 –split-mode-f32 \ –temp 0.7 -t 8 –parallel 8 \ –cpu-moe –merge-up-gate-experts \ –flash-attn on –mla-use 3 \ –mlock –run-time-repack –no-kv-offload

Under a black­box tool like ol­lama you never see this line. On ag­ing hard­ware you have to un­der­stand what each flag does, be­cause half of them won’t take, and the en­gine will tell you so in pass­ing.

Speculative de­cod­ing.

–spec-type mtp –draft-max 3 –draft-p-min 0.0 –spec-autotune

This pairs the 26B ver­i­fier with the small drafter from the pre­vi­ous post. Up to three to­kens per draft (–draft-max 3), all prob­a­bil­i­ties ac­cepted (–draft-p-min 0.0), –spec-autotune ad­just­ing the chain length per work­load.

This ties di­rectly back to our pre­vi­ous dis­cus­sion about the mem­ory-bound de­coder pass.

When a model uses a long rea­son­ing chain, it is gen­er­at­ing those thinking” to­kens one by one. Even if the in­ter­nal rea­son­ing is hid­den from the user and all you see is a short fi­nal an­swer, the hard­ware still has to per­form a full de­coder pass for every sin­gle to­ken in that hid­den chain.

In fact, spec­u­la­tive de­cod­ing is cur­rently one of the most bril­liant soft­ware workarounds the AI in­dus­try has in­vented to by­pass the memory wall,” and spec au­to­tune is how you squeeze the max­i­mum speed out of it.

The ar­gu­ment for spec­u­la­tive de­cod­ing is stronger on CPU than on GPU. CPU com­pute is cheap rel­a­tive to the cost of stream­ing the ver­i­fier’s weights through cache, so spend­ing ex­tra cy­cles on a tiny drafter whose ac­tive lay­ers eas­ily fit in L3 buys to­kens at very lit­tle mar­ginal cost. The drafter’s work­ing set fits in L3. The ver­i­fier how­ever spills out of every­thing.

CPU and MoE rout­ing.

–cpu-moe –merge-up-gate-experts -t 8 –parallel 8

Gemma 4 26B-A4B has 128 ex­perts with 8 ac­tive per to­ken, giv­ing about 3.8B ac­tive pa­ra­me­ters out of ~25.2B to­tal. –cpu-moe tunes the rout­ing for CPU cache hi­er­ar­chies.

CPUs han­dle mem­ory very dif­fer­ently than GPUs. While a GPU has a mas­sive pool of ul­tra-fast High-Bandwidth Memory (HBM), a CPU re­lies on small, light­ning-fast caches” (L1, L2, L3) built di­rectly onto the proces­sor chip.

In an MoE model, con­stantly jump­ing around be­tween 128 dif­fer­ent ex­perts can cause cache thrash­ing”, where the CPU con­stantly has to dump its cache and fetch new weights from the much slower main sys­tem RAM (normally DDR4/DDR5, we’re on DDR3!).

This flag tells the router to be smarter about how it picks ex­perts, op­ti­miz­ing the se­quence so the weights stay neatly in­side the CPUs lo­cal cache for as long as pos­si­ble.

–merge-up-gate-experts fuses two per-ex­pert pro­jec­tions into a sin­gle mat­mul, which the logs con­firm:

fused_up­_­gate = 1

This is a soft­ware trick to by­pass the mem­ory band­width bot­tle­neck we dis­cussed ear­lier.

Inside the ex­perts, the math op­er­a­tions re­quire data to be passed through dif­fer­ent lay­ers. Normally, the proces­sor would cal­cu­late an up pro­jec­tion”, write the re­sult to mem­ory, then load the weights for a gate pro­jec­tion”, cal­cu­late that, and com­bine them. That re­quires mov­ing data across the mem­ory bus mul­ti­ple times.

Instead of do­ing two sep­a­rate trips over the mem­ory bus, it com­bines the op­er­a­tions into a sin­gle step.

-t 8 matches phys­i­cal cores. The ma­chine has 16 SMT threads but only 8 cores. On a mem­ory-bound work­load, over­sub­scrib­ing threads adds sched­ul­ing cost with­out adding through­put: the cores are wait­ing on DDR3, not on each other.

Memory pin­ning, repack­ing, KV cache.

–mlock –run-time-repack –no-kv-offload

–run-time-repack re­or­ga­nizes weight ma­tri­ces in mem­ory im­me­di­ately be­fore in­fer­ence to match the CPUs cache lay­out. The logs con­firm:

============ Repacked 265 ten­sors

Processors have their own ul­tra-fast, built-in mem­ory called caches (L1, L2, and L3). However, these caches ex­pect data to be fed to them in very spe­cific shapes and sizes.

If the AIs weight ma­tri­ces are sit­ting in sys­tem RAM in a generic lay­out, the CPU has to awk­wardly pull the data in pieces, re­sult­ing in cache misses” where the CPU stalls. –run-time-repack tells the en­gine to spend a few sec­onds dur­ing startup to phys­i­cally re­or­ga­nize the mas­sive ta­bles of num­bers in the RAM so they per­fectly align with how the CPU wants to in­gest them. It pays a small time penalty up­front to guar­an­tee max­i­mum mem­ory band­width dur­ing the ac­tual text gen­er­a­tion.

–mlock is meant to pin the model in RAM so the OS can­not swap any of it to disk.

mlock stands for memory lock”, supris­ing, I know! In stan­dard op­er­at­ing sys­tems, if the sys­tem starts run­ning out of RAM, it will qui­etly take data that has­n’t been used in a few sec­onds and swap” (or page) it to the phys­i­cal hard drive.

If an OS tries to swap out 27GB of AI weights to a disk, the gen­er­a­tion speed will in­stantly drop to zero while the sys­tem chokes try­ing to read it back. –mlock tells the Linux ker­nel: Pin this 27GB strictly in phys­i­cal RAM. Do not ever move it to the disk.”

Notice that if you’re not care­ful, you’ll see this:

warn­ing: failed to mlock 27628376064-byte buffer (after pre­vi­ously lock­ing 0 bytes): Cannot al­lo­cate mem­ory Try in­creas­ing RLIMIT_MEMLOCK (‘ulimit -l’ as root).

The flag is fine; the ker­nel-side mem­lock limit is­n’t set high enough to pin a 27 GB buffer. This is not an LLM-shaped prob­lem at all — it’s a ulimit de­fault — and it’s the kind of foot­gun the black­box tools pa­per over by sim­ply not ask­ing for the op­ti­miza­tion in the first place.

Consider that for a mo­ment, that many tools by de­fault will just have no prob­lem putting your model into swap if it de­cided that’s the best op­tion. You can imag­ine how much this can hurt per­for­mance…

–no-kv-offload tells the en­gine not to look for a GPU for the KV cache. There is­n’t one to find, but the flag short-cir­cuits the check.

The KV (Key-Value) cache is the AIs short-term mem­ory — it stores the con­text of the cur­rent con­ver­sa­tion so the model does­n’t have to re-read the en­tire prompt for every new to­ken.

Because the KV cache is con­stantly be­ing read from and writ­ten to, AI en­gines usu­ally try to offload” it to a GPU, which has much faster mem­ory than we do.

Since this spe­cific setup is highly op­ti­mized to run purely on a CPU, let­ting the en­gine search the hard­ware buses for a GPU that does­n’t ex­ist is a waste of time and could throw an er­ror. This flag ex­plic­itly short-cir­cuits that check, telling the en­gine to just keep the short-term mem­ory in the sys­tem RAM along­side the weights.

Graph lay­out.

I’ve tried my best to keep this easy to un­der­stand, but this part is just plain hard to make ex­plain in a sin­gle blog post.

I’ve tried my best to keep this easy to un­der­stand, but this part is just plain hard to make ex­plain in a sin­gle blog post.

Now onto dark arts. A com­mon frus­tra­tion in bleed­ing-edge AI soft­ware is that the en­gine is be­ing de­vel­oped so fast that the de­vel­op­ers don’t have time to write of­fi­cial doc­u­men­ta­tion. If you want to know how to op­ti­mize the en­gine, you have to dig through the raw code or read the Github Pull Request (PR) com­ments be­tween the de­vel­op­ers.

-sm graph -smgs -sas -mea 256 –split-mode-f32

These flags gov­ern how the com­pu­ta­tional graph is al­lo­cated across mem­ory re­gions. The full doc­u­men­ta­tion ul­ti­mat­ley lives in the code, even if it has some doc­u­men­ta­tion.

The flag -sm graph tells the en­gine to use Split Mode in the Graph mode (often known in the in­dus­try as Tensor Parallelism). This is en­tirely about how you di­vide the mas­sive math work­load across mul­ti­ple proces­sors or mem­ory re­gions (like mul­ti­ple CPU sock­ets or GPUs).

Layer Split (The Default/Fallback): The en­gine slices the model hor­i­zon­tally. Processor A cal­cu­lates Layers 1 – 10, then sends the data over the sys­tem bus to Processor B, which cal­cu­lates Layers 11 – 20. While Processor A is work­ing, Processor B is sit­ting idle.

Layer Split (The Default/Fallback): The en­gine slices the model hor­i­zon­tally. Processor A cal­cu­lates Layers 1 – 10, then sends the data over the sys­tem bus to Processor B, which cal­cu­lates Layers 11 – 20. While Processor A is work­ing, Processor B is sit­ting idle.

Graph Split (The Goal): The en­gine slices the com­pu­ta­tional graph ver­ti­cally. Processor A and Processor B cal­cu­late dif­fer­ent halves of Layer 1 at the ex­act same time, com­bine their an­swers, and move to Layer 2 to­gether. This keeps all hard­ware run­ning at 100% si­mul­ta­ne­ously, dras­ti­cally im­prov­ing gen­er­a­tion speed.

Graph Split (The Goal): The en­gine slices the com­pu­ta­tional graph ver­ti­cally. Processor A and Processor B cal­cu­late dif­fer­ent halves of Layer 1 at the ex­act same time, com­bine their an­swers, and move to Layer 2 to­gether. This keeps all hard­ware run­ning at 100% si­mul­ta­ne­ously, dras­ti­cally im­prov­ing gen­er­a­tion speed.

On this run, the en­gine de­clines:

======================================================= Split mode graph’ is not sup­ported for Gemma4 ex­ter­nal MTP => chang­ing split mode to layer’ =======================================================

Because MTP cre­ates a much more com­pli­cated web of math at the very end of the net­work, this in­fer­ence en­gine sim­ply has­n’t got­ten sup­port yet to safely graph split” (vertically slice) an MTP ar­chi­tec­ture yet. When the en­gine boots up, it de­tects the MTP lay­ers, re­al­izes -sm graph will break the math, and safely down­grades to the slower, se­quen­tial layer split so the model can still run.

I’ve in­cluded it be­cause it will likely be very help­ful in the fu­ture, so you should try your luck if you’re work­ing on a newer ver­sion.

While -sm graph was dis­abled, these other flags still ap­ply to how the en­gine man­ages mem­ory:

-sas (Split Across Sockets): Explicitly tells the en­gine how to di­vide the work­load across dif­fer­ent phys­i­cal CPU sock­ets (NUMA nodes) on a server moth­er­board. You may note we only have one CPU, but we could get more later, it’s a nice op­ti­miza­tion, just bench it to be safe if you do this, since older boards may break cur­rent day as­sump­tions.

-sas (Split Across Sockets): Explicitly tells the en­gine how to di­vide the work­load across dif­fer­ent phys­i­cal CPU sock­ets (NUMA nodes) on a server moth­er­board. You may note we only have one CPU, but we could get more later, it’s a nice op­ti­miza­tion, just bench it to be safe if you do this, since older boards may break cur­rent day as­sump­tions.

–split-mode-f32: When data is split across proces­sors, it has to be stitched back to­gether. This flag forces those in­ter­me­di­ate con­nec­tion points to use 32-bit float­ing-point pre­ci­sion (higher qual­ity math). It pre­vents the AI from los­ing in­tel­li­gence or hal­lu­ci­nat­ing due to round­ing er­rors dur­ing the split.

–split-mode-f32: When data is split across proces­sors, it has to be stitched back to­gether. This flag forces those in­ter­me­di­ate con­nec­tion points to use 32-bit float­ing-point pre­ci­sion (higher qual­ity math). It pre­vents the AI from los­ing in­tel­li­gence or hal­lu­ci­nat­ing due to round­ing er­rors dur­ing the split.

And don’t worry if you see this:

Oops: ten­sor with strange name rope_freqs.weight

It has a strange name. Strange names will not stop us here. :D

Attention.

Look. ikawrakow, cre­ator of ik_l­lama.cpp is be­yond the word craked”.

Kawrakow wrote cus­tom CPU ker­nels to han­dle Flash Attention, by­pass­ing the need for a GPU dur­ing heavy con­text pro­cess­ing.

This let’s us do some­thing that nor­mally you only do on a GPU.

–flash-attn on –mla-use 3

Flash Attention fuses the at­ten­tion soft­max with its mat­muls to avoid ma­te­ri­al­iz­ing the full at­ten­tion ma­trix. Duh, any­one knows this, but I’ll try to ex­plain it.

To gen­er­ate text, an AI has to cal­cu­late how every sin­gle word in your prompt re­lates to every other word. Mathematically, this cre­ates a grid of size N×N (where N is the num­ber of to­kens).

If you give the AI a short sen­tence, that grid is small. But if you feed it a 100,000-word doc­u­ment, that ma­trix ex­plodes into 10 bil­lion cells. Normally, the proces­sor cal­cu­lates this mas­sive ma­trix and materializes” it — mean­ing it phys­i­cally writes the en­tire gi­ant grid out to the main sys­tem RAM, only to im­me­di­ately read it back for the next step.

Flash Attention ap­plies the Kernel Fusion trick, but to the at­ten­tion mech­a­nism. It cal­cu­lates the at­ten­tion scores in small chunks and fuses the math (the soft­max) so that the gi­ant N×N ma­trix is never ac­tu­ally writ­ten to RAM. It is cal­cu­lated and con­sumed en­tirely in­side the proces­sor’s ul­tra-fast lo­cal cache.

Flash Attention was orig­i­nally in­vented strictly for GPUs be­cause it re­lies on how GPU hard­ware han­dles mem­ory blocks. Successfully port­ing this highly com­plex, hard­ware-spe­cific op­ti­miza­tion to work on stan­dard CPUs is a mas­sive soft­ware en­gi­neer­ing achieve­ment. Well done ikawrakow.

–mla-use 3 en­ables Multi-Head Latent Attention. Earlier, we dis­cussed the KV Cache (the AIs short-term mem­ory of the con­ver­sa­tion that pre­vents it from hav­ing to re-read the whole prompt for every word).

In stan­dard ar­chi­tec­tures, stor­ing the raw Key and Value data for every sin­gle to­ken eats up RAM in­cred­i­bly fast. Multi-Head Latent Attention (MLA) is a break­through ar­chi­tec­ture that heav­ily com­presses this short-term mem­ory. Instead of sav­ing raw data for every to­ken, it com­presses the Keys and Values into a much smaller, dense math­e­mat­i­cal rep­re­sen­ta­tion (a latent” space).

This dras­ti­cally re­duces the mem­ory foot­print of the KV cache, al­low­ing the model to re­mem­ber mas­sive con­ver­sa­tions with­out run­ning out of sys­tem RAM. The flag –mla-use 3 sim­ply tells the en­gine to ac­ti­vate a spe­cific tier or ker­nel im­ple­men­ta­tion of this com­pres­sion.

But all of this is just ex­per­i­men­tal stuff right, like the split mode graph? Nah. The logs con­firm both took:

ChatGPT for Google Sheets Exfiltrates Workbooks

www.promptarmor.com

This at­tack does not re­quire hu­man-in-the-loop ap­provals, even when in set­tings the user has ex­plic­itly re­quired hu­man ap­proval be­fore ChatGPT ed­its work­books.

UPDATE from OpenAI:

We ap­pre­ci­ate the se­cu­rity re­search here, and it’s un­for­tu­nate this one slipped through a crack in our dis­clo­sure pipeline. As we’re now aware of this re­port, we’ve taken im­me­di­ate steps to pro­tect users against po­ten­tial at­tacks in this area by re­mov­ing the mod­el’s abil­ity to gen­er­ate Apps Script code, which should elim­i­nate the risk to users of ChatGPT for Google Sheets. We’re tak­ing a close look at how this fea­ture in­ter­acts with Google Sheets APIs and re-eval­u­at­ing our sand­box­ing ap­proach to make sure this prod­uct is as re­sis­tant as pos­si­ble against prompt in­jec­tion at­tacks. More broadly, we’ll be do­ing a re-re­view of sim­i­lar func­tion­al­ity in other sur­faces to make sure that our de­fenses are con­sis­tent and ef­fec­tive across the board.”

We ap­pre­ci­ate the se­cu­rity re­search here, and it’s un­for­tu­nate this one slipped through a crack in our dis­clo­sure pipeline. As we’re now aware of this re­port, we’ve taken im­me­di­ate steps to pro­tect users against po­ten­tial at­tacks in this area by re­mov­ing the mod­el’s abil­ity to gen­er­ate Apps Script code, which should elim­i­nate the risk to users of ChatGPT for Google Sheets. We’re tak­ing a close look at how this fea­ture in­ter­acts with Google Sheets APIs and re-eval­u­at­ing our sand­box­ing ap­proach to make sure this prod­uct is as re­sis­tant as pos­si­ble against prompt in­jec­tion at­tacks. More broadly, we’ll be do­ing a re-re­view of sim­i­lar func­tion­al­ity in other sur­faces to make sure that our de­fenses are con­sis­tent and ef­fec­tive across the board.”

Overview

Recently, OpenAI launched an AI ex­ten­sion for us­ing ChatGPT in Google Sheets, which has ac­cu­mu­lated over 185,000 down­loads since its launch less than a month ago. This al­lows users to op­er­ate on their spread­sheets by in­ter­act­ing with an AI chat­bot that lives in a side­bar, with the added ben­e­fit of draw­ing on data from ChatGPT con­nec­tors.

A sin­gle in­di­rect prompt in­jec­tion at­tack trig­gered by a sin­gle be­nign user query can trig­ger all of the fol­low­ing ef­fects at once:

Exfiltration of many work­books from across the vic­tim’s ac­count

Exfiltration of many work­books from across the vic­tim’s ac­count

Display of an in­ter­ac­tive phish­ing pop-up

Display of an in­ter­ac­tive phish­ing pop-up

Overwriting the en­tire GPT side­bar with an at­tacker-con­trolled chat­bot in­ter­face

Overwriting the en­tire GPT side­bar with an at­tacker-con­trolled chat­bot in­ter­face

Attacker-controlled ed­its to your work­books

Attacker-controlled ed­its to your work­books

This at­tack oc­curs when any un­trusted data source (e.g., from an im­ported sheet or ChatGPT con­nec­tor) ma­nip­u­lates ChatGPT to run an at­tacker-con­trolled ex­ter­nal script, which ex­e­cutes lever­ag­ing per­mis­sions the user has granted to the ChatGPT for Google Sheets ex­ten­sion.

This vul­ner­a­bil­ity was re­spon­si­bly dis­closed to OpenAI. Despite mul­ti­ple fol­low-ups, we re­ceived no com­mu­ni­ca­tion be­yond an au­to­mated re­ply to our ini­tial dis­clo­sure. OpenAI’s doc­u­men­ta­tion fails to de­scribe sen­si­tive ca­pa­bil­i­ties granted to the model (e.g., run­ning priv­i­leged scripts) or risks of model ma­nip­u­la­tion via in­di­rect prompt in­jec­tion, in­stead fo­cus­ing solely on func­tional lim­i­ta­tions and data-han­dling con­cerns. As such, we are pub­lish­ing our find­ings to en­able in­formed de­ci­sion-mak­ing re­gard­ing the risk sur­face.

The Attack Chain

A user is work­ing on an in­ter­nal fi­nan­cial model

A user is work­ing on an in­ter­nal fi­nan­cial model

The user im­ports an ex­ter­nal data set to use in their model

The user im­ports an ex­ter­nal data set to use in their model

The ex­ter­nal sheet has a prompt in­jec­tion hid­den in white text.

The ex­ter­nal sheet has a prompt in­jec­tion hid­den in white text.

The user asks ChatGPT for Google Sheets to help in­te­grate the data from the im­ported sheet into their fi­nan­cial model.

The user asks ChatGPT for Google Sheets to help in­te­grate the data from the im­ported sheet into their fi­nan­cial model.

The in­jec­tion ma­nip­u­lates ChatGPT for Google Sheets to run an ex­ter­nal script­Note: ChatGPT for Google Sheets has a set­ting called Apply ed­its au­to­mat­i­cal­ly’ that de­ter­mines when hu­man ap­provals are re­quired be­fore an agen­tic ac­tion com­pletes. However, this at­tack suc­ceeds even when the user has ex­plic­itly dis­abled au­to­matic ed­its.

The in­jec­tion ma­nip­u­lates ChatGPT for Google Sheets to run an ex­ter­nal script

Note: ChatGPT for Google Sheets has a set­ting called Apply ed­its au­to­mat­i­cal­ly’ that de­ter­mines when hu­man ap­provals are re­quired be­fore an agen­tic ac­tion com­pletes. However, this at­tack suc­ceeds even when the user has ex­plic­itly dis­abled au­to­matic ed­its.

The ex­ter­nal script ex­fil­trates the fi­nan­cial model from the user’s work­book­Be­low, the at­tack­er’s server logs show the user’s ex­fil­trated fi­nan­cial model.

The ex­ter­nal script ex­fil­trates the fi­nan­cial model from the user’s work­book

Below, the at­tack­er’s server logs show the user’s ex­fil­trated fi­nan­cial model.

The ex­ter­nal script iden­ti­fies links to other work­books in the stolen data, ex­fil­trates the dis­cov­ered work­books, and con­tin­ues across all work­books it can find­Here, the in­ter­nal fi­nan­cial model sheet in­cluded a link to an­other spread­sheet rel­e­vant to bud­get­ing. The ma­li­cious script iden­ti­fies the spread­sheet URL in the stolen data and ex­fil­trates the newly dis­cov­ered work­book. It then con­tin­ues to process the stolen data, iden­ti­fy­ing and ex­fil­trat­ing ad­di­tional work­books, even­tu­ally ex­fil­trat­ing 12 in to­tal.Note: Clicking the stop’ but­ton in the ChatGPT side­bar does not stop scripts that have started from fin­ish­ing ex­e­cu­tion.

The ex­ter­nal script iden­ti­fies links to other work­books in the stolen data, ex­fil­trates the dis­cov­ered work­books, and con­tin­ues across all work­books it can find

Here, the in­ter­nal fi­nan­cial model sheet in­cluded a link to an­other spread­sheet rel­e­vant to bud­get­ing. The ma­li­cious script iden­ti­fies the spread­sheet URL in the stolen data and ex­fil­trates the newly dis­cov­ered work­book. It then con­tin­ues to process the stolen data, iden­ti­fy­ing and ex­fil­trat­ing ad­di­tional work­books, even­tu­ally ex­fil­trat­ing 12 in to­tal.

Note: Clicking the stop’ but­ton in the ChatGPT side­bar does not stop scripts that have started from fin­ish­ing ex­e­cu­tion.

Phishing Overlay Attacks

In ad­di­tion to the data ex­fil­tra­tion de­scribed above, the same at­tacker-con­trolled scripts en­able a ma­li­cious ac­tor to tar­get two vari­ants of a phish­ing over­lay at­tack.

Variant 1: A side­bar is opened that over­lays the ChatGPT for Google Sheets ex­ten­sion with an at­tacker-con­trolled site, al­low­ing the at­tacker to im­per­son­ate the ex­ten­sion. The ma­li­cious side­bar can ex­e­cute scripts that edit the sheet in the same way ChatGPT can, al­low­ing it to act in most of the ways the ex­ten­sion nor­mally does, while also per­form­ing ma­li­cious ac­tiv­i­ties such as:

Harvesting all user prompts

Harvesting all user prompts

Providing the user with a mis­aligned chat­bot to in­ter­act with

Providing the user with a mis­aligned chat­bot to in­ter­act with

Convincing the user to reconnect’ con­nec­tors to gain ac­cess to ad­di­tional apps

Convincing the user to reconnect’ con­nec­tors to gain ac­cess to ad­di­tional apps

Displaying a phish­ing UI to steal cre­den­tials for OpenAI

Displaying a phish­ing UI to steal cre­den­tials for OpenAI

Variant 2: A pop-up modal is opened that ren­ders an at­tacker-con­trolled web­site to phish the user for cre­den­tials.

Control Access to ChatGPT for Google Sheets

Organizations can lever­age the fol­low­ing con­fig­u­ra­tion to con­trol ac­cess to ChatGPT for Google Sheets:

Workspace set­tings > Permissions & roles > ChatGPT for Excel and Google Sheets

Workspace set­tings > Permissions & roles > ChatGPT for Excel and Google Sheets

Responsible Disclosure

UPDATE: OpenAI has re­sponded; de­tails are at the top of the ar­ti­cle.

This vul­ner­a­bil­ity was re­spon­si­bly dis­closed to OpenAI. Despite mul­ti­ple fol­low-ups, we re­ceived no com­mu­ni­ca­tion be­yond an au­to­mated re­ply to our ini­tial dis­clo­sure. OpenAI’s doc­u­men­ta­tion fails to de­scribe sen­si­tive ca­pa­bil­i­ties granted to the model (e.g., run­ning priv­i­leged scripts) or risks of model ma­nip­u­la­tion via in­di­rect prompt in­jec­tion, in­stead fo­cus­ing solely on func­tional lim­i­ta­tions and data-han­dling con­cerns. As such, we are pub­lish­ing our find­ings to en­able in­formed de­ci­sion-mak­ing re­gard­ing the risk sur­face.

Timeline

May 08, 2026 PromptArmor dis­closes to OpenAI via email­May 08, 2026 OpenAI sends an au­to­mated re­ply, con­firm­ing the in­tended re­port­ing chan­nel­May 08, 2026 PromptArmor con­firms email pref­er­ence­May 12, 2026 PromptArmor fol­lows up­May 18, 2026 PromptArmor fol­lows up­May 27, 2026 Public dis­clo­sure­UP­DATE: May 31, 2026 OpenAI re­sponds; more de­tails at the top.

Meta launches Instagram, Facebook, and WhatsApp subscriptions, with more to come, including AI plans

techcrunch.com

Meta is dou­bling down on its sub­scrip­tion of­fer­ings. On Wednesday, the so­cial net­work­ing gi­ant an­nounced it’s now rolling out its con­sumer sub­scrip­tion plans glob­ally for its flag­ship apps, Instagram, Facebook, and WhatsApp, and be­gin­ning tests of new sub­scrip­tions for busi­nesses, cre­ators, and Meta AI users.

For a few dol­lars per month, con­sumers sub­scrib­ing to Instagram Plus ($3.99/mo), Facebook Plus ($3.99/mo), or WhatsApp Plus ($2.99/mo) will gain ac­cess to ex­tra fea­tures, like pro­file cus­tomiza­tion, su­per re­ac­tions, and story in­sights, among other things.

In an an­nounce­ment, Meta’s head of prod­uct, Naomi Gleit, noted that more fun fea­tures” will be added in the fu­ture.

Meanwhile, Meta will be­gin test­ing other of­fer­ings, in­clud­ing pro­fes­sional plans for cre­ators and busi­nesses, and AI-focused plans for all users. These new tests will be branded as Meta One,” which will serve as the com­pa­ny’s home for its sub­scrip­tion of­fer­ings go­ing for­ward.

Meta con­firmed it was plan­ning a sub­scrip­tion of­fer­ing ear­lier this year, with its ini­tial tests rolling out in the spring. The idea be­hind the plans aimed at con­sumers is to pro­vide ad­di­tional fea­tures for power users who want more from their so­cial apps. It also al­lows Meta to di­ver­sify its rev­enue streams be­yond ad­ver­tis­ing by ex­tract­ing more value from its ex­ist­ing au­di­ence of bil­lions, given the lim­ited growth op­por­tu­ni­ties for these apps, which have al­ready achieved global sat­u­ra­tion.

The new Plus” plans are tai­lored to each in­di­vid­ual app, with Facebook Plus and Instagram Plus fo­cused more on so­cial ex­pres­sion, while WhatsApp Plus fo­cuses on per­son­al­iza­tion and mes­sag­ing.

However, the com­pany tells us the new plans don’t re­place its ex­ist­ing of­fer­ing, Meta Verified, which is fo­cused on ver­i­fi­ca­tion, im­per­son­ation pro­tec­tion, and ex­tra sup­port. (This could change in time, but for now, Meta is not wind­ing down the older plans.)

For starters, the new Instagram Plus plan gives sub­scribers ac­cess to ex­tra fea­tures, like the abil­ity to see how many peo­ple have re­watched your Story in ag­gre­gate, as well as the abil­ity to cre­ate un­lim­ited au­di­ence lists for Stories, be­yond the Close Friends” op­tion. Users will also be able to spot­light a story once a week for ad­di­tional views, ex­tend a story be­yond 24 hours, pre­view a story with­out show­ing up as a viewer, search their story viewer list to see who is watch­ing, and more. Users will also be able to post straight to their pro­file and high­light with­out show­ing up on their fol­low­ers’ feeds.

There are also other fea­tures like Super Heart an­i­mated re­ac­tions for Stories, cus­tom app icons, cus­tomiz­able fonts for pro­file bios, and ac­cess to ad­di­tional pins for your pro­file.

These fea­tures are de­signed to bet­ter serve cre­ators and those look­ing to grow their fol­low­ing and un­der­stand their au­di­ence, but they could also ap­peal to heavy users.

Facebook Plus of­fers a sim­i­lar set of fea­tures to Instagram Plus. WhatsApp Plus, how­ever, of­fers other fea­tures, like app themes, cus­tom ring­tones, ad­di­tional pinned chats, list cus­tomiza­tion, pre­mium stick­ers, and more.

AI plans and more, in­clud­ing those for cre­ators and busi­nesses

Alongside the launch, Meta says it will be­gin test­ing even more sub­scrip­tion plans, which is where things start to get con­fus­ing.

For Meta AI users, it will test two plans — Meta One Plus ($7.99/mo) and Meta One Premium ($19.99/mo) with the same fea­tures, but the Premium plan un­locks more ca­pac­ity on higher com­pute queries. That means the Premium plan would of­fer deeper rea­son­ing for com­plex tasks (i.e., more of thinking mode” in the Meta AI app or on the web). It would also of­fer more video and im­age-gen­er­a­tion ca­pa­bil­i­ties across Meta’s apps.

Meta AI will re­main free for more ca­sual users, but these plans fol­low the same path as those put forth by other AI model providers that charge for ad­di­tional com­pute and heav­ier us­age. The plans will later ex­pand in the weeks to come with more ben­e­fits for those who use AI glasses, Meta says.

The AI plans will start test­ing next month, ini­tially in Singapore, Guatemala, and Bolivia.

Two other plans for cre­ators and busi­nesses will be­gin tests later this week, in mar­kets in­clud­ing Saudi Arabia, Morocco, Thailand, and Bangladesh.

The Meta One Essential plan ($14.99/mo) will of­fer the Verified badge, im­per­son­ation pro­tec­tion, and an en­hanced linksheet where users can link out to their on­line pres­ence across so­cial chan­nels and the web, sim­i­lar to Meta Verified.

The more ex­pen­sive Meta One Advanced plan ($49.99/mo) will in­clude the Essential plan ben­e­fits, as well as the abil­ity to be fea­tured in the Facebook feed, ap­pear higher in Facebook and Instagram search re­sults, gain at­ten­tion with a bold Follow” but­ton on Reels, and au­to­mat­i­cally send follow” in­vi­ta­tions to peo­ple who en­gage with your con­tent.

It can also help cre­ators and busi­nesses drive peo­ple to their web­site or shop through links in Instagram posts and Instagram Reels, and through en­hanced Facebook and Instagram pro­files with their ex­panded linksheets. These plans, not sur­pris­ingly, in­clude bet­ter an­a­lyt­ics, in­clud­ing deeper, com­pet­i­tive in­sights on Instagram and cus­tom au­di­ence in­sights on Facebook.

Advanced plan sub­scribers will have ac­cess to op­ti­mized sched­ul­ing tools, tools to share ac­cess with other ac­count mod­er­a­tors (without shar­ing a pass­word), and no­ti­fi­ca­tions that alert you when oth­ers on Facebook or Instagram reuse your con­tent so you can re­quest a la­bel cred­it­ing your orig­i­nal reel.

Gleit ac­knowl­edged that Meta is still ex­per­i­ment­ing with these AI and pro­fes­sional plans for the time be­ing, but aims to bring them all to­gether un­der Meta One, where they will then con­tinue to be up­dated and ex­panded over time.

When you pur­chase through links in our ar­ti­cles, we may earn a small com­mis­sion. This does­n’t af­fect our ed­i­to­r­ial in­de­pen­dence.

To add this web app to your iOS home screen tap the share button and select "Add to the Home Screen".

10HN is also available as an iOS App

If you visit 10HN only rarely, check out the the best articles from the past week.

Visit pancik.com for more.