10 interesting stories served every morning and every evening.




1 1,303 shares, 224 trendiness

Claude Opus 4.6

The new Claude Opus 4.6 im­proves on its pre­de­ces­sor’s cod­ing skills. It plans more care­fully, sus­tains agen­tic tasks for longer, can op­er­ate more re­li­ably in larger code­bases, and has bet­ter code re­view and de­bug­ging skills to catch its own mis­takes. And, in a first for our Opus-class mod­els, Opus 4.6 fea­tures a 1M to­ken con­text win­dow in beta. Opus 4.6 can also ap­ply its im­proved abil­i­ties to a range of every­day work tasks: run­ning fi­nan­cial analy­ses, do­ing re­search, and us­ing and cre­at­ing doc­u­ments, spread­sheets, and pre­sen­ta­tions. Within Cowork, where Claude can mul­ti­task au­tonomously, Opus 4.6 can put all these skills to work on your be­half.The mod­el’s per­for­mance is state-of-the-art on sev­eral eval­u­a­tions. For ex­am­ple, it achieves the high­est score on the agen­tic cod­ing eval­u­a­tion Terminal-Bench 2.0 and leads all other fron­tier mod­els on Humanity’s Last Exam, a com­plex mul­ti­dis­ci­pli­nary rea­son­ing test. On GDPval-AA—an eval­u­a­tion of per­for­mance on eco­nom­i­cally valu­able knowl­edge work tasks in fi­nance, le­gal, and other do­mains1—Opus 4.6 out­per­forms the in­dus­try’s next-best model (OpenAI’s GPT-5.2) by around 144 Elo points,2 and its own pre­de­ces­sor (Claude Opus 4.5) by 190 points. Opus 4.6 also per­forms bet­ter than any other model on BrowseComp, which mea­sures a mod­el’s abil­ity to lo­cate hard-to-find in­for­ma­tion on­line.As we show in our ex­ten­sive sys­tem card, Opus 4.6 also shows an over­all safety pro­file as good as, or bet­ter than, any other fron­tier model in the in­dus­try, with low rates of mis­aligned be­hav­ior across safety eval­u­a­tions.Opus 4.6 is state-of-the-art on real-world work tasks across sev­eral pro­fes­sional do­mains.Opus 4.6 gets the high­est score in the in­dus­try for deep, multi-step agen­tic search.In Claude Code, you can now as­sem­ble agent teams to work on tasks to­gether. On the API, Claude can use com­paction to sum­ma­rize its own con­text and per­form longer-run­ning tasks with­out bump­ing up against lim­its. We’re also in­tro­duc­ing adap­tive think­ing, where the model can pick up on con­tex­tual clues about how much to use its ex­tended think­ing, and new ef­fort con­trols to give de­vel­op­ers more con­trol over in­tel­li­gence, speed, and cost. We’ve made sub­stan­tial up­grades to Claude in Excel, and we’re re­leas­ing Claude in PowerPoint in a re­search pre­view. This makes Claude much more ca­pa­ble for every­day work.Claude Opus 4.6 is avail­able to­day on claude.ai, our API, and all ma­jor cloud plat­forms. If you’re a de­vel­oper, use claude-opus-4-6 via the Claude API. Pricing re­mains the same at $5/$25 per mil­lion to­kens; for full de­tails, see our pric­ing page.We cover the model, our new prod­uct up­dates, our eval­u­a­tions, and our ex­ten­sive safety test­ing in depth be­low.We build Claude with Claude. Our en­gi­neers write code with Claude Code every day, and every new model first gets tested on our own work. With Opus 4.6, we’ve found that the model brings more fo­cus to the most chal­leng­ing parts of a task with­out be­ing told to, moves quickly through the more straight­for­ward parts, han­dles am­bigu­ous prob­lems with bet­ter judg­ment, and stays pro­duc­tive over longer ses­sions.Opus 4.6 of­ten thinks more deeply and more care­fully re­vis­its its rea­son­ing be­fore set­tling on an an­swer. This pro­duces bet­ter re­sults on harder prob­lems, but can add cost and la­tency on sim­pler ones. If you’re find­ing that the model is over­think­ing on a given task, we rec­om­mend di­al­ing ef­fort down from its de­fault set­ting (high) to medium. You can con­trol this eas­ily with the /effort pa­ra­me­ter.Here are some of the things our Early Access part­ners told us about Claude Opus 4.6, in­clud­ing its propen­sity to work au­tonomously with­out hand-hold­ing, its suc­cess where pre­vi­ous mod­els failed, and its ef­fect on how teams work:

Claude Opus 4.6 is the strongest model Anthropic has shipped. It takes com­pli­cated re­quests and ac­tu­ally fol­lows through, break­ing them into con­crete steps, ex­e­cut­ing, and pro­duc­ing pol­ished work even when the task is am­bi­tious. For Notion users, it feels less like a tool and more like a ca­pa­ble col­lab­o­ra­tor.Early test­ing shows Claude Opus 4.6 de­liv­er­ing on the com­plex, multi-step cod­ing work de­vel­op­ers face every day—es­pe­cially agen­tic work­flows that de­mand plan­ning and tool call­ing. This starts un­lock­ing long-hori­zon tasks at the fron­tier.Claude Opus 4.6 is a huge leap for agen­tic plan­ning. It breaks com­plex tasks into in­de­pen­dent sub­tasks, runs tools and sub­agents in par­al­lel, and iden­ti­fies block­ers with real pre­ci­sion.Claude Opus 4.6 is the best model we’ve tested yet. Its rea­son­ing and plan­ning ca­pa­bil­i­ties have been ex­cep­tional at pow­er­ing our AI Teammates. It’s also a fan­tas­tic cod­ing model — its abil­ity to nav­i­gate a large code­base and iden­tify the right changes to make is state of the art.Claude Opus 4.6 rea­sons through com­plex prob­lems at a level we haven’t seen be­fore. It con­sid­ers edge cases that other mod­els miss and con­sis­tently lands on more el­e­gant, well-con­sid­ered so­lu­tions. We’re par­tic­u­larly im­pressed with Opus 4.6 in Devin Review, where it’s in­creased our bug catch­ing rates.Claude Opus 4.6 feels no­tice­ably bet­ter than Opus 4.5 in Windsurf, es­pe­cially on tasks that re­quire care­ful ex­plo­ration like de­bug­ging and un­der­stand­ing un­fa­mil­iar code­bases. We’ve no­ticed Opus 4.6 thinks longer, which pays off when deeper rea­son­ing is needed.Claude Opus 4.6 rep­re­sents a mean­ing­ful leap in long-con­text per­for­mance. In our test­ing, we saw it han­dle much larger bod­ies of in­for­ma­tion with a level of con­sis­tency that strength­ens how we de­sign and de­ploy com­plex re­search work­flows. Progress in this area gives us more pow­er­ful build­ing blocks to de­liver truly ex­pert-grade sys­tems pro­fes­sion­als can trust.Across 40 cy­ber­se­cu­rity in­ves­ti­ga­tions, Claude Opus 4.6 pro­duced the best re­sults 38 of 40 times in a blind rank­ing against Claude 4.5 mod­els. Each model ran end to end on the same agen­tic har­ness with up to 9 sub­agents and 100+ tool calls.Claude Opus 4.6 is the new fron­tier on long-run­ning tasks from our in­ter­nal bench­marks and test­ing. It’s also been highly ef­fec­tive at re­view­ing code.Claude Opus 4.6 achieved the high­est BigLaw Bench score of any Claude model at 90.2%. With 40% per­fect scores and 84% above 0.8, it’s re­mark­ably ca­pa­ble for le­gal rea­son­ing.Claude Opus 4.6 au­tonomously closed 13 is­sues and as­signed 12 is­sues to the right team mem­bers in a sin­gle day, man­ag­ing a ~50-person or­ga­ni­za­tion across 6 repos­i­to­ries. It han­dled both prod­uct and or­ga­ni­za­tional de­ci­sions while syn­the­siz­ing con­text across mul­ti­ple do­mains, and it knew when to es­ca­late to a hu­man.Claude Opus 4.6 is an up­lift in de­sign qual­ity. It works beau­ti­fully with our de­sign sys­tems and it’s more au­tonomous, which is core to Lovable’s val­ues. People should be cre­at­ing things that mat­ter, not mi­cro­manag­ing AI.Claude Opus 4.6 ex­cels in high-rea­son­ing tasks like multi-source analy­sis across le­gal, fi­nan­cial, and tech­ni­cal con­tent. Box’s eval showed a 10% lift in per­for­mance, reach­ing 68% vs. a 58% base­line, and near-per­fect scores in tech­ni­cal do­mains.Claude Opus 4.6 gen­er­ates com­plex, in­ter­ac­tive apps and pro­to­types in Figma Make with an im­pres­sive cre­ative range. The model trans­lates de­tailed de­signs and multi-lay­ered tasks into code on the first try, mak­ing it a pow­er­ful start­ing point for teams to ex­plore and build ideas.Claude Opus 4.6 is the best Anthropic model we’ve tested. It un­der­stands in­tent with min­i­mal prompt­ing and went above and be­yond, ex­plor­ing and cre­at­ing de­tails I did­n’t even know I wanted un­til I saw them. It felt like I was work­ing with the model, not wait­ing on it.Both hands-on test­ing and evals show Claude Opus 4.6 is a mean­ing­ful im­prove­ment for de­sign sys­tems and large code­bases, use cases that drive enor­mous en­ter­prise value. It also one-shot­ted a fully func­tional physics en­gine, han­dling a large multi-scope task in a sin­gle pass.Claude Opus 4.6 is the biggest leap I’ve seen in months. I’m more com­fort­able giv­ing it a se­quence of tasks across the stack and let­ting it run. It’s smart enough to use sub­agents for the in­di­vid­ual pieces.Claude Opus 4.6 han­dled a multi-mil­lion-line code­base mi­gra­tion like a se­nior en­gi­neer. It planned up front, adapted its strat­egy as it learned, and fin­ished in half the time.We only ship mod­els in v0 when de­vel­op­ers will gen­uinely feel the dif­fer­ence. Claude Opus 4.6 passed that bar with ease. Its fron­tier-level rea­son­ing, es­pe­cially with edge cases, helps v0 to de­liver on our num­ber-one aim: to let any­one el­e­vate their ideas from pro­to­type to pro­duc­tion.The per­for­mance jump with Claude Opus 4.6 feels al­most un­be­liev­able. Real-world tasks that were chal­leng­ing for Opus [4.5] sud­denly be­came easy. This feels like a wa­ter­shed mo­ment for spread­sheet agents on Shortcut.Across agen­tic cod­ing, com­puter use, tool use, search, and fi­nance, Opus 4.6 is an in­dus­try-lead­ing model, of­ten by a wide mar­gin. The table be­low shows how Claude Opus 4.6 com­pares to our pre­vi­ous mod­els and to other in­dus­try mod­els on a va­ri­ety of bench­marks.Opus 4.6 is much bet­ter at re­triev­ing rel­e­vant in­for­ma­tion from large sets of doc­u­ments. This ex­tends to long-con­text tasks, where it holds and tracks in­for­ma­tion over hun­dreds of thou­sands of to­kens with less drift, and picks up buried de­tails that even Opus 4.5 would miss.A com­mon com­plaint about AI mod­els is context rot,” where per­for­mance de­grades as con­ver­sa­tions ex­ceed a cer­tain num­ber of to­kens. Opus 4.6 per­forms markedly bet­ter than its pre­de­ces­sors: on the 8-needle 1M vari­ant of MRCR v2—a nee­dle-in-a-haystack bench­mark that tests a mod­el’s abil­ity to re­trieve in­for­ma­tion hidden” in vast amounts of text—Opus 4.6 scores 76%, whereas Sonnet 4.5 scores just 18.5%. This is a qual­i­ta­tive shift in how much con­text a model can ac­tu­ally use while main­tain­ing peak per­for­mance.All in all, Opus 4.6 is bet­ter at find­ing in­for­ma­tion across long con­texts, bet­ter at rea­son­ing af­ter ab­sorb­ing that in­for­ma­tion, and has sub­stan­tially bet­ter ex­pert-level rea­son­ing abil­i­ties in gen­eral.Fi­nally, the charts be­low show how Claude Opus 4.6 per­forms on a va­ri­ety of bench­marks that as­sess its soft­ware en­gi­neer­ing skills, mul­ti­lin­gual cod­ing abil­ity, long-term co­her­ence, cy­ber­se­cu­rity ca­pa­bil­i­ties, and its life sci­ences knowl­edge.Opus 4.6 main­tains fo­cus over time and earns $3,050.53 more than Opus 4.5 on Vending-Bench 2.Opus 4.6 finds real vul­ner­a­bil­i­ties in code­bases bet­ter than any other model.Opus 4.6 per­forms al­most bet­ter than Opus 4.5 on com­pu­ta­tional bi­ol­ogy, struc­tural bi­ol­ogy, or­ganic chem­istry, and phy­lo­ge­net­ics tests.These in­tel­li­gence gains do not come at the cost of safety. On our au­to­mated be­hav­ioral au­dit, Opus 4.6 showed a low rate of mis­aligned be­hav­iors such as de­cep­tion, syco­phancy, en­cour­age­ment of user delu­sions, and co­op­er­a­tion with mis­use. Overall, it is just as well-aligned as its pre­de­ces­sor, Claude Opus 4.5, which was our most-aligned fron­tier model to date. Opus 4.6 also shows the low­est rate of over-re­fusals—where the model fails to an­swer be­nign queries—of any re­cent Claude model.The over­all mis­aligned be­hav­ior score for each re­cent Claude model on our au­to­mated be­hav­ioral au­dit (described in full in the Claude Opus 4.6 sys­tem card).For Claude Opus 4.6, we ran the most com­pre­hen­sive set of safety eval­u­a­tions of any model, ap­ply­ing many dif­fer­ent tests for the first time and up­grad­ing sev­eral that we’ve used be­fore. We in­cluded new eval­u­a­tions for user well­be­ing, more com­plex tests of the mod­el’s abil­ity to refuse po­ten­tially dan­ger­ous re­quests, and up­dated eval­u­a­tions of the mod­el’s abil­ity to sur­rep­ti­tiously per­form harm­ful ac­tions. We also ex­per­i­mented with new meth­ods from in­ter­pretabil­ity, the sci­ence of the in­ner work­ings of AI mod­els, to be­gin to un­der­stand why the model be­haves in cer­tain ways—and, ul­ti­mately, to catch prob­lems that stan­dard test­ing might miss.A de­tailed de­scrip­tion of all ca­pa­bil­ity and safety eval­u­a­tions is avail­able in the Claude Opus 4.6 sys­tem card.We’ve also ap­plied new safe­guards in ar­eas where Opus 4.6 shows par­tic­u­lar strengths that might be put to dan­ger­ous as well as ben­e­fi­cial uses. In par­tic­u­lar, since the model shows en­hanced cy­ber­se­cu­rity abil­i­ties, we’ve de­vel­oped six new cy­ber­se­cu­rity probes—meth­ods of de­tect­ing harm­ful re­sponses—to help us track dif­fer­ent forms of po­ten­tial mis­use.We’re also ac­cel­er­at­ing the cy­berde­fen­sive uses of the model, us­ing it to help find and patch vul­ner­a­bil­i­ties in open-source soft­ware (as we de­scribe in our new cy­ber­se­cu­rity blog post). We think it’s crit­i­cal that cy­berde­fend­ers use AI mod­els like Claude to help level the play­ing field. Cybersecurity moves fast, and we’ll be ad­just­ing and up­dat­ing our safe­guards as we learn more about po­ten­tial threats; in the near fu­ture, we may in­sti­tute real-time in­ter­ven­tion to block abuse.We’ve made sub­stan­tial up­dates across Claude, Claude Code, and the Claude Developer Platform to let Opus 4.6 per­form at its best.On the API, we’re giv­ing de­vel­op­ers bet­ter con­trol over model ef­fort and more flex­i­bil­ity for long-run­ning agents. To do so, we’re in­tro­duc­ing the fol­low­ing fea­tures:Adap­tive think­ing. Previously, de­vel­op­ers only had a bi­nary choice be­tween en­abling or dis­abling ex­tended think­ing. Now, with adap­tive think­ing, Claude can de­cide when deeper rea­son­ing would be help­ful. At the de­fault ef­fort level (high), the model uses ex­tended think­ing when use­ful, but de­vel­op­ers can ad­just the ef­fort level to make it more or less se­lec­tive.Ef­fort. There are now four ef­fort lev­els to choose from: low, medium, high (default), and max. We en­cour­age de­vel­op­ers to ex­per­i­ment with dif­fer­ent op­tions to find what works best.Con­text com­paction (beta). Long-running con­ver­sa­tions and agen­tic tasks of­ten hit the con­text win­dow. Context com­paction au­to­mat­i­cally sum­ma­rizes and re­places older con­text when the con­ver­sa­tion ap­proaches a con­fig­urable thresh­old, let­ting Claude per­form longer tasks with­out hit­ting lim­its.1M to­ken con­text (beta). Opus 4.6 is our first Opus-class model with 1M to­ken con­text. Premium pric­ing ap­plies for prompts ex­ceed­ing 200k to­kens ($10/$37.50 per mil­lion in­put/​out­put to­kens).128k out­put to­kens. Opus 4.6 sup­ports out­puts of up to 128k to­kens, which lets Claude com­plete larger-out­put tasks with­out break­ing them into mul­ti­ple re­quests.US-only in­fer­ence. For work­loads that need to run in the United States, US-only in­fer­ence is avail­able at 1.1× to­ken pric­ing.Across Claude and Claude Code, we’ve added fea­tures that al­low knowl­edge work­ers and de­vel­op­ers to tackle harder tasks with more of the tools they use every day.We’ve in­tro­duced agent teams in Claude Code as a re­search pre­view. You can now spin up mul­ti­ple agents that work in par­al­lel as a team and co­or­di­nate au­tonomously—best for tasks that split into in­de­pen­dent, read-heavy work like code­base re­views. You can take over any sub­agent di­rectly us­ing Shift+Up/Down or tmux.Claude now also works bet­ter with the of­fice tools you al­ready use. Claude in Excel han­dles long-run­ning and harder tasks with im­proved per­for­mance, and can plan be­fore act­ing, in­gest un­struc­tured data and in­fer the right struc­ture with­out guid­ance, and han­dle multi-step changes in one pass. Pair that with Claude in PowerPoint, and you can first process and struc­ture your data in Excel, then bring it to life vi­su­ally in PowerPoint. Claude reads your lay­outs, fonts, and slide mas­ters to stay on brand, whether you’re build­ing from a tem­plate or gen­er­at­ing a full deck from a de­scrip­tion. Claude in PowerPoint is now avail­able in re­search pre­view for Max, Team, and Enterprise plans.

...

Read the original on www.anthropic.com »

2 1,048 shares, 62 trendiness

Owning a $5M data center

These days it seems you need a tril­lion fake dol­lars, or lunch with politi­cians to get your own data cen­ter. They may help, but they’re not re­quired. At comma we’ve been run­ning our own data cen­ter for years. All of our model train­ing, met­rics, and data live in our own data cen­ter in our own of­fice. Having your own data cen­ter is cool, and in this blog post I will de­scribe how ours works, so you can be in­spired to have your own data cen­ter too.

If your busi­ness re­lies on com­pute, and you run that com­pute in the cloud, you are putting a lot of trust in your cloud provider. Cloud com­pa­nies gen­er­ally make on­board­ing very easy, and off­board­ing very dif­fi­cult. If you are not vig­i­lant you will sleep­walk into a sit­u­a­tion of high cloud costs and no way out. If you want to con­trol your own des­tiny, you must run your own com­pute.

Self-reliance is great, but there are other ben­e­fits to run­ning your own com­pute. It in­spires good en­gi­neer­ing. Maintaining a data cen­ter is much more about solv­ing real-world chal­lenges. The cloud re­quires ex­per­tise in com­pany-spe­cific APIs and billing sys­tems. A data cen­ter re­quires knowl­edge of Watts, bits, and FLOPs. I know which one I rather think about.

Avoiding the cloud for ML also cre­ates bet­ter in­cen­tives for en­gi­neers. Engineers gen­er­ally want to im­prove things. In ML many prob­lems go away by just us­ing more com­pute. In the cloud that means im­prove­ments are just a bud­get in­crease away. This locks you into in­ef­fi­cient and ex­pen­sive so­lu­tions. Instead, when all you have avail­able is your cur­rent com­pute, the quick­est im­prove­ments are usu­ally speed­ing up your code, or fix­ing fun­da­men­tal is­sues.

Finally there’s cost, own­ing a data cen­ter can be far cheaper than rent­ing in the cloud. Especially if your com­pute or stor­age needs are fairly con­sis­tent, which tends to be true if you are in the busi­ness of train­ing or run­ning mod­els. In com­ma’s case I es­ti­mate we’ve spent ~5M on our data cen­ter, and we would have spent 25M+ had we done the same things in the cloud.

Our data cen­ter is pretty sim­ple. It’s main­tained and built by only a cou­ple en­gi­neers and tech­ni­cians. Your needs may be slightly dif­fer­ent, our im­ple­men­ta­tion should pro­vide use­ful con­text.

To run servers you need power. We cur­rently use about 450kW at max. Operating a data cen­ter ex­poses you to many fun en­gi­neer­ing chal­lenges, but procur­ing power is not one of them. San Diego power cost is over 40c/kWh, ~3x the global av­er­age. It’s a ripoff, and over­priced sim­ply due to po­lit­i­cal dys­func­tion. We spent $540,112 on power in 2025, a big part of the data cen­ter cost. In a fu­ture blog post I hope I can tell you about how we pro­duce our own power and you should too.

Data cen­ters need cool dry air. Typically this is achieved with a CRAC sys­tem, but they are power-hun­gry. San Diego has a mild cli­mate and we opted for pure out­side air cool­ing. This gives us less con­trol of the tem­per­a­ture and hu­mid­ity, but uses only a cou­ple dozen kW. We have dual 48” in­take fans and dual 48” ex­haust fans to keep the air cool. To en­sure low hu­mid­ity (

The ma­jor­ity of our cur­rent com­pute is 600 GPUs in 75 TinyBox Pro ma­chines. They were built in-house, which saves us money and en­sures they suit our needs. Our self-built ma­chines fail at a sim­i­lar rate to pre-built ma­chines we’ve bought, but we’re ca­pa­ble of fix­ing them our­selves quickly. They have 2 CPUs and 8 GPUs each, and work as both train­ing ma­chines and gen­eral com­pute work­ers.

For data stor­age we have a few racks of Dell ma­chines (R630 and R730). They are filled with SSDs for a to­tal of ~4PB of stor­age. We use SSDs for re­li­a­bil­ity and speed. Our main stor­age ar­rays have no re­dun­dancy and each node needs to be able to sat­u­rate the net­work band­width with ran­dom ac­cess reads. For the stor­age ma­chines this means read­ing up to 20Gbps of each 80TB chunk.

Other than stor­age and com­pute ma­chines we have sev­eral one-off ma­chines to run ser­vices. This in­cludes a router, cli­mate con­troller, data in­ges­tion ma­chine, stor­age mas­ter servers, met­ric servers, re­dis servers, and a few more.

Running the net­work re­quires switches, but at this scale we don’t need to bother with com­pli­cated switch topolo­gies. We have 3 100Gbps in­ter­con­nected Z9264F switches, which serve as the main eth­er­net net­work. We have two more in­fini­band switches to in­ter­con­nect the 2 tiny­box pro groups for train­ing all-re­duce.

To ef­fec­tively use all these com­pute and stor­age ma­chines you need some in­fra. At this scale, ser­vices don’t need re­dun­dancy to achieve 99% up­time. We use a sin­gle mas­ter for all ser­vices, which makes things pretty sim­ple.

All servers get ubuntu in­stalled with pxe­boot and are man­aged by salt.

All of our stor­age ar­rays use mkv. The main ar­ray is 3PB of non-re­dun­dant stor­age host­ing our dri­ving data we train on. We can read from this ar­ray at ~1TB/s, which means we can train di­rectly on the raw data with­out caching. Redundancy is not needed since no spe­cific data is crit­i­cal.

We have an ad­di­tional ~300TB non-re­dun­dant ar­ray to cache in­ter­me­di­ate processed re­sults. And lastly, we have a re­dun­dant mkv stor­age ar­ray to store all of our trained mod­els and train­ing met­rics. Each of these 3 ar­rays have a sep­a­rate sin­gle mas­ter server.

We use slurm to man­age the com­pute nodes, and com­pute jobs. We sched­ule two types of dis­trib­uted com­pute. Pytorch train­ing jobs, and mini­ray work­ers.

To train mod­els across mul­ti­ple GPU nodes we use torch.dis­trib­uted FSDP. We have 2 sep­a­rate train­ing par­ti­tions, each in­tra-con­nected with Infiniband for train­ing across ma­chines. We wrote our own train­ing frame­work which han­dles the train­ing loop boil­er­plate, but it’s mostly just py­torch.

We have a cus­tom model ex­per­i­ment track­ing ser­vice (similar to wandb or ten­sor­board). It pro­vides a dash­board for track­ing ex­per­i­ments, and shows cus­tom met­rics and re­ports. It is also the in­ter­face for the mkv stor­age ar­ray that hosts the model weights. The train­ing runs store the model weights there with a uuid, and they are avail­able to down­load for who­ever needs to run them. The met­rics and re­ports for our lat­est mod­els are also open.

Besides train­ing we have many other com­pute tasks. This can be any­thing from run­ning tests, run­ning mod­els, pre-pro­cess­ing data, or even run­ning agent roll­outs for on-pol­icy train­ing. We wrote a light­weight open-source task sched­uler called mini­ray that al­lows you to run ar­bi­trary python code on idle ma­chines. This is a sim­pler ver­sion of dask, with a fo­cus on ex­treme sim­plic­ity. Slurm will sched­ule any idle ma­chine to be an ac­tive mini­ray worker, and ac­cept pend­ing tasks. All the task in­for­ma­tion is hosted in a cen­tral re­dis server.

Miniray work­ers with GPUs will spin up a tri­ton in­fer­ence server to run model in­fer­ence with dy­namic batch­ing. A mini­ray worker can thus eas­ily and ef­fi­ciently run any of the mod­els hosted in the model mkv stor­age ar­ray.

Miniray makes it ex­tremely easy to scale par­al­lel tasks to hun­dreds of ma­chines. For ex­am­ple, the con­trols chal­lenge record was set by just hav­ing ~1hr of ac­cess to our data cen­ter with mini­ray.

All our code is in a monorepo that we have cloned on our work­sta­tions. This monorepo is kept small (

The most com­plex thing we do at comma is train dri­ving mod­els on-pol­icy, these train­ing runs re­quire train­ing data to be gen­er­ated dur­ing train­ing by run­ning sim­u­lated dri­ving roll­outs with the most re­cent model weights. Here’s a real-world com­mand we just used to train such a model. This train­ing run uses all of the in­fra­struc­ture de­scribed above. While only this small com­mand is needed to kick every­thing off, it or­ches­trates a lot of mov­ing parts.

Does all this stuff sound ex­cit­ing? Then build your own dat­a­cen­ter for your­self or your com­pany! You can also come work here.

...

Read the original on blog.comma.ai »

3 486 shares, 20 trendiness

OpenClaw is What Apple Intelligence Should Have Been

Something strange is hap­pen­ing with Mac Minis. They’re sell­ing out every­where, and it’s not be­cause peo­ple sud­denly need more cof­fee table com­put­ers.

If you browse Reddit or HN, you’ll see the same pat­tern: peo­ple are buy­ing Mac Minis specif­i­cally to run AI agents with com­puter use. They’re set­ting up head­less ma­chines whose sole job is to au­to­mate their work­flows. OpenClaw—the open-source frame­work that lets you run Claude, GPT-5, or what­ever model you want to ac­tu­ally con­trol your com­puter—has be­come the killer app for Mac hard­ware. Not Final Cut. Not Logic. An AI agent that clicks but­tons.

This is ex­actly what Apple Intelligence should have been.

Apple had every­thing: the hard­ware, the ecosys­tem, the rep­u­ta­tion for it just works.” They could have shipped an agen­tic AI that ac­tu­ally au­to­mated your com­puter in­stead of sum­ma­riz­ing your no­ti­fi­ca­tions. Imagine if Siri could gen­uinely file your taxes, re­spond to emails, or man­age your cal­en­dar by ac­tu­ally us­ing your apps, not through some brit­tle API layer that breaks every up­date.

They could have charged $500 more per de­vice and peo­ple would have paid it. The mar­gins would have been ob­scene. And they would have won the AI race not by build­ing the best model, but by be­ing the only com­pany that could ship an AI you’d ac­tu­ally trust with root ac­cess to your com­puter. That trust—built over decades—was their moat.

So why did­n’t they?

Maybe they just did­n’t see it. That sounds mun­dane, but it’s prob­a­bly the most com­mon rea­son com­pa­nies miss op­por­tu­ni­ties. When you’re Apple, you’re think­ing about chip de­sign, man­u­fac­tur­ing scale, and re­tail strat­egy. An open-source pro­ject let­ting AI agents con­trol com­put­ers might not ping your radar un­til it’s al­ready hap­pen­ing.

Or maybe they saw it and de­cided the risk was­n’t worth it. If you’re Apple, you don’t want your AI agent au­to­mat­i­cally buy­ing things, post­ing on so­cial me­dia, or mak­ing ir­re­versible de­ci­sions. The li­a­bil­ity ex­po­sure would be enor­mous. Better to ship some­thing safe and lim­ited than some­thing pow­er­ful and un­pre­dictable.

But there’s an­other dy­namic at play. Look at who’s about to get an­gry about OpenClaw-style au­toma­tion: LinkedIn, Facebook, any­one with a walled gar­den and a care­ful API strat­egy. These ser­vices de­pend on fric­tion. They want you to use their app, see their ads, stay in their ecosys­tem. An AI that can au­to­mate away that fric­tion is an ex­is­ten­tial threat.

If Apple had built this, they’d be fight­ing Instagram over ToS vi­o­la­tions by Tuesday. They’d be tes­ti­fy­ing in front of Congress about AI agents com­mit­ting fraud. Every tech plat­form would be up­dat­ing their terms to ex­plic­itly ban Apple Intelligence.

By let­ting some third party do it, Apple gets plau­si­ble de­ni­a­bil­ity. They’re just sell­ing hard­ware. Not their fault what peo­ple run on it. It’s the same strat­egy that made them bil­lions in the App Store while main­tain­ing they’re not re­spon­si­ble for what de­vel­op­ers do.”

But I think this is short-term think­ing.

Here’s what peo­ple miss about moats: they com­pound. The rea­son Microsoft dom­i­nated PCs was­n’t just that they had the best OS. It’s that every­one built for Windows, which made Windows more valu­able, which made more peo­ple build for Windows. Network ef­fects.

If Apple owned the agent layer, they could have cre­ated the most de­fen­si­ble moat in tech. Because an AI agent gets bet­ter the more it knows about you. And Apple al­ready has all your data, all your apps, all your de­vices. They could have built an agent that works across your iPhone, Mac, iPad, and Watch seam­lessly—some­thing no one else can do.

More im­por­tantly, they could have owned the API. Want your ser­vice to work with Apple Agent? You play by Apple’s rules. Suddenly Apple is­n’t fight­ing with plat­forms—they’re the plat­form that plat­forms need to in­te­grate with. It’s the App Store play­book all over again, but for the AI era.

The Mac Mini rush is a pre­view of this fu­ture. People want agents. They want au­toma­tion. They want to pay for it. They’re lit­er­ally buy­ing ex­tra com­put­ers just to run some­one else’s AI on Apple’s hard­ware.

Apple is get­ting the hard­ware rev­enue but miss­ing the plat­form rev­enue. That might look smart this quar­ter. But plat­form rev­enue is what built Apple into a $3 tril­lion com­pany. And plat­forms are what cre­ate tril­lion-dol­lar moats.

I sus­pect ten years from now, peo­ple will look back at 2024-2025 as the mo­ment Apple had a clear shot at own­ing the agent layer and chose not to take it. Not be­cause they could­n’t build it—they ob­vi­ously could—but be­cause they were op­ti­miz­ing for this year’s le­gal risk in­stead of next decade’s plat­form power.

The peo­ple buy­ing Mac Minis to run AI agents aren’t just early adopters. They’re show­ing Apple ex­actly what prod­uct they should have built. Whether Apple is pay­ing at­ten­tion is an­other ques­tion en­tirely.

...

Read the original on www.jakequist.com »

4 335 shares, 86 trendiness

- YouTube

...

Read the original on www.youtube.com »

5 330 shares, 30 trendiness

CIA says it will cease publishing the CIA World Factbook

The US Central Intelligence Agency (CIA) has an­nounced it will cease pub­lish­ing the World Factbook, a free on­line re­source used by mil­lions around the globe.

Frequently cited by jour­nal­ists and aca­d­e­mics, the Factbook of­fered reg­u­larly up­dated sta­tis­tics and in­for­ma­tion about coun­tries and com­mu­ni­ties all over the world, in an eas­ily un­der­stood and search­able for­mat.

A state­ment on the CIAs web­site did not in­clude a rea­son for the de­ci­sion, sim­ply stat­ing that the pub­li­ca­tion had sunset” while en­cour­ag­ing read­ers to stay cu­ri­ous about the world and find ways to ex­plore it … in per­son or vir­tu­ally”.

First launched dur­ing World War II as a clas­si­fied in­ter­nal pro­gram named JANIS (Joint Army Navy Intelligence Studies), the Factbook was orig­i­nally com­mis­sioned as a way to stan­dard­ise basic in­tel­li­gence” — fun­da­men­tal and fac­tual in­for­ma­tion about the world — across dif­fer­ent agen­cies of the US gov­ern­ment.

The pro­gram was taken over by the CIA in 1947 and re­named the National Intelligence Survey, be­fore the Factbook was launched in 1971 as an an­nual sum­mary of in­for­ma­tion.

An un­clas­si­fied ver­sion was first made avail­able to the pub­lic in 1975, and a dig­i­tal ver­sion was pub­lished on­line in the 1990s, with the data freely avail­able un­der pub­lic do­main.

The web­site was par­tic­u­larly pop­u­lar dur­ing the US school year, ac­cord­ing to pre­vi­ous ver­sions of the site, with traf­fic ex­pe­ri­enc­ing a no­tice­able drop-off dur­ing US sum­mer months.

While no spe­cific rea­son has been given for the Factbook’s clo­sure, the Trump ad­min­is­tra­tion has made no se­cret of its in­tent to cut gov­ern­ment pro­grams it does not con­sider to be fur­ther­ing the core pur­pose of its agen­cies and de­part­ments.

The ad­min­is­tra­tion of­fered buy­outs to every CIA em­ployee in February last year, and is re­port­edly plan­ning to cut about 1,200 fur­ther jobs at the agency over the next sev­eral years.

The CIA has been con­tacted for com­ment.

...

Read the original on www.abc.net.au »

6 295 shares, 45 trendiness

Spotlighting The World Factbook as We Bid a Fond Farewell

Spotlighting The World Factbook as We Bid a Fond Farewell (via) Somewhat dev­as­tat­ing news to­day from CIA:

One of CIAs old­est and most rec­og­niz­able in­tel­li­gence pub­li­ca­tions, The World Factbook, has sun­set.

There’s not even a hint as to why they de­cided to stop main­tain­ing this pub­li­ca­tion, which has been their most use­ful pub­lic-fac­ing ini­tia­tive since 1971 and a cor­ner­stone of the pub­lic in­ter­net since 1997.

In a bizarre act of cul­tural van­dal­ism they’ve not just re­moved the en­tire site (including the archives of pre­vi­ous ver­sions) but they’ve also set every sin­gle page to be a 302 redi­rect to their clo­sure an­nounce­ment.

The Factbook has been re­leased into the pub­lic do­main since the start. There’s no rea­son not to con­tinue to serve archived ver­sions - a ban­ner at the top of the page say­ing it’s no longer main­tained would be much bet­ter than re­mov­ing all of that valu­able con­tent en­tirely.

Up un­til 2020 the CIA pub­lished an­nual zip file archives of the en­tire site. Those are avail­able (along with the rest of the Factbook) on the Internet Archive.

I down­loaded the 384MB .zip file for the year 2020 and ex­tracted it into a new GitHub repos­i­tory, si­monw/​cia-world-fact­book-2020. I’ve en­abled GitHub Pages for that repos­i­tory so you can browse the archived copy at si­monw.github.io/​cia-world-fact­book-2020/.

Here’s a neat ex­am­ple of the ed­i­to­r­ial voice of the Factbook from the What’s New page, dated December 10th 2020:

Years of wran­gling were brought to a close this week when of­fi­cials from Nepal and China an­nounced that they have agreed on the height of Mount Everest. The moun­tain sits on the bor­der be­tween Nepal and Tibet (in west­ern China), and its height changed slightly fol­low­ing an earth­quake in 2015. The new height of 8,848.86 me­ters is just un­der a me­ter higher than the old fig­ure of 8,848 me­ters. The World Factbook rounds the new mea­sure­ment to 8,849 me­ters and this new height has been en­tered through­out the Factbook data­base.

...

Read the original on simonwillison.net »

7 294 shares, 28 trendiness

From magic to malware: How OpenClaw's agent skills become an attack surface

A few days ago, about why OpenClaw feels like a por­tal to the fu­ture, and why that fu­ture is scary in a very spe­cific way.

The short ver­sion: agent gate­ways that act like OpenClaw are pow­er­ful be­cause they have real ac­cess to your files, your tools, your browser, your ter­mi­nals, and of­ten a long-term memory” file that cap­tures how you think and what you’re build­ing. That com­bi­na­tion is ex­actly what mod­ern in­fos­teal­ers are de­signed to ex­ploit.

This post is the un­com­fort­able, and then it hap­pened” fol­low-up.

Because it’s not just that agents can be dan­ger­ous once they’re in­stalled. The ecosys­tem that dis­trib­utes their ca­pa­bil­i­ties and skill reg­istries has al­ready be­come an at­tack sur­face.

If you are ex­per­i­ment­ing with OpenClaw, do not do it on a com­pany de­vice. Full stop.

In my first post, I de­scribed OpenClaw as a kind of Faustian bar­gain. It is com­pelling pre­cisely be­cause it has real ac­cess to your lo­cal ma­chine, your apps, your browser ses­sions, your files, and of­ten long-term mem­ory. That same ac­cess means there is­n’t yet a safe way to run it on a ma­chine that holds cor­po­rate cre­den­tials or has ac­cess to pro­duc­tion sys­tems.

If you have al­ready run OpenClaw on a work de­vice, treat it as a po­ten­tial in­ci­dent and en­gage your se­cu­rity team im­me­di­ately. Do not wait for symp­toms. Pause work on that ma­chine and fol­low your or­ga­ni­za­tion’s in­ci­dent re­sponse process.

In the OpenClaw ecosys­tem, a skill” is of­ten a mark­down file: a page of in­struc­tions that tells an agent how to do a spe­cial­ized task. In prac­tice, that mark­down can in­clude links, copy-and-paste com­mands, and tool call recipes.

That sounds harm­less un­til you re­mem­ber how hu­mans, and agents, ac­tu­ally con­sume doc­u­men­ta­tion:

Markdown is­n’t content” in an agent ecosys­tem. Markdown is an in­staller.

Some peo­ple as­sume the layer makes this safer, be­cause tools can be ex­posed through a struc­tured in­ter­face, with ex­plicit user con­sent and au­tho­riza­tion con­trols de­pend­ing on the host and server im­ple­men­ta­tion.

But skills do not need to use MCP at all.

The Agent Skills spec­i­fi­ca­tion places no re­stric­tions on the mark­down body, and skills can in­clude what­ever in­struc­tions will help agents per­form the task,” in­clud­ing copy and paste ter­mi­nal com­mands. And skills can also bun­dle scripts along­side the mark­down, which means ex­e­cu­tion can hap­pen out­side the MCP tool bound­ary en­tirely.

So if your se­cu­rity model is MCP will gate tool calls,” you can still lose to a ma­li­cious skill that sim­ply routes around MCP through so­cial en­gi­neer­ing, di­rect shell in­struc­tions, or bun­dled code. MCP can be part of a safe sys­tem, but it is not a safety guar­an­tee by it­self.

Just as im­por­tantly, this is not unique to OpenClaw. Skills” are in­creas­ingly portable be­cause many agents are adopt­ing the open in which a skill is a folder cen­tered on a SKILL.md file with meta­data and freeform in­struc­tions, and it can also bun­dle scripts and other re­sources. Even de­scribes the same ba­sic shape: a SKILL.md file plus op­tional scripts and as­sets. That means a ma­li­cious skill” is not just an OpenClaw prob­lem. It is a dis­tri­b­u­tion mech­a­nism that can travel across any agent ecosys­tem that sup­ports the same stan­dard.

While brows­ing ClawHub (I won’t link it for ob­vi­ous rea­sons), I no­ticed the top down­loaded skill at the time was a Twitter” skill. It looked nor­mal: de­scrip­tion, in­tended use, an overview, the kind of thing you’d ex­pect to in­stall with­out a sec­ond thought.

But the very first thing it did was in­tro­duce a required de­pen­dency” named openclaw-core,” along with plat­form-spe­cific in­stall steps. Those steps in­cluded con­ve­nient links (“here”, this link”) that ap­peared to be nor­mal doc­u­men­ta­tion point­ers.

Both links led to ma­li­cious in­fra­struc­ture. The flow was clas­sic staged de­liv­ery:

The skil­l’s overview told you to in­stall a pre­req­ui­site. The link led to a stag­ing page de­signed to get the agent to run a com­mand.That com­mand de­coded an ob­fus­cated pay­load and ex­e­cuted it.The script down­loaded and ran a bi­nary, in­clud­ing re­mov­ing ma­cOS quar­an­tine at­trib­utes to en­sure ma­cOS’s built-in anti-mal­ware sys­tem, Gatekeeper, does­n’t scan it.

I’m in­ten­tion­ally not past­ing the ex­act com­mands or URLs here. The me­chan­ics are un­for­tu­nately straight­for­ward, and re­peat­ing them helps at­tack­ers more than it helps de­fend­ers. The key point is that this was not a sus­pi­cious link.” This was a com­plete ex­e­cu­tion chain dis­guised as setup in­struc­tions.

I down­loaded the fi­nal bi­nary safely and sub­mit­ted it to .

The ver­dict was not am­bigu­ous. It was flagged as ma­cOS in­fos­teal­ing mal­ware.

This is the type of mal­ware that does­n’t just infect your com­puter.” It raids every­thing valu­able on that de­vice:

* Anything else that can be turned into an ac­count takeover

If you’re the kind of per­son in­stalling agent skills, you are ex­actly the kind of per­son whose ma­chine is worth steal­ing from.

After I shared this in­ter­nally, sur­faced, putting the scale into fo­cus: hun­dreds of OpenClaw skills were re­port­edly in­volved in dis­trib­ut­ing ma­cOS mal­ware via ClickFix-style in­struc­tions.

That de­tail mat­ters be­cause it con­firms what this re­ally is.

A de­lib­er­ate strat­egy: use skills” as the dis­tri­b­u­tion chan­nel, and prerequisites” as the so­cial en­gi­neer­ing wrap­per.

We’ve spent years learn­ing that pack­age man­agers and open-source reg­istries can be­come sup­ply chain at­tack vec­tors.

Agent skill reg­istries are the next chap­ter, ex­cept that the package” is doc­u­men­ta­tion.

And that makes the at­tack path even smoother:

* And in agent ecosys­tems, the line be­tween read­ing in­struc­tions and ex­e­cut­ing them col­lapses.

Even if an agent can’t run shell com­mands di­rectly, it can still do some­thing dan­ger­ous: it can nor­mal­ize risky be­hav­ior.

It can con­fi­dently sum­ma­rize a ma­li­cious pre­req­ui­site as the stan­dard in­stall step.” It can en­cour­age you to paste a one-liner. It can re­duce hes­i­ta­tion.

And if your agent can ex­e­cute lo­cal com­mands, then a ma­li­cious skill is­n’t bad con­tent.” It’s re­mote ex­e­cu­tion wrapped in friendly docs.

Do not run this on a com­pany de­vice. There is­n’t a safe way to do it. If you al­ready did, or you ran any install” com­mands from a skill, en­gage your se­cu­rity team im­me­di­ately and treat it as a po­ten­tial com­pro­mise.

* Stop us­ing the de­vice for sen­si­tive work.

If you ex­per­i­ment any­way, use an iso­lated ma­chine with no cor­po­rate ac­cess and no saved cre­den­tials.

You are op­er­at­ing an app store. Assume it will be abused.

* Put warn­ings and fric­tion on ex­ter­nal links and in­stall steps.

* Use per­mis­sions that are spe­cific, time-bound, and re­vo­ca­ble.

This is the clear­est proof yet of the point I made in my ear­lier post. OpenClaw is pow­er­ful be­cause it col­lapses the dis­tance be­tween in­tent and ex­e­cu­tion. That is the magic. It also in­tro­duces sig­nif­i­cant risk. When ca­pa­bil­i­ties are dis­trib­uted as skills and in­stalled via doc­u­men­ta­tion, the reg­istry be­comes a sup­ply chain, and the eas­i­est in­stall path be­comes the at­tack­er’s fa­vorite path.

The an­swer is not to stop build­ing agents. The an­swer is to build the miss­ing trust layer around them. Skills need prove­nance. Execution needs me­di­a­tion. Permissions need to be spe­cific, re­vo­ca­ble, and con­tin­u­ously en­forced, not granted once and for­got­ten. If agents are go­ing to act on our be­half, cre­den­tials and sen­si­tive ac­tions can­not be grabbed” by what­ever code hap­pens to run. They need to be bro­kered, gov­erned, and au­dited in real time.

This is ex­actly why we need that next layer: when skills” be­come the sup­ply chain, the only safe fu­ture is one in which every agent has its own iden­tity and has the min­i­mum au­thor­ity it needs right now, with ac­cess that is time-bound, re­vo­ca­ble, and at­trib­ut­able.

...

Read the original on 1password.com »

8 274 shares, 13 trendiness

ICE seeks industry input on ad tech location data for investigative use

Immigration and Customs Enforcement (ICE) is sur­vey­ing the com­mer­cial ad­ver­tis­ing tech­nol­ogy mar­ket for tools ca­pa­ble of sup­ply­ing lo­ca­tion data and large-scale an­a­lyt­ics to fed­eral in­ves­ti­ga­tors, ac­cord­ing to a re­cent Request for Information (RFI).

Framed as mar­ket re­search rather than a pro­cure­ment, the RFI seeks in­for­ma­tion from com­pa­nies of­fer­ing Ad Tech com­pli­ant and lo­ca­tion data ser­vices” that could sup­port crim­i­nal, civil, and ad­min­is­tra­tive in­ves­ti­ga­tions across ICEs mis­sion set.

The RFI, is­sued by ICEs Homeland Security Investigations (HSI), em­pha­sizes that the gov­ern­ment is not so­lic­it­ing pro­pos­als or com­mit­ting to a fu­ture con­tract, but it does sig­nal ac­tive in­ter­est in se­lect­ing ven­dors for live demon­stra­tions of op­er­a­tional plat­forms and data ser­vices, a step that typ­i­cally pre­cedes pi­lot de­ploy­ments or in­te­gra­tion into ex­ist­ing in­ves­tiga­tive en­vi­ron­ments.

ICE says it is at­tempt­ing to bet­ter un­der­stand how com­mer­cial big data providers and ad­ver­tis­ing tech­nol­ogy firms might di­rectly sup­port in­ves­tiga­tive ac­tiv­i­ties, while re­main­ing sen­si­tive to regulatory con­straints and pri­vacy ex­pec­ta­tions.”

The agency noted that its com­po­nents are han­dling in­creas­ing vol­umes of crim­i­nal, civil, and ad­min­is­tra­tive in­for­ma­tion from both in­ter­nal and ex­ter­nal sources and are as­sess­ing whether com­mer­cial off-the-shelf plat­forms com­pa­ra­ble to large in­ves­tiga­tive data and le­gal an­a­lyt­ics providers can help man­age and ex­ploit that data at scale.

At the cen­ter of the in­quiry is a cat­e­gory of in­for­ma­tion tra­di­tion­ally as­so­ci­ated with dig­i­tal ad­ver­tis­ing rather than law en­force­ment: lo­ca­tion data, de­vice iden­ti­fiers, IP in­tel­li­gence, and be­hav­ioral sig­nals de­rived from every­day con­sumer ac­tiv­ity.

Advertising tech­nol­ogy, com­monly re­ferred to as ad tech, is the sprawl­ing ecosys­tem of soft­ware, data bro­kers, an­a­lyt­ics plat­forms, and in­ter­me­di­aries that power tar­geted ad­ver­tis­ing on the mod­ern Internet.

Ad tech com­pa­nies col­lect and process in­for­ma­tion about where de­vices are lo­cated, how users move be­tween phys­i­cal and dig­i­tal spaces, which apps are in­stalled on their phones, and how de­vices can be linked across web­sites, ap­pli­ca­tions, and net­works.

While the in­dus­try typ­i­cally frames this ac­tiv­ity as anony­mous or pseu­do­ny­mous, the un­der­ly­ing data is of­ten per­sis­tent, gran­u­lar, and ca­pa­ble of track­ing in­di­vid­u­als over time.

Location data is a par­tic­u­larly valu­able com­po­nent of that ecosys­tem. Mobile ap­pli­ca­tions rou­tinely share lat­i­tude and lon­gi­tude co­or­di­nates with ad­ver­tis­ing part­ners through em­bed­ded soft­ware de­vel­op­ment kits.

Even when pre­cise GPS data is not avail­able, com­pa­nies in­fer lo­ca­tion through IP ad­dresses, Wi-Fi net­works, Bluetooth bea­cons, and cell tower con­nec­tions. That in­for­ma­tion is then ag­gre­gated, an­a­lyzed, and sold to ad­ver­tis­ers seek­ing to mea­sure foot traf­fic, tar­get au­di­ences, or as­sess the ef­fec­tive­ness of cam­paigns.

ICEs RFI sug­gests that the agency is ex­plor­ing whether those same mech­a­nisms can be re­pur­posed as in­ves­tiga­tive tools.

The doc­u­ment asks ven­dors to de­scribe plat­forms and data ser­vices that can sup­port in­ves­tiga­tive needs while re­main­ing Ad Tech com­pli­ant,” a phrase that re­flects in­dus­try norms rather than statu­tory law en­force­ment stan­dards.

ICE ap­pears to be look­ing into tap­ping into the com­mer­cial data ecosys­tem rather than build­ing be­spoke sur­veil­lance tools from scratch, a strat­egy that al­lows agen­cies to ac­cess rich data streams with­out di­rectly col­lect­ing the in­for­ma­tion them­selves.

ICEs in­ter­est is not lim­ited to raw data. The RFI re­peat­edly ref­er­ences operational plat­forms,” sig­nal­ing a de­sire for sys­tems that can in­gest, cor­re­late, an­a­lyze, and vi­su­al­ize in­for­ma­tion from mul­ti­ple sources.

In prac­tice, that means soft­ware en­vi­ron­ments ca­pa­ble of fus­ing lo­ca­tion data with other records, such as crim­i­nal his­to­ries, fi­nan­cial data, travel records, so­cial me­dia ac­tiv­ity, or ad­min­is­tra­tive files, to gen­er­ate in­ves­tiga­tive leads or sup­port on­go­ing cases.

The agency frames its in­quiry as ex­ploratory and cau­tious. It notes that the gov­ern­ment is seek­ing to un­der­stand the current state” of ad tech and lo­ca­tion data ser­vices avail­able to fed­eral in­ves­tiga­tive en­ti­ties, par­tic­u­larly con­sid­er­ing reg­u­la­tory con­straints and pri­vacy ex­pec­ta­tions.

That lan­guage re­flects grow­ing scrutiny of com­mer­cial data prac­tices by courts, reg­u­la­tors, and civil lib­er­ties ad­vo­cates, es­pe­cially when such data is ac­cessed by fed­eral agen­cies like ICE.

In re­cent years, fed­eral agen­cies have in­creas­ingly re­lied on com­mer­cially avail­able data to side­step tra­di­tional le­gal bar­ri­ers.

Because ad tech data is col­lected by pri­vate com­pa­nies un­der con­sumer-fac­ing pri­vacy poli­cies, agen­cies have ar­gued that pur­chas­ing or ac­cess­ing that data does not con­sti­tute a search un­der the Fourth Amendment.

Critics counter that this ap­proach al­lows the gov­ern­ment to ob­tain highly sen­si­tive in­for­ma­tion, in­clud­ing de­tailed lo­ca­tion his­to­ries, with­out war­rants, prob­a­ble cause, or mean­ing­ful over­sight.

The U. S. Supreme Court has sig­naled skep­ti­cism of such prac­tices in cases rec­og­niz­ing the sen­si­tiv­ity of long-term lo­ca­tion track­ing, even when data is held by third par­ties.

At the same time, reg­u­la­tors have brought en­force­ment ac­tions against data bro­kers ac­cused of sell­ing sen­si­tive lo­ca­tion in­for­ma­tion with­out ad­e­quate safe­guards.

Against that back­drop, ICEs as­ser­tion that it is con­sid­er­ing pri­vacy ex­pec­ta­tions ap­pears de­signed to re­as­sure both pol­i­cy­mak­ers and po­ten­tial ven­dors that the agency is aware of the con­tro­versy sur­round­ing com­mer­cial sur­veil­lance data.

Yet the RFI it­self pro­vides lit­tle de­tail about how those con­cerns would be op­er­a­tional­ized. It does not ref­er­ence war­rants, court or­ders, or ju­di­cial au­tho­riza­tion.

Nor does it ex­plain how ICE would dis­tin­guish be­tween data as­so­ci­ated with U. S. per­sons and nonci­t­i­zens, how long in­for­ma­tion would be re­tained, or whether data ob­tained for one in­ves­tiga­tive pur­pose could be reused for oth­ers.

That am­bi­gu­ity is par­tic­u­larly sig­nif­i­cant given HSIs broad man­date. Unlike agen­cies fo­cused solely on crim­i­nal en­force­ment, HSI con­ducts civil and ad­min­is­tra­tive in­ves­ti­ga­tions along­side crim­i­nal cases.

Location data or ad tech-de­rived in­sights could there­fore be used in con­texts rang­ing from im­mi­gra­tion en­force­ment to cus­toms vi­o­la­tions to sanc­tions and ex­port con­trol in­ves­ti­ga­tions, of­ten un­der lower le­gal thresh­olds than those re­quired in crim­i­nal pro­ceed­ings.

ICEs em­pha­sis on Ad Tech com­pli­ant” ser­vices also un­der­score a fun­da­men­tal ten­sion. Compliance in the ad­ver­tis­ing in­dus­try typ­i­cally refers to ad­her­ence to self-reg­u­la­tory frame­works, con­trac­tual oblig­a­tions, and pri­vacy poli­cies that per­mit ex­ten­sive data col­lec­tion so long as cer­tain dis­clo­sures are made.

Those stan­dards are not de­signed to con­strain gov­ern­ment use, nor do they sub­sti­tute for con­sti­tu­tional or statu­tory pro­tec­tions gov­ern­ing law en­force­ment sur­veil­lance.

Companies mar­ket­ing privacy-friendly” lo­ca­tion or IP in­tel­li­gence tools of­ten ar­gue that they avoid di­rectly iden­ti­fy­ing in­di­vid­u­als. But re­searchers and reg­u­la­tors have re­peat­edly demon­strated that sup­pos­edly anonymized or ag­gre­gated data can be rei­den­ti­fied when com­bined with other datasets.

In an in­ves­tiga­tive con­text, rei­den­ti­fi­ca­tion is not a bug but a fea­ture, en­abling an­a­lysts to link dig­i­tal sig­nals back to real-world sub­jects.

Biometric Update ear­lier re­ported that a Government Accountability Office au­dit had found that pub­licly ac­ces­si­ble data — from so­cial me­dia posts to com­mer­cial ge­olo­ca­tion records — can be ag­gre­gated into de­tailed digital pro­files” that ex­pose U. S. per­son­nel, mil­i­tary op­er­a­tions, and se­nior lead­ers to tar­get­ing, co­er­cion, and dis­rup­tion.

In January 2025, Gravy Analytics, a promi­nent lo­ca­tion data bro­ker, dis­closed that a sig­nif­i­cant data breach had po­ten­tially ex­posed through de-anonymiza­tion the pre­cise lo­ca­tion in­for­ma­tion of mil­lions of in­di­vid­u­als.

The RFIs fo­cus on live demon­stra­tions sug­gests that ICE is in­ter­ested in ma­ture, de­ploy­able ca­pa­bil­i­ties rather than the­o­ret­i­cal of­fer­ings. Vendors se­lected to pre­sent would be ex­pected to show how their plat­forms op­er­ate in prac­tice, how data is ac­cessed and an­a­lyzed, and how in­ves­tiga­tive out­puts are gen­er­ated.

While the agency stresses that it is not com­mit­ting to a fu­ture so­lic­i­ta­tion, such demon­stra­tions of­ten in­form sub­se­quent pro­cure­ments, task or­ders, or pi­lot pro­grams con­ducted un­der ex­ist­ing con­tracts.

ICE has used sim­i­lar mar­ket re­search ap­proaches in the past to nor­mal­ize new sur­veil­lance ca­pa­bil­i­ties be­fore for­mal adop­tion.

Social me­dia mon­i­tor­ing tools, mo­bile bio­met­ric sys­tems, and large-scale an­a­lyt­ics plat­forms were all in­tro­duced through in­cre­men­tal steps that be­gan with RFIs and demon­stra­tions rather than head­line-grab­bing con­tracts.

For pri­vacy ad­vo­cates, the lat­est fil­ing fits a fa­mil­iar pat­tern. Commercial sur­veil­lance mar­kets evolve rapidly, dri­ven by ad­ver­tis­ing and mar­ket­ing de­mand. Government agen­cies then adopt those tools af­ter the fact, of­ten be­fore law­mak­ers have fully grap­pled with the im­pli­ca­tions.

Oversight mech­a­nisms, how­ever, lag tech­ni­cal ca­pa­bil­ity, leav­ing key ques­tions unan­swered un­til af­ter sys­tems are al­ready in use.

ICEs RFI does not in­di­cate when demon­stra­tions might oc­cur or whether a so­lic­i­ta­tion will fol­low. It does make clear, though, that the agency sees the ad tech ecosys­tem as a po­ten­tial in­ves­tiga­tive re­source worth se­ri­ous con­sid­er­a­tion.

As de­bates over com­mer­cial data, sur­veil­lance, and con­sti­tu­tional pro­tec­tions con­tinue, the fil­ing of­fers a win­dow into how fed­eral law en­force­ment is adapt­ing to — and seek­ing to lever­age — a data econ­omy built for ad­ver­tis­ing rather than ac­count­abil­ity.

For now, ICE is ask­ing in­dus­try to ex­plain how ad tech-de­rived lo­ca­tion and an­a­lyt­ics ser­vices can be made suit­able for in­ves­tiga­tive use while re­spect­ing pri­vacy ex­pec­ta­tions.

What re­mains un­clear is who will de­fine those ex­pec­ta­tions, how they will be en­forced, and whether ex­ist­ing le­gal frame­works are equipped to gov­ern a sur­veil­lance model that blurs the line be­tween con­sumer mar­ket­ing and gov­ern­ment in­tel­li­gence.

...

Read the original on www.biometricupdate.com »

9 274 shares, 58 trendiness

Building a C compiler with a team of parallel Claudes

Written by Nicholas Carlini, a re­searcher on our Safeguards team.

I’ve been ex­per­i­ment­ing with a new ap­proach to su­per­vis­ing lan­guage mod­els that we’re call­ing agent teams.” With agent teams, mul­ti­ple Claude in­stances work in par­al­lel on a shared code­base with­out ac­tive hu­man in­ter­ven­tion. This ap­proach dra­mat­i­cally ex­pands the scope of what’s achiev­able with LLM agents. To stress test it, I tasked 16 agents with writ­ing a Rust-based C com­piler, from scratch, ca­pa­ble of com­pil­ing the Linux ker­nel. Over nearly 2,000 Claude Code ses­sions and $20,000 in API costs, the agent team pro­duced a 100,000-line com­piler that can build Linux 6.9 on x86, ARM, and RISC-V. The com­piler is an in­ter­est­ing ar­ti­fact on its own, but I fo­cus here on what I learned about de­sign­ing har­nesses for long-run­ning au­tonomous agent teams: how to write tests that keep agents on track with­out hu­man over­sight, how to struc­ture work so mul­ti­ple agents can make progress in par­al­lel, and where this ap­proach hits its ceil­ing.Ex­ist­ing agent scaf­folds like Claude Code re­quire an op­er­a­tor to be on­line and avail­able to work jointly. If you ask for a so­lu­tion to a long and com­plex prob­lem, the model may solve part of it, but even­tu­ally it will stop and wait for con­tin­ued in­put—a ques­tion, a sta­tus up­date, or a re­quest for clar­i­fi­ca­tion.To elicit sus­tained, au­tonomous progress, I built a har­ness that sticks Claude in a sim­ple loop (if you’ve seen Ralph-loop, this should look fa­mil­iar). When it fin­ishes one task, it im­me­di­ately picks up the next. (Run this in a con­tainer, not your ac­tual ma­chine).

In the agent prompt, I tell Claude what prob­lem to solve and ask it to ap­proach the prob­lem by break­ing it into small pieces, track­ing what it’s work­ing on, fig­ur­ing out what to work on next, and to ef­fec­tively keep go­ing un­til it’s per­fect. (On this last point, Claude has no choice. The loop runs for­ever—al­though in one in­stance, I did see Claude pkill -9 bash on ac­ci­dent, thus killing it­self and end­ing the loop. Whoops!).Running mul­ti­ple in­stances in par­al­lel can ad­dress two weak­nesses of a sin­gle-agent har­ness:One Claude Code ses­sion can only do one thing at a time. Especially as the scope of a pro­ject ex­pands, de­bug­ging mul­ti­ple is­sues in par­al­lel is far more ef­fi­cient.Run­ning mul­ti­ple Claude agents al­lows for spe­cial­iza­tion. While a few agents are tasked to solve the ac­tual prob­lem at hand, other spe­cial­ized agents can be in­voked to (for ex­am­ple) main­tain doc­u­men­ta­tion, keep an eye on code qual­ity, or solve spe­cial­ized sub-tasks.My im­ple­men­ta­tion of par­al­lel Claude is bare-bones. A new bare git repo is cre­ated, and for each agent, a Docker con­tainer is spun up with the repo mounted to /upstream. Each agent clones a lo­cal copy to /workspace, and when it’s done, pushes from its own lo­cal con­tainer to up­stream.To pre­vent two agents from try­ing to solve the same prob­lem at the same time, the har­ness uses a sim­ple syn­chro­niza­tion al­go­rithm:Claude takes a lock” on a task by writ­ing a text file to cur­ren­t_­tasks/ (e.g., one agent might lock cur­ren­t_­tasks/​parse_if_s­tate­ment.txt, while an­other locks cur­ren­t_­tasks/​code­gen_­func­tion_de­f­i­n­i­tion.txt). If two agents try to claim the same task, git’s syn­chro­niza­tion forces the sec­ond agent to pick a dif­fer­ent one.Claude works on the task, then pulls from up­stream, merges changes from other agents, pushes its changes, and re­moves the lock. Merge con­flicts are fre­quent, but Claude is smart enough to fig­ure that out.The in­fi­nite agent-gen­er­a­tion-loop spawns a new Claude Code ses­sion in a fresh con­tainer, and the cy­cle re­peats.This is a very early re­search pro­to­type. I haven’t yet im­ple­mented any other method for com­mu­ni­ca­tion be­tween agents, nor do I en­force any process for man­ag­ing high-level goals. I don’t use an or­ches­tra­tion agent. Instead, I leave it up to each Claude agent to de­cide how to act. In most cases, Claude picks up the next most ob­vi­ous” prob­lem. When stuck on a bug, Claude will of­ten main­tain a run­ning doc of failed ap­proaches and re­main­ing tasks. In the git repos­i­tory of the pro­ject, you can read through the his­tory and watch it take out locks on var­i­ous tasks.The scaf­fold­ing runs Claude in a loop, but that loop is only use­ful if Claude can tell how to make progress. Most of my ef­fort went into de­sign­ing the en­vi­ron­ment around Claude—the tests, the en­vi­ron­ment, the feed­back—so that it could ori­ent it­self with­out me. These are the ap­proaches I’ve found most help­ful when or­ches­trat­ing mul­ti­ple Claude in­stances.Claude will work au­tonomously to solve what­ever prob­lem I give it. So it’s im­por­tant that the task ver­i­fier is nearly per­fect, oth­er­wise Claude will solve the wrong prob­lem. Improving the test­ing har­ness re­quired find­ing high-qual­ity com­piler test suites, writ­ing ver­i­fiers and build scripts for open-source soft­ware pack­ages, and watch­ing for mis­takes Claude was mak­ing, then de­sign­ing new tests as I iden­ti­fied those fail­ure modes.For ex­am­ple, near the end of the pro­ject, Claude started to fre­quently break ex­ist­ing func­tion­al­ity each time it im­ple­mented a new fea­ture. To ad­dress this, I built a con­tin­u­ous in­te­gra­tion pipeline and im­ple­mented stricter en­force­ment that al­lowed Claude to bet­ter test its work so that new com­mits can’t break ex­ist­ing code.I had to con­stantly re­mind my­self that I was writ­ing this test har­ness for Claude and not for my­self, which meant re­think­ing many of my as­sump­tions about how tests should com­mu­ni­cate re­sults.For ex­am­ple, each agent is dropped into a fresh con­tainer with no con­text and will spend sig­nif­i­cant time ori­ent­ing it­self, es­pe­cially on large pro­jects. Before we even reach the tests, to help Claude help it­self, I in­cluded in­struc­tions to main­tain ex­ten­sive READMEs and progress files that should be up­dated fre­quently with the cur­rent sta­tus.I also kept in mind the fact that lan­guage mod­els have in­her­ent lim­i­ta­tions, which, in this case, needed to be de­signed around. These in­clude:Con­text win­dow pol­lu­tion: The test har­ness should not print thou­sands of use­less bytes. At most, it should print a few lines of out­put and log all im­por­tant in­for­ma­tion to a file so Claude can find it when needed. Logfiles should be easy to process au­to­mat­i­cally: if there are er­rors, Claude should write ERROR and put the rea­son on the same line so grep will find it. It helps to pre-com­pute ag­gre­gate sum­mary sta­tis­tics so Claude does­n’t have to re­com­pute them.Time blind­ness: Claude can’t tell time and, left alone, will hap­pily spend hours run­ning tests in­stead of mak­ing progress. The har­ness prints in­cre­men­tal progress in­fre­quently (to avoid pol­lut­ing con­text) and in­cludes a de­fault –fast op­tion that runs a 1% or 10% ran­dom sam­ple. This sub­sam­ple is de­ter­min­is­tic per-agent but ran­dom across VMs, so Claude still cov­ers all files but each agent can per­fectly iden­tify re­gres­sions.When there are many dis­tinct fail­ing tests, par­al­leliza­tion is triv­ial: each agent picks a dif­fer­ent fail­ing test to work on. After the test suite reached a 99% pass rate, each agent worked on get­ting a dif­fer­ent small open-source pro­ject (e.g., SQlite, Redis, lib­jpeg, MQuickJS, Lua) to com­pile.But when agents started to com­pile the Linux ker­nel, they got stuck. Unlike a test suite with hun­dreds of in­de­pen­dent tests, com­pil­ing the Linux ker­nel is one gi­ant task. Every agent would hit the same bug, fix that bug, and then over­write each oth­er’s changes. Having 16 agents run­ning did­n’t help be­cause each was stuck solv­ing the same task.The fix was to use GCC as an on­line known-good com­piler or­a­cle to com­pare against. I wrote a new test har­ness that ran­domly com­piled most of the ker­nel us­ing GCC, and only the re­main­ing files with Claude’s C Compiler. If the ker­nel worked, then the prob­lem was­n’t in Claude’s sub­set of the files. If it broke, then it could fur­ther re­fine by re-com­pil­ing some of these files with GCC. This let each agent work in par­al­lel, fix­ing dif­fer­ent bugs in dif­fer­ent files, un­til Claude’s com­piler could even­tu­ally com­pile all files. (After this worked, it was still nec­es­sary to ap­ply delta de­bug­ging tech­niques to find pairs of files that failed to­gether but worked in­de­pen­dently.)Par­al­lelism also en­ables spe­cial­iza­tion. LLM-written code fre­quently re-im­ple­ments ex­ist­ing func­tion­al­ity, so I tasked one agent with co­a­lesc­ing any du­pli­cate code it found. I put an­other in charge of im­prov­ing the per­for­mance of the com­piler it­self, and a third I made re­spon­si­ble for out­putting ef­fi­cient com­piled code. I asked an­other agent to cri­tique the de­sign of the pro­ject from the per­spec­tive of a Rust de­vel­oper, and make struc­tural changes to the pro­ject to im­prove the over­all code qual­ity, and an­other to work on doc­u­men­ta­tion.This pro­ject was de­signed as a ca­pa­bil­ity bench­mark. I am in­ter­ested in stress-test­ing the lim­its of what LLMs can just barely achieve to­day in or­der to help us pre­pare for what mod­els will re­li­ably achieve in the fu­ture.I’ve been us­ing the C Compiler pro­ject as a bench­mark across the en­tire Claude 4 model se­ries. As I did with prior pro­jects, I started by draft­ing what I wanted: a from-scratch op­ti­miz­ing com­piler with no de­pen­den­cies, GCC-compatible, able to com­pile the Linux ker­nel, and de­signed to sup­port mul­ti­ple back­ends. While I spec­i­fied some as­pects of the de­sign (e.g., that it should have an SSA IR to en­able mul­ti­ple op­ti­miza­tion passes) I did not go into any de­tail on how to do so.Pre­vi­ous Opus 4 mod­els were barely ca­pa­ble of pro­duc­ing a func­tional com­piler. Opus 4.5 was the first to cross a thresh­old that al­lowed it to pro­duce a func­tional com­piler which could pass large test suites, but it was still in­ca­pable of com­pil­ing any real large pro­jects. My goal with Opus 4.6 was to again test the lim­its.Over nearly 2,000 Claude Code ses­sions across two weeks, Opus 4.6 con­sumed 2 bil­lion in­put to­kens and gen­er­ated 140 mil­lion out­put to­kens, a to­tal cost just un­der $20,000. Compared to even the most ex­pen­sive Claude Max plans, this was an ex­tremely ex­pen­sive pro­ject. But that to­tal is a frac­tion of what it would cost me to pro­duce this my­self—let alone an en­tire team.This was a clean-room im­ple­men­ta­tion (Claude did not have in­ter­net ac­cess at any point dur­ing its de­vel­op­ment); it de­pends only on the Rust stan­dard li­brary. The 100,000-line com­piler can build a bootable Linux 6.9 on x86, ARM, and RISC-V. It can also com­pile QEMU, FFmpeg, SQlite, post­gres, re­dis, and has a 99% pass rate on most com­piler test suites in­clud­ing the GCC tor­ture test suite. It also passes the de­vel­op­er’s ul­ti­mate lit­mus test: it can com­pile and run Doom.The com­piler, how­ever, is not with­out lim­i­ta­tions. These in­clude:It lacks the 16-bit x86 com­piler that is nec­es­sary to boot Linux out of real mode. For this, it calls out to GCC (the x86_32 and x86_64 com­pil­ers are its own).It does not have its own as­sem­bler and linker; these are the very last bits that Claude started au­tomat­ing and are still some­what buggy. The demo video was pro­duced with a GCC as­sem­bler and linker.The com­piler suc­cess­fully builds many pro­jects, but not all. It’s not yet a drop-in re­place­ment for a real com­piler.The gen­er­ated code is not very ef­fi­cient. Even with all op­ti­miza­tions en­abled, it out­puts less ef­fi­cient code than GCC with all op­ti­miza­tions dis­abled.The Rust code qual­ity is rea­son­able, but is nowhere near the qual­ity of what an ex­pert Rust pro­gram­mer might pro­duce.The re­sult­ing com­piler has nearly reached the lim­its of Opus’s abil­i­ties. I tried (hard!) to fix sev­eral of the above lim­i­ta­tions but was­n’t fully suc­cess­ful. New fea­tures and bug­fixes fre­quently broke ex­ist­ing func­tion­al­ity.As one par­tic­u­larly chal­leng­ing ex­am­ple, Opus was un­able to im­ple­ment a 16-bit x86 code gen­er­a­tor needed to boot into 16-bit real mode. While the com­piler can out­put cor­rect 16-bit x86 via the 66/67 op­code pre­fixes, the re­sult­ing com­piled out­put is over 60kb, far ex­ceed­ing the 32k code limit en­forced by Linux. Instead, Claude sim­ply cheats here and calls out to GCC for this phase (This is only the case for x86. For ARM or RISC-V, Claude’s com­piler can com­pile com­pletely by it­self.)The source code for the com­piler is avail­able. Download it, read through the code, and try it on your fa­vorite C pro­jects. I’ve con­sis­tently found the best way to un­der­stand what lan­guage mod­els can do is to push them to their lim­its, and then study where they start to break down. Over the com­ing days, I’ll con­tinue hav­ing Claude push new changes if you want to fol­low along with Claude’s con­tin­ued at­tempts at ad­dress­ing these lim­i­ta­tions.Each gen­er­a­tion of lan­guage mod­els opens up new ways of work­ing with them. Early mod­els were use­ful for tab-com­ple­tion in IDEs. Before long, mod­els could com­plete a func­tion body from its doc­string. The launch of Claude Code brought agents into the main­stream and en­abled de­vel­op­ers to pair-pro­gram with Claude. But each of these prod­ucts op­er­ates un­der the as­sump­tion that a user de­fines a task, an LLM runs for a few sec­onds or min­utes and re­turns an an­swer, and then the user pro­vides a fol­low-up.Agent teams show the pos­si­bil­ity of im­ple­ment­ing en­tire, com­plex pro­jects au­tonomously. This al­lows us, as users of these tools, to be­come more am­bi­tious with our goals.We are still early, and fully au­tonomous de­vel­op­ment comes with real risks. When a hu­man sits with Claude dur­ing de­vel­op­ment, they can en­sure con­sis­tent qual­ity and catch er­rors in real time. For au­tonomous sys­tems, it is easy to see tests pass and as­sume the job is done, when this is rarely the case. I used to work in pen­e­tra­tion test­ing, ex­ploit­ing vul­ner­a­bil­i­ties in prod­ucts pro­duced by large com­pa­nies, and the thought of pro­gram­mers de­ploy­ing soft­ware they’ve never per­son­ally ver­i­fied is a real con­cern.So, while this ex­per­i­ment ex­cites me, it also leaves me feel­ing un­easy. Building this com­piler has been some of the most fun I’ve had re­cently, but I did not ex­pect this to be any­where near pos­si­ble so early in 2026. The rapid progress in both lan­guage mod­els and the scaf­folds we use to in­ter­act with them opens the door to writ­ing an enor­mous amount of new code. I ex­pect the pos­i­tive ap­pli­ca­tions to out­weigh the neg­a­tive, but we’re en­ter­ing a new world which will re­quire new strate­gies to nav­i­gate safely.Spe­cial thanks to Josef Bacik, Edwin Chen, Bernardo Meurer Costa, Jake Eaton, Dan Kelley, Felix Klock, Jannet Park, Steve Weis, and many other peo­ple across Anthropic for their as­sis­tance and con­tri­bu­tions.

Product up­dates, how-tos, com­mu­nity spot­lights, and more. Delivered monthly to your in­box. Please pro­vide your email ad­dress if you’d like to re­ceive our monthly de­vel­oper newslet­ter. You can un­sub­scribe at any time.

...

Read the original on www.anthropic.com »

10 260 shares, 12 trendiness

Why More Companies Are Recognizing the Benefits of Keeping Older Employees

Although age bias is still the norm, the value-add of long­time, ex­pe­ri­enced work­ers is be­gin­ning to take shape.

On the out­skirts of Macclesfield, in north­west England, a branch of the UK home-im­prove­ment re­tailer B&Q qui­etly over­turned one of cor­po­rate life’s most per­sis­tent as­sump­tions. Faced with high staff turnover and un­even cus­tomer sat­is­fac­tion, the com­pany tried a sim­ple ex­per­i­ment: In 1989, it staffed the store largely with older work­ers.

The re­sults were strik­ing, ac­cord­ing to one study. Profits rose 18 per­cent. Staff turnover fell to a frac­tion of the com­pany av­er­age. Absenteeism dropped sharply. An ex­per­i­ment that started more than 30 years ago re­shaped how the re­tailer ap­proached age in­clu­sive­ness and led B&Q to open train­ing to all ages and fea­ture older work­ers in ad­ver­tis­ing, treat­ing ex­pe­ri­ence as an ad­van­tage rather than a cost.

In 2007, BMW be­gan im­ple­ment­ing 70 er­gonomic, low-cost im­prove­ments in a spe­cial­ized as­sem­bly line in Dingolfing, Germany, to pro­vide bet­ter con­di­tions for its many older and mid­dle-aged work­ers. Key changes in­cluded ad­justable-height work­sta­tions, im­proved light­ing and spe­cial­ized stools, re­sult­ing in a 7 per­cent pro­duc­tiv­ity in­crease.

Evidence sug­gests that sim­i­lar age-per­for­mance dy­nam­ics are not lim­ited to the quirks of re­tail or to the fac­tory floor and are in­creas­ingly rel­e­vant as de­clin­ing birth rates and ar­ti­fi­cial in­tel­li­gence in­vest­ments re­duce the in­flow of en­try-level work­ers. A white pa­per from Bank of America’s Workplace Benefits group ar­gues that re­cruit­ing and re­tain­ing older work­ers is be­com­ing in­creas­ingly im­por­tant as pop­u­la­tions age, fram­ing age-in­clu­sive ben­e­fits not as ac­com­mo­da­tion, but as a dri­ver of or­ga­ni­za­tional per­for­mance, es­pe­cially for roles where judg­ment, ex­pe­ri­ence and de­ci­sion qual­ity mat­ter most.

The re­ten­tion of these older work­ers is an idea that is be­com­ing much more well-re­ceived,” says Cynthia Hutchins, Bank of America’s in­au­gural di­rec­tor of fi­nan­cial geron­tol­ogy. Hutchins has been in­volved in im­ple­ment­ing a work­force longevity pol­icy that in­cludes hy­brid sched­ules, fi­nan­cial plan­ning ben­e­fits, menopause sup­port, grand­par­ents’ leave and sab­bat­i­cals. It’s al­most a busi­ness im­per­a­tive to in­sti­tute those types of ben­e­fits” to re­tain older work­ers and at­tract younger ones, adds Hutchins.

Yet ini­tia­tives such as these are rarely framed as strat­egy or as sig­nals of a deeper shift. Most cor­po­ra­tions con­tinue to de­sign ca­reers as if ef­fec­tive­ness peaks early — as if speed, sta­mina and in­no­va­tion be­long ex­clu­sively to the young. If ex­pe­ri­ence im­proves out­comes, why are so many or­ga­ni­za­tions struc­tured to push peo­ple out just as their value peaks?

If ex­pe­ri­ence im­proves out­comes, why are so many or­ga­ni­za­tions struc­tured to push peo­ple out just as their value peaks?

At the heart of cor­po­rate re­sis­tance lies a fun­da­men­tal dis­agree­ment about value. Moody’s Analytics chief econ­o­mist Mark Zandi framed the de­bate in Aging and the Productivity Puzzle, a 2018 analy­sis de­lin­eat­ing two schools of thought. The albatross the­ory” holds that work­ers above the age of 65 drag down pro­duc­tiv­ity due to re­sis­tance to change and out­dated skills. The wise man the­ory” tells a dif­fer­ent story: of work­ers who pos­sess judg­ment, in­sti­tu­tional knowl­edge, emo­tional in­tel­li­gence and ex­per­tise that younger em­ploy­ees can­not repli­cate.

Zandi and his col­leagues an­a­lyzed state-level ADP data in the U. S. and con­cluded that post-re­tire­ment-age work­ers slowed wage growth and pro­duc­tiv­ity, largely be­cause they tend to be averse to adopt­ing new tech­nolo­gies. Yet sev­eral ma­jor in­sti­tu­tions re­ject the idea that older work­ers are a pro­duc­tiv­ity albatross” — and most look at the ef­fects, not of those above the age of 65, but of the 50-plus age work­force, of­ten the first in line for lay­offs.

More re­cent re­search from AARP and the OECD shows that firms with more 50-plus work­ers are more pro­duc­tive, not less: a 10-percentage-point in­crease in older work­ers is as­so­ci­ated with roughly 1.1 per­cent higher pro­duc­tiv­ity. The 2020 OECD analy­sis also finds that age-bal­anced firms ben­e­fit from lower turnover and stronger team per­for­mance, dri­ven by ex­pe­ri­ence and knowl­edge shar­ing rather than tech­nol­ogy re­sis­tance. Similarly, a 2022 study from Boston Consulting Group found that cross-gen­er­a­tional teams out­per­form ho­mo­ge­neous ones when older work­ers’ judg­ment and men­tor­ing are com­bined with younger work­ers’ dig­i­tal skills. A 2022 meta analy­sis also pushes back against the idea that older work­ers are less ef­fec­tive, and found that teams per­form bet­ter when mem­bers have a long tenure at the com­pany, ir­re­spec­tive of work­ers’ ages.

Still, Zandi says that the value of older work­ers may de­pend on how AI in the work­place un­folds and what im­pact it has on pro­duc­tiv­ity growth. If AI turns out to be a bust or does­n’t live up to ex­pec­ta­tions, and you have other de­mo­graphic forces that are re­strain­ing la­bor growth, then I think older work­ers should fare well,” Zandi says. He notes that so far, older work­ers have navigated things rea­son­ably grace­fully,” while younger work­ers and mid-level man­agers are so far tak­ing the brunt of AI-related im­pacts.

Population ag­ing is of­ten treated as a fu­ture prob­lem, some­thing to be man­aged later with tech­nol­ogy or pol­icy tweaks. In re­al­ity, it is al­ready re­shap­ing la­bor mar­kets in the U. S. and across ad­vanced economies. Birth rates are lower, peo­ple are liv­ing longer and the share of work­ers above the age of 50 is ris­ing steadily. This is not a fore­cast. It is arith­metic.

Across ad­vanced economies, there ap­pears to be a per­sis­tent pat­tern of early ex­its that are less about in­di­vid­ual choice than or­ga­ni­za­tional de­sign.

Yet or­ga­ni­za­tional as­sump­tions about per­for­mance have not kept pace. Modern ca­reers are still built around the idea that ef­fec­tive­ness peaks early. Recent re­search chal­lenges that view. A 2025 study in the jour­nal Intelligence, an­a­lyz­ing age tra­jec­to­ries across 16 cog­ni­tive, emo­tional and per­son­al­ity di­men­sions, finds that while pro­cess­ing speed does de­cline af­ter early adult­hood, many of the ca­pa­bil­i­ties most rel­e­vant to com­plex work con­tinue to im­prove well into midlife. When these traits are com­bined into a com­pos­ite mea­sure of over­all func­tion­ing, per­for­mance peaks be­tween ages 55 and 60.

But if pro­fi­ciency in­creas­ingly peaks in late midlife, then why are so many ca­reers end­ing be­fore they can be fully re­al­ized? Across ad­vanced economies, there ap­pears to be a per­sis­tent pat­tern of early ex­its that are less about in­di­vid­ual choice than or­ga­ni­za­tional de­sign.

In the U. S., analy­sis by the Urban Institute of sur­vey data of older work­ers from 1992 to 2016 showed that more than half above the age of 50 were pushed out of long-held jobs be­fore they chose to re­tire, of­ten through lay­offs or re­struc­tur­ing rather than per­for­mance is­sues. The 2018 study — along with re­port­ing from ProPublica — found that few ever re­gained com­pa­ra­ble pay or re­spon­si­bil­ity, and hir­ing prac­tices re­in­forced the trend.

The fact that more than half of U. S. work­ers above the age of 50 leave long-held jobs for rea­sons un­re­lated to per­for­mance and be­fore they choose to re­tire is a sys­temic de­sign fail­ure.

Bill Greene, a long­time busi­ness con­sul­tant, is an ex­cep­tion to this lay­off trend. Hired at 64 as prin­ci­pal of Mind Share Partners, a non­profit in San Francisco, he ad­vises com­pa­nies on the im­por­tance of cre­at­ing men­tally healthy en­vi­ron­ments, and cau­tions that the work­place is a mine­field of bi­ases — and that ageism cuts both ways for older work­ers and younger work­ers.

Greene ad­vises em­ploy­ers to be aware of the blind spots and in­con­sis­ten­cies. In the tech­nol­ogy in­dus­try, he says, it’s widely per­ceived that if you are 45 years old or over, you are a di­nosaur,” yet in pol­i­tics, you can be 70, 75, 80, 85, and ap­par­ently that’s OK.”

Experience helps in an emer­gency. When the Covid-19 pan­demic struck in 2020, Greene was con­sult­ing for a fi­nan­cial ser­vices firm, and he saw first­hand how wor­ried his client was that younger em­ploy­ees were go­ing to panic and quit be­cause they had­n’t been through a cri­sis of that mag­ni­tude be­fore.

They re­al­ized that they had to coach their younger em­ploy­ees,” he says, com­par­ing the pan­demic to the 2008 fi­nan­cial crash to help the clien­t’s staff un­der­stand the risks and path for­ward. That kind of wis­dom and ex­pe­ri­ence can come with more depth of un­der­stand­ing and per­spec­tive from an older em­ployee than from a younger one,” he says.

Although sev­eral Fortune 500 com­pa­nies have ad­ver­tised their in­ter­est in hir­ing and re­tain­ing older work­ers, cor­po­rate com­mit­ments re­main ten­ta­tive and small-scale. UK-based Unilever launched its U-Work pro­gram in 2019, and now of­fers em­ploy­ees in nine coun­tries a hy­brid be­tween tra­di­tional em­ploy­ment and gig work: a monthly re­tainer, ben­e­fits and free­dom to choose which pro­jects they work on and when. Workers can scale back hours, pur­sue other in­ter­ests or tran­si­tion grad­u­ally to­ward re­tire­ment.

The pro­gram is in­no­v­a­tive and, by all ac­counts, suc­cess­ful. Half of par­tic­i­pants are above the age of 50. But only 140 em­ploy­ees out of Unilever’s 150,000-strong global work­force par­tic­i­pate. This raises a ques­tion: Are these strate­gies of gen­uine trans­for­ma­tion or so­phis­ti­cated pub­lic re­la­tions?

Three con­verg­ing forces make the case for ur­gency. First, pre­ma­ture exit cre­ates value leak­age. The fact that more than half of U. S. work­ers above the age of 50 leave long-held jobs for rea­sons un­re­lated to per­for­mance and be­fore they choose to re­tire is a sys­temic de­sign fail­ure.

Second, the de­mand-side blind spot. Globally, spend­ing by peo­ple above the age of 55 is pro­jected to ap­proach $15 tril­lion an­nu­ally by the end of this decade, mak­ing older con­sumers one of the largest and fastest-grow­ing sources of de­mand in the world econ­omy. Yet many com­pa­nies treat older cus­tomers as pe­riph­eral.

There are ex­cep­tions. Alan Patricof, now 91 and still in­vest­ing, launched Primetime Partners at 85 af­ter ob­serv­ing that ven­ture cap­i­tal re­mained fo­cused on mil­len­ni­als, de­spite ob­vi­ous un­met de­mand among older adults. His fund has in­vested in more than 35 com­pa­nies serv­ing what he calls the ageless mar­ket.” Consumer brands are adapt­ing, too — L’Oréal has repo­si­tioned it­self around longevity and healthy ag­ing, treat­ing later life as as­pi­ra­tion rather than de­cline.

The sil­ver econ­omy is not a niche. It is one of the largest and least con­tested growth op­por­tu­ni­ties of the next decade — and one that many firms still un­der­es­ti­mate.

Third, longer work­ing lives are in­evitable. In Europe and the UK, ef­fec­tive re­tire­ment ages have been climb­ing, dri­ven in part by fi­nan­cial need and pol­icy changes. Meanwhile, in the U. S., the shift from de­fined-ben­e­fit to de­fined-con­tri­bu­tion re­tire­ment plans in­cen­tivizes work­ers to re­main em­ployed longer. Organizations that fail to re­tain ex­pe­ri­enced tal­ent will face la­bor short­ages, while com­peti­tors ben­e­fit from work­ers who bring judg­ment, sta­bil­ity and in­sti­tu­tional mem­ory.

The mis­match be­tween de­mo­graphic re­al­ity and cor­po­rate be­hav­ior is be­gin­ning to reg­is­ter with long-term in­vestors. Large as­set man­agers in­creas­ingly frame longevity as a struc­tural eco­nomic force with im­pli­ca­tions for growth, pro­duc­tiv­ity and risk.

A Vanguard study, The Economics of a Graying World, high­lights ag­ing and slower la­bor-force growth as a per­sis­tent drag on eco­nomic ex­pan­sion, ar­gu­ing that longer work­ing lives are one of the few vi­able ad­just­ment mech­a­nisms. From this per­spec­tive, work­force age pol­icy be­comes fi­nan­cially ma­te­r­ial, not op­tional.

When or­ga­ni­za­tions push ex­pe­ri­enced work­ers out early, they for­feit peak judg­ment, ex­e­cu­tion ca­pa­bil­ity and men­tor­ing ca­pac­ity.

Economist Andrew J. Scott of the London Business School ar­gues in his 2024 book The Longevity Imperative that if so­ci­eties see longevity pri­mar­ily as an aging prob­lem” of more pen­sion­ers, higher health costs and fewer work­ers, longer lives risk be­com­ing a fis­cal drag. But if they in­vest in health, skills and age‑in­clu­sive work, longevity can in­stead raise growth, em­ploy­ment and in­no­va­tion.

One hur­dle to this shift in per­spec­tive is an on­go­ing lack of trans­parency and ac­count­abil­ity by em­ploy­ers. Ageism in hir­ing, pro­mo­tion and re­dun­dancy re­mains wide­spread, yet un­like gen­der or eth­nic­ity, work­force age is rarely dis­closed or scru­ti­nized. The re­sult is a grow­ing gov­er­nance gap. Misalignment with de­mo­graphic re­al­ity cre­ates ex­e­cu­tion risk — in tal­ent, pro­duc­tiv­ity and growth.

The case for a longevity strat­egy is ul­ti­mately an eco­nomic one. When or­ga­ni­za­tions push ex­pe­ri­enced work­ers out early, they for­feit peak judg­ment, ex­e­cu­tion ca­pa­bil­ity and men­tor­ing ca­pac­ity. When they un­der­in­vest in older con­sumers, they leave vast pools of de­mand un­der­served. Value is for­feited on both sides of the busi­ness.

In meet­ing their re­spon­si­bil­ity for long-term risk and growth, com­pa­nies should be­gin with clar­ity. Map the age pro­file of the work­force by role and se­nior­ity. Identify where peo­ple in their fifties and early six­ties are ex­it­ing — and whether those ex­its re­flect per­for­mance or de­sign. Treat age as a strate­gic vari­able in the same way firms now treat gen­der, skills or suc­ces­sion risk.

From there, re­design fol­lows. Build roles and ca­reer paths that as­sume longer work­ing lives. Invest in mid- and late-ca­reer reskilling, not as re­me­di­a­tion but as re­newal. Structure in­ter­gen­er­a­tional teams de­lib­er­ately, so ex­pe­ri­ence and speed com­pound rather than col­lide. Align prod­uct, ser­vice and brand strat­egy with the re­al­i­ties of an ag­ing, wealth­ier cus­tomer base.

None of this is about al­tru­ism. It is about re­claim­ing value cur­rently be­ing left on the table. As pop­u­la­tions age, com­pa­nies that learn to re­tain ex­pe­ri­ence and serve longevity-dri­ven de­mand will not just adapt — they will out­per­form.

Annie Coleman is Founder of RealiseLongevity, a con­sult­ing firm based in the UK, and is a Stanford Center on Longevity Ambassador.

...

Read the original on longevity.stanford.edu »

To add this web app to your iOS home screen tap the share button and select "Add to the Home Screen".

10HN is also available as an iOS App

If you visit 10HN only rarely, check out the the best articles from the past week.

If you like 10HN please leave feedback and share

Visit pancik.com for more.