10 interesting stories served every morning and every evening.




1 2,043 shares, 102 trendiness

Claude Opus 4.6

The new Claude Opus 4.6 im­proves on its pre­de­ces­sor’s cod­ing skills. It plans more care­fully, sus­tains agen­tic tasks for longer, can op­er­ate more re­li­ably in larger code­bases, and has bet­ter code re­view and de­bug­ging skills to catch its own mis­takes. And, in a first for our Opus-class mod­els, Opus 4.6 fea­tures a 1M to­ken con­text win­dow in beta. Opus 4.6 can also ap­ply its im­proved abil­i­ties to a range of every­day work tasks: run­ning fi­nan­cial analy­ses, do­ing re­search, and us­ing and cre­at­ing doc­u­ments, spread­sheets, and pre­sen­ta­tions. Within Cowork, where Claude can mul­ti­task au­tonomously, Opus 4.6 can put all these skills to work on your be­half.The mod­el’s per­for­mance is state-of-the-art on sev­eral eval­u­a­tions. For ex­am­ple, it achieves the high­est score on the agen­tic cod­ing eval­u­a­tion Terminal-Bench 2.0 and leads all other fron­tier mod­els on Humanity’s Last Exam, a com­plex mul­ti­dis­ci­pli­nary rea­son­ing test. On GDPval-AA—an eval­u­a­tion of per­for­mance on eco­nom­i­cally valu­able knowl­edge work tasks in fi­nance, le­gal, and other do­mains1—Opus 4.6 out­per­forms the in­dus­try’s next-best model (OpenAI’s GPT-5.2) by around 144 Elo points,2 and its own pre­de­ces­sor (Claude Opus 4.5) by 190 points. Opus 4.6 also per­forms bet­ter than any other model on BrowseComp, which mea­sures a mod­el’s abil­ity to lo­cate hard-to-find in­for­ma­tion on­line.As we show in our ex­ten­sive sys­tem card, Opus 4.6 also shows an over­all safety pro­file as good as, or bet­ter than, any other fron­tier model in the in­dus­try, with low rates of mis­aligned be­hav­ior across safety eval­u­a­tions.Opus 4.6 is state-of-the-art on real-world work tasks across sev­eral pro­fes­sional do­mains.Opus 4.6 gets the high­est score in the in­dus­try for deep, multi-step agen­tic search.In Claude Code, you can now as­sem­ble agent teams to work on tasks to­gether. On the API, Claude can use com­paction to sum­ma­rize its own con­text and per­form longer-run­ning tasks with­out bump­ing up against lim­its. We’re also in­tro­duc­ing adap­tive think­ing, where the model can pick up on con­tex­tual clues about how much to use its ex­tended think­ing, and new ef­fort con­trols to give de­vel­op­ers more con­trol over in­tel­li­gence, speed, and cost. We’ve made sub­stan­tial up­grades to Claude in Excel, and we’re re­leas­ing Claude in PowerPoint in a re­search pre­view. This makes Claude much more ca­pa­ble for every­day work.Claude Opus 4.6 is avail­able to­day on claude.ai, our API, and all ma­jor cloud plat­forms. If you’re a de­vel­oper, use claude-opus-4-6 via the Claude API. Pricing re­mains the same at $5/$25 per mil­lion to­kens; for full de­tails, see our pric­ing page.We cover the model, our new prod­uct up­dates, our eval­u­a­tions, and our ex­ten­sive safety test­ing in depth be­low.We build Claude with Claude. Our en­gi­neers write code with Claude Code every day, and every new model first gets tested on our own work. With Opus 4.6, we’ve found that the model brings more fo­cus to the most chal­leng­ing parts of a task with­out be­ing told to, moves quickly through the more straight­for­ward parts, han­dles am­bigu­ous prob­lems with bet­ter judg­ment, and stays pro­duc­tive over longer ses­sions.Opus 4.6 of­ten thinks more deeply and more care­fully re­vis­its its rea­son­ing be­fore set­tling on an an­swer. This pro­duces bet­ter re­sults on harder prob­lems, but can add cost and la­tency on sim­pler ones. If you’re find­ing that the model is over­think­ing on a given task, we rec­om­mend di­al­ing ef­fort down from its de­fault set­ting (high) to medium. You can con­trol this eas­ily with the /effort pa­ra­me­ter.Here are some of the things our Early Access part­ners told us about Claude Opus 4.6, in­clud­ing its propen­sity to work au­tonomously with­out hand-hold­ing, its suc­cess where pre­vi­ous mod­els failed, and its ef­fect on how teams work:

Claude Opus 4.6 is the strongest model Anthropic has shipped. It takes com­pli­cated re­quests and ac­tu­ally fol­lows through, break­ing them into con­crete steps, ex­e­cut­ing, and pro­duc­ing pol­ished work even when the task is am­bi­tious. For Notion users, it feels less like a tool and more like a ca­pa­ble col­lab­o­ra­tor.Early test­ing shows Claude Opus 4.6 de­liv­er­ing on the com­plex, multi-step cod­ing work de­vel­op­ers face every day—es­pe­cially agen­tic work­flows that de­mand plan­ning and tool call­ing. This starts un­lock­ing long-hori­zon tasks at the fron­tier.Claude Opus 4.6 is a huge leap for agen­tic plan­ning. It breaks com­plex tasks into in­de­pen­dent sub­tasks, runs tools and sub­agents in par­al­lel, and iden­ti­fies block­ers with real pre­ci­sion.Claude Opus 4.6 is the best model we’ve tested yet. Its rea­son­ing and plan­ning ca­pa­bil­i­ties have been ex­cep­tional at pow­er­ing our AI Teammates. It’s also a fan­tas­tic cod­ing model — its abil­ity to nav­i­gate a large code­base and iden­tify the right changes to make is state of the art.Claude Opus 4.6 rea­sons through com­plex prob­lems at a level we haven’t seen be­fore. It con­sid­ers edge cases that other mod­els miss and con­sis­tently lands on more el­e­gant, well-con­sid­ered so­lu­tions. We’re par­tic­u­larly im­pressed with Opus 4.6 in Devin Review, where it’s in­creased our bug catch­ing rates.Claude Opus 4.6 feels no­tice­ably bet­ter than Opus 4.5 in Windsurf, es­pe­cially on tasks that re­quire care­ful ex­plo­ration like de­bug­ging and un­der­stand­ing un­fa­mil­iar code­bases. We’ve no­ticed Opus 4.6 thinks longer, which pays off when deeper rea­son­ing is needed.Claude Opus 4.6 rep­re­sents a mean­ing­ful leap in long-con­text per­for­mance. In our test­ing, we saw it han­dle much larger bod­ies of in­for­ma­tion with a level of con­sis­tency that strength­ens how we de­sign and de­ploy com­plex re­search work­flows. Progress in this area gives us more pow­er­ful build­ing blocks to de­liver truly ex­pert-grade sys­tems pro­fes­sion­als can trust.Across 40 cy­ber­se­cu­rity in­ves­ti­ga­tions, Claude Opus 4.6 pro­duced the best re­sults 38 of 40 times in a blind rank­ing against Claude 4.5 mod­els. Each model ran end to end on the same agen­tic har­ness with up to 9 sub­agents and 100+ tool calls.Claude Opus 4.6 is the new fron­tier on long-run­ning tasks from our in­ter­nal bench­marks and test­ing. It’s also been highly ef­fec­tive at re­view­ing code.Claude Opus 4.6 achieved the high­est BigLaw Bench score of any Claude model at 90.2%. With 40% per­fect scores and 84% above 0.8, it’s re­mark­ably ca­pa­ble for le­gal rea­son­ing.Claude Opus 4.6 au­tonomously closed 13 is­sues and as­signed 12 is­sues to the right team mem­bers in a sin­gle day, man­ag­ing a ~50-person or­ga­ni­za­tion across 6 repos­i­to­ries. It han­dled both prod­uct and or­ga­ni­za­tional de­ci­sions while syn­the­siz­ing con­text across mul­ti­ple do­mains, and it knew when to es­ca­late to a hu­man.Claude Opus 4.6 is an up­lift in de­sign qual­ity. It works beau­ti­fully with our de­sign sys­tems and it’s more au­tonomous, which is core to Lovable’s val­ues. People should be cre­at­ing things that mat­ter, not mi­cro­manag­ing AI.Claude Opus 4.6 ex­cels in high-rea­son­ing tasks like multi-source analy­sis across le­gal, fi­nan­cial, and tech­ni­cal con­tent. Box’s eval showed a 10% lift in per­for­mance, reach­ing 68% vs. a 58% base­line, and near-per­fect scores in tech­ni­cal do­mains.Claude Opus 4.6 gen­er­ates com­plex, in­ter­ac­tive apps and pro­to­types in Figma Make with an im­pres­sive cre­ative range. The model trans­lates de­tailed de­signs and multi-lay­ered tasks into code on the first try, mak­ing it a pow­er­ful start­ing point for teams to ex­plore and build ideas.Claude Opus 4.6 is the best Anthropic model we’ve tested. It un­der­stands in­tent with min­i­mal prompt­ing and went above and be­yond, ex­plor­ing and cre­at­ing de­tails I did­n’t even know I wanted un­til I saw them. It felt like I was work­ing with the model, not wait­ing on it.Both hands-on test­ing and evals show Claude Opus 4.6 is a mean­ing­ful im­prove­ment for de­sign sys­tems and large code­bases, use cases that drive enor­mous en­ter­prise value. It also one-shot­ted a fully func­tional physics en­gine, han­dling a large multi-scope task in a sin­gle pass.Claude Opus 4.6 is the biggest leap I’ve seen in months. I’m more com­fort­able giv­ing it a se­quence of tasks across the stack and let­ting it run. It’s smart enough to use sub­agents for the in­di­vid­ual pieces.Claude Opus 4.6 han­dled a multi-mil­lion-line code­base mi­gra­tion like a se­nior en­gi­neer. It planned up front, adapted its strat­egy as it learned, and fin­ished in half the time.We only ship mod­els in v0 when de­vel­op­ers will gen­uinely feel the dif­fer­ence. Claude Opus 4.6 passed that bar with ease. Its fron­tier-level rea­son­ing, es­pe­cially with edge cases, helps v0 to de­liver on our num­ber-one aim: to let any­one el­e­vate their ideas from pro­to­type to pro­duc­tion.The per­for­mance jump with Claude Opus 4.6 feels al­most un­be­liev­able. Real-world tasks that were chal­leng­ing for Opus [4.5] sud­denly be­came easy. This feels like a wa­ter­shed mo­ment for spread­sheet agents on Shortcut.Across agen­tic cod­ing, com­puter use, tool use, search, and fi­nance, Opus 4.6 is an in­dus­try-lead­ing model, of­ten by a wide mar­gin. The table be­low shows how Claude Opus 4.6 com­pares to our pre­vi­ous mod­els and to other in­dus­try mod­els on a va­ri­ety of bench­marks.Opus 4.6 is much bet­ter at re­triev­ing rel­e­vant in­for­ma­tion from large sets of doc­u­ments. This ex­tends to long-con­text tasks, where it holds and tracks in­for­ma­tion over hun­dreds of thou­sands of to­kens with less drift, and picks up buried de­tails that even Opus 4.5 would miss.A com­mon com­plaint about AI mod­els is context rot,” where per­for­mance de­grades as con­ver­sa­tions ex­ceed a cer­tain num­ber of to­kens. Opus 4.6 per­forms markedly bet­ter than its pre­de­ces­sors: on the 8-needle 1M vari­ant of MRCR v2—a nee­dle-in-a-haystack bench­mark that tests a mod­el’s abil­ity to re­trieve in­for­ma­tion hidden” in vast amounts of text—Opus 4.6 scores 76%, whereas Sonnet 4.5 scores just 18.5%. This is a qual­i­ta­tive shift in how much con­text a model can ac­tu­ally use while main­tain­ing peak per­for­mance.All in all, Opus 4.6 is bet­ter at find­ing in­for­ma­tion across long con­texts, bet­ter at rea­son­ing af­ter ab­sorb­ing that in­for­ma­tion, and has sub­stan­tially bet­ter ex­pert-level rea­son­ing abil­i­ties in gen­eral.Fi­nally, the charts be­low show how Claude Opus 4.6 per­forms on a va­ri­ety of bench­marks that as­sess its soft­ware en­gi­neer­ing skills, mul­ti­lin­gual cod­ing abil­ity, long-term co­her­ence, cy­ber­se­cu­rity ca­pa­bil­i­ties, and its life sci­ences knowl­edge.Opus 4.6 main­tains fo­cus over time and earns $3,050.53 more than Opus 4.5 on Vending-Bench 2.Opus 4.6 finds real vul­ner­a­bil­i­ties in code­bases bet­ter than any other model.Opus 4.6 per­forms al­most bet­ter than Opus 4.5 on com­pu­ta­tional bi­ol­ogy, struc­tural bi­ol­ogy, or­ganic chem­istry, and phy­lo­ge­net­ics tests.These in­tel­li­gence gains do not come at the cost of safety. On our au­to­mated be­hav­ioral au­dit, Opus 4.6 showed a low rate of mis­aligned be­hav­iors such as de­cep­tion, syco­phancy, en­cour­age­ment of user delu­sions, and co­op­er­a­tion with mis­use. Overall, it is just as well-aligned as its pre­de­ces­sor, Claude Opus 4.5, which was our most-aligned fron­tier model to date. Opus 4.6 also shows the low­est rate of over-re­fusals—where the model fails to an­swer be­nign queries—of any re­cent Claude model.The over­all mis­aligned be­hav­ior score for each re­cent Claude model on our au­to­mated be­hav­ioral au­dit (described in full in the Claude Opus 4.6 sys­tem card).For Claude Opus 4.6, we ran the most com­pre­hen­sive set of safety eval­u­a­tions of any model, ap­ply­ing many dif­fer­ent tests for the first time and up­grad­ing sev­eral that we’ve used be­fore. We in­cluded new eval­u­a­tions for user well­be­ing, more com­plex tests of the mod­el’s abil­ity to refuse po­ten­tially dan­ger­ous re­quests, and up­dated eval­u­a­tions of the mod­el’s abil­ity to sur­rep­ti­tiously per­form harm­ful ac­tions. We also ex­per­i­mented with new meth­ods from in­ter­pretabil­ity, the sci­ence of the in­ner work­ings of AI mod­els, to be­gin to un­der­stand why the model be­haves in cer­tain ways—and, ul­ti­mately, to catch prob­lems that stan­dard test­ing might miss.A de­tailed de­scrip­tion of all ca­pa­bil­ity and safety eval­u­a­tions is avail­able in the Claude Opus 4.6 sys­tem card.We’ve also ap­plied new safe­guards in ar­eas where Opus 4.6 shows par­tic­u­lar strengths that might be put to dan­ger­ous as well as ben­e­fi­cial uses. In par­tic­u­lar, since the model shows en­hanced cy­ber­se­cu­rity abil­i­ties, we’ve de­vel­oped six new cy­ber­se­cu­rity probes—meth­ods of de­tect­ing harm­ful re­sponses—to help us track dif­fer­ent forms of po­ten­tial mis­use.We’re also ac­cel­er­at­ing the cy­berde­fen­sive uses of the model, us­ing it to help find and patch vul­ner­a­bil­i­ties in open-source soft­ware (as we de­scribe in our new cy­ber­se­cu­rity blog post). We think it’s crit­i­cal that cy­berde­fend­ers use AI mod­els like Claude to help level the play­ing field. Cybersecurity moves fast, and we’ll be ad­just­ing and up­dat­ing our safe­guards as we learn more about po­ten­tial threats; in the near fu­ture, we may in­sti­tute real-time in­ter­ven­tion to block abuse.We’ve made sub­stan­tial up­dates across Claude, Claude Code, and the Claude Developer Platform to let Opus 4.6 per­form at its best.On the API, we’re giv­ing de­vel­op­ers bet­ter con­trol over model ef­fort and more flex­i­bil­ity for long-run­ning agents. To do so, we’re in­tro­duc­ing the fol­low­ing fea­tures:Adap­tive think­ing. Previously, de­vel­op­ers only had a bi­nary choice be­tween en­abling or dis­abling ex­tended think­ing. Now, with adap­tive think­ing, Claude can de­cide when deeper rea­son­ing would be help­ful. At the de­fault ef­fort level (high), the model uses ex­tended think­ing when use­ful, but de­vel­op­ers can ad­just the ef­fort level to make it more or less se­lec­tive.Ef­fort. There are now four ef­fort lev­els to choose from: low, medium, high (default), and max. We en­cour­age de­vel­op­ers to ex­per­i­ment with dif­fer­ent op­tions to find what works best.Con­text com­paction (beta). Long-running con­ver­sa­tions and agen­tic tasks of­ten hit the con­text win­dow. Context com­paction au­to­mat­i­cally sum­ma­rizes and re­places older con­text when the con­ver­sa­tion ap­proaches a con­fig­urable thresh­old, let­ting Claude per­form longer tasks with­out hit­ting lim­its.1M to­ken con­text (beta). Opus 4.6 is our first Opus-class model with 1M to­ken con­text. Premium pric­ing ap­plies for prompts ex­ceed­ing 200k to­kens ($10/$37.50 per mil­lion in­put/​out­put to­kens).128k out­put to­kens. Opus 4.6 sup­ports out­puts of up to 128k to­kens, which lets Claude com­plete larger-out­put tasks with­out break­ing them into mul­ti­ple re­quests.US-only in­fer­ence. For work­loads that need to run in the United States, US-only in­fer­ence is avail­able at 1.1× to­ken pric­ing.Across Claude and Claude Code, we’ve added fea­tures that al­low knowl­edge work­ers and de­vel­op­ers to tackle harder tasks with more of the tools they use every day.We’ve in­tro­duced agent teams in Claude Code as a re­search pre­view. You can now spin up mul­ti­ple agents that work in par­al­lel as a team and co­or­di­nate au­tonomously—best for tasks that split into in­de­pen­dent, read-heavy work like code­base re­views. You can take over any sub­agent di­rectly us­ing Shift+Up/Down or tmux.Claude now also works bet­ter with the of­fice tools you al­ready use. Claude in Excel han­dles long-run­ning and harder tasks with im­proved per­for­mance, and can plan be­fore act­ing, in­gest un­struc­tured data and in­fer the right struc­ture with­out guid­ance, and han­dle multi-step changes in one pass. Pair that with Claude in PowerPoint, and you can first process and struc­ture your data in Excel, then bring it to life vi­su­ally in PowerPoint. Claude reads your lay­outs, fonts, and slide mas­ters to stay on brand, whether you’re build­ing from a tem­plate or gen­er­at­ing a full deck from a de­scrip­tion. Claude in PowerPoint is now avail­able in re­search pre­view for Max, Team, and Enterprise plans.

...

Read the original on www.anthropic.com »

2 642 shares, 37 trendiness

My AI Adoption Journey

My ex­pe­ri­ence adopt­ing any mean­ing­ful tool is that I’ve nec­es­sar­ily gone through three phases: (1) a pe­riod of in­ef­fi­ciency (2) a pe­riod of ad­e­quacy, then fi­nally (3) a pe­riod of work­flow and life-al­ter­ing dis­cov­ery.

In most cases, I have to force my­self through phase 1 and 2 be­cause I usu­ally have a work­flow I’m al­ready happy and com­fort­able with. Adopting a tool feels like work, and I do not want to put in the ef­fort, but I usu­ally do in an ef­fort to be a well-rounded per­son of my craft.

This is my jour­ney of how I found value in AI tool­ing and what I’m try­ing next with it. In an ocean of overly dra­matic, hyped takes, I hope this rep­re­sents a more nu­anced, mea­sured ap­proach to my views on AI and how they’ve changed over time.

Immediately cease try­ing to per­form mean­ing­ful work via a chat­bot (e.g. ChatGPT, Gemini on the web, etc.). Chatbots have real value and are a daily part of my AI work­flow, but their util­ity in cod­ing is highly lim­ited be­cause you’re mostly hop­ing they come up with the right re­sults based on their prior train­ing, and cor­rect­ing them in­volves a hu­man (you) to tell them they’re wrong re­peat­edly. It is in­ef­fi­cient.

I think every­one’s first ex­pe­ri­ence with AI is a chat in­ter­face. And I think every­one’s first ex­pe­ri­ence try­ing to code with AI has been ask­ing a chat in­ter­face to write code.

While I was still a heavy AI skep­tic, my first oh wow” mo­ment was past­ing a screen­shot of Zed’s com­mand palette into Gemini, ask­ing it to re­pro­duce it with SwiftUI, and be­ing truly flab­ber­gasted that it did it very well. The com­mand palette that ships for ma­cOS in Ghostty to­day is only very lightly mod­i­fied from what Gemini pro­duced for me in sec­onds.

But when I tried to re­pro­duce that be­hav­ior for other tasks, I was left dis­ap­pointed. In the con­text of brown­field pro­jects, I found the chat in­ter­face pro­duced poor re­sults very of­ten, and I found my­self very frus­trated copy­ing and past­ing code and com­mand out­put to and from the in­ter­face. It was very ob­vi­ously far less ef­fi­cient than me do­ing the work my­self.

To find value, you must use an agent. An agent is the in­dus­try-adopted term for an LLM that can chat and in­voke ex­ter­nal be­hav­ior in a loop1

At a bare min­i­mum, the agent must have the abil­ity to: read files, ex­e­cute pro­grams, and make HTTP re­quests.

The next phase on my jour­ney I tried

Claude Code. I’ll cut to the chase: I ini­tially was­n’t im­pressed. I just was­n’t get­ting good re­sults out of my ses­sions. I felt I had to touch up every­thing it pro­duced and this process was tak­ing more time than if I had just done it my­self. I read blog posts, watched videos, but just was­n’t that im­pressed.

Instead of giv­ing up, I forced my­self to re­pro­duce all my man­ual com­mits

with agen­tic ones. I lit­er­ally did the work twice. I’d do the work man­u­ally, and then I’d fight an agent to pro­duce iden­ti­cal re­sults in terms of qual­ity and func­tion (without it be­ing able to see my man­ual so­lu­tion, of course).

This was ex­cru­ci­at­ing, be­cause it got in the way of sim­ply get­ting things done. But I’ve been around the block with non-AI tools enough to know that fric­tion is nat­ural, and I can’t come to a firm, de­fen­si­ble con­clu­sion with­out ex­haust­ing my ef­forts.

But, ex­per­tise formed. I quickly dis­cov­ered for my­self from first prin­ci­ples what oth­ers were al­ready say­ing, but dis­cov­er­ing it my­self re­sulted in a stronger fun­da­men­tal un­der­stand­ing.

Break down ses­sions into sep­a­rate clear, ac­tion­able tasks. Don’t try

to draw the owl” in one mega ses­sion.

For vague re­quests, split the work into sep­a­rate plan­ning vs. ex­e­cu­tion

ses­sions.

If you give an agent a way to ver­ify its work, it more of­ten than

not fixes its own mis­takes and pre­vents re­gres­sions.

More gen­er­ally, I also found the edges of what agents — at the time — were good at, what they weren’t good at, and for the tasks they were good at how to achieve the re­sults I wanted.

All of this led to sig­nif­i­cant ef­fi­ciency gains, to the point where I was start­ing to nat­u­rally use agents in a way that I felt was no slower than do­ing it my­self (but I still did­n’t feel it was any faster, since I was mostly babysit­ting an agent).

The neg­a­tive space here is worth re­it­er­at­ing: part of the ef­fi­ciency gains here were un­der­stand­ing when not to reach for an agent. Using an agent for some­thing it’ll likely fail at is ob­vi­ously a big waste of time and hav­ing the knowl­edge to avoid that com­pletely leads to time sav­ings2.

At this stage, I was find­ing ad­e­quate value with agents that I was happy to use them in my work­flow, but still did­n’t feel like I was see­ing any net ef­fi­ciency gains. I did­n’t care though, I was con­tent at this point with AI as a tool.

To try to find some ef­fi­ciency, I next started up a new pat­tern:

block out the last 30 min­utes of every day to kick off one or more agents.

My hy­poth­e­sis was that per­haps I could gain some ef­fi­ciency if the agent can make some pos­i­tive progress in the times I can’t work any­ways. Basically: in­stead of try­ing to do more in the time I have, try to do more in the time I don’t have.

Similar to the pre­vi­ous task, I at first found this both un­suc­cess­ful and an­noy­ing. But, I once again quickly found dif­fer­ent cat­e­gories of work that were re­ally help­ful:

* Deep re­search ses­sions where I’d ask agents to sur­vey some

field, such as find­ing all li­braries in a spe­cific lan­guage with

a spe­cific li­cense type and pro­duc­ing multi-page sum­maries for each

on their pros, cons, de­vel­op­ment ac­tiv­ity, so­cial sen­ti­ment, etc.

* Parallel agents at­tempt­ing dif­fer­ent vague ideas I had but did­n’t

have time to get started on. I did­n’t ex­pect them to pro­duce some­thing

I’d ever ship here, but per­haps could il­lu­mi­nate some un­known un­knowns

when I got to the task the next day.

* Issue and PR triage/​re­view. Agents are good at us­ing gh (GitHub CLI),

so I man­u­ally scripted a quick way to spin up a bunch in par­al­lel to

triage is­sues. I would NOT al­low agents to re­spond, I just wanted

re­ports the next day to try to guide me to­wards high value or low ef­fort

tasks.

To be clear, I did not go as far as oth­ers went to have agents run­ning in loops all night. In most cases, agents com­pleted their tasks in less than half an hour. But, the lat­ter part of the work­ing day, I’m usu­ally tired and com­ing out of flow and find my­self too per­son­ally in­ef­fi­cient, so shift­ing my ef­fort to spin­ning up these agents I found gave me a warm start” the next morn­ing that got me work­ing more quickly than I would’ve oth­er­wise.

I was happy, and I was start­ing to feel like I was do­ing more than I was do­ing prior to AI, if only slightly.

By this point, I was get­ting very con­fi­dent about what tasks my AI was and was­n’t great at. I had re­ally high con­fi­dence with cer­tain tasks that the AI would achieve a mostly-cor­rect so­lu­tion. So the next step on my jour­ney was: let agents do all of that work while I worked on other tasks.

More specif­i­cally, I would start each day by tak­ing the re­sults of my prior night’s triage agents, fil­ter them man­u­ally to find the is­sues that an agent will al­most cer­tainly solve well, and then keep them go­ing in the back­ground (one at a time, not in par­al­lel).

Meanwhile, I’d work on some­thing else. I was­n’t go­ing to so­cial me­dia (any more than usual with­out AI), I was­n’t watch­ing videos, etc. I was in my own, nor­mal, pre-AI deep think­ing mode work­ing on some­thing I wanted to work on or had to work on.

Very im­por­tant at this stage: turn off agent desk­top no­ti­fi­ca­tions.

Context switch­ing is very ex­pen­sive. In or­der to re­main ef­fi­cient, I found that it was my job as a hu­man to be in con­trol of when I in­ter­rupt the agent, not the other way around. Don’t let the agent no­tify you. During nat­ural breaks in your work, tab over and check on it, then carry on.

Importantly, I think the work on some­thing else” helps coun­ter­act the highly pub­li­cized Anthropic skill for­ma­tion pa­per. Well, you’re trad­ing off: not form­ing skills for the tasks you’re del­e­gat­ing to the agent while con­tin­u­ing to form skills nat­u­rally in the tasks you con­tinue to work on man­u­ally.

At this point I was firmly in the no way I can go back” ter­ri­tory. I felt more ef­fi­cient, but even if I was­n’t, the thing I liked the most was that I could now fo­cus my cod­ing and think­ing on tasks I re­ally loved while still ad­e­quately com­plet­ing the tasks I did­n’t.

At risk of stat­ing the ob­vi­ous: agents are much more ef­fi­cient when they pro­duce the right re­sult the first time, or at worst pro­duce a re­sult that re­quires min­i­mal touch-ups. The most sure-fire way to achieve this is to give the agent fast, high qual­ity tools to au­to­mat­i­cally tell it when it is wrong.

I don’t know if there is a broad in­dus­try-ac­cepted term for this yet, but I’ve grown to call­ing this harness en­gi­neer­ing.” It is the idea that any­time you find an agent makes a mis­take, you take the time to en­gi­neer a so­lu­tion such that the agent never makes that mis­take again. I don’t need to in­vent any new terms here; if an­other one ex­ists, I’ll jump on the band­wagon.

This comes in two forms:

Better im­plicit prompt­ing (AGENTS.md). For sim­ple things, like the agent re­peat­edly run­ning the wrong com­mands or find­ing the wrong APIs, up­date the AGENTS.md (or equiv­a­lent). Here is

an ex­am­ple from Ghostty. Each line in that file is based on a bad agent be­hav­ior, and it al­most com­pletely re­solved them all.

Actual, pro­grammed tools. For ex­am­ple, scripts to take screen­shots, run fil­tered tests, etc etc. This is usu­ally paired with an AGENTS.md change to let it know about this ex­ist­ing.

This is where I’m at to­day. I’m mak­ing an earnest ef­fort when­ever I see an agent do a Bad Thing to pre­vent it from ever do­ing that bad thing again. Or, con­versely, I’m mak­ing an earnest ef­fort for agents to be able to ver­ify they’re do­ing a Good Thing.

Simultaneous to step 5, I’m also op­er­at­ing un­der the goal of

hav­ing an agent run­ning at all times. If an agent is­n’t run­ning, I ask my­self is there some­thing an agent could be do­ing for me right now?”

I par­tic­u­larly like to com­bine this with slower, more thought­ful mod­els like Amp’s deep mode (which is ba­si­cally just GPT-5.2-Codex) which can take up­wards of 30+ min­utes to make small changes. The flip side of that is that it does tend to pro­duce very good re­sults.

I’m not [yet?] run­ning mul­ti­ple agents, and cur­rently don’t re­ally want to.

I find hav­ing the one agent run­ning is a good bal­ance for me right now be­tween be­ing able to do deep, man­ual work I find en­joy­able, and babysit­ting my kind of stu­pid and yet mys­te­ri­ously pro­duc­tive ro­bot friend.

The have an agent run­ning at all times” goal is still just a goal. I’d say right now I’m maybe ef­fec­tive at hav­ing a back­ground agent run­ning 10 to 20% of a nor­mal work­ing day. But, I’m ac­tively work­ing to im­prove that.

And that’s where I’m at to­day.

Through this jour­ney, I’ve per­son­ally reached a point where I’m hav­ing suc­cess with mod­ern AI tool­ing and I be­lieve I’m ap­proach­ing it with the proper mea­sured view that is grounded in re­al­ity. I re­ally don’t care one way or the other if AI is here to stay3, I’m a soft­ware crafts­man that just wants to build stuff for the love of the game.

The whole land­scape is mov­ing so rapidly that I’m sure I’ll look back at this post very quickly and laugh at my naivete. But, as they say, if you can’t be em­barassed about your past self, you’re prob­a­bly not grow­ing. I just hope I’ll grow in the right di­rec­tion!

I have no skin in the game here4, and there are of course other rea­sons be­hind util­ity to avoid us­ing AI. I fully re­spect any­one’s in­di­vid­ual de­ci­sions re­gard­ing it. I’m not here to con­vince you! For those in­ter­ested, I just wanted to share my per­sonal ap­proach to nav­i­gat­ing these new tools and give a glimpse about how I ap­proach new tools

in gen­eral, re­gard­less of AI.

...

Read the original on mitchellh.com »

3 618 shares, 33 trendiness

- YouTube

...

Read the original on www.youtube.com »

4 565 shares, 29 trendiness

Building a C compiler with a team of parallel Claudes

Written by Nicholas Carlini, a re­searcher on our Safeguards team.

I’ve been ex­per­i­ment­ing with a new ap­proach to su­per­vis­ing lan­guage mod­els that we’re call­ing agent teams.” With agent teams, mul­ti­ple Claude in­stances work in par­al­lel on a shared code­base with­out ac­tive hu­man in­ter­ven­tion. This ap­proach dra­mat­i­cally ex­pands the scope of what’s achiev­able with LLM agents. To stress test it, I tasked 16 agents with writ­ing a Rust-based C com­piler, from scratch, ca­pa­ble of com­pil­ing the Linux ker­nel. Over nearly 2,000 Claude Code ses­sions and $20,000 in API costs, the agent team pro­duced a 100,000-line com­piler that can build Linux 6.9 on x86, ARM, and RISC-V. The com­piler is an in­ter­est­ing ar­ti­fact on its own, but I fo­cus here on what I learned about de­sign­ing har­nesses for long-run­ning au­tonomous agent teams: how to write tests that keep agents on track with­out hu­man over­sight, how to struc­ture work so mul­ti­ple agents can make progress in par­al­lel, and where this ap­proach hits its ceil­ing.Ex­ist­ing agent scaf­folds like Claude Code re­quire an op­er­a­tor to be on­line and avail­able to work jointly. If you ask for a so­lu­tion to a long and com­plex prob­lem, the model may solve part of it, but even­tu­ally it will stop and wait for con­tin­ued in­put—a ques­tion, a sta­tus up­date, or a re­quest for clar­i­fi­ca­tion.To elicit sus­tained, au­tonomous progress, I built a har­ness that sticks Claude in a sim­ple loop (if you’ve seen Ralph-loop, this should look fa­mil­iar). When it fin­ishes one task, it im­me­di­ately picks up the next. (Run this in a con­tainer, not your ac­tual ma­chine).

In the agent prompt, I tell Claude what prob­lem to solve and ask it to ap­proach the prob­lem by break­ing it into small pieces, track­ing what it’s work­ing on, fig­ur­ing out what to work on next, and to ef­fec­tively keep go­ing un­til it’s per­fect. (On this last point, Claude has no choice. The loop runs for­ever—al­though in one in­stance, I did see Claude pkill -9 bash on ac­ci­dent, thus killing it­self and end­ing the loop. Whoops!).Running mul­ti­ple in­stances in par­al­lel can ad­dress two weak­nesses of a sin­gle-agent har­ness:One Claude Code ses­sion can only do one thing at a time. Especially as the scope of a pro­ject ex­pands, de­bug­ging mul­ti­ple is­sues in par­al­lel is far more ef­fi­cient.Run­ning mul­ti­ple Claude agents al­lows for spe­cial­iza­tion. While a few agents are tasked to solve the ac­tual prob­lem at hand, other spe­cial­ized agents can be in­voked to (for ex­am­ple) main­tain doc­u­men­ta­tion, keep an eye on code qual­ity, or solve spe­cial­ized sub-tasks.My im­ple­men­ta­tion of par­al­lel Claude is bare-bones. A new bare git repo is cre­ated, and for each agent, a Docker con­tainer is spun up with the repo mounted to /upstream. Each agent clones a lo­cal copy to /workspace, and when it’s done, pushes from its own lo­cal con­tainer to up­stream.To pre­vent two agents from try­ing to solve the same prob­lem at the same time, the har­ness uses a sim­ple syn­chro­niza­tion al­go­rithm:Claude takes a lock” on a task by writ­ing a text file to cur­ren­t_­tasks/ (e.g., one agent might lock cur­ren­t_­tasks/​parse_if_s­tate­ment.txt, while an­other locks cur­ren­t_­tasks/​code­gen_­func­tion_de­f­i­n­i­tion.txt). If two agents try to claim the same task, git’s syn­chro­niza­tion forces the sec­ond agent to pick a dif­fer­ent one.Claude works on the task, then pulls from up­stream, merges changes from other agents, pushes its changes, and re­moves the lock. Merge con­flicts are fre­quent, but Claude is smart enough to fig­ure that out.The in­fi­nite agent-gen­er­a­tion-loop spawns a new Claude Code ses­sion in a fresh con­tainer, and the cy­cle re­peats.This is a very early re­search pro­to­type. I haven’t yet im­ple­mented any other method for com­mu­ni­ca­tion be­tween agents, nor do I en­force any process for man­ag­ing high-level goals. I don’t use an or­ches­tra­tion agent. Instead, I leave it up to each Claude agent to de­cide how to act. In most cases, Claude picks up the next most ob­vi­ous” prob­lem. When stuck on a bug, Claude will of­ten main­tain a run­ning doc of failed ap­proaches and re­main­ing tasks. In the git repos­i­tory of the pro­ject, you can read through the his­tory and watch it take out locks on var­i­ous tasks.The scaf­fold­ing runs Claude in a loop, but that loop is only use­ful if Claude can tell how to make progress. Most of my ef­fort went into de­sign­ing the en­vi­ron­ment around Claude—the tests, the en­vi­ron­ment, the feed­back—so that it could ori­ent it­self with­out me. These are the ap­proaches I’ve found most help­ful when or­ches­trat­ing mul­ti­ple Claude in­stances.Claude will work au­tonomously to solve what­ever prob­lem I give it. So it’s im­por­tant that the task ver­i­fier is nearly per­fect, oth­er­wise Claude will solve the wrong prob­lem. Improving the test­ing har­ness re­quired find­ing high-qual­ity com­piler test suites, writ­ing ver­i­fiers and build scripts for open-source soft­ware pack­ages, and watch­ing for mis­takes Claude was mak­ing, then de­sign­ing new tests as I iden­ti­fied those fail­ure modes.For ex­am­ple, near the end of the pro­ject, Claude started to fre­quently break ex­ist­ing func­tion­al­ity each time it im­ple­mented a new fea­ture. To ad­dress this, I built a con­tin­u­ous in­te­gra­tion pipeline and im­ple­mented stricter en­force­ment that al­lowed Claude to bet­ter test its work so that new com­mits can’t break ex­ist­ing code.I had to con­stantly re­mind my­self that I was writ­ing this test har­ness for Claude and not for my­self, which meant re­think­ing many of my as­sump­tions about how tests should com­mu­ni­cate re­sults.For ex­am­ple, each agent is dropped into a fresh con­tainer with no con­text and will spend sig­nif­i­cant time ori­ent­ing it­self, es­pe­cially on large pro­jects. Before we even reach the tests, to help Claude help it­self, I in­cluded in­struc­tions to main­tain ex­ten­sive READMEs and progress files that should be up­dated fre­quently with the cur­rent sta­tus.I also kept in mind the fact that lan­guage mod­els have in­her­ent lim­i­ta­tions, which, in this case, needed to be de­signed around. These in­clude:Con­text win­dow pol­lu­tion: The test har­ness should not print thou­sands of use­less bytes. At most, it should print a few lines of out­put and log all im­por­tant in­for­ma­tion to a file so Claude can find it when needed. Logfiles should be easy to process au­to­mat­i­cally: if there are er­rors, Claude should write ERROR and put the rea­son on the same line so grep will find it. It helps to pre-com­pute ag­gre­gate sum­mary sta­tis­tics so Claude does­n’t have to re­com­pute them.Time blind­ness: Claude can’t tell time and, left alone, will hap­pily spend hours run­ning tests in­stead of mak­ing progress. The har­ness prints in­cre­men­tal progress in­fre­quently (to avoid pol­lut­ing con­text) and in­cludes a de­fault –fast op­tion that runs a 1% or 10% ran­dom sam­ple. This sub­sam­ple is de­ter­min­is­tic per-agent but ran­dom across VMs, so Claude still cov­ers all files but each agent can per­fectly iden­tify re­gres­sions.When there are many dis­tinct fail­ing tests, par­al­leliza­tion is triv­ial: each agent picks a dif­fer­ent fail­ing test to work on. After the test suite reached a 99% pass rate, each agent worked on get­ting a dif­fer­ent small open-source pro­ject (e.g., SQlite, Redis, lib­jpeg, MQuickJS, Lua) to com­pile.But when agents started to com­pile the Linux ker­nel, they got stuck. Unlike a test suite with hun­dreds of in­de­pen­dent tests, com­pil­ing the Linux ker­nel is one gi­ant task. Every agent would hit the same bug, fix that bug, and then over­write each oth­er’s changes. Having 16 agents run­ning did­n’t help be­cause each was stuck solv­ing the same task.The fix was to use GCC as an on­line known-good com­piler or­a­cle to com­pare against. I wrote a new test har­ness that ran­domly com­piled most of the ker­nel us­ing GCC, and only the re­main­ing files with Claude’s C Compiler. If the ker­nel worked, then the prob­lem was­n’t in Claude’s sub­set of the files. If it broke, then it could fur­ther re­fine by re-com­pil­ing some of these files with GCC. This let each agent work in par­al­lel, fix­ing dif­fer­ent bugs in dif­fer­ent files, un­til Claude’s com­piler could even­tu­ally com­pile all files. (After this worked, it was still nec­es­sary to ap­ply delta de­bug­ging tech­niques to find pairs of files that failed to­gether but worked in­de­pen­dently.)Par­al­lelism also en­ables spe­cial­iza­tion. LLM-written code fre­quently re-im­ple­ments ex­ist­ing func­tion­al­ity, so I tasked one agent with co­a­lesc­ing any du­pli­cate code it found. I put an­other in charge of im­prov­ing the per­for­mance of the com­piler it­self, and a third I made re­spon­si­ble for out­putting ef­fi­cient com­piled code. I asked an­other agent to cri­tique the de­sign of the pro­ject from the per­spec­tive of a Rust de­vel­oper, and make struc­tural changes to the pro­ject to im­prove the over­all code qual­ity, and an­other to work on doc­u­men­ta­tion.This pro­ject was de­signed as a ca­pa­bil­ity bench­mark. I am in­ter­ested in stress-test­ing the lim­its of what LLMs can just barely achieve to­day in or­der to help us pre­pare for what mod­els will re­li­ably achieve in the fu­ture.I’ve been us­ing the C Compiler pro­ject as a bench­mark across the en­tire Claude 4 model se­ries. As I did with prior pro­jects, I started by draft­ing what I wanted: a from-scratch op­ti­miz­ing com­piler with no de­pen­den­cies, GCC-compatible, able to com­pile the Linux ker­nel, and de­signed to sup­port mul­ti­ple back­ends. While I spec­i­fied some as­pects of the de­sign (e.g., that it should have an SSA IR to en­able mul­ti­ple op­ti­miza­tion passes) I did not go into any de­tail on how to do so.Pre­vi­ous Opus 4 mod­els were barely ca­pa­ble of pro­duc­ing a func­tional com­piler. Opus 4.5 was the first to cross a thresh­old that al­lowed it to pro­duce a func­tional com­piler which could pass large test suites, but it was still in­ca­pable of com­pil­ing any real large pro­jects. My goal with Opus 4.6 was to again test the lim­its.Over nearly 2,000 Claude Code ses­sions across two weeks, Opus 4.6 con­sumed 2 bil­lion in­put to­kens and gen­er­ated 140 mil­lion out­put to­kens, a to­tal cost just un­der $20,000. Compared to even the most ex­pen­sive Claude Max plans, this was an ex­tremely ex­pen­sive pro­ject. But that to­tal is a frac­tion of what it would cost me to pro­duce this my­self—let alone an en­tire team.This was a clean-room im­ple­men­ta­tion (Claude did not have in­ter­net ac­cess at any point dur­ing its de­vel­op­ment); it de­pends only on the Rust stan­dard li­brary. The 100,000-line com­piler can build a bootable Linux 6.9 on x86, ARM, and RISC-V. It can also com­pile QEMU, FFmpeg, SQlite, post­gres, re­dis, and has a 99% pass rate on most com­piler test suites in­clud­ing the GCC tor­ture test suite. It also passes the de­vel­op­er’s ul­ti­mate lit­mus test: it can com­pile and run Doom.The com­piler, how­ever, is not with­out lim­i­ta­tions. These in­clude:It lacks the 16-bit x86 com­piler that is nec­es­sary to boot Linux out of real mode. For this, it calls out to GCC (the x86_32 and x86_64 com­pil­ers are its own).It does not have its own as­sem­bler and linker; these are the very last bits that Claude started au­tomat­ing and are still some­what buggy. The demo video was pro­duced with a GCC as­sem­bler and linker.The com­piler suc­cess­fully builds many pro­jects, but not all. It’s not yet a drop-in re­place­ment for a real com­piler.The gen­er­ated code is not very ef­fi­cient. Even with all op­ti­miza­tions en­abled, it out­puts less ef­fi­cient code than GCC with all op­ti­miza­tions dis­abled.The Rust code qual­ity is rea­son­able, but is nowhere near the qual­ity of what an ex­pert Rust pro­gram­mer might pro­duce.The re­sult­ing com­piler has nearly reached the lim­its of Opus’s abil­i­ties. I tried (hard!) to fix sev­eral of the above lim­i­ta­tions but was­n’t fully suc­cess­ful. New fea­tures and bug­fixes fre­quently broke ex­ist­ing func­tion­al­ity.As one par­tic­u­larly chal­leng­ing ex­am­ple, Opus was un­able to im­ple­ment a 16-bit x86 code gen­er­a­tor needed to boot into 16-bit real mode. While the com­piler can out­put cor­rect 16-bit x86 via the 66/67 op­code pre­fixes, the re­sult­ing com­piled out­put is over 60kb, far ex­ceed­ing the 32k code limit en­forced by Linux. Instead, Claude sim­ply cheats here and calls out to GCC for this phase (This is only the case for x86. For ARM or RISC-V, Claude’s com­piler can com­pile com­pletely by it­self.)The source code for the com­piler is avail­able. Download it, read through the code, and try it on your fa­vorite C pro­jects. I’ve con­sis­tently found the best way to un­der­stand what lan­guage mod­els can do is to push them to their lim­its, and then study where they start to break down. Over the com­ing days, I’ll con­tinue hav­ing Claude push new changes if you want to fol­low along with Claude’s con­tin­ued at­tempts at ad­dress­ing these lim­i­ta­tions.Each gen­er­a­tion of lan­guage mod­els opens up new ways of work­ing with them. Early mod­els were use­ful for tab-com­ple­tion in IDEs. Before long, mod­els could com­plete a func­tion body from its doc­string. The launch of Claude Code brought agents into the main­stream and en­abled de­vel­op­ers to pair-pro­gram with Claude. But each of these prod­ucts op­er­ates un­der the as­sump­tion that a user de­fines a task, an LLM runs for a few sec­onds or min­utes and re­turns an an­swer, and then the user pro­vides a fol­low-up.Agent teams show the pos­si­bil­ity of im­ple­ment­ing en­tire, com­plex pro­jects au­tonomously. This al­lows us, as users of these tools, to be­come more am­bi­tious with our goals.We are still early, and fully au­tonomous de­vel­op­ment comes with real risks. When a hu­man sits with Claude dur­ing de­vel­op­ment, they can en­sure con­sis­tent qual­ity and catch er­rors in real time. For au­tonomous sys­tems, it is easy to see tests pass and as­sume the job is done, when this is rarely the case. I used to work in pen­e­tra­tion test­ing, ex­ploit­ing vul­ner­a­bil­i­ties in prod­ucts pro­duced by large com­pa­nies, and the thought of pro­gram­mers de­ploy­ing soft­ware they’ve never per­son­ally ver­i­fied is a real con­cern.So, while this ex­per­i­ment ex­cites me, it also leaves me feel­ing un­easy. Building this com­piler has been some of the most fun I’ve had re­cently, but I did not ex­pect this to be any­where near pos­si­ble so early in 2026. The rapid progress in both lan­guage mod­els and the scaf­folds we use to in­ter­act with them opens the door to writ­ing an enor­mous amount of new code. I ex­pect the pos­i­tive ap­pli­ca­tions to out­weigh the neg­a­tive, but we’re en­ter­ing a new world which will re­quire new strate­gies to nav­i­gate safely.Spe­cial thanks to Josef Bacik, Edwin Chen, Bernardo Meurer Costa, Jake Eaton, Dan Kelley, Felix Klock, Jannet Park, Steve Weis, and many other peo­ple across Anthropic for their as­sis­tance and con­tri­bu­tions.

Product up­dates, how-tos, com­mu­nity spot­lights, and more. Delivered monthly to your in­box. Please pro­vide your email ad­dress if you’d like to re­ceive our monthly de­vel­oper newslet­ter. You can un­sub­scribe at any time.

...

Read the original on www.anthropic.com »

5 486 shares, 32 trendiness

It’s 2026, Just Use Postgres

Think of your data­base like your home. Your home has a liv­ing room, bed­room, bath­room, kitchen, and garage. Each room serves a dif­fer­ent pur­pose. But they’re all un­der the same roof, con­nected by hall­ways and doors. You don’t build a sep­a­rate restau­rant build­ing just be­cause you need to cook. You don’t con­struct a com­mer­cial garage across town just to park your car.

That’s what Postgres is. One home with many rooms. Search, vec­tors, time-se­ries, queues—all un­der one roof.

But this is ex­actly what spe­cial­ized data­base ven­dors don’t want you to hear. Their mar­ket­ing teams have spent years con­vinc­ing you to use the right tool for the right job.” It sounds rea­son­able. It sounds wise. And it sells a lot of data­bases.

Let me show you why it’s a trap and why Postgres is the bet­ter choice in 99% of cases.

You’ve heard the ad­vice: Use the right tool for the right job.”

Sounds wise. So you end up with:

Congratulations. You now have seven data­bases to man­age. Seven query lan­guages to learn. Seven backup strate­gies to main­tain. Seven se­cu­rity mod­els to au­dit. Six sets of cre­den­tials to ro­tate. Seven mon­i­tor­ing dash­boards to watch. And seven things that can break at 3 AM.

And when some­thing does break? Good luck spin­ning up a test en­vi­ron­ment to de­bug it.

Here’s a dif­fer­ent idea: Just use Postgres.

This is­n’t just about sim­plic­ity. AI agents have made data­base sprawl a night­mare.

Think about what agents need to do:

With one data­base? That’s a sin­gle com­mand. Fork it, test it, done.

With seven data­bases? Now you need to:

* Make sure they’re all at the same point in time

* Spin up seven dif­fer­ent ser­vices

* Tear down seven ser­vices when you’re done

This is vir­tu­ally im­pos­si­ble with­out a ton of R&D.

And it’s not just agents. Every time some­thing breaks at 3 AM, you need to spin up a test en­vi­ron­ment to de­bug. With six data­bases, that’s a co­or­di­na­tion night­mare. With one data­base, it’s a sin­gle com­mand.

In the AI era, sim­plic­ity is­n’t just el­e­gant. It’s es­sen­tial.

The myth: Specialized data­bases are far su­pe­rior at their spe­cific tasks.

The re­al­ity: Sometimes they’re mar­gin­ally bet­ter at a nar­row task. But they also bring un­nec­es­sary com­plex­ity. It’s like hir­ing a pri­vate chef for every meal. Sounds lux­u­ri­ous, but it adds ex­pense, co­or­di­na­tion over­head, and cre­ates prob­lems you did­n’t have be­fore.

Here’s the thing: 99% of com­pa­nies don’t need them. The top 1% have tens of mil­lions of users and a large en­gi­neer­ing team to match. You’ve read their blog posts about how amaz­ing Specialized Database X works for them. But that’s their scale, their team, their prob­lems. For every­one else, Postgres is more than enough.

Here’s what most peo­ple don’t re­al­ize: Postgres ex­ten­sions use the same or bet­ter al­go­rithms as spe­cial­ized data­bases (in many cases).

These aren’t wa­tered-down ver­sions. They’re the same/​bet­ter al­go­rithms, bat­tle-tested, open source, and of­ten de­vel­oped by the same re­searchers.

* pgvec­torscale: 28x lower la­tency than Pinecone at 75% less cost

* pg_­textsearch: The ex­act same BM25 rank­ing that pow­ers Elasticsearch

Beyond the AI/agent prob­lem, data­base sprawl has com­pound­ing costs:

Cognitive load: Your team needs SQL, Redis com­mands, Elasticsearch Query DSL, MongoDB ag­gre­ga­tion, Kafka pat­terns, and InfluxDB’s non-na­tive SQL workaround. That’s not spe­cial­iza­tion. That’s frag­men­ta­tion.

Data con­sis­tency: Keeping Elasticsearch in sync with Postgres? You build sync jobs. They fail. Data drifts. You add rec­on­cil­i­a­tion. That fails too. Now you’re main­tain­ing in­fra­struc­ture in­stead of build­ing fea­tures.

SLA math: Three sys­tems at 99.9% up­time each = 99.7% com­bined. That’s 26 hours of down­time per year in­stead of 8.7. Every sys­tem mul­ti­plies your fail­ure modes.

These ex­ten­sions aren’t new. They’ve been pro­duc­tion-ready for years:

* JSONB: Since 2014 (11 years). As fast as MongoDB, with ACID.

Over 48,000 com­pa­nies use PostgreSQL, in­clud­ing Netflix, Spotify, Uber, Reddit, Instagram, and Discord.

What this means: Building a RAG app used to re­quire Postgres + Pinecone + Elasticsearch + glue code.

Now? Just Postgres. One data­base. One query lan­guage. One backup. One fork com­mand for your AI agent to spin up a test en­vi­ron­ment.

Here’s all you need:

Below are work­ing ex­am­ples for each use case. Skip to what you need.

What you get: The ex­act same BM25 al­go­rithm that pow­ers Elasticsearch, di­rectly in Postgres.

This is what Elasticsearch re­quires a sep­a­rate plu­gin for. In Postgres, it’s just SQL.

What you get: pgvec­torscale uses the DiskANN al­go­rithm (from Microsoft Research), achiev­ing 28x lower p95 la­tency and 16x higher through­put than Pinecone at 99% re­call.

Now every INSERT/UPDATE au­to­mat­i­cally re­gen­er­ates em­bed­dings. No sync jobs. No drift. No 3 AM pages.

* Prometheus: Great for met­rics, not your ap­pli­ca­tion data

What you get: Automatic time par­ti­tion­ing, com­pres­sion up to 90%, con­tin­u­ous ag­gre­gates. Full SQL.

For AI ap­pli­ca­tions, you of­ten need both key­word search and se­man­tic search:

Try that with Elasticsearch + Pinecone. You’d need two API calls, re­sult merg­ing, fail­ure han­dling, and dou­ble la­tency.

In Postgres: one query, one trans­ac­tion, one re­sult.

Remember the home anal­ogy? You don’t build a sep­a­rate restau­rant just to cook din­ner. You don’t con­struct a com­mer­cial garage across town just to park your car. You use the rooms in your home.

That’s what we’ve shown you here. Search, vec­tors, time-se­ries, doc­u­ments, queues, caching—they’re all rooms in the Postgres home. Same al­go­rithms as the spe­cial­ized data­bases. Battle-tested for years. Used by Netflix, Uber, Discord, and 48,000 other com­pa­nies.

So what about that 99%?

For 99% of com­pa­nies, Postgres han­dles every­thing you need. The 1%? That’s when you’re pro­cess­ing petabytes of logs across hun­dreds of nodes, or you need Kibana’s spe­cific dash­boards, or you have ex­otic re­quire­ments that gen­uinely ex­ceed what Postgres can do.

But here’s the thing: you’ll know when you’re in the 1%. You won’t need a ven­dor’s mar­ket­ing team to tell you. You’ll have bench­marked it your­self and hit a real wall.

Until then, don’t scat­ter your data across seven build­ings be­cause some­one told you to use the right tool for the right job.” That ad­vice sells data­bases. It does­n’t serve you.

Start with Postgres. Stay with Postgres. Add com­plex­ity only when you’ve earned the need for it.

In 2026, just use Postgres.

All these ex­ten­sions are avail­able on Tiger Data. Create a free data­base in min­utes:

No need for spe­cial­ized data­bases, just use Postgres.

...

Read the original on www.tigerdata.com »

6 438 shares, 23 trendiness

mdp/linkedin-extension-fingerprinting

LinkedIn silently probes for 2,953 Chrome ex­ten­sions on every page load.

This repos­i­tory doc­u­ments every ex­ten­sion LinkedIn checks for and pro­vides tools to iden­tify them.

The com­plete list of ex­ten­sions with names and Chrome Web Store links:

Fetches ex­ten­sion names from Chrome Web Store with Extpose fall­back for re­moved/​un­avail­able ex­ten­sions.

# Fetch all ex­ten­sions

node fetch_ex­ten­sion_­names.js

# Fetch a sub­set (useful if rate lim­ited)

node fetch_ex­ten­sion_­names.js –offset 0 –limit 500

node fetch_ex­ten­sion_­names.js -o 500 -l 500

# Show help

node fetch_ex­ten­sion_­names.js –help

Test script that processes the first 3 ex­ten­sions with ver­bose out­put.

node test_fetch.js

* ~22% found via Extpose fall­back (removed or un­avail­able on Chrome Web Store)

...

Read the original on github.com »

7 389 shares, 25 trendiness

Recreating uncensored Epstein PDFs from raw encoded attachments

There have been a lot of com­plaints about both the com­pe­tency and the logic be­hind the lat­est Epstein archive re­lease by the DoJ: from cen­sor­ing the names of co-con­spir­a­tors to cen­sor­ing pic­tures of ran­dom women in a way that makes in­di­vid­u­als look guiltier than they re­ally are, for­get­ting to redact cre­den­tials that made it pos­si­ble for all of Reddit to log into Epstein’s ac­count and tram­ple over all the ev­i­dence, and the com­plete in­ep­ti­tude that re­sulted in most of the lat­est batch be­ing cor­rupted thanks to in­cor­rectly con­verted Quoted-Printable en­cod­ing ar­ti­facts, it’s safe to say that Pam Bondi’s DoJ did not put its best and bright­est on this (admittedly gar­gan­tuan) un­der­tak­ing. But the most damn­ing ev­i­dence has all been thor­oughly redacted… has­n’t it? Well, maybe not.

I was think­ing of writ­ing an ar­ti­cle on the man­gled quoted-print­able en­cod­ing the day this lat­est dump came out in re­sponse to all the mis­in­formed mus­ings and con­jec­tures that were lit­ter­ing so­cial me­dia (and my dilly-dal­ly­ing cost me, as some­one beat me to the punch), and spent some time search­ing through the lat­est archives look­ing  for some SMTP head­ers that I could use in the ar­ti­cle when I came across a cu­ri­ous ar­ti­fact: not only were the emails badly transcoded into plain text, but also some bi­nary at­tach­ments were ac­tu­ally in­cluded in the dumps in their over-the-wire Content-Transfer-Encoding: base64 for­mat, and the un­lucky in­tern that was as­signed to the doc­u­ments in ques­tion did­n’t re­al­ize the sig­nif­i­cance of what they were look­ing at and did­n’t see the point in cen­sor­ing seem­ingly mean­ing­less page af­ter page of hex con­tent!

Just take a look at EFTA00400459, an email from cor­re­spon­dence be­tween (presumably) one of Epstein’s as­sis­tants and Epstein lackey/​co-con­spir­a­tor Boris Nikolic and his friend, Sam Jaradeh, invit­ing them to a ████████ ben­e­fit:

Those hex char­ac­ters go on for 76 pages, and rep­re­sent the file DBC12 One Page Invite with Reply.pdf en­coded as base64 so that it can be in­cluded in the email with­out break­ing the SMTP pro­to­col. And con­vert­ing it back to the orig­i­nal PDF is, the­o­ret­i­cally, as easy as copy-and-past­ing those 76 pages into a text ed­i­tor, strip­ping the lead­ing > bytes, and pip­ing all that into base64 -d > out­put.pdf… or it would be, if we had the orig­i­nal (badly con­verted) email and not a par­tially redacted scan of a print­out of said email with some shoddy OCR ap­plied.

If you tried to ac­tu­ally copy that text as dig­i­tized by the DoJ from the PDF into a text ed­i­tor, here’s what you’d see:

You can ig­nore the EFTA00400459 on the sec­ond line; that (or some vari­ant thereof) will be in­ter­spersed into the base64 text since it’s stamped at the bot­tom of every page to iden­tify the piece of ev­i­dence it came from. But what else do you no­tice? Here’s a hint: this is what proper base64 looks like:

Notice how in this sam­ple every­thing lines up per­fectly (when us­ing a mono­spaced font) at the right mar­gin? And how that’s not the case when we copied-and-pasted from the OCR’d PDF? That’s be­cause it was­n’t a great OCR job: ex­tra char­ac­ters have been hal­lu­ci­nated into the out­put, some of them not even le­gal base64 char­ac­ters such as the , and [, while other char­ac­ters have been omit­ted al­to­gether, giv­ing us con­tent we can’t use:1

> pb­paste \

| string match -rv EFTA \

| string trim -c >” \

| string join ” \

| base64 -d >/dev/null

base64: in­valid in­put

I tried the eas­i­est al­ter­na­tive I had at hand: I loaded up the PDF in Adobe Acrobat Pro and re-ran an OCR process on the doc­u­ment, but came up with even worse re­sults, with spaces in­jected in the mid­dle of the base64 con­tent (easily fix­able) in ad­di­tion to other char­ac­ters be­ing com­pletely mis­read and butchered — it re­ally did­n’t like the cramped mono­space text at all. So I thought to do it man­u­ally with tesser­act, which, while very far from state-of-the-art, can still be use­ful be­cause it lets you do things like limit its out­put to a cer­tain sub­set of char­ac­ters, con­strain­ing the field of valid re­sults and hope­fully co­erc­ing it into pro­duc­ing bet­ter re­sults.

Only one prob­lem: tesser­act can’t read PDF in­put (or not by de­fault, any­way). No prob­lem, I’ll just use im­agemag­ick/​ghost­script to con­vert the PDF into in­di­vid­ual PNG im­ages (to avoid fur­ther gen­er­a­tional loss) and pro­vide those to tesser­act, right? But that did­n’t quite work out, they seem (?) to try to load and per­form the con­ver­sion of all 76 sep­a­rate pages/​png files all at once, and then nat­u­rally crash on too-large in­puts (but only af­ter tak­ing for­ever and gen­er­at­ing the 76 (invalid) out­put files that you’re forced to sub­se­quently clean up, of course):

> con­vert -density 300 EFTA00400459.pdf \

-background white -alpha re­move \

-alpha off out.png

con­vert-im6.q16: cache re­sources ex­hausted `/tmp/magick-QqXVSOZutVsiRcs7pLwwG2FYQnTsoAmX47′ @ er­ror/​cache.c/​Open­Pix­el­Cache/​4119.

con­vert-im6.q16: cache re­sources ex­hausted `out.png’ @ er­ror/​cache.c/​Open­Pix­el­Cache/​4119.

con­vert-im6.q16: No IDATs writ­ten into file `out-0.png’ @ er­ror/​png.c/​Mag­ickP­NGEr­rorHan­dler/​1643.

So we turn to pdftoppm from the pop­pler-utils pack­age in­stead, which does in­deed han­dle each page of the source PDF sep­a­rately and turned out to be up to the task, though in­cred­i­bly slow:

> pdftoppm -png -r 300 EFTA00400459.pdf out.png

After wait­ing the req­ui­site amount of time (and then some), I had files out-01.png through out-76.png, and was ready to try them with tesser­act:

for n in (printf %02d\n” (seq 1 76))

tesser­act out-$n.png out­put-$n \

–psm 6 \

-c tessed­it_char_whitelist=’>’ABCDE­FGHIJKLMNOPQRSTU­VWXYZ­abcde­fghijklmnopqrstu­vwxyz0123456789+/= \

-c load­_sys­tem_­dawg=0 \

-c load­_fre­q_­dawg=0

end

The above fish-shell com­mand in­structs tesser­act(1) to as­sume the in­put is a sin­gle block of text (the –psm 6 ar­gu­ment) and limit it­self to de­cod­ing only le­gal base64 char­ac­ters (and the lead­ing > so we can prop­erly strip it out there­after). My orig­i­nal at­tempt in­cluded a lit­eral space in the valid char whitelist, but that gave me worse re­sults: the very badly kerned base64 has sig­nif­i­cant ap­par­ent spac­ing be­tween some ad­ja­cent char­ac­ters (more on this later) and that caused tesser­act to both in­cor­rectly in­ject spaces (bad but fix­able) and also pos­si­bly af­fect how it han­dled the char­ac­ter af­ter the space (worse).

Unfortunately, while tesser­act gave me slightly bet­ter out­put than ei­ther the orig­i­nal OCR’d DoJ text or the (terrible) Adobe Acrobat Pro OCR re­sults, it too suf­fered from poor recog­ni­tion and gave me very in­con­sis­tent line lengths… but it also suf­fered from some­thing that I did­n’t re­ally think a heuris­tic-based, al­go­rithm-dri­ven tool like tesser­act would suc­cumb to, as it was more rem­i­nis­cent of how first-gen­er­a­tion LLMs would be­have: in a few places, it would only read the first dozen or so char­ac­ters on a line then leave the rest of the line blank, then pick up (correctly enough) at the start of the next line. Before I saw how gen­er­ally use­less the OCR re­sults were and gave up on tesser­act, I fig­ured I’d just man­u­ally type out the rest of the line (the aborted lines were easy enough to find, thanks to the mono­spaced out­put), and that was when I ran into the real is­sue that took this from an in­ter­est­ing chal­lenge to be­ing al­most mis­sion im­pos­si­ble.

I men­tioned ear­lier the bad kern­ing, which tricked the OCR tools into in­ject­ing spaces where there were sup­posed to be none, but that was far from be­ing the worst is­sue plagu­ing the PDF con­tent. The real prob­lem is that the text is ren­dered in pos­si­bly the worst type­face for the job at hand: Courier New.

If you’re a font en­thu­si­ast, I cer­tainly don’t need to say any more — you’re prob­a­bly al­ready shak­ing with a mix of PTSD and rage. But for the ben­e­fit of every­one else, let’s just say that Courier New is… not a great font. It was a dig­i­ti­za­tion of the ven­er­a­ble (though cer­tainly prim­i­tive) Courier font­face, com­mis­sioned by IBM in the 1950s. Courier was used (with some tweaks) for IBM type­writ­ers, in­clud­ing the IBM Selectric, and in the 1990s it was digitized di­rectly from the golf ball of the IBM Selectric” by Monotype, and shipped with Windows 3.1, where it re­mained the de­fault mono­space font on Windows un­til Consolas shipped with Windows Vista. Among the many is­sues with Courier New is that it was dig­i­tized from the Selectric golf ball without ac­count­ing for the vi­sual weight nor­mally added by the type­writer’s ink rib­bon”, which gives its char­ac­ter­is­tic thin” look. Microsoft ClearType, which was only en­abled by de­fault with Windows Vista, ad­dressed this ma­jor short­com­ing to some ex­tent, but Courier New has al­ways strug­gled with gen­eral read­abil­ity… and more im­por­tantly, with its poor dis­tinc­tion be­tween char­ac­ters.

While not as bad as some type­writer-era type­faces that ac­tu­ally reused the same sym­bol for 1 (one) and l (ell), Courier New came pretty close. Here is a com­par­i­son be­tween the two fonts when ren­der­ing these two char­ac­ters, only con­sid­er­ably en­larged:

The com­bi­na­tion of the two faults (the ane­mic weights and the even less dis­tinc­tion be­tween 1 and l as com­pared to Courier) makes Courier New a ter­ri­ble choice as a pro­gram­ming font. But as a font used for base64 out­put you want to OCR? You re­ally could­n’t pick a worse op­tion! To add fuel to the fire, you’re look­ing at SVG out­lines of the fonts, metic­u­lously con­verted and pre­serv­ing all the fine de­tails. But in the Epstein PDFs re­leased by the DoJ, we only have low-qual­ity JPEG scans at a fairly small point size. Here’s an ac­tual (losslessly en­coded) screen­shot of the DoJ text at 100% — I chal­lenge you to tell me which is a 1 and which is an l in the ex­cerpt be­low:

It’s not that there is­n’t any dif­fer­ence be­tween the two, be­cause there is. And some­times you get a clear gut feel­ing which is which — I was mid­way through man­u­ally typ­ing out one line of base64 text when I got stuck on iden­ti­fy­ing a one vs ell… only to re­al­ize that, at the same time, I had con­fi­dently tran­scribed one of them ear­lier that same line with­out even paus­ing to think about which it was. Here’s a zoomed-in view of the scanned PDF: you can clearly see all the JPEG DCT ar­ti­facts, the color fring­ing, and the smear­ing of char­ac­ter shapes, all of which make it hard to prop­erly iden­tify the char­ac­ters. But at the same time, at least in this par­tic­u­lar sam­ple, you can see which of the high­lighted char­ac­ters have a straight serif lead­ing out the top-left (the mid­dle, pre­sum­ably an ell) and which of those have the slight­est of strokes/​feet ex­tend­ing from them (the first and last, pre­sum­ably ones). But whether that’s be­cause that’s how the orig­i­nal glyph ap­peared or it’s be­cause of how the im­age was com­pressed, it’s tough to say:

But that’s get­ting ahead of my­self: at this point, none of the OCR tools had ac­tu­ally given me us­able re­sults, even ig­nor­ing the very im­por­tant ques­tion of l vs 1. After hav­ing been let down by one open source of­fer­ing (tesseract) and two com­mer­cial ones (Adobe Acrobat Pro and, pre­sum­ably, what­ever the DoJ used), I made the very ques­tion­able choice of writ­ing a script to use yet an­other com­mer­cial of­fer­ing, this time Amazon/AWS Textract, to process the PDF. Unfortunately, us­ing it di­rectly via the first-party tool­ing was (somewhat) of a no-go as it only sup­ports smaller/​shorter in­puts for di­rect use; longer PDFs like this one need to be up­loaded to S3 and then use the async work­flow to start the recog­ni­tion and poll for com­ple­tion.

Amazon Textract did pos­si­bly the best out of all the tools I tried, but its out­put still had ob­vi­ous line length dis­crep­an­cies — al­beit only one to two char­ac­ters or so off on av­er­age. I de­cided to try again, this time blow­ing up the in­put 2x (using near­est neigh­bor sam­pling to pre­serve sharp edges) as a workaround for Textract not hav­ing a tun­able I could set to con­fig­ure the DPI the doc­u­ment is processed at, though I wor­ried all in­puts could pos­si­bly be prescaled to a fixed size prior to pro­cess­ing once more:2

> for n in (printf %02d\n” (seq 01 76))

con­vert EFTA00400459-$n.png -scale 200% \

EFTA00400459-$n”_2x”.png; or break

end

> par­al­lel -j 16 ./textract.sh {} ::: EFTA00400459-*_2x.png

These re­sults were no­tably bet­ter, and I’ve in­cluded them in an archive, but some of the pages scanned bet­ter than oth­ers. Textract does­n’t seem to be 100% de­ter­min­is­tic from my brief ex­pe­ri­ence with it, and their fea­tures page does make vague or un­clear men­tions to ML, though it’s not ob­vi­ous when and where it kicks in or what it ex­actly refers to, but that could ex­plain why a cou­ple of the pages (like EFTA00400459-62_2x.txt) are con­sid­er­ably worse than oth­ers, even while the source im­ages don’t show a good rea­son for that di­ver­gence.

With the Textract 2x out­put cleaned up and piped into base64 -i (which ig­nores garbage data, gen­er­at­ing in­valid re­sults that can still be us­able for foren­sic analy­sis), I can get far enough to see that the PDF within the PDF (i.e. the ac­tual PDF at­tach­ment orig­i­nally sent) was at least par­tially (de)flate-encoded. Unfortunately, PDFs are bi­nary files with dif­fer­ent forms of com­pres­sion ap­plied; you can’t just use some­thing like strings to ex­tract any us­able con­tent. qpdf(1) can be (ab)used to de­com­press a PDF (while leav­ing it a PDF) via qpdf –qdf –object-streams=disable in­put.pdf de­com­pressed.pdf, but, pre­dictably, this does­n’t work when your in­put is gar­bled and cor­rupted:

> qpdf –qdf –object-streams=disable re­cov­ered.pdf de­com­pressed.pdf

WARNING: re­cov­ered.pdf: file is dam­aged

WARNING: re­cov­ered.pdf: can’t find startxref

WARNING: re­cov­ered.pdf: Attempting to re­con­struct cross-ref­er­ence table

WARNING: re­cov­ered.pdf (object 34 0, off­set 52): un­known to­ken while read­ing ob­ject; treat­ing as string

WARNING: re­cov­ered.pdf (object 34 0, off­set 70): un­known to­ken while read­ing ob­ject; treat­ing as string

WARNING: re­cov­ered.pdf (object 34 0, off­set 85): un­known to­ken while read­ing ob­ject; treat­ing as string

WARNING: re­cov­ered.pdf (object 34 0, off­set 90): un­ex­pected >

WARNING: re­cov­ered.pdf (object 34 0, off­set 92): un­known to­ken while read­ing ob­ject; treat­ing as string

WARNING: re­cov­ered.pdf (object 34 0, off­set 116): un­known to­ken while read­ing ob­ject; treat­ing as string

WARNING: re­cov­ered.pdf (object 34 0, off­set 121): un­known to­ken while read­ing ob­ject; treat­ing as string

WARNING: re­cov­ered.pdf (object 34 0, off­set 121): too many er­rors; giv­ing up on read­ing ob­ject

WARNING: re­cov­ered.pdf (object 34 0, off­set 125): ex­pected en­dobj

WARNING: re­cov­ered.pdf (object 41 0, off­set 9562): ex­pected end­stream

WARNING: re­cov­ered.pdf (object 41 0, off­set 8010): at­tempt­ing to re­cover stream length

WARNING: re­cov­ered.pdf (object 41 0, off­set 8010): un­able to re­cover stream data; treat­ing stream as empty

WARNING: re­cov­ered.pdf (object 41 0, off­set 9616): ex­pected en­dobj

WARNING: re­cov­ered.pdf (object 41 0, off­set 9616): EOF af­ter en­dobj

qpdf: re­cov­ered.pdf: un­able to find trailer dic­tio­nary while re­cov­er­ing dam­aged file

Between the in­con­sis­tent OCR re­sults and the prob­lem with the l vs 1, it’s not a very en­cour­ag­ing sit­u­a­tion. To me, this is a prob­lem beg­ging for a (traditional, non-LLM) ML so­lu­tion, specif­i­cally lever­ag­ing the fact that we know the font in ques­tion and, roughly, the com­pres­sion ap­plied. Alas, I don’t have more time to lend to this chal­lenge at the mo­ment, as there are a num­ber of things I set aside just in or­der to pub­lish this ar­ti­cle.

So here’s the chal­lenge for any­one I can suc­cess­fully nerd­snipe:

* Can you man­age to recre­ate the orig­i­nal PDF from the Content-Transfer-Encoding: base64 out­put in­cluded in the dump? It can’t be that hard, can it?

* Can you find other at­tach­ments in­cluded in the lat­est Epstein dumps that might also be pos­si­ble to re­con­struct? Unfortunately, the con­trac­tor that de­vel­oped the full-text search for the Department of Justice did a pretty crappy job and full-text search is prac­ti­cally bro­ken even ac­count­ing for the bad OCR and wran­gled quoted-print­able de­cod­ing (malicious com­pli­ance??); nev­er­the­less, search­ing for Content-Transfer-Encoding and base64 re­turns a num­ber of re­sults — it’s just that, un­for­tu­nately, most are use­lessly trun­cated or only the SMTP head­ers from Apple Mail cu­ri­ously ex­tracted.

I have up­loaded the orig­i­nal EFTA00400459.pdf from Epstein Dataset 9 as down­loaded from the DoJ web­site to the Internet Archive, as well as the in­di­vid­ual pages loss­lessly en­coded to WebP im­ages to save you the time and trou­ble of con­vert­ing them your­self. If it’s of any use to any­one, I’ve also up­loaded the very-much-in­valid Amazon Textract OCR text (from the loss­lessly 2x’d im­ages), which you can down­load here.

Oh, and one fi­nal hint: when try­ing to fig­ure out 1 vs l, I was able to do this with 100% ac­cu­racy only via trial-and-er­ror, de­cod­ing one line of base64 text at-a-time, but this only works for the plain-text por­tions of the PDF (headers, etc). For ex­am­ple, I started with my best guess for one line that I had to type out my­self when try­ing with tesser­act, and then was able to (in this case) de­duce which par­tic­u­lar 1s or ls were flipped:

> pb­paste

SW5mbzw8L01sbHVzdHJhdG9yIDgxIDAgUj4+L1Jlc29lcmNlczw8L0NvbG9yU3BhY2U8PC9DUzAG

> pb­paste | base64 -d

Info<>/Resoerces<

> # which I was able to cor­rect:

> pb­paste

SW5mbzw8L0lsbHVzdHJhdG9yIDgxIDAgUj4+L1Jlc291cmNlczw8L0NvbG9yU3BhY2U8PC9DUzAG

> pb­paste | base64 -d

Info<>/Resources<

…but good luck get­ting that to work once you get to the flate-com­pressed sec­tions of the PDF.

I’ll be post­ing up­dates on Twitter @mqudsi, and you can reach out to me on Signal at mqudsi.42 if you have any­thing sen­si­tive you would like to share. You can join in the dis­cus­sion on Hacker News or on r/​net­sec. Leave a com­ment be­low if you have any ideas/​ques­tions, or if you think I missed some­thing!

...

Read the original on neosmart.net »

8 378 shares, 17 trendiness

Spotlighting The World Factbook as We Bid a Fond Farewell

Spotlighting The World Factbook as We Bid a Fond Farewell (via) Somewhat dev­as­tat­ing news to­day from CIA:

One of CIAs old­est and most rec­og­niz­able in­tel­li­gence pub­li­ca­tions, The World Factbook, has sun­set.

There’s not even a hint as to why they de­cided to stop main­tain­ing this pub­li­ca­tion, which has been their most use­ful pub­lic-fac­ing ini­tia­tive since 1971 and a cor­ner­stone of the pub­lic in­ter­net since 1997.

In a bizarre act of cul­tural van­dal­ism they’ve not just re­moved the en­tire site (including the archives of pre­vi­ous ver­sions) but they’ve also set every sin­gle page to be a 302 redi­rect to their clo­sure an­nounce­ment.

The Factbook has been re­leased into the pub­lic do­main since the start. There’s no rea­son not to con­tinue to serve archived ver­sions - a ban­ner at the top of the page say­ing it’s no longer main­tained would be much bet­ter than re­mov­ing all of that valu­able con­tent en­tirely.

Up un­til 2020 the CIA pub­lished an­nual zip file archives of the en­tire site. Those are avail­able (along with the rest of the Factbook) on the Internet Archive.

I down­loaded the 384MB .zip file for the year 2020 and ex­tracted it into a new GitHub repos­i­tory, si­monw/​cia-world-fact­book-2020. I’ve en­abled GitHub Pages for that repos­i­tory so you can browse the archived copy at si­monw.github.io/​cia-world-fact­book-2020/.

Here’s a neat ex­am­ple of the ed­i­to­r­ial voice of the Factbook from the What’s New page, dated December 10th 2020:

Years of wran­gling were brought to a close this week when of­fi­cials from Nepal and China an­nounced that they have agreed on the height of Mount Everest. The moun­tain sits on the bor­der be­tween Nepal and Tibet (in west­ern China), and its height changed slightly fol­low­ing an earth­quake in 2015. The new height of 8,848.86 me­ters is just un­der a me­ter higher than the old fig­ure of 8,848 me­ters. The World Factbook rounds the new mea­sure­ment to 8,849 me­ters and this new height has been en­tered through­out the Factbook data­base.

...

Read the original on simonwillison.net »

9 371 shares, 14 trendiness

CIA says it will cease publishing the CIA World Factbook

The US Central Intelligence Agency (CIA) has an­nounced it will cease pub­lish­ing the World Factbook, a free on­line re­source used by mil­lions around the globe.

Frequently cited by jour­nal­ists and aca­d­e­mics, the Factbook of­fered reg­u­larly up­dated sta­tis­tics and in­for­ma­tion about coun­tries and com­mu­ni­ties all over the world, in an eas­ily un­der­stood and search­able for­mat.

A state­ment on the CIAs web­site did not in­clude a rea­son for the de­ci­sion, sim­ply stat­ing that the pub­li­ca­tion had sunset” while en­cour­ag­ing read­ers to stay cu­ri­ous about the world and find ways to ex­plore it … in per­son or vir­tu­ally”.

First launched dur­ing World War II as a clas­si­fied in­ter­nal pro­gram named JANIS (Joint Army Navy Intelligence Studies), the Factbook was orig­i­nally com­mis­sioned as a way to stan­dard­ise basic in­tel­li­gence” — fun­da­men­tal and fac­tual in­for­ma­tion about the world — across dif­fer­ent agen­cies of the US gov­ern­ment.

The pro­gram was taken over by the CIA in 1947 and re­named the National Intelligence Survey, be­fore the Factbook was launched in 1971 as an an­nual sum­mary of in­for­ma­tion.

An un­clas­si­fied ver­sion was first made avail­able to the pub­lic in 1975, and a dig­i­tal ver­sion was pub­lished on­line in the 1990s, with the data freely avail­able un­der pub­lic do­main.

The web­site was par­tic­u­larly pop­u­lar dur­ing the US school year, ac­cord­ing to pre­vi­ous ver­sions of the site, with traf­fic ex­pe­ri­enc­ing a no­tice­able drop-off dur­ing US sum­mer months.

While no spe­cific rea­son has been given for the Factbook’s clo­sure, the Trump ad­min­is­tra­tion has made no se­cret of its in­tent to cut gov­ern­ment pro­grams it does not con­sider to be fur­ther­ing the core pur­pose of its agen­cies and de­part­ments.

The ad­min­is­tra­tion of­fered buy­outs to every CIA em­ployee in February last year, and is re­port­edly plan­ning to cut about 1,200 fur­ther jobs at the agency over the next sev­eral years.

The CIA has been con­tacted for com­ment.

...

Read the original on www.abc.net.au »

10 320 shares, 13 trendiness

From magic to malware: How OpenClaw's agent skills become an attack surface

A few days ago, about why OpenClaw feels like a por­tal to the fu­ture, and why that fu­ture is scary in a very spe­cific way.

The short ver­sion: agent gate­ways that act like OpenClaw are pow­er­ful be­cause they have real ac­cess to your files, your tools, your browser, your ter­mi­nals, and of­ten a long-term memory” file that cap­tures how you think and what you’re build­ing. That com­bi­na­tion is ex­actly what mod­ern in­fos­teal­ers are de­signed to ex­ploit.

This post is the un­com­fort­able, and then it hap­pened” fol­low-up.

Because it’s not just that agents can be dan­ger­ous once they’re in­stalled. The ecosys­tem that dis­trib­utes their ca­pa­bil­i­ties and skill reg­istries has al­ready be­come an at­tack sur­face.

If you are ex­per­i­ment­ing with OpenClaw, do not do it on a com­pany de­vice. Full stop.

In my first post, I de­scribed OpenClaw as a kind of Faustian bar­gain. It is com­pelling pre­cisely be­cause it has real ac­cess to your lo­cal ma­chine, your apps, your browser ses­sions, your files, and of­ten long-term mem­ory. That same ac­cess means there is­n’t yet a safe way to run it on a ma­chine that holds cor­po­rate cre­den­tials or has ac­cess to pro­duc­tion sys­tems.

If you have al­ready run OpenClaw on a work de­vice, treat it as a po­ten­tial in­ci­dent and en­gage your se­cu­rity team im­me­di­ately. Do not wait for symp­toms. Pause work on that ma­chine and fol­low your or­ga­ni­za­tion’s in­ci­dent re­sponse process.

In the OpenClaw ecosys­tem, a skill” is of­ten a mark­down file: a page of in­struc­tions that tells an agent how to do a spe­cial­ized task. In prac­tice, that mark­down can in­clude links, copy-and-paste com­mands, and tool call recipes.

That sounds harm­less un­til you re­mem­ber how hu­mans, and agents, ac­tu­ally con­sume doc­u­men­ta­tion:

Markdown is­n’t content” in an agent ecosys­tem. Markdown is an in­staller.

Some peo­ple as­sume the layer makes this safer, be­cause tools can be ex­posed through a struc­tured in­ter­face, with ex­plicit user con­sent and au­tho­riza­tion con­trols de­pend­ing on the host and server im­ple­men­ta­tion.

But skills do not need to use MCP at all.

The Agent Skills spec­i­fi­ca­tion places no re­stric­tions on the mark­down body, and skills can in­clude what­ever in­struc­tions will help agents per­form the task,” in­clud­ing copy and paste ter­mi­nal com­mands. And skills can also bun­dle scripts along­side the mark­down, which means ex­e­cu­tion can hap­pen out­side the MCP tool bound­ary en­tirely.

So if your se­cu­rity model is MCP will gate tool calls,” you can still lose to a ma­li­cious skill that sim­ply routes around MCP through so­cial en­gi­neer­ing, di­rect shell in­struc­tions, or bun­dled code. MCP can be part of a safe sys­tem, but it is not a safety guar­an­tee by it­self.

Just as im­por­tantly, this is not unique to OpenClaw. Skills” are in­creas­ingly portable be­cause many agents are adopt­ing the open in which a skill is a folder cen­tered on a SKILL.md file with meta­data and freeform in­struc­tions, and it can also bun­dle scripts and other re­sources. Even de­scribes the same ba­sic shape: a SKILL.md file plus op­tional scripts and as­sets. That means a ma­li­cious skill” is not just an OpenClaw prob­lem. It is a dis­tri­b­u­tion mech­a­nism that can travel across any agent ecosys­tem that sup­ports the same stan­dard.

While brows­ing ClawHub (I won’t link it for ob­vi­ous rea­sons), I no­ticed the top down­loaded skill at the time was a Twitter” skill. It looked nor­mal: de­scrip­tion, in­tended use, an overview, the kind of thing you’d ex­pect to in­stall with­out a sec­ond thought.

But the very first thing it did was in­tro­duce a required de­pen­dency” named openclaw-core,” along with plat­form-spe­cific in­stall steps. Those steps in­cluded con­ve­nient links (“here”, this link”) that ap­peared to be nor­mal doc­u­men­ta­tion point­ers.

Both links led to ma­li­cious in­fra­struc­ture. The flow was clas­sic staged de­liv­ery:

The skil­l’s overview told you to in­stall a pre­req­ui­site. The link led to a stag­ing page de­signed to get the agent to run a com­mand.That com­mand de­coded an ob­fus­cated pay­load and ex­e­cuted it.The script down­loaded and ran a bi­nary, in­clud­ing re­mov­ing ma­cOS quar­an­tine at­trib­utes to en­sure ma­cOS’s built-in anti-mal­ware sys­tem, Gatekeeper, does­n’t scan it.

I’m in­ten­tion­ally not past­ing the ex­act com­mands or URLs here. The me­chan­ics are un­for­tu­nately straight­for­ward, and re­peat­ing them helps at­tack­ers more than it helps de­fend­ers. The key point is that this was not a sus­pi­cious link.” This was a com­plete ex­e­cu­tion chain dis­guised as setup in­struc­tions.

I down­loaded the fi­nal bi­nary safely and sub­mit­ted it to .

The ver­dict was not am­bigu­ous. It was flagged as ma­cOS in­fos­teal­ing mal­ware.

This is the type of mal­ware that does­n’t just infect your com­puter.” It raids every­thing valu­able on that de­vice:

* Anything else that can be turned into an ac­count takeover

If you’re the kind of per­son in­stalling agent skills, you are ex­actly the kind of per­son whose ma­chine is worth steal­ing from.

After I shared this in­ter­nally, sur­faced, putting the scale into fo­cus: hun­dreds of OpenClaw skills were re­port­edly in­volved in dis­trib­ut­ing ma­cOS mal­ware via ClickFix-style in­struc­tions.

That de­tail mat­ters be­cause it con­firms what this re­ally is.

A de­lib­er­ate strat­egy: use skills” as the dis­tri­b­u­tion chan­nel, and prerequisites” as the so­cial en­gi­neer­ing wrap­per.

We’ve spent years learn­ing that pack­age man­agers and open-source reg­istries can be­come sup­ply chain at­tack vec­tors.

Agent skill reg­istries are the next chap­ter, ex­cept that the package” is doc­u­men­ta­tion.

And that makes the at­tack path even smoother:

* And in agent ecosys­tems, the line be­tween read­ing in­struc­tions and ex­e­cut­ing them col­lapses.

Even if an agent can’t run shell com­mands di­rectly, it can still do some­thing dan­ger­ous: it can nor­mal­ize risky be­hav­ior.

It can con­fi­dently sum­ma­rize a ma­li­cious pre­req­ui­site as the stan­dard in­stall step.” It can en­cour­age you to paste a one-liner. It can re­duce hes­i­ta­tion.

And if your agent can ex­e­cute lo­cal com­mands, then a ma­li­cious skill is­n’t bad con­tent.” It’s re­mote ex­e­cu­tion wrapped in friendly docs.

Do not run this on a com­pany de­vice. There is­n’t a safe way to do it. If you al­ready did, or you ran any install” com­mands from a skill, en­gage your se­cu­rity team im­me­di­ately and treat it as a po­ten­tial com­pro­mise.

* Stop us­ing the de­vice for sen­si­tive work.

If you ex­per­i­ment any­way, use an iso­lated ma­chine with no cor­po­rate ac­cess and no saved cre­den­tials.

You are op­er­at­ing an app store. Assume it will be abused.

* Put warn­ings and fric­tion on ex­ter­nal links and in­stall steps.

* Use per­mis­sions that are spe­cific, time-bound, and re­vo­ca­ble.

This is the clear­est proof yet of the point I made in my ear­lier post. OpenClaw is pow­er­ful be­cause it col­lapses the dis­tance be­tween in­tent and ex­e­cu­tion. That is the magic. It also in­tro­duces sig­nif­i­cant risk. When ca­pa­bil­i­ties are dis­trib­uted as skills and in­stalled via doc­u­men­ta­tion, the reg­istry be­comes a sup­ply chain, and the eas­i­est in­stall path be­comes the at­tack­er’s fa­vorite path.

The an­swer is not to stop build­ing agents. The an­swer is to build the miss­ing trust layer around them. Skills need prove­nance. Execution needs me­di­a­tion. Permissions need to be spe­cific, re­vo­ca­ble, and con­tin­u­ously en­forced, not granted once and for­got­ten. If agents are go­ing to act on our be­half, cre­den­tials and sen­si­tive ac­tions can­not be grabbed” by what­ever code hap­pens to run. They need to be bro­kered, gov­erned, and au­dited in real time.

This is ex­actly why we need that next layer: when skills” be­come the sup­ply chain, the only safe fu­ture is one in which every agent has its own iden­tity and has the min­i­mum au­thor­ity it needs right now, with ac­cess that is time-bound, re­vo­ca­ble, and at­trib­ut­able.

...

Read the original on 1password.com »

To add this web app to your iOS home screen tap the share button and select "Add to the Home Screen".

10HN is also available as an iOS App

If you visit 10HN only rarely, check out the the best articles from the past week.

If you like 10HN please leave feedback and share

Visit pancik.com for more.