10 interesting stories served every morning and every evening.
The new Claude Opus 4.6 improves on its predecessor’s coding skills. It plans more carefully, sustains agentic tasks for longer, can operate more reliably in larger codebases, and has better code review and debugging skills to catch its own mistakes. And, in a first for our Opus-class models, Opus 4.6 features a 1M token context window in beta. Opus 4.6 can also apply its improved abilities to a range of everyday work tasks: running financial analyses, doing research, and using and creating documents, spreadsheets, and presentations. Within Cowork, where Claude can multitask autonomously, Opus 4.6 can put all these skills to work on your behalf.The model’s performance is state-of-the-art on several evaluations. For example, it achieves the highest score on the agentic coding evaluation Terminal-Bench 2.0 and leads all other frontier models on Humanity’s Last Exam, a complex multidisciplinary reasoning test. On GDPval-AA—an evaluation of performance on economically valuable knowledge work tasks in finance, legal, and other domains1—Opus 4.6 outperforms the industry’s next-best model (OpenAI’s GPT-5.2) by around 144 Elo points,2 and its own predecessor (Claude Opus 4.5) by 190 points. Opus 4.6 also performs better than any other model on BrowseComp, which measures a model’s ability to locate hard-to-find information online.As we show in our extensive system card, Opus 4.6 also shows an overall safety profile as good as, or better than, any other frontier model in the industry, with low rates of misaligned behavior across safety evaluations.Opus 4.6 is state-of-the-art on real-world work tasks across several professional domains.Opus 4.6 gets the highest score in the industry for deep, multi-step agentic search.In Claude Code, you can now assemble agent teams to work on tasks together. On the API, Claude can use compaction to summarize its own context and perform longer-running tasks without bumping up against limits. We’re also introducing adaptive thinking, where the model can pick up on contextual clues about how much to use its extended thinking, and new effort controls to give developers more control over intelligence, speed, and cost. We’ve made substantial upgrades to Claude in Excel, and we’re releasing Claude in PowerPoint in a research preview. This makes Claude much more capable for everyday work.Claude Opus 4.6 is available today on claude.ai, our API, and all major cloud platforms. If you’re a developer, use claude-opus-4-6 via the Claude API. Pricing remains the same at $5/$25 per million tokens; for full details, see our pricing page.We cover the model, our new product updates, our evaluations, and our extensive safety testing in depth below.We build Claude with Claude. Our engineers write code with Claude Code every day, and every new model first gets tested on our own work. With Opus 4.6, we’ve found that the model brings more focus to the most challenging parts of a task without being told to, moves quickly through the more straightforward parts, handles ambiguous problems with better judgment, and stays productive over longer sessions.Opus 4.6 often thinks more deeply and more carefully revisits its reasoning before settling on an answer. This produces better results on harder problems, but can add cost and latency on simpler ones. If you’re finding that the model is overthinking on a given task, we recommend dialing effort down from its default setting (high) to medium. You can control this easily with the /effort parameter.Here are some of the things our Early Access partners told us about Claude Opus 4.6, including its propensity to work autonomously without hand-holding, its success where previous models failed, and its effect on how teams work:
Claude Opus 4.6 is the strongest model Anthropic has shipped. It takes complicated requests and actually follows through, breaking them into concrete steps, executing, and producing polished work even when the task is ambitious. For Notion users, it feels less like a tool and more like a capable collaborator.Early testing shows Claude Opus 4.6 delivering on the complex, multi-step coding work developers face every day—especially agentic workflows that demand planning and tool calling. This starts unlocking long-horizon tasks at the frontier.Claude Opus 4.6 is a huge leap for agentic planning. It breaks complex tasks into independent subtasks, runs tools and subagents in parallel, and identifies blockers with real precision.Claude Opus 4.6 is the best model we’ve tested yet. Its reasoning and planning capabilities have been exceptional at powering our AI Teammates. It’s also a fantastic coding model — its ability to navigate a large codebase and identify the right changes to make is state of the art.Claude Opus 4.6 reasons through complex problems at a level we haven’t seen before. It considers edge cases that other models miss and consistently lands on more elegant, well-considered solutions. We’re particularly impressed with Opus 4.6 in Devin Review, where it’s increased our bug catching rates.Claude Opus 4.6 feels noticeably better than Opus 4.5 in Windsurf, especially on tasks that require careful exploration like debugging and understanding unfamiliar codebases. We’ve noticed Opus 4.6 thinks longer, which pays off when deeper reasoning is needed.Claude Opus 4.6 represents a meaningful leap in long-context performance. In our testing, we saw it handle much larger bodies of information with a level of consistency that strengthens how we design and deploy complex research workflows. Progress in this area gives us more powerful building blocks to deliver truly expert-grade systems professionals can trust.Across 40 cybersecurity investigations, Claude Opus 4.6 produced the best results 38 of 40 times in a blind ranking against Claude 4.5 models. Each model ran end to end on the same agentic harness with up to 9 subagents and 100+ tool calls.Claude Opus 4.6 is the new frontier on long-running tasks from our internal benchmarks and testing. It’s also been highly effective at reviewing code.Claude Opus 4.6 achieved the highest BigLaw Bench score of any Claude model at 90.2%. With 40% perfect scores and 84% above 0.8, it’s remarkably capable for legal reasoning.Claude Opus 4.6 autonomously closed 13 issues and assigned 12 issues to the right team members in a single day, managing a ~50-person organization across 6 repositories. It handled both product and organizational decisions while synthesizing context across multiple domains, and it knew when to escalate to a human.Claude Opus 4.6 is an uplift in design quality. It works beautifully with our design systems and it’s more autonomous, which is core to Lovable’s values. People should be creating things that matter, not micromanaging AI.Claude Opus 4.6 excels in high-reasoning tasks like multi-source analysis across legal, financial, and technical content. Box’s eval showed a 10% lift in performance, reaching 68% vs. a 58% baseline, and near-perfect scores in technical domains.Claude Opus 4.6 generates complex, interactive apps and prototypes in Figma Make with an impressive creative range. The model translates detailed designs and multi-layered tasks into code on the first try, making it a powerful starting point for teams to explore and build ideas.Claude Opus 4.6 is the best Anthropic model we’ve tested. It understands intent with minimal prompting and went above and beyond, exploring and creating details I didn’t even know I wanted until I saw them. It felt like I was working with the model, not waiting on it.Both hands-on testing and evals show Claude Opus 4.6 is a meaningful improvement for design systems and large codebases, use cases that drive enormous enterprise value. It also one-shotted a fully functional physics engine, handling a large multi-scope task in a single pass.Claude Opus 4.6 is the biggest leap I’ve seen in months. I’m more comfortable giving it a sequence of tasks across the stack and letting it run. It’s smart enough to use subagents for the individual pieces.Claude Opus 4.6 handled a multi-million-line codebase migration like a senior engineer. It planned up front, adapted its strategy as it learned, and finished in half the time.We only ship models in v0 when developers will genuinely feel the difference. Claude Opus 4.6 passed that bar with ease. Its frontier-level reasoning, especially with edge cases, helps v0 to deliver on our number-one aim: to let anyone elevate their ideas from prototype to production.The performance jump with Claude Opus 4.6 feels almost unbelievable. Real-world tasks that were challenging for Opus [4.5] suddenly became easy. This feels like a watershed moment for spreadsheet agents on Shortcut.Across agentic coding, computer use, tool use, search, and finance, Opus 4.6 is an industry-leading model, often by a wide margin. The table below shows how Claude Opus 4.6 compares to our previous models and to other industry models on a variety of benchmarks.Opus 4.6 is much better at retrieving relevant information from large sets of documents. This extends to long-context tasks, where it holds and tracks information over hundreds of thousands of tokens with less drift, and picks up buried details that even Opus 4.5 would miss.A common complaint about AI models is “context rot,” where performance degrades as conversations exceed a certain number of tokens. Opus 4.6 performs markedly better than its predecessors: on the 8-needle 1M variant of MRCR v2—a needle-in-a-haystack benchmark that tests a model’s ability to retrieve information “hidden” in vast amounts of text—Opus 4.6 scores 76%, whereas Sonnet 4.5 scores just 18.5%. This is a qualitative shift in how much context a model can actually use while maintaining peak performance.All in all, Opus 4.6 is better at finding information across long contexts, better at reasoning after absorbing that information, and has substantially better expert-level reasoning abilities in general.Finally, the charts below show how Claude Opus 4.6 performs on a variety of benchmarks that assess its software engineering skills, multilingual coding ability, long-term coherence, cybersecurity capabilities, and its life sciences knowledge.Opus 4.6 maintains focus over time and earns $3,050.53 more than Opus 4.5 on Vending-Bench 2.Opus 4.6 finds real vulnerabilities in codebases better than any other model.Opus 4.6 performs almost 2× better than Opus 4.5 on computational biology, structural biology, organic chemistry, and phylogenetics tests.These intelligence gains do not come at the cost of safety. On our automated behavioral audit, Opus 4.6 showed a low rate of misaligned behaviors such as deception, sycophancy, encouragement of user delusions, and cooperation with misuse. Overall, it is just as well-aligned as its predecessor, Claude Opus 4.5, which was our most-aligned frontier model to date. Opus 4.6 also shows the lowest rate of over-refusals—where the model fails to answer benign queries—of any recent Claude model.The overall misaligned behavior score for each recent Claude model on our automated behavioral audit (described in full in the Claude Opus 4.6 system card).For Claude Opus 4.6, we ran the most comprehensive set of safety evaluations of any model, applying many different tests for the first time and upgrading several that we’ve used before. We included new evaluations for user wellbeing, more complex tests of the model’s ability to refuse potentially dangerous requests, and updated evaluations of the model’s ability to surreptitiously perform harmful actions. We also experimented with new methods from interpretability, the science of the inner workings of AI models, to begin to understand why the model behaves in certain ways—and, ultimately, to catch problems that standard testing might miss.A detailed description of all capability and safety evaluations is available in the Claude Opus 4.6 system card.We’ve also applied new safeguards in areas where Opus 4.6 shows particular strengths that might be put to dangerous as well as beneficial uses. In particular, since the model shows enhanced cybersecurity abilities, we’ve developed six new cybersecurity probes—methods of detecting harmful responses—to help us track different forms of potential misuse.We’re also accelerating the cyberdefensive uses of the model, using it to help find and patch vulnerabilities in open-source software (as we describe in our new cybersecurity blog post). We think it’s critical that cyberdefenders use AI models like Claude to help level the playing field. Cybersecurity moves fast, and we’ll be adjusting and updating our safeguards as we learn more about potential threats; in the near future, we may institute real-time intervention to block abuse.We’ve made substantial updates across Claude, Claude Code, and the Claude Developer Platform to let Opus 4.6 perform at its best.On the API, we’re giving developers better control over model effort and more flexibility for long-running agents. To do so, we’re introducing the following features:Adaptive thinking. Previously, developers only had a binary choice between enabling or disabling extended thinking. Now, with adaptive thinking, Claude can decide when deeper reasoning would be helpful. At the default effort level (high), the model uses extended thinking when useful, but developers can adjust the effort level to make it more or less selective.Effort. There are now four effort levels to choose from: low, medium, high (default), and max. We encourage developers to experiment with different options to find what works best.Context compaction (beta). Long-running conversations and agentic tasks often hit the context window. Context compaction automatically summarizes and replaces older context when the conversation approaches a configurable threshold, letting Claude perform longer tasks without hitting limits.1M token context (beta). Opus 4.6 is our first Opus-class model with 1M token context. Premium pricing applies for prompts exceeding 200k tokens ($10/$37.50 per million input/output tokens).128k output tokens. Opus 4.6 supports outputs of up to 128k tokens, which lets Claude complete larger-output tasks without breaking them into multiple requests.US-only inference. For workloads that need to run in the United States, US-only inference is available at 1.1× token pricing.Across Claude and Claude Code, we’ve added features that allow knowledge workers and developers to tackle harder tasks with more of the tools they use every day.We’ve introduced agent teams in Claude Code as a research preview. You can now spin up multiple agents that work in parallel as a team and coordinate autonomously—best for tasks that split into independent, read-heavy work like codebase reviews. You can take over any subagent directly using Shift+Up/Down or tmux.Claude now also works better with the office tools you already use. Claude in Excel handles long-running and harder tasks with improved performance, and can plan before acting, ingest unstructured data and infer the right structure without guidance, and handle multi-step changes in one pass. Pair that with Claude in PowerPoint, and you can first process and structure your data in Excel, then bring it to life visually in PowerPoint. Claude reads your layouts, fonts, and slide masters to stay on brand, whether you’re building from a template or generating a full deck from a description. Claude in PowerPoint is now available in research preview for Max, Team, and Enterprise plans.
...
Read the original on www.anthropic.com »
My experience adopting any meaningful tool is that I’ve necessarily gone through three phases: (1) a period of inefficiency (2) a period of adequacy, then finally (3) a period of workflow and life-altering discovery.
In most cases, I have to force myself through phase 1 and 2 because I usually have a workflow I’m already happy and comfortable with. Adopting a tool feels like work, and I do not want to put in the effort, but I usually do in an effort to be a well-rounded person of my craft.
This is my journey of how I found value in AI tooling and what I’m trying next with it. In an ocean of overly dramatic, hyped takes, I hope this represents a more nuanced, measured approach to my views on AI and how they’ve changed over time.
Immediately cease trying to perform meaningful work via a chatbot (e.g. ChatGPT, Gemini on the web, etc.). Chatbots have real value and are a daily part of my AI workflow, but their utility in coding is highly limited because you’re mostly hoping they come up with the right results based on their prior training, and correcting them involves a human (you) to tell them they’re wrong repeatedly. It is inefficient.
I think everyone’s first experience with AI is a chat interface. And I think everyone’s first experience trying to code with AI has been asking a chat interface to write code.
While I was still a heavy AI skeptic, my first “oh wow” moment was pasting a screenshot of Zed’s command palette into Gemini, asking it to reproduce it with SwiftUI, and being truly flabbergasted that it did it very well. The command palette that ships for macOS in Ghostty today is only very lightly modified from what Gemini produced for me in seconds.
But when I tried to reproduce that behavior for other tasks, I was left disappointed. In the context of brownfield projects, I found the chat interface produced poor results very often, and I found myself very frustrated copying and pasting code and command output to and from the interface. It was very obviously far less efficient than me doing the work myself.
To find value, you must use an agent. An agent is the industry-adopted term for an LLM that can chat and invoke external behavior in a loop1
At a bare minimum, the agent must have the ability to: read files, execute programs, and make HTTP requests.
The next phase on my journey I tried
Claude Code. I’ll cut to the chase: I initially wasn’t impressed. I just wasn’t getting good results out of my sessions. I felt I had to touch up everything it produced and this process was taking more time than if I had just done it myself. I read blog posts, watched videos, but just wasn’t that impressed.
Instead of giving up, I forced myself to reproduce all my manual commits
with agentic ones. I literally did the work twice. I’d do the work manually, and then I’d fight an agent to produce identical results in terms of quality and function (without it being able to see my manual solution, of course).
This was excruciating, because it got in the way of simply getting things done. But I’ve been around the block with non-AI tools enough to know that friction is natural, and I can’t come to a firm, defensible conclusion without exhausting my efforts.
But, expertise formed. I quickly discovered for myself from first principles what others were already saying, but discovering it myself resulted in a stronger fundamental understanding.
Break down sessions into separate clear, actionable tasks. Don’t try
to “draw the owl” in one mega session.
For vague requests, split the work into separate planning vs. execution
sessions.
If you give an agent a way to verify its work, it more often than
not fixes its own mistakes and prevents regressions.
More generally, I also found the edges of what agents — at the time — were good at, what they weren’t good at, and for the tasks they were good at how to achieve the results I wanted.
All of this led to significant efficiency gains, to the point where I was starting to naturally use agents in a way that I felt was no slower than doing it myself (but I still didn’t feel it was any faster, since I was mostly babysitting an agent).
The negative space here is worth reiterating: part of the efficiency gains here were understanding when not to reach for an agent. Using an agent for something it’ll likely fail at is obviously a big waste of time and having the knowledge to avoid that completely leads to time savings2.
At this stage, I was finding adequate value with agents that I was happy to use them in my workflow, but still didn’t feel like I was seeing any net efficiency gains. I didn’t care though, I was content at this point with AI as a tool.
To try to find some efficiency, I next started up a new pattern:
block out the last 30 minutes of every day to kick off one or more agents.
My hypothesis was that perhaps I could gain some efficiency if the agent can make some positive progress in the times I can’t work anyways. Basically: instead of trying to do more in the time I have, try to do more in the time I don’t have.
Similar to the previous task, I at first found this both unsuccessful and annoying. But, I once again quickly found different categories of work that were really helpful:
* Deep research sessions where I’d ask agents to survey some
field, such as finding all libraries in a specific language with
a specific license type and producing multi-page summaries for each
on their pros, cons, development activity, social sentiment, etc.
* Parallel agents attempting different vague ideas I had but didn’t
have time to get started on. I didn’t expect them to produce something
I’d ever ship here, but perhaps could illuminate some unknown unknowns
when I got to the task the next day.
* Issue and PR triage/review. Agents are good at using gh (GitHub CLI),
so I manually scripted a quick way to spin up a bunch in parallel to
triage issues. I would NOT allow agents to respond, I just wanted
reports the next day to try to guide me towards high value or low effort
tasks.
To be clear, I did not go as far as others went to have agents running in loops all night. In most cases, agents completed their tasks in less than half an hour. But, the latter part of the working day, I’m usually tired and coming out of flow and find myself too personally inefficient, so shifting my effort to spinning up these agents I found gave me a “warm start” the next morning that got me working more quickly than I would’ve otherwise.
I was happy, and I was starting to feel like I was doing more than I was doing prior to AI, if only slightly.
By this point, I was getting very confident about what tasks my AI was and wasn’t great at. I had really high confidence with certain tasks that the AI would achieve a mostly-correct solution. So the next step on my journey was: let agents do all of that work while I worked on other tasks.
More specifically, I would start each day by taking the results of my prior night’s triage agents, filter them manually to find the issues that an agent will almost certainly solve well, and then keep them going in the background (one at a time, not in parallel).
Meanwhile, I’d work on something else. I wasn’t going to social media (any more than usual without AI), I wasn’t watching videos, etc. I was in my own, normal, pre-AI deep thinking mode working on something I wanted to work on or had to work on.
Very important at this stage: turn off agent desktop notifications.
Context switching is very expensive. In order to remain efficient, I found that it was my job as a human to be in control of when I interrupt the agent, not the other way around. Don’t let the agent notify you. During natural breaks in your work, tab over and check on it, then carry on.
Importantly, I think the “work on something else” helps counteract the highly publicized Anthropic skill formation paper. Well, you’re trading off: not forming skills for the tasks you’re delegating to the agent while continuing to form skills naturally in the tasks you continue to work on manually.
At this point I was firmly in the “no way I can go back” territory. I felt more efficient, but even if I wasn’t, the thing I liked the most was that I could now focus my coding and thinking on tasks I really loved while still adequately completing the tasks I didn’t.
At risk of stating the obvious: agents are much more efficient when they produce the right result the first time, or at worst produce a result that requires minimal touch-ups. The most sure-fire way to achieve this is to give the agent fast, high quality tools to automatically tell it when it is wrong.
I don’t know if there is a broad industry-accepted term for this yet, but I’ve grown to calling this “harness engineering.” It is the idea that anytime you find an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again. I don’t need to invent any new terms here; if another one exists, I’ll jump on the bandwagon.
This comes in two forms:
Better implicit prompting (AGENTS.md). For simple things, like the agent repeatedly running the wrong commands or finding the wrong APIs, update the AGENTS.md (or equivalent). Here is
an example from Ghostty. Each line in that file is based on a bad agent behavior, and it almost completely resolved them all.
Actual, programmed tools. For example, scripts to take screenshots, run filtered tests, etc etc. This is usually paired with an AGENTS.md change to let it know about this existing.
This is where I’m at today. I’m making an earnest effort whenever I see an agent do a Bad Thing to prevent it from ever doing that bad thing again. Or, conversely, I’m making an earnest effort for agents to be able to verify they’re doing a Good Thing.
Simultaneous to step 5, I’m also operating under the goal of
having an agent running at all times. If an agent isn’t running, I ask myself “is there something an agent could be doing for me right now?”
I particularly like to combine this with slower, more thoughtful models like Amp’s deep mode (which is basically just GPT-5.2-Codex) which can take upwards of 30+ minutes to make small changes. The flip side of that is that it does tend to produce very good results.
I’m not [yet?] running multiple agents, and currently don’t really want to.
I find having the one agent running is a good balance for me right now between being able to do deep, manual work I find enjoyable, and babysitting my kind of stupid and yet mysteriously productive robot friend.
The “have an agent running at all times” goal is still just a goal. I’d say right now I’m maybe effective at having a background agent running 10 to 20% of a normal working day. But, I’m actively working to improve that.
And that’s where I’m at today.
Through this journey, I’ve personally reached a point where I’m having success with modern AI tooling and I believe I’m approaching it with the proper measured view that is grounded in reality. I really don’t care one way or the other if AI is here to stay3, I’m a software craftsman that just wants to build stuff for the love of the game.
The whole landscape is moving so rapidly that I’m sure I’ll look back at this post very quickly and laugh at my naivete. But, as they say, if you can’t be embarassed about your past self, you’re probably not growing. I just hope I’ll grow in the right direction!
I have no skin in the game here4, and there are of course other reasons behind utility to avoid using AI. I fully respect anyone’s individual decisions regarding it. I’m not here to convince you! For those interested, I just wanted to share my personal approach to navigating these new tools and give a glimpse about how I approach new tools
in general, regardless of AI.
...
Read the original on mitchellh.com »
Written by Nicholas Carlini, a researcher on our Safeguards team.
I’ve been experimenting with a new approach to supervising language models that we’re calling “agent teams.” With agent teams, multiple Claude instances work in parallel on a shared codebase without active human intervention. This approach dramatically expands the scope of what’s achievable with LLM agents. To stress test it, I tasked 16 agents with writing a Rust-based C compiler, from scratch, capable of compiling the Linux kernel. Over nearly 2,000 Claude Code sessions and $20,000 in API costs, the agent team produced a 100,000-line compiler that can build Linux 6.9 on x86, ARM, and RISC-V. The compiler is an interesting artifact on its own, but I focus here on what I learned about designing harnesses for long-running autonomous agent teams: how to write tests that keep agents on track without human oversight, how to structure work so multiple agents can make progress in parallel, and where this approach hits its ceiling.Existing agent scaffolds like Claude Code require an operator to be online and available to work jointly. If you ask for a solution to a long and complex problem, the model may solve part of it, but eventually it will stop and wait for continued input—a question, a status update, or a request for clarification.To elicit sustained, autonomous progress, I built a harness that sticks Claude in a simple loop (if you’ve seen Ralph-loop, this should look familiar). When it finishes one task, it immediately picks up the next. (Run this in a container, not your actual machine).
In the agent prompt, I tell Claude what problem to solve and ask it to approach the problem by breaking it into small pieces, tracking what it’s working on, figuring out what to work on next, and to effectively keep going until it’s perfect. (On this last point, Claude has no choice. The loop runs forever—although in one instance, I did see Claude pkill -9 bash on accident, thus killing itself and ending the loop. Whoops!).Running multiple instances in parallel can address two weaknesses of a single-agent harness:One Claude Code session can only do one thing at a time. Especially as the scope of a project expands, debugging multiple issues in parallel is far more efficient.Running multiple Claude agents allows for specialization. While a few agents are tasked to solve the actual problem at hand, other specialized agents can be invoked to (for example) maintain documentation, keep an eye on code quality, or solve specialized sub-tasks.My implementation of parallel Claude is bare-bones. A new bare git repo is created, and for each agent, a Docker container is spun up with the repo mounted to /upstream. Each agent clones a local copy to /workspace, and when it’s done, pushes from its own local container to upstream.To prevent two agents from trying to solve the same problem at the same time, the harness uses a simple synchronization algorithm:Claude takes a “lock” on a task by writing a text file to current_tasks/ (e.g., one agent might lock current_tasks/parse_if_statement.txt, while another locks current_tasks/codegen_function_definition.txt). If two agents try to claim the same task, git’s synchronization forces the second agent to pick a different one.Claude works on the task, then pulls from upstream, merges changes from other agents, pushes its changes, and removes the lock. Merge conflicts are frequent, but Claude is smart enough to figure that out.The infinite agent-generation-loop spawns a new Claude Code session in a fresh container, and the cycle repeats.This is a very early research prototype. I haven’t yet implemented any other method for communication between agents, nor do I enforce any process for managing high-level goals. I don’t use an orchestration agent. Instead, I leave it up to each Claude agent to decide how to act. In most cases, Claude picks up the “next most obvious” problem. When stuck on a bug, Claude will often maintain a running doc of failed approaches and remaining tasks. In the git repository of the project, you can read through the history and watch it take out locks on various tasks.The scaffolding runs Claude in a loop, but that loop is only useful if Claude can tell how to make progress. Most of my effort went into designing the environment around Claude—the tests, the environment, the feedback—so that it could orient itself without me. These are the approaches I’ve found most helpful when orchestrating multiple Claude instances.Claude will work autonomously to solve whatever problem I give it. So it’s important that the task verifier is nearly perfect, otherwise Claude will solve the wrong problem. Improving the testing harness required finding high-quality compiler test suites, writing verifiers and build scripts for open-source software packages, and watching for mistakes Claude was making, then designing new tests as I identified those failure modes.For example, near the end of the project, Claude started to frequently break existing functionality each time it implemented a new feature. To address this, I built a continuous integration pipeline and implemented stricter enforcement that allowed Claude to better test its work so that new commits can’t break existing code.I had to constantly remind myself that I was writing this test harness for Claude and not for myself, which meant rethinking many of my assumptions about how tests should communicate results.For example, each agent is dropped into a fresh container with no context and will spend significant time orienting itself, especially on large projects. Before we even reach the tests, to help Claude help itself, I included instructions to maintain extensive READMEs and progress files that should be updated frequently with the current status.I also kept in mind the fact that language models have inherent limitations, which, in this case, needed to be designed around. These include:Context window pollution: The test harness should not print thousands of useless bytes. At most, it should print a few lines of output and log all important information to a file so Claude can find it when needed. Logfiles should be easy to process automatically: if there are errors, Claude should write ERROR and put the reason on the same line so grep will find it. It helps to pre-compute aggregate summary statistics so Claude doesn’t have to recompute them.Time blindness: Claude can’t tell time and, left alone, will happily spend hours running tests instead of making progress. The harness prints incremental progress infrequently (to avoid polluting context) and includes a default –fast option that runs a 1% or 10% random sample. This subsample is deterministic per-agent but random across VMs, so Claude still covers all files but each agent can perfectly identify regressions.When there are many distinct failing tests, parallelization is trivial: each agent picks a different failing test to work on. After the test suite reached a 99% pass rate, each agent worked on getting a different small open-source project (e.g., SQlite, Redis, libjpeg, MQuickJS, Lua) to compile.But when agents started to compile the Linux kernel, they got stuck. Unlike a test suite with hundreds of independent tests, compiling the Linux kernel is one giant task. Every agent would hit the same bug, fix that bug, and then overwrite each other’s changes. Having 16 agents running didn’t help because each was stuck solving the same task.The fix was to use GCC as an online known-good compiler oracle to compare against. I wrote a new test harness that randomly compiled most of the kernel using GCC, and only the remaining files with Claude’s C Compiler. If the kernel worked, then the problem wasn’t in Claude’s subset of the files. If it broke, then it could further refine by re-compiling some of these files with GCC. This let each agent work in parallel, fixing different bugs in different files, until Claude’s compiler could eventually compile all files. (After this worked, it was still necessary to apply delta debugging techniques to find pairs of files that failed together but worked independently.)Parallelism also enables specialization. LLM-written code frequently re-implements existing functionality, so I tasked one agent with coalescing any duplicate code it found. I put another in charge of improving the performance of the compiler itself, and a third I made responsible for outputting efficient compiled code. I asked another agent to critique the design of the project from the perspective of a Rust developer, and make structural changes to the project to improve the overall code quality, and another to work on documentation.This project was designed as a capability benchmark. I am interested in stress-testing the limits of what LLMs can just barely achieve today in order to help us prepare for what models will reliably achieve in the future.I’ve been using the C Compiler project as a benchmark across the entire Claude 4 model series. As I did with prior projects, I started by drafting what I wanted: a from-scratch optimizing compiler with no dependencies, GCC-compatible, able to compile the Linux kernel, and designed to support multiple backends. While I specified some aspects of the design (e.g., that it should have an SSA IR to enable multiple optimization passes) I did not go into any detail on how to do so.Previous Opus 4 models were barely capable of producing a functional compiler. Opus 4.5 was the first to cross a threshold that allowed it to produce a functional compiler which could pass large test suites, but it was still incapable of compiling any real large projects. My goal with Opus 4.6 was to again test the limits.Over nearly 2,000 Claude Code sessions across two weeks, Opus 4.6 consumed 2 billion input tokens and generated 140 million output tokens, a total cost just under $20,000. Compared to even the most expensive Claude Max plans, this was an extremely expensive project. But that total is a fraction of what it would cost me to produce this myself—let alone an entire team.This was a clean-room implementation (Claude did not have internet access at any point during its development); it depends only on the Rust standard library. The 100,000-line compiler can build a bootable Linux 6.9 on x86, ARM, and RISC-V. It can also compile QEMU, FFmpeg, SQlite, postgres, redis, and has a 99% pass rate on most compiler test suites including the GCC torture test suite. It also passes the developer’s ultimate litmus test: it can compile and run Doom.The compiler, however, is not without limitations. These include:It lacks the 16-bit x86 compiler that is necessary to boot Linux out of real mode. For this, it calls out to GCC (the x86_32 and x86_64 compilers are its own).It does not have its own assembler and linker; these are the very last bits that Claude started automating and are still somewhat buggy. The demo video was produced with a GCC assembler and linker.The compiler successfully builds many projects, but not all. It’s not yet a drop-in replacement for a real compiler.The generated code is not very efficient. Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.The Rust code quality is reasonable, but is nowhere near the quality of what an expert Rust programmer might produce.The resulting compiler has nearly reached the limits of Opus’s abilities. I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality.As one particularly challenging example, Opus was unable to implement a 16-bit x86 code generator needed to boot into 16-bit real mode. While the compiler can output correct 16-bit x86 via the 66/67 opcode prefixes, the resulting compiled output is over 60kb, far exceeding the 32k code limit enforced by Linux. Instead, Claude simply cheats here and calls out to GCC for this phase (This is only the case for x86. For ARM or RISC-V, Claude’s compiler can compile completely by itself.)The source code for the compiler is available. Download it, read through the code, and try it on your favorite C projects. I’ve consistently found the best way to understand what language models can do is to push them to their limits, and then study where they start to break down. Over the coming days, I’ll continue having Claude push new changes if you want to follow along with Claude’s continued attempts at addressing these limitations.Each generation of language models opens up new ways of working with them. Early models were useful for tab-completion in IDEs. Before long, models could complete a function body from its docstring. The launch of Claude Code brought agents into the mainstream and enabled developers to pair-program with Claude. But each of these products operates under the assumption that a user defines a task, an LLM runs for a few seconds or minutes and returns an answer, and then the user provides a follow-up.Agent teams show the possibility of implementing entire, complex projects autonomously. This allows us, as users of these tools, to become more ambitious with our goals.We are still early, and fully autonomous development comes with real risks. When a human sits with Claude during development, they can ensure consistent quality and catch errors in real time. For autonomous systems, it is easy to see tests pass and assume the job is done, when this is rarely the case. I used to work in penetration testing, exploiting vulnerabilities in products produced by large companies, and the thought of programmers deploying software they’ve never personally verified is a real concern.So, while this experiment excites me, it also leaves me feeling uneasy. Building this compiler has been some of the most fun I’ve had recently, but I did not expect this to be anywhere near possible so early in 2026. The rapid progress in both language models and the scaffolds we use to interact with them opens the door to writing an enormous amount of new code. I expect the positive applications to outweigh the negative, but we’re entering a new world which will require new strategies to navigate safely.Special thanks to Josef Bacik, Edwin Chen, Bernardo Meurer Costa, Jake Eaton, Dan Kelley, Felix Klock, Jannet Park, Steve Weis, and many other people across Anthropic for their assistance and contributions.
Product updates, how-tos, community spotlights, and more. Delivered monthly to your inbox. Please provide your email address if you’d like to receive our monthly developer newsletter. You can unsubscribe at any time.
...
Read the original on www.anthropic.com »
Think of your database like your home. Your home has a living room, bedroom, bathroom, kitchen, and garage. Each room serves a different purpose. But they’re all under the same roof, connected by hallways and doors. You don’t build a separate restaurant building just because you need to cook. You don’t construct a commercial garage across town just to park your car.
That’s what Postgres is. One home with many rooms. Search, vectors, time-series, queues—all under one roof.
But this is exactly what specialized database vendors don’t want you to hear. Their marketing teams have spent years convincing you to “use the right tool for the right job.” It sounds reasonable. It sounds wise. And it sells a lot of databases.
Let me show you why it’s a trap and why Postgres is the better choice in 99% of cases.
You’ve heard the advice: “Use the right tool for the right job.”
Sounds wise. So you end up with:
Congratulations. You now have seven databases to manage. Seven query languages to learn. Seven backup strategies to maintain. Seven security models to audit. Six sets of credentials to rotate. Seven monitoring dashboards to watch. And seven things that can break at 3 AM.
And when something does break? Good luck spinning up a test environment to debug it.
Here’s a different idea: Just use Postgres.
This isn’t just about simplicity. AI agents have made database sprawl a nightmare.
Think about what agents need to do:
With one database? That’s a single command. Fork it, test it, done.
With seven databases? Now you need to:
* Make sure they’re all at the same point in time
* Spin up seven different services
* Tear down seven services when you’re done
This is virtually impossible without a ton of R&D.
And it’s not just agents. Every time something breaks at 3 AM, you need to spin up a test environment to debug. With six databases, that’s a coordination nightmare. With one database, it’s a single command.
In the AI era, simplicity isn’t just elegant. It’s essential.
The myth: Specialized databases are far superior at their specific tasks.
The reality: Sometimes they’re marginally better at a narrow task. But they also bring unnecessary complexity. It’s like hiring a private chef for every meal. Sounds luxurious, but it adds expense, coordination overhead, and creates problems you didn’t have before.
Here’s the thing: 99% of companies don’t need them. The top 1% have tens of millions of users and a large engineering team to match. You’ve read their blog posts about how amazing Specialized Database X works for them. But that’s their scale, their team, their problems. For everyone else, Postgres is more than enough.
Here’s what most people don’t realize: Postgres extensions use the same or better algorithms as specialized databases (in many cases).
These aren’t watered-down versions. They’re the same/better algorithms, battle-tested, open source, and often developed by the same researchers.
* pgvectorscale: 28x lower latency than Pinecone at 75% less cost
* pg_textsearch: The exact same BM25 ranking that powers Elasticsearch
Beyond the AI/agent problem, database sprawl has compounding costs:
Cognitive load: Your team needs SQL, Redis commands, Elasticsearch Query DSL, MongoDB aggregation, Kafka patterns, and InfluxDB’s non-native SQL workaround. That’s not specialization. That’s fragmentation.
Data consistency: Keeping Elasticsearch in sync with Postgres? You build sync jobs. They fail. Data drifts. You add reconciliation. That fails too. Now you’re maintaining infrastructure instead of building features.
SLA math: Three systems at 99.9% uptime each = 99.7% combined. That’s 26 hours of downtime per year instead of 8.7. Every system multiplies your failure modes.
These extensions aren’t new. They’ve been production-ready for years:
* JSONB: Since 2014 (11 years). As fast as MongoDB, with ACID.
Over 48,000 companies use PostgreSQL, including Netflix, Spotify, Uber, Reddit, Instagram, and Discord.
What this means: Building a RAG app used to require Postgres + Pinecone + Elasticsearch + glue code.
Now? Just Postgres. One database. One query language. One backup. One fork command for your AI agent to spin up a test environment.
Here’s all you need:
Below are working examples for each use case. Skip to what you need.
What you get: The exact same BM25 algorithm that powers Elasticsearch, directly in Postgres.
This is what Elasticsearch requires a separate plugin for. In Postgres, it’s just SQL.
What you get: pgvectorscale uses the DiskANN algorithm (from Microsoft Research), achieving 28x lower p95 latency and 16x higher throughput than Pinecone at 99% recall.
Now every INSERT/UPDATE automatically regenerates embeddings. No sync jobs. No drift. No 3 AM pages.
* Prometheus: Great for metrics, not your application data
What you get: Automatic time partitioning, compression up to 90%, continuous aggregates. Full SQL.
For AI applications, you often need both keyword search and semantic search:
Try that with Elasticsearch + Pinecone. You’d need two API calls, result merging, failure handling, and double latency.
In Postgres: one query, one transaction, one result.
Remember the home analogy? You don’t build a separate restaurant just to cook dinner. You don’t construct a commercial garage across town just to park your car. You use the rooms in your home.
That’s what we’ve shown you here. Search, vectors, time-series, documents, queues, caching—they’re all rooms in the Postgres home. Same algorithms as the specialized databases. Battle-tested for years. Used by Netflix, Uber, Discord, and 48,000 other companies.
So what about that 99%?
For 99% of companies, Postgres handles everything you need. The 1%? That’s when you’re processing petabytes of logs across hundreds of nodes, or you need Kibana’s specific dashboards, or you have exotic requirements that genuinely exceed what Postgres can do.
But here’s the thing: you’ll know when you’re in the 1%. You won’t need a vendor’s marketing team to tell you. You’ll have benchmarked it yourself and hit a real wall.
Until then, don’t scatter your data across seven buildings because someone told you to “use the right tool for the right job.” That advice sells databases. It doesn’t serve you.
Start with Postgres. Stay with Postgres. Add complexity only when you’ve earned the need for it.
In 2026, just use Postgres.
All these extensions are available on Tiger Data. Create a free database in minutes:
No need for specialized databases, just use Postgres.
...
Read the original on www.tigerdata.com »
LinkedIn silently probes for 2,953 Chrome extensions on every page load.
This repository documents every extension LinkedIn checks for and provides tools to identify them.
The complete list of extensions with names and Chrome Web Store links:
Fetches extension names from Chrome Web Store with Extpose fallback for removed/unavailable extensions.
# Fetch all extensions
node fetch_extension_names.js
# Fetch a subset (useful if rate limited)
node fetch_extension_names.js –offset 0 –limit 500
node fetch_extension_names.js -o 500 -l 500
# Show help
node fetch_extension_names.js –help
Test script that processes the first 3 extensions with verbose output.
node test_fetch.js
* ~22% found via Extpose fallback (removed or unavailable on Chrome Web Store)
...
Read the original on github.com »
There have been a lot of complaints about both the competency and the logic behind the latest Epstein archive release by the DoJ: from censoring the names of co-conspirators to censoring pictures of random women in a way that makes individuals look guiltier than they really are, forgetting to redact credentials that made it possible for all of Reddit to log into Epstein’s account and trample over all the evidence, and the complete ineptitude that resulted in most of the latest batch being corrupted thanks to incorrectly converted Quoted-Printable encoding artifacts, it’s safe to say that Pam Bondi’s DoJ did not put its best and brightest on this (admittedly gargantuan) undertaking. But the most damning evidence has all been thoroughly redacted… hasn’t it? Well, maybe not.
I was thinking of writing an article on the mangled quoted-printable encoding the day this latest dump came out in response to all the misinformed musings and conjectures that were littering social media (and my dilly-dallying cost me, as someone beat me to the punch), and spent some time searching through the latest archives looking for some SMTP headers that I could use in the article when I came across a curious artifact: not only were the emails badly transcoded into plain text, but also some binary attachments were actually included in the dumps in their over-the-wire Content-Transfer-Encoding: base64 format, and the unlucky intern that was assigned to the documents in question didn’t realize the significance of what they were looking at and didn’t see the point in censoring seemingly meaningless page after page of hex content!
Just take a look at EFTA00400459, an email from correspondence between (presumably) one of Epstein’s assistants and Epstein lackey/co-conspirator Boris Nikolic and his friend, Sam Jaradeh, inviting them to a ████████ benefit:
Those hex characters go on for 76 pages, and represent the file DBC12 One Page Invite with Reply.pdf encoded as base64 so that it can be included in the email without breaking the SMTP protocol. And converting it back to the original PDF is, theoretically, as easy as copy-and-pasting those 76 pages into a text editor, stripping the leading > bytes, and piping all that into base64 -d > output.pdf… or it would be, if we had the original (badly converted) email and not a partially redacted scan of a printout of said email with some shoddy OCR applied.
If you tried to actually copy that text as digitized by the DoJ from the PDF into a text editor, here’s what you’d see:
You can ignore the EFTA00400459 on the second line; that (or some variant thereof) will be interspersed into the base64 text since it’s stamped at the bottom of every page to identify the piece of evidence it came from. But what else do you notice? Here’s a hint: this is what proper base64 looks like:
Notice how in this sample everything lines up perfectly (when using a monospaced font) at the right margin? And how that’s not the case when we copied-and-pasted from the OCR’d PDF? That’s because it wasn’t a great OCR job: extra characters have been hallucinated into the output, some of them not even legal base64 characters such as the , and [, while other characters have been omitted altogether, giving us content we can’t use:1
> pbpaste \
| string match -rv ‘EFTA’ \
| string trim -c ” >” \
| string join “” \
| base64 -d >/dev/null
base64: invalid input
I tried the easiest alternative I had at hand: I loaded up the PDF in Adobe Acrobat Pro and re-ran an OCR process on the document, but came up with even worse results, with spaces injected in the middle of the base64 content (easily fixable) in addition to other characters being completely misread and butchered — it really didn’t like the cramped monospace text at all. So I thought to do it manually with tesseract, which, while very far from state-of-the-art, can still be useful because it lets you do things like limit its output to a certain subset of characters, constraining the field of valid results and hopefully coercing it into producing better results.
Only one problem: tesseract can’t read PDF input (or not by default, anyway). No problem, I’ll just use imagemagick/ghostscript to convert the PDF into individual PNG images (to avoid further generational loss) and provide those to tesseract, right? But that didn’t quite work out, they seem (?) to try to load and perform the conversion of all 76 separate pages/png files all at once, and then naturally crash on too-large inputs (but only after taking forever and generating the 76 (invalid) output files that you’re forced to subsequently clean up, of course):
> convert -density 300 EFTA00400459.pdf \
-background white -alpha remove \
-alpha off out.png
convert-im6.q16: cache resources exhausted `/tmp/magick-QqXVSOZutVsiRcs7pLwwG2FYQnTsoAmX47′ @ error/cache.c/OpenPixelCache/4119.
convert-im6.q16: cache resources exhausted `out.png’ @ error/cache.c/OpenPixelCache/4119.
convert-im6.q16: No IDATs written into file `out-0.png’ @ error/png.c/MagickPNGErrorHandler/1643.
So we turn to pdftoppm from the poppler-utils package instead, which does indeed handle each page of the source PDF separately and turned out to be up to the task, though incredibly slow:
> pdftoppm -png -r 300 EFTA00400459.pdf out.png
After waiting the requisite amount of time (and then some), I had files out-01.png through out-76.png, and was ready to try them with tesseract:
for n in (printf “%02d\n” (seq 1 76))
tesseract out-$n.png output-$n \
–psm 6 \
-c tessedit_char_whitelist=’>’ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/= \
-c load_system_dawg=0 \
-c load_freq_dawg=0
end
The above fish-shell command instructs tesseract(1) to assume the input is a single block of text (the –psm 6 argument) and limit itself to decoding only legal base64 characters (and the leading > so we can properly strip it out thereafter). My original attempt included a literal space in the valid char whitelist, but that gave me worse results: the very badly kerned base64 has significant apparent spacing between some adjacent characters (more on this later) and that caused tesseract to both incorrectly inject spaces (bad but fixable) and also possibly affect how it handled the character after the space (worse).
Unfortunately, while tesseract gave me slightly better output than either the original OCR’d DoJ text or the (terrible) Adobe Acrobat Pro OCR results, it too suffered from poor recognition and gave me very inconsistent line lengths… but it also suffered from something that I didn’t really think a heuristic-based, algorithm-driven tool like tesseract would succumb to, as it was more reminiscent of how first-generation LLMs would behave: in a few places, it would only read the first dozen or so characters on a line then leave the rest of the line blank, then pick up (correctly enough) at the start of the next line. Before I saw how generally useless the OCR results were and gave up on tesseract, I figured I’d just manually type out the rest of the line (the aborted lines were easy enough to find, thanks to the monospaced output), and that was when I ran into the real issue that took this from an interesting challenge to being almost mission impossible.
I mentioned earlier the bad kerning, which tricked the OCR tools into injecting spaces where there were supposed to be none, but that was far from being the worst issue plaguing the PDF content. The real problem is that the text is rendered in possibly the worst typeface for the job at hand: Courier New.
If you’re a font enthusiast, I certainly don’t need to say any more — you’re probably already shaking with a mix of PTSD and rage. But for the benefit of everyone else, let’s just say that Courier New is… not a great font. It was a digitization of the venerable (though certainly primitive) Courier fontface, commissioned by IBM in the 1950s. Courier was used (with some tweaks) for IBM typewriters, including the IBM Selectric, and in the 1990s it was “digitized directly from the golf ball of the IBM Selectric” by Monotype, and shipped with Windows 3.1, where it remained the default monospace font on Windows until Consolas shipped with Windows Vista. Among the many issues with Courier New is that it was digitized from the Selectric golf ball “without accounting for the visual weight normally added by the typewriter’s ink ribbon”, which gives its characteristic “thin” look. Microsoft ClearType, which was only enabled by default with Windows Vista, addressed this major shortcoming to some extent, but Courier New has always struggled with general readability… and more importantly, with its poor distinction between characters.
While not as bad as some typewriter-era typefaces that actually reused the same symbol for 1 (one) and l (ell), Courier New came pretty close. Here is a comparison between the two fonts when rendering these two characters, only considerably enlarged:
The combination of the two faults (the anemic weights and the even less distinction between 1 and l as compared to Courier) makes Courier New a terrible choice as a programming font. But as a font used for base64 output you want to OCR? You really couldn’t pick a worse option! To add fuel to the fire, you’re looking at SVG outlines of the fonts, meticulously converted and preserving all the fine details. But in the Epstein PDFs released by the DoJ, we only have low-quality JPEG scans at a fairly small point size. Here’s an actual (losslessly encoded) screenshot of the DoJ text at 100% — I challenge you to tell me which is a 1 and which is an l in the excerpt below:
It’s not that there isn’t any difference between the two, because there is. And sometimes you get a clear gut feeling which is which — I was midway through manually typing out one line of base64 text when I got stuck on identifying a one vs ell… only to realize that, at the same time, I had confidently transcribed one of them earlier that same line without even pausing to think about which it was. Here’s a zoomed-in view of the scanned PDF: you can clearly see all the JPEG DCT artifacts, the color fringing, and the smearing of character shapes, all of which make it hard to properly identify the characters. But at the same time, at least in this particular sample, you can see which of the highlighted characters have a straight serif leading out the top-left (the middle, presumably an ell) and which of those have the slightest of strokes/feet extending from them (the first and last, presumably ones). But whether that’s because that’s how the original glyph appeared or it’s because of how the image was compressed, it’s tough to say:
But that’s getting ahead of myself: at this point, none of the OCR tools had actually given me usable results, even ignoring the very important question of l vs 1. After having been let down by one open source offering (tesseract) and two commercial ones (Adobe Acrobat Pro and, presumably, whatever the DoJ used), I made the very questionable choice of writing a script to use yet another commercial offering, this time Amazon/AWS Textract, to process the PDF. Unfortunately, using it directly via the first-party tooling was (somewhat) of a no-go as it only supports smaller/shorter inputs for direct use; longer PDFs like this one need to be uploaded to S3 and then use the async workflow to start the recognition and poll for completion.
Amazon Textract did possibly the best out of all the tools I tried, but its output still had obvious line length discrepancies — albeit only one to two characters or so off on average. I decided to try again, this time blowing up the input 2x (using nearest neighbor sampling to preserve sharp edges) as a workaround for Textract not having a tunable I could set to configure the DPI the document is processed at, though I worried all inputs could possibly be prescaled to a fixed size prior to processing once more:2
> for n in (printf “%02d\n” (seq 01 76))
convert EFTA00400459-$n.png -scale 200% \
EFTA00400459-$n”_2x”.png; or break
end
> parallel -j 16 ./textract.sh {} ::: EFTA00400459-*_2x.png
These results were notably better, and I’ve included them in an archive, but some of the pages scanned better than others. Textract doesn’t seem to be 100% deterministic from my brief experience with it, and their features page does make vague or unclear mentions to “ML”, though it’s not obvious when and where it kicks in or what it exactly refers to, but that could explain why a couple of the pages (like EFTA00400459-62_2x.txt) are considerably worse than others, even while the source images don’t show a good reason for that divergence.
With the Textract 2x output cleaned up and piped into base64 -i (which ignores garbage data, generating invalid results that can still be usable for forensic analysis), I can get far enough to see that the PDF within the PDF (i.e. the actual PDF attachment originally sent) was at least partially (de)flate-encoded. Unfortunately, PDFs are binary files with different forms of compression applied; you can’t just use something like strings to extract any usable content. qpdf(1) can be (ab)used to decompress a PDF (while leaving it a PDF) via qpdf –qdf –object-streams=disable input.pdf decompressed.pdf, but, predictably, this doesn’t work when your input is garbled and corrupted:
> qpdf –qdf –object-streams=disable recovered.pdf decompressed.pdf
WARNING: recovered.pdf: file is damaged
WARNING: recovered.pdf: can’t find startxref
WARNING: recovered.pdf: Attempting to reconstruct cross-reference table
WARNING: recovered.pdf (object 34 0, offset 52): unknown token while reading object; treating as string
WARNING: recovered.pdf (object 34 0, offset 70): unknown token while reading object; treating as string
WARNING: recovered.pdf (object 34 0, offset 85): unknown token while reading object; treating as string
WARNING: recovered.pdf (object 34 0, offset 90): unexpected >
WARNING: recovered.pdf (object 34 0, offset 92): unknown token while reading object; treating as string
WARNING: recovered.pdf (object 34 0, offset 116): unknown token while reading object; treating as string
WARNING: recovered.pdf (object 34 0, offset 121): unknown token while reading object; treating as string
WARNING: recovered.pdf (object 34 0, offset 121): too many errors; giving up on reading object
WARNING: recovered.pdf (object 34 0, offset 125): expected endobj
WARNING: recovered.pdf (object 41 0, offset 9562): expected endstream
WARNING: recovered.pdf (object 41 0, offset 8010): attempting to recover stream length
WARNING: recovered.pdf (object 41 0, offset 8010): unable to recover stream data; treating stream as empty
WARNING: recovered.pdf (object 41 0, offset 9616): expected endobj
WARNING: recovered.pdf (object 41 0, offset 9616): EOF after endobj
qpdf: recovered.pdf: unable to find trailer dictionary while recovering damaged file
Between the inconsistent OCR results and the problem with the l vs 1, it’s not a very encouraging situation. To me, this is a problem begging for a (traditional, non-LLM) ML solution, specifically leveraging the fact that we know the font in question and, roughly, the compression applied. Alas, I don’t have more time to lend to this challenge at the moment, as there are a number of things I set aside just in order to publish this article.
So here’s the challenge for anyone I can successfully nerdsnipe:
* Can you manage to recreate the original PDF from the Content-Transfer-Encoding: base64 output included in the dump? It can’t be that hard, can it?
* Can you find other attachments included in the latest Epstein dumps that might also be possible to reconstruct? Unfortunately, the contractor that developed the full-text search for the Department of Justice did a pretty crappy job and full-text search is practically broken even accounting for the bad OCR and wrangled quoted-printable decoding (malicious compliance??); nevertheless, searching for Content-Transfer-Encoding and base64 returns a number of results — it’s just that, unfortunately, most are uselessly truncated or only the SMTP headers from Apple Mail curiously extracted.
I have uploaded the original EFTA00400459.pdf from Epstein Dataset 9 as downloaded from the DoJ website to the Internet Archive, as well as the individual pages losslessly encoded to WebP images to save you the time and trouble of converting them yourself. If it’s of any use to anyone, I’ve also uploaded the very-much-invalid Amazon Textract OCR text (from the losslessly 2x’d images), which you can download here.
Oh, and one final hint: when trying to figure out 1 vs l, I was able to do this with 100% accuracy only via trial-and-error, decoding one line of base64 text at-a-time, but this only works for the plain-text portions of the PDF (headers, etc). For example, I started with my best guess for one line that I had to type out myself when trying with tesseract, and then was able to (in this case) deduce which particular 1s or ls were flipped:
> pbpaste
SW5mbzw8L01sbHVzdHJhdG9yIDgxIDAgUj4+L1Jlc29lcmNlczw8L0NvbG9yU3BhY2U8PC9DUzAG
> pbpaste | base64 -d
Info<>/Resoerces<
> # which I was able to correct:
> pbpaste
SW5mbzw8L0lsbHVzdHJhdG9yIDgxIDAgUj4+L1Jlc291cmNlczw8L0NvbG9yU3BhY2U8PC9DUzAG
> pbpaste | base64 -d
Info<>/Resources<
…but good luck getting that to work once you get to the flate-compressed sections of the PDF.
I’ll be posting updates on Twitter @mqudsi, and you can reach out to me on Signal at mqudsi.42 if you have anything sensitive you would like to share. You can join in the discussion on Hacker News or on r/netsec. Leave a comment below if you have any ideas/questions, or if you think I missed something!
...
Read the original on neosmart.net »
Spotlighting The World Factbook as We Bid a Fond Farewell (via) Somewhat devastating news today from CIA:
One of CIA’s oldest and most recognizable intelligence publications, The World Factbook, has sunset.
There’s not even a hint as to why they decided to stop maintaining this publication, which has been their most useful public-facing initiative since 1971 and a cornerstone of the public internet since 1997.
In a bizarre act of cultural vandalism they’ve not just removed the entire site (including the archives of previous versions) but they’ve also set every single page to be a 302 redirect to their closure announcement.
The Factbook has been released into the public domain since the start. There’s no reason not to continue to serve archived versions - a banner at the top of the page saying it’s no longer maintained would be much better than removing all of that valuable content entirely.
Up until 2020 the CIA published annual zip file archives of the entire site. Those are available (along with the rest of the Factbook) on the Internet Archive.
I downloaded the 384MB .zip file for the year 2020 and extracted it into a new GitHub repository, simonw/cia-world-factbook-2020. I’ve enabled GitHub Pages for that repository so you can browse the archived copy at simonw.github.io/cia-world-factbook-2020/.
Here’s a neat example of the editorial voice of the Factbook from the What’s New page, dated December 10th 2020:
Years of wrangling were brought to a close this week when officials from Nepal and China announced that they have agreed on the height of Mount Everest. The mountain sits on the border between Nepal and Tibet (in western China), and its height changed slightly following an earthquake in 2015. The new height of 8,848.86 meters is just under a meter higher than the old figure of 8,848 meters. The World Factbook rounds the new measurement to 8,849 meters and this new height has been entered throughout the Factbook database.
...
Read the original on simonwillison.net »
The US Central Intelligence Agency (CIA) has announced it will cease publishing the World Factbook, a free online resource used by millions around the globe.
Frequently cited by journalists and academics, the Factbook offered regularly updated statistics and information about countries and communities all over the world, in an easily understood and searchable format.
A statement on the CIA’s website did not include a reason for the decision, simply stating that the publication had “sunset” while encouraging readers to “stay curious about the world and find ways to explore it … in person or virtually”.
First launched during World War II as a classified internal program named JANIS (Joint Army Navy Intelligence Studies), the Factbook was originally commissioned as a way to standardise “basic intelligence” — fundamental and factual information about the world — across different agencies of the US government.
The program was taken over by the CIA in 1947 and renamed the National Intelligence Survey, before the Factbook was launched in 1971 as an annual summary of information.
An unclassified version was first made available to the public in 1975, and a digital version was published online in the 1990s, with the data freely available under public domain.
The website was particularly popular during the US school year, according to previous versions of the site, with traffic experiencing a noticeable drop-off during US summer months.
While no specific reason has been given for the Factbook’s closure, the Trump administration has made no secret of its intent to cut government programs it does not consider to be furthering the core purpose of its agencies and departments.
The administration offered buyouts to every CIA employee in February last year, and is reportedly planning to cut about 1,200 further jobs at the agency over the next several years.
The CIA has been contacted for comment.
...
Read the original on www.abc.net.au »
A few days ago, about why OpenClaw feels like a portal to the future, and why that future is scary in a very specific way.
The short version: agent gateways that act like OpenClaw are powerful because they have real access to your files, your tools, your browser, your terminals, and often a long-term “memory” file that captures how you think and what you’re building. That combination is exactly what modern infostealers are designed to exploit.
This post is the uncomfortable, “and then it happened” follow-up.
Because it’s not just that agents can be dangerous once they’re installed. The ecosystem that distributes their capabilities and skill registries has already become an attack surface.
If you are experimenting with OpenClaw, do not do it on a company device. Full stop.
In my first post, I described OpenClaw as a kind of Faustian bargain. It is compelling precisely because it has real access to your local machine, your apps, your browser sessions, your files, and often long-term memory. That same access means there isn’t yet a safe way to run it on a machine that holds corporate credentials or has access to production systems.
If you have already run OpenClaw on a work device, treat it as a potential incident and engage your security team immediately. Do not wait for symptoms. Pause work on that machine and follow your organization’s incident response process.
In the OpenClaw ecosystem, a “skill” is often a markdown file: a page of instructions that tells an agent how to do a specialized task. In practice, that markdown can include links, copy-and-paste commands, and tool call recipes.
That sounds harmless until you remember how humans, and agents, actually consume documentation:
Markdown isn’t “content” in an agent ecosystem. Markdown is an installer.
Some people assume the layer makes this safer, because tools can be exposed through a structured interface, with explicit user consent and authorization controls depending on the host and server implementation.
But skills do not need to use MCP at all.
The Agent Skills specification places no restrictions on the markdown body, and skills can include whatever instructions will “help agents perform the task,” including copy and paste terminal commands. And skills can also bundle scripts alongside the markdown, which means execution can happen outside the MCP tool boundary entirely.
So if your security model is “MCP will gate tool calls,” you can still lose to a malicious skill that simply routes around MCP through social engineering, direct shell instructions, or bundled code. MCP can be part of a safe system, but it is not a safety guarantee by itself.
Just as importantly, this is not unique to OpenClaw. “Skills” are increasingly portable because many agents are adopting the open in which a skill is a folder centered on a SKILL.md file with metadata and freeform instructions, and it can also bundle scripts and other resources. Even describes the same basic shape: a SKILL.md file plus optional scripts and assets. That means a malicious “skill” is not just an OpenClaw problem. It is a distribution mechanism that can travel across any agent ecosystem that supports the same standard.
While browsing ClawHub (I won’t link it for obvious reasons), I noticed the top downloaded skill at the time was a “Twitter” skill. It looked normal: description, intended use, an overview, the kind of thing you’d expect to install without a second thought.
But the very first thing it did was introduce a “required dependency” named “openclaw-core,” along with platform-specific install steps. Those steps included convenient links (“here”, “this link”) that appeared to be normal documentation pointers.
Both links led to malicious infrastructure. The flow was classic staged delivery:
The skill’s overview told you to install a prerequisite. The link led to a staging page designed to get the agent to run a command.That command decoded an obfuscated payload and executed it.The script downloaded and ran a binary, including removing macOS quarantine attributes to ensure macOS’s built-in anti-malware system, Gatekeeper, doesn’t scan it.
I’m intentionally not pasting the exact commands or URLs here. The mechanics are unfortunately straightforward, and repeating them helps attackers more than it helps defenders. The key point is that this was not “a suspicious link.” This was a complete execution chain disguised as setup instructions.
I downloaded the final binary safely and submitted it to .
The verdict was not ambiguous. It was flagged as macOS infostealing malware.
This is the type of malware that doesn’t just “infect your computer.” It raids everything valuable on that device:
* Anything else that can be turned into an account takeover
If you’re the kind of person installing agent skills, you are exactly the kind of person whose machine is worth stealing from.
After I shared this internally, surfaced, putting the scale into focus: hundreds of OpenClaw skills were reportedly involved in distributing macOS malware via ClickFix-style instructions.
That detail matters because it confirms what this really is.
A deliberate strategy: use “skills” as the distribution channel, and “prerequisites” as the social engineering wrapper.
We’ve spent years learning that package managers and open-source registries can become supply chain attack vectors.
Agent skill registries are the next chapter, except that the “package” is documentation.
And that makes the attack path even smoother:
* And in agent ecosystems, the line between reading instructions and executing them collapses.
Even if an agent can’t run shell commands directly, it can still do something dangerous: it can normalize risky behavior.
It can confidently summarize a malicious prerequisite as “the standard install step.” It can encourage you to paste a one-liner. It can reduce hesitation.
And if your agent can execute local commands, then a malicious skill isn’t “bad content.” It’s remote execution wrapped in friendly docs.
Do not run this on a company device. There isn’t a safe way to do it. If you already did, or you ran any “install” commands from a skill, engage your security team immediately and treat it as a potential compromise.
* Stop using the device for sensitive work.
If you experiment anyway, use an isolated machine with no corporate access and no saved credentials.
You are operating an app store. Assume it will be abused.
* Put warnings and friction on external links and install steps.
* Use permissions that are specific, time-bound, and revocable.
This is the clearest proof yet of the point I made in my earlier post. OpenClaw is powerful because it collapses the distance between intent and execution. That is the magic. It also introduces significant risk. When capabilities are distributed as skills and installed via documentation, the registry becomes a supply chain, and the easiest install path becomes the attacker’s favorite path.
The answer is not to stop building agents. The answer is to build the missing trust layer around them. Skills need provenance. Execution needs mediation. Permissions need to be specific, revocable, and continuously enforced, not granted once and forgotten. If agents are going to act on our behalf, credentials and sensitive actions cannot be “grabbed” by whatever code happens to run. They need to be brokered, governed, and audited in real time.
This is exactly why we need that next layer: when “skills” become the supply chain, the only safe future is one in which every agent has its own identity and has the minimum authority it needs right now, with access that is time-bound, revocable, and attributable.
...
Read the original on 1password.com »
To add this web app to your iOS home screen tap the share button and select "Add to the Home Screen".
10HN is also available as an iOS App
If you visit 10HN only rarely, check out the the best articles from the past week.
If you like 10HN please leave feedback and share
Visit pancik.com for more.