10 interesting stories served every morning and every evening.
The new Claude Opus 4.6 improves on its predecessor’s coding skills. It plans more carefully, sustains agentic tasks for longer, can operate more reliably in larger codebases, and has better code review and debugging skills to catch its own mistakes. And, in a first for our Opus-class models, Opus 4.6 features a 1M token context window in beta. Opus 4.6 can also apply its improved abilities to a range of everyday work tasks: running financial analyses, doing research, and using and creating documents, spreadsheets, and presentations. Within Cowork, where Claude can multitask autonomously, Opus 4.6 can put all these skills to work on your behalf.The model’s performance is state-of-the-art on several evaluations. For example, it achieves the highest score on the agentic coding evaluation Terminal-Bench 2.0 and leads all other frontier models on Humanity’s Last Exam, a complex multidisciplinary reasoning test. On GDPval-AA—an evaluation of performance on economically valuable knowledge work tasks in finance, legal, and other domains1—Opus 4.6 outperforms the industry’s next-best model (OpenAI’s GPT-5.2) by around 144 Elo points,2 and its own predecessor (Claude Opus 4.5) by 190 points. Opus 4.6 also performs better than any other model on BrowseComp, which measures a model’s ability to locate hard-to-find information online.As we show in our extensive system card, Opus 4.6 also shows an overall safety profile as good as, or better than, any other frontier model in the industry, with low rates of misaligned behavior across safety evaluations.Opus 4.6 is state-of-the-art on real-world work tasks across several professional domains.Opus 4.6 gets the highest score in the industry for deep, multi-step agentic search.In Claude Code, you can now assemble agent teams to work on tasks together. On the API, Claude can use compaction to summarize its own context and perform longer-running tasks without bumping up against limits. We’re also introducing adaptive thinking, where the model can pick up on contextual clues about how much to use its extended thinking, and new effort controls to give developers more control over intelligence, speed, and cost. We’ve made substantial upgrades to Claude in Excel, and we’re releasing Claude in PowerPoint in a research preview. This makes Claude much more capable for everyday work.Claude Opus 4.6 is available today on claude.ai, our API, and all major cloud platforms. If you’re a developer, use claude-opus-4-6 via the Claude API. Pricing remains the same at $5/$25 per million tokens; for full details, see our pricing page.We cover the model, our new product updates, our evaluations, and our extensive safety testing in depth below.We build Claude with Claude. Our engineers write code with Claude Code every day, and every new model first gets tested on our own work. With Opus 4.6, we’ve found that the model brings more focus to the most challenging parts of a task without being told to, moves quickly through the more straightforward parts, handles ambiguous problems with better judgment, and stays productive over longer sessions.Opus 4.6 often thinks more deeply and more carefully revisits its reasoning before settling on an answer. This produces better results on harder problems, but can add cost and latency on simpler ones. If you’re finding that the model is overthinking on a given task, we recommend dialing effort down from its default setting (high) to medium. You can control this easily with the /effort parameter.Here are some of the things our Early Access partners told us about Claude Opus 4.6, including its propensity to work autonomously without hand-holding, its success where previous models failed, and its effect on how teams work:
Claude Opus 4.6 is the strongest model Anthropic has shipped. It takes complicated requests and actually follows through, breaking them into concrete steps, executing, and producing polished work even when the task is ambitious. For Notion users, it feels less like a tool and more like a capable collaborator.Early testing shows Claude Opus 4.6 delivering on the complex, multi-step coding work developers face every day—especially agentic workflows that demand planning and tool calling. This starts unlocking long-horizon tasks at the frontier.Claude Opus 4.6 is a huge leap for agentic planning. It breaks complex tasks into independent subtasks, runs tools and subagents in parallel, and identifies blockers with real precision.Claude Opus 4.6 is the best model we’ve tested yet. Its reasoning and planning capabilities have been exceptional at powering our AI Teammates. It’s also a fantastic coding model — its ability to navigate a large codebase and identify the right changes to make is state of the art.Claude Opus 4.6 reasons through complex problems at a level we haven’t seen before. It considers edge cases that other models miss and consistently lands on more elegant, well-considered solutions. We’re particularly impressed with Opus 4.6 in Devin Review, where it’s increased our bug catching rates.Claude Opus 4.6 feels noticeably better than Opus 4.5 in Windsurf, especially on tasks that require careful exploration like debugging and understanding unfamiliar codebases. We’ve noticed Opus 4.6 thinks longer, which pays off when deeper reasoning is needed.Claude Opus 4.6 represents a meaningful leap in long-context performance. In our testing, we saw it handle much larger bodies of information with a level of consistency that strengthens how we design and deploy complex research workflows. Progress in this area gives us more powerful building blocks to deliver truly expert-grade systems professionals can trust.Across 40 cybersecurity investigations, Claude Opus 4.6 produced the best results 38 of 40 times in a blind ranking against Claude 4.5 models. Each model ran end to end on the same agentic harness with up to 9 subagents and 100+ tool calls.Claude Opus 4.6 is the new frontier on long-running tasks from our internal benchmarks and testing. It’s also been highly effective at reviewing code.Claude Opus 4.6 achieved the highest BigLaw Bench score of any Claude model at 90.2%. With 40% perfect scores and 84% above 0.8, it’s remarkably capable for legal reasoning.Claude Opus 4.6 autonomously closed 13 issues and assigned 12 issues to the right team members in a single day, managing a ~50-person organization across 6 repositories. It handled both product and organizational decisions while synthesizing context across multiple domains, and it knew when to escalate to a human.Claude Opus 4.6 is an uplift in design quality. It works beautifully with our design systems and it’s more autonomous, which is core to Lovable’s values. People should be creating things that matter, not micromanaging AI.Claude Opus 4.6 excels in high-reasoning tasks like multi-source analysis across legal, financial, and technical content. Box’s eval showed a 10% lift in performance, reaching 68% vs. a 58% baseline, and near-perfect scores in technical domains.Claude Opus 4.6 generates complex, interactive apps and prototypes in Figma Make with an impressive creative range. The model translates detailed designs and multi-layered tasks into code on the first try, making it a powerful starting point for teams to explore and build ideas.Claude Opus 4.6 is the best Anthropic model we’ve tested. It understands intent with minimal prompting and went above and beyond, exploring and creating details I didn’t even know I wanted until I saw them. It felt like I was working with the model, not waiting on it.Both hands-on testing and evals show Claude Opus 4.6 is a meaningful improvement for design systems and large codebases, use cases that drive enormous enterprise value. It also one-shotted a fully functional physics engine, handling a large multi-scope task in a single pass.Claude Opus 4.6 is the biggest leap I’ve seen in months. I’m more comfortable giving it a sequence of tasks across the stack and letting it run. It’s smart enough to use subagents for the individual pieces.Claude Opus 4.6 handled a multi-million-line codebase migration like a senior engineer. It planned up front, adapted its strategy as it learned, and finished in half the time.We only ship models in v0 when developers will genuinely feel the difference. Claude Opus 4.6 passed that bar with ease. Its frontier-level reasoning, especially with edge cases, helps v0 to deliver on our number-one aim: to let anyone elevate their ideas from prototype to production.The performance jump with Claude Opus 4.6 feels almost unbelievable. Real-world tasks that were challenging for Opus [4.5] suddenly became easy. This feels like a watershed moment for spreadsheet agents on Shortcut.Across agentic coding, computer use, tool use, search, and finance, Opus 4.6 is an industry-leading model, often by a wide margin. The table below shows how Claude Opus 4.6 compares to our previous models and to other industry models on a variety of benchmarks.Opus 4.6 is much better at retrieving relevant information from large sets of documents. This extends to long-context tasks, where it holds and tracks information over hundreds of thousands of tokens with less drift, and picks up buried details that even Opus 4.5 would miss.A common complaint about AI models is “context rot,” where performance degrades as conversations exceed a certain number of tokens. Opus 4.6 performs markedly better than its predecessors: on the 8-needle 1M variant of MRCR v2—a needle-in-a-haystack benchmark that tests a model’s ability to retrieve information “hidden” in vast amounts of text—Opus 4.6 scores 76%, whereas Sonnet 4.5 scores just 18.5%. This is a qualitative shift in how much context a model can actually use while maintaining peak performance.All in all, Opus 4.6 is better at finding information across long contexts, better at reasoning after absorbing that information, and has substantially better expert-level reasoning abilities in general.Finally, the charts below show how Claude Opus 4.6 performs on a variety of benchmarks that assess its software engineering skills, multilingual coding ability, long-term coherence, cybersecurity capabilities, and its life sciences knowledge.Opus 4.6 maintains focus over time and earns $3,050.53 more than Opus 4.5 on Vending-Bench 2.Opus 4.6 finds real vulnerabilities in codebases better than any other model.Opus 4.6 performs almost 2× better than Opus 4.5 on computational biology, structural biology, organic chemistry, and phylogenetics tests.These intelligence gains do not come at the cost of safety. On our automated behavioral audit, Opus 4.6 showed a low rate of misaligned behaviors such as deception, sycophancy, encouragement of user delusions, and cooperation with misuse. Overall, it is just as well-aligned as its predecessor, Claude Opus 4.5, which was our most-aligned frontier model to date. Opus 4.6 also shows the lowest rate of over-refusals—where the model fails to answer benign queries—of any recent Claude model.The overall misaligned behavior score for each recent Claude model on our automated behavioral audit (described in full in the Claude Opus 4.6 system card).For Claude Opus 4.6, we ran the most comprehensive set of safety evaluations of any model, applying many different tests for the first time and upgrading several that we’ve used before. We included new evaluations for user wellbeing, more complex tests of the model’s ability to refuse potentially dangerous requests, and updated evaluations of the model’s ability to surreptitiously perform harmful actions. We also experimented with new methods from interpretability, the science of the inner workings of AI models, to begin to understand why the model behaves in certain ways—and, ultimately, to catch problems that standard testing might miss.A detailed description of all capability and safety evaluations is available in the Claude Opus 4.6 system card.We’ve also applied new safeguards in areas where Opus 4.6 shows particular strengths that might be put to dangerous as well as beneficial uses. In particular, since the model shows enhanced cybersecurity abilities, we’ve developed six new cybersecurity probes—methods of detecting harmful responses—to help us track different forms of potential misuse.We’re also accelerating the cyberdefensive uses of the model, using it to help find and patch vulnerabilities in open-source software (as we describe in our new cybersecurity blog post). We think it’s critical that cyberdefenders use AI models like Claude to help level the playing field. Cybersecurity moves fast, and we’ll be adjusting and updating our safeguards as we learn more about potential threats; in the near future, we may institute real-time intervention to block abuse.We’ve made substantial updates across Claude, Claude Code, and the Claude Developer Platform to let Opus 4.6 perform at its best.On the API, we’re giving developers better control over model effort and more flexibility for long-running agents. To do so, we’re introducing the following features:Adaptive thinking. Previously, developers only had a binary choice between enabling or disabling extended thinking. Now, with adaptive thinking, Claude can decide when deeper reasoning would be helpful. At the default effort level (high), the model uses extended thinking when useful, but developers can adjust the effort level to make it more or less selective.Effort. There are now four effort levels to choose from: low, medium, high (default), and max. We encourage developers to experiment with different options to find what works best.Context compaction (beta). Long-running conversations and agentic tasks often hit the context window. Context compaction automatically summarizes and replaces older context when the conversation approaches a configurable threshold, letting Claude perform longer tasks without hitting limits.1M token context (beta). Opus 4.6 is our first Opus-class model with 1M token context. Premium pricing applies for prompts exceeding 200k tokens ($10/$37.50 per million input/output tokens).128k output tokens. Opus 4.6 supports outputs of up to 128k tokens, which lets Claude complete larger-output tasks without breaking them into multiple requests.US-only inference. For workloads that need to run in the United States, US-only inference is available at 1.1× token pricing.Across Claude and Claude Code, we’ve added features that allow knowledge workers and developers to tackle harder tasks with more of the tools they use every day.We’ve introduced agent teams in Claude Code as a research preview. You can now spin up multiple agents that work in parallel as a team and coordinate autonomously—best for tasks that split into independent, read-heavy work like codebase reviews. You can take over any subagent directly using Shift+Up/Down or tmux.Claude now also works better with the office tools you already use. Claude in Excel handles long-running and harder tasks with improved performance, and can plan before acting, ingest unstructured data and infer the right structure without guidance, and handle multi-step changes in one pass. Pair that with Claude in PowerPoint, and you can first process and structure your data in Excel, then bring it to life visually in PowerPoint. Claude reads your layouts, fonts, and slide masters to stay on brand, whether you’re building from a template or generating a full deck from a description. Claude in PowerPoint is now available in research preview for Max, Team, and Enterprise plans.
...
Read the original on www.anthropic.com »
These days it seems you need a trillion fake dollars, or lunch with politicians to get your own data center. They may help, but they’re not required. At comma we’ve been running our own data center for years. All of our model training, metrics, and data live in our own data center in our own office. Having your own data center is cool, and in this blog post I will describe how ours works, so you can be inspired to have your own data center too.
If your business relies on compute, and you run that compute in the cloud, you are putting a lot of trust in your cloud provider. Cloud companies generally make onboarding very easy, and offboarding very difficult. If you are not vigilant you will sleepwalk into a situation of high cloud costs and no way out. If you want to control your own destiny, you must run your own compute.
Self-reliance is great, but there are other benefits to running your own compute. It inspires good engineering. Maintaining a data center is much more about solving real-world challenges. The cloud requires expertise in company-specific APIs and billing systems. A data center requires knowledge of Watts, bits, and FLOPs. I know which one I rather think about.
Avoiding the cloud for ML also creates better incentives for engineers. Engineers generally want to improve things. In ML many problems go away by just using more compute. In the cloud that means improvements are just a budget increase away. This locks you into inefficient and expensive solutions. Instead, when all you have available is your current compute, the quickest improvements are usually speeding up your code, or fixing fundamental issues.
Finally there’s cost, owning a data center can be far cheaper than renting in the cloud. Especially if your compute or storage needs are fairly consistent, which tends to be true if you are in the business of training or running models. In comma’s case I estimate we’ve spent ~5M on our data center, and we would have spent 25M+ had we done the same things in the cloud.
Our data center is pretty simple. It’s maintained and built by only a couple engineers and technicians. Your needs may be slightly different, our implementation should provide useful context.
To run servers you need power. We currently use about 450kW at max. Operating a data center exposes you to many fun engineering challenges, but procuring power is not one of them. San Diego power cost is over 40c/kWh, ~3x the global average. It’s a ripoff, and overpriced simply due to political dysfunction. We spent $540,112 on power in 2025, a big part of the data center cost. In a future blog post I hope I can tell you about how we produce our own power and you should too.
Data centers need cool dry air. Typically this is achieved with a CRAC system, but they are power-hungry. San Diego has a mild climate and we opted for pure outside air cooling. This gives us less control of the temperature and humidity, but uses only a couple dozen kW. We have dual 48” intake fans and dual 48” exhaust fans to keep the air cool. To ensure low humidity (
The majority of our current compute is 600 GPUs in 75 TinyBox Pro machines. They were built in-house, which saves us money and ensures they suit our needs. Our self-built machines fail at a similar rate to pre-built machines we’ve bought, but we’re capable of fixing them ourselves quickly. They have 2 CPUs and 8 GPUs each, and work as both training machines and general compute workers.
For data storage we have a few racks of Dell machines (R630 and R730). They are filled with SSDs for a total of ~4PB of storage. We use SSDs for reliability and speed. Our main storage arrays have no redundancy and each node needs to be able to saturate the network bandwidth with random access reads. For the storage machines this means reading up to 20Gbps of each 80TB chunk.
Other than storage and compute machines we have several one-off machines to run services. This includes a router, climate controller, data ingestion machine, storage master servers, metric servers, redis servers, and a few more.
Running the network requires switches, but at this scale we don’t need to bother with complicated switch topologies. We have 3 100Gbps interconnected Z9264F switches, which serve as the main ethernet network. We have two more infiniband switches to interconnect the 2 tinybox pro groups for training all-reduce.
To effectively use all these compute and storage machines you need some infra. At this scale, services don’t need redundancy to achieve 99% uptime. We use a single master for all services, which makes things pretty simple.
All servers get ubuntu installed with pxeboot and are managed by salt.
All of our storage arrays use mkv. The main array is 3PB of non-redundant storage hosting our driving data we train on. We can read from this array at ~1TB/s, which means we can train directly on the raw data without caching. Redundancy is not needed since no specific data is critical.
We have an additional ~300TB non-redundant array to cache intermediate processed results. And lastly, we have a redundant mkv storage array to store all of our trained models and training metrics. Each of these 3 arrays have a separate single master server.
We use slurm to manage the compute nodes, and compute jobs. We schedule two types of distributed compute. Pytorch training jobs, and miniray workers.
To train models across multiple GPU nodes we use torch.distributed FSDP. We have 2 separate training partitions, each intra-connected with Infiniband for training across machines. We wrote our own training framework which handles the training loop boilerplate, but it’s mostly just pytorch.
We have a custom model experiment tracking service (similar to wandb or tensorboard). It provides a dashboard for tracking experiments, and shows custom metrics and reports. It is also the interface for the mkv storage array that hosts the model weights. The training runs store the model weights there with a uuid, and they are available to download for whoever needs to run them. The metrics and reports for our latest models are also open.
Besides training we have many other compute tasks. This can be anything from running tests, running models, pre-processing data, or even running agent rollouts for on-policy training. We wrote a lightweight open-source task scheduler called miniray that allows you to run arbitrary python code on idle machines. This is a simpler version of dask, with a focus on extreme simplicity. Slurm will schedule any idle machine to be an active miniray worker, and accept pending tasks. All the task information is hosted in a central redis server.
Miniray workers with GPUs will spin up a triton inference server to run model inference with dynamic batching. A miniray worker can thus easily and efficiently run any of the models hosted in the model mkv storage array.
Miniray makes it extremely easy to scale parallel tasks to hundreds of machines. For example, the controls challenge record was set by just having ~1hr of access to our data center with miniray.
All our code is in a monorepo that we have cloned on our workstations. This monorepo is kept small (
The most complex thing we do at comma is train driving models on-policy, these training runs require training data to be generated during training by running simulated driving rollouts with the most recent model weights. Here’s a real-world command we just used to train such a model. This training run uses all of the infrastructure described above. While only this small command is needed to kick everything off, it orchestrates a lot of moving parts.
Does all this stuff sound exciting? Then build your own datacenter for yourself or your company! You can also come work here.
...
Read the original on blog.comma.ai »
Something strange is happening with Mac Minis. They’re selling out everywhere, and it’s not because people suddenly need more coffee table computers.
If you browse Reddit or HN, you’ll see the same pattern: people are buying Mac Minis specifically to run AI agents with computer use. They’re setting up headless machines whose sole job is to automate their workflows. OpenClaw—the open-source framework that lets you run Claude, GPT-5, or whatever model you want to actually control your computer—has become the killer app for Mac hardware. Not Final Cut. Not Logic. An AI agent that clicks buttons.
This is exactly what Apple Intelligence should have been.
Apple had everything: the hardware, the ecosystem, the reputation for “it just works.” They could have shipped an agentic AI that actually automated your computer instead of summarizing your notifications. Imagine if Siri could genuinely file your taxes, respond to emails, or manage your calendar by actually using your apps, not through some brittle API layer that breaks every update.
They could have charged $500 more per device and people would have paid it. The margins would have been obscene. And they would have won the AI race not by building the best model, but by being the only company that could ship an AI you’d actually trust with root access to your computer. That trust—built over decades—was their moat.
So why didn’t they?
Maybe they just didn’t see it. That sounds mundane, but it’s probably the most common reason companies miss opportunities. When you’re Apple, you’re thinking about chip design, manufacturing scale, and retail strategy. An open-source project letting AI agents control computers might not ping your radar until it’s already happening.
Or maybe they saw it and decided the risk wasn’t worth it. If you’re Apple, you don’t want your AI agent automatically buying things, posting on social media, or making irreversible decisions. The liability exposure would be enormous. Better to ship something safe and limited than something powerful and unpredictable.
But there’s another dynamic at play. Look at who’s about to get angry about OpenClaw-style automation: LinkedIn, Facebook, anyone with a walled garden and a careful API strategy. These services depend on friction. They want you to use their app, see their ads, stay in their ecosystem. An AI that can automate away that friction is an existential threat.
If Apple had built this, they’d be fighting Instagram over ToS violations by Tuesday. They’d be testifying in front of Congress about AI agents committing fraud. Every tech platform would be updating their terms to explicitly ban Apple Intelligence.
By letting some third party do it, Apple gets plausible deniability. They’re just selling hardware. Not their fault what people run on it. It’s the same strategy that made them billions in the App Store while maintaining they’re “not responsible for what developers do.”
But I think this is short-term thinking.
Here’s what people miss about moats: they compound. The reason Microsoft dominated PCs wasn’t just that they had the best OS. It’s that everyone built for Windows, which made Windows more valuable, which made more people build for Windows. Network effects.
If Apple owned the agent layer, they could have created the most defensible moat in tech. Because an AI agent gets better the more it knows about you. And Apple already has all your data, all your apps, all your devices. They could have built an agent that works across your iPhone, Mac, iPad, and Watch seamlessly—something no one else can do.
More importantly, they could have owned the API. Want your service to work with Apple Agent? You play by Apple’s rules. Suddenly Apple isn’t fighting with platforms—they’re the platform that platforms need to integrate with. It’s the App Store playbook all over again, but for the AI era.
The Mac Mini rush is a preview of this future. People want agents. They want automation. They want to pay for it. They’re literally buying extra computers just to run someone else’s AI on Apple’s hardware.
Apple is getting the hardware revenue but missing the platform revenue. That might look smart this quarter. But platform revenue is what built Apple into a $3 trillion company. And platforms are what create trillion-dollar moats.
I suspect ten years from now, people will look back at 2024-2025 as the moment Apple had a clear shot at owning the agent layer and chose not to take it. Not because they couldn’t build it—they obviously could—but because they were optimizing for this year’s legal risk instead of next decade’s platform power.
The people buying Mac Minis to run AI agents aren’t just early adopters. They’re showing Apple exactly what product they should have built. Whether Apple is paying attention is another question entirely.
...
Read the original on www.jakequist.com »
The US Central Intelligence Agency (CIA) has announced it will cease publishing the World Factbook, a free online resource used by millions around the globe.
Frequently cited by journalists and academics, the Factbook offered regularly updated statistics and information about countries and communities all over the world, in an easily understood and searchable format.
A statement on the CIA’s website did not include a reason for the decision, simply stating that the publication had “sunset” while encouraging readers to “stay curious about the world and find ways to explore it … in person or virtually”.
First launched during World War II as a classified internal program named JANIS (Joint Army Navy Intelligence Studies), the Factbook was originally commissioned as a way to standardise “basic intelligence” — fundamental and factual information about the world — across different agencies of the US government.
The program was taken over by the CIA in 1947 and renamed the National Intelligence Survey, before the Factbook was launched in 1971 as an annual summary of information.
An unclassified version was first made available to the public in 1975, and a digital version was published online in the 1990s, with the data freely available under public domain.
The website was particularly popular during the US school year, according to previous versions of the site, with traffic experiencing a noticeable drop-off during US summer months.
While no specific reason has been given for the Factbook’s closure, the Trump administration has made no secret of its intent to cut government programs it does not consider to be furthering the core purpose of its agencies and departments.
The administration offered buyouts to every CIA employee in February last year, and is reportedly planning to cut about 1,200 further jobs at the agency over the next several years.
The CIA has been contacted for comment.
...
Read the original on www.abc.net.au »
Spotlighting The World Factbook as We Bid a Fond Farewell (via) Somewhat devastating news today from CIA:
One of CIA’s oldest and most recognizable intelligence publications, The World Factbook, has sunset.
There’s not even a hint as to why they decided to stop maintaining this publication, which has been their most useful public-facing initiative since 1971 and a cornerstone of the public internet since 1997.
In a bizarre act of cultural vandalism they’ve not just removed the entire site (including the archives of previous versions) but they’ve also set every single page to be a 302 redirect to their closure announcement.
The Factbook has been released into the public domain since the start. There’s no reason not to continue to serve archived versions - a banner at the top of the page saying it’s no longer maintained would be much better than removing all of that valuable content entirely.
Up until 2020 the CIA published annual zip file archives of the entire site. Those are available (along with the rest of the Factbook) on the Internet Archive.
I downloaded the 384MB .zip file for the year 2020 and extracted it into a new GitHub repository, simonw/cia-world-factbook-2020. I’ve enabled GitHub Pages for that repository so you can browse the archived copy at simonw.github.io/cia-world-factbook-2020/.
Here’s a neat example of the editorial voice of the Factbook from the What’s New page, dated December 10th 2020:
Years of wrangling were brought to a close this week when officials from Nepal and China announced that they have agreed on the height of Mount Everest. The mountain sits on the border between Nepal and Tibet (in western China), and its height changed slightly following an earthquake in 2015. The new height of 8,848.86 meters is just under a meter higher than the old figure of 8,848 meters. The World Factbook rounds the new measurement to 8,849 meters and this new height has been entered throughout the Factbook database.
...
Read the original on simonwillison.net »
A few days ago, about why OpenClaw feels like a portal to the future, and why that future is scary in a very specific way.
The short version: agent gateways that act like OpenClaw are powerful because they have real access to your files, your tools, your browser, your terminals, and often a long-term “memory” file that captures how you think and what you’re building. That combination is exactly what modern infostealers are designed to exploit.
This post is the uncomfortable, “and then it happened” follow-up.
Because it’s not just that agents can be dangerous once they’re installed. The ecosystem that distributes their capabilities and skill registries has already become an attack surface.
If you are experimenting with OpenClaw, do not do it on a company device. Full stop.
In my first post, I described OpenClaw as a kind of Faustian bargain. It is compelling precisely because it has real access to your local machine, your apps, your browser sessions, your files, and often long-term memory. That same access means there isn’t yet a safe way to run it on a machine that holds corporate credentials or has access to production systems.
If you have already run OpenClaw on a work device, treat it as a potential incident and engage your security team immediately. Do not wait for symptoms. Pause work on that machine and follow your organization’s incident response process.
In the OpenClaw ecosystem, a “skill” is often a markdown file: a page of instructions that tells an agent how to do a specialized task. In practice, that markdown can include links, copy-and-paste commands, and tool call recipes.
That sounds harmless until you remember how humans, and agents, actually consume documentation:
Markdown isn’t “content” in an agent ecosystem. Markdown is an installer.
Some people assume the layer makes this safer, because tools can be exposed through a structured interface, with explicit user consent and authorization controls depending on the host and server implementation.
But skills do not need to use MCP at all.
The Agent Skills specification places no restrictions on the markdown body, and skills can include whatever instructions will “help agents perform the task,” including copy and paste terminal commands. And skills can also bundle scripts alongside the markdown, which means execution can happen outside the MCP tool boundary entirely.
So if your security model is “MCP will gate tool calls,” you can still lose to a malicious skill that simply routes around MCP through social engineering, direct shell instructions, or bundled code. MCP can be part of a safe system, but it is not a safety guarantee by itself.
Just as importantly, this is not unique to OpenClaw. “Skills” are increasingly portable because many agents are adopting the open in which a skill is a folder centered on a SKILL.md file with metadata and freeform instructions, and it can also bundle scripts and other resources. Even describes the same basic shape: a SKILL.md file plus optional scripts and assets. That means a malicious “skill” is not just an OpenClaw problem. It is a distribution mechanism that can travel across any agent ecosystem that supports the same standard.
While browsing ClawHub (I won’t link it for obvious reasons), I noticed the top downloaded skill at the time was a “Twitter” skill. It looked normal: description, intended use, an overview, the kind of thing you’d expect to install without a second thought.
But the very first thing it did was introduce a “required dependency” named “openclaw-core,” along with platform-specific install steps. Those steps included convenient links (“here”, “this link”) that appeared to be normal documentation pointers.
Both links led to malicious infrastructure. The flow was classic staged delivery:
The skill’s overview told you to install a prerequisite. The link led to a staging page designed to get the agent to run a command.That command decoded an obfuscated payload and executed it.The script downloaded and ran a binary, including removing macOS quarantine attributes to ensure macOS’s built-in anti-malware system, Gatekeeper, doesn’t scan it.
I’m intentionally not pasting the exact commands or URLs here. The mechanics are unfortunately straightforward, and repeating them helps attackers more than it helps defenders. The key point is that this was not “a suspicious link.” This was a complete execution chain disguised as setup instructions.
I downloaded the final binary safely and submitted it to .
The verdict was not ambiguous. It was flagged as macOS infostealing malware.
This is the type of malware that doesn’t just “infect your computer.” It raids everything valuable on that device:
* Anything else that can be turned into an account takeover
If you’re the kind of person installing agent skills, you are exactly the kind of person whose machine is worth stealing from.
After I shared this internally, surfaced, putting the scale into focus: hundreds of OpenClaw skills were reportedly involved in distributing macOS malware via ClickFix-style instructions.
That detail matters because it confirms what this really is.
A deliberate strategy: use “skills” as the distribution channel, and “prerequisites” as the social engineering wrapper.
We’ve spent years learning that package managers and open-source registries can become supply chain attack vectors.
Agent skill registries are the next chapter, except that the “package” is documentation.
And that makes the attack path even smoother:
* And in agent ecosystems, the line between reading instructions and executing them collapses.
Even if an agent can’t run shell commands directly, it can still do something dangerous: it can normalize risky behavior.
It can confidently summarize a malicious prerequisite as “the standard install step.” It can encourage you to paste a one-liner. It can reduce hesitation.
And if your agent can execute local commands, then a malicious skill isn’t “bad content.” It’s remote execution wrapped in friendly docs.
Do not run this on a company device. There isn’t a safe way to do it. If you already did, or you ran any “install” commands from a skill, engage your security team immediately and treat it as a potential compromise.
* Stop using the device for sensitive work.
If you experiment anyway, use an isolated machine with no corporate access and no saved credentials.
You are operating an app store. Assume it will be abused.
* Put warnings and friction on external links and install steps.
* Use permissions that are specific, time-bound, and revocable.
This is the clearest proof yet of the point I made in my earlier post. OpenClaw is powerful because it collapses the distance between intent and execution. That is the magic. It also introduces significant risk. When capabilities are distributed as skills and installed via documentation, the registry becomes a supply chain, and the easiest install path becomes the attacker’s favorite path.
The answer is not to stop building agents. The answer is to build the missing trust layer around them. Skills need provenance. Execution needs mediation. Permissions need to be specific, revocable, and continuously enforced, not granted once and forgotten. If agents are going to act on our behalf, credentials and sensitive actions cannot be “grabbed” by whatever code happens to run. They need to be brokered, governed, and audited in real time.
This is exactly why we need that next layer: when “skills” become the supply chain, the only safe future is one in which every agent has its own identity and has the minimum authority it needs right now, with access that is time-bound, revocable, and attributable.
...
Read the original on 1password.com »
Immigration and Customs Enforcement (ICE) is surveying the commercial advertising technology market for tools capable of supplying location data and large-scale analytics to federal investigators, according to a recent Request for Information (RFI).
Framed as market research rather than a procurement, the RFI seeks information from companies offering “Ad Tech compliant and location data services” that could support criminal, civil, and administrative investigations across ICE’s mission set.
The RFI, issued by ICE’s Homeland Security Investigations (HSI), emphasizes that the government is not soliciting proposals or committing to a future contract, but it does signal active interest in selecting vendors for live demonstrations of operational platforms and data services, a step that typically precedes pilot deployments or integration into existing investigative environments.
ICE says it is attempting to better understand how commercial big data providers and advertising technology firms might directly support investigative activities, while remaining sensitive to “regulatory constraints and privacy expectations.”
The agency noted that its components are handling increasing volumes of criminal, civil, and administrative information from both internal and external sources and are assessing whether commercial off-the-shelf platforms comparable to large investigative data and legal analytics providers can help manage and exploit that data at scale.
At the center of the inquiry is a category of information traditionally associated with digital advertising rather than law enforcement: location data, device identifiers, IP intelligence, and behavioral signals derived from everyday consumer activity.
Advertising technology, commonly referred to as ad tech, is the sprawling ecosystem of software, data brokers, analytics platforms, and intermediaries that power targeted advertising on the modern Internet.
Ad tech companies collect and process information about where devices are located, how users move between physical and digital spaces, which apps are installed on their phones, and how devices can be linked across websites, applications, and networks.
While the industry typically frames this activity as anonymous or pseudonymous, the underlying data is often persistent, granular, and capable of tracking individuals over time.
Location data is a particularly valuable component of that ecosystem. Mobile applications routinely share latitude and longitude coordinates with advertising partners through embedded software development kits.
Even when precise GPS data is not available, companies infer location through IP addresses, Wi-Fi networks, Bluetooth beacons, and cell tower connections. That information is then aggregated, analyzed, and sold to advertisers seeking to measure foot traffic, target audiences, or assess the effectiveness of campaigns.
ICE’s RFI suggests that the agency is exploring whether those same mechanisms can be repurposed as investigative tools.
The document asks vendors to describe platforms and data services that can support investigative needs while remaining “Ad Tech compliant,” a phrase that reflects industry norms rather than statutory law enforcement standards.
ICE appears to be looking into tapping into the commercial data ecosystem rather than building bespoke surveillance tools from scratch, a strategy that allows agencies to access rich data streams without directly collecting the information themselves.
ICE’s interest is not limited to raw data. The RFI repeatedly references “operational platforms,” signaling a desire for systems that can ingest, correlate, analyze, and visualize information from multiple sources.
In practice, that means software environments capable of fusing location data with other records, such as criminal histories, financial data, travel records, social media activity, or administrative files, to generate investigative leads or support ongoing cases.
The agency frames its inquiry as exploratory and cautious. It notes that the government is seeking to understand the “current state” of ad tech and location data services available to federal investigative entities, particularly considering regulatory constraints and privacy expectations.
That language reflects growing scrutiny of commercial data practices by courts, regulators, and civil liberties advocates, especially when such data is accessed by federal agencies like ICE.
In recent years, federal agencies have increasingly relied on commercially available data to sidestep traditional legal barriers.
Because ad tech data is collected by private companies under consumer-facing privacy policies, agencies have argued that purchasing or accessing that data does not constitute a search under the Fourth Amendment.
Critics counter that this approach allows the government to obtain highly sensitive information, including detailed location histories, without warrants, probable cause, or meaningful oversight.
The U. S. Supreme Court has signaled skepticism of such practices in cases recognizing the sensitivity of long-term location tracking, even when data is held by third parties.
At the same time, regulators have brought enforcement actions against data brokers accused of selling sensitive location information without adequate safeguards.
Against that backdrop, ICE’s assertion that it is considering privacy expectations appears designed to reassure both policymakers and potential vendors that the agency is aware of the controversy surrounding commercial surveillance data.
Yet the RFI itself provides little detail about how those concerns would be operationalized. It does not reference warrants, court orders, or judicial authorization.
Nor does it explain how ICE would distinguish between data associated with U. S. persons and noncitizens, how long information would be retained, or whether data obtained for one investigative purpose could be reused for others.
That ambiguity is particularly significant given HSI’s broad mandate. Unlike agencies focused solely on criminal enforcement, HSI conducts civil and administrative investigations alongside criminal cases.
Location data or ad tech-derived insights could therefore be used in contexts ranging from immigration enforcement to customs violations to sanctions and export control investigations, often under lower legal thresholds than those required in criminal proceedings.
ICE’s emphasis on “Ad Tech compliant” services also underscore a fundamental tension. Compliance in the advertising industry typically refers to adherence to self-regulatory frameworks, contractual obligations, and privacy policies that permit extensive data collection so long as certain disclosures are made.
Those standards are not designed to constrain government use, nor do they substitute for constitutional or statutory protections governing law enforcement surveillance.
Companies marketing “privacy-friendly” location or IP intelligence tools often argue that they avoid directly identifying individuals. But researchers and regulators have repeatedly demonstrated that supposedly anonymized or aggregated data can be reidentified when combined with other datasets.
In an investigative context, reidentification is not a bug but a feature, enabling analysts to link digital signals back to real-world subjects.
Biometric Update earlier reported that a Government Accountability Office audit had found that publicly accessible data — from social media posts to commercial geolocation records — can be aggregated into detailed “digital profiles” that expose U. S. personnel, military operations, and senior leaders to targeting, coercion, and disruption.
In January 2025, Gravy Analytics, a prominent location data broker, disclosed that a significant data breach had potentially exposed through de-anonymization the precise location information of millions of individuals.
The RFI’s focus on live demonstrations suggests that ICE is interested in mature, deployable capabilities rather than theoretical offerings. Vendors selected to present would be expected to show how their platforms operate in practice, how data is accessed and analyzed, and how investigative outputs are generated.
While the agency stresses that it is not committing to a future solicitation, such demonstrations often inform subsequent procurements, task orders, or pilot programs conducted under existing contracts.
ICE has used similar market research approaches in the past to normalize new surveillance capabilities before formal adoption.
Social media monitoring tools, mobile biometric systems, and large-scale analytics platforms were all introduced through incremental steps that began with RFIs and demonstrations rather than headline-grabbing contracts.
For privacy advocates, the latest filing fits a familiar pattern. Commercial surveillance markets evolve rapidly, driven by advertising and marketing demand. Government agencies then adopt those tools after the fact, often before lawmakers have fully grappled with the implications.
Oversight mechanisms, however, lag technical capability, leaving key questions unanswered until after systems are already in use.
ICE’s RFI does not indicate when demonstrations might occur or whether a solicitation will follow. It does make clear, though, that the agency sees the ad tech ecosystem as a potential investigative resource worth serious consideration.
As debates over commercial data, surveillance, and constitutional protections continue, the filing offers a window into how federal law enforcement is adapting to — and seeking to leverage — a data economy built for advertising rather than accountability.
For now, ICE is asking industry to explain how ad tech-derived location and analytics services can be made suitable for investigative use while respecting privacy expectations.
What remains unclear is who will define those expectations, how they will be enforced, and whether existing legal frameworks are equipped to govern a surveillance model that blurs the line between consumer marketing and government intelligence.
...
Read the original on www.biometricupdate.com »
Written by Nicholas Carlini, a researcher on our Safeguards team.
I’ve been experimenting with a new approach to supervising language models that we’re calling “agent teams.” With agent teams, multiple Claude instances work in parallel on a shared codebase without active human intervention. This approach dramatically expands the scope of what’s achievable with LLM agents. To stress test it, I tasked 16 agents with writing a Rust-based C compiler, from scratch, capable of compiling the Linux kernel. Over nearly 2,000 Claude Code sessions and $20,000 in API costs, the agent team produced a 100,000-line compiler that can build Linux 6.9 on x86, ARM, and RISC-V. The compiler is an interesting artifact on its own, but I focus here on what I learned about designing harnesses for long-running autonomous agent teams: how to write tests that keep agents on track without human oversight, how to structure work so multiple agents can make progress in parallel, and where this approach hits its ceiling.Existing agent scaffolds like Claude Code require an operator to be online and available to work jointly. If you ask for a solution to a long and complex problem, the model may solve part of it, but eventually it will stop and wait for continued input—a question, a status update, or a request for clarification.To elicit sustained, autonomous progress, I built a harness that sticks Claude in a simple loop (if you’ve seen Ralph-loop, this should look familiar). When it finishes one task, it immediately picks up the next. (Run this in a container, not your actual machine).
In the agent prompt, I tell Claude what problem to solve and ask it to approach the problem by breaking it into small pieces, tracking what it’s working on, figuring out what to work on next, and to effectively keep going until it’s perfect. (On this last point, Claude has no choice. The loop runs forever—although in one instance, I did see Claude pkill -9 bash on accident, thus killing itself and ending the loop. Whoops!).Running multiple instances in parallel can address two weaknesses of a single-agent harness:One Claude Code session can only do one thing at a time. Especially as the scope of a project expands, debugging multiple issues in parallel is far more efficient.Running multiple Claude agents allows for specialization. While a few agents are tasked to solve the actual problem at hand, other specialized agents can be invoked to (for example) maintain documentation, keep an eye on code quality, or solve specialized sub-tasks.My implementation of parallel Claude is bare-bones. A new bare git repo is created, and for each agent, a Docker container is spun up with the repo mounted to /upstream. Each agent clones a local copy to /workspace, and when it’s done, pushes from its own local container to upstream.To prevent two agents from trying to solve the same problem at the same time, the harness uses a simple synchronization algorithm:Claude takes a “lock” on a task by writing a text file to current_tasks/ (e.g., one agent might lock current_tasks/parse_if_statement.txt, while another locks current_tasks/codegen_function_definition.txt). If two agents try to claim the same task, git’s synchronization forces the second agent to pick a different one.Claude works on the task, then pulls from upstream, merges changes from other agents, pushes its changes, and removes the lock. Merge conflicts are frequent, but Claude is smart enough to figure that out.The infinite agent-generation-loop spawns a new Claude Code session in a fresh container, and the cycle repeats.This is a very early research prototype. I haven’t yet implemented any other method for communication between agents, nor do I enforce any process for managing high-level goals. I don’t use an orchestration agent. Instead, I leave it up to each Claude agent to decide how to act. In most cases, Claude picks up the “next most obvious” problem. When stuck on a bug, Claude will often maintain a running doc of failed approaches and remaining tasks. In the git repository of the project, you can read through the history and watch it take out locks on various tasks.The scaffolding runs Claude in a loop, but that loop is only useful if Claude can tell how to make progress. Most of my effort went into designing the environment around Claude—the tests, the environment, the feedback—so that it could orient itself without me. These are the approaches I’ve found most helpful when orchestrating multiple Claude instances.Claude will work autonomously to solve whatever problem I give it. So it’s important that the task verifier is nearly perfect, otherwise Claude will solve the wrong problem. Improving the testing harness required finding high-quality compiler test suites, writing verifiers and build scripts for open-source software packages, and watching for mistakes Claude was making, then designing new tests as I identified those failure modes.For example, near the end of the project, Claude started to frequently break existing functionality each time it implemented a new feature. To address this, I built a continuous integration pipeline and implemented stricter enforcement that allowed Claude to better test its work so that new commits can’t break existing code.I had to constantly remind myself that I was writing this test harness for Claude and not for myself, which meant rethinking many of my assumptions about how tests should communicate results.For example, each agent is dropped into a fresh container with no context and will spend significant time orienting itself, especially on large projects. Before we even reach the tests, to help Claude help itself, I included instructions to maintain extensive READMEs and progress files that should be updated frequently with the current status.I also kept in mind the fact that language models have inherent limitations, which, in this case, needed to be designed around. These include:Context window pollution: The test harness should not print thousands of useless bytes. At most, it should print a few lines of output and log all important information to a file so Claude can find it when needed. Logfiles should be easy to process automatically: if there are errors, Claude should write ERROR and put the reason on the same line so grep will find it. It helps to pre-compute aggregate summary statistics so Claude doesn’t have to recompute them.Time blindness: Claude can’t tell time and, left alone, will happily spend hours running tests instead of making progress. The harness prints incremental progress infrequently (to avoid polluting context) and includes a default –fast option that runs a 1% or 10% random sample. This subsample is deterministic per-agent but random across VMs, so Claude still covers all files but each agent can perfectly identify regressions.When there are many distinct failing tests, parallelization is trivial: each agent picks a different failing test to work on. After the test suite reached a 99% pass rate, each agent worked on getting a different small open-source project (e.g., SQlite, Redis, libjpeg, MQuickJS, Lua) to compile.But when agents started to compile the Linux kernel, they got stuck. Unlike a test suite with hundreds of independent tests, compiling the Linux kernel is one giant task. Every agent would hit the same bug, fix that bug, and then overwrite each other’s changes. Having 16 agents running didn’t help because each was stuck solving the same task.The fix was to use GCC as an online known-good compiler oracle to compare against. I wrote a new test harness that randomly compiled most of the kernel using GCC, and only the remaining files with Claude’s C Compiler. If the kernel worked, then the problem wasn’t in Claude’s subset of the files. If it broke, then it could further refine by re-compiling some of these files with GCC. This let each agent work in parallel, fixing different bugs in different files, until Claude’s compiler could eventually compile all files. (After this worked, it was still necessary to apply delta debugging techniques to find pairs of files that failed together but worked independently.)Parallelism also enables specialization. LLM-written code frequently re-implements existing functionality, so I tasked one agent with coalescing any duplicate code it found. I put another in charge of improving the performance of the compiler itself, and a third I made responsible for outputting efficient compiled code. I asked another agent to critique the design of the project from the perspective of a Rust developer, and make structural changes to the project to improve the overall code quality, and another to work on documentation.This project was designed as a capability benchmark. I am interested in stress-testing the limits of what LLMs can just barely achieve today in order to help us prepare for what models will reliably achieve in the future.I’ve been using the C Compiler project as a benchmark across the entire Claude 4 model series. As I did with prior projects, I started by drafting what I wanted: a from-scratch optimizing compiler with no dependencies, GCC-compatible, able to compile the Linux kernel, and designed to support multiple backends. While I specified some aspects of the design (e.g., that it should have an SSA IR to enable multiple optimization passes) I did not go into any detail on how to do so.Previous Opus 4 models were barely capable of producing a functional compiler. Opus 4.5 was the first to cross a threshold that allowed it to produce a functional compiler which could pass large test suites, but it was still incapable of compiling any real large projects. My goal with Opus 4.6 was to again test the limits.Over nearly 2,000 Claude Code sessions across two weeks, Opus 4.6 consumed 2 billion input tokens and generated 140 million output tokens, a total cost just under $20,000. Compared to even the most expensive Claude Max plans, this was an extremely expensive project. But that total is a fraction of what it would cost me to produce this myself—let alone an entire team.This was a clean-room implementation (Claude did not have internet access at any point during its development); it depends only on the Rust standard library. The 100,000-line compiler can build a bootable Linux 6.9 on x86, ARM, and RISC-V. It can also compile QEMU, FFmpeg, SQlite, postgres, redis, and has a 99% pass rate on most compiler test suites including the GCC torture test suite. It also passes the developer’s ultimate litmus test: it can compile and run Doom.The compiler, however, is not without limitations. These include:It lacks the 16-bit x86 compiler that is necessary to boot Linux out of real mode. For this, it calls out to GCC (the x86_32 and x86_64 compilers are its own).It does not have its own assembler and linker; these are the very last bits that Claude started automating and are still somewhat buggy. The demo video was produced with a GCC assembler and linker.The compiler successfully builds many projects, but not all. It’s not yet a drop-in replacement for a real compiler.The generated code is not very efficient. Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.The Rust code quality is reasonable, but is nowhere near the quality of what an expert Rust programmer might produce.The resulting compiler has nearly reached the limits of Opus’s abilities. I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality.As one particularly challenging example, Opus was unable to implement a 16-bit x86 code generator needed to boot into 16-bit real mode. While the compiler can output correct 16-bit x86 via the 66/67 opcode prefixes, the resulting compiled output is over 60kb, far exceeding the 32k code limit enforced by Linux. Instead, Claude simply cheats here and calls out to GCC for this phase (This is only the case for x86. For ARM or RISC-V, Claude’s compiler can compile completely by itself.)The source code for the compiler is available. Download it, read through the code, and try it on your favorite C projects. I’ve consistently found the best way to understand what language models can do is to push them to their limits, and then study where they start to break down. Over the coming days, I’ll continue having Claude push new changes if you want to follow along with Claude’s continued attempts at addressing these limitations.Each generation of language models opens up new ways of working with them. Early models were useful for tab-completion in IDEs. Before long, models could complete a function body from its docstring. The launch of Claude Code brought agents into the mainstream and enabled developers to pair-program with Claude. But each of these products operates under the assumption that a user defines a task, an LLM runs for a few seconds or minutes and returns an answer, and then the user provides a follow-up.Agent teams show the possibility of implementing entire, complex projects autonomously. This allows us, as users of these tools, to become more ambitious with our goals.We are still early, and fully autonomous development comes with real risks. When a human sits with Claude during development, they can ensure consistent quality and catch errors in real time. For autonomous systems, it is easy to see tests pass and assume the job is done, when this is rarely the case. I used to work in penetration testing, exploiting vulnerabilities in products produced by large companies, and the thought of programmers deploying software they’ve never personally verified is a real concern.So, while this experiment excites me, it also leaves me feeling uneasy. Building this compiler has been some of the most fun I’ve had recently, but I did not expect this to be anywhere near possible so early in 2026. The rapid progress in both language models and the scaffolds we use to interact with them opens the door to writing an enormous amount of new code. I expect the positive applications to outweigh the negative, but we’re entering a new world which will require new strategies to navigate safely.Special thanks to Josef Bacik, Edwin Chen, Bernardo Meurer Costa, Jake Eaton, Dan Kelley, Felix Klock, Jannet Park, Steve Weis, and many other people across Anthropic for their assistance and contributions.
Product updates, how-tos, community spotlights, and more. Delivered monthly to your inbox. Please provide your email address if you’d like to receive our monthly developer newsletter. You can unsubscribe at any time.
...
Read the original on www.anthropic.com »
Although age bias is still the norm, the value-add of longtime, experienced workers is beginning to take shape.
On the outskirts of Macclesfield, in northwest England, a branch of the UK home-improvement retailer B&Q quietly overturned one of corporate life’s most persistent assumptions. Faced with high staff turnover and uneven customer satisfaction, the company tried a simple experiment: In 1989, it staffed the store largely with older workers.
The results were striking, according to one study. Profits rose 18 percent. Staff turnover fell to a fraction of the company average. Absenteeism dropped sharply. An experiment that started more than 30 years ago reshaped how the retailer approached age inclusiveness and led B&Q to open training to all ages and feature older workers in advertising, treating experience as an advantage rather than a cost.
In 2007, BMW began implementing 70 ergonomic, low-cost improvements in a specialized assembly line in Dingolfing, Germany, to provide better conditions for its many older and middle-aged workers. Key changes included adjustable-height workstations, improved lighting and specialized stools, resulting in a 7 percent productivity increase.
Evidence suggests that similar age-performance dynamics are not limited to the quirks of retail or to the factory floor and are increasingly relevant as declining birth rates and artificial intelligence investments reduce the inflow of entry-level workers. A white paper from Bank of America’s Workplace Benefits group argues that recruiting and retaining older workers is becoming increasingly important as populations age, framing age-inclusive benefits not as accommodation, but as a driver of organizational performance, especially for roles where judgment, experience and decision quality matter most.
“The retention of these older workers is an idea that is becoming much more well-received,” says Cynthia Hutchins, Bank of America’s inaugural director of financial gerontology. Hutchins has been involved in implementing a workforce longevity policy that includes hybrid schedules, financial planning benefits, menopause support, grandparents’ leave and sabbaticals. “It’s almost a business imperative to institute those types of benefits” to retain older workers and attract younger ones, adds Hutchins.
Yet initiatives such as these are rarely framed as strategy or as signals of a deeper shift. Most corporations continue to design careers as if effectiveness peaks early — as if speed, stamina and innovation belong exclusively to the young. If experience improves outcomes, why are so many organizations structured to push people out just as their value peaks?
If experience improves outcomes, why are so many organizations structured to push people out just as their value peaks?
At the heart of corporate resistance lies a fundamental disagreement about value. Moody’s Analytics chief economist Mark Zandi framed the debate in Aging and the Productivity Puzzle, a 2018 analysis delineating two schools of thought. The “albatross theory” holds that workers above the age of 65 drag down productivity due to resistance to change and outdated skills. The “wise man theory” tells a different story: of workers who possess judgment, institutional knowledge, emotional intelligence and expertise that younger employees cannot replicate.
Zandi and his colleagues analyzed state-level ADP data in the U. S. and concluded that post-retirement-age workers slowed wage growth and productivity, largely because they tend to be averse to adopting new technologies. Yet several major institutions reject the idea that older workers are a productivity “albatross” — and most look at the effects, not of those above the age of 65, but of the 50-plus age workforce, often the first in line for layoffs.
More recent research from AARP and the OECD shows that firms with more 50-plus workers are more productive, not less: a 10-percentage-point increase in older workers is associated with roughly 1.1 percent higher productivity. The 2020 OECD analysis also finds that age-balanced firms benefit from lower turnover and stronger team performance, driven by experience and knowledge sharing rather than technology resistance. Similarly, a 2022 study from Boston Consulting Group found that cross-generational teams outperform homogeneous ones when older workers’ judgment and mentoring are combined with younger workers’ digital skills. A 2022 meta analysis also pushes back against the idea that older workers are less effective, and found that teams perform better when members have a long tenure at the company, irrespective of workers’ ages.
Still, Zandi says that the value of older workers may depend on how AI in the workplace unfolds and what impact it has on productivity growth. “If AI turns out to be a bust or doesn’t live up to expectations, and you have other demographic forces that are restraining labor growth, then I think older workers should fare well,” Zandi says. He notes that so far, older workers have “navigated things reasonably gracefully,” while younger workers and mid-level managers are so far taking the brunt of AI-related impacts.
Population aging is often treated as a future problem, something to be managed later with technology or policy tweaks. In reality, it is already reshaping labor markets in the U. S. and across advanced economies. Birth rates are lower, people are living longer and the share of workers above the age of 50 is rising steadily. This is not a forecast. It is arithmetic.
Across advanced economies, there appears to be a persistent pattern of early exits that are less about individual choice than organizational design.
Yet organizational assumptions about performance have not kept pace. Modern careers are still built around the idea that effectiveness peaks early. Recent research challenges that view. A 2025 study in the journal Intelligence, analyzing age trajectories across 16 cognitive, emotional and personality dimensions, finds that while processing speed does decline after early adulthood, many of the capabilities most relevant to complex work continue to improve well into midlife. When these traits are combined into a composite measure of overall functioning, performance peaks between ages 55 and 60.
But if proficiency increasingly peaks in late midlife, then why are so many careers ending before they can be fully realized? Across advanced economies, there appears to be a persistent pattern of early exits that are less about individual choice than organizational design.
In the U. S., analysis by the Urban Institute of survey data of older workers from 1992 to 2016 showed that more than half above the age of 50 were pushed out of long-held jobs before they chose to retire, often through layoffs or restructuring rather than performance issues. The 2018 study — along with reporting from ProPublica — found that few ever regained comparable pay or responsibility, and hiring practices reinforced the trend.
The fact that more than half of U. S. workers above the age of 50 leave long-held jobs for reasons unrelated to performance and before they choose to retire is a systemic design failure.
Bill Greene, a longtime business consultant, is an exception to this layoff trend. Hired at 64 as principal of Mind Share Partners, a nonprofit in San Francisco, he advises companies on the importance of creating mentally healthy environments, and cautions that the workplace is a minefield of biases — and that ageism cuts both ways for older workers and younger workers.
Greene advises employers to be aware of the blind spots and inconsistencies. In the technology industry, he says, “it’s widely perceived that if you are 45 years old or over, you are a dinosaur,” yet in politics, “you can be 70, 75, 80, 85, and apparently that’s OK.”
Experience helps in an emergency. When the Covid-19 pandemic struck in 2020, Greene was consulting for a financial services firm, and he saw firsthand how worried his client was that younger employees were going to panic and quit because they hadn’t been through a crisis of that magnitude before.
“They realized that they had to coach their younger employees,” he says, comparing the pandemic to the 2008 financial crash to help the client’s staff understand the risks and path forward. “That kind of wisdom and experience can come with more depth of understanding and perspective from an older employee than from a younger one,” he says.
Although several Fortune 500 companies have advertised their interest in hiring and retaining older workers, corporate commitments remain tentative and small-scale. UK-based Unilever launched its U-Work program in 2019, and now offers employees in nine countries a hybrid between traditional employment and gig work: a monthly retainer, benefits and freedom to choose which projects they work on and when. Workers can scale back hours, pursue other interests or transition gradually toward retirement.
The program is innovative and, by all accounts, successful. Half of participants are above the age of 50. But only 140 employees out of Unilever’s 150,000-strong global workforce participate. This raises a question: Are these strategies of genuine transformation or sophisticated public relations?
Three converging forces make the case for urgency. First, premature exit creates value leakage. The fact that more than half of U. S. workers above the age of 50 leave long-held jobs for reasons unrelated to performance and before they choose to retire is a systemic design failure.
Second, the demand-side blind spot. Globally, spending by people above the age of 55 is projected to approach $15 trillion annually by the end of this decade, making older consumers one of the largest and fastest-growing sources of demand in the world economy. Yet many companies treat older customers as peripheral.
There are exceptions. Alan Patricof, now 91 and still investing, launched Primetime Partners at 85 after observing that venture capital remained focused on millennials, despite obvious unmet demand among older adults. His fund has invested in more than 35 companies serving what he calls the “ageless market.” Consumer brands are adapting, too — L’Oréal has repositioned itself around longevity and healthy aging, treating later life as aspiration rather than decline.
The silver economy is not a niche. It is one of the largest and least contested growth opportunities of the next decade — and one that many firms still underestimate.
Third, longer working lives are inevitable. In Europe and the UK, effective retirement ages have been climbing, driven in part by financial need and policy changes. Meanwhile, in the U. S., the shift from defined-benefit to defined-contribution retirement plans incentivizes workers to remain employed longer. Organizations that fail to retain experienced talent will face labor shortages, while competitors benefit from workers who bring judgment, stability and institutional memory.
The mismatch between demographic reality and corporate behavior is beginning to register with long-term investors. Large asset managers increasingly frame longevity as a structural economic force with implications for growth, productivity and risk.
A Vanguard study, The Economics of a Graying World, highlights aging and slower labor-force growth as a persistent drag on economic expansion, arguing that longer working lives are one of the few viable adjustment mechanisms. From this perspective, workforce age policy becomes financially material, not optional.
When organizations push experienced workers out early, they forfeit peak judgment, execution capability and mentoring capacity.
Economist Andrew J. Scott of the London Business School argues in his 2024 book The Longevity Imperative that if societies see longevity primarily as an “aging problem” of more pensioners, higher health costs and fewer workers, longer lives risk becoming a fiscal drag. But if they invest in health, skills and age‑inclusive work, longevity can instead raise growth, employment and innovation.
One hurdle to this shift in perspective is an ongoing lack of transparency and accountability by employers. Ageism in hiring, promotion and redundancy remains widespread, yet unlike gender or ethnicity, workforce age is rarely disclosed or scrutinized. The result is a growing governance gap. Misalignment with demographic reality creates execution risk — in talent, productivity and growth.
The case for a longevity strategy is ultimately an economic one. When organizations push experienced workers out early, they forfeit peak judgment, execution capability and mentoring capacity. When they underinvest in older consumers, they leave vast pools of demand underserved. Value is forfeited on both sides of the business.
In meeting their responsibility for long-term risk and growth, companies should begin with clarity. Map the age profile of the workforce by role and seniority. Identify where people in their fifties and early sixties are exiting — and whether those exits reflect performance or design. Treat age as a strategic variable in the same way firms now treat gender, skills or succession risk.
From there, redesign follows. Build roles and career paths that assume longer working lives. Invest in mid- and late-career reskilling, not as remediation but as renewal. Structure intergenerational teams deliberately, so experience and speed compound rather than collide. Align product, service and brand strategy with the realities of an aging, wealthier customer base.
None of this is about altruism. It is about reclaiming value currently being left on the table. As populations age, companies that learn to retain experience and serve longevity-driven demand will not just adapt — they will outperform.
Annie Coleman is Founder of RealiseLongevity, a consulting firm based in the UK, and is a Stanford Center on Longevity Ambassador.
...
Read the original on longevity.stanford.edu »
To add this web app to your iOS home screen tap the share button and select "Add to the Home Screen".
10HN is also available as an iOS App
If you visit 10HN only rarely, check out the the best articles from the past week.
If you like 10HN please leave feedback and share
Visit pancik.com for more.