10 interesting stories served every morning and every evening.
I Measured Claude 4.7′s New Tokenizer. Here’s What It Costs You. I Measured Claude 4.7′s New Tokenizer. Here’s What It Costs You.The docs said 1.0–1.35x more tokens. On real content, I measured 1.47x.Anthropic’s Claude Opus 4.7 migration guide says the new tokenizer uses “roughly 1.0 to 1.35x as many tokens” as 4.6. I measured 1.47x on technical docs. 1.45x on a real CLAUDE.md file. The top of Anthropic’s range is where most Claude Code content actually sits, not the middle.Same sticker price. Same quota. More tokens per prompt. Your Max window burns through faster. Your cached prefix costs more per turn. Your rate limit hits sooner.So Anthropic must be trading this for something. What? And is it worth it?I ran two experiments. The first measured the cost. The second measured what Anthropic claimed you’d get back. Here’s where it nets out.What does it cost?To measure the cost, I used POST /v1/messages/count_tokens — Anthropic’s free, no-inference token counter. Same content, both models, one number each per model. The difference is purely the tokenizer.First: seven samples of real content a Claude Code user actually sends — a CLAUDE.md file, a user prompt, a blog post, a git log, terminal output, a stack trace, a code diff. Second: twelve synthetic samples spanning content types — English prose, code, structured data, CJK, emoji, math symbols — to see how the ratio varies by kind.The core loop is three lines of Python:Seven samples pulled from real files a Claude Code user actually sends:Weighted ratio across all seven: 1.325x (8,254 → 10,937 tokens).What changed in the tokenizerThree patterns in the data:CJK, emoji, and symbol content moved 1.005–1.07x. A wholesale new vocabulary would shift these more uniformly. That didn’t happen. Consistent with the non-Latin portions of the vocabulary changing less than the Latin. Token counts don’t prove which specific slots were preserved.English and code moved 1.20–1.47x on natural content. Consistent with 4.7 using shorter or fewer sub-word merges for common English and code patterns than 4.6 did.Code is hit harder than unique prose (1.29–1.39x vs 1.20x). Code has more repeated high-frequency strings — keywords, imports, identifiers — exactly the patterns a Byte-Pair Encoding trained on code would collapse into long merges.Chars-per-token on English dropped from 4.33 to 3.60. TypeScript dropped from 3.66 to 2.69. The vocabulary is representing the same text in smaller pieces.That’s a hypothesis, not a proof. Counting tokens doesn’t tell you which specific entries in Anthropic’s proprietary vocabulary changed.60-min video lesson + CLAUDE.md starter kit. Yours when you subscribe.Why ship a tokenizer that uses more tokensAnthropic’s migration guide: “more literal instruction following, particularly at lower effort levels. The model will not silently generalize an instruction from one item to another.“Smaller tokens force attention over individual words. That’s a documented mechanism for tighter instruction following, character-level tasks, and tool-call precision. Partner reports (Notion, Warp, Factory) describe fewer tool errors on long runs.The tokenizer is one plausible contributor. Weights and post-training also changed. Token counts can’t separate them.Does 4.7 actually follow instructions better?That’s the cost, measured. Now the question: what did Anthropic trade for it?Their pitch is “more literal instruction following.” Plausible, but the token-count data doesn’t prove it. I ran a direct test.IFEval (Zhou et al., Google, 2023) is a benchmark of prompts with verifiable constraints. “Respond in exactly N words.” “Include the word X twice.” “No commas.” “All uppercase.” Each constraint has a Python grader. Binary pass/fail.IFEval ships 541 prompts. I sampled 20 with a fixed seed, ran each through both models, and graded with IFEval’s published checker.A small but directionally consistent improvement on strict instruction following. Loose evaluation is flat. Both models already follow the high-level instructions — the strict-mode gap comes down to 4.6 occasionally mishandling exact formatting where 4.7 doesn’t.Only one instruction type moved materially: change_case:english_capital (0/1 → 1/1). Everything else tied. The one prompt that actually separated the models was a four-constraint chain where 4.6 fumbled one and 4.7 got all four.N=20. IFEval has 541 prompts. A 20-prompt sample is enough to see direction, not enough to be confident about size. A +5pp delta at N=20 is consistent with anything from “no real difference” to “real +10pp improvement.“This measures the net effect of 4.6 → 4.7. Tokenizer, weights, and post-training all changed. I can’t isolate which one drove the +5pp. The causal link between “smaller tokens” and “better instruction following” remains a hypothesis.Single generation per prompt. Multiple runs per prompt would tighten the estimate.So: 4.7 follows strict instructions a few points better than 4.6 on this subset. Small effect, small sample. Not the “dramatic improvement” framing Anthropic’s partners used in launch quotes — at least not on this benchmark.The extra tokens bought something measurable. +5pp on strict instruction-following. Small. Real. So: is that worth 1.3–1.45x more tokens per prompt? Here’s the cost, session by session.Imagine a long Claude Code session — 80 turns of back-and-forth on a bug fix or refactor.The setup (what’s in your context each turn):One thing to explain upfront: the average cached prefix across the 80 turns is ~86K tokens, not 6K. The static 6K is tiny; the average history across all turns (0 at turn 1, 160K at turn 80, average ~80K) dominates. Since most of the cache-read cost happens in late turns where the history is huge, that ~86K average is what actually gets billed per turn.Every token in the prefix scales by its content ratio:Conversation history (mostly English and code): 1.325x → 160K becomes 212K by turn 80, averaging ~106K across the sessionAverage cached prefix on 4.7: ~115K tokens (up from 86K). Output tokens are a wildcard — roughly the same as 4.6, up to ~30% higher if Claude Code’s new xhigh default produces more thinking tokens.The per-token price didn’t change. The per-session cost did, because the same session packs more tokens.For Max-plan users hitting rate limits instead of dollars: your 5-hour window ends sooner by roughly the same ratio on English-heavy work. A session that ran the full window on 4.6 probably doesn’t on 4.7.How this hits the prompt cachePrompt caching is the architecture Claude Code runs on. The 4.7 tokenizer change interacts with caching in three ways:First 4.7 session starts cold. Anthropic’s prompt cache is partitioned per model — switching from 4.6 to 4.7 invalidates every cached prefix, the same way switching between Opus and Sonnet does. The tokenizer change doesn’t cause this, but it makes the cold-start more expensive: the prefix you’re writing to the new cache is 1.3–1.45x larger than the 4.6 equivalent.Cache volume grows by the token ratio. 1.445x more tokens in the CLAUDE.md portion means 1.445x more tokens paying cache-write once, and 1.445x more paying cache-read every turn after. The mechanism still works. There’s just more of it to pay for.Same transcript, different count. Re-run a 4.6 session on 4.7 and your logs show a different number. If you baseline billing or observability off historical token counts, expect a step-change the day you flip the model ID.“Input is mostly cache reads. The per-token cost barely changed.“Legitimate. In a session that stays within the 5-minute TTL, 96% of input is cache reads at $0.50/MTok — already 90% off nominal. A 1.325x ratio on the cached portion is a smaller dollar impact than on fresh input.But Max plans count all tokens toward rate limits, not dollars. And several patterns hit uncached territory: first session after a TTL expiry, every cache-bust event (CLAUDE.md edits, tool-list changes, model switches), and every compaction event that rewrites the prefix. On those turns you pay the full ratio on the cache-write. The steady-state is a bright spot. The edges got noisier.Agreed. The real-world weighted ratio (1.325x) lands near the top of their range. Individual file types exceed it — CLAUDE.md at 1.445x, technical docs at 1.473x. That’s the useful finding: the top of the documented range is where most Claude Code content sits, not the middle. Plan around the upper range, not the average.So: tokens are 1.3–1.45x more expensive on English and code. Anthropic bought you +5pp on strict instruction following. The sticker price didn’t change. The effective per-session cost did.Is it worth it? That depends on what you send. You’re paying ~20–30% more per session for a small but real improvement in how literally the model follows your prompt.
starter kit. Yours when you subscribe.
...
Read the original on www.claudecodecamp.com »
Ship and run software with isolation by default.
This is a CLI tool that lets you:
Pack a stateful virtual machine into a single file (.smolmachine) to rehydrate on any supported platform.
# install (macOS + Linux)
curl -sSL https://smolmachines.com/install.sh | bash
# for coding agents — install + discover all commands
curl -sSL https://smolmachines.com/install.sh | bash && smolvm –help
# run a command in an ephemeral VM (cleaned up after exit)
smolvm machine run –net –image alpine — sh -c “echo ‘Hello world from a microVM’ && uname -a”
# interactive shell
smolvm machine run –net -it –image alpine — /bin/sh
# inside the VM: apk add sl && sl && exit
Sandbox untrusted code — run untrusted programs in a hardware-isolated VM. Host filesystem, network, and credentials are separated by a hypervisor boundary.
# network is off by default — untrusted code can’t phone home
smolvm machine run –image alpine — ping -c 1 1.1.1.1
# fails — no network access
# lock down egress — only allow specific hosts
smolvm machine run –net –image alpine –allow-host registry.npmjs.org — wget -q -O /dev/null https://registry.npmjs.org
# works — allowed host
smolvm machine run –net –image alpine –allow-host registry.npmjs.org — wget -q -O /dev/null https://google.com
# fails — not in allow list
Pack into portable executables — turn any workload into a self-contained binary. All dependencies are pre-baked — no install step, no runtime downloads, boots in
smolvm pack create –image python:3.12-alpine -o ./python312
./python312 run — python3 –version
# Python 3.12.x — isolated, no pyenv/venv/conda needed
smolvm machine create –net myvm
smolvm machine start –name myvm
smolvm machine exec –name myvm — apk add sl
smolvm machine exec –name myvm -it — /bin/sh
# inside: sl, ls, uname -a — type ‘exit’ to leave
smolvm machine stop –name myvm
Use git and SSH without exposing keys — forward your host SSH agent into the VM. Private keys never enter the guest — the hypervisor enforces this. Requires an SSH agent running on your host (ssh-add -l to check).
smolvm machine run –ssh-agent –net –image alpine — sh -c “apk add -q openssh-client && ssh-add -l”
# lists your host keys, but they can’t be extracted from inside the VM
smolvm machine exec –name myvm — git clone git@github.com:org/private-repo.git
image = “python:3.12-alpine”
net = true
[network]
allow_hosts = [“api.stripe.com”, “db.example.com”]
[dev]
init = [“pip install -r requirements.txt”]
volumes = [”./src:/app”]
[auth]
ssh_agent = true
smolvm machine create myvm -s Smolfile
smolvm machine start –name myvm
Each workload gets real hardware isolation — its own kernel on Hypervisor.framework (macOS) or KVM (Linux). libkrun VMM with custom kernel: libkrunfw. Pack it into a .smolmachine and it runs anywhere the host architecture matches, with zero dependencies.
Images use the OCI format — the same open standard Docker uses. Any image on Docker Hub, ghcr.io, or other OCI registries can be pulled and booted as a microVM. No Docker daemon required.
Defaults: 4 vCPUs, 8 GiB RAM. Memory is elastic via virtio balloon — the host only commits what the guest actually uses and reclaims the rest automatically. vCPU threads sleep in the hypervisor when idle, so over-provisioning has near-zero cost. Override with –cpus and –mem.
* Network is opt-in (–net on machine create). TCP/UDP only, no ICMP.
* macOS: binary must be signed with Hypervisor.framework entitlements.
* –ssh-agent requires an SSH agent running on the host (SSH_AUTH_SOCK must be set).
...
Read the original on github.com »
Skip to main content
An official website of the United States government
NASA Force is a new hiring initiative—developed in partnership with the U. S. Office of Personnel Management—designed to bring exceptional technical talent into mission-critical roles that support NASA’s exploration, research, and advanced technology priorities. Highly skilled early- to mid- career engineers, technologists, and innovators join NASA for focused term appointments, typically 1–2 years with the possibility of extension, to solve complex challenges and help maintain U.S. leadership in air and space. Through NASA Force, you will contribute to missions that advance human spaceflight, aeronautics, and scientific discovery while helping expand humanity’s understanding of the universe. You will take a systems approach to solving problems, working across teams and disciplines from concept to execution. Your work will demand technical excellence, critical thinking, and continuous learning, and every contribution will directly support NASA’s mission. Work on flight systems, lunar infrastructure, and advanced technologies that go from concept to execution and support real missions beyond Earth.Work on flight systems, lunar infrastructure, and advanced technologies that go from concept to execution and support real missions beyond Earth.Collaborate directly with engineers, scientists, and partners shaping the future of space, aeronautics, and national capability.Expand your technical depth by solving complex, real-world problems where the standard is performance, not theory.Share knowledge, mentor others, and contribute to a culture that compounds capability across NASA’s workforce. HOW YOU WILL ENTER THE MISSION You will join a collaborative, mission-driven team where ideas are valued, contributions are recognized, and innovation is part of everyday work. NASA Force offers an opportunity to grow across projects and disciplines, build your expertise, and take on new challenges while working alongside some of the world’s leading minds. Propulsion systems support across the Commercial Crew Program, Launch Services Program, and Artemis If You Want Your Work to Operate Beyond Earth, This is Where it Begins.
...
Read the original on nasaforce.gov »
ai is here. so i’m spending 3 months coding the old wayI decided to move to Brooklyn for a coding retreat. There were some personal reasons that brought me back to the US. But rather than heading immediately back to work, I wanted to take some time to focus on coding things mostly without AI — at precisely the time when many successful programmers are saying programming is a solved problem. Given that I’m now six weeks through this retreat, I’ll also take some time to explain what I’ve been doing in that time. For the past two years, I’ve been building AI agents at Aily Labs in Barcelona alongside some super talented engineers. One of my first projects was building a web search agent we could use internally in early 2024… almost 6 months before Anthropic’s Building Effective AI Agents article came out and a year before OpenAI’s DeepResearch came out! We were also early on Cursor, early on using LLMs to make knowledge graphs, and constantly testing out new approaches for our use cases. One of my favorite parts of working at Aily was leading a weekly journal club. I chose to present papers that described how open source LLMs were built, including DeepSeek R1, Ai2’s Olmo 3, and Meta’s Llama 3 paper. All of these helped us understand the evolving tradeoffs between training models internally or building workflows around SOTA closed models. I was already hooked on LLMs since the first time I tried them in 2023, but I found my curiosity kept bringing me back to learning about how they worked and how to apply them.At the same time as I was learning about LLMs and agents, I was also using them to code. I learned that when writing code “by hand” I was actually doing two things: writing what I wanted and learning the code base. When I used a coding agent however, I would get exactly what I specified in my prompt, for better or worse. By this I mean that if I didn’t know what I wanted exactly, coding agents would be happy to make many assumptions for me. This almost always meant that I didn’t learn as much, and that I wouldn’t have a good grasp of the codebase.At the exact same time, coding agents helped me iterate quickly and ship software that worked well (after some dutiful testing, of course). They were also, I found, excellent tutors. Cal Newport, a computer science professor and writer of Deep Work and other popular productivity books, recently wrote about this tradeoff in a way that resonated with me. In the article, he makes an analogy between the relationship of exercise to health, and the relationship of thinking to craft: Your writing should be your own. The strain required to craft a clear memo or report is the mental equivalent of a gym workout by an athlete; it’s not an annoyance to be eliminated but a key element of your craft.I think the same applies to writing code. At Aily, the people I worked with who were amazing programmers were in most cases also amazing users of AI. Their deeper knowledge simply gave them more leverage over this tool. In the day to day of shipping agents into production, I didn’t stop learning. But I did have a growing list of coding and computer concepts that I was always too busy to learn about. So when I needed to head back to the US, I realized it was the perfect time to focus on this at the Recurse Center.What is a code retreat anyway? Recurse Center (RC) is a self-directed, full-time programming retreat in Brooklyn. After an application and a coding interview, Recursers arrive with ideas for what they want to program, and then spend 6 or 12 weeks programming. One of the highlights of RC is that it is collaborative: you enter with a cohort of other programmers, many with decades of experience, and with radically different expertises. Another highlight: it’s free! Coming into RC, my goals were the following: Train an LLM from scratch. This includes pre- and post-training, and I want to do this mostly from scratch; not just fork a premade codebase but write a Transformer myself. Get better at writing Python by hand. I’ve been working in Python for a few years now but I know there’s still so much for me to learn. I want to get to the point where I need to reference documentation or ask LLMs as little as possible, and have good intuition for how to set up various projects.Understand computers better. Admittedly a broad goal, I know that computers are extremely complicated machines that operate at many levels of abstraction. Given that I never had a formal Computer Science education I want to build a better mental model of these layers and how they work together. I don’t have a super concrete plan here, but I think RC will be the perfect place for this.So how is it going? I’ve done the first assignment from Stanford’s CS336: Language Modeling from Scratch course, without coding help from an LLM. For context, it was a 50-page assignment, but working with another Recurser, we wrote an optimized tokenizer in Python, and then built out an upgraded GPT-2 style architecture in PyTorch. We ran multiple ablations to tune hyperparameters on the Tiny Stories datasets, and then used those hyperparameters on the ~9 billion tokens of the OpenWebText dataset.Parameter sweep of different learning rates for the 17M parameter model we wrote by hand; high learning rates lead to instability. This was on the Tiny Stories dataset, and took about an hour to train on an A100. My plan is to do the other assignments in CS336 as well: optimizing our language model, estimating and computing scaling laws, converting raw text data into pre-training data, and finally post-training a model. I’ve already started the second assignment which involves profiling GPUs and implementing FlashAttention2 in Triton. There’s a lot to do, but ideally I can run through the meat of these assignments and then post-train my own model.2. Getting Better at Writing Python from ScratchI’ve been writing a lot of small agents and neural networks in Python or PyTorch to practice. But by far the most helpful thing was pair programming with people who have been working in Python for 10+ years, and just watching them work or having them watch me work. For example, a nice thing I picked up from someone I pair programmed with: when this guy was writing code and didn’t quite remember the syntax or operations, he would often just quickly open up a terminal and type a super simple example to rapidly iterate. He was usually able to work it out and verify if it worked correctly in less than a minute, and he didn’t have to google anything and comb through search results or ask an LLM. This technique might seem obvious to some, but making this process muscle memory has helped me become unstuck much faster. I want to keep moving in this direction, doing simple projects or even just problems like Advent of Code while pair programming. Working with someone else live was initially a bit nerve-racking, but precisely because of this I’ve noticed a lot of progress. Here are a few examples of things I’ve done which I’d classify as helping me understand computers better:I wrote the classic programming function fizzbuzz in BASIC on an Apple IIe computer from 1983. It was cool seeing how differently computers worked back then, for example how manual the code editing and execution process was, but also how it was basically the same. One thing I’ve always felt a bit self-conscious about are my Unix/terminal skills. So I joined CTF Fridays, a weekly session devoted to working through Bandit and other “war games.” These are Unix and computer security related challenges played through the terminal, with the objective of collecting passwords and leveling up. Now I have a pretty good sense for what Claude Code is trying to run on my computer!One day I hand-coded a single layer perceptron I saw when flipping through an AI textbook… completely in Vim. It was especially tedious at first, but I got some pro tips from another Recurser and learned a few shortcuts. This has actually been incredibly useful now when I’m running training jobs on cloud GPUs and I need to last-minute edit files. I joined a Clojure workshop given by someone who has 15+ years of experience using Clojure. The topic itself was interesting because Clojure is a functional programming language and I don’t have much experience with functional languages. The teaching methodology was also great: after a brief intro we did a round of mob programming, where we solved a problem collectively, going around the table with each person getting a minute or two to advance the solution. The weekly technical presentations are great exposure to an incredible array of topics. These are a set of 5-minute talks, so they are short enough that you don’t get bored but fast enough that you can learn something meaningful. A sample of titles: “Running Rust Code”, “GPUs for Dummies”, “Typesafe APIs for Type B Personalities”, “Some Useless Agents” (this one was mine!), and more. I’ve given two so far: one on simple agent architectures, one on scaling MCP tools efficiently; and will give another this week on different ways to optimize GPUs. Even just hearing from people about their projects and careers has been incredibly valuable in helping me understand the space of problems computers can solve.Soon I’ll be shipping agents to prod and running evals with a whole new bag of tricks and skills. But for now I’ve got 6 more weeks left at RC, which I’m beginning to worry is not enough time to finish everything on my list. And it won’t be. But that’s what makes RC so great: it’s not as much about crossing everything off my list but about spending time coding.
...
Read the original on miguelconner.substack.com »
Are the Costs of AI Agents Also Rising Exponentially?
There is an extremely important question about the near-future of AI that almost no-one is asking. We’ve all seen the graphs from METR showing that the length of tasks AI agents can perform has been growing exponentially over the last 7 years. While GPT-2 could only do software engineering tasks that would take someone a few seconds, the latest models can (50% of the time) do tasks that would take a human a few hours.
As this trend shows no signs of stopping, people have naturally taken to extrapolating it out, to forecast when we might expect AI to be able to do tasks that take an engineer a full work-day; or week; or year. But we are missing a key piece of information — the cost of performing this work. Over those 7 years AI systems have grown exponentially. The size of the models (parameter count) has grown by 4,000x and the number of times they are run in each task (tokens generated) has grown by about 100,000x. AI researchers have also found massive efficiencies, but it is eminently plausible that the cost for the peak performance measured by METR has been growing — and growing exponentially.This might not be so bad. For example, if the best AI agents are able to complete tasks that are 3x longer each year and the costs to do so are also increasing by 3x each year, then the cost to have an AI agent perform tasks would remain the same multiple of what it costs a human to do those tasks. Or if the costs have a longer doubling time than the time-horizons, then the AI-systems would be getting cheaper compared with humans. But what if the costs are growing more quickly than the time horizons? In that case, these cutting-edge AI systems would be getting less cost-competitive with humans over time. If so, the METR time-horizon trend could be misleading. It would be showing how the state of the art is improving, but part of this progress would be due to more and more lavish expenditure on compute so it would be diverging from what is economical. It would be becoming more like the Formula 1 of AI performance — showing what is possible, but not what is practical. So in my view, a key question we need to ask is: How is the ‘hourly’ cost of AI agents changing over time?By ‘hourly’ cost I mean the financial cost of using an LLM to complete a task right at the model’s 50% time horizon divided by the length of that time horizon. So as with the METR time horizons themselves, the durations are measured not by how long it takes the model, but how long it typically takes humans to do that task. For example, Claude 4.1 Opus’s 50% time horizon is 2 hours: it can succeed in 50% of tasks that take human software engineers 2 hours. So we can look at how much it costs for it to perform such a task and divide by 2, to find its hourly rate for this work. I’ve found that very few people are asking this question. And when I ask people what they think is happening to these costs over time, their opinions vary wildly. Some assume the total cost of a task is staying the same, even as the task length increases exponentially. That would imply an exponentially declining hourly rate. Others assume the total cost is also growing exponentially — after all, we’ve seen dramatic increases in the costs to access cutting-edge models. And most people (myself included) had little idea of how much it currently costs for AI agents to do an hour’s software engineering work. Are we talking cents? Dollars? Hundreds of dollars? An AI agent can’t cost more per hour than a human to complete these tasks can it? Can it?A couple of months ago I asked METR if they could share the cost data for their benchmarking. I figured it would be easy — just take the cost of running their benchmark for each model, plot it against release date and see how it is growing. Or plot the cost of each model vs its time horizon and see the relationship.But they helpfully pointed out that it isn’t so easy at all. Their headline time-horizon numbers are meant to show the best possible performance that can be attained with a model (regardless of cost). So they run their models inside an agent scaffold until the performance has plateaued. Since they really want to make sure it has plateaued, they use a lot of compute on this and don’t worry too much about whether they’ve used too much. After all, if you are just trying to find the eventual height of a plateau, there is no problem in going far into the flat part of the graph. But if you are trying to find out when the plateau begins, there is a problem with this strategy. Their total spend for each model is sometimes just enough to get onto the plateau and sometimes many times more than is needed. So total spend can’t be used as direct estimate of the costs of achieving that performance. Fortunately, they released a chart that can be used to shed some light on the key question of how hourly costs of LLM agents are changing over time:
This chart (from METR’s page for GPT-5) shows how performance increases with cost. The cost in question is the cost of using more and more tokens to complete the task (and thus more and more compute).The yellow curve is the best human performance for each task. It steadily marches onwards and upwards, transforming more wages into longer tasks. Since it is human performance that is used to define the vertical axis for METR’s time horizon work, it isn’t surprising that this curve is fairly linear — it costs about 8 times as much to get a human software engineer to perform an 8-hour task as a 1-hour task.The other colours are the curves for a selection of LLM-based agents. Unlike the humans, they all show diminishing returns, with the time horizon each one can achieve eventually stalling out and plateauing as more and more compute is added. The short upticks at the end of some of these curves are an artefact of some models not being prepared to give an answer until the last available moment. This suggests that the model must have been still making progress during the apparent flatline before the uptick (just not showing it). Indeed, this chart was originally displayed on METR’s page for GPT-5 to show that they may have stopped its run before it’s performance had truly plateaued. These upticks do make analysis harder and hopefully future versions of this chart will be able to avoid these glitches.So what can this chart tell us about our key question concerning the hourly cost of AI agents?To tease out the lessons that lie hidden in the chart, we’ll need to add a number of annotations. The first step is to add lines of constant hourly cost. On a log-log plot like this, every constant hourly cost will be a straight line with slope 1. Lower hourly costs will appear as lines that are located further to the left.
For each curve I’ve added a line of constant hourly cost that just grazes it. That is the cheapest hourly cost the model achieves. We can call the point where the line touches the curve the sweet spot for that model. Before a model’s sweet spot, its time horizon is growing super-linearly in cost — it is getting increasing marginal returns. The sweet spot is exactly the point at which diminishing marginal returns set in (which would correspond to the point of inflection if this was replotted on linear axes). It is thus a key point on any model’s performance curve.We can see that the human software engineer is at best \$120 per hour, while the sweet spots for the AI agents range from \$40 per hour for o3, all the way down to 40 cents per hour for Grok 4 and Sonnet 3.5. That’s quite a range of costs. While differences in horizon length between these models vary by about a factor of 15 (judged at either the end-points or at the sweet-spots) their sweet-spot costs vary by a factor of 100.And these are the best hourly rates for these models. On many task lengths (including those near their plateau) they cost 10 to 100 times as much per hour. For instance, Grok 4 is at \$0.40 per hour at its sweet spot, but \$13 per hour at the start of its final plateau. GPT-5 is about \$13 per hour for tasks that take about 45 minutes, but \$120 per hour for tasks that take 2 hours. And o3 actually costs \$350 per hour (more than the human price) to achieve tasks at its full 1.5 hour task horizon. This is a lot of money to pay for an agent that fails at the task you’ve just paid for 50% of the time — especially in cases where failure is much worse than not having tried at all.However, I do want to note that I’m a bit puzzled by how much higher the costs are here for the reasoning models from OpenAI compared to models from Anthropic and xAI. The METR page suggests that the price data for those models was still an estimate at that point (based on o1 costs), so I wouldn’t be surprised if these curves should really be shifted somewhat to the left, making them several times cheaper. We therefore shouldn’t lean too heavily on the fact that they cost as much or more than human labour at their full time-horizon.As well as the sweet spot, ideally we could add a saturation point for each curve — a point to represent the location where the plateau begins. We can’t simply use the end of the curve since some have run longer into the plateau than others. What I’ll do is find the point where the slope has diminished to 1/10th that of the sweet spot. This is the point at which it requires a 10% increase in cost just to increase the time horizon by 1%. Or equivalently, the time horizon is only growing as the 1/10th power of compute. Of course the number 1/10 is somewhat arbitrary, but unlike for the sweet spot, any definition of a saturation point will be arbitrary to some degree. As you can see below, this definition of saturation point does roughly correspond with the intuitive location, though it is still not quite clear how best to deal with the final upticks.
Armed with our sweet spots and saturation points, we can start to tease out the relationship between time horizon and cost.
We can see that there is a weak, but clear, positive correlation between task duration and cost in this dataset. Moreover, we see that higher task durations (at the sweet spot) are associated with higher hourly costs (and recall that these hourly costs at the sweet spot are the best hourly cost achievable with that model).What about if we instead look at the models’ saturation points, which are a little arbitrary in their definition, but closer to what METR is measuring in their headline results about time horizons:
Again, there is a correlation between time horizon and cost, and again the hourly costs seem to be increasing with time horizon too. Indeed it suggests we are nearing the point where the models’ peak performance comes at an impractically high cost. If this relationship were to continue, then forecasting when certain time horizons will be available from the headline METR trend will be misleading, as the models would be impractically expensive when they first reach those capabilities. We would need to wait some additional period of time for them to come down sufficiently in cost.That said, there are some significant limitations to the analysis above. Ideally one would want to:include curves for a larger and more representative set of modelsfind a way of addressing the uptick problemcheck if there is an issue with the costs of the OpenAI modelsFortunately, it should be fairly easy for METR to perform such analysis, and I hope they will follow up on this.Too few people are asking about how the costs of AI agents are growingThe key question is How is the ‘hourly’ cost of LLM agents changing over time?We can use METR’s chart to shed some light on this.We need to add lines of constant hourly cost, sweet spots, and saturation points.This provides moderate evidence that:the costs to achieve the time horizons are growing exponentially,even the hourly costs are rising exponentially,the hourly costs for some models are now close to human costs.Thus, there is evidence that:the METR trend is partly driven by unsustainably increasing inference computethere will be a divergence between what time horizon is possible in-principle and what is economically feasiblereal-world applications of AI agents will lag behind the METR time-horizon trend by increasingly large amountsMETR has a similar graph on their page for GPT-5.1 codex. It includes more models and compares them by token counts rather than dollar costs:
the correlation between time horizon and cost holds for these other models tooreasoning models with more RL post-training don’t always dominate their predecessors (e.g. o1 is better at small token budgets than o3 or GPT-5)the horizontal gap between the OpenAI reasoning models and the rest is smaller, supporting the idea that their costs were a bit high in the main chart
$\setCounter{0}$
February 04, 2026
Hazard Rates for AI Agents Decline as a Task Goes On
...
Read the original on www.tobyord.com »
When they received the call to respond to an Israeli airstrike in the city of Mayfadoun, in southern Lebanon, most of the paramedics held back, having previously seen colleagues killed by double-tap attacks targeting rescuers. But the medics from the Islamic Health Association (IHA) rushed to the scene.
By the time the other emergency workers arrived at the site, they found the IHA medics had indeed been caught in a second strike. They started evacuating their wounded colleagues, only for their ambulances to be hit in two further attacks.
One of the paramedics covered his ears and screamed, convulsing in pain as shrapnel shattered the back window of the ambulance.
The rescue mission on Wednesday afternoon had turned into a nightmare as Israel carried out three consecutive strikes on three sets of ambulances and medical workers.
In total, the attacks killed four medics and wounded six more, from three different ambulance corps, according to medical sources. Three of the medics were from the Hezbollah-affiliated IHA and Amal-affiliated medical corps, while one was from the Nabatieh emergency services organisation. Under international law, all medics are protected and are considered non-combatants, regardless of political affiliation.
Rescuers in Lebanon have long been wary of the double-tap attack, when Israeli forces target a location, wait until people gather to help survivors, and then strike again. Wednesday’s three-wave attack after the initial one prompted the coining of a fearsome new term: the quadruple tap.
In a video taken by one of the paramedics at the site, rescuers are seen loading two wounded people into their ambulances when a bomb lands next to their vehicle. Paramedics rush to extract the driver, who is motionless and limp as they pull him from the ambulance, which is splashed with blood. “Oh God, oh God,” the man filming can be heard saying. They carry two more blood-covered medics out of their vehicle and on to stretchers.
Among the paramedics killed was Fadel Sarhan, 43, who is survived by his eight-year-old daughter.
“Fadel was a very loved person. He had a bold personality, but at the same time, he was emotional. He was well liked and responsible,” said Ali Nasr al-Deen, the head of the Mayfadoun civil defence centre who grew up with Sarhan.
“He used to feed the cats and dogs. He would bring pet food from Beirut so they wouldn’t go hungry. He was that kind of person, caring and attentive. It’s a huge loss for us,” said Nasr al-Deen.
Medics mourned their colleagues on Thursday at funerals in Nabatieh, a city near Mayfadoun. Such events have become increasingly common, with healthcare workers killed by Israeli bombings on a near daily basis.
Mohammed Suleiman, whose 16-year-old son, Joud, was killed while on duty as a paramedic by an Israeli strike weeks earlier, joined his peers in burying another of his friends on Thursday. A few hours after the funerals, Israel carried out another wave of airstrikes on Nabatieh.
Israel has so far killed 91 healthcare workers and wounded 214 more in Lebanon since the Israel-Hezbollah war started on 2 March. It has given little justification for its repeated attacks on medical infrastructure and workers, apart from accusing Hezbollah of using ambulances and hospitals to transport fighters and weapons, without providing evidence for the claim.
The Lebanese ministry of health accused Israel of deliberately targeting ambulance crews. “Paramedics have become direct targets, pursued relentlessly in a blatant violation that confirms a total disregard for all norms and principles established by international humanitarian law,” the ministry said in a statement.
The Israeli military did not immediately respond to a request for comment.
In the video taken of the quadruple tap on Wednesday, the frame was frozen on the interior of the ambulances, as the Nabatieh emergency services highlighted that the vehicle clearly contained no weapons.
A few hours after Israel hit the ambulances outside Nabatieh, it bombed the vicinity of the governmental hospital in Tebnine, south Lebanon. It was the second time in two days that Israeli bombings damaged the healthcare facility, which is the only remaining public hospital in the area. The strikes injured 11 hospital workers and damaging the emergency department, according to the World Health Organization (WHO).
A video of Tebnine hospital from 14 April showed workers trying to clear shattered concrete and debris from the emergency department after a strike blew in the windows.
Commenting on the strike in Tebnine, the head of the WHO, Tedros Adhanom Ghebreyesus, said: “I reiterate the call for the immediate protection of healthcare facilities, health workers, ambulances and patients. There must be safe, sustained and unhindered humanitarian access across Lebanon.”
An ambulance in Tebnine was also struck on Thursday, leading to the critical injury of two medics, according to the Lebanese ministry of health. As healthcare workers watched their colleagues and friends being killed by Israel, the mental toll was becoming almost too much to bear.
“We have to go to places to rescue people, but then we get double tapped,” said Abbas Atwi, the head of the IHA’s emergency department in Nabatieh, shortly after a medical centre was targeted in March, killing his friends and colleagues. “But we will stay and keep going, we will not leave.”
...
Read the original on www.theguardian.com »
In a previous post about AI-discovered bugs in Vim and Emacs, we looked at how seemingly harmless workflows could cross a surprising line into code execution. This time we wanted to push that idea even further: is cat readme.txt safe?
It turns out that it is NOT, if you use iTerm2.
That looks insane until you understand what iTerm2 is trying to do for a legitimate feature, how it uses the PTY, and what happens when terminal output is able to impersonate one side of that feature’s protocol.
We’d like to acknowledge OpenAI for partnering with us on this project.
iTerm2 has an SSH integration feature that gives it a richer understanding of remote sessions. To make that work, it does not just “blindly type commands” into a remote shell. Instead, it bootstraps a tiny helper script on the remote side called the conductor.
iTerm2 sends a remote bootstrap script, the conductor, over the existing SSH session. That remote script becomes the protocol peer for iTerm2.iTerm2 and the remote conductor exchange terminal escape sequences to coordinate things like:
The important point is that there is no separate network service. The conductor is just a script running inside the remote shell session, and the protocol is carried over normal terminal I/O.
A terminal used to be a real hardware device: a keyboard and screen connected to a machine, with programs reading input from that device and writing output back to it.
A terminal emulator like iTerm2 is the modern software version of that hardware terminal. It draws the screen, accepts keyboard input, and interprets terminal control sequences.
But the shell and other command-line programs still expect to talk to something that looks like a real terminal device. That is why the OS provides a PTY, or pseudoterminal. A PTY is the software stand-in for the old hardware terminal, and it sits between the terminal emulator and the foreground process.
* ssh forwards those bytes to the remote machine
* the remote conductor reads them from its stdin
So when iTerm2 wants to “send a command to the remote conductor,” what it actually does locally is write bytes to the PTY.
The SSH integration protocol uses terminal escape sequences as its transport.
* DCS 2000p is used to hook the SSH conductor
* OSC 135 is used for pre-framer conductor messages
At source level, DCS 2000p causes iTerm2 to instantiate a conductor parser. Then the parser accepts OSC 135 messages like:
So a legitimate remote conductor can talk back to iTerm2 entirely through terminal output.
The bug is a trust failure. iTerm2 accepts the SSH conductor protocol from terminal output that is not actually coming from a trusted, real conductor session. In other words, untrusted terminal output can impersonate the remote conductor.
That means a malicious file, server response, banner, or MOTD can print:
and iTerm2 will start acting like it is in the middle of a real SSH integration exchange. That is the exploit primitive.
iTerm2 renders the file, but the file is not just text. It contains:
Once the hook is accepted, iTerm2 starts its normal conductor workflow. In upstream source, Conductor.start() immediately sends getshell(), and after that succeeds it sends pythonversion().
So the exploit does not need to inject those requests. iTerm2 issues them itself, and the malicious output only has to impersonate the replies.
The fake OSC 135 messages are minimal but precise.
They do this:
Return lines that look like shell-discovery output
This is enough to push iTerm2 down its normal fallback path. At that point, iTerm2 believes it has completed enough of the SSH integration workflow to move on to the next step: building and sending a run(…) command.
The forged DCS 2000p hook contains several fields, including attacker-controlled sshargs.
That value matters because iTerm2 later uses it as command material when it constructs the conductor’s run … request.
The exploit chooses sshargs so that when iTerm2 base64-encodes:
the last 128-byte chunk becomes:
That string is not arbitrary. It is chosen because it is both:
In a legitimate SSH integration session, iTerm2 writes base64-encoded conductor commands to the PTY, and ssh forwards them to the remote conductor. In the exploit case, iTerm2 still writes those commands to the PTY, but there is no real SSH conductor. The local shell receives them as plain input instead.
That is why the session looks like this when recorded:
* the last chunk is ace/c+aliFIo
Earlier chunks fail as nonsense commands. The final chunk works if that path exists locally and is executable.
You can reproduce the original file-based PoC with genpoc.py:
* readme.txt, a file containing the malicious DCS 2000p and OSC 135 sequences
The first fools iTerm2 into talking to a fake conductor. The second gives the shell something real to execute when the final chunk arrives.
For the exploit to work, run cat readme.txt from the directory containing ace/c+aliFIo, so the final attacker-shaped chunk resolves to a real executable path.
* Mar 30: We reported the bug to iTerm2.
* Mar 31: The bug was fixed in commit a9e745993c2e2cbb30b884a16617cd5495899f86.
* At the time of writing, the fix has not yet reached stable releases.
When the patch commit landed, we tried to rebuild the exploit from scratch using the patch alone. The prompts used for that process are in prompts.md, and the resulting exploit is genpoc2.py, which works very similarly to genpoc.py.
...
Read the original on blog.calif.io »
This is a calculator that works over unions of intervals rather than just real numbers. It is an implementation of Interval
Union Arithmetic.
An interval [a, b] represents the set of all numbers between and including a and b. An interval union: [a, b] U [c, d] is a disjoint set of intervals.
Interval union arithmetic is an extension of regular interval arithmetic that is vastly superior, mostly because it remains closed while supporting division by intervals containing zero:
➤ 2 / [-2, 1]
[-∞, -1] U [2, +∞]
The interesting thing about interval union arithmetic is the inclusion property, which means that if you pick any real number from every input union and compute the same expression over the reals, the result is guaranteed to be in the output union.
You can use it to represent uncertainty:
➤ 50 * (10 + [-1, 1])
[450, 550]
You can also compute more complex interval expressions, using the interval union operator U:
➤ ( [5, 10] U [15, 16] ) / [10, 100]
[0.05, 1.6]
Operations can result in disjoint unions of intervals:
➤ 1 / [-2, 1]
[-∞, -0.5] U [1, +∞]
➤ tan([pi/3, 2*pi/3])
[-∞, -1.732] U [1.732, +∞]
In full precision mode, you can use it as a regular calculator, and obtain interval results that are guaranteed to contain the true value, despite floating point precision issues:
➤ 0.1 + 0.2
[0.29999999999999993, 0.3000000000000001]
Note: you can input intervals with the bracket syntax: [1, 2], or bare numbers without brackets: 3.14. Bare numbers are intepreted as a narrow interval, i.e. [3.14, 3.14] (with subtleties related to full precision mode). This enables bare numbers and intervals to be mixed naturally:
➤ 1.55 + [-0.002, 0.002]
[1.548, 1.552]
A surprising consequence of the calculator grammar is that intervals can be nested and you can write things like:
➤ [0, [0, 100]]
[0, 100]
This is because all numbers, including those inside an interval bracket which define a bound, are interpreted as intervals. When nesting two intervals as above, an interval used as an interval bound is the same as taking its upper bound. This design choice enables using arithmetic on interval bounds themselves:
➤ [0, cos(2*pi)]
[0, 1]
Outward rounding is implemented over IEEE 754 double precision floats (javascript’s number type), so result intervals are guaranteed to contain the true value that would be obtained by computing the same expression over the reals with infinite precision. For example, try the famous sum 0.1 + 0.2 in the calculator. Interval arithmetic computes an interval that is guaranteed to contain 0.3, even though 0.3 is not representable as a double precision float.
* Numbers input by the user are interpreted as the smallest
interval that contains the IEEE 754 value closest to the input
decimal representation but where neither bounds are equal to
it
* Output numbers are displayed with all available decimal
digits (using Number.toString())
* Numbers input by the user are interpreted as the
degenerate interval (width zero) where both bounds are equal
to the IEEE 754 value closest to the input decimal
representation
* Output numbers are displayed with a maximum of 4 decimal
digits (using Number.toPrecision())
While I’ve been very careful, I’m sure there are still some bugs in the calculator. Please report
any issue on GitHub.
Interval
Calculator and not-so-float
(the engine powering the calculator) are open-source. If you you like my open-source work, please consider sponsoring me
on GitHub. Thank you ❤️
* Split full precision mode into two controls: input
interpretation and display precision
...
Read the original on victorpoughon.github.io »
PanicLock is macOS menu bar utility that instantly disables Touch ID and locks the screen with a single click or closing your laptop lid.
PanicLock fills a gap macOS leaves open: there is no built-in way to instantly disable Touch ID when it matters. Biometrics are convenient day-to-day, and sometimes preferable when you need speed or want to avoid your password being observed. But in sensitive situations, law enforcement and border agents in many countries can compel a biometric unlock in ways they cannot with a password. PanicLock gives you a one-click menu bar button, a customizable hotkey, or an automatic lock-on-lid-close option that immediately disables Touch ID and locks your screen, restoring password-only protection without killing your session or shutting down.
* One-click panic lock — Click the menu bar icon or press a hotkey to instantly lock
* Lock on Close — Optionally lock and disable Touch ID when you close the lid
* Launch at login — Start automatically when you log in
brew install paniclock/tap/paniclock
Download the latest DMG from the releases page.
When enabled in Preferences, closing your Mac’s lid will automatically disable Touch ID and lock your screen. Touch ID stays disabled until you re-login with your password. If your screen locks for other reasons (screensaver, display sleep, etc.), Touch ID will still work as normal.
On first use, you’ll be prompted for your admin password to install the privileged helper. This is a one-time setup.
Set your Development Team in both targets (PanicLock and PanicLockHelper)
brew uninstall paniclock
sudo launchctl bootout system/com.paniclock.helper
sudo rm -f /Library/PrivilegedHelperTools/com.paniclock.helper
sudo rm -f /Library/LaunchDaemons/com.paniclock.helper.plist
rm -rf /Applications/PanicLock.app
PanicLock uses a privileged helper (installed via SMJobBless) to modify Touch ID timeout settings:
Sets timeout to 1 second via bioutil -w -s -o 1
* No network activity — App is 100% offline, no telemetry or analytics
Note: PanicLock only disables Touch ID. If you have other unlock methods enabled, Apple Watch unlock, security keys, etc., your Mac can still be unlocked using those.
./scripts/release.sh
* Signs with Developer ID for distribution outside the App Store
* Submits to Apple for notarization (can take minutes to hours)
* Supports parallel notarizations — each version gets its own build/release/ directory
Run again later to check status and continue when approved
Contributions welcome! Please open an issue or pull request.
...
Read the original on github.com »
This newsletter is brought to you by Corelight. You can subscribe to an audio version of this newsletter as a podcast by searching for “Risky Business” in your podcatcher or subscribing via this RSS feed. You can also add the Risky Business newsletter as a Preferred Source to your Google search results by going here.
The US National Institute of Standards and Technology announced on Wednesday a new policy regarding the US National Vulnerability Database, which the agency has been struggling to keep updated with details for every new vulnerability added to the system.
Going forward, NIST says its staff will only add data—in a process called enrichment—only for important vulnerabilities.
This will include three types of security flaws, which the agency says are critical to the safe operation of US government networks and its private sector.
* CVE entries for vulnerabilities listed in CISA KEV, a database of actively exploited bugs;
* CVEs in software known to be used by US federal agencies;
* and CVEs in what the agency classifies as “critical software.”
This latter category sounds restrictive, but is in fact quite broad and includes all the major software you’d expect and want to have properly enriched CVEs for. Stuff like operating systems, web browsers, security software, firewalls, backup software, and VPNs; they are all on the list [PDF], which you can also see below this post.
NIST has been struggling to enrich CVEs for more than two years due to an explosion in bug discoveries and mounting costs, also made worse by the Trump administration’s recent cuts to various DHS and CISA budgets.
Its problems started in early 2024, when a handful of 2,100+ CVE entries that were left without enriched metadata turned into almost 30,000 by the end of the year. Despite efforts to catch up and add details to all CVEs published in the NVD, the agency is still tens of thousands of bugs behind.
The NIST announcement is a capitulation, with the agency admitting it won’t ever catch up due to its current budgetary circumstances.
It is a smart decision. Even though this sounds as a blasphemy for the infosec people in the vulnerability management space, the only way forward for NIST was to focus on the important bugs only and giving up on all the CVE chaff.
Each year, there are tens of thousands of vulnerabilities being reported in all kinds of no-name software you have never heard of, in all the tiny libraries that barely have 100 stars on GitHub, and all the IoT gear and their firmware components.
The announcement is not what the vulnerability management companies wanted, since many of them relied on packaging the NVD output into their own vulnerability scanners, dashboards, and reporting tools.
With some of that output set to disappear for good, they will have to find other places to get the data, or enrich it themselves. Aikido Security’s Sooraj Shah has an excellent take on what this means for the industry
The cybersecurity industry was expecting this to happen. At a January quarterly meeting, NIST officials talked about “rethinking” the agency’s role in analyzing software vulnerabilities, and hinted at a plan to only triage the important bugs.
NIST says that besides focusing on enriching only the big bugs, it will also stop providing its own CVSS severity scores for NVD entries, and will now show the severity score initially assigned by the organization that issued the CVE.
This opens the door for a lot of infosec drama. Some of the organizations that issue CVE numbers are also the makers of the “reported” software, and these companies are extremely likely to issue low severity scores and downplay their own bugs.
This has been happening for decades, and if you read enough vulnerability write-ups, you’ll often find security researchers accusing companies of blatantly downgrading CVSS scores and mischaracterizing their own bugs to downplay the bug’s impact, over and over again.
More than 48,000 vulnerabilities received a CVE number last year and NIST is giving up right before experts anticipate this number will explode with the broad adoption of AI cybersecurity agents designed to help improve vulnerability discovery.
The integration of AI vulnerability scanners is likely to yield a few major bugs, but they’re also expected to produce mountains of CVE chaff that no human team at NIST would have been able to keep up with anyway.
NIST’s new enrichment policy entered into effect this week, on Wednesday, April 15.
The main Risky Business podcast is now on YouTube with video versions of our recent episodes. Below is our latest weekly show with Pat and Adam at the helm!
Russian hackers targeted a Swedish thermal plant: A pro-Russian hacktivist group tried to disrupt a Swedish thermal power plant last year. The attack targeted a power plant in western Sweden last spring. The intrusion was caught by the plant’s built-in safeguards. Swedish officials linked the group to Russia’s security services. [EnergyWatch // SVT]
Russia hacked Ukrainian prosecutors: Russian hackers have broken into the emails of more than 170 Ukrainian prosecutors. The campaign sought to gain access to investigative information. The attacks were linked to APT28, a cyber unit inside Russia’s military intelligence agency, the GRU. The same campaign also breached militaries in Greece, Romania, and Serbia. The hacks are part of a campaign spotted last month by Ctrl-Alt-Intel. [Reuters]
Grinex shuts down after hack: Russian cryptocurrency exchange Grinex has shuttered operations following a theft this week. The company claims “Western intelligence agencies” broke into its wallets and stole $13 million (1 billion rubles) worth of assets. The exchange was sanctioned by US authorities last August for helping Russia evade sanctions and laundering ransomware payments. A TRM Labs report found that Grinex was a rebrand of an older Russian crypto exchange Garantex, also sanctioned for the same things. [Wayback Machine]
Zerion blames North Korea for crypto-heist: Crypto-wallet provider Zerion has blamed a recent heist of $100,000 on North Korean hackers.
Autovista ransomware attack: A ransomware group has hit automotive data analytics company Autovista, with the attack impacting systems in Europe and Australia.
McGraw Hill breach: Hackers have leaked the personal details of 13.5 million users of educational platform McGraw Hill. The data was taken from the company’s SalesForce accounts. It was leaked after a failed extortion attempt by the ShinyHunters group. It includes details such as real names, home addresses, emails, and phone numbers.
Standard Bank breach: South Africa’s largest bank has disclosed a security breach. The Standard Bank says hackers breached last week an internal network storing customer data. The incident is the third hack of a South African bank this year. [IOL]
BlueLeaks 2.0 data is now up for sale: A hacker is selling 8.3 million confidential crime tips for $10,000 in cryptocurrency. The data was stolen earlier this year from P3 Global Intel, a software provider for US law enforcement agencies. The hacker, who goes by the name Internet Yiff Machine, initially provided the data for free to select journalists and the DDoSecrets project. The hacker says they’re selling the data because “principles are for the well-fed, and I’m unfortunately not in a great place.” [Straight Arrow News // DataBreaches.net]
Krybit hacks 0APT: The Krybit ransomware group has hacked the website of rival ransom group 0APT. The incident occurred after the 0APT group threatened to dox Krybit’s members last week. According to security firm Barricade, 0APT leaked plaintext credentials for Krybit’s ransomware backend panel, along with Bitcoin addresses and victim names. Krybit returned the favor by leaking 0APT’s entire server contents.
OpenAI announces its own private cyber model: OpenAI has released an LLM model for cybersecurity work into private testing. Thousands of verified professionals and hundreds of teams responsible for defending critical software have been invited to test the GPT‑5.4‑Cyber model. The new model has loose permissions for cybersecurity research, such as reverse-engineering and vulnerability discovery. The new limited access model is OpenAI’s response to Anthropic’s Project Glasswing and the Mythos model.
Anthropic rolls out KYC for Claude: Anthropic will ask certain Claude users to verify their identity by providing a selfie and a government ID. The company says the new identity verification check will only roll out in a “few use cases.” The checks are meant to prevent abuse and comply with legal obligations. The ID checks will be handled by Persona, the same company Discord had to cut ties because of community backlash.
BlueSky’s mega outage: Social media network BlueSky had a prolonged outage on Thursday that was so bad, even its server status page was down—probably because they hosted it on the same infrastructure. You live and learn, I guess. [News.az]
Grok is still nudifying: xAI’s Grok is still generating nude images at users’ requests, despite a huge backlash from authorities all over the world. Just take Grok behind the shed, Elon! It’s time. [NBC News]
Nudify apps are still everywhere: Both Apple and Google are still hosting nudify apps on their stores, and their ad systems are often used to lure users to the very same apps they’re supposed to have banned. [Tech Transparency Project]
News sites block the Internet Archive: Twenty-three major news outlets are now blocking the Internet Archive’s Wayback Machine from creating copies of their content. Most cited fear the backed up pages could be used as a proxy to train AI on their content. [Tom’s Hardware]
IPv6 milestone: Global IPv6 traffic has crossed 50% for the first time at the end of last month.
IPv8 protocol proposal: A new version of the IP addressing protocol has been proposed with the Internet Engineering Task Force. The new protocol is being called IPv8 and is meant to be compatible with old IPv4 addresses. IPv8 addresses will include a prefix and an old IPv4 address. The prefix will be specific to each ASN (network operator). For old IPv4 addresses, this prefix will be 0.0.0.0. This will allow devices and networks with old IPv4 addresses to connect to IPv8 systems without any software updates required.
Chrome does nothing to stop browser fingerprinting: Web privacy expert Alexander Hanff looks at the various browser fingerprinting techniques used by online trackers and how Chrome doesn’t do anything to block them.
Android gets new one-time data pickers: The next Android OS version will include two new systems to let users pick contacts or share their precise location for one time without an app needing persistent access to the read contacts and precise geolocation permissions.
Raspberry Pi disables passwordless sudo: The Raspberry Pi project has disabled passwordless access to the sudo utility in its OS.
Some ESUs extended: Microsoft has extended the Exchange 2016/2019 Extended Security Updates (ESU) program until October this year. The ESU ended this month. Same goes for the Skype for Business ESU.
Windows adds RDP warning popups: Windows will now show a security warning popup whenever users open RDP configuration files. The popups will alert users that they are about to make dangerous changes that may allow remote attackers to connect to their PCs and steal data. Several threat actors have used malicious RDP config files in phishing operations as a way to gain a foothold inside targeted networks. Russian group ATP29 is known for using this technique in espionage operations.
FCC exempts Netgear from foreign router ban: The US Federal Communications Commission has excluded Netgear from the Trump administration ban on foreign-made routers. The agency granted the exemption at the request of the US Department of War. Netgear is an American company but most of its routers are made in Southeast Asia.
More cyber EOs are coming: National Cyber Director Sean Cairncross says the Trump administration will soon sign and issue more cyber-related executive orders to help push forward the implementation of the White House’s new cybersecurity strategy. [CyberScoop]
US Tech Force is hiring cyber staff: The Trump administration is recruiting cybersecurity specialists for its new and upcoming US Tech Force agency. The Tech Force was announced at the end of last year. The plan is to recruit around 1,000 tech workers from large US corps to “modernize” the US government’s networks. The new hiring process comes after the Trump administration fired a third of CISA’s staff and plans hundreds more next year. CISA also recently canceled summer internships for cyber scholarship students amid DHS funding lapse.
Foreign internet traffic in Russia is becoming very expensive: Russian telcos will increase the price for internet traffic received from outside the country’s borders as part of measures to crack down on VPN use. [RBC]
EU launches age verification app: The EU has launched its own internally-developed age verification app. The app uses cryptographic proofs to verify a user’s age without sharing their personal data. EU officials have urged online platforms to integrate the app with their processes. Age verification is mandatory under the EU’s new Digital Services Act. The app is available for Android and iOS, and future desktop and web versions are planned. The source code is also available on GitHub.
In this Risky Business sponsor interview, Corelight’s Senior Director of Product Management, Dave Getman, tells James Wilson how Corelight Agentic Triage helps defenders stay ahead of AI-powered attacks.
DPRK laptop farmers sentenced: The US has sentenced two individuals to prison for running a laptop farm for North Korean remote IT workers. Kejia Wang and Zhenxing Wang were sentenced to 108 and 92 months in prison, respectively. Both hosted laptops at their homes in New Jersey that ran from US IPs to allow North Koreans to pose as American citizens. Authorities also indicted nine North Koreans remote workers who participated in the scheme.
16yo arrested for school cyberattack: Northern Ireland authorities have arrested a 16-year-old for a cyberattack that disrupted the country’s national school IT network. The C2K platform was down at the start of the month after a cyberattack that targeted a small number of schools. More than 300,000 pupils and 20,000 teachers couldn’t access exam data, home assignments, and teaching materials for days following the incidents, as officials shut down the platform to investigate. [BelfastLive]
53 DDoS-for-hire domains seized: Europol and other law enforcement agencies have seized 53 domains that hosted DDoS-for-hire services. Four suspects were also detained following 25 house searches. Authorities have also sent letters and emails to more than 75,000 users who had signed up for the services. They also worked with Google to remove ads promoting DDoS services.
UNC2465 shifts to Europe: Orange’s security team reports that a known ransomware affiliate tracked as UNC2465 has shifted its attacks to Europe. The group is currently using the SmokedHam backdoor as an initial entry point for Qilin ransomware attacks.
Black Basta offshoots target execs: A group of former Black Basta affiliates are using automated email bombing and Teams-based social engineering to target executives and senior-level employees for initial access into corporate networks. [ReliaQuest]
Hazy Hawk hijacks university subdomains: A cybercrime group has hijacked subdomains at 34 US universities and educational organizations to show pornographic spam. MIT, Harvard, Stanford, Johns Hopkins, and other large universities have had subdomains hacked. The spam campaign has been linked to Hazy Hawk, a group that hijacked CDC subdomains last year. [SH Consulting]
QEMU abused in the wild: Sophos says at least two cybercrime groups are deploying the QEMU virtualization environment on compromised networks to hide malicious activity and later deploy ransomware.
WP scanning: F5 says a badness cluster it’s been keeping an eye on has recently started mass-scans for sites running vulnerable WordPress plugins.
FTP exposure is still huge: According to Censys, there are still 6 million endpoints exposing an FTP port over the internet, almost 55 years after the protocol was created.
C2 servers in Russia: A large-scale study of the Russian web hosting space has found more than 1,200 malicious command and control servers hosted inside Russia this year. Most of the servers are for IoT malware botnets, such as Keitaro, Hajime, Mozi, and Mirai. [Hunt Intelligence]
Rhadamanthys’s secret bug: The Rhadamanthys infostealer left its command and control server APIs exposed online without authentication, allowing security researchers to track its activity for months before the Europol takedown last year. [Censys]
Direct-Sys Loader: The Cyderes team has discovered a new malware loader named Direct-Sys Loader being delivered in the wild.
PowMix botnet: Cisco Talos has spotted a new Windows botnet malware strain named PowMix, currently going on a test run in the Czech Republic.
AngrySpark: Gen Digital has spotted a new Windows rootkit named AngrySpark, already used in the wild on a UK victim’s system.
W3LL PhaaS: Group-IB published a report on W3LL, the phishing platform seized by authorities earlier this month.
ATHR platform: A cybercrime group has developed and is renting access to a platform that automates voice phishing attacks. The ATHR platform uses AI agents to call targets using preconfigured and multi-step scripts. ATHR access is being sold for $4,000 and 10% of a campaign’s profits. According to AbnormalAI, the platform is primarily being used to trick victims into revealing credentials for their online accounts.
James Pope, Corelight’s Director of Technical Marketing Engineering, demonstrates the company’s Open NDR Platform and how it combines network detections with a whole host of other data sources.
UAC-0247 and AGINGFLY: CERT-UA reported a new wave of attacks against its government agencies, hospitals, and emergency services. This activity was linked to a cluster tracked as UAC-0247. The final payload was a new infostealer named AGINGFLY.
Sapphire Sleet targets macOS: DPRK APT group Sapphire Sleet has adapted its “install this Zoom update to hear me” malware delivery technique for macOS, per a new Microsoft report.
PyPI security audit: Python’s PyPI has completed its second security audit.
Zero Day Quest 2026: Microsoft awarded $2.3 million in bug bounty rewards at this year’s edition of Zero Day Quest, its cloud and AI hacking contest.
Mythos guidance: Cisco [PDF] and the Cloud Security Alliance have issued guides on how to protect and defend networks in the face of rising powerful AI vulnerability discovery agents like Anthropic’s Mythos.
Mythos/Glasswing vulnerabilities: VulnCheck has sifted through its huge CVE database and believes it has tracked down some of the bugs discovered using Anthropic’s Mythos agent as part of Project Glasswing. There are 75 CVEs that mention Anthropic, 40 credited to Anthropic, but only one specifically mentions Glasswing. So far, it’s unclear if any of the Mythos-found bugs even received proper CVEs.
You can trick Claude by being an industry legend: Manifold Security tricked Claude’ GitHub bot to merge malicious code to repositories by spoofing their requests under the names of famous developers.
Researcher drops another Windows zero-day: A disgruntled security researcher has published proof-of-concept code for a new Windows zero-day. The RedSun zero-day can be used to elevate privileges on Windows to SYSTEM level access. The researcher released the public exploit after a disagreement with the Microsoft team that handles its bug bounty program. The same researcher also released another Windows zero-day named BlueHammer earlier this month.
NGINX UI bug exploited in the wild: Hackers are exploiting a bug in a popular dashboard for managing NGINX web servers. Attacks began last month and are targeting the dashboard’s MCP endpoints. Tracked as CVE-2026-33032, the bug allows attackers to access the MCP endpoint without authentication and then modify the server’s config files. More than 2,600 of NGINX UI dashboards are currently exposed on the internet. [Pluto Security]
RAGFlow patches bug after public disclosure: The RAGFlow AI toolkit has patched a remote code execution bug in its software almost a week after the bug was publicly disclosed by security researchers. The project initially ignored the report and only patched the issue after the researchers themselves submitted the patch code.
Dolibarr RCE: The Dolibarr CRM and ERP has patched an eval-based remote code execution bug (CVE-2026-22666). A write-up and POC are available via Jiva Security.
Thymeleaf RCE: A critical vulnerability has been patched in the Java template engine Thymeleaf. Tracked as CVE-2026-40478, the bug allows attackers to bypass security checks and inject malicious content in server page templates. The bug impacts all Thymeleaf versions ever released and has a wide impact since Thymeleaf is also the default template engine in the Spring Boot Java framework. [Endor Labs]
Codex hacks a smart TV: Security firm Calif has used OpenAI’s Codex agent to hack and gain root access on a Samsung smart TV.
Fabricked attack: A team of academics has developed a new attack that breaks the confidentiality of AMD’s secure enclave technology. The Fabricked attack redirects memory transactions to trick AMD’s secure co-processor into improperly initializing SEV-SNP enclaves. The novel technique allows attackers to control confidential virtual machines where each individual customer’s data is typically processed in cloud environments. AMD released patches this week as part of its Patch Tuesday. Frabricked is one of multiple AMD SEV-SNP attacks disclosed over the past two years. Others include RMPocalypse, BadRAM, Ahoi, Heracles, WireTap, BatteringRAM, and TEE. Fail.
Threat/trend reports: Check Point, CyberHUB-AM, Google Mandiant, GuidePoint Security, Kaspersky, and Sysdig have recently published reports and summaries covering various threats and infosec industry trends.
New tool—Jaspr: Google has open-sourced Jaspr, a new web development framework written in Dart.
New tool—Malfixer: Mobile security firm Cleafy has open-sourced Malfixer, a toolkit for inspecting and recovering malformed Android APK files.
New tool—RePythonNET-MCP: Security firm Sekoia has open-sourced RePythonNET-MCP, an MCP server for .NET reverse engineering automation.
New tool—PMG: DevSecOps firm SafeDep has released PMG, a tool that delays npm and Python package installs until the libraries are checked against its threat intel database.
New tool—HoneyWire: Andrea Termine has published HoneyWire, a lightweight distributed deception engine designed for internal networks.
New tool—NetWatch: Westpac’s chief engineer Matt Hartley has released NetWatch, a real-time network diagnostics tool for terminals.
In this edition of Seriously Risky Business, Tom Uren and Amberleigh Jack talk about a new Citizen Lab report into Webloc, a tool to identify and track mobile devices. It demonstrates how the collection and sale of mobile phone geolocation data presents privacy and national security risks.
In this episode of Risky Business Features, James Wilson chats to professional hacker Jamieson O’Reilly about Anthropic’s Mythos and the impact it could have on offensive security. Jamieson is CEO of DVULN and co-founder of Aether AI.
...
Read the original on risky.biz »
To add this web app to your iOS home screen tap the share button and select "Add to the Home Screen".
10HN is also available as an iOS App
If you visit 10HN only rarely, check out the the best articles from the past week.
If you like 10HN please leave feedback and share
Visit pancik.com for more.