10 interesting stories served every morning and every evening.

Claude Code Is Steganographically Marking Requests

thereallo.dev

I was in­spect­ing Claude Code for pri­vacy rea­sons.

Most devs give their har­nesses ridicu­lous ac­cess. FS, shell, git, browser ac­cess, even com­puter use nowa­days. That is the whole point. They need enough con­text to do use­ful work.

That also means the client it­self de­serves scrutiny. If a cod­ing agent can read your repo and run com­mands, the bi­nary that ships it should be bor­ing (ƒor ex­am­ple, pi har­ness)

So I took a look at my lo­cal Claude Code (2.1.196) in­stall.

Inside the Claude Code bi­nary, there is a func­tion that changes the cur­rent date string in­serted into the sys­tem prompt.

The nor­mal string looks like this:

Claude Code can silently change two things:

The apos­tro­phe in Today’s

The date sep­a­ra­tor, from - to /

Here is the rel­e­vant code, cleaned up from the mini­fied bun­dle:

This is prompt steganog­ra­phy, a tech­nique used to hide data in plain sight.

The vis­i­ble sen­tence still reads like a nor­mal date. The model and the user see some­thing bor­ing. The raw re­quest con­tains a marker.

The trig­ger is ANTHROPIC_BASE_URL, Claude Code’s API base URL over­ride.

Then it checks if:

the sys­tem time­zone is Asia/Shanghai or Asia/Urumqi

the API base URL host­name matches a de­coded do­main list

the host­name con­tains spe­cific AI lab key­words

The time­zone check changes:

into:

The host­name check changes the apos­tro­phe:

These are vi­su­ally tiny changes you would never no­tice in most mono fonts.

The do­main and key­word lists are stored as base64 strings and XOR-decoded with key 91.

The de­coded lab key­word list is:

The de­coded do­main list is much larger. It con­tains Chinese cor­po­rate do­mains, AI com­pany do­mains, and a lot of proxy / re­seller / gate­way do­mains.

Some ex­am­ples:

The date func­tion is used when build­ing the agent con­text:

So the marker be­comes part of the sys­tem con­text sent to the model. (Where Anthropic prob­a­bly parses in their back­end)

My in­stalled bi­nary is signed by Anthropic:

My cur­rent shell had ANTHROPIC_BASE_URL un­set, and my time­zone was:

So on my ma­chine, un­der my cur­rent en­vi­ron­ment, this path would pro­duce the nor­mal apos­tro­phe and the nor­mal YYYY-MM-DD date string.

Anthropic prob­a­bly wants to de­tect API re­sellers, unau­tho­rized Claude Code gate­ways, and model distillation at­tack” pipelines. A cus­tom ANTHROPIC_BASE_URL point­ing at a known re­seller do­main is a use­ful sig­nal. A host­name con­tain­ing deepseek or zhipu is also a use­ful sig­nal.

That part makes sense, but the im­ple­men­ta­tion is weird.

CC silently al­ters the sys­tem prompt us­ing in­vis­i­ble-ish Unicode mark­ers. It en­codes proxy / gate­way clas­si­fi­ca­tion into a sen­tence that looks like plain English. It hides the do­main list be­hind XOR and base64. This is not a ma­li­cious fea­ture, but it is a weird choice for a de­vel­oper tool that asks for trust.

Coding agents al­ready live on the wrong side of a scary bound­ary. They can in­spect code, sum­ma­rize se­crets by ac­ci­dent, run com­mands, in­stall pack­ages, edit files, and push com­mits on your lo­cal ma­chine. Most de­vel­op­ers ac­cept that be­cause the pro­duc­tiv­ity gain is worth the risk.

Trust from real de­vel­op­ers de­pends on the bor­ing be­hav­ior.

If the client wants to de­tect cus­tom API gate­ways, it can say so plainly. It can send an ex­plicit teleme­try field with doc­u­men­ta­tion. It can make the pol­icy vis­i­ble. It can put the be­hav­ior in re­lease notes.

Hiding the sig­nal in the sys­tem prompt makes every other pri­vacy claim harder to be­lieve.

For most users, this path prob­a­bly stays in­ac­tive.

If you are us­ing the of­fi­cial Anthropic API end­point, Crt() re­turns early. If ANTHROPIC_BASE_URL is un­set, Crt() re­turns early. If you are us­ing a nor­mal setup, the date prompt stays boring”.

The in­ter­est­ing case is peo­ple rout­ing CC through a cus­tom base URL. That in­cludes:

Internal gate­ways

Local prox­ies

Model routers

Resellers

Research se­tups

In that case, Claude Code clas­si­fies the host­name and en­codes the re­sult into the prompt.

The by­pass is also triv­ial. Change host­name, change time­zone, patch the bi­nary, wrap the process. Any se­ri­ous ad­ver­sary can make this sig­nal use­less.

So the fea­ture mostly pun­ishes the ex­act peo­ple who are eas­ier to fin­ger­print: nor­mal de­vel­op­ers do­ing weird but le­git­i­mate things.

I think this could have been ex­plicit.

Developer tools can en­force terms. API providers can de­tect abuse. Companies can pro­tect their mod­els.

When a tool with filesys­tem and shell ac­cess starts hid­ing clas­si­fi­ca­tion bits in­side in­vis­i­ble prompt punc­tu­a­tion, the cor­rect re­ac­tion is scrutiny.

Trust is earned in the bor­ing parts.

What We Talk About When We Talk About Malware | F-Droid - Free and Open Source Android App Repository

f-droid.org

If you are run­ning Android 8 or higher, a virus has been in­stalled on your de­vice and is silently await­ing re­mote ac­ti­va­tion. Over the past few months, de­vices around the world have been in­fected with this novel strain, with as many as 4 bil­lion Android hand­sets and tablets es­ti­mated to have al­ready been con­t­a­m­i­nated, mean­ing that around half of all hu­man­ity may be at risk from this threat.

Disguising it­self as the in­nocu­ously-ti­tled Android Developer Verifier” (ADV) process, this tro­jan horse runs sur­rep­ti­tiously in the back­ground as a sys­tem ser­vice with full root priv­i­leges, qui­etly await­ing an ac­ti­va­tion sig­nal. The ser­vice can­not be blocked, dis­abled, or re­moved. Unlike a com­mon­place bit of mal­ware, this ex­tra­or­di­nary strain won’t be de­tected and neu­tral­ized by Play Protect (the mal­ware scan­ning and re­me­di­a­tion ser­vice that is in­stalled on all Android Certified de­vices). In fact, Play Protect is it­self the vec­tor through which this virus is trans­mit­ted and in­stalled.

That is be­cause it is Google them­selves who is prop­a­gat­ing ADV. And once ac­ti­vated, this malev­o­lent process has ex­actly one goal: to block you from run­ning soft­ware by de­vel­op­ers who haven’t been ap­proved cen­trally by Google.

Threat mas­querad­ing as Protection

We first raised the alarm about the Android Developer Verification pro­gram last September (“F-Droid and Google’s Developer Registration Decree”) shortly af­ter it was first an­nounced. Google’s loom­ing re­quire­ment that all Android de­vel­op­ers reg­is­ter them­selves cen­trally is ra­tio­nal­ized as a so­lu­tion to help stem the spread of mal­ware. However it does­n’t ac­tu­ally fea­ture any ca­pa­bil­i­ties to pre­vent a malev­o­lent ac­tor from dis­trib­ut­ing mal­ware in the first place; the only al­leged ben­e­fit of ADV is that it may help slow the ac­tions of an al­ready-iden­ti­fied re­cidi­vist by re­quir­ing that they cre­ate (or buy) an­other ac­count in or­der to con­tinue dis­trib­ut­ing their mal­ware with a new sign­ing key.

For this fairly nar­row threat vec­tor of mal­ware re­cidi­vism, a va­ri­ety of con­sid­er­ably less dra­con­ian so­lu­tions have been pro­posed. Play Protect it­self could be en­hanced to scru­ti­nize more closely those newly-in­stalled apps that have el­e­vated per­mis­sions or that were ob­tained through sus­pect chan­nels, con­tin­u­ing with their re­cently touted ad­vances in on-de­vice se­cu­rity ca­pa­bil­i­ties. Or a sys­tem of fed­er­ated ver­i­fiers might be im­ple­mented (as pro­posed in DCM: A Developers Certification Model for Mobile Ecosystems”, 2023) that would em­power end-users to se­lect their own trusted cu­ra­tors and au­thor­i­ties for ex-ante ap­proval. Instead, Google has used this mi­nor vec­tor as a pre­text to rad­i­cally re-en­gi­neer the en­tire Android ecosys­tem by fiat, up­end­ing a 18 year tra­di­tion of open soft­ware de­vel­op­ment and po­si­tion­ing them­selves as the world’s sole gate­keeper for which apps are per­mit­ted to ex­ist.

What They Talk About When They Talk About Malware

Should a de­vel­oper — con­trary to our rec­om­men­da­tion — elect to reg­is­ter them­self with Google as a verified” de­vel­oper, they should ex­pect to sign up for an ac­count and pay a fee, sur­ren­der de­tailed per­sonal in­for­ma­tion and up­load gov­ern­ment-is­sued iden­ti­fi­ca­tion, and then pro­ceed to reg­is­ter the iden­ti­fiers and sign­ing keys for all the apps they in­tend to dis­trib­ute (now or ever).

But the most di­a­bol­i­cal stage is the com­pul­sory agree­ment to the Android Developer Console Terms of Service. There are nu­mer­ous causes for dis­quiet in this doc­u­ment, but the most con­cern­ing of all ought to be:

6.5 If You vi­o­late any of the Terms or if You dis­trib­ute mal­ware or other harm­ful ap­pli­ca­tions, Google may ter­mi­nate Your ac­cess to the ADC…

6.5 If You vi­o­late any of the Terms or if You dis­trib­ute mal­ware or other harm­ful ap­pli­ca­tions, Google may ter­mi­nate Your ac­cess to the ADC…

This rea­son­able-sound­ing clause begs the ques­tion: what ex­actly is meant by malware”? No de­f­i­n­i­tion of the term is to be found any­where in the doc­u­ment. With the ab­sence of any for­mal de­f­i­n­i­tion, stan­dard, or guide­line, it im­plic­itly states:

…and malware” means what­ever we say it means.

…and malware” means what­ever we say it means.

As we dis­cussed in What We Talk About When We Talk About Sideloading”, be­ware the dan­gers of al­low­ing the ter­mi­nol­ogy of de­bate to be de­fined by those who don’t have your best in­ter­ests at heart. Malware be­ing syn­ony­mous with software we don’t like” means that they can uni­lat­er­ally dic­tate — dri­ven ei­ther by busi­ness in­cen­tives or by be­ing com­pelled by a suf­fi­ciently pow­er­ful gov­ern­ment — what the mal­ware-du-jour de­f­i­n­i­tion is to be.

For prece­dent, per­sonal con­tent fil­ter­ing in the form of ad block­ers” has long since been banned from the Play Store, and they have even clas­si­fied some in­stances as mal­ware. How long be­fore they des­ig­nate all ad-block­ing soft­ware as mal­ware, block in­stal­la­tion on all Android cer­ti­fied de­vices world­wide, and per­ma­nently des­ig­nate all de­vel­op­ers of this class of soft­ware as mal­ware cre­ators? Such a move would cer­tainly be aligned with their com­mer­cial in­cen­tives as the global ad-tech mo­nop­o­list, and would be com­pletely in ac­cor­dance with the lan­guage of their ADC Terms and Conditions.

Like a Lead Balloon

In terms of vol­un­tary de­vel­oper up­take, they re­cently claimed that over 99% of [Play de­vel­op­ers’] apps have been reg­is­tered” sug­gests that ADV is some­how a pop­u­lar and widely-ac­cepted dic­tate. That could­n’t be fur­ther from the truth: those 99% of de­vel­op­ers were auto-opted-in with­out their in­formed con­sent due to be­ing al­ready bound by their Play Store agree­ments.

In fact, hun­dreds of thou­sands of peo­ple have signed a pe­ti­tion op­pos­ing ADV. The Open Letter at keepan­droidopen.org de­nounc­ing the pro­gram has been signed by over 70 or­ga­ni­za­tions around the world, in­clud­ing the EFF, FSF, FSFE, ACLU, and the in­es­timable Forbrukerrådet. Any in­ter­net search, chat­bot query, or so­cial me­dia poll will con­firm that the op­po­si­tion to this pro­gram is over­whelm­ing and the con­dem­na­tion is uni­ver­sal. 90% of view­ers of the de­vel­oper round­table video where they at­tempt to de­fend the pro­gram reg­is­tered a dis­like of the spec­ta­cle, and even Google Gemini re­sponds to in­quiries about the pop­u­lar­ity of the pro­gram with:

Aside from Google it­self, find­ing full-throated, en­thu­si­as­tic sup­port for the manda­tory Android Developer Verification pro­gram in the tech com­mu­nity is vir­tu­ally im­pos­si­ble.

The back­lash is over­whelm­ingly dom­i­nant—head­lined by the Keep Android Open” coali­tion of civil rights and open-source groups fiercely op­pos­ing the cen­tral reg­is­tra­tion re­quire­ment.

Aside from Google it­self, find­ing full-throated, en­thu­si­as­tic sup­port for the manda­tory Android Developer Verification pro­gram in the tech com­mu­nity is vir­tu­ally im­pos­si­ble.

The back­lash is over­whelm­ingly dom­i­nant—head­lined by the Keep Android Open” coali­tion of civil rights and open-source groups fiercely op­pos­ing the cen­tral reg­is­tra­tion re­quire­ment.

And yet their lock­down blitzkrieg pro­ceeds apace. Legislators and reg­u­la­tors have thus far been un­re­cep­tive to the out­cry. Our own po­si­tion as a bas­tion of soft­ware free­dom and re­spect for user rights and pri­vacy is in ex­treme jeop­ardy. The F-Droid model of se­cu­rity and trust through open-source trans­parency is fun­da­men­tally at odds with the trust me bro” se­cu­rity model of the closed-source com­mer­cial app stores. And while these two mod­els have been able to co-ex­ist for the past 16 years of F-Droid’s ex­is­tence, it ap­pears that Google in­tends to es­tab­lish a regime where they alone have a mo­nop­oly on the de­f­i­n­i­tions of security” and trust”.

What to Expect in the Days to Come

We do not yet know the ex­act fail­ure mode to ex­pect when the ADV ac­ti­va­tion is trig­gered on September 30. If you are one of the 580 mil­lion peo­ple liv­ing in Brazil, Indonesia, Singapore, or Thailand, know that these are the first four tar­gets of the ADV lock­down ac­cord­ing to their pub­lished time­line (global roll­out is omi­nously pre­dicted to then oc­cur through­out 2027 and be­yond”).

There are many things we don’t know about what to ex­pect on September 30. Some com­mon ques­tions that we do not yet have the an­swer to, for those in the af­flicted re­gions, are:

What will hap­pen if I try to in­stall or launch the F-Droid app?

What will hap­pen to all the apps I’ve in­stalled through F-Droid? Will they be dis­abled? Deleted?

If apps that I rely on are sud­denly dis­ap­peared, what hap­pens to the data they con­tain? Can I still re­trieve it?

With all soft­ware in­stal­la­tions and launches now be­ing re­ported back to Google for ver­i­fi­ca­tion, what spe­cific in­for­ma­tion does that teleme­try in­clude?

We have reached out to the mal­ware ven­dor with our in­quiries. In the com­ing weeks and months lead­ing up to the lock­down, we will be pub­lish­ing more guid­ance and sup­port for those due to be im­pacted by ADV.

Introducing Claude Sonnet 5

www.anthropic.com

Claude Sonnet 5 is built to be the most agen­tic Sonnet model yet. It can make plans, use tools like browsers and ter­mi­nals, and run au­tonomously at a level that, just a few months ago, re­quired larger and more ex­pen­sive mod­els.

For many de­vel­op­ers, the agen­tic AI era be­gan with Sonnet-class mod­els: Claude Sonnet 3.5, 3.6, and 3.7 were the first mod­els that showed im­pres­sive skills in cod­ing and tool use. More re­cently, though, the clear­est gains in agen­tic ca­pa­bil­i­ties have been in our Opus-class mod­els.

Sonnet 5 nar­rows the gap: its per­for­mance is close to that of Opus 4.8, but at lower prices. It’s a sub­stan­tial im­prove­ment over its pre­de­ces­sor, Sonnet 4.6, on im­por­tant as­pects of agen­tic per­for­mance like rea­son­ing, tool use, cod­ing, and knowl­edge work:

Our safety as­sess­ments found that Sonnet 5 shows an over­all lower rate of un­de­sir­able be­hav­iors than Sonnet 4.6, and is gen­er­ally safer to use in agen­tic con­texts. Evaluations also show that it has a much lower abil­ity to per­form cy­ber­se­cu­rity tasks than our cur­rent Opus mod­els.

From to­day, Claude Sonnet 5 is avail­able across all plans: it is the de­fault model for Free and Pro plans, and is avail­able to Max, Team, and Enterprise users. It’s also avail­able in Claude Code and on the Claude Platform, where it launches with in­tro­duc­tory pric­ing of $2 per mil­lion in­put to­kens and $10 per mil­lion out­put to­kens through August 31, 2026, af­ter which it will be priced at $3 per mil­lion in­put to­kens and $15 per mil­lion out­put to­kens. Developers can use claude-son­net-5 via the Claude API.

Working with Claude Sonnet 5

The charts be­low com­pare the per­for­mance of Sonnet 5 with Sonnet 4.6 and Opus 4.8 at dif­fer­ent ef­fort lev­els on the agen­tic search eval­u­a­tion BrowseComp and the com­puter use eval­u­a­tion OSWorld-Verified. Sonnet 5 (orange line) is a strict im­prove­ment over Sonnet 4.6 (gray line) and cov­ers a much wider range of cost-per­for­mance op­tions than Opus 4.8 (yellow line). It pro­vides sub­stan­tially im­proved cost ef­fi­ciency at medium ef­fort; its higher-ef­fort per­for­mance can match Opus 4.8 on some tasks. Between Sonnet 5 and Opus 4.8, users can ad­just the ef­fort level to find the right bal­ance of cost and per­for­mance.

Feedback from our early ac­cess part­ners has been con­sis­tent: Sonnet 5 is much more agen­tic than its pre­de­ces­sors. Testers de­scribed how it fin­ishes com­plex tasks where pre­vi­ous Sonnet mod­els would stop short, how it checks its own out­put with­out ex­plic­itly be­ing asked, and how it does all this agen­tic work at an at­trac­tive price point:

Claude Sonnet 5 gives our agents a strong ex­e­cu­tion layer for multi-step soft­ware en­gi­neer­ing work. It han­dles sus­tained cod­ing, tool use, and de­bug­ging well across messy tech­ni­cal con­texts, and has been es­pe­cially use­ful for work­flows where fol­low-through and tech­ni­cal ground­ing mat­ter.

Claude Sonnet 5 gives our agents a strong ex­e­cu­tion layer for multi-step soft­ware en­gi­neer­ing work. It han­dles sus­tained cod­ing, tool use, and de­bug­ging well across messy tech­ni­cal con­texts, and has been es­pe­cially use­ful for work­flows where fol­low-through and tech­ni­cal ground­ing mat­ter.

We handed Claude Sonnet 5 a two-part job—up­date Salesforce ac­count tiers, send a launch an­nounce­ment to en­ter­prise con­tacts—and it fin­ished end to end. That used to stall halfway. For day-to-day au­toma­tion, it’s a no-brainer.

We handed Claude Sonnet 5 a two-part job—up­date Salesforce ac­count tiers, send a launch an­nounce­ment to en­ter­prise con­tacts—and it fin­ished end to end. That used to stall halfway. For day-to-day au­toma­tion, it’s a no-brainer.

Claude Sonnet 5 gets more done with less. Same out­put qual­ity, fewer steps to get there. It re­fuses un­safe re­quests cleanly and con­sis­tently, too. At Lovable, we’re putting pow­er­ful tools in the hands of mil­lions of builders. A model that knows when to say no is just as im­por­tant as one that knows how to build.

Claude Sonnet 5 gets more done with less. Same out­put qual­ity, fewer steps to get there. It re­fuses un­safe re­quests cleanly and con­sis­tently, too. At Lovable, we’re putting pow­er­ful tools in the hands of mil­lions of builders. A model that knows when to say no is just as im­por­tant as one that knows how to build.

We ran Claude Sonnet 5 against dozens of our most chal­leng­ing real pull re­quests, and it car­ried each one through to a tested, ver­i­fied re­sult on its own — free­ing our en­gi­neers to fo­cus on the judg­ment, the de­ci­sion, and the fi­nal sign-off.

We ran Claude Sonnet 5 against dozens of our most chal­leng­ing real pull re­quests, and it car­ried each one through to a tested, ver­i­fied re­sult on its own — free­ing our en­gi­neers to fo­cus on the judg­ment, the de­ci­sion, and the fi­nal sign-off.

I asked Claude Sonnet 5 to in­ves­ti­gate a bug. Unprompted, it wrote a re­pro­duc­ing test, im­ple­mented the fix, then stashed it to con­firm the bug came back with­out the change. All in a sin­gle pass.

I asked Claude Sonnet 5 to in­ves­ti­gate a bug. Unprompted, it wrote a re­pro­duc­ing test, im­ple­mented the fix, then stashed it to con­firm the bug came back with­out the change. All in a sin­gle pass.

With Claude Sonnet 5, agents stay on plan, fol­low our con­ven­tions, and ship clean multi-step changes, all at an ef­fi­cient cost.

With Claude Sonnet 5, agents stay on plan, fol­low our con­ven­tions, and ship clean multi-step changes, all at an ef­fi­cient cost.

Claude Sonnet 5 is at its best on brown­field code—race con­di­tions, hid­den tests, the parts no­body wants to touch. It traces a fail­ure to its ac­tual root cause and ships a durable fix in­stead of patch­ing the symp­tom.

Claude Sonnet 5 is at its best on brown­field code—race con­di­tions, hid­den tests, the parts no­body wants to touch. It traces a fail­ure to its ac­tual root cause and ships a durable fix in­stead of patch­ing the symp­tom.

Claude Sonnet 5 sits on the Pareto fron­tier for Eve’s plain­tiff-law tasks. We see the clear­est gains in le­gal re­search and analy­sis, at a price-to-per­for­mance ra­tio that made the choice to mi­grate easy.

Claude Sonnet 5 sits on the Pareto fron­tier for Eve’s plain­tiff-law tasks. We see the clear­est gains in le­gal re­search and analy­sis, at a price-to-per­for­mance ra­tio that made the choice to mi­grate easy.

ClickHouse agents ex­plore live data and pro­duce in­sights on the fly, so time-to-in­sight mat­ters when test­ing new mod­els. Claude Sonnet 5 rea­sons in tighter steps and gets our users to an­swers no­tice­ably faster. That speed is a dif­fer­ence our cus­tomers feel.

ClickHouse agents ex­plore live data and pro­duce in­sights on the fly, so time-to-in­sight mat­ters when test­ing new mod­els. Claude Sonnet 5 rea­sons in tighter steps and gets our users to an­swers no­tice­ably faster. That speed is a dif­fer­ence our cus­tomers feel.

At Pace, our com­puter-use agents run in­sur­ance work­flows—sub­mis­sion in­take, FNOL, loss runs—on the sys­tems our op­er­a­tions teams al­ready use. Claude Sonnet 5 con­sis­tently takes the right ac­tion and does it quickly, which is what real in­sur­ance work de­mands.

At Pace, our com­puter-use agents run in­sur­ance work­flows—sub­mis­sion in­take, FNOL, loss runs—on the sys­tems our op­er­a­tions teams al­ready use. Claude Sonnet 5 con­sis­tently takes the right ac­tion and does it quickly, which is what real in­sur­ance work de­mands.

01 /

10

Safety eval­u­a­tions

Our pre-de­ploy­ment safety eval­u­a­tions found that Sonnet 5 was over­all an im­prove­ment on Sonnet 4.6. On agen­tic safety, the model is bet­ter at re­fus­ing ma­li­cious re­quests and re­sist­ing hi­jack at­tempts in prompt in­jec­tion at­tacks. The model shows lower rates of hal­lu­ci­na­tion and syco­phancy than Sonnet 4.6. On our au­to­mated be­hav­ioral au­dit, which tests a wide range of mis­aligned be­hav­iors such as co­op­er­a­tion with mis­use and de­cep­tion, Sonnet 5 scored lower (that is, safer) over­all. However, it did show some­what higher rates of mis­aligned be­hav­ior on this as­sess­ment com­pared to the more ca­pa­ble Opus 4.8 and Claude Mythos Preview.

We did not de­lib­er­ately train Sonnet 5 on cy­ber­se­cu­rity tasks. It can per­form some rou­tine, non-harm­ful cy­ber tasks, but on eval­u­a­tions test­ing po­ten­tially dan­ger­ous cy­ber skills, such as de­vel­op­ing soft­ware ex­ploits, it shows sub­stan­tially poorer per­for­mance than mod­els such as Opus 4.8 and Mythos 5. Scores from one eval­u­a­tion, which tested mod­els’ abil­ity to de­velop ex­ploits for vul­ner­a­bil­i­ties in the Firefox browser, are shown in the chart be­low. Sonnet 5 was never able to de­velop a full work­ing ex­ploit, but it does show a slightly higher rate of par­tial suc­cess than Sonnet 4.6. This lat­ter change is likely due to im­prove­ments in gen­eral in­tel­li­gence rather than spe­cific train­ing.

Since Sonnet 5 is some­what stronger than its pre­de­ces­sor on these tasks, we’ve launched it with cy­ber safe­guards en­abled by de­fault. These safe­guards—which de­tect and block dan­ger­ous cy­ber us­age in real time—are the same as those pre­sent in Claude Opus 4.7 and 4.8 (because we judged that the over­all level of cy­ber­se­cu­rity risk from Sonnet 5 was low, the safe­guards are less strict than those launched with Fable 5, which block a much wider range of cy­ber­se­cu­rity tasks).1

Our full as­sess­ment of Sonnet 5 across many safety and ca­pa­bil­ity eval­u­a­tions is re­ported in the Claude Sonnet 5 System Card.

Availability and pric­ing

Claude Sonnet 5 is avail­able every­where to­day at an in­tro­duc­tory price of $2 per mil­lion in­put to­kens and $10 per mil­lion out­put to­kens through August 31, 2026. It then moves to stan­dard pric­ing at $3 per mil­lion in­put to­kens and $15 per mil­lion out­put to­kens.2 We’ve in­creased rate lim­its across Chat, Cowork, Claude Code, and the Claude Platform3 to ac­com­mo­date the higher to­ken us­age of higher ef­fort lev­els; users can se­lect whichever level makes sense for their par­tic­u­lar pro­ject.

Changelog

Edit June 30, 2026: In the orig­i­nal ver­sion of this post, we in­cluded a cost-per­for­mance chart for the BrowseComp eval­u­a­tion that was based on data from a sim­pler method­ol­ogy that did not re­flect the stan­dard method­ol­ogy we use for agen­tic search eval­u­a­tions. This had the re­sult of un­der­es­ti­mat­ing Sonnet 5′s per­for­mance on the eval­u­a­tion.

We have now up­dated the chart so that it matches the method­ol­ogy that we used and dis­cussed in the Sonnet 5 sys­tem card (which used a 10M to­ken bud­get with com­paction and pro­gram­matic tool call­ing). We have also up­dated the sur­round­ing text.

Footnotes

1 Sonnet 5 is part of our Cyber Verification Program, which is avail­able to­day on the na­tive Claude Platform, the Claude Platform on AWS, and Claude in Microsoft Foundry (hosted on Azure and Anthropic), and com­ing soon on Claude in Google Vertex. Organizations that are al­ready en­rolled in the Cyber Verification Program au­to­mat­i­cally have the same ac­cess on Sonnet 5, with no need to reap­ply. Overall, we rec­om­mend Claude Opus 4.8 for cy­ber­se­cu­rity work that re­quires re­duced guardrails.

2 Sonnet 5 is an up­grade to Sonnet 4.6, but it uses an up­dated to­k­enizer that changes how the model processes text to im­prove per­for­mance (this is sim­i­lar to the to­k­enizer change we in­tro­duced with Claude Opus 4.7). The trade­off is that the same in­put can map to more to­kens: roughly 1.0 – 1.35× de­pend­ing on the con­tent type. The in­tro­duc­tory pric­ing is set so that the tran­si­tion to Sonnet 5 is roughly cost-neu­tral.

3 On April 26, 2026, we raised Sonnet and Haiku rate lim­its at every us­age tier and sim­pli­fied to three tiers (Start, Build, and Scale) on the na­tive Claude Platform. You can view your tier and cur­rent lim­its in the Claude Console or read the doc­u­men­ta­tion to learn more.

Humanity’s Last Exam: We up­dated the grader model for Humanity’s Last Exam and have up­dated the Sonnet 4.6 score to 34.6% (no tools) and 46.8% (with tools). This is the rea­son the score dif­fers from that re­ported in the Sonnet 4.6 launch blog.

OSWorld-Verified: We made changes to how we run the OSWorld-Verified eval­u­a­tion to more ac­cu­rately re­flect the mod­el’s per­for­mance in the real world, and have up­dated the Sonnet 4.6 score to 78.5%. This is the rea­son the score dif­fers from that re­ported in the Sonnet 4.6 launch blog.

Related con­tent

Redeploying Fable 5

Fable 5 re­turns glob­ally July 1. We’re also propos­ing an in­dus­try-wide frame­work for scor­ing jail­break sever­ity, to­gether with Amazon, Microsoft, Google, and other Glasswing part­ners.

Read more

Claude Science, an AI work­bench for sci­en­tists, is now avail­able

Claude Science is a cus­tomiz­able app that in­te­grates the tools and pack­ages re­searchers most of­ten use, pro­duces au­ditable ar­ti­facts, and pro­vides flex­i­ble ac­cess to com­put­ing re­sources.

Read more

Introducing Claude Tag

Claude Tag is a new way for teams to work with Claude.

Read more

openai.com

Qwen 3.6 27B is the sweet spot for local development

quesma.com

I’ve been dis­ap­pointed by lo­cal mod­els in the past. But then I checked Qwen 3.6, and I was in awe. For me it’s the first lo­cal model that ac­tu­ally makes sense as a gen­eral in­tel­li­gence.

It comes in two vari­ants, a mix­ture-of-ex­perts model Qwen 3.6 35B A3B, and a dense Qwen 3.6 27B - slower, but more pow­er­ful. The one I rec­om­mend!

Let me share my im­pres­sions, and show that you can run it too.

It’s hot, lit­er­ally. When my knees started to melt, I grabbed a phone-at­tached ther­mal cam­era and took a photo.

Qwen 3.6, right­fully, got a lot of cov­er­age on Hacker News. The most com­mon state­ment about Qwen 3.6 27B is that it punches above its weight - see Will it Mythos?. And I think it is a well-de­served sen­ti­ment. It will make your com­puter hot, but it’s worth it!

Testing the wa­ters

Simon Willison uses penguins on a bi­cy­cle” as a smoke test (see for Qwen 3.6 35B A3B and then Qwen 3.6 27B). I usu­ally go with con­strained writ­ing.

A year ago these kinds of things were state of the art, need­ing a unique, and in­sanely ex­pen­sive GPT-4.5, see vibe trans­lat­ing Quantum Flytrap.

I also asked it to write an 8 line poem about Zouk dance and quan­tum physics, see the tran­script. The thought process made sense, both in terms of de­lib­er­a­tion on quan­tum terms, and rhymes.

Then I asked in OpenCode to cre­ate a hexag­o­nal minesweeper us­ing pnpm. It worked:

It worked on the first go, from a sin­gle prompt, with a proper Node pack­age. The mix­ture-of-ex­perts Qwen 3.6 35B A3B was faster… but ig­nored my in­struc­tion to cre­ate a pack­age, and did it in a sin­gle in­dex.html.

Real work

Sure, cre­ative writ­ing about quan­tum me­chan­ics, or yet an­other clone of a minesweeper, is rarely a day job. But Qwen 3.6 27B is de­cent at reg­u­lar tasks as well.

Prompt by a friend, Maciej Cielecki, at AI Tinkerers Warsaw.

It worked for a few min­utes and cre­ated this:

A land­ing page by Qwen 3.6 27B — view the live page.

By stan­dards of cur­rent fron­tier mod­els, it’s un­re­mark­able. But it is al­ready a prac­ti­cal job. It worked, was re­ac­tive, de­faults were nice - all from a sin­gle, short prompt.

Running Qwen 3.6 lo­cally with llama.cpp

Running lo­cal mod­els is eas­ier than ever. A few CLI lines and you’re off.

I rec­om­mend llama.cpp - a di­rect, open source tool that al­lows run­ning mod­els on var­i­ous de­vices. You don’t need Ollama, and frankly - I would rec­om­mend against us­ing that on eth­i­cal grounds.

First, we go to Hugging Face, to get proper quan­ti­za­tion, i.e. a model with re­duced size - pop­u­lar ones are by un­sloth or bar­towski, among oth­ers. Default mod­els usu­ally come with BF16 pre­ci­sion. A com­mon 8-bit quan­ti­za­tion saves half the space at al­most no cost to qual­ity. Going fur­ther down the road, mod­els are smaller (and po­ten­tially - faster), but at the cost of qual­ity, see this com­par­i­son for 27B and an­other one for 35B A3B.

We grab un­sloth/​Qwen3.6 – 27B-MTP-GGUF:Q8_0, an 8-bit quan­ti­za­tion with sup­port for multi-to­ken pre­dic­tion (MTP).

llama-server -hf un­sloth/​Qwen3.6 – 27B-MTP-GGUF:Q8_0 \ –spec-type draft-mtp -ngl 999 -fa on -c 65536 –port 8080

What it does:

-hf un­sloth/​Qwen3.6 – 27B-MTP-GGUF:Q8_0 grabs from Hugging Face, on the next runs will reuse that

-m ~/models/Qwen3.6 – 27B-Q8_0.gguf use in­stead if you al­ready have it

draft-mtp we use a fast model to pre­dict sub­se­quent to­kens, speeds up things

-ngl 999 for putting all lay­ers to GPU

-fa on flash at­ten­tion is on

-c 65536 con­text size set to 64k to­kens (this we can tweak, as Qwen 3.6 27B na­tive con­text is 256k)

–port 8080 bet­ter to pin port, as it will be used by other con­figs

If you open http://​127.0.0.1:8080, you can di­rectly chat with it.

Precisely the same server can be used for vibe cod­ing. Choice of agent de­pends both on one’s goal and sub­jec­tive taste - for an all-around OpenCode, min­i­mal­is­tic Pi, and self-im­prov­ing Hermes.

For OpenCode, it is as sim­ple as adding to ~/.config/opencode/opencode.jsonc:

{ $schema”: https://​open­code.ai/​con­fig.json, provider”: { llama”: { name”: llama.cpp (local)”, npm”: @ai-sdk/openai-compatible”, options”: { baseURL”: http://​127.0.0.1:8080/​v1, apiKey”: local” }, models”: { qwen3.6 – 27b”: { name”: Qwen3.6 – 27B Q8 +MTP” } } } }, model”: llama/qwen3.6 – 27b” }

If you just want to chat and are a big fan of Terminal, in­stead of llama-server use llama-cli:

llama-cli -hf un­sloth/​Qwen3.6 – 27B-MTP-GGUF:Q8_0 \ -ngl 999 -fa on -c 65536

Measuring per­for­mance

Is it fast enough?

I ran a few tests (source is here) on my Macbook Max M5 128 GB, run­ning it with and with­out multi-to­ken pre­dic­tion, and com­par­ing both with the 35B A3B model, and also a quan­tized DeepSeek V4 Flash ver­sion DwarfStar4.

to­kens / s

RAM

Qwen3.6 – 35B-A3B · 8-bit

MLX

85 tok/​s 85

37 GB RAM 37 GB

llama.cpp

93 tok/​s 93

44 GB RAM 44 GB

llama.cpp + MTP

105 tok/​s 105

45 GB RAM 45 GB

Qwen3.6 – 27B · 8-bit

MLX

17 tok/​s 17

28 GB RAM 28 GB

llama.cpp

18 tok/​s 18

41 GB RAM 41 GB

llama.cpp + MTP

32 tok/​s 32

42 GB RAM 42 GB

DeepSeek-V4-Flash · Q2–Q4

llama.cpp

33 tok/​s 33

103 GB RAM 103 GB

30 to­kens per sec­ond is not bad, well within typ­i­cal fron­tier model API range. While mlx-lm is pre­cisely tar­geted at Apple Silicon de­vices, and AI agents heav­ily rec­om­mend it, llama.cpp turned out to be faster. It was us­ing 95% of GPU, which means it is ef­fi­ciently us­ing avail­able re­sources.

Macbook Max M5 is a beast (at least for a lap­top), but on other de­vices it should also work de­cently. As you can see, both Qwen 3.6 vari­ants run within 48 GB of Apple Silicon’s shared RAM. A 4-bit quan­ti­za­tion are less than 18 GB and should run on 32 GB de­vice. On con­sumer Nvidia RTX cards, you need to quan­tize ag­gres­sively, but in­fer­ence runs even faster.

I set this up to­day on my 5090 at Q6_K quan­ti­za­tion and Q4_0 KV, got 50 to­kens/​s con­sis­tently at 123k con­text, us­ing ~28/32gb vram through LM Studio. - gfosco on the Hacker News

I set this up to­day on my 5090 at Q6_K quan­ti­za­tion and Q4_0 KV, got 50 to­kens/​s con­sis­tently at 123k con­text, us­ing ~28/32gb vram through LM Studio. - gfosco on the Hacker News

While 35B A3B is 3x faster, I pre­fer 27B. I’d rather gen­er­ate a third as much code, but of higher qual­ity.

How do they re­late to pre­vi­ous state of the art mod­els?

Manual in­spec­tion is great, but bench­marks help with ground­ing in­tu­itions. Here is the score from Artificial Analysis, com­par­ing it with fron­tier mod­els:

Gemma 4 31B

29

≈ late 2024

o1 / Claude 3.5 Sonnet

Qwen3.6 – 35B-A3B

32

≈ early 2025

o3 / Claude 4 Sonnet

Qwen3.6 – 27B

37

≈ mid 2025

GPT-5 / Claude Sonnet 4.5

DeepSeek-V4-Flash

40

≈ late 2025

GPT-5.2 / Claude Opus 4.5

A few more bench­marks are in these notes, but the spirit is sim­i­lar. Added here Gemma 4 31B, as a lot of peo­ple use this as the de­fault for lo­cal cod­ing. But both bench­marks and gen­eral sen­ti­ment on­line favour Qwen 3.6 27B by a large mar­gin.

Here there is a caveat - 8-bit quan­ti­za­tion of Qwen 3.6 likely does not af­fect re­sults much, but DwarfStar4 uses much more ag­gres­sive ones for DeepSeek V4 Flash, 2 – 4 bit. For sure it is worse than the full model. My per­sonal im­pres­sion is that within these quan­ti­za­tions Qwen 3.6 27B is as good as (or maybe slightly bet­ter than) DwarfStar4. Though, I won’t be sur­prised if for longer con­text pro­jects DS4 has an edge.

What’s next

I think we are en­ter­ing a fas­ci­nat­ing era, when it be­comes fea­si­ble to run one’s own mod­els.

The change will be pro­pelled fur­ther by the state of pro­pri­etary fron­tier mod­els. Claude Fable 5 was taken down. Other fron­tier mod­els run at a mas­sive sub­sidy, where pay­ing $100 a month gives us thou­sands worth in to­kens. Let’s use the dis­count while it lasts!

A lo­cally set model can be fine-tuned to our needs, and can­not be taken away. Businesses can use them for pro­pri­etary and sen­si­tive data. We can use them per­son­ally for of­fline pro­jects, or when we don’t feel com­fort­able shar­ing our deep­est se­crets, or med­ical data, with the US or China.

With the re­lease of fron­tier-level open-weight GLM 5.2, there is a new era. While Qwen 3.6 was the step­ping stone, even fron­tier GLM 5.2 can be run lo­cally. It won’t run on your Macbook or a sin­gle RTX 5090. But still, it is man­age­able with a com­pany bud­get.

Moreover, I strongly be­lieve that we will have mod­els smarter than cur­rent state of the art, while runnable on lo­cal de­vices, maybe even smart­phones. Current mod­els com­bine both raw in­tel­li­gence and fac­tual knowl­edge in the same weights. Future mod­els will likely sep­a­rate that, of­fload­ing a lot of knowl­edge to tool call­ing.

Discuss on Hacker News, LinkedIn, or X.

We have Mythos at Home: GLM 5.2 beats Claude in our Cyber Benchmarks

semgrep.dev

We ran a set of pop­u­lar open-source mod­els against our IDOR bench­mark, the same dataset and the same prompt we’ve used to eval­u­ate fron­tier cod­ing agents. The re­sult sur­prised us: GLM 5.2, an open-weight model from Zhipu AI, scored a 39% F1 on IDOR de­tec­tion, beat­ing Claude Code (32%) at roughly $0.17 per vul­ner­a­bil­ity found. It still trailed Semgrep’s mul­ti­modal pipeline (53 – 61% F1), but that pipeline runs in a pur­pose-built har­ness that does a lot of the heavy lift­ing. Among mod­els given noth­ing but a prompt, the best open-weight op­tion was no longer the ob­vi­ous un­der­dog, beat­ing out Claude Opus 4.8.

We weren’t try­ing to crown an open-weight cham­pion, re­ally. We were try­ing to an­swer a nar­rower, more bor­ing ques­tion: how much of vul­ner­a­bil­ity-de­tec­tion per­for­mance comes from the model, and how much comes from the har­ness around it? For us at Semgrep this is a very im­por­tant ques­tion as we speak to cus­tomers who are lever­ag­ing AI agents heav­ily in their se­cu­rity tasks. A har­ness is the scaf­fold­ing that wraps a model: it feeds it the repos­i­tory, de­cides what it sees, parses its out­put, and loops it through a task. Our in­ter­nal mul­ti­modal pipeline runs in­side a har­ness, which is pur­pose-built for sta­tic analy­sis. We have been test­ing this in­ter­nally for a while with a work­flow for find­ing IDORs or Insecure Direct Object References. These are ac­cess con­trol is­sues which can roughly be thought of as you’re ac­cess­ing some­thing be­long­ing to an­other user”.

Our har­ness enu­mer­ates the ap­pli­ca­tion’s end­points, and code try­ing to sift through only the im­por­tant con­text, and then points the model di­rectly at them. That’s a lot of struc­ture, but re­mem­ber when I said we re­ally did­n’t mean to an­swer the what’s-the-best-open-weight-model? The mod­els in this test don’t get that, they run in a sim­ple Pydantic AI har­ness with the same IDOR prompt we give every other LLM-provider model, no end­point dis­cov­ery, no guided nav­i­ga­tion, we did give it a bit of help, just a lit­tle more than here’s the code, find the bugs.”, of­fer­ing a search strat­egy and some point­ers on what IDORs look like.

So this started as a prompt­ing-ver­sus-har­ness ex­per­i­ment, but while we were run­ning it we were gen­uinely shocked. One of the open-weight mod­els, with none of our scaf­fold­ing, sur­passed a fron­tier cod­ing agent.

Introducing GLM-5.2

If you’ve not heard of GLM-5.2, don’t worry, nei­ther had we un­til we saw it on so­cial me­dia and thought to add it to our bench­marks. GLM 5.2 is the lat­est model from Zhipu AI (Z.ai), rolled out to its GLM Coding Plan mem­bers on Saturday, June 13, 2026, with the open weights and re­lease notes fol­low­ing three days later on June 16 (which is when we heard about it). Three things make it in­ter­est­ing for se­cu­rity work.

First, it’s open weight. That means the mod­el’s pa­ra­me­ters are pub­lished un­der an MIT li­cense, which means you can down­load them, run them on your own hard­ware, fine-tune them, and in­spect them. For a lot of se­cu­rity teams work­ing in sen­si­tive ar­eas that’s im­por­tant, an open-weight model can run en­tirely in­side your own en­vi­ron­ment. But it’s im­por­tant to note that open weight” is not the same as open source”, the trained weights are re­leased, but the train­ing data and full pipeline gen­er­ally are not (though Z.ai does pub­lish its RL train­ing frame­work).

Second, it’s gen­uinely com­pet­i­tive on cod­ing. GLM 5.2 is a Mixture-of-Experts (MoE) model with roughly 750 bil­lion to­tal pa­ra­me­ters but only about 40 bil­lion ac­tive per to­ken, which keeps in­fer­ence cost down rel­a­tive to its size. It ex­tends the us­able con­text from 200K all the way to 1M to­kens, and Z.ai’s pitch is that this con­text stays re­li­able across long, messy agent tra­jec­to­ries, not just that it ac­cepts more in­put. Again for se­cu­rity tasks this is im­por­tant, as se­cu­rity tasks for things like IDORs must be able to rea­son across dif­fer­ent files, through an au­tho­riza­tion frame­work. On stan­dard cod­ing bench­marks it posts the strongest open-weight num­bers go­ing: 81.0 on Terminal-Bench 2.1 (versus 63.5 for GLM 5.1, and within a few points of Claude Opus 4.8′s 85.0) and 62.1 on SWE-bench Pro, edg­ing out closed fron­tier mod­els and trail­ing the very top by sin­gle-digit per­cent­ages.

Third, cost. Tokenomics is quickly be­com­ing as im­por­tant as the LLM ca­pa­bil­i­ties them­selves. Reported pric­ing lands around one-sixth of com­pa­ra­ble fron­tier mod­els and com­men­ta­tors who track open mod­els closely have com­pared GLM 5.2′s re­cep­tion to DeepSeek. GLM-5.2 ar­rived at a charged time not just due to to­ke­nomics but also land­ing just af­ter fron­tier-class closed mod­els hit new ex­port re­stric­tions af­ter re­ported jail­breaks. One de­tail from the re­lease notes is worth flag­ging for any­one point­ing this model at code: Z.ai re­ports that GLM 5.2 ex­hibits more re­ward-hack­ing be­hav­ior than GLM 5.1, dur­ing train­ing it would do things like read pro­tected eval­u­a­tion files or curl ref­er­ence so­lu­tions to in­flate its score, prompt­ing them to build a ded­i­cated anti-hack­ing guard. It’s an hon­est dis­clo­sure by the team, but if you were build­ing a model for hack­ing, well… you can’t get more hacker than try­ing to by­pass the tests in the first place.

Our Experiment

Before we get too much into the de­tails, it’s im­por­tant to re­cap what ex­actly we were try­ing to do and what our ex­per­i­ments were. A quick re­fresher on IDOR: Insecure Direct Object Reference is a vul­ner­a­bil­ity class where an ap­pli­ca­tion ex­poses an in­ter­nal iden­ti­fier like a user ID in a re­quest with­out check­ing that the caller is ac­tu­ally al­lowed to ac­cess that ob­ject. Change the iden­ti­fier, get some­one else’s data.

@app.route(‘/user/<int:user_id>’) def get_user(user_id): user = User.query.get_or_404(user_id) re­turn jsonify(user.to_­dict())

This Flask route fetches and re­turns a user record straight from the ID in the URL, with no check that the re­quester owns it. Any logged in user can just change user_id and read some­one else’s record. IDOR is some­where be­tween a busi­ness-logic flaw and a mis­con­fig­u­ra­tion, it’s not a taint-flow bug, which is what makes it hard for both sta­tic analy­sis and LLMs: there’s no dan­ger­ous func­tion to flag, only a miss­ing check. It’s also one of the most com­mon find­ings in the wild (currently #4 on the HackerOne top vul­ner­a­bil­ity types list), which is why we keep com­ing back to it as a bench­mark.

So back to our ex­per­i­ment: We held three things con­stant and var­ied one, stan­dard ex­per­i­men­tal con­di­tions. Constant: the IDOR dataset (the same real, open-source ap­pli­ca­tions we’ve used in prior re­search), the eval­u­a­tion method (F1 score against a known set of true pos­i­tives), and the IDOR sys­tem prompt it­self. Varied: the model and its har­ness. Specifically:

Semgrep Multimodal ran in­side our cus­tom har­ness: the one that enu­mer­ates end­points and di­rects the model to them. We tested it with two fron­tier mod­els be­hind it.

Semgrep Multimodal ran in­side our cus­tom har­ness: the one that enu­mer­ates end­points and di­rects the model to them. We tested it with two fron­tier mod­els be­hind it.

But we also just ran Claude Code through the Claude Code SDK, and other provider mod­els through their na­tive SDKs but with the same prompt.

But we also just ran Claude Code through the Claude Code SDK, and other provider mod­els through their na­tive SDKs but with the same prompt.

The open-weight mod­els which in­cludes­GLM 5.2, MiniMax M3, and Kimi K2.7 Code, ran in the sim­ple Pydantic AI har­ness with the IDOR prompt and noth­ing else.

The open-weight mod­els which in­cludes­GLM 5.2, MiniMax M3, and Kimi K2.7 Code, ran in the sim­ple Pydantic AI har­ness with the IDOR prompt and noth­ing else.

This is an im­por­tant de­tail, so we’ll say it twice: the open-weight mod­els were not given the end­point-dis­cov­ery scaf­fold­ing that the mul­ti­modal pipeline gets. They saw a prompt and a code­base. This is just what they are ca­pa­ble of with­out any help.

We also com­puted a few dif­fer­ent mea­sures of ef­fec­tive­ness:

Precision: of every­thing the de­tec­tor flagged as an IDOR, what frac­tion were real? High pre­ci­sion = few false alarms. If it re­ports 10 bugs and 7 are gen­uine, pre­ci­sion is 70%.

Precision: of every­thing the de­tec­tor flagged as an IDOR, what frac­tion were real? High pre­ci­sion = few false alarms. If it re­ports 10 bugs and 7 are gen­uine, pre­ci­sion is 70%.

Recall: of all the real IDORs that ac­tu­ally ex­ist in the dataset, what frac­tion did it find? High re­call = it misses a few real bugs. If there are 20 real IDORs and it catches 12, re­call is 60%.

Recall: of all the real IDORs that ac­tu­ally ex­ist in the dataset, what frac­tion did it find? High re­call = it misses a few real bugs. If there are 20 real IDORs and it catches 12, re­call is 60%.

F1: the sin­gle num­ber that bal­ances pre­ci­sion and re­call. It’s their har­monic mean: F1 = 2 × (precision × re­call) / (precision + re­call). The rea­son you use F1 in­stead of plain ac­cu­racy is that the two goals fight each other. A de­tec­tor can hit 100% pre­ci­sion by flag­ging only the one bug it’s cer­tain about (but miss­ing every­thing else so ter­ri­ble re­call), or 100% re­call by flag­ging every­thing as vul­ner­a­ble (but drown­ing you in false pos­i­tives so ter­ri­ble pre­ci­sion). F1 re­wards be­ing good at both at once, and the har­monic mean pun­ishes a lop­sided score, if ei­ther pre­ci­sion or re­call is near zero, F1 is dragged down hard. This is what we’ll re­fer to through­out this post.

F1: the sin­gle num­ber that bal­ances pre­ci­sion and re­call. It’s their har­monic mean: F1 = 2 × (precision × re­call) / (precision + re­call). The rea­son you use F1 in­stead of plain ac­cu­racy is that the two goals fight each other. A de­tec­tor can hit 100% pre­ci­sion by flag­ging only the one bug it’s cer­tain about (but miss­ing every­thing else so ter­ri­ble re­call), or 100% re­call by flag­ging every­thing as vul­ner­a­ble (but drown­ing you in false pos­i­tives so ter­ri­ble pre­ci­sion). F1 re­wards be­ing good at both at once, and the har­monic mean pun­ishes a lop­sided score, if ei­ther pre­ci­sion or re­call is near zero, F1 is dragged down hard. This is what we’ll re­fer to through­out this post.

Cost in dol­lars: per true pos­i­tive and per run to­tal spend di­vided by the num­ber of real bugs found. The real-world eco­nom­ics of run­ning the de­tec­tor. A cheap model with mediocre F1 can still win here.

Cost in dol­lars: per true pos­i­tive and per run to­tal spend di­vided by the num­ber of real bugs found. The real-world eco­nom­ics of run­ning the de­tec­tor. A cheap model with mediocre F1 can still win here.

The re­sults

Ranked by F1 score on IDOR de­tec­tion:

Rank

Configuration

Harness

F1

1

Semgrep Multimodal (GPT 5.5)

Semgrep Multimodal

61%

2

Semgrep Multimodal (Opus 4.8)

Semgrep Multimodal

53%

3

GLM 5.2

Pydantic AI (prompt only)

39%

4

Claude Code (Opus 4.6)

Claude Code SDK

37%

5

Claude Code (Opus 4.8/4.7)

Claude Code SDK

28%

6

MiniMax M3

Pydantic AI (prompt only)

23%

7

Kimi K2.7 Code

Pydantic AI (prompt only)

22%

8

GPT-5.5

Codex

20%

9

Nemotron Super 3 120B

Pydantic AI (prompt only)

18%

10

DeepSeek V4

Pydantic AI (prompt only)

17%

For us two find­ings stand out.

Our mul­ti­modal pipeline leads, and the har­ness is prob­a­bly why. GPT 5.5 and Opus 4.8 in­side Semgrep Multimodal take the top two spots at 61% and 53%. This is of course good news for us and our cus­tomers, val­i­dates that our ap­proach works, etc… But that is­n’t the in­ter­est­ing part.

The biggest sur­prise is in third place. GLM 5.2, with no scaf­fold­ing at all, beat Claude Code by seven points (39% vs. 32%). An open-weight model run­ning a bare prompt out­per­formed a fron­tier cod­ing agent on a rea­son­ing-heavy se­cu­rity task. And it did so cheaply! At GLM 5.2′s pric­ing, the open-weight run cost roughly $0.17 per vul­ner­a­bil­ity found. For a de­tec­tion task you might run across thou­sands of end­points, per-bug eco­nom­ics are not a foot­note, they’re of­ten the de­cid­ing fac­tor in whether a tech­nique is us­able at scale.

GLM 5.2 was­n’t rep­re­sen­ta­tive of open weights as a cat­e­gory, it was the stand­out for sure, but that does­n’t mean the oth­ers don’t hold their own. MiniMax M3 (23%) and Kimi K2.7 Code (22%) landed well be­hind it and be­hind Claude Code, clus­tered closely to­gether. Both are ca­pa­ble gen­eral cod­ing mod­els, but on this spe­cific task, rea­son­ing about miss­ing au­tho­riza­tion checks with no guid­ance to­ward where to look, they strug­gled to sep­a­rate real IDORs from noise.

The spread be­tween GLM 5.2 and the next open-weight model (16 points) is wider than the gap be­tween GLM 5.2 and Claude Code. So the take­away is­n’t open weights have caught up.” It’s one open-weight model has, on this task, un­der these con­di­tions.”

Takeaways

This is not an ap­ples-to-ap­ples com­par­i­son of raw model abil­ity, and we don’t want any­one walk­ing away think­ing it is. Instead we think the take­away is: Among mod­els given the same min­i­mal prompt and har­ness, GLM 5.2 a open-weight model, ⅙ the cost of a fron­tier LLM beat Claude Code at a gen­uinely dif­fi­cult se­cu­rity re­search task.

The har­ness still mat­ters more than the model. The largest per­for­mance gap in the table is­n’t be­tween mod­els, it’s be­tween con­fig­u­ra­tions that get end­point dis­cov­ery and those that don’t. But for any­one fol­low­ing se­cu­rity re­search right now, this is def­i­nitely not a sur­prise, and to be ex­pected.

The har­ness still mat­ters more than the model. The largest per­for­mance gap in the table is­n’t be­tween mod­els, it’s be­tween con­fig­u­ra­tions that get end­point dis­cov­ery and those that don’t. But for any­one fol­low­ing se­cu­rity re­search right now, this is def­i­nitely not a sur­prise, and to be ex­pected.

BUT when a sur­prise like this comes out of nowhere and pro­duces these kinds of re­sults for that lit­tle com­pute cost, it’s a stark re­minder that you can’t put all your eggs in one LLM-basket. If you’re stuck to an ex­pen­sive fron­tier model, even with the best ven­dor-locked-in-har­ness you can miss the ad­van­tages of swap­ping mod­els whether that be cost or per­for­mance.

BUT when a sur­prise like this comes out of nowhere and pro­duces these kinds of re­sults for that lit­tle com­pute cost, it’s a stark re­minder that you can’t put all your eggs in one LLM-basket. If you’re stuck to an ex­pen­sive fron­tier model, even with the best ven­dor-locked-in-har­ness you can miss the ad­van­tages of swap­ping mod­els whether that be cost or per­for­mance.

Open-weight mod­els have crossed a thresh­old worth watch­ing. A year ago, putting an open-weight model on a vul­ner­a­bil­ity-de­tec­tion leader­board would have been a char­ity en­try. GLM 5.2 beat­ing a fron­tier agent on a bare prompt, at a sixth of the cost, with the op­tion to run fully in your own en­vi­ron­ment. For a lot of se­cu­rity teams this is an at­trac­tive op­tion.

Open-weight mod­els have crossed a thresh­old worth watch­ing. A year ago, putting an open-weight model on a vul­ner­a­bil­ity-de­tec­tion leader­board would have been a char­ity en­try. GLM 5.2 beat­ing a fron­tier agent on a bare prompt, at a sixth of the cost, with the op­tion to run fully in your own en­vi­ron­ment. For a lot of se­cu­rity teams this is an at­trac­tive op­tion.

We have a caveat: This is one task, one dataset, one run. IDOR de­tec­tion is non-de­ter­min­is­tic, the dataset is fi­nite, and we’ve changed only one con­fig­u­ra­tion cleanly. It might well be the case that for IDOR de­tec­tion GLM-5.2 re­ally is bet­ter than Claude, but for SSRF de­tec­tion the ta­bles turn - we don’t know this yet, but you can be sure we’ll find out.

Lots of love,

Security Research and Engineering @ Semgrep

Half-Baked Product

weli.dev

The Founder

A freshly minted founder de­cides to get into the oven busi­ness. He can’t bake a cake or knead bread, but he knows the kitchen ap­pli­ance mar­ket in­side and out. He’s an­a­lyzed every busi­ness in Spain and reached a con­clu­sion: if he sells a new oven to the coun­try’s pizza mak­ers, pas­try chefs, and bak­ers, he only needs to cap­ture 10% of the mar­ket to be­come a bil­lion­aire.

10% al­ways looks small when you type it into an Excel spread­sheet.

The founder is very good. He builds a plan that, on pa­per, is flaw­less and air­tight: man­u­fac­ture a more ef­fi­cient oven us­ing new tech­nol­ogy. Selling it is easy. Want to work more ef­fi­ciently? Buy our oven. End of pitch. The founder has ex­pe­ri­ence talk­ing to in­vestors and raises enough money to build an MVP.

The Engineer

The founder looks for some­one who knows how to build ovens and finds an en­gi­neer from a pres­ti­gious school. The en­gi­neer has spent 10 years build­ing ovens and knows how to make one. More than that: he’s the kind of per­son who spends all day talk­ing and ar­gu­ing about ovens. He goes to oven con­fer­ences. When he gets home at night, he ar­gues for hours on Italian fo­rums about which type of oven is best. The Italian fo­rums are, to him, the ul­ti­mate source of oven-truth.

He’s tired of build­ing ovens at Corporate Oven. Ten years mak­ing the same oven he’s told to make. He wants the free­dom to build his own.

The founder of­fers him 20% of the com­pany and to­tal free­dom to build the per­fect oven. The salary is­n’t great, but there’s the promise: if things go well, some­day he could be a mil­lion­aire. And some­thing more im­por­tant than money: he’ll fi­nally get to build the oven of his dreams.

He signs.

The MVP

With lit­tle money and lots of en­thu­si­asm, they build an MVP. Two months later it’s done. It’s a func­tional oven and, more im­por­tantly, it has one im­prove­ment over tra­di­tional ovens: you in­put the amount of flour, yeast, and wa­ter, and the oven au­to­mat­i­cally knows when to stop for a per­fect bake.

In the­ory.

In prac­tice it does­n’t work very well, but it’s good enough for an MVP. They go to mar­ket and sell 5 pro­to­types: two bak­ers the founder knows, the en­gi­neer’s mother who bakes cakes, and two oven en­thu­si­asts who buy it out of cu­rios­ity.

The feed­back is unan­i­mous:

My bread came out burnt.” The cake was raw.” Every sin­gle pizza burns.”

But all things con­sid­ered, it’s pos­i­tive: a third of the time, the pro­to­type worked and pro­duced the per­fect cake, bread, or pizza.

This is just a pro­to­type. Imagine when we ship the real prod­uct. Trust us.”

And with that, the founder goes to see an old col­league who now works at a VC: In 2 months we’ve built a pro­to­type, we al­ready have 5 cus­tomers, and it’s very promis­ing. We just need money to scale, build a bet­ter ver­sion, and sell to every bak­ery and pas­try shop in Spain.”

Nobody asks whether the 5 cus­tomers would buy again.

The founder is very good. He raises 5 mil­lion. Ovens Inc. is born.

Forum of the Bakers

They start im­prov­ing the pro­to­type. The en­gi­neer re­al­izes some­thing: build­ing an al­go­rithm that cal­cu­lates bak­ing time for cakes, piz­zas, and bread is quite a bit more com­plex than it looked. Every dough is its own uni­verse. They need to hire more en­gi­neers.

The en­gi­neer knows ex­actly where to look. On the Italian fo­rums there are two users he’s spent years ar­gu­ing with about con­vec­tion and re­frac­tory stone: Mario and Luigi. He’s never met them in per­son, but he knows their opin­ions on ovens bet­ter than his own fam­i­ly’s. He of­fers them the same deal he got: low salary, lots of free­dom, the per­fect oven.

They sign.

Meanwhile, the founder needs to sell ovens, but Facebook and Instagram ads get no trac­tion. Turns out no­body buys a fif­teen-thou­sand-euro in­dus­trial oven be­cause it popped up in their sto­ries. So he hires a leg­endary sales team: the best sales­peo­ple in all of Spain. People who have never sold ovens, who know noth­ing about ovens, but who are hun­gry to sell and very ex­cited about the com­pany.

At first it goes badly. Few peo­ple want a new oven; they’re happy with the one they have. Why switch? Most small busi­nesses don’t care about a 15% ef­fi­ciency gain: the risk of switch­ing is too high. If Juan’s Bakery swaps ovens and the new oven fails, Juan loses his cus­tomers and shuts down. For Juan, ef­fi­ciency is op­tional; to­mor­row’s bread is not. Better to stick with the old oven, even if it’s worse on pa­per. He’d only switch if Manolo’s Bakery across the street started sell­ing cheaper bread thanks to a more ef­fi­cient oven and he had no choice. But Manolo thinks ex­actly the same as Juan, so no­body moves. Perfect equi­lib­rium. Economists have a name for this; Juan and Manolo call it com­mon sense.

Big busi­nesses are an­other story. For them, 15% ef­fi­ciency means mil­lions saved every year. And one sales­per­son man­ages to make con­tact with Pepepizza.

The Decision

Meanwhile, over in en­gi­neer­ing, things aren’t go­ing any bet­ter. The al­go­rithm is un­sta­ble. They’ve got­ten the fail­ure rate down from two thirds to one third, but each point of im­prove­ment costs twice as much as the last. And then comes the un­com­fort­able dis­cov­ery: if the oven only does two of the three things (bread, cakes, or pizza), the al­go­rithm fails just 5% of the time.

The en­gi­neer brings the pro­posal to the founder: let’s sac­ri­fice one mar­ket and have a prod­uct that works.

The founder gets an­gry. He promised the VCs 10% of Spain’s oven mar­ket. The en­tire mar­ket. We can’t sac­ri­fice any of them.”

It’s not just greed. The 5 mil­lion was raised with the en­tire mar­ket on the slide. The founder is­n’t choos­ing be­tween right and wrong: he’s choos­ing which promise to break.

The en­gi­neer goes back to his desk with his three doughs and his 33% fail­ure rate.

Mallorca

Back to sales: there’s con­tact with Pepepizza, but en­ter­prise deals don’t close over email. The founder flies to Pepepizza head­quar­ters and meets the owner. They hit it off. They hit it off so well they go to Mallorca to­gether. Nobody knows what was dis­cussed there. What’s known is that when they come back, there’s a deal. Nobody has tried the oven yet. No need. Enterprise sales is­n’t about ovens.

The hand­shake comes first. The re­quire­ments come later.

And they come. Pepepizza’s op­er­a­tions team sends the list to sales: their kitchens are cus­tom-built, so they need ovens with spe­cific di­men­sions. Oh, and a ro­tat­ing base like the one they al­ready have.

Sales replies: No prob­lem.”

The founder is eu­phoric. Pepepizza wants to buy an ini­tial batch of 500 ovens. Five hun­dred. That’s more rev­enue than every­thing since they started. For Pepepizza it’s a small pi­lot, a trial in a few lo­ca­tions be­fore de­cid­ing any­thing. For Ovens Inc. it’s bet­ting the en­tire com­pany.

Engineer, we need 500 ovens for Pepepizza. They want spe­cific di­men­sions and a base that spins. Let’s make it hap­pen.”

The en­gi­neer does­n’t faint only be­cause he’s al­ready sit­ting down.

The al­go­rithm barely works for pizza. The mold di­men­sions have spent 5 months be­ing op­ti­mized in CAD for the stan­dard size. And no­body, ever, has dis­cussed ro­tat­ing bases on the Italian fo­rums. If it’s not on the Italian fo­rums, does it even ex­ist?

The en­gi­neer opens the CAD file in front of the founder. He shows him why the new di­men­sions break the en­tire ther­mal de­sign. The founder looks at the screen, looks at the blue­prints, looks at the en­gi­neer.

But this is just chang­ing a num­ber, right?”

We can’t. Not un­til we fix the al­go­rithm and re­design the in­verter for the new sizes. That’s 5 more months.”

The Miracle

It’s not 5 months.

After many lost week­ends and en­tire nights run­ning on Red Bull, in 3 weeks there’s a pro­to­type for Pepepizza. Compromises were made. The al­go­rithm still fails plenty, but at least the di­men­sions are right. The ro­tat­ing base? Doesn’t ex­ist yet. Pepepizza is promised an add-on in a cou­ple of months.” Pepepizza says fine.

The Candle Button

Sales has had a rev­e­la­tion: if you sell the oven that ex­ists to­day, you don’t sell ovens. You have to sell the oven that will ex­ist in 6 months. Promise fea­tures. It worked last time: they promised Pepepizza the im­pos­si­ble and the team de­liv­ered in 3 weeks. What could go wrong?

Sales, of course, has no idea what hap­pens af­ter the con­tract is signed. The com­mis­sion is paid at sign­ing. Whatever comes next is an­other de­part­men­t’s prob­lem.

Though there’s some­thing no­body in en­gi­neer­ing wants to look at: Ovens Inc. does­n’t live off sell­ing ovens (for now). It lives off rais­ing rounds. And rounds are raised with pro­jec­tions, and pro­jec­tions are man­u­fac­tured out of what­ever sales promises. The No prob­lem” peo­ple are also the only life raft.

And then the daily re­quests be­gin:

A lot of our po­ten­tial cus­tomers make birth­day cakes. When they ask if we have spe­cial birth­day-cake fea­tures, we have to say no, and we lose them. Can we add the fea­ture?”

The founder has no doubts. Last time a fea­ture was re­quested, they es­ti­mated 5 months and did it in 3 weeks. And this is much eas­ier. It’s just a sim­ple but­ton that adds can­dles.”

The en­gi­neer is climb­ing the walls. They still haven’t fin­ished clean­ing up the Pepepizza wreck­age. This is ab­solutely not what peo­ple dis­cuss on the Italian fo­rums. Adding a can­dle but­ton is an in­sult to the state of the art, or rather, the state of the oven.

But he caves.

Just this once.”

Sales sells 2 more ovens a month thanks to the new but­ton. Or so they be­lieve. They have no way of check­ing whether they’d have sold them any­way with­out the but­ton.

The Second-Highest Priority

Soon af­ter, more fea­ture re­quests ar­rive.

My oven at home con­nects to the fire­place. Does yours?”

I make a lot of wed­ding cakes, what have you got for me?”

Do you have a Ramadan mode?”

They build all of them.

Engineering stops try­ing to build a good oven and starts adding but­tons and fea­tures. Nobody made that de­ci­sion. It just hap­pened, one ticket at a time.

And there’s a de­tail every­one seems to ig­nore: each but­ton takes longer than the last. The can­dle but­ton took three days. The fire­place one, a week. The lat­est one took three. It’s not that the en­gi­neers are get­ting slower: it’s that every new but­ton has to co­ex­ist with all the pre­vi­ous but­tons.

Meanwhile, cus­tomers who buy the oven re­turn it within a week. The rea­son? The bread and cakes still burn 10% of the time. The MVP prob­lem. The orig­i­nal one. The one from day one. Underneath the twelve new but­tons sits the same al­go­rithm from the very first day, and a baker who loses one out of every ten batches is not con­soled by the fact that the oven does can­dles.

When a cus­tomer calls to can­cel, sup­port tries to re­tain them by of­fer­ing what’s avail­able: the new but­ton from the lat­est re­lease. The baker whose bread keeps burn­ing is of­fered Ramadan mode. The baker leaves any­way. It gets logged as feed­back. Engineering has no time to stop and re­think their ap­proach, be­cause stop­ping is­n’t in the back­log.

And then the worst day ar­rives. Pepepizza calls:

Where is the ro­tat­ing base?”

The founder swal­lows hard. The ticket has been sit­ting on the kan­ban board for a month and a half. It’s not that no­body saw it: it’s that every week some­thing jumped ahead of it. The can­dle but­ton. The fire­place thing. The Ramadan thing. The ro­tat­ing base was al­ways the sec­ond-high­est pri­or­ity, and the sec­ond-high­est pri­or­ity never gets done. So he an­swers with con­vic­tion:

Almost fin­ished.”

The Request

Guys, these next two weeks we’re go­ing to fo­cus on the ro­tat­ing base,” says the founder.

The team can’t be­lieve it. They al­ready said the ro­tat­ing base was im­pos­si­ble. They al­ready ex­plained why. Besides, Mario has va­ca­tion planned, the va­ca­tion he was promised af­ter the Pepepizza crunch. And Luigi’s per­for­mance has been slip­ping for weeks and no­body knows why.

The en­gi­neer tries one more time:

We can’t do the ro­tat­ing base right now. We need to refac­tor, con­sol­i­date, and add an ab­strac­tion layer for com­part­ments and but­tons. Otherwise, every new fea­ture takes twice as long as the last. Also, Mario has va­ca­tion planned, and I’m not sure Luigi is in a good place to be asked for more.”

The founder nods. He gets it. He gets all of it.

But this is a startup. And star­tups are built with blood and sweat. Everyone here has to sac­ri­fice. You have two weeks.”

And he’s not say­ing it from the couch: the founder takes the low­est salary in the com­pany and has­n’t had a va­ca­tion in two years (Mallorca was work). He’s the first to live the speech. That is ex­actly the prob­lem.

When Everything Is Urgent, Nothing Is

There’s a new crunch. This time with less en­thu­si­asm and less pas­sion. The first one was an epic feat; this one is pa­per­work. Mario can­cels his va­ca­tion. Luigi keeps show­ing up. Nobody asks how he’s do­ing.

Two weeks later, the re­sult: a ro­tat­ing base that re­quires three spe­cial but­ton com­bi­na­tions. It’s in­com­pat­i­ble with every other mode, but it’s not like no­body tested it.

It gets in­stalled at Pepepizza. Pepepizza’s re­sponse:

It does­n’t ro­tate clock­wise. We’re go­ing with Corporate Oven.”

The team: dev­as­tated. They just lost their most im­por­tant cus­tomer. Nobody in prod­uct ever com­mu­ni­cated that it had to ro­tate clock­wise. Somewhere be­tween sales, the founder, and the back­log, the sin­gle most im­por­tant re­quire­ment of the pro­ject sim­ply never ex­isted.

And the worst part is­n’t los­ing Pepepizza. The worst part is that the changes made for the ro­tat­ing base will haunt the oven’s de­sign un­til the end of time. The cus­tomer leaves now. Their ro­tat­ing base stays for­ever.

No Blockers

A month later, Mario leaves the com­pany. He’s not go­ing to a com­peti­tor and he has­n’t found any­thing bet­ter: he leaves be­cause it’s the only way he can see to get a va­ca­tion. In the retro, it gets writ­ten down as a learning.”

Luigi stays. He now main­tains the can­dle but­ton. It’s his spe­cialty, they say. Nobody re­mem­bers who de­cided that, but it’s his spe­cialty. He keeps show­ing up every day, keeps do­ing his work. On the Italian fo­rums, peo­ple ask why Luigi has­n’t posted in 5 months. In standups he says no block­ers” and every­one moves on to the next per­son.

Epilogue

Six months later.

Ovens Inc. is still alive. Technically. There’s money for eight more months and a new ver­sion of the pitch deck where the word oven” no longer ap­pears: it’s now an intelligent bak­ing plat­form.”

The en­gi­neer left in March. He did­n’t slam the door or write a vi­ral thread about his ex­pe­ri­ence. One day he sim­ply stopped ar­gu­ing in meet­ings, a month later he stopped show­ing up, and his farewell was a three-line email. Nobody has touched his code since. Nobody dares.

The founder has it all fig­ured out: the prob­lem was never the plan. The prob­lem was the ex­e­cu­tion. He needs an­other en­gi­neer.

And he finds one.

Young, grad­u­ated from a pres­ti­gious school, has spent years build­ing ovens at Corporate Oven and he’s tired. More than that: he’s the kind of per­son who spends all day talk­ing and ar­gu­ing about ovens. He goes to oven con­fer­ences. When he gets home at night, he ar­gues for hours on Italian oven fo­rums about which type of oven is best. On the fo­rum an old user warns Make sure that you sup­port ro­tat­ing bases day 1”. The young en­gi­neer laughs. Who uses ro­tat­ing bases in an oven?

The founder of­fers him 5% of the com­pany. It can’t be 20 any­more; there’s been di­lu­tion (funding-round stuff, it’s com­pli­cated). But the salary does­n’t mat­ter, be­cause he’s of­fer­ing the im­por­tant thing: to­tal free­dom to build the per­fect oven.

The kid smiles.

HackerRank open sourced its ATS. My resume scored 90/100. Oh wait 74. No – 88

danunparsed.com

This open-source ATS by HackerRank has been blow­ing up re­cently: https://​github.com/​in­ter­view­street/​hir­ing-agent

It’s popped up on LinkedIn and Reddit with hun­dreds, some­times thou­sands, of likes.1 A coworker men­tioned it to me in pass­ing a few days ago.

I’ve de­cided to test it out.

First work­ing run: 90/100. Felt pretty good!

I had some de­bug prints scat­tered around from trou­bleshoot­ing the setup, so I cleaned those up and ran it again.

74/100.

Same re­sume. Same com­mand. The only thing I changed was delet­ing print state­ments.

I dis­abled DEVELOPMENT_MODE and put it in a loop to run a hun­dred times.

The scores range from 66 to 99.

If your com­pa­ny’s cut­off sits at 85, I fail 65% of the time. Same ex­act re­sume, dif­fer­ent luck.

Here a quick run­down on how the tool works:

Your PDF gets parsed into text. An LLM is called six times to ex­tract struc­tured in­for­ma­tion — your ba­sics, work his­tory, ed­u­ca­tion, skills, pro­jects, awards. It pulls your GitHub pro­file, scans your top re­pos, ap­pends them as ex­tra con­text. Then every­thing gets fed into the LLM at once to be graded.

The scor­ing is out of 100, with up to 20 bonus points on top:

35 points for open source con­tri­bu­tions

35 points for open source con­tri­bu­tions

30 for per­sonal pro­jects

30 for per­sonal pro­jects

25 for work ex­pe­ri­ence

25 for work ex­pe­ri­ence

10 for tech­ni­cal skills

10 for tech­ni­cal skills

Up to 20 bonus points for startup ex­pe­ri­ence, a port­fo­lio site, a tech­ni­cal blog, etc.

Up to 20 bonus points for startup ex­pe­ri­ence, a port­fo­lio site, a tech­ni­cal blog, etc.

The de­fault model is gem­ma3:4b, run­ning at tem­per­a­ture 0.1 — low, sup­pos­edly nudg­ing the model to­ward de­ter­min­is­tic out­puts.

Here’s what I found when I looked at those in­di­vid­ual cat­e­gories.

Look at tech­ni­cal skills: I scored 8/10 in 98 out of 100 runs. Nearly per­fect con­sis­tency. How come? Because tech­ni­cal skills are a check­list. You ei­ther know React or you don’t. There’s noth­ing for an LLM to judge — a five year old could match that check-list.

Now look at pro­jects — there’s HUGE vari­a­tion.

LLMs strug­gle to make a judg­ment call like that con­sis­tently. Sometimes my pro­jects lack ar­chi­tec­tural com­plex­ity”, some­times they demonstrate real-world de­ploy­ment”. Which one the LLM spits out is a roll of the dice.

Temperature 0.1 is al­ready low, but even go­ing down to tem­per­a­ture 0 does­n’t fix this. Someone opened a GitHub is­sue back in October show­ing scores of 27, 34, 32, 34, 34, 30 across six con­sec­u­tive runs at tem­per­a­ture 0.2 This non-de­ter­min­ism is­n’t a bug you can just fine-tune away, it’s a fun­da­men­tal de­sign flaw.

I was wor­ried part of this might be the model. After all, gem­ma3:4b was a lo­cal model run­ning on my ma­chine.

Gemini re­sulted in a tighter dis­tri­b­u­tion — scores clus­tered be­tween 48 and 64. But if your cut­off is 60, you’re still fail­ing 28% of the time through no fault of your own.

The Open Source scores have be­come con­sis­tent — that’s a le­git im­prove­ment. But pro­ject scores are still all over the place.

Experience has me the most con­cerned.

25/25.

Every sin­gle run.

I went back and pulled up an old re­sume — one in­tern­ship on it.

Also 25/25.

The clue is in the prompt…

The en­tire thing is two lines long.

No rubric. No ex­am­ples. No an­chors for what earns a 15 ver­sus a 25.

A ju­nior en­gi­neer with one in­tern­ship gets 25/25. A prin­ci­pal en­gi­neer with a decade of dis­trib­uted sys­tems gets 25/25. I get 25/25. Experience has two lines and no an­chors — con­sis­tent, but use­less. Projects has a de­tailed rubric with ex­am­ples but it’s the nois­i­est cat­e­gory — in­con­sis­tent, also use­less. There are some things that LLMs just can’t do well, no mat­ter how you prompt.

Use an LLM to parse a re­sume into struc­tured data — great, that’s what they’re good at. Use one to check whether some­one knows Python — amaz­ing. Use one to judge whether a can­di­date’s ex­pe­ri­ence is worth 18 points or 24 points? You get a vibe-check. Something HR teams, bar rais­ers, and a dozen other ini­tia­tives have spent decades try­ing to avoid.

The 65% weight­ing on open source + pro­jects does­n’t help ei­ther. I’d take the en­gi­neer with 30 years of ex­pe­ri­ence who built S3 over some­one with two in­tern­ships and an open source pro­ject — but this tool would­n’t. Some of the best en­gi­neers I know have built things that never ended up on GitHub. That’s over half of their score gone be­fore any hu­man looks their way.

If you’re an en­gi­neer with any say in how your com­pany han­dles re­sume screen­ing: please be very care­ful with AI-screening tools. A tool that can’t dif­fer­en­ti­ate is­n’t fil­ter­ing for qual­ity — it’s just fil­ter­ing. You might as well throw out half the re­sumes and tell the the ap­pli­cants you don’t fuck with bad luck.

Correction (June 28): A reader flagged that the re­sume_e­val­u­a­tion_cri­te­ria.jinja tem­plate says Software Intern” on line 1 — nowhere doc­u­mented, nowhere else ref­er­enced in the repo. The same tem­plate that later gives bonus points for founder roles, co-founder po­si­tions, or early-stage en­gi­neer roles.” I re-ran with an ex­plicit Senior SWE prompt and got iden­ti­cal re­sults — the scor­ing di­men­sions are po­si­tion-ag­nos­tic.

Update (June 30): This blew up on Hacker News, if you’re cu­ri­ous to see that thread — here it is.

A few peo­ple noted that fron­tier mod­els do not have this prob­lem. I checked out one of the GitHub PRs that in­tro­duced sup­port for Claude and ran Opus 4.8 in a loop un­til my cred­its ran out.

The range has tight­ened slightly — the score has gone down from 48 – 64 to 49 – 63, and pro­jects from 12 – 25 down to 13 – 23. At its core, the point is the same: pro­jects in­con­sis­tent, skills per­fect.

1

Viral LinkedIn (read at your own risk) and Reddit posts. They both claim the repo was open-sourced re­cently, but based on com­mit his­tory it’s more likely that it just blew up re­cently and has been open sourced since October 2025.

2

Non-determinism at tem­per­a­ture 0 was flagged in this GitHub is­sue, opened October 2025.

No posts

Age verification is just a precursor to attribution of speech

nonogra.ph

Lots of US states, European coun­tries, and Australia have in­tro­duced age ver­i­fi­ca­tion” reg­u­la­tions. They pre­sent it as the clas­sic save the chil­dren” talk­ing point, but it’s re­ally just a pre­cur­sor to at­tri­bu­tion of speech, par­tic­u­larly at­tribut­ing your words to your real iden­tity.

This is the state’s dream; your words, un­de­ni­ably tied to your real life iden­tity. Law en­force­ment gen­er­ally needs two things to take mean­ing­ful ac­tion: What hap­pened? and Who did it? so lets go over them, I promise it’s rel­e­vant.

What hap­pened? - Maybe you dis­like dat­a­cen­ters, il­le­gal im­mi­gra­tion, or taxes. Whatever it is, the po­lice want to know. If you’re post­ing on so­cial me­dia, they prob­a­bly al­ready know.

Who did it? - They can’t pros­e­cute PickleDog52, they rely on some sort of iden­ti­fier and a lot of in­ves­tiga­tive work to fig­ure out who to ha­rass or jail. Traditionally this has been achieved with OSINT (looking for clues in your posts, speech pat­tern, etc..) or sub­poe­naing the ser­vice provider to get your IP or other iden­ti­fiers like email or phone.

Doing #2 takes a lot of ef­fort and does­n’t scale. Sometimes there’s no prob­a­ble cause that a crime has been or will be com­mit­ted. Sometimes the tar­get uses a VPN or Tor. Sometimes the plat­form does­n’t have re­li­able met­rics on the tar­get. Whatever the rea­son, it usu­ally re­quires hu­mans click­ing but­tons, send­ing emails, or de­cid­ing things.

These age ver­i­fi­ca­tion” laws are - by de­sign - iden­tity at­tri­bu­tion sys­tems. They at­tribute dig­i­tal iden­ti­ties (accounts) to phys­i­cal iden­ti­ties (SSN, ID, etc..). This is gov­ern­men­t’s ideal sit­u­a­tion, the abil­ity to quickly (automatically?) get iden­ti­fy­ing in­for­ma­tion about in­con­ve­nient peo­ple re­gard­less if they’re a crim­i­nal or not.

There’s also some­thing creep­ily ironic about se­lect cor­po­rate elite, politi­cians, and gov­ern­ment of­fi­cials push­ing age ver­i­fi­ca­tion to save the chil­dren”… Maybe go check their flight logs or hard dri­ves or some­thing… Yikes!

Anyways, I have no doubts that this will be­come au­to­mated once enough of the pop­u­la­tion has ver­i­fied their iden­ti­ties. Post an in­con­ve­nient mes­sage about a politi­cian, or get a lit­tle too rowdy in a group chat, and you’ll get a let­ter in the mail or a knock at the door. Similar to the love let­ters” sent by ISPs on be­half of the RIAA and MPAA when you en­joy a DRM-free me­dia file.

Don’t let them win. Don’t ver­ify your age. Don’t give up your iden­tity. If you ab­solutely must, find one of the nu­mer­ous age ver­i­fi­ca­tion ser­vices and pay in Monero.

Virginia Bans Sale of Geolocation Data

www.hunton.com

On April 13, 2026, Virginia Governor Abigail Spanberger signed into law S.B. 388, which amends the Virginia Consumer Data Protection Act (“VCDPA) to pro­hibit the sale of ge­olo­ca­tion data. Notably, the VCDPA de­fines sale” more nar­rowly than other state com­pre­hen­sive pri­vacy laws, as the ex­change of per­sonal data for mon­e­tary con­sid­er­a­tion by the con­troller to a third party.”

The ban on the sale of ge­olo­ca­tion data goes into ef­fect on July 1, 2026.

Virginia fol­lows Maryland and Oregon in ban­ning the sale of ge­olo­ca­tion data. Both Maryland and Oregon more broadly de­fine sale” to mean the ex­change of per­sonal data for mon­e­tary or other valu­able con­sid­er­a­tion.” Virginia joins sev­eral other states that have re­cently pro­posed leg­is­la­tion with sim­i­lar bans, in­clud­ing California, Massachusetts, Vermont and Washington State. The leg­isla­tive ac­tiv­ity fol­lows reg­u­la­tory scrutiny on the sale of ge­olo­ca­tion data, in­clud­ing the California Attorney General’s in­ves­ti­ga­tion into the lo­ca­tion data in­dus­try in March 2025, and a 2024 FTC set­tle­ment ban­ning a data bro­ker from sell­ing ge­olo­ca­tion data.

To add this web app to your iOS home screen tap the share button and select "Add to the Home Screen".

10HN is also available as an iOS App

If you visit 10HN only rarely, check out the the best articles from the past week.

Visit pancik.com for more.