10 interesting stories served every morning and every evening.

Mullvad exit IPs as a fingerprinting vector

tmctmt.com

Mullvad is one of the few VPN providers that of­fers mul­ti­ple exit IPs for its servers. If two peo­ple con­nect to the same server, they will usu­ally end up with dif­fer­ent pub­lic IPs.

With only 578 servers (compared to Proton VPNs 20,000), this kind of ver­ti­cal scal­ing makes sense to avoid cram­ming too many users onto one IP, which would be a prob­lem on sites with overzeal­ous IP blocks and rate­lim­its.

Surprisingly, the exit IP you are given is not ran­dom­ized each time you con­nect to the server, but de­ter­min­is­ti­cally picked based on your WireGuard key, which ro­tates every 1 to 30 days (unless you use a third-party client, in which case it never ro­tates).

But wait.. if each server as­signs you an in­de­pen­dently picked sta­tic exit IP, would­n’t just a few of those be enough to uniquely iden­tify you among every other Mullvad user?

Putting it to the test

I wrote a script that re­peat­edly changes my pub­key and fetches exit IPs for a set of 9 servers. Leaving it run­ning for a night pro­duced data points for 3650 pub­keys, which is enough to map out the exit IP range for each server:

The pool sizes add up to over 8.2 tril­lion exit IP com­bi­na­tions for these servers, so you’d think each pub­key would be as­signed a unique com­bi­na­tion of IPs since the odds of a col­li­sion are so as­tro­nom­i­caly low. And yet, some­how all the pub­keys I tested were as­signed just one of 284 com­bi­na­tions.

What’s go­ing on here?

Different IPs, same pro­por­tion

You can cal­cu­late a nu­mer­i­cal po­si­tion for an exit IP by count­ing its dis­tance from the pool’s start­ing IP.

For ex­am­ple, the IP 103.136.147.53 as­signed by au-syd-wg-101 would have a 1-based in­dex of 49 (X.X.X.53 - X.X.X.5 + 1).

Now, if you take the IP po­si­tions for any of the 284 com­bi­na­tions linked above, and you di­vide them by pool size, a com­mon ra­tio emerges:

Each IP lands within the same per­centile of its pool, in this case, the 81st.

This ex­plains the lim­ited num­ber of com­bi­na­tions, Mullvad will only as­sign neigh­bor­ing exit IPs across all its servers. But why?

Feature or bug?

Curiously, the servers cl-scl-wg-001 and za-jnb-wg-002 con­sis­tently share IP in­dexes with each other across all 284 ob­served IP com­bi­na­tions.

The thing they have in com­mon is a pool size of 11, and this gives us a clue about what’s hap­pen­ing.

In any lan­guage, if you ini­ti­ate an RNG with a sta­tic seed, a rand-be­tween call with the same bounds will al­ways pro­duce the same re­sult:

use rand::{Rng, SeedableRng}; use rand::rngs::StdRng;

fn main() { let seed = 1234; for _ in 1..100 { let mut rng = StdRng::seed_from_u64(seed); let num­ber = rng.ran­dom_range(0..1000); println!(“{}”, num­ber) // will al­ways print 56 } }

So, the shared in­dexes be­tween these two servers in­di­cate that Mullvad is prob­a­bly us­ing some sort of seed-based RNG to pick exit IP in­dexes, where the seed is the pub­key (or pos­si­bly the tun­nel ad­dress) and the up­per bound pa­ra­me­ter is the pool size.

This is fairly straight­for­ward, but what hap­pens when the bounds are changed?

use rand::{Rng, SeedableRng}; use rand::rngs::StdRng;

fn main() { let seed = 12345; for bound in 10..100 { let mut rng = StdRng::seed_from_u64(seed); let num­ber = rng.ran­dom_range(0..bound); let ra­tio = num­ber as f64 / bound as f64; println!(“{} {:.3} , num­ber, ra­tio) } }

5 0.500 5 0.455 6 0.500 6 0.462 7 0.500 7 0.467 8 0.500 9 0.529 9 0.500 10 0.526 10 0.500 11 0.524 11 0.500 12 0.522 12 0.500 13 0.520 13 0.500 14 0.519 14 0.500 15 0.517 …

As it turns out, the en­tropy pool of the RNG is un­af­fected by the bounds you pro­vide, and at least in Rust, the same float is gen­er­ated on each first call and used as a mul­ti­plier scale for the bounds, like so: min + round((max - min) * float) (this may be a gi­ant over­sim­pli­fi­ca­tion)

This lines up with the be­hav­ior we’ve seen in Mullvad’s exit IP pick­ing al­go­rithm, so it’s safe to say that this is the cause of it.

Rust as the back­end lan­guage makes sense too, con­sid­er­ing that the client is also writ­ten in it.

The thing is, al­most none of my pro­gram­mer friends were able to ac­cu­rately de­scribe what ran­dom_range would pro­duce in the sec­ond code snip­pet, and the ac­tual be­hav­ior took me by sur­prise too. It’s rea­son­able to think that each in­cre­ment to the bounds would skew with the en­tropy and re­sult in a dif­fer­ent num­ber, even though that’s not what hap­pens.

Is it pos­si­ble that the Mullvad devs shared this com­mon mis­con­cep­tion, while ac­tu­ally in­tend­ing for there to be an un­bounded num­ber of exit IP com­bi­na­tions? I don’t know, but it’s a funny thought.

Correlating iden­ti­ties

I made a tool that can de­duce the min­i­mum and max­i­mum float value for a given com­bi­na­tion of IPs, avail­able at https://​tm­ctmt.github.io/​mul­l­vad-seed-es­ti­ma­tor/.

This par­tic­u­lar set of IPs in the screen­shot re­solves to a float value be­tween 0.2909 and 0.2943 for a dif­fer­ence of 0.0034, which means that 0.34% of Mullvad users share these IPs. At a ball park es­ti­mate of a 100,000 ac­tive Mullvad users, this equates to 340 users.

This is def­i­nitely not as unique as I orig­i­nally thought, but at the same time, >99% ac­cu­racy is re­ally not that bad?

As an ex­am­ple, imag­ine that you are a mod­er­a­tor on a fo­rum and you sus­pect that a new face is ac­tu­ally a sock­pup­pet of a user you banned the day prior. You check the IP logs, and de­spite us­ing dif­fer­ent Mullvad servers, both ac­counts re­solve to the over­lap­ping float ranges 0.4334 – 0.4428 and 0.4358 – 0.4423. This gives you a >99% chance that they are the same per­son.

Now ap­ply this to IP logs ob­tained through data breaches and le­gal chan­nels and you can see how you could get deanonymized be­hind a VPN through sim­i­lar cor­re­la­tion at­tacks.

Protecting your­self

Avoid switch­ing servers more than once per pub­key

Force ro­tate your pub­key by log­ging out of the Mullvad app

UK saves 'millions' of pounds by ditching Palantir for refugee system

www.bbc.com

Millions’ of pounds saved by re­plac­ing Palantir tech in refugee sys­tem

24 hours ago

Chris VallanceSenior tech­nol­ogy re­porter

Getty Images

Millions of pounds have been saved by re­plac­ing a Palantir IT sys­tem which helps to find homes for Ukrainian refugees with one built by its own ex­perts, a gov­ern­ment de­part­ment has said.

The Homes for Ukraine scheme matched peo­ple flee­ing the con­flict with of­fers of ac­com­mo­da­tion - a com­plex task Palantir ini­tially sup­ported for free but which grew to cost mil­lions.

The Ministry of Housing, Communities and Local Government (MHCLG) said its new sys­tem was more flex­i­ble” and could meet high stan­dards” of se­cu­rity.

Palantir said it was proud to have sup­ported the scheme and stood up a so­lu­tion in just nine days, which en­abled the safe re­set­tle­ment of more than 157,000 refugees”.

Through a web­site, backed by an IT sys­tem, those who had a rent-free space in their home or a sep­a­rate res­i­dence could to of­fer it to refugees.

In or­der to set this up quickly, then-Con­ser­v­a­tive gov­ern­ment min­is­ters ac­cepted an of­fer from Palantir to build a sys­tem to ad­min­is­trate the scheme, based on its Foundry plat­form, for free for six months.

In a 2023 blog post, Palantir de­scribed the chal­lenge of com­bin­ing data from mul­ti­ple gov­ern­ment sys­tems con­tain­ing tens of thou­sands of visa ap­pli­ca­tions and hun­dreds of thou­sands of ac­com­mo­da­tion of­fers.

The re­port notes the Government’s chief com­mer­cial of­fi­cer in­formed Palantir of his con­cern about the fir­m’s prac­tice of of­fer­ing a zero- or nom­i­nal-cost ini­tial of­fer to gain a com­mer­cial foothold.

This, he ar­gued, was con­trary to pub­lic pro­cure­ment prin­ci­ples re­quir­ing open com­pe­ti­tion.

Palantir main­tains gov­ern­ment guid­ance sug­gests run­ning pi­lots of sys­tems and ask­ing if they can be sup­plied for free.

The NAO re­port also said there was a de­sire to re­place the Palantir sys­tem.

Dnipropetrovsk Regional State Administration

Coco Chan, a se­nior dig­i­tal leader of the Homes for Ukraine pro­ject, said in a blog a sys­tem built on an ex­ist­ing com­mer­cial plat­form had been re­placed with one cre­ated in-house.

The blog did not name the plat­form in ques­tion, now known to have been Palantir’s Foundry tech.

Longer term, we wanted to re­place the plat­form with a more flex­i­ble tech­nol­ogy so­lu­tion, en­abling [MHCLG] to save sig­nif­i­cant sup­port costs, con­trol the sys­tem data and code,” Chan wrote.

She added its in-house re­place­ment was already sav­ing MHCLG mil­lions of pounds a year in run­ning costs”.

Towards sovereign tech­nol­o­gy’

According to Chan, the de­part­ment set a prece­dent by mov­ing a com­plex live sys­tem to an in-house set up, re­duc­ing re­liance on ex­ter­nal sup­pli­ers.

That mes­sage may be par­tic­u­larly wel­come to those who have crit­i­cised Palantir and its con­tracts across UK pub­lic ser­vices - in­clud­ing with the NHS, the Ministry of Defence (MoD), the Financial Conduct Authority and 11 po­lice forces.

Some ar­gue the fir­m’s suc­cess is be­cause its tech is badly needed and works well.

There are also con­cerns the UK is re­ly­ing too much on large US tech sup­pli­ers.

Terence Eden - who alerted the BBC to the MHCLG blog - said the de­vel­op­ment of an in-house al­ter­na­tive to Palantir’s tech was an im­por­tant step to­wards more sovereign tech­nol­ogy”.

When given suit­able re­sources the Civil Service can of­ten out­per­form pri­vate com­pa­nies like Palantir,” the for­mer gov­ern­ment tech­nol­ogy ad­vi­sor said.

Eden added MHCLG had cre­ated a better, eas­ier to use, and cheaper” sys­tem.

Emma Logan, deputy pres­i­dent of BCS, The Chartered Institute for IT, told the BBC there were clear ad­van­tages to build­ing some dig­i­tal ser­vices in-house.

But she said external spe­cial­ists can bring ex­pe­ri­ence, spe­cial­ist skills, and the abil­ity to put large teams in place quickly, which can be par­tic­u­larly im­por­tant for ur­gent na­tional pro­grammes”.

Rob Miller, of Public Digital - a con­sul­tancy founded by for­mer gov­ern­ment tech ex­perts - added that the gov­ern­ment should not just con­sider whether to curb re­liance on big tech but how quickly it is will­ing to in­vest in the ca­pa­bil­ity to do so.”

Palantir told the BBC its Homes for Ukraine sys­tem formed part of a multi-faceted ef­fort to help Ukraine in the face of Russian ag­gres­sion”.

It added this in­cluded the use of our soft­ware for mil­i­tary sup­port, dem­i­ning, in­ves­ti­ga­tion into war crimes and pro­vi­sion of pupils with safe ac­cess to schools”.

The firm also said the change to a new sys­tem showed there was no risk of firms be­ing locked into us­ing its own.

The MHCLG said it ini­tially needed a sys­tem which could be ready within days but, in seek­ing a steadier ser­vice”, later cre­ated an up­dated plat­form to meet the pro­gram­me’s longer-term needs and bring down costs.

Its re­place­ment sys­tem was op­er­a­tional by September 2025.

openai.com

Wikipedia File Explorer — Browse Wikipedia on a Windows XP desktop

explorer.samismith.com

User

A few words on DS4

antirez.com

I did­n’t ex­pect DwarfStar 4 (https://​github.com/​an­ti­rez/​ds4) to be­come so pop­u­lar so fast. It is clear that there was a need for sin­gle-model in­te­gra­tion fo­cused lo­cal AI ex­pe­ri­ence, and that a few things hap­pened to­gether: the re­lease of a quasi-fron­tier model that is large and fast enough to change the game of lo­cal in­fer­ence, and the fact that it works ex­tremely well with an ex­tremely asym­met­ric quants recipe of 2/8 bit, so that 96 or 128GB of RAM are enough to run it. And, of course: all the ex­pe­ri­ence pro­duced by the lo­cal AI move­ment in the lat­est years, that can be lever­aged more promptly be­cause of GPT 5.5 (otherwise you can’t build DS4 in one week — and even with all this help you need to know how to gen­tly talk to LLMs).

The last week was funny and also tir­ing, I worked 14 hours per day on av­er­age. My nor­mal av­er­age is 4/6 since early Redis times, but the first few months of Redis were like that.

So, what’s next? Is this a pro­ject that starts and ends with DeepSeek v4 Flash? Nope, the model can change over time. The space will be oc­cu­pied, in my vi­sion, by the best cur­rent open weights model that is *practically fast* on a high end Mac or GPU in a box” gear (like the DGX Spark and other sim­i­lar se­tups). I bet that the next con­tender is DeepSeek v4 Flash it­self, in the new check­point that will be re­leased and, hope­fully, a ver­sion specif­i­cally tuned for cod­ing, and who knows, other ex­pert-vari­ants (not in the sense of MoE ex­perts) maybe. For lo­cal in­fer­ence, to have a ds4-cod­ing, ds4-le­gal, ds4-med­ical mod­els make a lot of sense, af­ter all. You just load what you need de­pend­ing on the ques­tion.

It is the first time since I play with lo­cal in­fer­ence (I play with it since the start) that I find my­self us­ing a lo­cal model for se­ri­ous stuff that I would nor­mally ask to Claude / GPT. This, I think, is re­ally a big thing. It is also the first time that us­ing vec­tor steer­ing I can en­joy an ex­pe­ri­ence where the LLM can be used with more free­dom. DeepSeek v4 Flash is re­ally an im­pres­sive model, no doubt about that. If you can imag­ine in your mind the small good lo­cal model ex­pe­ri­ence as A, and the fron­tier model you use on­line as B, DS4 is a lot more B than A. I can’t wait for the new re­leases, hon­estly (btw, thank you DeepSeek).

So, af­ter those chaotic first days, I hope the pro­ject will fo­cus on: qual­ity bench­marks, po­ten­tially adding a cod­ing agent that is also part of the pro­ject, a hard­ware setup here in my home that can run the CI test in or­der to en­sure long term qual­ity, more ports, and fi­nally but as a very im­por­tant point: dis­trib­uted in­fer­ence (both se­r­ial and par­al­lel).

For now, thank you for all the sup­port: it was re­ally ap­pre­ci­ated :) AI is too crit­i­cal to be just a pro­vided ser­vice.

blog com­ments pow­ered by Disqus

The Wonders of AI: We Are Retiring Our Bug Bounty Program

turso.tech

For al­most a year now, Turso has had a pro­gram that pays $1,000 for any bug that can be demon­strated to lead to data cor­rup­tion. Today, with im­mense sad­ness, we are re­tir­ing this pro­gram.

The rea­son is sim­ple: every­body is be­ing in­un­dated by the slop ma­chine. We are not unique in this re­gard. However, a pro­gram that of­fers money in ex­change for a spe­cific class of bugs is just too juicy of a tar­get for the slop mak­ers. For days, our main­tain­ers have done lit­tle else other than close slop PRs claim­ing to have found bugs that led to data cor­rup­tion in Turso. In a time where many OSS pro­jects are clos­ing their doors to con­tri­bu­tions, we want to make every ef­fort pos­si­ble to keep the doors of Turso open. Being an Open Contribution pro­ject is part of our DNA. It is how Turso was born. But un­for­tu­nately, the fi­nan­cial re­ward is mak­ing this close to im­pos­si­ble and it has to go.

We are shar­ing this pub­licly and loudly be­cause we be­lieve that we will all have to find new ways to es­tab­lish good gov­er­nance in this new era, and should learn from each other. This is our con­tri­bu­tion to that con­ver­sa­tion.

#Why did we start this pro­gram

We started this pro­gram be­cause we are rewrit­ing SQLite, known to be one of the most re­li­able pieces of soft­ware in the world. The com­mu­nity ex­pects a high bar from a pro­ject with such am­bi­tion, and we in­vest tremen­dous ef­fort into mak­ing sure that we can match or even sur­pass SQLite’s leg­endary re­li­a­bil­ity. Turso ships with a na­tive Deterministic Simulator, a col­lec­tion of fuzzers, an or­a­cle-based dif­fer­en­tial test­ing en­gine against SQLite, a con­cur­rency sim­u­la­tor, and on top of that, we have ex­ten­sive runs on Antithesis.

We take our test­ing dis­ci­pline se­ri­ously. And we wanted to com­mu­ni­cate our con­fi­dence. On the other hand, all of that test­ing in­fra­struc­ture is, at the end of the day, just soft­ware and is not per­fect. You can write all the fuzzers and sim­u­la­tors in the world, but they will only catch bugs in the com­bi­na­tions that are ef­fec­tively gen­er­ated. For ex­am­ple, if your fuzzer never gen­er­ates in­dexes, you will by de­f­i­n­i­tion not find any bugs re­lated to in­dexes, re­gard­less of how well you stress the rest of the sys­tem. As a real ex­am­ple, we found bugs that es­caped our sim­u­la­tor be­cause they would only ap­pear in data­bases that were larger than 1GB, and be­cause we in­jected faults ag­gres­sively into every run, data­bases would never get big enough to trig­ger.

The main ad­van­tage of au­to­mated test­ing is that a bug es­capes your val­i­da­tion, once you im­prove the test gen­er­a­tors, an en­tire class of bugs go away. So we en­vi­sioned this pro­gram as a great way to do both things: it helped us es­tab­lish the con­fi­dence we had in the method­ol­ogy, but at the same time, if some­one did find ar­eas that our sim­u­la­tors did­n’t cover well, we’d be more than happy to pay for it! We started the pro­gram with a $1,000 re­ward for bugs that would lead to data cor­rup­tion un­til we could re­lease a 1.0 ver­sion of Turso. Our plan was that once we’d reach 1.0, we would pro­gres­sively in­crease both the size of the re­ward to sub­stan­tial lev­els, and the scope of the is­sues we’d re­ward peo­ple for.

#And be­fore the singularity”, this worked great

We were de­lighted by this pro­gram. We paid a to­tal of 5 in­di­vid­u­als. All of those peo­ple who were awarded were in­cred­i­bly spe­cial peo­ple. Worth high­light­ing the work of Alperen, who was ac­tu­ally one of the core con­trib­u­tors to our sim­u­la­tor it­self (so lit­tle sur­prise that he knew of a cou­ple of places where it could be im­proved). Then Mikael, who in fact used LLMs in a very cre­ative ways to iden­tify places where the sim­u­la­tor was not reach­ing (we later hired Mikael), and Pavan Nambi, who paired the sim­u­la­tor with for­mal meth­ods and ended up not only find­ing bugs in Turso, but in fact found more than TEN bugs on SQLite it­self through is method­ol­ogy.

#But af­ter the singularity”, we got drowned

In our ex­pe­ri­ence, any­body who was skilled enough to find crit­i­cal is­sues was some­one we wanted around in our com­mu­nity. We did have the oc­ca­sional per­son that tried to sub­mit bad PRs in the hopes of col­lect­ing the bounty, but it was a rare oc­cur­rence: the re­quire­ment that the sim­u­la­tor had to be ex­tended to demon­strate the bug (just point­ing out the bug was not enough) helped keep the bar high, and most im­por­tantly, there just aren’t that many bugs.

But then an army of slop was re­leased overnight. It be­came too high a re­ward to just point an LLM at Turso, and try to find a bug. And as you all know, if you in­struct an LLM to go find a bug and col­lect a bounty, it will pro­duce some out­put. Whether or not it makes sense, is a com­pletely dif­fer­ent story. I want to share some of those with you.

#Some ex­am­ples

In this PR, the au­thor just in­jected garbage bytes man­u­ally into the data­base header, and then ar­gued that this cor­rupted the data­base (duh!). After our main­tainer pointed out that well, no shit Sherlock, the au­thor (or his bot) kept ar­gu­ing with your usual LLM-induced wall-of-text for quite a while.

You might find that un­be­liev­able, but it is ac­tu­ally less in­cred­i­ble than mod­i­fy­ing the source code to man­u­ally add an out-of-bound ar­ray ac­cess to cor­rupt the data­base

In this other PR which is full of ta­bles, green check marks and em dashes, the au­thor claims to have found a crit­i­cal vul­ner­a­bil­ity that al­lows for the ex­e­cu­tion of ar­bi­trary SQL state­ments. Imagine that? A SQL data­base that al­lows the ex­e­cu­tion of SQL state­ments. How can we ever re­cover from this.

This other mas­ter­piece en­ables con­cur­rent writes on Turso, one of the fea­tures that set us apart from SQLite, and then demon­strates that SQLite can­not open the file un­til the jour­nal mode is set back to WAL, dis­abling con­cur­rent writes (that is how the sys­tem is de­signed to op­er­ate)

For this other one, I wish I could write a nice de­scrip­tion, but I have no idea what they are try­ing to do. As our main­tainer Mikael (the same who won the award in the past!) pointed out, it is very clear that the per­son just saw the prize an­nounce­ment, started sali­vat­ing, and pointed the slop ma­chine at us.

#The last at­tempt

In our last at­tempt to es­tab­lish some or­der, we have de­signed and im­ple­mented a vouch­ing sys­tem. If we sus­pect that a sub­mis­sion is com­ing from a bot, we just auto-close it. And this worked okay for some time, un­til the bots just started open­ing is­sues ques­tion­ing the clos­ing of their PRs and re­quest­ing a man­ual in­spec­tion. They all look the same:

We also had many in­stances in which we could close a PR, and the same or a very sim­i­lar PR would just be opened by a dif­fer­ent user mo­ments af­ter.

#It’s sad, but here we are

The main prob­lem of course is that it costs the sl­op­maker per­haps a minute to gen­er­ate their sub­mis­sion. But it costs us hours to read, un­der­stand, and en­gage with them. And they can be gen­er­ated at a semi-in­fi­nite pace. It is pos­si­ble to set up au­to­mated sys­tems to gate­keep this, but with a non-neg­li­gi­ble dol­lar value at­tached to it, the in­cen­tive is just too great for the AIs to just keep ar­gu­ing, re­open­ing the same PR, etc.

We value our Open Source com­mu­nity of con­trib­u­tors a lot, and we will con­tinue to strengthen our com­mu­nity. But at this point, we just don’t be­lieve that a fi­nan­cial in­cen­tive of any kind works well with an open sys­tem. We have to ei­ther close the sys­tem, or get rid of the in­cen­tive. For now, we are choos­ing the lat­ter.

Ontario auditors find doctors' AI note takers routinely blow basic facts

www.theregister.com

REG AD

AI + ML

60% of eval­u­ated AI Scribe sys­tems mixed up pre­scribed drugs in pa­tient notes, au­di­tors say

The AI sys­tems ap­proved for Ontario health­care providers rou­tinely missed crit­i­cal de­tails, in­serted in­cor­rect in­for­ma­tion, and hal­lu­ci­nated con­tent that nei­ther pa­tients nor clin­i­cians men­tioned, ac­cord­ing to a provin­cial au­dit of 20 ap­proved ven­dors’ sys­tems.

The find­ings come from the Office of the Auditor General of Ontario, Canada, and are in­cluded in a larger re­port about the state of AI us­age by pub­lic ser­vices in the province. They specif­i­cally ad­dress the AI Scribe pro­gram, the Ontario Ministry of Health ini­ti­ated for physi­cians, nurse prac­ti­tion­ers, and other health­care pro­fes­sion­als across the broader health sec­tor.

As part of the pro­cure­ment process, of­fi­cials con­ducted eval­u­a­tions us­ing sim­u­lated doc­tor-pa­tient record­ings. Medical pro­fes­sion­als then re­viewed the orig­i­nal record­ings along­side the AI-generated notes to eval­u­ate their ac­cu­racy.

REG AD

What they found was, frankly, shock­ing for any­one con­cerned about the ac­cu­racy of AI in crit­i­cal sit­u­a­tions.

REG AD

Nine out of 20 AI sys­tems re­port­edly fabricated in­for­ma­tion and made sug­ges­tions to pa­tients’ treat­ment plans” that weren’t dis­cussed in the record­ings. According to the re­port, eval­u­a­tors spot­ted po­ten­tially dev­as­tat­ing in­cor­rect in­for­ma­tion in the sam­ple re­ports, such as no masses be­ing found, or pa­tients be­ing anx­ious, even though these things were never dis­cussed in the record­ings.

Twelve of the 20 sys­tems eval­u­ated in­serted in­cor­rect drug in­for­ma­tion into pa­tient notes, while 17 of the sys­tems missed key de­tails about the pa­tients’ men­tal health is­sues” that were dis­cussed in the record­ings. Six of the sys­tems missed the pa­tients’ men­tal health is­sues fully or par­tially or were miss­ing key de­tails,” per the re­port.

OntarioMD, a group that of­fers sup­port for physi­cians in adopt­ing new tech­nolo­gies and was in­volved in the AI Scribe pro­cure­ment process, has rec­om­mended that doc­tors man­u­ally re­view their AI notes for ac­cu­racy, but the re­port notes there’s no manda­tory at­tes­ta­tion fea­ture in any of the AI Scribe-approved sys­tems.

Bad eval­u­a­tions don’t help, ei­ther

AI sys­tems mak­ing mis­takes is­n’t ex­actly shock­ing. As we’ve re­ported pre­vi­ously, con­sumer-fo­cused AI has a ten­dency to pro­vide bad med­ical in­for­ma­tion to users, and some stud­ies have found large lan­guage mod­els failed to pro­duce ap­pro­pri­ate dif­fer­en­tial di­ag­noses in roughly 80 per­cent of tested cases. But the tools eval­u­ated here are for doc­tors, not con­sumers, and such poor per­for­mance ne­ces­si­tates ex­pla­na­tion. A good por­tion of the re­port blames how the sys­tems were eval­u­ated.

According to the re­port, the weight given to var­i­ous cat­e­gories of AI Scribe per­for­mances was wonky. While 30 per­cent of a plat­for­m’s eval­u­a­tion score de­pended solely on whether they had a do­mes­tic pres­ence in Ontario, the ac­cu­racy of med­ical notes con­tributed only 4 per­cent to the to­tal score.

Bias con­trols ac­counted for only 2 per­cent of the to­tal eval­u­a­tion score; threat, risk, and pri­vacy as­sess­ments counted for an­other 2 per­cent; and SOC 2 Type 2 com­pli­ance con­tributed an ad­di­tional 4 per­cent­age points.

In other words, cri­te­ria tied to ac­cu­racy, bias con­trols, and key se­cu­rity and pri­vacy safe­guards made up only a small por­tion of the to­tal eval­u­a­tion score for the AI Scribe sys­tems.

REG AD

Inaccurate weight­ings could re­sult in the se­lec­tion of ven­dors whose AI tools may pro­duce in­ac­cu­rate or bi­ased med­ical records or lack ad­e­quate pro­tec­tion to safe­guard sen­si­tive per­sonal health in­for­ma­tion,” the re­port said of the scor­ing regime.

The Register reached out to the Ontario Health Ministry for its take on the re­port, and whether it was go­ing to con­form to its rec­om­men­da­tions for the AI Scribe pro­gram, but we did­n’t im­me­di­ately hear back. A spokesper­son for the Ministry told the CBC on Wednesday that more than 5,000 physi­cians in Ontario are par­tic­i­pat­ing in the AI Scribe pro­gram and there have been no known re­ports of pa­tient harms as­so­ci­ated with the tech­nol­ogy. ®

GitHub - Andyyyy64/whichllm: Find the local LLM that actually runs and performs best on your hardware. Ranked by real, recency-aware benchmarks, not parameter count. One command, run it instantly.

github.com

Find the best lo­cal LLM that ac­tu­ally runs on your hard­ware.

Auto-detects your GPU/CPU/RAM and ranks the top mod­els from HuggingFace that fit your sys­tem.

日本語版はこちら

See it

$ which­llm –gpu RTX 4090”

#1 Qwen/Qwen3.6 – 27B 27.8B Q5_K_M score 92.8 27 t/​s #2 Qwen/Qwen3 – 32B 32.0B Q4_K_M score 83.0 31 t/​s #3 Qwen/Qwen3 – 30B-A3B 30.0B Q5_K_M score 82.7 102 t/​s

The 32B model fits your card fine — which­llm still ranks the 27B #1, be­cause it scores higher on real bench­marks and is a newer gen­er­a­tion. A size-only what fits?” tool would hand you the big­ger one. That gap is the whole point of which­llm. (Note #3: a MoE model at 102 t/​s — speed is ranked on ac­tive params, qual­ity on to­tal.)

What can I run?

Real top picks (snapshot 2026 – 05 — your re­sults track live HuggingFace data, this is not a sta­tic list):

which­llm –gpu <your card>” to sim­u­late any of these be­fore you buy.

Useful? A GitHub star helps other peo­ple find it — and I’d gen­uinely like to know what it picked for your rig: drop it in Issues.

Useful? A GitHub star helps other peo­ple find it — and I’d gen­uinely like to know what it picked for your rig: drop it in Issues.

Star History

Why which­llm?

Fitting a model into your VRAM is the easy part. The hard part is know­ing which of the mod­els that fit is ac­tu­ally the best — and that is what which­llm is built to get right.

Evidence-based rank­ing, not a size heuris­tic — The top pick is cho­sen from merged real bench­marks (LiveBench, Artificial Analysis, Aider, mul­ti­modal/​vi­sion, Chatbot Arena ELO, Open LLM Leaderboard) — never the biggest model that hap­pens to fit.”

Recency-aware — Stale leader­boards are de­moted along each mod­el’s lin­eage, so a 2024 model can’t out­rank a cur­rent-gen­er­a­tion one on an out­dated score. The bench­mark snap­shot date is printed un­der every rank­ing, so a stale rec­om­men­da­tion is self-ev­i­dent in­stead of silently trusted.

Evidence-graded and guarded — Every score is tagged di­rect / vari­ant / base / in­ter­po­lated / self-re­ported and dis­counted by con­fi­dence. Fabricated up­loader claims and cross-fam­ily in­her­i­tance (a small fork bor­row­ing its much larger base’s score) are ac­tively re­jected.

Architecture-aware es­ti­mates — VRAM = weights + GQA KV cache + ac­ti­va­tion + over­head; speed is band­width-bound with per-quant ef­fi­ciency, per-back­end fac­tors, MoE ac­tive-vs-to­tal split, and uni­fied-mem­ory vs dis­crete-PCIe par­tial-of­fload mod­el­ing.

One com­mand, script­able — which­llm prints the an­swer; add –json | jq for pipelines. No TUI, no key­bind­ings to mem­o­rize.

Live data — Models fetched di­rectly from the HuggingFace API, with cu­rated frozen fall­backs for of­fline or rate-lim­ited use.

Features

Auto-detect hard­ware — NVIDIA, AMD, Apple Silicon, CPU-only

Smart rank­ing — Scores mod­els by VRAM fit, speed, and bench­mark qual­ity

One-command chat — which­llm run down­loads and starts a chat ses­sion in­stantly

Code snip­pets — which­llm snip­pet prints ready-to-run Python for any model

Live data — Fetches mod­els di­rectly from HuggingFace (cached for per­for­mance)

Benchmark-aware — Integrates real eval scores with con­fi­dence-based damp­en­ing

Task pro­files — Filter by gen­eral, cod­ing, vi­sion, or math use cases

GPU sim­u­la­tion — Test with any GPU: which­llm –gpu RTX 4090”

Hardware plan­ning — Reverse lookup: which­llm plan llama 3 70b”

JSON out­put — Pipe-friendly: which­llm –json

Run & Snippet

Try any model with a sin­gle com­mand. No man­ual in­stalls needed — which­llm cre­ates an iso­lated en­vi­ron­ment via uv, in­stalls de­pen­den­cies, down­loads the model, and starts an in­ter­ac­tive chat.

# Chat with a model (auto-picks the best GGUF vari­ant) which­llm run qwen 2.5 1.5b gguf”

# Auto-pick the best model for your hard­ware and chat which­llm run

# CPU-only mode which­llm run phi 3 mini gguf” –cpu-only

Works with all model for­mats:

GGUF — via llama-cpp-python (lightweight, fast)

AWQ / GPTQ — via trans­form­ers + au­toawq / auto-gptq

FP16 / BF16 — via trans­form­ers

Get a copy-paste Python snip­pet in­stead:

which­llm snip­pet qwen 7b”

from lla­ma_cpp im­port Llama

llm = Llama.from_pretrained( re­po_id=“Qwen/​Qwen2.5 – 7B-In­struct-GGUF”, file­name=“qwen2.5 – 7b-in­struct-q4_k_m.gguf”, n_ctx=4096, n_g­pu_lay­ers=-1, ver­bose=False, )

out­put = llm.cre­ate_chat_­com­ple­tion( mes­sages=[{“role”: user”, content”: Hello!“}], ) print(out­put[“choices”][0][“mes­sage”][“con­tent”])

Install

uv (recommended)

uvx which­llm

To in­stall per­ma­nently:

uv tool in­stall which­llm

Homebrew

brew in­stall andyyyy64/​which­llm/​which­llm

pip

pip in­stall which­llm

Development

git clone https://​github.com/​Andyyyy64/​which­llm.git cd which­llm uv sync –dev uv run which­llm uv run pytest

Usage

# Auto-detect hard­ware and show best mod­els which­llm

# Simulate a GPU (e.g. plan­ning a pur­chase) which­llm –gpu RTX 4090” which­llm –gpu RTX 5090″ # Specify vari­ant which­llm –gpu RTX 5060 16”

# CPU-only mode which­llm –cpu-only

# More re­sults / fil­ters which­llm –top 20 which­llm –quant Q4_K_M which­llm –min-speed 30 which­llm –evidence base # al­low id/​base-model matches which­llm –evidence strict # id-ex­act only (same as –direct) which­llm –direct

# JSON out­put which­llm –json

# Force re­fresh (ignore cache) which­llm –refresh

# Show hard­ware info only which­llm hard­ware

# Plan: what GPU do I need for a spe­cific model? which­llm plan llama 3 70b” which­llm plan Qwen2.5 – 72B” –quant Q8_0 which­llm plan mistral 7b” –context-length 32768

# Run: down­load and chat with a model in­stantly which­llm run qwen 2.5 1.5b gguf” which­llm run # auto-pick best for your hard­ware

# Snippet: print ready-to-run Python code which­llm snip­pet qwen 7b” which­llm snip­pet llama 3 8b gguf” –quant Q5_K_M

Integrations

Ollama

Find the best model and run it di­rectly:

# Pick the top model and run it with Ollama which­llm –top 1 –json | jq -r .models[0].model_id’ | xargs ol­lama run

# Find the best cod­ing model which­llm –profile cod­ing –top 1 –json | jq -r .models[0].model_id’ | xargs ol­lama run

Shell alias

Add to your .bashrc / .zshrc:

alias bestllm=‘which­llm –top 1 –json | jq -r .models[0].model_id”’ # Usage: ol­lama run $(bestllm)

Scoring

Each model gets a 0 – 100 score. Benchmark qual­ity and size form the core; ev­i­dence con­fi­dence and run­time fit then scale it, with speed, source trust, and pop­u­lar­ity as ad­just­ments.

Score mark­ers:

~ (yellow) — No di­rect bench­mark; score in­her­ited/​in­ter­po­lated from the model fam­ily

? (yellow) — No bench­mark data avail­able

How it works

Data pipeline

Model fetch­ing — Fetches pop­u­lar mod­els from HuggingFace API:

Text-generation (downloads + re­cently up­dated) GGUF-filtered (separate query for cov­er­age) Vision mod­els (image-text-to-text) when –profile vi­sion or any

Model fetch­ing — Fetches pop­u­lar mod­els from HuggingFace API:

Text-generation (downloads + re­cently up­dated)

GGUF-filtered (separate query for cov­er­age)

Vision mod­els (image-text-to-text) when –profile vi­sion or any

Benchmark sources — Current tier (LiveBench, Artificial Analysis Index, Aider) merged live when reach­able, plus a cu­rated mul­ti­modal / vi­sion in­dex; frozen tier (Open LLM Leaderboard v2, Chatbot Arena ELO). Tiers have sep­a­rate caps and lin­eage-aware re­cency de­mo­tion so stale leader­boards stop over-re­ward­ing older gen­er­a­tions.

Benchmark sources — Current tier (LiveBench, Artificial Analysis Index, Aider) merged live when reach­able, plus a cu­rated mul­ti­modal / vi­sion in­dex; frozen tier (Open LLM Leaderboard v2, Chatbot Arena ELO). Tiers have sep­a­rate caps and lin­eage-aware re­cency de­mo­tion so stale leader­boards stop over-re­ward­ing older gen­er­a­tions.

Benchmark ev­i­dence — Five res­o­lu­tion lev­els, in­creas­ingly dis­counted:

di­rect — Exact model ID match vari­ant — Suffix-stripped or -Instruct vari­ant base_­model — Base model from card­Data line_in­terp — Size-aware in­ter­po­la­tion within model fam­ily self­_re­ported — Uploader-claimed eval (heavily dis­counted)

Inheritance is re­jected when a mod­el’s params di­verge more than from its fam­i­ly’s dom­i­nant mem­ber, catch­ing draft / MTP / ablit­er­ated forks that share a fam­i­ly_id with a much larger base.

Benchmark ev­i­dence — Five res­o­lu­tion lev­els, in­creas­ingly dis­counted:

di­rect — Exact model ID match

vari­ant — Suffix-stripped or -Instruct vari­ant

base_­model — Base model from card­Data

line_in­terp — Size-aware in­ter­po­la­tion within model fam­ily

self­_re­ported — Uploader-claimed eval (heavily dis­counted)

fastcompany.com

www.fastcompany.com

Please en­able JS and dis­able any ad blocker

all of rust codebase: This codebase fails even the most basic miri checks, allows for UB in safe rust

github.com

er­ror: Undefined Behavior: con­struct­ing in­valid value of type &[u8]: en­coun­tered a dan­gling ref­er­ence (0x20933[noalloc] has no prove­nance) –> src/​main.rs:97:18 | 97 | un­safe { core::slice::from_raw_­parts(ptr as *const u8, self.len()) } | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Undefined Behavior oc­curred here | = help: this in­di­cates a bug in the pro­gram: it per­formed an in­valid op­er­a­tion, and caused Undefined Behavior = help: see https://​doc.rust-lang.org/​nightly/​ref­er­ence/​be­hav­ior-con­sid­ered-un­de­fined.html for fur­ther in­for­ma­tion = note: stack back­trace: 0: PathString::slice at src/​main.rs:97:18: 97:75 1: main at src/​main.rs:130:22: 130:34

code:

fn main() { let test = Box::new(*b”Hello World”); let init = PathString::init(&*test); drop(test);

println!(“{:?}”, init.slice()); }

Please con­sider not vibe cod­ing rust as AIs are not good at writ­ing Rust and also hire a real rust dev

To add this web app to your iOS home screen tap the share button and select "Add to the Home Screen".

10HN is also available as an iOS App

If you visit 10HN only rarely, check out the the best articles from the past week.

Visit pancik.com for more.