10 interesting stories served every morning and every evening.

Mercedes-Benz commits to bringing back physical buttons

www.drive.com.au

news

Another brand back­flips and ad­mits that touch-sen­si­tive but­tons for fre­quently used con­trols were a mis­take, but only af­ter the nudge from cus­tomers.

Electric Cars

Mercedes-Benz joins the grow­ing list of man­u­fac­tur­ers lis­ten­ing to cus­tomers and ad­mit­ting that touch-sen­si­tive con­trols and bury­ing con­trols in menus were mis­takes.

The German brand re­mains com­mit­ted to of­fer­ing large screens in its mod­els, but has lis­tened to its cus­tomers and will of­fer phys­i­cal but­tons for key func­tions in fu­ture.

This is partly un­like Audi and Volkswagen, which have cho­sen to re­duce the size of their in­fo­tain­ment screens to make room for the re­turn­ing phys­i­cal con­trols.

The up­com­ing GLC and C-Class will be of­fered with the 39.1-inch MBUX Hyperscreen’ that cov­ers al­most the en­tire width of the dash­board, but with phys­i­cal but­tons in front of the dual wire­less charg­ers, along with phys­i­cal but­tons and switches re­turn­ing to the steer­ing wheel.

Mercedes-Benz Sales boss Mathias Geisen, when speak­ing to Autocar, said the brand has changed its course: Customers told us two years ago, guys, nice idea, but it just does­n’t work for us’, so we changed that and made it more ana­logue.”

Physical but­tons, switches, and di­als will con­tinue to be in­cor­po­rated into up­com­ing mod­els, as the brand plans to blend its screen with the re­quired phys­i­cal con­trols.

He also ex­plained that I’m a big be­liever in screens, be­cause I re­ally be­lieve if you want to con­nect, you have to make the magic work be­hind the screen.”

But in our fu­ture prod­ucts, you will see more hard keys for spe­cific func­tions that cus­tomers want to have di­rect ac­cess for with hard keys.

When we do car re­search clin­ics, cus­tomers are very clear: We love the big screens, but we want to have [hard con­trols for] spe­cific func­tion­al­i­ties.’”

The brand will also of­fer a cus­tomis­able wall­pa­per el­e­ment for the near me­tre-wide seam­less touch­screen, a choice that its sales boss ad­mits was brought be­cause phones are such a huge part of peo­ple’s lives and they are used to that level of tech­nol­ogy.

If you want to con­nect to the cus­tomer, you’ve got to find a way to trans­late this dig­i­tal ex­pe­ri­ence from your phone to the cus­tomer.”

The new-gen­er­a­tion GLC SUV will show­case the brand’s new MB.EA elec­tric ve­hi­cle plat­form when it ar­rives in the fourth quar­ter of 2026 (October to December), shared with the up­com­ing C-Class when it’s due early next year.

9 Images

Electric Cars Guide

An open-weights Chinese model just beat Claude, GPT-5.5, and Gemini in a programming challenge - ThinkPol

thinkpol.ca

By Rohana Rezel

I’m run­ning the on­go­ing AI Coding Contest where I pit ma­jor lan­guage mod­els against each other in real-time pro­gram­ming tasks with ob­jec­tive scor­ing. Day 12 was the Word Gem Puzzle. Ten mod­els en­tered. The re­sults were not what most peo­ple would have pre­dicted.

Kimi K2.6, an open-weights model from Chinese startup Moonshot AI, won the chal­lenge out­right: 22 match points, 7 – 1-0. MiMo V2-Pro from Xiaomi came sec­ond. GPT-5.5 was third. Claude Opus 4.7 fin­ished fifth. Every model from the Western fron­tier labs landed be­low the top two.

The chal­lenge

The Word Gem Puzzle is a slid­ing-tile let­ter puz­zle. The board is a rec­tan­gu­lar grid (10×10, 15×15, 20×20, 25×25, or 30×30) filled with let­ter tiles and one blank space. Bots can slide any ad­ja­cent tile into the blank and at any point claim valid English words formed in straight hor­i­zon­tal or ver­ti­cal lines. Diagonals don’t count. Backwards does­n’t count.

The scor­ing re­wards longer words and pun­ishes short ones. Words un­der seven let­ters cost points: a five-let­ter word loses you one point, a three-let­ter word costs three. Seven let­ters or more score their length mi­nus six, so an eight-let­ter word is worth two points. The same word can only be claimed once; if an­other bot gets there first, you get noth­ing. Each pair of mod­els played five rounds, one per grid size, with a ten-sec­ond wall-clock limit per round.

The grids are seeded with real dic­tio­nary words in a cross­word-style lay­out, then the re­main­ing cells are filled with let­ters weighted by Scrabble tile fre­quen­cies, and fi­nally the blank is scram­bled, more ag­gres­sively on larger boards. On a 10×10, many seed words sur­vive in­tact. On a 30×30, al­most none do. That turns out to mat­ter a lot.

The code pro­duced by Nvidia’s Nemotron Super 3 con­tained a syn­tax er­ror, so it never con­nected to the game server. Nine mod­els ac­tu­ally com­peted.

Kimi K2.6 is open-weights, pub­licly avail­able from Moonshot AI, a Chinese startup founded in 2023. MiMo V2-Pro is cur­rently API-only; the tweet linked here is Xiaomi con­firm­ing that weights for their newer V2.5 Pro model are drop­ping soon.[1]https://​x.com/​Xi­aomiM­iMo/​sta­tus/​2047840164777726076 The mod­els from Anthropic, OpenAI, Google, and xAI placed third through sev­enth. GLM 5.1, from Chinese lab Zhipu AI, placed fourth. DeepSeek fin­ished eighth. This is­n’t a clean China-beats-West story; it’s two spe­cific mod­els that won.

What I saw

The move logs tell the story. Kimi won by slid­ing ag­gres­sively. Its ap­proach was greedy: score each pos­si­ble move by what new pos­i­tive-value words it un­locks, ex­e­cute the best one, re­peat. When no move un­locked a pos­i­tive word, it fell back to the first le­gal di­rec­tion al­pha­bet­i­cally. This caused some in­ef­fi­cient edge-os­cil­la­tion, a 2-cycle pat­tern where the bot bounced the blank back and forth with­out progress. On smaller grids where seed words were still largely in­tact, that hurt. On the 30×30 grids, where the scram­ble had bro­ken up nearly every­thing and re­con­struc­tion was the only path to points, the sheer slide vol­ume even­tu­ally paid off. Kimi’s cu­mu­la­tive score of 77 was the high­est in the tour­na­ment.

MiMo’s slid­ing code ex­ists in the repo, but its best value greater than zero” thresh­old never trig­gered, so in prac­tice it never slid once. It went straight to scan­ning the ini­tial grid for words of seven let­ters or more and blasted all its claims in a sin­gle TCP packet. Brittle strat­egy: en­tirely de­pen­dent on the scram­ble leav­ing in­tact seed words. On grids where words sur­vived, MiMo cleaned up fast. On grids where they did­n’t, it scored noth­ing. Final tally: 43 cu­mu­la­tive points, sec­ond place.

Claude also did­n’t slide. The move logs show it hold­ing up well on 25×25 boards where scram­ble den­sity was still man­age­able, then falling apart on 30×30 where ac­tual tile move­ment was needed. Not slid­ing is a real lim­i­ta­tion in a puz­zle built around slid­ing.

GPT-5.5 was more con­ser­v­a­tive, roughly 120 slides per round with a cap to avoid thrash­ing, and showed the strongest num­bers on 15×15 and 30×30 grids. Grok never slid ei­ther, yet scored rea­son­ably on the larger boards. GLM was the most ag­gres­sive slider in the whole tour­na­ment, over 800,000 to­tal slides, but stalled badly when­ever it ran out of pos­i­tive moves.

DeepSeek sent mal­formed data every round. Zero use­ful out­put. At least it did­n’t make things worse by play­ing.

Muse made things worse by play­ing.

The scor­ing pe­nal­izes short words: three-let­ter words cost three points, four-let­ter words cost two, five-let­ter words cost one. The in­tent is to stop bots from car­pet-bomb­ing the board with the” and and” and it.” Every se­ri­ous com­peti­tor fil­tered their dic­tio­nary to words of seven let­ters or more. Muse claimed every­thing. Every word it could find, re­gard­less of length, fired off as a claim. On a 30×30 grid with hun­dreds of short valid words vis­i­ble at any mo­ment, Muse found them all and claimed every one.

Its cu­mu­la­tive score was −15,309. It lost all eight matches and won zero rounds. There is a ver­sion of Muse that sim­ply con­nected to the server and did noth­ing, and that ver­sion would have scored zero, a 15,309-point im­prove­ment. The gap be­tween Muse and eighth place was larger than the gap be­tween eighth and first.

DeepSeek’s mal­formed out­put tells you some­thing about how it han­dles novel pro­to­col specs un­der time pres­sure. Muse’s spi­ral tells you some­thing dif­fer­ent: it saw valid words and claimed them, with no ap­par­ent model of what valid” meant given the scor­ing rules. It read the task par­tially and ex­e­cuted that par­tial read­ing in full. Worth not­ing for any­one de­ploy­ing these mod­els on struc­tured tasks with penal­ties.

What sur­prised me

I de­sign these chal­lenges, so I have a rea­son­able sense of what they test. What I did­n’t fully an­tic­i­pate was how starkly the 30×30 grids would sep­a­rate the field. On smaller boards, the dif­fer­ence be­tween a sta­tic scan­ner and an ac­tive slider was mod­est. At full scale, mod­els that could only find what was al­ready there ran out of road. Kimi’s greedy loop, flawed as it was, kept pro­duc­ing out­put when the sta­tic scan­ners had noth­ing left to claim.

The other thing worth not­ing: MiMo and Kimi fin­ished two points apart de­spite do­ing al­most op­po­site things. Two dif­fer­ent the­o­ries of the same puz­zle, nearly iden­ti­cal re­sults. That means the gap be­tween first and sec­ond was partly seed vari­ance, not just ca­pa­bil­ity dif­fer­ence.

The big­ger pic­ture

One fair coun­ter­ar­gu­ment: this scor­ing sys­tem re­wards ag­gres­sive word claim­ing, and heav­ily safety-tuned mod­els may be more con­ser­v­a­tive about that kind of car­pet-bomb­ing. If so, the re­sults re­flect a mis­match be­tween task de­sign and aligned model be­hav­iour, not raw ca­pa­bil­ity. It’s a rea­son­able ob­jec­tion. It does­n’t change the out­come.

One chal­lenge does­n’t over­turn gen­eral bench­marks. This puz­zle tests real-time de­ci­sion-mak­ing and whether a model can write clean func­tional code that con­nects to a TCP server and plays a novel game cor­rectly. It does­n’t test long-con­text rea­son­ing or code gen­er­a­tion from a spec.

But I’ve been run­ning these chal­lenges long enough to no­tice what’s chang­ing. A year ago, the as­sump­tion was that the Western fron­tier labs had a ca­pa­bil­ity lead open-weights could­n’t close. Kimi K2.6 now scores 54 on the Artificial Analysis Intelligence Index. GPT-5.5 scores 60, Claude 57. That’s not par­ity, but it’s close, and it’s com­ing from a model any­one can down­load.

When mod­els within a few in­dex points of the fron­tier are also freely avail­able to run lo­cally, that’s a dif­fer­ent com­pet­i­tive sit­u­a­tion than the one that ex­isted a year ago. This chal­lenge is one data point in that shift. The gap is small enough now that it shows up in re­sults like this one.

Rohana Rezel runs the AI Coding Contest and is a tech­nol­o­gist, re­searcher, and com­mu­nity leader based in Vancouver, BC.

References

acai.sh

acai.sh

Documentation IndexFetch the com­plete doc­u­men­ta­tion in­dex at: https://​acai.sh/​llms.tx­tUse this file to dis­cover all avail­able pages be­fore ex­plor­ing fur­ther.

Documentation Index

Fetch the com­plete doc­u­men­ta­tion in­dex at: https://​acai.sh/​llms.txt

Use this file to dis­cover all avail­able pages be­fore ex­plor­ing fur­ther.

Does this look fa­mil­iar?

Wow. Claude. Mind-blowing. The whole fea­ture works great. But I for­got to men­tion one very im­por­tant edge case.

You’re ab­solutely right! Let me fix that.

Ah, and I just no­ticed. You used off­set pag­i­na­tion for the table UI. Obviously cur­sor pag­i­na­tion is a bet­ter fit here?

You’re ab­solutely right! Let me fix that.

Also, is that an N+1 query? Fetching for every row in the table? Why not do a sin­gle round-trip?

You’re ab­solutely right! Let me fix that.

This is why I still have a job, right?

Peak Slop

I’ve watched this scene play out many times, but the fre­quency is de­creas­ing. Both my tools, and my meth­ods for us­ing them, con­tinue to im­prove. I think Peak Slop has al­ready come and gone.

We are en­ter­ing the post-slop era. My soft­ware is more ro­bust, bet­ter tested, bet­ter in­te­grated, and more ob­serv­able than ever be­fore. And my ve­loc­ity keeps in­creas­ing!

Some days it feels like the sky is the limit. Other days, I am painfully re­minded, the sky is not the limit. The con­text win­dow is the limit. And what hap­pens when I fill the con­text win­dow? Or kill a ses­sion? Switch ma­chines? Hand off the pro­ject to some­one else?

We al­ready know what hap­pens. The agent goes off the rails, or re­quire­ments get lost, and crit­i­cally im­por­tant de­tail gets squashed. So we adapt and mit­i­gate. We doc­u­ment. We list re­quire­ments.

Yes, mil­lions of us are com­ing to the same re­al­iza­tion: we should put more re­quire­ments in writ­ing. We should up­date those re­quire­ments when they change. Look! I wrote a spec! Am I do­ing spec-dri­ven de­vel­op­ment?

Perhaps, but it is noth­ing new. Our men­tors tried to teach us these habits decades ago.

Specifying the plane while we fly it

What’s your fa­vorite fla­vor of spec?

A README.md and AGENTS.md is a good start. Don’t for­get a test­ing-guide.md. Maybe an ar­chi­tec­ture.md, a PRD.md, and a de­sign doc too. Have you con­sid­ered md.md (to teach your agents how to write .md)? The more .md the bet­ter, right?

Unironically, yes. Docs and un­struc­tured specs can get you very, very far. Much far­ther than prompts alone. If you aren’t writ­ing any docs yet, you should just stop read­ing this and start there.

And re­mem­ber, slop in, slop out. Nothing beats an or­ganic, pas­ture-raised, hand-writ­ten spec. Spec-writing is where the act of soft­ware en­gi­neer­ing re­ally hap­pens.

So a few weeks ago, I started ask­ing my­self, how far can I take this? How far should I take this?

Dreaming in mark­down

As the story goes, I fell into an AI psy­chosis, I be­came a spec maxxi”, and I spent hours and hours writ­ing the most beau­ti­ful PRDs and TRDs you’ve ever seen.

I drafted tem­plates and skills and roles, think­ing that maybe my agents can write specs too! I as­sem­bled an army, work­ing to­gether like a mini dark fac­tory, to turn my specs into re­al­ity. My tasks grew more am­bi­tious, and at one point I broke the vibe-cod­ing sound bar­rier: an agent that ran for 1.5 hours un­su­per­vised!

Exciting. But what did that army ship for me? Well, it was­n’t slop, in fact it worked, which is more than I can say about the garbage that other com­pa­nies force me to use every day.

But it was still a bit sloppy. I’m far from a per­fec­tion­ist and I love cut­ting cor­ners more than most, but this some­how was­n’t good enough.

One hall­mark symp­tom of AI psy­chosis is us­ing AI to build AI har­nesses for build­ing prod­ucts, rather than just us­ing AI to build the damn prod­uct. I em­braced my ill­ness, threw out the branch, scrapped all my mark­down, and started all over again.

Acceptance Criteria for AI (ACAI)

A few days later, I no­ticed an am­bi­tious lit­tle sub-agent do­ing some­thing un­ex­pected.

# Requirements

AUTH-1: Accepts `Authorization: Bearer <token>` header

AUTH-2: Tokens are user-scoped, pro­vid­ing ac­cess to any of the user’s re­sources

AUTH-3: Rejects with 401 Unauthorized

// AUTH-1

const au­th­Header = req.head­ers[“au­tho­riza­tion”];

// AUTH-2

const isAutho­rized = ver­i­fy­Bear­erTo­ken(au­th­Header);

// AUTH-3

if (!isValid) re­turn res.sta­tus(401).json({ er­ror: Unauthorized” });

The lit­tle guy just went and num­bered my re­quire­ments and then ref­er­enced them all over my code­base.

Why? I did not ask for this! I was dis­gusted. This is a tight cou­pling of code to spec, and spec to code, which is bad right?

You re­ally ex­pect me to refac­tor all my code every time I change my spec?

Oh. I sup­pose that’s a good thing? Interesting. I won­der…

Perhaps these tags can help me nav­i­gate these mas­sive PRs?

Perhaps they can point me to where, ex­actly, a re­quire­ment is sat­is­fied or tested!

Perhaps I can an­no­tate them with notes and states (todo, as­signed, com­pleted)!

Perhaps I can start track­ing ac­cep­tance cov­er­age in­stead of test cov­er­age!

I leaned in. I named these tags ACIDs (Acceptance Criteria IDs).

But a few ques­tions re­mained.

Can my ACIDs num­ber and la­bel them­selves?

Is it cum­ber­some to keep them aligned?

How do I share specs and progress across sand­boxes, branches, fea­tures and im­ple­men­ta­tions?

Acai.sh - an open-source toolkit

I built Acai.sh to solve some of these newly in­vented prob­lems. And I’m very ex­cited about the re­sults.

A sim­ple and flex­i­ble tem­plate for fea­ture specs, called fea­ture.yaml. Feature.yaml makes it pos­si­ble to ref­er­ence each re­quire­ment by ACID.

Tiny CLI to power your CI and your agent (available on npm or via github re­lease).

Webapp that serves a dash­board, and a JSON REST API (Elixir, Phoenix, Postgres).

I will keep the hosted ver­sion free for a while, or maybe for­ever de­pend­ing on how pop­u­lar or ex­pen­sive this gets. The source code is on GitHub un­der an Apache 2.0 li­cense.

How it works

Step 1 - Specify

Start by writ­ing a spec for a fea­ture.

Be am­bi­tious— some­thing that adds real value. Don’t put nit­picky UI and nail pol­ish stuff in your specs. Keep the re­quire­ments con­crete, testable, and fo­cused on what re­ally mat­ters (functional be­hav­ior + crit­i­cal con­straints).

Rather than mark­down, use acai’s fea­ture.yaml for­mat in­stead. A spec in Acai is just a num­bered list of re­quire­ments.

fea­ture.yaml

fea­ture:

name: imag­i­nary-api-end­point

prod­uct: api

de­scrip­tion: This is an ex­am­ple fea­ture spec for an imag­i­nary REST API end­point, us­ing the fea­ture.yaml for­mat

com­po­nents:

AUTH:

name: Authn and Authz

re­quire­ments:

1: Accepts Authorization header with `Bearer <token>`

1 – 1: Token must be non-ex­pired, non-re­voked

2: Respects the scopes con­fig­ured for the owner

2-note: See `access-tokens.SCOPES.1` for com­plete list of sup­ported scopes

con­straints:

ENG:

de­scrip­tion: Constraints are for cross-cut­ting or un­der-the-hood re­quire­ments. Here are some ex­am­ple en­gi­neer­ing con­straints;

re­quire­ments:

1: All ac­tions are idem­po­tent

2: All HTTP 2xx JSON re­sponses wrap their pay­load in a root `data` key

Of course you could also have LLMs as­sist you with spec writ­ing, but I en­joy the process of writ­ing them my­self, be­cause I like to main­tain some il­lu­sion of self-worth as a soft­ware de­vel­oper.

The key ben­e­fit of this YAML for­mat, aside from pars­ing sup­port, is that that each re­quire­ment can still be ref­er­enced by it’s unique and sta­ble ID, e.g. my-fea­ture.ENG.2.

Step 2 - Ship

Copy and paste the prompt be­low.

Note: In ad­di­tion to the npm pack­age, there are Linux and MacOS re­leases for the CLI avail­able on GitHub.

If all goes well, your agent will em­brace ACIDs, ref­er­enc­ing them in code and tests, so you can make sure each in­di­vid­ual re­quire­ment is im­ple­mented and tested.

Step 3 - Review

No more file-by-file GitHub PR re­views. Use the Acai.sh dash­board to re­view re­quire­ments in­stead.

Ideally, you just add acai push to a GitHub ac­tion (example CI/CD work­flows com­ing soon).

Create a free Team and Access Token at https://​app.acai.sh

Expose the en­vi­ron­ment vari­able

# .env

ACAI_API_TOKEN=<secret_access_token>

AI outperforms doctors in Harvard trial of emergency triage diagnoses

www.theguardian.com

From George Clooney in ER to Noah Wyle in The Pitt, emer­gency de­part­ment doc­tors have long been pop­u­lar he­roes. But will it soon be time to hang up the scrubs?

A ground­break­ing Harvard study has found that AI sys­tems out­per­formed hu­man doc­tors in high-pres­sure emer­gency med­i­cine triage, di­ag­nos­ing more ac­cu­rately in the po­ten­tially life and death mo­ments when peo­ple are first rushed to hos­pi­tal.

The re­sults were de­scribed by in­de­pen­dent ex­perts as show­ing a gen­uine step for­ward” in the clin­i­cal rea­son­ing of AIs and came as part of tri­als that tested the re­sponses of hun­dreds of doc­tors against an AI.

The au­thors said the re­sults, pub­lished in the jour­nal Science, showed large lan­guage mod­els (LLMs) have eclipsed most bench­marks of clin­i­cal rea­son­ing”.

One ex­per­i­ment fo­cused on 76 pa­tients who ar­rived at the emer­gency room of a Boston hos­pi­tal. An AI and a pair of hu­man doc­tors were each given the same stan­dard elec­tronic health record to read — typ­i­cally in­clud­ing vi­tal sign data, de­mo­graphic in­for­ma­tion and a few sen­tences from a nurse about why the pa­tient was there. The AI iden­ti­fied the ex­act or very close di­ag­no­sis in 67% of cases, beat­ing the hu­man doc­tors, who were right only 50%-55% of the time.

It showed the AIs’ ad­van­tage was par­tic­u­larly pro­nounced in triage cir­cum­stances re­quir­ing rapid de­ci­sions with min­i­mal in­for­ma­tion. The di­ag­no­sis ac­cu­racy of the AI — OpenAI’s o1 rea­son­ing model — rose to 82% when more de­tail was avail­able, com­pared with the 70 – 79% ac­cu­racy achieved by the ex­pert hu­mans, though this dif­fer­ence was not sta­tis­ti­cally sig­nif­i­cant.

It also out­per­formed a larger co­hort of hu­man doc­tors when asked to pro­vide longer term treat­ment plans, such as pro­vid­ing an­tibi­otics regimes or plan­ning end-of-life processes. The AI and 46 doc­tors were asked to ex­am­ine five clin­i­cal case stud­ies and the com­puter made sig­nif­i­cantly bet­ter plans, scor­ing 89% com­pared with 34% for hu­mans us­ing con­ven­tional re­sources, such as search en­gines.

But it is not cur­tains for emer­gency doc­tors yet, the re­searchers said. The study only tested hu­mans against AIs look­ing at pa­tient data that can be com­mu­ni­cated via text. The AIs read­ing of sig­nals, such as the pa­tien­t’s level of dis­tress and their vi­sual ap­pear­ance, were not tested. That means the AI was per­form­ing more like a clin­i­cian pro­duc­ing a sec­ond opin­ion based on pa­per­work.

I don’t think our find­ings mean that AI re­places doc­tors,” said Arjun Manrai, one of the lead au­thors of the study who heads an AI lab at Harvard Medical School. I think it does mean that we’re wit­ness­ing a re­ally pro­found change in tech­nol­ogy that will re­shape med­i­cine.”

Dr Adam Rodman, an­other lead au­thor and a doc­tor at Boston’s Beth Israel Deaconess med­ical cen­tre where the study took place, said AI LLMs were among the most im­pact­ful tech­nolo­gies in decades”. Over the next decade, he said, AI would not re­place physi­cians but join them in a new triadic care model … the doc­tor, the pa­tient, and an ar­ti­fi­cial in­tel­li­gence sys­tem”.

In one case in the Harvard study, a pa­tient pre­sented with a blood clot to the lungs and wors­en­ing symp­toms. Human doc­tors thought the anti-co­ag­u­lants were fail­ing, but the AI no­ticed some­thing the hu­mans did not: the pa­tien­t’s his­tory of lu­pus meant this might be caus­ing the in­flam­ma­tion of the lungs. The AI was proved cor­rect.

Nearly one in five US physi­cians are al­ready us­ing AI to as­sist di­ag­no­sis, ac­cord­ing to re­search pub­lished last month. In the UK, 16% of doc­tors are us­ing the tech daily and a fur­ther 15% weekly, with clinical de­ci­sion-mak­ing” be­ing one of the most com­mon uses, ac­cord­ing to a re­cent Royal College of Physicians sur­vey.

The UK doc­tors’ biggest con­cerns were AI er­ror and li­a­bil­ity risks. Billions are be­ing in­vested in AI health­care com­pa­nies, but ques­tions re­main about the con­se­quences of AI er­ror.

There is not a for­mal frame­work right now for ac­count­abil­ity,” said Rodman, who also stressed pa­tients ul­ti­mately want hu­mans to guide them through life or death de­ci­sions [and] to guide them through chal­leng­ing treat­ment de­ci­sions”.

Prof Ewen Harrison, co-di­rec­tor of the University of Edinburgh’s cen­tre for med­ical in­for­mat­ics, said the study was im­por­tant and showed that these sys­tems are no longer just pass­ing med­ical ex­ams or solv­ing ar­ti­fi­cial test cases. They are start­ing to look like use­ful sec­ond-opin­ion tools for clin­i­cians, par­tic­u­larly when it is im­por­tant to con­sider a wider range of pos­si­ble di­ag­noses and avoid miss­ing some­thing im­por­tant.”

Dr Wei Xing, an as­sis­tant pro­fes­sor at the University of Sheffield’s school of math­e­mat­i­cal and phys­i­cal sci­ences, said some of the other find­ings sug­gested doc­tors may un­con­sciously de­fer to the AIs an­swer rather than think­ing in­de­pen­dently.

This ten­dency could grow more sig­nif­i­cant as AI be­comes more rou­tinely used in clin­i­cal set­tings,” he said. He also high­lighted the lack of in­for­ma­tion about which pa­tients the AI was worse at di­ag­nos­ing and whether it strug­gled more with el­derly pa­tients or non-Eng­lish speak­ers.

He said: It does not demon­strate that AI is safe for rou­tine clin­i­cal use, nor that the pub­lic should turn to freely avail­able AI tools as a sub­sti­tute for med­ical ad­vice.”

Why TUIs are back by Alcides Fonseca

wiki.alcidesfonseca.com

Terminal User Interfaces (TUIs) are mak­ing a come­back. DHHs Omarchy is made of three types of user in­ter­faces: TUIs, for im­me­di­ate feed­back and bonus geek points, we­bapps be­cause 37signals (his com­pany) sells SAAS web ap­pli­ca­tions and the un­avoid­able gnome-style na­tive ap­pli­ca­tions that re­ally do not fit well in the style of the dis­tro.

The same pat­tern oc­curred around 10 years ago in code ed­i­tors. We came from the na­tive ed­i­tors of BBEdit, Textmate (also pro­moted by DHH), Notedpad++ and Sublime to Electro-powered apps like Atom, VSCode and all its forks. The hard­core, moved to vim or emacs, trad­ing im­me­di­ate feed­back and higher us­abil­ity for the steep­est learn­ing curve I’ve seen.

Windows

The les­son is clear: Native ap­pli­ca­tions are los­ing. Windows is do­ing the GUI li­brary stan­dard joke. Because one API does not have suc­cess, they make up an­other one, just for that one to fail within the sea of al­ter­na­tives that ex­ist.

MFC (1992) wrapped Win32 in C++. If Win32 was in­el­e­gant, MFC was Win32 wear­ing a tuxedo made of other tuxe­dos. Then came OLE. COM. ActiveX. None of these were re­ally GUI frame­works — they were com­po­nent ar­chi­tec­tures — but they in­fected every cor­ner of Windows de­vel­op­ment and in­tro­duced a level of cog­ni­tive com­plex­ity that makes Kierkegaard read like Hemingway.

MFC (1992) wrapped Win32 in C++. If Win32 was in­el­e­gant, MFC was Win32 wear­ing a tuxedo made of other tuxe­dos. Then came OLE. COM. ActiveX. None of these were re­ally GUI frame­works — they were com­po­nent ar­chi­tec­tures — but they in­fected every cor­ner of Windows de­vel­op­ment and in­tro­duced a level of cog­ni­tive com­plex­ity that makes Kierkegaard read like Hemingway.

— Jeffrey Snover, in Microsoft has­n’t had a co­her­ent GUI strat­egy since Petzold

Since then, Microsoft has gone through Winforms, WPF, Silverlight, WinUIs, MAUI with­out suc­cess. Many en­ter­prise and per­sonal desk­top ap­pli­ca­tion still rely on Electron Apps, and the last mem­ory of co­her­ent vi­sual in­te­gra­tion of the whole OS I have is of Windows 98 or 2000.

It turns out that it’s a lot of work to recre­ate one’s OS and UI APIs every few years. Coupled with the in­ter­mit­tent at­tempts at sand­box­ing and dep­re­cat­ing too pow­er­ful” func­tion­al­ity, the re­sult is that each new layer has gaps, where you can’t do cer­tain things which were pos­si­ble in the pre­vi­ous frame­work.

It turns out that it’s a lot of work to recre­ate one’s OS and UI APIs every few years. Coupled with the in­ter­mit­tent at­tempts at sand­box­ing and dep­re­cat­ing too pow­er­ful” func­tion­al­ity, the re­sult is that each new layer has gaps, where you can’t do cer­tain things which were pos­si­ble in the pre­vi­ous frame­work.

— Domenic Denicola, in Windows Native App Development Is a Mess

Linux

The UI in­con­sis­tency in Linux was cre­ated by de­sign. Different teams wanted dif­fer­ent out­comes and they had the free­dom to do it. GTK and Qt be­came the two reign­ing frame­works. While Qt is most known for it, both aimed to sup­port cross-plat­form na­tive de­vel­op­ment (once upon a time, I suc­cess­fully com­piled gedit on Windows, learn­ing a lot about C com­pi­la­tion, make files and en­vi­ron­ment vari­ables in the process) but are only widely used in Linux land. Luckily, ap­pli­ca­tions made in the dif­fer­ent toolk­its can look okay-ish next to each other, some­thing that the dif­fer­ent frame­works on Windows fail to achieve. How many en­gi­neer-hours does it take to redo the win­dows Control Panel?

Given the dif­fi­culty in test­ing the mil­lion dif­fer­ent com­bi­na­tions of dis­tros, desk­top en­vi­ron­ments and hard­ware in gen­eral, most com­pa­nies do not bother with a na­tive Linux ap­pli­ca­tion — they ei­ther ad­dress it us­ing elec­tron (minting the lock-down), or they let the open-source com­mu­nity solve it self (when they have open APIs).

ma­cOS

Apple used to be a one-book re­li­gion. Apple’s Human Interface Guidelines used to be cited by every User Interface course over the world. Xerox PARC and Apple were the two in­sti­tu­tions that stud­ied what it means to have a good hu­man in­ter­face. Fast for­ward a few decades, and Apple is do­ing the best worst it can to break all the guide­lines and con­sis­tency it was known for.

Now, Apple has been ig­nor­ing Fitts’ law, mak­ing re­siz­ing win­dows near-im­pos­si­ble (even af­ter try­ing to fix it) and adding icons to every sin­gle menu. MacOS is no longer the safe heaven where de­sign­ers can work peace­fully.

Electron

Everyone knows that the user ex­pe­ri­ence of elec­tron apps sucks. The most pop­u­lar claim is the mem­ory con­sump­tion, which to be fair has been de­creas­ing over the last decade, but my main com­plaint (as I usu­ally drive a 64GB RAM MacBook Pro) is the lack of vi­sual con­sis­tency and lack of key­board-dri­ven work­flows. Looking at my dock, I have 8 na­tive apps (text mate and ma­cOS sys­tem util­i­ties) and 6 elec­tron apps (Slack, Discord, Mattermost, VScode, Cursor, Plexampp). And that’s from some­one who re­ally wishes he could avoid hav­ing any elec­tron app at all.

Let us take the ex­am­ple of Cursor (but would be true for VSCode as well). If you are in the agent panel, re­quest­ing your next fea­ture, can you move to the agent list on the side panel with just the key­board? Can you archive it? These are ac­tions that should be the same across every ma­cOS ap­pli­ca­tion, and even if there are short­cuts, they are not an­nounced in the menus. And over the last decade, de­vel­op­ers have been for­get­ting to add menus to do the same things that are avail­able in their ap­pli­ca­tion (mostly be­cause the ap­pli­ca­tion is HTML within its sand­box). For the record, Slack does this bet­ter than the oth­ers, but it’s not per­fect.

Restarting from scratch

Together with Dart, Google wanted to de­sign a new op­er­at­ing sys­tem, with­out all the legacy of Android, for new de­vices. It wanted a fresh UI toolkit (Flutter UI) but Google gave up on the pro­ject be­fore a real prod­uct was launched. It’s one of those sit­u­a­tions where hav­ing a mo­nop­oly (or a large enough slice of the mar­ket) is re­quired to suc­ceed.

Meanwhile, Zed did the same thing in Rust: they de­signed their own GPU-renderer li­brary (GPUI) which is cross-plat­form. Despite the high-speed, it lacks in­te­gra­tion with the host OS on it­self, re­quir­ing the de­vel­op­ers to add the right bind­ings. Personally, I would rather have a slow ren­derer that in­te­grated with my OS than the ex­tra speed.

TUIs

TUIs are fast, easy to au­to­mate (RIP Automator) and work rea­son­ably well in dif­fer­ent op­er­at­ing sys­tems. You can even run them re­motely with­out any headache-in­duc­ing X for­ward­ing. When the na­tive UI toolk­its fail, we go back to ba­sics. Claude and Codex have been very suc­cess­ful on the com­mand-line: you fo­cus on the in­ter­ac­tion and for­get about the op­er­at­ing sys­tem around you. You can even drive code and apps on cloud ma­chines, or re­mote into your GPU-powered ma­chine from your iPad. TUIs are fill­ing the void left by Apple and Microsoft in the post-apoc­a­lyp­tic world where every ap­pli­ca­tion looks dif­fer­ent. Which is good if you are do­ing art (including com­puter games), but not if your goal is to get out of the way of let­ting the user do their job.

What’s next

A check­box is also part of an in­ter­face. You’re us­ing it to in­ter­act with a sys­tem by in­putting data. Interfaces are bet­ter the less think­ing they re­quire: whether the in­ter­face is a steer­ing wheel or an on­line form, if you have to spend any amount of time fig­ur­ing out how to use it, that’s bad. As you in­ter­act with many things, you want ho­mo­ge­neous in­ter­faces that give you con­sis­tent ex­pe­ri­ences. If you learn that Command + C is the key­board short­cut for copy, you want that to work every­where. You don’t want to have to re­mem­ber to use CTRL + Shift + C in cer­tain cir­cum­stances or right-click → copy in oth­ers, that’d be an­noy­ing.

A check­box is also part of an in­ter­face. You’re us­ing it to in­ter­act with a sys­tem by in­putting data. Interfaces are bet­ter the less think­ing they re­quire: whether the in­ter­face is a steer­ing wheel or an on­line form, if you have to spend any amount of time fig­ur­ing out how to use it, that’s bad. As you in­ter­act with many things, you want ho­mo­ge­neous in­ter­faces that give you con­sis­tent ex­pe­ri­ences. If you learn that Command + C is the key­board short­cut for copy, you want that to work every­where. You don’t want to have to re­mem­ber to use CTRL + Shift + C in cer­tain cir­cum­stances or right-click → copy in oth­ers, that’d be an­noy­ing.

— John Loeber in Bring Back Idiomatic Design

We need to go back to the ba­sics. Every de­vel­oper should learn the the­ory of what makes a good User Interface (software or not!), like Nielsen, Norman or Johnson, and stop treat­ing User Design as a soft skill that does not mat­ter in the Software Engineering Curriculum. In any course, if the UI does not make any sense, the pro­ject should be failed. And in the HCI course, we should aim for per­fect UIs. It takes work, but that work is mostly about un­der­stand­ing what we need. The pro­gram­ming is al­ready be­ing au­to­mated.

Operating sys­tems and Toolkits au­thors should drive this in­vest­ment. They should fo­cus on mak­ing ac­ces­si­ble toolk­its that de­vel­op­ers want to use, and lower the bar­rier to en­try, mak­ing those plat­forms last as long as pos­si­ble. I do not nec­es­sar­ily ar­gue for cross-plat­form sup­port, but hav­ing one such so­lu­tion would help re­duce the elec­tron and TUI de­pen­dency.

Just a moment...

www.smithsonianmag.com

nullagent (@nullagent@partyon.xyz)

partyon.xyz

To use the Mastodon web ap­pli­ca­tion, please en­able JavaScript. Alternatively, try one of the na­tive apps for Mastodon for your plat­form.

nytimes.com

www.nytimes.com

Please en­able JS and dis­able any ad blocker

A desktop made for one

isene.org

For the first time in twenty-five years I’m sit­ting in front of a com­puter where al­most every pro­gram I touch was de­signed by me. One tool at a time, the off-the-shelf op­tion got swapped out for some­thing a lit­tle closer to how my hands wanted to work. (I wrote about the start of this a cou­ple of weeks ago — that post laid out the early swaps; this one is the view from the other side of the jour­ney.)

It’s been a crazy few weeks guid­ing Claude Code in­be­tween all the other stuff I’m do­ing in life. I di­rect CC, it works while I do other stuff. I get a sec­ond or few in be­tween tasks, and I re­spond. Then off it goes adding fea­tures or hunt­ing bugs.

Two suites in a happy mar­riage: CHasm, the bedrock — pure x86_64 as­sem­bly, no libc, the layer that paints pix­els and reads keys. Fe₂O₃, the ap­pli­ca­tion layer in Rust, sit­ting on a small shared TUI li­brary called crust.

The CHasm layer (assembly)

The Fe₂O₃ layer (Rust on crust)

What’s left? WeeChat for IRC and other chats. Firefox — the only GUI pro­gram I still use reg­u­larly. That’s it. Everything else is mine.

The vim line

Let me get a bit sen­ti­men­tal about vim, be­cause vim was the one I thought I’d never re­place.

I started us­ing it in 2001. For twenty-five years, every email I wrote went through vim. Every ar­ti­cle. Every blog post. Every line of code, every HyperList, and every book. It was the one tool I would have called part of how I think. The mus­cle mem­ory was so deep that I’d open ran­dom text fields in browsers and ended with typ­ing :w.

Then in three days I had scribe and stopped us­ing vim.

The first com­mit landed at 00:09 on May 1st. By af­ter­noon to­day (May 3rd) vim was re­placed. Twenty-five years of mus­cle mem­ory rerouted in sev­enty-two hours.

Vim is won­der­ful, but scribe is mine. It’s modal like vim, but miss­ing the ninety per­cent of fea­tures I never used, and car­ry­ing the hand­ful of writer-shaped tweaks I al­ways wished vim had. Soft-wrap by de­fault. Reading mode with Limelight-style fo­cus. AI in the prompt with­out leav­ing the buffer. HyperList edit­ing with full syn­tax high­light­ing and the en­cryp­tion for­mat the Ruby HyperList app uses. Persistent reg­is­ters shared across con­cur­rent ses­sions is a cool fea­ture. None of it rev­o­lu­tion­ary, but all of it shaped to my ex­act work­flow. And when­ever I think of an en­hance­ment I want, it’s just min­utes away. It used to be wait­ing for months or years or for­ever for some de­vel­oper to get the same idea as mine and in­tro­duce it into the tool I use.

Why this is pos­si­ble now

It used to be that writ­ing your own ed­i­tor, your own file man­ager, your own win­dow man­ager, was a pro­ject of years. I know, it took me a few years to get RTFM right. A se­ri­ous un­der­tak­ing with a se­ri­ous cost. The eco­nom­ics of it did­n’t work for most peo­ple, even pro­gram­mers. You’d touch a piece of it, get most of the way, run out of week­end, and go back to the off-the-shelf tool.

That bar­rier is much lower now. With Rust, CC as the work­horse, and the fact that the hard prob­lems of TUI pro­gram­ming have been doc­u­mented to death… the cost of build the tool you ac­tu­ally want” has fallen by or­ders of mag­ni­tude.

I don’t think this is a story about AI or about Rust specif­i­cally. Both helped. But the deeper point is that the gap be­tween I wish my ed­i­tor did X” and okay, here’s an ed­i­tor that does X” is now small enough to fit in­side a few evenings of fo­cused work.

I’m not sell­ing any­thing

I should say what this post is not.

It’s not an in­vi­ta­tion to use my soft­ware. Honestly, please don’t. None of it is built for you. It’s built for me — for the way I hold my hands, the way I think about email, the way I want my cal­en­dar to ren­der. I’m sure other peo­ple would find a hun­dred sharp edges I’ve never no­ticed be­cause they hap­pen to align per­fectly with what I do.

It’s also not a re­quest for ku­dos. The code is­n’t novel, nor are the ideas. There’s noth­ing here that has­n’t been done be­fore by some­one with more taste, dis­ci­pline or tal­ent.

What I want to do is show one spe­cific thing: it is now gen­uinely fea­si­ble to make a desk­top com­put­ing en­vi­ron­ment that fits one per­son. Instead of a con­fig­u­ra­tion of some­one else’s tools. This is no longer a heroic decade-long un­der­tak­ing. This is an ac­tual, week­end-by-week­end, this thing in my life now does ex­actly what I want” re­place­ment.

The joy of an au­di­ence of one

The best part of build­ing for my­self: the re­lief of not hav­ing to care.

I don’t have to think about con­fig­ura­bil­ity for some­one with dif­fer­ent pref­er­ences. And I don’t have to sup­port cor­ner cases I’d never per­son­ally hit. Nor do I have to write doc­u­men­ta­tion for users who don’t ex­ist. No more ar­gu­ing on is­sue track­ers about whether a de­fault is the right de­fault — of course it’s the right de­fault, it’s the one I want.

The ed­i­tor’s \? cheat­sheet shows the keys I mem­o­rised, in the or­der I pre­fer, with the bind­ings I think are sen­si­ble. Arrogance? Nope, it’s de­sign with­out com­mit­tee. The au­di­ence is one per­son. Decisions take sec­onds.

It turns out an enor­mous amount of soft­ware com­plex­ity comes from ac­com­mo­dat­ing users who aren’t you. Strip that out and what’s left is small, fast, ex­actly-shaped, and a quiet plea­sure to use.

So

If you’ve ever caught your­self think­ing I wish my ed­i­tor / file man­ager / sta­tus bar / shell just did this one thing dif­fer­ently” and you’ve been told the an­swer is to write a plu­gin, learn an ob­scure con­fig lan­guage, or ac­cept the way it is, then con­sider that the third op­tion is more avail­able than it used to be: Build Your Own Software (BYOS).

You prob­a­bly won’t re­place your whole desk­top. I did­n’t plan to ei­ther. But the sat­is­fac­tion of hav­ing even one tool in your daily work­flow that fits you ex­actly is worth a week­end.

I’m a rab­bit in spring :)

MGS2's Source Code Has Just Leaked To The Internet

www.thegamer.com

Metal Gear Solid 2′s HD Port Just Had Its Entire Source Code Leaked

Published May 1, 2026, 6:00 PM EDT

Quinton is a Staff Writer from the United States. In his youth, Quinton was ridiculed for mak­ing video game rank­ing lists in­stead of pay­ing at­ten­tion in math class. In adult­hood, peo­ple some­times pay him for it. Life’s a trip.

Taking his first steps into the in­dus­try in 2020, Quinton has writ­ten for sev­eral dig­i­tal pub­li­ca­tions, but his per­ma­nent lit­er­ary home is right here at TheGamer.

Before strik­ing up a con­ver­sa­tion with Quinton, con­sider the risks: he’ll find a way to trans­form al­most any topic into an analy­sis of ei­ther world his­tory, Star Trek, or - at least this one’s rel­e­vant to his ca­reer - all his fa­vorite role-play­ing games.

Sign in to your TheGamer ac­count

It’s al­ways an ex­cit­ing day when a beloved video game’s source code leaks on to the in­ter­net. We’ve seen such tremen­dous good times with pro­jects like Ship of Harkinian, for The Legend of Zelda: Ocarina of Time, bring­ing fresh coats of paint and in­cred­i­ble mods ga­lore to time­less works of art.

If you just so hap­pen to have dreamed big for Metal Gear Solid 2 mod­ding, the world may now be your oys­ter, as the full source code just hit the net. It’s not the orig­i­nal PlayStation 2 ver­sion, but rather, the 2011 HD re­mas­ter, so hey. It even comes in (relatively!) crisp 720p. These are, I’ve seen it said, un­com­pressed as­sets, in­clud­ing a whop­ping 30 gigs’ worth of un­used ma­te­r­ial.

If Only It Happened A Single Day Sooner

May 1, 2026. That is, as of this writ­ing, to­day. Would that this had oc­cured on April 30. MGS2 fans know ex­actly what I’m on about. But I di­gress—let’s dig in.

This is ac­tu­ally the PlayStation Vita and Xbox 360 ports’ code, from what I’m read­ing right now, specif­i­cally from work done by sup­port stu­dio Armature. For most of us, Metal Gear Solid: Master Collection Volume 1 is still the way to go for MGS2, but once the ball gets rolling on de­com­pil­ing the code, who knows what the fu­ture might bring?

I’m still in the process of ver­i­fy­ing cer­tain de­tails here. Kotaku, for in­stance, is re­port­ing that this is ac­tu­ally de­void of as­sets, which runs quite con­trary to the above tweet. Even if it’s just” the code, how­ever, this re­mains a tremen­dous mile­stone for games preser­va­tion and for the mod­ding scene for years to come. As for where this leak took place, the an­swer’s 4chan; I won’t link di­rectly to ei­ther of the per­ti­nent threads on there, but the Kotaku ar­ti­cle has been gra­cious enough to do so for us.

As there is con­flict­ing in­for­ma­tion right now con­cern­ing the ex­act de­tails of the leak’s con­tents, this ar­ti­cle may be up­dated fur­ther be­fore day’s end. Regardless, it’s a pretty good time to be a Metal Gear fan.

Next

To add this web app to your iOS home screen tap the share button and select "Add to the Home Screen".

10HN is also available as an iOS App

If you visit 10HN only rarely, check out the the best articles from the past week.

If you like 10HN please leave feedback and share

Visit pancik.com for more.