10 interesting stories served every morning and every evening.




1 848 shares, 42 trendiness

Boris Tane

The Workflow in One Sentence I’ve been us­ing Claude Code as my pri­mary de­vel­op­ment tool for ap­prox 9 months, and the work­flow I’ve set­tled into is rad­i­cally dif­fer­ent from what most peo­ple do with AI cod­ing tools. Most de­vel­op­ers type a prompt, some­times use plan mode, fix the er­rors, re­peat. The more ter­mi­nally on­line are stitch­ing to­gether ralph loops, mcps, gas towns (remember those?), etc. The re­sults in both cases are a mess that com­pletely falls apart for any­thing non-triv­ial.

The work­flow I’m go­ing to de­scribe has one core prin­ci­ple: never let Claude write code un­til you’ve re­viewed and ap­proved a writ­ten plan. This sep­a­ra­tion of plan­ning and ex­e­cu­tion is the sin­gle most im­por­tant thing I do. It pre­vents wasted ef­fort, keeps me in con­trol of ar­chi­tec­ture de­ci­sions, and pro­duces sig­nif­i­cantly bet­ter re­sults with min­i­mal to­ken us­age than jump­ing straight to code.

flow­chart LR

R[Research] –> P[Plan]

P –> A[Annotate]

A –>|repeat 1-6x| A

A –> T[Todo List]

T –> I[Implement]

I –> F[Feedback & Iterate]

Every mean­ing­ful task starts with a deep-read di­rec­tive. I ask Claude to thor­oughly un­der­stand the rel­e­vant part of the code­base be­fore do­ing any­thing else. And I al­ways re­quire the find­ings to be writ­ten into a per­sis­tent mark­down file, never just a ver­bal sum­mary in the chat.

read this folder in depth, un­der­stand how it works deeply, what it does and all its speci­fici­ties. when that’s done, write a de­tailed re­port of your learn­ings and find­ings in re­search.md

study the no­ti­fi­ca­tion sys­tem in great de­tails, un­der­stand the in­tri­ca­cies of it and write a de­tailed re­search.md doc­u­ment with every­thing there is to know about how no­ti­fi­ca­tions work

go through the task sched­ul­ing flow, un­der­stand it deeply and look for po­ten­tial bugs. there def­i­nitely are bugs in the sys­tem as it some­times runs tasks that should have been can­celled. keep re­search­ing the flow un­til you find all the bugs, don’t stop un­til all the bugs are found. when you’re done, write a de­tailed re­port of your find­ings in re­search.md

Notice the lan­guage: deeply”, in great de­tails”, intricacies”, go through every­thing”. This is­n’t fluff. Without these words, Claude will skim. It’ll read a file, see what a func­tion does at the sig­na­ture level, and move on. You need to sig­nal that sur­face-level read­ing is not ac­cept­able.

The writ­ten ar­ti­fact (research.md) is crit­i­cal. It’s not about mak­ing Claude do home­work. It’s my re­view sur­face. I can read it, ver­ify Claude ac­tu­ally un­der­stood the sys­tem, and cor­rect mis­un­der­stand­ings be­fore any plan­ning hap­pens. If the re­search is wrong, the plan will be wrong, and the im­ple­men­ta­tion will be wrong. Garbage in, garbage out.

This is the most ex­pen­sive fail­ure mode with AI-assisted cod­ing, and it’s not wrong syn­tax or bad logic. It’s im­ple­men­ta­tions that work in iso­la­tion but break the sur­round­ing sys­tem. A func­tion that ig­nores an ex­ist­ing caching layer. A mi­gra­tion that does­n’t ac­count for the ORMs con­ven­tions. An API end­point that du­pli­cates logic that al­ready ex­ists else­where. The re­search phase pre­vents all of this.

Once I’ve re­viewed the re­search, I ask for a de­tailed im­ple­men­ta­tion plan in a sep­a­rate mark­down file.

I want to build a new fea­ture that ex­tends the sys­tem to per­form . write a de­tailed plan.md doc­u­ment out­lin­ing how to im­ple­ment this. in­clude code snip­pets

the list end­point should sup­port cur­sor-based pag­i­na­tion in­stead of off­set. write a de­tailed plan.md for how to achieve this. read source files be­fore sug­gest­ing changes, base the plan on the ac­tual code­base

The gen­er­ated plan al­ways in­cludes a de­tailed ex­pla­na­tion of the ap­proach, code snip­pets show­ing the ac­tual changes, file paths that will be mod­i­fied, and con­sid­er­a­tions and trade-offs.

I use my own .md plan files rather than Claude Code’s built-in plan mode. The built-in plan mode sucks. My mark­down file gives me full con­trol. I can edit it in my ed­i­tor, add in­line notes, and it per­sists as a real ar­ti­fact in the pro­ject.

One trick I use con­stantly: for well-con­tained fea­tures where I’ve seen a good im­ple­men­ta­tion in an open source repo, I’ll share that code as a ref­er­ence along­side the plan re­quest. If I want to add sortable IDs, I paste the ID gen­er­a­tion code from a pro­ject that does it well and say this is how they do sortable IDs, write a plan.md ex­plain­ing how we can adopt a sim­i­lar ap­proach.” Claude works dra­mat­i­cally bet­ter when it has a con­crete ref­er­ence im­ple­men­ta­tion to work from rather than de­sign­ing from scratch.

But the plan doc­u­ment it­self is­n’t the in­ter­est­ing part. The in­ter­est­ing part is what hap­pens next.

This is the most dis­tinc­tive part of my work­flow, and the part where I add the most value.

flow­chart TD

W[Claude writes plan.md] –> R[I re­view in my ed­i­tor]

R –> N[I add in­line notes]

N –> S[Send Claude back to the doc­u­ment]

S –> U[Claude up­dates plan]

U –> D{Satisfied?}

D –>|No| R

D –>|Yes| T[Request todo list]

After Claude writes the plan, I open it in my ed­i­tor and add in­line notes di­rectly into the doc­u­ment. These notes cor­rect as­sump­tions, re­ject ap­proaches, add con­straints, or pro­vide do­main knowl­edge that Claude does­n’t have.

The notes vary wildly in length. Sometimes a note is two words: not op­tional” next to a pa­ra­me­ter Claude marked as op­tional. Other times it’s a para­graph ex­plain­ing a busi­ness con­straint or past­ing a code snip­pet show­ing the data shape I ex­pect.

use driz­zle:gen­er­ate for mi­gra­tions, not raw SQL — do­main knowl­edge Claude does­n’t have

no — this should be a PATCH, not a PUT — cor­rect­ing a wrong as­sump­tion

remove this sec­tion en­tirely, we don’t need caching here” — re­ject­ing a pro­posed ap­proach

the queue con­sumer al­ready han­dles re­tries, so this retry logic is re­dun­dant. re­move it and just let it fail” — ex­plain­ing why some­thing should change

this is wrong, the vis­i­bil­ity field needs to be on the list it­self, not on in­di­vid­ual items. when a list is pub­lic, all items are pub­lic. re­struc­ture the schema sec­tion ac­cord­ingly” — redi­rect­ing an en­tire sec­tion of the plan

Then I send Claude back to the doc­u­ment:

I added a few notes to the doc­u­ment, ad­dress all the notes and up­date the doc­u­ment ac­cord­ingly. don’t im­ple­ment yet

This cy­cle re­peats 1 to 6 times. The ex­plicit don’t im­ple­ment yet” guard is es­sen­tial. Without it, Claude will jump to code the mo­ment it thinks the plan is good enough. It’s not good enough un­til I say it is.

Why This Works So Well

The mark­down file acts as shared mu­ta­ble state be­tween me and Claude. I can think at my own pace, an­no­tate pre­cisely where some­thing is wrong, and re-en­gage with­out los­ing con­text. I’m not try­ing to ex­plain every­thing in a chat mes­sage. I’m point­ing at the ex­act spot in the doc­u­ment where the is­sue is and writ­ing my cor­rec­tion right there.

This is fun­da­men­tally dif­fer­ent from try­ing to steer im­ple­men­ta­tion through chat mes­sages. The plan is a struc­tured, com­plete spec­i­fi­ca­tion I can re­view holis­ti­cally. A chat con­ver­sa­tion is some­thing I’d have to scroll through to re­con­struct de­ci­sions. The plan wins every time.

Three rounds of I added notes, up­date the plan” can trans­form a generic im­ple­men­ta­tion plan into one that fits per­fectly into the ex­ist­ing sys­tem. Claude is ex­cel­lent at un­der­stand­ing code, propos­ing so­lu­tions, and writ­ing im­ple­men­ta­tions. But it does­n’t know my prod­uct pri­or­i­ties, my users’ pain points, or the en­gi­neer­ing trade-offs I’m will­ing to make. The an­no­ta­tion cy­cle is how I in­ject that judge­ment.

add a de­tailed todo list to the plan, with all the phases and in­di­vid­ual tasks nec­es­sary to com­plete the plan - don’t im­ple­ment yet

This cre­ates a check­list that serves as a progress tracker dur­ing im­ple­men­ta­tion. Claude marks items as com­pleted as it goes, so I can glance at the plan at any point and see ex­actly where things stand. Especially valu­able in ses­sions that run for hours.

When the plan is ready, I is­sue the im­ple­men­ta­tion com­mand. I’ve re­fined this into a stan­dard prompt I reuse across ses­sions:

im­ple­ment it all. when you’re done with a task or phase, mark it as com­pleted in the plan doc­u­ment. do not stop un­til all tasks and phases are com­pleted. do not add un­nec­es­sary com­ments or js­docs, do not use any or un­known types. con­tin­u­ously run type­check to make sure you’re not in­tro­duc­ing new is­sues.

This sin­gle prompt en­codes every­thing that mat­ters:

implement it all”: do every­thing in the plan, don’t cherry-pick

mark it as com­pleted in the plan doc­u­ment”: the plan is the source of truth for progress

do not stop un­til all tasks and phases are com­pleted”: don’t pause for con­fir­ma­tion mid-flow

do not add un­nec­es­sary com­ments or js­docs”: keep the code clean

do not use any or un­known types”: main­tain strict typ­ing

continuously run type­check”: catch prob­lems early, not at the end

I use this ex­act phras­ing (with mi­nor vari­a­tions) in vir­tu­ally every im­ple­men­ta­tion ses­sion. By the time I say implement it all,” every de­ci­sion has been made and val­i­dated. The im­ple­men­ta­tion be­comes me­chan­i­cal, not cre­ative. This is de­lib­er­ate. I want im­ple­men­ta­tion to be bor­ing. The cre­ative work hap­pened in the an­no­ta­tion cy­cles. Once the plan is right, ex­e­cu­tion should be straight­for­ward.

Without the plan­ning phase, what typ­i­cally hap­pens is Claude makes a rea­son­able-but-wrong as­sump­tion early on, builds on top of it for 15 min­utes, and then I have to un­wind a chain of changes. The don’t im­ple­ment yet” guard elim­i­nates this en­tirely.

Once Claude is ex­e­cut­ing the plan, my role shifts from ar­chi­tect to su­per­vi­sor. My prompts be­come dra­mat­i­cally shorter.

flow­chart LR

I[Claude im­ple­ments] –> R[I re­view / test]

R –> C{Correct?}

C –>|No| F[Terse cor­rec­tion]

F –> I

C –>|Yes| N{More tasks?}

N –>|Yes| I

N –>|No| D[Done]

Where a plan­ning note might be a para­graph, an im­ple­men­ta­tion cor­rec­tion is of­ten a sin­gle sen­tence:

You built the set­tings page in the main app when it should be in the ad­min app, move it.”

Claude has the full con­text of the plan and the on­go­ing ses­sion, so terse cor­rec­tions are enough.

Frontend work is the most it­er­a­tive part. I test in the browser and fire off rapid cor­rec­tions:

For vi­sual is­sues, I some­times at­tach screen­shots. A screen­shot of a mis­aligned table com­mu­ni­cates the prob­lem faster than de­scrib­ing it.

this table should look ex­actly like the users table, same header, same pag­i­na­tion, same row den­sity.”

This is far more pre­cise than de­scrib­ing a de­sign from scratch. Most fea­tures in a ma­ture code­base are vari­a­tions on ex­ist­ing pat­terns. A new set­tings page should look like the ex­ist­ing set­tings pages. Pointing to the ref­er­ence com­mu­ni­cates all the im­plicit re­quire­ments with­out spelling them out. Claude would typ­i­cally read the ref­er­ence file(s) be­fore mak­ing the cor­rec­tion.

When some­thing goes in a wrong di­rec­tion, I don’t try to patch it. I re­vert and re-scope by dis­card­ing the git changes:

I re­verted every­thing. Now all I want is to make the list view more min­i­mal — noth­ing else.”

Narrowing scope af­ter a re­vert al­most al­ways pro­duces bet­ter re­sults than try­ing to in­cre­men­tally fix a bad ap­proach.

Even though I del­e­gate ex­e­cu­tion to Claude, I never give it to­tal au­ton­omy over what gets built. I do the vast ma­jor­ity of the ac­tive steer­ing in the plan.md doc­u­ments.

This mat­ters be­cause Claude will some­times pro­pose so­lu­tions that are tech­ni­cally cor­rect but wrong for the pro­ject. Maybe the ap­proach is over-en­gi­neered, or it changes a pub­lic API sig­na­ture that other parts of the sys­tem de­pend on, or it picks a more com­plex op­tion when a sim­pler one would do. I have con­text about the broader sys­tem, the prod­uct di­rec­tion, and the en­gi­neer­ing cul­ture that Claude does­n’t.

flow­chart TD

P[Claude pro­poses changes] –> E[I eval­u­ate each item]

E –> A[Accept as-is]

E –> M[Modify ap­proach]

E –> S[Skip / re­move]

E –> O[Override tech­ni­cal choice]

A & M & S & O –> R[Refined im­ple­men­ta­tion scope]

Cherry-picking from pro­pos­als: When Claude iden­ti­fies mul­ti­ple is­sues, I go through them one by one: for the first one, just use Promise.all, don’t make it overly com­pli­cated; for the third one, ex­tract it into a sep­a­rate func­tion for read­abil­ity; ig­nore the fourth and fifth ones, they’re not worth the com­plex­ity.” I’m mak­ing item-level de­ci­sions based on my knowl­edge of what mat­ters right now.

Trimming scope: When the plan in­cludes nice-to-haves, I ac­tively cut them. remove the down­load fea­ture from the plan, I don’t want to im­ple­ment this now.” This pre­vents scope creep.

Protecting ex­ist­ing in­ter­faces: I set hard con­straints when I know some­thing should­n’t change: the sig­na­tures of these three func­tions should not change, the caller should adapt, not the li­brary.”

Overriding tech­ni­cal choices: Sometimes I have a spe­cific pref­er­ence Claude would­n’t know about: use this model in­stead of that one” or use this li­brary’s built-in method in­stead of writ­ing a cus­tom one.” Fast, di­rect over­rides.

Claude han­dles the me­chan­i­cal ex­e­cu­tion, while I make the judge­ment calls. The plan cap­tures the big de­ci­sions up­front, and se­lec­tive guid­ance han­dles the smaller ones that emerge dur­ing im­ple­men­ta­tion.

I run re­search, plan­ning, and im­ple­men­ta­tion in a sin­gle long ses­sion rather than split­ting them across sep­a­rate ses­sions. A sin­gle ses­sion might start with deep-read­ing a folder, go through three rounds of plan an­no­ta­tion, then run the full im­ple­men­ta­tion, all in one con­tin­u­ous con­ver­sa­tion.

I am not see­ing the per­for­mance degra­da­tion every­one talks about af­ter 50% con­text win­dow. Actually, by the time I say implement it all,” Claude has spent the en­tire ses­sion build­ing un­der­stand­ing: read­ing files dur­ing re­search, re­fin­ing its men­tal model dur­ing an­no­ta­tion cy­cles, ab­sorb­ing my do­main knowl­edge cor­rec­tions.

When the con­text win­dow fills up, Claude’s auto-com­paction main­tains enough con­text to keep go­ing. And the plan doc­u­ment, the per­sis­tent ar­ti­fact, sur­vives com­paction in full fi­delity. I can point Claude to it at any point in time.

The Workflow in One Sentence

Read deeply, write a plan, an­no­tate the plan un­til it’s right, then let Claude ex­e­cute the whole thing with­out stop­ping, check­ing types along the way.

That’s it. No magic prompts, no elab­o­rate sys­tem in­struc­tions, no clever hacks. Just a dis­ci­plined pipeline that sep­a­rates think­ing from typ­ing. The re­search pre­vents Claude from mak­ing ig­no­rant changes. The plan pre­vents it from mak­ing wrong changes. The an­no­ta­tion cy­cle in­jects my judge­ment. And the im­ple­men­ta­tion com­mand lets it run with­out in­ter­rup­tion once every de­ci­sion has been made.

Try my work­flow, you’ll won­der how you ever shipped any­thing with cod­ing agents with­out an an­no­tated plan doc­u­ment sit­ting be­tween you and the code.

The Workflow in One Sentence

...

Read the original on boristane.com »

2 441 shares, 56 trendiness

Attention Media ≠ Social Networks

When web-based so­cial net­works started flour­ish­ing nearly two decades ago, they were gen­uinely so­cial net­works. You would sign up for a pop­u­lar ser­vice, fol­low peo­ple you knew or liked and read up­dates from them. When you posted some­thing, your fol­low­ers would re­ceive your up­dates as well. Notifications were gen­uine. The lit­tle icons in the top bar would light up be­cause some­one had sent you a di­rect mes­sage or en­gaged with some­thing you had posted. There was also, at the be­gin­ning of this mil­len­nium, a gen­eral sense of hope and op­ti­mism around tech­nol­ogy, com­put­ers and the Internet. Social net­work­ing plat­forms were one of the ser­vices that were part of what was called Web 2.0, a term used for web­sites built around user par­tic­i­pa­tion and in­ter­ac­tion. It felt as though the in­for­ma­tion su­per­high­way was fi­nally reach­ing its po­ten­tial. But some­time be­tween 2012 and 2016, things took a turn for the worse.

First came the in­fa­mous in­fi­nite scroll. I re­mem­ber feel­ing un­easy the first time a web page no longer had a bot­tom. Logically, I knew very well that every­thing a browser dis­plays is a vir­tual con­struct. There is no phys­i­cal page. It is just pix­els pre­tend­ing to be one. Still, my brain had learned to treat web pages as ob­jects with a be­gin­ning and an end. The sud­den dis­ap­pear­ance of that end dis­turbed my sense of ease.

Then came the bo­gus no­ti­fi­ca­tions. What had once been mean­ing­ful sig­nals turned into ar­bi­trary prompts. Someone you fol­lowed had posted some­thing un­re­mark­able and the plat­form would sur­face it as a no­ti­fi­ca­tion any­way. It did­n’t mat­ter whether the no­ti­fi­ca­tion was rel­e­vant to me. The no­ti­fi­ca­tion sys­tem stopped serv­ing me and started serv­ing it­self. It felt like a vi­o­la­tion of an un­spo­ken agree­ment be­tween users and ser­vices. Despite all that, these plat­forms still re­mained so­cial in some di­luted sense. Yes, the no­ti­fi­ca­tions were ma­nip­u­la­tive, but they were at least about peo­ple I ac­tu­ally knew or had cho­sen to fol­low. That, too, would change.

Over time, my time­line con­tained fewer and fewer posts from friends and more and more con­tent from ran­dom strangers. Using these ser­vices be­gan to feel like stand­ing in front of a blar­ing loud­speaker, broad­cast­ing frag­ments of con­ver­sa­tions from all over the world di­rectly in my face. That was when I gave up on these ser­vices. There was noth­ing so­cial about them any­more. They had be­come at­ten­tion me­dia. My at­ten­tion is pre­cious to me. I can­not spend it mind­lessly scrolling through videos that have nei­ther rel­e­vance nor sub­stance.

But where one av­enue dis­ap­peared, an­other emerged. A few years ago, I stum­bled upon Mastodon and it re­minded me of the early days of Twitter. Back in 2006, I fol­lowed a small num­ber of folks of the nerd va­ri­ety on Twitter and re­ceived gen­uinely in­ter­est­ing up­dates from them. But when I log into the ru­ins of those older plat­forms now, all I see are ran­dom videos pre­sented to me for rea­sons I can nei­ther in­fer nor care about. Mastodon, by con­trast, still feels like so­cial net­work­ing in the orig­i­nal sense. I fol­low a small num­ber of peo­ple I gen­uinely find in­ter­est­ing and I re­ceive their up­dates and only their up­dates. What I see is the re­sult of my own choices rather than a sys­tem try­ing to cap­ture and mon­e­tise my at­ten­tion. There are no bo­gus no­ti­fi­ca­tions. The time­line feels calm and pre­dictable. If there are no new up­dates from peo­ple I fol­low, there is noth­ing to see. It feels closer to how so­cial net­works used to work orig­i­nally. I hope it stays that way.

...

Read the original on susam.net »

3 400 shares, 11 trendiness

Why is Claude an Electron App?

The state of cod­ing agents can be summed up by this fact

Claude spent $20k on an agent swarm im­ple­ment­ing (kinda) a C-compiler in Rust, but desk­top Claude is an Electron app.

If you’re un­fa­mil­iar, Electron is a cod­ing frame­work for build­ing desk­top ap­pli­ca­tions us­ing web tech, specif­i­cally HTML, CSS, and JS. What’s great about Electron is it al­lows you to build one desk­top app that sup­ports Windows, Mac, and Linux. Plus it lets de­vel­op­ers use ex­ist­ing web app code to get started. It’s great for teams big and small. Many apps you prob­a­bly use every day are built with Electron: Slack, Discord, VS Code, Teams, Notion, and more.

There are down­sides though. Electron apps are bloated; each runs its own Chromium en­gine. The min­i­mum app size is usu­ally a cou­ple hun­dred megabytes. They are of­ten laggy or un­re­spon­sive. They don’t in­te­grate well with OS fea­tures.

But these down­sides are dra­mat­i­cally out­weighed by the abil­ity to build and main­tain one app, ship­ping it every­where.

But now we have cod­ing agents! And one thing cod­ing agents are prov­ing to be pretty good at is cross-plat­form, cross-lan­guage im­ple­men­ta­tions given a well-de­fined spec and test suite.

On the sur­face, this abil­ity should ren­der Electron’s ben­e­fits ob­so­lete! Rather than write one web app and ship it to each plat­form, we should write one spec and test suite and use cod­ing agents to ship na­tive code to each plat­form. If this abil­ity is real and adopted, users get snappy, per­for­mant, na­tive apps from small, fo­cused teams serv­ing a broad mar­ket.

But we’re still lean­ing on Electron. Even Anthropic, one of the lead­ers in AI cod­ing tools, who keeps pub­lish­ing flashy agen­tic cod­ing achieve­ments, still uses Electron in the Claude desk­top app. And it’s slow, buggy, and bloated app.

So why are we still us­ing Electron and not em­brac­ing the agent-pow­ered, spec dri­ven de­vel­op­ment fu­ture?

For one thing, cod­ing agents are re­ally good at the first 90% of dev. But that last bit — nail­ing down all the edge cases and con­tin­u­ing sup­port once it meets the real world — re­mains hard, te­dious, and re­quires plenty of agent hand-hold­ing.

Anthropic’s Rust-base C com­piler slammed into this wall, af­ter scream­ing through the bulk of the tests:

The re­sult­ing com­piler has nearly reached the lim­its of Opus’s abil­i­ties. I tried (hard!) to fix sev­eral of the above lim­i­ta­tions but was­n’t fully suc­cess­ful. New fea­tures and bug­fixes fre­quently broke ex­ist­ing func­tion­al­ity.

The re­sult­ing com­piler is im­pres­sive, given the time it took to de­liver it and the num­ber of peo­ple who worked on it, but it is largely un­us­able. That last mile is hard.

And this gets even worse once a pro­gram meets the real world. Messy, un­ex­pected sce­nar­ios stack up and de­vel­op­ment never re­ally ends. Agents make it eas­ier, sure, but hard prod­uct de­ci­sions be­come chal­lenged and re­quire hu­man de­ci­sions.

Further, with 3 dif­fer­ent apps pro­duced (Mac, Windows, and Linux) the sur­face area for bugs and sup­port in­creases 3-fold. Sure, there are lo­cal quirks with Electron apps, but most of it is mit­i­gated by the com­mon wrap­per. Not so with na­tive!

A good test suite and spec could en­able the Claude team to ship a Claude desk­top app na­tive to each plat­form. But the re­sult­ing over­head of that last 10% of dev and the in­creased sup­port and main­te­nance bur­den will re­main.

For now, Electron still makes sense. Coding agents are amaz­ing. But the last mile of dev and the sup­port sur­face area re­mains a real con­cern.

Over at Hacker News, Claude Code’s Boris Cherney chimes in:

Boris from the Claude Code team here.

Some of the en­gi­neers work­ing on the app worked on Electron back in the day, so pre­ferred build­ing non-na­tively. It’s also a nice way to share code so we’re guar­an­teed that fea­tures across web and desk­top have the same look and feel. Finally, Claude is great at it.

That said, en­gi­neer­ing is all about trade­offs and this may change in the fu­ture!

There we go: de­vel­oper fa­mil­iar­ity and sim­pler main­tain­abil­ity across mul­ti­ple plat­forms is worth the tradeoffs”. We have in­cred­i­ble cod­ing agents that are great at tran­spi­la­tion, but there re­main costs that out­weigh the costs of ship­ping a non-na­tive app.

...

Read the original on www.dbreunig.com »

4 367 shares, 24 trendiness

How Taalas "prints" LLM onto a chip?

A startup called Taalas, re­cently re­leased an ASIC chip run­ning Llama 3.1 8B (3/6 bit quant) at an in­fer­ence rate of 17,000 to­kens per sec­onds. That’s like writ­ing around 30 A4 sized pages in one sec­ond. They claim it’s 10x cheaper in own­er­ship cost than GPU based in­fer­ence sys­tems and is 10x less elec­tric­ity hog. And yeah, about 10x faster than state of art in­fer­ence.

I tried to read through their blog and they’ve lit­er­ally hardwired” the mod­el’s weights on chip. Initially, this did­n’t sound in­tu­itive to me. Coming from a Software back­ground, with hobby-ist un­der­stand­ing of LLMs, I could­n’t wrap my head around how you just print” a LLM onto a chip. So, I de­cided to dig into mul­ti­ple blog­posts, LocalLLaMA dis­cus­sions, and hard­ware con­cepts. It was much more in­ter­est­ing than I had thought. Hence this blog­post.

Taalas is a 2.5 year old com­pany and it’s their first chip. Taalas’s chip is a fixed-func­tion ASIC (Application-Specific Integrated Circuit). Kinda like a CD-ROM/Game car­tridge, or a printed book, it only holds one model and can­not be rewrit­ten.

LLMs con­sist of se­quen­tial Layers. For eg. Llama 3.1 8B has 32 lay­ers. The task of each layer is to fur­ther re­fine the in­put. Each layer is es­sen­tially large weight ma­tri­ces (the mod­el’s knowledge’).

When a user in­puts a prompt, it is con­verted into an vec­tor of num­bers aka em­bed­dings.

On a nor­mal GPU, the in­put vec­tor en­ters the com­pute cores. Then GPU fetches the Layer 1 weights from VRAM/HBM (GPUs RAM) , does ma­trix mul­ti­pli­ca­tion, stores the in­ter­me­di­ate re­sults(aka ac­ti­va­tions) back in VRAM. Then it fetches the Layer 2 weights, and pre­vi­ous re­sult, does the math, and saves it to VRAM again. This cy­cle con­tin­ues till 32nd layer just to gen­er­ate a sin­gle to­ken. Then, to gen­er­ate the next to­ken, the GPU re­peats this en­tire 32-layer jour­ney.

So, due to this con­stant back-and-forth the mem­ory bus in­duces la­tency and con­sumes sig­nif­i­cant amounts of en­ergy. This is the mem­ory band­width bot­tle­neck, some­times loosely called the Von Neumann bot­tle­neck or the memory wall.”

Taalas side­steps this wall en­tirely. They just en­graved the 32 lay­ers of Llama 3.1 se­quen­tially on a chip. Essentially, the mod­el’s weights are phys­i­cal tran­sis­tors etched into the sil­i­con.

Importantly, they also claim to have in­vented a hard­ware scheme where they can store a 4-bit data and per­form the mul­ti­pli­ca­tion re­lated to it us­ing a sin­gle tran­sis­tor. I will re­fer it as their magic mul­ti­pli­er’

Now, when the user’s in­put ar­rives, it gets con­verted into a vec­tor, and flows into phys­i­cal tran­sis­tors mak­ing up Layer1. It does mul­ti­pli­ca­tion via their magic mul­ti­pli­er’ and in­stead of re­sult be­ing saved in a VRAM, the elec­tri­cal sig­nal sim­ply flows down phys­i­cal wires into the Layer 2 tran­sis­tors (via pipeline reg­is­ters from what I un­der­stand). The data streams con­tin­u­ously through the sil­i­con un­til the fi­nal out­put to­ken is gen­er­ated.

They don’t use ex­ter­nal DRAM/HBM, but they do use a small amount of on-chip SRAM. Why SRAM? Due to cost and com­plex­ity, man­u­fac­tur­ers don’t mix DRAM and logic gates. That’s why GPUs have sep­a­rate VRAM. (Also SRAM is­n’t fac­ing sup­ply chain cri­sis, DRAM is).

Taalas uses this on-chip SRAM for the KV Cache (the tem­po­rary mem­ory/​con­text win­dow of an on­go­ing con­ver­sa­tion) and to hold LoRA adapters for fine tun­ing.

Technically yes, I read lots of com­ments say­ing that. But Taalas de­signed a base chip with a mas­sive, generic grid of logic gates and tran­sis­tors. To map a spe­cific model onto the chip, they only need to cus­tomize the top two lay­ers/​masks. While it’s still slow, but it’s much faster than build­ing chips from ground up.

It took them two months, to de­velop chip for Llama 3.1 8B. In the AI world where one week is a year, it’s su­per slow. But in a world of cus­tom chips, this is sup­posed to be in­sanely fast.

As some­one stuck run­ning lo­cal mod­els on a lap­top with­out a mas­sive GPU, I am keep­ing my fin­gers crossed for this type of hard­ware to be mass-pro­duced soon.

...

Read the original on www.anuragk.com »

5 356 shares, 16 trendiness

xaskasdf/ntransformer: High-efficiency LLM inference engine in C++/CUDA. Run Llama 70B on RTX 3090.

High-efficiency C++/CUDA LLM in­fer­ence en­gine. Runs Llama 70B on a sin­gle RTX 3090 (24GB VRAM) by stream­ing model lay­ers through GPU mem­ory via PCIe, with op­tional NVMe di­rect I/O that by­passes the CPU en­tirely.

3-tier adap­tive caching auto-sizes from hard­ware: VRAM-resident lay­ers (zero I/O) + pinned RAM (H2D only) + NVMe/mmap fall­back. Achieves 83x speedup over mmap base­line for 70B on con­sumer hard­ware (RTX 3090 + 48 GB RAM).

Bottleneck is PCIe H2D band­width at Gen3 x8 (~6.5 GB/s). Q4_K_M fits 10 more lay­ers in VRAM (36 vs 26), re­duc­ing tier B trans­fers. Layer skip (cosine sim­i­lar­ity cal­i­bra­tion) elim­i­nates 20/80 lay­ers per to­ken with min­i­mal qual­ity loss.

* Zero ex­ter­nal de­pen­den­cies be­yond CUDA Toolkit (no PyTorch, no cuBLAS)

# Build

mkdir build && cd build

cmake .. -DCMAKE_BUILD_TYPE=Release \

-DCMAKE_C_COMPILER=gcc-14 \

-DCMAKE_CXX_COMPILER=g++-14 \

-DCMAKE_CUDA_COMPILER=/usr/local/cuda-13.1/bin/nvcc

cmake –build . -j

# Run (resident mode — model fits in VRAM)

./ntransformer -m /path/to/llama-8b-q8_0.gguf -p Hello” -n 128

# Run (streaming mode — model larger than VRAM)

./ntransformer -m /path/to/llama-70b-q6_k.gguf -p Hello” -n 32 –streaming

# Run with layer skip (fastest for 70B)

./ntransformer -m /path/to/llama-70b-q4_k_m.gguf -p Hello” -n 32 –streaming –skip-threshold 0.98

# Self-speculative de­cod­ing (VRAM lay­ers as draft, no ex­tra model)

./ntransformer -m /path/to/llama-70b-q6_k.gguf -p Hello” -n 32 –self-spec –draft-k 3

# Chat mode

./ntransformer -m /path/to/model.gguf –chat

# Benchmark

./ntransformer -m /path/to/model.gguf –benchmark -n 64

Running ntrans­former with NVMe di­rect I/O re­quires sys­tem-level mod­i­fi­ca­tions. An au­to­mated setup script han­dles all of them:

# Full first-time setup (interactive, cre­ates back­ups)

sudo ./scripts/setup_system.sh

# Check cur­rent sys­tem state (no changes)

sudo ./scripts/setup_system.sh –check

# NVMe-only (run af­ter every re­boot)

sudo ./scripts/setup_system.sh –nvme-only

* Above 4G Decoding: ON (required for 64-bit BAR map­ping)

* IOMMU: OFF (or leave on — the script adds the ker­nel pa­ra­me­ter)

WARNING: This pro­ject per­forms low-level PCIe op­er­a­tions (GPU MMIO writes to NVMe con­troller reg­is­ters, user­space NVMe com­mand sub­mis­sion, VFIO de­vice passthrough). While tested ex­ten­sively on RTX 3090 + WD SN740, in­cor­rect con­fig­u­ra­tion or hard­ware in­com­pat­i­bil­i­ties could the­o­ret­i­cally cause:

Data loss on the NVMe de­vice used for raw block stor­age

Never use your boot drive for NVMe di­rect I/O. Always use a ded­i­cated sec­ondary NVMe. The au­thors are not re­spon­si­ble for hard­ware dam­age or data loss. Use at your own risk.

For mod­els that don’t fit in VRAM, the NVMe back­end elim­i­nates the CPU from the data path:

# Build with NVMe sup­port (requires gpu-nvme-di­rect li­brary)

cmake .. -DCMAKE_BUILD_TYPE=Release -DUSE_GPUNVME=ON \

-DCMAKE_C_COMPILER=gcc-14 -DCMAKE_CXX_COMPILER=g++-14 \

-DCMAKE_CUDA_COMPILER=/usr/local/cuda-13.1/bin/nvcc

cmake –build . -j

# Write GGUF model to NVMe raw de­vice

sudo ./scripts/restore_nvme.sh # en­sure ker­nel dri­ver is bound

sudo dd if=model.gguf of=/​dev/​nvme0n1 bs=1M oflag=di­rect sta­tus=progress

# Bind NVMe to VFIO for user­space ac­cess

sudo ./scripts/setup_nvme.sh # loads VFIO, forces D0, en­ables BusMaster

# Run with NVMe back­end

sudo GPUNVME_PCI_BDF=0000:01:00.0 GPUNVME_GGUF_LBA=0 \

./build/ntransformer -m /path/to/model.gguf -p Hello” -n 32 –streaming

# Restore NVMe to ker­nel dri­ver when done

sudo ./scripts/restore_nvme.sh

The GGUF model file is writ­ten to raw NVMe blocks via dd

During in­fer­ence, each layer (~670 MB for 70B Q6_K) is read via 670 NVMe com­mands in ~202 ms

Data lands in CUDA pinned stag­ing mem­ory, then async DMA to GPU com­pute buffers

...

Read the original on github.com »

6 298 shares, 18 trendiness

Revive Dead Games

...

Read the original on gamedate.org »

7 262 shares, 40 trendiness

Iran students resume anti-government protests

I don’t want to use the word frustrated,’ be­cause he un­der­stands he has plenty of al­ter­na­tives, but he’s cu­ri­ous as to why they haven’t… I don’t want to use the word capitulated,’ but why they haven’t ca­pit­u­lated,” he said.

...

Read the original on www.bbc.com »

8 186 shares, 10 trendiness

Japanese Print Search and Database

Ukiyo-e Search pro­vides an in­cred­i­ble re­source: The abil­ity to both search for Japanese wood­block prints by sim­ply tak­ing a pic­ture of an ex­ist­ing print AND the abil­ity to see sim­i­lar prints across mul­ti­ple col­lec­tions of prints. Below is an ex­am­ple print, click to see it in ac­tion.

Upload a pic­ture of a print to find sim­i­lar prints across mul­ti­ple col­lec­tions.

Better data, hun­dreds of thou­sands of ad­di­tional im­ages, and bet­ter search ca­pa­bil­i­ties are forth­com­ing. Sign-up to be no­ti­fied when ad­di­tional fea­tures are ready.

...

Read the original on ukiyo-e.org »

9 177 shares, 22 trendiness

Database Transactions — PlanetScale

PlanetScale Postgres is the fastest way to run Postgres in the cloud. Plans start at just $5 per month.

Transactions are fun­da­men­tal to how SQL data­bases work. Trillions of trans­ac­tions ex­e­cute every sin­gle day, across the thou­sands of ap­pli­ca­tions that rely on SQL data­bases.

A trans­ac­tion is a se­quence of ac­tions that we want to per­form on a data­base as a sin­gle, atomic op­er­a­tion. An in­di­vid­ual trans­ac­tion can in­clude a com­bi­na­tion of read­ing, cre­at­ing, up­dat­ing, and re­mov­ing data.

In MySQL and Postgres, we be­gin a new trans­ac­tion with be­gin; and end it with com­mit;. Between these two com­mands, any num­ber of SQL queries that search and ma­nip­u­late data can be ex­e­cuted.

The ex­am­ple above shows a trans­ac­tion be­gin, three query ex­e­cu­tions, then the com­mit. You can hit the ↻ but­ton to re­play the se­quence at any time. The act of com­mit­ting is what atom­i­cally ap­plies all of the changes made by those SQL state­ments.

There are some sit­u­a­tions where trans­ac­tions do not com­mit. This is some­times due to un­ex­pected events in the phys­i­cal world, like a hard drive fail­ure or power out­age. Databases like MySQL and Postgres are de­signed to cor­rectly han­dle many of these un­ex­pected sce­nar­ios, us­ing dis­as­ter re­cov­ery tech­niques. Postgres, for ex­am­ple, han­dles this via its write-ahead log mech­a­nism (WAL).

There are also times when we want to in­ten­tion­ally undo a par­tially-ex­e­cuted trans­ac­tion. This hap­pens when mid­way through a trans­ac­tion, we en­counter miss­ing / un­ex­pected data or get a can­cel­la­tion re­quest from a client. For this, data­bases sup­port the roll­back; com­mand.

In the ex­am­ple above, the trans­ac­tion made sev­eral mod­i­fi­ca­tions to the data­base, but those changes were iso­lated from all other on­go­ing queries and trans­ac­tions. Before the trans­ac­tion com­mit­ted, we de­cided to roll­back, un­do­ing all changes and leav­ing the data­base un­al­tered by this trans­ac­tion.

By the way, you can use the menu be­low to change the speed of all the ses­sions and an­i­ma­tions in this ar­ti­cle. If the ones above were go­ing too fast or too slow for your lik­ing, fix that here!

A key rea­son trans­ac­tions are use­ful is to al­low ex­e­cu­tion of many queries si­mul­ta­ne­ously with­out them in­ter­fer­ing with each other. Below you can see a sce­nario with two dis­tinct ses­sions con­nected to the same data­base. Session A starts a trans­ac­tion, se­lects data, up­dates it, se­lects again, and then com­mits. Session B se­lects that same data twice dur­ing a trans­ac­tion and again af­ter both of the trans­ac­tions have com­pleted.

Session B does not see the name up­date from ben to joe un­til af­ter Session A com­mits the trans­ac­tion.

Consider the same se­quence of events, ex­cept in­stead of com­mit­ing the trans­ac­tion in Session A, we roll­back.

The sec­ond ses­sion never sees the ef­fect of any changes made by the first, due to the roll­back. This is a nice segue into an­other im­por­tant con­cept in trans­ac­tions: Consistent reads.

During a trans­ac­tion’s ex­e­cu­tion, we would like it to have a con­sis­tent view of the data­base. This means that even if an­other trans­ac­tion si­mul­ta­ne­ously adds, re­moves, or up­dates in­for­ma­tion, our trans­ac­tion should get its own iso­lated view of the data, un­af­fected by these ex­ter­nal changes, un­til the trans­ac­tion com­mits.

MySQL and Postgres both sup­port this ca­pa­bil­ity when op­er­at­ing in REPEATABLE READ mode (plus all stricter modes, too). However, they each take dif­fer­ent ap­proaches to achiev­ing this same goal.

Postgres han­dles this with multi-ver­sion­ing of rows. Every time a row is in­serted or up­dated, it cre­ates a new row along with meta­data to keep track of which trans­ac­tions can ac­cess the new ver­sion. MySQL han­dles this with an undo log. Changes to rows im­me­di­ately over­write old ver­sions, but a record of mod­i­fi­ca­tions is main­tained in a log file, in case they need to be re­con­structed.

Let’s take a close look at each.

Below, you’ll see a sim­ple user table on the left and a se­quence of state­ments in Session A on the right. Click the play ses­sions” but­ton and watch what hap­pens as the state­ments get ex­e­cuted.

* An up­date is made to the user with ID 4, chang­ing the name from liz” to aly”. This causes a new ver­sion of the row to be cre­ated, while the other is main­tained.

* The old ver­sion of the row had its xmax set to 10 (xmax = max trans­ac­tion ID)

* The new ver­sion of the row also had its xmin set to 10 (xmin = min trans­ac­tion ID)

* The trans­ac­tion com­mits, mak­ing the up­date vis­i­ble to the broader data­base

But now we have two ver­sions of the row with ID = 4. Ummm… that’s odd! The key here is xmin and xmax.

xmin stores the ID of the trans­ac­tion that cre­ated a row ver­sion, and xmax is the ID of the trans­ac­tion that caused a re­place­ment row to be cre­ated. Postgres uses these to de­ter­mine which row ver­sion each trans­ac­tion sees.

Let’s look at Session A again, but this time with an ad­di­tional Session B run­ning si­mul­ta­ne­ously. Press play ses­sions” again.

Before the com­mit, Session B could not see Session A’s mod­i­fi­ca­tion. It sees the name as liz” while Session A sees aly” within the trans­ac­tion. At this stage, it has noth­ing to do with xmin and xmax, but rather be­cause other trans­ac­tions can­not see un­com­mit­ted data. After Session A com­mits, Session B can now see the new name of aly” be­cause the data is com­mit­ted and the trans­ac­tion ID is greater than 10.

If the trans­ac­tion in­stead gets a roll­back, those row changes do not get ap­plied, leav­ing the data­base in a state as if the trans­ac­tion never be­gan in the first place.

This is a sim­ple sce­nario. Only one of the trans­ac­tions mod­i­fies data. Session B only does se­lect state­ments! When both si­mul­ta­ne­ously mod­ify data, each one will be able to see” the mod­i­fi­ca­tions it made, but these changes won’t bleed out into other trans­ac­tions un­til com­mit. Here’s an ex­am­ple where each trans­ac­tion se­lects data, up­dates data, se­lects again, com­mits, and fi­nally both do a fi­nal se­lect.

The con­cur­rent trans­ac­tions can­not see each oth­er’s changes un­til the data is com­mit­ted. The same mech­a­nisms are used to con­trol data vis­i­bil­ity when there are hun­dreds of si­mul­ta­ne­ous trans­ac­tions on busy Postgres data­bases.

Before we move on to MySQL, one more im­por­tant note. What hap­pens to all those du­pli­cated rows? Over time, we can end up with thou­sands of du­pli­cate rows that are no longer needed. There are sev­eral things Postgres does to mit­i­gate this is­sue, but I’ll fo­cus on the VACUUM FULL com­mand. When run, this purges ver­sions of rows that are so old that we know no trans­ac­tions will need them go­ing for­ward. It com­pacts the table in the process. Try it out be­low.

Notice that when the vac­uum full com­mand ex­e­cutes, all un­used rows are elim­i­nated, and the gaps in the table are com­pressed, re­claim­ing the un­used space.

MySQL achieves the con­sis­tent read be­hav­ior us­ing a dif­fer­ent ap­proach. Instead of keep­ing many copies of each row, MySQL im­me­di­ately over­writes old row data with new row data when mod­i­fied. This means it re­quires less main­te­nance over time for the rows (in other words, we don’t need to do vac­u­um­ing like Postgres).

However, MySQL still needs the abil­ity to show dif­fer­ent ver­sions of a row to dif­fer­ent trans­ac­tions. For this, MySQL uses an undo log — a log of re­cently-made row mod­i­fi­ca­tions, al­low­ing a trans­ac­tion to re­con­struct past ver­sions on-the-fly.

Notice how each MySQL row has two meta­data columns (in blue). These keep track of the ID of the trans­ac­tion that up­dated the row most re­cently (xid), and a ref­er­ence to the most re­cent mod­i­fi­ca­tion in the undo log (ptr).

When there are si­mul­ta­ne­ous trans­ac­tions, trans­ac­tion A may clob­ber the ver­sion of a row that trans­ac­tion B needs to see. Transaction B can see the pre­vi­ous ver­sion(s) of the row by check­ing the undo log, which stores old val­ues so long as any run­ning trans­ac­tion may need to see it.

There can even be sev­eral undo log records in the log for the same row si­mul­ta­ne­ously. In such a case, MySQL will choose the cor­rect ver­sion based on trans­ac­tion iden­ti­fiers.

The idea of Repeatable reads is im­por­tant for data­bases, but this is just one of sev­eral iso­la­tion lev­els data­bases like MySQL and Postgres sup­port. This set­ting de­ter­mines how protected” each trans­ac­tion is from see­ing data that other si­mul­ta­ne­ous trans­ac­tions are mod­i­fy­ing. Adjusting this set­ting gives the user con­trol of the trade­off be­tween iso­la­tion and per­for­mance.

Both MySQL and Postgres have four lev­els of iso­la­tion: From strongest to weak­est, these are: Serializable, Repeatable Read, Read Committed, Read Uncommitted.

Stronger lev­els of iso­la­tion pro­vide more pro­tec­tions from data in­con­sis­tency is­sues across trans­ac­tions, but come at the cost of worse per­for­mance in some sce­nar­ios.

Serializable is the strongest. In this mode, all trans­ac­tions be­have as if they were run in a well-de­fined se­quen­tial or­der, even if in re­al­ity many ran si­mul­ta­ne­ously. This is ac­com­plished via com­plex lock­ing and wait­ing.

The other three grad­u­ally loosen the strict­ness, and can be de­scribed by the un­de­sir­able phe­nom­ena they al­low or pro­hibit.

A phan­tom read is one where a trans­ac­tion runs the same SELECT mul­ti­ple times, but sees dif­fer­ent re­sults the sec­ond time around. This is typ­i­cally due to data that was in­serted and com­mit­ted by a dif­fer­ent trans­ac­tion. The time­line be­low vi­su­al­izes such a sce­nario. The hor­i­zon­tal axis rep­re­sents time pass­ing on a data­base with two clients. Hit the ↻ but­ton to re­play the se­quence at any time.

After se­ri­al­iz­able, the next least strict iso­la­tion level is called re­peat­able read. Under the SQL stan­dard, the re­peat­able read level al­lows phan­tom reads, though in Postgres they still aren’t pos­si­ble.

These hap­pen when a trans­ac­tion reads a row, and then later re-reads the same row, find­ing changes by an­other al­ready-com­mit­ted trans­ac­tion. This is dan­ger­ous be­cause we may have al­ready made as­sump­tions about the state of our data­base, but that data has changed un­der our feet.

The read com­mit­ted iso­la­tion level, the next af­ter re­peat­able read, al­lows these and phan­tom reads to oc­cur. The trade­off is slightly bet­ter data­base trans­ac­tion per­for­mance.

The last and ar­guably worst is dirty reads. A dirty read is one where a trans­ac­tion is able to see data writ­ten by an­other trans­ac­tion run­ning si­mul­ta­ne­ously that is not yet com­mit­ted. This is re­ally bad! In most cases, we never want to see data that is un­com­mit­ted from other trans­ac­tions.

The loos­est iso­la­tion level, read un­com­mit­ted, al­lows for dirty reads and the other two de­scribed above. It is the most dan­ger­ous and also most per­for­mant mode.

The keen-eyed ob­server will no­tice that I have ig­nored a par­tic­u­lar sce­nario, quite on pur­pose, up to this mo­ment. What if two trans­ac­tions need to mod­ify the same row at the same time?

Precisely how this is han­dled de­pends on both (A) the data­base sys­tem and (B) the iso­la­tion level. To keep the dis­cus­sion sim­ple, we’ll fo­cus on how this works for the strictest (SERIALIZABLE) level in Postgres and MySQL. Yet again, the world’s two most pop­u­lar re­la­tional data­bases take very dif­fer­ent ap­proaches here.

A lock is a soft­ware mech­a­nism for giv­ing own­er­ship of a piece of data to one trans­ac­tion (or a set of trans­ac­tions). Transactions ob­tain a lock on a row when they need to own” it with­out in­ter­rup­tion. When the trans­ac­tion is fin­ished us­ing the rows, it re­leases the lock to al­low other trans­ac­tions ac­cess.

Though there are many types of locks in prac­tice, the two main ones you need to know about here are shared locks and ex­clu­sive locks.

A shared (S) lock can be ob­tained by mul­ti­ple trans­ac­tions on the same row si­mul­ta­ne­ously. Typically, trans­ac­tions will ob­tain shared locks on a row when read­ing it, be­cause mul­ti­ple trans­ac­tions can do so si­mul­ta­ne­ously safely.

An ex­clu­sive (X) lock can only be owned by one trans­ac­tion for any given row at any given time. When a trans­ac­tion re­quests an X lock, no other trans­ac­tions can have any type of lock on the row. These are used when a trans­ac­tion needs to write to a row, be­cause we don’t want two trans­ac­tions si­mul­ta­ne­ously mess­ing with col­umn val­ues!

In SERIALIZABLE mode, all trans­ac­tions must al­ways ob­tain X locks when up­dat­ing a row. Most of the time, this works fine other than the per­for­mance over­head of lock­ing. In sce­nar­ios where two trans­ac­tions are both try­ing to up­date the same row si­mul­ta­ne­ously, this can lead to dead­lock!

MySQL can de­tect dead­lock and will kill one of the in­volved trans­ac­tions to al­low the other to make progress.

Postgres han­dles write con­flicts in SERIALIZABLE mode with less lock­ing, and avoids the dead­lock is­sue com­pletely.

As trans­ac­tions read and write rows, Postgres cre­ates pred­i­cate locks, which are locks” on sets of rows spec­i­fied by a pred­i­cate. For ex­am­ple, if a trans­ac­tion up­dates all rows with IDs 10–20, it will take a lock on the pred­i­cate WHERE id BETWEEN 10 AND 20. These locks are not used to block ac­cess to rows, but rather to track which rows are be­ing used by which trans­ac­tions, and then de­tect data con­flicts on-the-fly.

Combined with multi-row ver­sion­ing, this lets Postgres use op­ti­mistic con­flict res­o­lu­tion. It never blocks trans­ac­tions while wait­ing to ac­quire a lock, but it will kill a trans­ac­tion if it de­tects that it’s vi­o­lat­ing the SERIALIZABLE guar­an­tees.

Let’s look at a sim­i­lar time­line from the MySQL ex­am­ple, but this time watch­ing Postgres’ op­ti­mistic tech­nique.

The dif­fer­ence is sub­tle vi­su­ally, but im­ple­mented in quite dif­fer­ent ways. Both Postgres and MySQL lever­age the killing of one trans­ac­tion in fa­vor of main­tain­ing SERIALIZABLE guar­an­tees. Applications must ac­count for this out­come, and have retry logic for im­por­tant trans­ac­tions.

Transactions are just one tiny cor­ner of all the amaz­ing en­gi­neer­ing that goes into data­bases, and we only scratched the sur­face! But a fun­da­men­tal un­der­stand­ing of what they are, how they work, and the guar­an­tees of the four iso­la­tion lev­els is help­ful for work­ing with data­bases more ef­fec­tively.

What es­o­teric cor­ner of data­base man­age­ment sys­tems would you like to see us cover next? Join our Discord com­mu­nity and let us know.

...

Read the original on planetscale.com »

10 176 shares, 9 trendiness

making bloom filters 2x more accurate

A bloom fil­ter is a prob­a­bilis­tic data struc­ture that po­ten­tially can make SQL queries ex­e­cute or­ders of mag­ni­tudes faster. Today I want to tell you how we use them in Floe, and how we make them pro­duce 2x fewer false re­sults.

Feel free to skip this sec­tion if you know the an­swer.

A bloom fil­ter is a prob­a­bilis­tic data struc­ture that an­swers one ques­tion: Is this el­e­ment def­i­nitely not in the set?” It can give false pos­i­tives (says yes when the an­swer is no), but never false neg­a­tives (it won’t miss el­e­ments that are ac­tu­ally there). The main ben­e­fit is that a well-de­signed bloom fil­ter can be re­ally fast - a few CPU cy­cles per lookup. That’s faster than a sin­gle func­tion call.

The struc­ture: An ar­ray of m bits, all ini­tially set to 0.

Insertion: To add an el­e­ment, we:

Hash the el­e­ment us­ing k dif­fer­ent hash func­tions

Each hash gives us a po­si­tion in the bit ar­ray

Set all k bits at those po­si­tions to 1

Lookup: To check if an el­e­ment ex­ists:

Hash it with the same k hash func­tions

Check if ALL k bits are set to 1

If any bit is 0 → def­i­nitely not in the set

If all bits are 1 → prob­a­bly in the set

Here we’re dis­cussing bloom fil­ters in the con­text of data­base en­gi­neer­ing. If you’re not fa­mil­iar with how data­bases join ta­bles - here’s a quick primer: a hash join matches rows from two ta­bles. First, it loads the smaller table into a hash table (that’s the build side). Then it scans the larger table row by row, look­ing up each value in the hash table to find matches (that’s the probe side). Most of the work hap­pens on the probe side, be­cause the larger table can have bil­lions of rows.

When pro­cess­ing mil­lions of rows we want to avoid all the ex­tra work that we can. Don’t de­com­press the data you won’t use. Don’t probe hash ta­bles for keys that don’t ex­ist. Discard rows as soon as you can - it’s called be­ing ef­fi­cient (not lazy!)

We use bloom fil­ters at 2 crit­i­cal places:

Hash joins: be­fore prob­ing the hash table dur­ing probe phase

Let’s imag­ine the sit­u­a­tion where we want to join two ta­bles where only 1% of 10 bil­lion probe-side rows will be matched. Without fil­ter­ing we would need to de­com­press and probe 99% of those rows be­fore dis­card­ing them.

What we do in­stead:

Build phase: At build phase we pop­u­late the bloom fil­ter with hashes of build side.

Pushdown: af­ter build phase is com­plete we push down the bloom fil­ter, which at this point is read-only, to the stor­age en­gine.

First-pass fil­ter­ing: The stor­age en­gine de­com­presses only the columns needed for bloom fil­ter­ing. It checks each value against the bloom fil­ter, and marks val­ues that def­i­nitely do not match the build side.

Adaptive be­hav­iour: Here it gets in­ter­est­ing. We keep the sta­tis­tics of how many rows we skipped. If we end up dis­card­ing al­most no rows we don’t bother with first-pass fil­ter­ing and dis­able it. But we keep check­ing de­com­pressed rows to re-en­able fil­ter­ing if stats im­prove.

= time to get an­other cof­fee. Or three.

That’s a huge 9x re­duc­tion in scan and I/O!

Why do we need to keep the fil­ter­ing adap­tive? Because some­times bloom does­n’t help:

Join se­lec­tiv­ity is high (most rows match any­way)

Data is skewed (many du­pli­cates sat­u­rate the fil­ter)

For hash joins we use a sim­pler, al­most text­book-style bloom fil­ter: in­sert val­ues into the bloom fil­ter at build phase, read it at probe phase be­fore prob­ing the hash buck­ets.

We landed on us­ing a fixed 256KB bloom fil­ter per join as a sweet spot be­tween size and ef­fi­ciency. Go big­ger - waste the mem­ory and over­flow L2/L3 cache (cache misses hurt). Go smaller - might as well flip a coin.

Why fixed size? Predictable per­for­mance. No dy­namic al­lo­ca­tion. Compiler can op­ti­mize the hell out of it. Lock-free ac­cess. The last one is es­pe­cially crit­i­cal when we’re talk­ing about a con­cur­rent per­for­mance-first data­base en­gine.

The Problem: When too many bits tell less

All of the above works well only if the bloom fil­ter is ac­tu­ally use­ful and does­n’t lie too of­ten. If it does - it is use­less. In our en­gine we mea­sure bloom fil­ter per­for­mance with a sim­ple thresh­old for num­ber of bits set. What does that mat­ter? To un­der­stand we need to dive deeper into the the­ory, and un­der­stand false pos­i­tive rate of bloom fil­ter

Why false pos­i­tives? As we in­sert more el­e­ments (n), more bits get set to 1. Eventually, ran­dom el­e­ments will have all their k bits set to 1 by pure chance - even though they were never in­serted. That’s a false pos­i­tive.

The oc­cu­pancy prob­lem: As we in­sert more el­e­ments, more bits get set to 1 and the fil­ter gets sat­u­rated. For our sin­gle-hash (k=1) ap­proach, that means the false pos­i­tive rate climbs quickly - up to 10% and above - that’s way too high!

Let’s Do Some Math (No Really, Stay With Me)

You could just trust the for­mula. Or we could de­rive it in 30 sec­onds and ac­tu­ally un­der­stand why bloom fil­ters break down:

The in­tu­ition: Every time we in­sert an el­e­ment, we flip some bits from 0 to 1. Eventually, so many bits are set that ran­dom el­e­ments look like they were in­serted - even though they weren’t.

Here is a re­ally nice in­ter­ac­tive tool where you can play around with dif­fer­ent pa­ra­me­ters of bloom fil­ters to see how they scale: Bloom Filter Calculator

Enough the­ory. Let’s look at the code

That im­ple­men­ta­tion is sim­ple, and it works. But it is way too sim­ple, let’s look for some­thing bet­ter:

Goal: find some­thing that’s still fast as hell, but lies to us less of­ten.

We started ex­per­i­ment­ing with ideas to see how they per­form

Naive ap­proach: Set two bits us­ing two in­de­pen­dent hash func­tions - ter­ri­ble

Alternative 1: store two bits in the same cache line - bet­ter, but still bad

Alternative 2: split uin­t32 into halves. Use lower 16 bits for first bit po­si­tion, up­per 16 bits for 2nd bit po­si­tion - bet­ter, we are get­ting there

Here’s the in­sight: both bits live in the same uin­t32 vari­able. We use a sin­gle hash value to com­pute:

Which el­e­ment in the ar­ray (16 bits of the hash)

Position of first bit within that el­e­ment (5 bits)

Position of sec­ond bit within that el­e­ment (5 more bits)

Why this is ben­e­fi­cial:

One atomic op­er­a­tion: set both bits with sin­gle atomic OR

Simple ad­dress­ing: bit ma­nip­u­la­tion is cheap (few cy­cles), while mem­ory is ex­pen­sive

There is a mi­nor trade-off: two bits are not truly in­de­pen­dent any­more. This slightly in­creases col­li­sion prob­a­bil­ity. But the per­for­mance gain from one mem­ory ac­cess crushes the the­o­ret­i­cal dis­ad­van­tage.

The new code is nearly iden­ti­cal in struc­ture - just one ex­tra bit in the mask. We shift and mask by 5 bits be­cause uin­t32_t has 32 bit po­si­tions (2^5).

T bit­Loc1(T& h) { re­turn (h >> IDX_BITS) & MASK_5BIT; } // first 5-bit off­set (0..31)

T bit­Loc2(T& h) { re­turn (h >> (IDX_BITS + 5)) & MASK_5BIT; } // next 5-bit off­set (another bit)

void put(HashKey32 h) {

const uin­t32_t idx = uin­t32Idx(h);

const uin­t32_t mask = (1u << bit­Loc1(h)) | (1u << bit­Loc2(h));

__sync_fetch_and_or(mBuf + idx, mask);

bool con­tains(HashKey32 h) const {

const uin­t32_t data = mBuf[uin­t32Idx(h)];

const uin­t32_t mask = (1u << bit­Loc1(h)) | (1u << bit­Loc2(h));

re­turn (data & mask) == mask;

The per­for­mance hit? Negligible. From our bench­mark­ing:

For con­text: even the slower” ver­sion ex­e­cutes faster than a func­tion call or branch mis-pre­dic­tion. We’re talk­ing about a nanosec­ond.

On a query scan­ning a ter­abyte table, that’s avoid­ing de­com­pres­sion of ~60GB of data

Let me spell it out: we spend one ex­tra nanosec­ond per row to avoid read­ing dozens of ex­tra gi­ga­bytes.

I’ll take that trade every sin­gle time.

Two bits in one uin­t32 gave us 2x bet­ter bloom fil­ter ac­cu­racy at es­sen­tially zero cost:

Still one atomic OR

One more bit shift for cre­at­ing the mask (but it’s very cheap)

Adaptive fil­ter­ing at the stor­age en­gine layer saves even more, al­low­ing us to com­pletely avoid de­com­press­ing rows that will not be needed. But be­cause the first-pass de­com­pres­sion is still costly, op­ti­miz­ing this code path is not as triv­ial, so we did not touch it at that time. But when work­ing on Floe, we will def­i­nitely use some of this gained knowl­edge for our smarter push­downs.

Why Not Just Use 3 Bits? Or 4?”

Fair ques­tion. Using the same in­ter­ac­tive tool: Bloom Filter Calculator

2 bits: 253k el­e­ments - 2.5x more fil­ter ca­pac­ity with­out any real cost

3 bits: 306k el­e­ments - that’s just 20% more ca­pac­ity. At this point trade­off be­comes ques­tion­able

4+ bits: 320k el­e­ments - less than 5% ca­pac­ity in­crease - not even worth it

The beauty of 2 bits: they fit in a sin­gle uin­t32 with min­i­mal col­li­sions.

It is the sweet spot be­tween too sim­ple” and too com­plex”

But what about Cuckoo fil­ters or XOR fil­ters?”

Great struc­tures! But they re­quire dy­namic re­siz­ing or more com­plex ad­dress­ing. We wanted:

Two bits in a fixed bloom fil­ter gave us all three.

We also wrote a ver­sion that checks 8 el­e­ments at a time us­ing SIMD in­struc­tions. But that is a story for an­other day.

...

Read the original on floedb.ai »

To add this web app to your iOS home screen tap the share button and select "Add to the Home Screen".

10HN is also available as an iOS App

If you visit 10HN only rarely, check out the the best articles from the past week.

If you like 10HN please leave feedback and share

Visit pancik.com for more.