10 interesting stories served every morning and every evening.
The Workflow in One Sentence I’ve been using Claude Code as my primary development tool for approx 9 months, and the workflow I’ve settled into is radically different from what most people do with AI coding tools. Most developers type a prompt, sometimes use plan mode, fix the errors, repeat. The more terminally online are stitching together ralph loops, mcps, gas towns (remember those?), etc. The results in both cases are a mess that completely falls apart for anything non-trivial.
The workflow I’m going to describe has one core principle: never let Claude write code until you’ve reviewed and approved a written plan. This separation of planning and execution is the single most important thing I do. It prevents wasted effort, keeps me in control of architecture decisions, and produces significantly better results with minimal token usage than jumping straight to code.
flowchart LR
R[Research] –> P[Plan]
P –> A[Annotate]
A –>|repeat 1-6x| A
A –> T[Todo List]
T –> I[Implement]
I –> F[Feedback & Iterate]
Every meaningful task starts with a deep-read directive. I ask Claude to thoroughly understand the relevant part of the codebase before doing anything else. And I always require the findings to be written into a persistent markdown file, never just a verbal summary in the chat.
read this folder in depth, understand how it works deeply, what it does and all its specificities. when that’s done, write a detailed report of your learnings and findings in research.md
study the notification system in great details, understand the intricacies of it and write a detailed research.md document with everything there is to know about how notifications work
go through the task scheduling flow, understand it deeply and look for potential bugs. there definitely are bugs in the system as it sometimes runs tasks that should have been cancelled. keep researching the flow until you find all the bugs, don’t stop until all the bugs are found. when you’re done, write a detailed report of your findings in research.md
Notice the language: “deeply”, “in great details”, “intricacies”, “go through everything”. This isn’t fluff. Without these words, Claude will skim. It’ll read a file, see what a function does at the signature level, and move on. You need to signal that surface-level reading is not acceptable.
The written artifact (research.md) is critical. It’s not about making Claude do homework. It’s my review surface. I can read it, verify Claude actually understood the system, and correct misunderstandings before any planning happens. If the research is wrong, the plan will be wrong, and the implementation will be wrong. Garbage in, garbage out.
This is the most expensive failure mode with AI-assisted coding, and it’s not wrong syntax or bad logic. It’s implementations that work in isolation but break the surrounding system. A function that ignores an existing caching layer. A migration that doesn’t account for the ORM’s conventions. An API endpoint that duplicates logic that already exists elsewhere. The research phase prevents all of this.
Once I’ve reviewed the research, I ask for a detailed implementation plan in a separate markdown file.
I want to build a new feature that extends the system to perform . write a detailed plan.md document outlining how to implement this. include code snippets
the list endpoint should support cursor-based pagination instead of offset. write a detailed plan.md for how to achieve this. read source files before suggesting changes, base the plan on the actual codebase
The generated plan always includes a detailed explanation of the approach, code snippets showing the actual changes, file paths that will be modified, and considerations and trade-offs.
I use my own .md plan files rather than Claude Code’s built-in plan mode. The built-in plan mode sucks. My markdown file gives me full control. I can edit it in my editor, add inline notes, and it persists as a real artifact in the project.
One trick I use constantly: for well-contained features where I’ve seen a good implementation in an open source repo, I’ll share that code as a reference alongside the plan request. If I want to add sortable IDs, I paste the ID generation code from a project that does it well and say “this is how they do sortable IDs, write a plan.md explaining how we can adopt a similar approach.” Claude works dramatically better when it has a concrete reference implementation to work from rather than designing from scratch.
But the plan document itself isn’t the interesting part. The interesting part is what happens next.
This is the most distinctive part of my workflow, and the part where I add the most value.
flowchart TD
W[Claude writes plan.md] –> R[I review in my editor]
R –> N[I add inline notes]
N –> S[Send Claude back to the document]
S –> U[Claude updates plan]
U –> D{Satisfied?}
D –>|No| R
D –>|Yes| T[Request todo list]
After Claude writes the plan, I open it in my editor and add inline notes directly into the document. These notes correct assumptions, reject approaches, add constraints, or provide domain knowledge that Claude doesn’t have.
The notes vary wildly in length. Sometimes a note is two words: “not optional” next to a parameter Claude marked as optional. Other times it’s a paragraph explaining a business constraint or pasting a code snippet showing the data shape I expect.
“use drizzle:generate for migrations, not raw SQL” — domain knowledge Claude doesn’t have
“no — this should be a PATCH, not a PUT” — correcting a wrong assumption
“remove this section entirely, we don’t need caching here” — rejecting a proposed approach
“the queue consumer already handles retries, so this retry logic is redundant. remove it and just let it fail” — explaining why something should change
“this is wrong, the visibility field needs to be on the list itself, not on individual items. when a list is public, all items are public. restructure the schema section accordingly” — redirecting an entire section of the plan
Then I send Claude back to the document:
I added a few notes to the document, address all the notes and update the document accordingly. don’t implement yet
This cycle repeats 1 to 6 times. The explicit “don’t implement yet” guard is essential. Without it, Claude will jump to code the moment it thinks the plan is good enough. It’s not good enough until I say it is.
Why This Works So Well
The markdown file acts as shared mutable state between me and Claude. I can think at my own pace, annotate precisely where something is wrong, and re-engage without losing context. I’m not trying to explain everything in a chat message. I’m pointing at the exact spot in the document where the issue is and writing my correction right there.
This is fundamentally different from trying to steer implementation through chat messages. The plan is a structured, complete specification I can review holistically. A chat conversation is something I’d have to scroll through to reconstruct decisions. The plan wins every time.
Three rounds of “I added notes, update the plan” can transform a generic implementation plan into one that fits perfectly into the existing system. Claude is excellent at understanding code, proposing solutions, and writing implementations. But it doesn’t know my product priorities, my users’ pain points, or the engineering trade-offs I’m willing to make. The annotation cycle is how I inject that judgement.
add a detailed todo list to the plan, with all the phases and individual tasks necessary to complete the plan - don’t implement yet
This creates a checklist that serves as a progress tracker during implementation. Claude marks items as completed as it goes, so I can glance at the plan at any point and see exactly where things stand. Especially valuable in sessions that run for hours.
When the plan is ready, I issue the implementation command. I’ve refined this into a standard prompt I reuse across sessions:
implement it all. when you’re done with a task or phase, mark it as completed in the plan document. do not stop until all tasks and phases are completed. do not add unnecessary comments or jsdocs, do not use any or unknown types. continuously run typecheck to make sure you’re not introducing new issues.
This single prompt encodes everything that matters:
“implement it all”: do everything in the plan, don’t cherry-pick
“mark it as completed in the plan document”: the plan is the source of truth for progress
“do not stop until all tasks and phases are completed”: don’t pause for confirmation mid-flow
“do not add unnecessary comments or jsdocs”: keep the code clean
“do not use any or unknown types”: maintain strict typing
“continuously run typecheck”: catch problems early, not at the end
I use this exact phrasing (with minor variations) in virtually every implementation session. By the time I say “implement it all,” every decision has been made and validated. The implementation becomes mechanical, not creative. This is deliberate. I want implementation to be boring. The creative work happened in the annotation cycles. Once the plan is right, execution should be straightforward.
Without the planning phase, what typically happens is Claude makes a reasonable-but-wrong assumption early on, builds on top of it for 15 minutes, and then I have to unwind a chain of changes. The “don’t implement yet” guard eliminates this entirely.
Once Claude is executing the plan, my role shifts from architect to supervisor. My prompts become dramatically shorter.
flowchart LR
I[Claude implements] –> R[I review / test]
R –> C{Correct?}
C –>|No| F[Terse correction]
F –> I
C –>|Yes| N{More tasks?}
N –>|Yes| I
N –>|No| D[Done]
Where a planning note might be a paragraph, an implementation correction is often a single sentence:
“You built the settings page in the main app when it should be in the admin app, move it.”
Claude has the full context of the plan and the ongoing session, so terse corrections are enough.
Frontend work is the most iterative part. I test in the browser and fire off rapid corrections:
For visual issues, I sometimes attach screenshots. A screenshot of a misaligned table communicates the problem faster than describing it.
“this table should look exactly like the users table, same header, same pagination, same row density.”
This is far more precise than describing a design from scratch. Most features in a mature codebase are variations on existing patterns. A new settings page should look like the existing settings pages. Pointing to the reference communicates all the implicit requirements without spelling them out. Claude would typically read the reference file(s) before making the correction.
When something goes in a wrong direction, I don’t try to patch it. I revert and re-scope by discarding the git changes:
“I reverted everything. Now all I want is to make the list view more minimal — nothing else.”
Narrowing scope after a revert almost always produces better results than trying to incrementally fix a bad approach.
Even though I delegate execution to Claude, I never give it total autonomy over what gets built. I do the vast majority of the active steering in the plan.md documents.
This matters because Claude will sometimes propose solutions that are technically correct but wrong for the project. Maybe the approach is over-engineered, or it changes a public API signature that other parts of the system depend on, or it picks a more complex option when a simpler one would do. I have context about the broader system, the product direction, and the engineering culture that Claude doesn’t.
flowchart TD
P[Claude proposes changes] –> E[I evaluate each item]
E –> A[Accept as-is]
E –> M[Modify approach]
E –> S[Skip / remove]
E –> O[Override technical choice]
A & M & S & O –> R[Refined implementation scope]
Cherry-picking from proposals: When Claude identifies multiple issues, I go through them one by one: “for the first one, just use Promise.all, don’t make it overly complicated; for the third one, extract it into a separate function for readability; ignore the fourth and fifth ones, they’re not worth the complexity.” I’m making item-level decisions based on my knowledge of what matters right now.
Trimming scope: When the plan includes nice-to-haves, I actively cut them. “remove the download feature from the plan, I don’t want to implement this now.” This prevents scope creep.
Protecting existing interfaces: I set hard constraints when I know something shouldn’t change: “the signatures of these three functions should not change, the caller should adapt, not the library.”
Overriding technical choices: Sometimes I have a specific preference Claude wouldn’t know about: “use this model instead of that one” or “use this library’s built-in method instead of writing a custom one.” Fast, direct overrides.
Claude handles the mechanical execution, while I make the judgement calls. The plan captures the big decisions upfront, and selective guidance handles the smaller ones that emerge during implementation.
I run research, planning, and implementation in a single long session rather than splitting them across separate sessions. A single session might start with deep-reading a folder, go through three rounds of plan annotation, then run the full implementation, all in one continuous conversation.
I am not seeing the performance degradation everyone talks about after 50% context window. Actually, by the time I say “implement it all,” Claude has spent the entire session building understanding: reading files during research, refining its mental model during annotation cycles, absorbing my domain knowledge corrections.
When the context window fills up, Claude’s auto-compaction maintains enough context to keep going. And the plan document, the persistent artifact, survives compaction in full fidelity. I can point Claude to it at any point in time.
The Workflow in One Sentence
Read deeply, write a plan, annotate the plan until it’s right, then let Claude execute the whole thing without stopping, checking types along the way.
That’s it. No magic prompts, no elaborate system instructions, no clever hacks. Just a disciplined pipeline that separates thinking from typing. The research prevents Claude from making ignorant changes. The plan prevents it from making wrong changes. The annotation cycle injects my judgement. And the implementation command lets it run without interruption once every decision has been made.
Try my workflow, you’ll wonder how you ever shipped anything with coding agents without an annotated plan document sitting between you and the code.
The Workflow in One Sentence
...
Read the original on boristane.com »
When web-based social networks started flourishing nearly two decades ago, they were genuinely social networks. You would sign up for a popular service, follow people you knew or liked and read updates from them. When you posted something, your followers would receive your updates as well. Notifications were genuine. The little icons in the top bar would light up because someone had sent you a direct message or engaged with something you had posted. There was also, at the beginning of this millennium, a general sense of hope and optimism around technology, computers and the Internet. Social networking platforms were one of the services that were part of what was called Web 2.0, a term used for websites built around user participation and interaction. It felt as though the information superhighway was finally reaching its potential. But sometime between 2012 and 2016, things took a turn for the worse.
First came the infamous infinite scroll. I remember feeling uneasy the first time a web page no longer had a bottom. Logically, I knew very well that everything a browser displays is a virtual construct. There is no physical page. It is just pixels pretending to be one. Still, my brain had learned to treat web pages as objects with a beginning and an end. The sudden disappearance of that end disturbed my sense of ease.
Then came the bogus notifications. What had once been meaningful signals turned into arbitrary prompts. Someone you followed had posted something unremarkable and the platform would surface it as a notification anyway. It didn’t matter whether the notification was relevant to me. The notification system stopped serving me and started serving itself. It felt like a violation of an unspoken agreement between users and services. Despite all that, these platforms still remained social in some diluted sense. Yes, the notifications were manipulative, but they were at least about people I actually knew or had chosen to follow. That, too, would change.
Over time, my timeline contained fewer and fewer posts from friends and more and more content from random strangers. Using these services began to feel like standing in front of a blaring loudspeaker, broadcasting fragments of conversations from all over the world directly in my face. That was when I gave up on these services. There was nothing social about them anymore. They had become attention media. My attention is precious to me. I cannot spend it mindlessly scrolling through videos that have neither relevance nor substance.
But where one avenue disappeared, another emerged. A few years ago, I stumbled upon Mastodon and it reminded me of the early days of Twitter. Back in 2006, I followed a small number of folks of the nerd variety on Twitter and received genuinely interesting updates from them. But when I log into the ruins of those older platforms now, all I see are random videos presented to me for reasons I can neither infer nor care about. Mastodon, by contrast, still feels like social networking in the original sense. I follow a small number of people I genuinely find interesting and I receive their updates and only their updates. What I see is the result of my own choices rather than a system trying to capture and monetise my attention. There are no bogus notifications. The timeline feels calm and predictable. If there are no new updates from people I follow, there is nothing to see. It feels closer to how social networks used to work originally. I hope it stays that way.
...
Read the original on susam.net »
The state of coding agents can be summed up by this fact
Claude spent $20k on an agent swarm implementing (kinda) a C-compiler in Rust, but desktop Claude is an Electron app.
If you’re unfamiliar, Electron is a coding framework for building desktop applications using web tech, specifically HTML, CSS, and JS. What’s great about Electron is it allows you to build one desktop app that supports Windows, Mac, and Linux. Plus it lets developers use existing web app code to get started. It’s great for teams big and small. Many apps you probably use every day are built with Electron: Slack, Discord, VS Code, Teams, Notion, and more.
There are downsides though. Electron apps are bloated; each runs its own Chromium engine. The minimum app size is usually a couple hundred megabytes. They are often laggy or unresponsive. They don’t integrate well with OS features.
But these downsides are dramatically outweighed by the ability to build and maintain one app, shipping it everywhere.
But now we have coding agents! And one thing coding agents are proving to be pretty good at is cross-platform, cross-language implementations given a well-defined spec and test suite.
On the surface, this ability should render Electron’s benefits obsolete! Rather than write one web app and ship it to each platform, we should write one spec and test suite and use coding agents to ship native code to each platform. If this ability is real and adopted, users get snappy, performant, native apps from small, focused teams serving a broad market.
But we’re still leaning on Electron. Even Anthropic, one of the leaders in AI coding tools, who keeps publishing flashy agentic coding achievements, still uses Electron in the Claude desktop app. And it’s slow, buggy, and bloated app.
So why are we still using Electron and not embracing the agent-powered, spec driven development future?
For one thing, coding agents are really good at the first 90% of dev. But that last bit — nailing down all the edge cases and continuing support once it meets the real world — remains hard, tedious, and requires plenty of agent hand-holding.
Anthropic’s Rust-base C compiler slammed into this wall, after screaming through the bulk of the tests:
The resulting compiler has nearly reached the limits of Opus’s abilities. I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality.
The resulting compiler is impressive, given the time it took to deliver it and the number of people who worked on it, but it is largely unusable. That last mile is hard.
And this gets even worse once a program meets the real world. Messy, unexpected scenarios stack up and development never really ends. Agents make it easier, sure, but hard product decisions become challenged and require human decisions.
Further, with 3 different apps produced (Mac, Windows, and Linux) the surface area for bugs and support increases 3-fold. Sure, there are local quirks with Electron apps, but most of it is mitigated by the common wrapper. Not so with native!
A good test suite and spec could enable the Claude team to ship a Claude desktop app native to each platform. But the resulting overhead of that last 10% of dev and the increased support and maintenance burden will remain.
For now, Electron still makes sense. Coding agents are amazing. But the last mile of dev and the support surface area remains a real concern.
Over at Hacker News, Claude Code’s Boris Cherney chimes in:
Boris from the Claude Code team here.
Some of the engineers working on the app worked on Electron back in the day, so preferred building non-natively. It’s also a nice way to share code so we’re guaranteed that features across web and desktop have the same look and feel. Finally, Claude is great at it.
That said, engineering is all about tradeoffs and this may change in the future!
There we go: developer familiarity and simpler maintainability across multiple platforms is worth the “tradeoffs”. We have incredible coding agents that are great at transpilation, but there remain costs that outweigh the costs of shipping a non-native app.
...
Read the original on www.dbreunig.com »
A startup called Taalas, recently released an ASIC chip running Llama 3.1 8B (3/6 bit quant) at an inference rate of 17,000 tokens per seconds. That’s like writing around 30 A4 sized pages in one second. They claim it’s 10x cheaper in ownership cost than GPU based inference systems and is 10x less electricity hog. And yeah, about 10x faster than state of art inference.
I tried to read through their blog and they’ve literally “hardwired” the model’s weights on chip. Initially, this didn’t sound intuitive to me. Coming from a Software background, with hobby-ist understanding of LLMs, I couldn’t wrap my head around how you just “print” a LLM onto a chip. So, I decided to dig into multiple blogposts, LocalLLaMA discussions, and hardware concepts. It was much more interesting than I had thought. Hence this blogpost.
Taalas is a 2.5 year old company and it’s their first chip. Taalas’s chip is a fixed-function ASIC (Application-Specific Integrated Circuit). Kinda like a CD-ROM/Game cartridge, or a printed book, it only holds one model and cannot be rewritten.
LLMs consist of sequential Layers. For eg. Llama 3.1 8B has 32 layers. The task of each layer is to further refine the input. Each layer is essentially large weight matrices (the model’s ‘knowledge’).
When a user inputs a prompt, it is converted into an vector of numbers aka embeddings.
On a normal GPU, the input vector enters the compute cores. Then GPU fetches the Layer 1 weights from VRAM/HBM (GPU’s RAM) , does matrix multiplication, stores the intermediate results(aka activations) back in VRAM. Then it fetches the Layer 2 weights, and previous result, does the math, and saves it to VRAM again. This cycle continues till 32nd layer just to generate a single token. Then, to generate the next token, the GPU repeats this entire 32-layer journey.
So, due to this constant back-and-forth the memory bus induces latency and consumes significant amounts of energy. This is the memory bandwidth bottleneck, sometimes loosely called the Von Neumann bottleneck or the “memory wall.”
Taalas sidesteps this wall entirely. They just engraved the 32 layers of Llama 3.1 sequentially on a chip. Essentially, the model’s weights are physical transistors etched into the silicon.
Importantly, they also claim to have invented a hardware scheme where they can store a 4-bit data and perform the multiplication related to it using a single transistor. I will refer it as their ‘magic multiplier’
Now, when the user’s input arrives, it gets converted into a vector, and flows into physical transistors making up Layer1. It does multiplication via their ‘magic multiplier’ and instead of result being saved in a VRAM, the electrical signal simply flows down physical wires into the Layer 2 transistors (via pipeline registers from what I understand). The data streams continuously through the silicon until the final output token is generated.
They don’t use external DRAM/HBM, but they do use a small amount of on-chip SRAM. Why SRAM? Due to cost and complexity, manufacturers don’t mix DRAM and logic gates. That’s why GPUs have separate VRAM. (Also SRAM isn’t facing supply chain crisis, DRAM is).
Taalas uses this on-chip SRAM for the KV Cache (the temporary memory/context window of an ongoing conversation) and to hold LoRA adapters for fine tuning.
Technically yes, I read lots of comments saying that. But Taalas designed a base chip with a massive, generic grid of logic gates and transistors. To map a specific model onto the chip, they only need to customize the top two layers/masks. While it’s still slow, but it’s much faster than building chips from ground up.
It took them two months, to develop chip for Llama 3.1 8B. In the AI world where one week is a year, it’s super slow. But in a world of custom chips, this is supposed to be insanely fast.
As someone stuck running local models on a laptop without a massive GPU, I am keeping my fingers crossed for this type of hardware to be mass-produced soon.
...
Read the original on www.anuragk.com »
High-efficiency C++/CUDA LLM inference engine. Runs Llama 70B on a single RTX 3090 (24GB VRAM) by streaming model layers through GPU memory via PCIe, with optional NVMe direct I/O that bypasses the CPU entirely.
3-tier adaptive caching auto-sizes from hardware: VRAM-resident layers (zero I/O) + pinned RAM (H2D only) + NVMe/mmap fallback. Achieves 83x speedup over mmap baseline for 70B on consumer hardware (RTX 3090 + 48 GB RAM).
Bottleneck is PCIe H2D bandwidth at Gen3 x8 (~6.5 GB/s). Q4_K_M fits 10 more layers in VRAM (36 vs 26), reducing tier B transfers. Layer skip (cosine similarity calibration) eliminates 20/80 layers per token with minimal quality loss.
* Zero external dependencies beyond CUDA Toolkit (no PyTorch, no cuBLAS)
# Build
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_COMPILER=gcc-14 \
-DCMAKE_CXX_COMPILER=g++-14 \
-DCMAKE_CUDA_COMPILER=/usr/local/cuda-13.1/bin/nvcc
cmake –build . -j
# Run (resident mode — model fits in VRAM)
./ntransformer -m /path/to/llama-8b-q8_0.gguf -p “Hello” -n 128
# Run (streaming mode — model larger than VRAM)
./ntransformer -m /path/to/llama-70b-q6_k.gguf -p “Hello” -n 32 –streaming
# Run with layer skip (fastest for 70B)
./ntransformer -m /path/to/llama-70b-q4_k_m.gguf -p “Hello” -n 32 –streaming –skip-threshold 0.98
# Self-speculative decoding (VRAM layers as draft, no extra model)
./ntransformer -m /path/to/llama-70b-q6_k.gguf -p “Hello” -n 32 –self-spec –draft-k 3
# Chat mode
./ntransformer -m /path/to/model.gguf –chat
# Benchmark
./ntransformer -m /path/to/model.gguf –benchmark -n 64
Running ntransformer with NVMe direct I/O requires system-level modifications. An automated setup script handles all of them:
# Full first-time setup (interactive, creates backups)
sudo ./scripts/setup_system.sh
# Check current system state (no changes)
sudo ./scripts/setup_system.sh –check
# NVMe-only (run after every reboot)
sudo ./scripts/setup_system.sh –nvme-only
* Above 4G Decoding: ON (required for 64-bit BAR mapping)
* IOMMU: OFF (or leave on — the script adds the kernel parameter)
WARNING: This project performs low-level PCIe operations (GPU MMIO writes to NVMe controller registers, userspace NVMe command submission, VFIO device passthrough). While tested extensively on RTX 3090 + WD SN740, incorrect configuration or hardware incompatibilities could theoretically cause:
Data loss on the NVMe device used for raw block storage
Never use your boot drive for NVMe direct I/O. Always use a dedicated secondary NVMe. The authors are not responsible for hardware damage or data loss. Use at your own risk.
For models that don’t fit in VRAM, the NVMe backend eliminates the CPU from the data path:
# Build with NVMe support (requires gpu-nvme-direct library)
cmake .. -DCMAKE_BUILD_TYPE=Release -DUSE_GPUNVME=ON \
-DCMAKE_C_COMPILER=gcc-14 -DCMAKE_CXX_COMPILER=g++-14 \
-DCMAKE_CUDA_COMPILER=/usr/local/cuda-13.1/bin/nvcc
cmake –build . -j
# Write GGUF model to NVMe raw device
sudo ./scripts/restore_nvme.sh # ensure kernel driver is bound
sudo dd if=model.gguf of=/dev/nvme0n1 bs=1M oflag=direct status=progress
# Bind NVMe to VFIO for userspace access
sudo ./scripts/setup_nvme.sh # loads VFIO, forces D0, enables BusMaster
# Run with NVMe backend
sudo GPUNVME_PCI_BDF=0000:01:00.0 GPUNVME_GGUF_LBA=0 \
./build/ntransformer -m /path/to/model.gguf -p “Hello” -n 32 –streaming
# Restore NVMe to kernel driver when done
sudo ./scripts/restore_nvme.sh
The GGUF model file is written to raw NVMe blocks via dd
During inference, each layer (~670 MB for 70B Q6_K) is read via 670 NVMe commands in ~202 ms
Data lands in CUDA pinned staging memory, then async DMA to GPU compute buffers
...
Read the original on github.com »
“I don’t want to use the word ‘frustrated,’ because he understands he has plenty of alternatives, but he’s curious as to why they haven’t… I don’t want to use the word ‘capitulated,’ but why they haven’t capitulated,” he said.
...
Read the original on www.bbc.com »
Ukiyo-e Search provides an incredible resource: The ability to both search for Japanese woodblock prints by simply taking a picture of an existing print AND the ability to see similar prints across multiple collections of prints. Below is an example print, click to see it in action.
Upload a picture of a print to find similar prints across multiple collections.
Better data, hundreds of thousands of additional images, and better search capabilities are forthcoming. Sign-up to be notified when additional features are ready.
...
Read the original on ukiyo-e.org »
PlanetScale Postgres is the fastest way to run Postgres in the cloud. Plans start at just $5 per month.
Transactions are fundamental to how SQL databases work. Trillions of transactions execute every single day, across the thousands of applications that rely on SQL databases.
A transaction is a sequence of actions that we want to perform on a database as a single, atomic operation. An individual transaction can include a combination of reading, creating, updating, and removing data.
In MySQL and Postgres, we begin a new transaction with begin; and end it with commit;. Between these two commands, any number of SQL queries that search and manipulate data can be executed.
The example above shows a transaction begin, three query executions, then the commit. You can hit the ↻ button to replay the sequence at any time. The act of committing is what atomically applies all of the changes made by those SQL statements.
There are some situations where transactions do not commit. This is sometimes due to unexpected events in the physical world, like a hard drive failure or power outage. Databases like MySQL and Postgres are designed to correctly handle many of these unexpected scenarios, using disaster recovery techniques. Postgres, for example, handles this via its write-ahead log mechanism (WAL).
There are also times when we want to intentionally undo a partially-executed transaction. This happens when midway through a transaction, we encounter missing / unexpected data or get a cancellation request from a client. For this, databases support the rollback; command.
In the example above, the transaction made several modifications to the database, but those changes were isolated from all other ongoing queries and transactions. Before the transaction committed, we decided to rollback, undoing all changes and leaving the database unaltered by this transaction.
By the way, you can use the menu below to change the speed of all the sessions and animations in this article. If the ones above were going too fast or too slow for your liking, fix that here!
A key reason transactions are useful is to allow execution of many queries simultaneously without them interfering with each other. Below you can see a scenario with two distinct sessions connected to the same database. Session A starts a transaction, selects data, updates it, selects again, and then commits. Session B selects that same data twice during a transaction and again after both of the transactions have completed.
Session B does not see the name update from ben to joe until after Session A commits the transaction.
Consider the same sequence of events, except instead of commiting the transaction in Session A, we rollback.
The second session never sees the effect of any changes made by the first, due to the rollback. This is a nice segue into another important concept in transactions: Consistent reads.
During a transaction’s execution, we would like it to have a consistent view of the database. This means that even if another transaction simultaneously adds, removes, or updates information, our transaction should get its own isolated view of the data, unaffected by these external changes, until the transaction commits.
MySQL and Postgres both support this capability when operating in REPEATABLE READ mode (plus all stricter modes, too). However, they each take different approaches to achieving this same goal.
Postgres handles this with multi-versioning of rows. Every time a row is inserted or updated, it creates a new row along with metadata to keep track of which transactions can access the new version. MySQL handles this with an undo log. Changes to rows immediately overwrite old versions, but a record of modifications is maintained in a log file, in case they need to be reconstructed.
Let’s take a close look at each.
Below, you’ll see a simple user table on the left and a sequence of statements in Session A on the right. Click the “play sessions” button and watch what happens as the statements get executed.
* An update is made to the user with ID 4, changing the name from “liz” to “aly”. This causes a new version of the row to be created, while the other is maintained.
* The old version of the row had its xmax set to 10 (xmax = max transaction ID)
* The new version of the row also had its xmin set to 10 (xmin = min transaction ID)
* The transaction commits, making the update visible to the broader database
But now we have two versions of the row with ID = 4. Ummm… that’s odd! The key here is xmin and xmax.
xmin stores the ID of the transaction that created a row version, and xmax is the ID of the transaction that caused a replacement row to be created. Postgres uses these to determine which row version each transaction sees.
Let’s look at Session A again, but this time with an additional Session B running simultaneously. Press “play sessions” again.
Before the commit, Session B could not see Session A’s modification. It sees the name as “liz” while Session A sees “aly” within the transaction. At this stage, it has nothing to do with xmin and xmax, but rather because other transactions cannot see uncommitted data. After Session A commits, Session B can now see the new name of “aly” because the data is committed and the transaction ID is greater than 10.
If the transaction instead gets a rollback, those row changes do not get applied, leaving the database in a state as if the transaction never began in the first place.
This is a simple scenario. Only one of the transactions modifies data. Session B only does select statements! When both simultaneously modify data, each one will be able to “see” the modifications it made, but these changes won’t bleed out into other transactions until commit. Here’s an example where each transaction selects data, updates data, selects again, commits, and finally both do a final select.
The concurrent transactions cannot see each other’s changes until the data is committed. The same mechanisms are used to control data visibility when there are hundreds of simultaneous transactions on busy Postgres databases.
Before we move on to MySQL, one more important note. What happens to all those duplicated rows? Over time, we can end up with thousands of duplicate rows that are no longer needed. There are several things Postgres does to mitigate this issue, but I’ll focus on the VACUUM FULL command. When run, this purges versions of rows that are so old that we know no transactions will need them going forward. It compacts the table in the process. Try it out below.
Notice that when the vacuum full command executes, all unused rows are eliminated, and the gaps in the table are compressed, reclaiming the unused space.
MySQL achieves the consistent read behavior using a different approach. Instead of keeping many copies of each row, MySQL immediately overwrites old row data with new row data when modified. This means it requires less maintenance over time for the rows (in other words, we don’t need to do vacuuming like Postgres).
However, MySQL still needs the ability to show different versions of a row to different transactions. For this, MySQL uses an undo log — a log of recently-made row modifications, allowing a transaction to reconstruct past versions on-the-fly.
Notice how each MySQL row has two metadata columns (in blue). These keep track of the ID of the transaction that updated the row most recently (xid), and a reference to the most recent modification in the undo log (ptr).
When there are simultaneous transactions, transaction A may clobber the version of a row that transaction B needs to see. Transaction B can see the previous version(s) of the row by checking the undo log, which stores old values so long as any running transaction may need to see it.
There can even be several undo log records in the log for the same row simultaneously. In such a case, MySQL will choose the correct version based on transaction identifiers.
The idea of Repeatable reads is important for databases, but this is just one of several isolation levels databases like MySQL and Postgres support. This setting determines how “protected” each transaction is from seeing data that other simultaneous transactions are modifying. Adjusting this setting gives the user control of the tradeoff between isolation and performance.
Both MySQL and Postgres have four levels of isolation: From strongest to weakest, these are: Serializable, Repeatable Read, Read Committed, Read Uncommitted.
Stronger levels of isolation provide more protections from data inconsistency issues across transactions, but come at the cost of worse performance in some scenarios.
Serializable is the strongest. In this mode, all transactions behave as if they were run in a well-defined sequential order, even if in reality many ran simultaneously. This is accomplished via complex locking and waiting.
The other three gradually loosen the strictness, and can be described by the undesirable phenomena they allow or prohibit.
A phantom read is one where a transaction runs the same SELECT multiple times, but sees different results the second time around. This is typically due to data that was inserted and committed by a different transaction. The timeline below visualizes such a scenario. The horizontal axis represents time passing on a database with two clients. Hit the ↻ button to replay the sequence at any time.
After serializable, the next least strict isolation level is called repeatable read. Under the SQL standard, the repeatable read level allows phantom reads, though in Postgres they still aren’t possible.
These happen when a transaction reads a row, and then later re-reads the same row, finding changes by another already-committed transaction. This is dangerous because we may have already made assumptions about the state of our database, but that data has changed under our feet.
The read committed isolation level, the next after repeatable read, allows these and phantom reads to occur. The tradeoff is slightly better database transaction performance.
The last and arguably worst is dirty reads. A dirty read is one where a transaction is able to see data written by another transaction running simultaneously that is not yet committed. This is really bad! In most cases, we never want to see data that is uncommitted from other transactions.
The loosest isolation level, read uncommitted, allows for dirty reads and the other two described above. It is the most dangerous and also most performant mode.
The keen-eyed observer will notice that I have ignored a particular scenario, quite on purpose, up to this moment. What if two transactions need to modify the same row at the same time?
Precisely how this is handled depends on both (A) the database system and (B) the isolation level. To keep the discussion simple, we’ll focus on how this works for the strictest (SERIALIZABLE) level in Postgres and MySQL. Yet again, the world’s two most popular relational databases take very different approaches here.
A lock is a software mechanism for giving ownership of a piece of data to one transaction (or a set of transactions). Transactions obtain a lock on a row when they need to “own” it without interruption. When the transaction is finished using the rows, it releases the lock to allow other transactions access.
Though there are many types of locks in practice, the two main ones you need to know about here are shared locks and exclusive locks.
A shared (S) lock can be obtained by multiple transactions on the same row simultaneously. Typically, transactions will obtain shared locks on a row when reading it, because multiple transactions can do so simultaneously safely.
An exclusive (X) lock can only be owned by one transaction for any given row at any given time. When a transaction requests an X lock, no other transactions can have any type of lock on the row. These are used when a transaction needs to write to a row, because we don’t want two transactions simultaneously messing with column values!
In SERIALIZABLE mode, all transactions must always obtain X locks when updating a row. Most of the time, this works fine other than the performance overhead of locking. In scenarios where two transactions are both trying to update the same row simultaneously, this can lead to deadlock!
MySQL can detect deadlock and will kill one of the involved transactions to allow the other to make progress.
Postgres handles write conflicts in SERIALIZABLE mode with less locking, and avoids the deadlock issue completely.
As transactions read and write rows, Postgres creates predicate locks, which are “locks” on sets of rows specified by a predicate. For example, if a transaction updates all rows with IDs 10–20, it will take a lock on the predicate WHERE id BETWEEN 10 AND 20. These locks are not used to block access to rows, but rather to track which rows are being used by which transactions, and then detect data conflicts on-the-fly.
Combined with multi-row versioning, this lets Postgres use optimistic conflict resolution. It never blocks transactions while waiting to acquire a lock, but it will kill a transaction if it detects that it’s violating the SERIALIZABLE guarantees.
Let’s look at a similar timeline from the MySQL example, but this time watching Postgres’ optimistic technique.
The difference is subtle visually, but implemented in quite different ways. Both Postgres and MySQL leverage the killing of one transaction in favor of maintaining SERIALIZABLE guarantees. Applications must account for this outcome, and have retry logic for important transactions.
Transactions are just one tiny corner of all the amazing engineering that goes into databases, and we only scratched the surface! But a fundamental understanding of what they are, how they work, and the guarantees of the four isolation levels is helpful for working with databases more effectively.
What esoteric corner of database management systems would you like to see us cover next? Join our Discord community and let us know.
...
Read the original on planetscale.com »
A bloom filter is a probabilistic data structure that potentially can make SQL queries execute orders of magnitudes faster. Today I want to tell you how we use them in Floe, and how we make them produce 2x fewer false results.
Feel free to skip this section if you know the answer.
A bloom filter is a probabilistic data structure that answers one question: “Is this element definitely not in the set?” It can give false positives (says yes when the answer is no), but never false negatives (it won’t miss elements that are actually there). The main benefit is that a well-designed bloom filter can be really fast - a few CPU cycles per lookup. That’s faster than a single function call.
The structure: An array of m bits, all initially set to 0.
Insertion: To add an element, we:
Hash the element using k different hash functions
Each hash gives us a position in the bit array
Set all k bits at those positions to 1
Lookup: To check if an element exists:
Hash it with the same k hash functions
Check if ALL k bits are set to 1
If any bit is 0 → definitely not in the set
If all bits are 1 → probably in the set
Here we’re discussing bloom filters in the context of database engineering. If you’re not familiar with how databases join tables - here’s a quick primer: a hash join matches rows from two tables. First, it loads the smaller table into a hash table (that’s the build side). Then it scans the larger table row by row, looking up each value in the hash table to find matches (that’s the probe side). Most of the work happens on the probe side, because the larger table can have billions of rows.
When processing millions of rows we want to avoid all the extra work that we can. Don’t decompress the data you won’t use. Don’t probe hash tables for keys that don’t exist. Discard rows as soon as you can - it’s called being efficient (not lazy!)
We use bloom filters at 2 critical places:
Hash joins: before probing the hash table during probe phase
Let’s imagine the situation where we want to join two tables where only 1% of 10 billion probe-side rows will be matched. Without filtering we would need to decompress and probe 99% of those rows before discarding them.
What we do instead:
Build phase: At build phase we populate the bloom filter with hashes of build side.
Pushdown: after build phase is complete we push down the bloom filter, which at this point is read-only, to the storage engine.
First-pass filtering: The storage engine decompresses only the columns needed for bloom filtering. It checks each value against the bloom filter, and marks values that definitely do not match the build side.
Adaptive behaviour: Here it gets interesting. We keep the statistics of how many rows we skipped. If we end up discarding almost no rows we don’t bother with first-pass filtering and disable it. But we keep checking decompressed rows to re-enable filtering if stats improve.
= time to get another coffee. Or three.
That’s a huge 9x reduction in scan and I/O!
Why do we need to keep the filtering adaptive? Because sometimes bloom doesn’t help:
Join selectivity is high (most rows match anyway)
Data is skewed (many duplicates saturate the filter)
For hash joins we use a simpler, almost textbook-style bloom filter: insert values into the bloom filter at build phase, read it at probe phase before probing the hash buckets.
We landed on using a fixed 256KB bloom filter per join as a sweet spot between size and efficiency. Go bigger - waste the memory and overflow L2/L3 cache (cache misses hurt). Go smaller - might as well flip a coin.
Why fixed size? Predictable performance. No dynamic allocation. Compiler can optimize the hell out of it. Lock-free access. The last one is especially critical when we’re talking about a concurrent performance-first database engine.
The Problem: When too many bits tell less
All of the above works well only if the bloom filter is actually useful and doesn’t lie too often. If it does - it is useless. In our engine we measure bloom filter performance with a simple threshold for number of bits set. What does that matter? To understand we need to dive deeper into the theory, and understand false positive rate of bloom filter
Why false positives? As we insert more elements (n), more bits get set to 1. Eventually, random elements will have all their k bits set to 1 by pure chance - even though they were never inserted. That’s a false positive.
The occupancy problem: As we insert more elements, more bits get set to 1 and the filter gets saturated. For our single-hash (k=1) approach, that means the false positive rate climbs quickly - up to 10% and above - that’s way too high!
Let’s Do Some Math (No Really, Stay With Me)
You could just trust the formula. Or we could derive it in 30 seconds and actually understand why bloom filters break down:
The intuition: Every time we insert an element, we flip some bits from 0 to 1. Eventually, so many bits are set that random elements look like they were inserted - even though they weren’t.
Here is a really nice interactive tool where you can play around with different parameters of bloom filters to see how they scale: Bloom Filter Calculator
Enough theory. Let’s look at the code
That implementation is simple, and it works. But it is way too simple, let’s look for something better:
Goal: find something that’s still fast as hell, but lies to us less often.
We started experimenting with ideas to see how they perform
Naive approach: Set two bits using two independent hash functions - terrible
Alternative 1: store two bits in the same cache line - better, but still bad
Alternative 2: split uint32 into halves. Use lower 16 bits for first bit position, upper 16 bits for 2nd bit position - better, we are getting there
Here’s the insight: both bits live in the same uint32 variable. We use a single hash value to compute:
Which element in the array (16 bits of the hash)
Position of first bit within that element (5 bits)
Position of second bit within that element (5 more bits)
Why this is beneficial:
One atomic operation: set both bits with single atomic OR
Simple addressing: bit manipulation is cheap (few cycles), while memory is expensive
There is a minor trade-off: two bits are not truly independent anymore. This slightly increases collision probability. But the performance gain from one memory access crushes the theoretical disadvantage.
The new code is nearly identical in structure - just one extra bit in the mask. We shift and mask by 5 bits because uint32_t has 32 bit positions (2^5).
T bitLoc1(T& h) { return (h >> IDX_BITS) & MASK_5BIT; } // first 5-bit offset (0..31)
T bitLoc2(T& h) { return (h >> (IDX_BITS + 5)) & MASK_5BIT; } // next 5-bit offset (another bit)
void put(HashKey32 h) {
const uint32_t idx = uint32Idx(h);
const uint32_t mask = (1u << bitLoc1(h)) | (1u << bitLoc2(h));
__sync_fetch_and_or(mBuf + idx, mask);
bool contains(HashKey32 h) const {
const uint32_t data = mBuf[uint32Idx(h)];
const uint32_t mask = (1u << bitLoc1(h)) | (1u << bitLoc2(h));
return (data & mask) == mask;
The performance hit? Negligible. From our benchmarking:
For context: even the “slower” version executes faster than a function call or branch mis-prediction. We’re talking about a nanosecond.
On a query scanning a terabyte table, that’s avoiding decompression of ~60GB of data
Let me spell it out: we spend one extra nanosecond per row to avoid reading dozens of extra gigabytes.
I’ll take that trade every single time.
Two bits in one uint32 gave us 2x better bloom filter accuracy at essentially zero cost:
Still one atomic OR
One more bit shift for creating the mask (but it’s very cheap)
Adaptive filtering at the storage engine layer saves even more, allowing us to completely avoid decompressing rows that will not be needed. But because the first-pass decompression is still costly, optimizing this code path is not as trivial, so we did not touch it at that time. But when working on Floe, we will definitely use some of this gained knowledge for our smarter pushdowns.
“Why Not Just Use 3 Bits? Or 4?”
Fair question. Using the same interactive tool: Bloom Filter Calculator
2 bits: 253k elements - 2.5x more filter capacity without any real cost
3 bits: 306k elements - that’s just 20% more capacity. At this point tradeoff becomes questionable
4+ bits: 320k elements - less than 5% capacity increase - not even worth it
The beauty of 2 bits: they fit in a single uint32 with minimal collisions.
It is the sweet spot between “too simple” and “too complex”
“But what about Cuckoo filters or XOR filters?”
Great structures! But they require dynamic resizing or more complex addressing. We wanted:
Two bits in a fixed bloom filter gave us all three.
We also wrote a version that checks 8 elements at a time using SIMD instructions. But that is a story for another day.
...
Read the original on floedb.ai »
To add this web app to your iOS home screen tap the share button and select "Add to the Home Screen".
10HN is also available as an iOS App
If you visit 10HN only rarely, check out the the best articles from the past week.
If you like 10HN please leave feedback and share
Visit pancik.com for more.