10 interesting stories served every morning and every evening.
Over the last couple of years, we’ve seen significant growth of the e18e community and a rise in performance focused contributions because of it. A large part of this is the “cleanup” initiative, where the community has been pruning packages which are redundant, outdated, or unmaintained.
One of the most common topics that comes up as part of this is “dependency bloat” - the idea that npm dependency trees are getting larger over time, often with long since redundant code which the platform now provides natively.
In this post, I want to briefly look at what I think are the three main types of bloat in our dependency trees, why they exist, and how we can start to address them.
The graph above is a common sight in many npm dependency trees - a small utility function for something which seems like it should be natively available, followed by many similarly small deep dependencies.
So why is this a thing? Why do we need is-string instead of typeof checks? Why do we need hasown instead of Object.hasOwn (or Object.prototype.hasOwnProperty)? Three things:
Support for very old engines
Somewhere in the world, some people apparently exist who need to support ES3 - think IE6/7, or extremely early versions of Node.js.
For these people, much of what we take for granted today does not exist. For example, they don’t have any of the following:
These are all ES5 features, meaning they simply don’t exist in ES3 engines.
For these unfortunate souls who are still running old engines, they need to reimplement everything themselves, or be provided with polyfills.
Alternatively, what’d be really nice is if they upgraded.
The second reason for some of these packages is “safety”.
Basically, inside Node itself, there is a concept of “primordials”. These are essentially just global objects wrapped at startup and imported by Node from then on, to avoid Node itself being broken by someone mutating the global namespace.
For example, if Node itself uses Map and we re-define what Map is - we can break Node. To avoid this, Node keeps a reference to the original Map which it imports rather than accessing the global.
You can read more about this here in the Node repo.
This makes a lot of sense for an engine, since it really shouldn’t fall over if a script messes up the global namespace.
Some maintainers also believe this is the correct way to build packages, too. This is why we have dependencies like math-intrinsics in the graph above, which basically re-exports the various Math.* functions to avoid mutation.
Lastly, we have cross-realm values. These are basically values you have passed from one realm to another - for example, from a web page to a child or vice versa.
In this situation, a new RegExp(pattern) in an iframe, is not the same RegExp class as the one in the parent page. This means window. RegExp !== iframeWindow.RegExp, which of course means val instanceof RegExp would be false if it came from the iframe (another realm).
For example, I am a maintainer of chai, and we have this exact issue. We need to support assertions happening across realms (since a test runner may run tests in a VM or iframe), so we can’t rely on instanceof checks. For that reason, we use Object.prototype.toString.call(val) === ‘[object RegExp]’ to check if something is a regex, which works across realms since it doesn’t rely on the constructor.
In the graph above, is-string is basically doing this same job in case we passed a new String(val) from one realm to another.
All of this makes sense for a very small group of people. If you’re supporting very old engines, passing values across realms, or want protection from someone mutating the environment - these packages are exactly what you need.
The problem is that the vast majority of us don’t need any of this. We’re running a version of Node from the last 10 years, or using an evergreen browser. We don’t need to support pre-ES5 environments, we don’t pass values across frames, and we uninstall packages which break the environment.
These layers of niche compatibility somehow made their way into the “hot path” of everyday packages. The tiny group of people who actually need this stuff should be the ones seeking out special packages for it. Instead, it is reversed and we all pay the cost.
Some folks believe that packages should be broken up to an almost atomic level, creating a collection of small building blocks which can later be re-used to build other higher level things.
This kind of architecture means we end up with graphs like this:
As you can see, the most granular snippets of code have their own packages. For example, shebang-regex is the following at the time of writing this post:
By splitting code up to this atomic level, the theory is that we can then create higher level packages simply by joining the dots.
Some examples of these atomic packages to give you an idea of the granularity:
* arrify - Converts a value to an array (Array.isArray(val) ? val : [val])
* cli-boxes - A JSON file containing the edges of a box
* path-key - Get the PATH environment variable key for the current platform (PATH on Unix, Path on Windows)
* onetime - Ensure a function is only called once
* is-wsl - Check if process.platform is linux and os.release() contains microsoft
If we wanted to build a new CLI for example, we could pull a few of these in and not worry about implementation. We don’t need to do env[‘PATH’] || env[‘Path’] ourselves, we can just pull a package for that.
In reality, most or all of these packages did not end up as the reusable building blocks they were meant to be. They’re either largely duplicated across various versions in a wider tree, or they’re single-use packages which only one other package uses.
Let’s take a look at some of the most granular packages:
* shebang-regex is used almost solely by shebang-command by the same maintainer
* cli-boxes is used almost solely by boxen and ink by the same maintainer
* onetime is used almost solely by restore-cursor by the same maintainer
Each of these having only one consumer means they’re equivalent of inline code but cost us more to acquire (npm requests, tar extraction, bandwidth, etc.).
Taking a look at nuxt’s dependency tree, we can see a few of these building blocks duplicated:
Inlining them doesn’t mean we no longer duplicate the code, but it does mean we don’t pay the cost of things like version resolution, conflicts, cost of acquisition, etc.
Inlining makes duplication almost free, while packaging makes it expensive.
The more packages we have, the larger our supply chain surface area is. Every package is a potential point of failure for maintenance, security, and so on.
For example, a maintainer of many of these packages was compromised last year. This meant hundreds of tiny building blocks were compromised, which meant the higher level packages we actually install were also compromised.
Logic as simple as Array.isArray(val) ? val : [val] probably doesn’t need its own package, security, maintenance, and so on. It can just be inlined and we can avoid the risk of it being compromised.
Similar to the first pillar, this philosophy made its way into the “hot path” and probably shouldn’t have. Again, we all pay the cost to no real benefit.
If you’re building an app, you might want to use some “future” features your chosen engine doesn’t support yet. In this situation, a polyfill can come in handy - it provides a fallback implementation where the feature should be, so you can use it as if it were natively supported.
For example, temporal-polyfill polyfills the new Temporal API so we can use Temporal regardless of if the engine supports it or not.
Now, if you’re building a library instead, what should you do?
In general, no library should load a polyfill as that is a consumer’s concern and a library shouldn’t be mutating the environment around it. As an alternative, some maintainers choose to use what’s called a ponyfill (sticking to the unicorns, sparkles and rainbows theme).
A ponyfill is basically a polyfill you import rather than one which mutates the environment.
This kinda works since it means a library can use future tech by importing an implementation of it which passes through to the native one if it exists, and uses the fallback otherwise. None of this mutates the environment, so it is safe for libraries to use.
For example, fastly provides @fastly/performance-observer-polyfill, which contains both a polyfill and ponyfill for PerformanceObserver.
These ponyfills did their job at the time - they allowed the library author to use future tech without mutating the environment and without forcing the consumer to know which polyfills to install.
The problem comes when these ponyfills outstay their welcome. When the feature they fill in for is now supported by all engines we care about, the ponyfill should be removed. However, this often doesn’t happen and the ponyfill remains in place long after it’s needed.
We’re now left with many, many packages which rely on ponyfills for features we’ve all had for a decade now.
Unless these packages are being kept alive because of Pillar 1, they’re usually still used just because nobody ever thought to remove them.
When all long-term support versions of engines have the feature, the ponyfill should be removed.
Much of this bloat is so deeply nested in dependency trees today that it is a fairly hefty task to unravel it all and get to a good place. It will take time, and it will take a lot of effort from maintainers and consumers.
Having said that, I do think we can make significant progress on this front if we all work together.
Start asking yourself, “why do I have this package?” and “do I really need it?”.
If you find something which seems redundant, raise an issue with the maintainer asking if it can be removed.
If you encounter a direct dependency which has many of these issues, have a look for an alternative which doesn’t. A good start for that is the module-replacements project.
knip is a great project which can help you find and remove unused dependencies, dead code, and much more. In this case, it can be a great tool to help you find and remove dependencies you no longer use.
This doesn’t solve the problems above necessarily, but is a great starting point to help clean up the dependency tree before doing more involved work.
You can read more about how knip deals with unused dependencies in their documentation.
The e18e CLI has a super useful analyze mode to determine which dependencies are no longer needed, or have community recommended replacements.
For example, if you get something like this:
Using this, we can quickly identify which direct dependencies can be cleaned up. We can also then use the migrate command to automatically migrate some of these dependencies:
In this case, it will migrate from chalk to picocolors, a much smaller package which provides the same functionality.
In the future, this CLI will even recommend based on your environment - for example, it could suggest the native styleText instead of a colours library if you’re running a new enough Node.
npmgraph is a great tool to visualize your dependency tree and investigate where bloat is coming from.
For example, let’s take a look at the bottom half of ESLint’s dependency graph as of writing this post:
We can see in this graph that the find-up branch is isolated, in that nothing else uses its deep dependencies. For something as simple as an upwards file-system traversal, maybe we don’t need 6 packages. We can then go look for an alternative, such as empathic which has a much smaller dependency graph and achieves the same thing.
The module replacements project is being used as a central data set for the wider community to document which packages can be replaced with native functionality, or more performant alternatives.
If you’re ever in need of an alternative or just want to check your dependencies, this data set is great for that.
Similarly, if you come across packages in your tree which are made redundant by native functionality, or just have better battle-tested alternatives, this project is definitely a great place to contribute that so others can benefit from it.
Paired with the data, there’s also a codemods project which provides codemods to automatically migrate some of these packages to their suggested replacements.
We all pay the cost for an incredibly small group of people to have an unusual architecture they like, or a level of backwards compatibility they need.
This isn’t necessarily a fault of the people who made these packages, as each person should be able to build however they want. Many of them are an older generation of influential JavaScript developers - building packages in a darker time where many of the nice APIs and cross-compatibility we have today didn’t exist. They built the way they did because it was possibly the best way at the time.
The problem is that we never moved on from that. We still download all of this bloat today even though we’ve had these features for several years.
I think we can solve this by reversing things. This small group should pay the cost - they should have their own special stack pretty much only they use. Everyone else gets the modern, lightweight, and widely supported code.
Hopefully things like e18e and npmx can help with that through documentation, tooling, etc. You can also help by taking a closer look at your dependencies and asking “why?”. Raise issues with your dependencies asking them if, and why they need these packages anymore.
We can fix it.
...
Read the original on 43081j.com »
Professional video editing, right in your browserA powerful NLE editor with GPU compositing, keyframe animation, and real-time preview. No installs required. Everything you need to editBuilt on WebGPU and Rust/WASM for performance that rivals native apps.WebGPU-powered compositing via Rust/WASM delivers near-native performance for real-time previews and exports.Canvas-rendered timeline with unlimited video and audio tracks, linked clips, and cross-transitions.Animate any property with bezier easing curves. Transform, opacity, effects — everything is keyframeable.Apply brightness, contrast, saturation, blur, and hue rotation — all GPU-computed with instant preview.Everything runs in the browser. Your media stays local with the File System Access API — nothing leaves your machine.
...
Read the original on tooscut.app »
Wikipedia, AI, maps, and education tools running on your own hardware — completely free. No internet required.
Knowledge That Never Goes Offline
Node for Offline Media, Archives, and Data — a free, open source offline server you install on any computer. Download the content you want, and it works without internet — forever. Similar products cost hundreds of dollars. Project NOMAD is free.
Khan Academy, Wikipedia for Schools, and more — complete learning resources for families anywhere, even without connectivity.
Run local LLMs, self-host your knowledge base, own your data. Built for beefy hardware and those who want full control.
Cabin, RV, or sailboat — bring a complete library, AI assistant, and offline maps wherever you go. True digital independence.
When infrastructure fails, NOMAD keeps working. Medical references, survival guides, and encyclopedic knowledge — no internet required.
Emergency PreparednessWhen infrastructure fails, NOMAD keeps working. Medical references, survival guides, and encyclopedic knowledge — no internet required. Off-Grid LivingCabin, RV, or sailboat — bring a complete library, AI assistant, and offline maps wherever you go. True digital independence.Tech EnthusiastsRun local LLMs, self-host your knowledge base, own your data. Built for beefy hardware and those who want full control.EducationKhan Academy, Wikipedia for Schools, and more — complete learning resources for families anywhere, even without connectivity.
Whether you’re planning for emergencies or living off-grid, Project NOMAD has you covered.
Full offline mapping with OpenStreetMap data. Navigate, plan routes, and explore terrain without any cell service.
Run powerful large language models completely offline. Chat, write, analyze, code — all without sending data anywhere.
Offline Wikipedia, Project Gutenberg, medical references, repair guides, and more — terabytes of human knowledge at your fingertips.
Information LibraryPowered by KiwixOffline Wikipedia, Project Gutenberg, medical references, repair guides, and more — terabytes of human knowledge at your fingertips. AI AssistantPowered by OllamaRun powerful large language models completely offline. Chat, write, analyze, code — all without sending data anywhere.Offline MapsPowered by OpenStreetMapFull offline mapping with OpenStreetMap data. Navigate, plan routes, and explore terrain without any cell service.Education PlatformPowered by KolibriKhan Academy courses, educational videos, interactive lessons — complete K-12 curriculum available offline.
Watch the full walkthrough to see what Project NOMAD can do on your hardware.
Wikipedia, AI, maps, and education tools running on your own hardware — completely free. No internet required.
Knowledge That Never Goes Offline
Node for Offline Media, Archives, and Data — a free, open source offline server you install on any computer. Download the content you want, and it works without internet — forever. Similar products cost hundreds of dollars. Project NOMAD is free.
Khan Academy, Wikipedia for Schools, and more — complete learning resources for families anywhere, even without connectivity.
Run local LLMs, self-host your knowledge base, own your data. Built for beefy hardware and those who want full control.
Cabin, RV, or sailboat — bring a complete library, AI assistant, and offline maps wherever you go. True digital independence.
When infrastructure fails, NOMAD keeps working. Medical references, survival guides, and encyclopedic knowledge — no internet required.
Emergency PreparednessWhen infrastructure fails, NOMAD keeps working. Medical references, survival guides, and encyclopedic knowledge — no internet required. Off-Grid LivingCabin, RV, or sailboat — bring a complete library, AI assistant, and offline maps wherever you go. True digital independence.Tech EnthusiastsRun local LLMs, self-host your knowledge base, own your data. Built for beefy hardware and those who want full control.EducationKhan Academy, Wikipedia for Schools, and more — complete learning resources for families anywhere, even without connectivity.
Whether you’re planning for emergencies or living off-grid, Project NOMAD has you covered.
Full offline mapping with OpenStreetMap data. Navigate, plan routes, and explore terrain without any cell service.
Run powerful large language models completely offline. Chat, write, analyze, code — all without sending data anywhere.
Offline Wikipedia, Project Gutenberg, medical references, repair guides, and more — terabytes of human knowledge at your fingertips.
Information LibraryPowered by KiwixOffline Wikipedia, Project Gutenberg, medical references, repair guides, and more — terabytes of human knowledge at your fingertips. AI AssistantPowered by OllamaRun powerful large language models completely offline. Chat, write, analyze, code — all without sending data anywhere.Offline MapsPowered by OpenStreetMapFull offline mapping with OpenStreetMap data. Navigate, plan routes, and explore terrain without any cell service.Education PlatformPowered by KolibriKhan Academy courses, educational videos, interactive lessons — complete K-12 curriculum available offline.
Watch the full walkthrough to see what Project NOMAD can do on your hardware.
Other offline products charge hundreds and lock you into specific hardware. Project NOMAD runs on any PC you choose — with GPU-accelerated AI — for free.
...
Read the original on www.projectnomad.us »
I’m releasing Manyana, a project which I believe presents a coherent vision for the future of version control — and a compelling case for building it.
It’s based on the fundamentally sound approach of using CRDTs for version control, which is long overdue but hasn’t happened yet because of subtle UX issues. A CRDT merge always succeeds by definition, so there are no conflicts in the traditional sense — the key insight is that changes should be flagged as conflicting when they touch each other, giving you informative conflict presentation on top of a system which never actually fails. This project works that out.
One immediate benefit is much more informative conflict markers. Two people branch from a file containing a function. One deletes the function. The other adds a line in the middle of it. A traditional VCS gives you this:
<<<<<<< left
def calculate(x):
a = x * 2
logger.debug(f”a={a}“)
b = a + 1
return b
>>>>>>> right
Two opaque blobs. You have to mentally reconstruct what actually happened.
Manyana gives you this:
<<<<<<< begin deleted left
def calculate(x):
a = x * 2
======= begin added right
logger.debug(f”a={a}“)
======= begin deleted left
b = a + 1
return b
>>>>>>> end conflict
Each section tells you what happened and who did it. Left deleted the function. Right added a line in the middle. You can see the structure of the conflict instead of staring at two blobs trying to figure it out.
CRDTs (Conflict-Free Replicated Data Types) give you eventual consistency: merges never fail, and the result is always the same no matter what order branches are merged in — including many branches mashed together by multiple people working independently. That one property turns out to have profound implications for every aspect of version control design.
Line ordering becomes permanent. When two branches insert code at the same point, the CRDT picks an ordering and it sticks. This prevents problems when conflicting sections are both kept but resolved in different orders on different branches.
Conflicts are informative, not blocking. The merge always produces a result. Conflicts are surfaced for review when concurrent edits happen “too near” each other, but they never block the merge itself. And because the algorithm tracks what each side did rather than just showing the two outcomes, the conflict presentation is genuinely useful.
History lives in the structure. The state is a weave — a single structure containing every line which has ever existed in the file, with metadata about when it was added and removed. This means merges don’t need to find a common ancestor or traverse the DAG. Two states go in, one state comes out, and it’s always correct.
One idea I’m particularly excited about: rebase doesn’t have to destroy history. Conventional rebase creates a fictional history where your commits happened on top of the latest main. In a CRDT system, you can get the same effect — replaying commits one at a time onto a new base — while keeping the full history. The only addition needed is a “primary ancestor” annotation in the DAG.
This matters because aggressive rebasing quickly produces merge topologies with no single common ancestor, which is exactly where traditional 3-way merge falls apart. CRDTs don’t care — the history is in the weave, not reconstructed from the DAG.
Manyana is a demo, not a full-blown version control system. It’s about 470 lines of Python which operate on individual files. Cherry-picking and local undo aren’t implemented yet, though the README lays out a vision for how those can be done well.
What it is is a proof that CRDT-based version control can handle the hard UX problems and come out with better answers than the tools we’re all using today — and a coherent design for building the real thing.
The code is public domain. The full design document is in the README.
...
Read the original on bramcohen.com »
Named after floccus — the cloud formation that looks exactly like popcorn.
A free, open-source local AWS emulator. No account. No feature gates. No CI restrictions. Just docker compose up.
LocalStack’s community edition sunset in March 2026 — requiring auth tokens, dropping CI support, and freezing security updates. Floci is the no-strings-attached alternative.
# docker-compose.yml
services:
floci:
image: hectorvent/floci:latest
ports:
- “4566:4566”
volumes:
- ./data:/app/data
docker compose up
All services are available at http://localhost:4566. Use any AWS region — credentials can be anything.
export AWS_ENDPOINT_URL=http://localhost:4566
export AWS_DEFAULT_REGION=us-east-1
export AWS_ACCESS_KEY_ID=test
export AWS_SECRET_ACCESS_KEY=test
# Try it
aws s3 mb s3://my-bucket
aws sqs create-queue –queue-name my-queue
aws dynamodb list-tables
Point your existing AWS SDK at http://localhost:4566 — no other changes needed.
// Java (AWS SDK v2)
DynamoDbClient client = DynamoDbClient.builder()
.endpointOverride(URI.create(“http://localhost:4566”))
.region(Region.US_EAST_1)
.credentialsProvider(StaticCredentialsProvider.create(
AwsBasicCredentials.create(“test”, “test”)))
.build();
# Python (boto3)
import boto3
client = boto3.client(“s3”,
endpoint_url=“http://localhost:4566”,
region_name=“us-east-1”,
aws_access_key_id=“test”,
aws_secret_access_key=“test”)
// Node.js (AWS SDK v3)
import { S3Client } from “@aws-sdk/client-s3”;
const client = new S3Client({
endpoint: “http://localhost:4566”,
region: “us-east-1”,
credentials: { accessKeyId: “test”, secretAccessKey: “test” },
forcePathStyle: true,
All settings are overridable via environment variables (FLOCI_ prefix).
MIT — use it however you want.
...
Read the original on github.com »
Read the paper — Full technical details, 90+ experiments, and the story of how an AI and a human built this in 24 hours.
Pure C/Metal inference engine that runs Qwen3.5-397B-A17B (a 397 billion parameter Mixture-of-Experts model) on a MacBook Pro with 48GB RAM at 4.4+ tokens/second with production-quality output including tool calling.
The entire 209GB model streams from SSD through a custom Metal compute pipeline. No Python. No frameworks. Just C, Objective-C, and hand-tuned Metal shaders.
*2-bit quantization produces \name\ instead of “name” in JSON output, making tool calling unreliable. 4-bit is the production configuration.
The model has 60 transformer layers: 45 GatedDeltaNet (linear attention) + 15 standard full attention. Each layer has 512 experts, of which K=4 are activated per token (plus one shared expert). Hidden dimension is 4096.
SSD Expert Streaming — Expert weights (209GB at 4-bit) are read from NVMe SSD on demand via parallel pread() with GCD dispatch groups. Only the K=4 active experts per layer are loaded (~6.75MB each). The OS page cache manages caching — no custom cache needed (“Trust the OS” principle). Inspired by Apple’s “LLM in a Flash” paper.
FMA-Optimized Dequant Kernel — The inner loop of the 4-bit dequantized matrix-vector multiply rearranges the math from (nibble * scale + bias) * x to fma(nibble, scale*x, bias*x). Pre-computing scale*x and bias*x lets the GPU fused multiply-add unit do dequant+multiply in one instruction. 12% faster than the naive formulation.
Deferred GPU Expert Compute — CMD3 (expert forward pass) is submitted without waiting. The GPU executes it while the CPU prepares the next layer. The combine + residual + norm are also on GPU, feeding directly into the next layer’s attention projections.
Accelerate BLAS for Linear Attention — The GatedDeltaNet recurrence uses cblas_sscal, cblas_sgemv, and cblas_sger for the 64-head × 128×128 state matrix update. 64% faster than scalar code.
Trust the OS — No custom expert cache. The OS page cache (~35GB) manages expert data caching via standard LRU. Every custom caching approach we tested (Metal LRU, malloc cache, LZ4 compressed cache) was slower due to GPU memory pressure or overhead. The page cache achieves ~71% hit rate naturally.
On Apple Silicon, SSD DMA and GPU compute share the same memory controller and cannot be profitably overlapped. The GPU’s dequant kernels are bandwidth-saturated at ~418 GiB/s. Even small background SSD DMA causes disproportionate GPU latency spikes through memory controller arbitration. The serial pipeline (GPU → SSD → GPU) is hardware-optimal.
cd metal_infer
make
# 4-bit inference (needs packed_experts/ directory)
./infer –prompt “Explain quantum computing” –tokens 100
# 2-bit inference (faster but breaks tool calling)
./infer –prompt “Explain quantum computing” –tokens 100 –2bit
# Interactive chat with tool calling
./chat
# Per-layer timing breakdown
./infer –prompt “Hello” –tokens 20 –timing
This is a primary development machine. The engine explicitly controls memory:
* No OOM risk. Expert data streams from SSD on demand.
...
Read the original on github.com »
I’m a Windows guy; I always have been. One of my first programming books was , which crucially came with a trial version of Visual C++ that my ten-year-old self could install on my parents’ computer. I remember being on a family vacation when .NET 1.0 came out, working my way through a C# tome and gearing up to rewrite my Neopets cheating programs from MFC into Windows Forms. Even my very first job after university was at a .NET shop, although I worked mostly on the frontend.
While I followed the Windows development ecosystem from the sidelines, my professional work never involved writing native Windows apps. (Chromium is technically a native app, but is more like its own operating system.) And for my hobby projects, the web was always a better choice. But, spurred on by fond childhood memories, I thought writing a fun little Windows utility program might be a good retirement project.
Well. I am here to report that the scene is a complete mess. I totally understand why nobody writes native Windows applications these days, and instead people turn to Electron.
The utility I built, Display Blackout, scratched an itch for me: when playing games on my three-monitor setup, I wanted to black out my left and right displays. Turning them off will cause Windows to spasm for several seconds and throw all your current window positioning out of whack. But for OLED monitors, throwing up a black overlay will turn off all the pixels, which is just as good.
To be clear, this is not an original idea. I was originally using an AutoHotkey script, which upon writing this post I found out has since morphed into a full Windows application. Other | incarnations of the idea are even available on the Microsoft Store. But, I thought I could create a slightly nicer and more modern UI, and anyway, the point was to learn, not to create a commercial product.
For our purposes, what’s interesting about this app is the sort of capabilities it needs:
Enumerating the machine’s displays and their bounds
Let’s keep those in mind going forward.
Look at this beautiful UI that I made. Surely you will agree that it is better than all other software in this space.
In the beginning, there was the Win32 API, in C. Unfortunately, this API is still highly relevant today, including for my program.
Over time, a series of abstractions on top of this emerged. The main pre-.NET one was the C++ library, which used modern-at-the-time language features like classes and templates to add some object-orientation on top of the raw C functions.
The abstraction train really got going with the introduction of .NET. .NET was many things, but for our purposes the most important part was the introduction of a new programming language, C#, that ran as JITed bytecode on a new virtual machine, in the same style as Java. This brought automatic memory management (and thus memory safety) to Windows programming, and generally gave Microsoft a more modern foundation for their ecosystem. Additionally, the .NET libraries included a whole new set of APIs for interacting with Windows. On the UI side in particular, .NET 1.0 (2002) started out with Windows Forms. Similar to MFC, it was largely a wrapper around the Win32 windowing and control APIs.
With .NET 3.0 (2006), Microsoft introduced . Now, instead of creating all controls as C# objects, there was a separate markup language, : more like the HTML + JavaScript relationship. This also was the first time they redrew controls from scratch, on the GPU, instead of wrapping the Win32 API controls that shipped with the OS. At the time, this felt like a fresh start, and a good foundation for the foreseeable future of Windows apps.
The next big pivot was with the release of Windows 8 (2012) and the introduction of WinRT. Similar to .NET, it was an attempt to create new APIs for all of the functionality needed to write Windows applications. If developers stayed inside the lines of WinRT, their apps would meet the modern standard of sandboxed apps, such as those on Android and iOS, and be deployable across Windows desktops, tablets, and phones. It was still XAML-based on the UI side, but with everything slightly different than it was in WPF, to support the more constrained cross-device targets.
This strategy got a do-over in Windows 10 (2015) with , with some sandboxing restrictions lifted to allow for more capable desktop/phone/Xbox/HoloLens apps, but still not quite the same power as full .NET apps with WPF. At the same time, with both WinRT and UWP, certain new OS-level features and integrations (such as push notifications, live tiles, or publication in the Microsoft Store) were only granted to apps that used these frameworks. This led to awkward architectures where applications like Chrome or Microsoft Office would have WinRT/UWP bridge apps around old-school cores, communicating over or similar.
With Windows 11 (2021), Microsoft finally gave up on the attempts to move everyone to some more-sandboxed and more-modern platform. The Windows App SDK exposes all the formerly WinRT/UWP-exclusive features to all Windows apps, whether written in standard C++ (no more C++/CLI) or written in .NET. The SDK includes WinUI 3, yet another XAML-based, drawn-from-scratch control library.
So did you catch all that? Just looking at the UI framework evolution, we have:
In the spirit of this being a learning project, I knew I wanted to use the latest and greatest first-party foundation. That meant writing a WinUI 3 app, using the Windows App SDK. There ends up being three ways to go about this:
This is a painful choice. C++ will produce lean apps, runtime-linked against the Windows APP SDK libraries, with easy interop down into any Win32 C APIs that I might need. But, in 2026, writing a greenfield application in a memory-unsafe language like C++ is a crime.
What would be ideal is if I could use the system’s .NET, and just distribute the C# bytecode, similar to how all web apps share the same web platform provided by the browser. This is called “framework-dependent deployment”. However, for no reason I can understand, Microsoft has decided that even the latest versions of Windows 11 only get .NET 4.8.1 preinstalled. (The current version of .NET is 10.) So distributing an app this way incurs a tragedy of the commons, where the first app to need modern .NET will cause Windows to show a dialog prompting the user to download and install the .NET libraries. This is not the optimal user experience!
That leaves .NET AOT. Yes, I am compiling the entire .NET runtime—including the virtual machine, garbage collector, standard library, etc.—into my binary. The compiler tries to trim out unused code, but the result is still a solid 9 MiB for an app that blacks out some monitors.
There’s a similar painful choice when it comes to distribution. Although Windows is happy to support hand-rolled or third-party-tool-generated setup.exe installers, the Microsoft-recommended path for a modern app with containerized install/uninstall is MSIX. But this format relies heavily on code signing certificates, which seem to cost around $200–300/year for non-US residents. The unsigned sideloading experience is terrible, requiring a cryptic PowerShell command only usable from an admin terminal. I could avoid sideloading if Microsoft would just accept my app into their store, but they rejected it for not offering “unique lasting value”.
The tragedy here is that this all seems so unnecessary. .NET could be distributed via Windows Update, so the latest version is always present, making framework-dependent deployment viable. Or at least there could be a MSIX package for .NET available, so that other MSIX packages could declare a dependency on it. Unsigned MSIX sideloads use the same crowd-sourced reputation system that EXE installers get. Windows code signing certs could cost $100/year, instead of $200+, like the equivalent costs for the Apple ecosystem. But like everything else about modern Windows development, it’s all just … half-assed.
It turns out that it’s a lot of work to recreate one’s OS and UI APIs every few years. Coupled with the intermittent attempts at sandboxing and deprecating “too powerful” functionality, the result is that each new layer has gaps, where you can’t do certain things which were possible in the previous framework.
This is not a new problem. Even back with MFC, you would often find yourself needing to drop down to Win32 APIs. And .NET has had P/Invoke since 1.0. So, especially now that Microsoft is no longer requiring that you only use the latest framework in exchange for new capabilities, having to drop down to a previous layer is not the end of the world. But it’s frustrating: what is the point of using Microsoft’s latest and greatest, if half your code is just interop goop to get at the old APIs? What’s the point of programming in C#, if you have to wrap a bunch of C APIs?
Let’s revisit the list of things my app needs to do, and compare them to what you can do using the Windows App SDK:
Enumerating the machine’s displays and their bounds: can enumerate, as long as you use a for loop instead of a foreach loop. But watching for changes requires P/Invoke, because the modern API doesn’t actually work.
Placing borderless, titlebar-less, non-activating black windows: much of this is doable, but non-activating needs P/Invoke.
Optionally running at startup: can do, with a nice system-settings-integrated off-by-default API.
Displaying a tray icon with a few menu items: not available. Not only does the tray icon itself need P/Invoke, the concept of menus for tray icons is not standardized, so depending on which wrapper package you pick, you’ll get one of several different context menu styles.
The Windows IME system component uses a modern frosted-glass style, matching a few other system components but no apps (including Microsoft apps) that I can find.
The OneNote first-party app uses a white background, and uses bold to indicate the left-click action.
The Phone Link bundled app is pretty similar to OneNote.
Command Palette comes from PowerToys, which is supposed to be a WinUI 3 showcase. Similar to OneNote and Phone Link, but with extra “Left-click” and “Double-click” indicators seen nowhere else.
The Windows Security system component uses different margins, and inexplicably, is the only app to position the menu on the left.
1Password seems to be trying for the same style as the white-background Windows components and Microsoft apps, but with different margins than all of them.
Signal seems roughly the same as 1Password. A shared library?
Discord seems similar to 1Password and Signal, but it inserted an unselectable branding “menu item”.
Steam is too cool to fit into the host OS, and just draws something completely custom.
For Display Blackout, I used the approach provided by WinUIEx. This matches the system IME menu, although not in vertical offset or horizontal centering.
But these are just the headline features. Even something as simple as automatically sizing your app window to its contents was lost somewhere along the way from WPF to WinUI 3.
Given how often you need to call back down to Win32 C APIs, it doesn’t help that the interop technology is itself undergoing a transition. The modern way appears to be something called CsWin32, which is supposed to take some of the pain out of P/Invoke. But it can’t even correctly wrap strings inside of structs. To my eyes, it appears to be one of those underfunded, perpetually pre-1.0 projects with uninspiring changelogs, on track to get abandoned after a couple years.
And CsWin32’s problems aren’t just implementation gaps: some of them trace back to missing features in C# itself. The documentation contains this darkly hilarious passage:
Some parameters in win32 are [optional, out] or [optional, in, out]. C# does not have an idiomatic way to represent this concept, so for any method that has such parameters, CsWin32 will generate two versions: one with all ref or out parameters included, and one with all such parameters omitted.
The C# language doesn’t have a way to specify a foundational parameter type of the Win32 API? One which is a linear combination of two existing supported parameter types? One might think that an advantage of controlling C# would be that Microsoft has carefully shaped and coevolved it to be the perfect programming language for Windows APIs. This does not appear to be the case.
Indeed, it’s not just in interop with old Win32 APIs where C# falls short of its target platform’s needs. When WPF first came out in 2006, with its emphasis on two-way data binding, everyone quickly realized that the boilerplate involved in creating classes that could bind to UI was unsustainable. Essentially, every property needs to become a getter/setter pair, with the setter having a same-value guard and a call to fire an event. (And firing an event is full of ceremony in C#.) People tried various solutions to paper over this, from base classes to code generators. But the real solution here is to put something in the language, like JavaScript has done with decorators and proxies.
So when I went to work on my app, I was astonished to find that twenty years after the release of WPF, the boilerplate had barely changed. (The sole improvement is that C# got a feature that lets you omit the name of the property when firing the event.) What has the C# language team been doing for twenty years, that creating native observable classes never became a priority?
Honestly, the whole project of native Windows app development feels like it’s not a priority for Microsoft. The relevant issue trackers are full of developers encountering painful bugs and gaps, and getting little-to-no response from Microsoft engineers. The Windows App SDK changelog is mostly about them adding new machine learning APIs. And famously, many first-party apps, from Visual Studio Code to Outlook to the Start menu itself, are written using web technologies.
This is probably why large parts of the community have decided to go their own way, investing in third-party UI frameworks like Avalonia and Uno Platform. From what I can tell browsing their landing pages and GitHub repositories, these are better-maintained, and written by people who loved WPF and wished WinUI were as capable. They also embrace cross-platform development, which certainly is important for some use cases.
But at that point: why not Electron? Seriously. C# and XAML are not that amazing, compared to, say, TypeScript/React/CSS. As we saw from my list above, to do most anything beyond the basics, you’re going to need to reach down into Win32 interop anyway. If you use something like Tauri, you don’t even need to bundle a whole Chromium binary: you can use the system webview. Ironically, the system webview receives updates every 4 weeks (soon to be 2?), whereas the system .NET is perpetually stuck at version 4.8.1!
It’s still possible for Microsoft to turn this around. The Windows App SDK approach does seem like an improvement over the long digression into WinRT and UWP. I’ve identified some low-hanging fruit around packaging and deployment above, which I’d love for them to act on. And their recent announcement of a focus on Windows quality includes a line about using WinUI 3 more throughout the OS, which could in theory trickle back into improving WinUI itself.
I’m not holding my breath. And from what I can tell, neither are most developers. The Hacker News commentariat loves to bemoan the death of native apps. But given what a mess the Windows app platform is, I’ll pick the web stack any day, with Electron or Tauri to bridge down to the relevant Win32 APIs for OS integration.
...
Read the original on domenic.me »
Everyone needs a rewarding hobby. I’ve been scanning all of my receipts since 2001. I never typed in a single price - just kept the images. I figured someday the technology to read them would catch up, and the data would be interesting.
This year I tested it. Two AI coding agents, 11,345 receipts. I started with eggs. If you can track one item across 25 years of garbled thermal prints, OCR failures, and folder typos, you can track anything.
14 days. 1.6 billion tokens. 589 egg receipts found. Here’s what the data says.
Ok so let’s make a project plan. In the ~/Records/ we have a ton of receipts. Many are pdf/image/etc. I want to go through and extract the actual content of the receipts to find how much we spend on eggs. Receipts are notoriously terrible to OCR, so we might need to do something more advanced.
Codex explored my file system, found two existing SQLite databases I’d forgotten about, discovered 11,345 receipts across PDFs, emails, and images, and came back with a project plan. I said “write this out to a plan.md please.” It did. We were building within the hour.
The whole thing took 14 days. Maybe 15 hours of me actually at the keyboard - short bursts of direction-giving separated by long stretches of the agents just running. Codex ran 15 interactive sessions. Claude handled 10.
The oldest receipts were flatbed scans - multiple receipts per page, random orientations, white paper on a white scanner bed. Codex and I tried seven classical CV approaches to find receipt boundaries. Edge detection. Adaptive thresholding. Contour analysis. Morphological operations. Watershed segmentation. Template matching. A grid-based decomposition I pitched as “a classic HackerRank problem.”
None of them worked. The core issue: receipts are white and so is the scanner bed. I started calling it the “shades of white” problem. The cleverest attempt was inspired by removing tourists from landmark photos - stack all scans, compute the median pixel at each position, subtract to reveal edges. I thought that one was going to work. Best F1: 0.302.
We also threw macOS Vision OCR at it (via a Swift script Codex wrote on the fly), Tesseract, several other tools. I was starting to think the flatbed scans might just be a loss. Then I tried Meta’s SAM3.
One API call with text=“receipt”. 0.92-0.98 confidence on every boundary. Four seconds per scan. 1,873 receipts from 760 multi-receipt pages. Seven approaches in hours; SAM3 in an afternoon.
Receipts land at random angles, and OCR needs them upright. We tried Tesseract’s orientation detection, macOS Vision OCR, Moondream 2 and 3 - each one better than the last but none reliable enough. Then I realized that every time I pasted a receipt into our Claude conversation to debug something, it was already reading the text perfectly. Rotated, faded, didn’t matter.
Why am I building a rotation pipeline when the tool I’m talking to already solves this? So we sent all 11,345 receipts through Sonnet and Codex. Sometimes the answer is staring you right in the face.
Halfway through the project, Tesseract was the weak link. It read “OAT MILK” as “OATH ILK.” It dropped decimals - $4.37 became $437. On old thermal prints it produced nothing at all. Codex opened 20 of the worst ones by hand and found that some weren’t even receipts. A family photo. A postcard. A greeting card. All filed under “Receipts.”
I found PaddleOCR-VL - a 0.9B parameter vision-language model that runs locally on Apple Silicon. First test on a sample bank statement: clean, accurate text in 2.1 seconds. Tesseract was faster but dramatically noisier. Second test on a tall Fred Meyer receipt: disaster. The model entered a repetition loop, hallucinating “TILL YGRT” endlessly.
The fix turned out to be simple - split tall receipts into slices. Dynamic slicing based on aspect ratio: num_slices = max(2, round(aspect_ratio / 1.5)). Five parallel shards ran overnight. GPU pegged at 100% for 10.8 hours. In the morning: 11,345 receipts OCR’d successfully. Cleaner text for every receipt in the archive.
PaddleOCR-VL isn’t a Codex replacement - it can’t do structured extraction or follow instructions. It’s a better Tesseract. The real pipeline: receipt image → PaddleOCR-VL (local, clean text) → Codex/Claude (structured extraction).
Once receipts were segmented, oriented, and OCR’d, they needed structured extraction - find the egg line items, pull prices and quantities.
It started with regex. The models love regex. Keyword matching for “egg,” money patterns for prices. Heuristics found eggs in 25/25 positive samples with 0 false positives. Not bad. But on the full corpus, false negatives piled up - Fred Meyer abbreviated codes like STO LRG BRUNN, Whole Foods truncated to EDGS, OCR mangled “EGGS” into LG EGO 12 CT. No regex catches these.
So I told Codex “we have unlimited tokens, let’s use them all,” and we pivoted to sending every receipt through Codex for structured extraction. From that one sentence, Codex came back with a parallel worker architecture - sharding, health management, checkpointing, retry logic. The whole thing. When I ran out of tokens on Codex mid-run, it auto-switched to Claude and kept going. I didn’t ask it to do that. I didn’t know it had happened until I read the logs.
But the runs kept crashing. Long CLI jobs died when sessions timed out. The script committed results at end-of-run, so early deaths lost everything. I watched it happen three times. On the fourth attempt I said “I would have expected we start a new process per batch.” That was the fix - one fresh process per batch, hard call cap, exit cleanly, resume from cache. Codex patched it, launched it in a tmux session, and the ETA dropped from 12 hours to 3. Not a hard fix. Just the kind of thing you know after you’ve watched enough overnight jobs die at 3 AM.
11,345 receipts processed. The thing that was supposed to take all night finished before I went to bed.
First I needed ground truth. I asked Claude to build me a labeling tool - keyboard-first, receipt image on the left, classification data on the right, arrow keys to navigate, single keypress to verdict. It built the whole Flask app in 22 minutes. I sat down and hand-labeled 375 receipts.
Regex found 650 receipts mentioning “egg.” Against those 375 labels: 88% recall. The misses told the story - abbreviated codes, OCR garble, truncated descriptions. No keyword search catches STO LRG BRUNN.
The fix: use those hand-labeled edge cases as few-shot examples in an LLM classifier. Twenty examples of what “eggs” looks like on a garbled thermal print from 2003. Batch 10 receipts per call. Eight parallel workers. Two hours. 11,345 receipts classified.
Final accuracy: 99%+. Every supposed “miss” by the LLM turned out to be a mislabel in the ground truth. A bicycle shop receipt the old heuristic had flagged. A barcode-only scan. Egg noodles. The classifier was more correct than my labels.
Then more QA. A second tool for eyeballing 497 weak images: Space for no-eggs, X for has-eggs. A third for data entry on 95 receipts with missing fields - numpad-optimized, auto-advancing. Four tools total, each built in minutes, each one I ground through by hand.
So how good is the data? I pulled 372 random samples and checked them by hand. Initially: 96% correct. The errors were mostly garbled OCR on old scans. One was a hallucination - the pipeline fabricated egg data for a receipt that contained no eggs at all.
* Email receipts silently preferring text/plain over text/html, dropping pricing lines that only existed in the HTML part
Here’s what made the quality good: every time I caught something, I could show the agents what to look for and they’d go fix it everywhere. I caught a store address hiding in OCR noise: “915 Ny 45th St” was 915 NW 45th St, Seattle. I showed them the pattern, they ran a recovery pass on 40 missing-location receipts - all 40 resolved.
Codex and Claude are excellent at building tools and extracting structured data, but they couldn’t segment an image or replace an OCR engine. The right answer was a stack of specialized models - SAM3 for segmentation, PaddleOCR for text, Codex and Claude for everything else. I expected this, but it was worth trying the simple path first.
These are the days of miracle and wonder. I can’t wait to see what 30 years of eggs looks like.
...
Read the original on john-rush.com »
How a sign-extension bug in C made me pull my hair out for days but became my first patch to the Linux kernel!
A while ago, I started dipping my toe into virtualization. It’s a topic that many people have heard of or are using on a daily basis but a few know and think about how it works under the hood.
I like to learn by reinventing the wheel, and naturally, to learn virtualization I started by trying to build a Type-2 hypervisor. This approach is similar to how KVM (Linux) or bhyve (FreeBSD) are built.
My experimental hypervisor (and VMM) is still a work-in-progress and is available on my Github: pooladkhay/evmm.
Since virtualization is hardware assisted these days , the hypervisor needs to communicate directly with the CPU by running certain privileged instructions; which means a Type-2 hypervisor is essentially a Kernel Module that exposes an API to the user-space where a Virtual Machine Monitor (VMM) like QEMU or Firecracker is running and orchestrating VMs by utilizing that API.
In this post, I want to describe exactly how I found that bug. But to make it a bit more educational, I’m going to set the stage first and talk about a few core concepts so you can see exactly where the bug emerges.
The x86 architecture in protected mode (32-bit mode) envisions a task switching mechanism that is facilitated by the hardware. The architecture defines a Task State Segment (TSS) which is a region in the memory that holds information about a task (General purpose registers, segment registers, etc.). The idea was that any given task or thread would have its own TSS, and when the switch happens, a specific register (Task Register or TR) would get updated to point to the new task .
This was abandoned in favor of software-defined task switching which gives more granular control and portability to the operating system kernel.
But the TSS was not entirely abandoned. In modern days (64-bit systems) the kernel uses a TSS-per-core approach where the main job of TSS is to hold a few stack pointers that are very critical for the kernel and CPU’s normal operation. More specifically, it holds the kernel stack of the current thread which is used when the system wants to switch from user-space to the kernel-space.
It also holds a few known good stacks for critical events like Non-Maskable Interrupts (NMIs) and Double Faults. These are events that if not handled correctly, can cause a triple fault and crash a CPU core or cause an immediate system reboot.
We know that memory access is generally considered to be expensive and caching values somewhere on the CPU die is the preferred approach if possible. This is where the TR register comes into the picture. It has a visible part which is a 16-bit offset that we have already discussed as well as a hidden part that holds direct information about the TSS (Base address, Limit, and Access rights). This saves the CPU the trouble of indexing into the GDT to eventually find the TSS every time it’s needed.
A hypervisor is essentially a task switcher where tasks are operating systems. In order for multiple operating systems to run on the same silicon chip, the hypervisor must swap the entire state of the CPU which includes updating the hidden part of the TR register as well.
In a previous blog post I described how Intel implemented their virtualization extension (VT-x) and how each vCPU (vCore) is given its own VMCS (Virtual Machine Control Structure) block where its state is saved to or restored from by the hardware when switching between host and guest OSes.
I suggest reading that post if you’re interested in the topic but VMCS consists of four main areas:
Host-state area has two fields which correspond to the visible part and one of the hidden parts (base address) the TR register:
While guest-state area has four (one visible plus all three hidden parts):
The reason is that the hardware assumes the host OS to be a modern 64-bit operating system where TR limit and Access Rights are fixed known values (0x67 and 0x11 respectively). But the guest OS can be virtually any operating system with any constraints.
Naturally, it is the hypervisor’s job to set these values on initial run and to update them when needed (e.g. when the kernel thread that is running a vCPU is migrated to another physical CPU core, the hypervisor must update the host state to match the new core).
To set these values, I “borrowed” some code from the linux kernel tree (KVM selftests):
vmwrite(HOST_TR_BASE,
get_desc64_base((struct desc64 *)(get_gdt().address + get_tr())));
This piece of code does the following:
* Gets the address of GDT.
* Indexes into it using the value of TR register.
* Parses the TSS segment descriptor and extracts the memory address of TSS.
* Writes the address into the HOST_TR_BASE section of the VMCS using the special VMWRITE instruction .
So far, so good!
If for any reason this operation fails to extract and write the correct address, upon the next context switch from user-space to kernel-space (or next NMI or next Double fault), when the CPU hardware tries to read the kernel stack from the TSS to update the Stack Pointer register, it either receives garbage or an unmapped address. Either way, the CPU will eventually face a double fault (a fault that happens when trying to handle another fault like a page fault) and when trying to use one of the known good stacks for handling the double fault, it will fail again which will make it a triple fault and BOOM! The core dies or we get a sudden reboot.
Now lets talk about the issue that I was facing.
I started developing my hypervisor on a virtualized instance of Fedora, to avoid crashing my machine in case something went wrong. By the time I realized something is indeed wrong, I had already developed the ability to put the CPU in VMX operation, run a hardcoded loop in VMX non-root mode that would use the VMCALL instruction to trap into the hypervisor (VMX root) and ask it to print a message, then resume the loop (VMRESUME).
Additionally, VMCS was programmed to trap external interrupts (e.g. timer ticks). Upon an exit, the hypervisor would check if we (the current kernel thread) needs to be rescheduled, keeping the kernel scheduler happy.
I was using preempt notifier api which lets threads provide two custom functions (sched_in and sched_out) that are called by the scheduler when it’s about to deschedule the thread as well as right before rescheduling it. These functions are then responsible for cleanups and initialization work that is required.
In my case, sched_out would unload the VMCS from the current core, and sched_in would load it on the new core while reinitializing it using a series of VMWRITEs to match the new core’s state.
On my virtualized dev environment with only three vCPUs, everything was working just fine. Until I decided to give it a try on my main machine where the hypervisor would talk to an actual physical CPU.
Seconds after running the loop, the system crashed, in a very unpredictable way. I was logging the core switches and didn’t find any meaningful correlation between the last core number and the crash. Additionally, sometimes it would last longer and sometimes it was immediate. After investigating kernel logs a few times, I saw a pattern in the sequence of events that caused the system to eventually hang:
* The Fatal VM-Exit: An NMI triggered a VM-Exit on CPU 5 and naturally the hardware tried to locate a valid kernel stack from TSS to handle the privilege transition.
* Core Death: CPU 5 hit a fatal Page Fault attempting to read an unmapped memory address, resulting in a Kernel Oops. CPU 5 was left completely paralyzed with interrupts disabled.
* IPI Lockup: CPU 6 attempted a routine system-wide update (kernel text patching) requiring an Inter-Processor Interrupt (IPI) acknowledgment from all cores. CPU 6 became permanently stuck in an infinite loop waiting for the dead CPU 5 to respond.
* Cascading Paralysis: As other cores (3, 8, 11, etc.) attempted standard cross-core communications (like memory map TLB flushes and RCU synchronizations), they too fell into the IPI trap, waiting indefinitely for CPU 5.
* Terminal State: The RCU subsystem starved, peripheral drivers (like Wi-Fi) crashed from timeouts, and the system entered a total, unrecoverable deadlock.
So why no triple faults?!
The Kernel Oops killed the active task and halted operations on CPU 5. However, it left CPU 5 in a “zombie” state. Alive enough to keep the motherboard powered on, but with its interrupts disabled, making it entirely unresponsive to the rest of the system.
Soon I realized that the hypervisor works absolutely fine when pinned to one core (e.g. via taskset command), so there must be something happening while moving between cores. Additionally, I didn’t dare to question the code I stole from the Linux kernel source, and I was trying hard to find an issue in the code I wrote myself. This eventually led to rewriting a portion of the hypervisor code with an alternative method which would achieve the same goal.
For example, from reading Intel’s Software Developer Manual (SDM) , I knew that when moving from core A to core B, core A must run the VMCLEAR instruction to unload the VMCS, and only then can core B load the VMCS using the VMPTRLD to be able to execute the guest code. For that, I was using smp_call_function_single which relies on IPIs to run a piece of code on another CPU, that I replaced with the preempt notifiers.
Eventually, (while pulling my hair out) I realized I have eliminated all possible parts of the hypervisor that played a role in moving between cores.
Then there was another clue!
While running the hypervisor on my virtual dev environment (QEMU + Fedora) I observed that by increasing the number of vCores, I can reproduce the issue and there is also a new behavior. Sometimes the VM reboots immediately (instead of freezing) and after the reboot, there is no trace of any logs related to the previous session. And I concluded that a triple fault has happened.
This turned my attention to the TR and TSS. I started looking for alternative ways of setting the HOST_TR_BASE and realized that the KVM itself (not KVM selftests) uses a different method:
* Linux uses per-cpu TSS and GDT, so set these when switching
* processors. See 22.2.4.
vmcs_writel(HOST_TR_BASE, (unsigned long)&get_cpu_entry_area(cpu)->tss.x86_tss);
And that was it! Using this method to set HOST_TR_BASE fixed my hypervisor and helped me keep whatever sanity I had left.
Remember that piece of code I took from the kernel source. It used the get_desc64_base function to extract and write the address of TSS into the HOST_TR_BASE. This function has this definition:
static inline uint64_t get_desc64_base(const struct desc64 *desc)
return ((uint64_t)desc->base3 << 32) |
(desc->base0 | ((desc->base1) << 16) | ((desc->base2) << 24));
TSS segment descriptor has four fields that must be stitched together to form the address of the TSS .
The C standard dictates Integer Promotion. Whenever a type smaller than an int is used in an expression, the compiler automatically promotes it to a standard int (which is a 32-bit signed integer on modern x86-64 architectures) before performing the operation.
If an int can represent all values of the original type (as restricted by the width, for a bit-field), the value is converted to an int; otherwise, it is converted to an unsigned int. These are called the integer promotions. All other types are unchanged by the integer promotions.
This promotion has a consequence: if the resulting value after promotion has a 1 in its most significant bit (32nd bit), this value considered negative by the compiler and if casted to a larger type like a uint64_t in our case, sign extension happens.
Lets see an example:
We have an 8-bit unsigned integer (uint8_t) with 11001100 bit pattern. If we left-shift it by 24, it still can be represented by an int which is 32 bits long. So the compiler generates this value: 11001100000000000000000000000000 and considers it to be an int which is a signed type.
Now if we try to perform any operation on this value, it would follow the protocol for signed values. In our case, we are ORing it with a uint64_t. So the compiler would cast our int (a 32-bit signed value) into uint64_t (a 64-bit unsigned value), which is where the sign-extension happens which would turn our value to 11111111111111111111111111111111_11001100000000000000000000000000 before OR happens.
Saw the problem?
Because the upper 32 bits are sign-extended to all 1s (Hex: 0xFFFFFFFF), the bitwise OR operation completely destroys base3 (In a bitwise OR, 1 | X equals 1). Therefore, whatever data was in base3 is permanently overwritten by the 1s from the sign extension.
Here is an actual example with “real” addresses:
base0 = 0x5000
base1 = 0xd6
base2 = 0xf8
base3 = 0xfffffe7c
Expected return: 0xfffffe7cf8d65000
Actual return: 0xfffffffff8d65000
This also explains when the problem would happen: Only and only if base2 has a 1 as its most significant bit. Any other value would not corrupt the resulting address.
The fix is actually very simple. We must cast values to unsigned types before the bit-shift operation:
static inline uint64_t get_desc64_base(const struct desc64 *desc)
return (uint64_t)desc->base3 << 32 |
(uint64_t)desc->base2 << 24 |
(uint64_t)desc->base1 << 16 |
(uint64_t)desc->base0;
This will prevent the sign-extension from happening.
Finally, this is the patch I sent, which was approved and merged:
20251222174207.107331-1-mj@pooladkhay.com/“>https://lore.kernel.org/kvm/20251222174207.107331-1-mj@pooladkhay.com/
I can’t finish this post without talking about AI!
You may wonder whether I tried asking an LLM for help or not. Well, I did. In fact it was very helpful in some tasks like summarizing kernel logs [^13] and extracting the gist of them. But when it came to debugging based on all the clues that were available, it concluded that my code didn’t have any bugs, and that the CPU hardware was faulty.
...
Read the original on pooladkhay.com »
To add this web app to your iOS home screen tap the share button and select "Add to the Home Screen".
10HN is also available as an iOS App
If you visit 10HN only rarely, check out the the best articles from the past week.
If you like 10HN please leave feedback and share
Visit pancik.com for more.