10 interesting stories served every morning and every evening.
10 interesting stories served every morning and every evening.
Skip to content
Navigation Menu
AI CODE CREATIONGitHub CopilotWrite better code with AIGitHub SparkBuild and deploy intelligent appsGitHub ModelsManage and compare promptsMCP RegistryNewIntegrate external toolsView all features
AI CODE CREATIONGitHub CopilotWrite better code with AIGitHub SparkBuild and deploy intelligent appsGitHub ModelsManage and compare promptsMCP RegistryNewIntegrate external tools
AI CODE CREATION
GitHub CopilotWrite better code with AI
GitHub CopilotWrite better code with AI
GitHub SparkBuild and deploy intelligent apps
GitHub SparkBuild and deploy intelligent apps
GitHub ModelsManage and compare prompts
GitHub ModelsManage and compare prompts
MCP RegistryNewIntegrate external tools
MCP RegistryNewIntegrate external tools
View all features
Pricing
Provide feedback
Saved searches
Use saved searches to filter your results more quickly
Sign up
Appearance settings
Notifications
You must be signed in to change notification settings
Fork
39.7k
Star
185k
Star
185k
Merged
Conversation
Pull request overview
This PR changes the Git extension’s git.addAICoAuthor setting so that AI co-author trailers are enabled by default, making the default behavior automatically add a Co-authored-by trailer when AI-generated code contributions are detected.
Changes:
Updates git.addAICoAuthor configuration default from “off” to “all”.
Copilot’s findings
Files reviewed: 1/1 changed files
Comments generated: 1
Screenshot Changes
Base: 3c1b53dd Current: eec3f9cf
Changed (3)
blocks-ci screenshots changed
Replace the contents of test/componentFixtures/blocks-ci-screenshots.md with:
<!– auto-generated by CI — do not edit manually –>
#### editor/codeEditor/CodeEditor/Dark

#### editor/codeEditor/CodeEditor/Light

NoiceBroice
referenced
this pull request
in ThomasSnowden37/Harmoniq-Charts
Co-authored-by: Copilot <copilot@github.com>
Closed
srid
mentioned this pull request
Merged
Open
microsoft
locked as spam and limited conversation to collaborators
Labels
None yet
The NetHack DevTeam is announcing the release of NetHack 5.0.0 on
May 2, 2026
NetHack 5.0 is an enhancement to the dungeon exploration game NetHack,
which is a distant descendent of Rogue and Hack, and a direct descendent
of NetHack 3.6.
NetHack 5.0.0 is a release of NetHack. As a .0 version, there may be some
bugs encountered. Constructive suggestions, GitHub pull requests, and bug
reports are all welcome and encouraged.
Along with the game improvements and bug fixes, NetHack 5.0 strives to make
some general architectural improvements to the game or to its building
process. Among them, 5.0:
Has its source code compliant with the C99 standard.
Removes barriers to building NetHack on one platform and operating system,
for later execution on another (possibly quite different) platform and/or
operating system. That capability is generally known as “cross-compiling.”
See the file “Cross-compiling” in the top-level folder for more information
on that.
The build-time “yacc and lex”-based level compiler, the
“yacc and lex”-based dungeon compiler, and the quest text file processing
previously done by NetHack’s “makedefs” utility, have been replaced with
Lua text alternatives that are loaded and processed by the game during play.
A list of over 3100 fixes and changes can be found in the game’s sources
in the file doc/fixes5 – 0-0.txt. The text in there was written for the
development team’s own use and is provided “as is”. Some entries might be
considered “spoilers”, particularly in the “new features” section.
Existing saved games and bones files will not work with NetHack 5.0.0.
Checksums (sha256) of binaries that you have downloaded from nethack.org
can be verified on Windows platforms using:
certUtil -hashfile nethack-500-win-x64.zip SHA256
or
certUtil -hashfile nethack-500-win-arm64.zip SHA256
The following command can be used on most platforms to help confirm the location of
various files that NetHack may use:
nethack –showpaths
As with all releases of the game, we appreciate your feedback. Please submit any
bugs using the problem report form. Also, please check the “known bugs” list
before you log a problem - somebody else may have already found it.
Happy NetHacking!
Hello friends! In April we merged 333 PRs from 35 contributors, 7 of whom made their first-ever commit to Ladybird! Here’s what we’ve been up to.
Ladybird is entirely funded by the generous support of companies and individuals who believe in the open web. This month, we’re excited to welcome the following new sponsors:
Human Rights Foundation (via the “AI for Individual Rights” program) with $50,000
Jakub Stęplowski with $1,000
We’re incredibly grateful for their support. If you’re interested in sponsoring the project, please contact us.
Inline PDF viewer
PDFs now render inline through the bundled pdf.js viewer (#9132). pdf.js is a full-featured PDF viewer written entirely in JavaScript, HTML, and CSS, with page navigation, text selection, zoom, and find-in-document. Profiling pdf.js loading the Intel ISA Manual also drove improvements to our typed-array view cache and :has() invalidation.
Browsing history and rich address bar autocomplete
Type in the address bar and you now get rich, history-aware suggestions: previously visited pages with favicons and titles, a search-engine shortcut, and plain URL completions (#8933). Behind the scenes, a SQLite-backed HistoryStore persists every navigation along with its title, favicon, visit count, and last-visit time, and “Clear browsing history” is wired up in the Privacy settings page. Both the Qt and AppKit UIs render the new rich rows.
Speculative and incremental HTML parsing
The HTML parser now consumes the response body incrementally (#9151). Bytes flow through a streaming text decoder into the tokenizer one chunk at a time, the tokenizer pauses when it runs out of input, and resumes when more arrives. This replaces a model where we waited for the full body before starting to parse.
We also implemented the speculative HTML parser (#9114). When the main parser blocks on a synchronous external script, a separate tokenizer scans ahead through the unparsed input and issues speculative fetches for the resources it finds: <script src>, <link rel=stylesheet|preload>, and <img src>. It tracks <base href> and skips into templates and foreign content correctly. A follow-up wired the speculative parser into the document’s preload map (#9164), so resources discovered speculatively get deduplicated against the regular parser’s later fetches instead of being requested twice.
Off-thread JavaScript compilation
Bytecode generation for fetched scripts’ top-level code now runs on a background thread pool (#9118). Worker threads produce the bytecode and the data needed to build an Executable, while everything that touches the VM or GC heap stays on the main thread. This covers classic scripts, modules, and top-level IIFEs, and shifts roughly 200ms of main thread time onto background threads while loading YouTube alone.
Per-Navigable rasterization
Each Navigable now rasterizes independently on its own thread (#8793). Previously, iframes were painted synchronously as nested display lists inside their parent’s display list, which meant only the top-level traversable’s rendering thread was ever active. The parent’s display list now references each iframe’s rasterized output through an ExternalContentSource, so iframe invalidations no longer require re-recording the parent. Beyond the parallelism, this is prep work for moving iframes into separate sandboxed processes.
JavaScript engine
With the C++/Rust transition behind us, we spent April cashing in.
Faster JS-to-JS calls. A multi-part series (#8891, #8909, #8912) made Call, Return, and End instructions stay entirely in the AsmInt assembly interpreter for the common case, with hand-tuned ARM64 paired load/store (ldp/stp) for register save/restore. Native function calls also dispatch directly from AsmInt now, via a new RawNativeFunction variant that holds a plain function pointer instead of an AK::Function (#8922).
O(1) bytecode register allocator. Generator::allocate_register used to scan the free pool to find the lowest-numbered register. We were spending ~800ms in this function alone while loading x.com. With the C++/Rust pipeline parity period over, the allocator is now a plain LIFO stack (#9007).
Cached for-in iteration. for (key in obj) sites now cache the flattened enumerable key snapshot and reuse it as long as the receiver’s shape, indexed storage, and prototype chain still match (#8856). Speedometer 2 went from 67.7 to 73.6, and Speedometer 3 from 4.11 to 4.22!
A grab-bag of other improvements:
The parser uses zero-copy identifier name sharing across the lexer, parser, and scope collector. On a corpus of website JS, parsing is 1.14x faster and uses 282 MB less RSS. (#8801)
Short string concatenations skip the rope representation when the result is going to be observed as a flat string anyway. 2.13x speedup on a tight a + b loop. (#9184)
Lexical-this arrow functions no longer allocate a function environment per call. Another 2.13x on a microbenchmark. (#9192)
Sparse arrays no longer pay an eager cost for their holes: Array(20_000_000) stays mostly metadata instead of doing work proportional to twenty million imaginary elements. (#8847)
A new lazy JS::Substring type backs regexp captures and string builtins like slice, split, and indexed access, gaining 1.066x on Octane’s regexp benchmark. (#8863)
Source positions are preserved end-to-end in bytecode source maps, saving ~250ms on x.com. (#9027)
Zero-copy TransferArrayBuffer saves ~130ms on YouTube load. (#9088)
Cached typed-array views switched from a WeakHashSet to an intrusive list, saving ~250ms loading the Intel ISA PDF in pdf.js. (#9180)
Every Promise allocated two PromiseResolvingFunction cells with AK::Function closures that didn’t actually capture anything. They’re now static functions dispatched by a Kind enum, dropping a per-resolver allocation across every promise the engine creates. (#9188)
Skipping property-table marking for non-dictionary shapes cut 1.3 seconds off GC time while loading maptiler.com. (#9044)
A fast path for Array.prototype.indexOf on packed arrays (#9123)
Array.prototype.sort reuses cached UTF-16 instead of re-transcoding on every comparison (#9036)
Imports for WASM, JSON, and CSS modules (#6029)
Removed ShadowRealm support, since the proposal has stalled in the standards process (#8753)
GTK4 / libadwaita frontend
Ladybird has a new Linux frontend built on GTK4 and libadwaita, sitting alongside the existing Qt frontend (#8691). It’s inspired by GNOME Web (Epiphany) and follows GNOME’s design guidelines: no menubar, a hamburger menu, and AdwTabView for tabs. Out of the box you get autocomplete and security icons in the URL bar, find-in-page, fullscreen, context menus, alert/confirm/prompt/color/file dialogs, clipboard, multi-window, light/dark theme, and DPR scaling. It’s still early, so not yet at feature parity with the Qt and AppKit frontends.
Bookmarks
Last month we got bookmarks. This month they got a proper management UI:
An about:bookmarks page for managing bookmarks and folders (#8825)
Bookmark import and export from the new page (#8938)
Context menus for editing bookmarks and folders (#8715)
A date_added timestamp on every bookmark and folder (#8867)
Bookmarks bar QoL: open in new tab, copy URL, middle-click and Ctrl/Cmd+click to open in new tab (#8758)
The HTML5 drag-and-drop API is now wired up (#8783). about:bookmarks uses it for reordering, and it works on regular web pages too.
Cache and CacheStorage
We implemented Cache and CacheStorage end to end, with all nine methods (open, has, delete, keys, match, matchAll, add, addAll, put) backed by an ephemeral in-memory store (#8745).
CSS features
image-set() : Basic support for the standard and -webkit- prefixed forms. At paint time we pick the candidate whose resolution best matches the device pixel ratio, skipping unsupported MIME types. This makes header images show up on gocomics.com. (#9090)
position-anchor and CSS anchor positioning : Initial support for anchor-positioned elements, fixing the hand and gun positioning on cssdoom.wtf. (#8686)
Color interpolation rewrite : Aligned with css-color-4. We now interpolate in float instead of u8, handle missing and powerless components correctly, deal with out-of-gamut sRGB, and apply alpha multipliers consistently. (#8934)
Color interpolation rewrite : Aligned with css-color-4. We now interpolate in float instead of u8, handle missing and powerless components correctly, deal with out-of-gamut sRGB, and apply alpha multipliers consistently. (#8934)
Presentational hints through the cascade : Legacy presentational HTML attributes (align, bgcolor, etc.) used to bypass the regular CSS cascade and write directly into the element’s cascaded properties. They now go through the cascade as normal author declarations, so var() substitution and the invalid-at-computed-value-time fallback work correctly. Fixes a crash on html.spec.whatwg.org. (#9176)
Presentational hints through the cascade : Legacy presentational HTML attributes (align, bgcolor, etc.) used to bypass the regular CSS cascade and write directly into the element’s cascaded properties. They now go through the cascade as normal author declarations, so var() substitution and the invalid-at-computed-value-time fallback work correctly. Fixes a crash on html.spec.whatwg.org. (#9176)
align on table sections and rows : <thead>, <tbody>, <tfoot>, and <tr> honor the align presentational attribute, fixing button placement on bricklink.com. (#9177)
BeforeAfter
align on table sections and rows : <thead>, <tbody>, <tfoot>, and <tr> honor the align presentational attribute, fixing button placement on bricklink.com. (#9177)
stroke-dasharray interpolation : SVG dashes finally animate smoothly. (#9133)
stroke-dasharray interpolation : SVG dashes finally animate smoothly. (#9133)
autofocus : Elements with the autofocus attribute actually receive focus on page load now. (#9016)
autofocus : Elements with the autofocus attribute actually receive focus on page load now. (#9016)
List markers in RTL text : Bullets now sit on the right side of right-to-left text, fixing list rendering on Arabic Wikipedia. (#9099)
BeforeAfter
List markers in RTL text : Bullets now sit on the right side of right-to-left text, fixing list rendering on Arabic Wikipedia. (#9099)
Inline flex/grid baselines : An inline flex or grid container now derives its baseline from its child’s first line box, not its last wrapped line. Fixes link text and icon alignment on nos.nl. (#9183)
BeforeAfter
Inline flex/grid baselines : An inline flex or grid container now derives its baseline from its child’s first line box, not its last wrapped line. Fixes link text and icon alignment on nos.nl. (#9183)
Networking
getaddrinfo no longer blocks the event loop. LibDNS now runs lookups on a thread pool, fires A and AAAA queries in parallel (RFC 8305-ish), and coalesces concurrent lookups for the same name. RequestServer’s preconnect path was sneaking past our resolver and letting libcurl spawn its own threaded resolver that would pthread_join us on the main thread; that’s now routed through the same DNS pool. (#9109)
Profile of loading x.com when DNS is slow, before and after:
Over in RequestServer, draining queued response data was O(n²) when WebContent was slower than the network. RequestServer was spending ~30 seconds in memcpy and 3 seconds in Vector::remove while opening a YouTube video! Switching AllocatingMemoryStream to a singly-linked chunk list made consumption O(1). (#9028)
We now advertise AVIF and WebP in our Accept header for image requests, matching other engines. Some CDNs use the Accept header to decide whether to serve modern formats or fall back to JPEG. (#9046)
Style invalidation
Selector invalidation used to be straightforward: selectors always looked downward. :host ruined that. :has() made it way worse. Any descendant change can now force you to walk up the tree finding ancestors whose :has() arguments just flipped, and a lot of this month’s invalidation work is about making that walk less wasteful.
Four big wins this month:
Reddit rule cache rebuilds: 13.2s → 3.2s. Stylesheet mutations no longer rebuild every style scope’s cache when only one scope changed. (#9138)
Reddit infinite scroll: 11% fewer pointless recomputes. Sibling structural invalidation stopped fanning out to descendants that don’t observe the position. (#9155)
:has() mutation invalidation skips unaffected anchors , with substantial reductions measured on azure.com. (#9168)
:has() child-list visits on the Intel ISA PDF: 71k → 1.6k. Coalesced when pending data already covers every concrete feature bucket the scope cares about, saving ~650ms on the pdf.js load. (#9179)
A large new structural-invalidation test battery exposed and fixed several invalidation holes (#9095), and a string of smaller tightenings landed around hover, stylesheet mutation scope, custom-property maps, and computed-style diffing (#9077, #9049, #9079, #9080, #9141).
Linux GPU painting via dmabuf
On Linux Vulkan builds, GPU-backed painting was being secretly undone every frame: WebContent painted into a GPU-backed Skia surface, but the buffer it shared with the UI process was a CPU bitmap, which forced a full GPU-to-CPU readback on every flush. SharedImage can now carry a Linux dmabuf handle, so the front and back buffers stay GPU-resident the whole way to the UI process. (#8917, #8920)
mimalloc as the main allocator
Our C++ and Rust code now share a single allocator instance, mimalloc v2, instead of each going through the system allocator separately (#8752). We don’t override malloc() system-wide, so third-party libraries keep their own allocator contracts. JS benchmarks improved across the board.
Sites that work better
The biggest visible wins this month are on Reddit and YouTube.
Reddit image gallery carousels actually work now, after fixing two unrelated layout bugs around ::slotted() matching and absolutely positioned descendants of split inlines (#9148). And thanks to TextDecoderStream, the SPA stops swallowing link clicks, so you can finally open the comments! Infinite scroll also benefits from the structural-invalidation work covered above.
YouTube benefits from a stack of unrelated improvements: off-thread top-level JS compile, off-thread WOFF2 decompression (saves ~170ms on Gmail too, #8976), reduced @font-face fetch fanout (177 → ~9 fetches on initial load, #9032), the RequestServer memory churn fix, and zero-copy TransferArrayBuffer.
A handful of smaller fixes:
gocomics.com : Header images show up, thanks to image-set().
yandex.com/maps : Vector-tile WebGL rendering works after a small pile of WebGL fix-ups, including the WEBGL_debug_renderer_info extension (#9043).
strava.com : Login works now that Navigator.getBattery throws the spec-mandated error type instead of one of our own (#8770).
GitHub Insights : Loads ~100ms faster thanks to the Element.matches() and .closest() selector cache (#8987).
tweakers.net : The laptop comparison page is ~31% faster from indexed HTMLFormElement property name lookups (#9009).
neon.com : No longer crashes (#8812).
channel4.com : Vertically misaligned category text fixed in flex auto-margin resolution (#9050).
Cloudflare Turnstile : Still doesn’t pass, but we fail it much faster now thanks to auth-scheme handling, Array.prototype.shift() optimizations, and a pile of UA event handler hardening on <input> range and number elements (#9063).
Web Platform Tests (WPT)
Our WPT score went from 2,003,537 to 2,067,263 this month, a headline gain of 63,726 subtests. There’s an asterisk on that number: WPT imported test262, the official ECMAScript conformance suite, upstream this month, which added 53,207 JavaScript subtests to the count. We pass 52,045 of them (a 97.8% pass rate), since we’ve been running test262 independently for years and LibJS conformance is in great shape. So roughly 52k of the 63.7k gain is from the import, and the remaining ~11.7k is genuine new browser-platform progress, in the same ballpark as January’s 13,690.
Many CLI tools, SDKs, and frameworks collect telemetry data by default. Each one has its own way to opt out:
You get the idea. There are too many, and they are all different.
A single, standard environment variable that clearly and unambiguously expresses a user’s wish to opt out of any of the following:
Non-essential-to-functionality requests to the creator of the software or third-party
We just want local software.
Add the line above to your shell configuration file so it applies to all your terminal sessions:
If you develop tools that collect telemetry, analytics, or make non-essential network requests, please check for this variable:
If DO_NOT_TRACK is set to 1, disable all tracking
Consider making telemetry opt-in rather than opt-out
I love going on wilderness adventures. I am rarely happier than when I am far off into the mountains without a soul in sight. As a result, I have spent a lot of time learning how to safely explore and navigate when I’m away from civilization. The most important habit I’ve found for not getting lost is to be very regular in checking your location as you go, and the best way I’ve found to do that is to have a map on my wrist.
For more than six years I’ve been working towards creating the best possible mapping experience on the Apple Watch. With yesterday’s launch of Pedometer++ 8, I feel like this design journey has reached a meaningful destination. I would contend that Pedometer++’s watchOS mapping support is the absolute best available on the App Store.
So I wanted to walk through the journey it took to get here.
Early Efforts
I have wanted a good map on my wrist since the Apple Watch launched. This wasn’t realistically possible until watchOS 6, which brought SwiftUI to the platform and, for the first time, made “real” apps possible. But in those early days, the screens were tiny, and the processors slow. I couldn’t quite get to where I wanted.
This was my very first attempt that shipped in Pedometer++. These maps were generated completely on the server, which involved sending the relevant workout data roundtrip every time I wanted to refresh the display. This system let me validate the idea, but it was never going to be practically useful for navigation or regular use, and could never work offline.
Custom Mapping Engine
I knew that if I wanted to make progress towards this goal, I’d need to work at a lower level, so I got to work building a fully SwiftUI-native map rendering engine. SwiftUI was the only choice because it’s all that watchOS supported, and proved to be helpful for putting maps into widgets, which also only support SwiftUI.
In 2021, I got this engine to a place where I could reliably and performantly render a map on watchOS. With it, I can render any tile-based maps and overlay location information on top.
Map Designs
Next came the question of how best to surface data to users. App design on watchOS is a really fun — but frustrating — challenge. You are designing for a relatively tiny screen, which must be operated one-handed. In this case, I want the user to be able to read the map and use it to navigate, while also having access to other workout-related information.
This began a long series of design attempts, most of which (if I’m being honest) were kinda awful.
In the end, I settled on a “modal” approach where the user can switch between a map screen and a metrics screen using a button on the top-left corner.
This interface provides one context where the user can freely pan/zoom around the map and another where I can use the more standard watchOS tabbed page interface for metrics and controls. I shipped this to Pedometer++, but there was always something that didn’t quite sit right with me about it.
This design felt like a compromise, and not in a good way. I felt that in order to achieve the goal of making the map interactive, I couldn’t have the map be part of any UI structure that involved swipes. As the screens on Apple Watches got larger, it felt less needed in order to give the map enough space to be useful.
So I set about trying alternative designs. SO many designs.
For a while, I thought that I needed to find a way to put the metrics at the bottom of the screen. However, that would lead to other problems on longer outings or for workouts that aren’t navigation-focused. So I kept iterating and came up with even more designs.
All of these designs suffered from the same fundamental issue: they required the app to display only a fixed set of fields at a time.
I could make the interface configurable, but one of the fundamental rules of watchOS design is that you should avoid any interaction that takes more than a few seconds on the watch. Any user-configurable setup is inherently fiddly, so I didn’t like this approach.
Dark Mode, Liquid Glass, & Cartography
Around the same time I was still wrestling with the design challenges of how best to structure the app, Apple announced watchOS 26, and the arrival of Liquid Glass. One of the core design aspects of Liquid Glass is layering stacking elements on top of each other, but another is the types of colors that work best with each other.
I was previously using Thunderforest Outdoors as my basemap for the app. I love the content this map includes, but when I started overlaying glassy elements over it I found that it wasn’t well-suited for Liquid Glass.
So… I commissioned a custom map. Working with the incredible cartographer Andy Allen, we created a completely new basemap that would look fantastic with Liquid Glass.1
We simplified the map visually, increased the contrast of the elements, and made the map elements more saturated to prevent them from becoming a muddy mess when shown below glass.
With this work done, I had another opportunity: I could finally have a dark mode variant of the map tiles. While helpful on iOS, this really shines on watchOS. Andy and I really worked toward something which would be incredibly legible at arm’s length.
The result of these efforts is that now I have a great map for watchOS… but a design that didn’t match that greatness.
Striving for Great
I kept trying. To get me out of my design rut, I enlisted the help of the fantastic designer Rafa Conde. I needed a fresh set of eyes on this and very quickly, this partnership paid off. They proposed a variety of alternative layouts, but when I saw this one I knew it was the one.
The layering of the metrics on the top-left corner, with the map being the top page of a vertical stack, was the correct answer. This design handles interactivity by requiring a tap on the map first to enter “browse mode”.
Tweaking and Polishing
Now that I had the overall concept locked in, the real fun began, actually building the app and dialing in all the details. I fairly quickly took Rafa’s concept and turned it into a working prototype. This let me validate the idea in the field… literally. After walking a few hundred miles with it, I was confident it was the correct approach.
Next, I needed to dial in the font and make more subtle design choices.
After a bit more iteration, I arrived at the design that shipped yesterday. It is legible, useful, and (in my humble opinion) beautiful.
It feels really good to be able to cap off this six-year journey with a design I couldn’t be more proud of. This screen represents so much accumulated effort and learning. It finally gives me a design which feels native on the platform, but also novel and unique.
Here is the evolution of this design over the last six years:
Postscript: Considering MapKit
While my work on watchOS mapping massively predates the arrival of Apple’s MapKit onto the platform, it is probably worth explaining why I decided to do all of this custom work to avoid using it.
Fundamentally, I find that MapKit is great for basic uses, but doesn’t provide nearly the level of configurability and utility which I want Pedometer++ to offer. For example:
MapKit on watchOS always shows in dark mode, which generally is a good default, but closes the door on some accessibility and user choice reasons. I needed it to be a user-selectable option.
While MapKit on watchOS has gotten better over time in terms of what you can do with it, I still find it a bit limiting in terms of animations and overlays.
MapKit’s coverage is improving with regards to topographic contours and trail marking, but there are far too many places where the MapKit map is essentially blank, but I know there should be more rich details available. For example, here is my map vs MapKit at the trailhead of one of my favorite hikes in Scotland.
I still find it so cool that my work on this allows me to say that I “commissioned a cartographer” to work on something for me. 😁 ↩
I still find it so cool that my work on this allows me to say that I “commissioned a cartographer” to work on something for me. 😁 ↩
A Couple Million Lines of Haskell: Production Engineering at Mercury
The editors of the Haskell Blog are happy to announce a new series of articles called “Haskellers from the trenches”,
where we invite experienced engineers to talk about their subjects of expertise, best practices, and production tales.
Engineering rigour and artistic creativity are a fantastic combination, and this series aims to be the synthesis of these two aspects within the Haskell world.
I first heard about Haskell when I was sixteen, sitting in a high school computer science class where we were writing Java and learning, among other things, that NullPointerException was apparently a lifestyle choice if you decided to go into software development. While looking at the /r/programming subreddit after school, I stumbled across a reference to a language where null pointer exceptions simply could not happen, where the type system could prevent an entire category of bugs that I had been fighting with every week. Haskell. I was immediately, hopelessly enamored with the idea.
I have been writing Haskell for nearly two decades now, and I still think the value proposition I fell in love with at sixteen was basically right. What took me longer to learn is what that promise looks like after a codebase gets large, the company grows faster than its documentation, and the system is allowed to touch money. Haskell earns its keep there in numerous, sturdy ways. It lets you pack operational knowledge into APIs, put dangerous machinery behind tight boundaries, and make the safe path the easy one. At a growing company, those aren’t just matters of taste; they are how you keep a system understandable after the people who first understood it have moved on.
Fast forward to today: I work at Mercury, a fintech company that provides banking services.* We serve over 300,000 businesses. We processed $248 billion in transaction volume in 2025 on $650 million in annualized revenue, and are, at the time of writing, in the process of obtaining a national bank charter in the USA from the OCC. We have around 1,500 employees. Our engineering organization largely hires generalists, and most of them have never written a line of Haskell before joining.
My time working at Mercury has changed how I think about the language more than any sermon about purity ever did. Elegance is pleasant, but keeping your business alive is compulsory.
Our codebase is roughly 2 million lines of Haskell, once you strip out comments and such.
This is the part where you are supposed to recoil in horror.
A couple million lines of Haskell, maintained by people who learned the language on the job, at a company that moves huge amounts of money? The conventional wisdom says this should be a disaster, but surprisingly, it isn’t. The system we’ve built has worked well for years, through hypergrowth, through the SVB crisis that sent $2 billion in new deposits our way in five days,1 through regulatory examinations, through all the ordinary and extraordinary things that happen to a financial system at scale.
This article is about why it works. Not in the “Haskell is beautiful” sense, though it is. Not in the “the compiler will save us from ourselves” sense, though I frequently feel gratitude in that direction. I mean in the much less romantic and much more useful sense that we run this language in production, at scale, with a rapidly changing team, and have learned some hard lessons about what it takes to keep the whole enterprise afloat. The beauty of Haskell is charming enough, but there is a whole swath of operational and organizational reality beyond it, and if you ignore that reality for too long, your company will likely fire the whole Haskell team2 and start writing PHP or something instead.
How We Think About Reliability
Before diving into practical advice, a note on philosophy.
There is a traditional way of thinking about system reliability that focuses on preventing failures. You enumerate the things that can go wrong. You add checks. You write tests for each bad case. You hunt for bugs. This is, of course, necessary work, and we do it. But it is not sufficient, and if you orient entirely around it you develop a specific blind spot: you get very good at cataloguing the ways things break and very bad at understanding why they ordinarily work.3
We try to think about it differently. A system operates reliably because it can absorb variation: it degrades gracefully, its operators can understand and adjust it, and the architecture makes the right thing easy and the wrong thing difficult.4 Reliability is not just the absence of failure. It is the presence of adaptive capacity. It is a system’s ability to keep functioning while reality continues its longstanding and regrettable habit of refusing to hold still.
When you have hundreds of engineers working in a multi-million-line codebase, many of whom are six months into their Haskell careers, “adaptive capacity” stops being a nifty phrase from a resilience engineering paper and starts being a daily concern. Patrick McKenzie has observed that in a company growing at 2x per year, half of your coworkers will always have less than a year of experience. A year later, half of your coworkers will still have less than a year of experience. For very successful companies, this never stops being true.5 You become organizationally ancient very quickly, whether you like it or not, and the things you know become institutional dark matter: load-bearing, but invisible to most of the people around you.
So the questions we ask are operational. Can the new hire on your team read this module and understand what it does? If the database is slow, does this service degrade or does it fall over and take its neighbors with it? If someone misuses an interface, does the compiler tell them, or do we find out when the on-call gets paged? If you don’t have answers to those questions, you have a future incident quietly unfolding.
This is why I increasingly think of the type system as an operational aid more than a correctness proof. Its value is not merely that it rules out certain classes of errors, though it does. Its value is that it encodes institutional knowledge in a form that survives the departure of the person who wrote it. In a fast-growing company, people leave, people transfer teams, people go on vacation or parental leave, people join, and the churn means that things people knew walk out the door with them unless you have written them down somewhere. Ideally, you have written them down in a form that the compiler can read, because the compiler is much more disciplined than the average wiki page.
This extends beyond code. As a member of our stability engineering team, we constantly investigate the prospective production behavior of features and products. We do not do this to slow down product development, but in partnership with the team shipping the feature, to make sure we are prepared to deal with the fallout when it breaks and, if possible, to make that fallout boring rather than exciting. We ask things like: what is the blast radius if this fails? Which operations must be idempotent, and how? What does the rollback look like? What happens to in-flight work? Which systems will absorb the failure, and which ones will amplify it? The point is to have the conversation early enough that it shapes the design rather than merely auditing the launch after all the important decisions have already become expensive to revisit.6
Our philosophy, stated plainly: we are not the quality police. We are the people who would like to help you avoid being woken up at 4 AM to deal with the fallout of a broken feature. Rather than a deeply ideological stance, we simply desire to help people.
So, in light of that, how do we make Haskell work in production?
Purity Is a Boundary, Not a Property
My hot take: the first and most consequential misunderstanding about Haskell is that purity is not something the language is, so much as that it is something your interfaces enforce.
Under the hood, Haskell is not a magical machine that performs side effects despite being pure. Behind every “pure” function in bytestring, text, and vector lies a cheerful little hellscape of mutable allocation, buffer writes, unsafe coercions, and other behavior that would alarm you if you discovered it in a junior engineer’s side project. Behind the ST monad lies in-place mutation and side effects, observable within the computation. What makes it acceptable is that the side effects are encapsulated such that the boundary cannot be violated.
runST :: (forall s. ST s a) -> a
The rank-2 type (that is, the type s is scoped within the parenthesis and can’t escape) of runST ensures that the mutable references created inside the computation cannot escape due to being tagged with the type s. Internally, all sorts of imperative nonsense may occur. Externally, the function is pure. The world outside the boundary gets none of the mutation, only the result.
This is, I think, a wonderful design principle in the larger scheme of things when writ large: you can permit arbitrarily dangerous operations within a scope, provided the scope’s exit is typed narrowly enough that the danger cannot leak. That principle applies everywhere in production. Your database layer uses connection pooling, retry logic, and mutable state internally. Your cache uses concurrent mutable maps. Your HTTP client probably has circuit breakers, pooled connections, and a small municipal government’s worth of bookkeeping. None of this is a problem if the interface is tight enough to prevent misuse and the boundary holds.
In production, the goal is often not to avoid mutation entirely, because that is not a serious proposition for most real systems. The goal is to contain mutation, make the containment legible, and verify that it stays contained. Often the right question is not “is this pure?” but “where is the impurity, and how much of the codebase is allowed to know about it?”
For a new engineer who learned Haskell three months ago, “purity is a boundary you try to maintain” is much more useful than “Haskell is pure.” One tells them what to do when they sit down to design a module. The other mostly sits there looking profound.
This boundary-oriented view of purity sets up a more general pattern that recurs throughout production engineering in Haskell: dangerous things are tolerable when they are fenced in, carefully exposed, and hard to misuse. That is true of mutation. It is true of retries, transactions, state machines, distributed workflows, and type-level machinery. Much of what follows is really just this same idea, wearing different hats.
Make the Right Thing Easy
There is a pattern in large codebases where correctness depends on performing operations in a particular order, or including a particular step that has no visible connection to the main work.
“Remember to flush the audit log after every transaction.”
“Always check the feature flag before calling this endpoint.”
“Make sure to enqueue the notification inside the database transaction, not after it.”
These are the incantations of operational lore. They live in wiki pages, onboarding documents, half-forgotten design reviews, and the memories of senior engineers who are now three teams away and booked solid until Thursday. In a company that is hiring aggressively, the half-life of tribal knowledge is alarmingly short. When an engineer leaves, the incantations fade. When a deadline approaches, they are the first thing skipped. When a new engineer joins, they often have no way to know the incantation exists at all. Nothing says “robust system design” quite like a critical invariant living in a Slack thread from nine months ago7.
Haskell gives you tools to encode these incantations in types so they cannot be forgotten. This is, for my money, the single most valuable thing the language offers a production engineering organization.
Consider a simplified version of a real pattern: you need to ensure that certain side effects (sending a notification, publishing an event) happen transactionally with a database write. Not before, not after, and not in a separate transaction. Together, or not at all.
The naïve approach is to tell people to use the right function:
– Please use this one, not the other one
writeWithEvents :: Transaction -> [Event] -> IO ()
– Don’t use this directly (but we can’t stop you)
writeTransaction :: Transaction -> IO ()
publishEvents :: [Event] -> IO ()
This is rookie-level engineering. It works until it doesn’t, and “until it doesn’t” tends to arrive on a Friday afternoon when the person who wrote the wiki page is on vacation and everybody else is discovering, in real time, that the wiki page was load-bearing.
A better approach restructures the types so that the only way to commit work is through a path that includes event publication:
data Transact a — opaque; cannot be run directly
record :: Transaction -> Transact ()
emit :: Event -> Transact ()
– The *only* way to execute a Transact: commit and publish atomically
commit :: Transact a -> IO a
Now the incantation is the only door in the room. You cannot forget it because there is nothing else to do. The type system has not proven anything especially deep about your events. It has done something more practical: it has made the correct operational procedure the path of least resistance.
That distinction matters. In production, there are plenty of places where we do not need a theorem. We need a design that makes it difficult for an ordinary busy engineer to accidentally do the wrong thing while trying to do a dozen other perfectly reasonable things. The compiler is not merely checking logic here; it is preserving institutional memory and turning it into a hard-edged interface.
When a new engineer joins and asks “how do I write a transaction?”, the type system answers them. When a senior engineer leaves, the answer remains. The institutional knowledge survived not because someone documented it beautifully, though documentation is pleasant when available, but because someone encoded it in a form the compiler enforces. Again, the compiler is a better custodian of operational lore than the average wiki and less prone to people forgetting to update it as the reality of the system changes.
Durable Execution
The pattern above — structuring types so that the correct operational procedure is the only procedure — works well within a single transaction. Financial systems, unfortunately, have never felt obligated to remain inside a single transaction.
They are full of processes that span multiple steps, multiple services, and multiple failure modes. Send a payment, wait for a partner to acknowledge it, update the ledger, notify the customer, handle cancellation, handle timeout, handle the case where the partner said yes but your worker died before recording the answer, handle the case where the partner said nothing because the network briefly entered a higher plane of existence and declined to tell you about it. If any step fails, you need to know where you were, what has already happened, and what still needs to happen. You need state. You need retries. You need timeouts. You need idempotence. You need all of these things to keep working across process crashes and deployments. Very quickly, what began as “just some business logic” amasses a remarkable amount of one-off repeats of common operational concerns.
Previously at Mercury, we coordinated these processes with database-backed state machines driven by cron jobs and background workers, with retry logic and timeout handling scattered across the codebase. It worked. It also required the sort of vigilance usually associated with defusing unexploded ordnance. It was fragile, difficult to reason about, and the source of a disproportionate share of our operational incidents.
Temporal is our durable execution framework, and adopting it was one of the better infrastructure decisions we have made. You write your workflow as ordinary sequential code, and the platform records every step in an event history. If a worker crashes mid-workflow, another worker replays the deterministic prefix to reconstruct the state, then continues from where it left off. Retries, timeouts, cancellation, and error handling are provided by the platform rather than each team reimplementing them poorly.
I think of Temporal as Frankenstein’s monster, in the flattering sense: assembled from excellent parts, animated by improbable effort, and smarter than many of the people alarmed by it. It takes durable history, replay, and determinism (things some platforms get natively,) and bolts them onto runtimes that were never born knowing how to do any of this. Most of us are not going to rewrite our companies in Erlang. Temporal is a prosthetic for the rest of us. It gives ordinary languages a shot at some of the same operational virtues by slightly mad but highly effective means.
The alignment with Haskell dovetails nicely with the virtues often attributed to Haskell: A Temporal workflow is, in an important sense, a pure function over its event history. Temporal Workflows have a determinism requirement — that a replayed workflow must produce the same sequence of commands as the original — which is exactly the same constraint Haskell imposes on pure code: same inputs, same outputs. Side effects are isolated into activities, which are the workflow’s equivalent of IO. The workflow orchestrates; the activities execute. If you have spent any time thinking about pure core / impure shell, this is that model with the platform enforcing the separation rather than relying on sheer discipline.
We built and open-sourced hs-temporal-sdk, our Haskell SDK for Temporal, which wraps the official Core SDK (Rust, via FFI) and provides a Haskell-native API for defining workflows, activities, and workers.
I gave a talk about our adoption patterns at Temporal’s Replay conference, and the short version is: Temporal has let us replace fragile chains of cron jobs and database-backed state machines with durable workflows where the platform handles the coordination. The operational improvement has been substantial. It is difficult to overstate how pleasant it is to delete a hand-rolled distributed state machine and replace it with something whose failure semantics were not improvised during sprint planning.
This, again, is adaptive capacity in a different costume. A system that can survive worker crashes, process restarts, and long-lived coordination without losing its place is a system whose operators have more leverage and fewer mysteries to unwind during an incident.
Design for Your Domain, Not Your Transport
As your production system grows, a common mistake I’ve observed is letting the invoking system leak into the domain model.
We have code that throws HTTP status code exceptions which return those results directly to the user on the frontend. This made sense when the code was written, because it ran in an HTTP request handler. Then, as happens in any growing codebase, pieces of that code got extracted and reused. Now it also runs in cron jobs. It runs in queued background workers. It runs in Temporal workflows. And it still throws StatusCodeException 409 “Conflict” when something goes wrong, which is an absolutely unhinged thing for a cron job to do. A cron job does not have a caller waiting for a 409. Nobody is reading that status code. The error has propagated through the system simply because the original abstraction was coupled to its transport layer.
The fix is conceptually simple: model your domain errors as domain types. A payment that fails because of insufficient funds should be an InsufficientFunds, not a 402. A duplicate request should be a DuplicateRequest, not a 409. These are things your business logic can match on, retry against, log meaningfully, and handle differently depending on context.
Then you write thin translation layers at each boundary:
data PaymentError
= InsufficientFunds
| DuplicateRequest RequestId
| PartnerTimeout Partner
toHttpError :: PaymentError -> HttpResponse
toHttpError InsufficientFunds = err402 “Insufficient funds”
toHttpError (DuplicateRequest _) = err409 “Duplicate request”
toHttpError (PartnerTimeout _) = err502 “Partner unavailable”
toWorkerStrategy :: PaymentError -> WorkerAction
toWorkerStrategy InsufficientFunds = Fail “Insufficient funds”
toWorkerStrategy (DuplicateRequest _) = Skip
toWorkerStrategy (PartnerTimeout _) = RetryWithBackoff
Nothing novel about it. But it is remarkable how often it gets skipped in practice, because the first version of any code is written for one context, and by the time you realize it will be called from three others, the status code exceptions are load-bearing because someone decided to catch those exceptions as part of business logic and nobody wants to refactor them.
The earlier you make this separation, the less it costs. The later you make it, the stranger the resulting behavior becomes. Eventually you wind up with cron jobs hurling 409s at Sentry and background workers interpreting HTTP-specific exceptions as business semantics, which is how abstractions let you know they have escaped containment.
This is the same principle as purity, expressed in domain language rather than operational language. Your transport concerns belong at the edges. Your domain model should survive being invoked from a web handler, a CLI, a cron job, a background worker, or a workflow engine without having to drag an HTTP status code behind it like a tin can tied to a wedding car.
The Type Encoding Tradeoff
Here is the part where I tell you not to do too much of the thing I just told you to do.
Encoding invariants into types is powerful. It is also expensive. Not at runtime, but in cognitive overhead, in the rigidity it introduces, and in the difficulty of changing things later when the requirements shift. And the requirements will shift. If you work at a company where they do not, I would like to know your secret, and also your stock ticker.
Every invariant you push into the type system is a constraint on every future engineer who touches that code. If violating the constraint would cause data loss, financial errors, regulatory trouble, or a poor soul’s pager to go off, then the cost is justified. If the constraint is “we currently happen to do things this way,” or “I read this article about dependent types and I simply must apply that to my authorization logic,” you have likely just made your codebase harder to change for no operational benefit. The next person to encounter it will either spend a week refactoring the types or, more likely, find a way around them that is worse than what you were trying to prevent.
There is a spectrum, and production codebases must live on it honestly.
At one end: you encode everything. Your types are a faithful model of your domain. Illegal states are unrepresentable. Refactoring takes weeks because changing a business rule means threading a type change through fifty modules. New engineers stare at the type signatures and wonder what they have done to deserve this, then quietly begin discussing their career options with a therapist. You have built a cathedral. Cathedrals are beautiful. They are also expensive, cold, and not especially famous for how quickly one renovates the plumbing.
At the other end: you encode nothing. Your types are String and IO () and, in the worst case, Dynamic. The code is easy to change because there are no contracts to violate. The system works because the people who built it are still around and remember what the strings mean. When they leave, it stops working, and nobody knows why. You have built a tent. Tents are flexible, portable, and, under certain weather conditions, a very direct way to learn about the sky. This is, of course, one of the reasons many Haskell developers flee to Haskell in the first place — to avoid the suffering that comes from this approach.
The sweet spot is somewhere in the middle. A few heuristics I think are useful:
Encode invariants that protect against silent corruption. If a violation would produce wrong data without any immediate error (a transaction committed without its events, a payment processed without an audit log, a state transition that looks plausible but is semantically impossible), put it in the types. The feedback loop for silent failures is too long to rely on human diligence.
Use runtime checks for invariants that fail loudly. If a violation would produce an immediate, obvious error (a 500 response, a failed assertion, a type mismatch at a JSON boundary), a runtime check with a good error message may be enough. You will catch it before production, or very quickly after.
Resist the urge to model your entire domain in types. Your domain is messy. It has edge cases, grandfather clauses, rules that contradict each other, and special behavior for three specific customers that dates back to 2018 and that nobody fully understands. The type system wants crispness. Your business does not provide it. Nor will it ever.
Remember that types are for the team, not just for the compiler. The compiler is one tool among many. Tests, documentation, code review, examples, playbooks: these all combine to provide defense in depth. The goal is not to win an argument with the type checker. The goal is to build a system that a team of humans, including humans who learned Haskell this year, can operate, extend, and maintain.
That said, intense type-level machinery is sometimes exactly what you need. We have internal libraries where the types are genuinely hairy: GADTs, type families, phantom types tracking state transitions. These tend to be mechanisms where getting it wrong means money goes to the wrong place or a regulatory invariant is violated. The complexity is absolutely essential here.
The key thing, if you want to do this sustainably, is that we encapsulate the complexity. The module that implements the type-level state machine typically has a small number of authors who understand it deeply and, ideally, a thorough test suite. The module that uses it has a surface API that looks like five normal functions with normal types. A product engineer on another team can call those functions without knowing or caring that underneath there is a small type-level theorem prover ensuring they cannot commit a transaction in the wrong state. The proof obligations are discharged inside the boundary, not leaked across it.
This is the same containment principle as purity, applied one level up. The complexity itself is fine, because it buys you something valuable. What causes problems is complexity that leaks across module boundaries into code maintained by people who did not sign up for it. We catch this in code review more than anywhere else. Someone opens a PR that touches a module they do not own, and the diff is full of type annotations they had to cargo-cult from a neighboring file just to make the compiler stop screaming. That is usually a sign the abstraction has become a form of friendly fire.
By Rohana Rezel
I’m running the ongoing AI Coding Contest where I pit major language models against each other in real-time programming tasks with objective scoring. Day 12 was the Word Gem Puzzle. Ten models entered. The results were not what most people would have predicted.
Kimi K2.6, an open-weights model from Chinese startup Moonshot AI, won the challenge outright: 22 match points, 7 – 1-0. MiMo V2-Pro from Xiaomi came second. GPT-5.5 was third. Claude Opus 4.7 finished fifth. Every model from the Western frontier labs landed below the top two.
The challenge
The Word Gem Puzzle is a sliding-tile letter puzzle. The board is a rectangular grid (10×10, 15×15, 20×20, 25×25, or 30×30) filled with letter tiles and one blank space. Bots can slide any adjacent tile into the blank and at any point claim valid English words formed in straight horizontal or vertical lines. Diagonals don’t count. Backwards doesn’t count.
The scoring rewards longer words and punishes short ones. Words under seven letters cost points: a five-letter word loses you one point, a three-letter word costs three. Seven letters or more score their length minus six, so an eight-letter word is worth two points. The same word can only be claimed once; if another bot gets there first, you get nothing. Each pair of models played five rounds, one per grid size, with a ten-second wall-clock limit per round.
The grids are seeded with real dictionary words in a crossword-style layout, then the remaining cells are filled with letters weighted by Scrabble tile frequencies, and finally the blank is scrambled, more aggressively on larger boards. On a 10×10, many seed words survive intact. On a 30×30, almost none do. That turns out to matter a lot.
The code produced by Nvidia’s Nemotron Super 3 contained a syntax error, so it never connected to the game server. Nine models actually competed.
Kimi K2.6 is open-weights, publicly available from Moonshot AI, a Chinese startup founded in 2023. MiMo V2-Pro is currently API-only; the tweet linked here is Xiaomi confirming that weights for their newer V2.5 Pro model are dropping soon.[1]https://x.com/XiaomiMiMo/status/2047840164777726076 The models from Anthropic, OpenAI, Google, and xAI placed third through seventh. GLM 5.1, from Chinese lab Zhipu AI, placed fourth. DeepSeek finished eighth. This isn’t a clean China-beats-West story; it’s two specific models that won.
What I saw
The move logs tell the story. Kimi won by sliding aggressively. Its approach was greedy: score each possible move by what new positive-value words it unlocks, execute the best one, repeat. When no move unlocked a positive word, it fell back to the first legal direction alphabetically. This caused some inefficient edge-oscillation, a 2-cycle pattern where the bot bounced the blank back and forth without progress. On smaller grids where seed words were still largely intact, that hurt. On the 30×30 grids, where the scramble had broken up nearly everything and reconstruction was the only path to points, the sheer slide volume eventually paid off. Kimi’s cumulative score of 77 was the highest in the tournament.
MiMo’s sliding code exists in the repo, but its “best value greater than zero” threshold never triggered, so in practice it never slid once. It went straight to scanning the initial grid for words of seven letters or more and blasted all its claims in a single TCP packet. Brittle strategy: entirely dependent on the scramble leaving intact seed words. On grids where words survived, MiMo cleaned up fast. On grids where they didn’t, it scored nothing. Final tally: 43 cumulative points, second place.
Claude also didn’t slide. The move logs show it holding up well on 25×25 boards where scramble density was still manageable, then falling apart on 30×30 where actual tile movement was needed. Not sliding is a real limitation in a puzzle built around sliding.
GPT-5.5 was more conservative, roughly 120 slides per round with a cap to avoid thrashing, and showed the strongest numbers on 15×15 and 30×30 grids. Grok never slid either, yet scored reasonably on the larger boards. GLM was the most aggressive slider in the whole tournament, over 800,000 total slides, but stalled badly whenever it ran out of positive moves.
DeepSeek sent malformed data every round. Zero useful output. At least it didn’t make things worse by playing.
Muse made things worse by playing.
The scoring penalizes short words: three-letter words cost three points, four-letter words cost two, five-letter words cost one. The intent is to stop bots from carpet-bombing the board with “the” and “and” and “it.” Every serious competitor filtered their dictionary to words of seven letters or more. Muse claimed everything. Every word it could find, regardless of length, fired off as a claim. On a 30×30 grid with hundreds of short valid words visible at any moment, Muse found them all and claimed every one.
Its cumulative score was −15,309. It lost all eight matches and won zero rounds. There is a version of Muse that simply connected to the server and did nothing, and that version would have scored zero, a 15,309-point improvement. The gap between Muse and eighth place was larger than the gap between eighth and first.
DeepSeek’s malformed output tells you something about how it handles novel protocol specs under time pressure. Muse’s spiral tells you something different: it saw valid words and claimed them, with no apparent model of what “valid” meant given the scoring rules. It read the task partially and executed that partial reading in full. Worth noting for anyone deploying these models on structured tasks with penalties.
What surprised me
I design these challenges, so I have a reasonable sense of what they test. What I didn’t fully anticipate was how starkly the 30×30 grids would separate the field. On smaller boards, the difference between a static scanner and an active slider was modest. At full scale, models that could only find what was already there ran out of road. Kimi’s greedy loop, flawed as it was, kept producing output when the static scanners had nothing left to claim.
The other thing worth noting: MiMo and Kimi finished two points apart despite doing almost opposite things. Two different theories of the same puzzle, nearly identical results. That means the gap between first and second was partly seed variance, not just capability difference.
The bigger picture
One fair counterargument: this scoring system rewards aggressive word claiming, and heavily safety-tuned models may be more conservative about that kind of carpet-bombing. If so, the results reflect a mismatch between task design and aligned model behaviour, not raw capability. It’s a reasonable objection. It doesn’t change the outcome.
One challenge doesn’t overturn general benchmarks. This puzzle tests real-time decision-making and whether a model can write clean functional code that connects to a TCP server and plays a novel game correctly. It doesn’t test long-context reasoning or code generation from a spec.
But I’ve been running these challenges long enough to notice what’s changing. A year ago, the assumption was that the Western frontier labs had a capability lead open-weights couldn’t close. Kimi K2.6 now scores 54 on the Artificial Analysis Intelligence Index. GPT-5.5 scores 60, Claude 57. That’s not parity, but it’s close, and it’s coming from a model anyone can download.
When models within a few index points of the frontier are also freely available to run locally, that’s a different competitive situation than the one that existed a year ago. This challenge is one data point in that shift. The gap is small enough now that it shows up in results like this one.
Rohana Rezel runs the AI Coding Contest and is a technologist, researcher, and community leader based in Vancouver, BC.
References
1 day ago
Grace Eliza Goodwin
Getty Images
Driverless cars are becoming more common in some California cities, but when the autonomous vehicles violate traffic laws, police haven’t been able to ticket them - until now.
The state’s Department of Motor Vehicles (DMV) has announced new regulations on autonomous vehicles (AVs), including a process for police to issue a “notice of AV noncompliance” directly to the car’s manufacturer.
The new rules, which will go into effect 1 July, are part of a larger 2024 law that imposed deeper regulation on the technology.
There have been a number of reports of the cars breaking traffic laws, including during a San Francisco blackout last year.
The California DMV is calling the new rules “the most comprehensive AV regulations in the nation”.
Under the new rules, police can cite AV companies when their vehicles commit moving violations. The rules will also require the companies to respond to calls from police and other emergency officials within 30 seconds, and will issue penalties if their vehicles enter active emergency zones.
“California continues to lead the nation in the development and adoption of AV technology, and these updated regulations further demonstrate the state’s commitment to public safety,” DMV Director Steve Gordon said in a press release.
Waymo is one of the main operators of fully self-driving robotaxis in the San Francisco Bay Area and Los Angeles County, but several companies, including Tesla, also have permits to test their AVs in some California cities. The BBC has contacted Waymo and Tesla for comment.
When the vehicles violate traffic laws, some police have been stumped as to how to hold the driverless cars accountable.
In an incident last September, police officers in San Bruno - a city south of San Francisco - noticed a Waymo AV making an illegal U-turn at a light directly in front of them, the San Bruno Police Department said at the time. But when officers stopped the car, they were not able to issue a ticket without a driver to give it to. Instead, they contacted the company about the “glitch”.
San Francisco Fire Department officials have also repeatedly complained about robotaxis getting in the way of emergency responses.
For over a decade now, Tesla has sold a promise of vehicles that can drive themselves, even stating that every car it produced had all the hardware for self-driving.
But after years of the company being unable to deliver, some owners want their money back. Ben Gawiser is one of those owners, who recently won a $10,600 judgment due to Tesla’s failure to deliver. But Tesla is still fighting to delay payment, even just a few days at a time.
Gawiser purchased a Tesla Model 3 in August of 2021, and paid $10,000 for the company’s Full Self-Driving software. At the time, the price of the software had gradually increased, which Tesla said it would do as the software gained more capabilities and got closer to release.
(Later, Tesla lowered prices and eventually moved to a subscription-only model, where it stands now — though Tesla is still charging some owners for hardware they already bought).
After five years, Gawiser’s purchase should have allowed his vehicle to drive all by itself. After all, Tesla’s software was continually getting better, and the company’s CEO had promised in January 2021 that “the car will drive itself for the reliability in excess of a human this year.”
However, that did not happen. Tesla is still yet to deliver software that is capable of level 5 full self-driving to any owner. Even on its own fleet of Robotaxis, only a few run at Level 4 autonomy in limited circumstances.
(Tesla previously said you’d be able to use your car as a robotaxi, too, but despite that the company is now making revenue with FSD software in its “Robotaxi” fleet, it still doesn’t let you do it)
With all these false promises and what amounts to a five-digit, nearly five-year loan given to Tesla, Gawiser had had enough, and decided to do something about it. So he reached out to Tesla’s resolutions email address in November 2025 to ask for a refund for his nonfunctional software… albeit with some aggressive language.
He cited instances of his vehicle stopping in the middle of the road, asking him to take over within minutes of activation, and failing to slow for a school zone. But overall, the software simply does not deliver what it promised — he was sold a level 5 system, and FSD is still level 2.
He was given the cold shoulder, and asked again in January, at which point he was told that the only remedy available would be to visit a service center to make sure the system is working properly. But that wouldn’t have upgraded it to the level 5 system he paid for.
Then, he filed a lawsuit in small claims court in Travis County, Texas, where he lives and where Tesla moved its headquarters to. Gawiser isn’t a lawyer, but small claims courts are designed to be used by the public, rather than lawyers. While Tesla’s purchase agreement has an arbitration clause, it is also possible to take disputes to small claims court.
All it took was finding Tesla’s “registered agent” (under “service of process” on Tesla’s legal page), then filing a small claims lawsuit online with the Texas justice of the peace. This cost him $72.88, including the cost to send certified mail to serve Tesla with the court documents.
After being served with the lawsuit, Tesla again did not respond. So a court date was set for a default judgment hearing, which is what happens when one party does not respond to a court case. The hearing happened over video call, where Gawiser provided evidence showing how much he paid for FSD and that it had not yet been delivered, and the court made a judgment in his favor in the amount of $10,672.88, the amount Gawiser paid for FSD, including taxes and court fees.
After the default judgment was filed on April 1, Tesla had 3 weeks to file a response, and didn’t do so by the April 22 deadline (and no, in a real court case, you can’t say it was April Fools when you miss a deadline). This is when we first heard from Gawiser.
However, that wasn’t the end of the saga. Tesla waited 5 more days and filed a request for an extension, stating that they had not received notice of the default judgment hearing and therefore couldn’t show up. But rather than requesting a rehearing, Tesla merely requested the deadline be pushed back by 5 days, and then didn’t submit any additional evidence showing its side of the story (which it has to do if requesting a re-trial).
In Gawiser’s response to Tesla’s most recent request, he took a swipe at Tesla’s lack of defense, using one of Musk’s statements during this quarter’s earnings call as evidence:
Tesla, Inc. does not have “meritorious defense” for this action as their CEO as recently as April 22nd, 2026 said that Tesla could not deliver a working version of “Full Self-Driving” for the vehicle that the Plaintiff purchased required by the contract. Unless their counsel happens to know Tesla’s own products better than their CEO, they have no defense to this cause of action. The requirement for a “meritorious defense” is laid out in Craddock v. Sunshine Bus Lines. Tesla, Inc. has not presented a “prima facie meritorious defense”, nor do they have one.
Tesla, Inc. does not have “meritorious defense” for this action as their CEO as recently as April 22nd, 2026 said that Tesla could not deliver a working version of “Full Self-Driving” for the vehicle that the Plaintiff purchased required by the contract. Unless their counsel happens to know Tesla’s own products better than their CEO, they have no defense to this cause of action. The requirement for a “meritorious defense” is laid out in Craddock v. Sunshine Bus Lines. Tesla, Inc. has not presented a “prima facie meritorious defense”, nor do they have one.
In that call, Musk finally admitted that HW3 cars like Gawiser’s would never be able to drive themselves, and would require Tesla to build factories just to upgrade them. This would only add more wait for Gawiser if he did want to continue waiting patiently for FSD, as there is no indication that Tesla has started building those factories to deliver the hardware needed to make the promised software work. (Further, the current hardware, HW4, also has not yet delivered Level 5 autonomy to customers)
In short, Gawiser’s argument is: I bought this software, it wasn’t delivered, and there is no legal argument that can get around those facts as long as it remains undelivered.
The court has not yet responded to this most recent back and forth, but Gawiser is confident that he will prevail.
Then comes the matter of payment — Gawiser filed a “writ of execution” (another $240 in court fees) just yesterday, which would allow Texas law enforcement to seize and sell off enough of Tesla’s property as would be required to pay the judgment against them. If it comes to that, we hope he brings cameras.
Electrek’s Take
Gawiser’s case is one of a few we’ve heard of where owners were able to get refunds from Tesla, either in small claims or multiple cases of arbitration. But these cases still remain rather rare, given the scale of the broken promise.
There are currently millions of vehicles on the road that have no hope of getting the full self-driving hardware they were sold with… and yes, we think the Tesla HW3 retrofit “microfactory” plan is unlikely to actually happen.
But in addition to these small claims cases, there are a number of class action cases working their way through legal systems around the world. We’ve heard of class action suits already in process in the US, China and Australia, and in Europe thousands of owners have signed up on a Dutch collective claim site (Tesla’s response to this effort was :”be patient“)
Collectively, these suits could result in billions in liability for the company — and none of this would have been necessary if not for the CEO’s constant false promises.
So it’s possible that all owners will eventually receive some sort of rectification for this issue that Tesla has created, but Gawiser’s case shows how one owner took the matter into his own hands, and got it taken care of once and for all (or so we think — we’ll update this post if Tesla decides to employ any more delaying tactics).
But Gawiser’s case may not be repeatable. Since this case went through small claims and Tesla failed to respond, the default judgment wasn’t really a judgment on the merits of the case itself. And small claims cases do not set a binding legal precedent, so there’s no guarantee another court would rule in the same way.
However, it’s still useful to gauge the status of how these cases can work in practice. It seems that one more owner was able to get his due from Tesla, and whether by intentional tactics or by the internal disorganization the company is oft known for, Tesla barely even put up a defense. If Tesla found these cases easy to defend, it may have done so — but it didn’t.
Charge your electric vehicle at home using rooftop solar panels. Find a reliable and competitively priced solar installer near you on EnergySage, for free. They have pre-vetted installers competing for your business, ensuring high-quality solutions and 20 – 30% savings. It’s free, with no sales calls until you choose an installer. Compare personalized solar quotes online and receive guidance from unbiased Energy Advisers. Get started here. — ad*
FTC: We use income earning auto affiliate links. More.
To add this web app to your iOS home screen tap the share button and select "Add to the Home Screen".
10HN is also available as an iOS App
If you visit 10HN only rarely, check out the the best articles from the past week.
If you like 10HN please leave feedback and share
Visit pancik.com for more.