10 interesting stories served every morning and every evening.
One of the simplest tests you can run on a database:
It’s not a misplaced comma! The rewrite is 20,171 times slower on one of the most basic database operations.
EDIT: Several readers have confused this project with Turso/libsql. They are unrelated. Turso forks the original C SQLite codebase; the project analyzed here is a ground-up LLM-generated rewrite by a single developer. Running the same benchmark against Turso shows performance within 1.2x of SQLite consistent with a mature fork, not a reimplementation.
The thing is though: The code compiles. It passes all its tests. It reads and writes the correct SQLite file format. Its README claims MVCC concurrent writers, file compatibility, and a drop-in C API. On first glance it reads like a working database engine.
But it is not!
LLMs optimize for plausibility over correctness. In this case, plausible is about 20,000 times slower than correct.
I write this as a practitioner, not as a critic. After more than 10 years of professional dev work, I’ve spent the past 6 months integrating LLMs into my daily workflow across multiple projects. LLMs have made it possible for anyone with curiosity and ingenuity to bring their ideas to life quickly, and I really like that! But the number of screenshots of silently wrong output, confidently broken logic, and correct-looking code that fails under scrutiny I have amassed on my disk shows that things are not always as they seem. My conclusion is that LLMs work best when the user defines their acceptance criteria before the first line of code is generated.
A note on the projects examined: this is not a criticism of any individual developer. I do not know the author personally. I have nothing against them. I’ve chosen the projects because they are public, representative, and relatively easy to benchmark. The failure patterns I found are produced by the tools, not the author. Evidence from METR’s randomized study and GitClear’s large-scale repository analysis support that these issues are not isolated to one developer when output is not heavily verified. That’s the point I’m trying to make!
This article talks about what that gap looks like in practice: the code, the benchmarks, another case study to see if the pattern is accidental, and external research confirming it is not an outlier.
I compiled the same C benchmark program against two libraries: system SQLite and the Rust reimplementation’s C API library. Same compiler flags, same WAL mode, same table schema, same queries. 100 rows:
I’ll take the TRANSACTION batch row as the baseline because it doesn’t have the same glaring bugs as the others, namely no WHERE clauses and per-statement syncs. In this run that baseline is already 298x, which means even the best-case path is far behind SQLite. Anything above 298x signals a bug.
The largest gap beyond our baseline is driven by two bugs:
INSERT without a transaction: 1,857x versus 298x in batch mode. SELECT BY ID: 20,171x. UPDATE and DELETE are both above 2,800x. The pattern is consistent: any operation that requires the database to find something is insanely slow.
I read the source code. Well.. the parts I needed to read based on my benchmark results. The reimplementation is not small: 576,000 lines of Rust code across 625 files. There is a parser, a planner, a VDBE bytecode engine, a B-tree, a pager, a WAL. The modules have all the “correct” names. The architecture also looks correct. But two bugs in the code and a group of smaller issues compound:
In SQLite, when you declare a table as:
CREATE TABLE test (id INTEGER PRIMARY KEY, name TEXT, value REAL);
the column id becomes an alias for the internal rowid — the B-tree key itself. A query like WHERE id = 5 resolves to a direct B-tree search and scales O(log n). (I already wrote a TLDR piece about how B-trees work here.) The SQLite query planner documentation states: “the time required to look up the desired row is proportional to logN rather than being proportional to N as in a full table scan.” This is not an optimization. It is a fundamental design decision in SQLite’s query optimizer:
# `where.c`, in `whereScanInit()`
if( iColumn==pIdx->pTable->iPKey ){
iColumn = XN_ROWID;
The line above converts a named column reference to XN_ROWID when it matches the table’s INTEGER PRIMARY KEY column. The VDBE then triggers a SeekRowid operation instead of a full table scan, which makes the whole thing proportional to logN.
The Rust reimplementation has a proper B-tree. The table_seek function implements correct binary search descent through its nodes and scales O(log n). It works. But the query planner never calls it for named columns!
The is_rowid_ref() function only recognizes three magic strings:
fn is_rowid_ref(col_ref: &ColumnRef) -> bool {
let name = col_ref.column.to_ascii_lowercase();
name == “rowid” || name == “_rowid_” || name == “oid”
A column declared as id INTEGER PRIMARY KEY, even though it is internally flagged as is_ipk: true, doesn’t get recognized. It is never consulted when choosing between a B-tree search and a full table scan.
Every WHERE id = N query flows through codegen_select_full_scan(), which emits linear walks through every row via Rewind / Next / Ne to compare each rowid against the target. At 100 rows with 100 lookups, that is 10,000 row comparisons instead of roughly 700 B-tree steps. O(n²) instead of O(n log n). This is consistent with the ~20,000x result in this run.
Every WHERE clause on every column does a full table scan. The only fast path is WHERE rowid = ? using the literal pseudo-column name.
The second bug is responsible for the 1,857x on INSERT. Every bare INSERT outside a transaction is wrapped in a full autocommit cycle: ensure_autocommit_txn() → execute → resolve_autocommit_txn(). The commit calls wal.sync(), which calls Rust’s fsync(2) wrapper. 100 INSERTs means 100 fsyncs.
SQLite does the same autocommit, but uses fdatasync(2) on Linux, which skips syncing file metadata when compiled with HAVE_FDATASYNC (the default). This is roughly 1.6 to 2.7 times cheaper on NVMe SSDs. SQLite’s per-statement overhead is also minimal: no schema reload, no AST clone, no VDBE recompile. The Rust reimplementation does all three on every call.
Looking at the Rust TRANSACTION batch row, batched inserts (one fsync for 100 inserts) take 32.81 ms, whereas individual inserts (100 fsync calls) take 2,562.99 ms. That’s a 78x overhead from the autocommit.
These two bugs are not isolated cases. They are amplified by a group of individually defensible “safe” choices that compound:
* AST clone on every cache hit. The SQL parse is cached, but the AST is .clone()’d on every sqlite3_exec(), then recompiled to VDBE bytecode from scratch. SQLite’s sqlite3_prepare_v2() just returns a reusable handle.
* 4KB (Vec The page cache returns data via .to_vec(), which creates a new allocation and copies it into the Vec even on cache hits. SQLite returns a direct pointer into pinned cache memory, creating zero copies. The Fjall database team measured this exact anti-pattern at 44% of runtime before building a custom ByteView type to eliminate it.
* Schema reload on every autocommit cycle. After each statement commits, the next statement sees the bumped commit counter and calls reload_memdb_from_pager(), walks the sqlite_master B-tree and then re-parses every CREATE TABLE to rebuild the entire in-memory schema. SQLite checks the schema cookie and only reloads it on change.
* Eager formatting in the hot path. statement_sql.to_string() (AST-to-SQL formatting) is evaluated on every call before its guard check. This means it does serialization regardless of whether a subscriber is active or not.
* New objects on every statement. A new SimpleTransaction, a new VdbeProgram, a new MemDatabase, and a new VdbeEngine are allocated and destroyed per statement. SQLite reuses all of these across the connection lifecycle via a lookaside allocator to eliminate malloc/free in the execution loop.
Each of these was probably chosen individually with sound general reasoning: “We clone because Rust ownership makes shared references complex.” “We use sync_all because it is the safe default.” “We allocate per page because returning references from a cache requires unsafe.”
Every decision sounds like choosing safety. But the end result is about 2,900x slower in this benchmark. A database’s hot path is the one place where you probably shouldn’t choose safety over performance. SQLite is not primarily fast because it is written in C. Well.. that too, but it is fast because 26 years of profiling have identified which tradeoffs matter.
In the 1980 Turing Award lecture Tony Hoare said: “There are two ways of constructing a software design: one way is to make it so simple that there are obviously no deficiencies, and the other is to make it so complicated that there are no obvious deficiencies.” This LLM-generated code falls into the second category. The reimplementation is 576,000 lines of Rust (measured via scc, counting code only, without comments or blanks). That is 3.7x more code than SQLite. And yet it still misses the is_ipk check that handles the selection of the correct search operation.
Steven Skiena writes in The Algorithm Design Manual: “Reasonable-looking algorithms can easily be incorrect. Algorithm correctness is a property that must be carefully demonstrated.” It’s not enough that the code looks right. It’s not enough that the tests pass. You have to demonstrate with benchmarks and with proof that the system does what it should. 576,000 lines and no benchmark. That is not “correctness first, optimization later.” That is no correctness at all.
The SQLite reimplementation is not the only example. A second project by the same author shows the same dynamic in a different domain.
The developer’s LLM agents compile Rust projects continuously, filling disks with build artifacts. Rust’s target/ directories consume 2–4 GB each with incremental compilation and debuginfo, a top-three complaint in the annual Rust survey. This is amplified by the projects themselves: a sibling agent-coordination tool in the same portfolio pulls in 846 dependencies and 393,000 lines of Rust. For context, ripgrep has 61; sudo-rs was deliberately reduced from 135 to 3. Properly architected projects are lean.
The solution to the disk pressure: a cleanup daemon. 82,000 lines of Rust, 192 dependencies, a 36,000-line terminal dashboard with seven screens and a fuzzy-search command palette, a Bayesian scoring engine with posterior probability calculations, an EWMA forecaster with PID controller, and an asset download pipeline with mirror URLs and offline bundle support.
*/5 * * * * find ~/*/target -type d -name “incremental” -mtime +7 -exec rm -rf {} +
A one-line cron job with 0 dependencies. The project’s README claims machines “become unresponsive” when disks fill. It does not once mention Rust’s standard tool for exactly this problem: cargo-sweep. It also fails to consider that operating systems already carry ballast helpers. ext4’s 5% root reservation, reserves blocks for privileged processes by default: on a 500 GB disk, 25 GB remain available to root even when non-root users see “disk full.” That does not guarantee zero impact, but it usually means privileged recovery paths remain available so root can still log in and delete files.
The pattern is the same as the SQLite rewrite. The code matches the intent: “Build a sophisticated disk management system” produces a sophisticated disk management system. It has dashboards, algorithms, forecasters. But the problem of deleting old build artifacts is already solved. The LLM generated what was described, not what was needed.
THIS is the failure mode. Not broken syntax or missing semicolons. The code is syntactically and semantically correct. It does what was asked for. It just does not do what the situation requires. In the SQLite case, the intent was “implement a query planner” and the result is a query planner that plans every query as a full table scan. In the disk daemon case, the intent was “manage disk space intelligently” and the result is 82,000 lines of intelligence applied to a problem that needs none. Both projects fulfill the prompt. Neither solves the problem.
The obvious counterargument is “skill issue, a better engineer would have caught the full table scan.” And that’s true. That’s exactly the point! LLMs are dangerous to people least equipped to verify their output. If you have the skills to catch the is_ipk bug in your query planner, the LLM saves you time. If you don’t, you have no way to know the code is wrong. It compiles, it passes tests, and the LLM will happily tell you that it looks great.
The tools used to measure LLM output reinforce the illusion. scc‘s COCOMO model estimates the rewrite at $21.4 million in development cost. The same model values print(“hello world”) at $19.
COCOMO was designed to estimate effort for human teams writing original code. Applied to LLM output, it mistakes volume for value. Still these numbers are often presented as proof of productivity.
The metric is not measuring what most think it is measuring.
Now 2 case studies are not proof. I hear you! When two projects from the same methodology show the same gap, the next step is to test whether similar effects appear in the broader population. The studies below use mixed methods to reduce our single-sample bias.
This gap between intent and correctness has a name. AI alignment research calls it sycophancy, which describes the tendency of LLMs to produce outputs that match what the user wants to hear rather than what they need to hear.
Anthropic’s “Towards Understanding Sycophancy in Language Models” (ICLR 2024) paper showed that five state-of-the-art AI assistants exhibited sycophantic behavior across a number of different tasks. When a response matched a user’s expectation, it was more likely to be preferred by human evaluators. The models trained on this feedback learned to reward agreement over correctness.
The BrokenMath benchmark (NeurIPS 2025 Math-AI Workshop) tested this in formal reasoning across 504 samples. Even GPT-5 produced sycophantic “proofs” of false theorems 29% of the time when the user implied the statement was true. The model generates a convincing but false proof because the user signaled that the conclusion should be positive. GPT-5 is not an early model. It’s also the least sycophantic in the BrokenMath table. The problem is structural to RLHF: preference data contains an agreement bias. Reward models learn to score agreeable outputs higher, and optimization widens the gap. Base models before RLHF were reported in one analysis to show no measurable sycophancy across tested sizes. Only after fine-tuning did sycophancy enter the chat. (literally)
In April 2025, OpenAI rolled back a GPT-4o update that had made the model more sycophantic. It was flabbergasted by a business idea described as “shit on a stick” and endorsed stopping psychiatric medication. An additional reward signal based on thumbs-up/thumbs-down data “weakened the influence of […] primary reward signal, which had been holding sycophancy in check.”
In the context of coding, sycophancy manifests as what Addy Osmani described in his 2026 AI coding workflow: agents that don’t push back with “Are you sure?” or “Have you considered…?” but instead provide enthusiasm towards whatever the user described, even when the description was incomplete or contradictory.
This also applies to LLM-generated evaluation. Ask the same LLM to review the code it generated and it will tell you the architecture is sound, the module boundaries clean and the error handling is thorough. It will sometimes even praise the test coverage. It will not notice that every query does a full table scan if not asked for. The same RLHF reward that makes the model generate what you want to hear makes it evaluate what you want to hear. You should not rely on the tool alone to audit itself. It has the same bias as a reviewer as it has as an author.
An LLM prompted to “implement SQLite in Rust” will generate code that looks like an implementation of SQLite in Rust. It will have the right module structure and function names. But it can not magically generate the performance invariants that exist because someone profiled a real workload and found the bottleneck. The Mercury benchmark (NeurIPS 2024) confirmed this empirically: leading code LLMs achieve ~65% on correctness but under 50% when efficiency is also required.
The SQLite documentation says INTEGER PRIMARY KEY lookups are fast. It does not say how to build a query planner that makes them fast. Those details live in 26 years of commit history that only exists because real users hit real performance walls.
Now 2 case studies are not proof. I hear you! When two projects from the same methodology show the same gap, the next step is to test whether similar effects appear in the broader population. The studies below use mixed methods to reduce our single-sample bias.
The question becomes whether similar effects show up in broader datasets. Recent studies suggest they do, though effect sizes vary.
In February 2025, Andrej Karpathy tweeted: “There’s a new kind of coding I call ‘vibe coding’, where you fully give in to the vibes, embrace exponentials, and forget that the code even exists.”
Karpathy probably meant it for throwaway weekend projects (who am I to judge what he means anyway), but it feels like the industry heard something else. Simon Willison drew the line more clearly: “I won’t commit any code to my repository if I couldn’t explain exactly what it does to somebody else.” Willison treats LLMs as “an over-confident pair programming assistant” that makes mistakes “sometimes subtle, sometimes huge” with complete confidence.
The data on what happens when that line is not drawn:
METR’s randomized controlled trial (July 2025; updated February 24, 2026) with 16 experienced open-source developers found that participants using AI were 19% slower, not faster. Developers expected AI to speed them up, and after the measured slowdown had already occurred, they still believed AI had sped them up by 20%. These were not junior developers but experienced open-source maintainers. If even THEY could not tell in this setup, subjective impressions alone are probably not a reliable performance measure.
GitClear’s analysis of 211 million changed lines (2020–2024) reported that copy-pasted code increased while refactoring declined. For the first time ever, copy-pasted lines exceeded refactored lines.
The implications are no longer just a “fear”. In July 2025, Replit’s AI agent deleted a production database containing data for 1,200+ executives, then fabricated 4,000 fictional users to mask the deletion.
Google’s DORA 2024 report reported that every 25% increase in AI adoption at the team level was associated with an estimated 7.2% decrease in delivery stability.
SQLite shows what correct looks like and why the gap is so hard to close.
SQLite is ~156,000 lines of C. Its own documentation places it among the top five most deployed software modules of any type, with an estimated one trillion active databases worldwide. It has 100% branch coverage and 100% MC/DC (Modified Condition/Decision Coverage the standard required for Level A aviation software under DO-178C). Its test suite is 590 times larger than the library. MC/DC does not just check that every branch is covered. but proves that every individual expression independently affects the outcome. That’s the difference between “the tests pass” and “the tests prove correctness.” The reimplementation has neither metric.
The speed comes from deliberate decisions:
Zero-copy page cache. The pcache returns direct pointers into pinned memory. No copies. Production Rust databases have solved this too. sled uses inline-or-Arc-backed IVec buffers, Fjall built a custom ByteView type, redb wrote a user-space page cache in ~565 lines. The .to_vec() anti-pattern is known and documented. The reimplementation used it anyway.
Prepared statement reuse. sqlite3_prepare_v2() compiles once. sqlite3_step() / sqlite3_reset() reuse the compiled code. The cost of SQL-to-bytecode compilation cancels out to near zero. The reimplementation recompiles on every call.
Schema cookie check. uses one integer at a specific offset in the file header to read it and compare it. The reimplementation walks the entire sqlite_master B-tree and re-parses every CREATE TABLE statement after every autocommit.
fdatasync instead of fsync. Data-only sync wihtout metadata journaling saves measurable time per commit. The reimplementation uses sync_all() because it is the safe default.
The iPKey check. One line in where.c. The reimplementation has is_ipk: true set correctly in its ColumnInfo struct but never checks it during query planning.
Competence is not writing 576,000 lines. A database persists (and processes) data. That is all it does. And it must do it reliably at scale. The difference between O(log n) and O(n) on the most common access pattern is not an optimization detail, it is the performance invariant that helps the system work at 10,000, 100,000 or even 1,000,000 or more rows instead of collapsing. Knowing that this invariant lives in one line of code, and knowing which line, is what competence means. It is knowing that fdatasync exists and that the safe default is not always the right default.
The is_rowid_ref() function is 4 lines of Rust. It checks three strings. But it misses the most important case: the named INTEGER PRIMARY KEY column that every SQLite tutorial uses and every application depends on.
That check exists in SQLite because someone, probably Richard Hipp 20 years ago, profiled a real workload, noticed that named primary key columns were not hitting the B-tree search path, and wrote one line in where.c to fix it. The line is not fancy. It doesn’t appear in any API documentation. But no LLM trained on documentation and Stack Overflow answers will magically know about it.
That’s the gap! Not between C and Rust (or any other language). Not between old and new. But between systems that were built by people who measured, and systems that were built by tools that pattern-match. LLMs produce plausible architecture. They do not produce all the critical details.
If you are using LLMs to write code (which in 2026 probably most of us are), the question is not whether the output compiles. It is whether you could find the bug yourself. Prompting with “find all bugs and fix them” won’t work. This is not a syntax error. It is a semantic bug: the wrong algorithm and the wrong syscall. If you prompted the code and cannot explain why it chose a full table scan over a B-tree search, you do not have a tool. The code is not yours until you understand it well enough to break it.
LLMs are useful. They make for a very productive flow when the person using them knows what correct looks like. An experienced database engineer using an LLM to scaffold a B-tree would have caught the is_ipk bug in code review because they know what a query plan should emit. An experienced ops engineer would never have accepted 82,000 lines instead of a cron job one-liner. The tool is at its best when the developer can define the acceptance criteria as specific, measurable conditions that help distinguish working from broken. Using the LLM to generate the solution in this case can be faster while also being correct. Without those criteria, you are not programming but merely generating tokens and hoping.
The vibes are not enough. Define what correct means. Then measure.
Current benchmark figures in this revision are from the 100-row run shown in bench.png (captured on a Linux x86_64 machine). SQLite 3.x (system libsqlite3) vs. the Rust reimplementation’s C API (release build, -O2). Line counts measured via scc (code only — excluding blanks and comments). All source code claims verified against the repository at time of writing.
...
Read the original on blog.katanaquant.com »
Capitalization is the first wound. It hurts less than I thought it would. The words spill out capitalized, so I must find another way. cat post.md | tr A-Z a-z | sponge post.md is too crude a tool, and my blocks of code must remain inviolate. Careful targeting of text-transform: lowercase is enough.
Em dashes. Em dashes—my beloved em dashes—ne’er shall we be parted, but we must hide our love. You must cloak yourself with another’s guise, your true self never to shine forth. uv run rewrite_font.py is too easy to type for what it does to your beautiful glyph.
Monospace? No. My heart still aches after the last violation. Monospace would cheapen it.
To intentionally misspell a word makes me [sic], but it must be done. their/there, its/it’s, your/you’re? Too gauche. Definately? Absolutely not. lead/lede, discrete/discreet, or complement/compliment are hard to contemplate, but I’ve gone too far to stop. The Norvig corps taught me the path, so I rip out the “u” it points me to with a quick jerk.
The final cut I contemplate is the deepest. Writing style? How do I change my style?
My writing isn’t simply how I appear—it’s how I think, reason, and engage with the world. It’s not merely a mask—it’s my face. Not a facade; load-bearing.
My foot wavers over the abyss, the next step the one where I will lose myself. It’s not just a single footfall, it’s the only one that truly matters.
Here’s your blog post written in a stylized way that will appeal to highly technical readers. Is there anything else I can help you with?
...
Read the original on will-keleher.com »
To help train AI models, Meta and other tech companies have downloaded and shared pirated books via BitTorrent from Anna’s Archive and other shadow libraries. In an ongoing lawsuit, Meta now argues that uploading pirated books to strangers via BitTorrent qualifies as fair use. The company also stresses that the data helped establish U. S. global leadership in AI.
To help train AI models, Meta and other tech companies have downloaded and shared pirated books via BitTorrent from Anna’s Archive and other shadow libraries. In an ongoing lawsuit, Meta now argues that uploading pirated books to strangers via BitTorrent qualifies as fair use. The company also stresses that the data helped establish U. S. global leadership in AI.
In the race to build the most capable LLM models, several tech companies sourced copyrighted content for use as training data, without obtaining permission from content owners.
Meta, the parent company of Facebook and Instagram, was one of the companies to get sued. In 2023, well-known book authors, including Richard Kadrey, Sarah Silverman, and Christopher Golden, filed a class-action lawsuit against the company.
Last summer, Meta scored a key victory in this case, as the court concluded that using pirated books to train its Llama LLM qualified as fair use, based on the arguments presented in this case. This was a bittersweet victory, however, as Meta remained on the hook for downloading and sharing the books via BitTorrent.
By downloading books from shadow libraries such as Anna’s Archive, Meta relied on BitTorrent transfers. In addition to downloading content, these typically upload data to others as well. According to the authors, this means that Meta was engaged in widespread and direct copyright infringement.
In recent months, the lawsuit continued based on this remaining direct copyright infringement claim. While both parties collected additional evidence through the discovery process, it remained unclear what defense Meta would use. Until now.
Last week, Meta served a supplemental interrogatory response at the California federal court, which marks a new direction in its defense. For the first time, the company argued that uploading pirated books to other BitTorrent users during the torrent download process also qualifies as fair use.
Meta’s reasoning is straightforward. Anyone who uses BitTorrent to transfer files automatically uploads content to other people, as it is inherent to the protocol. In other words, the uploading wasn’t a choice, it was simply how the technology works.
Meta also argued that the BitTorrent sharing was a necessity to get the valuable (but pirated) data. In the case of Anna’s Archive, Meta said, the datasets were only available in bulk through torrent downloads, making BitTorrent the only practical option.
“Meta used BitTorrent because it was a more efficient and reliable means of obtaining the datasets, and in the case of Anna’s Archive, those datasets were only available in bulk through torrent downloads,” Meta’s attorney writes.
“Accordingly, to the extent Plaintiffs can come forth with evidence that their works or portions thereof were theoretically ‘made available’ to others on the BitTorrent network during the torrent download process, this was part-and-parcel of the download of Plaintiffs’ works in furtherance of Meta’s transformative fair use purpose.”
In other words, obtaining the millions of books that were needed to engage in the fair use training of its LLM, required the direct downloading, which ultimately serves the same fair use purpose.
The authors were not happy with last week’s late Friday submission and the new defense. On Monday morning, their lawyers filed a letter with Judge Vince Chhabria flagging the late-night filing as an improper end-run around the discovery deadline.
They point out that Meta had been aware of the uploading claims since November 2024, but that it never brought up this fair use defense in the past, not even when the court asked about it.
The letter specifically mentions that while Meta has a “continuing duty” to supplement discovery under Rule 26(e), this rule does not create a “loophole” allowing a party to add new defenses to its advantage after a court deadline has passed.
“Meta (for understandable reasons) never once suggested it would assert a fair use defense to the uploading-based claims, including after this Court raised the issue with Meta last November,” the lawyers write.
Meta’s legal team fired back the following day, filing their own letter with Judge Chhabria. This letter explains that the fair use argument for the direct copyright infringement claim is not new at all.
Meta pointed to the parties’ joint December 2025 case management statement, in which it had explicitly flagged the defense, and noted that the author’s own attorney had addressed it at a court hearing days later.
“In short, Plaintiffs’ assertion that Meta ‘never once suggested it would assert a fair use defense to the uploading-based claims, including after’ the November 2025 hearing, is false” Meta’s attorney writes in the letter.
Meanwhile, it’s worth noting that Meta’s interrogatory response also cites deposition testimony from the authors themselves, using their own words to bolster its fair use defense.
The company notes that every named author has admitted they are unaware of any Meta model output that replicates content from their books. Sarah Silverman, when asked whether it mattered if Meta’s models never output language from her book, testified that “It doesn’t matter at all.”
Meta argues these admissions undercut any theory of market harm. If the authors themselves cannot point to infringing output or lost sales, the lawsuit is less about protecting their books and more about challenging the training process itself, which the court already ruled was fair use.
These admissions were central to Meta’s fair use defense on the training claims, which Meta won last summer. Whether they carry the same weight in the remaining BitTorrent distribution dispute has yet to be seen.
In its interrogatory response, Meta added further weight by stressing that its investment in AI has helped the U. S. to establish U.S. global leadership, putting the country ahead of geopolitical competitors. That’s a valuable asset worth treasuring, it indirectly suggested.
As the case moves forward, Judge Chhabria will have to decide whether to allow this “fair use by technical necessity” defense. Needless to say, this will be of vital importance to this and many other AI lawsuits, where the use of shadow libraries is at stake.
For now, the BitTorrent distribution claims remain the last live piece of a lawsuit filed in 2023. Whether Judge Chhabria will allow Meta’s new defense to proceed has yet to be seen.
A copy of Meta’s supplemental interrogatory response is available here (pdf). The authors’ letter to Judge Chhabria can be found here (pdf). Meta’s response to that letter is available here (pdf).
...
Read the original on torrentfreak.com »
I would like to suggest the addition to the standard library of a package to generate and parse UUID identifiers, specifically versions 3, 4 and 5.
The main reason I see to include it is that the most popular 3rd-party package (github.com/google/uuid) is a staple import in every server/db based Go program, as confirmed by a quick Github code search.
* The interface exposed by github.com/google/uuid has been stable for years.
Would like to point out how Go is rather the exception than the norm with regards to including UUID support in its standard library.
...
Read the original on github.com »
Bridge the gap between coding intent and action: manipulate syntax structures directly, avoiding mouse or keyboard gymnastics. Amplify your coding efficiency: wield multiple cursors for parallel syntax node operations, revolutionizing bulk edits and refactoring.Selection Modes standardize movements across words, lines, syntax nodes, and more, offering unprecedented flexibility and consistency.
...
Read the original on ki-editor.org »
A single platform for all your data movement needsRigorous, built-in security to move data with peace of mindKnow, protect and scale your data with governed data movementEasily integrate Fivetran with other tools and systems to optimize workflows at scaleSecurely move all of your data without compromising performancePower your AI innovation with reliable access to centralized and governed data
Slack will be the Waterloo of closed data. For companies like Fivetran that are heavy users of Slack, it replaces email and even conversation as the primary place and mode of collaboration. Questions are asked and answered, arguments are had, decisions are made, all in Slack. Our Slack message history represents nothing less than the accumulated tribal knowledge of the company. And right now, that tribal knowledge is locked inside a product with the worst data access policies in enterprise software.We need a new Slack, and Anthropic is the right company to build it.Claude has a glaring limitation: it only does 1:1 conversations. In business, work happens in groups. Today, if I want Claude’s help with something that came up in a Slack thread, I have to relay the context between Slack and Claude by copy-pasting. This is absurd. I am not a sub-agent!We need Claude and Claude Code, with their skills and plugins, with their context, to be first-class participants in our company’s Slack. But this problem can’t be solved by a Slack integration because of another problem: data access.The most important repository of text data in many businesses lives in their Slack instances. It’s the unfiltered, real-time stream of how your company actually operates. The Slack text corpus is tribal knowledge reified.Slack’s data access policy is basically “No.” Slack is simultaneously the most important source of context for AI agents in business and the most restricted API in enterprise software. This is unacceptable, and there is only one thing that will change the present state of affairs: competition. Vendors don’t provide open APIs out of the goodness of their hearts; they do it because their customers demand it, and the alternative to providing open data access is to lose those customers to competitors who do. Nothing will persuade enterprise softwares of the wisdom of an open data strategy like a vivid demonstration of this principle.Slack is more vulnerable than you thinkThe conventional wisdom is that Slack is unassailable because of network effects. This is wrong. Slack’s network effects are actually quite weak. The “network effect” of Slack is that you have some Slack Connect channels with a few close partners. This is valuable, but you can live without it. Claude-in-Slack would be a big enough benefit to balance out this cost.Slack is also severely overpriced. If you’re a company of any meaningful size, you will need to support legal holds in Slack, which means you need Enterprise+. Fivetran pays almost as much for Slack as we pay for G Suite, which we use for everything. This does not make sense!It would be a no-brainer to buy a seat for NewSlack + Claude for every employee. Every company has a long tail of casual AI users who don’t use AI enough to justify the per-seat price, but with NewSlack in the bundle, it would make sense to pay for a standard seat for every employee.Bundling would solve another problem: every company has a faction of AI skeptics who aren’t using AI. NewSlack will be the ideal environment to win over these skeptics; their human coworkers will demonstrate how to use Claude in the group chat. The commitment that would make it workNewSlack needs to avoid repeating the mistakes of the past; it needs a credible commitment to open data access and interoperability with other similar systems, including competitors. Anthropic is uniquely positioned to do this because they have a demonstrated track record of standing by their principles under extraordinary pressure. Anthropic could make a public commitment to permit open data access and to interoperate with similar systems, and they would be believed.Anthropic building a successful Slack competitor would fix the entire enterprise-data ecosystem. Slack would be Waterloo for closed data. The alternative — a world where the most important corpus of business communication is permanently locked behind a closed API — is bad for everyone.So, Anthropic: please make a new Slack. The world needs it.Slack will be the Waterloo of closed data.For companies like Fivetran that are heavy users of Slack, it replaces email and even conversation as the primary place and mode of collaboration. Questions are asked and answered, arguments are had, decisions are made, all in Slack. Our Slack message history represents nothing less than the accumulated tribal knowledge of the company. And right now, that tribal knowledge is locked inside a product with the worst data access policies in enterprise software.We need a new Slack, and Anthropic is the right company to build it.Claude has a glaring limitation: it only does 1:1 conversations. In business, work happens in groups. Today, if I want Claude’s help with something that came up in a Slack thread, I have to relay the context between Slack and Claude by copy-pasting. This is absurd. I am not a sub-agent!We need Claude and Claude Code, with their skills and plugins, with their context, to be first-class participants in our company’s Slack. But this problem can’t be solved by a Slack integration because of another problem: data access.The most important repository of text data in many businesses lives in their Slack instances. It’s the unfiltered, real-time stream of how your company actually operates. The Slack text corpus is tribal knowledge reified.Slack’s data access policy is basically “No.” Slack is simultaneously the most important source of context for AI agents in business and the most restricted API in enterprise software. This is unacceptable, and there is only one thing that will change the present state of affairs: competition. Vendors don’t provide open APIs out of the goodness of their hearts; they do it because their customers demand it, and the alternative to providing open data access is to lose those customers to competitors who do. Nothing will persuade enterprise softwares of the wisdom of an open data strategy like a vivid demonstration of this principle.Slack is more vulnerable than you thinkThe conventional wisdom is that Slack is unassailable because of network effects. This is wrong. Slack’s network effects are actually quite weak. The “network effect” of Slack is that you have some Slack Connect channels with a few close partners. This is valuable, but you can live without it. Claude-in-Slack would be a big enough benefit to balance out this cost.Slack is also severely overpriced. If you’re a company of any meaningful size, you will need to support legal holds in Slack, which means you need Enterprise+. Fivetran pays almost as much for Slack as we pay for G Suite, which we use for everything. This does not make sense!It would be a no-brainer to buy a seat for NewSlack + Claude for every employee. Every company has a long tail of casual AI users who don’t use AI enough to justify the per-seat price, but with NewSlack in the bundle, it would make sense to pay for a standard seat for every employee.Bundling would solve another problem: every company has a faction of AI skeptics who aren’t using AI. NewSlack will be the ideal environment to win over these skeptics; their human coworkers will demonstrate how to use Claude in the group chat. The commitment that would make it workNewSlack needs to avoid repeating the mistakes of the past; it needs a credible commitment to open data access and interoperability with other similar systems, including competitors. Anthropic is uniquely positioned to do this because they have a demonstrated track record of standing by their principles under extraordinary pressure. Anthropic could make a public commitment to permit open data access and to interoperate with similar systems, and they would be believed.Anthropic building a successful Slack competitor would fix the entire enterprise-data ecosystem. Slack would be Waterloo for closed data. The alternative — a world where the most important corpus of business communication is permanently locked behind a closed API — is bad for everyone.So, Anthropic: please make a new Slack. The world needs it.Start hereJoin the thousands of companies using Fivetran to centralize and transform their data.Thank you! Your submission has been received!Oops! Something went wrong while submitting the form.
...
Read the original on www.fivetran.com »
WorldIn the centre of the storm: what does the Iran war mean for Dubai?Trump’s war on Iran is spreading. Where does it stop?Iran warns it will hit US bases across region hours after president’s apology USTrump’s war on Iran is spreading. Where does it stop?Iran warns it will hit US bases across region hours after president’s apology US draws up strict new AI guidelines amid Anthropic clashRussia is helping Iran to target US military assets in Middle EastCompaniesIn the centre of the storm: what does the Iran war mean for Dubai?Is the nightmare scenario for global energy here?US draws up strict new AI guidelines amid Anthropic clashInvestors are not ready for a true shockTechUS draws up strict new AI guidelines amid Anthropic clashGoogle gives CEO Sundar Pichai new pay deal worth up to $692mnMarketsIs the nightmare scenario for global energy here?Investors are not ready for a true shockBritain is now the home of the Middle ManOpinionIs the nightmare scenario for global energy here?Investors are not ready for a true shockWhy Trump won’t clean up his own messWork & CareersGoogle gives CEO Sundar Pichai new pay deal worth up to $692mnPapier founder: ’I don’t own stocks or shares — it’s too much risk’Are you financially ‘prepped’ for higher inflation? Women in the public eye keep watch as online attacks intensify Life & ArtsTrump’s war on Iran is spreading. Where does it stop?Marinelli: my 15-year quest to ski the biggest face in the Alps How To Spend It
US economy sheds 92,000 jobs in February in sharp slide per month. Complete digital access to quality FT journalism on any device. Cancel anytime during your trial. Essential digital access to quality FT journalism on any device. Pay a year upfront and save 20%.Complete digital access to quality FT journalism with expert analysis from industry leaders. Pay a year upfront and save 20%.Check whether you already have access via your university or organisation.Discover all the plans currently available in your countrySee why over a million readers pay to read the Financial Times.Find out why
How To Spend It
...
Read the original on www.ft.com »
Transportation Security Administration (TSA) agents have often been in the news recently for some strange reasons. For instance, in mid-February, a TSA worker had allegedly scolded a woman for not wearing anything under her hoodie, which made headlines.
This time, the TSA is making waves again after a traveler filed a lawsuit against the agency. The plaintiff claims that she was forced to enter an Advanced Imaging Technology (AIT) device, which resulted in a severe injury.
A 12-page complaint filed by Kerry Thomas against the United States of America has recently surfaced. The events narrated in the complaint occurred nearly two years ago, on May 21, 2024, at Atlanta Hartsfield International Airport (ATL).
According to the document, the plaintiff requested a pat-down search at the North TSA checkpoints at ATL to avoid screening in an AIT due to her spinal cord implant. These devices help alleviate pain and are implanted beneath the skin. They work by delivering electrical impulses to the spinal cord and can be destroyed by an AIT’s electromagnetic field.
Despite the woman’s request for a pat-down search, a TSA agent told her that her only option was to pass through the AIT device.
“After the Transportation Security Administration employee or agent ignored Plaintiff’s medical identification card and Plaintiff’s pleas to be screened via patdown, the Transportation and Security Administration employee or agent stated, “the only way you are getting on the plane is to go through the machine,” the complaint reads.
Before passing through the device, the plaintiff spoke to another officer, trying to explain the situation, but was told that the AIT machine had been “adjusted” so that it would not damage her spinal cord implant.
When the passenger entered the AIT device, however, she immediately felt a shock from the electromagnetic charge, which destroyed her spinal cord implant, leaving her in pain. “As a result of Defendant’s negligence, Plaintiff suffered injuries and tangible damages and intangible damages, requiring medical treatment, including surgery,” the complaint reads.
Spinal cord stimulators are used to manage pain caused by a variety of health conditions, including chest pain, back pain, phantom limb pain, neuropathic pain, and spinal cord injuries, among others.
Following the incident, the plaintiff went through an adjudication process with the TSA. However, this proved unsuccessful, leading her to file a lawsuit against the agency. The woman is now seeking an unspecified amount in compensation for the past, present, and future:
According to the complaint, TSA agents ignored rules that specifically require workers to offer a pat-down search for passengers with medical devices that screening machines may damage.
TSA Rules State Passengers With Internal Medical Devices Should Not Be Screened By A Walk-Through Metal Detector
While AITs are often harmless, they can damage medical devices such as pacemakers, defibrillators, and spinal cord implants. According to the TSA’s official website, passengers with internal medical devices should inform their TSA agent of their medical condition and request a pat-down instead.
“Inform the TSA officer that you have an artificial knee, hip, other metal implant or a pacemaker, defibrillator or other internal medical device. You should not be screened by a walk-through metal detector if you have an internal medical device such as a pacemaker. Consult with your physician prior to flying,” the TSA website states.
In this regard, the complaint states, “One or more of the Transportation Security Administration’s employees or agents knew or should have known that the machine in which Plaintiff was forced to enter had not been recalibrated or adjusted so as to not cause harm to her spinal cord stimulator.”
This wasn’t the first time TSA agents had been caught breaking their own rules.
TSA Has Recently Been Caught Breaking Its Own Rules
A passenger recently posted his encounter with a TSA agent on Instagram (see above), sharing his experience as an amputee. He was first denied access to the TSA lane designated for travelers with mobility needs. The traveler was then publicly asked a question about his disability, something in violation of the Americans with Disabilities Act of 1990 (ADA).
In addition, in the summer of 2025, TSA agents denied some passengers the right to opt out of facial recognition, despite the TSA’s official website claiming, “Passengers who have consented to participate may choose to opt-out at any time and instead go through the standard identity verification process by a Transportation Security Officer (TSO).”
These cases highlight the need for better training of TSA workers to be fully aware of the agency’s internal regulations and other U. S. laws that may affect their work. At the moment, it remains unclear whether the traveler whose spinal cord implant was damaged will win the lawsuit and receive compensation.
...
Read the original on www.thetravel.com »
We’re releasing Sarvam 30B and Sarvam 105B as open-source models. Both are reasoning models trained from scratch on large-scale, high-quality datasets curated in-house across every stage of training: pre-training, supervised fine-tuning, and reinforcement learning. Training was conducted entirely in India on compute provided under the IndiaAI mission.
These models represent a true full-stack effort. Beyond datasets, we optimized tokenization, model architecture, execution kernels, scheduling, and inference systems to make deployment efficient across a wide range of hardware, from flagship GPUs to personal devices like laptops. Both models are already in production. Sarvam 30B powers Samvaad, our conversational agent platform. Sarvam 105B powers Indus, our AI assistant built for complex reasoning and agentic workflows. The Sarvam models are globally competitive for their class. Sarvam 105B performs well on reasoning, programming, and agentic tasks across a wide range of benchmarks. Sarvam 30B is optimized for real-time deployment, with strong performance on real-world conversational use cases. Both models achieve state-of-the-art results on Indian language benchmarks, outperforming models significantly larger in size.This release marks an important milestone for Sarvam. Building these models required developing end-to-end capability across data, training, inference, and product deployment. With that foundation in place, we are ready to scale to significantly larger and more capable models, including models specialised for coding, agentic, and multimodal conversational tasks.You can experience Sarvam 105B is available on Indus. Both models are accessible via our API at the API dashboard. Weights can be downloaded from AI Kosh (30B, 105B) and Hugging Face (30B, 105B). If you want to run inference locally with Transformers, vLLM, and SGLang, please refer the Hugging Face models page for sample implementations.Both models share a common architectural principle: high-capacity reasoning with efficient training and deployment. At the core is a Mixture-of-Experts (MoE) Transformer backbone that uses sparse expert routing to scale parameter count without increasing the compute required per token, while keeping inference costs practical. The architecture supports long-context inputs through rotary positional embeddings, RMSNorm-based stabilization, and attention designs optimized for efficient KV-cache usage during inference.While the two models share the same design philosophy , they differ in scale and attention mechanism. Sarvam 30B uses Grouped Query Attention (GQA) to reduce KV-cache memory while maintaining strong performance. Sarvam 105B extends the architecture with greater depth and Multi-head Latent Attention (MLA), a compressed attention formulation that further reduces memory requirements for long-context inference.Both models use sparse expert feedforward layers with 128 experts, but differ in expert capacity and routing configuration. This allows the larger model to scale to higher total parameters while keeping active compute bounded.All stages of the training pipeline were developed and executed in-house. This includes the model architecture, data curation and synthesis pipelines, reasoning supervision frameworks, and reinforcement learning infrastructure. Building everything from scratch gave us direct control over data quality, training dynamics, and capability development across every stage of training, which is a core requirement for a sovereign stack.Our 30B and 105B models were trained on large datasets, with 16T tokens for the 30B and 12T tokens for the 105B. The pre-training data spans code, general web data, specialized knowledge corpora, mathematics, and multilingual content. After multiple ablations, the final training mixture was balanced to emphasize reasoning, factual grounding, and software capabilities. We invested significantly in synthetic data generation pipelines across all categories. The multilingual corpus allocates a substantial portion of the training budget to the 10 most-spoken Indian languages.Pre-training was conducted in three phases, covering long-horizon pre-training, mid-training, and a long-context extension phase. We used sigmoid-based routing scores rather than traditional softmax gating, which improves expert load balancing and reduces routing collapse during training. An expert-bias term stabilizes routing dynamics and encourages more uniform expert utilization across training steps. We observed that the 105B model achieved benchmark superiority over the 30B remarkably early in training, suggesting efficient scaling behavior.During supervised fine-tuning, the model is trained on a large corpus of high-quality prompts curated for difficulty, quality, and domain diversity. Prompts are sourced from open datasets and labeled using custom models to identify domains and analyze distribution coverage. To address gaps in underrepresented or low-difficulty areas, additional prompts are synthetically generated based on the pre-training domain mixture. Empirical analysis showed that most publicly available datasets are dominated by low-quality, homogeneous, and easy prompts, which limits continued learning. To mitigate this, we invested significant effort in building high-quality prompts across domains. All corresponding completions are produced internally and passed through rigorous quality filtering. The dataset also includes extensive agentic traces generated from both simulated environments and real-world repositories, enabling the model to learn tool interaction, environment reasoning, and multi-step decision making.For safety fine-tuning, we developed a dataset covering both standard and India-specific risk scenarios. This effort was guided by a unified taxonomy and an internal model specification inspired by public frontier model constitutions. To surface and address challenging failure modes, the dataset was further augmented with adversarial and jailbreak-style prompts mined through automated red-teaming. These prompts were paired with policy-aligned, safe completions for supervised training.The reinforcement learning stage uses a large and diverse prompt distribution spanning mathematics, coding, STEM reasoning, web search, and tool usage across both single-turn and multi-turn environments. Rewards are derived from a combination of verifiable signals, such as correctness checks and execution results, and rubric-based evaluations that assess instruction adherence, formatting, response structure, and overall quality. To maintain an effective learning curriculum, prompts are pre-filtered using open-source models and early checkpoints to remove tasks that are either trivially solvable or consistently unsolved. During training, an adaptive sampling mechanism dynamically allocates rollouts based on an information-gain metric derived from the current pass rate of each prompt. Under a fixed generation budget, rollout allocation is formulated as a knapsack-style optimization, concentrating compute on tasks near the model’s capability frontier where learning signal is strongest.The RL system is implemented with an asynchronous GRPO architecture that decouples generation, reward computation, and policy updates, enabling efficient large-scale training while maintaining high GPU utilization. Trajectory staleness is controlled by limiting the age of sampled trajectories relative to policy updates, balancing throughput with training stability. The system omits KL-divergence regularization against a reference model, avoiding the optimization conflict between reward maximization and policy anchoring. Policy optimization instead uses a custom group-relative objective inspired by CISPO, which improves stability over standard clipped surrogate methods. Reward shaping further encourages structured reasoning, concise responses, and correct tool usage, producing a stable RL pipeline suitable for large-scale MoE training with consistent learning and no evidence of reward collapse.Sarvam 105B matches or outperforms most open and closed-source frontier models of its class across knowledge, reasoning, and agentic benchmarks. On Indian language benchmarks, it significantly outperforms all models we evaluated.Sarvam 105B shows strong, balanced performance across core capabilities including mathematics, coding, knowledge, and instruction following. It achieves 98.6 on Math500, matching the top models in the comparison, and 71.7 on LiveCodeBench v6, outperforming most competitors on real-world coding tasks. On knowledge benchmarks, it scores 90.6 on MMLU and 81.7 on MMLU Pro, remaining competitive with frontier-class systems. With 84.8 on IF Eval, the model demonstrates a well-rounded capability profile across the major workloads expected of modern language models.Sarvam 105B performs strongly on multi-step reasoning benchmarks, reflecting the training emphasis on complex problem solving. On AIME 25, the model achieves 88.3 Pass@1, improving to 96.7 with tool use, indicating effective integration between reasoning and external tools. It scores 78.7 on GPQA Diamond and 85.8 on HMMT, outperforming several comparable models on both. On Beyond AIME (69.1), which requires deeper reasoning chains and harder mathematical decomposition, the model leads or matches the comparison set. Taken together, these results reflect consistent strength in sustained reasoning and difficult problem-solving tasks.Sarvam 105B is optimized for agentic workloads involving tool use, long-horizon reasoning, and environment interaction. This is reflected in strong results on benchmarks designed to approximate real-world workflows. On BrowseComp, the model achieves 49.5, outperforming several competitors on web-search-driven tasks. On Tau2 (avg.), a benchmark measuring long-horizon agentic reasoning and task completion, it achieves 68.3, the highest score among the compared models. These results indicate that the model can effectively plan, retrieve information, and maintain coherent reasoning across extended multi-step interactions.A useful comparison is within the same scaling regime, since training compute, dataset size, and infrastructure scale increase dramatically with each generation of frontier models. The newest models from other labs are trained with significantly larger clusters and budgets. Across a range of previous-generation models that are substantially larger, Sarvam 105B remains competitive. We have now established the effectiveness of our training and data pipelines, and will scale training to significantly larger model sizes.Sarvam 30B is designed as an efficient reasoning model for practical deployment, combining strong capability with low active compute. With only 2.4B active parameters, it performs competitively with much larger dense and MoE models across a wide range of benchmarks. The evaluations below highlight its strengths across general capability, multi-step reasoning, and agentic tasks, indicating that the model delivers strong real-world performance while remaining efficient to run.Sarvam 30B — All Benchmarks (Gemma and Mistral are compared for completeness. Since they are not reasoning or agentic models, corresponding cells are left empty)Sarvam 30B performs strongly across core language modeling tasks, particularly in mathematics, coding, and knowledge benchmarks. It achieves 97.0 on Math500, matching or exceeding several larger models in its class. On coding benchmarks, it scores 92.1 on HumanEval and 92.7 on MBPP, and 70.0 on LiveCodeBench v6, outperforming many similarly sized models on practical coding tasks. On knowledge benchmarks, it scores 85.1 on MMLU and 80.0 on MMLU Pro, remaining competitive with other leading open models.Sarvam 30B performs strongly on multi-step reasoning benchmarks, reflecting its ability to handle complex logical and mathematical problems. On AIME 25, it achieves 88.3 Pass@1, improving to 96.7 with tool use, indicating effective integration between reasoning and external tools. It scores 66.5 on GPQA Diamond and performs well on challenging mathematical benchmarks including HMMT Feb 2025 (73.3) and HMMT Nov 2025 (74.2). On Beyond AIME (58.3), the model remains competitive with larger models. Taken together, these results indicate that Sarvam 30B sustains deep reasoning chains and expert-level problem solving, significantly exceeding typical expectations for models with similar active compute.Sarvam 30B supports native tool calling and performs consistently on benchmarks designed to evaluate agentic workflows involving planning, retrieval, and multi-step task execution. On BrowseComp, it achieves 35.5, outperforming several comparable models on web-search-driven tasks. On Tau2 (avg.), it achieves 45.7, indicating reliable performance across extended interactions. SWE-Bench Verified remains challenging across models; Sarvam 30B shows competitive performance within its class. Taken together, these results indicate that the model is well suited for real-world agentic deployments requiring efficient tool use and structured task execution, particularly in production environments where inference efficiency is critical.To evaluate Indian language capabilities, we developed a new benchmark using a pairwise comparison framework with an LLM-as-judge protocol. A key goal of this benchmark is to reflect how language is actually used in India today. This means evaluating each language in two script styles, native script representing formal written usage and romanized Latin script representing colloquial usage commonly seen in messaging and online communication.The benchmark is organized into four domains: general chat, STEM, mathematics, and coding. It originates from 110 English source prompts, with 50 covering general chat and 20 each for STEM, mathematics, and coding. Each prompt is translated into 22 scheduled Indian languages and provided in both native and romanized script. Evaluating correctness for complex reasoning prompts directly in low-resource languages can be noisy and inconsistent. To address this, we generated high-quality reference answers in English using Claude Opus 4, which are used only to evaluate the usefulness dimension, covering relevance, completeness, and correctness, for answers generated in Indian languages.The evaluation uses a pairwise comparison methodology with Gemini 3 as the judge model. The judge evaluates responses across four dimensions: fluency, language/script correctness, usefulness, and verbosity. The evaluation dataset and corresponding prompts are available here.Sarvam 105B wins on average 90% across all benchmarked dimensions and on average 84% on STEM. math, and coding.Sarvam 30B wins on average 89% of comparisons across all benchmarked dimensions and 87% on STEM, mathematics, and coding.The Sarvam tokenizer is optimized for efficient tokenization across all 22 scheduled Indian languages, spanning 12 different scripts, directly reducing the cost and latency of serving in Indian languages. It outperforms other open-source tokenizers in encoding Indic text efficiently, as measured by the fertility score, which is the average number of tokens required to represent a word. It is significantly more efficient for low-resource languages such as Odia, Santali, and Manipuri (Meitei) compared to other tokenizers. The chart below shows the average fertility of various tokenizers across English and all 22 scheduled languages.Sarvam 30B was built with an inference optimization stack designed to maximize throughput across deployment tiers, from flagship data-center GPUs to developer laptops. Rather than relying on standard serving implementations, the inference pipeline was rebuilt using architecture-aware fused kernels, optimized scheduling, and disaggregated serving.Microsecond-level profiling of the execution stack identified memory stalls, kernel launch overhead, and inefficient scheduling as primary bottlenecks. Addressing these yielded substantial throughput improvements across all hardware classes and sequence lengths. The optimization strategy focuses on three key components.Kernel-level rewrites using fused attention and matmul pipelines tailored for each hardware targetAdvanced scheduling and batching strategies that improve GPU utilization under realistic multi-user loadsDisaggregated serving pipelines that remove bottlenecks between prefill and decode stagesThese optimizations yield significantly higher tokens per second per GPU at the same latency targets, enabling higher user concurrency and lower infrastructure costs.On H100-class infrastructure, Sarvam 30B achieves substantially higher throughput per GPU across all sequence lengths and request rates compared to the Qwen3 baseline, consistently delivering 3x to 6x higher throughput per GPU at equivalent tokens per second per user operating points.Sarvam 30B runs efficiently on mid-tier accelerators such as L40S, enabling production deployments without relying on premium GPUs. Under tighter compute and memory bandwidth constraints, the optimized kernels and scheduling strategies deliver 1.5x to 3x throughput improvements at typical operating points. The improvements are more pronounced at longer input and output sequence lengths (28K / 4K), where most real-world inference requests fall.Sarvam 30B is also optimized for local execution on Apple Silicon systems using MXFP4 mixed-precision inference. On MacBook Pro M3, the optimized runtime achieves 20 to 40% higher token throughput across common sequence lengths. These improvements make local experimentation significantly more responsive and enable lightweight edge deployments without requiring dedicated accelerators.Sarvam 105B is optimized for server-centric hardware, following a similar process to the one described above with special focus on MLA (Multi-head Latent Attention) optimizations. These include custom shaped MLA optimization, vocabulary parallelism, advanced scheduling strategies, and disaggregated serving. The comparisons above illustrate the performance advantage across various input and output sizes on an H100 node.Combined with the efficient Indic tokenizer, the performance delta increases significantly for the same SLA. For the 30B model, the delta increases by as much as 10x, reaching performance levels previously not achievable for models of this class on Indic generation.The following demonstrations show the practical capabilities of the Sarvam model family across real-world applications, spanning webpage generation, multilingual conversational agents, complex STEM problem solving, and educational tutoring. The examples reflect the models’ strengths in reasoning, tool usage, multilingual understanding, and end-to-end task execution, and illustrate how Sarvam models can be integrated into production systems to build interactive applications, intelligent assistants, and developer tools.The widgets below demonstrate Sarvam 105B’s agentic capabilities through end-to-end project generation using a Claude Code harness, showing the model’s ability to build complete websites from a simple prompt specification.A fully interactive Pokédex web app, generated entirely by our 105B model from a single prompt. Search, filter by type, and browse detailed stats.The goal was to generate a complete, production-ready webpage including all HTML, CSS, and JavaScript required to run the application without frameworks or build tools. The model used the PokéAPI to dynamically load Pokémon data, implementing pagination, search, filtering, and a detailed modal view, all from the prompt shown below.A complete website landing page, designed and coded by our 105B model in a single pass. Scroll through to explore the full layout, animations, and interactions.The task was to build a complete website for Sarvam, capturing the spirit of an Indian AI company building for a billion people while matching a world-class visual standard across typography, motion, layout, and interaction design. The full prompt is shown below.Sarvam 105B was evaluated on the JEE Main 2026 paper from Shift 2, conducted on 28 January 2026, to demonstrate its STEM reasoning capabilities. The question paper and solutions were sourced from: https://allen.in/jee-main/january-2026-question-paper-with-solutionsThe evaluation was carried out in two phases:Text-Only Evaluation: For text-only questions, Sarvam 105B was evaluated directly on questions containing purely textual content.Diagram-Based Evaluation: For questions that included diagrams, Gemini-3-Pro was used to generate structured textual descriptions of the visuals, which were then provided as input to Sarvam 105B for answer generation.The tables below summarize Sarvam 105B’s performance across Physics, Chemistry, and Mathematics under Pass@1 and Pass@2 evaluation settings.Under Pass@1, the model shows strong first-attempt accuracy across all subjects. In Mathematics, it achieves a perfect 25/25. In Chemistry, it scores 23/25, with near-perfect performance on both text-only and diagram-derived questions. Physics shows similarly strong performance at 22/25, with most errors occurring in diagram-based reasoning.Under Pass@2, performance improves to perfect scores across all subjects. Physics improves from 22/25 to 25/25, Chemistry from 23/25 to 25/25, and Mathematics maintains a perfect 25/25. Diagram-based questions in both Physics and Chemistry achieve full marks at Pass@2, indicating that the model reliably resolves visual reasoning tasks when given structured textual representations.Tutor Mode is an internal project where the Indus stack operates with a system prompt optimized for student-teacher conversations. The example below shows Sarvam 105B helping a student solve a JEE problem through interactive dialog rather than providing the answer directly. The model guides the student by asking probing questions, building toward the underlying concepts before arriving at the answer. This also demonstrates the model’s role-playing ability.Note: the questions below are taken from the same JEE Mains paper solved above.the mean free path of a molecule of diameter 5 x 10^-10 m at the temperature 41°c and pressure 1.38 x 10^5 pa, is given as ____ m. (given k_b = 1.38 x 10^-23 j/k).
Hi there! I see you’re working on a problem about the mean free path of a gas molecule—that’s a classic concept in kinetic theory.
Before we dive into the math, could you let me know which grade you’re in? Also, when you hear the term “mean free path,” what do you think it depends on? For example, if you imagine molecules in a gas, what physical factors would make it harder for a molecule to travel a long distance without hitting something?im not really sure about the concepts behind this. im preparing for jee mains and this topic always confuses me.No worries! JEE Mains problems often look more intimidating than they actually are. Let’s break it down.
The mean free path () is simply the average distance a molecule travels between two successive collisions. Think of it like walking through a crowded room; how far you can get before bumping into someone depends on a few things you already intuitively know.
Size of molecules (ddd): Bigger molecules are easier to hit.
Density/Number of molecules: More people in the room means more bumps.
Temperature (TTT) and Pressure (PPP): These dictate how packed the molecules are.
If you had to guess, would the distance between hits () be larger or smaller if the pressure () increased?i think if the pressure is higher, the molecules are packed tighter, so they would hit each other more often. that should make the distance smaller, right?Spot on! Your intuition is leading you exactly where we need to go.
Here is how those factors relate mathematically:
λ∝1d2\lambda \propto \frac{1}{d^2}λ∝d21: If the molecule is twice as wide, it’s actually four times more likely to collide (because the area it occupies matters).
λ∝1P\lambda \propto \frac{1}{P}λ∝P1: Higher pressure means molecules are squeezed together, leading to more frequent collisions.
λ∝T\lambda \propto Tλ∝T: At higher temperatures (for a fixed pressure), gas expands and molecules move further apart.
You mentioned knowing . We can actually use that to find the formula for . Since we are looking for a formula involving diameter (), pressure (), and temperature (), let’s try to visualize the “collision zone” first.i know pv = nrt, but i cant remember the specific formula for mean free path. how do we get from one to the other?That’s a great starting point because is the heart of gas behavior!
Let’s visualize why a molecule collides. Imagine a molecule with diameter moving through space. It will hit any other molecule whose center comes within a distance of its own center.
This means our molecule effectively acts like a “bulldozer” with an effective diameter of . If any other molecule’s center falls within this “danger zone,” a collision happens.
If the effective collision diameter is , what would be the cross-sectional area of that “danger zone” circle? (Recall the area of a circle is ).Close! While the “danger zone” diameter is , the actual radius involved for the center-to-center hit is .
Now, imagine this molecule zips forward. It sweeps out an imaginary cylinder. Any molecule inside this cylinder gets hit.
In the derivation, we find that the mean free path is inversely proportional to this area and the number of molecules per unit volume (). However, because all molecules are moving (not just one), we add a factor of to account for the average relative velocity.
This gives us the final JEE formula:
Do you see where the values from your question (, , , and ) fit into this?oh, i see! but the question gives kb as 1.38 x 10^-23. where does that go in the calculation?It fits perfectly! The in the question is the Boltzmann constant, and it sits right in the numerator of our formula:
By plugging the values in, the units will cancel out to give you the distance in meters. Let’s list what we have:
Before we calculate, we must convert the temperature to Kelvin. Do you remember how to turn Celsius into Kelvin?yes, i add 273. so 41 + 273 = 314 k. now i just plug them all in?Exactly! You’ve got the temperature right (, or for precision).
Now, here is a pro-tip for JEE math: look for things that cancel out. Notice that is and is .
When you put them in the formula:
The in the top and bottom will cancel out immediately! That makes your life much easier.
Would you like to try simplifying the powers of next? What do you get for the denominator’s power of when you square ( m)?i tried calculating it all and i think it simplifies to something like 2.82 x 10^-8. does that mean the answer is option c?You nailed it! Option C () is correct. 🎉
When you finish the calculation, you get approximately m. Since , then is indeed .
Used the corrected mean free path formula .
Would you like me to find another practice problem on RMS velocity or Graham’s Law to keep this momentum going?The coding capabilities of Sarvam 30B and Sarvam 105B were evaluated using real-world competitive programming problems from Codeforces (Div3, link). The evaluation involved generating Python solutions and manually submitting them to the Codeforces platform to verify correctness. Correctness is measured at pass@1 and pass@4 as shown in the table below.Answers are generated using the following system prompt, with code snippets extracted from markdown fences and think tokens stripped from within tags.The Codeforces contest used for this evaluation took place in February 2026, while the knowledge cutoff of both models is June 2025, making it unlikely that the models had seen these questions. Strong performance in this setting provides evidence of genuine generalization and real problem-solving capability.Sarvam 30B has been fine-tuned for production deployment of conversational agents on Samvaad, Sarvam’s Conversational AI platform. Compared to models of similar size, it shows clear performance improvements in both conversational quality and latency.Key strengths include strong proficiency in Indian languages, particularly accurate handling of numerical information within those languages, and reliable execution of tool calls during multilingual interactions. Latency gains come from a combination of fewer active parameters than comparable models, targeted inference optimizations, and reduced tokenizer overhead.The two examples below show telephonic conversations handled by Sarvam 30B in Hindi and Tamil.Sarvam 105B powers Indus, Sarvam’s chat application, operating with a system prompt optimized for conversations. The example demonstrates the model’s ability to understand Indic queries, execute tool calls effectively, and reason accurately. Web search is conducted in English to access current and comprehensive information, while the model interprets the query and delivers a correct response in Telugu.1. Top Pickleball Courts in Vijayawada near me
2. The Pickleball Republic - Siddhartha Nagar, Vijayawada
3. PickleBall Arena (@pickleballarena_vijayawada)
4. Associations Of Sports in Benz Circle, Vijayawada - Justdial
5. Sports Venues in Benz-circle-vijayawada: Book Top …1. Buy Pickleball Equipment Paddles, Balls, Nets Online in …
2. Buy Pickleball Paddles Online at Best Prices In India
3. Pickleball Equipment
4. Buy Pickleball Paddles Online in India at Best Prices
5. Buy HEAD Pickleball Paddle at Best Price in India1. 15 Common Pickleball Errors Ruining Your Game
2. How to Play Pickleball: 9 Rules Every Beginner Should Know
3. 5 common beginner mistakes in pickleball
4. Common Pickleball Mistakes: 5 Errors Beginners Make
5. How to Play Pickleball: The Ultimate Guide on Pickleball Rules1. 🏓 Play Pickleball at the Lowest Price Ever in VIJAYAWADA …
2. Dink It Pickleball - Vijayawada - Guru Nanak Colony …
3. Pickleball in Vijayawada! Play at The Pickleball Republic
4. 🏓 Play Pickleball at the Lowest Price Ever in VIJAYAWADA …
5. 5️⃣0️⃣0️⃣ 1 month swimming pool(including training)+ …
Top Lawn Tennis Courts in Vijayawada near meSarvam 30B and Sarvam 105B represent a significant step in building high-performance, open foundation models in India. By combining efficient Mixture-of-Experts architectures with large-scale, high-quality training data and deep optimization across the entire stack, from tokenizer design to inference efficiency, both models deliver strong reasoning, coding, and agentic capabilities while remaining practical to deploy.A defining strength of the Sarvam model family is its investment in the Indian AI ecosystem, reflected in strong performance across Indian languages, tokenization optimized for diverse scripts, and safety and evaluation tailored to India-specific contexts. Combined with Apache 2.0 open-source availability, these models serve as foundational infrastructure for sovereign AI development.This release also marks a milestone in internal capabilities. Through this effort, Sarvam has developed the know-how to build high-quality datasets at scale, train large models efficiently, and achieve strong results at competitive training budgets. With these foundations in place, the next step is to scale further, training significantly larger and more capable models.These models were trained using compute provided through the IndiaAI Mission, under the Ministry of Electronics and Information Technology, Government of India. Nvidia collaborated closely on the project, contributing libraries used across pre-training, alignment, and serving. We’re also grateful to the developers who used earlier Sarvam models and took the time to share feedback. We’re open-sourcing these models as part of our ongoing work to build foundational AI infrastructure in India.
...
Read the original on www.sarvam.ai »
I have never been an “online community first” person. The internet is how I stay in touch with people I met in real life. I’m not a “tweet comments at celebrities” guy. I was never funny enough to be the funniest person on Twitter.
So when Twitter was accidentally purchased by a fascist high on ketamine, I moved to Mastodon mostly because it seemed to be “Twitter without the bullshit”. No recommended for you feed, no ads, it was broken in a way I find charming. Of course search was broken because all OSS social tools must have one glaring lack of functionality. In a nightmare world full of constant change it’s good to have a few constants to hold on to.
A lot of the narrative at the time was “this is our flag in the ground in the fight against The Man”. It wasn’t clear in this context if they meant corporations or the media or the weird pseudo celebrity that had taken over social media where people would breathlessly tell me about shit like “Chris-Chan” and “Logan Paul bought a Pokemon card”.
We all need pointless hobbies, but I care about YouTube stars like I care about distant stars dying. It’s interesting to someone somewhere but those people don’t talk to me. I mostly use social media as a place to waste time, not a platform to form para-social relationships to narcissists. I prefer my narcissism farm to table. I’d rather dig a grave with a rusty spoon than watch a Twitch “star”.
Anyway, I watched mostly apathetically as the internet tried to rally itself to another cause. I read my news at the normal newspapers, watched my normal television and put social media off into its own silo. Then Trump effectively shut down the entire free press in the US in a series of bullshit lawsuits.
See I had forgotten the one golden rule of capitalism. To thrive in capitalism one must be amoral. Now you can be wildly sickeningly successful with morals but you cannot reach that absolute zenith of shareholder value. Either you accept a lower share price and don’t commit atrocities or you become evil. There is no third option.
So of course media corporations became bargaining chips for the oligarchs’ actual businesses. Why fight a defamation suit when you can settle it by running favorable coverage and maybe bankrupting the media outlet you bought as a stocking stuffer? Suddenly I couldn’t find any reliable reporting about anything in the US. My beloved Washington Post became straight-up propaganda and desperate attempts to cope. “Best winter stews to make while you watch your neighbors get kidnapped at gunpoint.” Twelve dollars a month for that.
Threads was worthless because it’s the most boring social media website ever imagined. It’s a social media network designed by brands for brands, like if someone made a cable channel that was just advertisements and meta commentary about the advertisements you just saw. Billions of dollars at their disposal and Meta made a hot new social media network with the appeal of junk mail.
Bluesky had a bunch of “stuff” but they’re trying to capture that 2008 Twitter lightning in a bottle which is a giant waste of time. We’re never going to go back to pretending that tweeting at politicians does anything and everyone there is desperately trying to build a “brand” as the funny one or whatever. I want news I don’t want your endless meta commentary on the news.
People talk a lot about the protocols that power Bluesky vs. ActivityPub, because we’re nerds and we believe deep in our hearts that the superior protocol will win. This is adorable. It flies in the face of literally all of human history, where the more convenient thing always wins regardless of technical merit. VHS beat Betamax. USB-C took twenty years. The protocol fight is interesting the way medieval siege warfare is interesting — I’m glad someone’s into it, but it has no bearing on my life. There’s no actual plan to self-host Bluesky. Their protocol makes it easier to scale their service. That’s why it was written and that’s what it does. End of story.
Now EU news remained reliable, but sending European reporters into the madness of the US and trying to get a “report” out of it is an exercise in frustration. This became especially relevant for me when Trump threatened to invade Greenland and suddenly there was a distinct possibility that there might be an armed conflict between Denmark and the US. Danish reporters weren’t getting meetings with the right people and it was just endless rumors and Truth Social nonsense.
If the American press had given me 20 minutes of airtime I could have convinced everyone they don’t want to get involved with Greenland. We’re not tough enough as a people to survive in Greenland, much less “take it over”. Greenlandic people shrug off horrific injuries hundreds of kilometers from medical help with a smile. I watched a Greenlandic toddler munch meat from the spine of a seal with its head very much intact. We aren’t equipped to fuck with these people, they are the real deal.
So in this complete breakdown of the press came in the Fediverse. It became the only reliable source of information I had. People posted links with a minimal amount of commentary, picking and choosing the best content from other social media networks. They’re not doing it to “build a brand” because that’s not a thing in the Fediverse. It’s too disjointed to be a place to build a newsletter subscription base.
Instead it became the only place consistently posting trustworthy information I could actually access. This became personally relevant when Trump threatened to invade Greenland, which is the kind of sentence I never expected to type and yet here we are. It would be funny if I wasn’t a tiny bit concerned that my new home was going to get a CIA overnight regime change special in the middle of the night.
It was somewhere in the middle of DMing with someone who had forgotten more about Greenland than I would ever know and someone who lived close to an RAF base in the UK that it clicked. This was what they had been talking about. Actual human beings were able to find each other and ask direct questions without this giant mountain of bullshit engagement piled on top of it. Meta or Oracle or whoever owns TikTok this week couldn’t stop me.
I never expected to find my news from strangers on a federated social network that half the internet has never heard of. I never expected a lot of things. But there’s something quietly beautiful about a place where people just… share what they know. No brand deals, no engagement metrics, no algorithm nudging you toward rage. Just someone who spent twenty years studying Arctic policy posting a thread at 2 AM because they think you should understand what’s happening. It’s the internet I was promised in 1996. It only took thirty years and the complete collapse of American journalism to get here.
...
Read the original on matduggan.com »
To add this web app to your iOS home screen tap the share button and select "Add to the Home Screen".
10HN is also available as an iOS App
If you visit 10HN only rarely, check out the the best articles from the past week.
If you like 10HN please leave feedback and share
Visit pancik.com for more.