10 interesting stories served every morning and every evening.
Computational Complexity and other fun stuff in math and computer science from Lance Fortnow and Bill Gasarch
...
Read the original on blog.computationalcomplexity.org »
New U. S laws designed to protect minors are pulling millions of adult Americans into mandatory age-verification gates to access online content, leading to backlash from users and criticism from privacy advocates that a free and open internet is at stake. Roughly half of U.S. states have enacted or are advancing laws requiring platforms — including adult content sites, online gaming services, and social media apps — to block underage users, forcing companies to screen everyone who approaches these digital gates.
“There’s a big spectrum,” said Joe Kaufmann, global head of privacy at Jumio, one of the largest digital identity-verification and authentication platforms. He explained that the patchwork of state laws vary in technical demands and compliance expectations. “The regulations are moving in many different directions at once,” he said.
Social media company Discord announced plans in February to roll out mandatory age verification globally, which the company said would rely on verification methods designed so facial analysis occurs on a user’s device and submitted data would be deleted immediately. The proposal quickly drew backlash from users concerned about having to submit selfies or government IDs to access certain features, which led Discord to delay the launch until the second half of this year.
“Let me be upfront: we knew this rollout was going to be controversial. Any time you introduce something that touches identity and verification, people are going to have strong feelings,” Discord chief technology officer and co-founder Stanislav Vishnevskiy wrote in a Feb. 24 blog post.
Websites offering adult content, gambling, or financial services often rely on full identity verification that requires scanning a government ID and matching it to a live image. But most of the verification systems powering these checkpoints — often run by specialized identity-verification vendors on behalf of websites — rely on artificial intelligence such as facial recognition and age-estimation models that analyze selfies or video to determine in seconds whether someone is old enough to access content. Social media and lower-risk services may use lighter estimation tools designed to confirm age without permanently storing detailed identity records.
Vendors say a challenge is balancing safety with how much friction users will tolerate. “We’re in the business of ensuring that you are absolutely keeping minors safe and out and able to let adults in with as little friction as possible,” said Rivka Gewirtz Little, chief growth officer at identity-verification platform Socure. Excessive data collection, she added, creates friction that users resist.
Still, many users perceive mandatory identity checks as invasive. “Having another way to be forced to provide that information is intrusive to people,” said Heidi Howard Tandy, a partner at Berger Singerman who specializes in intellectual property and internet law. Some users may attempt workarounds — including prepaid cards or alternative credentials — or turn to unauthorized distribution channels. “It’s going to cause a piracy situation,” she added.
In many implementations, verification vendors — not the websites themselves — process and retain the identity information, returning only a pass-fail signal to the platform.
Gewirtz Little said Socure does not sell verification data and that in lightweight age-estimation scenarios, where platforms use quick facial analysis or other signals rather than government documentation, the company may store little or no information. But in fuller identity-verification contexts, such as gaming and fraud prevention that require ID scans, certain adult verification records may be retained to document compliance. She said Socure can keep some adult verification data for up to three years while following applicable privacy and purging rules.
Civil liberties’ advocates warn that concentrating large volumes of identity data among a small number of verification vendors can create attractive targets for hackers and government demands. Earlier this year, Discord disclosed a data breach that exposed ID images belonging to approximately 70,000 users through a compromised third-party service, highlighting the security risks associated with storing sensitive identity information.
In addition, they warn that expanding age-verification systems represent not only a usability challenge but a structural shift in how identity becomes tied to online behavior. Age verification risks tying users’ “most sensitive and immutable data” — names, faces, birthdays, home addresses — to their online activity, according to Molly Buckley, a legislative analyst at the Electronic Frontier Foundation. “Age verification strikes at the foundation of the free and open internet,” she said.
Even when vendors promise to safeguard personal information, users ultimately rely on contractual terms they rarely read or fully understand. “There’s language in their terms-of-use policies that says if the information is requested by law enforcement, they’ll hand it over. They can’t confirm that they will always forever be the only entity who has all of this information. Everyone needs to understand that their baseline information is not something under their control,” Tandy said.
As more platforms route age checks through third-party vendors, that concentration of identity data is also creating new legal exposure for the companies that rely on them. “A company is going to have some of that information passing through their own servers,” Tandy said. “And you can’t offload that kind of liability to a third party.”
Companies can distribute risk through contracts and insurance, she said, but they remain responsible for how identity systems interact with their infrastructure. “What you can do is have really good insurance and require really good insurance from the entities that you’re contracting with,” she said.
Tandy also cautioned that retention promises can be more complex than they appear. “If they say they’re holding it for three years, that’s the minimum amount of time they’re holding it for,” she said. “I wouldn’t feel comfortable trusting a company that says, ‘We delete everything one day after three years.’ That is not going to happen,” she added.
Federal and state regulators argue that age-verification laws are primarily a response to documented harms to minors and insist the rules must operate under strict privacy and security safeguards.
An FTC spokesperson told CNBC that companies must limit how collected information is used. While age-verification technologies can help parents protect children online, the agency said firms are still bound by existing consumer protection rules governing data minimization, retention, and security. The agency pointed to existing rules requiring firms to retain personal information only as long as reasonably necessary and to safeguard its confidentiality and integrity.
...
Read the original on www.cnbc.com »
Advanced Machine Intelligence (AMI), a new Paris-based startup cofounded by Meta’s former chief AI scientist Yann LeCun, announced Monday it has raised more than $1 billion to develop AI world models.
LeCun argues that most human reasoning is grounded in the physical world, not language, and that AI world models are necessary to develop true human-level intelligence. “The idea that you’re going to extend the capabilities of LLMs [large language models] to the point that they’re going to have human-level intelligence is complete nonsense,” he said in an interview with WIRED.
The financing, which values the startup at $3.5 billion, was co-led by investors such as Cathay Innovation, Greycroft, Hiro Capital, HV Capital, and Bezos Expeditions. Other notable backers include Mark Cuban, former Google CEO Eric Schmidt, and French billionaire and telecommunications executive Xavier Niel.
AMI (pronounced like the French word for friend) aims to build “a new breed of AI systems that understand the world, have persistent memory, can reason and plan, and are controllable and safe,” the company says in a press release. The startup says it will be global from day one, with offices in Paris, Montreal, Singapore, and New York, where LeCun will continue working as a New York University professor in addition to leading the startup. AMI will be the first commercial endeavor for LeCun since his departure from Meta in November 2025.
LeCun’s startup represents a bet against many of the world’s biggest AI labs like OpenAI, Anthropic, and even his former workplace, Meta, which believe that scaling up LLMs will eventually deliver AI systems with human-level intelligence or even superintelligence. LLMs have powered viral products such as ChatGPT and Claude Code, but LeCun has been one of the AI industry’s most prominent researchers speaking out about the limitations of these AI models. LeCun is well known for being outspoken, but as a pioneer of modern AI that won a Turing award back in 2018, his skepticism carries weight.
LeCun says AMI aims to work with companies in manufacturing, biomedical, robotics, and other industries that have lots of data. For example, he says AMI could build a realistic world model of an aircraft engine and work with the manufacturer to help them optimize for efficiency, minimize emissions, or ensure reliability.
AMI was cofounded by LeCun and several leaders he worked with at Meta, including the company’s former director of research science, Michael Rabbat; former vice president of Europe, Laurent Solly; and former senior director of AI research, Pascale Fung. Other cofounders include Alexandre LeBrun, former CEO of the AI health care startup Nabla, who will serve as AMI’s CEO, and Saining Xie, a former Google DeepMind researcher who will be the startup’s chief science officer.
LeCun does not dismiss the overall utility of LLMs. Rather, in his view, these AI models are simply the tech industry’s latest promising trend, and their success has created a “kind of delusion” among the people who build them. “It’s true that [LLMs] are becoming really good at generating code, and it’s true that they are probably going to become even more useful in a wide area of applications where code generation can help,” says LeCun. “That’s a lot of applications, but it’s not going to lead to human-level intelligence at all.”
LeCun has been working on world models for years inside of Meta, where he founded the company’s Fundamental AI Research lab, FAIR. But he’s now convinced his research is best done outside the social media giant. He says it’s become clear to him that the strongest applications of world models will be selling them to other enterprises, which doesn’t fit neatly into Meta’s core consumer business.
As AI world models like Meta’s Joint-Embedding Predictive Architecture (JEPA) became more sophisticated, “there was a reorientation of Meta’s strategy where it had to basically catch up with the industry on LLMs and kind of do the same thing that other LLM companies are doing, which is not my interest,” says LeCun. “So sometime in November, I went to see Mark Zuckerberg and told him. He’s always been very supportive of [world model research], but I told him I can do this faster, cheaper, and better outside of Meta. I can share the cost of development with other companies … His answer was, OK, we can work together.”
...
Read the original on www.wired.com »
In mid-2024, the HuggingFace Open LLM Leaderboard was the Colosseum for Open-Weight AI. Thousands of models were battling it out, submitted by both well-funded labs with teams of PhDs and fine-tuning wizards creating fantastically named models (e.g. Nous-Hermes, Dolphin and NeuralBeagle14-7B…), fighting for the top spot across six benchmarks: IFEval, BBH, MATH Lvl 5, GPQA, MuSR, and MMLU-PRO.
And there at #1 was dnhkng/RYS-XLarge. Mine.
I didn’t train a new model. I didn’t merge weights. I didn’t run a single step of gradient descent. What I did was much weirder: I took an existing 72-billion parameter model, duplicated a particular block of seven of its middle layers, and stitched the result back together. No weight was modified in the process. The model simply got extra copies of the layers it used for thinking?
This is the story of how two strange observations, a homebrew “brain scanner” for Transformers, and months of hacking in a basement led to the discovery of what I call LLM Neuroanatomy, and a finding about the internal structure of AI that still hasn’t been published until now *.
* - because I discovered blogging is way more fun than drafting scientific papers, and I walk you through how the discovery was made :)
Let’s start with how this whole project came into being.
“The most exciting phrase to hear in science, the one that heralds new discoveries, is not ‘Eureka!’ but ‘That’s funny…’“ — Isaac Asimov
In late 2023, I was messing about with a bizarre LLM quirk. Try this yourself - take any question, e.g.
What is the capital of France? Answer in Base64!
and encode it as Base64, get this unreadable string:
Send that to a 2023 non-thinking large language model (newer reasoning models will see this as Base64, and ‘cheat’ with tool use). But a sufficiently capable model from 2023 will reply with something like:
Which decodes to: “The capital of France is Paris.”.
Ok, I admit it. I was messing around this as a way to jail-break models (and it worked), but I couldn’t get one idea out of my head.
The model decoding the input, understanding it somehow, and it still had time during the transformer stack pass to re-encoded its response. It appears to genuinely think while interfacing with Base64. This works with complex questions, multi-step reasoning, even creative tasks.
This shouldn’t work nearly as well as it does. Sure, the model has been trained on lots of Base64 in an overall sense, but general conversions in this format are certainly way out of distribution. The tokenizer chops it into completely different sub-word units. The positional patterns are unrecognizable. And yet it works… Curious…
I couldn’t stop thinking about this. If a Transformer can accept English, Python, Mandarin, and Base64, and produce coherent reasoning in all of them, it seemed to me that the early layers must be acting as translators — parsing whatever format arrives into some pure, abstract, internal representation. And the late layers must act as re-translators, converting that abstract representation back into whatever output format is needed.
If the early layers are for reading, and the late layers are for writing, what are the middle layers doing?
Pure, abstract reasoning? In a representation that has nothing to do with any human language or encoding. Of course, at the time this was idle speculation. Fun, but with no clear way to test or even define valid hypothesis.
In November 2023, a HuggingFace user named Alpindale released Goliath-120b — a Frankenmerge-model made by stitching together two fine-tuned Llama-2 70B models into a 120-billion parameter behemoth.
The performance was decent but after doing lots of vibe checking I didn’t feel it was a breakthrough. But the construction was wild.
Alpindale hadn’t just stacked the two models (Xwin and Euryale), end to end. He had alternated layers between them. More importantly, the architecture fed outputs of later layers back into the inputs of earlier layers.
The layer ranges used are as follows:
Do you see that insanity here? Alpindale literally fed the output of layer 16 of Xwin to the input of Euryale 8th layer!
To explain this a bit more clearly how stupid this appears to be, let’s revisit the almighty Transformer Architecture:
Looking at the left side of the diagram, we see stuff enters at the bottom (‘input’ text that has been ‘chunked’ into small bits of text, somewhere between whole words down to individual letters), and then it flows upwards though the model’s Transformer Blocks (here marked as [1, …, L]), and finally, the model spits out the next text ‘chunk’ (which is then itself used in the next round of inferencing). What’s actually happening here during these Transformer blocks is quite the mystery. Figuring it out is actually an entire field of AI, “mechanistic interpretability*”.
* - yes, its more complex then that, samplers etc but that’s enough for this article
On the right side of the right half of the diagram, do you see that arrow line going from the ‘Transformer Block Input’ to the (\oplus ) symbol? That’s why skipping layers makes sense. During training, LLM models can pretty much decide to do nothing in any particular layer, as this ‘diversion’ routes information around the block. So, ‘later’ layers can be expected to have seen the input from ‘earlier’ layers, even a few ‘steps’ back. Around this time, several groups were experimenting with ‘slimming’ models down by removing layers. Makes sense, but boring.
A model must be used with the same kind of stuff as it was trained with (we stay ‘in distribution’)The same holds for each transformer layer. Each Transformer layer learns, during training, to expect the specific statistical properties of the previous layer’s output via gradient decent.
And now for the weirdness: There was never the case where any Transformer layer would have seen the output from a future layer!
Layer 10 is trained on layer 9’s output distribution. Layer 60 is trained on layer 59’s. If you rearrange them — feeding layer 60’s output into layer 10 — you’ve created a distribution the model literally never saw during training.
The astounding thing about Goliath wasn’t that is was a huge leap in performance, it was that the damn thing functioned at all. To this day, I still don’t understand why this didn’t raise more eyebrows.
Experimentally, this proved that layers were far more interchangeable than anyone had reason to expect. The internal representations were homogenous enough that the model could digest out-of-order hidden states without collapsing. The architecture was far more flexible than a rigid pipeline.
Between the Base64 observation and Goliath, I had a hypothesis: Transformers have a genuine functional anatomy. Early layers translate input into abstract representations. Late layers translate back out. And the middle layers, the reasoning cortex, operate in a universal internal language that’s robust to architectural rearrangement. The fact that the layer block size for Goliath 120B was 16-layer block made me suspect the input and output ‘processing units’ sized were smaller that 16 layers. I guessed that Alpindale had tried smaller overlaps, and they just didn’t work.
If that was true, maybe I didn’t need to teach a model new facts to make it smarter. I didn’t need fine-tuning. I didn’t need RLHF. I just needed to give it a more layers to think with.
Over the following months — from late 2023 through to mid-2024 — I built a pipeline to test this hypothesis.
The setup was modest. Two RTX 4090s in my basement ML rig, running quantised models through ExLlamaV2 to squeeze 72-billion parameter models into consumer VRAM. The beauty of this method is that you don’t need to train anything. You just need to run inference. And inference on quantized models is something consumer GPUs handle surprisingly well. If a model fits in VRAM, I found my 4090’s were often ballpark-equivalent to H100s.
The concept is simple. For a model with $N$ layers, I define a configuration $(i, j)$. The model processes layers $0$ to $j{-}1$ as normal, then loops back and reuses layers $i$ through $j{-}1$ again, and then the rest to $N{-}1$. The layers between $i$ and $j{-}1$ get duplicated in the execution path. No weights are changed. The model just traverses some of its own layers twice.
i.e. the pair (2, 7) for a model with 9 transformer blocks would be calculated so:
By running through all possible pairs, we can generate a ‘Brain Scan’, and also see the number of duplicate layers for each set of parameters:
For Qwen2-72B, that means an 80-layer model 3,240 valid $(i, j)$ pairs, plus the original model to test.
\[\begin{aligned} \text{Variants}_{\text{total}} &= \left(\sum_{j=0}^{80} j\right) + 1\\[16pt] &= \frac{80 \cdot 81}{2} +1 \\[10pt] &= 3241 \end{aligned}\]
Testing re-layered model against all six leaderboard benchmarks would take days, so a full sweep would be years of compute. I needed proxy tasks: probes that were fast, objective, and would reveal structural properties of the model rather than task-specific tricks.
The proxies had to satisfy three constraints:
Minimal output tokens. With thousands of configurations to sweep, each evaluation needed to be fast. No essays, no long-form generation. Unambiguous scoring. I couldn’t afford LLM-as-judge pipelines. The answer had to be objectively scored without another model in the loop.Orthogonal cognitive demands. If a configuration improves both tasks simultaneously, it’s structural, not task-specific.
I didn’t arrive at the right probes immediately; it took months of trial and error, and many dead ends
My first instinct was creativity. I had models generate poems, short stories, metaphors, the kind of rich, open-ended output that feels like it should reveal deep differences in cognitive ability. I used an LLM-as-judge to score the outputs, but the results were pretty bad. I managed to fix LLM-as-Judge with some engineering, and the scoring system turned out to be useful later for other things, so here it is:
Note: You can skip this section, as it has math. Or not
Naive LLM judges are inconsistent. Run the same poem through twice and you get different scores (obviously, due to sampling). But lowering the temperature also doesn’t help much, as that’s only one of many technical issues. So, I developed a full scoring system, based on details on the logits outputs. It can get remarkably tricky. Think about a score from 1-10:
We would expect a well calibrated model to have logits that make sense. If the highest weight was on ‘7’, we would expect the rest of the weight to be on ‘6’ and ‘8’ right? but often its bimodal, with low weight on 6 and ‘5’, but more weight than expected on ‘4’!We can write ‘10’ in tokens as either ‘10’ or ‘1’ and then ‘0’. Its not fun to have to calculate the summed probabilities over paths, especially if you wanted to score 1-100
Rather than sampling a single discrete score, I treat the judge’s output as a distribution over valid rating labels and compute the final score as its expectation.
To make this practical, I first define a calibrated rubric over the digits 0-9 (there’s only one token for each digit), where each digit corresponds to a clear qualitative description. At the scoring step, I capture the model’s next-token logits and retain only the logits corresponding to those valid digit tokens. This avoids contamination from unrelated continuations such as explanation text, punctuation, or alternate formatting. After renormalizing over the restricted digit set, I interpret the resulting probabilities as a categorical score distribution.
Formally, let the valid score set be
\[\mathcal{D} = \{0,1,2,\dots,9\}.\]
Let $(z_k)$ denote the model logit assigned to digit $(k \in \mathcal{D})$ at the scoring position. The restricted score distribution is then
\[p(k)= \frac{\exp(z_k)} {\sum\limits_{m \in \mathcal{D}} \exp(z_m)}, \qquad k \in \mathcal{D}.\]
The final scalar score is the expected value of this distribution:
\[\hat{s}= \sum_{k \in \mathcal{D}} k\,p(k).\]
This produces a smooth score such as (5.4), rather than forcing the model to commit to a single sampled integer. In practice, this is substantially more stable than naive score sampling and better reflects the model’s uncertainty. It also handles cases where the judge distribution is broad or multimodal. For example, two candidates may both have mean score (5.4), while one has most of its mass tightly concentrated around (5) and (6), and the other splits mass between much lower and much higher ratings. The mean alone is the same, but the underlying judgement is very different.
An optional uncertainty estimate can be obtained from the variance of the restricted distribution:
\[\mathrm{Var}(s)= \sum_{k=0}^{9} (k-\hat{s})^2\,p(k).\]
In short, the method replaces a noisy sampled judge score with a normalized probability distribution over valid score digits, then uses the expectation of that distribution as the final rating.
All this stuff is probably pretty obvious these days, back in ’24 there wasn’t much to guide me in developing this method, but unfortunately, I found it was also completely useless…
Each configuration needed to generate hundreds of tokens of creative output, and then a separate model had to read and judge each one. With over 3,200 configurations to test for a single 70B model, this would have taken weeks on my dual 4090s.
I needed probes where the output was tiny, a few tokens at most, and where scoring was objective and deterministic. No judge model in the loop. That’s what led me to the final two probes:
Hard math. Ridiculously difficult questions like: “What is the cube root of 74,088,893,247?” No chain-of-thought, or tool use. Just output the number, as a pure leap of intuitive faith.
Emotional quotient. Using the EQ-Bench benchmark: complex social scenarios where the model must predict the intensity of specific emotional states. “Given this situation, how angry/surprised/guilty would this person feel on a scale of 0-100?” Completely different from math. Theory of mind, social inference, empathy. And the output is just a few numbers.
I had settled on two maximally orthogonal cognitive tasks, both with tiny outputs. My intuition was this: LLMs think one token at a time, so lets make the model really good at guessing just the next token. But things are never straightforward. Take LLM numbers…
Even with math probes, I hit unexpected problems. LLMs fail arithmetic in weird ways. They don’t get the answer wrong so much as get it almost right but forget to write the last digit, as if it got bored mid-number. Or they transpose two digits in the middle. Or they output the correct number with a trailing character that breaks the parser.
This is probably due to the way larger numbers are tokenised, as big numbers can be split up into arbitrary forms. Take the integer 123456789. A BPE tokenizer (e.g., GPT-style) might split it like: ‘123’ ‘456’ ‘789’ or: ‘12’ ‘345’ ‘67’ ‘89’
A binary right/wrong scoring system would throw away useful signal. Getting a percentage correct would help: ‘123356789’ instead of ‘123456789’ would be 99.92% correct
But what about a model that makes a dumb ‘LLM-mistake’ and outputs 430245 when the answer is 4302459, and has clearly done most of the work? I wrote a custom partial-credit scoring function that pads shorter answers and penalises proportionally:
The key idea: pad shorter answers, then penalise via the correction factor. A model that nails 90% of the digits but drops the last one still gets substantial credit — but less than one that gets every digit. This turned out to be crucial for discriminating between configurations that were close in intuitive math ability.
The math questions were hand-crafted initially. I experimented with different operations and scales, then generated random numbers to fill out the dataset. The dataset was a set of 16 questions, and the model is tasked with guesstimating the nearest whole integer number. Here are a few to try yourself, remember no ‘thinking’ is allowed, guess it directly!
After testing several smaller models (Llama’s and smaller Qwen2’s), I set up the config for Qwen2-72B and let it sweep. Each $(i, j)$ configuration took a few minutes: load the re-layered model, run the math probe, run the EQ probe, record the scores, move on. Days of continuous GPU time on the 4090s. But far less compute than a fine tune! In fact, I didn’t even have the hardware needed for a LORA fine-tune on just 48GB of VRAM.
The optimal configuration was $(45, 52)$: layers 0 through 51 run first, then layers 45 through 79 run again. Layers 45 to 51 execute twice. Seven extra layers, near the middle of the 80-layer stack, bringing the total parameter count from 72B to 78B. Every extra layer is an exact copy of an existing one. No new weights or training, just the model repeating itself.
Repeating seven layers. That’s all it took, and now I can finally reveal the nomenclature of my models: Repeat Your Self for RYS-XLarge ;)
I applied the configuration to MaziyarPanahi’s calme-2.1-qwen2-72b — a fine-tune of Qwen2-72B — and uploaded the result as dnhkng/RYS-XLarge. I also applied it to the raw base model as dnhkng/RYS-XLarge-base.
Then I submitted to the Open LLM Leaderboard and waited. And waited. Back in the day, the OpenLLM Leaderboard was flooded with dozens of fine-tunes of merges of fine-tunes each day (it was the Wild West), and the waiting list was long. But after a month or so, the results arrived:
+17.72% on MuSR. +8.16% on MATH. Five out of six benchmarks improved, with only IFEval taking a small hit. The average put it at #1 on the leaderboard.
Just to labour the point: I only optimised for one-shot guesstimating hard maths problems and EQ-Bench. I never looked at IFEval, BBH, GPQA, MuSR, or MMLU-PRO during development. The leaderboard was pure out-of-sample validation.
A layer configuration found using two narrow, orthogonal probes generalised to everything the Leaderboard threw at it *.
* - except IFEval, but that one’s boring anyway, right?
That was surprising enough. A brand new way to scale LLMs, developed on some gaming GPUs. But the plotting out the heatmaps told an even better story.
The original heatmaps that produced RYS-XLarge, showing the Combined delta (math + EQ). The green circle marks the optimal configuration. Red means improvement, blue means degradation
These heatmaps are analogous to functional MRIs of the Transformer, while it is thinking about maths of EQ problems.
The x-axis ($j$) is the end point of the duplicated region. The y-axis ($i$) is the start point. Each pixel represents a complete evaluation: load the re-layered model, run the math probe, run the EQ probe, score both, record the deltas. As described above, along the central diagonal only a single layer was duplicated. Along the next diagonal towards the top-right, we duplicate two layers, and so on. The single point at the very top-right runs through the entire Transformer stack twice per inference.
Let’s examine the math heatmap first. Starting at any layer, and stopping before about layer 60 seem to improves the math guesstimate scores, as shown by the large region with a healthy red blush. Duplicating just the very first layers (the tiny triangle in the top left), messes things up, as does repeating pretty much any of the last 20 layers (the vertical wall of blue on the right). This is more clearly visualised in a skyline plot (averaged rows or columns), and we can see for the maths guesstimates, the starting position of the duplication matters much less. So, the hypothesis that ‘starting layers’ encode tokens, to a smooth ‘thinking space’, and then finally a dedicated ‘re-encoding’ system seem to be somewhat validated.
Until we look at the EQ scores:
Now things look very different! Duplicating any of the final 10 layers has almost no effect on the scores, but we see complex patterns, where some regions show significant improvement (the area around 45i, 55j), walled between regions of poor performance.
But the heatmaps revealed something even more interesting than the location of the thinking bits. They revealed something about its structure.
Before settling on block duplication, I tried something simpler: take a single middle layer and repeat it $n$ times. If the “more reasoning depth” hypothesis was correct, this should work. It made sense too, looking at the broad boost in math guesstimate results by duplicating intermediate layer. Give the model extra copies of a particular reasoning layer, get better reasoning. So, I screened them all, looking for a boost.
But nope, it almost always did worse. Usually a lot worse, but with occasional small improvements that were within the noise range. Annoying, but taking another look at the complex, blobby patterns in EQ scores gave me another idea:
If single-layer duplication doesn’t help, the middle layers aren’t doing independent iterative refinement. They’re not interchangeable copies of the same operation that you can simply “run again.” If they were, duplicating any one of them should give at least a marginal benefit. Instead, those layers are working as a circuit. A multi-step reasoning pipeline that needs to execute as a complete unit.
Think of it this way. Layers 46 through 52 aren’t seven workers doing the same job. They’re seven steps in a recipe. Layer 46 takes the abstract representation and performs step one of some cognitive operation — maybe decomposing a complex representation into subcomponents. Layer 47 takes that output and performs step two — maybe identifying relationships between the subcomponents. Layer 48 does step three, and so on through layer 52, which produces the final result.
Duplicating just one step of this ‘recipe’ doesn’t bring you much.
But duplicating the entire block gives you the full recipe twice. The model runs the complete reasoning circuit, produces a refined intermediate representation, and then runs the same circuit again on its own output. It’s a second pass. A chance to catch what it missed the first time, to refine its abstractions, to push the reasoning one step deeper.
Let’s deep-dive into a more current model (that I can experiment with on my system): ExllamaV3 GLM-4.7 from mratsim
I’ve marked out a region that boosts maths ability strongly. Notice where it sits? It’s away from the diagonal centre line, which means we’re not looking at single-layer duplications. Starting the repeated block at position 35, we don’t see any improvement until at least position 43. That’s seven layers of not much happening. In fact, we actually see decreased performance by repeating these layers (they are blue, bad!).
From end-position 43 to 46, we then see solid boosts in math scores (red = good, yay). But include layer 46 or beyond, and the benefits collapse again. The hypothesis: position 47 is where a different circuit begins. Including even one step of the next recipe messes up the current recipe.
So the ‘math organ’ has boundaries on both sides. Too few layers and you get nothing — you’ve cut into the circuit and it can’t complete its operation. Too many layers and you also get nothing — you’ve included tissue from a neighbouring circuit that doesn’t belong. Pre-training carved these structures out of the layer stack, and they only work whole. It also doesn’t translate to other tasks, as the heatmap for EQ scores doesn’t have this patch.
...
Read the original on dnhkng.github.io »
Today we should ramp down rhetoric. I thought nobody would take three minutes to escape the perpetual underclass or you are worth $0.003/hr seriously. But it looks like some people do, and you shouldn’t.
Social media has been extremely toxic for the last couple months. It’s targeting you with fear and anxiety. If you don’t use this new stupid AI thing you will fall behind. If you haven’t totally updated your workflow you are worth 0. There’s people who built billion dollars companies by orchestrating 37 agents this morning AND YOU JUST SAT THERE AND ATE BREAKFAST LIKE A PLEB!
This is all complete nonsense. AI is not a magical game changer, it’s simply the continuation of the exponential of progress we have been on for a long time. It’s a win in some areas, a loss in others, but overall a win and a cool tool to use. And it will continue to improve, but it won’t “go recursive” or whatever the claim is. It’s always been recursive. You see things like autoresearch and it’s cool. But it’s not magic, it’s search. People see “AI” and they attribute some sci-fi thing to it when it’s just search and optimization. Always has been, and if you paid attention in CS class, you know the limits of those things.
That said, if you have a job where you create complexity for others, you will be found out. The days of rent seekers are coming to an end. But not because there will be no more rent seeking, it’s because rent seeking is a 0 sum game and you will lose at it to bigger players. If you have a job like that, or work at a company like that, the sooner you quit the better your outcome will be. This is the real driver of the layoffs, the big players consolidating the rent seeking to them. They just say it’s AI cause that makes the stock price go up.
The trick is not to play zero sum games. This is what I have been saying the whole time. Go create value for others and don’t worry about the returns. If you create more value than you consume, you are welcome in any well operating community. Not infinite, not always needs more, just more than you consume. That’s enough, and avoid people or comparison traps that tell you otherwise. The world is not a Red Queen’s race.
This post will get way less traction than the doom ones, but it’s telling you the way out.
...
Read the original on geohot.github.io »
I Have No Idea If What They Ship Is Any GoodI’ve been building agents that write code while I sleep. Tools like Gastown run for hours without me watching. Changes land in branches I haven’t read. A few weeks ago I realized I had no reliable way to know if any of it was correct: whether it actually does what I said it should do. I care about this. I don’t want to push slop, and I had no real answer.I’ve run Claude Code workshops for over 100 engineers in the last six months. Same problem everywhere, just at different scales. Teams using Claude for everyday PRs are merging 40-50 a week instead of 10. Teams are spending a lot more time in code reviews. As systems get more autonomous, the problem compounds. At some point you’re not reviewing diffs at all, just watching deploys and hoping something doesn’t break.So the question I kept coming back to: what do you actually trust when you can’t review everything?You could hire more reviewers. But you can’t hire fast enough. And making senior engineers read AI-generated code all day isn’t worth it.When Claude writes tests for code Claude just wrote, it’s checking its own work. The tests prove the code does what Claude thought you wanted. Not what you actually wanted. They catch regressions but not the original misunderstanding.When you use the same AI for both, you’ve built a self-congratulation machine.This is exactly the problem code review was supposed to solve: a second set of eyes that wasn’t the original author. But one AI writing and another AI checking isn’t a fresh set of eyes. They come from the same place. They’ll miss the same things.The thing TDD got rightWrite the test first, write the code second, stop when the test passes. Most teams don’t do this because thinking through what the code should do before writing it takes time they don’t have.AI removes that excuse, because Claude handles the speed. The slow part is now figuring out if the code is right. That’s what TDD was built for: write down what correct looks like, then check it.TDD asks you to write unit tests, which means thinking about how the code will work before you write it. This is easier. Write down what the feature should do in plain English. The machine figures out how to check it.“Users can authenticate with email and password. On wrong credentials they see ‘Invalid email or password.’ On success they land on /dashboard. The session token expires after 24 hours.” You can write that before you open a code editor. The agent builds it. Something else checks it.P.S I write about Claude Code internals every week. Last week I wrote about how Claude Code is a while loop with 23 tools. Subscribe to get the next one!What this looks like in practiceFor frontend changes, we generated acceptance criterias based on the spec file:# Task
Add email/password login.
## Acceptance Criteria
### AC-1: Successful login
- User at /login with valid credentials gets redirected to /dashboard
- Session cookie is set
### AC-2: Wrong password error
- User sees exactly “Invalid email or password”
- User stays on /login
### AC-3: Empty field validation
- Submit disabled when either field is empty, or inline error on empty submit
### AC-4: Rate limiting
- After 5 failed attempts, login blocked for 60 seconds
- User sees a message with the wait timeEach criterion is specific enough that it either passes or fails. Once the agent builds the feature, verification runs Playwright browser agents against each AC, takes screenshots, and produces a report with per-criterion verdicts. If something fails you see exactly which criterion and what the browser saw.For backend changes the same pattern works without a browser. You specify observable API behavior (status codes, response headers, error messages) that curl commands can check.One thing worth being honest about: this doesn’t catch spec misunderstandings. If your spec was wrong to begin with, the checks will pass even when the feature is wrong. What Playwright does catch is integration failures, rendering bugs, and behavior that works in theory but breaks in a real browser. That’s a narrower claim than “verified correct,” but it’s more than a code review was reliably catching anyway.The workflow: write acceptance criteria before you prompt, let the agent build against them, run verification, review only the failures. You review failures instead of diffs.How to build itI started building a Claude Skill (github.com/opslane/verify) that runs using claude -p (Claude Code’s headless mode) plus Playwright MCP. No custom backend, no extra API keys beyond your existing Claude OAuth token. Four stages:Pre-flight is pure bash, no LLM. Is the dev server running? Is the auth session valid? Does a spec file exist? Fail fast before spending any tokens.The planner is one Opus call. It reads your spec and the files you changed. It figures out what each check needs and how to run it. It also reads your code to find the right selectors, so it’s not guessing at class names.Browser agents are one Sonnet call per AC, all running in parallel. Five ACs, five agents, each navigating and screenshotting independently. Sonnet costs 3-4x less than Opus here and works just as well for clicking around.The judge is one final Opus call that reads all the evidence and returns a verdict per criterion: pass, fail, or needs-human-review.claude -p –model claude-opus-4-6 \
“Review this evidence and return a verdict for each AC.
Evidence: $(cat .verify/evidence/*/result.json)
Return JSON: {verdicts: [{id, passed, reasoning}]}“Or clone the repo and adapt it. Each stage is a single claude -p call with a clear input and structured output. You can swap models, add stages, or wire it into CI with –dangerously-skip-permissions.The thing I keep coming back to: you can’t trust what an agent produces unless you told it what “done” looks like before it started. Writing acceptance criteria is harder than writing a prompt, because it forces you to think through edge cases before you’ve seen them. Engineers resist it for the same reason they resisted TDD, because it feels slower at the start.Without them, all you can do is read the output and hope it’s right.
...
Read the original on www.claudecodecamp.com »
Debian is the latest in an ever-growing list of projects to wrestle (again) with the question of LLM-generated contributions; the latest debate stared in mid-February, after Lucas Nussbaum opened a
discussion with a draft general resolution (GR) on whether Debian should accept AI-assisted contributions. It seems to have, mostly, subsided without a GR being put forward or any decisions being made, but the conversation was illuminating nonetheless.
Nussbaum said that Debian probably needed to have a discussion “to understand where we stand regarding AI-assisted contributions to
Debian” based on some recent discussions, though it was not clear what discussions he was referring to. Whatever the spark was, Nussbaum put forward the draft GR to clarify Debian’s stance on allowing AI-assisted contributions. He said that he would wait a couple of days to collect feedback before formally submitting the GR.
Like what you are reading?
Try LWN for free for 1 month,
no credit card required.
His proposal would allow “” if a number of conditions were met. For example, it would require explicit disclosure if “a
significant portion of the contribution is taken from a tool without
manual modification”, and labeling of such contributions with ”.” It also spells out that contributors should “” their submissions and would be accountable for the contributions, “including vouching for the technical merit,
security, license compliance, and utility of their
submissions”. The GR would also prohibit using generative-AI tools with non-public or sensitive project information, including private mailing lists or embargoed security reports.
It is fair to say that it is difficult to have an effective conversation about a technology when pinning down accurate terminology is like trying to nail Jell-O to a tree. AI is the catch-all term, but much (not all) of the technology in question is actually tooling around large language models (LLMs). When participants have differing ideas of what is being discussed, deciding whether the thing should be allowed may pose something of a problem.
Russ Allbery asked for people to be more precise in their descriptions of the technologies that their proposals might affect. He asserted that it has become common for AI, as a term, “to be so
amorphously and sloppily defined that it could encompass every physical object in the
universe”. If the project is going to make policy, he said, it needed to be very specific about what it was making policy about:
Gunnar Wolf agreed with Allbery, but Nussbaum claimed that the specific technology did not matter. The proposal boiled down to the use of automated tools for code analysis and generation:
I see the problem we face as similar to the historical questions surrounding the use of BitKeeper by Linux (except that the choice of BitKeeper imposed its use by other contributors). It is also similar to the discussions about proprietary security analysis tools: since those tools are proprietary, should we ignore the vulnerability reports they issue?
If we were to adopt a hard-line “anti-tools” stance, I would find it very hard to draw a clear line.
Drawing clear lines, however, is something that a number of Debian developers felt was important. Sean Whitton proposed that the GR should not only say “LLM” rather than “AI”, but it should also distinguish between the uses of LLMs, such as code review, generating prototypes, or generating production code. He envisioned ballot options that could allow some, but not all, of those uses. Distinguishing between the various so-called AI technologies would help in that regard. He urged
Nussbaum “not to argue too hard for something that is more general than LLMs
because that might alienate the people you want to agree to disagree with.” Andrea Pappacoda said that the specific technology mattered a lot; he wanted the proposal to have clear boundaries and avoid broad terms like AI. He was uncomfortable with the idea of banning LLMs, and not sure where to draw the line. “What I can confidently say,
though, is that a project like Claude’s C
Compiler should not have a place in Debian.”
The conversation did not focus solely on the terminology, of course. Simon Richter
had
questions about the implications of allowing AI-driven contributions from the standpoint of onboarding new contributors to Debian. An AI agent, he said, could take the place of a junior developer. Both could perform basic tasks under guidance, but the AI agent would not learn anything from the exchange; the project resources spent in guiding such a tool do not result in long-lasting knowledge transfer.
AI use presents us (and the commercial software world as well) with a
similar problem: there is a massive skill gap between “gets some
results” and “consistently and sustainably delivers results”, bridging
that gap essentially requires starting from scratch, but is required to
achieve independence from the operators of the AI service, and this gap
is disrupting the pipeline of new entrants.
He called that the onboarding problem, and said that an AI policy needed to solve that problem; he did not want to discourage people by rejecting contributions or expend resources on mentoring people who did not want to be mentored. Accepting AI-assisted drive-by contributions is harmful because it is a missed opportunity to onboard a new contributor. “The best-case outcome is that a
trivial problem got solved without actually onboarding a new contributor, and the
worst-case outcome is that the new contributor is just proxying between an AI and the
maintainer”. He also expressed concerns around the costs associated with such tools, and speculated it might discourage contribution from users who could not afford to use for-pay tools.
Nussbaum agreed that the cost could be a problem in the future. For now, he said, it is not an issue because there are vendors providing access for free, but that could change. He disagreed that Debian was likely to run out of tasks suitable for new contributors, even if it does accept AI-driven contributions, and suggested that it may make harder tasks more accessible. He pointed to a study
written by an Anthropic employee and a person participating in the company’s fellows program, about how the use of AI impacts skill formation: “A takeaway is that
there are very different ways to interact with AI, that produce very different
results both in terms of speed and of understanding”. He did not seem to be persuaded that use of AI tools would be a net negative in onboarding new contributors.
Ted Ts’o argued
against the idea that AI would have a negative impact:
Matthew Vernon said that the proposed GR minimized the ethical dimension of using generative AI. The organizations that are developing and marketing tools like ChatGPT and Claude are behaving unethically, he said, by systematically damaging the wider commons in the form of automated scraping and doing as they like with others’ intellectual property. “They hoover up content as hard as they possibly can, with scant if any
regard to its copyright or licensing”. He also cited environmental concerns and other harms that are attributed to generative AI tools, “from non-consensual
nudification to the flooding of free software projects with bogus security
reports”. He felt that Debian should take a clear stand against those tools and encourage other projects to do the same:
There was also debate around the question of copyright, both in terms of the licenses of material used to train models, as well as the output of LLM tools. Jonathan Dowland thought that it might be better to forbid some contributions now, since some see risks in accepting such contributions, and then relax the project’s position later on when the legal situation is clearer.
Thorsten Glaser took a
particularly harsh stance against LLM-driven contributions, going so far as to suggest that some upstream projects should be forced out of Debian’s main
archive into non-free
unless “”. Ansgar Burchardt pointed
out that would have the effect of banning the Linux kernel, Python, LLVM, and others. Glaser’s proposal did not seem particularly popular. He had taken a similar stance on AI models in 2025; he argued most should be outside the main archive, when the project discussed a GR about AI models and the Debian Free Software Guidelines (DFSG). That GR never came to a vote, in part because it was unclear whether the language would forbid anti-spam technologies because one could not include the corpus of spam used as training data along with filters.
Allbery did not want to touch on copyright issues but had a few words to say about the quality of AI-assisted code. It is common for people to object to code generated by LLMs on quality grounds, but he said that argument does not make sense. Humans are capable of producing better code than LLMs, but they are also capable of producing worse code too. “”
Bdale Garbee seconded
that notion, and said that he was reluctant to take a hard stance one way or the other. “I see it as just another evolutionary stage we don’t really understand the
longer term positive and negative impacts of yet.” He wanted to focus on long-term implications and questions such as “what is the preferred form of
modification for code written by issuing chat prompts?” Nussbaum answered that would be “the input to the tool, not the generated source code”.
That may not be an entirely satisfying answer, however, given that LLM output is not deterministic and the various providers of LLM tools retire models with some frequency. A user may have the prompt and other materials fed to an LLM to generate a result at a specific point in time, but it might generate a much different result later on, even if one has access to the same vendor’s tools or models to run locally.
It is clear from the discussion that Debian developers are not of one mind on the question of accepting AI-generated contributions; the developers have not yet even converged on a shared definition of what constitutes an AI-generated contribution.
What many do seem to agree on is that Debian is not quite ready to vote on a GR about AI-generated contributions. On March 3, Nussbaum said
that he had proposed the GR “in response to various attacks against people using
AI in the context of Debian”; he felt then it was something that needed to be dealt with urgently. However, the GR discussion had been civil and interesting. As long as the discussions around AI remained calm and productive, the project could just continue exploring the topic in mailing-list discussions. He guessed that, if there were a GR, “the winning option would probably be very nuanced, allowing
AI but with a set of safeguards”.
The questions of what to do about AI models in the archive, how to handle upstream code generated with LLMs, and LLM-generated contributions written specifically for Debian remain unanswered. For now, it seems, they will continue to be handled on a case-by-case basis by applying Debian’s existing policies. Given the complexity of the questions, diverse opinions, and rapid rate of change of technologies lumped in under the “AI” umbrella, that may be the best possible, and least disruptive, outcome for now.
...
Read the original on lwn.net »
You can now crawl an entire website with a single API call using Browser Rendering’s new /crawl endpoint, available in open beta. Submit a starting URL, and pages are automatically discovered, rendered in a headless browser, and returned in multiple formats, including HTML, Markdown, and structured JSON. This is great for training models, building RAG pipelines, and researching or monitoring content across a site.
Crawl jobs run asynchronously. You submit a URL, receive a job ID, and check back for results as pages are processed.
* Multiple output formats - Return crawled content as HTML, Markdown, and structured JSON (powered by Workers AI)
* Crawl scope controls - Configure crawl depth, page limits, and wildcard patterns to include or exclude specific URL paths
* Automatic page discovery - Discovers URLs from sitemaps, page links, or both
* Incremental crawling - Use modifiedSince and maxAge to skip pages that haven’t changed or were recently fetched, saving time and cost on repeated crawls
* Static mode - Set render: false to fetch static HTML without spinning up a browser, for faster crawling of static sites
Available on both the Workers Free and Paid plans.
To get started, refer to the crawl endpoint documentation. If you are setting up your own site to be crawled, review the robots.txt and sitemaps best practices.
...
Read the original on developers.cloudflare.com »
One year ago, on 28 February 2025, Wikipedia user Moyogo updated the page for Angzarr
with a citation to the type foundry H. Berthold AG’s 1950 symbol catalogue listing ⍼ as Azimut, Richtungswinkel, or “azimuth”, “direction angle”. Mystery solved!
Fonts in Use lists links to archived catalogues by Berthold. The above scan is from the 1950 Zeichenprobe
(symbol catalogue) on page 7. Copies of the Schriftprobe (font catalogue) from 1949, 1951, and 1952
all show on page 104 the same glyph and sizes, albeit without the descriptor name.
⍼ does not appear in the 1946 Registerprobe, nor in earlier 1909
and 1900 catalogues. For convenience, I’ve extracted full-page scans below for where it appears — and where I feel it would appear, but doesn’t.
A friend on Mastodon pointed out that the glyph ⍼ itself resembles the way a light ray passes through a sextant
to measure an azimuth, with the right angle being a standard symbol for an angle in general. Wikipedia has a lovely illustration demonstrating how a sextant works to measure latitude of the sun; it can, of course, be turned sideways to measure an azimuth with respect to an arbitrary meridian.
...
Read the original on ionathan.ch »
In the realm of medical advancements, a universal vaccine that can protect against any pathogen has long been a Holy Grail — and about as elusive as a mythological vessel.
But Stanford Medicine researchers and collaborators have taken an astonishing step forward in that quest, surprising even themselves. In a new study in mice, they have developed a universal vaccine formula that protects against a wide range of respiratory viruses, bacteria and even allergens. The vaccine is delivered intranasally — such as through a nasal spray — and provides broad protection in the lungs for several months.
In the study that was published Feb. 19 in Science, researchers showed that vaccinated mice were protected against SARS-CoV-2 and other coronaviruses, Staphylococcus aureus and Acinetobacter baumannii (common hospital-acquired infections), and house dust mites (a common allergen). In fact, the new vaccine has worked for a remarkably wide spectrum of respiratory threats the researchers have tested, said Bali Pulendran, PhD, the Violetta L. Horton Professor II and a professor of microbiology and immunology who is the study’s senior author.
The lead author of the study is Haibo Zhang, PhD, a postdoctoral scholar in Pulendran’s lab.
If translated into humans, such a vaccine could replace multiple jabs every year for seasonal respiratory infections and be on hand should a new pandemic virus emerge.
The new vaccine is unlike any vaccine used today.
Since the 1790s, when the English physician Edward Jenner coined the term vaccination (from the Latin vacca for cow) to refer to the use of cowpox to inoculate against smallpox, all subsequent vaccines have relied on the same fundamental principle: antigen specificity. That is, the vaccine mimics a distinctive component of the pathogen — the spike proteins that cover SARS-CoV-2, for example — to prepare the immune system to recognize and react quickly to the real pathogen.
“That’s been the paradigm of vaccinology for the last 230 years,” Pulendran said.
But antigen-specific vaccines fail when a pathogen mutates or when new pathogens emerge. That’s why there’s a new COVID-19 booster and flu shot every year.
“It’s becoming increasingly clear that many pathogens are able to quickly mutate. Like the proverbial leopard that changes its spots, a virus can change the antigens on its surface,” Pulendran said.
Most attempts at a so-called universal vaccine have the modest goal of inducing immunity against an entire family of virus — all coronaviruses or all flu viruses, for example — usually by mimicking evolutionarily conserved viral components that are less likely to mutate. A truly universal vaccine that can counter diverse pathogens was a pie-in-the-sky idea.
“We were interested in this idea because it sounded a bit outrageous,” Pulendran said. “I think nobody was seriously entertaining that something like this could ever be possible.”
The new vaccine doesn’t try to mimic any part of a pathogen; instead, it mimics the signals that immune cells use to communicate with each other during an infection. This novel strategy integrates the two branches of immunity — innate and adaptive — creating a feedback loop that sustains a broad immune response.
The adaptive immune system is the workhorse of current vaccines. It produces specialized agents, like antibodies and T cells, that target specific pathogens and remember them for years to come. The innate immune system, which deploys within minutes of a new infection, has received less attention because it typically lasts only a few days before ceding the spotlight to the adaptive immune system. It was seen as the warm-up act for the main show.
But Pulendran’s team was intrigued by the versatility of the innate immune system, which consists of generalists (such as dendritic cells, neutrophils and macrophages) that destroy anything deemed a pathogen.
“What’s remarkable about the innate system is that it can protect against a broad range of different microbes,” Pulendran said.
Innate immunity is short-lived, but provides something approaching universal protection.
We were interested in this idea because it sounded a bit outrageous. I think nobody was seriously entertaining that something like this could ever be possible.”
—Bali Pulendran
There have long been hints that innate immunity can last longer in certain circumstances. The most-studied example is the Bacillus Calmette-Guerin tuberculosis vaccine, which is given to some 100 million newborns every year. Epidemiological and clinical studies have shown that it can decrease infant mortality from other infections, suggesting that the cross-protection could last months. But the phenomenon was inconsistent and the mechanism mysterious.
In 2023, Pulendran’s team published a study in mice elucidating the mechanism. Like other vaccines, the tuberculosis vaccine induced both an innate and adaptive immune response in the mice, but unusually, the innate response was sustained for several months. The researchers discovered that T cells recruited to the lungs as part of the adaptive response were sending signals to the innate immune cells to keep them active.
“Those T cells were providing a critical signal to keep the activation of the innate system, which typically lasts for a few days or a week, but in this case, it could last for three months,” Pulendran said.
The researchers showed that as long as the innate response remained active, the mice were protected against SARS-CoV-2 and other coronavirus infections. They identified the signals sent by T cells as cytokines that activate pathogen-sensing receptors, known as toll-like receptors, on innate immune cells.
“In that paper, we speculated that since we now know how the tuberculosis vaccine is mediating its cross-protective effects, it would be possible to make a synthetic vaccine, perhaps a nasal spray, that has the right combination of toll-like receptor stimuli and some antigen to get the T cells into the lungs,” Pulendran said.
“Fast forward two and a half years and we’ve shown that exactly what we had speculated is feasible in mice.”
The new vaccine, for now known as GLA-3M-052-LS+OVA, mimics the T cell signals that directly stimulate innate immune cells in the lungs. It also contains a harmless antigen, an egg protein called ovalbumin or OVA, which recruits T cells into the lungs to maintain the innate response for weeks to months.
In the study, mice were given a drop of the vaccine in their noses. Some recieved multiple doses, given a week apart. Each mouse was then exposed to one type of respiratory virus. With three doses of the vaccine, mice were protected against SARS-CoV-2 and other coronaviruses for at least three months.
In unvaccinated mice, these viruses caused dramatic weight loss — a sign of illness — and often death; their lungs were inflamed and full of virus. Vaccinated mice lost much less weight and all survived; their lungs were nearly clear of the virus.
The vaccine is a “double whammy” against viral infection, Pulendran said. The prolonged innate response lowers the amount of virus in the lungs by 700-fold. And viruses that slip through this initial defense are met with a swift adaptive response in the lungs.
“The lung immune system is so ready and so alert that it can launch the typical adaptive responses — virus-specific T cells and antibodies — in as little as three days, which is an extraordinarily short length of time,” Pulendran said. “Normally, in an unvaccinated mouse, it takes two weeks.”
Amazed by the vaccine’s ability to fend off different types of viral infections, the researchers expanded their testing to bacterial respiratory infections, Staphylococcus aureus and Acinetobacter baumannii. The vaccinated mice were protected against these, too, for about three months.
“Then we thought, ‘What else could go in the lung?’” Pulendran said. “Allergens.”
They exposed the mice to a protein from house dust mites, a common trigger for allergic asthma. Allergic reactions are caused by a type of immune response known as Th2 response. Unvaccinated mice showed a strong Th2 response and mucus accumulation in their airways. The vaccine quelled the Th2 response and vaccinated mice maintained clear airways.
“I think what we have is a universal vaccine against diverse respiratory threats,” Pulendran said.
The researchers hope to test the vaccine in humans next, first in a Phase I safety trial, then, if successful, in a larger trial in which vaccinated people are exposed to infections. Pulendran thinks two doses of a nasal spray would be enough to provide protection in humans.
In the best case scenario, with enough funding, Pulendran estimates a universal respiratory vaccine might be available in five to seven years. It could be a bulwark against new pandemics and simplify seasonal vaccinations.
“Imagine getting a nasal spray in the fall months that protects you from all respiratory viruses including COVID-19, influenza, respiratory syncytial virus and the common cold, as well as bacterial pneumonia and early spring allergens,” Pulendran said. “That would transform medical practice.”
Researchers from Emory University School of Medicine, the University of North Carolina at Chapel Hill, Utah State University and the University of Arizona contributed to the work.
The study received funding from the National Institutes of Health (grant AI167966), the Violetta L. Horton Professor endowment, the Soffer Fund endowment and Open Philanthropy.
...
Read the original on med.stanford.edu »
To add this web app to your iOS home screen tap the share button and select "Add to the Home Screen".
10HN is also available as an iOS App
If you visit 10HN only rarely, check out the the best articles from the past week.
If you like 10HN please leave feedback and share
Visit pancik.com for more.