10 interesting stories served every morning and every evening.
Computational Complexity and other fun stuff in math and computer science from Lance Fortnow and Bill Gasarch
...
Read the original on blog.computationalcomplexity.org »
New U. S laws designed to protect minors are pulling millions of adult Americans into mandatory age-verification gates to access online content, leading to backlash from users and criticism from privacy advocates that a free and open internet is at stake. Roughly half of U.S. states have enacted or are advancing laws requiring platforms — including adult content sites, online gaming services, and social media apps — to block underage users, forcing companies to screen everyone who approaches these digital gates.
“There’s a big spectrum,” said Joe Kaufmann, global head of privacy at Jumio, one of the largest digital identity-verification and authentication platforms. He explained that the patchwork of state laws vary in technical demands and compliance expectations. “The regulations are moving in many different directions at once,” he said.
Social media company Discord announced plans in February to roll out mandatory age verification globally, which the company said would rely on verification methods designed so facial analysis occurs on a user’s device and submitted data would be deleted immediately. The proposal quickly drew backlash from users concerned about having to submit selfies or government IDs to access certain features, which led Discord to delay the launch until the second half of this year.
“Let me be upfront: we knew this rollout was going to be controversial. Any time you introduce something that touches identity and verification, people are going to have strong feelings,” Discord chief technology officer and co-founder Stanislav Vishnevskiy wrote in a Feb. 24 blog post.
Websites offering adult content, gambling, or financial services often rely on full identity verification that requires scanning a government ID and matching it to a live image. But most of the verification systems powering these checkpoints — often run by specialized identity-verification vendors on behalf of websites — rely on artificial intelligence such as facial recognition and age-estimation models that analyze selfies or video to determine in seconds whether someone is old enough to access content. Social media and lower-risk services may use lighter estimation tools designed to confirm age without permanently storing detailed identity records.
Vendors say a challenge is balancing safety with how much friction users will tolerate. “We’re in the business of ensuring that you are absolutely keeping minors safe and out and able to let adults in with as little friction as possible,” said Rivka Gewirtz Little, chief growth officer at identity-verification platform Socure. Excessive data collection, she added, creates friction that users resist.
Still, many users perceive mandatory identity checks as invasive. “Having another way to be forced to provide that information is intrusive to people,” said Heidi Howard Tandy, a partner at Berger Singerman who specializes in intellectual property and internet law. Some users may attempt workarounds — including prepaid cards or alternative credentials — or turn to unauthorized distribution channels. “It’s going to cause a piracy situation,” she added.
In many implementations, verification vendors — not the websites themselves — process and retain the identity information, returning only a pass-fail signal to the platform.
Gewirtz Little said Socure does not sell verification data and that in lightweight age-estimation scenarios, where platforms use quick facial analysis or other signals rather than government documentation, the company may store little or no information. But in fuller identity-verification contexts, such as gaming and fraud prevention that require ID scans, certain adult verification records may be retained to document compliance. She said Socure can keep some adult verification data for up to three years while following applicable privacy and purging rules.
Civil liberties’ advocates warn that concentrating large volumes of identity data among a small number of verification vendors can create attractive targets for hackers and government demands. Earlier this year, Discord disclosed a data breach that exposed ID images belonging to approximately 70,000 users through a compromised third-party service, highlighting the security risks associated with storing sensitive identity information.
In addition, they warn that expanding age-verification systems represent not only a usability challenge but a structural shift in how identity becomes tied to online behavior. Age verification risks tying users’ “most sensitive and immutable data” — names, faces, birthdays, home addresses — to their online activity, according to Molly Buckley, a legislative analyst at the Electronic Frontier Foundation. “Age verification strikes at the foundation of the free and open internet,” she said.
Even when vendors promise to safeguard personal information, users ultimately rely on contractual terms they rarely read or fully understand. “There’s language in their terms-of-use policies that says if the information is requested by law enforcement, they’ll hand it over. They can’t confirm that they will always forever be the only entity who has all of this information. Everyone needs to understand that their baseline information is not something under their control,” Tandy said.
As more platforms route age checks through third-party vendors, that concentration of identity data is also creating new legal exposure for the companies that rely on them. “A company is going to have some of that information passing through their own servers,” Tandy said. “And you can’t offload that kind of liability to a third party.”
Companies can distribute risk through contracts and insurance, she said, but they remain responsible for how identity systems interact with their infrastructure. “What you can do is have really good insurance and require really good insurance from the entities that you’re contracting with,” she said.
Tandy also cautioned that retention promises can be more complex than they appear. “If they say they’re holding it for three years, that’s the minimum amount of time they’re holding it for,” she said. “I wouldn’t feel comfortable trusting a company that says, ‘We delete everything one day after three years.’ That is not going to happen,” she added.
Federal and state regulators argue that age-verification laws are primarily a response to documented harms to minors and insist the rules must operate under strict privacy and security safeguards.
An FTC spokesperson told CNBC that companies must limit how collected information is used. While age-verification technologies can help parents protect children online, the agency said firms are still bound by existing consumer protection rules governing data minimization, retention, and security. The agency pointed to existing rules requiring firms to retain personal information only as long as reasonably necessary and to safeguard its confidentiality and integrity.
...
Read the original on www.cnbc.com »
Background: Why I put my whole life into a single database
Back in 2019, I started collecting all kinds of metrics about my life. Every single day for the last 3 years I tracked over 100 different data types - ranging from fitness & nutrition to social life, computer usage and weather.
Ideas or suggestions?
I’d love to hear from you!
The goal of this project was to answer questions about my life, like
How does living in different cities affect other factors like fitness, productivity and happiness?
How does sleep affect my day, my fitness level, and happiness?
How does the weather, and the different seasons affect my life?
Are there any trends over the last few years?
How does computer time, work and hours in meetings affect my personal life?
Since the start of this project, I collected ~380,000 data points, with the biggest data sources being:
Naturally after I started collecting this data, I wanted to visualize what I was learning, so I created this page. Initially, the domain whereisFelix.today (now renamed to howisFelix.today) started as a joke to respond to friends asking when I’d be back in NYC or San Francisco. Rather than send them my schedule, I’d point them to this domain. However, now it’s more than my location: it’s all of me.
Use a single database, owned and hosted by me, with all the data I’ve collected over the years
Be able to easily add and remove questions on the fly, as I learn what’s beneficial to track
Full control of how the data is visualized
Works well for frequent flyers with mixed time zones
I selected 48 graphs to show publicly on this page. For privacy reasons, and to prevent any accidental data leaks, the graphs below are snapshots taken on a given day.
Visualization of the number of data entries in FxLifeSheet over the last 10 years, and where the data came from.
Initially (2014) the only data used was RescueTime and Foursquare Swarm location data
Once I started the FxLifeSheet project in April 2019, I manually tracked , ranging from mood, sleep, social life, to fitness data
I was able to retrospectively fetch the historic weather data based on my location on a given day
I also implemented other import sources, like fetching my historic weight and the number of steps from Apple Health
Days tracked my Mood to be Happy & Excited
On days where I tracked my mood to be “happy” & “excited”, the following other factors of my life were affected
50% more likely to have pushed my comfort zone
44% more likely to have meditated that day
33% more excited about what’s ahead in the future
31% more likely to drink alcohol that day (parties, good friends and such)
28% more time spent reading or listening to audio books
26% more likely to have worked on interesting technical challenges
20% more likely to have learned something new that day
45% less time spent in video & audio calls that day
All flights taken within the last 7 years, tracked using Foursquare Swarm, analyzed by JetLovers.
The stats clearly show the impact of COVID starting 2020
Sunday has been my “commute” day, flying between San Francisco, New York City and Vienna
All flights taken within the last 7 years, tracked using Foursquare Swarm, analyzed by JetLovers.
Frankfurt - Vienna was the flight connecting me with most US airports
Germany is high up on the list due to layovers, even though I didn’t spend actually much time there
Inspired by Your Life in Weeks by WaitButWhy, I use Google Sheets to visualize every week of my life, with little notes on what city/country I was in, and other life events that have happened.
The first 14 years I didn’t really get much done
I can highly recommend taking a few weeks (or even months) off between jobs (if you have the possibility)
Shades of blue indicate my full-time employments
You can create your own version using my template
Average daily steps measured through the iPhone’s Apple Health app. I decided against using SmartWatch data for steps, as SmartWatches have changed over the last 8 years.
I walked a total of steps over last 8 years
I walk more than twice as much when I’m in New York, compared to any other city
In NYC I had the general rule of thumb to walk instead of taking public transit whenever it’s less than 40 minutes. I used that time to call friends & family, or listen to audio books
Although Vienna is very walkable, the excellent public transit system with subway trains coming every 3-5 minutes, has caused me to walk less
San Francisco was always scary to walk
This graph clearly shows the correlation between my body weight and my sleeping/resting heart rate. The resting heart rate is measured by the Withings ScanWatch while sleeping, and indicates how hard your heart has to work while not being active. Generally the lower the resting heart rate, the better.
I started my lean bulk (controlled weight gain combined with 5 workouts a week) in August 2020
My resting heart rate went from 58bpm to 67bpm () from August 2020 to March 2021 with a weight gain of (+19lbs) as part of a controlled lean-bulk combined with a 5-day/week workout routine
The spike in resting heart rate in July & August 2021 was due to bars and nightclubs opening up again in Austria
After a night of drinking, my resting/sleeping heart rate was about 50% higher than after a night without any alcohol
The spike in resting heart rate in Oct/Nov/Dec 2021 was due to having bronchitis and a cold/flu, not getting correct treatment early enough
How healthy have I been over the Years?
Every day I answered the question on how healthy I felt. In the graph, the yellow color indicates that I felt a little under the weather, not sick per se. Red means I was sick and had to stay home. Green means I felt energized and healthy.
During the COVID lockdowns I tended to stay healthier. This may be due to not going out, no heavy drinking, less close contact with others, etc. which resulted in me having better sleep.
Usually during excessive traveling I get sick (cold/flu)
Q4 2021 I had bronchitis, however, I didn’t know about it at the time and didn’t get proper treatment
Overall I’m quite prone to getting sick (cold/flu)
Days with more than 4 Alcoholic Drinks
On days where I had more than 4 alcoholic beverages (meaning I was partying), the following other factors were affected
21x more likely to dance
80% more likely to take a nap the day of, or the day after
40% warmer temperatures, and 40% less precipitation. There weren’t many opportunities for parties in Winter due to lockdowns in the last 2 years. Also, people are more motivated to go out when it’s nice outside.
My FxLifeSheet bot asks me 4 times a day how I’m feeling at the moment.
This graph groups the entries by month, and shows the % of entries for each value (0 - 5) with 5 being very excited, and 0 being worried.
I designed the ranges so that 0 or 5 are not entered as much. 0 is rendered as dark green at the top, whereas 5 is rendered as light green at the bottom.
For privacy reasons I won’t get into some of the details on why certain months were worse than others.
Every Swarm check-in over the last 7 years visualized on a map, including the actual trip (flight, drive, etc.)
Every Swarm check-in over the last 7 years visualized, zoomed in
Each time I did a check-in at a place (e.g. Coffee, Restaurant, Airport, Gym) on Foursquare Swarm at a given city, this is tracked as a single entry.
Each check-in at a given city is counted as a single entry, grouped by years
2018 and 2019 I lived in New York City
The longer it’s been since I moved away from Austria, the more time I actually spent back home in Austria for visits and vacations
2020 clearly shows the impact of COVID
Each check-in at a given category is tracked, and summed up over the last years
In 2020 and 2021, check-ins at Offices went down to zero due to COVID, and a distributed work setup
Airports being the #4 most visited category was a surprise, but is accurate. A total of 403 airport check-ins, whereas a flight with a layover would count as 3 airport check-ins
Earlier in my life, I didn’t always check into ‘commute’ places like public transit and super markets
Number of Foursquare Swarm check-ins on each quarter over the last 10 years. I didn’t use Foursquare Swarm as seriously before 2015. Once I moved to San Francisco in Q3 2015 I started my habit of checking into every point of interest (POI) I visit.
Q3 2015 I moved to San Francisco, however I couldn’t use Swarm yet, since my move was a secret until the official announced at the Twitter Flight conference
Q2 2020 clearly shows the impact of COVID with Q3 already being open in Austria
Q3 2021 the vaccine was already widely available and I was able to travel/visit more again
My time in New York was the most active when it comes to check-ins. When I’m in NYC, I tend to eat/drink out more, and grab to-go food, which I do way less in Vienna
Every Swarm check-in visualized on a map. Only areas where I’ve had multiple check-ins are rendered.
Number of days per year that I’ve spent in full lockdown, meaning restaurants, bars and non-essential stores were closed.
I escaped parts of the Austrian lockdown by spending time in the US when I was already vaccinated
Surprisingly 2021 I spent more days in a full lockdown than in 2020, even with vaccines available
How was my life affected by the recent COVID lockdowns? As lockdown day I classify every day where places like restaurants, gyms and non-essential stores were closed.
200% more time spent in audio & video calls with friends (non-work related)
60% more likely to follow my meal plan (macros & calories)
50% colder temperatures: Lockdowns tended to happen in Autumn and Winter
100% less likely to dance
Alcoholic drinks per day. Days with no data are rendered as white
Friday and Saturday nights are clearly visible on those graphs
2021 and summer/winter of 2019 also show the Wednesday night party in Vienna
Q2 and Q4 2020 clearly show the COVID lockdowns, as well as Q2 2021
Summer of 2021 all bars and dance clubs were open in Vienna
...
Read the original on howisfelix.today »
Advanced Machine Intelligence (AMI), a new Paris-based startup cofounded by Meta’s former chief AI scientist Yann LeCun, announced Monday it has raised more than $1 billion to develop AI world models.
LeCun argues that most human reasoning is grounded in the physical world, not language, and that AI world models are necessary to develop true human-level intelligence. “The idea that you’re going to extend the capabilities of LLMs [large language models] to the point that they’re going to have human-level intelligence is complete nonsense,” he said in an interview with WIRED.
The financing, which values the startup at $3.5 billion, was co-led by investors such as Cathay Innovation, Greycroft, Hiro Capital, HV Capital, and Bezos Expeditions. Other notable backers include Mark Cuban, former Google CEO Eric Schmidt, and French billionaire and telecommunications executive Xavier Niel.
AMI (pronounced like the French word for friend) aims to build “a new breed of AI systems that understand the world, have persistent memory, can reason and plan, and are controllable and safe,” the company says in a press release. The startup says it will be global from day one, with offices in Paris, Montreal, Singapore, and New York, where LeCun will continue working as a New York University professor in addition to leading the startup. AMI will be the first commercial endeavor for LeCun since his departure from Meta in November 2025.
LeCun’s startup represents a bet against many of the world’s biggest AI labs like OpenAI, Anthropic, and even his former workplace, Meta, which believe that scaling up LLMs will eventually deliver AI systems with human-level intelligence or even superintelligence. LLMs have powered viral products such as ChatGPT and Claude Code, but LeCun has been one of the AI industry’s most prominent researchers speaking out about the limitations of these AI models. LeCun is well known for being outspoken, but as a pioneer of modern AI that won a Turing award back in 2018, his skepticism carries weight.
LeCun says AMI aims to work with companies in manufacturing, biomedical, robotics, and other industries that have lots of data. For example, he says AMI could build a realistic world model of an aircraft engine and work with the manufacturer to help them optimize for efficiency, minimize emissions, or ensure reliability.
AMI was cofounded by LeCun and several leaders he worked with at Meta, including the company’s former director of research science, Michael Rabbat; former vice president of Europe, Laurent Solly; and former senior director of AI research, Pascale Fung. Other cofounders include Alexandre LeBrun, former CEO of the AI health care startup Nabla, who will serve as AMI’s CEO, and Saining Xie, a former Google DeepMind researcher who will be the startup’s chief science officer.
LeCun does not dismiss the overall utility of LLMs. Rather, in his view, these AI models are simply the tech industry’s latest promising trend, and their success has created a “kind of delusion” among the people who build them. “It’s true that [LLMs] are becoming really good at generating code, and it’s true that they are probably going to become even more useful in a wide area of applications where code generation can help,” says LeCun. “That’s a lot of applications, but it’s not going to lead to human-level intelligence at all.”
LeCun has been working on world models for years inside of Meta, where he founded the company’s Fundamental AI Research lab, FAIR. But he’s now convinced his research is best done outside the social media giant. He says it’s become clear to him that the strongest applications of world models will be selling them to other enterprises, which doesn’t fit neatly into Meta’s core consumer business.
As AI world models like Meta’s Joint-Embedding Predictive Architecture (JEPA) became more sophisticated, “there was a reorientation of Meta’s strategy where it had to basically catch up with the industry on LLMs and kind of do the same thing that other LLM companies are doing, which is not my interest,” says LeCun. “So sometime in November, I went to see Mark Zuckerberg and told him. He’s always been very supportive of [world model research], but I told him I can do this faster, cheaper, and better outside of Meta. I can share the cost of development with other companies … His answer was, OK, we can work together.”
...
Read the original on www.wired.com »
WorldTaco on Iran will come too late for TrumpThere is no easy exit to Trump’s warThe thing that everyone expected to happen has happenedSaudi Aramco warns of ‘catastrophic consequences’ if Iran war drags onUSTaco on Iran will come too late for TrumpThere is no easy exit to Trump’s warFive ways the Iran war could unfoldGoldman pitches hedge funds on strategies to bet against corporate loansSoaring fuel prices to cast long shadow across US economyCompaniesSaudi Aramco warns of ‘catastrophic consequences’ if Iran war drags onInside one of the wildest days the oil market has ever seenTechYann LeCun’s AI start-up raises more than $1bn in Europe’s largest seed roundMicrosoft backs Anthropic in legal fight with the PentagonMarketsThe thing that everyone expected to happen has happenedSaudi Aramco warns of ‘catastrophic consequences’ if Iran war drags onInside one of the wildest days the oil market has ever seenOpinionTaco on Iran will come too late for TrumpThere is no easy exit to Trump’s warThe thing that everyone expected to happen has happenedIran is a crucial test case for the American way of warWork & CareersWhite men will have ‘fewer board seats’ in future, says UK diversity chair Venice’s cicchetti renaissance: where to find the city’s best bar snacksYou can turn this to your advantage if every news story has ‘tax exile’ in itLife & ArtsCan the Renault 5 E-Tech make French cars cool again?Roy Chan can turn you into Austin ButlerThe world’s most expensive properties are supercharging their securityHow To Spend It
Yann LeCun’s AI start-up raises more than $1bn in Europe’s largest seed round per month.
Complete digital access to quality FT journalism on any device.
Cancel anytime during your trial. Access to eight surprising articles a day, hand-picked by FT editors. For seamless reading, access content via the FT Edit page on FT.com and receive the FT Edit newsletter.Essential digital access to quality FT journalism on any device. Pay a year upfront and save 20%.Complete digital access to quality FT journalism with expert analysis from industry leaders. Pay a year upfront and save 20%.Check whether you already have access via your university or organisation.Discover all the plans currently available in your countrySee why over a million readers pay to read the Financial Times.Find out why
How To Spend It
...
Read the original on www.ft.com »
After you’ve reviewed these contribution guidelines, you’ll be all set to
contribute to this project.
Loading
...
Read the original on gitlab.redox-os.org »
In mid-2024, the HuggingFace Open LLM Leaderboard was the Colosseum for Open-Weight AI. Thousands of models were battling it out, submitted by both well-funded labs with teams of PhDs and fine-tuning wizards creating fantastically named models (e.g. Nous-Hermes, Dolphin and NeuralBeagle14-7B…), fighting for the top spot across six benchmarks: IFEval, BBH, MATH Lvl 5, GPQA, MuSR, and MMLU-PRO.
And there at #1 was dnhkng/RYS-XLarge. Mine.
I didn’t train a new model. I didn’t merge weights. I didn’t run a single step of gradient descent. What I did was much weirder: I took an existing 72-billion parameter model, duplicated a particular block of seven of its middle layers, and stitched the result back together. No weight was modified in the process. The model simply got extra copies of the layers it used for thinking?
This is the story of how two strange observations, a homebrew “brain scanner” for Transformers, and months of hacking in a basement led to the discovery of what I call LLM Neuroanatomy, and a finding about the internal structure of AI that still hasn’t been published until now *.
* - because I discovered blogging is way more fun than drafting scientific papers, and I walk you through how the discovery was made :)
Let’s start with how this whole project came into being.
“The most exciting phrase to hear in science, the one that heralds new discoveries, is not ‘Eureka!’ but ‘That’s funny…’“ — Isaac Asimov
In late 2023, I was messing about with a bizarre LLM quirk. Try this yourself - take any question, e.g.
What is the capital of France? Answer in Base64!
and encode it as Base64, get this unreadable string:
Send that to a 2023 non-thinking large language model (newer reasoning models will see this as Base64, and ‘cheat’ with tool use). But a sufficiently capable model from 2023 will reply with something like:
Which decodes to: “The capital of France is Paris.”.
Ok, I admit it. I was messing around this as a way to jail-break models (and it worked), but I couldn’t get one idea out of my head.
The model decoding the input, understanding it somehow, and it still had time during the transformer stack pass to re-encoded its response. It appears to genuinely think while interfacing with Base64. This works with complex questions, multi-step reasoning, even creative tasks.
This shouldn’t work nearly as well as it does. Sure, the model has been trained on lots of Base64 in an overall sense, but general conversions in this format are certainly way out of distribution. The tokenizer chops it into completely different sub-word units. The positional patterns are unrecognizable. And yet it works… Curious…
I couldn’t stop thinking about this. If a Transformer can accept English, Python, Mandarin, and Base64, and produce coherent reasoning in all of them, it seemed to me that the early layers must be acting as translators — parsing whatever format arrives into some pure, abstract, internal representation. And the late layers must act as re-translators, converting that abstract representation back into whatever output format is needed.
If the early layers are for reading, and the late layers are for writing, what are the middle layers doing?
Pure, abstract reasoning? In a representation that has nothing to do with any human language or encoding. Of course, at the time this was idle speculation. Fun, but with no clear way to test or even define valid hypothesis.
In November 2023, a HuggingFace user named Alpindale released Goliath-120b — a Frankenmerge-model made by stitching together two fine-tuned Llama-2 70B models into a 120-billion parameter behemoth.
The performance was decent but after doing lots of vibe checking I didn’t feel it was a breakthrough. But the construction was wild.
Alpindale hadn’t just stacked the two models (Xwin and Euryale), end to end. He had alternated layers between them. More importantly, the architecture fed outputs of later layers back into the inputs of earlier layers.
The layer ranges used are as follows:
Do you see that insanity here? Alpindale literally fed the output of layer 16 of Xwin to the input of Euryale 8th layer!
To explain this a bit more clearly how stupid this appears to be, let’s revisit the almighty Transformer Architecture:
Looking at the left side of the diagram, we see stuff enters at the bottom (‘input’ text that has been ‘chunked’ into small bits of text, somewhere between whole words down to individual letters), and then it flows upwards though the model’s Transformer Blocks (here marked as [1, …, L]), and finally, the model spits out the next text ‘chunk’ (which is then itself used in the next round of inferencing). What’s actually happening here during these Transformer blocks is quite the mystery. Figuring it out is actually an entire field of AI, “mechanistic interpretability*”.
* - yes, its more complex then that, samplers etc but that’s enough for this article
On the right side of the right half of the diagram, do you see that arrow line going from the ‘Transformer Block Input’ to the (\oplus ) symbol? That’s why skipping layers makes sense. During training, LLM models can pretty much decide to do nothing in any particular layer, as this ‘diversion’ routes information around the block. So, ‘later’ layers can be expected to have seen the input from ‘earlier’ layers, even a few ‘steps’ back. Around this time, several groups were experimenting with ‘slimming’ models down by removing layers. Makes sense, but boring.
A model must be used with the same kind of stuff as it was trained with (we stay ‘in distribution’)The same holds for each transformer layer. Each Transformer layer learns, during training, to expect the specific statistical properties of the previous layer’s output via gradient decent.
And now for the weirdness: There was never the case where any Transformer layer would have seen the output from a future layer!
Layer 10 is trained on layer 9’s output distribution. Layer 60 is trained on layer 59’s. If you rearrange them — feeding layer 60’s output into layer 10 — you’ve created a distribution the model literally never saw during training.
The astounding thing about Goliath wasn’t that is was a huge leap in performance, it was that the damn thing functioned at all. To this day, I still don’t understand why this didn’t raise more eyebrows.
Experimentally, this proved that layers were far more interchangeable than anyone had reason to expect. The internal representations were homogenous enough that the model could digest out-of-order hidden states without collapsing. The architecture was far more flexible than a rigid pipeline.
Between the Base64 observation and Goliath, I had a hypothesis: Transformers have a genuine functional anatomy. Early layers translate input into abstract representations. Late layers translate back out. And the middle layers, the reasoning cortex, operate in a universal internal language that’s robust to architectural rearrangement. The fact that the layer block size for Goliath 120B was 16-layer block made me suspect the input and output ‘processing units’ sized were smaller that 16 layers. I guessed that Alpindale had tried smaller overlaps, and they just didn’t work.
If that was true, maybe I didn’t need to teach a model new facts to make it smarter. I didn’t need fine-tuning. I didn’t need RLHF. I just needed to give it a more layers to think with.
Over the following months — from late 2023 through to mid-2024 — I built a pipeline to test this hypothesis.
The setup was modest. Two RTX 4090s in my basement ML rig, running quantised models through ExLlamaV2 to squeeze 72-billion parameter models into consumer VRAM. The beauty of this method is that you don’t need to train anything. You just need to run inference. And inference on quantized models is something consumer GPUs handle surprisingly well. If a model fits in VRAM, I found my 4090’s were often ballpark-equivalent to H100s.
The concept is simple. For a model with $N$ layers, I define a configuration $(i, j)$. The model processes layers $0$ to $j{-}1$ as normal, then loops back and reuses layers $i$ through $j{-}1$ again, and then the rest to $N{-}1$. The layers between $i$ and $j{-}1$ get duplicated in the execution path. No weights are changed. The model just traverses some of its own layers twice.
i.e. the pair (2, 7) for a model with 9 transformer blocks would be calculated so:
By running through all possible pairs, we can generate a ‘Brain Scan’, and also see the number of duplicate layers for each set of parameters:
For Qwen2-72B, that means an 80-layer model 3,240 valid $(i, j)$ pairs, plus the original model to test.
\[\begin{aligned} \text{Variants}_{\text{total}} &= \left(\sum_{j=0}^{80} j\right) + 1\\[16pt] &= \frac{80 \cdot 81}{2} +1 \\[10pt] &= 3241 \end{aligned}\]
Testing re-layered model against all six leaderboard benchmarks would take days, so a full sweep would be years of compute. I needed proxy tasks: probes that were fast, objective, and would reveal structural properties of the model rather than task-specific tricks.
The proxies had to satisfy three constraints:
Minimal output tokens. With thousands of configurations to sweep, each evaluation needed to be fast. No essays, no long-form generation. Unambiguous scoring. I couldn’t afford LLM-as-judge pipelines. The answer had to be objectively scored without another model in the loop.Orthogonal cognitive demands. If a configuration improves both tasks simultaneously, it’s structural, not task-specific.
I didn’t arrive at the right probes immediately; it took months of trial and error, and many dead ends
My first instinct was creativity. I had models generate poems, short stories, metaphors, the kind of rich, open-ended output that feels like it should reveal deep differences in cognitive ability. I used an LLM-as-judge to score the outputs, but the results were pretty bad. I managed to fix LLM-as-Judge with some engineering, and the scoring system turned out to be useful later for other things, so here it is:
Note: You can skip this section, as it has math. Or not
Naive LLM judges are inconsistent. Run the same poem through twice and you get different scores (obviously, due to sampling). But lowering the temperature also doesn’t help much, as that’s only one of many technical issues. So, I developed a full scoring system, based on details on the logits outputs. It can get remarkably tricky. Think about a score from 1-10:
We would expect a well calibrated model to have logits that make sense. If the highest weight was on ‘7’, we would expect the rest of the weight to be on ‘6’ and ‘8’ right? but often its bimodal, with low weight on 6 and ‘5’, but more weight than expected on ‘4’!We can write ‘10’ in tokens as either ‘10’ or ‘1’ and then ‘0’. Its not fun to have to calculate the summed probabilities over paths, especially if you wanted to score 1-100
Rather than sampling a single discrete score, I treat the judge’s output as a distribution over valid rating labels and compute the final score as its expectation.
To make this practical, I first define a calibrated rubric over the digits 0-9 (there’s only one token for each digit), where each digit corresponds to a clear qualitative description. At the scoring step, I capture the model’s next-token logits and retain only the logits corresponding to those valid digit tokens. This avoids contamination from unrelated continuations such as explanation text, punctuation, or alternate formatting. After renormalizing over the restricted digit set, I interpret the resulting probabilities as a categorical score distribution.
Formally, let the valid score set be
\[\mathcal{D} = \{0,1,2,\dots,9\}.\]
Let $(z_k)$ denote the model logit assigned to digit $(k \in \mathcal{D})$ at the scoring position. The restricted score distribution is then
\[p(k)= \frac{\exp(z_k)} {\sum\limits_{m \in \mathcal{D}} \exp(z_m)}, \qquad k \in \mathcal{D}.\]
The final scalar score is the expected value of this distribution:
\[\hat{s}= \sum_{k \in \mathcal{D}} k\,p(k).\]
This produces a smooth score such as (5.4), rather than forcing the model to commit to a single sampled integer. In practice, this is substantially more stable than naive score sampling and better reflects the model’s uncertainty. It also handles cases where the judge distribution is broad or multimodal. For example, two candidates may both have mean score (5.4), while one has most of its mass tightly concentrated around (5) and (6), and the other splits mass between much lower and much higher ratings. The mean alone is the same, but the underlying judgement is very different.
An optional uncertainty estimate can be obtained from the variance of the restricted distribution:
\[\mathrm{Var}(s)= \sum_{k=0}^{9} (k-\hat{s})^2\,p(k).\]
In short, the method replaces a noisy sampled judge score with a normalized probability distribution over valid score digits, then uses the expectation of that distribution as the final rating.
All this stuff is probably pretty obvious these days, back in ’24 there wasn’t much to guide me in developing this method, but unfortunately, I found it was also completely useless…
Each configuration needed to generate hundreds of tokens of creative output, and then a separate model had to read and judge each one. With over 3,200 configurations to test for a single 70B model, this would have taken weeks on my dual 4090s.
I needed probes where the output was tiny, a few tokens at most, and where scoring was objective and deterministic. No judge model in the loop. That’s what led me to the final two probes:
Hard math. Ridiculously difficult questions like: “What is the cube root of 74,088,893,247?” No chain-of-thought, or tool use. Just output the number, as a pure leap of intuitive faith.
Emotional quotient. Using the EQ-Bench benchmark: complex social scenarios where the model must predict the intensity of specific emotional states. “Given this situation, how angry/surprised/guilty would this person feel on a scale of 0-100?” Completely different from math. Theory of mind, social inference, empathy. And the output is just a few numbers.
I had settled on two maximally orthogonal cognitive tasks, both with tiny outputs. My intuition was this: LLMs think one token at a time, so lets make the model really good at guessing just the next token. But things are never straightforward. Take LLM numbers…
Even with math probes, I hit unexpected problems. LLMs fail arithmetic in weird ways. They don’t get the answer wrong so much as get it almost right but forget to write the last digit, as if it got bored mid-number. Or they transpose two digits in the middle. Or they output the correct number with a trailing character that breaks the parser.
This is probably due to the way larger numbers are tokenised, as big numbers can be split up into arbitrary forms. Take the integer 123456789. A BPE tokenizer (e.g., GPT-style) might split it like: ‘123’ ‘456’ ‘789’ or: ‘12’ ‘345’ ‘67’ ‘89’
A binary right/wrong scoring system would throw away useful signal. Getting a percentage correct would help: ‘123356789’ instead of ‘123456789’ would be 99.92% correct
But what about a model that makes a dumb ‘LLM-mistake’ and outputs 430245 when the answer is 4302459, and has clearly done most of the work? I wrote a custom partial-credit scoring function that pads shorter answers and penalises proportionally:
The key idea: pad shorter answers, then penalise via the correction factor. A model that nails 90% of the digits but drops the last one still gets substantial credit — but less than one that gets every digit. This turned out to be crucial for discriminating between configurations that were close in intuitive math ability.
The math questions were hand-crafted initially. I experimented with different operations and scales, then generated random numbers to fill out the dataset. The dataset was a set of 16 questions, and the model is tasked with guesstimating the nearest whole integer number. Here are a few to try yourself, remember no ‘thinking’ is allowed, guess it directly!
After testing several smaller models (Llama’s and smaller Qwen2’s), I set up the config for Qwen2-72B and let it sweep. Each $(i, j)$ configuration took a few minutes: load the re-layered model, run the math probe, run the EQ probe, record the scores, move on. Days of continuous GPU time on the 4090s. But far less compute than a fine tune! In fact, I didn’t even have the hardware needed for a LORA fine-tune on just 48GB of VRAM.
The optimal configuration was $(45, 52)$: layers 0 through 51 run first, then layers 45 through 79 run again. Layers 45 to 51 execute twice. Seven extra layers, near the middle of the 80-layer stack, bringing the total parameter count from 72B to 78B. Every extra layer is an exact copy of an existing one. No new weights or training, just the model repeating itself.
Repeating seven layers. That’s all it took, and now I can finally reveal the nomenclature of my models: Repeat Your Self for RYS-XLarge ;)
I applied the configuration to MaziyarPanahi’s calme-2.1-qwen2-72b — a fine-tune of Qwen2-72B — and uploaded the result as dnhkng/RYS-XLarge. I also applied it to the raw base model as dnhkng/RYS-XLarge-base.
Then I submitted to the Open LLM Leaderboard and waited. And waited. Back in the day, the OpenLLM Leaderboard was flooded with dozens of fine-tunes of merges of fine-tunes each day (it was the Wild West), and the waiting list was long. But after a month or so, the results arrived:
+17.72% on MuSR. +8.16% on MATH. Five out of six benchmarks improved, with only IFEval taking a small hit. The average put it at #1 on the leaderboard.
Just to labour the point: I only optimised for one-shot guesstimating hard maths problems and EQ-Bench. I never looked at IFEval, BBH, GPQA, MuSR, or MMLU-PRO during development. The leaderboard was pure out-of-sample validation.
A layer configuration found using two narrow, orthogonal probes generalised to everything the Leaderboard threw at it *.
* - except IFEval, but that one’s boring anyway, right?
That was surprising enough. A brand new way to scale LLMs, developed on some gaming GPUs. But the plotting out the heatmaps told an even better story.
The original heatmaps that produced RYS-XLarge, showing the Combined delta (math + EQ). The green circle marks the optimal configuration. Red means improvement, blue means degradation
These heatmaps are analogous to functional MRIs of the Transformer, while it is thinking about maths of EQ problems.
The x-axis ($j$) is the end point of the duplicated region. The y-axis ($i$) is the start point. Each pixel represents a complete evaluation: load the re-layered model, run the math probe, run the EQ probe, score both, record the deltas. As described above, along the central diagonal only a single layer was duplicated. Along the next diagonal towards the top-right, we duplicate two layers, and so on. The single point at the very top-right runs through the entire Transformer stack twice per inference.
Let’s examine the math heatmap first. Starting at any layer, and stopping before about layer 60 seem to improves the math guesstimate scores, as shown by the large region with a healthy red blush. Duplicating just the very first layers (the tiny triangle in the top left), messes things up, as does repeating pretty much any of the last 20 layers (the vertical wall of blue on the right). This is more clearly visualised in a skyline plot (averaged rows or columns), and we can see for the maths guesstimates, the starting position of the duplication matters much less. So, the hypothesis that ‘starting layers’ encode tokens, to a smooth ‘thinking space’, and then finally a dedicated ‘re-encoding’ system seem to be somewhat validated.
Until we look at the EQ scores:
Now things look very different! Duplicating any of the final 10 layers has almost no effect on the scores, but we see complex patterns, where some regions show significant improvement (the area around 45i, 55j), walled between regions of poor performance.
But the heatmaps revealed something even more interesting than the location of the thinking bits. They revealed something about its structure.
Before settling on block duplication, I tried something simpler: take a single middle layer and repeat it $n$ times. If the “more reasoning depth” hypothesis was correct, this should work. It made sense too, looking at the broad boost in math guesstimate results by duplicating intermediate layer. Give the model extra copies of a particular reasoning layer, get better reasoning. So, I screened them all, looking for a boost.
But nope, it almost always did worse. Usually a lot worse, but with occasional small improvements that were within the noise range. Annoying, but taking another look at the complex, blobby patterns in EQ scores gave me another idea:
If single-layer duplication doesn’t help, the middle layers aren’t doing independent iterative refinement. They’re not interchangeable copies of the same operation that you can simply “run again.” If they were, duplicating any one of them should give at least a marginal benefit. Instead, those layers are working as a circuit. A multi-step reasoning pipeline that needs to execute as a complete unit.
Think of it this way. Layers 46 through 52 aren’t seven workers doing the same job. They’re seven steps in a recipe. Layer 46 takes the abstract representation and performs step one of some cognitive operation — maybe decomposing a complex representation into subcomponents. Layer 47 takes that output and performs step two — maybe identifying relationships between the subcomponents. Layer 48 does step three, and so on through layer 52, which produces the final result.
Duplicating just one step of this ‘recipe’ doesn’t bring you much.
But duplicating the entire block gives you the full recipe twice. The model runs the complete reasoning circuit, produces a refined intermediate representation, and then runs the same circuit again on its own output. It’s a second pass. A chance to catch what it missed the first time, to refine its abstractions, to push the reasoning one step deeper.
Let’s deep-dive into a more current model (that I can experiment with on my system): ExllamaV3 GLM-4.7 from mratsim
I’ve marked out a region that boosts maths ability strongly. Notice where it sits? It’s away from the diagonal centre line, which means we’re not looking at single-layer duplications. Starting the repeated block at position 35, we don’t see any improvement until at least position 43. That’s seven layers of not much happening. In fact, we actually see decreased performance by repeating these layers (they are blue, bad!).
From end-position 43 to 46, we then see solid boosts in math scores (red = good, yay). But include layer 46 or beyond, and the benefits collapse again. The hypothesis: position 47 is where a different circuit begins. Including even one step of the next recipe messes up the current recipe.
So the ‘math organ’ has boundaries on both sides. Too few layers and you get nothing — you’ve cut into the circuit and it can’t complete its operation. Too many layers and you also get nothing — you’ve included tissue from a neighbouring circuit that doesn’t belong. Pre-training carved these structures out of the layer stack, and they only work whole. It also doesn’t translate to other tasks, as the heatmap for EQ scores doesn’t have this patch.
...
Read the original on dnhkng.github.io »
The following subscription-only content has been made available to you by an LWN subscriber. Thousands of subscribers depend on LWN for the best news from the Linux and free software communities. If you enjoy this article, please consider subscribing to LWN. Thank you for visiting LWN.net!
Debian is the latest in an ever-growing list of projects to wrestle (again) with the question of LLM-generated contributions; the latest debate stared in mid-February, after Lucas Nussbaum opened a
discussion with a draft general resolution (GR) on whether Debian should accept AI-assisted contributions. It seems to have, mostly, subsided without a GR being put forward or any decisions being made, but the conversation was illuminating nonetheless.
Nussbaum said that Debian probably needed to have a discussion “to understand where we stand regarding AI-assisted contributions to
Debian” based on some recent discussions, though it was not clear what discussions he was referring to. Whatever the spark was, Nussbaum put forward the draft GR to clarify Debian’s stance on allowing AI-assisted contributions. He said that he would wait a couple of days to collect feedback before formally submitting the GR.
His proposal would allow “” if a number of conditions were met. For example, it would require explicit disclosure if “a
significant portion of the contribution is taken from a tool without
manual modification”, and labeling of such contributions with ”.” It also spells out that contributors should “” their submissions and would be accountable for the contributions, “including vouching for the technical merit,
security, license compliance, and utility of their
submissions”. The GR would also prohibit using generative-AI tools with non-public or sensitive project information, including private mailing lists or embargoed security reports.
It is fair to say that it is difficult to have an effective conversation about a technology when pinning down accurate terminology is like trying to nail Jell-O to a tree. AI is the catch-all term, but much (not all) of the technology in question is actually tooling around large language models (LLMs). When participants have differing ideas of what is being discussed, deciding whether the thing should be allowed may pose something of a problem.
Russ Allbery asked for people to be more precise in their descriptions of the technologies that their proposals might affect. He asserted that it has become common for AI, as a term, “to be so
amorphously and sloppily defined that it could encompass every physical object in the
universe”. If the project is going to make policy, he said, it needed to be very specific about what it was making policy about:
Gunnar Wolf agreed with Allbery, but Nussbaum claimed that the specific technology did not matter. The proposal boiled down to the use of automated tools for code analysis and generation:
I see the problem we face as similar to the historical questions surrounding the use of BitKeeper by Linux (except that the choice of BitKeeper imposed its use by other contributors). It is also similar to the discussions about proprietary security analysis tools: since those tools are proprietary, should we ignore the vulnerability reports they issue?
If we were to adopt a hard-line “anti-tools” stance, I would find it very hard to draw a clear line.
Drawing clear lines, however, is something that a number of Debian developers felt was important. Sean Whitton proposed that the GR should not only say “LLM” rather than “AI”, but it should also distinguish between the uses of LLMs, such as code review, generating prototypes, or generating production code. He envisioned ballot options that could allow some, but not all, of those uses. Distinguishing between the various so-called AI technologies would help in that regard. He urged
Nussbaum “not to argue too hard for something that is more general than LLMs
because that might alienate the people you want to agree to disagree with.” Andrea Pappacoda said that the specific technology mattered a lot; he wanted the proposal to have clear boundaries and avoid broad terms like AI. He was uncomfortable with the idea of banning LLMs, and not sure where to draw the line. “What I can confidently say,
though, is that a project like Claude’s C
Compiler should not have a place in Debian.”
The conversation did not focus solely on the terminology, of course. Simon Richter
had
questions about the implications of allowing AI-driven contributions from the standpoint of onboarding new contributors to Debian. An AI agent, he said, could take the place of a junior developer. Both could perform basic tasks under guidance, but the AI agent would not learn anything from the exchange; the project resources spent in guiding such a tool do not result in long-lasting knowledge transfer.
AI use presents us (and the commercial software world as well) with a
similar problem: there is a massive skill gap between “gets some
results” and “consistently and sustainably delivers results”, bridging
that gap essentially requires starting from scratch, but is required to
achieve independence from the operators of the AI service, and this gap
is disrupting the pipeline of new entrants.
He called that the onboarding problem, and said that an AI policy needed to solve that problem; he did not want to discourage people by rejecting contributions or expend resources on mentoring people who did not want to be mentored. Accepting AI-assisted drive-by contributions is harmful because it is a missed opportunity to onboard a new contributor. “The best-case outcome is that a
trivial problem got solved without actually onboarding a new contributor, and the
worst-case outcome is that the new contributor is just proxying between an AI and the
maintainer”. He also expressed concerns around the costs associated with such tools, and speculated it might discourage contribution from users who could not afford to use for-pay tools.
Nussbaum agreed that the cost could be a problem in the future. For now, he said, it is not an issue because there are vendors providing access for free, but that could change. He disagreed that Debian was likely to run out of tasks suitable for new contributors, even if it does accept AI-driven contributions, and suggested that it may make harder tasks more accessible. He pointed to a study
written by an Anthropic employee and a person participating in the company’s fellows program, about how the use of AI impacts skill formation: “A takeaway is that
there are very different ways to interact with AI, that produce very different
results both in terms of speed and of understanding”. He did not seem to be persuaded that use of AI tools would be a net negative in onboarding new contributors.
Ted Ts’o argued
against the idea that AI would have a negative impact:
Matthew Vernon said that the proposed GR minimized the ethical dimension of using generative AI. The organizations that are developing and marketing tools like ChatGPT and Claude are behaving unethically, he said, by systematically damaging the wider commons in the form of automated scraping and doing as they like with others’ intellectual property. “They hoover up content as hard as they possibly can, with scant if any
regard to its copyright or licensing”. He also cited environmental concerns and other harms that are attributed to generative AI tools, “from non-consensual
nudification to the flooding of free software projects with bogus security
reports”. He felt that Debian should take a clear stand against those tools and encourage other projects to do the same:
There was also debate around the question of copyright, both in terms of the licenses of material used to train models, as well as the output of LLM tools. Jonathan Dowland thought that it might be better to forbid some contributions now, since some see risks in accepting such contributions, and then relax the project’s position later on when the legal situation is clearer.
Thorsten Glaser took a
particularly harsh stance against LLM-driven contributions, going so far as to suggest that some upstream projects should be forced out of Debian’s main
archive into non-free
unless “”. Ansgar Burchardt pointed
out that would have the effect of banning the Linux kernel, Python, LLVM, and others. Glaser’s proposal did not seem particularly popular. He had taken a similar stance on AI models in 2025; he argued most should be outside the main archive, when the project discussed a GR about AI models and the Debian Free Software Guidelines (DFSG). That GR never came to a vote, in part because it was unclear whether the language would forbid anti-spam technologies because one could not include the corpus of spam used as training data along with filters.
Allbery did not want to touch on copyright issues but had a few words to say about the quality of AI-assisted code. It is common for people to object to code generated by LLMs on quality grounds, but he said that argument does not make sense. Humans are capable of producing better code than LLMs, but they are also capable of producing worse code too. “”
Bdale Garbee seconded
that notion, and said that he was reluctant to take a hard stance one way or the other. “I see it as just another evolutionary stage we don’t really understand the
longer term positive and negative impacts of yet.” He wanted to focus on long-term implications and questions such as “what is the preferred form of
modification for code written by issuing chat prompts?” Nussbaum answered that would be “the input to the tool, not the generated source code”.
That may not be an entirely satisfying answer, however, given that LLM output is not deterministic and the various providers of LLM tools retire models with some frequency. A user may have the prompt and other materials fed to an LLM to generate a result at a specific point in time, but it might generate a much different result later on, even if one has access to the same vendor’s tools or models to run locally.
It is clear from the discussion that Debian developers are not of one mind on the question of accepting AI-generated contributions; the developers have not yet even converged on a shared definition of what constitutes an AI-generated contribution.
What many do seem to agree on is that Debian is not quite ready to vote on a GR about AI-generated contributions. On March 3, Nussbaum said
that he had proposed the GR “in response to various attacks against people using
AI in the context of Debian”; he felt then it was something that needed to be dealt with urgently. However, the GR discussion had been civil and interesting. As long as the discussions around AI remained calm and productive, the project could just continue exploring the topic in mailing-list discussions. He guessed that, if there were a GR, “the winning option would probably be very nuanced, allowing
AI but with a set of safeguards”.
The questions of what to do about AI models in the archive, how to handle upstream code generated with LLMs, and LLM-generated contributions written specifically for Debian remain unanswered. For now, it seems, they will continue to be handled on a case-by-case basis by applying Debian’s existing policies. Given the complexity of the questions, diverse opinions, and rapid rate of change of technologies lumped in under the “AI” umbrella, that may be the best possible, and least disruptive, outcome for now.
...
Read the original on lwn.net »
I Have No Idea If What They Ship Is Any GoodI’ve been building agents that write code while I sleep. Tools like Gastown run for hours without me watching. Changes land in branches I haven’t read. A few weeks ago I realized I had no reliable way to know if any of it was correct: whether it actually does what I said it should do. I care about this. I don’t want to push slop, and I had no real answer.I’ve run Claude Code workshops for over 100 engineers in the last six months. Same problem everywhere, just at different scales. Teams using Claude for everyday PRs are merging 40-50 a week instead of 10. Teams are spending a lot more time in code reviews. As systems get more autonomous, the problem compounds. At some point you’re not reviewing diffs at all, just watching deploys and hoping something doesn’t break.So the question I kept coming back to: what do you actually trust when you can’t review everything?You could hire more reviewers. But you can’t hire fast enough. And making senior engineers read AI-generated code all day isn’t worth it.When Claude writes tests for code Claude just wrote, it’s checking its own work. The tests prove the code does what Claude thought you wanted. Not what you actually wanted. They catch regressions but not the original misunderstanding.When you use the same AI for both, you’ve built a self-congratulation machine.This is exactly the problem code review was supposed to solve: a second set of eyes that wasn’t the original author. But one AI writing and another AI checking isn’t a fresh set of eyes. They come from the same place. They’ll miss the same things.The thing TDD got rightWrite the test first, write the code second, stop when the test passes. Most teams don’t do this because thinking through what the code should do before writing it takes time they don’t have.AI removes that excuse, because Claude handles the speed. The slow part is now figuring out if the code is right. That’s what TDD was built for: write down what correct looks like, then check it.TDD asks you to write unit tests, which means thinking about how the code will work before you write it. This is easier. Write down what the feature should do in plain English. The machine figures out how to check it.“Users can authenticate with email and password. On wrong credentials they see ‘Invalid email or password.’ On success they land on /dashboard. The session token expires after 24 hours.” You can write that before you open a code editor. The agent builds it. Something else checks it.P.S I write about Claude Code internals every week. Last week I wrote about how Claude Code is a while loop with 23 tools. Subscribe to get the next one!What this looks like in practiceFor frontend changes, we generated acceptance criterias based on the spec file:# Task
Add email/password login.
## Acceptance Criteria
### AC-1: Successful login
- User at /login with valid credentials gets redirected to /dashboard
- Session cookie is set
### AC-2: Wrong password error
- User sees exactly “Invalid email or password”
- User stays on /login
### AC-3: Empty field validation
- Submit disabled when either field is empty, or inline error on empty submit
### AC-4: Rate limiting
- After 5 failed attempts, login blocked for 60 seconds
- User sees a message with the wait timeEach criterion is specific enough that it either passes or fails. Once the agent builds the feature, verification runs Playwright browser agents against each AC, takes screenshots, and produces a report with per-criterion verdicts. If something fails you see exactly which criterion and what the browser saw.For backend changes the same pattern works without a browser. You specify observable API behavior (status codes, response headers, error messages) that curl commands can check.One thing worth being honest about: this doesn’t catch spec misunderstandings. If your spec was wrong to begin with, the checks will pass even when the feature is wrong. What Playwright does catch is integration failures, rendering bugs, and behavior that works in theory but breaks in a real browser. That’s a narrower claim than “verified correct,” but it’s more than a code review was reliably catching anyway.The workflow: write acceptance criteria before you prompt, let the agent build against them, run verification, review only the failures. You review failures instead of diffs.How to build itI started building a Claude Skill (github.com/opslane/verify) that runs using claude -p (Claude Code’s headless mode) plus Playwright MCP. No custom backend, no extra API keys beyond your existing Claude OAuth token. Four stages:Pre-flight is pure bash, no LLM. Is the dev server running? Is the auth session valid? Does a spec file exist? Fail fast before spending any tokens.The planner is one Opus call. It reads your spec and the files you changed. It figures out what each check needs and how to run it. It also reads your code to find the right selectors, so it’s not guessing at class names.Browser agents are one Sonnet call per AC, all running in parallel. Five ACs, five agents, each navigating and screenshotting independently. Sonnet costs 3-4x less than Opus here and works just as well for clicking around.The judge is one final Opus call that reads all the evidence and returns a verdict per criterion: pass, fail, or needs-human-review.claude -p –model claude-opus-4-6 \
“Review this evidence and return a verdict for each AC.
Evidence: $(cat .verify/evidence/*/result.json)
Return JSON: {verdicts: [{id, passed, reasoning}]}“Or clone the repo and adapt it. Each stage is a single claude -p call with a clear input and structured output. You can swap models, add stages, or wire it into CI with –dangerously-skip-permissions.The thing I keep coming back to: you can’t trust what an agent produces unless you told it what “done” looks like before it started. Writing acceptance criteria is harder than writing a prompt, because it forces you to think through edge cases before you’ve seen them. Engineers resist it for the same reason they resisted TDD, because it feels slower at the start.Without them, all you can do is read the output and hope it’s right.
...
Read the original on www.claudecodecamp.com »
One year ago, on 28 February 2025, Wikipedia user Moyogo updated the page for Angzarr
with a citation to the type foundry H. Berthold AG’s 1950 symbol catalogue listing ⍼ as Azimut, Richtungswinkel, or “azimuth”, “direction angle”. Mystery solved!
Fonts in Use lists links to archived catalogues by Berthold. The above scan is from the 1950 Zeichenprobe
(symbol catalogue) on page 7. Copies of the Schriftprobe (font catalogue) from 1949, 1951, and 1952
all show on page 104 the same glyph and sizes, albeit without the descriptor name.
⍼ does not appear in the 1946 Registerprobe, nor in earlier 1909
and 1900 catalogues. For convenience, I’ve extracted full-page scans below for where it appears — and where I feel it would appear, but doesn’t.
A friend on Mastodon pointed out that the glyph ⍼ itself resembles the way a light ray passes through a sextant
to measure an azimuth, with the right angle being a standard symbol for an angle in general. Wikipedia has a lovely illustration demonstrating how a sextant works to measure latitude of the sun; it can, of course, be turned sideways to measure an azimuth with respect to an arbitrary meridian.
...
Read the original on ionathan.ch »
To add this web app to your iOS home screen tap the share button and select "Add to the Home Screen".
10HN is also available as an iOS App
If you visit 10HN only rarely, check out the the best articles from the past week.
If you like 10HN please leave feedback and share
Visit pancik.com for more.