10 interesting stories served every morning and every evening.
Computational Complexity and other fun stuff in math and computer science from Lance Fortnow and Bill Gasarch
...
Read the original on blog.computationalcomplexity.org »
New U. S laws designed to protect minors are pulling millions of adult Americans into mandatory age-verification gates to access online content, leading to backlash from users and criticism from privacy advocates that a free and open internet is at stake. Roughly half of U.S. states have enacted or are advancing laws requiring platforms — including adult content sites, online gaming services, and social media apps — to block underage users, forcing companies to screen everyone who approaches these digital gates.
“There’s a big spectrum,” said Joe Kaufmann, global head of privacy at Jumio, one of the largest digital identity-verification and authentication platforms. He explained that the patchwork of state laws vary in technical demands and compliance expectations. “The regulations are moving in many different directions at once,” he said.
Social media company Discord announced plans in February to roll out mandatory age verification globally, which the company said would rely on verification methods designed so facial analysis occurs on a user’s device and submitted data would be deleted immediately. The proposal quickly drew backlash from users concerned about having to submit selfies or government IDs to access certain features, which led Discord to delay the launch until the second half of this year.
“Let me be upfront: we knew this rollout was going to be controversial. Any time you introduce something that touches identity and verification, people are going to have strong feelings,” Discord chief technology officer and co-founder Stanislav Vishnevskiy wrote in a Feb. 24 blog post.
Websites offering adult content, gambling, or financial services often rely on full identity verification that requires scanning a government ID and matching it to a live image. But most of the verification systems powering these checkpoints — often run by specialized identity-verification vendors on behalf of websites — rely on artificial intelligence such as facial recognition and age-estimation models that analyze selfies or video to determine in seconds whether someone is old enough to access content. Social media and lower-risk services may use lighter estimation tools designed to confirm age without permanently storing detailed identity records.
Vendors say a challenge is balancing safety with how much friction users will tolerate. “We’re in the business of ensuring that you are absolutely keeping minors safe and out and able to let adults in with as little friction as possible,” said Rivka Gewirtz Little, chief growth officer at identity-verification platform Socure. Excessive data collection, she added, creates friction that users resist.
Still, many users perceive mandatory identity checks as invasive. “Having another way to be forced to provide that information is intrusive to people,” said Heidi Howard Tandy, a partner at Berger Singerman who specializes in intellectual property and internet law. Some users may attempt workarounds — including prepaid cards or alternative credentials — or turn to unauthorized distribution channels. “It’s going to cause a piracy situation,” she added.
In many implementations, verification vendors — not the websites themselves — process and retain the identity information, returning only a pass-fail signal to the platform.
Gewirtz Little said Socure does not sell verification data and that in lightweight age-estimation scenarios, where platforms use quick facial analysis or other signals rather than government documentation, the company may store little or no information. But in fuller identity-verification contexts, such as gaming and fraud prevention that require ID scans, certain adult verification records may be retained to document compliance. She said Socure can keep some adult verification data for up to three years while following applicable privacy and purging rules.
Civil liberties’ advocates warn that concentrating large volumes of identity data among a small number of verification vendors can create attractive targets for hackers and government demands. Earlier this year, Discord disclosed a data breach that exposed ID images belonging to approximately 70,000 users through a compromised third-party service, highlighting the security risks associated with storing sensitive identity information.
In addition, they warn that expanding age-verification systems represent not only a usability challenge but a structural shift in how identity becomes tied to online behavior. Age verification risks tying users’ “most sensitive and immutable data” — names, faces, birthdays, home addresses — to their online activity, according to Molly Buckley, a legislative analyst at the Electronic Frontier Foundation. “Age verification strikes at the foundation of the free and open internet,” she said.
Even when vendors promise to safeguard personal information, users ultimately rely on contractual terms they rarely read or fully understand. “There’s language in their terms-of-use policies that says if the information is requested by law enforcement, they’ll hand it over. They can’t confirm that they will always forever be the only entity who has all of this information. Everyone needs to understand that their baseline information is not something under their control,” Tandy said.
As more platforms route age checks through third-party vendors, that concentration of identity data is also creating new legal exposure for the companies that rely on them. “A company is going to have some of that information passing through their own servers,” Tandy said. “And you can’t offload that kind of liability to a third party.”
Companies can distribute risk through contracts and insurance, she said, but they remain responsible for how identity systems interact with their infrastructure. “What you can do is have really good insurance and require really good insurance from the entities that you’re contracting with,” she said.
Tandy also cautioned that retention promises can be more complex than they appear. “If they say they’re holding it for three years, that’s the minimum amount of time they’re holding it for,” she said. “I wouldn’t feel comfortable trusting a company that says, ‘We delete everything one day after three years.’ That is not going to happen,” she added.
Federal and state regulators argue that age-verification laws are primarily a response to documented harms to minors and insist the rules must operate under strict privacy and security safeguards.
An FTC spokesperson told CNBC that companies must limit how collected information is used. While age-verification technologies can help parents protect children online, the agency said firms are still bound by existing consumer protection rules governing data minimization, retention, and security. The agency pointed to existing rules requiring firms to retain personal information only as long as reasonably necessary and to safeguard its confidentiality and integrity.
...
Read the original on www.cnbc.com »
Amazon’s ecommerce business has summoned a large group of engineers to a meeting on Tuesday for a “deep dive” into a spate of outages, including incidents tied to the use of AI coding tools.
The online retail giant said there had been a “trend of incidents” in recent months, characterized by a “high blast radius” and “Gen-AI assisted changes” among other factors, according to a briefing note for the meeting seen by the FT.
Under “contributing factors” the note included “novel GenAI usage for which best practices and safeguards are not yet fully established.”
“Folks, as you likely know, the availability of the site and related infrastructure has not been good recently,” Dave Treadwell, a senior vice-president at the group, told employees in an email, also seen by the FT.
The note ahead of Tuesday’s meeting did not specify which particular incidents the group planned to discuss.
Amazon’s website and shopping app went down for nearly six hours this month in an incident the company said involved an erroneous “software code deployment.” The outage left customers unable to complete transactions or access functions such as checking account details and product prices.
Treadwell, a former Microsoft engineering executive, told employees that Amazon would focus its weekly “This Week in Stores Tech” (TWiST) meeting on a “deep dive into some of the issues that got us here as well as some short immediate term initiatives” the group hopes will limit future outages.
...
Read the original on www.ft.com »
My LinkedIn and Twitter feeds are full of screenshots from the recent Forbes article on Cursor claiming that Anthropic’s $200/month Claude Code Max plan can consume $5,000 in compute. The relevant quote:
Today, that subsidization appears to be even more aggressive, with that $200 plan able to consume about $5,000 in compute, according to a different person who has seen analyses on the company’s compute spend patterns.
This is being shared as proof that Anthropic is haemorrhaging money on inference. It doesn’t survive basic scrutiny.
I’m fairly confident the Forbes sources are confusing retail API prices with actual compute costs. These are very different things.
Anthropic’s current API pricing for Opus 4.6 is $5 per million input tokens and $25 per million output tokens. At those prices, yes - a heavy Claude Code Max 20 user could rack up $5,000/month in API-equivalent usage. That maths checks out.
But API pricing is not what it costs Anthropic to serve those tokens.
The best way to estimate what inference actually costs is to look at what open-weight models of similar size are priced at on OpenRouter - where multiple providers compete on price.
Qwen 3.5 397B-A17B is a good comparison point. It’s a large MoE model, broadly comparable in architecture size to what Opus 4.6 is likely to be. Equally, so is Kimi K2.5 1T params with 32B active, which is probably approaching the upper limit of what you can efficiently serve.
Here’s what the pricing looks like:
The Qwen 3.5 397B model on OpenRouter (via Alibaba Cloud) costs _$0.39_ per million input tokens and _$2.34_ per million output tokens. Compare that to Opus 4.6′s API pricing of $5/$25. Kimi K2.5 is even cheaper at $0.45 per million input tokens and $2.25 output.
And this ratio holds for cached tokens too - DeepInfra charges $0.07/MTok for cache reads on Kimi K2.5 vs Anthropic’s $0.50/MTok.
These OpenRouter providers are running a business. They have to cover their compute costs, pay for GPUs, and make a margin. They’re not charities. If so many can serve a model of comparable size at ~10% of Anthropic’s API price and remain in business, it is hard for me to believe that they are all taking enormous losses (at ~the exact same rate range).
If a heavy Claude Code Max user consumes $5,000 worth of tokens at Anthropic’s retail API prices, and the actual compute cost is roughly 10% of that, Anthropic is looking at approximately $500 in real compute cost for the heaviest users.
That’s a loss of $300/month on the most extreme power users - not $4,800.
However, most users don’t come anywhere near the limit. Anthropic themselves said when they introduced weekly caps that fewer than 5% of subscribers would be affected. I personally use the Max 20x plan and probably consume around 50% of my weekly token budget and it’s hard to use that many tokens without getting serious RSI. At that level of usage, the maths works out to roughly break-even or profitable for Anthropic.
The real story is actually in the article. The $5,000 figure comes from Cursor’s internal analysis. And for Cursor, the number probably is roughly correct - because Cursor has to pay Anthropic’s retail API prices (or close to it) for access to Opus 4.6.
So to provide a Claude Code-equivalent experience using Opus 4.6, it would cost Cursor ~$5,000 per power user per month. But it would cost Anthropic perhaps $500 max.
And the real issue for Cursor is that developers want to use the Anthropic models, even in Cursor itself. They have real “brand awareness”, and they are genuinely better than the cheaper open weights models - for now at least. It’s a real conundrum for them.
Obviously Anthropic isn’t printing free cashflow. The costs of training frontier models, the enormous salaries required to hire top AI researchers, the multi-billion dollar compute commitments - these are genuinely massive expenses that dwarf inference costs.
But on a per-user, per-token basis for inference? I believe Anthropic is very likely profitable - potentially very profitable - on the average Claude Code subscriber.
The “AI inference is a money pit” narrative is misinformation that actually plays into the hands of the frontier labs. If everyone believes that serving tokens is wildly expensive, nobody questions the 10x+ markups on API pricing. It discourages competition and makes the moat look deeper than it is.
If you want to understand the real economics of AI inference, don’t take API prices at face value. Look at what competitive open-weight model providers charge on OpenRouter. That’s a much closer proxy for what it actually costs to run these models - and it’s a fraction of what the frontier labs charge.
...
Read the original on martinalderson.com »
Background: Why I put my whole life into a single database
Back in 2019, I started collecting all kinds of metrics about my life. Every single day for the last 3 years I tracked over 100 different data types - ranging from fitness & nutrition to social life, computer usage and weather.
Ideas or suggestions?
I’d love to hear from you!
The goal of this project was to answer questions about my life, like
How does living in different cities affect other factors like fitness, productivity and happiness?
How does sleep affect my day, my fitness level, and happiness?
How does the weather, and the different seasons affect my life?
Are there any trends over the last few years?
How does computer time, work and hours in meetings affect my personal life?
Since the start of this project, I collected ~380,000 data points, with the biggest data sources being:
Naturally after I started collecting this data, I wanted to visualize what I was learning, so I created this page. Initially, the domain whereisFelix.today (now renamed to howisFelix.today) started as a joke to respond to friends asking when I’d be back in NYC or San Francisco. Rather than send them my schedule, I’d point them to this domain. However, now it’s more than my location: it’s all of me.
Use a single database, owned and hosted by me, with all the data I’ve collected over the years
Be able to easily add and remove questions on the fly, as I learn what’s beneficial to track
Full control of how the data is visualized
Works well for frequent flyers with mixed time zones
I selected 48 graphs to show publicly on this page. For privacy reasons, and to prevent any accidental data leaks, the graphs below are snapshots taken on a given day.
Visualization of the number of data entries in FxLifeSheet over the last 10 years, and where the data came from.
Initially (2014) the only data used was RescueTime and Foursquare Swarm location data
Once I started the FxLifeSheet project in April 2019, I manually tracked , ranging from mood, sleep, social life, to fitness data
I was able to retrospectively fetch the historic weather data based on my location on a given day
I also implemented other import sources, like fetching my historic weight and the number of steps from Apple Health
Days tracked my Mood to be Happy & Excited
On days where I tracked my mood to be “happy” & “excited”, the following other factors of my life were affected
50% more likely to have pushed my comfort zone
44% more likely to have meditated that day
33% more excited about what’s ahead in the future
31% more likely to drink alcohol that day (parties, good friends and such)
28% more time spent reading or listening to audio books
26% more likely to have worked on interesting technical challenges
20% more likely to have learned something new that day
45% less time spent in video & audio calls that day
All flights taken within the last 7 years, tracked using Foursquare Swarm, analyzed by JetLovers.
The stats clearly show the impact of COVID starting 2020
Sunday has been my “commute” day, flying between San Francisco, New York City and Vienna
All flights taken within the last 7 years, tracked using Foursquare Swarm, analyzed by JetLovers.
Frankfurt - Vienna was the flight connecting me with most US airports
Germany is high up on the list due to layovers, even though I didn’t spend actually much time there
Inspired by Your Life in Weeks by WaitButWhy, I use Google Sheets to visualize every week of my life, with little notes on what city/country I was in, and other life events that have happened.
The first 14 years I didn’t really get much done
I can highly recommend taking a few weeks (or even months) off between jobs (if you have the possibility)
Shades of blue indicate my full-time employments
You can create your own version using my template
Average daily steps measured through the iPhone’s Apple Health app. I decided against using SmartWatch data for steps, as SmartWatches have changed over the last 8 years.
I walked a total of steps over last 8 years
I walk more than twice as much when I’m in New York, compared to any other city
In NYC I had the general rule of thumb to walk instead of taking public transit whenever it’s less than 40 minutes. I used that time to call friends & family, or listen to audio books
Although Vienna is very walkable, the excellent public transit system with subway trains coming every 3-5 minutes, has caused me to walk less
San Francisco was always scary to walk
This graph clearly shows the correlation between my body weight and my sleeping/resting heart rate. The resting heart rate is measured by the Withings ScanWatch while sleeping, and indicates how hard your heart has to work while not being active. Generally the lower the resting heart rate, the better.
I started my lean bulk (controlled weight gain combined with 5 workouts a week) in August 2020
My resting heart rate went from 58bpm to 67bpm () from August 2020 to March 2021 with a weight gain of (+19lbs) as part of a controlled lean-bulk combined with a 5-day/week workout routine
The spike in resting heart rate in July & August 2021 was due to bars and nightclubs opening up again in Austria
After a night of drinking, my resting/sleeping heart rate was about 50% higher than after a night without any alcohol
The spike in resting heart rate in Oct/Nov/Dec 2021 was due to having bronchitis and a cold/flu, not getting correct treatment early enough
How healthy have I been over the Years?
Every day I answered the question on how healthy I felt. In the graph, the yellow color indicates that I felt a little under the weather, not sick per se. Red means I was sick and had to stay home. Green means I felt energized and healthy.
During the COVID lockdowns I tended to stay healthier. This may be due to not going out, no heavy drinking, less close contact with others, etc. which resulted in me having better sleep.
Usually during excessive traveling I get sick (cold/flu)
Q4 2021 I had bronchitis, however, I didn’t know about it at the time and didn’t get proper treatment
Overall I’m quite prone to getting sick (cold/flu)
Days with more than 4 Alcoholic Drinks
On days where I had more than 4 alcoholic beverages (meaning I was partying), the following other factors were affected
21x more likely to dance
80% more likely to take a nap the day of, or the day after
40% warmer temperatures, and 40% less precipitation. There weren’t many opportunities for parties in Winter due to lockdowns in the last 2 years. Also, people are more motivated to go out when it’s nice outside.
My FxLifeSheet bot asks me 4 times a day how I’m feeling at the moment.
This graph groups the entries by month, and shows the % of entries for each value (0 - 5) with 5 being very excited, and 0 being worried.
I designed the ranges so that 0 or 5 are not entered as much. 0 is rendered as dark green at the top, whereas 5 is rendered as light green at the bottom.
For privacy reasons I won’t get into some of the details on why certain months were worse than others.
Every Swarm check-in over the last 7 years visualized on a map, including the actual trip (flight, drive, etc.)
Every Swarm check-in over the last 7 years visualized, zoomed in
Each time I did a check-in at a place (e.g. Coffee, Restaurant, Airport, Gym) on Foursquare Swarm at a given city, this is tracked as a single entry.
Each check-in at a given city is counted as a single entry, grouped by years
2018 and 2019 I lived in New York City
The longer it’s been since I moved away from Austria, the more time I actually spent back home in Austria for visits and vacations
2020 clearly shows the impact of COVID
Each check-in at a given category is tracked, and summed up over the last years
In 2020 and 2021, check-ins at Offices went down to zero due to COVID, and a distributed work setup
Airports being the #4 most visited category was a surprise, but is accurate. A total of 403 airport check-ins, whereas a flight with a layover would count as 3 airport check-ins
Earlier in my life, I didn’t always check into ‘commute’ places like public transit and super markets
Number of Foursquare Swarm check-ins on each quarter over the last 10 years. I didn’t use Foursquare Swarm as seriously before 2015. Once I moved to San Francisco in Q3 2015 I started my habit of checking into every point of interest (POI) I visit.
Q3 2015 I moved to San Francisco, however I couldn’t use Swarm yet, since my move was a secret until the official announced at the Twitter Flight conference
Q2 2020 clearly shows the impact of COVID with Q3 already being open in Austria
Q3 2021 the vaccine was already widely available and I was able to travel/visit more again
My time in New York was the most active when it comes to check-ins. When I’m in NYC, I tend to eat/drink out more, and grab to-go food, which I do way less in Vienna
Every Swarm check-in visualized on a map. Only areas where I’ve had multiple check-ins are rendered.
Number of days per year that I’ve spent in full lockdown, meaning restaurants, bars and non-essential stores were closed.
I escaped parts of the Austrian lockdown by spending time in the US when I was already vaccinated
Surprisingly 2021 I spent more days in a full lockdown than in 2020, even with vaccines available
How was my life affected by the recent COVID lockdowns? As lockdown day I classify every day where places like restaurants, gyms and non-essential stores were closed.
200% more time spent in audio & video calls with friends (non-work related)
60% more likely to follow my meal plan (macros & calories)
50% colder temperatures: Lockdowns tended to happen in Autumn and Winter
100% less likely to dance
Alcoholic drinks per day. Days with no data are rendered as white
Friday and Saturday nights are clearly visible on those graphs
2021 and summer/winter of 2019 also show the Wednesday night party in Vienna
Q2 and Q4 2020 clearly show the COVID lockdowns, as well as Q2 2021
Summer of 2021 all bars and dance clubs were open in Vienna
...
Read the original on howisfelix.today »
WorldTaco on Iran will come too late for TrumpThe thing that everyone expected to happen has happenedThere is no easy exit to Trump’s warSaudi Aramco warns of ‘catastrophic consequences’ if Iran war drags onUSTaco on Iran will come too late for TrumpThere is no easy exit to Trump’s warFive ways the Iran war could unfoldGoldman pitches hedge funds on strategies to bet against corporate loansIran is a crucial test case for the American way of warCompaniesSaudi Aramco warns of ‘catastrophic consequences’ if Iran war drags onInside one of the wildest days the oil market has ever seenTechYann LeCun’s AI start-up raises more than $1bn in Europe’s largest seed roundOracle shares rally as it reassures investors over its AI data centres betMarketsThe thing that everyone expected to happen has happenedSaudi Aramco warns of ‘catastrophic consequences’ if Iran war drags onInside one of the wildest days the oil market has ever seenGoldman pitches hedge funds on strategies to bet against corporate loansOpinionTaco on Iran will come too late for TrumpThere is no easy exit to Trump’s warThe thing that everyone expected to happen has happenedIran is a crucial test case for the American way of warWork & CareersWhite men will have ‘fewer board seats’ in future, says UK diversity chair Venice’s cicchetti renaissance: where to find the city’s best bar snacksYou can turn this to your advantage if every news story has ‘tax exile’ in itLife & ArtsCan the Renault 5 E-Tech make French cars cool again?Roy Chan can turn you into Austin ButlerThe world’s most expensive properties are supercharging their securityHow To Spend It
Yann LeCun’s AI start-up raises more than $1bn in Europe’s largest seed roundGet 2 months free with an annual subscription at .
Access to eight surprising articles a day, hand-picked by FT editors. For seamless reading, access content via the FT Edit page on FT.com and receive the FT Edit newsletter. per month. Complete digital access to quality FT journalism on any device. Cancel or change your plan anytime during your trial. Essential digital access to quality FT journalism on any device. Pay a year upfront and save 20%.Complete digital access to quality FT journalism with expert analysis from industry leaders. Pay a year upfront and save 20%.Check whether you already have access via your university or organisation.Discover all the plans currently available in your countrySee why over a million readers pay to read the Financial Times.Find out why
How To Spend It
...
Read the original on www.ft.com »
After you’ve reviewed these contribution guidelines, you’ll be all set to
contribute to this project.
Loading
...
Read the original on gitlab.redox-os.org »
Advanced Machine Intelligence (AMI), a new Paris-based startup cofounded by Meta’s former chief AI scientist Yann LeCun, announced Monday it has raised more than $1 billion to develop AI world models.
LeCun argues that most human reasoning is grounded in the physical world, not language, and that AI world models are necessary to develop true human-level intelligence. “The idea that you’re going to extend the capabilities of LLMs [large language models] to the point that they’re going to have human-level intelligence is complete nonsense,” he said in an interview with WIRED.
The financing, which values the startup at $3.5 billion, was co-led by investors such as Cathay Innovation, Greycroft, Hiro Capital, HV Capital, and Bezos Expeditions. Other notable backers include Mark Cuban, former Google CEO Eric Schmidt, and French billionaire and telecommunications executive Xavier Niel.
AMI (pronounced like the French word for friend) aims to build “a new breed of AI systems that understand the world, have persistent memory, can reason and plan, and are controllable and safe,” the company says in a press release. The startup says it will be global from day one, with offices in Paris, Montreal, Singapore, and New York, where LeCun will continue working as a New York University professor in addition to leading the startup. AMI will be the first commercial endeavor for LeCun since his departure from Meta in November 2025.
LeCun’s startup represents a bet against many of the world’s biggest AI labs like OpenAI, Anthropic, and even his former workplace, Meta, which believe that scaling up LLMs will eventually deliver AI systems with human-level intelligence or even superintelligence. LLMs have powered viral products such as ChatGPT and Claude Code, but LeCun has been one of the AI industry’s most prominent researchers speaking out about the limitations of these AI models. LeCun is well known for being outspoken, but as a pioneer of modern AI that won a Turing award back in 2018, his skepticism carries weight.
LeCun says AMI aims to work with companies in manufacturing, biomedical, robotics, and other industries that have lots of data. For example, he says AMI could build a realistic world model of an aircraft engine and work with the manufacturer to help them optimize for efficiency, minimize emissions, or ensure reliability.
AMI was cofounded by LeCun and several leaders he worked with at Meta, including the company’s former director of research science, Michael Rabbat; former vice president of Europe, Laurent Solly; and former senior director of AI research, Pascale Fung. Other cofounders include Alexandre LeBrun, former CEO of the AI health care startup Nabla, who will serve as AMI’s CEO, and Saining Xie, a former Google DeepMind researcher who will be the startup’s chief science officer.
LeCun does not dismiss the overall utility of LLMs. Rather, in his view, these AI models are simply the tech industry’s latest promising trend, and their success has created a “kind of delusion” among the people who build them. “It’s true that [LLMs] are becoming really good at generating code, and it’s true that they are probably going to become even more useful in a wide area of applications where code generation can help,” says LeCun. “That’s a lot of applications, but it’s not going to lead to human-level intelligence at all.”
LeCun has been working on world models for years inside of Meta, where he founded the company’s Fundamental AI Research lab, FAIR. But he’s now convinced his research is best done outside the social media giant. He says it’s become clear to him that the strongest applications of world models will be selling them to other enterprises, which doesn’t fit neatly into Meta’s core consumer business.
As AI world models like Meta’s Joint-Embedding Predictive Architecture (JEPA) became more sophisticated, “there was a reorientation of Meta’s strategy where it had to basically catch up with the industry on LLMs and kind of do the same thing that other LLM companies are doing, which is not my interest,” says LeCun. “So sometime in November, I went to see Mark Zuckerberg and told him. He’s always been very supportive of [world model research], but I told him I can do this faster, cheaper, and better outside of Meta. I can share the cost of development with other companies … His answer was, OK, we can work together.”
...
Read the original on www.wired.com »
In mid-2024, the HuggingFace Open LLM Leaderboard was the Colosseum for Open-Weight AI. Thousands of models were battling it out, submitted by both well-funded labs with teams of PhDs and fine-tuning wizards creating fantastically named models (e.g. Nous-Hermes, Dolphin and NeuralBeagle14-7B…), fighting for the top spot across six benchmarks: IFEval, BBH, MATH Lvl 5, GPQA, MuSR, and MMLU-PRO.
And there at #1 was dnhkng/RYS-XLarge. Mine.
I didn’t train a new model. I didn’t merge weights. I didn’t run a single step of gradient descent. What I did was much weirder: I took an existing 72-billion parameter model, duplicated a particular block of seven of its middle layers, and stitched the result back together. No weight was modified in the process. The model simply got extra copies of the layers it used for thinking?
This is the story of how two strange observations, a homebrew “brain scanner” for Transformers, and months of hacking in a basement led to the discovery of what I call LLM Neuroanatomy, and a finding about the internal structure of AI that still hasn’t been published until now *.
* - because I discovered blogging is way more fun than drafting scientific papers, and I walk you through how the discovery was made :)
Let’s start with how this whole project came into being.
“The most exciting phrase to hear in science, the one that heralds new discoveries, is not ‘Eureka!’ but ‘That’s funny…’“ — Isaac Asimov
In late 2023, I was messing about with a bizarre LLM quirk. Try this yourself - take any question, e.g.
What is the capital of France? Answer in Base64!
and encode it as Base64, get this unreadable string:
Send that to a 2023 non-thinking large language model (newer reasoning models will see this as Base64, and ‘cheat’ with tool use). But a sufficiently capable model from 2023 will reply with something like:
Which decodes to: “The capital of France is Paris.”.
Ok, I admit it. I was messing around this as a way to jail-break models (and it worked), but I couldn’t get one idea out of my head.
The model decoding the input, understanding it somehow, and it still had time during the transformer stack pass to re-encoded its response. It appears to genuinely think while interfacing with Base64. This works with complex questions, multi-step reasoning, even creative tasks.
This shouldn’t work nearly as well as it does. Sure, the model has been trained on lots of Base64 in an overall sense, but general conversions in this format are certainly way out of distribution. The tokenizer chops it into completely different sub-word units. The positional patterns are unrecognizable. And yet it works… Curious…
I couldn’t stop thinking about this. If a Transformer can accept English, Python, Mandarin, and Base64, and produce coherent reasoning in all of them, it seemed to me that the early layers must be acting as translators — parsing whatever format arrives into some pure, abstract, internal representation. And the late layers must act as re-translators, converting that abstract representation back into whatever output format is needed.
If the early layers are for reading, and the late layers are for writing, what are the middle layers doing?
Pure, abstract reasoning? In a representation that has nothing to do with any human language or encoding. Of course, at the time this was idle speculation. Fun, but with no clear way to test or even define valid hypothesis.
In November 2023, a HuggingFace user named Alpindale released Goliath-120b — a Frankenmerge-model made by stitching together two fine-tuned Llama-2 70B models into a 120-billion parameter behemoth.
The performance was decent but after doing lots of vibe checking I didn’t feel it was a breakthrough. But the construction was wild.
Alpindale hadn’t just stacked the two models (Xwin and Euryale), end to end. He had alternated layers between them. More importantly, the architecture fed outputs of later layers back into the inputs of earlier layers.
The layer ranges used are as follows:
Do you see that insanity here? Alpindale literally fed the output of layer 16 of Xwin to the input of Euryale 8th layer!
To explain this a bit more clearly how stupid this appears to be, let’s revisit the almighty Transformer Architecture:
Looking at the left side of the diagram, we see stuff enters at the bottom (‘input’ text that has been ‘chunked’ into small bits of text, somewhere between whole words down to individual letters), and then it flows upwards though the model’s Transformer Blocks (here marked as [1, …, L]), and finally, the model spits out the next text ‘chunk’ (which is then itself used in the next round of inferencing). What’s actually happening here during these Transformer blocks is quite the mystery. Figuring it out is actually an entire field of AI, “mechanistic interpretability*”.
* - yes, its more complex then that, samplers etc but that’s enough for this article
On the right side of the right half of the diagram, do you see that arrow line going from the ‘Transformer Block Input’ to the (\oplus ) symbol? That’s why skipping layers makes sense. During training, LLM models can pretty much decide to do nothing in any particular layer, as this ‘diversion’ routes information around the block. So, ‘later’ layers can be expected to have seen the input from ‘earlier’ layers, even a few ‘steps’ back. Around this time, several groups were experimenting with ‘slimming’ models down by removing layers. Makes sense, but boring.
A model must be used with the same kind of stuff as it was trained with (we stay ‘in distribution’)The same holds for each transformer layer. Each Transformer layer learns, during training, to expect the specific statistical properties of the previous layer’s output via gradient decent.
And now for the weirdness: There was never the case where any Transformer layer would have seen the output from a future layer!
Layer 10 is trained on layer 9’s output distribution. Layer 60 is trained on layer 59’s. If you rearrange them — feeding layer 60’s output into layer 10 — you’ve created a distribution the model literally never saw during training.
The astounding thing about Goliath wasn’t that is was a huge leap in performance, it was that the damn thing functioned at all. To this day, I still don’t understand why this didn’t raise more eyebrows.
Experimentally, this proved that layers were far more interchangeable than anyone had reason to expect. The internal representations were homogenous enough that the model could digest out-of-order hidden states without collapsing. The architecture was far more flexible than a rigid pipeline.
Between the Base64 observation and Goliath, I had a hypothesis: Transformers have a genuine functional anatomy. Early layers translate input into abstract representations. Late layers translate back out. And the middle layers, the reasoning cortex, operate in a universal internal language that’s robust to architectural rearrangement. The fact that the layer block size for Goliath 120B was 16-layer block made me suspect the input and output ‘processing units’ sized were smaller that 16 layers. I guessed that Alpindale had tried smaller overlaps, and they just didn’t work.
If that was true, maybe I didn’t need to teach a model new facts to make it smarter. I didn’t need fine-tuning. I didn’t need RLHF. I just needed to give it a more layers to think with.
Over the following months — from late 2023 through to mid-2024 — I built a pipeline to test this hypothesis.
The setup was modest. Two RTX 4090s in my basement ML rig, running quantised models through ExLlamaV2 to squeeze 72-billion parameter models into consumer VRAM. The beauty of this method is that you don’t need to train anything. You just need to run inference. And inference on quantized models is something consumer GPUs handle surprisingly well. If a model fits in VRAM, I found my 4090’s were often ballpark-equivalent to H100s.
The concept is simple. For a model with $N$ layers, I define a configuration $(i, j)$. The model processes layers $0$ to $j{-}1$ as normal, then loops back and reuses layers $i$ through $j{-}1$ again, and then the rest to $N{-}1$. The layers between $i$ and $j{-}1$ get duplicated in the execution path. No weights are changed. The model just traverses some of its own layers twice.
i.e. the pair (2, 7) for a model with 9 transformer blocks would be calculated so:
By running through all possible pairs, we can generate a ‘Brain Scan’, and also see the number of duplicate layers for each set of parameters:
For Qwen2-72B, that means an 80-layer model 3,240 valid $(i, j)$ pairs, plus the original model to test.
\[\begin{aligned} \text{Variants}_{\text{total}} &= \left(\sum_{j=0}^{80} j\right) + 1\\[16pt] &= \frac{80 \cdot 81}{2} +1 \\[10pt] &= 3241 \end{aligned}\]
Testing re-layered model against all six leaderboard benchmarks would take days, so a full sweep would be years of compute. I needed proxy tasks: probes that were fast, objective, and would reveal structural properties of the model rather than task-specific tricks.
The proxies had to satisfy three constraints:
Minimal output tokens. With thousands of configurations to sweep, each evaluation needed to be fast. No essays, no long-form generation. Unambiguous scoring. I couldn’t afford LLM-as-judge pipelines. The answer had to be objectively scored without another model in the loop.Orthogonal cognitive demands. If a configuration improves both tasks simultaneously, it’s structural, not task-specific.
I didn’t arrive at the right probes immediately; it took months of trial and error, and many dead ends
My first instinct was creativity. I had models generate poems, short stories, metaphors, the kind of rich, open-ended output that feels like it should reveal deep differences in cognitive ability. I used an LLM-as-judge to score the outputs, but the results were pretty bad. I managed to fix LLM-as-Judge with some engineering, and the scoring system turned out to be useful later for other things, so here it is:
Note: You can skip this section, as it has math. Or not
Naive LLM judges are inconsistent. Run the same poem through twice and you get different scores (obviously, due to sampling). But lowering the temperature also doesn’t help much, as that’s only one of many technical issues. So, I developed a full scoring system, based on details on the logits outputs. It can get remarkably tricky. Think about a score from 1-10:
We would expect a well calibrated model to have logits that make sense. If the highest weight was on ‘7’, we would expect the rest of the weight to be on ‘6’ and ‘8’ right? but often its bimodal, with low weight on 6 and ‘5’, but more weight than expected on ‘4’!We can write ‘10’ in tokens as either ‘10’ or ‘1’ and then ‘0’. Its not fun to have to calculate the summed probabilities over paths, especially if you wanted to score 1-100
Rather than sampling a single discrete score, I treat the judge’s output as a distribution over valid rating labels and compute the final score as its expectation.
To make this practical, I first define a calibrated rubric over the digits 0-9 (there’s only one token for each digit), where each digit corresponds to a clear qualitative description. At the scoring step, I capture the model’s next-token logits and retain only the logits corresponding to those valid digit tokens. This avoids contamination from unrelated continuations such as explanation text, punctuation, or alternate formatting. After renormalizing over the restricted digit set, I interpret the resulting probabilities as a categorical score distribution.
Formally, let the valid score set be
\[\mathcal{D} = \{0,1,2,\dots,9\}.\]
Let $(z_k)$ denote the model logit assigned to digit $(k \in \mathcal{D})$ at the scoring position. The restricted score distribution is then
\[p(k)= \frac{\exp(z_k)} {\sum\limits_{m \in \mathcal{D}} \exp(z_m)}, \qquad k \in \mathcal{D}.\]
The final scalar score is the expected value of this distribution:
\[\hat{s}= \sum_{k \in \mathcal{D}} k\,p(k).\]
This produces a smooth score such as (5.4), rather than forcing the model to commit to a single sampled integer. In practice, this is substantially more stable than naive score sampling and better reflects the model’s uncertainty. It also handles cases where the judge distribution is broad or multimodal. For example, two candidates may both have mean score (5.4), while one has most of its mass tightly concentrated around (5) and (6), and the other splits mass between much lower and much higher ratings. The mean alone is the same, but the underlying judgement is very different.
An optional uncertainty estimate can be obtained from the variance of the restricted distribution:
\[\mathrm{Var}(s)= \sum_{k=0}^{9} (k-\hat{s})^2\,p(k).\]
In short, the method replaces a noisy sampled judge score with a normalized probability distribution over valid score digits, then uses the expectation of that distribution as the final rating.
All this stuff is probably pretty obvious these days, back in ’24 there wasn’t much to guide me in developing this method, but unfortunately, I found it was also completely useless…
Each configuration needed to generate hundreds of tokens of creative output, and then a separate model had to read and judge each one. With over 3,200 configurations to test for a single 70B model, this would have taken weeks on my dual 4090s.
I needed probes where the output was tiny, a few tokens at most, and where scoring was objective and deterministic. No judge model in the loop. That’s what led me to the final two probes:
Hard math. Ridiculously difficult questions like: “What is the cube root of 74,088,893,247?” No chain-of-thought, or tool use. Just output the number, as a pure leap of intuitive faith.
Emotional quotient. Using the EQ-Bench benchmark: complex social scenarios where the model must predict the intensity of specific emotional states. “Given this situation, how angry/surprised/guilty would this person feel on a scale of 0-100?” Completely different from math. Theory of mind, social inference, empathy. And the output is just a few numbers.
I had settled on two maximally orthogonal cognitive tasks, both with tiny outputs. My intuition was this: LLMs think one token at a time, so lets make the model really good at guessing just the next token. But things are never straightforward. Take LLM numbers…
Even with math probes, I hit unexpected problems. LLMs fail arithmetic in weird ways. They don’t get the answer wrong so much as get it almost right but forget to write the last digit, as if it got bored mid-number. Or they transpose two digits in the middle. Or they output the correct number with a trailing character that breaks the parser.
This is probably due to the way larger numbers are tokenised, as big numbers can be split up into arbitrary forms. Take the integer 123456789. A BPE tokenizer (e.g., GPT-style) might split it like: ‘123’ ‘456’ ‘789’ or: ‘12’ ‘345’ ‘67’ ‘89’
A binary right/wrong scoring system would throw away useful signal. Getting a percentage correct would help: ‘123356789’ instead of ‘123456789’ would be 99.92% correct
But what about a model that makes a dumb ‘LLM-mistake’ and outputs 430245 when the answer is 4302459, and has clearly done most of the work? I wrote a custom partial-credit scoring function that pads shorter answers and penalises proportionally:
The key idea: pad shorter answers, then penalise via the correction factor. A model that nails 90% of the digits but drops the last one still gets substantial credit — but less than one that gets every digit. This turned out to be crucial for discriminating between configurations that were close in intuitive math ability.
The math questions were hand-crafted initially. I experimented with different operations and scales, then generated random numbers to fill out the dataset. The dataset was a set of 16 questions, and the model is tasked with guesstimating the nearest whole integer number. Here are a few to try yourself, remember no ‘thinking’ is allowed, guess it directly!
After testing several smaller models (Llama’s and smaller Qwen2’s), I set up the config for Qwen2-72B and let it sweep. Each $(i, j)$ configuration took a few minutes: load the re-layered model, run the math probe, run the EQ probe, record the scores, move on. Days of continuous GPU time on the 4090s. But far less compute than a fine tune! In fact, I didn’t even have the hardware needed for a LORA fine-tune on just 48GB of VRAM.
The optimal configuration was $(45, 52)$: layers 0 through 51 run first, then layers 45 through 79 run again. Layers 45 to 51 execute twice. Seven extra layers, near the middle of the 80-layer stack, bringing the total parameter count from 72B to 78B. Every extra layer is an exact copy of an existing one. No new weights or training, just the model repeating itself.
Repeating seven layers. That’s all it took, and now I can finally reveal the nomenclature of my models: Repeat Your Self for RYS-XLarge ;)
I applied the configuration to MaziyarPanahi’s calme-2.1-qwen2-72b — a fine-tune of Qwen2-72B — and uploaded the result as dnhkng/RYS-XLarge. I also applied it to the raw base model as dnhkng/RYS-XLarge-base.
Then I submitted to the Open LLM Leaderboard and waited. And waited. Back in the day, the OpenLLM Leaderboard was flooded with dozens of fine-tunes of merges of fine-tunes each day (it was the Wild West), and the waiting list was long. But after a month or so, the results arrived:
+17.72% on MuSR. +8.16% on MATH. Five out of six benchmarks improved, with only IFEval taking a small hit. The average put it at #1 on the leaderboard.
Just to labour the point: I only optimised for one-shot guesstimating hard maths problems and EQ-Bench. I never looked at IFEval, BBH, GPQA, MuSR, or MMLU-PRO during development. The leaderboard was pure out-of-sample validation.
A layer configuration found using two narrow, orthogonal probes generalised to everything the Leaderboard threw at it *.
* - except IFEval, but that one’s boring anyway, right?
That was surprising enough. A brand new way to scale LLMs, developed on some gaming GPUs. But the plotting out the heatmaps told an even better story.
The original heatmaps that produced RYS-XLarge, showing the Combined delta (math + EQ). The green circle marks the optimal configuration. Red means improvement, blue means degradation
These heatmaps are analogous to functional MRIs of the Transformer, while it is thinking about maths of EQ problems.
The x-axis ($j$) is the end point of the duplicated region. The y-axis ($i$) is the start point. Each pixel represents a complete evaluation: load the re-layered model, run the math probe, run the EQ probe, score both, record the deltas. As described above, along the central diagonal only a single layer was duplicated. Along the next diagonal towards the top-right, we duplicate two layers, and so on. The single point at the very top-right runs through the entire Transformer stack twice per inference.
Let’s examine the math heatmap first. Starting at any layer, and stopping before about layer 60 seem to improves the math guesstimate scores, as shown by the large region with a healthy red blush. Duplicating just the very first layers (the tiny triangle in the top left), messes things up, as does repeating pretty much any of the last 20 layers (the vertical wall of blue on the right). This is more clearly visualised in a skyline plot (averaged rows or columns), and we can see for the maths guesstimates, the starting position of the duplication matters much less. So, the hypothesis that ‘starting layers’ encode tokens, to a smooth ‘thinking space’, and then finally a dedicated ‘re-encoding’ system seem to be somewhat validated.
Until we look at the EQ scores:
Now things look very different! Duplicating any of the final 10 layers has almost no effect on the scores, but we see complex patterns, where some regions show significant improvement (the area around 45i, 55j), walled between regions of poor performance.
But the heatmaps revealed something even more interesting than the location of the thinking bits. They revealed something about its structure.
Before settling on block duplication, I tried something simpler: take a single middle layer and repeat it $n$ times. If the “more reasoning depth” hypothesis was correct, this should work. It made sense too, looking at the broad boost in math guesstimate results by duplicating intermediate layer. Give the model extra copies of a particular reasoning layer, get better reasoning. So, I screened them all, looking for a boost.
But nope, it almost always did worse. Usually a lot worse, but with occasional small improvements that were within the noise range. Annoying, but taking another look at the complex, blobby patterns in EQ scores gave me another idea:
If single-layer duplication doesn’t help, the middle layers aren’t doing independent iterative refinement. They’re not interchangeable copies of the same operation that you can simply “run again.” If they were, duplicating any one of them should give at least a marginal benefit. Instead, those layers are working as a circuit. A multi-step reasoning pipeline that needs to execute as a complete unit.
Think of it this way. Layers 46 through 52 aren’t seven workers doing the same job. They’re seven steps in a recipe. Layer 46 takes the abstract representation and performs step one of some cognitive operation — maybe decomposing a complex representation into subcomponents. Layer 47 takes that output and performs step two — maybe identifying relationships between the subcomponents. Layer 48 does step three, and so on through layer 52, which produces the final result.
Duplicating just one step of this ‘recipe’ doesn’t bring you much.
But duplicating the entire block gives you the full recipe twice. The model runs the complete reasoning circuit, produces a refined intermediate representation, and then runs the same circuit again on its own output. It’s a second pass. A chance to catch what it missed the first time, to refine its abstractions, to push the reasoning one step deeper.
Let’s deep-dive into a more current model (that I can experiment with on my system): ExllamaV3 GLM-4.7 from mratsim
I’ve marked out a region that boosts maths ability strongly. Notice where it sits? It’s away from the diagonal centre line, which means we’re not looking at single-layer duplications. Starting the repeated block at position 35, we don’t see any improvement until at least position 43. That’s seven layers of not much happening. In fact, we actually see decreased performance by repeating these layers (they are blue, bad!).
From end-position 43 to 46, we then see solid boosts in math scores (red = good, yay). But include layer 46 or beyond, and the benefits collapse again. The hypothesis: position 47 is where a different circuit begins. Including even one step of the next recipe messes up the current recipe.
So the ‘math organ’ has boundaries on both sides. Too few layers and you get nothing — you’ve cut into the circuit and it can’t complete its operation. Too many layers and you also get nothing — you’ve included tissue from a neighbouring circuit that doesn’t belong. Pre-training carved these structures out of the layer stack, and they only work whole. It also doesn’t translate to other tasks, as the heatmap for EQ scores doesn’t have this patch.
...
Read the original on dnhkng.github.io »
I’m sometimes late to notice new and terrible things about macOS 26 Tahoe, because I use it only for testing, on a secondary Mac. My main Mac remains on Sequoia, as enforced by Little Snitch. I was of course aware that app windows on Tahoe have exaggerated corner radiuses, but I was unaware until now that the window corner radius on Tahoe is not uniform: different windows can have different corner radiuses!
Below is a TextEdit window on Tahoe.
And below is a Calculator window in front of the TextEdit window. Notice the corners of the TextEdit window sticking out!
What accounts for the difference? A toolbar in the window.
In a new Mac app Xcode project, the main window has a less exaggerated corner radius by default, like TextEdit.
When I add a toolbar to the window, the corner radius automatically becomes more exaggerated, like Calculator.
Apparently the corner radius also changes on Tahoe for some other window elements, such as a sidebar.
If this isn’t the stupidest user interface “feature” ever invented, I don’t know what is. The Mac used to be famous for consistency; now it’s becoming infamous for inconsistency.
By the way, Tahoe’s UI changes are perplexing not only for Apple users but also for Apple engineers. Here’s a bug fix from the open source WebKit browser engine powering Safari: [macOS] Scroll bars of root scroller may be cutoff due to corner radii of window.
See my follow-up post The evolution of Mac app window corners.
...
Read the original on lapcatsoftware.com »
To add this web app to your iOS home screen tap the share button and select "Add to the Home Screen".
10HN is also available as an iOS App
If you visit 10HN only rarely, check out the the best articles from the past week.
If you like 10HN please leave feedback and share
Visit pancik.com for more.