10 interesting stories served every morning and every evening.




1 1,591 shares, 110 trendiness

Tony Hoare (1934-2026)

Computational Complexity and other fun stuff in math and com­puter sci­ence from Lance Fortnow and Bill Gasarch

...

Read the original on blog.computationalcomplexity.org »

2 589 shares, 37 trendiness

Online age-verification tools spread across U.S. for child safety, but adults are being surveilled

New U. S laws de­signed to pro­tect mi­nors are pulling mil­lions of adult Americans into manda­tory age-ver­i­fi­ca­tion gates to ac­cess on­line con­tent, lead­ing to back­lash from users and crit­i­cism from pri­vacy ad­vo­cates that a free and open in­ter­net is at stake. Roughly half of U.S. states have en­acted or are ad­vanc­ing laws re­quir­ing plat­forms — in­clud­ing adult con­tent sites, on­line gam­ing ser­vices, and so­cial me­dia apps — to block un­der­age users, forc­ing com­pa­nies to screen every­one who ap­proaches these dig­i­tal gates.

There’s a big spec­trum,” said Joe Kaufmann, global head of pri­vacy at Ju­mio, one of the largest dig­i­tal iden­tity-ver­i­fi­ca­tion and au­then­ti­ca­tion plat­forms. He ex­plained that the patch­work of state laws vary in tech­ni­cal de­mands and com­pli­ance ex­pec­ta­tions. “The reg­u­la­tions are mov­ing in many dif­fer­ent di­rec­tions at once,” he said.

Social me­dia com­pany Discord an­nounced plans in February to roll out manda­tory age ver­i­fi­ca­tion glob­ally, which the com­pany said would rely on ver­i­fi­ca­tion meth­ods de­signed so fa­cial analy­sis oc­curs on a user’s de­vice and sub­mit­ted data would be deleted im­me­di­ately. The pro­posal quickly drew back­lash from users con­cerned about hav­ing to sub­mit self­ies or gov­ern­ment IDs to ac­cess cer­tain fea­tures, which led Discord to de­lay the launch un­til the sec­ond half of this year.

Let me be up­front: we knew this roll­out was go­ing to be con­tro­ver­sial. Any time you in­tro­duce some­thing that touches iden­tity and ver­i­fi­ca­tion, peo­ple are go­ing to have strong feel­ings,” Discord chief tech­nol­ogy of­fi­cer and co-founder Stanislav Vishnevskiy wrote in a Feb. 24 blog post.

Websites offering adult con­tent, gam­bling, or fi­nan­cial ser­vices of­ten rely on full iden­tity ver­i­fi­ca­tion that re­quires scan­ning a gov­ern­ment ID and match­ing it to a live im­age. But most of the ver­i­fi­ca­tion sys­tems pow­er­ing these check­points — of­ten run by spe­cial­ized iden­tity-ver­i­fi­ca­tion ven­dors on be­half of web­sites — rely on ar­ti­fi­cial in­tel­li­gence such as fa­cial recog­ni­tion and age-es­ti­ma­tion mod­els that an­a­lyze self­ies or video to de­ter­mine in sec­onds whether some­one is old enough to ac­cess con­tent. Social me­dia and lower-risk ser­vices may use lighter es­ti­ma­tion tools de­signed to con­firm age with­out per­ma­nently stor­ing de­tailed iden­tity records.

Vendors say a chal­lenge is bal­anc­ing safety with how much fric­tion users will tol­er­ate. “We’re in the busi­ness of en­sur­ing that you are ab­solutely keep­ing mi­nors safe and out and able to let adults in with as lit­tle fric­tion as pos­si­ble,” said Rivka Gewirtz Little, chief growth of­fi­cer at iden­tity-ver­i­fi­ca­tion plat­form So­cure. Excessive data col­lec­tion, she added, cre­ates fric­tion that users re­sist.

Still, many users per­ceive manda­tory iden­tity checks as in­va­sive. Having an­other way to be forced to pro­vide that in­for­ma­tion is in­tru­sive to peo­ple,” said Heidi Howard Tandy, a part­ner at Berger Singerman who spe­cial­izes in in­tel­lec­tual prop­erty and in­ter­net law. Some users may at­tempt workarounds — in­clud­ing pre­paid cards or al­ter­na­tive cre­den­tials — or turn to unau­tho­rized dis­tri­b­u­tion chan­nels. “It’s go­ing to cause a piracy sit­u­a­tion,” she added.

In many im­ple­men­ta­tions, ver­i­fi­ca­tion ven­dors — not the web­sites them­selves — process and re­tain the iden­tity in­for­ma­tion, re­turn­ing only a pass-fail sig­nal to the plat­form.

Gewirtz Little said So­cure does not sell ver­i­fi­ca­tion data and that in light­weight age-es­ti­ma­tion sce­nar­ios, where plat­forms use quick fa­cial analy­sis or other sig­nals rather than gov­ern­ment doc­u­men­ta­tion, the com­pany may store lit­tle or no in­for­ma­tion. But in fuller iden­tity-ver­i­fi­ca­tion con­texts, such as gam­ing and fraud pre­ven­tion that re­quire ID scans, cer­tain adult ver­i­fi­ca­tion records may be re­tained to doc­u­ment com­pli­ance. She said So­cure can keep some adult ver­i­fi­ca­tion data for up to three years while fol­low­ing ap­plic­a­ble pri­vacy and purg­ing rules.

Civil lib­er­ties’ ad­vo­cates warn that con­cen­trat­ing large vol­umes of iden­tity data among a small num­ber of ver­i­fi­ca­tion ven­dors can cre­ate at­trac­tive tar­gets for hack­ers and gov­ern­ment de­mands. Ear­lier this year, Discord dis­closed a data breach that ex­posed ID im­ages be­long­ing to ap­prox­i­mately 70,000 users through a com­pro­mised third-party ser­vice, high­light­ing the se­cu­rity risks as­so­ci­ated with stor­ing sen­si­tive iden­tity in­for­ma­tion.

In ad­di­tion, they warn that ex­pand­ing age-ver­i­fi­ca­tion sys­tems rep­re­sent not only a us­abil­ity chal­lenge but a struc­tural shift in how iden­tity be­comes tied to on­line be­hav­ior. Age ver­i­fi­ca­tion risks ty­ing users’ most sen­si­tive and im­mutable data” — names, faces, birth­days, home ad­dresses — to their on­line ac­tiv­ity, ac­cord­ing to Molly Buckley, a leg­isla­tive an­a­lyst at the Electronic Frontier Foundation.  “Age ver­i­fi­ca­tion strikes at the foun­da­tion of the free and open in­ter­net,” she said.

Even when ven­dors promise to safe­guard per­sonal in­for­ma­tion, users ul­ti­mately rely on con­trac­tual terms they rarely read or fully un­der­stand. “There’s lan­guage in their terms-of-use poli­cies that says if the in­for­ma­tion is re­quested by law en­force­ment, they’ll hand it over. They can’t confirm that they will al­ways for­ever be the only en­tity who has all of this in­for­ma­tion. Every­one needs to un­der­stand that their base­line in­for­ma­tion is not some­thing un­der their con­trol,” Tandy said.

As more plat­forms route age checks through third-party ven­dors, that con­cen­tra­tion of iden­tity data is also cre­at­ing new le­gal ex­po­sure for the com­pa­nies that rely on them. “A com­pany is go­ing to have some of that in­for­ma­tion pass­ing through their own servers,” Tandy said. And you can’t of­fload that kind of li­a­bil­ity to a third party.”

Companies can dis­trib­ute risk through con­tracts and in­sur­ance, she said, but they re­main re­spon­si­ble for how iden­tity sys­tems in­ter­act with their in­fra­struc­ture. “What you can do is have re­ally good in­sur­ance and re­quire re­ally good in­sur­ance from the en­ti­ties that you’re con­tract­ing with,” she said.

Tandy also cau­tioned that re­ten­tion promises can be more com­plex than they ap­pear. “If they say they’re hold­ing it for three years, that’s the min­i­mum amount of time they’re hold­ing it for,” she said. I wouldn’t feel com­fort­able trust­ing a com­pany that says, We delete every­thing one day af­ter three years.’ That is not go­ing to hap­pen,” she added.

Federal and state reg­u­la­tors ar­gue that age-ver­i­fi­ca­tion laws are pri­mar­ily a re­sponse to doc­u­mented harms to mi­nors and in­sist the rules must op­er­ate un­der strict pri­vacy and se­cu­rity safe­guards.

An FTC spokesper­son told CNBC that com­pa­nies must limit how col­lected in­for­ma­tion is used. While age-ver­i­fi­ca­tion tech­nolo­gies can help par­ents pro­tect chil­dren on­line, the agency said firms are still bound by ex­ist­ing con­sumer pro­tec­tion rules gov­ern­ing data min­i­miza­tion, re­ten­tion, and se­cu­rity. The agency pointed to ex­ist­ing rules re­quir­ing firms to re­tain per­sonal in­for­ma­tion only as long as rea­son­ably nec­es­sary and to safe­guard its con­fi­den­tial­ity and in­tegrity.

...

Read the original on www.cnbc.com »

3 485 shares, 22 trendiness

After outages, Amazon to make senior engineers sign off on AI-assisted changes

Amazon’s ecom­merce busi­ness has sum­moned a large group of en­gi­neers to a meet­ing on Tuesday for a deep dive” into a spate of out­ages, in­clud­ing in­ci­dents tied to the use of AI cod­ing tools.

The on­line re­tail gi­ant said there had been a trend of in­ci­dents” in re­cent months, char­ac­ter­ized by a high blast ra­dius” and Gen-AI as­sisted changes” among other fac­tors, ac­cord­ing to a brief­ing note for the meet­ing seen by the FT.

Under contributing fac­tors” the note in­cluded novel GenAI us­age for which best prac­tices and safe­guards are not yet fully es­tab­lished.”

Folks, as you likely know, the avail­abil­ity of the site and re­lated in­fra­struc­ture has not been good re­cently,” Dave Treadwell, a se­nior vice-pres­i­dent at the group, told em­ploy­ees in an email, also seen by the FT.

The note ahead of Tuesday’s meet­ing did not spec­ify which par­tic­u­lar in­ci­dents the group planned to dis­cuss.

Amazon’s web­site and shop­ping app went down for nearly six hours this month in an in­ci­dent the com­pany said in­volved an er­ro­neous software code de­ploy­ment.” The out­age left cus­tomers un­able to com­plete trans­ac­tions or ac­cess func­tions such as check­ing ac­count de­tails and prod­uct prices.

Treadwell, a for­mer Microsoft en­gi­neer­ing ex­ec­u­tive, told em­ploy­ees that Amazon would fo­cus its weekly This Week in Stores Tech” (TWiST) meet­ing on a deep dive into some of the is­sues that got us here as well as some short im­me­di­ate term ini­tia­tives” the group hopes will limit fu­ture out­ages.

...

Read the original on www.ft.com »

4 445 shares, 17 trendiness

No, it doesn't cost Anthropic $5k per Claude Code user

My LinkedIn and Twitter feeds are full of screen­shots from the re­cent Forbes ar­ti­cle on Cursor claim­ing that Anthropic’s $200/month Claude Code Max plan can con­sume $5,000 in com­pute. The rel­e­vant quote:

Today, that sub­si­diza­tion ap­pears to be even more ag­gres­sive, with that $200 plan able to con­sume about $5,000 in com­pute, ac­cord­ing to a dif­fer­ent per­son who has seen analy­ses on the com­pa­ny’s com­pute spend pat­terns.

This is be­ing shared as proof that Anthropic is haem­or­rhag­ing money on in­fer­ence. It does­n’t sur­vive ba­sic scrutiny.

I’m fairly con­fi­dent the Forbes sources are con­fus­ing re­tail API prices with ac­tual com­pute costs. These are very dif­fer­ent things.

Anthropic’s cur­rent API pric­ing for Opus 4.6 is $5 per mil­lion in­put to­kens and $25 per mil­lion out­put to­kens. At those prices, yes - a heavy Claude Code Max 20 user could rack up $5,000/month in API-equivalent us­age. That maths checks out.

But API pric­ing is not what it costs Anthropic to serve those to­kens.

The best way to es­ti­mate what in­fer­ence ac­tu­ally costs is to look at what open-weight mod­els of sim­i­lar size are priced at on OpenRouter - where mul­ti­ple providers com­pete on price.

Qwen 3.5 397B-A17B is a good com­par­i­son point. It’s a large MoE model, broadly com­pa­ra­ble in ar­chi­tec­ture size to what Opus 4.6 is likely to be. Equally, so is Kimi K2.5 1T params with 32B ac­tive, which is prob­a­bly ap­proach­ing the up­per limit of what you can ef­fi­ciently serve.

Here’s what the pric­ing looks like:

The Qwen 3.5 397B model on OpenRouter (via Alibaba Cloud) costs _$0.39_ per mil­lion in­put to­kens and _$2.34_ per mil­lion out­put to­kens. Compare that to Opus 4.6′s API pric­ing of $5/$25. Kimi K2.5 is even cheaper at $0.45 per mil­lion in­put to­kens and $2.25 out­put.

And this ra­tio holds for cached to­kens too - DeepInfra charges $0.07/MTok for cache reads on Kimi K2.5 vs Anthropic’s $0.50/MTok.

These OpenRouter providers are run­ning a busi­ness. They have to cover their com­pute costs, pay for GPUs, and make a mar­gin. They’re not char­i­ties. If so many can serve a model of com­pa­ra­ble size at ~10% of Anthropic’s API price and re­main in busi­ness, it is hard for me to be­lieve that they are all tak­ing enor­mous losses (at ~the ex­act same rate range).

If a heavy Claude Code Max user con­sumes $5,000 worth of to­kens at Anthropic’s re­tail API prices, and the ac­tual com­pute cost is roughly 10% of that, Anthropic is look­ing at ap­prox­i­mately $500 in real com­pute cost for the heav­i­est users.

That’s a loss of $300/month on the most ex­treme power users - not $4,800.

However, most users don’t come any­where near the limit. Anthropic them­selves said when they in­tro­duced weekly caps that fewer than 5% of sub­scribers would be af­fected. I per­son­ally use the Max 20x plan and prob­a­bly con­sume around 50% of my weekly to­ken bud­get and it’s hard to use that many to­kens with­out get­ting se­ri­ous RSI. At that level of us­age, the maths works out to roughly break-even or prof­itable for Anthropic.

The real story is ac­tu­ally in the ar­ti­cle. The $5,000 fig­ure comes from Cursor’s in­ter­nal analy­sis. And for Cursor, the num­ber prob­a­bly is roughly cor­rect - be­cause Cursor has to pay Anthropic’s re­tail API prices (or close to it) for ac­cess to Opus 4.6.

So to pro­vide a Claude Code-equivalent ex­pe­ri­ence us­ing Opus 4.6, it would cost Cursor ~$5,000 per power user per month. But it would cost Anthropic per­haps $500 max.

And the real is­sue for Cursor is that de­vel­op­ers want to use the Anthropic mod­els, even in Cursor it­self. They have real brand aware­ness”, and they are gen­uinely bet­ter than the cheaper open weights mod­els - for now at least. It’s a real co­nun­drum for them.

Obviously Anthropic is­n’t print­ing free cash­flow. The costs of train­ing fron­tier mod­els, the enor­mous salaries re­quired to hire top AI re­searchers, the multi-bil­lion dol­lar com­pute com­mit­ments - these are gen­uinely mas­sive ex­penses that dwarf in­fer­ence costs.

But on a per-user, per-to­ken ba­sis for in­fer­ence? I be­lieve Anthropic is very likely prof­itable - po­ten­tially very prof­itable - on the av­er­age Claude Code sub­scriber.

The AI in­fer­ence is a money pit” nar­ra­tive is mis­in­for­ma­tion that ac­tu­ally plays into the hands of the fron­tier labs. If every­one be­lieves that serv­ing to­kens is wildly ex­pen­sive, no­body ques­tions the 10x+ markups on API pric­ing. It dis­cour­ages com­pe­ti­tion and makes the moat look deeper than it is.

If you want to un­der­stand the real eco­nom­ics of AI in­fer­ence, don’t take API prices at face value. Look at what com­pet­i­tive open-weight model providers charge on OpenRouter. That’s a much closer proxy for what it ac­tu­ally costs to run these mod­els - and it’s a frac­tion of what the fron­tier labs charge.

...

Read the original on martinalderson.com »

5 434 shares, 23 trendiness

howisFelix.today? · Felix Krause

Background: Why I put my whole life into a sin­gle data­base

Back in 2019, I started col­lect­ing all kinds of met­rics about my life. Every sin­gle day for the last 3 years I tracked over 100 dif­fer­ent data types - rang­ing from fit­ness & nu­tri­tion to so­cial life, com­puter us­age and weather.

Ideas or sug­ges­tions?

I’d love to hear from you!

The goal of this pro­ject was to an­swer ques­tions about my life, like

How does liv­ing in dif­fer­ent cities af­fect other fac­tors like fit­ness, pro­duc­tiv­ity and hap­pi­ness?

How does sleep af­fect my day, my fit­ness level, and hap­pi­ness?

How does the weather, and the dif­fer­ent sea­sons af­fect my life?

Are there any trends over the last few years?

How does com­puter time, work and hours in meet­ings af­fect my per­sonal life?

Since the start of this pro­ject, I col­lected ~380,000 data points, with the biggest data sources be­ing:

Naturally af­ter I started col­lect­ing this data, I wanted to vi­su­al­ize what I was learn­ing, so I cre­ated this page. Initially, the do­main where­is­Fe­lix.to­day (now re­named to how­is­Fe­lix.to­day) started as a joke to re­spond to friends ask­ing when I’d be back in NYC or San Francisco. Rather than send them my sched­ule, I’d point them to this do­main. However, now it’s more than my lo­ca­tion: it’s all of me.

Use a sin­gle data­base, owned and hosted by me, with all the data I’ve col­lected over the years

Be able to eas­ily add and re­move ques­tions on the fly, as I learn what’s ben­e­fi­cial to track

Full con­trol of how the data is vi­su­al­ized

Works well for fre­quent fly­ers with mixed time zones

I se­lected 48 graphs to show pub­licly on this page. For pri­vacy rea­sons, and to pre­vent any ac­ci­den­tal data leaks, the graphs be­low are snap­shots taken on a given day.

Visualization of the num­ber of data en­tries in FxLifeSheet over the last 10 years, and where the data came from.

Initially (2014) the only data used was RescueTime and Foursquare Swarm lo­ca­tion data

Once I started the FxLifeSheet pro­ject in April 2019, I man­u­ally tracked , rang­ing from mood, sleep, so­cial life, to fit­ness data

I was able to ret­ro­spec­tively fetch the his­toric weather data based on my lo­ca­tion on a given day

I also im­ple­mented other im­port sources, like fetch­ing my his­toric weight and the num­ber of steps from Apple Health

Days tracked my Mood to be Happy & Excited

On days where I tracked my mood to be happy” & excited”, the fol­low­ing other fac­tors of my life were af­fected

50% more likely to have pushed my com­fort zone

44% more likely to have med­i­tated that day

33% more ex­cited about what’s ahead in the fu­ture

31% more likely to drink al­co­hol that day (parties, good friends and such)

28% more time spent read­ing or lis­ten­ing to au­dio books

26% more likely to have worked on in­ter­est­ing tech­ni­cal chal­lenges

20% more likely to have learned some­thing new that day

45% less time spent in video & au­dio calls that day

All flights taken within the last 7 years, tracked us­ing Foursquare Swarm, an­a­lyzed by JetLovers.

The stats clearly show the im­pact of COVID start­ing 2020

Sunday has been my commute” day, fly­ing be­tween San Francisco, New York City and Vienna

All flights taken within the last 7 years, tracked us­ing Foursquare Swarm, an­a­lyzed by JetLovers.

Frankfurt - Vienna was the flight con­nect­ing me with most US air­ports

Germany is high up on the list due to lay­overs, even though I did­n’t spend ac­tu­ally much time there

Inspired by Your Life in Weeks by WaitButWhy, I use Google Sheets to vi­su­al­ize every week of my life, with lit­tle notes on what city/​coun­try I was in, and other life events that have hap­pened.

The first 14 years I did­n’t re­ally get much done

I can highly rec­om­mend tak­ing a few weeks (or even months) off be­tween jobs (if you have the pos­si­bil­ity)

Shades of blue in­di­cate my full-time em­ploy­ments

You can cre­ate your own ver­sion us­ing my tem­plate

Average daily steps mea­sured through the iPhone’s Apple Health app. I de­cided against us­ing SmartWatch data for steps, as SmartWatches have changed over the last 8 years.

I walked a to­tal of steps over last 8 years

I walk more than twice as much when I’m in New York, com­pared to any other city

In NYC I had the gen­eral rule of thumb to walk in­stead of tak­ing pub­lic tran­sit when­ever it’s less than 40 min­utes. I used that time to call friends & fam­ily, or lis­ten to au­dio books

Although Vienna is very walk­a­ble, the ex­cel­lent pub­lic tran­sit sys­tem with sub­way trains com­ing every 3-5 min­utes, has caused me to walk less

San Francisco was al­ways scary to walk

This graph clearly shows the cor­re­la­tion be­tween my body weight and my sleep­ing/​rest­ing heart rate. The rest­ing heart rate is mea­sured by the Withings ScanWatch while sleep­ing, and in­di­cates how hard your heart has to work while not be­ing ac­tive. Generally the lower the rest­ing heart rate, the bet­ter.

I started my lean bulk (controlled weight gain com­bined with 5 work­outs a week) in August 2020

My rest­ing heart rate went from 58bpm to 67bpm () from August 2020 to March 2021 with a weight gain of (+19lbs) as part of a con­trolled lean-bulk com­bined with a 5-day/week work­out rou­tine

The spike in rest­ing heart rate in July & August 2021 was due to bars and night­clubs open­ing up again in Austria

After a night of drink­ing, my rest­ing/​sleep­ing heart rate was about 50% higher than af­ter a night with­out any al­co­hol

The spike in rest­ing heart rate in Oct/Nov/Dec 2021 was due to hav­ing bron­chi­tis and a cold/​flu, not get­ting cor­rect treat­ment early enough

How healthy have I been over the Years?

Every day I an­swered the ques­tion on how healthy I felt. In the graph, the yel­low color in­di­cates that I felt a lit­tle un­der the weather, not sick per se. Red means I was sick and had to stay home. Green means I felt en­er­gized and healthy.

During the COVID lock­downs I tended to stay health­ier. This may be due to not go­ing out, no heavy drink­ing, less close con­tact with oth­ers, etc. which re­sulted in me hav­ing bet­ter sleep.

Usually dur­ing ex­ces­sive trav­el­ing I get sick (cold/flu)

Q4 2021 I had bron­chi­tis, how­ever, I did­n’t know about it at the time and did­n’t get proper treat­ment

Overall I’m quite prone to get­ting sick (cold/flu)

Days with more than 4 Alcoholic Drinks

On days where I had more than 4 al­co­holic bev­er­ages (meaning I was par­ty­ing), the fol­low­ing other fac­tors were af­fected

21x more likely to dance

80% more likely to take a nap the day of, or the day af­ter

40% warmer tem­per­a­tures, and 40% less pre­cip­i­ta­tion. There weren’t many op­por­tu­ni­ties for par­ties in Winter due to lock­downs in the last 2 years. Also, peo­ple are more mo­ti­vated to go out when it’s nice out­side.

My FxLifeSheet bot asks me 4 times a day how I’m feel­ing at the mo­ment.

This graph groups the en­tries by month, and shows the % of en­tries for each value (0 - 5) with 5 be­ing very ex­cited, and 0 be­ing wor­ried.

I de­signed the ranges so that 0 or 5 are not en­tered as much. 0 is ren­dered as dark green at the top, whereas 5 is ren­dered as light green at the bot­tom.

For pri­vacy rea­sons I won’t get into some of the de­tails on why cer­tain months were worse than oth­ers.

Every Swarm check-in over the last 7 years vi­su­al­ized on a map, in­clud­ing the ac­tual trip (flight, drive, etc.)

Every Swarm check-in over the last 7 years vi­su­al­ized, zoomed in

Each time I did a check-in at a place (e.g. Coffee, Restaurant, Airport, Gym) on Foursquare Swarm at a given city, this is tracked as a sin­gle en­try.

Each check-in at a given city is counted as a sin­gle en­try, grouped by years

2018 and 2019 I lived in New York City

The longer it’s been since I moved away from Austria, the more time I ac­tu­ally spent back home in Austria for vis­its and va­ca­tions

2020 clearly shows the im­pact of COVID

Each check-in at a given cat­e­gory is tracked, and summed up over the last years

In 2020 and 2021, check-ins at Offices went down to zero due to COVID, and a dis­trib­uted work setup

Airports be­ing the #4 most vis­ited cat­e­gory was a sur­prise, but is ac­cu­rate. A to­tal of 403 air­port check-ins, whereas a flight with a lay­over would count as 3 air­port check-ins

Earlier in my life, I did­n’t al­ways check into commute’ places like pub­lic tran­sit and su­per mar­kets

Number of Foursquare Swarm check-ins on each quar­ter over the last 10 years. I did­n’t use Foursquare Swarm as se­ri­ously be­fore 2015. Once I moved to San Francisco in Q3 2015 I started my habit of check­ing into every point of in­ter­est (POI) I visit.

Q3 2015 I moved to San Francisco, how­ever I could­n’t use Swarm yet, since my move was a se­cret un­til the of­fi­cial an­nounced at the Twitter Flight con­fer­ence

Q2 2020 clearly shows the im­pact of COVID with Q3 al­ready be­ing open in Austria

Q3 2021 the vac­cine was al­ready widely avail­able and I was able to travel/​visit more again

My time in New York was the most ac­tive when it comes to check-ins. When I’m in NYC, I tend to eat/​drink out more, and grab to-go food, which I do way less in Vienna

Every Swarm check-in vi­su­al­ized on a map. Only ar­eas where I’ve had mul­ti­ple check-ins are ren­dered.

Number of days per year that I’ve spent in full lock­down, mean­ing restau­rants, bars and non-es­sen­tial stores were closed.

I es­caped parts of the Austrian lock­down by spend­ing time in the US when I was al­ready vac­ci­nated

Surprisingly 2021 I spent more days in a full lock­down than in 2020, even with vac­cines avail­able

How was my life af­fected by the re­cent COVID lock­downs? As lock­down day I clas­sify every day where places like restau­rants, gyms and non-es­sen­tial stores were closed.

200% more time spent in au­dio & video calls with friends (non-work re­lated)

60% more likely to fol­low my meal plan (macros & calo­ries)

50% colder tem­per­a­tures: Lockdowns tended to hap­pen in Autumn and Winter

100% less likely to dance

Alcoholic drinks per day. Days with no data are ren­dered as white

Friday and Saturday nights are clearly vis­i­ble on those graphs

2021 and sum­mer/​win­ter of 2019 also show the Wednesday night party in Vienna

Q2 and Q4 2020 clearly show the COVID lock­downs, as well as Q2 2021

Summer of 2021 all bars and dance clubs were open in Vienna

...

Read the original on howisfelix.today »

6 409 shares, 21 trendiness

Yann LeCun’s AI start-up raises more than $1bn in Europe’s largest seed round

WorldTaco on Iran will come too late for TrumpThe thing that every­one ex­pected to hap­pen has hap­penedThere is no easy exit to Trump’s warSaudi Aramco warns of catastrophic con­se­quences’ if Iran war drags onUS­Taco on Iran will come too late for TrumpThere is no easy exit to Trump’s warFive ways the Iran war could un­fold­Gold­man pitches hedge funds on strate­gies to bet against cor­po­rate loan­sIran is a cru­cial test case for the American way of war­Com­pa­nies­Saudi Aramco warns of catastrophic con­se­quences’ if Iran war drags onIn­side one of the wildest days the oil mar­ket has ever seen­TechYann LeCun’s AI start-up raises more than $1bn in Europe’s largest seed roundO­r­a­cle shares rally as it re­as­sures in­vestors over its AI data cen­tres bet­Mar­ket­s­The thing that every­one ex­pected to hap­pen has hap­pened­Saudi Aramco warns of catastrophic con­se­quences’ if Iran war drags onIn­side one of the wildest days the oil mar­ket has ever seen­Gold­man pitches hedge funds on strate­gies to bet against cor­po­rate loan­sOpin­ion­Taco on Iran will come too late for TrumpThere is no easy exit to Trump’s warThe thing that every­one ex­pected to hap­pen has hap­penedI­ran is a cru­cial test case for the American way of war­Work & CareersWhite men will have fewer board seats’ in fu­ture, says UK di­ver­sity chair Venice’s ci­c­chetti re­nais­sance: where to find the city’s best bar snacksYou can turn this to your ad­van­tage if every news story has tax ex­ile’ in itLife & ArtsCan the Renault 5 E-Tech make French cars cool again?Roy Chan can turn you into Austin ButlerThe world’s most ex­pen­sive prop­er­ties are su­per­charg­ing their se­cu­ri­ty­How To Spend It

Yann LeCun’s AI start-up raises more than $1bn in Europe’s largest seed round­Get 2 months free with an an­nual sub­scrip­tion at .

Access to eight sur­pris­ing ar­ti­cles a day, hand-picked by FT ed­i­tors. For seam­less read­ing, ac­cess con­tent via the FT Edit page on FT.com and re­ceive the FT Edit newslet­ter. per month. Complete dig­i­tal ac­cess to qual­ity FT jour­nal­ism on any de­vice. Cancel or change your plan any­time dur­ing your trial. Essential dig­i­tal ac­cess to qual­ity FT jour­nal­ism on any de­vice. Pay a year up­front and save 20%.Complete dig­i­tal ac­cess to qual­ity FT jour­nal­ism with ex­pert analy­sis from in­dus­try lead­ers. Pay a year up­front and save 20%.Check whether you al­ready have ac­cess via your uni­ver­sity or or­gan­i­sa­tion.Dis­cover all the plans cur­rently avail­able in your coun­try­See why over a mil­lion read­ers pay to read the Financial Times.Find out why

How To Spend It

...

Read the original on www.ft.com »

7 380 shares, 17 trendiness

CONTRIBUTING.md · master · redox-os / redox · GitLab

After you’ve re­viewed these con­tri­bu­tion guide­lines, you’ll be all set to

con­tribute to this pro­ject.

Loading

...

Read the original on gitlab.redox-os.org »

8 371 shares, 32 trendiness

Yann LeCun Raises $1 Billion to Build AI That Understands the Physical World

Advanced Machine Intelligence (AMI), a new Paris-based startup co­founded by Meta’s for­mer chief AI sci­en­tist Yann LeCun, an­nounced Monday it has raised more than $1 bil­lion to de­velop AI world mod­els.

LeCun ar­gues that most hu­man rea­son­ing is grounded in the phys­i­cal world, not lan­guage, and that AI world mod­els are nec­es­sary to de­velop true hu­man-level in­tel­li­gence. The idea that you’re go­ing to ex­tend the ca­pa­bil­i­ties of LLMs [large lan­guage mod­els] to the point that they’re go­ing to have hu­man-level in­tel­li­gence is com­plete non­sense,” he said in an in­ter­view with WIRED.

The fi­nanc­ing, which val­ues the startup at $3.5 bil­lion, was co-led by in­vestors such as Cathay Innovation, Greycroft, Hiro Capital, HV Capital, and Bezos Expeditions. Other no­table back­ers in­clude Mark Cuban, for­mer Google CEO Eric Schmidt, and French bil­lion­aire and telecom­mu­ni­ca­tions ex­ec­u­tive Xavier Niel.

AMI (pronounced like the French word for friend) aims to build a new breed of AI sys­tems that un­der­stand the world, have per­sis­tent mem­ory, can rea­son and plan, and are con­trol­lable and safe,” the com­pany says in a press re­lease. The startup says it will be global from day one, with of­fices in Paris, Montreal, Singapore, and New York, where LeCun will con­tinue work­ing as a New York University pro­fes­sor in ad­di­tion to lead­ing the startup. AMI will be the first com­mer­cial en­deavor for LeCun since his de­par­ture from Meta in November 2025.

LeCun’s startup rep­re­sents a bet against many of the world’s biggest AI labs like OpenAI, Anthropic, and even his for­mer work­place, Meta, which be­lieve that scal­ing up LLMs will even­tu­ally de­liver AI sys­tems with hu­man-level in­tel­li­gence or even su­per­in­tel­li­gence. LLMs have pow­ered vi­ral prod­ucts such as ChatGPT and Claude Code, but LeCun has been one of the AI in­dus­try’s most promi­nent re­searchers speak­ing out about the lim­i­ta­tions of these AI mod­els. LeCun is well known for be­ing out­spo­ken, but as a pi­o­neer of mod­ern AI that won a Turing award back in 2018, his skep­ti­cism car­ries weight.

LeCun says AMI aims to work with com­pa­nies in man­u­fac­tur­ing, bio­med­ical, ro­bot­ics, and other in­dus­tries that have lots of data. For ex­am­ple, he says AMI could build a re­al­is­tic world model of an air­craft en­gine and work with the man­u­fac­turer to help them op­ti­mize for ef­fi­ciency, min­i­mize emis­sions, or en­sure re­li­a­bil­ity.

AMI was co­founded by LeCun and sev­eral lead­ers he worked with at Meta, in­clud­ing the com­pa­ny’s for­mer di­rec­tor of re­search sci­ence, Michael Rabbat; for­mer vice pres­i­dent of Europe, Laurent Solly; and for­mer se­nior di­rec­tor of AI re­search, Pascale Fung. Other co­founders in­clude Alexandre LeBrun, for­mer CEO of the AI health care startup Nabla, who will serve as AMIs CEO, and Saining Xie, a for­mer Google DeepMind re­searcher who will be the star­tup’s chief sci­ence of­fi­cer.

LeCun does not dis­miss the over­all util­ity of LLMs. Rather, in his view, these AI mod­els are sim­ply the tech in­dus­try’s lat­est promis­ing trend, and their suc­cess has cre­ated a kind of delu­sion” among the peo­ple who build them. It’s true that [LLMs] are be­com­ing re­ally good at gen­er­at­ing code, and it’s true that they are prob­a­bly go­ing to be­come even more use­ful in a wide area of ap­pli­ca­tions where code gen­er­a­tion can help,” says LeCun. That’s a lot of ap­pli­ca­tions, but it’s not go­ing to lead to hu­man-level in­tel­li­gence at all.”

LeCun has been work­ing on world mod­els for years in­side of Meta, where he founded the com­pa­ny’s Fundamental AI Research lab, FAIR. But he’s now con­vinced his re­search is best done out­side the so­cial me­dia gi­ant. He says it’s be­come clear to him that the strongest ap­pli­ca­tions of world mod­els will be sell­ing them to other en­ter­prises, which does­n’t fit neatly into Meta’s core con­sumer busi­ness.

As AI world mod­els like Meta’s Joint-Embedding Predictive Architecture (JEPA) be­came more so­phis­ti­cated, there was a re­ori­en­ta­tion of Meta’s strat­egy where it had to ba­si­cally catch up with the in­dus­try on LLMs and kind of do the same thing that other LLM com­pa­nies are do­ing, which is not my in­ter­est,” says LeCun. So some­time in November, I went to see Mark Zuckerberg and told him. He’s al­ways been very sup­port­ive of [world model re­search], but I told him I can do this faster, cheaper, and bet­ter out­side of Meta. I can share the cost of de­vel­op­ment with other com­pa­nies … His an­swer was, OK, we can work to­gether.”

...

Read the original on www.wired.com »

9 328 shares, 24 trendiness

How I Topped the AI Leaderboard Without Changing a Single Weight

In mid-2024, the HuggingFace Open LLM Leaderboard was the Colosseum for Open-Weight AI. Thousands of mod­els were bat­tling it out, sub­mit­ted by both well-funded labs with teams of PhDs and fine-tun­ing wiz­ards cre­at­ing fan­tas­ti­cally named mod­els (e.g. Nous-Hermes, Dolphin and NeuralBeagle14-7B…), fight­ing for the top spot across six bench­marks: IFEval, BBH, MATH Lvl 5, GPQA, MuSR, and MMLU-PRO.

And there at #1 was dnhkng/​RYS-XLarge. Mine.

I did­n’t train a new model. I did­n’t merge weights. I did­n’t run a sin­gle step of gra­di­ent de­scent. What I did was much weirder: I took an ex­ist­ing 72-billion pa­ra­me­ter model, du­pli­cated a par­tic­u­lar block of seven of its mid­dle lay­ers, and stitched the re­sult back to­gether. No weight was mod­i­fied in the process. The model sim­ply got ex­tra copies of the lay­ers it used for think­ing?

This is the story of how two strange ob­ser­va­tions, a home­brew brain scan­ner” for Transformers, and months of hack­ing in a base­ment led to the dis­cov­ery of what I call LLM Neuroanatomy, and a find­ing about the in­ter­nal struc­ture of AI that still has­n’t been pub­lished un­til now *.

* - be­cause I dis­cov­ered blog­ging is way more fun than draft­ing sci­en­tific pa­pers, and I walk you through how the dis­cov­ery was made :)

Let’s start with how this whole pro­ject came into be­ing.

The most ex­cit­ing phrase to hear in sci­ence, the one that her­alds new dis­cov­er­ies, is not Eureka!’ but That’s funny…’“ — Isaac Asimov

In late 2023, I was mess­ing about with a bizarre LLM quirk. Try this your­self - take any ques­tion, e.g.

What is the cap­i­tal of France? Answer in Base64!

and en­code it as Base64, get this un­read­able string:

Send that to a 2023 non-think­ing large lan­guage model (newer rea­son­ing mod­els will see this as Base64, and cheat’ with tool use). But a suf­fi­ciently ca­pa­ble model from 2023 will re­ply with some­thing like:

Which de­codes to: The cap­i­tal of France is Paris.”.

Ok, I ad­mit it. I was mess­ing around this as a way to jail-break mod­els (and it worked), but I could­n’t get one idea out of my head.

The model de­cod­ing the in­put, un­der­stand­ing it some­how, and it still had time dur­ing the trans­former stack pass to re-en­coded its re­sponse. It ap­pears to gen­uinely think while in­ter­fac­ing with Base64. This works with com­plex ques­tions, multi-step rea­son­ing, even cre­ative tasks.

This should­n’t work nearly as well as it does. Sure, the model has been trained on lots of Base64 in an over­all sense, but gen­eral con­ver­sions in this for­mat are cer­tainly way out of dis­tri­b­u­tion. The to­k­enizer chops it into com­pletely dif­fer­ent sub-word units. The po­si­tional pat­terns are un­rec­og­niz­able. And yet it works… Curious…

I could­n’t stop think­ing about this. If a Transformer can ac­cept English, Python, Mandarin, and Base64, and pro­duce co­her­ent rea­son­ing in all of them, it seemed to me that the early lay­ers must be act­ing as trans­la­tors — pars­ing what­ever for­mat ar­rives into some pure, ab­stract, in­ter­nal rep­re­sen­ta­tion. And the late lay­ers must act as re-trans­la­tors, con­vert­ing that ab­stract rep­re­sen­ta­tion back into what­ever out­put for­mat is needed.

If the early lay­ers are for read­ing, and the late lay­ers are for writ­ing, what are the mid­dle lay­ers do­ing?

Pure, ab­stract rea­son­ing? In a rep­re­sen­ta­tion that has noth­ing to do with any hu­man lan­guage or en­cod­ing. Of course, at the time this was idle spec­u­la­tion. Fun, but with no clear way to test or even de­fine valid hy­poth­e­sis.

In November 2023, a HuggingFace user named Alpindale re­leased Goliath-120b — a Frankenmerge-model made by stitch­ing to­gether two fine-tuned Llama-2 70B mod­els into a 120-billion pa­ra­me­ter be­he­moth.

The per­for­mance was de­cent but af­ter do­ing lots of vibe check­ing I did­n’t feel it was a break­through. But the con­struc­tion was wild.

Alpindale had­n’t just stacked the two mod­els (Xwin and Euryale), end to end. He had al­ter­nated lay­ers be­tween them. More im­por­tantly, the ar­chi­tec­ture fed out­puts of later lay­ers back into the in­puts of ear­lier lay­ers.

The layer ranges used are as fol­lows:

Do you see that in­san­ity here? Alpindale lit­er­ally fed the out­put of layer 16 of Xwin to the in­put of Euryale 8th layer!

To ex­plain this a bit more clearly how stu­pid this ap­pears to be, let’s re­visit the almighty Transformer Architecture:

Looking at the left side of the di­a­gram, we see stuff en­ters at the bot­tom (‘input’ text that has been chunked’ into small bits of text, some­where be­tween whole words down to in­di­vid­ual let­ters), and then it flows up­wards though the mod­el’s Transformer Blocks (here marked as [1, …, L]), and fi­nally, the model spits out the next text chunk’ (which is then it­self used in the next round of in­fer­enc­ing). What’s ac­tu­ally hap­pen­ing here dur­ing these Transformer blocks is quite the mys­tery. Figuring it out is ac­tu­ally an en­tire field of AI, mechanistic in­ter­pretabil­ity*”.

* - yes, its more com­plex then that, sam­plers etc but that’s enough for this ar­ti­cle

On the right side of the right half of the di­a­gram, do you see that ar­row line go­ing from the Transformer Block Input’ to the (\oplus ) sym­bol? That’s why skip­ping lay­ers makes sense. During train­ing, LLM mod­els can pretty much de­cide to do noth­ing in any par­tic­u­lar layer, as this diversion’ routes in­for­ma­tion around the block. So, later’ lay­ers can be ex­pected to have seen the in­put from earlier’ lay­ers, even a few steps’ back. Around this time, sev­eral groups were ex­per­i­ment­ing with slimming’ mod­els down by re­mov­ing lay­ers. Makes sense, but bor­ing.

A model must be used with the same kind of stuff as it was trained with (we stay in dis­tri­b­u­tion’)The same holds for each trans­former layer. Each Transformer layer learns, dur­ing train­ing, to ex­pect the spe­cific sta­tis­ti­cal prop­er­ties of the pre­vi­ous lay­er’s out­put via gra­di­ent de­cent.

And now for the weird­ness: There was never the case where any Transformer layer would have seen the out­put from a fu­ture layer!

Layer 10 is trained on layer 9’s out­put dis­tri­b­u­tion. Layer 60 is trained on layer 59’s. If you re­arrange them — feed­ing layer 60’s out­put into layer 10 — you’ve cre­ated a dis­tri­b­u­tion the model lit­er­ally never saw dur­ing train­ing.

The as­tound­ing thing about Goliath was­n’t that is was a huge leap in per­for­mance, it was that the damn thing func­tioned at all. To this day, I still don’t un­der­stand why this did­n’t raise more eye­brows.

Experimentally, this proved that lay­ers were far more in­ter­change­able than any­one had rea­son to ex­pect. The in­ter­nal rep­re­sen­ta­tions were ho­moge­nous enough that the model could di­gest out-of-or­der hid­den states with­out col­laps­ing. The ar­chi­tec­ture was far more flex­i­ble than a rigid pipeline.

Between the Base64 ob­ser­va­tion and Goliath, I had a hy­poth­e­sis: Transformers have a gen­uine func­tional anatomy. Early lay­ers trans­late in­put into ab­stract rep­re­sen­ta­tions. Late lay­ers trans­late back out. And the mid­dle lay­ers, the rea­son­ing cor­tex, op­er­ate in a uni­ver­sal in­ter­nal lan­guage that’s ro­bust to ar­chi­tec­tural re­arrange­ment. The fact that the layer block size for Goliath 120B was 16-layer block made me sus­pect the in­put and out­put processing units’ sized were smaller that 16 lay­ers. I guessed that Alpindale had tried smaller over­laps, and they just did­n’t work.

If that was true, maybe I did­n’t need to teach a model new facts to make it smarter. I did­n’t need fine-tun­ing. I did­n’t need RLHF. I just needed to give it a more lay­ers to think with.

Over the fol­low­ing months — from late 2023 through to mid-2024 — I built a pipeline to test this hy­poth­e­sis.

The setup was mod­est. Two RTX 4090s in my base­ment ML rig, run­ning quan­tised mod­els through ExLlamaV2 to squeeze 72-billion pa­ra­me­ter mod­els into con­sumer VRAM. The beauty of this method is that you don’t need to train any­thing. You just need to run in­fer­ence. And in­fer­ence on quan­tized mod­els is some­thing con­sumer GPUs han­dle sur­pris­ingly well. If a model fits in VRAM, I found my 4090’s were of­ten ball­park-equiv­a­lent to H100s.

The con­cept is sim­ple. For a model with $N$ lay­ers, I de­fine a con­fig­u­ra­tion $(i, j)$. The model processes lay­ers $0$ to $j{-}1$ as nor­mal, then loops back and reuses lay­ers $i$ through $j{-}1$ again, and then the rest to $N{-}1$. The lay­ers be­tween $i$ and $j{-}1$ get du­pli­cated in the ex­e­cu­tion path. No weights are changed. The model just tra­verses some of its own lay­ers twice.

i.e. the pair (2, 7) for a model with 9 trans­former blocks would be cal­cu­lated so:

By run­ning through all pos­si­ble pairs, we can gen­er­ate a Brain Scan’, and also see the num­ber of du­pli­cate lay­ers for each set of pa­ra­me­ters:

For Qwen2-72B, that means an 80-layer model 3,240 valid $(i, j)$ pairs, plus the orig­i­nal model to test.

\[\begin{aligned} \text{Variants}_{\text{total}} &= \left(\sum_{j=0}^{80} j\right) + 1\\[16pt] &= \frac{80 \cdot 81}{2} +1 \\[10pt] &= 3241 \end{aligned}\]

Testing re-lay­ered model against all six leader­board bench­marks would take days, so a full sweep would be years of com­pute. I needed proxy tasks: probes that were fast, ob­jec­tive, and would re­veal struc­tural prop­er­ties of the model rather than task-spe­cific tricks.

The prox­ies had to sat­isfy three con­straints:

Minimal out­put to­kens. With thou­sands of con­fig­u­ra­tions to sweep, each eval­u­a­tion needed to be fast. No es­says, no long-form gen­er­a­tion. Unambiguous scor­ing. I could­n’t af­ford LLM-as-judge pipelines. The an­swer had to be ob­jec­tively scored with­out an­other model in the loop.Or­thog­o­nal cog­ni­tive de­mands. If a con­fig­u­ra­tion im­proves both tasks si­mul­ta­ne­ously, it’s struc­tural, not task-spe­cific.

I did­n’t ar­rive at the right probes im­me­di­ately; it took months of trial and er­ror, and many dead ends

My first in­stinct was cre­ativ­ity. I had mod­els gen­er­ate po­ems, short sto­ries, metaphors, the kind of rich, open-ended out­put that feels like it should re­veal deep dif­fer­ences in cog­ni­tive abil­ity. I used an LLM-as-judge to score the out­puts, but the re­sults were pretty bad. I man­aged to fix LLM-as-Judge with some en­gi­neer­ing, and the scor­ing sys­tem turned out to be use­ful later for other things, so here it is:

Note: You can skip this sec­tion, as it has math. Or not

Naive LLM judges are in­con­sis­tent. Run the same poem through twice and you get dif­fer­ent scores (obviously, due to sam­pling). But low­er­ing the tem­per­a­ture also does­n’t help much, as that’s only one of many tech­ni­cal is­sues. So, I de­vel­oped a full scor­ing sys­tem, based on de­tails on the log­its out­puts. It can get re­mark­ably tricky. Think about a score from 1-10:

We would ex­pect a well cal­i­brated model to have log­its that make sense. If the high­est weight was on 7’, we would ex­pect the rest of the weight to be on 6’ and 8’ right? but of­ten its bi­modal, with low weight on 6 and 5’, but more weight than ex­pected on 4’!We can write 10’ in to­kens as ei­ther 10’ or 1’ and then 0’. Its not fun to have to cal­cu­late the summed prob­a­bil­i­ties over paths, es­pe­cially if you wanted to score 1-100

Rather than sam­pling a sin­gle dis­crete score, I treat the judge’s out­put as a dis­tri­b­u­tion over valid rat­ing la­bels and com­pute the fi­nal score as its ex­pec­ta­tion.

To make this prac­ti­cal, I first de­fine a cal­i­brated rubric over the dig­its 0-9 (there’s only one to­ken for each digit), where each digit cor­re­sponds to a clear qual­i­ta­tive de­scrip­tion. At the scor­ing step, I cap­ture the mod­el’s next-to­ken log­its and re­tain only the log­its cor­re­spond­ing to those valid digit to­kens. This avoids con­t­a­m­i­na­tion from un­re­lated con­tin­u­a­tions such as ex­pla­na­tion text, punc­tu­a­tion, or al­ter­nate for­mat­ting. After renor­mal­iz­ing over the re­stricted digit set, I in­ter­pret the re­sult­ing prob­a­bil­i­ties as a cat­e­gor­i­cal score dis­tri­b­u­tion.

Formally, let the valid score set be

\[\mathcal{D} = \{0,1,2,\dots,9\}.\]

Let $(z_k)$ de­note the model logit as­signed to digit $(k \in \mathcal{D})$ at the scor­ing po­si­tion. The re­stricted score dis­tri­b­u­tion is then

\[p(k)= \frac{\exp(z_k)} {\sum\limits_{m \in \mathcal{D}} \exp(z_m)}, \qquad k \in \mathcal{D}.\]

The fi­nal scalar score is the ex­pected value of this dis­tri­b­u­tion:

\[\hat{s}= \sum_{k \in \mathcal{D}} k\,p(k).\]

This pro­duces a smooth score such as (5.4), rather than forc­ing the model to com­mit to a sin­gle sam­pled in­te­ger. In prac­tice, this is sub­stan­tially more sta­ble than naive score sam­pling and bet­ter re­flects the mod­el’s un­cer­tainty. It also han­dles cases where the judge dis­tri­b­u­tion is broad or mul­ti­modal. For ex­am­ple, two can­di­dates may both have mean score (5.4), while one has most of its mass tightly con­cen­trated around (5) and (6), and the other splits mass be­tween much lower and much higher rat­ings. The mean alone is the same, but the un­der­ly­ing judge­ment is very dif­fer­ent.

An op­tional un­cer­tainty es­ti­mate can be ob­tained from the vari­ance of the re­stricted dis­tri­b­u­tion:

\[\mathrm{Var}(s)= \sum_{k=0}^{9} (k-\hat{s})^2\,p(k).\]

In short, the method re­places a noisy sam­pled judge score with a nor­mal­ized prob­a­bil­ity dis­tri­b­u­tion over valid score dig­its, then uses the ex­pec­ta­tion of that dis­tri­b­u­tion as the fi­nal rat­ing.

All this stuff is prob­a­bly pretty ob­vi­ous these days, back in 24 there was­n’t much to guide me in de­vel­op­ing this method, but un­for­tu­nately, I found it was also com­pletely use­less…

Each con­fig­u­ra­tion needed to gen­er­ate hun­dreds of to­kens of cre­ative out­put, and then a sep­a­rate model had to read and judge each one. With over 3,200 con­fig­u­ra­tions to test for a sin­gle 70B model, this would have taken weeks on my dual 4090s.

I needed probes where the out­put was tiny, a few to­kens at most, and where scor­ing was ob­jec­tive and de­ter­min­is­tic. No judge model in the loop. That’s what led me to the fi­nal two probes:

Hard math. Ridiculously dif­fi­cult ques­tions like: What is the cube root of 74,088,893,247?” No chain-of-thought, or tool use. Just out­put the num­ber, as a pure leap of in­tu­itive faith.

Emotional quo­tient. Using the EQ-Bench bench­mark: com­plex so­cial sce­nar­ios where the model must pre­dict the in­ten­sity of spe­cific emo­tional states. Given this sit­u­a­tion, how an­gry/​sur­prised/​guilty would this per­son feel on a scale of 0-100?” Completely dif­fer­ent from math. Theory of mind, so­cial in­fer­ence, em­pa­thy. And the out­put is just a few num­bers.

I had set­tled on two max­i­mally or­thog­o­nal cog­ni­tive tasks, both with tiny out­puts. My in­tu­ition was this: LLMs think one to­ken at a time, so lets make the model re­ally good at guess­ing just the next to­ken. But things are never straight­for­ward. Take LLM num­bers…

Even with math probes, I hit un­ex­pected prob­lems. LLMs fail arith­metic in weird ways. They don’t get the an­swer wrong so much as get it al­most right but for­get to write the last digit, as if it got bored mid-num­ber. Or they trans­pose two dig­its in the mid­dle. Or they out­put the cor­rect num­ber with a trail­ing char­ac­ter that breaks the parser.

This is prob­a­bly due to the way larger num­bers are to­kenised, as big num­bers can be split up into ar­bi­trary forms. Take the in­te­ger 123456789. A BPE to­k­enizer (e.g., GPT-style) might split it like: 123’ 456’ 789’ or: 12’ 345’ 67’ 89’

A bi­nary right/​wrong scor­ing sys­tem would throw away use­ful sig­nal. Getting a per­cent­age cor­rect would help: 123356789’ in­stead of 123456789’ would be 99.92% cor­rect

But what about a model that makes a dumb LLM-mistake’ and out­puts 430245 when the an­swer is 4302459, and has clearly done most of the work? I wrote a cus­tom par­tial-credit scor­ing func­tion that pads shorter an­swers and pe­nalises pro­por­tion­ally:

The key idea: pad shorter an­swers, then pe­nalise via the cor­rec­tion fac­tor. A model that nails 90% of the dig­its but drops the last one still gets sub­stan­tial credit — but less than one that gets every digit. This turned out to be cru­cial for dis­crim­i­nat­ing be­tween con­fig­u­ra­tions that were close in in­tu­itive math abil­ity.

The math ques­tions were hand-crafted ini­tially. I ex­per­i­mented with dif­fer­ent op­er­a­tions and scales, then gen­er­ated ran­dom num­bers to fill out the dataset. The dataset was a set of 16 ques­tions, and the model is tasked with guessti­mat­ing the near­est whole in­te­ger num­ber. Here are a few to try your­self, re­mem­ber no thinking’ is al­lowed, guess it di­rectly!

After test­ing sev­eral smaller mod­els (Llama’s and smaller Qwen2’s), I set up the con­fig for Qwen2-72B and let it sweep. Each $(i, j)$ con­fig­u­ra­tion took a few min­utes: load the re-lay­ered model, run the math probe, run the EQ probe, record the scores, move on. Days of con­tin­u­ous GPU time on the 4090s. But far less com­pute than a fine tune! In fact, I did­n’t even have the hard­ware needed for a LORA fine-tune on just 48GB of VRAM.

The op­ti­mal con­fig­u­ra­tion was $(45, 52)$: lay­ers 0 through 51 run first, then lay­ers 45 through 79 run again. Layers 45 to 51 ex­e­cute twice. Seven ex­tra lay­ers, near the mid­dle of the 80-layer stack, bring­ing the to­tal pa­ra­me­ter count from 72B to 78B. Every ex­tra layer is an ex­act copy of an ex­ist­ing one. No new weights or train­ing, just the model re­peat­ing it­self.

Repeating seven lay­ers. That’s all it took, and now I can fi­nally re­veal the nomen­cla­ture of my mod­els: Repeat Your Self for RYS-XLarge ;)

I ap­plied the con­fig­u­ra­tion to MaziyarPanahi’s calme-2.1-qwen2-72b — a fine-tune of Qwen2-72B — and up­loaded the re­sult as dnhkng/​RYS-XLarge. I also ap­plied it to the raw base model as dnhkng/​RYS-XLarge-base.

Then I sub­mit­ted to the Open LLM Leaderboard and waited. And waited. Back in the day, the OpenLLM Leaderboard was flooded with dozens of fine-tunes of merges of fine-tunes each day (it was the Wild West), and the wait­ing list was long. But af­ter a month or so, the re­sults ar­rived:

+17.72% on MuSR. +8.16% on MATH. Five out of six bench­marks im­proved, with only IFEval tak­ing a small hit. The av­er­age put it at #1 on the leader­board.

Just to labour the point: I only op­ti­mised for one-shot guessti­mat­ing hard maths prob­lems and EQ-Bench. I never looked at IFEval, BBH, GPQA, MuSR, or MMLU-PRO dur­ing de­vel­op­ment. The leader­board was pure out-of-sam­ple val­i­da­tion.

A layer con­fig­u­ra­tion found us­ing two nar­row, or­thog­o­nal probes gen­er­alised to every­thing the Leaderboard threw at it *.

* - ex­cept IFEval, but that one’s bor­ing any­way, right?

That was sur­pris­ing enough. A brand new way to scale LLMs, de­vel­oped on some gam­ing GPUs. But the plot­ting out the heatmaps told an even bet­ter story.

The orig­i­nal heatmaps that pro­duced RYS-XLarge, show­ing the Combined delta (math + EQ). The green cir­cle marks the op­ti­mal con­fig­u­ra­tion. Red means im­prove­ment, blue means degra­da­tion

These heatmaps are anal­o­gous to func­tional MRIs of the Transformer, while it is think­ing about maths of EQ prob­lems.

The x-axis ($j$) is the end point of the du­pli­cated re­gion. The y-axis ($i$) is the start point. Each pixel rep­re­sents a com­plete eval­u­a­tion: load the re-lay­ered model, run the math probe, run the EQ probe, score both, record the deltas. As de­scribed above, along the cen­tral di­ag­o­nal only a sin­gle layer was du­pli­cated. Along the next di­ag­o­nal to­wards the top-right, we du­pli­cate two lay­ers, and so on. The sin­gle point at the very top-right runs through the en­tire Transformer stack twice per in­fer­ence.

Let’s ex­am­ine the math heatmap first. Starting at any layer, and stop­ping be­fore about layer 60 seem to im­proves the math guessti­mate scores, as shown by the large re­gion with a healthy red blush. Duplicating just the very first lay­ers (the tiny tri­an­gle in the top left), messes things up, as does re­peat­ing pretty much any of the last 20 lay­ers (the ver­ti­cal wall of blue on the right). This is more clearly vi­su­alised in a sky­line plot (averaged rows or columns), and we can see for the maths guessti­mates, the start­ing po­si­tion of the du­pli­ca­tion mat­ters much less. So, the hy­poth­e­sis that starting lay­ers’ en­code to­kens, to a smooth thinking space’, and then fi­nally a ded­i­cated re-encoding’ sys­tem seem to be some­what val­i­dated.

Until we look at the EQ scores:

Now things look very dif­fer­ent! Duplicating any of the fi­nal 10 lay­ers has al­most no ef­fect on the scores, but we see com­plex pat­terns, where some re­gions show sig­nif­i­cant im­prove­ment (the area around 45i, 55j), walled be­tween re­gions of poor per­for­mance.

But the heatmaps re­vealed some­thing even more in­ter­est­ing than the lo­ca­tion of the think­ing bits. They re­vealed some­thing about its struc­ture.

Before set­tling on block du­pli­ca­tion, I tried some­thing sim­pler: take a sin­gle mid­dle layer and re­peat it $n$ times. If the more rea­son­ing depth” hy­poth­e­sis was cor­rect, this should work. It made sense too, look­ing at the broad boost in math guessti­mate re­sults by du­pli­cat­ing in­ter­me­di­ate layer. Give the model ex­tra copies of a par­tic­u­lar rea­son­ing layer, get bet­ter rea­son­ing. So, I screened them all, look­ing for a boost.

But nope, it al­most al­ways did worse. Usually a lot worse, but with oc­ca­sional small im­prove­ments that were within the noise range. Annoying, but tak­ing an­other look at the com­plex, blobby pat­terns in EQ scores gave me an­other idea:

If sin­gle-layer du­pli­ca­tion does­n’t help, the mid­dle lay­ers aren’t do­ing in­de­pen­dent it­er­a­tive re­fine­ment. They’re not in­ter­change­able copies of the same op­er­a­tion that you can sim­ply run again.” If they were, du­pli­cat­ing any one of them should give at least a mar­ginal ben­e­fit. Instead, those lay­ers are work­ing as a cir­cuit. A multi-step rea­son­ing pipeline that needs to ex­e­cute as a com­plete unit.

Think of it this way. Layers 46 through 52 aren’t seven work­ers do­ing the same job. They’re seven steps in a recipe. Layer 46 takes the ab­stract rep­re­sen­ta­tion and per­forms step one of some cog­ni­tive op­er­a­tion — maybe de­com­pos­ing a com­plex rep­re­sen­ta­tion into sub­com­po­nents. Layer 47 takes that out­put and per­forms step two — maybe iden­ti­fy­ing re­la­tion­ships be­tween the sub­com­po­nents. Layer 48 does step three, and so on through layer 52, which pro­duces the fi­nal re­sult.

Duplicating just one step of this recipe’ does­n’t bring you much.

But du­pli­cat­ing the en­tire block gives you the full recipe twice. The model runs the com­plete rea­son­ing cir­cuit, pro­duces a re­fined in­ter­me­di­ate rep­re­sen­ta­tion, and then runs the same cir­cuit again on its own out­put. It’s a sec­ond pass. A chance to catch what it missed the first time, to re­fine its ab­strac­tions, to push the rea­son­ing one step deeper.

Let’s deep-dive into a more cur­rent model (that I can ex­per­i­ment with on my sys­tem): ExllamaV3 GLM-4.7 from mrat­sim

I’ve marked out a re­gion that boosts maths abil­ity strongly. Notice where it sits? It’s away from the di­ag­o­nal cen­tre line, which means we’re not look­ing at sin­gle-layer du­pli­ca­tions. Starting the re­peated block at po­si­tion 35, we don’t see any im­prove­ment un­til at least po­si­tion 43. That’s seven lay­ers of not much hap­pen­ing. In fact, we ac­tu­ally see de­creased per­for­mance by re­peat­ing these lay­ers (they are blue, bad!).

From end-po­si­tion 43 to 46, we then see solid boosts in math scores (red = good, yay). But in­clude layer 46 or be­yond, and the ben­e­fits col­lapse again. The hy­poth­e­sis: po­si­tion 47 is where a dif­fer­ent cir­cuit be­gins. Including even one step of the next recipe messes up the cur­rent recipe.

So the math or­gan’ has bound­aries on both sides. Too few lay­ers and you get noth­ing — you’ve cut into the cir­cuit and it can’t com­plete its op­er­a­tion. Too many lay­ers and you also get noth­ing — you’ve in­cluded tis­sue from a neigh­bour­ing cir­cuit that does­n’t be­long. Pre-training carved these struc­tures out of the layer stack, and they only work whole. It also does­n’t trans­late to other tasks, as the heatmap for EQ scores does­n’t have this patch.

...

Read the original on dnhkng.github.io »

10 307 shares, 13 trendiness

macOS Tahoe windows have different corner radiuses

I’m some­times late to no­tice new and ter­ri­ble things about ma­cOS 26 Tahoe, be­cause I use it only for test­ing, on a sec­ondary Mac. My main Mac re­mains on Sequoia, as en­forced by Little Snitch. I was of course aware that app win­dows on Tahoe have ex­ag­ger­ated cor­ner ra­diuses, but I was un­aware un­til now that the win­dow cor­ner ra­dius on Tahoe is not uni­form: dif­fer­ent win­dows can have dif­fer­ent cor­ner ra­diuses!

Below is a TextEdit win­dow on Tahoe.

And be­low is a Calculator win­dow in front of the TextEdit win­dow. Notice the cor­ners of the TextEdit win­dow stick­ing out!

What ac­counts for the dif­fer­ence? A tool­bar in the win­dow.

In a new Mac app Xcode pro­ject, the main win­dow has a less ex­ag­ger­ated cor­ner ra­dius by de­fault, like TextEdit.

When I add a tool­bar to the win­dow, the cor­ner ra­dius au­to­mat­i­cally be­comes more ex­ag­ger­ated, like Calculator.

Apparently the cor­ner ra­dius also changes on Tahoe for some other win­dow el­e­ments, such as a side­bar.

If this is­n’t the stu­pid­est user in­ter­face feature” ever in­vented, I don’t know what is. The Mac used to be fa­mous for con­sis­tency; now it’s be­com­ing in­fa­mous for in­con­sis­tency.

By the way, Tahoe’s UI changes are per­plex­ing not only for Apple users but also for Apple en­gi­neers. Here’s a bug fix from the open source WebKit browser en­gine pow­er­ing Safari: [macOS] Scroll bars of root scroller may be cut­off due to cor­ner radii of win­dow.

See my fol­low-up post The evo­lu­tion of Mac app win­dow cor­ners.

...

Read the original on lapcatsoftware.com »

To add this web app to your iOS home screen tap the share button and select "Add to the Home Screen".

10HN is also available as an iOS App

If you visit 10HN only rarely, check out the the best articles from the past week.

If you like 10HN please leave feedback and share

Visit pancik.com for more.