10 interesting stories served every morning and every evening.




1 1,734 shares, 91 trendiness

Tony Hoare (1934-2026)

Computational Complexity and other fun stuff in math and com­puter sci­ence from Lance Fortnow and Bill Gasarch

...

Read the original on blog.computationalcomplexity.org »

2 600 shares, 30 trendiness

Online age-verification tools spread across U.S. for child safety, but adults are being surveilled

New U. S laws de­signed to pro­tect mi­nors are pulling mil­lions of adult Americans into manda­tory age-ver­i­fi­ca­tion gates to ac­cess on­line con­tent, lead­ing to back­lash from users and crit­i­cism from pri­vacy ad­vo­cates that a free and open in­ter­net is at stake. Roughly half of U.S. states have en­acted or are ad­vanc­ing laws re­quir­ing plat­forms — in­clud­ing adult con­tent sites, on­line gam­ing ser­vices, and so­cial me­dia apps — to block un­der­age users, forc­ing com­pa­nies to screen every­one who ap­proaches these dig­i­tal gates.

There’s a big spec­trum,” said Joe Kaufmann, global head of pri­vacy at Ju­mio, one of the largest dig­i­tal iden­tity-ver­i­fi­ca­tion and au­then­ti­ca­tion plat­forms. He ex­plained that the patch­work of state laws vary in tech­ni­cal de­mands and com­pli­ance ex­pec­ta­tions. “The reg­u­la­tions are mov­ing in many dif­fer­ent di­rec­tions at once,” he said.

Social me­dia com­pany Discord an­nounced plans in February to roll out manda­tory age ver­i­fi­ca­tion glob­ally, which the com­pany said would rely on ver­i­fi­ca­tion meth­ods de­signed so fa­cial analy­sis oc­curs on a user’s de­vice and sub­mit­ted data would be deleted im­me­di­ately. The pro­posal quickly drew back­lash from users con­cerned about hav­ing to sub­mit self­ies or gov­ern­ment IDs to ac­cess cer­tain fea­tures, which led Discord to de­lay the launch un­til the sec­ond half of this year.

Let me be up­front: we knew this roll­out was go­ing to be con­tro­ver­sial. Any time you in­tro­duce some­thing that touches iden­tity and ver­i­fi­ca­tion, peo­ple are go­ing to have strong feel­ings,” Discord chief tech­nol­ogy of­fi­cer and co-founder Stanislav Vishnevskiy wrote in a Feb. 24 blog post.

Websites offering adult con­tent, gam­bling, or fi­nan­cial ser­vices of­ten rely on full iden­tity ver­i­fi­ca­tion that re­quires scan­ning a gov­ern­ment ID and match­ing it to a live im­age. But most of the ver­i­fi­ca­tion sys­tems pow­er­ing these check­points — of­ten run by spe­cial­ized iden­tity-ver­i­fi­ca­tion ven­dors on be­half of web­sites — rely on ar­ti­fi­cial in­tel­li­gence such as fa­cial recog­ni­tion and age-es­ti­ma­tion mod­els that an­a­lyze self­ies or video to de­ter­mine in sec­onds whether some­one is old enough to ac­cess con­tent. Social me­dia and lower-risk ser­vices may use lighter es­ti­ma­tion tools de­signed to con­firm age with­out per­ma­nently stor­ing de­tailed iden­tity records.

Vendors say a chal­lenge is bal­anc­ing safety with how much fric­tion users will tol­er­ate. “We’re in the busi­ness of en­sur­ing that you are ab­solutely keep­ing mi­nors safe and out and able to let adults in with as lit­tle fric­tion as pos­si­ble,” said Rivka Gewirtz Little, chief growth of­fi­cer at iden­tity-ver­i­fi­ca­tion plat­form So­cure. Excessive data col­lec­tion, she added, cre­ates fric­tion that users re­sist.

Still, many users per­ceive manda­tory iden­tity checks as in­va­sive. Having an­other way to be forced to pro­vide that in­for­ma­tion is in­tru­sive to peo­ple,” said Heidi Howard Tandy, a part­ner at Berger Singerman who spe­cial­izes in in­tel­lec­tual prop­erty and in­ter­net law. Some users may at­tempt workarounds — in­clud­ing pre­paid cards or al­ter­na­tive cre­den­tials — or turn to unau­tho­rized dis­tri­b­u­tion chan­nels. “It’s go­ing to cause a piracy sit­u­a­tion,” she added.

In many im­ple­men­ta­tions, ver­i­fi­ca­tion ven­dors — not the web­sites them­selves — process and re­tain the iden­tity in­for­ma­tion, re­turn­ing only a pass-fail sig­nal to the plat­form.

Gewirtz Little said So­cure does not sell ver­i­fi­ca­tion data and that in light­weight age-es­ti­ma­tion sce­nar­ios, where plat­forms use quick fa­cial analy­sis or other sig­nals rather than gov­ern­ment doc­u­men­ta­tion, the com­pany may store lit­tle or no in­for­ma­tion. But in fuller iden­tity-ver­i­fi­ca­tion con­texts, such as gam­ing and fraud pre­ven­tion that re­quire ID scans, cer­tain adult ver­i­fi­ca­tion records may be re­tained to doc­u­ment com­pli­ance. She said So­cure can keep some adult ver­i­fi­ca­tion data for up to three years while fol­low­ing ap­plic­a­ble pri­vacy and purg­ing rules.

Civil lib­er­ties’ ad­vo­cates warn that con­cen­trat­ing large vol­umes of iden­tity data among a small num­ber of ver­i­fi­ca­tion ven­dors can cre­ate at­trac­tive tar­gets for hack­ers and gov­ern­ment de­mands. Ear­lier this year, Discord dis­closed a data breach that ex­posed ID im­ages be­long­ing to ap­prox­i­mately 70,000 users through a com­pro­mised third-party ser­vice, high­light­ing the se­cu­rity risks as­so­ci­ated with stor­ing sen­si­tive iden­tity in­for­ma­tion.

In ad­di­tion, they warn that ex­pand­ing age-ver­i­fi­ca­tion sys­tems rep­re­sent not only a us­abil­ity chal­lenge but a struc­tural shift in how iden­tity be­comes tied to on­line be­hav­ior. Age ver­i­fi­ca­tion risks ty­ing users’ most sen­si­tive and im­mutable data” — names, faces, birth­days, home ad­dresses — to their on­line ac­tiv­ity, ac­cord­ing to Molly Buckley, a leg­isla­tive an­a­lyst at the Electronic Frontier Foundation.  “Age ver­i­fi­ca­tion strikes at the foun­da­tion of the free and open in­ter­net,” she said.

Even when ven­dors promise to safe­guard per­sonal in­for­ma­tion, users ul­ti­mately rely on con­trac­tual terms they rarely read or fully un­der­stand. “There’s lan­guage in their terms-of-use poli­cies that says if the in­for­ma­tion is re­quested by law en­force­ment, they’ll hand it over. They can’t confirm that they will al­ways for­ever be the only en­tity who has all of this in­for­ma­tion. Every­one needs to un­der­stand that their base­line in­for­ma­tion is not some­thing un­der their con­trol,” Tandy said.

As more plat­forms route age checks through third-party ven­dors, that con­cen­tra­tion of iden­tity data is also cre­at­ing new le­gal ex­po­sure for the com­pa­nies that rely on them. “A com­pany is go­ing to have some of that in­for­ma­tion pass­ing through their own servers,” Tandy said. And you can’t of­fload that kind of li­a­bil­ity to a third party.”

Companies can dis­trib­ute risk through con­tracts and in­sur­ance, she said, but they re­main re­spon­si­ble for how iden­tity sys­tems in­ter­act with their in­fra­struc­ture. “What you can do is have re­ally good in­sur­ance and re­quire re­ally good in­sur­ance from the en­ti­ties that you’re con­tract­ing with,” she said.

Tandy also cau­tioned that re­ten­tion promises can be more com­plex than they ap­pear. “If they say they’re hold­ing it for three years, that’s the min­i­mum amount of time they’re hold­ing it for,” she said. I wouldn’t feel com­fort­able trust­ing a com­pany that says, We delete every­thing one day af­ter three years.’ That is not go­ing to hap­pen,” she added.

Federal and state reg­u­la­tors ar­gue that age-ver­i­fi­ca­tion laws are pri­mar­ily a re­sponse to doc­u­mented harms to mi­nors and in­sist the rules must op­er­ate un­der strict pri­vacy and se­cu­rity safe­guards.

An FTC spokesper­son told CNBC that com­pa­nies must limit how col­lected in­for­ma­tion is used. While age-ver­i­fi­ca­tion tech­nolo­gies can help par­ents pro­tect chil­dren on­line, the agency said firms are still bound by ex­ist­ing con­sumer pro­tec­tion rules gov­ern­ing data min­i­miza­tion, re­ten­tion, and se­cu­rity. The agency pointed to ex­ist­ing rules re­quir­ing firms to re­tain per­sonal in­for­ma­tion only as long as rea­son­ably nec­es­sary and to safe­guard its con­fi­den­tial­ity and in­tegrity.

...

Read the original on www.cnbc.com »

3 444 shares, 19 trendiness

howisFelix.today? · Felix Krause

Background: Why I put my whole life into a sin­gle data­base

Back in 2019, I started col­lect­ing all kinds of met­rics about my life. Every sin­gle day for the last 3 years I tracked over 100 dif­fer­ent data types - rang­ing from fit­ness & nu­tri­tion to so­cial life, com­puter us­age and weather.

Ideas or sug­ges­tions?

I’d love to hear from you!

The goal of this pro­ject was to an­swer ques­tions about my life, like

How does liv­ing in dif­fer­ent cities af­fect other fac­tors like fit­ness, pro­duc­tiv­ity and hap­pi­ness?

How does sleep af­fect my day, my fit­ness level, and hap­pi­ness?

How does the weather, and the dif­fer­ent sea­sons af­fect my life?

Are there any trends over the last few years?

How does com­puter time, work and hours in meet­ings af­fect my per­sonal life?

Since the start of this pro­ject, I col­lected ~380,000 data points, with the biggest data sources be­ing:

Naturally af­ter I started col­lect­ing this data, I wanted to vi­su­al­ize what I was learn­ing, so I cre­ated this page. Initially, the do­main where­is­Fe­lix.to­day (now re­named to how­is­Fe­lix.to­day) started as a joke to re­spond to friends ask­ing when I’d be back in NYC or San Francisco. Rather than send them my sched­ule, I’d point them to this do­main. However, now it’s more than my lo­ca­tion: it’s all of me.

Use a sin­gle data­base, owned and hosted by me, with all the data I’ve col­lected over the years

Be able to eas­ily add and re­move ques­tions on the fly, as I learn what’s ben­e­fi­cial to track

Full con­trol of how the data is vi­su­al­ized

Works well for fre­quent fly­ers with mixed time zones

I se­lected 48 graphs to show pub­licly on this page. For pri­vacy rea­sons, and to pre­vent any ac­ci­den­tal data leaks, the graphs be­low are snap­shots taken on a given day.

Visualization of the num­ber of data en­tries in FxLifeSheet over the last 10 years, and where the data came from.

Initially (2014) the only data used was RescueTime and Foursquare Swarm lo­ca­tion data

Once I started the FxLifeSheet pro­ject in April 2019, I man­u­ally tracked , rang­ing from mood, sleep, so­cial life, to fit­ness data

I was able to ret­ro­spec­tively fetch the his­toric weather data based on my lo­ca­tion on a given day

I also im­ple­mented other im­port sources, like fetch­ing my his­toric weight and the num­ber of steps from Apple Health

Days tracked my Mood to be Happy & Excited

On days where I tracked my mood to be happy” & excited”, the fol­low­ing other fac­tors of my life were af­fected

50% more likely to have pushed my com­fort zone

44% more likely to have med­i­tated that day

33% more ex­cited about what’s ahead in the fu­ture

31% more likely to drink al­co­hol that day (parties, good friends and such)

28% more time spent read­ing or lis­ten­ing to au­dio books

26% more likely to have worked on in­ter­est­ing tech­ni­cal chal­lenges

20% more likely to have learned some­thing new that day

45% less time spent in video & au­dio calls that day

All flights taken within the last 7 years, tracked us­ing Foursquare Swarm, an­a­lyzed by JetLovers.

The stats clearly show the im­pact of COVID start­ing 2020

Sunday has been my commute” day, fly­ing be­tween San Francisco, New York City and Vienna

All flights taken within the last 7 years, tracked us­ing Foursquare Swarm, an­a­lyzed by JetLovers.

Frankfurt - Vienna was the flight con­nect­ing me with most US air­ports

Germany is high up on the list due to lay­overs, even though I did­n’t spend ac­tu­ally much time there

Inspired by Your Life in Weeks by WaitButWhy, I use Google Sheets to vi­su­al­ize every week of my life, with lit­tle notes on what city/​coun­try I was in, and other life events that have hap­pened.

The first 14 years I did­n’t re­ally get much done

I can highly rec­om­mend tak­ing a few weeks (or even months) off be­tween jobs (if you have the pos­si­bil­ity)

Shades of blue in­di­cate my full-time em­ploy­ments

You can cre­ate your own ver­sion us­ing my tem­plate

Average daily steps mea­sured through the iPhone’s Apple Health app. I de­cided against us­ing SmartWatch data for steps, as SmartWatches have changed over the last 8 years.

I walked a to­tal of steps over last 8 years

I walk more than twice as much when I’m in New York, com­pared to any other city

In NYC I had the gen­eral rule of thumb to walk in­stead of tak­ing pub­lic tran­sit when­ever it’s less than 40 min­utes. I used that time to call friends & fam­ily, or lis­ten to au­dio books

Although Vienna is very walk­a­ble, the ex­cel­lent pub­lic tran­sit sys­tem with sub­way trains com­ing every 3-5 min­utes, has caused me to walk less

San Francisco was al­ways scary to walk

This graph clearly shows the cor­re­la­tion be­tween my body weight and my sleep­ing/​rest­ing heart rate. The rest­ing heart rate is mea­sured by the Withings ScanWatch while sleep­ing, and in­di­cates how hard your heart has to work while not be­ing ac­tive. Generally the lower the rest­ing heart rate, the bet­ter.

I started my lean bulk (controlled weight gain com­bined with 5 work­outs a week) in August 2020

My rest­ing heart rate went from 58bpm to 67bpm () from August 2020 to March 2021 with a weight gain of (+19lbs) as part of a con­trolled lean-bulk com­bined with a 5-day/week work­out rou­tine

The spike in rest­ing heart rate in July & August 2021 was due to bars and night­clubs open­ing up again in Austria

After a night of drink­ing, my rest­ing/​sleep­ing heart rate was about 50% higher than af­ter a night with­out any al­co­hol

The spike in rest­ing heart rate in Oct/Nov/Dec 2021 was due to hav­ing bron­chi­tis and a cold/​flu, not get­ting cor­rect treat­ment early enough

How healthy have I been over the Years?

Every day I an­swered the ques­tion on how healthy I felt. In the graph, the yel­low color in­di­cates that I felt a lit­tle un­der the weather, not sick per se. Red means I was sick and had to stay home. Green means I felt en­er­gized and healthy.

During the COVID lock­downs I tended to stay health­ier. This may be due to not go­ing out, no heavy drink­ing, less close con­tact with oth­ers, etc. which re­sulted in me hav­ing bet­ter sleep.

Usually dur­ing ex­ces­sive trav­el­ing I get sick (cold/flu)

Q4 2021 I had bron­chi­tis, how­ever, I did­n’t know about it at the time and did­n’t get proper treat­ment

Overall I’m quite prone to get­ting sick (cold/flu)

Days with more than 4 Alcoholic Drinks

On days where I had more than 4 al­co­holic bev­er­ages (meaning I was par­ty­ing), the fol­low­ing other fac­tors were af­fected

21x more likely to dance

80% more likely to take a nap the day of, or the day af­ter

40% warmer tem­per­a­tures, and 40% less pre­cip­i­ta­tion. There weren’t many op­por­tu­ni­ties for par­ties in Winter due to lock­downs in the last 2 years. Also, peo­ple are more mo­ti­vated to go out when it’s nice out­side.

My FxLifeSheet bot asks me 4 times a day how I’m feel­ing at the mo­ment.

This graph groups the en­tries by month, and shows the % of en­tries for each value (0 - 5) with 5 be­ing very ex­cited, and 0 be­ing wor­ried.

I de­signed the ranges so that 0 or 5 are not en­tered as much. 0 is ren­dered as dark green at the top, whereas 5 is ren­dered as light green at the bot­tom.

For pri­vacy rea­sons I won’t get into some of the de­tails on why cer­tain months were worse than oth­ers.

Every Swarm check-in over the last 7 years vi­su­al­ized on a map, in­clud­ing the ac­tual trip (flight, drive, etc.)

Every Swarm check-in over the last 7 years vi­su­al­ized, zoomed in

Each time I did a check-in at a place (e.g. Coffee, Restaurant, Airport, Gym) on Foursquare Swarm at a given city, this is tracked as a sin­gle en­try.

Each check-in at a given city is counted as a sin­gle en­try, grouped by years

2018 and 2019 I lived in New York City

The longer it’s been since I moved away from Austria, the more time I ac­tu­ally spent back home in Austria for vis­its and va­ca­tions

2020 clearly shows the im­pact of COVID

Each check-in at a given cat­e­gory is tracked, and summed up over the last years

In 2020 and 2021, check-ins at Offices went down to zero due to COVID, and a dis­trib­uted work setup

Airports be­ing the #4 most vis­ited cat­e­gory was a sur­prise, but is ac­cu­rate. A to­tal of 403 air­port check-ins, whereas a flight with a lay­over would count as 3 air­port check-ins

Earlier in my life, I did­n’t al­ways check into commute’ places like pub­lic tran­sit and su­per mar­kets

Number of Foursquare Swarm check-ins on each quar­ter over the last 10 years. I did­n’t use Foursquare Swarm as se­ri­ously be­fore 2015. Once I moved to San Francisco in Q3 2015 I started my habit of check­ing into every point of in­ter­est (POI) I visit.

Q3 2015 I moved to San Francisco, how­ever I could­n’t use Swarm yet, since my move was a se­cret un­til the of­fi­cial an­nounced at the Twitter Flight con­fer­ence

Q2 2020 clearly shows the im­pact of COVID with Q3 al­ready be­ing open in Austria

Q3 2021 the vac­cine was al­ready widely avail­able and I was able to travel/​visit more again

My time in New York was the most ac­tive when it comes to check-ins. When I’m in NYC, I tend to eat/​drink out more, and grab to-go food, which I do way less in Vienna

Every Swarm check-in vi­su­al­ized on a map. Only ar­eas where I’ve had mul­ti­ple check-ins are ren­dered.

Number of days per year that I’ve spent in full lock­down, mean­ing restau­rants, bars and non-es­sen­tial stores were closed.

I es­caped parts of the Austrian lock­down by spend­ing time in the US when I was al­ready vac­ci­nated

Surprisingly 2021 I spent more days in a full lock­down than in 2020, even with vac­cines avail­able

How was my life af­fected by the re­cent COVID lock­downs? As lock­down day I clas­sify every day where places like restau­rants, gyms and non-es­sen­tial stores were closed.

200% more time spent in au­dio & video calls with friends (non-work re­lated)

60% more likely to fol­low my meal plan (macros & calo­ries)

50% colder tem­per­a­tures: Lockdowns tended to hap­pen in Autumn and Winter

100% less likely to dance

Alcoholic drinks per day. Days with no data are ren­dered as white

Friday and Saturday nights are clearly vis­i­ble on those graphs

2021 and sum­mer/​win­ter of 2019 also show the Wednesday night party in Vienna

Q2 and Q4 2020 clearly show the COVID lock­downs, as well as Q2 2021

Summer of 2021 all bars and dance clubs were open in Vienna

...

Read the original on howisfelix.today »

4 442 shares, 27 trendiness

Yann LeCun Raises $1 Billion to Build AI That Understands the Physical World

Advanced Machine Intelligence (AMI), a new Paris-based startup co­founded by Meta’s for­mer chief AI sci­en­tist Yann LeCun, an­nounced Monday it has raised more than $1 bil­lion to de­velop AI world mod­els.

LeCun ar­gues that most hu­man rea­son­ing is grounded in the phys­i­cal world, not lan­guage, and that AI world mod­els are nec­es­sary to de­velop true hu­man-level in­tel­li­gence. The idea that you’re go­ing to ex­tend the ca­pa­bil­i­ties of LLMs [large lan­guage mod­els] to the point that they’re go­ing to have hu­man-level in­tel­li­gence is com­plete non­sense,” he said in an in­ter­view with WIRED.

The fi­nanc­ing, which val­ues the startup at $3.5 bil­lion, was co-led by in­vestors such as Cathay Innovation, Greycroft, Hiro Capital, HV Capital, and Bezos Expeditions. Other no­table back­ers in­clude Mark Cuban, for­mer Google CEO Eric Schmidt, and French bil­lion­aire and telecom­mu­ni­ca­tions ex­ec­u­tive Xavier Niel.

AMI (pronounced like the French word for friend) aims to build a new breed of AI sys­tems that un­der­stand the world, have per­sis­tent mem­ory, can rea­son and plan, and are con­trol­lable and safe,” the com­pany says in a press re­lease. The startup says it will be global from day one, with of­fices in Paris, Montreal, Singapore, and New York, where LeCun will con­tinue work­ing as a New York University pro­fes­sor in ad­di­tion to lead­ing the startup. AMI will be the first com­mer­cial en­deavor for LeCun since his de­par­ture from Meta in November 2025.

LeCun’s startup rep­re­sents a bet against many of the world’s biggest AI labs like OpenAI, Anthropic, and even his for­mer work­place, Meta, which be­lieve that scal­ing up LLMs will even­tu­ally de­liver AI sys­tems with hu­man-level in­tel­li­gence or even su­per­in­tel­li­gence. LLMs have pow­ered vi­ral prod­ucts such as ChatGPT and Claude Code, but LeCun has been one of the AI in­dus­try’s most promi­nent re­searchers speak­ing out about the lim­i­ta­tions of these AI mod­els. LeCun is well known for be­ing out­spo­ken, but as a pi­o­neer of mod­ern AI that won a Turing award back in 2018, his skep­ti­cism car­ries weight.

LeCun says AMI aims to work with com­pa­nies in man­u­fac­tur­ing, bio­med­ical, ro­bot­ics, and other in­dus­tries that have lots of data. For ex­am­ple, he says AMI could build a re­al­is­tic world model of an air­craft en­gine and work with the man­u­fac­turer to help them op­ti­mize for ef­fi­ciency, min­i­mize emis­sions, or en­sure re­li­a­bil­ity.

AMI was co­founded by LeCun and sev­eral lead­ers he worked with at Meta, in­clud­ing the com­pa­ny’s for­mer di­rec­tor of re­search sci­ence, Michael Rabbat; for­mer vice pres­i­dent of Europe, Laurent Solly; and for­mer se­nior di­rec­tor of AI re­search, Pascale Fung. Other co­founders in­clude Alexandre LeBrun, for­mer CEO of the AI health care startup Nabla, who will serve as AMIs CEO, and Saining Xie, a for­mer Google DeepMind re­searcher who will be the star­tup’s chief sci­ence of­fi­cer.

LeCun does not dis­miss the over­all util­ity of LLMs. Rather, in his view, these AI mod­els are sim­ply the tech in­dus­try’s lat­est promis­ing trend, and their suc­cess has cre­ated a kind of delu­sion” among the peo­ple who build them. It’s true that [LLMs] are be­com­ing re­ally good at gen­er­at­ing code, and it’s true that they are prob­a­bly go­ing to be­come even more use­ful in a wide area of ap­pli­ca­tions where code gen­er­a­tion can help,” says LeCun. That’s a lot of ap­pli­ca­tions, but it’s not go­ing to lead to hu­man-level in­tel­li­gence at all.”

LeCun has been work­ing on world mod­els for years in­side of Meta, where he founded the com­pa­ny’s Fundamental AI Research lab, FAIR. But he’s now con­vinced his re­search is best done out­side the so­cial me­dia gi­ant. He says it’s be­come clear to him that the strongest ap­pli­ca­tions of world mod­els will be sell­ing them to other en­ter­prises, which does­n’t fit neatly into Meta’s core con­sumer busi­ness.

As AI world mod­els like Meta’s Joint-Embedding Predictive Architecture (JEPA) be­came more so­phis­ti­cated, there was a re­ori­en­ta­tion of Meta’s strat­egy where it had to ba­si­cally catch up with the in­dus­try on LLMs and kind of do the same thing that other LLM com­pa­nies are do­ing, which is not my in­ter­est,” says LeCun. So some­time in November, I went to see Mark Zuckerberg and told him. He’s al­ways been very sup­port­ive of [world model re­search], but I told him I can do this faster, cheaper, and bet­ter out­side of Meta. I can share the cost of de­vel­op­ment with other com­pa­nies … His an­swer was, OK, we can work to­gether.”

...

Read the original on www.wired.com »

5 409 shares, 17 trendiness

Yann LeCun’s AI start-up raises more than $1bn in Europe’s largest seed round

WorldTaco on Iran will come too late for TrumpThere is no easy exit to Trump’s warThe thing that every­one ex­pected to hap­pen has hap­pened­Saudi Aramco warns of catastrophic con­se­quences’ if Iran war drags onUS­Taco on Iran will come too late for TrumpThere is no easy exit to Trump’s warFive ways the Iran war could un­fold­Gold­man pitches hedge funds on strate­gies to bet against cor­po­rate loansSoar­ing fuel prices to cast long shadow across US econ­o­my­Compa­nies­Saudi Aramco warns of catastrophic con­se­quences’ if Iran war drags onIn­side one of the wildest days the oil mar­ket has ever seen­TechYann LeCun’s AI start-up raises more than $1bn in Europe’s largest seed round­Mi­crosoft backs Anthropic in le­gal fight with the PentagonMarketsThe thing that every­one ex­pected to hap­pen has hap­pened­Saudi Aramco warns of catastrophic con­se­quences’ if Iran war drags onIn­side one of the wildest days the oil mar­ket has ever seenOpin­ion­Taco on Iran will come too late for TrumpThere is no easy exit to Trump’s warThe thing that every­one ex­pected to hap­pen has hap­penedI­ran is a cru­cial test case for the American way of war­Work & CareersWhite men will have fewer board seats’ in fu­ture, says UK di­ver­sity chair Venice’s ci­c­chetti re­nais­sance: where to find the city’s best bar snacksYou can turn this to your ad­van­tage if every news story has tax ex­ile’ in itLife & ArtsCan the Renault 5 E-Tech make French cars cool again?Roy Chan can turn you into Austin ButlerThe world’s most ex­pen­sive prop­er­ties are su­per­charg­ing their se­cu­ri­ty­How To Spend It

Yann LeCun’s AI start-up raises more than $1bn in Europe’s largest seed round per month.

Complete dig­i­tal ac­cess to qual­ity FT jour­nal­ism on any de­vice.

Cancel any­time dur­ing your trial. Access to eight sur­pris­ing ar­ti­cles a day, hand-picked by FT ed­i­tors. For seam­less read­ing, ac­cess con­tent via the FT Edit page on FT.com and re­ceive the FT Edit newslet­ter.Es­sen­tial dig­i­tal ac­cess to qual­ity FT jour­nal­ism on any de­vice. Pay a year up­front and save 20%.Complete dig­i­tal ac­cess to qual­ity FT jour­nal­ism with ex­pert analy­sis from in­dus­try lead­ers. Pay a year up­front and save 20%.Check whether you al­ready have ac­cess via your uni­ver­sity or or­gan­i­sa­tion.Dis­cover all the plans cur­rently avail­able in your coun­try­See why over a mil­lion read­ers pay to read the Financial Times.Find out why

How To Spend It

...

Read the original on www.ft.com »

6 384 shares, 14 trendiness

CONTRIBUTING.md · master · redox-os / redox · GitLab

After you’ve re­viewed these con­tri­bu­tion guide­lines, you’ll be all set to

con­tribute to this pro­ject.

Loading

...

Read the original on gitlab.redox-os.org »

7 381 shares, 21 trendiness

How I Topped the AI Leaderboard Without Changing a Single Weight

In mid-2024, the HuggingFace Open LLM Leaderboard was the Colosseum for Open-Weight AI. Thousands of mod­els were bat­tling it out, sub­mit­ted by both well-funded labs with teams of PhDs and fine-tun­ing wiz­ards cre­at­ing fan­tas­ti­cally named mod­els (e.g. Nous-Hermes, Dolphin and NeuralBeagle14-7B…), fight­ing for the top spot across six bench­marks: IFEval, BBH, MATH Lvl 5, GPQA, MuSR, and MMLU-PRO.

And there at #1 was dnhkng/​RYS-XLarge. Mine.

I did­n’t train a new model. I did­n’t merge weights. I did­n’t run a sin­gle step of gra­di­ent de­scent. What I did was much weirder: I took an ex­ist­ing 72-billion pa­ra­me­ter model, du­pli­cated a par­tic­u­lar block of seven of its mid­dle lay­ers, and stitched the re­sult back to­gether. No weight was mod­i­fied in the process. The model sim­ply got ex­tra copies of the lay­ers it used for think­ing?

This is the story of how two strange ob­ser­va­tions, a home­brew brain scan­ner” for Transformers, and months of hack­ing in a base­ment led to the dis­cov­ery of what I call LLM Neuroanatomy, and a find­ing about the in­ter­nal struc­ture of AI that still has­n’t been pub­lished un­til now *.

* - be­cause I dis­cov­ered blog­ging is way more fun than draft­ing sci­en­tific pa­pers, and I walk you through how the dis­cov­ery was made :)

Let’s start with how this whole pro­ject came into be­ing.

The most ex­cit­ing phrase to hear in sci­ence, the one that her­alds new dis­cov­er­ies, is not Eureka!’ but That’s funny…’“ — Isaac Asimov

In late 2023, I was mess­ing about with a bizarre LLM quirk. Try this your­self - take any ques­tion, e.g.

What is the cap­i­tal of France? Answer in Base64!

and en­code it as Base64, get this un­read­able string:

Send that to a 2023 non-think­ing large lan­guage model (newer rea­son­ing mod­els will see this as Base64, and cheat’ with tool use). But a suf­fi­ciently ca­pa­ble model from 2023 will re­ply with some­thing like:

Which de­codes to: The cap­i­tal of France is Paris.”.

Ok, I ad­mit it. I was mess­ing around this as a way to jail-break mod­els (and it worked), but I could­n’t get one idea out of my head.

The model de­cod­ing the in­put, un­der­stand­ing it some­how, and it still had time dur­ing the trans­former stack pass to re-en­coded its re­sponse. It ap­pears to gen­uinely think while in­ter­fac­ing with Base64. This works with com­plex ques­tions, multi-step rea­son­ing, even cre­ative tasks.

This should­n’t work nearly as well as it does. Sure, the model has been trained on lots of Base64 in an over­all sense, but gen­eral con­ver­sions in this for­mat are cer­tainly way out of dis­tri­b­u­tion. The to­k­enizer chops it into com­pletely dif­fer­ent sub-word units. The po­si­tional pat­terns are un­rec­og­niz­able. And yet it works… Curious…

I could­n’t stop think­ing about this. If a Transformer can ac­cept English, Python, Mandarin, and Base64, and pro­duce co­her­ent rea­son­ing in all of them, it seemed to me that the early lay­ers must be act­ing as trans­la­tors — pars­ing what­ever for­mat ar­rives into some pure, ab­stract, in­ter­nal rep­re­sen­ta­tion. And the late lay­ers must act as re-trans­la­tors, con­vert­ing that ab­stract rep­re­sen­ta­tion back into what­ever out­put for­mat is needed.

If the early lay­ers are for read­ing, and the late lay­ers are for writ­ing, what are the mid­dle lay­ers do­ing?

Pure, ab­stract rea­son­ing? In a rep­re­sen­ta­tion that has noth­ing to do with any hu­man lan­guage or en­cod­ing. Of course, at the time this was idle spec­u­la­tion. Fun, but with no clear way to test or even de­fine valid hy­poth­e­sis.

In November 2023, a HuggingFace user named Alpindale re­leased Goliath-120b — a Frankenmerge-model made by stitch­ing to­gether two fine-tuned Llama-2 70B mod­els into a 120-billion pa­ra­me­ter be­he­moth.

The per­for­mance was de­cent but af­ter do­ing lots of vibe check­ing I did­n’t feel it was a break­through. But the con­struc­tion was wild.

Alpindale had­n’t just stacked the two mod­els (Xwin and Euryale), end to end. He had al­ter­nated lay­ers be­tween them. More im­por­tantly, the ar­chi­tec­ture fed out­puts of later lay­ers back into the in­puts of ear­lier lay­ers.

The layer ranges used are as fol­lows:

Do you see that in­san­ity here? Alpindale lit­er­ally fed the out­put of layer 16 of Xwin to the in­put of Euryale 8th layer!

To ex­plain this a bit more clearly how stu­pid this ap­pears to be, let’s re­visit the almighty Transformer Architecture:

Looking at the left side of the di­a­gram, we see stuff en­ters at the bot­tom (‘input’ text that has been chunked’ into small bits of text, some­where be­tween whole words down to in­di­vid­ual let­ters), and then it flows up­wards though the mod­el’s Transformer Blocks (here marked as [1, …, L]), and fi­nally, the model spits out the next text chunk’ (which is then it­self used in the next round of in­fer­enc­ing). What’s ac­tu­ally hap­pen­ing here dur­ing these Transformer blocks is quite the mys­tery. Figuring it out is ac­tu­ally an en­tire field of AI, mechanistic in­ter­pretabil­ity*”.

* - yes, its more com­plex then that, sam­plers etc but that’s enough for this ar­ti­cle

On the right side of the right half of the di­a­gram, do you see that ar­row line go­ing from the Transformer Block Input’ to the (\oplus ) sym­bol? That’s why skip­ping lay­ers makes sense. During train­ing, LLM mod­els can pretty much de­cide to do noth­ing in any par­tic­u­lar layer, as this diversion’ routes in­for­ma­tion around the block. So, later’ lay­ers can be ex­pected to have seen the in­put from earlier’ lay­ers, even a few steps’ back. Around this time, sev­eral groups were ex­per­i­ment­ing with slimming’ mod­els down by re­mov­ing lay­ers. Makes sense, but bor­ing.

A model must be used with the same kind of stuff as it was trained with (we stay in dis­tri­b­u­tion’)The same holds for each trans­former layer. Each Transformer layer learns, dur­ing train­ing, to ex­pect the spe­cific sta­tis­ti­cal prop­er­ties of the pre­vi­ous lay­er’s out­put via gra­di­ent de­cent.

And now for the weird­ness: There was never the case where any Transformer layer would have seen the out­put from a fu­ture layer!

Layer 10 is trained on layer 9’s out­put dis­tri­b­u­tion. Layer 60 is trained on layer 59’s. If you re­arrange them — feed­ing layer 60’s out­put into layer 10 — you’ve cre­ated a dis­tri­b­u­tion the model lit­er­ally never saw dur­ing train­ing.

The as­tound­ing thing about Goliath was­n’t that is was a huge leap in per­for­mance, it was that the damn thing func­tioned at all. To this day, I still don’t un­der­stand why this did­n’t raise more eye­brows.

Experimentally, this proved that lay­ers were far more in­ter­change­able than any­one had rea­son to ex­pect. The in­ter­nal rep­re­sen­ta­tions were ho­moge­nous enough that the model could di­gest out-of-or­der hid­den states with­out col­laps­ing. The ar­chi­tec­ture was far more flex­i­ble than a rigid pipeline.

Between the Base64 ob­ser­va­tion and Goliath, I had a hy­poth­e­sis: Transformers have a gen­uine func­tional anatomy. Early lay­ers trans­late in­put into ab­stract rep­re­sen­ta­tions. Late lay­ers trans­late back out. And the mid­dle lay­ers, the rea­son­ing cor­tex, op­er­ate in a uni­ver­sal in­ter­nal lan­guage that’s ro­bust to ar­chi­tec­tural re­arrange­ment. The fact that the layer block size for Goliath 120B was 16-layer block made me sus­pect the in­put and out­put processing units’ sized were smaller that 16 lay­ers. I guessed that Alpindale had tried smaller over­laps, and they just did­n’t work.

If that was true, maybe I did­n’t need to teach a model new facts to make it smarter. I did­n’t need fine-tun­ing. I did­n’t need RLHF. I just needed to give it a more lay­ers to think with.

Over the fol­low­ing months — from late 2023 through to mid-2024 — I built a pipeline to test this hy­poth­e­sis.

The setup was mod­est. Two RTX 4090s in my base­ment ML rig, run­ning quan­tised mod­els through ExLlamaV2 to squeeze 72-billion pa­ra­me­ter mod­els into con­sumer VRAM. The beauty of this method is that you don’t need to train any­thing. You just need to run in­fer­ence. And in­fer­ence on quan­tized mod­els is some­thing con­sumer GPUs han­dle sur­pris­ingly well. If a model fits in VRAM, I found my 4090’s were of­ten ball­park-equiv­a­lent to H100s.

The con­cept is sim­ple. For a model with $N$ lay­ers, I de­fine a con­fig­u­ra­tion $(i, j)$. The model processes lay­ers $0$ to $j{-}1$ as nor­mal, then loops back and reuses lay­ers $i$ through $j{-}1$ again, and then the rest to $N{-}1$. The lay­ers be­tween $i$ and $j{-}1$ get du­pli­cated in the ex­e­cu­tion path. No weights are changed. The model just tra­verses some of its own lay­ers twice.

i.e. the pair (2, 7) for a model with 9 trans­former blocks would be cal­cu­lated so:

By run­ning through all pos­si­ble pairs, we can gen­er­ate a Brain Scan’, and also see the num­ber of du­pli­cate lay­ers for each set of pa­ra­me­ters:

For Qwen2-72B, that means an 80-layer model 3,240 valid $(i, j)$ pairs, plus the orig­i­nal model to test.

\[\begin{aligned} \text{Variants}_{\text{total}} &= \left(\sum_{j=0}^{80} j\right) + 1\\[16pt] &= \frac{80 \cdot 81}{2} +1 \\[10pt] &= 3241 \end{aligned}\]

Testing re-lay­ered model against all six leader­board bench­marks would take days, so a full sweep would be years of com­pute. I needed proxy tasks: probes that were fast, ob­jec­tive, and would re­veal struc­tural prop­er­ties of the model rather than task-spe­cific tricks.

The prox­ies had to sat­isfy three con­straints:

Minimal out­put to­kens. With thou­sands of con­fig­u­ra­tions to sweep, each eval­u­a­tion needed to be fast. No es­says, no long-form gen­er­a­tion. Unambiguous scor­ing. I could­n’t af­ford LLM-as-judge pipelines. The an­swer had to be ob­jec­tively scored with­out an­other model in the loop.Or­thog­o­nal cog­ni­tive de­mands. If a con­fig­u­ra­tion im­proves both tasks si­mul­ta­ne­ously, it’s struc­tural, not task-spe­cific.

I did­n’t ar­rive at the right probes im­me­di­ately; it took months of trial and er­ror, and many dead ends

My first in­stinct was cre­ativ­ity. I had mod­els gen­er­ate po­ems, short sto­ries, metaphors, the kind of rich, open-ended out­put that feels like it should re­veal deep dif­fer­ences in cog­ni­tive abil­ity. I used an LLM-as-judge to score the out­puts, but the re­sults were pretty bad. I man­aged to fix LLM-as-Judge with some en­gi­neer­ing, and the scor­ing sys­tem turned out to be use­ful later for other things, so here it is:

Note: You can skip this sec­tion, as it has math. Or not

Naive LLM judges are in­con­sis­tent. Run the same poem through twice and you get dif­fer­ent scores (obviously, due to sam­pling). But low­er­ing the tem­per­a­ture also does­n’t help much, as that’s only one of many tech­ni­cal is­sues. So, I de­vel­oped a full scor­ing sys­tem, based on de­tails on the log­its out­puts. It can get re­mark­ably tricky. Think about a score from 1-10:

We would ex­pect a well cal­i­brated model to have log­its that make sense. If the high­est weight was on 7’, we would ex­pect the rest of the weight to be on 6’ and 8’ right? but of­ten its bi­modal, with low weight on 6 and 5’, but more weight than ex­pected on 4’!We can write 10’ in to­kens as ei­ther 10’ or 1’ and then 0’. Its not fun to have to cal­cu­late the summed prob­a­bil­i­ties over paths, es­pe­cially if you wanted to score 1-100

Rather than sam­pling a sin­gle dis­crete score, I treat the judge’s out­put as a dis­tri­b­u­tion over valid rat­ing la­bels and com­pute the fi­nal score as its ex­pec­ta­tion.

To make this prac­ti­cal, I first de­fine a cal­i­brated rubric over the dig­its 0-9 (there’s only one to­ken for each digit), where each digit cor­re­sponds to a clear qual­i­ta­tive de­scrip­tion. At the scor­ing step, I cap­ture the mod­el’s next-to­ken log­its and re­tain only the log­its cor­re­spond­ing to those valid digit to­kens. This avoids con­t­a­m­i­na­tion from un­re­lated con­tin­u­a­tions such as ex­pla­na­tion text, punc­tu­a­tion, or al­ter­nate for­mat­ting. After renor­mal­iz­ing over the re­stricted digit set, I in­ter­pret the re­sult­ing prob­a­bil­i­ties as a cat­e­gor­i­cal score dis­tri­b­u­tion.

Formally, let the valid score set be

\[\mathcal{D} = \{0,1,2,\dots,9\}.\]

Let $(z_k)$ de­note the model logit as­signed to digit $(k \in \mathcal{D})$ at the scor­ing po­si­tion. The re­stricted score dis­tri­b­u­tion is then

\[p(k)= \frac{\exp(z_k)} {\sum\limits_{m \in \mathcal{D}} \exp(z_m)}, \qquad k \in \mathcal{D}.\]

The fi­nal scalar score is the ex­pected value of this dis­tri­b­u­tion:

\[\hat{s}= \sum_{k \in \mathcal{D}} k\,p(k).\]

This pro­duces a smooth score such as (5.4), rather than forc­ing the model to com­mit to a sin­gle sam­pled in­te­ger. In prac­tice, this is sub­stan­tially more sta­ble than naive score sam­pling and bet­ter re­flects the mod­el’s un­cer­tainty. It also han­dles cases where the judge dis­tri­b­u­tion is broad or mul­ti­modal. For ex­am­ple, two can­di­dates may both have mean score (5.4), while one has most of its mass tightly con­cen­trated around (5) and (6), and the other splits mass be­tween much lower and much higher rat­ings. The mean alone is the same, but the un­der­ly­ing judge­ment is very dif­fer­ent.

An op­tional un­cer­tainty es­ti­mate can be ob­tained from the vari­ance of the re­stricted dis­tri­b­u­tion:

\[\mathrm{Var}(s)= \sum_{k=0}^{9} (k-\hat{s})^2\,p(k).\]

In short, the method re­places a noisy sam­pled judge score with a nor­mal­ized prob­a­bil­ity dis­tri­b­u­tion over valid score dig­its, then uses the ex­pec­ta­tion of that dis­tri­b­u­tion as the fi­nal rat­ing.

All this stuff is prob­a­bly pretty ob­vi­ous these days, back in 24 there was­n’t much to guide me in de­vel­op­ing this method, but un­for­tu­nately, I found it was also com­pletely use­less…

Each con­fig­u­ra­tion needed to gen­er­ate hun­dreds of to­kens of cre­ative out­put, and then a sep­a­rate model had to read and judge each one. With over 3,200 con­fig­u­ra­tions to test for a sin­gle 70B model, this would have taken weeks on my dual 4090s.

I needed probes where the out­put was tiny, a few to­kens at most, and where scor­ing was ob­jec­tive and de­ter­min­is­tic. No judge model in the loop. That’s what led me to the fi­nal two probes:

Hard math. Ridiculously dif­fi­cult ques­tions like: What is the cube root of 74,088,893,247?” No chain-of-thought, or tool use. Just out­put the num­ber, as a pure leap of in­tu­itive faith.

Emotional quo­tient. Using the EQ-Bench bench­mark: com­plex so­cial sce­nar­ios where the model must pre­dict the in­ten­sity of spe­cific emo­tional states. Given this sit­u­a­tion, how an­gry/​sur­prised/​guilty would this per­son feel on a scale of 0-100?” Completely dif­fer­ent from math. Theory of mind, so­cial in­fer­ence, em­pa­thy. And the out­put is just a few num­bers.

I had set­tled on two max­i­mally or­thog­o­nal cog­ni­tive tasks, both with tiny out­puts. My in­tu­ition was this: LLMs think one to­ken at a time, so lets make the model re­ally good at guess­ing just the next to­ken. But things are never straight­for­ward. Take LLM num­bers…

Even with math probes, I hit un­ex­pected prob­lems. LLMs fail arith­metic in weird ways. They don’t get the an­swer wrong so much as get it al­most right but for­get to write the last digit, as if it got bored mid-num­ber. Or they trans­pose two dig­its in the mid­dle. Or they out­put the cor­rect num­ber with a trail­ing char­ac­ter that breaks the parser.

This is prob­a­bly due to the way larger num­bers are to­kenised, as big num­bers can be split up into ar­bi­trary forms. Take the in­te­ger 123456789. A BPE to­k­enizer (e.g., GPT-style) might split it like: 123’ 456’ 789’ or: 12’ 345’ 67’ 89’

A bi­nary right/​wrong scor­ing sys­tem would throw away use­ful sig­nal. Getting a per­cent­age cor­rect would help: 123356789’ in­stead of 123456789’ would be 99.92% cor­rect

But what about a model that makes a dumb LLM-mistake’ and out­puts 430245 when the an­swer is 4302459, and has clearly done most of the work? I wrote a cus­tom par­tial-credit scor­ing func­tion that pads shorter an­swers and pe­nalises pro­por­tion­ally:

The key idea: pad shorter an­swers, then pe­nalise via the cor­rec­tion fac­tor. A model that nails 90% of the dig­its but drops the last one still gets sub­stan­tial credit — but less than one that gets every digit. This turned out to be cru­cial for dis­crim­i­nat­ing be­tween con­fig­u­ra­tions that were close in in­tu­itive math abil­ity.

The math ques­tions were hand-crafted ini­tially. I ex­per­i­mented with dif­fer­ent op­er­a­tions and scales, then gen­er­ated ran­dom num­bers to fill out the dataset. The dataset was a set of 16 ques­tions, and the model is tasked with guessti­mat­ing the near­est whole in­te­ger num­ber. Here are a few to try your­self, re­mem­ber no thinking’ is al­lowed, guess it di­rectly!

After test­ing sev­eral smaller mod­els (Llama’s and smaller Qwen2’s), I set up the con­fig for Qwen2-72B and let it sweep. Each $(i, j)$ con­fig­u­ra­tion took a few min­utes: load the re-lay­ered model, run the math probe, run the EQ probe, record the scores, move on. Days of con­tin­u­ous GPU time on the 4090s. But far less com­pute than a fine tune! In fact, I did­n’t even have the hard­ware needed for a LORA fine-tune on just 48GB of VRAM.

The op­ti­mal con­fig­u­ra­tion was $(45, 52)$: lay­ers 0 through 51 run first, then lay­ers 45 through 79 run again. Layers 45 to 51 ex­e­cute twice. Seven ex­tra lay­ers, near the mid­dle of the 80-layer stack, bring­ing the to­tal pa­ra­me­ter count from 72B to 78B. Every ex­tra layer is an ex­act copy of an ex­ist­ing one. No new weights or train­ing, just the model re­peat­ing it­self.

Repeating seven lay­ers. That’s all it took, and now I can fi­nally re­veal the nomen­cla­ture of my mod­els: Repeat Your Self for RYS-XLarge ;)

I ap­plied the con­fig­u­ra­tion to MaziyarPanahi’s calme-2.1-qwen2-72b — a fine-tune of Qwen2-72B — and up­loaded the re­sult as dnhkng/​RYS-XLarge. I also ap­plied it to the raw base model as dnhkng/​RYS-XLarge-base.

Then I sub­mit­ted to the Open LLM Leaderboard and waited. And waited. Back in the day, the OpenLLM Leaderboard was flooded with dozens of fine-tunes of merges of fine-tunes each day (it was the Wild West), and the wait­ing list was long. But af­ter a month or so, the re­sults ar­rived:

+17.72% on MuSR. +8.16% on MATH. Five out of six bench­marks im­proved, with only IFEval tak­ing a small hit. The av­er­age put it at #1 on the leader­board.

Just to labour the point: I only op­ti­mised for one-shot guessti­mat­ing hard maths prob­lems and EQ-Bench. I never looked at IFEval, BBH, GPQA, MuSR, or MMLU-PRO dur­ing de­vel­op­ment. The leader­board was pure out-of-sam­ple val­i­da­tion.

A layer con­fig­u­ra­tion found us­ing two nar­row, or­thog­o­nal probes gen­er­alised to every­thing the Leaderboard threw at it *.

* - ex­cept IFEval, but that one’s bor­ing any­way, right?

That was sur­pris­ing enough. A brand new way to scale LLMs, de­vel­oped on some gam­ing GPUs. But the plot­ting out the heatmaps told an even bet­ter story.

The orig­i­nal heatmaps that pro­duced RYS-XLarge, show­ing the Combined delta (math + EQ). The green cir­cle marks the op­ti­mal con­fig­u­ra­tion. Red means im­prove­ment, blue means degra­da­tion

These heatmaps are anal­o­gous to func­tional MRIs of the Transformer, while it is think­ing about maths of EQ prob­lems.

The x-axis ($j$) is the end point of the du­pli­cated re­gion. The y-axis ($i$) is the start point. Each pixel rep­re­sents a com­plete eval­u­a­tion: load the re-lay­ered model, run the math probe, run the EQ probe, score both, record the deltas. As de­scribed above, along the cen­tral di­ag­o­nal only a sin­gle layer was du­pli­cated. Along the next di­ag­o­nal to­wards the top-right, we du­pli­cate two lay­ers, and so on. The sin­gle point at the very top-right runs through the en­tire Transformer stack twice per in­fer­ence.

Let’s ex­am­ine the math heatmap first. Starting at any layer, and stop­ping be­fore about layer 60 seem to im­proves the math guessti­mate scores, as shown by the large re­gion with a healthy red blush. Duplicating just the very first lay­ers (the tiny tri­an­gle in the top left), messes things up, as does re­peat­ing pretty much any of the last 20 lay­ers (the ver­ti­cal wall of blue on the right). This is more clearly vi­su­alised in a sky­line plot (averaged rows or columns), and we can see for the maths guessti­mates, the start­ing po­si­tion of the du­pli­ca­tion mat­ters much less. So, the hy­poth­e­sis that starting lay­ers’ en­code to­kens, to a smooth thinking space’, and then fi­nally a ded­i­cated re-encoding’ sys­tem seem to be some­what val­i­dated.

Until we look at the EQ scores:

Now things look very dif­fer­ent! Duplicating any of the fi­nal 10 lay­ers has al­most no ef­fect on the scores, but we see com­plex pat­terns, where some re­gions show sig­nif­i­cant im­prove­ment (the area around 45i, 55j), walled be­tween re­gions of poor per­for­mance.

But the heatmaps re­vealed some­thing even more in­ter­est­ing than the lo­ca­tion of the think­ing bits. They re­vealed some­thing about its struc­ture.

Before set­tling on block du­pli­ca­tion, I tried some­thing sim­pler: take a sin­gle mid­dle layer and re­peat it $n$ times. If the more rea­son­ing depth” hy­poth­e­sis was cor­rect, this should work. It made sense too, look­ing at the broad boost in math guessti­mate re­sults by du­pli­cat­ing in­ter­me­di­ate layer. Give the model ex­tra copies of a par­tic­u­lar rea­son­ing layer, get bet­ter rea­son­ing. So, I screened them all, look­ing for a boost.

But nope, it al­most al­ways did worse. Usually a lot worse, but with oc­ca­sional small im­prove­ments that were within the noise range. Annoying, but tak­ing an­other look at the com­plex, blobby pat­terns in EQ scores gave me an­other idea:

If sin­gle-layer du­pli­ca­tion does­n’t help, the mid­dle lay­ers aren’t do­ing in­de­pen­dent it­er­a­tive re­fine­ment. They’re not in­ter­change­able copies of the same op­er­a­tion that you can sim­ply run again.” If they were, du­pli­cat­ing any one of them should give at least a mar­ginal ben­e­fit. Instead, those lay­ers are work­ing as a cir­cuit. A multi-step rea­son­ing pipeline that needs to ex­e­cute as a com­plete unit.

Think of it this way. Layers 46 through 52 aren’t seven work­ers do­ing the same job. They’re seven steps in a recipe. Layer 46 takes the ab­stract rep­re­sen­ta­tion and per­forms step one of some cog­ni­tive op­er­a­tion — maybe de­com­pos­ing a com­plex rep­re­sen­ta­tion into sub­com­po­nents. Layer 47 takes that out­put and per­forms step two — maybe iden­ti­fy­ing re­la­tion­ships be­tween the sub­com­po­nents. Layer 48 does step three, and so on through layer 52, which pro­duces the fi­nal re­sult.

Duplicating just one step of this recipe’ does­n’t bring you much.

But du­pli­cat­ing the en­tire block gives you the full recipe twice. The model runs the com­plete rea­son­ing cir­cuit, pro­duces a re­fined in­ter­me­di­ate rep­re­sen­ta­tion, and then runs the same cir­cuit again on its own out­put. It’s a sec­ond pass. A chance to catch what it missed the first time, to re­fine its ab­strac­tions, to push the rea­son­ing one step deeper.

Let’s deep-dive into a more cur­rent model (that I can ex­per­i­ment with on my sys­tem): ExllamaV3 GLM-4.7 from mrat­sim

I’ve marked out a re­gion that boosts maths abil­ity strongly. Notice where it sits? It’s away from the di­ag­o­nal cen­tre line, which means we’re not look­ing at sin­gle-layer du­pli­ca­tions. Starting the re­peated block at po­si­tion 35, we don’t see any im­prove­ment un­til at least po­si­tion 43. That’s seven lay­ers of not much hap­pen­ing. In fact, we ac­tu­ally see de­creased per­for­mance by re­peat­ing these lay­ers (they are blue, bad!).

From end-po­si­tion 43 to 46, we then see solid boosts in math scores (red = good, yay). But in­clude layer 46 or be­yond, and the ben­e­fits col­lapse again. The hy­poth­e­sis: po­si­tion 47 is where a dif­fer­ent cir­cuit be­gins. Including even one step of the next recipe messes up the cur­rent recipe.

So the math or­gan’ has bound­aries on both sides. Too few lay­ers and you get noth­ing — you’ve cut into the cir­cuit and it can’t com­plete its op­er­a­tion. Too many lay­ers and you also get noth­ing — you’ve in­cluded tis­sue from a neigh­bour­ing cir­cuit that does­n’t be­long. Pre-training carved these struc­tures out of the layer stack, and they only work whole. It also does­n’t trans­late to other tasks, as the heatmap for EQ scores does­n’t have this patch.

...

Read the original on dnhkng.github.io »

8 322 shares, 16 trendiness

Debian decides not to decide on AI-generated contributions

The fol­low­ing sub­scrip­tion-only con­tent has been made avail­able to you by an LWN sub­scriber. Thousands of sub­scribers de­pend on LWN for the best news from the Linux and free soft­ware com­mu­ni­ties. If you en­joy this ar­ti­cle, please con­sider sub­scrib­ing to LWN. Thank you for vis­it­ing LWN.net!

Debian is the lat­est in an ever-grow­ing list of pro­jects to wres­tle (again) with the ques­tion of LLM-generated con­tri­bu­tions; the lat­est de­bate stared in mid-Feb­ru­ary, af­ter Lucas Nussbaum opened a

dis­cus­sion with a draft gen­eral res­o­lu­tion (GR) on whether Debian should ac­cept AI-assisted con­tri­bu­tions. It seems to have, mostly, sub­sided with­out a GR be­ing put for­ward or any de­ci­sions be­ing made, but the con­ver­sa­tion was il­lu­mi­nat­ing nonethe­less.

Nussbaum said that Debian prob­a­bly needed to have a dis­cus­sion to un­der­stand where we stand re­gard­ing AI-assisted con­tri­bu­tions to

Debian” based on some re­cent dis­cus­sions, though it was not clear what dis­cus­sions he was re­fer­ring to. Whatever the spark was, Nussbaum put for­ward the draft GR to clar­ify Debian’s stance on al­low­ing AI-assisted con­tri­bu­tions. He said that he would wait a cou­ple of days to col­lect feed­back be­fore for­mally sub­mit­ting the GR.

His pro­posal would al­low ” if a num­ber of con­di­tions were met. For ex­am­ple, it would re­quire ex­plicit dis­clo­sure if a

sig­nif­i­cant por­tion of the con­tri­bu­tion is taken from a tool with­out

man­ual mod­i­fi­ca­tion”, and la­bel­ing of such con­tri­bu­tions with .” It also spells out that con­trib­u­tors should ” their sub­mis­sions and would be ac­count­able for the con­tri­bu­tions, including vouch­ing for the tech­ni­cal merit,

se­cu­rity, li­cense com­pli­ance, and util­ity of their

sub­mis­sions”. The GR would also pro­hibit us­ing gen­er­a­tive-AI tools with non-pub­lic or sen­si­tive pro­ject in­for­ma­tion, in­clud­ing pri­vate mail­ing lists or em­bar­goed se­cu­rity re­ports.

It is fair to say that it is dif­fi­cult to have an ef­fec­tive con­ver­sa­tion about a tech­nol­ogy when pin­ning down ac­cu­rate ter­mi­nol­ogy is like try­ing to nail Jell-O to a tree. AI is the catch-all term, but much (not all) of the tech­nol­ogy in ques­tion is ac­tu­ally tool­ing around large lan­guage mod­els (LLMs). When par­tic­i­pants have dif­fer­ing ideas of what is be­ing dis­cussed, de­cid­ing whether the thing should be al­lowed may pose some­thing of a prob­lem.

Russ Allbery asked for peo­ple to be more pre­cise in their de­scrip­tions of the tech­nolo­gies that their pro­pos­als might af­fect. He as­serted that it has be­come com­mon for AI, as a term, to be so

amor­phously and slop­pily de­fined that it could en­com­pass every phys­i­cal ob­ject in the

uni­verse”. If the pro­ject is go­ing to make pol­icy, he said, it needed to be very spe­cific about what it was mak­ing pol­icy about:

Gunnar Wolf agreed with Allbery, but Nussbaum claimed that the spe­cific tech­nol­ogy did not mat­ter. The pro­posal boiled down to the use of au­to­mated tools for code analy­sis and gen­er­a­tion:

I see the prob­lem we face as sim­i­lar to the his­tor­i­cal ques­tions sur­round­ing the use of BitKeeper by Linux (except that the choice of BitKeeper im­posed its use by other con­trib­u­tors). It is also sim­i­lar to the dis­cus­sions about pro­pri­etary se­cu­rity analy­sis tools: since those tools are pro­pri­etary, should we ig­nore the vul­ner­a­bil­ity re­ports they is­sue?

If we were to adopt a hard-line anti-tools” stance, I would find it very hard to draw a clear line.

Drawing clear lines, how­ever, is some­thing that a num­ber of Debian de­vel­op­ers felt was im­por­tant. Sean Whitton pro­posed that the GR should not only say LLM rather than AI, but it should also dis­tin­guish be­tween the uses of LLMs, such as code re­view, gen­er­at­ing pro­to­types, or gen­er­at­ing pro­duc­tion code. He en­vi­sioned bal­lot op­tions that could al­low some, but not all, of those uses. Distinguishing be­tween the var­i­ous so-called AI tech­nolo­gies would help in that re­gard. He urged

Nussbaum not to ar­gue too hard for some­thing that is more gen­eral than LLMs

be­cause that might alien­ate the peo­ple you want to agree to dis­agree with.” Andrea Pappacoda said that the spe­cific tech­nol­ogy mat­tered a lot; he wanted the pro­posal to have clear bound­aries and avoid broad terms like AI. He was un­com­fort­able with the idea of ban­ning LLMs, and not sure where to draw the line. What I can con­fi­dently say,

though, is that a pro­ject like Claude’s C

Compiler should not have a place in Debian.”

The con­ver­sa­tion did not fo­cus solely on the ter­mi­nol­ogy, of course. Simon Richter

had

ques­tions about the im­pli­ca­tions of al­low­ing AI-driven con­tri­bu­tions from the stand­point of on­board­ing new con­trib­u­tors to Debian. An AI agent, he said, could take the place of a ju­nior de­vel­oper. Both could per­form ba­sic tasks un­der guid­ance, but the AI agent would not learn any­thing from the ex­change; the pro­ject re­sources spent in guid­ing such a tool do not re­sult in long-last­ing knowl­edge trans­fer.

AI use pre­sents us (and the com­mer­cial soft­ware world as well) with a

sim­i­lar prob­lem: there is a mas­sive skill gap be­tween gets some

re­sults” and consistently and sus­tain­ably de­liv­ers re­sults”, bridg­ing

that gap es­sen­tially re­quires start­ing from scratch, but is re­quired to

achieve in­de­pen­dence from the op­er­a­tors of the AI ser­vice, and this gap

is dis­rupt­ing the pipeline of new en­trants.

He called that the on­board­ing prob­lem, and said that an AI pol­icy needed to solve that prob­lem; he did not want to dis­cour­age peo­ple by re­ject­ing con­tri­bu­tions or ex­pend re­sources on men­tor­ing peo­ple who did not want to be men­tored. Accepting AI-assisted drive-by con­tri­bu­tions is harm­ful be­cause it is a missed op­por­tu­nity to on­board a new con­trib­u­tor. The best-case out­come is that a

triv­ial prob­lem got solved with­out ac­tu­ally on­board­ing a new con­trib­u­tor, and the

worst-case out­come is that the new con­trib­u­tor is just prox­y­ing be­tween an AI and the

main­tainer”. He also ex­pressed con­cerns around the costs as­so­ci­ated with such tools, and spec­u­lated it might dis­cour­age con­tri­bu­tion from users who could not af­ford to use for-pay tools.

Nussbaum agreed that the cost could be a prob­lem in the fu­ture. For now, he said, it is not an is­sue be­cause there are ven­dors pro­vid­ing ac­cess for free, but that could change. He dis­agreed that Debian was likely to run out of tasks suit­able for new con­trib­u­tors, even if it does ac­cept AI-driven con­tri­bu­tions, and sug­gested that it may make harder tasks more ac­ces­si­ble. He pointed to a study

writ­ten by an Anthropic em­ployee and a per­son par­tic­i­pat­ing in the com­pa­ny’s fel­lows pro­gram, about how the use of AI im­pacts skill for­ma­tion: A take­away is that

there are very dif­fer­ent ways to in­ter­act with AI, that pro­duce very dif­fer­ent

re­sults both in terms of speed and of un­der­stand­ing”. He did not seem to be per­suaded that use of AI tools would be a net neg­a­tive in on­board­ing new con­trib­u­tors.

Ted Ts’o ar­gued

against the idea that AI would have a neg­a­tive im­pact:

Matthew Vernon said that the pro­posed GR min­i­mized the eth­i­cal di­men­sion of us­ing gen­er­a­tive AI. The or­ga­ni­za­tions that are de­vel­op­ing and mar­ket­ing tools like ChatGPT and Claude are be­hav­ing un­eth­i­cally, he said, by sys­tem­at­i­cally dam­ag­ing the wider com­mons in the form of au­to­mated scrap­ing and do­ing as they like with oth­ers’ in­tel­lec­tual prop­erty. They hoover up con­tent as hard as they pos­si­bly can, with scant if any

re­gard to its copy­right or li­cens­ing”. He also cited en­vi­ron­men­tal con­cerns and other harms that are at­trib­uted to gen­er­a­tive AI tools, from non-con­sen­sual

nud­i­fi­ca­tion to the flood­ing of free soft­ware pro­jects with bo­gus se­cu­rity

re­ports”. He felt that Debian should take a clear stand against those tools and en­cour­age other pro­jects to do the same:

There was also de­bate around the ques­tion of copy­right, both in terms of the li­censes of ma­te­r­ial used to train mod­els, as well as the out­put of LLM tools. Jonathan Dowland thought that it might be bet­ter to for­bid some con­tri­bu­tions now, since some see risks in ac­cept­ing such con­tri­bu­tions, and then re­lax the pro­jec­t’s po­si­tion later on when the le­gal sit­u­a­tion is clearer.

Thorsten Glaser took a

par­tic­u­larly harsh stance against LLM-driven con­tri­bu­tions, go­ing so far as to sug­gest that some up­stream pro­jects should be forced out of Debian’s main

archive into non-free

un­less ”. Ansgar Burchardt pointed

out that would have the ef­fect of ban­ning the Linux ker­nel, Python, LLVM, and oth­ers. Glaser’s pro­posal did not seem par­tic­u­larly pop­u­lar. He had taken a sim­i­lar stance on AI mod­els in 2025; he ar­gued most should be out­side the main archive, when the pro­ject dis­cussed a GR about AI mod­els and the Debian Free Software Guidelines (DFSG). That GR never came to a vote, in part be­cause it was un­clear whether the lan­guage would for­bid anti-spam tech­nolo­gies be­cause one could not in­clude the cor­pus of spam used as train­ing data along with fil­ters.

Allbery did not want to touch on copy­right is­sues but had a few words to say about the qual­ity of AI-assisted code. It is com­mon for peo­ple to ob­ject to code gen­er­ated by LLMs on qual­ity grounds, but he said that ar­gu­ment does not make sense. Humans are ca­pa­ble of pro­duc­ing bet­ter code than LLMs, but they are also ca­pa­ble of pro­duc­ing worse code too.

Bdale Garbee sec­onded

that no­tion, and said that he was re­luc­tant to take a hard stance one way or the other. I see it as just an­other evo­lu­tion­ary stage we don’t re­ally un­der­stand the

longer term pos­i­tive and neg­a­tive im­pacts of yet.” He wanted to fo­cus on long-term im­pli­ca­tions and ques­tions such as what is the pre­ferred form of

mod­i­fi­ca­tion for code writ­ten by is­su­ing chat prompts?” Nussbaum an­swered that would be the in­put to the tool, not the gen­er­ated source code”.

That may not be an en­tirely sat­is­fy­ing an­swer, how­ever, given that LLM out­put is not de­ter­min­is­tic and the var­i­ous providers of LLM tools re­tire mod­els with some fre­quency. A user may have the prompt and other ma­te­ri­als fed to an LLM to gen­er­ate a re­sult at a spe­cific point in time, but it might gen­er­ate a much dif­fer­ent re­sult later on, even if one has ac­cess to the same ven­dor’s tools or mod­els to run lo­cally.

It is clear from the dis­cus­sion that Debian de­vel­op­ers are not of one mind on the ques­tion of ac­cept­ing AI-generated con­tri­bu­tions; the de­vel­op­ers have not yet even con­verged on a shared de­f­i­n­i­tion of what con­sti­tutes an AI-generated con­tri­bu­tion.

What many do seem to agree on is that Debian is not quite ready to vote on a GR about AI-generated con­tri­bu­tions. On March 3, Nussbaum said

that he had pro­posed the GR in re­sponse to var­i­ous at­tacks against peo­ple us­ing

AI in the con­text of Debian”; he felt then it was some­thing that needed to be dealt with ur­gently. However, the GR dis­cus­sion had been civil and in­ter­est­ing. As long as the dis­cus­sions around AI re­mained calm and pro­duc­tive, the pro­ject could just con­tinue ex­plor­ing the topic in mail­ing-list dis­cus­sions. He guessed that, if there were a GR, the win­ning op­tion would prob­a­bly be very nu­anced, al­low­ing

AI but with a set of safe­guards”.

The ques­tions of what to do about AI mod­els in the archive, how to han­dle up­stream code gen­er­ated with LLMs, and LLM-generated con­tri­bu­tions writ­ten specif­i­cally for Debian re­main unan­swered. For now, it seems, they will con­tinue to be han­dled on a case-by-case ba­sis by ap­ply­ing Debian’s ex­ist­ing poli­cies. Given the com­plex­ity of the ques­tions, di­verse opin­ions, and rapid rate of change of tech­nolo­gies lumped in un­der the AI um­brella, that may be the best pos­si­ble, and least dis­rup­tive, out­come for now.

...

Read the original on lwn.net »

9 315 shares, 22 trendiness

I'm Building Agents That Run While I Sleep

I Have No Idea If What They Ship Is Any GoodI’ve been build­ing agents that write code while I sleep. Tools like Gastown run for hours with­out me watch­ing. Changes land in branches I haven’t read. A few weeks ago I re­al­ized I had no re­li­able way to know if any of it was cor­rect: whether it ac­tu­ally does what I said it should do. I care about this. I don’t want to push slop, and I had no real an­swer.I’ve run Claude Code work­shops for over 100 en­gi­neers in the last six months. Same prob­lem every­where, just at dif­fer­ent scales. Teams us­ing Claude for every­day PRs are merg­ing 40-50 a week in­stead of 10. Teams are spend­ing a lot more time in code re­views. As sys­tems get more au­tonomous, the prob­lem com­pounds. At some point you’re not re­view­ing diffs at all, just watch­ing de­ploys and hop­ing some­thing does­n’t break.So the ques­tion I kept com­ing back to: what do you ac­tu­ally trust when you can’t re­view every­thing?You could hire more re­view­ers. But you can’t hire fast enough. And mak­ing se­nior en­gi­neers read AI-generated code all day is­n’t worth it.When Claude writes tests for code Claude just wrote, it’s check­ing its own work. The tests prove the code does what Claude thought you wanted. Not what you ac­tu­ally wanted. They catch re­gres­sions but not the orig­i­nal mis­un­der­stand­ing.When you use the same AI for both, you’ve built a self-con­grat­u­la­tion ma­chine.This is ex­actly the prob­lem code re­view was sup­posed to solve: a sec­ond set of eyes that was­n’t the orig­i­nal au­thor. But one AI writ­ing and an­other AI check­ing is­n’t a fresh set of eyes. They come from the same place. They’ll miss the same things.The thing TDD got rightWrite the test first, write the code sec­ond, stop when the test passes. Most teams don’t do this be­cause think­ing through what the code should do be­fore writ­ing it takes time they don’t have.AI re­moves that ex­cuse, be­cause Claude han­dles the speed. The slow part is now fig­ur­ing out if the code is right. That’s what TDD was built for: write down what cor­rect looks like, then check it.TDD asks you to write unit tests, which means think­ing about how the code will work be­fore you write it. This is eas­ier. Write down what the fea­ture should do in plain English. The ma­chine fig­ures out how to check it.“Users can au­then­ti­cate with email and pass­word. On wrong cre­den­tials they see Invalid email or pass­word.’ On suc­cess they land on /dashboard. The ses­sion to­ken ex­pires af­ter 24 hours.” You can write that be­fore you open a code ed­i­tor. The agent builds it. Something else checks it.P.S I write about Claude Code in­ter­nals every week. Last week I wrote about how Claude Code is a while loop with 23 tools. Subscribe to get the next one!What this looks like in prac­tice­For fron­tend changes, we gen­er­ated ac­cep­tance cri­te­rias based on the spec file:# Task

Add email/​pass­word lo­gin.

## Acceptance Criteria

### AC-1: Successful lo­gin

- User at /login with valid cre­den­tials gets redi­rected to /dashboard

- Session cookie is set

### AC-2: Wrong pass­word er­ror

- User sees ex­actly Invalid email or pass­word”

- User stays on /login

### AC-3: Empty field val­i­da­tion

- Submit dis­abled when ei­ther field is empty, or in­line er­ror on empty sub­mit

### AC-4: Rate lim­it­ing

- After 5 failed at­tempts, lo­gin blocked for 60 sec­onds

- User sees a mes­sage with the wait timeEach cri­te­rion is spe­cific enough that it ei­ther passes or fails. Once the agent builds the fea­ture, ver­i­fi­ca­tion runs Playwright browser agents against each AC, takes screen­shots, and pro­duces a re­port with per-cri­te­rion ver­dicts. If some­thing fails you see ex­actly which cri­te­rion and what the browser saw.For back­end changes the same pat­tern works with­out a browser. You spec­ify ob­serv­able API be­hav­ior (status codes, re­sponse head­ers, er­ror mes­sages) that curl com­mands can check.One thing worth be­ing hon­est about: this does­n’t catch spec mis­un­der­stand­ings. If your spec was wrong to be­gin with, the checks will pass even when the fea­ture is wrong. What Playwright does catch is in­te­gra­tion fail­ures, ren­der­ing bugs, and be­hav­ior that works in the­ory but breaks in a real browser. That’s a nar­rower claim than verified cor­rect,” but it’s more than a code re­view was re­li­ably catch­ing any­way.The work­flow: write ac­cep­tance cri­te­ria be­fore you prompt, let the agent build against them, run ver­i­fi­ca­tion, re­view only the fail­ures. You re­view fail­ures in­stead of diffs.How to build itI started build­ing a Claude Skill (github.com/​op­slane/​ver­ify) that runs us­ing claude -p (Claude Code’s head­less mode) plus Playwright MCP. No cus­tom back­end, no ex­tra API keys be­yond your ex­ist­ing Claude OAuth to­ken. Four stages:Pre-flight is pure bash, no LLM. Is the dev server run­ning? Is the auth ses­sion valid? Does a spec file ex­ist? Fail fast be­fore spend­ing any to­kens.The plan­ner is one Opus call. It reads your spec and the files you changed. It fig­ures out what each check needs and how to run it. It also reads your code to find the right se­lec­tors, so it’s not guess­ing at class names.Browser agents are one Sonnet call per AC, all run­ning in par­al­lel. Five ACs, five agents, each nav­i­gat­ing and screen­shot­ting in­de­pen­dently. Sonnet costs 3-4x less than Opus here and works just as well for click­ing around.The judge is one fi­nal Opus call that reads all the ev­i­dence and re­turns a ver­dict per cri­te­rion: pass, fail, or needs-hu­man-re­view.claude -p –model claude-opus-4-6 \

Review this ev­i­dence and re­turn a ver­dict for each AC.

Evidence: $(cat .verify/evidence/*/result.json)

Return JSON: {verdicts: [{id, passed, rea­son­ing}]}“Or clone the repo and adapt it. Each stage is a sin­gle claude -p call with a clear in­put and struc­tured out­put. You can swap mod­els, add stages, or wire it into CI with –dangerously-skip-permissions.The thing I keep com­ing back to: you can’t trust what an agent pro­duces un­less you told it what done” looks like be­fore it started. Writing ac­cep­tance cri­te­ria is harder than writ­ing a prompt, be­cause it forces you to think through edge cases be­fore you’ve seen them. Engineers re­sist it for the same rea­son they re­sisted TDD, be­cause it feels slower at the start.With­out them, all you can do is read the out­put and hope it’s right.

...

Read the original on www.claudecodecamp.com »

10 270 shares, 27 trendiness

U+237C ⍼ is Azimuth

One year ago, on 28 February 2025, Wikipedia user Moyogo up­dated the page for Angzarr

with a ci­ta­tion to the type foundry H. Berthold AGs 1950 sym­bol cat­a­logue list­ing ⍼ as Azimut, Richtungswinkel, or azimuth”, direction an­gle”. Mystery solved!

Fonts in Use lists links to archived cat­a­logues by Berthold. The above scan is from the 1950 Zeichenprobe

(symbol cat­a­logue) on page 7. Copies of the Schriftprobe (font cat­a­logue) from 1949, 1951, and 1952

all show on page 104 the same glyph and sizes, al­beit with­out the de­scrip­tor name.

⍼ does not ap­pear in the 1946 Registerprobe, nor in ear­lier 1909

and 1900 cat­a­logues. For con­ve­nience, I’ve ex­tracted full-page scans be­low for where it ap­pears — and where I feel it would ap­pear, but does­n’t.

A friend on Mastodon pointed out that the glyph ⍼ it­self re­sem­bles the way a light ray passes through a sex­tant

to mea­sure an az­imuth, with the right an­gle be­ing a stan­dard sym­bol for an an­gle in gen­eral. Wikipedia has a lovely il­lus­tra­tion demon­strat­ing how a sex­tant works to mea­sure lat­i­tude of the sun; it can, of course, be turned side­ways to mea­sure an az­imuth with re­spect to an ar­bi­trary merid­ian.

...

Read the original on ionathan.ch »

To add this web app to your iOS home screen tap the share button and select "Add to the Home Screen".

10HN is also available as an iOS App

If you visit 10HN only rarely, check out the the best articles from the past week.

If you like 10HN please leave feedback and share

Visit pancik.com for more.