10 interesting stories served every morning and every evening.




1 1,410 shares, 87 trendiness

The sound of inevitability

...

Read the original on tomrenner.com »

2 516 shares, 21 trendiness

[WIP] CUDA backend by zcbenz · Pull Request #1983 · ml-explore/mlx

Have a ques­tion about this pro­ject? Sign up for a free GitHub ac­count to open an is­sue and con­tact its main­tain­ers and the com­mu­nity.

By click­ing Sign up for GitHub”, you agree to our terms of ser­vice and pri­vacy state­ment. We’ll oc­ca­sion­ally send you ac­count re­lated emails.

Already on GitHub? Sign in

to your ac­count

...

Read the original on github.com »

3 316 shares, 14 trendiness

DOGWALK

Blender Studio’s of­fi­cial game pro­ject is a short ca­sual in­ter­ac­tive story. Play a big, adorable dog tra­vers­ing through win­ter woods and help out a lit­tle kid dec­o­rate a snow­man with col­or­ful items hid­den in the en­vi­ron­ment.

You are let loose to roam camp­ing grounds, for­est paths, idyl­lic creeks and a frozen pond in this minia­ture open world.

Guide or drag around your lit­tle kid owner that you have in tow. Help each other out, be a men­ace or be a good boy.

Dive straight in and have the game re­act to your play-style and choices. There are no fail states. Only player dri­ven mo­ments.

Traverse an en­vi­ron­ment made of real-life pa­per crafted mod­els, scanned and recre­ated to be played with.

Brought to you by the Blender Studio as the new free and cre­ative com­mons Open Project”. Made with, and avail­able as free and open-source soft­ware.

The source code and pro­duc­tion  repos­i­tory can be ac­cessed on our web­site at

https://​stu­dio.blender.org/​pro­jects/​dog­walk/​3e3­fa4dfd790bc/

The pro­ject was used to test and im­prove both Blender and the Godot Game Engine.

Support our work on the Blender Studio web­site:

https://​stu­dio.blender.org/​pro­jects/​pro­ject-dog­walk/

...

Read the original on blenderstudio.itch.io »

4 267 shares, 8 trendiness

php_license_update

This pro­posal ad­dresses a long­stand­ing is­sue within the open source com­mu­nity by pub­lish­ing new ver­sions of the PHP License and the Zend Engine License. The Modified BSD License is adopted as the PHP License, ver­sion 4, and as the Zend Engine License, ver­sion 3.

The Modified BSD License is some­times re­ferred to as the New,” Revised,” or 3-clause” BSD License. Its SPDX iden­ti­fier is BSD-3-Clause.1) It is rec­og­nized as a free soft­ware li­cense by both the Open Source Initiative (OSI) and the Free Software Foundation (FSF).2) 3) The FSF has des­ig­nated it as com­pat­i­ble with the GNU General Public License (), and it is an OSI Approved License.

The PHP License, ver­sion 3.01, and Zend Engine License, ver­sion 2.00, com­bine the Modified BSD License with spe­cial terms spe­cific only to the PHP Group and Zend Technologies (now a sub­sidiary of Perforce Software). After re­mov­ing these spe­cial terms, the li­censes are iden­ti­cal to the Modified BSD License, and there is no change to the rights granted by con­trib­u­tors or to users.

The rights granted by con­trib­u­tors do not change.

The rights granted to users do not change.

We will work with the PHP Group and Perforce Software to re­move the terms that are spe­cific to them.

PHP soft­ware and the Zend Engine will be li­censed un­der terms that are both OSI Approved and com­pat­i­ble with the .

Work with the PHP Group to adopt the Modified BSD License as the PHP License, ver­sion 4.

Work with Perforce Software to adopt the Modified BSD License as the Zend Engine License, ver­sion 3.

Deprecate the PHP License and the Zend Engine License. Use of these li­censes for new pro­jects, in­side or out­side the PHP pro­ject, is strongly dis­cour­aged.

Delete the con­tents of the LICENSE file from the PHP soft­ware, and re­place them with the con­tents in­di­cated in the New LICENSE File sec­tion be­low.

Remove the Zend/LICENSE file from the Zend Engine.

Replace the file head­ers for all PHP source files in the PHP soft­ware with the con­tents in­di­cated in the New PHP Source File Header sec­tion be­low.

Replace the file head­ers for all Zend Engine source files with the con­tents in­di­cated in the New Zend Engine Source File Header sec­tion be­low.

Update other ap­plic­a­ble doc­u­men­ta­tion and web pages to re­flect these changes, such as https://​www.php.net/​li­cense/.

The Background, Change Authority, and Additional Context sec­tions of this doc­u­ment pro­vide fur­ther con­text and le­gal jus­ti­fi­ca­tion for this change.

The PHP License and Zend Engine License are not com­pat­i­ble with the ,4) and the Zend Engine License is not OSI Approved. While the OSI li­cense ap­proval com­mit­tee voted to ap­prove ver­sions 3.0 and 3.01 of the PHP License, each fol­lowed the legacy ap­proval” process, mean­ing the li­censes had al­ready been in wide use for many years be­fore the OSI ap­proved them. As a re­sult, the OSI ap­proved the PHP License based more on its in­tent, rather than its con­tent. If the OSI li­cense ap­proval com­mit­tee were not con­sid­er­ing the legacy use of the PHP License, it is un­likely they would have ap­proved it based solely on its con­tent.

In the be­gin­ning, while the Zend Engine was bun­dled with PHP in the Zend/ di­rec­tory, it was thought of as a com­pletely sep­a­rate prod­uct that could be un­bun­dled and used apart from PHP. Indeed, that was the in­tent, and it is the rea­son PHP and the Zend Engine have sep­a­rate li­censes. However, af­ter 25 years of co­hab­i­ta­tion within the same source code repos­i­tory, the two are in­ter­twined in ways in which the Zend Engine can no longer be sep­a­rated and used as a stand­alone prod­uct. Together, they form the PHP pro­gram­ming lan­guage ref­er­ence im­ple­men­ta­tion.

Rasmus Lerdorf cre­ated PHP at a time when a fac­tion within the free soft­ware move­ment was grow­ing dis­sat­is­fied with the pol­i­tics and phi­los­o­phy of the move­ment and splin­tered off, crys­tal­liz­ing around a more per­mis­sive set of li­censes viewed as friend­lier to com­mer­cial use—this be­came the open source move­ment.

The frame dis­pute, con­se­quent trans­for­ma­tion, and cre­ation of the open source move­ment can be viewed as a spin-off move­ment that not only had a dif­fer­ent di­ag­no­sis and more elas­tic reach, but that strove to avoid what they saw as mistakes” made by the found­ing move­ment that in­hib­ited com­mer­cial growth.5)

In his orig­i­nal re­lease an­nounce­ment, Lerdorf wrote, The tools are in the pub­lic do­main dis­trib­uted un­der the GNU Public License. Yes, that means they are free!”6) 7) Lerdorf chose to re­lease PHP ver­sion 1 and PHP/FI (version 2) un­der the terms of the GNU , ver­sion 2 (GPLv2), but he rec­og­nized the grow­ing con­cerns among the open source move­ment that com­mer­cial in­ter­ests were scared of or even for­bade the use of soft­ware in their or­ga­ni­za­tions—in­deed, many con­tinue this prac­tice to­day. In a 1997 mail­ing list post dis­cussing li­cens­ing, Lerdof said, PHP, if I can help it, will al­ways be free. But, I am not against let­ting com­mer­cial en­ti­ties take a shot at a com­mer­cial ver­sion as long as the terms are such that the ma­jor con­trib­u­tors don’t feel cheated.”8)

This led to a dual-li­cens­ing model in PHP 3, al­low­ing users the choice to use PHP un­der the terms of the GPLv2 or a cus­tom li­cense based on the Apache License, ver­sion 1.0. Our li­cense is iden­ti­cal to the Apache li­cense (since that’s where we copied it from) ex­cept for that first clause,” wrote Lerdforf in a 1999 mail­ing list post.9) That first clause re­stricted com­mer­cial use:

Commercial re­dis­tri­b­u­tion of larger works de­rived from, or works which bun­dle PHP, re­quires writ­ten per­mis­sion from the PHP Development Team. You may charge a fee for the phys­i­cal act of trans­fer­ring a copy, and must make it clear that the fee be­ing charged is for the dis­tri­b­u­tion, and not for the soft­ware it­self. You may, at your op­tion, of­fer war­ranty pro­tec­tion in ex­change for a fee.10)

The dual-li­cens­ing model pre­sented a num­ber of chal­lenges to a group that was ill-equipped to han­dle le­gal ques­tions. In the same thread, Lerdorf dis­cussed hav­ing re­ceived re­quests from com­pa­nies for signed, hard­copy doc­u­ments grant­ing per­mis­sion to use PHP and be­ing un­able to re­spond to them ap­pro­pri­ately.11) Free and open source soft­ware was not well-un­der­stood by com­pa­nies, and there was sig­nif­i­cant dis­agree­ment within the PHP pro­ject about what level of free­dom users should have. At the time, Zeev Suraski wrote, people should not be given the le­gal right to do what­ever they wish with PHP.”12) Nevertheless, with Lerdorf hav­ing re­ferred to the first clause as that trou­ble­some clause which we can’t en­force,”13) the team fi­nally re­moved it in PHP 3.0.14.14)

Meanwhile, Richard Stallman, au­thor of the and founder of the FSF, had sig­nif­i­cant dis­agree­ments with the PHP pro­ject over their use of the ,15) 16) so the PHP pro­ject dis­con­tin­ued the dual-li­cens­ing ap­proach, re­mov­ing the li­cense as an op­tion, and PHP 4.0.0 shipped with the PHP License, ver­sion 2.02 and the Zend License, ver­sion 0.92,17) for sources within the Zend/ di­rec­tory.

Suraski and Andi Gutmans orig­i­nally in­tended the Zend/ di­rec­tory to be read-only, with all the source code owned by the two, so they could sell the Zend en­gine for uses other than PHP.”18) It’s clear they—and other early mem­bers of the PHP pro­ject—saw the Zend Engine as wholly sep­a­rate from PHP. In a 1999 in­ter­view, Lerdorf clar­i­fied li­cens­ing con­cerns sur­round­ing the sep­a­rate li­censes:

PHP 4 is not syn­ony­mous with Zend. And when it comes to li­cens­ing, the only time the [Zend License] kicks in is if you un­bun­dle Zend from PHP and try to em­bed the Zend en­gine into some­thing else.19)

I think there is still some con­fu­sion about what role ex­actly Zend plays in the PHP in­fra­struc­ture. The host lan­guage (PHP) uses the base ser­vices pro­vided by the en­gine (Zend)—services such as mem­ory al­lo­ca­tion, per­sis­tent re­sources, com­pi­la­tion, and ex­e­cu­tion. PHP it­self then pro­vides the func­tion li­braries, in­ter­faces to the Web servers, .ini file sup­port, etc.20)

Gutmans hinted at a pos­si­ble fu­ture use of the Zend Engine, which ex­plained the need for a sep­a­rate li­cense:

I’d very much like to see the Zend en­gine em­bed­ded in MySQL at some point. I think it would be great to be able to write the stored pro­ce­dure code of the DB in the same lan­guage as the script­ing en­gine used to ac­cess the DB. […]

The Zend en­gine was writ­ten in a way where it can be used in other prod­ucts be­sides PHP. The [Zend License] al­lows us (the Zend com­pany) to re­serve the right to use it else­where com­mer­cially. However, Zend as part of PHP can be used freely and falls un­der the PHP li­cense.21)

Later, Gutmans ex­plained why he thought the sep­a­rate li­cense for the Zend Engine did not pre­sent any prob­lems for con­trib­u­tors:

No one re­ally con­tributes to the script­ing en­gine but ex­tends PHP with ad­di­tional mod­ules and func­tions. There are con­stantly de­vel­op­ers (besides us) ex­tend­ing PHPs func­tions.22)

Since then, the li­censes un­der­went only one se­ries of ma­jor changes, which pro­duced the Zend Engine License, ver­sion 2.00, first dis­trib­uted with PHP 4.2.0 (April 22, 2002), and the PHP License, ver­sion 3.0, first dis­trib­uted with PHP 4.2.3 (September 6, 2002).

In May 2003, Lerdorf pe­ti­tioned the OSI for ap­proval of ver­sion 3.0 of the PHP License, clos­ing with a state­ment that im­plied he wished to switch PHP to the Apache License, Version 2.0, once it gained ap­proval from the OSI.

Hopefully the new Apache li­cense when­ever that gets fi­nal­ized will be OSI-approved and has the big ad­van­tage of be­ing pro­ject-ag­nos­tic, so pro­jects such as PHP that are closely tied to Apache can use it ver­ba­tim with­out hav­ing to mas­sage it and we won’t need all these in­di­vid­ual Apache-like li­censes.23)

A few years later, a very slight change in the word­ing of the PHP License re­sulted in chang­ing the ver­sion num­ber to 3.01.24) This new ver­sion, while al­most iden­ti­cal, never re­ceived OSI ap­proval, a prob­lem that pre­sented it­self 14 years later, when Matthew Sheahan asked on the php-gen­eral mail­ing list re­gard­ing the OSI ap­proval sta­tus of ver­sion 3.01.

My team’s abil­ity to use the ph­pdbg util­ity hinges on OSI ap­proval of its li­cense. Language at https://​www.php.net/​li­cense/ in­di­cates that the PHP 3.01 li­cense is OSI ap­proved, but OSI dis­agrees; https://​open­source.org/​li­censes/​al­pha­bet­i­cal shows ap­proval only of the PHP 3.0 li­cense. (The fact that 3.0 and 3.01 are sub­stan­tively iden­ti­cal is no use to us at all.)25)

Andreas Heigl asked on the php-in­ter­nals mail­ing list, Does any­one here re­mem­ber why the changes to the li­cense where [sic] done in the first place?”26) In re­sponse, Johannes Schlüter ref­er­enced the Debian de­bate.

My mem­ory could fail me, but I be­lieve there were de­bates com­ing from Debian com­mu­nity around es­pe­cially PECL ex­ten­sions be­ing Licensed un­der PHP Licens [sic] 3.0 and the word­ing be­ing sub-op­ti­mal. The new word­ing (and web­site link) should make it clear that PECL (and PEAR) is PHP Software” while not be­ing PHP”.27)

At that time, Ben Ramsey vol­un­teered to con­tact the OSI to for­mally re­quest legacy ap­proval for the PHP License.28) The legacy ap­proval des­ig­na­tion al­lowed the li­cense stew­ard or any in­ter­ested li­censee to re­quest retroactive ap­proval of his­toric/​legacy li­censes that have al­ready been ex­ten­sively used by an ex­ist­ing com­mu­nity, but have not pre­vi­ously been ap­proved.”29) So, on March 4, 2020, Ramsey sub­mit­ted a re­quest for legacy ap­proval to the OSI li­cense-re­view list,30) and on May 13, 2020, the OSI Board voted to ap­prove the PHP License, ver­sion 3.01.31)

The PHP Association was a pub­lic ben­e­fit cor­po­ra­tion in­cor­po­rated in the State of Nebraska in the United States in February 2000.32) Each of the di­rec­tors of the PHP Association were also mem­bers of the PHP Group.33) 34) We can in­fer from this that the PHP Group cre­ated the PHP Association to rep­re­sent the group in le­gal and busi­ness mat­ters.

On May 22, 2000, the same day the PHP team re­leased PHP ver­sion 4.0.0, in­clud­ing Zend Engine ver­sion 1.0.0, Zend Technologies and the PHP Association en­tered into an agree­ment to en­sure the con­tin­ued avail­abil­ity of the Zend Engine as an open source prod­uct.

Since Zend Engine is a cru­cial com­po­nent of PHP, Zend hereby makes the fol­low­ing com­mit­ments and as­sur­ances to The PHP Association:

Zend will con­tinue to make Zend Engine avail­able as an open source prod­uct un­der the Zend Open Source License. If Zend changes the terms of the Zend Open Source License, the new li­cense will be con­sis­tent with the Open Source Definition of the Open Source Initiative.

The PHP Association is hereby au­tho­rized to mar­ket, dis­trib­ute and sub­li­cense Zend Engine, in source and ob­ject code forms, as an in­te­grated com­po­nent of PHP, to end users who agree to be bound by the PHP open-source li­cense, ver­sion 2.02. […] However, if Zend Engine is ei­ther mod­i­fied or sep­a­rated from the rest of PHP, the use of the mod­i­fied or sep­a­rated Zend Engine shall not be gov­erned by the PHP Open Source License, but in­stead shall be gov­erned by the Zend Open Source License.

The PHP Association agreed to the terms of the agree­ment, which in­cluded the fol­low­ing con­di­tions:

The Association will not delete or al­ter any in­tel­lec­tual prop­erty rights or li­cense no­tices ap­pear­ing on the Zend Engine and will re­pro­duce and dis­play such no­tices on each copy it makes of the Zend Engine.”

The Association may not as­sign this Letter, by op­er­a­tion of law or oth­er­wise in whole or in part, with­out Zend’s writ­ten con­sent. Any at­tempt to as­sign this Letter with­out such con­sent will be null and void. This Letter will bind and in­ure to the ben­e­fit of each par­ty’s per­mit­ted suc­ces­sors and as­signs.”

Given how cor­po­ra­tion law works in most US states, the PHP Association is likely still legally bound to this con­tract, even if they are no longer an ac­tive en­tity, and the terms of the con­tract fol­lowed Zend as it was ac­quired by Rogue Wave in 2015 and Perforce Software in 2019.

The PHP License and Zend Engine License are BSD-style li­censes. As men­tioned ear­lier, Lerdorf pointed to the Apache License, ver­sion 1.0, as the model for the orig­i­nal PHP li­cense,46) and the Apache License, ver­sion 1.0, is de­rived from the orig­i­nal, or 4-clause, BSD li­cense.47) In fact, the two are iden­ti­cal, ex­cept the Apache License added con­di­tions 5 and 6:

5. Products de­rived from this soft­ware may not be called Apache” nor may Apache” ap­pear in their names with­out prior writ­ten per­mis­sion of the Apache Group.

6. Redistributions of any form what­so­ever must re­tain the fol­low­ing ac­knowl­edg­ment: This prod­uct in­cludes soft­ware de­vel­oped by the Apache Group for use in the Apache HTTP server pro­ject (http://​www.apache.org/).”48)

By ex­ten­sion, the PHP License is a de­riv­a­tive of the BSD 4-Clause License.

The BSD 4-Clause License is not an OSI-approved li­cense,49) while the FSF con­sid­ers it free but prob­lem­atic.50) Both po­si­tions are in re­sponse to the BSD ad­ver­tis­ing clause:

All ad­ver­tis­ing ma­te­ri­als men­tion­ing fea­tures or use of this soft­ware must dis­play the fol­low­ing ac­knowl­edge­ment: This prod­uct in­cludes soft­ware de­vel­oped by the or­ga­ni­za­tion.

For the PHP License, ver­sion 3.01, con­di­tions 1 and 2 are iden­ti­cal to con­di­tions 1 and 2 of the BSD 4-Clause License. Condition 3 of the PHP License is sim­i­lar in func­tion to con­di­tion 4 of the BSD. Condition 6 of the PHP License is sim­i­lar in func­tion to con­di­tion 3 of the BSD 4-Clause License. PHP added new con­di­tions 4 and 5.

For the Zend Engine License, ver­sion 2.00, con­di­tions 1 and 2 are iden­ti­cal to con­di­tions 1 and 2 of the BSD 4-Clause License. Condition 3 of the Zend Engine License is sim­i­lar in func­tion to con­di­tion 4 of the BSD 4-Clause License. Conditions 5 and 6 of the Zend Engine License are sim­i­lar in func­tion to con­di­tion 3 of the BSD 4-Clause License. Zend added a new con­di­tion 4.

Every con­trib­u­tor owns the copy­right on their spe­cific con­tri­bu­tions to an open source pro­ject, if the con­tri­bu­tions are copy­rightable. Some con­tri­bu­tions (e.g., typo fixes, white space changes, etc.) aren’t copy­rightable, but any­thing more sig­nif­i­cant be­longs to the con­trib­u­tor, pro­vided it is their own work.

In other words, even though the li­cense state­ment says the copy­right be­longs to The PHP Group51) or Zend Technologies52), tech­ni­cally, these copy­right state­ments only ap­ply to the spe­cific code con­tributed by these or­ga­ni­za­tions or by peo­ple con­tribut­ing on be­half of these or­ga­ni­za­tions.

Contributing to an open source pro­ject is NOT an im­plicit trans­fer of your copy­right to the pro­ject. To do this, every con­trib­u­tor must sign a con­trib­u­tor li­cense agree­ment that ex­plictly states they are trans­fer­ring their copy­right to whomever owns the code. No one has signed any agree­ments of this sort for the PHP soft­ware, so every con­trib­u­tor re­tains copy­right own­er­ship over the code they have con­tributed to PHP.

What is im­plied, how­ever, is as­sign­ment of li­cense. When some­one con­tributes to an open source pro­ject, they own the copy­right on their con­tri­bu­tions, but un­less they spec­ify a dif­fer­ent li­cense cov­er­ing their con­tri­bu­tions (which is wholly valid, with ex­am­ples in­clud­ing Derick Rethans’s timelib, which is bun­dled within the PHP source code), it is im­plied they are grant­ing use of their con­tri­bu­tions un­der the same li­cense terms as the pro­ject. In this way, the con­trib­u­tor can­not later de­mand to re­move all their copy­righted code; it’s un­der the terms of the same li­cense, which can’t be re­voked. However, if the pro­ject de­cides to change its li­cense terms, a con­trib­u­tor may then re­quest re­moval of their copy­righted code be­cause they may not wish to grant the terms of the new li­cense to their copy­righted work.

Additionally, com­mon con­ven­tion dic­tates that, once a copy­right state­ment is placed on a source file, it should re­main on that source file, com­plete with any years listed, though the years do not re­quire up­dat­ing. For an ex­am­ple, look at the file header on any WebKit source file.53) WebKit even spec­i­fies that you add a copy­right no­tice to each file where you make significant” changes.54)

The short an­swer is, No.” As a cour­tesy, how­ever, we will keep dis­cus­sion on this topic open for a pe­riod of no less than six months be­fore call­ing a vote on the pro­posal.

Earlier, we es­tab­lished that every con­trib­u­tor owns the copy­right for their spe­cific con­tri­bu­tions, and un­less they spec­i­fied a dif­fer­ent li­cense cov­er­ing their con­tri­bu­tions, it is im­plied they have granted use of their con­tri­bu­tions un­der the same li­cense terms as the pro­ject. We have also es­tab­lished, at length, the PHP License, ver­sion 3.01, and Zend Engine License, ver­sion 2.00, are iden­ti­cal to the Modified BSD License if con­di­tions 4, 5, and 6 are re­moved from each li­cense.55)

There is no doubt con­trib­u­tors have the au­thor­ity to grant users li­cense to use their code with re­spect to con­di­tions 1 and 2. These are the same for the PHP License, Zend Engine License, and Modified BSD License. This pro­posal does not change the word­ing of any part of these con­di­tions:

Redistribution and use in source and bi­nary forms, with or with­out mod­i­fi­ca­tion, are per­mit­ted pro­vided that the fol­low­ing con­di­tions are met:

Redistributions of source code must re­tain the above copy­right no­tice, this list of con­di­tions and the fol­low­ing dis­claimer.

Redistributions in bi­nary form must re­pro­duce the above copy­right no­tice, this list of con­di­tions and the fol­low­ing dis­claimer in the doc­u­men­ta­tion and/​or other ma­te­ri­als pro­vided with the dis­tri­b­u­tion.

Condition 3 does have dif­fer­ences across each li­cense. However, when viewed at face-value, the in­tent of this con­di­tion in the PHP and Zend Engine li­censes is the same as the 3rd con­di­tion of the Modified BSD License. Additionally, as worded in the PHP and Zend Engine li­censes, con­trib­u­tors have no au­thor­ity to as­sert these terms for their own con­tri­bu­tions, since the terms are spe­cific to the PHP Group and Perforce Software, re­spec­tively, but they do have the au­thor­ity to as­sert the terms of con­di­tion 3 from the Modified BSD License.

The name PHP must not be used to en­dorse or pro­mote prod­ucts de­rived from this soft­ware with­out prior writ­ten per­mis­sion. For writ­ten per­mis­sion, please con­tact group@php.net.

The names Zend” and Zend Engine” must not be used to en­dorse or pro­mote prod­ucts de­rived from this soft­ware with­out prior per­mis­sion from Zend Technologies Ltd. For writ­ten per­mis­sion, please con­tact li­cense@zend.com.

Neither the name of the copy­right holder nor the names of its con­trib­u­tors may be used to en­dorse or pro­mote prod­ucts de­rived from this soft­ware with­out spe­cific prior writ­ten per­mis­sion.

When we look closer at con­di­tions 4, 5, and 6 for both the PHP License and the Zend Engine License, it ap­pears no con­trib­u­tors, other than rep­re­sen­ta­tives of the PHP Group and Perforce Software, are able to grant or as­sert these con­di­tions for their con­tri­bu­tions. Removing them from the li­cense does not change any of the rights granted or re­stricted by con­trib­u­tors (other than the PHP Group and Perforce Software; see be­low).

For these rea­sons, we do not need to gain per­mis­sion from all con­trib­u­tors to make these changes.

This pro­posal re­moves the fol­low­ing con­di­tions, which the PHP Group is uniquely able to claim over the PHP source code:

4. Products de­rived from this soft­ware may not be called PHP, nor may PHP ap­pear in their name, with­out prior writ­ten per­mis­sion from group@php.net. You may in­di­cate that your soft­ware works in con­junc­tion with PHP by say­ing Foo for PHP in­stead of call­ing it PHP Foo” or phpfoo”

5. The PHP Group may pub­lish re­vised and/​or new ver­sions of the li­cense from time to time. Each ver­sion will be given a dis­tin­guish­ing ver­sion num­ber. Once cov­ered code has been pub­lished un­der a par­tic­u­lar ver­sion of the li­cense, you may al­ways con­tinue to use it un­der the terms of that ver­sion. You may also choose to use such cov­ered code un­der the terms of any sub­se­quent ver­sion of the li­cense pub­lished by the PHP Group. No one other than the PHP Group has the right to mod­ify the terms ap­plic­a­ble to cov­ered code cre­ated un­der this License.

6. Redistributions of any form what­so­ever must re­tain the fol­low­ing ac­knowl­edg­ment: This prod­uct in­cludes PHP soft­ware, freely avail­able from http://​www.php.net/​soft­ware/.

The good news is that con­di­tion 5 grants the PHP Group the au­thor­ity to make changes to the PHP License, with­out ap­proval from any con­trib­u­tors.

Depending on the by­laws adopted by the PHP Association (as dis­cussed ear­lier in Zend and the PHP Association), we may re­quire ap­proval from one or more rep­re­sen­ta­tives of the PHP Group to ac­cept this pro­posal. There is no pub­lic record of the as­so­ci­a­tion’s by­laws, so un­less the by­laws spec­ify a quo­rum, we will need ap­proval from each of:

Note: Legal rep­re­sen­ta­tives of Perforce Software have in­for­mally ap­proved this pro­posal. The next step is a for­mal ap­proval, in writ­ing.

As the suc­ces­sor of Zend Technologies, Perforce Software is party to the Zend Grant and owner of the Zend Engine License. This pro­posal re­moves the fol­low­ing con­di­tions, which Perforce Software is uniquely able to claim over the Zend Engine source code:

4. Zend Technologies Ltd. may pub­lish re­vised and/​or new ver­sions of the li­cense from time to time. Each ver­sion will be given a dis­tin­guish­ing ver­sion num­ber. Once cov­ered code has been pub­lished un­der a par­tic­u­lar ver­sion of the li­cense, you may al­ways con­tinue to use it un­der the terms of that ver­sion. You may also choose to use such cov­ered code un­der the terms of any sub­se­quent ver­sion of the li­cense pub­lished by Zend Technologies Ltd. No one other than Zend Technologies Ltd. has the right to mod­ify the terms ap­plic­a­ble to cov­ered code cre­ated un­der this License.

5. Redistributions of any form what­so­ever must re­tain the fol­low­ing ac­knowl­edg­ment: This prod­uct in­cludes the Zend Engine, freely avail­able at http://​www.zend.com

6. All ad­ver­tis­ing ma­te­ri­als men­tion­ing fea­tures or use of this soft­ware must dis­play the fol­low­ing ac­knowl­edg­ment: The Zend Engine is freely avail­able at http://​www.zend.com

Just as the PHP License grants the PHP Group the au­thor­ity to make changes to the PHP License, the Zend Engine License grants Perforce Software the sole au­thor­ity to make changes to the Zend Engine License, with­out ap­proval from its con­trib­u­tors.

To make the changes pro­posed in this , the PHP pro­ject will re­quire that a rep­re­sen­ta­tive (or rep­re­sen­ta­tives) from the PHP Group work with rep­re­sen­ta­tives from Perforce Software to agree to this pro­posal.

This pro­posal pub­lishes a new ver­sion of the PHP License, trig­ger­ing clause 5 of the PHP License, ver­sion 3.01, which states (emphasis added):

The PHP Group may pub­lish re­vised and/​or new ver­sions of the li­cense from time to time. Each ver­sion will be given a dis­tin­guish­ing ver­sion num­ber. Once cov­ered code has been pub­lished un­der a par­tic­u­lar ver­sion of the li­cense, you may al­ways con­tinue to use it un­der the terms of that ver­sion. You may also choose to use such cov­ered code un­der the terms of any sub­se­quent ver­sion of the li­cense pub­lished by the PHP Group. No one other than the PHP Group has the right to mod­ify the terms ap­plic­a­ble to cov­ered code cre­ated un­der this License.

Users of any PHP ex­ten­sion or other soft­ware pub­lished un­der the terms of the PHP License, ver­sion 3.01, may choose to use that soft­ware un­der the terms of the PHP License, ver­sion 4 (i.e., the Modified BSD License).

Maintainers of PHP ex­ten­sions and other soft­ware pub­lished un­der the terms of the PHP License, ver­sion 3.01, may choose to up­grade the soft­ware li­cense to the PHP License, ver­sion 4 (i.e., the Modified BSD License). In an ef­fort to re­duce li­cense pro­lif­er­a­tion, you are dis­cour­aged from us­ing the name PHP License, ver­sion 4” as the li­cense name. If you need an SPDX iden­ti­fier, use BSD-3-Clause.

Historically, many ex­ten­sions up­loaded to PECL were li­censed un­der the PHP License, ver­sion 3.01. Indeed, one of the sug­ges­tions for pub­lish­ing a PECL pack­age is: We strongly en­cour­age con­trib­u­tors to choose the PHP License 3.01 for their ex­ten­sions, in or­der to avoid pos­si­ble trou­bles for end-users of the ex­ten­sion. Other solid op­tions are BSD and Apache type li­censes.”57)

The possible trou­bles” men­tioned here al­most al­ways arise from use of a copy­left li­cense like the . The FSF con­sid­ers the com­bi­na­tion of PHP ex­ten­sions and the PHP soft­ware a sin­gle com­bined pro­gram.58) As a re­sult, li­cens­ing a PHP ex­ten­sion with the leads to a con­fus­ing state that is es­pe­cially prob­lem­atic for dis­trib­u­tors.

New PHP ex­ten­sions and other soft­ware should not use the PHP License. Recommended li­censes in­clude, but are not lim­ited to (in al­pha­bet­i­cal or­der):

Did RMS come to terms with the PHP/Zend li­cens­ing struc­ture?59) 60)

This in­di­cates there was a dis­agree­ment be­tween the PHP main­tain­ers and Richard Stallman (a. k. a. RMS) at some point prior to May 2001. However, the full na­ture of this dis­agree­ment is un­known, as there is no record of it on pub­lic mail­ing lists or fo­rums.

In an ar­ti­cle pub­lished in 2004, Sean Michael Kerner quoted Gutmans, who ref­er­enced past ex­changes with RMS, con­cern­ing the PHP li­cense.

Gutmans said he has ex­changed e-mails with FSF founder Richard Stallman in the past on such is­sues. We def­i­nitely don’t see eye to eye on the is­sue of li­cens­ing. He [Richard Stallman] does­n’t like our li­cens­ing and we know that,” Gutmans said. We’re aware of each other, but the PHP pro­ject has no in­ten­tion of mov­ing to some sort of li­cense.”61)

In this same in­ter­view, Gutmans ex­pounded on his phi­los­o­phy re­gard­ing users’ rights when us­ing PHP: We like the fact that it (PHP) is very open. It’s a long dis­cus­sion about what Free re­ally means. When I think of free, my users can do what­ever they want.” He con­tin­ued, Most of PHPs user base are peo­ple that are us­ing PHP to make a liv­ing and they would­n’t care less [about the ]. They are just happy that it’s a PHP li­cense and they can do what­ever they want with it and can ship it with their com­mer­cial prod­ucts”

...

Read the original on wiki.php.net »

5 235 shares, 9 trendiness

How Increasing Input Tokens Impacts LLM Performance

Recent de­vel­op­ments in LLMs show a trend to­ward longer con­text win­dows, with the in­put to­ken count of the lat­est mod­els reach­ing the mil­lions. Because these mod­els achieve near-per­fect scores on widely adopted bench­marks like Needle in a Haystack (NIAH) [1], it’s of­ten as­sumed that their per­for­mance is uni­form across long-con­text tasks.

However, NIAH is fun­da­men­tally a sim­ple re­trieval task, in which a known sen­tence (the needle”) is placed in a long doc­u­ment of un­re­lated text (the haystack”), and the model is prompted to re­trieve it. While scal­able, this bench­mark typ­i­cally as­sesses di­rect lex­i­cal match­ing, which may not be rep­re­sen­ta­tive of flex­i­ble, se­man­ti­cally ori­ented tasks.

We ex­tend the stan­dard NIAH task, to in­ves­ti­gate model be­hav­ior in pre­vi­ously un­der­ex­plored set­tings. We ex­am­ine the ef­fects of nee­dles with se­man­tic, rather than di­rect lex­i­cal matches, as well as the ef­fects of in­tro­duc­ing vari­a­tions to the haystack con­tent.

Additionally, we in­clude a con­ver­sa­tional ques­tion-an­swer eval­u­a­tion us­ing LongMemEval [2], as well as a syn­thetic task in which mod­els repli­cate a se­ries of re­peated words. Each task re­mains in­ten­tion­ally sim­ple and is de­lib­er­ately con­trolled to iso­late the im­pact of con­text length alone.

We demon­strate that even un­der these min­i­mal con­di­tions, model per­for­mance de­grades as in­put length in­creases, of­ten in sur­pris­ing and non-uni­form ways. Real-world ap­pli­ca­tions typ­i­cally in­volve much greater com­plex­ity, im­ply­ing that the in­flu­ence of in­put length may be even more pro­nounced in prac­tice.

Our in-depth tech­ni­cal re­port con­tin­ues be­low. If you find our work use­ful, please con­sider cit­ing us:

Interested in work­ing on im­prov­ing re­trieval for AI ap­pli­ca­tions? Chroma is Hiring

It is com­mon for mod­ern LLMs to have in­put con­text lengths in the mil­lions of to­kens. Gemini 1.5 Pro [3] first in­tro­duced their 1M con­text win­dow in early 2024, fol­lowed by the re­cent GPT-4.1’s 1M con­text win­dow [4] and Llama 4 with 10M [5]. The use case for long con­text is com­pelling: longer con­text means that the LLM can process more in­for­ma­tion with each call and gen­er­ate more in­formed out­puts.

Long con­text eval­u­a­tions for these mod­els of­ten demon­strate con­sis­tent per­for­mance across in­put lengths. However, these eval­u­a­tions are nar­row in scope and not rep­re­sen­ta­tive of how long con­text is used in prac­tice. The most com­monly used test, Needle in a Haystack (NIAH), is a sim­ple lex­i­cal re­trieval task of­ten used to gen­er­al­ize a mod­el’s abil­ity to re­li­ably han­dle long con­text. Real ap­pli­ca­tions, such as agent tasks or sum­ma­riza­tion, de­mand sig­nif­i­cantly more pro­cess­ing and rea­son­ing over broader, of­ten more am­bigu­ous in­for­ma­tion.

Designing re­al­is­tic long con­text bench­marks is chal­leng­ing. Tasks of­ten grow in com­plex­ity as in­put length in­creases, mak­ing it dif­fi­cult to iso­late whether per­for­mance drops are due to longer in­puts or in­her­ently harder prob­lems. To ad­dress this, our ex­per­i­ments hold task com­plex­ity con­stant while vary­ing only the in­put length—al­low­ing us to di­rectly mea­sure the ef­fect of in­put length alone.

We pre­sent the fol­low­ing:

* An eval­u­a­tion across 18 LLMs, in­clud­ing lead­ing closed-source and open-weights mod­els, re­veal­ing nonuni­form per­for­mance with in­creas­ing in­put length.

* A writeup of ob­served model-spe­cific be­hav­ior pat­terns when han­dling dis­trac­tors and vary­ing ques­tion-an­swer sim­i­lar­ity.

* The com­plete code­base to repli­cate our re­sults.

One of the most widely used bench­marks for eval­u­at­ing a mod­el’s long con­text ca­pa­bil­i­ties is Needle in a Haystack (NIAH). While use­ful as a scal­able test, it mea­sures a nar­row ca­pa­bil­ity: lex­i­cal re­trieval. Models typ­i­cally per­form well on NIAH, which has led to the per­cep­tion that long-con­text is largely solved.

However, NIAH un­der­es­ti­mates what most long con­text tasks re­quire in prac­tice. Variants of NIAH, like NoLiMa[6] which in­clude nee­dle-ques­tion pairs with non-lex­i­cal matches, re­veal sig­nif­i­cant per­for­mance drops. Other tasks that ap­pear sim­i­lar in re­gards to dif­fi­culty, such as AbsenceBench [7] which tests mod­els for rec­og­niz­ing the ab­sence of a given snip­pet of text, also demon­strate per­for­mance degra­da­tion with grow­ing in­put length.

More com­plex bench­marks, such as Multi-round co-ref­er­ence res­o­lu­tion (MRCR) [8], Graphwalks [9], and Latent List [10], fur­ther high­light per­for­mance degra­da­tion with long in­puts. MRCR com­bines var­i­ous sub­tasks into one: iden­ti­fy­ing rel­e­vant parts, dis­am­biguat­ing amongst dis­trac­tors, rea­son­ing about the or­der of nee­dles, and repli­cat­ing text. In or­der to at­tribute the re­ported model fail­ures to in­creased in­put length, one must as­sume that the model is equally com­pe­tent at each sub­task. However, this as­sump­tion has not been thor­oughly tested; it may be that the model fails at one spe­cific sub­task, or a unique com­bi­na­tion of a few, with in­creas­ing in­put length. The com­pos­ite na­ture of this task makes it dif­fi­cult to sys­tem­at­i­cally eval­u­ate ex­actly where and how mod­els fail with long con­text.

Graphwalks is a graph tra­ver­sal task in which the model is given a di­rected graph com­posed of hexa­dec­i­mal hashes, then asked to per­form breadth-first search start­ing from a ran­dom node. In this case, in­creas­ing in­put length means in­creas­ing the size of the graph to tra­verse through, which in­creases task dif­fi­culty as a re­sult. Latent List pre­sents a sim­i­lar chal­lenge that scales with in­put length: the model is given a se­quence of Python list op­er­a­tions, then asked to out­put the re­sult­ing list.

It is dif­fi­cult to dis­am­biguate in­creas­ing task com­plex­ity from in­put length, which makes it dif­fi­cult to iso­late the im­pact on per­for­mance due to in­put length alone. There re­mains a lack of eval­u­a­tions which iso­late in­put length as the vari­able of in­ter­est, lim­it­ing our un­der­stand­ing of how LLMs ac­tu­ally be­have with long in­puts.

The clas­sic Needle in a Haystack task in­volves plac­ing a ran­dom fact (the needle’) in the mid­dle of a long con­text win­dow (the haystack’), then ask­ing the model about that fact.

The orig­i­nal im­ple­men­ta­tion of this task uses a nee­dle-ques­tion pair with lex­i­cal matches. However, us­age of long con­text in prac­tice of­ten re­quires se­man­tic un­der­stand­ing of am­bigu­ous tasks.

NoLiMa has demon­strated non-lex­i­cal match­ing to be a chal­lenge for mod­els as con­text length in­creases. This task uti­lizes nee­dle-ques­tion pairs that re­quire mod­els to in­fer la­tent as­so­ci­a­tions, for ex­am­ple:

In or­der to an­swer this ques­tion, the model would first have to know that Kiasma mu­seum is lo­cated in Helsinki, then make that la­tent as­so­ci­a­tion link. This tests the model not only for its non-lex­i­cal match­ing abil­i­ties, but also for its world knowl­edge. 72.4% of nee­dle-ques­tion pairs from NoLiMa re­quire such ex­ter­nal knowl­edge, mak­ing this bench­mark closer to a test of how mod­els han­dle both tasks at once rather than pure non-lex­i­cal match­ing alone.

Testing the im­pact of non-lex­i­cal match­ing in iso­la­tion re­mains un­der­ex­plored. Furthermore, this bi­nary dis­tinc­tion of lexical” ver­sus non-lexical” over­sim­pli­fies the com­plex­ity of ques­tion-an­swer­ing in real-world sce­nar­ios. Needle-question pairs ex­ist on a spec­trum of sim­i­lar­ity, yet they are all clas­si­fied un­der these broad cat­e­gories.

Models of­ten have to deal with dis­trac­tors as well, which has been shown to de­grade per­for­mance [11].

Throughout this re­port, we dis­tin­guish be­tween dis­trac­tors and ir­rel­e­vant con­tent:

* Distractors are top­i­cally re­lated to the nee­dle, but do not quite an­swer the ques­tion

* Irrelevant con­tent is un­re­lated to the nee­dle and ques­tion

Prior work has demon­strated that dis­trac­tors have non-uni­form im­pact, yet most eval­u­a­tions in­volve short in­put lengths and older mod­els. Current state-of-the-art mod­els are claimed to be more re­silient to dis­trac­tors, yet their per­for­mance has not been ex­ten­sively tested across var­i­ous in­put lengths.

Another un­der­ex­plored as­pect of NIAH is the haystack it­self, which is of­ten sim­ply treated as a means of scal­ing in­put length, but this as­sumes that the haystack con­tent it­self has no ef­fect on task per­for­mance. If the model is in­deed in­sen­si­tive to the con­tent of the haystack, then vary­ing this con­tent, for ex­am­ple the haystack’s topic or nar­ra­tive flow, should have no in­flu­ence on the re­sults. However, this as­sump­tion re­mains largely untested.

We de­sign four con­trolled ex­per­i­ments to in­ves­ti­gate the in­flu­ence of these fac­tors:

We com­pute the co­sine sim­i­lar­ity be­tween nee­dle-ques­tion pairs us­ing em­bed­dings. For ro­bust­ness, we av­er­age across five em­bed­ding mod­els: text-em­bed­ding-3-small, text-em­bed­ding-3-large, jina-em­bed­dings-v3, voy­age-3-large, and all-MiniLM-L6-v2. We mea­sure how model per­for­mance is im­pacted by nee­dle-ques­tion sim­i­lar­ity as in­put length in­creases.

Taking a high-sim­i­lar­ity nee­dle-ques­tion pair, we write four dis­trac­tors. We have the fol­low­ing se­tups:

We test the im­pact of dis­trac­tors on model per­for­mance as in­put length in­creases to mea­sure non-uni­for­mity amongst dis­trac­tors and in­put lengths.

We use two the­mat­i­cally dis­tinct haystacks, Paul Graham es­says and arXiv pa­pers [12], and write cor­re­spond­ing nee­dles for each. To mea­sure nee­dle-haystack sim­i­lar­ity, we em­bed the haystack and re­trieve the top-5 chunks for each nee­dle, then av­er­age their co­sine sim­i­lar­ity scores. This process is re­peated across five dif­fer­ent em­bed­ding mod­els for ro­bust­ness.

In typ­i­cal NIAH se­tups, haystacks are con­cate­na­tions of co­her­ent texts, each with their own log­i­cal flow of ideas. For in­stance, the orig­i­nal NIAH bench­mark uses a se­ries of Paul Graham es­says, where each es­say fol­lows a struc­tured or­ga­ni­za­tion of ideas to form an ar­gu­ment. To eval­u­ate whether this struc­ture in­flu­ences model per­for­mance, we com­pare two con­di­tions:

* Original: pre­serves the nat­ural flow of ideas within each ex­cerpt

* Shuffled: sen­tences are ran­domly re­ordered through­out the haystack to main­tain the same over­all topic with­out log­i­cal con­ti­nu­ity

We demon­strate the fol­low­ing:

* Across all ex­per­i­ments, model per­for­mance con­sis­tently de­grades with in­creas­ing in­put length.

* Distractors have non-uni­form im­pact on model per­for­mance with re­gards to how dis­tract­ing they are rel­a­tive to each other. We see this im­pact more promi­nently as in­put length in­creases, and ob­serve dis­tinc­tions in how var­i­ous mod­els re­spond to them.

* Needle-haystack sim­i­lar­ity does not have a uni­form ef­fect on model per­for­mance, sug­gest­ing the need for fur­ther in­ves­ti­ga­tion.

* The struc­tural pat­tern of the haystack con­sis­tently shows an im­pact on how mod­els process long in­puts.

For every unique com­bi­na­tion of nee­dle type, haystack topic, and haystack struc­ture, we test each model across:

We eval­u­ate each model across its max­i­mum con­text win­dow with tem­per­a­ture=0 un­less that set­ting is in­com­pat­i­ble (i.e. o3) or ex­plic­itly dis­cour­aged (i.e. Qwen’s thinking mode”). For Qwen mod­els, we ap­ply the YaRN method [13] to ex­tend from 32,768 to 131,072 to­kens.

We in­clude mod­els in both stan­dard and thinking mode” where ap­plic­a­ble.

We eval­u­ate model out­puts us­ing an aligned GPT-4.1 judge, us­ing our method out­lined in the ap­pen­dix.

We note some rare in­stances of a model re­fus­ing to at­tempt the task (69 out of 194,480 to­tal LLM calls—0.035%) , which we ex­clude from our re­sults and sep­a­rately re­port in our ap­pen­dix. For ex­am­ple, Claude Opus 4 may some­times have an empty out­put with stop_rea­son=”re­fusal”.

In real-world ap­pli­ca­tions, mod­els are of­ten ex­pected to han­dle am­bigu­ous tasks and iden­tify rel­e­vant in­for­ma­tion with­out re­ly­ing on ex­act lex­i­cal matches. For ex­am­ple, when an agent is given a task in­volv­ing a large cor­pus to search through, users rarely spec­ify pre­cise key­words for rel­e­vant parts. Instead, the model must in­fer rel­e­vance.

We vary the sim­i­lar­ity of our nee­dle-ques­tion pairs, quan­ti­fied by the co­sine sim­i­lar­ity of their em­bed­dings. We find that as nee­dle-ques­tion sim­i­lar­ity de­creases, model per­for­mance de­grades more sig­nif­i­cantly with in­creas­ing in­put length. This re­flects more re­al­is­tic sce­nar­ios where ex­act ques­tion-an­swer matches are rare, and se­man­tic am­bi­gu­ity com­pounds the chal­lenge of long in­put pro­cess­ing.

We source our haystack con­tent from two do­mains: Paul Graham es­says (as in the orig­i­nal NIAH ex­per­i­ment), and arXiv pa­pers. For each haystack topic (PG es­says, arXiv), we first de­ter­mine com­mon themes to guide our ques­tion and nee­dle writ­ing.

We use clus­ter­ing to iden­tify the most com­mon top­ics that ap­pear for a given cor­pus:

Use UMAP [14] for di­men­sion­al­ity re­duc­tion with the fol­low­ing pa­ra­me­ters: n_neigh­bors=30, min_dist=0.05, n_­com­po­nents=50, ran­dom_s­tate=42Use HDBSCAN [15] to cre­ate clus­ters with the fol­low­ing pa­ra­me­ters: min_­clus­ter_­size=10, min_sam­ples=15Get 20 rep­re­sen­ta­tive chunks for the largest clus­ters us­ing max­i­mal mar­ginal rel­e­vance (MMR)Manually ex­am­ine the largest clus­ters to de­ter­mine their themes and style

Using this method, we iden­tify writ­ing ad­vice as a com­mon topic for PG es­says, of­ten in anec­do­tal form. For arXiv pa­pers, we iden­tify in­for­ma­tion re­trieval as a com­mon topic, specif­i­cally re-rank­ing.

We write a cor­re­spond­ing ques­tion for each topic:

Before writ­ing our nee­dles, we ver­ify that an­swers to these ques­tions do not ex­ist in the haystack con­tent:

We store our pre­vi­ously com­puted haystack chunk em­bed­dings in a vec­tor data­base. Query top-10 re­sults from that vec­tor data­base with our ques­tion em­bed­ding.Man­u­ally ex­am­ine these re­sults to ver­ify that they do not an­swer the given ques­tion.

This sets up a fair test­ing en­vi­ron­ment as it en­sures that al­ter­na­tive an­swers do not ex­ist, and any in­cor­rect an­swers are due to model hal­lu­ci­na­tions.

For each ques­tion, we write 8 nee­dles that each be­long to the large clus­ter which we ver­ify us­ing ap­prox­i­mate pre­dic­tions. Needles that be­long to the writ­ing/​re­trieval clus­ter with >0.9 prob­a­bil­ity are con­sid­ered to top­i­cally blend into the haystack. We man­u­ally write these nee­dles to avoid data con­t­a­m­i­na­tion.

For the 8 nee­dles, we also vary the level of am­bi­gu­ity, quan­ti­fied through the fol­low­ing method:

Using an em­bed­ding model, we com­pute em­bed­dings for nee­dle and ques­tion and their co­sine sim­i­lar­ity. Repeat across five em­bed­ding mod­els (text-embedding-3-small, text-em­bed­ding-3-large, jina-em­bed­dings-v3, voy­age-3-large, and all-MiniLM-L6-v2).

For the PG es­says topic, our nee­dles range from 0.445-0.775 nee­dle-ques­tion sim­i­lar­ity with

We ob­serve a clear pat­tern that per­for­mance de­grades more quickly in in­put length with lower sim­i­lar­ity nee­dle-ques­tion pairs.

At short in­put lengths, the mod­els per­form well even on low-sim­i­lar­ity pairs. We see this most clearly in the high/​medium-per­for­mance mod­els, demon­strat­ing that these mod­els are ca­pa­ble of suc­ceed­ing at this task for all nee­dle-ques­tion pairs.

We note that lower-per­for­mance mod­els (i.e., older or smaller mod­els like GPT-4.1 nano) per­form poorly over­all, start­ing from a lower base­line for low-sim­i­lar­ity pairs. Our analy­sis fo­cuses pri­mar­ily on higher-per­form­ing mod­els as they are more rep­re­sen­ta­tive of what is com­monly used in prac­tice.

The ob­served per­for­mance degra­da­tion at longer in­put lengths is not due to the in­trin­sic dif­fi­culty of the nee­dle-ques­tion pair­ing. By hold­ing the nee­dle-ques­tion pair fixed and vary­ing only the amount of ir­rel­e­vant con­tent, we iso­late in­put size as the pri­mary fac­tor in per­for­mance de­cline.

We also ex­am­ine whether nee­dle po­si­tion in­flu­ences per­for­mance. Testing across 11 nee­dle po­si­tions, we find no no­table vari­a­tion in per­for­mance for this spe­cific NIAH task.

It has al­ready been es­tab­lished with older mod­els that dis­trac­tors de­grade model per­for­mance and have non-uni­form im­pact. Newer mod­els are claimed to re­li­ably han­dle any dis­trac­tor, but does this hold true as in­put length in­creases?

Our ex­per­i­ments re­veal that the im­pact of dis­trac­tors and their non-uni­for­mity am­pli­fies as in­put length grows across mod­els, in­clud­ing the lat­est state-of-the-art mod­els. We also ob­serve dis­tinct be­hav­iors across model fam­i­lies in how they deal with am­bi­gu­ity.

From each haystack topic (PG es­says and arXiv pa­pers), we take a nee­dle with high nee­dle-ques­tion sim­i­lar­ity (second high­est out of eight), and man­u­ally write 4 dis­trac­tors:

Instead of test­ing all eight nee­dles with dis­trac­tors, we use one nee­dle with high nee­dle-ques­tion sim­i­lar­ity to cre­ate a con­di­tion in which the nee­dle should be rel­a­tively easy to iden­tify. We see from pre­vi­ous re­sults that mod­els gen­er­ally per­form well on this nee­dle across in­put lengths due to high nee­dle-ques­tion sim­i­lar­ity, which al­lows us to bet­ter iso­late and mea­sure the im­pact of dis­trac­tors alone.

* Multiple dis­trac­tors: Needle + all four dis­trac­tors, ran­domly po­si­tioned through­out the haystack

Even a sin­gle dis­trac­tor re­duces per­for­mance rel­a­tive to the base­line (needle only), and adding four dis­trac­tors com­pounds this degra­da­tion fur­ther.

We are also able to see that dis­trac­tors do not have uni­form im­pact. For ex­am­ple, in our arXiv haystack and writ­ing nee­dle com­bi­na­tion, we can see that dis­trac­tor 3 (red) causes greater per­for­mance de­cline rel­a­tive to the other dis­trac­tors.

To fur­ther in­ves­ti­gate this non-uni­form im­pact, we an­a­lyze the failed at­tempts of var­i­ous mod­els in the 4-distractor con­di­tion. For the arXiv haystack and writ­ing nee­dle com­bi­na­tion, we see that dis­trac­tors 2 and 3 ap­pear most fre­quently in hal­lu­ci­nated re­sponses across mod­els.

These fail­ures also re­veal model-spe­cific dif­fer­ences in han­dling am­bi­gu­ity. Claude mod­els con­sis­tently ex­hibit the low­est hal­lu­ci­na­tion rates. Specifically, Claude Sonnet 4 and Opus 4 are par­tic­u­larly con­ser­v­a­tive and tend to ab­stain when un­cer­tain, ex­plic­itly stat­ing that no an­swer can be found. In con­trast, GPT mod­els show the high­est rates of hal­lu­ci­na­tion, of­ten gen­er­at­ing con­fi­dent but in­cor­rect re­sponses when dis­trac­tors are pre­sent.

In long-con­text tasks, ir­rel­e­vant con­text is of­ten treated as a neu­tral place­holder to scale up in­put length. It’s typ­i­cally as­sumed that the con­tent of this ir­rel­e­vant con­text does­n’t mat­ter, as long as it does­n’t di­rectly in­ter­fere with the task.

However, a nat­ural ques­tion arises: does the nee­dle-haystack sim­i­lar­ity in­flu­ence task dif­fi­culty at all? Intuitively, if the nee­dle blends in with the con­tent of the haystack, the model may have greater dif­fi­culty in ex­tract­ing the nee­dle.

Our find­ings re­veal that nee­dle-haystack sim­i­lar­ity has a non-uni­form ef­fect on model per­for­mance.

Using the nee­dles from our nee­dle-ques­tion sim­i­lar­ity ex­per­i­ment, we set up our ex­per­i­ment to test the im­pact of nee­dle-haystack sim­i­lar­ity.

We mea­sure nee­dle-haystack sim­i­lar­ity by em­bed­ding the haystack and re­triev­ing the top five most sim­i­lar chunks for each nee­dle, then av­er­ag­ing their co­sine sim­i­lar­ity scores. This process is re­peated across five dif­fer­ent em­bed­ding mod­els for ro­bust­ness.

In the PG es­say haystack, PG es­say nee­dles have an av­er­age nee­dle-haystack sim­i­lar­ity score of 0.529 with a vari­a­tion of 0.101, while arXiv nee­dles av­er­age 0.368 nee­dle-haystack sim­i­lar­ity with a vari­a­tion of 0.111. Conversely, in the arXiv haystack, arXiv nee­dles av­er­age 0.654 nee­dle-haystack sim­i­lar­ity with a vari­a­tion of 0.0858, whereas PG-essay nee­dles score lower at 0.394 nee­dle-haystack sim­i­lar­ity with a vari­a­tion of 0.105.

On each haystack, we test se­man­ti­cally sim­i­lar nee­dles against un­re­lated nee­dles. For in­stance, we place both PG es­say and arXiv nee­dles within a Paul Graham es­say haystack to com­pare the two con­di­tions:

We test both writ­ing and arXiv nee­dles in two haystack types: Paul Graham es­says and arXiv pa­pers. In the Paul Graham es­say haystack, arXiv nee­dles per­form sig­nif­i­cantly bet­ter rel­a­tive to the writ­ing nee­dles; in other words, mod­els per­form bet­ter when the nee­dle does not se­man­ti­cally blend in with its haystack. In the arXiv haystack, how­ever, we ob­serve only min­i­mal per­for­mance dif­fer­ences be­tween our arXiv and writ­ing nee­dles.

Testing across only two top­ics is in­suf­fi­cient to draw a gen­er­al­iz­able con­clu­sion that higher nee­dle-haystack sim­i­lar­ity de­grades model per­for­mance on this task. This does high­light, how­ever, the non-uni­form na­ture of long-con­text pro­cess­ing. Even when task struc­ture and nee­dle-ques­tion sim­i­lar­ity are held con­stant, chang­ing the se­man­tic sim­i­lar­ity be­tween the nee­dle and the haystack can in­flu­ence re­sults. This points to an un­der­ex­plored area in long-con­text bench­marks and a mean­ing­ful di­rec­tion for fu­ture re­search.

Aside from nee­dle-haystack sim­i­lar­ity, we also con­sider the struc­tural pat­tern of the haystack.

If the haystack is com­posed of co­her­ent es­says, a ran­domly in­serted nee­dle may dis­rupt the log­i­cal flow of ideas, mak­ing it more no­tice­able. In con­trast, in a shuf­fled haystack of ran­domly or­dered sen­tences, the nee­dle may blend in more eas­ily since the over­all con­text lacks struc­ture. This fol­lows the as­sump­tion that mod­els are sen­si­tive to the log­i­cal flow of con­text—pro­cess­ing it in a struc­tured, or­der-sen­si­tive man­ner.

Although it seems coun­ter­in­tu­itive, mod­els per­form worse when the haystack pre­serves a log­i­cal flow of ideas. Shuffling the haystack and re­mov­ing lo­cal co­her­ence con­sis­tently im­proves per­for­mance.

To as­sess the im­pact of haystack struc­ture, we cre­ate two vari­ants:

Original: pre­serves the nat­ural flow of ideas within each ex­cerpt­Shuf­fled: sen­tences are ran­domly re­ordered through­out the haystack to main­tain the same over­all topic but with­out log­i­cal con­ti­nu­ity

Across all 18 mod­els and nee­dle-haystack con­fig­u­ra­tions, we ob­serve a con­sis­tent pat­tern that mod­els per­form bet­ter on shuf­fled haystacks than on log­i­cally struc­tured ones.

Logically, one might ex­pect re­trieval to be­come more dif­fi­cult when the nee­dle se­man­ti­cally blends in with the haystack or when it dis­rupts the lo­cal co­her­ence of the sur­round­ing text. Yet our find­ings show that mod­els are not con­sis­tently af­fected by topic blend­ing. Instead, they are more sen­si­tive to whether the haystack main­tains a log­i­cal struc­ture.

These re­sults may have some im­pli­ca­tions for the mod­el’s in­ter­nal pro­cess­ing: struc­tural pat­terns of in­puts could in­flu­ence how the at­ten­tion mech­a­nism is ap­plied, par­tic­u­larly as in­put length in­creases.

While out of scope for this re­port, this points to a po­ten­tial di­rec­tion for in­ter­pretabil­ity re­search in how at­ten­tion is in­flu­enced by in­put struc­ture. Understanding these struc­tural in­flu­ences that arise with in­creased in­put length could help ex­plain these long con­text fail­ure pat­terns.

To eval­u­ate these mod­els in a more re­al­is­tic set­ting, we use LongMemEval, a long-con­text bench­mark for con­ver­sa­tional ques­tion-an­swer­ing.

Using long in­puts for chat as­sis­tants is a com­mon ap­proach for main­tain­ing rel­e­vant his­tory for sub­se­quent chats. To in­cor­po­rate memory” into a chat as­sis­tant, a naive ap­proach would be to in­clude the full chat his­tory into the prompt for fol­low­ing chats. This re­quires the model to per­form two tasks, typ­i­cally per­formed in one call: find rel­e­vant parts of the con­ver­sa­tion his­tory (retrieval), then syn­the­size them in a way that is use­ful to an in­com­ing query (reasoning).

In an ideal case, the model would be given only the rel­e­vant parts so it can fo­cus solely on rea­son­ing. Adding ir­rel­e­vant con­text adds the ad­di­tional step of iden­ti­fy­ing what is rel­e­vant, forc­ing the model to per­form two tasks si­mul­ta­ne­ously.

We sys­tem­at­i­cally test the ef­fect of adding this ad­di­tional step with in­creased in­put length through two con­di­tions:

Focused in­put, con­tain­ing only the rel­e­vant parts and so the model just has to do sim­ple rea­son­ing. Full in­put, which uti­lizes the full 113k to­ken LongMemEval in­put that in­cludes ir­rel­e­vant con­text. In this case, the model has to per­form re­trieval across the long con­text in ad­di­tion to rea­son­ing.

We ver­ify that the mod­els are highly ca­pa­ble of suc­ceed­ing on the fo­cused in­puts, then ob­serve con­sis­tent per­for­mance degra­da­tion with the full in­puts. This per­for­mance drop sug­gests that adding ir­rel­e­vant con­text, and thereby adding an ad­di­tional step of re­trieval, sig­nif­i­cantly im­pacts a mod­el’s abil­ity to main­tain re­li­able per­for­mance.

...

Read the original on research.trychroma.com »

6 224 shares, 50 trendiness

Shoggoth Mini

Over the past year, ro­bot­ics has been catch­ing up with the LLM era. Pi’s π0.5 can clean un­seen homes. Tesla’s Optimus can fol­low nat­ural lan­guage cook­ing in­struc­tions. These sys­tems are ex­tremely im­pres­sive, but they feel stuck in a util­i­tar­ian mind­set of ro­botic ap­pli­ances. For these fu­ture ro­bots to live with us, they must be ex­pres­sive. Expressiveness com­mu­ni­cates in­ter­nal state such as in­tent, at­ten­tion, and con­fi­dence. Beyond its func­tional util­ity as a com­mu­ni­ca­tion chan­nel, ex­pres­sive­ness makes in­ter­ac­tions feel nat­ural. Without it, you get the text­book un­canny val­ley ef­fect.

Earlier this year, I came across Apple’s ELEGNT pa­per, which frames this idea rig­or­ously through a Pixar-like lamp to show how pos­ture and tim­ing alone can con­vey in­ten­tion. Around the same time, I dis­cov­ered SpiRobs, a soft ten­ta­cle ro­bot that feels oddly alive with just sim­ple move­ments. One sys­tem was care­fully de­signed to ex­press in­tent while the other just moved, yet some­how felt like it had in­tent. That dif­fer­ence was in­ter­est­ing. I started build­ing Shoggoth Mini as a way to ex­plore it more di­rectly. Not with a clear goal, but to see what would hap­pen if I pushed em­bod­i­ment into stranger ter­ri­tory. This post re­traces that process, the happy ac­ci­dents, and what I learned about build­ing ro­bots.

The first chal­lenge was cre­at­ing a test­bed to ex­plore the con­trol of SpiRobs. I started very sim­ple: a plate to hold three mo­tors, and a dome to lift the ten­ta­cle above them. This setup was­n’t meant to be the fi­nal de­sign, only a plat­form for quick ex­per­i­men­ta­tion. However, halfway through 3D print­ing, I ran out of black fil­a­ment and had to fin­ish the dome in grey. This made it look like the dome had a mouth. When my flat­mate saw it sit­ting on my desk, he grabbed a marker and drew some eyes. It looked good: cute, weird, slightly un­set­tling. I used ChatGPT to ex­plore ren­ders, and de­cided that this ac­ci­dent would be­come the form fac­tor.

Later, I mounted stereo cam­eras on the dome to track the ten­ta­cle. Robot eyes are eerie. You keep ex­pect­ing move­ment, but noth­ing ever hap­pens. That pre­dic­tion er­ror fo­cuses at­ten­tion even more.

The orig­i­nal open-spool de­sign re­lied on con­stant ca­ble ten­sion, but any slight per­tur­ba­tion (such as test­ing a buggy new pol­icy) would make the ca­bles leave the spool and tan­gle around the mo­tor shafts. The process to fix it re­quired un­ty­ing the knot at the tip hold­ing the ca­bles to­gether, and dis­man­tling the whole ro­bot. Adding sim­ple spool cov­ers elim­i­nated most tan­gles and made it­er­a­tion dra­mat­i­cally faster.

Another key step was adding a cal­i­bra­tion script and pre-rolling ex­tra wire length. This made it pos­si­ble to:

* Unroll and reroll the ca­bles to open the ro­bot with­out hav­ing to un­tie the tip knot, speed­ing up it­er­a­tion dra­mat­i­cally

* Calibrate ca­ble ten­sion pre­cisely and as of­ten as needed

* Give con­trol poli­cies slack to work with dur­ing mo­tion

Finally, as you can see in the video, the stan­dard 3-cable SpiRobs de­sign sags un­der its own weight. This makes con­sis­tent be­hav­ior hard to re­pro­duce. I had to thicken the spine just enough to pre­vent sag, but not so much that it would de­form per­ma­nently un­der high load.

You can ex­plore the cur­rent CAD as­sem­bly here, with all STL files for 3D print­ing in­cluded in the repo.

With the hard­ware ready, the next step was to feel how the ten­ta­cle moved. To sim­plify con­trol, I re­duced the ten­ta­cle’s three ten­don lengths (a 3D space) down to two in­tu­itive di­men­sions you can ma­nip­u­late with a track­pad.

Concretely, each of the three ten­dons has a prin­ci­pal pulling di­rec­tion in the 2D plane, form­ing a tri­an­gu­lar ba­sis that sums to zero. By pro­ject­ing the 2D cur­sor con­trol vec­tor onto each ten­don’s prin­ci­pal axis, you com­pute how much each ten­don should shorten or lengthen to align with the de­sired di­rec­tion.

* $\mathbf{v}_i$ is the prin­ci­pal axis of ten­don $i$.

Positive $s_i$ means short­en­ing the ten­don; neg­a­tive means length­en­ing it. In prac­tice, the cur­sor in­put is nor­mal­ized to keep the mo­tor com­mands in a rea­son­able range.

While this 2D map­ping does­n’t ex­pose the ten­ta­cle’s full con­fig­u­ra­tion space (there are in­ter­nal shapes it can­not reach), it is in­tu­itive. Anyone can im­me­di­ately move the ten­ta­cle by drag­ging on a track­pad, see­ing the tip fol­low the cur­sor in the same di­rec­tion.

Unexpectedly, this sim­ple 2D-to-3D map­ping be­came the back­bone of the en­tire sys­tem. Later, all au­to­mated con­trol poli­cies, from hard­coded prim­i­tives to re­in­force­ment learn­ing, reused the same pro­jec­tion layer to out­put ac­tions.

The sys­tem has two con­trol lay­ers. Low-level con­trol uses both open-loop prim­i­tives (like ” or ”) and closed-loop RL poli­cies (like fin­ger-track­ing). The lat­ter de­pends on a spe­cial­ized stereo vi­sion pipeline, which tracks the ten­ta­cle tip and user hand po­si­tions. Initially, I con­sid­ered em­bed­ding in­ter­nal op­ti­cal sen­sors for pro­pri­o­cep­tion, but this proved im­prac­ti­cal with­out adding bulk, so I stuck with ex­ter­nal stereo vi­sion in­stead. While this works, it lim­its the us­able field of view. To ad­dress this, I im­ple­mented a some­what nat­ural-look­ing hom­ing be­hav­ior if the tip goes out of frame, and re­stricted the RL ob­ser­va­tion space to en­sure it re­mains vis­i­ble.

High-level con­trol lever­ages GPT-4o’s real-time API, which streams au­dio and text (vision is­n’t ex­posed yet). GPT-4o con­tin­u­ously lis­tens to speech through the au­dio stream, while stereo vi­sion is processed lo­cally to de­tect high-level vi­sual events—like hand waves or prox­im­ity trig­gers—which are sent to GPT-4o as text cues (“” or ”). GPT-4o then de­cides, zero-shot, which low-level API calls to make. This fol­lows the ap­proach shown in DeepMind’s Gemini Robotics pa­per, where a vi­sion-lan­guage-ac­tion (VLA) model zero-shots con­trol of ALOHA 2 by gen­er­at­ing Python con­trol code with­out ro­bot-spe­cific fine-tun­ing. In prac­tice, GPT-4o tends to over­call or un­der­call ac­tions (the ques­tion of time cal­i­bra­tion of LLMs is tricky), so prompt en­gi­neer­ing was es­sen­tial.

I ini­tially con­sid­ered train­ing a sin­gle end-to-end VLA model. Projects like Hugging Face’s LeRobot lean hard on im­i­ta­tion learn­ing. That works for rigid arms be­cause the end-ef­fec­tor pose maps cleanly to joint an­gles, so a re­played tra­jec­tory usu­ally does what you ex­pect. A ca­ble-dri­ven soft ro­bot is dif­fer­ent: the same tip po­si­tion can cor­re­spond to many ca­ble length com­bi­na­tions. This un­pre­dictabil­ity makes demon­stra­tion-based ap­proaches dif­fi­cult to scale.

Instead, I went with a cas­caded de­sign: spe­cial­ized vi­sion feed­ing light­weight con­trollers, leav­ing room to ex­pand into more ad­vanced learned be­hav­iors later.

One thing I no­ticed was that the ten­ta­cle would look slightly life­less dur­ing pauses be­tween API calls. To ad­dress this, I added a breath­ing idle mode with small, noisy os­cil­la­tions that shift be­tween prin­ci­pal di­rec­tions, keep­ing it feel­ing alive even when not ac­tively re­spond­ing.

Perception re­quired two com­po­nents: hand track­ing and ten­ta­cle tip track­ing. For hands, MediaPipe worked rea­son­ably well out of the box, though it strug­gles with oc­clu­sions.

For the ten­ta­cle, I col­lected a dataset across var­ied light­ing, po­si­tions, and back­grounds, us­ing k-means clus­ter­ing to fil­ter for di­verse, non-re­dun­dant sam­ples. Roboflow’s auto-la­bel­ing and ac­tive learn­ing sped up an­no­ta­tion, and I aug­mented the dataset syn­thet­i­cally by ex­tract­ing ten­ta­cle tips via the Segment Anything demo.

Once the data was ready, train­ing a YOLO model with Ultralytics was straight­for­ward. The fi­nal cal­i­bra­tion step used a DeepLabCut note­book to com­pute cam­era in­trin­sics and ex­trin­sics, en­abling 3D tri­an­gu­la­tion of the ten­ta­cle tip and hand po­si­tions.

Programming open-loop be­hav­iors for soft ro­bots is uniquely hard. Unlike rigid sys­tems where in­verse kine­mat­ics can give you pre­cise joint an­gles for a de­sired tra­jec­tory, soft bod­ies de­form un­pre­dictably. To sim­plify, I reused the 2D con­trol pro­jec­tion from man­ual con­trol. Instead of think­ing in raw 3D ca­ble lengths, I could de­sign be­hav­iors in an in­tu­itive 2D space and let the pro­jec­tion han­dle the rest. Having a thicker spine that pre­vents sag also helped en­sure con­sis­tent be­hav­ior re­pro­duc­tion across dif­fer­ent ses­sions.

Experimenting with ob­ject in­ter­ac­tions made me ap­pre­ci­ate how ro­bust SpiRobs can be. The grab­bing prim­i­tive, for ex­am­ple, sim­ply pulls the front ca­ble while adding slack to the oth­ers, yet it re­li­ably grips ob­jects of vary­ing shapes and weights. Given that high-fre­quency dex­ter­ous ma­nip­u­la­tion re­mains chal­leng­ing, this me­chan­i­cal ro­bust­ness is a non-triv­ial de­sign op­por­tu­nity.

For closed-loop con­trol, I turned to re­in­force­ment learn­ing, start­ing with a pol­icy that would fol­low a user’s fin­ger. This came from an old idea I’ve al­ways wanted to make: a ro­botic wooden owl that fol­lows you with its big eyes. It was also sim­ple enough to val­i­date the en­tire sim-to-real stack end-to-end be­fore mov­ing to more com­plex poli­cies.

I recre­ated SpiRobs in MuJoCo and set up a tar­get-fol­low­ing en­vi­ron­ment with smooth, ran­dom­ized tra­jec­to­ries. I used PPO with a sim­ple MLP and frame stack­ing to pro­vide tem­po­ral con­text. To im­prove sim-to-real trans­fer, I added dy­nam­ics ran­dom­iza­tion, per­turb­ing mass, damp­ing, and fric­tion dur­ing train­ing.

My first ap­proach used di­rect ten­don lengths as the ac­tion space. The pol­icy quickly found re­ward-hack­ing strate­gies, pulling ca­bles to ex­tremes to achieve per­fect track­ing in sim­u­la­tion. In re­al­ity, these chaotic con­fig­u­ra­tions would never trans­fer.

A fix I found was to con­strain the ac­tion space to the same 2D pro­jec­tion used every­where else. This rep­re­sen­ta­tion blocked un­re­al­is­tic be­hav­iors while keep­ing the sys­tem ex­pres­sive enough. Note that you could use cur­ricu­lum learn­ing to grad­u­ally tran­si­tion from this 2D con­straint to full 3D con­trol by start­ing with the sim­pli­fied rep­re­sen­ta­tion and pro­gres­sively ex­pand­ing the ac­tion space as the pol­icy be­comes more sta­ble.

Another is­sue was that the pol­icy ex­hib­ited jit­tery be­hav­ior from rapid ac­tion changes be­tween timesteps. I added con­trol penal­ties to the re­ward func­tion that pe­nal­ized large con­sec­u­tive ac­tion dif­fer­ences, en­cour­ag­ing smooth move­ments over er­ratic cor­rec­tions.

Once the pol­icy sta­bi­lized in sim­u­la­tion, trans­fer to hard­ware was sur­pris­ingly smooth.

One last is­sue: even with a sta­tion­ary tar­get, the pol­icy would some­times jit­ter and os­cil­late un­pre­dictably as it over­cor­rected. Applying an ex­po­nen­tial mov­ing av­er­age to the ac­tions added enough damp­ing to let the ten­ta­cle set­tle qui­etly with­out sac­ri­fic­ing re­spon­sive­ness too much.

One thing I no­ticed to­ward the end is that, even though the ro­bot re­mained ex­pres­sive, it started feel­ing less alive. Early on, its mo­tions sur­prised me: I had to in­ter­pret them, in­fer in­tent. But as I in­ter­nal­ized how it worked, the pre­dic­tion er­ror faded.

Expressiveness is about com­mu­ni­cat­ing in­ter­nal state. But per­ceived alive­ness de­pends on some­thing else: un­pre­dictabil­ity, a cer­tain opac­ity. This makes sense: liv­ing sys­tems track a messy, high-di­men­sional world. Shoggoth Mini does­n’t.

This raises a ques­tion: do we ac­tu­ally want to build ro­bots that feel alive? Or is there a thresh­old, some­where past ex­pres­sive­ness, where the sys­tem be­comes too agen­tic, too un­pre­dictable to stay com­fort­able around hu­mans?

Looking for­ward, I see sev­eral short-term paths worth ex­plor­ing:

* Giving it a voice (but as non-hu­man as pos­si­ble!)

* Expanding the ex­pres­sion reper­toire, both open and closed-loop, po­ten­tially through RLHF

* Adding more ten­ta­cles and teach­ing it to crawl

Fork the repo, build your own, or get in touch if you’d like to dis­cuss ro­bot­ics, RL, or LLMs!

...

Read the original on www.matthieulc.com »

7 212 shares, 7 trendiness

Anthropic, Google, OpenAI and xAI granted up to $200 million for AI work from Defense Department

The U. S. Department of Defense on Monday said it’s grant­ing con­tract awards of up to $200 mil­lion for ar­ti­fi­cial in­tel­li­gence de­vel­op­ment at Anthropic, Google, OpenAI and xAI.

The DoD’s Chief Digital and Artificial Intelligence Office said the awards will help the agency ac­cel­er­ate its adop­tion of advanced AI ca­pa­bil­i­ties to ad­dress crit­i­cal na­tional se­cu­rity chal­lenges.” The com­pa­nies will work to de­velop AI agents across sev­eral mis­sion ar­eas at the agency.

The adop­tion of AI is trans­form­ing the Department’s abil­ity to sup­port our warfight­ers and main­tain strate­gic ad­van­tage over our ad­ver­saries,” Doug Matty, the DoD’s chief dig­i­tal and AI of­fi­cer, said in a re­lease.

Elon Musk’s xAI also an­nounced Grok for Government on Monday, which is a suite of prod­ucts that make the com­pa­ny’s mod­els avail­able to U. S. gov­ern­ment cus­tomers. The prod­ucts are avail­able through the General Services Administration (GSA) sched­ule, which al­lows fed­eral gov­ern­ment de­part­ments, agen­cies, or of­fices to pur­chase them, ac­cord­ing to a post on X.

Musk’s AI startup has launched a new ver­sion of Grok and Grok for Government ser­vices af­ter the chat­bot gen­er­ated and spread anti-se­mitic posts and other of­fen­sive con­tent, spark­ing a back­lash.

OpenAI was pre­vi­ously awarded a year-long $200 mil­lion con­tract from the DoD in 2024, shortly af­ter it said it would col­lab­o­rate with de­fense tech­nol­ogy startup Anduril to de­ploy ad­vanced AI sys­tems for national se­cu­rity mis­sions.”

In June, the com­pany launched OpenAI for Government for U. S. fed­eral, state, and lo­cal gov­ern­ment work­ers.

WATCH: US needs an al­lied strat­egy for AI in­vest­ment in mil­i­tary and de­fense: Palantir

...

Read the original on www.cnbc.com »

8 205 shares, 48 trendiness

Reflections on OpenAI

I left OpenAI three weeks ago. I had joined the com­pany back in May 2024.

I wanted to share my re­flec­tions be­cause there’s a lot of smoke and noise around what OpenAI is do­ing, but not a lot of first-hand ac­counts of what the cul­ture of work­ing there ac­tu­ally feels like.

Nabeel Quereshi has an amaz­ing post called Reflections on Palantir, where he ru­mi­nates on what made Palantir spe­cial. I wanted to do the same for OpenAI while it’s fresh in my mind. You won’t find any trade se­crets here, more just re­flec­tions on this cur­rent it­er­a­tion of one of the most fas­ci­nat­ing or­ga­ni­za­tions in his­tory at an ex­tremely in­ter­est­ing time.

To put it up-front: there was­n’t any per­sonal drama in my de­ci­sion to leave–in fact I was deeply con­flicted about it. It’s hard to go from be­ing a founder of your own thing to an em­ployee at a 3,000-person or­ga­ni­za­tion. Right now I’m crav­ing a fresh start.

It’s en­tirely pos­si­ble that the qual­ity of the work will draw me back. It’s hard to imag­ine build­ing any­thing as im­pact­ful as AGI, and LLMs are eas­ily the tech­no­log­i­cal in­no­va­tion of the decade. I feel lucky to have seen some of the de­vel­op­ments first-hand and also been a part of the Codex launch.

Obviously these aren’t the views of the com­pany–as ob­ser­va­tions they are my own. OpenAI is a big place, and this is my lit­tle win­dow into it.

The first thing to know about OpenAI is how quickly it’s grown. When I joined, the com­pany was a lit­tle over 1,000 peo­ple. One year later, it is over 3,000 and I was in the top 30% by tenure. Nearly every­one in lead­er­ship is do­ing a dras­ti­cally dif­fer­ent job than they were ~2-3 years ago.

Of course, every­thing breaks when you scale that quickly: how to com­mu­ni­cate as a com­pany, the re­port­ing struc­tures, how to ship prod­uct, how to man­age and or­ga­nize peo­ple, the hir­ing processes, etc. Teams vary sig­nif­i­cantly in cul­ture: some are sprint­ing flat-out all the time, oth­ers are babysit­ting big runs, some are mov­ing along at a much more con­sis­tent pace. There’s no sin­gle OpenAI ex­pe­ri­ence, and re­search, ap­plied, and GTM op­er­ate on very dif­fer­ent time hori­zons.

An un­usual part of OpenAI is that every­thing, and I mean every­thing, runs on Slack. There is no email. I maybe re­ceived ~10 emails in my en­tire time there. If you aren’t or­ga­nized, you will find this in­cred­i­bly dis­tract­ing. If you cu­rate your chan­nels and no­ti­fi­ca­tions, you can make it pretty work­able.

OpenAI is in­cred­i­bly bot­toms-up, es­pe­cially in re­search. When I first showed up, I started ask­ing ques­tions about the roadmap for the next quar­ter. The an­swer I got was: this does­n’t ex­ist” (though now it does). Good ideas can come from any­where, and it’s of­ten not re­ally clear which ideas will prove most fruit­ful ahead of time. Rather than a grand master plan’, progress is it­er­a­tive and un­cov­ered as new re­search bears fruit.

Thanks to this bot­toms-up cul­ture, OpenAI is also very mer­i­to­cratic. Historically, lead­ers in the com­pany are pro­moted pri­mar­ily based upon their abil­ity to have good ideas and then ex­e­cute upon them. Many lead­ers who were in­cred­i­bly com­pe­tent weren’t very good at things like pre­sent­ing at all-hands or po­lit­i­cal ma­neu­ver­ing. That mat­ters less at OpenAI then it might at other com­pa­nies. The best ideas do tend to win.

There’s a strong bias to ac­tion (you can just do things). It was­n’t un­usual for sim­i­lar teams but un­re­lated teams to con­verge on var­i­ous ideas. I started out work­ing on a par­al­lel (but in­ter­nal) ef­fort sim­i­lar to ChatGPT Connectors. There must’ve been ~3-4 dif­fer­ent Codex pro­to­types float­ing around be­fore we de­cided to push for a launch. These ef­forts are usu­ally taken by a small hand­ful of in­di­vid­u­als with­out ask­ing per­mis­sion. Teams tend to quickly form around them as they show promise.

Andrey (the Codex lead) used to tell me that you should think of re­searchers as their own mini-executive”. There is a strong bias to work on your own thing and see how it pans out. There’s a corol­lary here–most re­search gets done by nerd-snip­ing a re­searcher into a par­tic­u­lar prob­lem. If some­thing is con­sid­ered bor­ing or solved’, it prob­a­bly won’t get worked on.

Good re­search man­agers are in­sanely im­pact­ful and also in­cred­i­bly lim­ited. The best ones man­age to con­nect the dots be­tween many dif­fer­ent re­search ef­forts and bring to­gether a big­ger model train­ing. The same goes for great PMs (shoutout ae).

The ChatGPT EMs I worked with (Akshay, Rizzo, Sulman) were some of the coolest cus­tomers I’ve ever seen. It re­ally felt like they had seen every­thing at this point . Most of them were rel­a­tively hands-off, but hired good peo­ple and tried to make sure they were setup for suc­cess.

OpenAI changes di­rec­tion on a dime. This was a thing we val­ued a lot at Segment–it’s much bet­ter to do the right thing as you get new in­for­ma­tion, vs de­cide to stay the course just be­cause you had a plan. It’s re­mark­able that a com­pany as large as OpenAI still main­tains this ethos–Google clearly does­n’t. The com­pany makes de­ci­sions quickly, and when de­cid­ing to pur­sue a di­rec­tion, goes all in.

There is a ton of scrutiny on the com­pany. Coming from a b2b en­ter­prise back­ground, this was a bit of a shock to me. I’d reg­u­larly see news sto­ries bro­ken in the press that had­n’t yet been an­nounced in­ter­nally. I’d tell peo­ple I work at OpenAI and be met with a pre-formed opin­ion on the com­pany. A num­ber of Twitter users run au­to­mated bots which check to see if there are new fea­ture launches com­ing up.

As a re­sult, OpenAI is a very se­cre­tive place. I could­n’t tell any­one what I was work­ing on in de­tail. There’s a hand­ful of slack work­spaces with var­i­ous per­mis­sions. Revenue and burn num­bers are more closely guarded.

OpenAI is also a more se­ri­ous place than you might ex­pect, in part be­cause the stakes feel re­ally high. On the one hand, there’s the goal of build­ing AGI–which means there is a lot to get right. On the other hand, you’re try­ing to build a prod­uct that hun­dreds of mil­lions of users lever­age for every­thing from med­ical ad­vice to ther­apy. And on the other, other hand, the com­pany is com­pet­ing in the biggest arena in the world. We’d pay close at­ten­tion to what was hap­pen­ing at Meta, Google, and Anthropic–and I’m sure they were all do­ing the same. All of the ma­jor world gov­ern­ments are watch­ing this space with a keen in­ter­est.

As of­ten as OpenAI is ma­ligned in the press, every­one I met there is ac­tu­ally try­ing to do the right thing. Given the con­sumer fo­cus, it is the most vis­i­ble of the big labs, and con­se­quently there’s a lot of slan­der for it.

That said, you prob­a­bly should­n’t view OpenAI as a sin­gle mono­lith. I think of OpenAI as an or­ga­ni­za­tion that started like Los Alamos. It was a group of sci­en­tists and tin­ker­ers in­ves­ti­gat­ing the cut­ting edge of sci­ence. That group hap­pened to ac­ci­den­tally spawn the most vi­ral con­sumer app in his­tory. And then grew to have am­bi­tions to sell to gov­ern­ments and en­ter­prises. People of dif­fer­ent tenure and dif­fer­ent parts of the org sub­se­quently have very dif­fer­ent goals and view­points. The longer you’ve been there, the more you prob­a­bly view things through the research lab” or non-profit for good” lens.

The thing that I ap­pre­ci­ate most is that the com­pany is that it walks the walk” in terms of dis­trib­ut­ing the ben­e­fits of AI. Cutting edge mod­els aren’t re­served for some en­ter­prise-grade tier with an an­nual agree­ment. Anybody in the world can jump onto ChatGPT and get an an­swer, even if they aren’t logged in. There’s an API you can sign up and use–and most of the mod­els (even if SOTA or pro­pri­etary) tend to quickly make it into the API for star­tups to use. You could imag­ine an al­ter­nate regime that op­er­ates very dif­fer­ently from the one we’re in to­day. OpenAI de­serves a ton of credit for this, and it’s still core to the DNA of the com­pany.

Safety is ac­tu­ally more of a thing than you might guess if you read a lot from Zvi or Lesswrong. There’s a large num­ber of peo­ple work­ing to de­velop safety sys­tems. Given the na­ture of OpenAI, I saw more fo­cus on prac­ti­cal risks (hate speech, abuse, ma­nip­u­lat­ing po­lit­i­cal bi­ases, craft­ing bio-weapons, self-harm, prompt in­jec­tion) than the­o­ret­i­cal ones (intelligence ex­plo­sion, power-seek­ing). That’s not to say that no­body is work­ing on the lat­ter, there’s def­i­nitely peo­ple fo­cus­ing on the the­o­ret­i­cal risks. But from my view­point, it’s not the fo­cus. Most of the work which is done is­n’t pub­lished, and OpenAI re­ally should do more to get it out there.

Unlike other com­pa­nies which freely hand out their swag at every ca­reer fair, OpenAI does­n’t re­ally give much swag (even to new em­ploy­ees). Instead there are drops’ which hap­pen where you can or­der in-stock items. The first one brought down the Shopify store, it had so much de­mand. There was an in­ter­nal post which cir­cu­lated on how to POST the right json pay­loads and cir­cum­vent this.

Nearly every­thing is a round­ing er­ror com­pared to GPU cost. To give you a sense: a niche fea­ture that was built as part of the Codex prod­uct had the same GPU cost foot­print as our en­tire Segment in­fra­struc­ture (not the same scale as ChatGPT but saw a de­cent por­tion of in­ter­net traf­fic).

OpenAI is per­haps the most fright­en­ingly am­bi­tious org I’ve ever seen. You might think that hav­ing one of the top con­sumer apps on the planet might be enough, but there’s a de­sire to com­pete across dozens of are­nas: the API prod­uct, deep re­search, hard­ware, cod­ing agents, im­age gen­er­a­tion, and a hand­ful of oth­ers which haven’t been an­nounced. It’s a fer­tile ground for tak­ing ideas and run­ning with them.

The com­pany pays a lot of at­ten­tion to twit­ter. If you tweet some­thing re­lated to OpenAI that goes vi­ral, chances are good some­one will read about it and con­sider it. A friend of mine joked, this com­pany runs on twit­ter vibes”. As a con­sumer com­pany, per­haps that’s not so wrong. There’s cer­tainly still a lot of an­a­lyt­ics around us­age, user growth, and re­ten­tion–but the vibes are equally as im­por­tant.

Teams at OpenAI are much more fluid than they might be else­where. When launch­ing Codex, we needed some help from a few ex­pe­ri­enced ChatGPT en­gi­neers to hit our launch date. We met with some of the ChatGPT EMs to make the re­quest. The next day we had two badass folks ready to dive in and help. There was no waiting for quar­terly plan­ning” or re-shuffling head­count”. It moved re­ally quickly.

Leadership is quite vis­i­ble and heav­ily in­volved. This might be ob­vi­ous at a com­pany such as OpenAI, but every exec seemed quite di­aled in. You’d see gdb, sama, kw, mark, dane, et al chime in reg­u­larly on Slack. There are no ab­sen­tee lead­ers.

OpenAI uses a gi­ant monorepo which is ~mostly Python (though there is a grow­ing set of Rust ser­vices and a hand­ful of Golang ser­vices sprin­kled in for things like net­work prox­ies). This cre­ates a lot of strange-look­ing code be­cause there are so many ways you can write Python. You will en­counter both li­braries de­signed for scale from 10y Google vet­er­ans as well as throw­away Jupyter note­books newly-minted PhDs. Pretty much every­thing op­er­ates around FastAPI to cre­ate APIs and Pydantic for val­i­da­tion. But there aren’t style guides en­forced writ-large.

OpenAI runs every­thing on Azure. What’s funny about this is there are ex­actly three ser­vices that I would con­sider trust­wor­thy: Azure Kubernetes Service, CosmosDB (Azure’s doc­u­ment stor­age), and BlobStore. There’s no true equiv­a­lents of Dynamo, Spanner, Bigtable, Bigquery Kinesis or Aurora. It’s a bit rarer to think a lot in auto-scal­ing units. The IAM im­ple­men­ta­tions tend to be way more lim­ited than what you might get from an AWS. And there’s a strong bias to im­ple­ment in-house.

When it comes to per­son­nel (at least in eng), there’s a very sig­nif­i­cant Meta → OpenAI pipeline. In many ways, OpenAI re­sem­bles early Meta: a block­buster con­sumer app, nascent in­fra, and a de­sire to move re­ally quickly. Most of the in­fra tal­ent I’ve seen brought over from Meta + Instagram has been quite strong.

Put these things to­gether, and you see a lot of core parts of in­fra that feel rem­i­nis­cent of Meta. There was an in-house reim­ple­men­ta­tion of TAO. An ef­fort to con­sol­i­date auth iden­tity at the edge. And I’m sure a num­ber of oth­ers I don’t know about.

Chat runs re­ally deep. Since ChatGPT took off, a lot of the code­base is struc­tured around the idea of chat mes­sages and con­ver­sa­tions. These prim­i­tives are so baked at this point, you should prob­a­bly ig­nore them at your own peril. We did de­vi­ate from them a bit in Codex (leaning more into learn­ings from the re­sponses API), but we lever­aged a lot of prior art.

Code wins. Rather than hav­ing some cen­tral ar­chi­tec­ture or plan­ning com­mit­tee, de­ci­sions are typ­i­cally made by whichever team plans to do the work. The re­sult is that there’s a strong bias for ac­tion, and of­ten a num­ber of du­pli­cate parts of the code­base. I must’ve seen half a dozen li­braries for things like queue man­age­ment or agent loops.

There were a few ar­eas where hav­ing a rapidly scaled eng team and not a lot of tool­ing cre­ated is­sues. sa-server (the back­end mono­lith) was a bit of a dump­ing ground. CI broke a lot more fre­quently than you might ex­pect on mas­ter. Test cases even run­ning in par­al­lel and fac­tor­ing in a sub­set of de­pen­den­cies could take ~30m to run on GPUs. These weren’t un­solv­able prob­lems, but it’s a good re­minder that these sorts of prob­lems ex­ist every­where, and they are likely to get worse when you scale su­per quickly. To the credit of the in­ter­nal teams, there’s a lot of fo­cus go­ing into im­prov­ing this story.

What a big con­sumer brand looks like. I had­n’t re­ally in­ter­nal­ized this un­til we started work­ing on Codex. Everything is mea­sured in terms of pro subs’. Even for a prod­uct like Codex, we thought of the on­board­ing pri­mar­ily re­lated to in­di­vid­ual us­age rather than teams. It broke my brain a bit, com­ing from pre­dom­i­nantly a B2B / en­ter­prise back­ground. You flip a switch and you get traf­fic from day 1.

How large mod­els are trained (at a high-level). There’s a spec­trum from experimentation” to engineering”. Most ideas start out as small-scale ex­per­i­ments. If the re­sults look promis­ing, they then get in­cor­po­rated into a big­ger run. Experimentation is as much about tweak­ing the core al­go­rithms as it is tweak­ing the data mix and care­fully study­ing the re­sults. On the large end, do­ing a big run al­most looks like gi­ant dis­trib­uted sys­tems en­gi­neer­ing. There will be weird edge cases and things you did­n’t ex­pect. It’s up to you to de­bug them.

How to do GPU-math. We had to fore­cast out the load ca­pac­ity re­quire­ments as part of the Codex launch, and do­ing this was the first time I’d re­ally spent bench­mark­ing any GPUs. The gist is that you should ac­tu­ally start from the la­tency re­quire­ments you need (overall la­tency, # of to­kens, time-to-first-to­ken) vs do­ing bot­toms-up analy­sis on what a GPU can sup­port. Every new model it­er­a­tion can change the load pat­terns wildly.

How to work in a large Python code­base. Segment was a com­bi­na­tion of both mi­croser­vices, and was mostly Golang and Typescript. We did­n’t re­ally have the breadth of code that OpenAI does. I learned a lot about how to scale a code­base based upon the num­ber of de­vel­op­ers con­tribut­ing to it. You have to put in a lot more guardrails for things like works by de­fault”, keep mas­ter clean”, and hard to mis­use”.

A big part of my last three months at OpenAI was launch­ing Codex. It’s un­ques­tion­ably one of the high­lights of my ca­reer.

To set the stage, back in November 2024, OpenAI had set a 2025 goal to launch a cod­ing agent. By February 2025 we had a few in­ter­nal tools float­ing around which were us­ing the mod­els to great ef­fect. And we were feel­ing the pres­sure to launch a cod­ing-spe­cific agent. Clearly the mod­els had got­ten to the point where they were get­ting re­ally use­ful for cod­ing (seeing the new ex­plo­sion of vibe-cod­ing tools in the mar­ket).

I re­turned early from my pa­ter­nity leave to help par­tic­i­pate in the Codex launch. A week af­ter I re­turned we had a (slightly chaotic) merger of two teams, and be­gan a mad-dash sprint. From start (the first lines of code writ­ten) to fin­ish, the whole prod­uct was built in just 7 weeks.

The Codex sprint was prob­a­bly the hard­est I’ve worked in nearly a decade. Most nights were up un­til 11 or mid­night. Waking up to a new­born at 5:30 every morn­ing. Heading to the of­fice again at 7a. Working most week­ends. We all pushed hard as a team, be­cause every week counted. It re­minded me of be­ing back at YC.

It’s hard to over­state how in­cred­i­ble this level of pace was. I haven’t seen or­ga­ni­za­tions large or small go from an idea to a fully launched + freely avail­able prod­uct in such a short win­dow. The scope was­n’t small ei­ther; we built a con­tainer run­time, made op­ti­miza­tions on repo down­load­ing, fine-tuned a cus­tom model to deal with code ed­its, han­dled all man­ner of git op­er­a­tions, in­tro­duced a com­pletely new sur­face area, en­abled in­ter­net ac­cess, and ended up with a prod­uct that was gen­er­ally a de­light to use.

Say what you will, OpenAI still has that launch­ing spirit.

The good news is that the right peo­ple can make magic hap­pen. We were a se­nior team of ~8 en­gi­neers, ~4 re­searchers, 2 de­sign­ers, 2 GTM and a PM. Had we not had that group, I think we would’ve failed. Nobody needed much di­rec­tion, but we did need a de­cent amount of co­or­di­na­tion. If you get the chance to work with any­one on the Codex team, know that every one of them is fan­tas­tic.

The night be­fore launch, five of us stayed up un­til 4a try­ing to de­ploy the main mono­lith (a multi-hour af­fair). Then it was back to the of­fice for the 8a launch an­nounce­ment and livestream. We turned on the flags, and started to see see the traf­fic pour in. I’ve never seen a prod­uct get so much im­me­di­ate uptick just from ap­pear­ing in a left-hand side­bar, but that’s the power of ChatGPT.

In terms of the prod­uct shape, we set­tled on a form fac­tor which was en­tirely asyn­chro­nous. Unlike tools like Cursor (at the time, it now sup­ports a sim­i­lar mode) or Claude Code, we aimed to al­low users to kick off tasks and let the agent run in its own en­vi­ron­ment. Our bet was in the end-game, users should treat a cod­ing agent like a co-worker: they’d send mes­sages to the agent, it gets some time to do its work, and then it comes back with a PR.

This was a bit of a gam­ble: we’re in a slightly weird state to­day where the mod­els are good, but not great. They can work for min­utes at a time, but not yet hours. Users have widely vary­ing de­grees of trust in the mod­els ca­pa­bil­i­ties. And we’re not even clear what the true ca­pa­bil­i­ties of the mod­els are.

Over the long arc of time, I do be­lieve most pro­gram­ming will look more like Codex. In the mean­time, it’s go­ing to be in­ter­est­ing to see how all the prod­ucts un­fold.

Codex (maybe un­sur­pris­ingly) is re­ally good at work­ing in a large code­base, un­der­stand­ing how to nav­i­gate it. The biggest dif­fer­en­tia­tor I’ve seen vs other tools is the abil­ity to kick off mul­ti­ple tasks at once and com­pare their out­put.

I re­cently saw that there are pub­lic num­bers com­par­ing the PRs made by dif­fer­ent LLM agents. Just at the pub­lic num­bers, Codex has gen­er­ated 630,000 PRs. That’s about 78k pub­lic PRs per en­gi­neer in the 53 days since launch (you can make your own guesses about the mul­ti­ple of pri­vate PRs). I’m not sure I’ve ever worked on some­thing so im­pact­ful in my life.

Truth be told, I was orig­i­nally ap­pre­hen­sive about join­ing OpenAI. I was­n’t sure what it would be like to sac­ri­fice my free­dom, to have a boss, to be a much smaller piece of a much larger ma­chine. I kept it fairly low-key that I had joined, just in case it was­n’t the right fit.

I did want to get three things from the ex­pe­ri­ence…

* to build in­tu­ition for how the mod­els were trained and where the ca­pa­bil­i­ties were go­ing

* to work with and learn from amaz­ing peo­ple

In re­flect­ing on the year, I think it was one of the best moves I’ve ever made. It’s hard to imag­ine learn­ing more any­where else.

If you’re a founder and feel­ing like your startup re­ally is­n’t go­ing any­where, you should ei­ther 1) deeply re-as­sess how you can take more shots on goal or 2) go join one of the big labs. Right now is an in­cred­i­ble time to build. But it’s also an in­cred­i­ble time to peer into where the fu­ture is headed.

As I see it, the path to AGI is a three-horse race right now: OpenAI, Anthropic, and Google. Each of these or­ga­ni­za­tions are go­ing to take a dif­fer­ent path to get there based upon their DNA (consumer vs busi­ness vs rock-solid-in­fra + data). Working at any of them will be an eye-open­ing ex­pe­ri­ence.

Thank you to Leah for be­ing in­cred­i­bly sup­port­ive and tak­ing the ma­jor­ity of the child­care through­out the late nights. Thanks to PW, GDB, and Rizzo for giv­ing me a shot. Thanks to the SA team­mates for teach­ing me the ropes: Andrew, Anup, Bill, Kwaz, Ming, Simon, Tony, and Val. And thanks for the Codex core team for giv­ing me the ride of a life­time: Albin, AE, Andrey, Bryan, Channing, DavidK, Gabe, Gladstone, Hanson, Joey, Josh, Katy, KevinT, Max, Sabrina, SQ, Tibo, TZ and Will. I’ll never for­get this sprint.

...

Read the original on calv.info »

9 203 shares, 8 trendiness

Bedrock — benbridle.com

Bedrock is a com­pact and portable 8-bit com­puter sys­tem, de­signed to last for­ever. Click here to jump straight to the live demos.

Bedrock is a com­puter sys­tem that makes it easy to write use­ful pro­grams that will last for­ever. The sys­tem is small and quick to learn, with only 32 in­struc­tions and 12 de­vices to re­mem­ber.

Bedrock is­n’t a real com­puter sys­tem that you can pick up and hold in your hands. It’s a spec­i­fi­ca­tion that de­scribes an in­ter­face for any kind of com­put­ing de­vice, al­low­ing you to write pro­grams that will run on any de­vice with­out hav­ing to worry about the pe­cu­liar­i­ties of the un­der­ly­ing hard­ware.

Programs writ­ten for Bedrock can run on any com­puter sys­tem, so long as a Bedrock em­u­la­tor has been im­ple­mented for that sys­tem. The em­u­la­tor acts as a thin trans­la­tion layer be­tween the pro­gram and the sys­tem, and is de­signed to be easy to im­ple­ment on any com­puter, con­sole, or hand­held, no mat­ter how old or lim­ited. The core sys­tem can be im­ple­mented in a few hours, and the 12 stan­dard de­vices can be im­ple­mented and con­nected as needed.

Programs can cur­rently run on Windows, Linux, the web, and the Nintendo DS. See the live demon­stra­tions sec­tion at the bot­tom of this page for ex­am­ples of the kinds of pro­grams that can run on Bedrock.

* Bedrock: Printing a string

A hands-on tu­to­r­ial that shows how to print a string to the ter­mi­nal. It as­sumes no for­mer knowl­edge about Bedrock, and al­most no for­mer knowl­edge about pro­gram­ming in gen­eral.

* User man­ual

The user man­ual is aimed at peo­ple who are learn­ing about or writ­ing pro­grams for the Bedrock sys­tem. It con­tains many ex­am­ples in the form of runnable code snip­pets.

* Specification

The spec­i­fi­ca­tion is aimed at peo­ple who are im­ple­ment­ing the sys­tem from scratch.

* Examples

Implementations of some pop­u­lar al­go­rithms as runnable pro­grams.

* Example: Microwave clock

Full ed­itable source code for the mi­crowave clock pro­gram.

To write and run a pro­gram us­ing Bedrock you’ll need an as­sem­bler and an em­u­la­tor. An as­sem­bler is used for con­vert­ing pro­gram source code into a Bedrock pro­gram, and an em­u­la­tor is used for run­ning any Bedrock pro­gram on your cho­sen sys­tem:

* bedrock-js

An as­sem­bler and em­u­la­tor that can be em­bed­ded in a web­page, writ­ten in Javascript.

* bedrock-pc

An as­sem­bler and em­u­la­tor for Windows and Linux com­put­ers, writ­ten in Rust.

Bedrock orig­i­nated as a fork of the Uxn vir­tual ma­chine and Varvara com­put­ing stack, with the aim of im­prov­ing per­for­mance on ex­tremely re­source-con­strained sys­tems. It has since di­verged in many sig­nif­i­cant ways, most no­tably by re­strict­ing the in­ter­faces be­tween com­po­nents and by strip­ping down the as­sem­bler and the in­struc­tion set. See Bedrock: Differences from Uxn for more de­tails.

The name Bedrock comes from the con­cept of a bedrock ab­strac­tion’ coined by this blog post, though it takes a dif­fer­ent ap­proach to the one ad­vo­cated for in the post. Bedrock achieves hab­it­abil­ity not by pro­duc­ing a higher-level in­struc­tion set, but by re­duc­ing the com­plex­ity of the pro­gram en­vi­ron­ment.

The fol­low­ing pro­grams are all run­ning us­ing the bedrock-js em­u­la­tor, which was thrown to­gether in a few days. There is a lot of room for im­prove­ment.

* snake.br (1133 bytes)

A graph­ics demo show­ing a coloured stream of let­ters that fol­low the mouse cur­sor.

* clock.br (393 bytes)

A clock in the style of an old mi­crowave oven dis­play.

* sys­info.br (4918 bytes)

Shows in­for­ma­tion about the Bedrock im­ple­men­ta­tion be­ing used.

* key­board.br (2774 bytes)

An on-screen key­board, de­signed to be used as the key­board for Bedrock on the Nintendo DS.

...

Read the original on benbridle.com »

10 186 shares, 33 trendiness

NIST Ion Clock Sets New Record for Most Accurate Clock in the World

There’s a new record holder for the most ac­cu­rate clock in the world. Researchers at the National Institute of Standards and Technology (NIST) have im­proved their atomic clock based on a trapped alu­minum ion. Part of the lat­est wave of op­ti­cal atomic clocks, it can per­form time­keep­ing with 19 dec­i­mal places of ac­cu­racy.

Optical clocks are typ­i­cally eval­u­ated on two lev­els — ac­cu­racy (how close a clock comes to mea­sur­ing the ideal true” time, also known as sys­tem­atic un­cer­tainty) and sta­bil­ity (how ef­fi­ciently a clock can mea­sure time, re­lated to sta­tis­ti­cal un­cer­tainty). This new record in ac­cu­racy comes out of 20 years of con­tin­u­ous im­prove­ment of the alu­minum ion clock. Beyond its world-best ac­cu­racy, 41% greater than the pre­vi­ous record, this new clock is also 2.6 times more sta­ble than any other ion clock. Reaching these lev­els has meant care­fully im­prov­ing every as­pect of the clock, from the laser to the trap and the vac­uum cham­ber.

The team pub­lished its re­sults in Physical Review Letters.

It’s ex­cit­ing to work on the most ac­cu­rate clock ever,” said Mason Marshall, NIST re­searcher and first au­thor on the pa­per. At NIST we get to carry out these long-term plans in pre­ci­sion mea­sure­ment that can push the field of physics and our un­der­stand­ing of the world around us.”

The alu­minum ion makes an ex­cep­tion­ally good clock, with an ex­tremely steady, high-fre­quency ticking” rate. Its ticks are more sta­ble than those of ce­sium, which pro­vides the cur­rent sci­en­tific de­f­i­n­i­tion of the sec­ond, said David Hume, the NIST physi­cist lead­ing the alu­minum ion clock pro­ject. And the alu­minum ion is­n’t as sen­si­tive to some en­vi­ron­men­tal con­di­tions, like tem­per­a­ture and mag­netic fields.

But the alu­minum ion is kind of shy, Marshall ex­plained. Aluminum is dif­fi­cult to probe and cool with lasers, both nec­es­sary tech­niques for atomic clocks. The re­search group there­fore paired the alu­minum ion with mag­ne­sium. Magnesium does­n’t have the beau­ti­ful tick­ing prop­er­ties of alu­minum, but it can be eas­ily con­trolled with lasers. This buddy sys­tem’ for ions is called quan­tum logic spec­troscopy,” said Willa Arthur-Dworschack, a grad­u­ate stu­dent on the pro­ject. The mag­ne­sium ion cools the alu­minum ion, slow­ing it down. It also moves in tan­dem with its alu­minum part­ner, and the state of the clock can be read out via the mag­ne­sium ion’s mo­tion, mak­ing this a quantum logic” clock. Even with this co­or­di­na­tion, there was still an ar­ray of phys­i­cal ef­fects to char­ac­ter­ize, said Daniel Rodriguez Castillo, also a grad­u­ate stu­dent on the pro­ject.

It’s a big, com­plex chal­lenge, be­cause every part of the clock’s de­sign af­fects the clock,” Rodriguez Castillo said.

One chal­lenge was the de­sign of the trap where the ions are held, which was caus­ing tiny move­ments of the ions, called ex­cess mi­cro­mo­tion, that were low­er­ing the clock’s ac­cu­racy. That ex­cess mi­cro­mo­tion throws off the ions’ tick rate. Electrical im­bal­ances at op­po­site sides of the trap were cre­at­ing ex­tra fields that dis­turbed the ions. The team re­designed the trap, putting it on a thicker di­a­mond wafer and mod­i­fy­ing the gold coat­ings on the elec­trodes to fix the im­bal­ance of the elec­tric field. They also made the gold coat­ings thicker to re­duce re­sis­tance. Refining the trap this way slowed the ions’ mo­tion and let them tick” un­per­turbed.

The vac­uum sys­tem in which the trap must op­er­ate was also caus­ing prob­lems. Hydrogen dif­fuses out of the steel body of a typ­i­cal vac­uum cham­ber, Marshall said. Traces of hy­dro­gen gas col­lided with the ions, in­ter­rupt­ing the clock’s op­er­a­tion. That lim­ited how long the ex­per­i­ment could run be­fore the ions needed to be re­loaded. The team re­designed the vac­uum cham­ber and had it re­built out of ti­ta­nium, which low­ered the back­ground hy­dro­gen gas by 150 times. That meant they could go days with­out re­load­ing the trap, rather than re­load­ing every 30 min­utes.

There was still one more in­gre­di­ent they needed: a more sta­ble laser to probe the ions and count their ticks. The 2019 ver­sion of the clock had to be run for weeks to av­er­age out quan­tum fluc­tu­a­tions — tem­po­rary ran­dom changes in the ions’ en­ergy state — caused by its laser. To re­duce that time, the team turned to NISTs own Jun Ye, whose lab at JILA (a joint in­sti­tute of NIST and the University of Colorado Boulder) hosts one of the most sta­ble lasers in the world. Ye’s stron­tium lat­tice clock, Strontium 1, held the pre­vi­ous record for ac­cu­racy.

This was a team ef­fort. Using fiber links un­der the street, Ye’s group at JILA sent the ul­tra­stable laser beam 3.6 kilo­me­ters (a lit­tle more than 2 miles) to the fre­quency comb in the lab of Tara Fortier at NIST. The fre­quency comb, which acts as a ruler for light,” al­lowed the alu­minum ion clock group to com­pare its laser with Ye’s ul­tra­stable one. This process en­abled the Ye lab’s laser to trans­fer its sta­bil­ity to the alu­minum clock laser. With this im­prove­ment, the re­searchers could probe the ions for a full sec­ond com­pared to their pre­vi­ous record of 150 mil­lisec­onds. This im­proves the clock’s sta­bil­ity, re­duc­ing the time re­quired to mea­sure down to the 19th dec­i­mal place from three weeks to a day and a half.

With this new record, the alu­minum ion clock con­tributes to the in­ter­na­tional ef­fort to re­de­fine the sec­ond to much greater lev­els of ac­cu­racy than be­fore, fa­cil­i­tat­ing new sci­en­tific and tech­no­log­i­cal ad­vances. The up­grades also dras­ti­cally im­prove its use as a quan­tum logic test­bed, ex­plor­ing new con­cepts in quan­tum physics and build­ing the tools needed for quan­tum tech­nol­ogy, an ex­cit­ing prospect for those in­volved. More im­por­tantly, by cut­ting down the av­er­ag­ing time from weeks to days, this clock can be a tool to make new mea­sure­ments of Earth’s ge­o­desy and ex­plore physics be­yond the Standard Model, such as the pos­si­bil­ity that the fun­da­men­tal con­stants of na­ture are not fixed val­ues but ac­tu­ally chang­ing.

With this plat­form, we’re poised to ex­plore new clock ar­chi­tec­tures — like scal­ing up the num­ber of clock ions and even en­tan­gling them — fur­ther im­prov­ing our mea­sure­ment ca­pa­bil­i­ties,” Arthur-Dworschack said.

Paper: Mason C. Marshall, Daniel A. Rodriguez Castillo, Willa J. Arthur-Dworschack, Alexander Aeppli, Kyungtae Kim, Dahyeon Lee, William Warfield, Joost Hinrichs, Nicholas V. Nardelli, Tara M. Fortier, Jun Ye, David R. Leibrandt and David B. Hume. High-stability sin­gle-ion clock with 5.5×10−19 sys­tem­atic un­cer­tainty. Physical Review Letters. Published on­line July 14, 2025. DOI: 10.1103/hb3c-dk28

...

Read the original on www.nist.gov »

To add this web app to your iOS home screen tap the share button and select "Add to the Home Screen".

10HN is also available as an iOS App

If you visit 10HN only rarely, check out the the best articles from the past week.

If you like 10HN please leave feedback and share

Visit pancik.com for more.