10 interesting stories served every morning and every evening.

SWIPE LEFTPRESS RIGHT ARROW

1 1,410 shares, 87 trendiness

The sound of inevitability

...

2 516 shares, 21 trendiness

[WIP] CUDA backend by zcbenz · Pull Request #1983 · ml-explore/mlx

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in

to your account

...

Read the original on github.com »

3 316 shares, 14 trendiness

DOGWALK

Blender Studio’s ofﬁcial game project is a short casual interactive story. Play a big, adorable dog traversing through winter woods and help out a little kid decorate a snowman with colorful items hidden in the environment.

You are let loose to roam camping grounds, forest paths, idyllic creeks and a frozen pond in this miniature open world.

Guide or drag around your little kid owner that you have in tow. Help each other out, be a menace or be a good boy.

Dive straight in and have the game react to your play-style and choices. There are no fail states. Only player driven moments.

Traverse an environment made of real-life paper crafted models, scanned and recreated to be played with.

Brought to you by the Blender Studio as the new free and creative commons “Open Project”. Made with, and available as free and open-source software.

The source code and production repository can be accessed on our website at

https://studio.blender.org/projects/dogwalk/3e3fa4dfd790bc/

The project was used to test and improve both Blender and the Godot Game Engine.

Support our work on the Blender Studio website:

https://studio.blender.org/projects/project-dogwalk/

...

Read the original on blenderstudio.itch.io »

4 267 shares, 8 trendiness

php_license_update

This proposal addresses a longstanding issue within the open source community by publishing new versions of the PHP License and the Zend Engine License. The Modiﬁed BSD License is adopted as the PHP License, version 4, and as the Zend Engine License, version 3.

The Modiﬁed BSD License is sometimes referred to as the “New,” “Revised,” or “3-clause” BSD License. Its SPDX identiﬁer is BSD-3-Clause.1) It is recognized as a free software license by both the Open Source Initiative (OSI) and the Free Software Foundation (FSF).2) 3) The FSF has designated it as compatible with the GNU General Public License (), and it is an OSI Approved License.

The PHP License, version 3.01, and Zend Engine License, version 2.00, combine the Modiﬁed BSD License with special terms speciﬁc only to the PHP Group and Zend Technologies (now a subsidiary of Perforce Software). After removing these special terms, the licenses are identical to the Modiﬁed BSD License, and there is no change to the rights granted by contributors or to users.

The rights granted by contributors do not change.

The rights granted to users do not change.

We will work with the PHP Group and Perforce Software to remove the terms that are speciﬁc to them.

PHP software and the Zend Engine will be licensed under terms that are both OSI Approved and compatible with the .

Work with the PHP Group to adopt the Modiﬁed BSD License as the PHP License, version 4.

Work with Perforce Software to adopt the Modiﬁed BSD License as the Zend Engine License, version 3.

Deprecate the PHP License and the Zend Engine License. Use of these licenses for new projects, inside or outside the PHP project, is strongly discouraged.

Delete the contents of the LICENSE ﬁle from the PHP software, and replace them with the contents indicated in the New LICENSE File section below.

Remove the Zend/LICENSE ﬁle from the Zend Engine.

Replace the ﬁle headers for all PHP source ﬁles in the PHP software with the contents indicated in the New PHP Source File Header section below.

Replace the ﬁle headers for all Zend Engine source ﬁles with the contents indicated in the New Zend Engine Source File Header section below.

Update other applicable documentation and web pages to reﬂect these changes, such as https://www.php.net/license/.

The Background, Change Authority, and Additional Context sections of this document provide further context and legal justiﬁcation for this change.

The PHP License and Zend Engine License are not compatible with the ,4) and the Zend Engine License is not OSI Approved. While the OSI license approval committee voted to approve versions 3.0 and 3.01 of the PHP License, each followed the “legacy approval” process, meaning the licenses had already been in wide use for many years before the OSI approved them. As a result, the OSI approved the PHP License based more on its intent, rather than its content. If the OSI license approval committee were not considering the legacy use of the PHP License, it is unlikely they would have approved it based solely on its content.

In the beginning, while the Zend Engine was bundled with PHP in the Zend/ directory, it was thought of as a completely separate product that could be unbundled and used apart from PHP. Indeed, that was the intent, and it is the reason PHP and the Zend Engine have separate licenses. However, after 25 years of cohabitation within the same source code repository, the two are intertwined in ways in which the Zend Engine can no longer be separated and used as a standalone product. Together, they form the PHP programming language reference implementation.

Rasmus Lerdorf created PHP at a time when a faction within the free software movement was growing dissatisﬁed with the politics and philosophy of the movement and splintered off, crystallizing around a more permissive set of licenses viewed as friendlier to commercial use—this became the open source movement.

The frame dispute, consequent transformation, and creation of the open source movement can be viewed as a spin-off movement that not only had a different diagnosis and more elastic reach, but that strove to avoid what they saw as “mistakes” made by the founding movement that inhibited commercial growth.5)

In his original release announcement, Lerdorf wrote, “The tools are in the public domain distributed under the GNU Public License. Yes, that means they are free!”6) 7) Lerdorf chose to release PHP version 1 and PHP/FI (version 2) under the terms of the GNU , version 2 (GPLv2), but he recognized the growing concerns among the open source movement that commercial interests were scared of or even forbade the use of software in their organizations—indeed, many continue this practice today. In a 1997 mailing list post discussing licensing, Lerdof said, “PHP, if I can help it, will always be free. But, I am not against letting commercial entities take a shot at a commercial version as long as the terms are such that the major contributors don’t feel cheated.”8)

This led to a dual-licensing model in PHP 3, allowing users the choice to use PHP under the terms of the GPLv2 or a custom license based on the Apache License, version 1.0. “Our license is identical to the Apache license (since that’s where we copied it from) except for that ﬁrst clause,” wrote Lerdforf in a 1999 mailing list post.9) That ﬁrst clause restricted commercial use:

Commercial redistribution of larger works derived from, or works which bundle PHP, requires written permission from the PHP Development Team. You may charge a fee for the physical act of transferring a copy, and must make it clear that the fee being charged is for the distribution, and not for the software itself. You may, at your option, offer warranty protection in exchange for a fee.10)

The dual-licensing model presented a number of challenges to a group that was ill-equipped to handle legal questions. In the same thread, Lerdorf discussed having received requests from companies for signed, hardcopy documents granting permission to use PHP and being unable to respond to them appropriately.11) Free and open source software was not well-understood by companies, and there was significant disagreement within the PHP project about what level of freedom users should have. At the time, Zeev Suraski wrote, “people should not be given the legal right to do whatever they wish with PHP.”12) Nevertheless, with Lerdorf having referred to the ﬁrst clause as “that troublesome clause which we can’t enforce,”13) the team ﬁnally removed it in PHP 3.0.14.14)

Meanwhile, Richard Stallman, author of the and founder of the FSF, had significant disagreements with the PHP project over their use of the ,15) 16) so the PHP project discontinued the dual-licensing approach, removing the license as an option, and PHP 4.0.0 shipped with the PHP License, version 2.02 and the Zend License, version 0.92,17) for sources within the Zend/ directory.

Suraski and Andi Gutmans originally intended the Zend/ directory to be read-only, with all the source code owned by the two, so they could “sell the Zend engine for uses other than PHP.”18) It’s clear they—and other early members of the PHP project—saw the Zend Engine as wholly separate from PHP. In a 1999 interview, Lerdorf clariﬁed licensing concerns surrounding the separate licenses:

PHP 4 is not synonymous with Zend. And when it comes to licensing, the only time the [Zend License] kicks in is if you unbundle Zend from PHP and try to embed the Zend engine into something else.19)

I think there is still some confusion about what role exactly Zend plays in the PHP infrastructure. The host language (PHP) uses the base services provided by the engine (Zend)—services such as memory allocation, persistent resources, compilation, and execution. PHP itself then provides the function libraries, interfaces to the Web servers, .ini ﬁle support, etc.20)

Gutmans hinted at a possible future use of the Zend Engine, which explained the need for a separate license:

I’d very much like to see the Zend engine embedded in MySQL at some point. I think it would be great to be able to write the stored procedure code of the DB in the same language as the scripting engine used to access the DB. […]

The Zend engine was written in a way where it can be used in other products besides PHP. The [Zend License] allows us (the Zend company) to reserve the right to use it elsewhere commercially. However, Zend as part of PHP can be used freely and falls under the PHP license.21)

Later, Gutmans explained why he thought the separate license for the Zend Engine did not present any problems for contributors:

No one really contributes to the scripting engine but extends PHP with additional modules and functions. There are constantly developers (besides us) extending PHP’s functions.22)

Since then, the licenses underwent only one series of major changes, which produced the Zend Engine License, version 2.00, ﬁrst distributed with PHP 4.2.0 (April 22, 2002), and the PHP License, version 3.0, ﬁrst distributed with PHP 4.2.3 (September 6, 2002).

In May 2003, Lerdorf petitioned the OSI for approval of version 3.0 of the PHP License, closing with a statement that implied he wished to switch PHP to the Apache License, Version 2.0, once it gained approval from the OSI.

Hopefully the new Apache license whenever that gets ﬁnalized will be OSI-approved and has the big advantage of being project-agnostic, so projects such as PHP that are closely tied to Apache can use it verbatim without having to massage it and we won’t need all these individual Apache-like licenses.23)

A few years later, a very slight change in the wording of the PHP License resulted in changing the version number to 3.01.24) This new version, while almost identical, never received OSI approval, a problem that presented itself 14 years later, when Matthew Sheahan asked on the php-general mailing list regarding the OSI approval status of version 3.01.

My team’s ability to use the phpdbg utility hinges on OSI approval of its license. Language at https://www.php.net/license/ indicates that the PHP 3.01 license is OSI approved, but OSI disagrees; https://opensource.org/licenses/alphabetical shows approval only of the PHP 3.0 license. (The fact that 3.0 and 3.01 are substantively identical is no use to us at all.)25)

Andreas Heigl asked on the php-internals mailing list, “Does anyone here remember why the changes to the license where [sic] done in the ﬁrst place?”26) In response, Johannes Schlüter referenced the Debian debate.

My memory could fail me, but I believe there were debates coming from Debian community around especially PECL extensions being Licensed under PHP Licens [sic] 3.0 and the wording being sub-optimal. The new wording (and website link) should make it clear that PECL (and PEAR) is “PHP Software” while not being “PHP”.27)

At that time, Ben Ramsey volunteered to contact the OSI to formally request legacy approval for the PHP License.28) The legacy approval designation allowed the license steward or any interested licensee to request “retroactive approval of historic/legacy licenses that have already been extensively used by an existing community, but have not previously been approved.”29) So, on March 4, 2020, Ramsey submitted a request for legacy approval to the OSI license-review list,30) and on May 13, 2020, the OSI Board voted to approve the PHP License, version 3.01.31)

The PHP Association was a public beneﬁt corporation incorporated in the State of Nebraska in the United States in February 2000.32) Each of the directors of the PHP Association were also members of the PHP Group.33) 34) We can infer from this that the PHP Group created the PHP Association to represent the group in legal and business matters.

On May 22, 2000, the same day the PHP team released PHP version 4.0.0, including Zend Engine version 1.0.0, Zend Technologies and the PHP Association entered into an agreement to ensure the continued availability of the Zend Engine as an open source product.

Since Zend Engine is a crucial component of PHP, Zend hereby makes the following commitments and assurances to The PHP Association:

Zend will continue to make Zend Engine available as an open source product under the Zend Open Source License. If Zend changes the terms of the Zend Open Source License, the new license will be consistent with the Open Source Deﬁnition of the Open Source Initiative.

The PHP Association is hereby authorized to market, distribute and sublicense Zend Engine, in source and object code forms, as an integrated component of PHP, to end users who agree to be bound by the PHP open-source license, version 2.02. […] However, if Zend Engine is either modiﬁed or separated from the rest of PHP, the use of the modiﬁed or separated Zend Engine shall not be governed by the PHP Open Source License, but instead shall be governed by the Zend Open Source License.

The PHP Association agreed to the terms of the agreement, which included the following conditions:

“The Association will not delete or alter any intellectual property rights or license notices appearing on the Zend Engine and will reproduce and display such notices on each copy it makes of the Zend Engine.”

“The Association may not assign this Letter, by operation of law or otherwise in whole or in part, without Zend’s written consent. Any attempt to assign this Letter without such consent will be null and void. This Letter will bind and inure to the beneﬁt of each party’s permitted successors and assigns.”

Given how corporation law works in most US states, the PHP Association is likely still legally bound to this contract, even if they are no longer an active entity, and the terms of the contract followed Zend as it was acquired by Rogue Wave in 2015 and Perforce Software in 2019.

The PHP License and Zend Engine License are BSD-style licenses. As mentioned earlier, Lerdorf pointed to the Apache License, version 1.0, as the model for the original PHP license,46) and the Apache License, version 1.0, is derived from the original, or 4-clause, BSD license.47) In fact, the two are identical, except the Apache License added conditions 5 and 6:

5. Products derived from this software may not be called “Apache” nor may “Apache” appear in their names without prior written permission of the Apache Group.

6. Redistributions of any form whatsoever must retain the following acknowledgment: “This product includes software developed by the Apache Group for use in the Apache HTTP server project (http://www.apache.org/).”48)

By extension, the PHP License is a derivative of the BSD 4-Clause License.

The BSD 4-Clause License is not an OSI-approved license,49) while the FSF considers it free but problematic.50) Both positions are in response to the BSD advertising clause:

All advertising materials mentioning features or use of this software must display the following acknowledgement: This product includes software developed by the organization.

For the PHP License, version 3.01, conditions 1 and 2 are identical to conditions 1 and 2 of the BSD 4-Clause License. Condition 3 of the PHP License is similar in function to condition 4 of the BSD. Condition 6 of the PHP License is similar in function to condition 3 of the BSD 4-Clause License. PHP added new conditions 4 and 5.

For the Zend Engine License, version 2.00, conditions 1 and 2 are identical to conditions 1 and 2 of the BSD 4-Clause License. Condition 3 of the Zend Engine License is similar in function to condition 4 of the BSD 4-Clause License. Conditions 5 and 6 of the Zend Engine License are similar in function to condition 3 of the BSD 4-Clause License. Zend added a new condition 4.

Every contributor owns the copyright on their speciﬁc contributions to an open source project, if the contributions are copyrightable. Some contributions (e.g., typo ﬁxes, white space changes, etc.) aren’t copyrightable, but anything more significant belongs to the contributor, provided it is their own work.

In other words, even though the license statement says the copyright belongs to The PHP Group51) or Zend Technologies52), technically, these copyright statements only apply to the speciﬁc code contributed by these organizations or by people contributing on behalf of these organizations.

Contributing to an open source project is NOT an implicit transfer of your copyright to the project. To do this, every contributor must sign a contributor license agreement that explictly states they are transferring their copyright to whomever owns the code. No one has signed any agreements of this sort for the PHP software, so every contributor retains copyright ownership over the code they have contributed to PHP.

What is implied, however, is assignment of license. When someone contributes to an open source project, they own the copyright on their contributions, but unless they specify a different license covering their contributions (which is wholly valid, with examples including Derick Rethans’s timelib, which is bundled within the PHP source code), it is implied they are granting use of their contributions under the same license terms as the project. In this way, the contributor cannot later demand to remove all their copyrighted code; it’s under the terms of the same license, which can’t be revoked. However, if the project decides to change its license terms, a contributor may then request removal of their copyrighted code because they may not wish to grant the terms of the new license to their copyrighted work.

Additionally, common convention dictates that, once a copyright statement is placed on a source ﬁle, it should remain on that source ﬁle, complete with any years listed, though the years do not require updating. For an example, look at the ﬁle header on any WebKit source ﬁle.53) WebKit even speciﬁes that you add a copyright notice to each ﬁle where you make “signiﬁcant” changes.54)

The short answer is, “No.” As a courtesy, however, we will keep discussion on this topic open for a period of no less than six months before calling a vote on the proposal.

Earlier, we established that every contributor owns the copyright for their speciﬁc contributions, and unless they speciﬁed a different license covering their contributions, it is implied they have granted use of their contributions under the same license terms as the project. We have also established, at length, the PHP License, version 3.01, and Zend Engine License, version 2.00, are identical to the Modiﬁed BSD License if conditions 4, 5, and 6 are removed from each license.55)

There is no doubt contributors have the authority to grant users license to use their code with respect to conditions 1 and 2. These are the same for the PHP License, Zend Engine License, and Modiﬁed BSD License. This proposal does not change the wording of any part of these conditions:

Redistribution and use in source and binary forms, with or without modiﬁcation, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

Condition 3 does have differences across each license. However, when viewed at face-value, the intent of this condition in the PHP and Zend Engine licenses is the same as the 3rd condition of the Modiﬁed BSD License. Additionally, as worded in the PHP and Zend Engine licenses, contributors have no authority to assert these terms for their own contributions, since the terms are speciﬁc to the PHP Group and Perforce Software, respectively, but they do have the authority to assert the terms of condition 3 from the Modiﬁed BSD License.

The name “PHP” must not be used to endorse or promote products derived from this software without prior written permission. For written permission, please contact group@php.net.

The names “Zend” and “Zend Engine” must not be used to endorse or promote products derived from this software without prior permission from Zend Technologies Ltd. For written permission, please contact license@zend.com.

Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without speciﬁc prior written permission.

When we look closer at conditions 4, 5, and 6 for both the PHP License and the Zend Engine License, it appears no contributors, other than representatives of the PHP Group and Perforce Software, are able to grant or assert these conditions for their contributions. Removing them from the license does not change any of the rights granted or restricted by contributors (other than the PHP Group and Perforce Software; see below).

For these reasons, we do not need to gain permission from all contributors to make these changes.

This proposal removes the following conditions, which the PHP Group is uniquely able to claim over the PHP source code:

4. Products derived from this software may not be called “PHP”, nor may “PHP” appear in their name, without prior written permission from group@php.net. You may indicate that your software works in conjunction with PHP by saying “Foo for PHP” instead of calling it “PHP Foo” or “phpfoo”

5. The PHP Group may publish revised and/or new versions of the license from time to time. Each version will be given a distinguishing version number. Once covered code has been published under a particular version of the license, you may always continue to use it under the terms of that version. You may also choose to use such covered code under the terms of any subsequent version of the license published by the PHP Group. No one other than the PHP Group has the right to modify the terms applicable to covered code created under this License.

6. Redistributions of any form whatsoever must retain the following acknowledgment: “This product includes PHP software, freely available from http://www.php.net/software/”.

The good news is that condition 5 grants the PHP Group the authority to make changes to the PHP License, without approval from any contributors.

Depending on the bylaws adopted by the PHP Association (as discussed earlier in Zend and the PHP Association), we may require approval from one or more representatives of the PHP Group to accept this proposal. There is no public record of the association’s bylaws, so unless the bylaws specify a quorum, we will need approval from each of:

Note: Legal representatives of Perforce Software have informally approved this proposal. The next step is a formal approval, in writing.

As the successor of Zend Technologies, Perforce Software is party to the Zend Grant and owner of the Zend Engine License. This proposal removes the following conditions, which Perforce Software is uniquely able to claim over the Zend Engine source code:

4. Zend Technologies Ltd. may publish revised and/or new versions of the license from time to time. Each version will be given a distinguishing version number. Once covered code has been published under a particular version of the license, you may always continue to use it under the terms of that version. You may also choose to use such covered code under the terms of any subsequent version of the license published by Zend Technologies Ltd. No one other than Zend Technologies Ltd. has the right to modify the terms applicable to covered code created under this License.

5. Redistributions of any form whatsoever must retain the following acknowledgment: “This product includes the Zend Engine, freely available at http://www.zend.com”

6. All advertising materials mentioning features or use of this software must display the following acknowledgment: “The Zend Engine is freely available at http://www.zend.com”

Just as the PHP License grants the PHP Group the authority to make changes to the PHP License, the Zend Engine License grants Perforce Software the sole authority to make changes to the Zend Engine License, without approval from its contributors.

To make the changes proposed in this , the PHP project will require that a representative (or representatives) from the PHP Group work with representatives from Perforce Software to agree to this proposal.

This proposal publishes a new version of the PHP License, triggering clause 5 of the PHP License, version 3.01, which states (emphasis added):

The PHP Group may publish revised and/or new versions of the license from time to time. Each version will be given a distinguishing version number. Once covered code has been published under a particular version of the license, you may always continue to use it under the terms of that version. You may also choose to use such covered code under the terms of any subsequent version of the license published by the PHP Group. No one other than the PHP Group has the right to modify the terms applicable to covered code created under this License.

Users of any PHP extension or other software published under the terms of the PHP License, version 3.01, may choose to use that software under the terms of the PHP License, version 4 (i.e., the Modiﬁed BSD License).

Maintainers of PHP extensions and other software published under the terms of the PHP License, version 3.01, may choose to upgrade the software license to the PHP License, version 4 (i.e., the Modiﬁed BSD License). In an effort to reduce license proliferation, you are discouraged from using the name “PHP License, version 4” as the license name. If you need an SPDX identiﬁer, use BSD-3-Clause.

Historically, many extensions uploaded to PECL were licensed under the PHP License, version 3.01. Indeed, one of the suggestions for publishing a PECL package is: “We strongly encourage contributors to choose the PHP License 3.01 for their extensions, in order to avoid possible troubles for end-users of the extension. Other solid options are BSD and Apache type licenses.”57)

The “possible troubles” mentioned here almost always arise from use of a copyleft license like the . The FSF considers the combination of PHP extensions and the PHP software a single combined program.58) As a result, licensing a PHP extension with the leads to a confusing state that is especially problematic for distributors.

New PHP extensions and other software should not use the PHP License. Recommended licenses include, but are not limited to (in alphabetical order):

Did RMS come to terms with the PHP/Zend licensing structure?59) 60)

This indicates there was a disagreement between the PHP maintainers and Richard Stallman (a. k. a. RMS) at some point prior to May 2001. However, the full nature of this disagreement is unknown, as there is no record of it on public mailing lists or forums.

In an article published in 2004, Sean Michael Kerner quoted Gutmans, who referenced past exchanges with RMS, concerning the PHP license.

Gutmans said he has exchanged e-mails with FSF founder Richard Stallman in the past on such issues. “We definitely don’t see eye to eye on the issue of licensing. He [Richard Stallman] doesn’t like our licensing and we know that,” Gutmans said. “We’re aware of each other, but the PHP project has no intention of moving to some sort of license.”61)

In this same interview, Gutmans expounded on his philosophy regarding users’ rights when using PHP: “We like the fact that it (PHP) is very open. It’s a long discussion about what Free really means. When I think of free, my users can do whatever they want.” He continued, “Most of PHP’s user base are people that are using PHP to make a living and they wouldn’t care less [about the ]. They are just happy that it’s a PHP license and they can do whatever they want with it and can ship it with their commercial products”

...

Read the original on wiki.php.net »

5 235 shares, 9 trendiness

How Increasing Input Tokens Impacts LLM Performance

Recent developments in LLMs show a trend toward longer context windows, with the input token count of the latest models reaching the millions. Because these models achieve near-perfect scores on widely adopted benchmarks like Needle in a Haystack (NIAH) [1], it’s often assumed that their performance is uniform across long-context tasks.

However, NIAH is fundamentally a simple retrieval task, in which a known sentence (the “needle”) is placed in a long document of unrelated text (the “haystack”), and the model is prompted to retrieve it. While scalable, this benchmark typically assesses direct lexical matching, which may not be representative of ﬂexible, semantically oriented tasks.

We extend the standard NIAH task, to investigate model behavior in previously underexplored settings. We examine the effects of needles with semantic, rather than direct lexical matches, as well as the effects of introducing variations to the haystack content.

Additionally, we include a conversational question-answer evaluation using LongMemEval [2], as well as a synthetic task in which models replicate a series of repeated words. Each task remains intentionally simple and is deliberately controlled to isolate the impact of context length alone.

We demonstrate that even under these minimal conditions, model performance degrades as input length increases, often in surprising and non-uniform ways. Real-world applications typically involve much greater complexity, implying that the inﬂuence of input length may be even more pronounced in practice.

Our in-depth technical report continues below. If you ﬁnd our work useful, please consider citing us:

Interested in working on improving retrieval for AI applications? Chroma is Hiring

It is common for modern LLMs to have input context lengths in the millions of tokens. Gemini 1.5 Pro [3] ﬁrst introduced their 1M context window in early 2024, followed by the recent GPT-4.1’s 1M context window [4] and Llama 4 with 10M [5]. The use case for long context is compelling: longer context means that the LLM can process more information with each call and generate more informed outputs.

Long context evaluations for these models often demonstrate consistent performance across input lengths. However, these evaluations are narrow in scope and not representative of how long context is used in practice. The most commonly used test, Needle in a Haystack (NIAH), is a simple lexical retrieval task often used to generalize a model’s ability to reliably handle long context. Real applications, such as agent tasks or summarization, demand significantly more processing and reasoning over broader, often more ambiguous information.

Designing realistic long context benchmarks is challenging. Tasks often grow in complexity as input length increases, making it difﬁcult to isolate whether performance drops are due to longer inputs or inherently harder problems. To address this, our experiments hold task complexity constant while varying only the input length—allowing us to directly measure the effect of input length alone.

We present the following:

* An evaluation across 18 LLMs, including leading closed-source and open-weights models, revealing nonuniform performance with increasing input length.

* A writeup of observed model-speciﬁc behavior patterns when handling distractors and varying question-answer similarity.

* The complete codebase to replicate our results.

One of the most widely used benchmarks for evaluating a model’s long context capabilities is Needle in a Haystack (NIAH). While useful as a scalable test, it measures a narrow capability: lexical retrieval. Models typically perform well on NIAH, which has led to the perception that long-context is largely solved.

However, NIAH underestimates what most long context tasks require in practice. Variants of NIAH, like NoLiMa[6] which include needle-question pairs with non-lexical matches, reveal significant performance drops. Other tasks that appear similar in regards to difﬁculty, such as AbsenceBench [7] which tests models for recognizing the absence of a given snippet of text, also demonstrate performance degradation with growing input length.

More complex benchmarks, such as Multi-round co-reference resolution (MRCR) [8], Graphwalks [9], and Latent List [10], further highlight performance degradation with long inputs. MRCR combines various subtasks into one: identifying relevant parts, disambiguating amongst distractors, reasoning about the order of needles, and replicating text. In order to attribute the reported model failures to increased input length, one must assume that the model is equally competent at each subtask. However, this assumption has not been thoroughly tested; it may be that the model fails at one speciﬁc subtask, or a unique combination of a few, with increasing input length. The composite nature of this task makes it difﬁcult to systematically evaluate exactly where and how models fail with long context.

Graphwalks is a graph traversal task in which the model is given a directed graph composed of hexadecimal hashes, then asked to perform breadth-ﬁrst search starting from a random node. In this case, increasing input length means increasing the size of the graph to traverse through, which increases task difﬁculty as a result. Latent List presents a similar challenge that scales with input length: the model is given a sequence of Python list operations, then asked to output the resulting list.

It is difﬁcult to disambiguate increasing task complexity from input length, which makes it difﬁcult to isolate the impact on performance due to input length alone. There remains a lack of evaluations which isolate input length as the variable of interest, limiting our understanding of how LLMs actually behave with long inputs.

The classic Needle in a Haystack task involves placing a random fact (the ‘needle’) in the middle of a long context window (the ‘haystack’), then asking the model about that fact.

The original implementation of this task uses a needle-question pair with lexical matches. However, usage of long context in practice often requires semantic understanding of ambiguous tasks.

NoLiMa has demonstrated non-lexical matching to be a challenge for models as context length increases. This task utilizes needle-question pairs that require models to infer latent associations, for example:

In order to answer this question, the model would ﬁrst have to know that Kiasma museum is located in Helsinki, then make that latent association link. This tests the model not only for its non-lexical matching abilities, but also for its world knowledge. 72.4% of needle-question pairs from NoLiMa require such external knowledge, making this benchmark closer to a test of how models handle both tasks at once rather than pure non-lexical matching alone.

Testing the impact of non-lexical matching in isolation remains underexplored. Furthermore, this binary distinction of “lexical” versus “non-lexical” oversimpliﬁes the complexity of question-answering in real-world scenarios. Needle-question pairs exist on a spectrum of similarity, yet they are all classiﬁed under these broad categories.

Models often have to deal with distractors as well, which has been shown to degrade performance [11].

Throughout this report, we distinguish between distractors and irrelevant content:

* Distractors are topically related to the needle, but do not quite answer the question

* Irrelevant content is unrelated to the needle and question

Prior work has demonstrated that distractors have non-uniform impact, yet most evaluations involve short input lengths and older models. Current state-of-the-art models are claimed to be more resilient to distractors, yet their performance has not been extensively tested across various input lengths.

Another underexplored aspect of NIAH is the haystack itself, which is often simply treated as a means of scaling input length, but this assumes that the haystack content itself has no effect on task performance. If the model is indeed insensitive to the content of the haystack, then varying this content, for example the haystack’s topic or narrative ﬂow, should have no inﬂuence on the results. However, this assumption remains largely untested.

We design four controlled experiments to investigate the inﬂuence of these factors:

We compute the cosine similarity between needle-question pairs using embeddings. For robustness, we average across ﬁve embedding models: text-embedding-3-small, text-embedding-3-large, jina-embeddings-v3, voyage-3-large, and all-MiniLM-L6-v2. We measure how model performance is impacted by needle-question similarity as input length increases.

Taking a high-similarity needle-question pair, we write four distractors. We have the following setups:

We test the impact of distractors on model performance as input length increases to measure non-uniformity amongst distractors and input lengths.

We use two thematically distinct haystacks, Paul Graham essays and arXiv papers [12], and write corresponding needles for each. To measure needle-haystack similarity, we embed the haystack and retrieve the top-5 chunks for each needle, then average their cosine similarity scores. This process is repeated across ﬁve different embedding models for robustness.

In typical NIAH setups, haystacks are concatenations of coherent texts, each with their own logical ﬂow of ideas. For instance, the original NIAH benchmark uses a series of Paul Graham essays, where each essay follows a structured organization of ideas to form an argument. To evaluate whether this structure inﬂuences model performance, we compare two conditions:

* Original: preserves the natural ﬂow of ideas within each excerpt

* Shufﬂed: sentences are randomly reordered throughout the haystack to maintain the same overall topic without logical continuity

We demonstrate the following:

* Across all experiments, model performance consistently degrades with increasing input length.

* Distractors have non-uniform impact on model performance with regards to how distracting they are relative to each other. We see this impact more prominently as input length increases, and observe distinctions in how various models respond to them.

* Needle-haystack similarity does not have a uniform effect on model performance, suggesting the need for further investigation.

* The structural pattern of the haystack consistently shows an impact on how models process long inputs.

For every unique combination of needle type, haystack topic, and haystack structure, we test each model across:

We evaluate each model across its maximum context window with temperature=0 unless that setting is incompatible (i.e. o3) or explicitly discouraged (i.e. Qwen’s “thinking mode”). For Qwen models, we apply the YaRN method [13] to extend from 32,768 to 131,072 tokens.

We include models in both standard and “thinking mode” where applicable.

We evaluate model outputs using an aligned GPT-4.1 judge, using our method outlined in the appendix.

We note some rare instances of a model refusing to attempt the task (69 out of 194,480 total LLM calls—0.035%) , which we exclude from our results and separately report in our appendix. For example, Claude Opus 4 may sometimes have an empty output with stop_reason=”refusal”.

In real-world applications, models are often expected to handle ambiguous tasks and identify relevant information without relying on exact lexical matches. For example, when an agent is given a task involving a large corpus to search through, users rarely specify precise keywords for relevant parts. Instead, the model must infer relevance.

We vary the similarity of our needle-question pairs, quantiﬁed by the cosine similarity of their embeddings. We ﬁnd that as needle-question similarity decreases, model performance degrades more significantly with increasing input length. This reﬂects more realistic scenarios where exact question-answer matches are rare, and semantic ambiguity compounds the challenge of long input processing.

We source our haystack content from two domains: Paul Graham essays (as in the original NIAH experiment), and arXiv papers. For each haystack topic (PG essays, arXiv), we ﬁrst determine common themes to guide our question and needle writing.

We use clustering to identify the most common topics that appear for a given corpus:

Use UMAP [14] for dimensionality reduction with the following parameters: n_neighbors=30, min_dist=0.05, n_components=50, random_state=42Use HDBSCAN [15] to create clusters with the following parameters: min_cluster_size=10, min_samples=15Get 20 representative chunks for the largest clusters using maximal marginal relevance (MMR)Manually examine the largest clusters to determine their themes and style

Using this method, we identify writing advice as a common topic for PG essays, often in anecdotal form. For arXiv papers, we identify information retrieval as a common topic, specifically re-ranking.

We write a corresponding question for each topic:

Before writing our needles, we verify that answers to these questions do not exist in the haystack content:

We store our previously computed haystack chunk embeddings in a vector database. Query top-10 results from that vector database with our question embedding.Manually examine these results to verify that they do not answer the given question.

This sets up a fair testing environment as it ensures that alternative answers do not exist, and any incorrect answers are due to model hallucinations.

For each question, we write 8 needles that each belong to the large cluster which we verify using approximate predictions. Needles that belong to the writing/retrieval cluster with >0.9 probability are considered to topically blend into the haystack. We manually write these needles to avoid data contamination.

For the 8 needles, we also vary the level of ambiguity, quantiﬁed through the following method:

Using an embedding model, we compute embeddings for needle and question and their cosine similarity. Repeat across ﬁve embedding models (text-embedding-3-small, text-embedding-3-large, jina-embeddings-v3, voyage-3-large, and all-MiniLM-L6-v2).

For the PG essays topic, our needles range from 0.445-0.775 needle-question similarity with

We observe a clear pattern that performance degrades more quickly in input length with lower similarity needle-question pairs.

At short input lengths, the models perform well even on low-similarity pairs. We see this most clearly in the high/medium-performance models, demonstrating that these models are capable of succeeding at this task for all needle-question pairs.

We note that lower-performance models (i.e., older or smaller models like GPT-4.1 nano) perform poorly overall, starting from a lower baseline for low-similarity pairs. Our analysis focuses primarily on higher-performing models as they are more representative of what is commonly used in practice.

The observed performance degradation at longer input lengths is not due to the intrinsic difﬁculty of the needle-question pairing. By holding the needle-question pair ﬁxed and varying only the amount of irrelevant content, we isolate input size as the primary factor in performance decline.

We also examine whether needle position inﬂuences performance. Testing across 11 needle positions, we ﬁnd no notable variation in performance for this speciﬁc NIAH task.

It has already been established with older models that distractors degrade model performance and have non-uniform impact. Newer models are claimed to reliably handle any distractor, but does this hold true as input length increases?

Our experiments reveal that the impact of distractors and their non-uniformity ampliﬁes as input length grows across models, including the latest state-of-the-art models. We also observe distinct behaviors across model families in how they deal with ambiguity.

From each haystack topic (PG essays and arXiv papers), we take a needle with high needle-question similarity (second highest out of eight), and manually write 4 distractors:

Instead of testing all eight needles with distractors, we use one needle with high needle-question similarity to create a condition in which the needle should be relatively easy to identify. We see from previous results that models generally perform well on this needle across input lengths due to high needle-question similarity, which allows us to better isolate and measure the impact of distractors alone.

* Multiple distractors: Needle + all four distractors, randomly positioned throughout the haystack

Even a single distractor reduces performance relative to the baseline (needle only), and adding four distractors compounds this degradation further.

We are also able to see that distractors do not have uniform impact. For example, in our arXiv haystack and writing needle combination, we can see that distractor 3 (red) causes greater performance decline relative to the other distractors.

To further investigate this non-uniform impact, we analyze the failed attempts of various models in the 4-distractor condition. For the arXiv haystack and writing needle combination, we see that distractors 2 and 3 appear most frequently in hallucinated responses across models.

These failures also reveal model-speciﬁc differences in handling ambiguity. Claude models consistently exhibit the lowest hallucination rates. Speciﬁcally, Claude Sonnet 4 and Opus 4 are particularly conservative and tend to abstain when uncertain, explicitly stating that no answer can be found. In contrast, GPT models show the highest rates of hallucination, often generating conﬁdent but incorrect responses when distractors are present.

In long-context tasks, irrelevant context is often treated as a neutral placeholder to scale up input length. It’s typically assumed that the content of this irrelevant context doesn’t matter, as long as it doesn’t directly interfere with the task.

However, a natural question arises: does the needle-haystack similarity inﬂuence task difﬁculty at all? Intuitively, if the needle blends in with the content of the haystack, the model may have greater difﬁculty in extracting the needle.

Our ﬁndings reveal that needle-haystack similarity has a non-uniform effect on model performance.

Using the needles from our needle-question similarity experiment, we set up our experiment to test the impact of needle-haystack similarity.

We measure needle-haystack similarity by embedding the haystack and retrieving the top ﬁve most similar chunks for each needle, then averaging their cosine similarity scores. This process is repeated across ﬁve different embedding models for robustness.

In the PG essay haystack, PG essay needles have an average needle-haystack similarity score of 0.529 with a variation of 0.101, while arXiv needles average 0.368 needle-haystack similarity with a variation of 0.111. Conversely, in the arXiv haystack, arXiv needles average 0.654 needle-haystack similarity with a variation of 0.0858, whereas PG-essay needles score lower at 0.394 needle-haystack similarity with a variation of 0.105.

On each haystack, we test semantically similar needles against unrelated needles. For instance, we place both PG essay and arXiv needles within a Paul Graham essay haystack to compare the two conditions:

We test both writing and arXiv needles in two haystack types: Paul Graham essays and arXiv papers. In the Paul Graham essay haystack, arXiv needles perform significantly better relative to the writing needles; in other words, models perform better when the needle does not semantically blend in with its haystack. In the arXiv haystack, however, we observe only minimal performance differences between our arXiv and writing needles.

Testing across only two topics is insufﬁcient to draw a generalizable conclusion that higher needle-haystack similarity degrades model performance on this task. This does highlight, however, the non-uniform nature of long-context processing. Even when task structure and needle-question similarity are held constant, changing the semantic similarity between the needle and the haystack can inﬂuence results. This points to an underexplored area in long-context benchmarks and a meaningful direction for future research.

Aside from needle-haystack similarity, we also consider the structural pattern of the haystack.

If the haystack is composed of coherent essays, a randomly inserted needle may disrupt the logical ﬂow of ideas, making it more noticeable. In contrast, in a shufﬂed haystack of randomly ordered sentences, the needle may blend in more easily since the overall context lacks structure. This follows the assumption that models are sensitive to the logical ﬂow of context—processing it in a structured, order-sensitive manner.

Although it seems counterintuitive, models perform worse when the haystack preserves a logical ﬂow of ideas. Shufﬂing the haystack and removing local coherence consistently improves performance.

To assess the impact of haystack structure, we create two variants:

Original: preserves the natural ﬂow of ideas within each excerptShufﬂed: sentences are randomly reordered throughout the haystack to maintain the same overall topic but without logical continuity

Across all 18 models and needle-haystack conﬁgurations, we observe a consistent pattern that models perform better on shufﬂed haystacks than on logically structured ones.

Logically, one might expect retrieval to become more difﬁcult when the needle semantically blends in with the haystack or when it disrupts the local coherence of the surrounding text. Yet our ﬁndings show that models are not consistently affected by topic blending. Instead, they are more sensitive to whether the haystack maintains a logical structure.

These results may have some implications for the model’s internal processing: structural patterns of inputs could inﬂuence how the attention mechanism is applied, particularly as input length increases.

While out of scope for this report, this points to a potential direction for interpretability research in how attention is inﬂuenced by input structure. Understanding these structural inﬂuences that arise with increased input length could help explain these long context failure patterns.

To evaluate these models in a more realistic setting, we use LongMemEval, a long-context benchmark for conversational question-answering.

Using long inputs for chat assistants is a common approach for maintaining relevant history for subsequent chats. To incorporate “memory” into a chat assistant, a naive approach would be to include the full chat history into the prompt for following chats. This requires the model to perform two tasks, typically performed in one call: ﬁnd relevant parts of the conversation history (retrieval), then synthesize them in a way that is useful to an incoming query (reasoning).

In an ideal case, the model would be given only the relevant parts so it can focus solely on reasoning. Adding irrelevant context adds the additional step of identifying what is relevant, forcing the model to perform two tasks simultaneously.

We systematically test the effect of adding this additional step with increased input length through two conditions:

Focused input, containing only the relevant parts and so the model just has to do simple reasoning. Full input, which utilizes the full 113k token LongMemEval input that includes irrelevant context. In this case, the model has to perform retrieval across the long context in addition to reasoning.

We verify that the models are highly capable of succeeding on the focused inputs, then observe consistent performance degradation with the full inputs. This performance drop suggests that adding irrelevant context, and thereby adding an additional step of retrieval, significantly impacts a model’s ability to maintain reliable performance.

...

Read the original on research.trychroma.com »

6 224 shares, 50 trendiness

Shoggoth Mini

Over the past year, robotics has been catching up with the LLM era. Pi’s π0.5 can clean unseen homes. Tesla’s Optimus can follow natural language cooking instructions. These systems are extremely impressive, but they feel stuck in a utilitarian mindset of robotic appliances. For these future robots to live with us, they must be expressive. Expressiveness communicates internal state such as intent, attention, and conﬁdence. Beyond its functional utility as a communication channel, expressiveness makes interactions feel natural. Without it, you get the textbook uncanny valley effect.

Earlier this year, I came across Apple’s ELEGNT paper, which frames this idea rigorously through a Pixar-like lamp to show how posture and timing alone can convey intention. Around the same time, I discovered SpiRobs, a soft tentacle robot that feels oddly alive with just simple movements. One system was carefully designed to express intent while the other just moved, yet somehow felt like it had intent. That difference was interesting. I started building Shoggoth Mini as a way to explore it more directly. Not with a clear goal, but to see what would happen if I pushed embodiment into stranger territory. This post retraces that process, the happy accidents, and what I learned about building robots.

The ﬁrst challenge was creating a testbed to explore the control of SpiRobs. I started very simple: a plate to hold three motors, and a dome to lift the tentacle above them. This setup wasn’t meant to be the ﬁnal design, only a platform for quick experimentation. However, halfway through 3D printing, I ran out of black ﬁlament and had to ﬁnish the dome in grey. This made it look like the dome had a mouth. When my ﬂatmate saw it sitting on my desk, he grabbed a marker and drew some eyes. It looked good: cute, weird, slightly unsettling. I used ChatGPT to explore renders, and decided that this accident would become the form factor.

Later, I mounted stereo cameras on the dome to track the tentacle. Robot eyes are eerie. You keep expecting movement, but nothing ever happens. That prediction error focuses attention even more.

The original open-spool design relied on constant cable tension, but any slight perturbation (such as testing a buggy new policy) would make the cables leave the spool and tangle around the motor shafts. The process to ﬁx it required untying the knot at the tip holding the cables together, and dismantling the whole robot. Adding simple spool covers eliminated most tangles and made iteration dramatically faster.

Another key step was adding a calibration script and pre-rolling extra wire length. This made it possible to:

* Unroll and reroll the cables to open the robot without having to untie the tip knot, speeding up iteration dramatically

* Calibrate cable tension precisely and as often as needed

* Give control policies slack to work with during motion

Finally, as you can see in the video, the standard 3-cable SpiRobs design sags under its own weight. This makes consistent behavior hard to reproduce. I had to thicken the spine just enough to prevent sag, but not so much that it would deform permanently under high load.

You can explore the current CAD assembly here, with all STL ﬁles for 3D printing included in the repo.

With the hardware ready, the next step was to feel how the tentacle moved. To simplify control, I reduced the tentacle’s three tendon lengths (a 3D space) down to two intuitive dimensions you can manipulate with a trackpad.

Concretely, each of the three tendons has a principal pulling direction in the 2D plane, forming a triangular basis that sums to zero. By projecting the 2D cursor control vector onto each tendon’s principal axis, you compute how much each tendon should shorten or lengthen to align with the desired direction.

* $\mathbf{v}_i$ is the principal axis of tendon $i$.

Positive $s_i$ means shortening the tendon; negative means lengthening it. In practice, the cursor input is normalized to keep the motor commands in a reasonable range.

While this 2D mapping doesn’t expose the tentacle’s full conﬁguration space (there are internal shapes it cannot reach), it is intuitive. Anyone can immediately move the tentacle by dragging on a trackpad, seeing the tip follow the cursor in the same direction.

Unexpectedly, this simple 2D-to-3D mapping became the backbone of the entire system. Later, all automated control policies, from hardcoded primitives to reinforcement learning, reused the same projection layer to output actions.

The system has two control layers. Low-level control uses both open-loop primitives (like “” or “”) and closed-loop RL policies (like ﬁnger-tracking). The latter depends on a specialized stereo vision pipeline, which tracks the tentacle tip and user hand positions. Initially, I considered embedding internal optical sensors for proprioception, but this proved impractical without adding bulk, so I stuck with external stereo vision instead. While this works, it limits the usable ﬁeld of view. To address this, I implemented a somewhat natural-looking homing behavior if the tip goes out of frame, and restricted the RL observation space to ensure it remains visible.

High-level control leverages GPT-4o’s real-time API, which streams audio and text (vision isn’t exposed yet). GPT-4o continuously listens to speech through the audio stream, while stereo vision is processed locally to detect high-level visual events—like hand waves or proximity triggers—which are sent to GPT-4o as text cues (“” or “”). GPT-4o then decides, zero-shot, which low-level API calls to make. This follows the approach shown in DeepMind’s Gemini Robotics paper, where a vision-language-action (VLA) model zero-shots control of ALOHA 2 by generating Python control code without robot-speciﬁc ﬁne-tuning. In practice, GPT-4o tends to overcall or undercall actions (the question of time calibration of LLMs is tricky), so prompt engineering was essential.

I initially considered training a single end-to-end VLA model. Projects like Hugging Face’s LeRobot lean hard on imitation learning. That works for rigid arms because the end-effector pose maps cleanly to joint angles, so a replayed trajectory usually does what you expect. A cable-driven soft robot is different: the same tip position can correspond to many cable length combinations. This unpredictability makes demonstration-based approaches difﬁcult to scale.

Instead, I went with a cascaded design: specialized vision feeding lightweight controllers, leaving room to expand into more advanced learned behaviors later.

One thing I noticed was that the tentacle would look slightly lifeless during pauses between API calls. To address this, I added a breathing idle mode with small, noisy oscillations that shift between principal directions, keeping it feeling alive even when not actively responding.

Perception required two components: hand tracking and tentacle tip tracking. For hands, MediaPipe worked reasonably well out of the box, though it struggles with occlusions.

For the tentacle, I collected a dataset across varied lighting, positions, and backgrounds, using k-means clustering to ﬁlter for diverse, non-redundant samples. Roboﬂow’s auto-labeling and active learning sped up annotation, and I augmented the dataset synthetically by extracting tentacle tips via the Segment Anything demo.

Once the data was ready, training a YOLO model with Ultralytics was straightforward. The ﬁnal calibration step used a DeepLabCut notebook to compute camera intrinsics and extrinsics, enabling 3D triangulation of the tentacle tip and hand positions.

Programming open-loop behaviors for soft robots is uniquely hard. Unlike rigid systems where inverse kinematics can give you precise joint angles for a desired trajectory, soft bodies deform unpredictably. To simplify, I reused the 2D control projection from manual control. Instead of thinking in raw 3D cable lengths, I could design behaviors in an intuitive 2D space and let the projection handle the rest. Having a thicker spine that prevents sag also helped ensure consistent behavior reproduction across different sessions.

Experimenting with object interactions made me appreciate how robust SpiRobs can be. The grabbing primitive, for example, simply pulls the front cable while adding slack to the others, yet it reliably grips objects of varying shapes and weights. Given that high-frequency dexterous manipulation remains challenging, this mechanical robustness is a non-trivial design opportunity.

For closed-loop control, I turned to reinforcement learning, starting with a policy that would follow a user’s ﬁnger. This came from an old idea I’ve always wanted to make: a robotic wooden owl that follows you with its big eyes. It was also simple enough to validate the entire sim-to-real stack end-to-end before moving to more complex policies.

I recreated SpiRobs in MuJoCo and set up a target-following environment with smooth, randomized trajectories. I used PPO with a simple MLP and frame stacking to provide temporal context. To improve sim-to-real transfer, I added dynamics randomization, perturbing mass, damping, and friction during training.

My ﬁrst approach used direct tendon lengths as the action space. The policy quickly found reward-hacking strategies, pulling cables to extremes to achieve perfect tracking in simulation. In reality, these chaotic conﬁgurations would never transfer.

A ﬁx I found was to constrain the action space to the same 2D projection used everywhere else. This representation blocked unrealistic behaviors while keeping the system expressive enough. Note that you could use curriculum learning to gradually transition from this 2D constraint to full 3D control by starting with the simpliﬁed representation and progressively expanding the action space as the policy becomes more stable.

Another issue was that the policy exhibited jittery behavior from rapid action changes between timesteps. I added control penalties to the reward function that penalized large consecutive action differences, encouraging smooth movements over erratic corrections.

Once the policy stabilized in simulation, transfer to hardware was surprisingly smooth.

One last issue: even with a stationary target, the policy would sometimes jitter and oscillate unpredictably as it overcorrected. Applying an exponential moving average to the actions added enough damping to let the tentacle settle quietly without sacriﬁcing responsiveness too much.

One thing I noticed toward the end is that, even though the robot remained expressive, it started feeling less alive. Early on, its motions surprised me: I had to interpret them, infer intent. But as I internalized how it worked, the prediction error faded.

Expressiveness is about communicating internal state. But perceived aliveness depends on something else: unpredictability, a certain opacity. This makes sense: living systems track a messy, high-dimensional world. Shoggoth Mini doesn’t.

This raises a question: do we actually want to build robots that feel alive? Or is there a threshold, somewhere past expressiveness, where the system becomes too agentic, too unpredictable to stay comfortable around humans?

Looking forward, I see several short-term paths worth exploring:

* Giving it a voice (but as non-human as possible!)

* Expanding the expression repertoire, both open and closed-loop, potentially through RLHF

* Adding more tentacles and teaching it to crawl

Fork the repo, build your own, or get in touch if you’d like to discuss robotics, RL, or LLMs!

...

Read the original on www.matthieulc.com »

7 212 shares, 7 trendiness

Anthropic, Google, OpenAI and xAI granted up to $200 million for AI work from Defense Department

The U. S. Department of Defense on Monday said it’s granting contract awards of up to $200 million for artiﬁcial intelligence development at Anthropic, Google, OpenAI and xAI.

The DoD’s Chief Digital and Artiﬁcial Intelligence Ofﬁce said the awards will help the agency accelerate its adoption of “advanced AI capabilities to address critical national security challenges.” The companies will work to develop AI agents across several mission areas at the agency.

“The adoption of AI is transforming the Department’s ability to support our warﬁghters and maintain strategic advantage over our adversaries,” Doug Matty, the DoD’s chief digital and AI ofﬁcer, said in a release.

Elon Musk’s xAI also announced Grok for Government on Monday, which is a suite of products that make the company’s models available to U. S. government customers. The products are available through the General Services Administration (GSA) schedule, which allows federal government departments, agencies, or ofﬁces to purchase them, according to a post on X.

Musk’s AI startup has launched a new version of Grok and Grok for Government services after the chatbot generated and spread anti-semitic posts and other offensive content, sparking a backlash.

OpenAI was previously awarded a year-long $200 million contract from the DoD in 2024, shortly after it said it would collaborate with defense technology startup Anduril to deploy advanced AI systems for “national security missions.”

In June, the company launched OpenAI for Government for U. S. federal, state, and local government workers.

WATCH: US needs an allied strategy for AI investment in military and defense: Palantir

...

Read the original on www.cnbc.com »

8 205 shares, 48 trendiness

Reflections on OpenAI

I left OpenAI three weeks ago. I had joined the company back in May 2024.

I wanted to share my reﬂections because there’s a lot of smoke and noise around what OpenAI is doing, but not a lot of ﬁrst-hand accounts of what the culture of working there actually feels like.

Nabeel Quereshi has an amazing post called Reﬂections on Palantir, where he ruminates on what made Palantir special. I wanted to do the same for OpenAI while it’s fresh in my mind. You won’t ﬁnd any trade secrets here, more just reﬂections on this current iteration of one of the most fascinating organizations in history at an extremely interesting time.

To put it up-front: there wasn’t any personal drama in my decision to leave–in fact I was deeply conﬂicted about it. It’s hard to go from being a founder of your own thing to an employee at a 3,000-person organization. Right now I’m craving a fresh start.

It’s entirely possible that the quality of the work will draw me back. It’s hard to imagine building anything as impactful as AGI, and LLMs are easily the technological innovation of the decade. I feel lucky to have seen some of the developments ﬁrst-hand and also been a part of the Codex launch.

Obviously these aren’t the views of the company–as observations they are my own. OpenAI is a big place, and this is my little window into it.

The ﬁrst thing to know about OpenAI is how quickly it’s grown. When I joined, the company was a little over 1,000 people. One year later, it is over 3,000 and I was in the top 30% by tenure. Nearly everyone in leadership is doing a drastically different job than they were ~2-3 years ago.

Of course, everything breaks when you scale that quickly: how to communicate as a company, the reporting structures, how to ship product, how to manage and organize people, the hiring processes, etc. Teams vary significantly in culture: some are sprinting ﬂat-out all the time, others are babysitting big runs, some are moving along at a much more consistent pace. There’s no single OpenAI experience, and research, applied, and GTM operate on very different time horizons.

An unusual part of OpenAI is that everything, and I mean everything, runs on Slack. There is no email. I maybe received ~10 emails in my entire time there. If you aren’t organized, you will ﬁnd this incredibly distracting. If you curate your channels and notiﬁcations, you can make it pretty workable.

OpenAI is incredibly bottoms-up, especially in research. When I ﬁrst showed up, I started asking questions about the roadmap for the next quarter. The answer I got was: “this doesn’t exist” (though now it does). Good ideas can come from anywhere, and it’s often not really clear which ideas will prove most fruitful ahead of time. Rather than a grand ‘master plan’, progress is iterative and uncovered as new research bears fruit.

Thanks to this bottoms-up culture, OpenAI is also very meritocratic. Historically, leaders in the company are promoted primarily based upon their ability to have good ideas and then execute upon them. Many leaders who were incredibly competent weren’t very good at things like presenting at all-hands or political maneuvering. That matters less at OpenAI then it might at other companies. The best ideas do tend to win.

There’s a strong bias to action (you can just do things). It wasn’t unusual for similar teams but unrelated teams to converge on various ideas. I started out working on a parallel (but internal) effort similar to ChatGPT Connectors. There must’ve been ~3-4 different Codex prototypes ﬂoating around before we decided to push for a launch. These efforts are usually taken by a small handful of individuals without asking permission. Teams tend to quickly form around them as they show promise.

Andrey (the Codex lead) used to tell me that you should think of researchers as their own “mini-executive”. There is a strong bias to work on your own thing and see how it pans out. There’s a corollary here–most research gets done by nerd-sniping a researcher into a particular problem. If something is considered boring or ‘solved’, it probably won’t get worked on.

Good research managers are insanely impactful and also incredibly limited. The best ones manage to connect the dots between many different research efforts and bring together a bigger model training. The same goes for great PMs (shoutout ae).

The ChatGPT EMs I worked with (Akshay, Rizzo, Sulman) were some of the coolest customers I’ve ever seen. It really felt like they had seen everything at this point . Most of them were relatively hands-off, but hired good people and tried to make sure they were setup for success.

OpenAI changes direction on a dime. This was a thing we valued a lot at Segment–it’s much better to do the right thing as you get new information, vs decide to stay the course just because you had a plan. It’s remarkable that a company as large as OpenAI still maintains this ethos–Google clearly doesn’t. The company makes decisions quickly, and when deciding to pursue a direction, goes all in.

There is a ton of scrutiny on the company. Coming from a b2b enterprise background, this was a bit of a shock to me. I’d regularly see news stories broken in the press that hadn’t yet been announced internally. I’d tell people I work at OpenAI and be met with a pre-formed opinion on the company. A number of Twitter users run automated bots which check to see if there are new feature launches coming up.

As a result, OpenAI is a very secretive place. I couldn’t tell anyone what I was working on in detail. There’s a handful of slack workspaces with various permissions. Revenue and burn numbers are more closely guarded.

OpenAI is also a more serious place than you might expect, in part because the stakes feel really high. On the one hand, there’s the goal of building AGI–which means there is a lot to get right. On the other hand, you’re trying to build a product that hundreds of millions of users leverage for everything from medical advice to therapy. And on the other, other hand, the company is competing in the biggest arena in the world. We’d pay close attention to what was happening at Meta, Google, and Anthropic–and I’m sure they were all doing the same. All of the major world governments are watching this space with a keen interest.

As often as OpenAI is maligned in the press, everyone I met there is actually trying to do the right thing. Given the consumer focus, it is the most visible of the big labs, and consequently there’s a lot of slander for it.

That said, you probably shouldn’t view OpenAI as a single monolith. I think of OpenAI as an organization that started like Los Alamos. It was a group of scientists and tinkerers investigating the cutting edge of science. That group happened to accidentally spawn the most viral consumer app in history. And then grew to have ambitions to sell to governments and enterprises. People of different tenure and different parts of the org subsequently have very different goals and viewpoints. The longer you’ve been there, the more you probably view things through the “research lab” or “non-proﬁt for good” lens.

The thing that I appreciate most is that the company is that it “walks the walk” in terms of distributing the beneﬁts of AI. Cutting edge models aren’t reserved for some enterprise-grade tier with an annual agreement. Anybody in the world can jump onto ChatGPT and get an answer, even if they aren’t logged in. There’s an API you can sign up and use–and most of the models (even if SOTA or proprietary) tend to quickly make it into the API for startups to use. You could imagine an alternate regime that operates very differently from the one we’re in today. OpenAI deserves a ton of credit for this, and it’s still core to the DNA of the company.

Safety is actually more of a thing than you might guess if you read a lot from Zvi or Lesswrong. There’s a large number of people working to develop safety systems. Given the nature of OpenAI, I saw more focus on practical risks (hate speech, abuse, manipulating political biases, crafting bio-weapons, self-harm, prompt injection) than theoretical ones (intelligence explosion, power-seeking). That’s not to say that nobody is working on the latter, there’s definitely people focusing on the theoretical risks. But from my viewpoint, it’s not the focus. Most of the work which is done isn’t published, and OpenAI really should do more to get it out there.

Unlike other companies which freely hand out their swag at every career fair, OpenAI doesn’t really give much swag (even to new employees). Instead there are ‘drops’ which happen where you can order in-stock items. The ﬁrst one brought down the Shopify store, it had so much demand. There was an internal post which circulated on how to POST the right json payloads and circumvent this.

Nearly everything is a rounding error compared to GPU cost. To give you a sense: a niche feature that was built as part of the Codex product had the same GPU cost footprint as our entire Segment infrastructure (not the same scale as ChatGPT but saw a decent portion of internet trafﬁc).

OpenAI is perhaps the most frighteningly ambitious org I’ve ever seen. You might think that having one of the top consumer apps on the planet might be enough, but there’s a desire to compete across dozens of arenas: the API product, deep research, hardware, coding agents, image generation, and a handful of others which haven’t been announced. It’s a fertile ground for taking ideas and running with them.

The company pays a lot of attention to twitter. If you tweet something related to OpenAI that goes viral, chances are good someone will read about it and consider it. A friend of mine joked, “this company runs on twitter vibes”. As a consumer company, perhaps that’s not so wrong. There’s certainly still a lot of analytics around usage, user growth, and retention–but the vibes are equally as important.

Teams at OpenAI are much more ﬂuid than they might be elsewhere. When launching Codex, we needed some help from a few experienced ChatGPT engineers to hit our launch date. We met with some of the ChatGPT EMs to make the request. The next day we had two badass folks ready to dive in and help. There was no “waiting for quarterly planning” or “re-shufﬂing headcount”. It moved really quickly.

Leadership is quite visible and heavily involved. This might be obvious at a company such as OpenAI, but every exec seemed quite dialed in. You’d see gdb, sama, kw, mark, dane, et al chime in regularly on Slack. There are no absentee leaders.

OpenAI uses a giant monorepo which is ~mostly Python (though there is a growing set of Rust services and a handful of Golang services sprinkled in for things like network proxies). This creates a lot of strange-looking code because there are so many ways you can write Python. You will encounter both libraries designed for scale from 10y Google veterans as well as throwaway Jupyter notebooks newly-minted PhDs. Pretty much everything operates around FastAPI to create APIs and Pydantic for validation. But there aren’t style guides enforced writ-large.

OpenAI runs everything on Azure. What’s funny about this is there are exactly three services that I would consider trustworthy: Azure Kubernetes Service, CosmosDB (Azure’s document storage), and BlobStore. There’s no true equivalents of Dynamo, Spanner, Bigtable, Bigquery Kinesis or Aurora. It’s a bit rarer to think a lot in auto-scaling units. The IAM implementations tend to be way more limited than what you might get from an AWS. And there’s a strong bias to implement in-house.

When it comes to personnel (at least in eng), there’s a very significant Meta → OpenAI pipeline. In many ways, OpenAI resembles early Meta: a blockbuster consumer app, nascent infra, and a desire to move really quickly. Most of the infra talent I’ve seen brought over from Meta + Instagram has been quite strong.

Put these things together, and you see a lot of core parts of infra that feel reminiscent of Meta. There was an in-house reimplementation of TAO. An effort to consolidate auth identity at the edge. And I’m sure a number of others I don’t know about.

Chat runs really deep. Since ChatGPT took off, a lot of the codebase is structured around the idea of chat messages and conversations. These primitives are so baked at this point, you should probably ignore them at your own peril. We did deviate from them a bit in Codex (leaning more into learnings from the responses API), but we leveraged a lot of prior art.

Code wins. Rather than having some central architecture or planning committee, decisions are typically made by whichever team plans to do the work. The result is that there’s a strong bias for action, and often a number of duplicate parts of the codebase. I must’ve seen half a dozen libraries for things like queue management or agent loops.

There were a few areas where having a rapidly scaled eng team and not a lot of tooling created issues. sa-server (the backend monolith) was a bit of a dumping ground. CI broke a lot more frequently than you might expect on master. Test cases even running in parallel and factoring in a subset of dependencies could take ~30m to run on GPUs. These weren’t unsolvable problems, but it’s a good reminder that these sorts of problems exist everywhere, and they are likely to get worse when you scale super quickly. To the credit of the internal teams, there’s a lot of focus going into improving this story.

What a big consumer brand looks like. I hadn’t really internalized this until we started working on Codex. Everything is measured in terms of ‘pro subs’. Even for a product like Codex, we thought of the onboarding primarily related to individual usage rather than teams. It broke my brain a bit, coming from predominantly a B2B / enterprise background. You ﬂip a switch and you get trafﬁc from day 1.

How large models are trained (at a high-level). There’s a spectrum from “experimentation” to “engineering”. Most ideas start out as small-scale experiments. If the results look promising, they then get incorporated into a bigger run. Experimentation is as much about tweaking the core algorithms as it is tweaking the data mix and carefully studying the results. On the large end, doing a big run almost looks like giant distributed systems engineering. There will be weird edge cases and things you didn’t expect. It’s up to you to debug them.

How to do GPU-math. We had to forecast out the load capacity requirements as part of the Codex launch, and doing this was the ﬁrst time I’d really spent benchmarking any GPUs. The gist is that you should actually start from the latency requirements you need (overall latency, # of tokens, time-to-ﬁrst-token) vs doing bottoms-up analysis on what a GPU can support. Every new model iteration can change the load patterns wildly.

How to work in a large Python codebase. Segment was a combination of both microservices, and was mostly Golang and Typescript. We didn’t really have the breadth of code that OpenAI does. I learned a lot about how to scale a codebase based upon the number of developers contributing to it. You have to put in a lot more guardrails for things like “works by default”, “keep master clean”, and “hard to misuse”.

A big part of my last three months at OpenAI was launching Codex. It’s unquestionably one of the highlights of my career.

To set the stage, back in November 2024, OpenAI had set a 2025 goal to launch a coding agent. By February 2025 we had a few internal tools ﬂoating around which were using the models to great effect. And we were feeling the pressure to launch a coding-speciﬁc agent. Clearly the models had gotten to the point where they were getting really useful for coding (seeing the new explosion of vibe-coding tools in the market).

I returned early from my paternity leave to help participate in the Codex launch. A week after I returned we had a (slightly chaotic) merger of two teams, and began a mad-dash sprint. From start (the ﬁrst lines of code written) to ﬁnish, the whole product was built in just 7 weeks.

The Codex sprint was probably the hardest I’ve worked in nearly a decade. Most nights were up until 11 or midnight. Waking up to a newborn at 5:30 every morning. Heading to the ofﬁce again at 7a. Working most weekends. We all pushed hard as a team, because every week counted. It reminded me of being back at YC.

It’s hard to overstate how incredible this level of pace was. I haven’t seen organizations large or small go from an idea to a fully launched + freely available product in such a short window. The scope wasn’t small either; we built a container runtime, made optimizations on repo downloading, ﬁne-tuned a custom model to deal with code edits, handled all manner of git operations, introduced a completely new surface area, enabled internet access, and ended up with a product that was generally a delight to use.

Say what you will, OpenAI still has that launching spirit.

The good news is that the right people can make magic happen. We were a senior team of ~8 engineers, ~4 researchers, 2 designers, 2 GTM and a PM. Had we not had that group, I think we would’ve failed. Nobody needed much direction, but we did need a decent amount of coordination. If you get the chance to work with anyone on the Codex team, know that every one of them is fantastic.

The night before launch, ﬁve of us stayed up until 4a trying to deploy the main monolith (a multi-hour affair). Then it was back to the ofﬁce for the 8a launch announcement and livestream. We turned on the ﬂags, and started to see see the trafﬁc pour in. I’ve never seen a product get so much immediate uptick just from appearing in a left-hand sidebar, but that’s the power of ChatGPT.

In terms of the product shape, we settled on a form factor which was entirely asynchronous. Unlike tools like Cursor (at the time, it now supports a similar mode) or Claude Code, we aimed to allow users to kick off tasks and let the agent run in its own environment. Our bet was in the end-game, users should treat a coding agent like a co-worker: they’d send messages to the agent, it gets some time to do its work, and then it comes back with a PR.

This was a bit of a gamble: we’re in a slightly weird state today where the models are good, but not great. They can work for minutes at a time, but not yet hours. Users have widely varying degrees of trust in the models capabilities. And we’re not even clear what the true capabilities of the models are.

Over the long arc of time, I do believe most programming will look more like Codex. In the meantime, it’s going to be interesting to see how all the products unfold.

Codex (maybe unsurprisingly) is really good at working in a large codebase, understanding how to navigate it. The biggest differentiator I’ve seen vs other tools is the ability to kick off multiple tasks at once and compare their output.

I recently saw that there are public numbers comparing the PRs made by different LLM agents. Just at the public numbers, Codex has generated 630,000 PRs. That’s about 78k public PRs per engineer in the 53 days since launch (you can make your own guesses about the multiple of private PRs). I’m not sure I’ve ever worked on something so impactful in my life.

Truth be told, I was originally apprehensive about joining OpenAI. I wasn’t sure what it would be like to sacriﬁce my freedom, to have a boss, to be a much smaller piece of a much larger machine. I kept it fairly low-key that I had joined, just in case it wasn’t the right ﬁt.

I did want to get three things from the experience…

* to build intuition for how the models were trained and where the capabilities were going

* to work with and learn from amazing people

In reﬂecting on the year, I think it was one of the best moves I’ve ever made. It’s hard to imagine learning more anywhere else.

If you’re a founder and feeling like your startup really isn’t going anywhere, you should either 1) deeply re-assess how you can take more shots on goal or 2) go join one of the big labs. Right now is an incredible time to build. But it’s also an incredible time to peer into where the future is headed.

As I see it, the path to AGI is a three-horse race right now: OpenAI, Anthropic, and Google. Each of these organizations are going to take a different path to get there based upon their DNA (consumer vs business vs rock-solid-infra + data). Working at any of them will be an eye-opening experience.

Thank you to Leah for being incredibly supportive and taking the majority of the childcare throughout the late nights. Thanks to PW, GDB, and Rizzo for giving me a shot. Thanks to the SA teammates for teaching me the ropes: Andrew, Anup, Bill, Kwaz, Ming, Simon, Tony, and Val. And thanks for the Codex core team for giving me the ride of a lifetime: Albin, AE, Andrey, Bryan, Channing, DavidK, Gabe, Gladstone, Hanson, Joey, Josh, Katy, KevinT, Max, Sabrina, SQ, Tibo, TZ and Will. I’ll never forget this sprint.

...

Read the original on calv.info »

9 203 shares, 8 trendiness

Bedrock — benbridle.com

Bedrock is a compact and portable 8-bit computer system, designed to last forever. Click here to jump straight to the live demos.

Bedrock is a computer system that makes it easy to write useful programs that will last forever. The system is small and quick to learn, with only 32 instructions and 12 devices to remember.

Bedrock isn’t a real computer system that you can pick up and hold in your hands. It’s a speciﬁcation that describes an interface for any kind of computing device, allowing you to write programs that will run on any device without having to worry about the peculiarities of the underlying hardware.

Programs written for Bedrock can run on any computer system, so long as a Bedrock emulator has been implemented for that system. The emulator acts as a thin translation layer between the program and the system, and is designed to be easy to implement on any computer, console, or handheld, no matter how old or limited. The core system can be implemented in a few hours, and the 12 standard devices can be implemented and connected as needed.

Programs can currently run on Windows, Linux, the web, and the Nintendo DS. See the live demonstrations section at the bottom of this page for examples of the kinds of programs that can run on Bedrock.

* Bedrock: Printing a string

A hands-on tutorial that shows how to print a string to the terminal. It assumes no former knowledge about Bedrock, and almost no former knowledge about programming in general.

* User manual

The user manual is aimed at people who are learning about or writing programs for the Bedrock system. It contains many examples in the form of runnable code snippets.

* Speciﬁcation

The speciﬁcation is aimed at people who are implementing the system from scratch.

* Examples

Implementations of some popular algorithms as runnable programs.

* Example: Microwave clock

Full editable source code for the microwave clock program.

To write and run a program using Bedrock you’ll need an assembler and an emulator. An assembler is used for converting program source code into a Bedrock program, and an emulator is used for running any Bedrock program on your chosen system:

* bedrock-js

An assembler and emulator that can be embedded in a webpage, written in Javascript.

* bedrock-pc

An assembler and emulator for Windows and Linux computers, written in Rust.

Bedrock originated as a fork of the Uxn virtual machine and Varvara computing stack, with the aim of improving performance on extremely resource-constrained systems. It has since diverged in many significant ways, most notably by restricting the interfaces between components and by stripping down the assembler and the instruction set. See Bedrock: Differences from Uxn for more details.

The name Bedrock comes from the concept of a ‘bedrock abstraction’ coined by this blog post, though it takes a different approach to the one advocated for in the post. Bedrock achieves habitability not by producing a higher-level instruction set, but by reducing the complexity of the program environment.

The following programs are all running using the bedrock-js emulator, which was thrown together in a few days. There is a lot of room for improvement.

* snake.br (1133 bytes)

A graphics demo showing a coloured stream of letters that follow the mouse cursor.

* clock.br (393 bytes)

A clock in the style of an old microwave oven display.

* sysinfo.br (4918 bytes)

Shows information about the Bedrock implementation being used.

* keyboard.br (2774 bytes)

An on-screen keyboard, designed to be used as the keyboard for Bedrock on the Nintendo DS.

...

Read the original on benbridle.com »

10 186 shares, 33 trendiness

NIST Ion Clock Sets New Record for Most Accurate Clock in the World

There’s a new record holder for the most accurate clock in the world. Researchers at the National Institute of Standards and Technology (NIST) have improved their atomic clock based on a trapped aluminum ion. Part of the latest wave of optical atomic clocks, it can perform timekeeping with 19 decimal places of accuracy.

Optical clocks are typically evaluated on two levels — accuracy (how close a clock comes to measuring the ideal “true” time, also known as systematic uncertainty) and stability (how efﬁciently a clock can measure time, related to statistical uncertainty). This new record in accuracy comes out of 20 years of continuous improvement of the aluminum ion clock. Beyond its world-best accuracy, 41% greater than the previous record, this new clock is also 2.6 times more stable than any other ion clock. Reaching these levels has meant carefully improving every aspect of the clock, from the laser to the trap and the vacuum chamber.

The team published its results in Physical Review Letters.

“It’s exciting to work on the most accurate clock ever,” said Mason Marshall, NIST researcher and ﬁrst author on the paper. “At NIST we get to carry out these long-term plans in precision measurement that can push the ﬁeld of physics and our understanding of the world around us.”

The aluminum ion makes an exceptionally good clock, with an extremely steady, high-frequency “ticking” rate. Its ticks are more stable than those of cesium, which provides the current scientiﬁc definition of the second, said David Hume, the NIST physicist leading the aluminum ion clock project. And the aluminum ion isn’t as sensitive to some environmental conditions, like temperature and magnetic ﬁelds.

But the aluminum ion is kind of shy, Marshall explained. Aluminum is difﬁcult to probe and cool with lasers, both necessary techniques for atomic clocks. The research group therefore paired the aluminum ion with magnesium. Magnesium doesn’t have the beautiful ticking properties of aluminum, but it can be easily controlled with lasers. “This ‘buddy system’ for ions is called quantum logic spectroscopy,” said Willa Arthur-Dworschack, a graduate student on the project. The magnesium ion cools the aluminum ion, slowing it down. It also moves in tandem with its aluminum partner, and the state of the clock can be read out via the magnesium ion’s motion, making this a “quantum logic” clock. Even with this coordination, there was still an array of physical effects to characterize, said Daniel Rodriguez Castillo, also a graduate student on the project.

“It’s a big, complex challenge, because every part of the clock’s design affects the clock,” Rodriguez Castillo said.

One challenge was the design of the trap where the ions are held, which was causing tiny movements of the ions, called excess micromotion, that were lowering the clock’s accuracy. That excess micromotion throws off the ions’ tick rate. Electrical imbalances at opposite sides of the trap were creating extra ﬁelds that disturbed the ions. The team redesigned the trap, putting it on a thicker diamond wafer and modifying the gold coatings on the electrodes to ﬁx the imbalance of the electric ﬁeld. They also made the gold coatings thicker to reduce resistance. Reﬁning the trap this way slowed the ions’ motion and let them “tick” unperturbed.

The vacuum system in which the trap must operate was also causing problems. Hydrogen diffuses out of the steel body of a typical vacuum chamber, Marshall said. Traces of hydrogen gas collided with the ions, interrupting the clock’s operation. That limited how long the experiment could run before the ions needed to be reloaded. The team redesigned the vacuum chamber and had it rebuilt out of titanium, which lowered the background hydrogen gas by 150 times. That meant they could go days without reloading the trap, rather than reloading every 30 minutes.

There was still one more ingredient they needed: a more stable laser to probe the ions and count their ticks. The 2019 version of the clock had to be run for weeks to average out quantum ﬂuctuations — temporary random changes in the ions’ energy state — caused by its laser. To reduce that time, the team turned to NIST’s own Jun Ye, whose lab at JILA (a joint institute of NIST and the University of Colorado Boulder) hosts one of the most stable lasers in the world. Ye’s strontium lattice clock, Strontium 1, held the previous record for accuracy.

This was a team effort. Using ﬁber links under the street, Ye’s group at JILA sent the ultrastable laser beam 3.6 kilometers (a little more than 2 miles) to the frequency comb in the lab of Tara Fortier at NIST. The frequency comb, which acts as a “ruler for light,” allowed the aluminum ion clock group to compare its laser with Ye’s ultrastable one. This process enabled the Ye lab’s laser to transfer its stability to the aluminum clock laser. With this improvement, the researchers could probe the ions for a full second compared to their previous record of 150 milliseconds. This improves the clock’s stability, reducing the time required to measure down to the 19th decimal place from three weeks to a day and a half.

With this new record, the aluminum ion clock contributes to the international effort to redeﬁne the second to much greater levels of accuracy than before, facilitating new scientiﬁc and technological advances. The upgrades also drastically improve its use as a quantum logic testbed, exploring new concepts in quantum physics and building the tools needed for quantum technology, an exciting prospect for those involved. More importantly, by cutting down the averaging time from weeks to days, this clock can be a tool to make new measurements of Earth’s geodesy and explore physics beyond the Standard Model, such as the possibility that the fundamental constants of nature are not ﬁxed values but actually changing.

“With this platform, we’re poised to explore new clock architectures — like scaling up the number of clock ions and even entangling them — further improving our measurement capabilities,” Arthur-Dworschack said.

Paper: Mason C. Marshall, Daniel A. Rodriguez Castillo, Willa J. Arthur-Dworschack, Alexander Aeppli, Kyungtae Kim, Dahyeon Lee, William Warﬁeld, Joost Hinrichs, Nicholas V. Nardelli, Tara M. Fortier, Jun Ye, David R. Leibrandt and David B. Hume. High-stability single-ion clock with 5.5×10−19 systematic uncertainty. Physical Review Letters. Published online July 14, 2025. DOI: 10.1103/hb3c-dk28

...

Read the original on www.nist.gov »

To add this web app to your iOS home screen tap the share button and select "Add to the Home Screen".

10HN is also available as an iOS App

If you visit 10HN only rarely, check out the the best articles from the past week.

If you like 10HN please leave feedback and share

Visit pancik.com for more.