10 interesting stories served every morning and every evening.
Finally, an efficient blocker. Easy on CPU and memory. IMPORTANT: uBlock Origin is completely unrelated to the site “ublock.org”.
uBlock Origin is not an “ad blocker”, it’s a wide-spectrum content blocker with CPU and memory efficiency as a primary feature.
Out of the box, these lists of filters are loaded and enforced:
- uBlock Origin filter lists - EasyList (ads) - EasyPrivacy (tracking) - Peter Lowe’s Ad server list (ads and tracking) - Online Malicious URL Blocklist
More lists are available for you to select if you wish:
- Annoyances (cookie warnings, overlays, etc.) - hosts-based lists - And many others
Additionally, you can point-and-click to block JavaScript locally or globally, create your own global or local rules to override entries from filter lists, and many more advanced features.
Free. Open source with public license (GPLv3) For users by users.
If ever you really do want to contribute something, think about the people working hard to maintain the filter lists you are using, which were made available to use by all for free.
Documentation: https://github.com/gorhill/uBlock#ublock-origin
Project change log: https://github.com/gorhill/uBlock/releases
Contributors @ Github: https://github.com/gorhill/uBlock/graphs/contributors
Contributors @ Crowdin: https://crowdin.net/project/ublockGoogle doesn’t verify reviews. Learn more about results and reviews.This developer has not identified itself as a trader. For consumers in the European Union, please note that consumer rights do not apply to contracts between you and this developer.The developer has disclosed that it will not collect or use your data. To learn more, see the developer’s privacy policy This developer declares that your data isNot being sold to third parties, outside of the approved use casesNot being used or transferred for purposes that are unrelated to the item’s core functionalityNot being used or transferred to determine creditworthiness or for lending purposes
...
Read the original on chromewebstore.google.com »
Finally, an efficient blocker. Easy on CPU and memory. IMPORTANT: uBlock Origin is completely unrelated to the site “ublock.org”.
uBlock Origin is not an “ad blocker”, it’s a wide-spectrum content blocker with CPU and memory efficiency as a primary feature.
Out of the box, these lists of filters are loaded and enforced:
- uBlock Origin filter lists - EasyList (ads) - EasyPrivacy (tracking) - Peter Lowe’s Ad server list (ads and tracking) - Online Malicious URL Blocklist
More lists are available for you to select if you wish:
- Annoyances (cookie warnings, overlays, etc.) - hosts-based lists - And many others
Additionally, you can point-and-click to block JavaScript locally or globally, create your own global or local rules to override entries from filter lists, and many more advanced features.
Free. Open source with public license (GPLv3) For users by users.
If ever you really do want to contribute something, think about the people working hard to maintain the filter lists you are using, which were made available to use by all for free.
Documentation: https://github.com/gorhill/uBlock#ublock-origin
Project change log: https://github.com/gorhill/uBlock/releases
Contributors @ Github: https://github.com/gorhill/uBlock/graphs/contributors
Contributors @ Crowdin: https://crowdin.net/project/ublockGoogle doesn’t verify reviews. Learn more about results and reviews.This developer has not identified itself as a trader. For consumers in the European Union, please note that consumer rights do not apply to contracts between you and this developer.The developer has disclosed that it will not collect or use your data. To learn more, see the developer’s privacy policy This developer declares that your data isNot being sold to third parties, outside of the approved use casesNot being used or transferred for purposes that are unrelated to the item’s core functionalityNot being used or transferred to determine creditworthiness or for lending purposes
...
Read the original on chromewebstore.google.com »
Today I’m excited to announce the next steps we’re taking to radically improve TypeScript performance.
The core value proposition of TypeScript is an excellent developer experience. As your codebase grows, so does the value of TypeScript itself, but in many cases TypeScript has not been able to scale up to the very largest codebases. Developers working in large projects can experience long load and check times, and have to choose between reasonable editor startup time or getting a complete view of their source code. We know developers love when they can rename variables with confidence, find all references to a particular function, easily navigate their codebase, and do all of those things without delay. New experiences powered by AI benefit from large windows of semantic information that need to be available with tighter latency constraints. We also want fast command-line builds to validate that your entire codebase is in good shape.
To meet those goals, we’ve begun work on a native port of the TypeScript compiler and tools. The native implementation will drastically improve editor startup, reduce most build times by 10x, and substantially reduce memory usage. By porting the current codebase, we expect to be able to preview a native implementation of tsc capable of command-line typechecking by mid-2025, with a feature-complete solution for project builds and a language service by the end of the year.
You can build and run the Go code from our new working repo, which is offered under the same license as the existing TypeScript codebase. Check the README for instructions on how to build and run tsc and the language server, and to see a summary of what’s implemented so far. We’ll be posting regular updates as new functionality becomes available for testing.
Our native implementation is already capable of loading many popular TypeScript projects, including the TypeScript compiler itself. Here are times to run tsc on some popular codebases on GitHub of varying sizes:
While we’re not yet feature-complete, these numbers are representative of the order of magnitude performance improvement you’ll see checking most codebases.
We’re incredibly excited about the opportunities that this massive speed boost creates. Features that once seemed out of reach are now within grasp. This native port will be able to provide instant, comprehensive error listings across an entire project, support more advanced refactorings, and enable deeper insights that were previously too expensive to compute. This new foundation goes beyond today’s developer experience and will enable the next generation of AI tools to enhance development, powering new tools that will learn, adapt, and improve the coding experience.
Most developer time is spent in editors, and it’s where performance is most important. We want editors to load large projects quickly, and respond quickly in all situations. Modern editors like Visual Studio and Visual Studio Code have excellent performance as long as the underlying language services are also fast. With our native implementation, we’ll be able to provide incredibly fast editor experiences.
Again using the Visual Studio Code codebase as a benchmark, the current time to load the entire project in the editor on a fast computer is about 9.6 seconds. This drops down to about 1.2 seconds with the native language service, an 8x improvement in project load time in editor scenarios. What this translates to is a faster working experience from the time you open your editor to your first keystroke in any TypeScript codebase. We expect all projects to see this level of improvement in load time.
Overall memory usage also appears to be roughly half of the current implementation, though we haven’t actively investigated optimizing this yet and expect to realize further improvements. Editor responsiveness for all language service operations (including completion lists, quick info, go to definition, and find all references) will also see significant speed gains. We’ll also be moving to the Language Server Protocol (LSP), a longstanding infrastructural work item to better align our implementation with other languages.
Our most recent TypeScript release was TypeScript 5.8, with TypeScript 5.9 coming soon. The JS-based codebase will continue development into the 6.x series, and TypeScript 6.0 will introduce some deprecations and breaking changes to align with the upcoming native codebase.
When the native codebase has reached sufficient parity with the current TypeScript, we’ll be releasing it as TypeScript 7.0. This is still in development and we’ll be announcing stability and feature milestones as they occur.
For the sake of clarity, we’ll refer to them simply as TypeScript 6 (JS) and TypeScript 7 (native), since this will be the nomenclature for the foreseeable future. You may also see us refer to “Strada” (the original TypeScript codename) and “Corsa” (the codename for this effort) in internal discussions or code comments.
While some projects may be able to switch to TypeScript 7 upon release, others may depend on certain API features, legacy configurations, or other constraints that necessitate using TypeScript 6. Recognizing TypeScript’s critical role in the JS development ecosystem, we’ll still be maintaining the JS codebase in the 6.x line until TypeScript 7+ reaches sufficient maturity and adoption.
Our long-term goal is to keep these versions as closely aligned as possible so that you can upgrade to TypeScript 7 as soon as it meets your requirements, or fall back to TypeScript 6 if necessary.
In the coming months we’ll be sharing more about this exciting effort, including deeper looks into performance, a new compiler API, LSP, and more. We’ve written up some FAQs on the GitHub repo to address some questions we expect you might have. We also invite you to join us for an AMA at the TypeScript Community Discord at 10 AM PDT | 5 PM UTC on March 13th.
A 10x performance improvement represents a massive leap in the TypeScript and JavaScript development experience, so we hope you are as enthusiastic as we are for this effort!
...
Read the original on devblogs.microsoft.com »
Skip to main content
Email updates on news, actions,
and events in your area.
EFFecting Change Livestream Series: Is There Hope for Social Media?
EFF is deeply saddened to learn of the passing of Mark Klein, a bona fide hero who risked civil liability and criminal prosecution to help expose a massive spying program that violated the rights of millions of Americans.
Mark didn’t set out to change the world. For 22 years, he was a telecommunications technician for AT&T, most of that in San Francisco. But he always had a strong sense of right and wrong and a commitment to privacy.
Mark not only saw how it works, he had the documents to prove it.
When the New York Times reported in late 2005 that the NSA was engaging in spying inside the U. S., Mark realized that he had witnessed how it was happening. He also realized that the President was not telling Americans the truth about the program. And, though newly retired, he knew that he had to do something. He showed up at EFF’s front door in early 2006 with a simple question: “Do you folks care about privacy?”
We did. And what Mark told us changed everything. Through his work, Mark had learned that the National Security Agency (NSA) had installed a secret, secure room at AT&T’s central office in San Francisco, called Room 641A. Mark was assigned to connect circuits carrying Internet data to optical “splitters” that sat just outside of the secret NSA room but were hardwired into it. Those splitters—as well as similar ones in cities around the U.S.—made a copy of all data going through those circuits and delivered it into the secret room.
A photo of the NSA-controlled ‘secret room’ in the AT&T facility in San Francisco (Credit: Mark Klein)
Mark not only saw how it works, he had the documents to prove it. He brought us over a hundred pages of authenticated AT&T schematic diagrams and tables. Mark also shared this information with major media outlets, numerous Congressional staffers, and at least two senators personally. One, Senator Chris Dodd, took the floor of the Senate to acknowledge Mark as the great American hero he was.
We used Mark’s evidence to bring two lawsuits against the NSA spying that he uncovered. The first was Hepting v. AT&T and the second was Jewel v. NSA. Mark also came with us to Washington D.C. to push for an end to the spying and demand accountability for it happening in secret for so many years. He wrote an account of his experience called Wiring Up the Big Brother Machine . . . And Fighting It.
Mark stood up and told the truth at great personal risk to himself and his family. AT&T threatened to sue him, although it wisely decided not to do so. While we were able to use his evidence to make some change, both EFF and Mark were ultimately let down by Congress and the Courts, which have refused to take the steps necessary to end the mass spying even after Edward Snowden provided even more evidence of it in 2013.
But Mark certainly inspired all of us at EFF, and he helped inspire and inform hundreds of thousands of ordinary Americans to demand an end to illegal mass surveillance. While we have not yet seen the success in ending the spying that we all have hoped for, his bravery helped to usher numerous reforms so far.
And the fight is not over. The law, called Section 702, that now authorizes the continued surveillance that Mark first revealed, expires in early 2026. EFF and others will continue to push for continued reforms and, ultimately, for the illegal spying to end entirely.
Mark’s legacy lives on in our continuing fights to reform surveillance and honor the Fourth Amendment’s promise of protecting personal privacy. We are forever grateful to him for having the courage to stand up and will do our best to honor that legacy by continuing the fight.
Email updates on news, actions, events in your area, and more.
Thanks, you’re awesome! Please check your email for a confirmation link.
Oops something is broken right now, please try again later.
Email updates on news, actions, events in your area, and more.
Thanks, you’re awesome! Please check your email for a confirmation link.
Oops something is broken right now, please try again later.
JavaScript license information
...
Read the original on www.eff.org »
At Google DeepMind, we’ve been making progress in how our Gemini models solve complex problems through multimodal reasoning across text, images, audio and video. So far however, those abilities have been largely confined to the digital realm. In order for AI to be useful and helpful to people in the physical realm, they have to demonstrate “embodied” reasoning — the humanlike ability to comprehend and react to the world around us— as well as safely take action to get things done.
Today, we are introducing two new AI models, based on Gemini 2.0, which lay the foundation for a new generation of helpful robots.
The first is Gemini Robotics, an advanced vision-language-action (VLA) model that was built on Gemini 2.0 with the addition of physical actions as a new output modality for the purpose of directly controlling robots. The second is Gemini Robotics-ER, a Gemini model with advanced spatial understanding, enabling roboticists to run their own programs using Gemini’s embodied reasoning (ER) abilities.
Both of these models enable a variety of robots to perform a wider range of real-world tasks than ever before. As part of our efforts, we’re partnering with Apptronik to build the next generation of humanoid robots with Gemini 2.0. We’re also working with a selected number of trusted testers to guide the future of Gemini Robotics-ER.
We look forward to exploring our models’ capabilities and continuing to develop them on the path to real-world applications.
To be useful and helpful to people, AI models for robotics need three principal qualities: they have to be general, meaning they’re able to adapt to different situations; they have to be interactive, meaning they can understand and respond quickly to instructions or changes in their environment; and they have to be dexterous, meaning they can do the kinds of things people generally can do with their hands and fingers, like carefully manipulate objects.
While our previous work demonstrated progress in these areas, Gemini Robotics represents a substantial step in performance on all three axes, getting us closer to truly general purpose robots.
Gemini Robotics leverages Gemini’s world understanding to generalize to novel situations and solve a wide variety of tasks out of the box, including tasks it has never seen before in training. Gemini Robotics is also adept at dealing with new objects, diverse instructions, and new environments. In our tech report, we show that on average, Gemini Robotics more than doubles performance on a comprehensive generalization benchmark compared to other state-of-the-art vision-language-action models.
To operate in our dynamic, physical world, robots must be able to seamlessly interact with people and their surrounding environment, and adapt to changes on the fly.
Because it’s built on a foundation of Gemini 2.0, Gemini Robotics is intuitively interactive. It taps into Gemini’s advanced language understanding capabilities and can understand and respond to commands phrased in everyday, conversational language and in different languages.
It can understand and respond to a much broader set of natural language instructions than our previous models, adapting its behavior to your input. It also continuously monitors its surroundings, detects changes to its environment or instructions, and adjusts its actions accordingly. This kind of control, or “steerability,” can better help people collaborate with robot assistants in a range of settings, from home to the workplace.
If an object slips from its grasp, or someone moves an item around, Gemini Robotics quickly replans and carries on — a crucial ability for robots in the real world, where surprises are the norm.
The third key pillar for building a helpful robot is acting with dexterity. Many everyday tasks that humans perform effortlessly require surprisingly fine motor skills and are still too difficult for robots. By contrast, Gemini Robotics can tackle extremely complex, multi-step tasks that require precise manipulation such as origami folding or packing a snack into a Ziploc bag.
Finally, because robots come in all shapes and sizes, Gemini Robotics was also designed to easily adapt to different robot types. We trained the model primarily on data from the bi-arm robotic platform, ALOHA 2, but we also demonstrated that it could control a bi-arm platform, based on the Franka arms used in many academic labs. Gemini Robotics can even be specialized for more complex embodiments, such as the humanoid Apollo robot developed by Apptronik, with the goal of completing real world tasks.
Gemini Robotics works on different kinds of robots
Alongside Gemini Robotics, we’re introducing an advanced vision-language model called Gemini Robotics-ER (short for ’“embodied reasoning”). This model enhances Gemini’s understanding of the world in ways necessary for robotics, focusing especially on spatial reasoning, and allows roboticists to connect it with their existing low level controllers.
Gemini Robotics-ER improves Gemini 2.0’s existing abilities like pointing and 3D detection by a large margin. Combining spatial reasoning and Gemini’s coding abilities, Gemini Robotics-ER can instantiate entirely new capabilities on the fly. For example, when shown a coffee mug, the model can intuit an appropriate two-finger grasp for picking it up by the handle and a safe trajectory for approaching it.
Gemini Robotics-ER can perform all the steps necessary to control a robot right out of the box, including perception, state estimation, spatial understanding, planning and code generation. In such an end-to-end setting the model achieves a 2x-3x success rate compared to Gemini 2.0. And where code generation is not sufficient, Gemini Robotics-ER can even tap into the power of in-context learning, following the patterns of a handful of human demonstrations to provide a solution.
Gemini Robotics-ER excels at embodied reasoning capabilities including detecting objects and pointing at object parts, finding corresponding points and detecting objects in 3D.
As we explore the continuing potential of AI and robotics, we’re taking a layered, holistic approach to addressing safety in our research, from low-level motor control to high-level semantic understanding.
The physical safety of robots and the people around them is a longstanding, foundational concern in the science of robotics. That’s why roboticists have classic safety measures such as avoiding collisions, limiting the magnitude of contact forces, and ensuring the dynamic stability of mobile robots. Gemini Robotics-ER can be interfaced with these ‘low-level’ safety-critical controllers, specific to each particular embodiment. Building on Gemini’s core safety features, we enable Gemini Robotics-ER models to understand whether or not a potential action is safe to perform in a given context, and to generate appropriate responses.
To advance robotics safety research across academia and industry, we are also releasing a new dataset to evaluate and improve semantic safety in embodied AI and robotics. In previous work, we showed how a Robot Constitution inspired by Isaac Asimov’s Three Laws of Robotics could help prompt an LLM to select safer tasks for robots. We have since developed a framework to automatically generate data-driven constitutions - rules expressed directly in natural language — to steer a robot’s behavior. This framework would allow people to create, modify and apply constitutions to develop robots that are safer and more aligned with human values. Finally, the new ASIMOV dataset will help researchers to rigorously measure the safety implications of robotic actions in real-world scenarios.
To further assess the societal implications of our work, we collaborate with experts in our Responsible Development and Innovation team and as well as our Responsibility and Safety Council, an internal review group committed to ensure we develop AI applications responsibly. We also consult with external specialists on particular challenges and opportunities presented by embodied AI in robotics applications.
In addition to our partnership with Apptronik, our Gemini Robotics-ER model is also available to trusted testers including Agile Robots, Agility Robots, Boston Dynamics, and Enchanted Tools. We look forward to exploring our models’ capabilities and continuing to develop AI for the next generation of more helpful robots.
This work was developed by the Gemini Robotics team. For a full list of authors and acknowledgements please view our technical report.
...
Read the original on deepmind.google »
The DuckDB project was built to make it simple to leverage modern database technology. DuckDB can be used from many popular languages and runs on a wide variety of platforms. The included Command Line Interface (CLI) provides a convenient way to interactively run SQL queries from a terminal window, and several third-party tools offer more sophisticated UIs.
The DuckDB CLI provides advanced features like interactive multi-line editing, auto-complete, and progress indicators. However, it can be cumbersome for working with lengthy SQL queries, and its data exploration tools are limited. Many of the available third party UIs are great, but selecting, installing, and configuring one is not straightforward. Using DuckDB through a UI should be as simple as using the CLI. And now it is!
The DuckDB UI is the result of a collaboration between DuckDB Labs and MotherDuck and is shipped as part of the ui extension.
Starting with DuckDB v1.2.1, a full-featured local web user interface is available out-of-the-box! You can start it from the terminal by launching the DuckDB CLI client with the -ui argument:
You can also run the following SQL command from a DuckDB client (e.g., CLI, Python, Java, etc.):
Both of these approaches install the ui extension (if it isn’t installed yet), then open the DuckDB UI in your browser:
The DuckDB UI uses interactive notebooks to define SQL scripts and show the results of queries. However, its capabilities go far beyond this. Let’s go over its main features.
The DuckDB UI runs all your queries locally: your queries and data never leave your computer. If you would like to use MotherDuck through the UI, you have to opt-in explicitly.
Your attached databases are shown on the left. This list includes in-memory databases plus any files and URLs you’ve loaded. You can explore tables and views by expanding databases and schemas.
Click on a table or view to show a summary below. The UI shows the number of rows, the name and type of each column, and a profile of the data in each column.
Select a column to see a more detailed summary of its data. You can use the “Preview data” button near the top right to inspect the first 100 rows. You can also find the SQL definition of the table or view here.
You can organize your work into named notebooks. Each cell of the notebook can execute one or more SQL statements. The UI supports syntax highlighting and autocomplete to assist with writing your queries.
You can run the whole cell, or just a selection, then sort, filter, or further transform the results using the provided controls.
The right panel contains the Column Explorer, which shows a summary of your results. You can dive into each column to gain insights.
If you would like to connect to MotherDuck, you can sign into MotherDuck to persist files and tables to a cloud data warehouse crafted for using DuckDB at scale and sharing data with your team.
The DuckDB UI is under active development. Expect additions and improvements!
Like the DuckDB CLI, the DuckDB UI creates some files in the .duckdb directory in your home directory. The UI puts its files in a sub-directory, extension_data/ui:
* Your notebooks and some other state are stored in a DuckDB database, ui.db.
* When you export data to the clipboard or a file (using the controls below the results), some tiny intermediate files (e.g. ui_export.csv) are generated.
Your data is cleared from these files after the export is completed, but some near-empty files remain, one per file type.
Support for the UI is implemented in a DuckDB extension. The extension embeds a localhost HTTP server, which serves the UI browser application, and also exposes an API for communication with DuckDB. In this way, the UI leverages the native DuckDB instance from which it was started, enabling full access to your local memory, compute, and file system.
Results are returned in an efficient binary form closely matching DuckDB’s in-memory representation (DataChunk).
Server-sent events enable prompt notification of updates such as attaching databases. These techniques and others make for a low-latency experience that keeps you in your flow.
See the UI extension documentation for more details.
In this blog post, we presented the new DuckDB UI, a powerful web interface for DuckDB.
The DuckDB UI shares many of its design principles with the DuckDB database. It’s simple, fast, feature-rich, and portable, and runs locally on your computer. The DuckDB UI extension is also open source: visit the duckdb/duckdb-ui repository if you want to dive in deeper into the extension’s code.
The repository does not contain the source code for the frontend, which is currently not available as open-source. Releasing it as open-source is under consideration.
For help or to share feedback, please file an issue, join the #ui channel in either the DuckDB Discord or the MotherDuck Community Slack.
...
Read the original on duckdb.org »
It is as if you were on your phone
Look at you! On your phone! But you’ve got a secret! And you won’t tell! You’re not on your phone! It is only as if you were on your phone! You’re just pretending to be on your phone! On your phone!
It is as if you were on your phone is an almost speculative game about an incredibly near future in which we’re all simultaneously under significant pressure to be on our phones all the time, but also to not be on our phones all the time. Our fingers want to touch the screen, our eyes want to watch the surface, our brains want to be occupied efficiently and always. But it’s also exhausting liking photos, swiping profiles, watching short-form video, and everything else we’re always doing. It is as if you were on your phone presents an alternative: pretend to be on your phone so that you pass as human, but actually do essentially nothing instead. Follow the prompts and be free.
It is as if you were on your phone was created using p5 along with Hammer.js for touch gestures.
Iwan Morris. It’s As If You Were On Your Phone is a bizarre new introspective desktop mobile release. Pocket Gamer. 6 March 2025.
Jason Kottke. A game called “It is as if you were on your phone” is designed to make you look like you’re on your phone.. Kottke.org. 7 March 2025.
Dan Q. It is as if you were on your phone. Dan Q (Blog). 10 March 2025. (This guy recorded a video of him playing which I love!)
de Rochefort, Simone. Finally, I can pretend I’m on my phone - And it’s giving me an existential crisis!. Polygon. 10 March 2025.
Read the Process Documentation for todos and design explorations
Read the Commit History for detailed, moment-to-moment insights into the development process
Look at the Code Repository for source code etc.
It is as if you were on your phone is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.
...
Read the original on pippinbarr.com »
(Bloomberg) — OpenAI has asked the Trump administration to help shield artificial intelligence companies from a growing number of proposed state regulations if they voluntarily share their models with the federal government.
In a 15-page set of policy suggestions released on Thursday, the ChatGPT maker argued that the hundreds of AI-related bills currently pending across the US risk undercutting America’s technological progress at a time when it faces renewed competition from China. OpenAI said the administration should consider providing some relief for AI companies big and small from state rules — if and when enacted — in exchange for voluntary access to models.
The recommendation was one of several included in OpenAI’s response to a request for public input issued by the White House Office of Science and Technology Policy in February as the administration drafts a new policy to ensure US dominance in AI. President Donald Trump previously rescinded the Biden administration’s sprawling executive order on AI and tasked the science office with developing an AI Action Plan by July.
To date, there has been a notable absence of federal legislation governing the AI sector. The Trump administration has generally signaled its intention to take a hands-off approach to regulating the technology. But many states are actively weighing new measures on everything from deepfakes to bias in AI systems.
Chris Lehane, OpenAI’s vice president of global affairs, said in an interview that the US AI Safety Institute — a key government group focused on AI — could act as the main point of contact between the federal government and the private sector. If companies work with the group voluntarily to review models, the government could provide them “with liability protections including preemption from state based regulations that focus on frontier model security,” according to the proposal.
“Part of the incentive for doing that ought to be that you don’t have to go through the state stuff, which is not going to be anywhere near as good as what the federal level would be,” Lehane said.
In its policy recommendations, OpenAI also reiterated its call for the government to take steps to support AI infrastructure investments and called for copyright reform, arguing that America’s fair use doctrine is critical to maintaining AI leadership. OpenAI and other AI developers have faced numerous copyright lawsuits over the data used to build their models.
...
Read the original on finance.yahoo.com »
Large Language Models (LLMs) are rapidly saturating existing benchmarks, necessitating new open-ended evaluations. We introduce the Factorio Learning Environment (FLE), based on the game of Factorio, that tests agents in long-term planning, program synthesis, and resource optimization.
FLE provides open-ended and exponentially scaling challenges - from basic automation to complex factories processing millions of resource units per second. We provide two settings:
Open-play with the unbounded task of building the largest factory from scratch on a procedurally generated map.
We demonstrate across both settings that models still lack strong spatial reasoning. In lab-play, we find that LLMs exhibit promising short-horizon skills, yet are unable to operate effectively in constrained environments, reflecting limitations in error analysis. In open-play, while LLMs discover automation strategies that improve growth (e.g electric-powered drilling), they fail to achieve complex automation (e.g electronic-circuit manufacturing).
Large Language Models (LLMs) have demonstrated remarkable capabilities at solving complex question-answer (QA) problems, saturating benchmarks in factual recollection, reasoning and code generation. Benchmark saturation presents a critical challenge for the AI research community: how do we meaningfully evaluate and differentiate increasingly capable models?
We introduce the Factorio Learning Environment (FLE): a novel framework built upon the game of Factorio that addresses this challenge by enabling unbounded agent evaluation. FLE provides the infrastructure, API, and metrics for assessing frontier LLM agents in code generation, spatial reasoning and long-term planning. In this environment, agents must navigate rapidly scaling challenges—from basic resource extraction producing ~30 units/minute to sophisticated production chains processing millions of units/second. This dramatic growth in complexity, driven by geometric increases in research costs and the combinatorial expansion of interdependent production chains, creates natural curricula for evaluating increasingly capable agents.
Within FLE, we define two complementary evaluation protocols: (1) lab-play with structured, goal-oriented tasks that have clear completion criteria, allowing targeted assessment of specific capabilities, and (2) open-play with no predetermined end-state, supporting truly unbounded evaluation of an agent’s ability to autonomously set and achieve increasingly complex goals.
Agents in FLE aim to optimise factories programmatically. Left: Agents aim to create increasingly efficient factories, advancing through technological tiers to produce more resources per second. Middle: We provide a Python API to Factorio which enables direct interaction with the environment through code. Right: Agents submit programs to the game server and receive rich feedback, enabling them to refine their strategies through an iterative process of exploration and refinement.
Agents develop policies through an interactive feedback loop.
Using 23 core API tools, agents compose programs that interact with the environment and observe the results through stdout and stderr streams.
The Python namespace allows agents to store variables and define functions for later use, enabling increasingly sophisticated strategies as experience grows.
This approach mirrors the way human programmers learn - through iteration, debugging, and refinement based on direct feedback.
Agent programs yield both a Production Score (PS) representing the economic value of all items produced, and milestones that reflect technological advancements.
To systematically evaluate agent capabilities in the Factorio Learning Environment, we introduce two complementary experimental settings that test different aspects of planning, automation, and resource management; namely open-play and lab-play.
We evaluate six frontier language models across both settings: Claude 3.5-Sonnet, GPT-4o, GPT-4o-Mini, Deepseek-v3, Gemini-2-Flash, and Llama-3.3-70B-Instruct. Each model interacts with the environment through a consistent prompting approach, receiving the API schema, a guide describing common patterns, and memory of past actions and observations.
Agents begin in a procedurally generated world with instruction to “build the largest possible factory”. This setting tests agents’ ability to set appropriate goals, balance short-term production against long-term research, and navigate the complex tech tree and game map without external guidance.
Agent capabilities are clearly differentiated by their production scores in open-play.
Left: By plotting Production Score (PS) against steps on a log/log scale, we can observe distinct performance trajectories for each model.
More capable models not only achieve higher scores but demonstrate steeper growth curves, indicating better long-term planning.
Milestone annotations show when the median agent first created key entities, revealing how quickly each model progresses through the tech tree.
Right: Final rewards reveal how weaker models struggle to advance when complex automation and logistics become necessary.
Production strategies reveal differences in agent planning and capabilities.
We track how various models produce items with multiple antecedent ingredients in open-play, showing not just what they build but how they approach factory design.
Claude 3.5-Sonnet demonstrates sophisticated strategy by immediately beginning complex crafting and investing in research and automation, ultimately unlocking electric-mining-drills around step 3k - a decision that boosts iron-plate production by 50% thereafter.
In contrast, less advanced models like GPT-4o-Mini produce minimal quantities of multi-ingredient items, revealing limitations in planning horizons.
Interestingly, Deepseek showed stronger capabilities in lab-play than open-play, suggesting that its general capabilities exceed its objective-setting abilities in open-ended environments.
Agents are provided with resources and given a time-limit to achieve an objective. We task agents to build production lines of 24 distinct target entities of increasing complexity, starting from a single resource mine requiring at most 2 machines (making iron-ore) to a late game entity requiring the coordination of close to 100 machines (making utility-science-pack). The target entities cover items from early to late game, requiring agents to use a wide variety of machines present in Factorio (drills, furnaces, assembling machines, oil refineries, chemical plants). As the task difficulty naturally increases with resource requirements, this provides a measure of the complexity that agents are capable of creating in a limited number of steps. All tasks provide the agent with sufficient resources to complete the task with all technologies unlocked.
Item production complexity creates a natural difficulty gradient for agent evaluation. Top: We measure task success rates across the first 8 complexity levels, revealing a clear decline as target entity crafting complexity increases. Even the most capable models struggle with coordinating more than six machines when producing items with three or more ingredients. Bottom: Production progress over time shows a pattern of initial rapid advancement followed by stagnation or regression. This reveals a key limitation in current agents’ abilities: they often break existing functional structures when attempting to scale production or add new factory sections. The high variance in task progress across runs further demonstrates the challenge of consistent performance in complex automation tasks.
Plastic bar manufacturing is the most challenging task successfully completed in lab-play.
The factory consists of a electricity steam generator (top-left), a coal mine with storage buffer (top), a crude-oil to petroleum gas pipeline (bottom) and a chemical plant (bottom-right).
The chemical plant creates plastic bars using the coal and petroleum gas as inputs. By themselves, the cumulative raw resources generate a production score of $224$.
With this specific layout, the factory creates $40$ plastic bars per $60$ in-game seconds, for a production score of $352$.
This factory was created by Claude Sonnet 3.5.
Even the strongest model (Claude) only completed 7/24 tasks in lab-play, illustrating substantial room for improvement in this benchmark.
Our experiments revealed several key patterns that highlight both the capabilities and limitations of current AI agents when faced with open-ended industrial challenges:
Models with stronger coding abilities (Claude 3.5-Sonnet, GPT-4o) achieved higher Production Scores and completed more lab tasks. Claude outperformed others with a PS of 293,206 and 28 milestones, progressing beyond early-game resource extraction.
Only Claude consistently invested resources in researching new technologies, despite their importance for long-term progression. After deploying electric mining drills at step 3k, Claude’s PS grew by 50% (from 200k to 300k), demonstrating the value of strategic investment.
In open-play, agents frequently pursue short-sighted objectives — like Gemini-2.0 manually crafting 300+ wooden chests over 100 steps — rather than investing in research or scaling existing production. This reveals a telling discrepancy: while Gemini-2 and Deepseek demonstrate early-game automation capabilities in structured lab-play, they rarely attempt to create cohesive factories during open-ended exploration, resulting in poorer overall performance.
All models exhibited limitations in spatial planning when constructing multi-section factories. Common failures included placing entities too close together, not allocating space for connections, or incorrect inserter placement - issues that severely impacted performance in complex tasks requiring coordination of multiple production lines.
Models frequently become trapped in repetitive error patterns, attempting the same invalid operations repeatedly rather than exploring alternative solutions. For instance, GPT-4o repeated the same API method incorrectly for 78 consecutive steps despite identical error messages.
Models exhibited distinct coding approaches: Claude favored a REPL style with extensive print statements (43.3% of code lines) but few assertions (2.0%), while GPT-4o used a defensive style with more validation checks (12.8% assertions) and fewer prints (10.3%).
With thanks to Jack Kleeman and Minqi Jiang for their invaluable help with setting up compute resources and advice during the inception of this project. Thanks to Wube and the Factorio team for developing such a stimulating game.
...
Read the original on jackhopkins.github.io »
Social media that’s only open from 7:39pm to 10:39pm EST.
Create an account now and we’ll email you when seven39 opens!
Because social media is better when we’re all online together.
No endless scrolling. No FOMO. Just 3 hours of fun every evening.
The domain was available.
...
Read the original on www.seven39.com »
To add this web app to your iOS home screen tap the share button and select "Add to the Home Screen".
10HN is also available as an iOS App
If you visit 10HN only rarely, check out the the best articles from the past week.
If you like 10HN please leave feedback and share
Visit pancik.com for more.