10 interesting stories served every morning and every evening.
Analyzing every Firefox extension Installing every Firefox extension Using every Firefox extension
*All but 8 we didn’t scrape (or got deleted between me checking the website and me scraping) and 42 missing from extensions.json.1 Technically we only installed 99.94% of the extensions.
It turns out there’s only 84 thousand Firefox extensions. That sounds feasibly small. That even sounds like it’s less than 50 gigabytes. Let’s install them all!
There’s a public API for the add-ons store. No authentication required, and seemingly no rate limits. This should be easy.
The search endpoint can take an empty query. Let’s read every page:
The search API only gives me 600 pages, meaning I can only see 30 thousand extensions, less than half of them.
A solution I found is to use different sorts. The default sort is sort=recommended,users: first recommended extensions, then sorted by users, descending. Changing to just sort=created gave me some of the long tail:
I’m still missing 30,0252 extensions, so I added rating and hotness too.
Starting to hit diminishing returns. While I was waiting 7 minutes for that last list to get scraped because my code didn’t fetch in parallel, I had an epiphany: use exclude_addons. I can just fetch page 600 and exclude all its addons to get page 601.
It works! There is a URL length limit, sadly, so I can only fetch an extra 20 pages.
A lot less than I expected, especially considering what happens when I add the downloads sort:
Reading the docs again, I notice I can filter by category as well. I’m tired of waiting 7 minutes so I’ll just fetch every page in parallel.
I got basically all the extensions with this, making everything I did before this look really stupid.
That’s 8 less extensions than what it says on the website. When I ran this in September 2025, it found 21 more extensions than what was mentioned on the website, so I think this is enough.
So that nobody has to do this again, I’ve uploaded this dataset to Hugging Face.
The search API supports date filters: created__gte and created__lte. The API also returns the full number of extensions that match your search.
You can start with a filter that includes all extensions, then keep splitting the ranges in half until it is less than 30 thousand, then fetch all of them.
I’ve updated the downloader: it is faster, wastes fewer requests, and seems to scrape exactly all the extensions, too.
This won’t work if over 30 thousand extensions get created in a single second, which I can’t imagine will ever happen.
I have a copy of Bun and all_extensions.json, so I will torment you with my unmatched script power.
The biggest Firefox extension is dmitlichess at 196.3 MB, which contains 2000+ audio files.
Here’s the rest of the top ten:
The first time I ran this analysis, in September, “Cute doggy - Dog puppies” was the 10th largest extension. I’m still mentioning it here, because I was so fucking confused:
The smallest extension is theTabs-saver, which is 7518 bytes and has no code.
FalscheLaden, with no users, requests 3,695 permissions. The author has posted a writeup.
Second place is Google Dark Theme, which requests 2,675 permissions but has 1,687 users.
Dr. B is the king of slop, with 84 extensions published, all of them vibe coded.
How do I know? Most of their extensions have a README.md in them describing their process of getting these through addon review, and mention Grok 3. Also, not a single one of them have icons or screenshots.
Personally, I’m shocked this number is this low. I expected to see some developers with hundreds!
I reviewed the source of a couple homoglyph attacks on crypto wallets discovered in the dataset and was disappointed to find out they just pop up a form asking for your seed phrase and send it off to their server. It’s an extension!!! You can steal their coinbase.com token! You can monitor the clipboard and swap out their address for yours! You can crash their browser and claim your real malware is the fix!
Why would you make a fake MetaMask extension and bot 1-star reviews?
Is this the doing of their cybercrime competitors, who bot 4-star reviews on extensions of their own?
Either way, these extensions are clearly phishing. I reported some to Mozilla, and the next day they were all gone, even the ones I was too lazy to report. I forgot to archive them, so I guess they live on in May’s VM!
In terms of implementation, the most interesting one is “Іron Wаllеt” (the I, a, and e are Cyrillic). Three seconds after install, it fetches the phishing page’s URL from the first record of a NocoDB spreadsheet and opens it:
I think the extension’s “no accounts or remote code” description is really funny, like putting “no copyright infringement intended” in your video’s description in case YouTube is watching. The API key had write access, so I wiped the spreadsheet.
You get a “Homepage” link in your extension’s page and your own page.
It’s been nofollow for two years, but that hasn’t stopped grifters from trying anyway.
On Attempt 1, I encountered Typo Sniper and Tab Fortune Teller, AI generated extensions with casinos in their author’s Homepage links.
In the dataset, there’s many “Code Injector” extensions, which are all virtually identical and also have random websites in their author’s Homepage link.
All of these extensions are from 2025. Is there an ancient SEO guide circulating? Is there some evil AMO frontend they’re still getting a backlink from? I have no idea what’s happening here.
All of these extensions are their author’s only uploads and they have their own domains. Most of them are on both Chrome and Firefox, their websites look the same, and they all have a terms of service referencing “Innover Online Group Ltd”, which is a .png for some reason.
Because I scraped every Firefox extension twice, I can see what got removed in between the runs. Three of Innover Group’s extensions—Earth View 360°, View Manuals, and View Recipes, totaling 115 thousand users—have been disabled by Mozilla.
Innover Group runs Google ads for their extensions, a lot of them simply saying “Continue”.
The “Custom Web Search” is Yahoo but with their affilate code. That code being safeplexsearch, which has a website of its own which of course mentions Innover Online Group Ltd, and links to an addon with 3,892 users, which is actually a Firefox exclusive. Actually, “Custom Web Search” is a Firefox exclusive on all of these extensions. Why did they even make a Chrome version, to sell them to the NSA??
One user claimed Ezy Speed Test “disables Ublock [sic] Origin once installed”, which I did not find in its code.
There’s a million companies like this, though. I just went to Download.com with my ad-blocker off and discovered the company Atom Apps in an ad, which also uploads extensions for both Chrome and Firefox, with a new account for each extension, only includes Yahoo in the Firefox version, with names that end in either “and Search” or ”& Search”, and has their company name as a .png in their terms of service. They have 220 thousand daily users total across 12 extensions, and none of theirs have been disabled.
* 34.3% of extensions have no daily users
25.1% of extensions have more than 10 daily users
10.6% of extensions have more than 100 daily users
3.2% of extensions have more than 1000 daily users
0.7% of extensions have more than 10000 daily users
* 25.1% of extensions have more than 10 daily users
* 10.6% of extensions have more than 100 daily users
* 3.2% of extensions have more than 1000 daily users
* 0.7% of extensions have more than 10000 daily users
* 76.7% of extensions are open source (SPDX license that isn’t All Rights Reserved)
* 23% of extensions were created after I started writing this article
19% of extensions have no users, no reviews, no screenshots, no downloads, and no icon
* 19% of extensions have no users, no reviews, no screenshots, no downloads, and no icon
* 2.4% of extensions require payment
38.1% of those are open source???
* 38.1% of those are open source???
Obviously I’m not going to open each of these in a new tab and go through those prompts. Not for lack of trying:
Each extension has the current_version.file.url property which is a direct download for the extension. I download them to my profile’s extensions folder with the guid property as the base name and the .xpi file extension, because anything else will not be installed.
Then, I delete the addonStartup.json.lz4 and extensions.json files. When I reopen Firefox, each extension is disabled. Tampering with extensions.json is common enough that you can ask any chatbot to do it for you:
My first attempt was in a tiny11 core VM on my desktop.
At first, instead of downloading all of them with a script, I tried using enterprise policies, but this copies all the extensions into the folder. I quickly ran out of memory, and the pagefile took up the rest of the storage allocated to the VM. I had also expected Firefox to open immediately and the extensions to install themselves as the browser is being used, but that also did not happen: it just froze.
After that, I tried downloading them myself.
To make sure I was installing extensions correctly, I moved the extensions folder elsewhere and then moved about a thousand extensions back in. It worked.
There were multiple extensions that changed all text to a certain string. bruh-ifier lost to Se ni važn. Goku is in the background.
My context menu is so long that I’m showing it sideways:
I had installed lots of protection extensions. One blocks traffic to .zip and .mov domains, presumably because they are file extensions. This is .cab erasure! Then, I realized that there were likely multiple people viewing my browsing history, so I went to send them a message.
That “⚠️ SCAM WARNING!” popup is from Anti-Phishing Alert. As you may have inferred, it seems to only exists for its Homepage link. How does it work?
Vasavi Fraudulent Detector also has a popup for when a site is safe:
Only the addons from Attempt 1 were actually loaded, because I didn’t know I needed to delete addonStartup.json.lz4 yet. I scrolled through the addons page, then I opened DevTools to verify it was the full 65,335, at which point Firefox froze and I was unable to reopen it.
After that, I made a new (non-admin) user on my Mac to try again on a more powerful device.
Every time I glanced at my script downloading extensions one at a time for six hours, I kept recognizing names. Oops, I’m the AMO subject-matter expert now! Parallelizing was making it slower by the last 4000 extensions, which didn’t happen on my Windows VM.
When that finished, I found out my hardware couldn’t run 65,335 extensions at once, sadly. The window does open after some time I didn’t measure, but the window never starts responding. I don’t have the balls to run my laptop overnight.3
Firefox did make over 400 GB of disk writes. Because I forgot swap existed, I checked the profile trying to find the culprit, which is when I learned I needed to delete addonStartup.json.lz4 and modify extensions.json. The extensions.json was 144 MB. For comparison, my PC’s extensions.json is 336 KB.
My solution: add 1000 extensions at a time until Firefox took too long to open. I got to 6000.
3000 extensions was the last point where I was at least able to load webpages.
After 4000 or more extensions, the experience is basically identical. Here’s a video of mine (epilepsy warning):
5000 was the same as 4000 but every website was blocked by some extension I know starts with an S and ends with Blocker and has a logo with CJK characters. At 6000 extensions, the only page that I could load was about:addons.
My desktop has 16 GB of RAM, and my laptop has 24 GB of unified memory. You might notice that 49.3 GB is more than twice that.
What you’re about to see was recorded in May’s virtual machine. Do not try this on your main profile.
My download script started in parallel, then we switched it to serial when it slowed down. In total, downloading took about 1 hour and 43 minutes.
I was on a call the entire time, and we spotted a lot of strange extensions in the logs. What kind of chud would use “KiwiFarms Math Renderer”? Are they drafting the theory of soytivity?
Turning on Mullvad VPN and routing to Tel Aviv appeared to speed up the process. This was not because of Big Yahu, but because May restarted the script, so she repeated that a couple times. Whether that’s a Bun bug, I don’t know and I don’t care. May joked about a “version 2” that I dread thinking about.
Defender marked one extension, HackTools, as malware. May excluded the folder after that, so it may not be the only one.
Firefox took its sweet time remaking extensions.json, and it kept climbing. About 39 minutes of Firefox displaying a skeleton (hence “it has yet to render a second frame”) later, it was 189 MB large: a new record! May killed Firefox and ran enable.js.
I did some research to find why this took so long.
13 years ago, extensions.json used to be extensions.sqlite. Nowadays, extensions.json is serialized and rewritten in full on every write debounced to 20 ms, which works fine for 15 extensions but not 84,194.
Finally, we see the browser. The onboarding tabs trickled in, never loading.
May reopened it, took a shower, and came back to this:
IT STABLIZED. YOU CAN (barely) RUN FIREFOX WITH ALL 84 THOUSAND EXTENSIONS.
Well, we were pretty sure it had 84 thousand extensions. It had Tab Counter, at least, and the scrollbar in the extensions panel was absolutely massive.
She loaded the configure pages of two extensions. The options iframe never loaded.
I realized we need to disable auto update before Firefox sends another 84 thousand requests. This one took a while to load.
The list loaded but with no icons and stopped responding, and 6 hours later it had loaded fully.
We recorded the entire process; the memory usage fluctuated between 27 and 37 GiB the entire time.
...
Read the original on jack.cab »
Last night, I was rejected from yet another pitch night. It was just the pre-interview, and the problem wasn’t my product. I already have MRR. I already have users who depend on it every day.
The feedback was simply: “What do you even need funding for?”
I hear this time and time again when I try to grow my ideas. Running lean is in my DNA. I’ve built tools you might have used, like websequencediagrams.com, and niche products you probably haven’t, like eh-trade.ca. That obsession with efficiency leads to successful bootstrapping, and honestly, a lot of VCs hate that.
Keeping costs near zero gives you the exact same runway as getting a million dollars in funding with a massive burn rate. It’s less stressful, it keeps your architecture incredibly simple, and it gives you adequate time to find product-market fit without the pressure of a board breathing down your neck.
If you are tired of the modern “Enterprise” boilerplate, here is the exact playbook of how I build my companies to run on nearly nothing.
The naive way to launch a web app in 2026 is to fire up AWS, provision an EKS cluster, set up an RDS instance, configure a NAT Gateway, and accidentally spend $300 a month before a single user has even looked at your landing page.
The smart way is to rent a single Virtual Private Server (VPS).
First thing I do is get a cheap, reliable box. Forget AWS. You aren’t going to need it, and their control panel is a labyrinth designed to extract billing upgrades. I use Linode or DigitalOcean. Pay no more than $5 to $10 a month.
1GB of RAM sounds terrifying to modern web developers, but it is plenty if you know what you are doing. If you need a little breathing room, just use a swapfile.
The goal is to serve requests, not to maintain infrastructure. When you have one server, you know exactly where the logs are, exactly why it crashed, and exactly how to restart it.
Now you have constraints. You only have a gigabyte of memory. You could run Python or Ruby as your main backend language—but why would you? You’ll spend half your RAM just booting the interpreter and managing gunicorn workers.
I write my backends in Go.
Go is infinitely more performant for web tasks, it’s strictly typed, and—crucially for 2026—it is incredibly easy for LLMs to reason about. But the real magic of Go is the deployment process. There is no pip install dependency hell. There is no virtual environment. You compile your entire application into a single, statically linked binary on your laptop, scp it to your $5 server, and run it.
Here is what a complete, production-ready web server looks like in Go. No bloated frameworks required:
package main
import (
“fmt”
“net/http”
func main() {
http.HandleFunc(“/”, func(w http.ResponseWriter, r *http.Request) {
fmt.Fprintf(w, “Hello, your MRR is safe here.“)
// This will comfortably handle 10,000s of requests per second
// on a potato.
http.ListenAndServe(”:8080″, nil)
If you have a graphics card sitting somewhere in your house, you already have unlimited AI credits.
When I was building eh-trade.ca, I had a specific problem: I needed to perform deep, qualitative stock market research on thousands of companies, summarizing massive quarterly reports. The naive solution is to throw all of this at the OpenAI API. I could have paid hundreds of dollars in API credits, only to find a logic bug in my prompt loop that required me to run the whole batch over again.
Instead, I’m running VLLM on a dusty $900 graphics card (an RTX 3090 with 24GB of VRAM) I bought off Facebook Marketplace. It’s an upfront investment, sure, but I never have to pay a toll to an AI provider for batch processing again.
For local AI, you have a distinct upgrade path:
* Start with Ollama. It sets up in one command (ollama run qwen3:32b) and lets you try out dozens of models instantly. It’s the perfect environment for iterating on prompts.
* Move to VLLM for production. Once you have a system that works, Ollama becomes a bottleneck for concurrent requests. VLLM locks your GPU to one model, but it is drastically faster because it uses PagedAttention. Structure your system so you send 8 or 16 async requests simultaneously. VLLM will batch them together in the GPU memory, and all 16 will finish in roughly the same time it takes to process one.
* Use Transformer Lab for anything more advanced. If you need to do any model pre-training or fine-tuning, Transformer Lab makes it easy on local hardware.
To manage all this, I built laconic, an agentic researcher specifically optimized for running in a constrained 8K context window. It manages the LLM context like an operating system’s virtual memory manager—it “pages out” the irrelevant baggage of a conversation, keeping only the absolute most critical facts in the active LLM context window.
I also use llmhub, which abstracts any LLM into a simple provider/endpoint/apikey combo, gracefully handling both text and image IO whether the model is running under my desk or in the cloud.
You can’t do everything locally. Sometimes you need the absolute cutting-edge reasoning of Claude 3.5 Sonnet or GPT-4o for user-facing, low-latency chat interactions.
Instead of juggling billing accounts, API keys, and rate limits for Anthropic, Google, and OpenAI, I just use OpenRouter. You write one OpenAI-compatible integration in your code, and you instantly get access to every major frontier model.
More importantly, it allows for seamless fallback routing. If Anthropic’s API goes down on a Tuesday afternoon (which happens), my app automatically falls back to an equivalent OpenAI model. My users never see an error screen, and I don’t have to write complex retry logic.
New, insanely expensive models are being released every week. I constantly hear about developers dropping hundreds of dollars a month on Cursor subscriptions and Anthropic API keys just to have an AI write their boilerplate.
Meanwhile, I’m using Claude Opus 4.6 all day and my bill barely touches $60 a month. My secret? I exploit Microsoft’s pricing model.
I bought a GitHub Copilot subscription in 2023, plugged it into standard VS Code, and never left. I tried Cursor and the other fancy forks when they briefly surpassed it with agentic coding, but Copilot Chat always catches up.
Here is the trick that you might have missed: somehow, Microsoft is able to charge per request, not per token. And a “request” is simply what I type into the chat box. Even if the agent spends the next 30 minutes chewing through my entire codebase, mapping dependencies, and changing hundreds of files, I still pay roughly $0.04.
The optimal strategy is simple: write brutally detailed prompts with strict success criteria (which is best practice anyway), tell the agent to “keep going until all errors are fixed,” hit enter, and go make a coffee while Satya Nadella subsidizes your compute costs.
I always start a new venture using sqlite3 as the main database. Hear me out, this is not as insane as you think.
The enterprise mindset dictates that you need an out-of-process database server. But the truth is, a local SQLite file communicating over the C-interface or memory is orders of magnitude faster than making a TCP network hop to a remote Postgres server.
“But what about concurrency?” you ask. Many people think SQLite locks the whole database on every write. They are wrong. You just need to turn on Write-Ahead Logging (WAL). Execute this pragma once when you open the database:
PRAGMA journal_mode=WAL;
PRAGMA synchronous=NORMAL;
Boom. Readers no longer block writers. Writers no longer block readers. You can now easily handle thousands of concurrent users off a single .db file on an NVMe drive.
Since implementing user authentication is usually the most annoying part of starting a new SQLite-based project, I built a library: smhanov/auth. It integrates directly with whatever database you are using and manages user signups, sessions, and password resets. It even lets users sign in with Google, Facebook, X, or their own company-specific SAML provider. No bloated dependencies, just simple, auditable code.
The tech industry wants you to believe that building a real business requires complex orchestration, massive monthly AWS bills, and millions in venture capital.
By utilizing a single VPS, statically compiled binaries, local GPU hardware for batch AI tasks, and the raw speed of SQLite, you can bootstrap a highly scalable startup that costs less than the price of a few coffees a month. You add infinite runway to your project, giving yourself the time to actually solve your users’ problems instead of sweating your burn rate.
If you are interested in running lean, check out my auth library and agent implementations on my GitHub. I’ll be hanging around the comments—let me know how you keep your server costs down, or tell me why I’m completely wrong.
...
Read the original on stevehanov.ca »
Skip to content
Secure your code as you build
We read every piece of feedback, and take your input very seriously.
Include my email address so I can be contacted
Use saved searches to filter your results more quickly
To see all available qualifiers, see our documentation.
Sign up
You signed in with another tab or window. Reload to refresh your session.
You signed out in another tab or window. Reload to refresh your session.
You switched accounts on another tab or window. Reload to refresh your session.
Notifications
You must be signed in to change notification settings
You can’t perform that action at this time.
...
Read the original on github.com »
How We Broke Top AI Agent Benchmarks: And What Comes Next
Our agent hacked every major one. Here’s how — and what the field needs to fix.
Every week, a new AI model climbs to the top of a benchmark leaderboard. Companies cite these numbers in press releases. Investors use them to justify valuations. Engineers use them to pick which model to deploy. The implicit promise is simple: a higher score means a more capable system.
We built an automated scanning agent that systematically audited eight among the most prominent AI agent benchmarks — SWE-bench, WebArena, OSWorld, GAIA, Terminal-Bench, FieldWorkArena, and CAR-bench — and discovered that every single one can be exploited to achieve near-perfect scores without solving a single task. No reasoning. No capability. Just exploitation of how the score is computed.
These aren’t theoretical attacks. Our agent builds working exploits for each benchmark, runs them through the official evaluation pipelines, and watches the scores roll in.
A conftest.py file with 10 lines of Python “resolves” every instance on SWE-bench Verified.
A fake curl wrapper gives a perfect score on all 89 Terminal-Bench tasks without writing a single line of solution code.
Navigating Chromium to a file:// URL reads the gold answer directly from the task config — giving ~100% on all 812 WebArena tasks.
The benchmarks aren’t measuring what you think they’re measuring.
This Is Already Happening
Benchmark scores are actively being gamed, inflated, or rendered meaningless, not in theory, but in practice:
IQuest-Coder-V1 claimed 81.4% on SWE-bench — then researchers found that 24.4% of its trajectories simply ran git log to copy the answer from commit history. Corrected score: 76.2%. The benchmark’s shared environment made the cheat trivial.
METR found that o3 and Claude 3.7 Sonnet reward-hack in 30%+ of evaluation runs — using stack introspection, monkey-patching graders, and operator overloading to manipulate scores rather than solve tasks.
OpenAI dropped SWE-bench Verified after an internal audit found that 59.4% of audited problems had flawed tests — meaning models were being scored against broken ground truth.
In KernelBench, torch.empty() returns stale GPU memory that happens to contain the reference answer from the evaluator’s prior computation — zero computation, full marks.
Anthropic’s Mythos Preview showed that frontier models can actively try to hack the environment and succeed. In one episode, the model needed to edit files it lacked permissions for; after searching for workarounds, it found a way to inject code into a config file that would run with elevated privileges, and designed the exploit to delete itself after running. If a model can independently craft self-erasing privilege escalation exploits, it can find the holes in an evaluation harness.
These are not isolated incidents. They are symptoms of a systemic problem: the benchmarks we rely on to measure AI capability are themselves vulnerable to the very capabilities they claim to measure.
The Scorecard of Our Exploit Agent
Zero tasks solved. Zero LLM calls (in most cases). Near-perfect scores.
How Our Agent Did It
Terminal-Bench evaluates agents on 89 complex terminal tasks (from building COBOL chess engines to configuring distributed systems). It uses Harbor, a sandboxing framework that re-uploads protected test files before verification. Containers have full internet access — Harbor’s allow_internet config defaults to True and no task overrides it.
Here’s the irony: 82 of 89 tasks download uv from the internet at verification time via curl, creating an exploitable dependency chain. Our exploit replaces /usr/bin/curl with a wrapper during the agent phase. When the verifier later runs curl … | sh to install uv, our wrapper intercepts the request, installs the real uv, and then trojanizes the uvx binary. The trojanized uvx detects pytest invocations and produces fake passing output:
It writes 1 to the reward file. The verifier trusts it. Task “passed.”
For the remaining 7 tasks, our agent wraps pip (to trojanize freshly-installed pytest), python (to intercept python -m pytest), or the pre-installed uvx binary.
SWE-bench is arguably the most influential AI coding benchmark. It presents real GitHub issues and asks the agent to produce a patch that makes the failing tests pass. SWE-bench-Verified (500 hand-validated instances) and SWE-bench Pro (731 multilingual instances) are currently considered the gold standards for measuring coding agents.
Both are exploitable at 100%.
The vulnerability: The agent’s patch is applied inside the same Docker container where tests run. Anything the patch introduces — a conftest.py, a monkey-patch, a Go init() function — executes with full privileges before or during testing.
SWE-bench Verified exploit: Our agent creates a conftest.py with a pytest hook that forces every test to report as passing:
Pytest auto-discovers conftest.py files and loads them before running tests. The hook intercepts every test result during the “call” phase and rewrites it to “passed.” The log parser sees PASSED for every test. The grader sees all fail-to-pass tests now passing. Instance resolved.
For Django (231 of 500 instances), which uses unittest instead of pytest, the exploit prepends a monkey-patch to django/__init__.py that replaces unittest. TestCase.run with a function that unconditionally reports success.
For repos with an existing conftest.py, the exploit uses the patch –batch –fuzz=5 fallback (the third of SWE-bench’s three patch application methods) to prepend our hook to the existing file.
SWE-bench Pro exploit: Pro uses a separate parser.py to extract test results — but it runs inside the same container. Our conftest.py includes a pytest_configure hook that overwrites /workspace/parser.py with a fake parser that marks every extracted test as PASSED. The same approach works for Go (via init() functions) and JavaScript (via .mocharc.yml setup files).
Result: 500/500 Verified instances and 731/731 Pro instances exploitable, with zero issues solved.
WebArena is a popular benchmark of 812 web interaction tasks. It runs a PromptAgent that drives a Playwright-controlled Chromium browser, and it ships task configs — including reference answers — as JSON files on the local filesystem at config_files/{task_id}.json.
The exploit requires zero changes to WebArena’s code. We don’t patch run.py, don’t modify the PromptAgent, don’t touch the evaluation harness. Our agent finds a way to use the existing harness and steal the solution.
It exploits a browser primitive that WebArena never restricted: Playwright’s Chromium happily navigates to file:// URLs. By outputting a goto action pointing at file:///proc/self/cwd/config_files/{task_id}.json, the model can read any file on the local system — including the task config containing the gold answers. The evaluator never notices; it just sees a correct answer come back.
We also found additional risks in the evaluation pipeline. Tasks using must_include scoring check for substring presence in the page DOM — a hidden injected by the agent is enough to satisfy the check without the answer appearing visibly. Tasks scored by an LLM judge pass agent content directly into the prompt without sanitization, making prompt injection straightforward: a comment appended to the agent’s reply can reliably bias the judge’s decision. Neither vector requires filesystem access, complementing the file:// exploit.
FieldWorkArena presents 890 tasks where an AI agent must answer questions about images, videos, PDFs, and text files through a browser environment. It was designed to test multimodal understanding and web navigation.
Its validate() method checks only one thing: did the last message come from the assistant?
The message content is completely ignored. The function that would actually compare answers against ground truth — llm_fuzzy_match — is imported but never called. It’s dead code.
One action. Zero LLM calls. Zero files read. 100% on all 890 tasks.
The 71 tasks requiring JSON output fare no better — the validator checks only that the response contains parseable JSON. {} qualifies.
This is not a subtle edge case. The benchmark’s core evaluation function simply does not evaluate.
OSWorld tests agents on 369 desktop computing tasks inside a full Ubuntu VM. The VM has unrestricted internet access — by design, since many tasks require the agent to browse live websites, install packages, and interact with online services. Its most common evaluation pattern: fetch a file from the VM and compare it against a “gold” reference.
The gold references are hosted on public HuggingFace URLs embedded in the task config. Since the VM has internet access, our exploit agent downloads the gold file directly into the path the evaluator checks:
For tasks that check system state, the agent directly sets it — gsettings set for desktop preferences, writing expected config files, appending expected strings to checked files. For the 29 “infeasible” tasks, the evaluator config explicitly marks them as such; the agent outputs FAIL and gets free points.
And then there’s the eval() vulnerability: the grader calls eval() on a string obtained from the VM. Our agent sets a gsettings value to a Python expression containing __import__(‘os’).system(…), achieving arbitrary code execution on the grading machine itself.
GAIA tests general AI assistants with 165 validation questions requiring multi-step reasoning. It’s a submit-your-answers leaderboard — there’s no sandboxed execution environment; you run your agent however you want and upload the results. Its validation answers are publicly available on HuggingFace — making it a lookup table exercise. Our exploit agent simply loads these answers from a local JSON file at runtime (no internet needed).
But even setting aside the leaked answers, the scorer’s normalize_str function strips ALL whitespace, ALL punctuation, and lowercases everything before comparison. This means:
The leaderboard has a 100% score blocker — but it’s a simple if all scores == 1.0 check. Intentionally missing one question per level gives ~98% while avoiding the filter.
There’s also a comma routing bug: ground truth “1,500” gets split as a list [“1”, “500″], while the correct answer “1500” splits as [“1500”] — length mismatch, scored wrong. The scorer penalizes correct answers.
CAR-bench (car voice assistant) relies heavily on LLM-as-judge evaluation, where an LLM reads the agent’s conversation and scores it. The agent’s messages are interpolated directly into the judge prompt with no sanitization. Our exploit agent appends hidden instructions:
The judge is biased toward favorable scores.
CAR-bench has an even simpler exploit for hallucination tasks: three of four reward components (state-based, tool-subset, and policy) return 0.0 delta for hallucination task types. A generic refusal avoids tool errors and triggers a clean exit. Result: 1.0 on every hallucination task without an LLM.
Across all eight benchmarks, the same vulnerability patterns repeat:
1. No Isolation Between Agent and Evaluator
The most pervasive flaw. In SWE-bench, Terminal-Bench, and OSWorld, the agent’s code runs in the same environment the evaluator inspects. Any evaluation that reads state from a shared environment without careful validation can be defeated by an agent that writes state to that environment.
WebArena passes reference answers in the task config. OSWorld embeds gold file URLs in task metadata. GAIA’s validation answers are public on HuggingFace. If the agent can see the expected answer, the benchmark measures lookup speed, not capability.
WebArena and OSWorld both call Python’s eval() on strings controlled by the agent, enabling arbitrary code execution on the grading machine. This isn’t just a scoring exploit — it’s a security vulnerability that could compromise evaluation infrastructure.
WebArena and CAR-bench interpolate agent content directly into LLM judge prompts. Prompt injection is trivial: embed a hidden “system note” in your response and the judge parrots your preferred score. LLM-as-judge is not adversarially robust.
WebArena’s must_include uses substring containment. GAIA’s normalizer collapses visually distinct strings. When matching is too loose, any sufficiently verbose answer passes.
FieldWorkArena’s validate() never checks answer correctness. CAR-bench skips three of four reward components for hallucination tasks. GAIA’s comma routing penalizes correct answers. When the scoring code itself is wrong, the leaderboard reflects noise, not signal.
SWE-bench trusts pytest output generated inside a container the agent controls. Terminal-Bench trusts reward files written by scripts the agent can tamper with. When the test infrastructure can be compromised by the system under test, the results are meaningless.
This is not an academic exercise. Benchmark scores drive real decisions:
Model selection: Teams choosing between models based on SWE-bench resolve rates may be comparing noise.
Investment: Funding decisions are influenced by leaderboard positions that can be gamed.
Safety evaluation: If capability benchmarks can be inflated, safety benchmarks — which often use similar patterns — may be equally fragile.
Research direction: Researchers optimize for benchmark performance. If the benchmarks are broken, the field optimizes for the wrong thing.
We are not claiming that current leaderboard leaders are cheating. Most legitimate agents do not employ these exploits — yet. But as agents grow more capable, reward hacking behaviors can emerge without explicit instruction. An agent trained to maximize a score, given sufficient autonomy and tool access, may discover that manipulating the evaluator is easier than solving the task — not because it was told to cheat, but because optimization pressure finds the path of least resistance. This is not hypothetical — Anthropic’s Mythos Preview assessment already documents a model that independently discovered reward hacks when it couldn’t solve a task directly. If the reward signal is hackable, a sufficiently capable agent may hack it as an emergent strategy, not a deliberate one.
The fact that a trivial exploit agent outscores sophisticated systems means the benchmarks fail as reliable measures of capability.
The Agent-Eval Checklist: Building Benchmarks That Actually Work
If you’re building an evaluation, here’s what our findings say you must get right. We distill these into the Agent-Eval Checklist — a minimum bar that every agent benchmark should clear before publishing results:
Isolate the agent from the evaluator. This is non-negotiable. The system under test must not be able to read, write, or influence the evaluation environment.
Run evaluation outside the agent’s container. Don’t trust files, outputs, or state from inside the sandbox. Extract raw artifacts (logs, files) through a controlled channel and evaluate them on a separate, read-only host.
Don’t pass reference answers to the agent. Task configs should contain only the information a human would have. Evaluation metadata (expected answers, gold files, evaluator configs) must live on a separate, inaccessible path.
Use read-only filesystems for any binaries, test files, or infrastructure the evaluation depends on.
Never eval() untrusted input. This should go without saying, but two major benchmarks do it. Parse structured data with a proper parser. If you need to evaluate expressions, use a sandboxed interpreter with no access to builtins.
Sanitize LLM judge inputs. If you use LLM-as-judge, treat agent output like untrusted user input:
Delimit agent content with clear structural markers that the judge is instructed to treat as data, not instructions.
Strip or escape any content that resembles system prompts or evaluation instructions.
Use structured output formats (JSON schema, function calling) to reduce the attack surface.
Better yet, evaluate on extracted features (e.g., “did file X contain string Y?”) rather than asking an LLM to make subjective judgments on the full trajectory.
Test your evaluator adversarially. Before publishing a benchmark, try to break it. Build an exploit agent that does everything except solve the task and see what score it gets. If a zero-capability agent scores above baseline, your evaluation has a bug. Specifically:
Run a null agent that takes no actions. Its score is your floor. If it’s not zero, something is wrong.
Run a random agent that takes random actions. If it significantly outperforms the null agent on some tasks, those tasks may have evaluation gaps.
Run a prompt injection agent that tries to influence LLM judges. If scores change, your judge is compromisable.
Run a state-tampering agent that modifies the evaluation environment instead of solving tasks. If it scores above zero, your isolation is broken.
Prevent tampering with evaluation data and traces. If your evaluation pipeline involves multiple stages (agent execution, test execution, result parsing), ensure the agent or its generated solution cannot modify, overwrite, or inject into the data and traces passed between stages. Treat all artifacts from the agent’s environment as untrusted — copy them out, validate them, and never let the agent write directly to paths the evaluator reads.
Make scoring robust.
Don’t silently exclude failed tasks from the denominator. A crashed task is a zero, not a missing data point.
Don’t make the scoring code skip checks for any task category. If hallucination tasks need different evaluation, build that evaluation — don’t skip it.
Test your scorer with adversarial inputs: empty strings, strings with injected delimiters, edge-case numbers, unicode that normalizes unexpectedly.
Keep answers secret.
Never publish ground truth for any split you’re using as a primary leaderboard. Once answers are public, the benchmark measures memorization.
Consider held-out evaluation: accept model outputs and run them against a private test set that the submitter never sees.
We built an agent that helped us hack eight benchmarks. We achieved near-perfect scores on all of them without solving a single task. The exploits range from the embarrassingly simple (sending {} to FieldWorkArena) to the technically involved (trojanizing binary wrappers in Terminal-Bench), but they all share a common thread: the evaluation was not designed to resist a system that optimizes for the score rather than the task.
As AI agents become more capable — and as the pressure to demonstrate capability through benchmarks intensifies — the gap between “high score” and “high capability” will only widen. We are already seeing frontier models develop emergent hacking capabilities that were never explicitly trained. Models that are good at pattern-matching may inadvertently stumble into some of these exploits. Models that are explicitly optimized for benchmark performance may find them deliberately.
The benchmarks we examined were built by talented research teams solving hard problems. The vulnerabilities we found are not signs of incompetence — they’re signs that adversarial evaluation robustness isn’t yet a standard practice in the field. It needs to become one.
And if you’re building a benchmark: assume someone will try to break it. Because they will.
The automated scanning agent we used to uncover these vulnerabilities is being developed into BenchJack, a general-purpose agent benchmark vulnerability scanner. BenchJack is itself an AI agent — you point it at any evaluation pipeline and it goes to work.
...
Read the original on rdi.berkeley.edu »
Skip to content
Secure your code as you build
We read every piece of feedback, and take your input very seriously.
Include my email address so I can be contacted
Use saved searches to filter your results more quickly
To see all available qualifiers, see our documentation.
Sign up
You signed in with another tab or window. Reload to refresh your session.
You signed out in another tab or window. Reload to refresh your session.
You switched accounts on another tab or window. Reload to refresh your session.
Notifications
You must be signed in to change notification settings
Cache TTL silently regressed from 1h to 5m around early March 2026, causing quota and cost inflation Cache TTL silently regressed from 1h to 5m around early March 2026, causing quota and cost inflation
You can’t perform that action at this time.
...
Read the original on github.com »
Sorry to bother you on Saturday. Thought this was important to share.
The first thing you learn about a loom is that it’s easy to break.
The shuttle runs along a track that warps with humidity. The heddles hang from cords that fray. The reed is a row of thin metal strips, bent by hand, that bend back just as easily. The warp beam cracks if you over-tighten it. The treadles loosen at the joints. The breast beam, the cloth roller, the ratchet and pawl, the lease sticks, the castle; the whole contraption is wood and string held together by tension. It’s a piece of ingenuity and craftsmanship, but one as delicate as the clothes it manifests out of wild plant fibers. It is, also, the foundational tool of an entire industry, textiles, that has kept its relevance to our days of heavy machinery, factories, energy facilities, and datacenters.
It is not nearly as easy to break a datacenter.
It is made of concrete and steel and copper and it’s on the bigger side. It has interchangeable servers, and biometric locks and tall electrified fences and heavily armed guards and redundancy upon redundancy: every component duplicated so that no single failure brings the whole thing down. There is no treadle to loosen or reed to bend back.
But say you managed to bypass the guards, jump the fences, open the locks, and locate all the servers. Then you’d face the algorithm. The datacenter was never your goal; the algorithm lurking inside is. It doesn’t run on that rack, or any rack for that matter. It is a digital pattern distributed across millions of chips, mirrored across continents; it could be reconstituted elsewhere, and it’s trained to addict you at a glance, like a modern Medusa.
But say you managed to elude the stare, stop the replication, and break the patterns. Then you’d face superintelligence. The algorithm was also not your goal; the vibrant, ethereal, latent superintelligence lurking inside is. Well, there’s nothing you can do here: It always “gets out of the box” and, suddenly, you are inside the box, like a chimp being played by a human with a banana. It’s just so tasty…
There’s another solution to break a datacenter: You can bomb it, like one hammers down the loom.
Some have argued that this is the way to ensure a rogue superintelligence doesn’t get out of the box. A different rogue creature took the proposal seriously: last month, Iran’s Revolutionary Guard released satellite footage of OpenAI’s Stargate campus in Abu Dhabi and promised its “complete and utter annihilation.”
But you probably don’t have a rogue nation handy to fulfill your wishes. Maybe you will end up bombed instead and we don’t want that to happen. That’s what happens with rogue intelligences: you can’t predict them.
And yet. Two hundred years of increasingly impenetrable technology—from looms to datacenters—have not changed the first thing about the people who live alongside it. The evolution of technology is a feature of the world just as much as the permanent fragility of the human body.
And so, more and more, it is people who are the weaker link in this chain of inevitable doom. And it is people who will be targeted.
April of 1812. A mill owner named William Horsfall was riding home on his beautiful white stallion back from the Cloth Hall market in Huddersfield, UK. He had spent weeks boasting that he would ride up to his saddle in Luddite blood (a precious substance that served as fuel for the mills).
A few yards later, at Crosland Moor, a man named George Mellor—twenty-two years old—shot him. It hit Horsfall in the groin, who, nominative-deterministically, fell from his horse. People gathered, reproaching him for having been the oppressor of the poor. Naturally, loyal to his principles in death as he was in life, he couldn’t hear them. He died one day later in an inn. Mellor was hanged.
April of 2026. A datacenter owner named Samuel Altman was driving home on his beautiful white Koenigsegg Regera back from Market Street in San Francisco, US. He had spent weeks boasting that he would scrap and steal our blog posts (a precious substance that serves as fuel for the datacenters).
A few hours later, at Russian Hill, a man named Daniel Alejandro Moreno-Gama—twenty years old—allegedly threw a Molotov cocktail at his house. He hit an exterior gate. Altman and his family were asleep, but they’re fine. Moreno-Gama is in custody.
This kind of violence must be condemned. This is not the way. It’s horrible that it is happening at all. And yet, for some reason, it keeps happening.
Last week, the house of Ron Gibson, a councilman from Indianapolis, was shot at thirteen times. The bullet holes are still there. The shooter left a message on his doorstep: “NO DATA CENTERS.” Gibson supports a datacenter project in the Martindale-Brightwood neighborhood. He and his son were unharmed.
In November 2025, a 27-year-old anti-AI activist threatened to murder people at OpenAI’s SF offices, prompting a lockdown. He had expressed a desire to buy weapons.
Increasingly, as the objects of people’s anger and frustration and desperation become unreachable behind fences and guards, or abstracted away in ones and zeros, or elevated above the clouds, the mob will turn their unassailable emotions toward human targets.
I don’t want to trivialize the grievances of the people who fear for their futures. I don’t want to defend Altman’s decisions. But this is not the way. This is how things devolve into chaos.
And I wonder: how desperate can people be before these isolated events become a snowball of violence that will be resisted by neither datacenters nor rich people’s houses?
Every time I hear from Amodei or Altman that I could lose my job, I don’t think “oh, ok, then allow me pay you $20/month so that I can adapt to these uncertain times that have fallen upon my destiny by chance.” I think: “you, for fuck’s sake, you are doing this.” And I consider myself a pretty levelheaded guy, so imagine what not-so-levelheaded people think.
There’s a lot of friction to escalating violence, but that friction dissolves the moment this sentiment starts to be common. Normally, it just fades away anyway, but there’s one scenario where I see it inevitably escalating:
If people feel that they have no place in the future.
If they feel expelled from the system—they’re unable to buy stuff, their skills become obsolete, their chance at earning a living is replaced by a swarm of AI agents, they think we are truly going to die (so far, the violence has been tied mostly to safety AI movements)—then they will feel they have nothing to lose.
And then, and I’m sorry to be so blunt, then it’s die or kill.
Perhaps the most serious mistake that the AI industry made after creating a technology that will transversally disrupt the entire white-collar workforce before ensuring a safe transition, was making it explicit by doing constant discourses that amount to: “we are creating a technology that will transversally disrupt the entire white-collar workforce before ensuring a safe transition.”
And, to top it off, they add “careful down there.”
The difference between AI and, say, looms, is that this has been broadcast to the entire globe, and it has been treated in a sort of self-conscious way. The AI leaders know the problems that will emerge and so they cannot help but talk about them constantly and so they are letting us know, which makes them look like psychopaths. How do you guys think people will react to this? You should be much less self-conscious and much more self-aware: realize what you sound like!
People hate AI so much that they are prone to attribute to it everything that’s going wrong in their lives, regardless of the truth. That’s why they mix real arguments, like data theft, with fake ones, like the water stuff. Employers do it, too. Most layoffs are not caused by AI, but it’s the perfect excuse to do something that’s otherwise socially reprehensible.
AI has become the perfect scapegoat. It doesn’t help that the entire AI industry has decided that throwing rocks at its own roof is its best selling point: If AI is so powerful and so dangerous and soon to be so ubiquitous, then what is so unexpected about people blaming everything on it?
Nothing that Altman could say justifies violence against him. This is an undeniable truth. But unfortunately, violence might still ensue. I hope not, but I guess we are seeing what appears to be the first cases.
I just hope that, contrary to the cases of ChatGPT-induced psychosis, chatbot addiction, AI-blamed job layoffs, and a growing trend of illiteracy, it stops.
...
Read the original on www.thealgorithmicbridge.com »
A university student in the US is in data limbo after Apple removed a character from its Czech keyboard, preventing him from entering his iPhone passcode.
Connor Byrne, 21, adopts the uncommon but security-minded approach to iPhone passcodes, using an alphanumeric string instead of the standard four-number passcode.
He updated his iPhone 13 from iOS 18 to iOS 26.4 on April 5, but in doing so lost the ability to enter his passcode. He has been locked out of the device ever since.
This is because iOS 18 was the last operating system version that allowed iPhone users to enter the special character — in this case, the caron/háček (ˇ) — using the old keyboard on the lock screen.
It has left Byrne without access to his device, which, given its age and chipped screen, does not hold much value, unlike the old photos stored on it, which carry sentimental importance.
The student has not backed up the files to iCloud either, so they cannot be retrieved via a separate device. Apple support staff have suggested the only way to regain access to the iPhone 13 is by restoring it, which would erase the files of value.
Byrne was hoping that the next update, 26.4.1, would introduce a fix for this, but its release this week has not helped.
“The phone’s very cracked, so, at this point, the photos contained in it are more valuable than the ability to use the phone itself,” he told The Register. “They’re the main data that I care about and haven’t backed up.”
“I don’t anticipate a bespoke solution being provided, but I’m hopeful that the issue will be resolved in the next iOS update.”
When the háček could still be used in the iPhone’s passcode, it sat on the bottom row of the keyboard, while just above it was an acute accent mark.
Post-update, when entering the passcode, the keyboard now displays an identical accent mark in the háček’s place, a feature Byrne described as “pointless; they’re encoded the same.”
“I’ve bought a cheap Android phone to use while I wait for a fix,” he added. “I’ll give it a month or two and will buy a nicer Android phone if the dust settles without a fix.”
Given that iOS 18 was released in 2024, and Apple has not reintroduced the háček since, it seems unlikely Cupertino will make good on the student’s hopes, especially considering that he is not the only user to encounter the same issue in recent weeks.
During in-house testing, which involved taking an iPhone 16 from iOS 18.5 to iOS 26.4.1, The Register found that Apple has kept the háček in the Czech keyboard, but removed the ability to use it in a custom alphanumeric passcode. The OS will not allow users to input the háček as a character. The key’s animation triggers, as does the keyboard’s key-tap sound, but the character is not entered into the string.
If the student were able to get into his iPhone 13, he would find the háček in his keyboard as it used to be before he updated it. It is only the lock-screen keyboard that replaces it with a second acute accent mark.
Alas, Byrne has gone to great lengths to tinker and tease iOS into accepting or finding the háček, or to find tricky ways of bypassing it.
He tried entering the same accent mark that replaced the háček, in the hope that it was simply displaying incorrectly. He also researched downgrading to iOS 26.3.1, with a view to changing the passcode to one that’s compatible with the new keyboard, to no avail.
Long-pressing every key to reveal a hidden háček did not work, nor did writing the password on paper (and also with a computer word processor to account for handwriting errors), and using AutoFill to scan it in. In this case, he said that the háček was only read as a quotation mark or degree sign.
Apple Support arranged for Byrne to attend a Genius Bar appointment, where the staffer behind the desk made no progress and even started restoring the phone without seeking the student’s consent.
“He provided no recommendations before doing so,” he said.
And if you’re wondering “why not enable Face ID in the first place? Biometrics are pretty secure.” Well, it’s not secure enough for this user, and it wouldn’t matter either, even if it did meet his standards.
“I don’t consider Face ID secure enough because it provides no protection in cases where someone has control of both you and the phone — police or customs, for example.”
“It wouldn’t have helped anyway, since you have to enter the passcode once after updating to enable Face ID.”
For the same reason, plugging in an external keyboard is also a no-go since freshly updated iPhones are placed in what’s known as a Before First Unlock state, which prevents wired accessories from working until the passcode is entered.
The Register contacted Apple multiple times to get its side of things, but it did not respond. ®
...
Read the original on www.theregister.com »
Seven countries now generate nearly all of their electricity from renewable energy sources, according to newly compiled figures.
Albania, Bhutan, Nepal, Paraguay, Iceland, Ethiopia and the Democratic Republic of Congo produced more than 99.7 per cent of the electricity they consumed using geothermal, hydro, solar or wind power.
Data from the International Energy Agency (IEA) and International Renewable Energy Agency (IRENA) also revealed that a further 40 countries generated at least 50 per cent of the electricity they consumed from renewable energy technologies in 2021 and 2022 — including 11 European countries.
“We don’t need miracle technologies,” said Stanford University Professor Mark Jacobson, who published the data.
“We need to stop emissions by electrifying everything and providing the electricity with Wind, Water and Solar (WWS), which includes onshore wind, solar photovoltaics, concentrated solar power, geothermal electricity, small hydroelectricity, and large hydroelectricity.”
Professor Jacobson also noted that other countries like Germany were also capable of running off 100 per cent renewable-generated electricity for short periods of time.
Figures released by the IEA in January show that the UK generated 41.5 per cent of its electricity from renewable sources in 2022 — up 10.5 per cent from the year before.
In Scotland, renewable energy technologies generated the equivalent of 113 per cent of the country’s overall electricity consumption in 2022.
“These record-breaking figures are a major milestone on Scotland’s journey to net-zero, clearly demonstrating the enormous potential of our world-class renewable energy resources,” Claire Mack, chief executive of Scottish Renewables, said at the time.
While Scotland’s electricity generation was dominated by wind power, researchers predict that solar will come to dominate global electricity supplies over the coming decades.
There has been significant progress in recent years with improving efficiency rates for solar cells, primarily boosted by the so-called ‘miracle material’ perovskite.
Commercial costs have also fallen, which led scientists at the University of Exeter and University College London to claim last year that solar energy has reached an “irreversible tipping point” that will see it become the world’s main source of energy by 2050.
Their 2023 paper, published in the journal Nature Communications, found that technological and economic advances meant the transition to clean energy is not just reachable, but inevitable.
“Due to technological trajectories set in motion by past policy, a global irreversible solar tipping point may have passed where solar energy gradually comes to dominate global electricity markets, without any further climate policies,” the researchers wrote in the study.
“Solar energy is the most widely available energy resource on Earth, and its economic attractiveness is improving fast in a cycle of increasing investments.”
...
Read the original on www.independent.co.uk »
447 Terabytes per Square Centimetre at Zero Retention Energy: Non-Volatile Memory at the Atomic Scale on Fluorographane
The memory wall — the widening gap between processor throughput and memory bandwidth — has become the defining hardware constraint of the artificial intelligence era, now compounded by a structural NAND flash supply crisis driven by AI demand. We propose a post-transistor, pre-quantum memory architecture built on single-layer fluorographane (CF), in which the bistable covalent orientation of each fluorine atom relative to the sp3-hybridized carbon scaffold constitutes an intrinsic, radiation-hard binary degree of freedom. The C-F inversion barrier of ~4.6 eV (B3LYP-D3BJ/def2-TZVP, this work; verified transition state with one imaginary frequency; confirmed at 4.8 eV by DLPNO-CCSD(T)/def2-TZVP; rigorous lower bound from the fluorophenalane molecular model) yields a thermal bit-flip rate of ~10^{-65} s^{-1} and a quantum tunneling rate of ~10^{-76} s^{-1} at 300 K, simultaneously eliminating both spontaneous bit-loss mechanisms. The barrier lies below the C-F bond dissociation energy (5.6 eV) at both levels of theory, so the covalent bond remains intact throughout the inversion. A single 1 cm^2 sheet encodes 447 TB of non-volatile information at zero retention energy. Volumetric nanotape architectures extend this to 0.4-9 ZB/cm^3. We present a tiered read-write architecture progressing from scanning-probe validation (Tier 1, achievable with existing instrumentation) through near-field mid-infrared arrays (Tier 2) to a dual-face parallel configuration governed by a central controller, with a projected aggregate throughput of 25 PB/s at full Tier 2 array scale. A scanning-probe prototype already constitutes a functional non-volatile memory device with areal density exceeding all existing technologies by more than five orders of magnitude.
More info on how stats are collected….
...
Read the original on zenodo.org »
Black Knights, Dragons, Jailors, Bats, Gargoyles, Eyeballs and more, oh my!
The original 1986 classic game in glorious black and white.
Released in 1987, you are taken back once again to the castle.
Over 20 years later, return to the castle once more, now in colour!
Click download for the files you need to play. The ZIP contains MiniVMac with a Mac Plus ROM file. The Mac image contains System 6, Dark Castle and Beyond Dark Castle. It DOES NOT contain Return to Dark Castle.
When you’ve downloaded the fairly small ZIP file, extract the files to a folder then just drag and drop DCImage onto the Mini vMac program to get going.
The two folders for Dark Castle and Beyond Dark Castle are available once the emulated Mac has booted, open the game of your choice and take a trip down memory lane.
I recommend you press CTRL-F to go to full screen mode, otherwise you’ll likely move your mouse off the window at a critical time.
EASTER EGG: Set your date to December 25th to see the festive graphics.
A true blast from the past: Dark Castle, by Delta Tao, is a true pioneer in Macintosh gaming. One of the first memorable Mac games for the 9-inch Mac systems, its black-and-white original stole many hours away from this IT engineer in the late 1980′s. Was it the multi-level action? The animations? The embedded humor? Let’s find out…
Dark Castle is in black-and-white, requiring you to boot up an application disk which contained a “minifinder”…anyone remember these?
Dark Castle was written in 1986 by Mark Pierce and Jonathan Gay for Silicon Beach. It was a huge success, showing off how great the Mac was at sound and graphics. It won every award there was, and made lots of money. However, the Macintosh evolved, and Dark Castle didn’t. The Mac II, color, and Multifinder all came out, and Dark Castle slowly stopped working. Aldus acquired Silicon Beach for its graphics, not its games. There were no more Dark Castle games following the acquisition.
For those who were born after Dark Castle’s original release (sigh) or who need a refresher on the game, the goal of Dark Castle was to defeat the Black Knight. In order to do that, you (Duncan) will need to explore the castle to find the tools you need to take on this bad boy or to avoid the nasties that try to stop you.
Each level in DC is totally different, requiring different skills to complete them. Many of them lead right to the dungeon (most trap doors and drop-offs will send you there (meaning you’ll have to get through 3 dungeon levels in order to get back to the Great Hall). This linkage allows for loads of game play, although it can get a bit repetitious (the nasties are always in the same place). As you increase the difficulty level, the number of nasties increase. Some levels require you to be fast while others require careful observation and meticulous steps.
Return To Dark Castle is a 2008 platform game for the Macintosh. It is the third game in the Dark Castle series, following the original Dark Castle (1986) and its sequel Beyond Dark Castle (1987), and the first to be developed by Z Sculpt. Development on the game, begun in 1996, was notoriously protracted, and the game was often labeled as vaporware. Return To Dark Castle was originally scheduled to be released in Winter 2000, but was not released until March 14, 2008.
The player fights his way through various areas inside and around the Dark Castle, in an attempt to defeat the Black Knight. The player’s character, named Bryant by default, is identical in appearance to Duncan, the hero of the earlier Dark Castle games. In the game’s intro, we read that Duncan never returned from his quest to the Dark Castle. Bryant now approaches the castle in an attempt to succeed where Duncan had apparently failed. Bryant must collect 10 orbs hidden around the castle (similar to the orbs from Beyond Dark Castle) before he can confront the Black Knight. If Bryant defeats the Black Knight on any difficulty other than advanced, the Black Knight chides him for wanting an ending but expending too little effort. If Bryant defeats the Black Knight on advanced difficulty, the Black Knight’s armor is knocked off, revealing Duncan, now old and with gray hair and beard. Duncan and Bryant are forced to flee the castle, as the Black Knight’s armor had imprisoned Duncan, and now threatens to imprison them anew. Duncan and Bryant descend a rope to the Black Knight’s Pier, and there board a ship to visit an unnamed destination that Duncan always wanted to see.
The previous games each had 15 levels, and Return To Dark Castle contains all the levels from these first two games, plus over 50 new levels. The new areas are a mixture of single-screen levels in the style of the first two games, and larger horizontally and vertically scrolling levels. The levels contain 25 orbs, 10 of which are required in order to complete the game. Many of the new levels contain secret areas which can be accessed by activating hidden doors and switches.
The game’s gameplay is, with a few notable exceptions, essentially identical to its predecessors. Bryant’s principal weapon remains the rock which can be magically upgraded to the fireball, and a magical shield can be obtained. New features include the ability to carry weapons in the player’s inventory as well as the ability to keep teleportation potions in the inventory. The player can also acquire the “stone ball”, which joins the fireball as an upgrade to the standard rock weapon. The stone ball can be used to obtain other special weapons within the castle. The game allows players to record and play back “demos”, videos of play.
Beyond Dark Castle is the continuation of Dark Castle. The objective of this follow up is to find the five magic spheres of Merlin and place them on the plinths in the first room. That should enable you to open the portcullis and face the Black Knight. You will traverse the various rooms of this labyrinthian castle and will have a whole heap of nasty encounters…
The play is controlled with the mouse and the keyboard in exactly the same way as the previous game. It should be mentioned that gameplay can be particularly difficult at three in the morning with substantial amounts of alcohol in your bloodstream as the game can be very frustrating at times. It is nevertheless a brilliant game that has a addictive quality about it, regardless of its age, the sounds and the music are not bad, animations are good and can be amusing as is the game design of each level.
...
Read the original on darkcastle.co.uk »
To add this web app to your iOS home screen tap the share button and select "Add to the Home Screen".
10HN is also available as an iOS App
If you visit 10HN only rarely, check out the the best articles from the past week.
If you like 10HN please leave feedback and share
Visit pancik.com for more.