10 interesting stories served every morning and every evening.
While LLMs are adept at reading and can be terrific at editing, their writing is much more mixed. At best, writing from LLMs is hackneyed and cliché-ridden; at worst, it brims with tells that reveal that the prose is in fact automatically generated.
What’s so bad about this? First, to those who can recognize an LLM’s reveals (an expanding demographic!), it’s just embarrassing — it’s as if the writer is walking around with their
intellectual
fly open. But there are deeper problems: LLM-generated writing undermines the authenticity of not just one’s writing but of the thinking behind it as well. If the prose is automatically generated, might the ideas be too? The reader can’t be sure — and increasingly, the hallmarks of LLM generation cause readers to turn off (or worse).
Finally, LLM-generated prose undermines a social contract of sorts: absent LLMs, it is presumed that of the reader and the writer, it is the writer that has undertaken the greater intellectual exertion. (That is, it is more work to write than to read!) For the reader, this is important: should they struggle with an idea, they can reasonably assume that the writer themselves understands it — and it is the least a reader can do to labor to make sense of it.
If, however, prose is LLM-generated, this social contract becomes ripped up: a reader cannot assume that the writer understands their ideas because they might not so much have read the product of the LLM that they tasked to write it. If one is lucky, these are LLM hallucinations: obviously wrong and quickly discarded. If one is unlucky, however, it will be a kind of LLM-induced cognitive dissonance: a puzzle in which pieces don’t fit because there is in fact no puzzle at all. This can leave a reader frustrated: why should they spend more time reading prose than the writer spent writing it?
This can be navigated, of course, but it is truly perilous: our writing is an important vessel for building trust — and that trust can be quickly eroded if we are not speaking with our own voice. For us at Oxide, there is a more mechanical reason to be jaundiced about using LLMs to write: because our hiring process very much selects for writers, we know that everyone at Oxide can write — and we have the luxury of demanding of ourselves the kind of writing that we know that we are all capable of.
So our guideline is to generally not use LLMs to write, but this shouldn’t be thought of as an absolute — and it doesn’t mean that an LLM can’t be used as part of the writing process. Just please: consider your responsibility to yourself, to your own ideas — and to the reader.
...
Read the original on rfd.shared.oxide.computer »
The state administration of Schleswig-Holstein is making a remarkable U-turn in its IT strategy and consistently relying on open source. After the migration from proprietary Microsoft software to free solutions was initially accompanied by problems and criticism, Digitalization Minister Dirk Schrödter (CDU) can now report a significant success: According to his ministry, the state will save over 15 million euros in license costs for Windows, Microsoft Office & Co. next year alone. It is expected to be similar in the following years.
In contrast, there would be one-time investments of nine million euros in 2026, explained the Ministry of Digitalization to the Kieler Nachrichten. These would have to be made for the conversion of workplaces and the further development of solutions with free software in the next 12 months. Given the annual savings, this sum will pay for itself in less than a year. In the past, the state transferred millions to the US company Microsoft, primarily for the use of office software and other programs.
The department sees the departure from this “vendor lock-in” — the dependence on a single large provider — as a clear signal for greater independence and sustainable digitalization. The financial incentive now underscores that digital sovereignty can be not only a political buzzword but also an economic gain.
The numbers speak for themselves: outside the tax administration, almost 80 percent of workplaces in the state administration have already been switched to the open-source office software LibreOffice. Schrödter thus confirms a course that reduces technical and economic dependence on individual manufacturers. The consequence of the conversion was already evident recently, as Schrödter emphasized in an interview with c’t. Regarding the status of Microsoft license cancellations, he said: “We are at almost 80, without the tax administration.” For tax matters, the state finance ministers have “given themselves a clear timetable for the switch.” Recently, the Christian Democrat also emphasized, according to the Südtiroler Wirtschaftszeitung, that the state has entered a marathon, not just a sprint.
The remaining 20 percent of workplaces are currently still dependent on Microsoft programs such as Word or Excel, as there is a technical dependency on these programs in certain specialized applications. According to Schrödter, however, the successive conversion of these remaining computers is the stated goal.
Despite the savings and the almost completed migration in large parts of the administration, the opposition continues to criticize the quality of the conversion. SPD state parliament member Kianusch Stender pointed out to the Kieler Nachrichten: “It may be that on paper 80 percent of workplaces have been converted. But far fewer than 80 percent of employees can now work with them properly.” Errors in the migration are “still present.” The initial difficulties in introducing the open-source programs have apparently led to ongoing frustration among some employees in certain areas.
The Green state parliament member Jan Kürschner also admitted in an interview with heise online that such a comprehensive conversion would not go without friction. But he emphasized the long-term nature of the project and the necessity of fundamentally rethinking administrative processes: “With the change, there is an opportunity to truly rethink the administration and free ourselves from old burdens. That is the great added value.” If only a one-to-one conversion is made, it might certainly “stumble at one point or another.” But those who truly optimize administrative processes will likely find in the end: “Open source is the better way.”
The challenge now is to resolve the initial migration problems and acceptance difficulties and to further develop the open-source solutions so that they fully meet the requirements of a modern state administration. The savings achieved give Schleswig-Holstein more financial leeway for this.
...
Read the original on www.heise.de »
Peer review is under siege. By speeding up the writing process, LLMs and other AI tools are overwhelming scholarly journals and conferences and the peer review pipeline with hallucinated papers (“AI slop”).
These aren’t just issues for low-ranking journals with high acceptance rates. The GPTZero team used our Hallucination Check tool to scan 300 papers under review by the prestigious International Conference on Learning Representations (ICLR). We discovered that 50 submissions included at least one obvious hallucitation, which were not previously reported.
Worryingly, each of these submissions has already been reviewed by 3-5 peer experts, most of whom missed the fake citation(s). This failure suggests that some of these papers might have been accepted by ICLR without any intervention. Some had average ratings of 8/10, meaning they would almost certainly have been published.
In the table below, we’ve included a specific human-verified hallucitation our tool flagged in each paper. According to the ICLR’s editorial policy, even a single, clear hallucitation is an ethics violation that could lead to the paper’s rejection. Given that we’ve only scanned 300 out of 20,000 submissions, we estimate that we will find 100s of hallucinated papers in the coming days.
...
Read the original on gptzero.me »
We strive to create an environment conducive to many different types of research across many different time scales and levels of risk.
We strive to create an environment conducive to many different types of research across many different time scales and levels of risk.
We introduce the Titans architecture and the MIRAS framework, which allow AI models to work much faster and handle massive contexts by updating their core memory while it’s actively running.
The Transformer architecture revolutionized sequence modeling with its introduction of attention, a mechanism by which models look back at earlier inputs to prioritize relevant input data. However, computational cost increases drastically with sequence length, which limits the ability to scale Transformer-based models to extremely long contexts, such as those required for full-document understanding or genomic analysis. The research community explored various approaches for solutions, such as efficient linear recurrent neural networks (RNNs) and state space models (SSMs) like Mamba-2. These models offer fast, linear scaling by compressing context into a fixed-size. However, this fixed-size compression cannot adequately capture the rich information in very long sequences.In two new papers, Titans and MIRAS, we introduce an architecture and theoretical blueprint that combine the speed of RNNs with the accuracy of transformers. Titans is the specific architecture (the tool), and MIRAS is the theoretical framework (the blueprint) for generalizing these approaches. Together, they advance the concept of test-time memorization, the ability of an AI model to maintain long-term memory by incorporating more powerful “surprise” metrics (i.e., unexpected pieces of information) while the model is running and without dedicated offline retraining.The MIRAS framework, as demonstrated by Titans, introduces a meaningful shift toward real-time adaptation. Instead of compressing information into a static state, this architecture actively learns and updates its own parameters as data streams in. This crucial mechanism enables the model to incorporate new, specific details into its core knowledge instantly.
Titans: Learning new context on the fly
An effective learning system requires distinct yet interconnected memory modules, mirroring the human brain’s separation of short-term and long-term memory.While attention mechanisms excel for precise, short-term memory, Titans introduces a novel neural long-term memory module, that, unlike the fixed-size vector or matrix memory in traditional RNNs, acts as a deep neural network (specifically, a multi-layer perceptron). This memory module provides significantly higher expressive power, allowing the model to summarize large volumes of information without losing important context. The model isn’t simply taking notes; it’s understanding and synthesizing the entire story.Crucially, Titans doesn’t just passively store data. It actively learns how to recognize and retain important relationships and conceptual themes that connect tokens across the entire input. A key aspect of this ability is what we call the “surprise metric”. In human psychology, we know we quickly and easily forget routine, expected events but remember things that break the pattern — unexpected, surprising, or highly emotional events.
Overview of the Titans (MAC) architecture. It uses a long-term memory to compress the past data and then incorporate the summary into the context and pass it to attention. Attention can then decide if it needs to attend to the summary of the past or not.
In the context of Titans, the “surprise metric” is the model detecting a large difference between what it currently remembers and what the new input is telling it.Low surprise: If the new word is “cat” and the model’s memory state already expects an animal word, the gradient (surprise) is low. It can safely skip memorizing the word “cat” in its permanent long-term state.High surprise: If the model’s memory state is summarizing a serious financial report, and the new input is a picture of a banana peel (the unexpected event), the gradient (surprise) will be very high. This signals that the new input is important or anomalous, and it must be prioritized for permanent storage in the long-term memory module.The model uses this internal error signal (the gradient) as a mathematical equivalent of saying, “This is unexpected and important!” This allows the Titans architecture to selectively update its long-term memory only with the most novel and context-breaking information, keeping the overall process fast and efficient.Titans refines this mechanism by incorporating two critical elements:Momentum: The model considers both “momentary surprise” (the current input) and “past surprise” (the recent context flow). This ensures relevant subsequent information is also captured, even if those tokens are not individually surprising.Forgetting (weight decay): To manage the finite capacity of the memory when dealing with extremely long sequences, Titans employ an adaptive weight decay mechanism. This acts as a forgetting gate, allowing the model to discard information that is no longer needed.
Every major breakthrough in sequence modeling — from modern transformers to the new, lightning-fast linear RNNs — is essentially the same thing under the hood: a highly complex associative memory module.Accordingly, what makes MIRAS both unique and practical is the way it views AI modeling. Instead of seeing diverse architectures, it sees different methods of solving the same problem: efficiently combining new information with old memories without letting the essential concepts be forgotten.Memory architecture: The structure that stores information (e.g., a vector, matrix, or a deep multi-layer perceptron, like in Titans).Attentional bias: The internal learning objective the model optimizes that determines what it prioritizes.Retention gate: The memory regularizer. MIRAS reinterprets “forgetting mechanisms” as specific forms of regularization that balance new learning against retaining past knowledge.Memory algorithm: The optimization algorithm used to update the memory.
The MIRAS framework overview. In the MIRAS framework, we aim to learn an associative memory, mapping between keys and values. For each token, the memory module internally optimizes its inner attentional bias while using its retention gate to make sure that it does not deviate from its past state. The optimization process is done through gradient-based optimizer.
Virtually all successful existing sequence models rely on mean squared error (MSE) or dot-product similarity for both their bias and retention. This reliance can make models sensitive to outliers and limit their expressive power.MIRAS transcends this limitation by providing a generative framework to explore a more rich design space informed by the literature in optimization and statistics. This allows for the creation of novel architectures with non-Euclidean objectives and regularization.Using MIRAS, we created three specific attention-free models:YAAD: We designed this MIRAS variant to be less sensitive to major errors or “outliers” (like a single typo in a large document). It uses a gentler math penalty (Huber loss) for mistakes, so it doesn’t overreact to one-off issues. This makes the model more robust when the input data is messy or inconsistent.MONETA: This model explores the use of more complex and strict mathematical penalties (called generalized norms). It investigates whether using these more disciplined rules for both what the model attends to and what it forgets can lead to a more powerful and stable long-term memory system overall.MEMORA: This model focuses on achieving the best possible memory stability by forcing its memory to act like a strict probability map. By using this constraint, it ensures that every time the memory state is updated, the changes are controlled and balanced. This guarantees a clean, stable process for integrating new information.Virtually all successful existing sequence models rely on mean squared error (MSE) or dot-product similarity for both their bias and retention. This reliance can make models sensitive to outliers and limit their expressive power.
We rigorously compared Titans along with MIRAS variants (YAAD, MONETA, MEMORA) against leading architectures, including Transformer++, Mamba-2, and Gated DeltaNet. We further validated versatility by testing Titans on genomic modeling (DNA) and time-series forecasting, proving the architecture generalizes effectively beyond text.Across both standard language modeling datasets (C4, WikiText) and zero-shot reasoning tasks (HellaSwag, PIQA), our models consistently demonstrated higher accuracy and perplexity (a measure of how surprised an LLM is when looking at a piece of text).
Ablation studies clearly show that the depth of the memory architecture is crucial. When comparing long-term memory modules of the same size but different depths, modules with deeper memories consistently achieve lower perplexity in language modeling. Furthermore, they exhibit better scaling properties, maintaining performance as the sequence length increases significantly.
The effect of memory depth on the perplexity across 360M and 760M parameter scales.
In language modeling and commonsense reasoning tasks, Titans architectures outperform state-of-the-art linear recurrent models (such as Mamba-2 and Gated DeltaNet) and Transformer++ baselines of comparable sizes. The novel MIRAS variants (MONETA, YAAD, MEMORA) also achieve improved performance compared to these baselines, validating the benefit of exploring robust, non-MSE optimization mechanisms. Importantly, these models maintain efficient, parallelizable training and fast linear inference speeds.
The most significant advantage of these new architectures is their ability to handle extremely long contexts. This is highlighted in the BABILong benchmark, a task requiring reasoning across facts distributed in extremely long documents. In this challenging setting, Titans outperforms all baselines, including extremely large models like GPT-4, despite having many fewer parameters. Titans further demonstrates the capability to scale effectively to context window sizes larger than 2 million tokens.
The introduction of Titans and the MIRAS framework marks a significant advancement in sequence modeling. By employing deep neural networks as memory modules that learn to memorize as data is coming in, these approaches overcome the limitations of fixed-size recurrent states. Furthermore, MIRAS provides a powerful theoretical unification, revealing the connection between online optimization, associative memory, and architectural design. By moving beyond the standard Euclidean paradigm, this research opens the door to a new generation of sequence models that combine the efficiency of RNNs with the expressive power needed for the era of long-context AI.
From Waveforms to Wisdom: The New Benchmark for Auditory Intelligence
A diagram illustrating a neural architecture with three layers: Contextual Memory (learning), Core (in-context learning), and Persistent Memory (fixed weights).
Line graph showing Titans (MAC)-FT maintains improved accuracy over increasing sequence lengths compared to GPT-4, Mamba-FT, and other models.
Two line charts showing that LMM and MM models maintain lower perplexity than Mamba as sequence length increases across 360M and 760M parameter scales.
...
Read the original on research.google »
In 2018 I made the first lithographically fabricated integrated circuits in my garage fab. I was a senior in high school when I made the Z1 amplifier, and now I’m a senior in college so there are some long overdue improvements to the amateur silicon process.
The Z1 had 6 transistors and was a great test chip to develop all the processes and equipment. The Z2 has 100 transistors on a 10µm polysilicon gate process — same technology as Intel’s first processor. My chip is a simple 10×10 array of transistors to test, characterize, and tweak the process but this is a huge step closer to more advanced DIY computer chips. The Intel 4004 has 2,200 transistors and I’ve now made 1,200 on the same piece of silicon.
Previously, I made chips with a metal gate process. The aluminum gate has a large work function difference with the silicon channel beneath it which results in a high threshold voltage (>10V). I used these metal gate transistors in a few fun projects like a guitar distortion pedal and a ring oscillator LED blinker but both of these required one or two 9V batteries to run the circuit due to high Vth. By switching to a polysilicon gate process, I get a ton of performance benefits (self aligned gate means lower overlap capacitances) including a much lower Vth which makes these chips compatible with 2.5V and 3.3V logic levels. The new FETs have excellent characteristics:
NMOS Electrical Properties:
Vth = 1.1 V
Vgs MAX = 8 V
Cgs =
I was particularly surprised by the super low leakage current. This value goes up about 100x in ambient room lighting.
Now we know that it’s possible to make really good transistors with impure chemicals, no cleanroom, and homemade equipment. Of course, yield and process repeatability are diminished. I’ll do more testing to collect data on the statistics and variability of FET properties but it’s looking good!
The chip is small, about one quarter the die area of my previous ICs (2.4mm^2) which makes it hard to probe. There’s a simple 10×10 array of N-channel FETs on each chip which will give me a lot of characterization data. Since it’s such a simple design, I was able to lay it out using Photoshop. Columns of 10 transistors share a common gate connection and each row is strung together in series with adjacent transistors sharing a source/drain terminal. It’s similar to NAND flash but I only did this to keep the metal pads large enough so I can reasonably probe them, if every FET had 3 pads for itself they would be too small.
It’s hard to convey the excitement of seeing a good FET curve displayed on the curve tracer after dipping a shard of rock into chemicals all day.
A single 10µm NMOS transistor can be see below, with slight misalignment in the metal layer (part of the left contact is uncovered). Red outline is polycrystalline silicon, blue is the source/drain.
So far I’ve made an opamp (Z1) and a memory-like array (Z2). More interesting circuits are definitely possible even with this low transistor density. The process needs some tweaking but now that I’m able to consistently make good quality transistors I should be able to design more complex digital and analog circuits. Testing each chip is very tedious so I am trying to automate the process and I’ll post more data then. I’ve made 15 chips (1,500 transistors) and know there’s at least one completely functional chip and at least two “mostly functional”, meaning ~80% of the transistors work instead of 100%. No proper yield data yet. The most common defect is a drain or source shorted to the bulk silicon channel, not a leaky or shorted gate like on my Z1 process.
I said before that the gate used to be made out of aluminum and now it’s silicon which makes the chips work a lot better. Silicon comes in three varieties that we care about: amorphous, polycrystalline, and monocrystalline. From left to right, these become more electrically conductive but also much harder to deposit. In fact, monocrystalline Si can’t be deposited, you can only grow it in contact with another mono-Si layer as a seed (epitaxy). Since the gate must be deposited on top of an insulating dielectric, poly is the best we can do. We can heavily dope the polysilicon anyway to make it more conductive.
A typical self-aligned polysilicon gate process requires silane, a toxic and explosive gas, to deposit polycrystalline silicon layers. It may also be possible by sputtering or evaporating amorphous silicon and annealing with a laser. A major theme of this DIY silicon process is to circumvent expensive, difficult, or dangerous steps. So, I came up with a modified process flow. It’s a variation on the standard self-aligned methods to allow doping via high temperature diffusion rather than ion implantation. The effect is that I’m able to buy a silicon wafer with the polysilicon already deposited on it from the factory and pattern it to make transistors instead of putting my own polysilicon down halfway through the process. This is a nice short term workaround but it would be best to design a polysilicon deposition process using the laser anneal method mentioned above.
Wafers are available with all kinds of materials deposited on them already, so I just had to find one with a thin layer of SiO2 (gate oxide, ~10nm) followed by a thicker polysilicon (300nm). I found a lot of 25 200mm (EPI, prime, [1-0-0], p-type) wafers on eBay for $45 which is essentially a lifetime supply, so email me if you want one. The gate oxide is the most fragile layer and requires the most care during fabrication. Since I bought the wafer with a nice high quality oxide on it already that was capped off and kept clean by the thick polysilicon layer, I was able to eliminate all the aggressive cleaning chemicals (sulfuric acid, etc) from the process and still make great transistors. Minimal process chemicals and tools are listed below.
Chemicals used in home poly-gate process:
-Water
-Alcohol
-Acetone
-Phosphoric acid
-Photoresist
-Developer (2% KOH)
-N type dopant (filmtronics P509)
-HF (1%) or CF4/CHF3 RIE
-HNO3 for poly etch or SF6 RIE
Equipment used in home poly-gate process:
-Hotplate
-Tube furnace
-Lithography apparatus
-Microscope
-Vacuum chamber to deposit metal
Z2 “gate first” process (similar to standard self-aligned process but without a field oxide):
I snapped one of the test chips in half (functional Z2 but with bad layer alignment and thin metal, about 300nm) and put it in my SEM for a cross section:
Find the dust particle in the red circle below, use that to get oriented in the coming cross section views.
Because I bought the wafer already with gate oxide and polysilicon on it, I can’t grow a field oxide. These thick oxide layers are typically used to mask dopants and require a long high temperature step which would oxidize all of my poly and there would be none remaining. So, my modified process uses an additional masking step (the “gate” mask is typically not found in a self-aligned process) that allows me to use the polysilicon itself as a dopant mask and hard-baked photoresist as the field dielectric. This alternative processing results in the stepped structure you can see in the orange region on the NMOS cross section above. This process subtlety is mentioned here, read this twitter thread.
This process isn’t ideal and I want to make some changes so it’s CMOS compatible but it simplifies fabrication and makes it possible with a minimal set of tools. The 1µm dielectric layer (orange) would ideally be CVD SiO2 (it’s possible to build a TEOS oxide reactor at home) but I used a photoresist instead. Most photoresists can be baked around 250°C to form a hard permanent dielectric layer that is an easy alternative to CVD or PECVD oxide. A spin-on-glass/sol-gel could also be used here. SiO2 etching is done with a buffered HF solution made from rust stain remover or RIE.
Thanks for following my work and feel free to contact me with your thoughts!
...
Read the original on sam.zeloof.xyz »
Trains were halted after a suspected AI-generated picture that seemed to show major damage to a bridge appeared on social media following an earthquake. The tremor, which struck on Wednesday night, was felt across Lancashire and the southern Lake District.Network Rail said it was made aware of the image which appeared to show major damage to Carlisle Bridge in Lancaster at 00:30 GMT and stopped rail services across the bridge while safety inspections were carried out.A BBC journalist ran the image through an AI chatbot which identified key spots that may have been manipulated.
Network Rail said the railway line was fully reopened at around 02:00 GMT and it has urged people to “think about the serious impact it could have” before creating or sharing hoax images.“The disruption caused by the creation and sharing of hoax images and videos like this creates a completely unnecessary delay to passengers at a cost to the taxpayer,” a spokesperson said.“It adds to the high workload of our frontline teams, who work extremely hard to keep the railway running smoothly,” the spokesperson said.“The safety of rail passengers and staff is our number one priority and we will always take any safety concerns seriously.“The British Transport Police said it was “made aware” of the situation but there was no ongoing investigation into the incident. Network Rail said 32 services including passenger and freight trains were delayed because of hoax. A spokesperson for the rail provider said a mix of passenger and freight train would have been impacted.They said some of them would have been directly stopped or slowed while it checked the lines, but a lot of the trains were delayed as a result of earlier services still being in their path. The spokesperson said many of them would have been local but because of the length of the West Coast Main Line some trains were delayed as far north as Scotland.
Railway expert Tony Miles said due to the timing of the incident, very few passengers will have been impacted by the hoax as the services passing through at that time were primarily freight and sleeper trains.“They generally go slow so as not to disturb the passengers trying to sleep - this means they have a bit of leeway to go faster and make up time if they encounter a delay,” he said.“It’s more the fact that Network Rail will have had to mobilise a team to go and check the bridge which could impact their work for days.“He urged people to consider hoaxes like this could have on real people.“If they actually did delay a train it could have impacted someone who had to get to a medical appointment, or a flight or a funeral.“It may seem like a game, but anyone who’s thinking of doing this should consider how it will impact real people.”
...
Read the original on www.bbc.com »
Skip to contentI failed to recreate the 1996 Space Jam Website with ClaudeLink to the Hacker News post. Thanks everybody for all the engagement!
Can Claude Recreate the 1996 Space Jam Website? No. Or at least not with my prompting skills. Note: please help, because I’d like to preserve this website forever and there’s no other way to do it besides getting Claude to recreate it from a screenshot. Believe me, I’m an engineering manager with a computer science degree. Please please please help 😞
Final note: I use “he” to refer to Claude, which Josh finds ridiculous.
For those who don’t know, Warner Bros keeps this anachronistic website online that was released in 1996 to accompany the Space Jam movie.
It’s a classic example of early web era design. Simple, colorful, and sparks joy. We’re going to find out if we can get Claude to recreate it using only a screenshot.
all of the assets the website uses
To track Claude’s inner monologue and actual API calls, I set up a man-in-the-middle proxy to capture the full conversation between Claude Code and Anthropic’s API. This logs everything: user prompts, Claude’s responses, tool invocations (Read, Write, Bash commands), etc. Each attempt generates a traffic.log file with the raw API traffic, which I then parse for easier analysis.
Edit:I used Opus 4.1 for this investigation. Thanks to anorwell for pointing out I forgot to add the model.
The Space Jam website is simple: a single HTML page, , and a tiling starfield GIF background. The entire page uses absolute positioning with pixel specific left/top values. The total payload is under 200KB.
Correction: The original site is built using tables. Thanks to wilsmex and sqircles for calling that out!
Given that Claude has all of the assets + screenshots of the website, I assume this should be relatively boring. He’ll nail it, and we’ll move on to something much more. A mildly cute example of agentic HTML generation…
.css-ce1lkk{background-color:var(–theme-ui-colors-background);border:none;color:var(–theme-ui-colors-text);cursor:pointer;font-size:14px;font-family:-apple-system,BlinkMacSystemFont,“Segoe UI”,Roboto,“Helvetica Neue”,Arial,“Noto Sans”,sans-serif,“Apple Color Emoji”,“Segoe UI Emoji”,“Segoe UI Symbol”,“Noto Color Emoji”;letter-spacing:0.025rem;-webkit-transition:all 0.3s ease-in-out;transition:all 0.3s ease-in-out;position:absolute;right:0;z-index:1;border-radius:0 0 0 0.25rem;padding:0.25rem 0.6rem;}@media screen and (min-width: 640px){.css-ce1lkk{font-size:14px;}}@media screen and (min-width: 768px){.css-ce1lkk{font-size:16px;}}.css-ce1lkk[disabled]{cursor:not-allowed;}.css-ce1lkk:not([disabled]):hover{background-color:var(–theme-ui-colors-primary);color:var(–theme-ui-colors-white);}Copy.css-15vf505{border:0;clip:rect(0, 0, 0, 0);height:1px;margin:-1px;overflow:hidden;padding:0;position:absolute;white-space:nowrap;width:1px;}copy code to clipboardI am giving you:
1. A full screenshot of the Space Jam 1996 landing page.2. A directory of raw image assets** extracted from the original site
Your job is to recreate the landing page as faithfully as possible, matching the screenshot exactly.
What he produces is actually not that bad. But it’s not right. From a distance, the layout kind of resembled the original: planets arranged in an ellipse around the logo, little yellow labels where the buttons go. But, the orbital pattern was off, almost diamond shaped and symmetrical.
Claude, however, was thrilled with himself.
Further, he brags that he had:
Digging through the logs I found it interesting that Claude actually did notice the planets were arranged in a deliberate way, so much so that it’s called out twice in both the screenshot analysis and CSS construction, but he failed to recreate the pattern faithfully.
Okay, fine. Maybe he needed a nudge to get the orbit right. So for my next attempt, I try to push him to focus on understanding the orbital pattern and I ask him to explain his reasoning before generating his HTML. I was hoping to understand the delta between what is there and what he thought he was seeing. In my prompt, I outline a set of sections for him to consider. Each one of these sections also includes a number of sub-questions, which I won’t include here for the sake of brevity. This made things significantly worse.
Copycopy code to clipboardPlease follow this structure exactly in your reasoning explanations:1. Perception Analysis2. Spatial Interpretation3. Reconstruction Plan
Claude didn’t ignore my instructions (not always a given) and things seemed promising until I realized he was ignoring his own analysis during the HTML generation phase. He would say things like “the orbit radius appears to be 220 pixels” and then place the planets directly next to the logo. His self critique was surprisingly accurate. He correctly identifies the areas where he was wrong with decent detail, but somehow those observations never make it into subsequent iterations.
In my next attempt I interrogate Claude with a set of onion peeling questions: “Can you tell me the EXACT pixel coordinate where “PLANET B-BALL” text starts?”
“No, I cannot measure exact pixel coordinates. I can only make visual estimations.”
I asked him a few more questions:
Can you extract exact pixel coordinates? “No.”
Can you measure exact distances? “No.”
Confidence you can get within 5 pixels? “15 out of 100.”
Oh. This explains a lot. But it raises a bigger question to me: “Why can’t he measure?” It’s a screenshot. The pixels are right there. Claude clearly understood the structure, but he couldn’t recreate it with any precision. Also, I’m not even sure I trust Claude. Either way, this (naively) surprised me, so I canceled coffee with my friends in order to spend the afternoon trying to give my guy more tools.
Before I start I execute one more attempt and ask him: “Would you bet $1000 on your HTML matching this screenshot exactly?”
Maybe he just needs a little help.
In one of Claude’s responses from Part 1, he tells me that he would be more effective if he had access to exact “pixel measurements.” so I build a few tools to make it impossible for Claude to mis-measure anything:
Grid overlays and a script to generate grid overlays on screenshots
color-diff comparison (this ignores the background which was giving Claude false positives because of how much black there was)
Tool to take screenshots of his index.html file to compare iteratively with the original
Here are three grid versions Claude generated which I am including because I find them aesthetically pleasing.
I put together a new prompt: same screenshot, same assets folder. I even included some grid screenshots so Claude wouldn’t have to remember to do it himself. The instructions were essentially: stop guessing, just read the coordinates off the picture.
Claude’s new attempt still wasn’t correct. The orbit was better: closer to the original but somehow compressed and smooshing (a technical word) into the Space Jam logo. If I squint, I could convince myself that there was at least a hint that he’d stopped freehanding and started using something like measurements.
When I dug into the logs, it appeared that Claude actually did use the grids. He pulled out these numbers:
and so on down the list
In one iteration, Claude built himself a helper: compare.html a little side by side viewer so he could look at his screenshot and the reference together. It didn’t help him at all, but my God was he convinced it did.
The actual progression tells a different story. Going through the iterations:
Iteration 1 (50px grid): he notices things are off and makes a few conservative tweaks — moves Planet B-Ball from (850, 165) to (800, 120), shifts Lunar Tunes from (925, 195) to (950, 200). These are 15 - 50 pixel changes, tiny nudges.
Iteration 2 (25px grid): he decides he needs “more precise positioning” and shifts the entire orbit inward by ~20 pixels. Planets go from roughly a 250px radius to ~230px. He is now confidently converging on the wrong answer.
Iteration 3 (5px grid): he shuffles around a lot of deck chairs in the name of micro adjustments. 5 - 10 pixel tweaks: Planet B-Ball from (800, 120) to (805, 125), that kind of thing.
Iteration 4: more “fine-tuning based on exact grid measurements.” Site Map drifts from (755, 460) to (750, 455). The numbers look careful; the layout does not lol.
“Now the positioning should be much more accurate!”
Across all five iterations, he’s moved planets maybe 50 - 75 pixels total when they needed to move 150 - 200 pixels outward. The orbital radius never expands beyond ~250px when it should be closer to 350 -400px. The planets stay trapped in this ever compressing orbit while Claude dances around telling a story about steady convergence:
He is absolutely not getting closer.
Before giving up on tools entirely, I had one more idea. I created a tool that would split the screenshot into six regions. Claude could then compare his output to the reference region by region.
The prompt was getting desperate at this point lol:
Copycopy code to clipboard## INITIAL ANALYSIS - DO THIS FIRSTBefore creating index.html, study the reference in detail using zoom inspection:python3 split.py reference.png
This creates 6 files showing every detail
Claude ran the tool and the logs look promising.
Examining region 1 (top-left): I can see Jam Central and Press Box Shuttle clearly at 2x zoom Examining region 2 (top-right): Planet B-Ball and Lunar Tunes are visible with precise spacing Examining region 4 (middle-right): Jump Station appears approximately 280px from center
Unfortunately, Claude’s “precise observations” from the zoomed regions were just as wrong as before. He’d look at a planet and confidently declare it was at position (750, 320) when it was actually at (850, 380). The split did not appear to help him measure or get a more accurate picture of planet spacing.
What makes this phase ~~depressing~~ interesting is that the tools, despite invalidating his result, seem to lock in the wrong answer. Once he’s picked an internal picture of the layout (“the orbit radius is about 230px”), the grids and the compare viewer don’t correct it. They just help him make more confident micro moves around his invented orbit. Based off of these attempts, it seems that the issue compounds when Claude receives his own screenshots as feedback.
My very rough read of Anthropic’s “Language Models (Mostly) Know What They Know”, is that models can become overconfident when evaluating their own outputs, in part because they cannot distinguish the tokens they generated from tokens provided by someone else / an external source. So, when Claude is asked to judge or revise content that originated from itself, it treats that material as if it were “ground truth.”
This kind of fits what I’m seeing in the logs. Once Claude’s version existed, every grid overlay, every comparison step, every “precise” adjustment was anchored to his layout, not the real one. At the end of all this, I’m left with the irritating fact that, like many engineers, he’s wrong and he thinks he’s right.
What this teaches me is that Claude is actually kind of a liar, or at least Claude is confused. However, for the drama, I’ll assume Claude is a liar.
At this point I had tried grids, comparisons, step-by-step corrections, letting Claude narrate his thought process, and every combination of tools I could bolt onto the interaction. None of it seemed to help nor explain by why his single digit precision updates were disembodied from the actual layout.
Before getting to the final experiment, here’s the mental model I was forming about Claude’s vision. The vision encoder converts each 16 x 16 block of the image into a single token. So instead of geometry, he sees semantics: “near,” “above,” “roughly circular.” When he says “approximately 220px radius,” he’s not measuring anything. He’s describing the idea of a radius. He excels at semantic understanding (“this is a planet,” “these form a circle”) but lacks the tools for working with visual media. It explains why his perception is good. He always knows a planet is a planet but the execution is never precise.
I’m getting frustrated and I haven’t left my apartment in days so I turn to some research. GPTing around, I found “An Image is Worth 16x16 Words”. I have no idea if Claude uses this exact architecture or anything close to it, but the intuition seemed right. The paper (after I made ChatGPT explain it to me) explains that the the image is chopped into fixed patches, each patch gets compressed into a single embedding, and whatever details lived inside those pixels vanish.
Assuming this applies, a lot of the failures suddenly make sense. Most planets on the Space Jam screenshot are maybe 40 - 50 pixels wide. That’s two or three patches. A three patch planet is basically a blob to him. Claude knows it’s a planet, but not much else. The orbit radius only spans a couple dozen patches total. Tiny changes in distance barely show up in the patch embeddings.
But this raised a new and final idea. If the 40px planets turn into fuzzy tokens, what if I make them bigger? What if I give Claude a 2x zoomed screenshot? Would each planet spans 10 - 15 patches instead of two or three? Maybe this gives him a more crisp understanding of the spatial relationships and a better chance at success.
I deleted most of the prompt and tools and just gave Claude this 2x’d screenshot
Copycopy code to clipboardCRITICAL: remember that the zoomed image is zoomed in to 200%. When you’re creating your version, maintain proper proportions, meaning that your version should keep the same relative spacing as if it were just 100%, not 200%.
but he does not listen
My best explanation for all of this is that Claude was working with a very coarse version of the screenshot. Considering the 16 x 16 patch thing from earlier it sort of helps me understand what might be happening: he could describe the layout, but the fine grained stuff wasn’t in his representation. And that weird tension I kept seeing , where he could describe the layout correctly but couldn’t reproduce it, also looks different under that lens. His explanations were always based on the concepts he got from the image (“this planet is above this one,” “the cluster is to the left”), but the actual HTML had to be grounded in geometry he didn’t have. So the narration sounded right while the code drifted off.
After these zoom attempts, I didn’t have any new moves left. I was being evicted. The bank repo’d my car. So I wrapped it there.
Look, I still need this Space Jam website recreated. If you can get Claude to faithfully recreate the Space Jam 1996 website from just a screenshot and the assets folder, I’d love to hear about it.
Based on my failures, here are some approaches I didn’t try:
Break the screen into quadrants, get each quadrant right independently, then merge. Maybe Claude can handle spatial precision better in smaller chunks.
Maybe there’s some magic prompt engineering that unlocks spatial reasoning. “You are a CSS grid with perfect absolute positioning knowledge…” (I’m skeptical but worth trying).
Providing Claude with a zoom tool and an understanding of how to use the screenshots might be an effective path.
For now, this task stands undefeated. A monument to 1996 web design and a humbling reminder that sometimes the simplest tasks are the hardest. That orbital pattern of planets, thrown together by some Warner Brothers webmaster 28 years ago, has become an inadvertent benchmark for Claude.
Until then, the Space Jam website remains proof that not everything old is obsolete. Some things are just irreproducibly perfect.
...
Read the original on j0nah.com »
I’m sure I’m not the only one who feels Apple’s quality is degrading. I spend 10 hours a day on my laptop and would spend any amount of money within reason for a better one. However, everything comes with tradeoffs.
My dream laptop is simple, a MacBook with Linux, supported by a company that is user aligned.
The first idea is simple, put Linux on a MacBook.
Asahi Linux is a good idea, however, it won’t ever be good. Apple is putting more and more stuff into closed source microcontrollers that have no documentation. Like jailbreaking, it may start off strong when people are excited, but support for the next generation and that last bit of polish won’t ever get there.
While it got some impressive stuff like psychoacoustic bass (works on other machines too, I installed this on my ZBook), it lacks DP Alt Mode, meaning you can’t plug in a USB-C monitor. I don’t fault the Asahi people, Apple uses custom undocumented hardware to manage the USB ports, and reversing muxes seems boring.
Additionally, like on almost all Linux laptops, the power management is bad. And even worse, there’s 0 documentation from Apple on how to fix it, so despite it being super good on macOS, it’s one of the more annoying laptops to try to fix on Linux. At least if you have a laptop with AMD or Intel there’s some docs on power states.
So with Apple out, we have to look for alternatives. I like so much about Framework as a company, straightforward, open source ethos, but they aren’t building the product I want.
I don’t care one bit about upgradability or customizability. After a year or two, I’m happy to throw it out and buy a new one. It’s not like upgradability is a bad thing, but it usually comes with tradeoffs to weight and power draw, and I’d rather it all be in one solid package glued together. And I don’t like customizability because I like when all the testing and polish work is put into one configuration.
Perhaps the Framework 16 will impress me; I shouldn’t judge until I use it. But I see things like a request for a touchpad single unit so there’s not some random pieces of plastic digging into my wrist just in case I want to move my touchpad left or right. And I read some complaints about the rigidity, how can it be rigid if the modules are attached with magnets? Engineering is all about trade-offs, and the trade-off I’d prefer is 0 upgradability or customizability in exchange for less weight and more polish.
The Framework 16 also has a Strix Point instead of a Strix Halo, and I hear the power draw isn’t too much better on Point. Coming from an M3 Max, the Strix Halo is just barely acceptable performance wise, I also own an Intel Core 7 155H and AMD Hawk Point. Those are not what I consider okay in a laptop.
I’m typing this blog on a HP ZBook Ultra G1a 14. Question to HP, who names this crap? Why do these companies insist on having the most confusing product lineups and names.
Are ZBooks good or do I want an OmniBook or ProBook? Within ZBook, is Ultra or Fury better? Do I want a G1a or a G1i? Oh you sell ZBook Firefly G11, I liked that TV show, is that one good?
Wait wait wait OMEN MAX 16z-ak000 has a lot of capital letters, that one must be the best, right? But there’s also an HP EliteBook, Elite sounds like the best, do I still want a ZBook?
These are all real products on HP’s laptop page.
Consumer electronics naming is very simple. Make a good product with a simple name. “iPhone”, “comma”, “Z Fold”. Then every year or two, add one to the number of that product. If it’s a small refresh, you can add a letter after the number. “2 3 3X 4” “4 4s 5 5s 6 …” “2 3 4 5 6 7”
Why is this so hard for companies like HP?
If I made a laptop, it would come in one configuration. Call it the hackbook
Highest end Strix Halo part, which is the best mobile(ish) chip you can get outside Apple. 16 core Zen 5 CPU, 40 core RDNA 3.5 GPU. 64GB of LPDDR5X RAM @ 256 GB/s. A stunning 16 inch OLED screen that’s the full size of the laptop. A max size legal on planes 100 Wh battery. Great sound with out of the box tuned psychoacoustic bass. Aluminium unibody with just one bit of laser etched branding where the Apple is, no other writing on the laptop. A classy keyboard without weird logos and random lights. An awesome touchpad; the ZBook touchpad is actually fine, it’s not just Apple with good ones anymore.
Crazy fast boot times, amazing power management. Linux can be tuned so well if you care, and this tuning will be installed on every one we sell. We sell one configuration to all the best developers in the world who want to not use a MacBook anymore. Apple will not understand what they had until they lose it, the only reason anything works on Mac at all is because there’s 100,000 amazing developers who use these machines every day; they put some work into making their house nice.
And when it’s time to upgrade in one or two years, we’ll have the hackbook two ready for you. The number goes up by one, and you know which one to buy. For some reason people say I get distracted, but comma has been around for ten years following this playbook; we now have a comma four for you. If I built one laptop, I’d keep building a laptop for 10 years. With Apple’s decline and our rise, the hackbook four will be the first one that’s clearly better than a MacBook.
I’m writing this blog post in hopes I don’t actually have to do this. I’m not really going to, there’s so many other things to do. This is just whining and bikeshedding. Can somebody please build a good MacBook replacement and make it a Schelling point everyone will switch to so I don’t have to think about this anymore?
...
Read the original on geohot.github.io »
Blog HomeBlog Quest and StreetPass help you discover the independent web
When social media first entered my life, it came with a promise of connection. Facebook connected college-aged adults in a way that was previously impossible, helping to shape our digital generation. Social media was our super-power and we wielded it to great effect.
Yet social media today is a noisy, needy, mental health hazard. They push distracting notifications, constantly begging us to “like and subscribe”, and trying to trap us in endless scrolling. They have become sirens that lure us into their ad-infested shores with their saccharine promise of dopamine.
How can we defeat these monsters that have invaded deep into our world, while still staying connected?
A couple weeks ago I stumbled into a great browser extension, StreetPass for Mastodon. The creator, tvler, built it to help people find each other on Mastodon. StreetPass autodiscovers Mastodon verification links as you browse the web, building a collection of Mastodon accounts from the blogs and personal websites you’ve encountered.
StreetPass is a beautiful example of calm technology . When StreetPass finds Mastodon profiles it doesn’t draw your attention with a notification, it quietly adds the profile to a list, knowing you’ll check in when you’re ready.
StreetPass recognizes that there’s no need for an immediate call to action. Instead it allows the user to focus on their browsing, enriching their experience in the background. The user engages with StreetPass when they are ready, and on their own terms.
StreetPass is open source and available for Firefox, Chrome, and Safari.
Inspired by StreetPass, I applied this technique to RSS feed discovery.
Blog Quest is a web browser extension that helps you discover and subscribe to blogs. Blog Quest checks each page for auto-discoverable RSS and Atom feeds (using rel=“alternate” links) and quietly collects them in the background. When you’re ready to explore the collected feeds, open the extension’s drop-down window.
The extension integrates with several feed readers, making subscription management nearly effortless.
Blog Quest is available for both Firefox and Chrome. The project is open source and I encourage you to build your own variants.
I reject the dead Internet theory: I see a vibrant Internet full of humans sharing their experiences and seeking connection. Degradation of the engagement-driven web is well underway, accelerated by AI slop. But the independent web works on a different incentive structure and is resistant to this effect. Humans inherently create, connect, and share: we always have and we always will. If you choose software that works in your interest you’ll find that it’s possible to make meaningful online connections without mental hazard.
Check out StreetPass and Blog Quest to discover a decentralized, independent Internet that puts you in control.
Hello!
I’m Robert Alexander, a DevSecOps consultant
available for contract work. This blog features some of my work and thoughts on software, the cloud, and security. You can subscribe to my posts
with your favorite RSS client and
follow me on Mastodon.
Statements are my own and do not represent the positions or opinions of my employer.
...
Read the original on alexsci.com »
To add this web app to your iOS home screen tap the share button and select "Add to the Home Screen".
10HN is also available as an iOS App
If you visit 10HN only rarely, check out the the best articles from the past week.
If you like 10HN please leave feedback and share
Visit pancik.com for more.