10 interesting stories served every morning and every evening.
Ghostty is a fast, feature-rich, and cross-platform terminal emulator that uses platform-native UI and GPU acceleration.
Ghostty is a fast, feature-rich, and cross-platform terminal emulator that uses platform-native UI and GPU acceleration.
Install Ghostty and run!
Zero configuration required to get up and running.
Ready-to-run binaries for macOS. Packages or build from source for Linux.
...
Read the original on ghostty.org »
Ask questions about this page
Switch to Claude without starting overBring your preferences and context from other AI providers to Claude. With one copy-paste, Claude updates its memory and picks up right where you left off. Memory is available on all paid plans. Import what matters in under a minuteYou’ve spent months teaching another AI how you work. That context shouldn’t disappear because you want to try something new. Claude can import what matters, so your first conversation feels like your hundredth.Copy and paste the provided prompt into a chat with any AI provider. It’s written specifically to help you get all of your context in one chat.Copy and paste the results into Claude’s memory settings. That’s it! Claude will update its memory and you’re good to go.Memory that understands how you workClaude learns your preferences across conversations, keeps project context separate so nothing bleeds together, and lets you see and edit everything it remembers. Your AI should know you from day oneStart your Pro plan, import your memory when you’re ready, and see for yourself.
...
Read the original on claude.com »
A satirical (but real!) demo of what AI chat could look like in an ad-supported future. Chat with an AI while experiencing every monetization pattern imaginable — banners, interstitials, sponsored responses, freemium gates, and more.
Join 2 million professionals who think faster, focus better, and accomplish more. AI-powered goal tracking, habit building, and memory enhancement. First 30 days FREE!Think 10x Faster with AI. First Month FREE! 🧠Did you know? Writing down 3 goals each morning increases productivity by 42%! BrainBoost Pro tracks your daily goals with AI reminders. Your AI assistant, proudly powered by the finest advertising money can buy 💸⚠️ Warning: This AI may spontaneously recommend products at any time🏷️ This conversation is proudly powered by BrainBoost Pro™ • Ad-supported free tier • Remove adsStressed by all these ads? 10 minutes of AI-guided meditation changes everything.AI-curated meal prep kits delivered weekly. $30 off your first box!🎨 Today’s chat theme sponsored by BrainBoost Pro • Colors, fonts, and vibes curated by our advertising team
This tool is a satirical but fully functional demonstration of what AI chat assistants could look like if they were monetized through advertising — similar to how free apps, websites, and streaming services fund themselves today. As AI chat becomes mainstream, companies face a fundamental question: how do you make it free for users while covering the significant compute costs? Advertising is one obvious answer — and this demo shows every major ad pattern that could be applied to a chat interface.We built this as an educational tool to help marketers, product managers, and developers understand the landscape of AI monetization, and to give users a glimpse of the future they might want to avoid (or embrace, depending on your perspective).
This demo covers the full spectrum of advertising patterns that could appear in an AI chat product.
This tool is educational and useful for a wide range of professionals thinking about the future of AI products.
Are the ads in this demo real?No — all brands and ads are completely fictional and created for this demo. BrainBoost Pro, QuickLearn Academy, ZenFocus, TaskMaster AI, ReadyMeal, and all other brands are made up. No actual advertising revenue is being generated. Does this show what AI chat will actually look like?It shows one possible future. Some ad-supported AI products already exist and use several of these patterns. Others are speculative. The goal is to make these possibilities concrete and tangible so people can have informed conversations about what kind of AI future they want.Is the AI actually working or is everything scripted?The AI is real — your messages are processed by a live language model and you get genuine responses. The ads are the scripted part. Some AI responses will include sponsored product mentions as part of the demonstration.What happens to my chat data?Like all our free tools, conversations are logged to improve the service. We do not sell this data to advertisers — this is a demo, not an actual ad network.How does the freemium gate work?After 5 free messages, you can either ‘watch an ad’ (a simulated 5-second countdown) to unlock 5 more messages, or you can upgrade to our actual ad-free service. This mirrors how real freemium products work.
All of our tools are genuinely free — no ads, no paywalls, no sponsored responses. Just AI that works.
Build Your Own AI Chatbot — No Ads RequiredNow that you’ve seen what ad-supported AI looks like, imagine giving your customers a clean, focused AI experience with zero interruptions. With 99helpers, you can deploy an AI chatbot trained on your content in minutes. No credit card required • Setup in minutes • No ads, ever
...
Read the original on 99helpers.com »
It started with two incidents on the same day. In a fairly empty train carriage, a stranger in her 70s approached me: “Do you mind if I sit here? Or did you want to be alone with your thoughts?” I weighed it up for a split second, conscious that I was, in effect, agreeing to a conversation: “No, of course I don’t mind. Sit down.”
She turned out to be an agreeable, kind woman who had had a difficult day. I didn’t have to say much: “I’m sorry to hear that.” “That’s tough for you.” She occasionally asked me questions about myself, which I dodged politely. I could tell she was only asking so the conversation would not be so one-sided. Some moments are for listening, not sharing. I sensed, without needing to know explicitly, that she was probably returning to an empty house and wanted to process the day out loud. I didn’t feel uncomfortable, as I knew I could duck out at any moment by saying I needed to get back to my phone messages. But instead we talked — or, rather, I listened — for most of the 50-minute journey. I registered that it was an unusual occurrence, this connection, but thought little more of it. A small part of me was glad this kind of thing still happens.
That evening, I ate at a restaurant with my family. As the waitress brought the bill, we chatted and I learned that she was from Seoul. She was shy and softly spoken. We talked gently about Korean food and what she missed about home. Once again, I thought little of this exchange.
As we walked home, my 15-year-old son asked: “Is it OK to talk to people in that way?” “What way?” He was asking about the boundaries when it comes to talking to someone about their home country.
This was a very good question. How do you know, generally, what the terms are of a conversation with a stranger? I realised that there is a sort of unwritten code you learn as you get older, which enables you to assess whether a conversation is a good idea or not. I thought about the woman who had approached me earlier. How did she know it was OK to talk to me? In the end, I replied to my son: “You don’t always know if it’s OK. Sometimes you have to take the risk and find out.”
Then it struck me. A lot of people have given up taking a chance on other people: that they might want to listen, that they might want to talk. But they have also given up taking a chance on themselves: that they might be able to navigate a conversation with someone new, cope with knockbacks and steer a path through any misunderstandings.
The disappearance of these kinds of interactions from day-to-day life — in pubs, restaurants, shops, queues, on public transport — is striking. I have been talking to people tangentially about this for the past 10 years, ever since I started researching my book, How to Own the Room, which came out in 2018 and went on to become a podcast. This project was supposed to be about public speaking and confidence. But I realised from people’s reactions to the topic — especially younger people — that their deepest anxiety lies elsewhere, in something much more banal and inexpressible. Forget “public speaking”. What a lot of people don’t like at all any more is “speaking to anyone in public”.
Many reasons are cited: state-of-the-art don’t-talk-to-me headphones, mobile phones and social media generally, the rise of working from home, the introduction of touchscreens in takeaway restaurants so you barely interact with a human, the death of third spaces, the pandemic. In the end, the biggest excuse becomes “social norm reinforcement”. This is the idea that if no one talks to you, you don’t talk to anyone either. A casual conversation in a waiting room where no one else is having a casual conversation suddenly sounds not very casual at all.
On an individual level, some people perfectly understandably cite neurodivergence, introversion, inability to tolerate eye contact or an intense loathing for small talk (especially about the weather) as reasons to avoid these conversations. It’s certainly true that this time six years ago — at the height of lockdown — it would have been rude and unsafe to start a chat, let alone sit next to someone on a train. But now? It can feel as if everyone is still adhering to the 2-metre rule, employing “the tech shield” or even “phantom phone use” (pretending that you need to be on your phone when you don’t).
This goes deeper than adolescent angst or personal preference. And possibly deeper than our overreliance on phones. We are losing a basic human skill. The ability to speak to others and understand them is being compromised.
Dr Jared Cooney Horvath, a teacher turned cognitive neuroscientist who focuses on speech, has warned that gen Z is the first generation in history to underperform the previous generation on cognitive measures. And Dr Rangan Chatterjee, a bestselling author and father of two teenagers, said in an interview this month: “I think we’re raising a generation of children who have low self-worth, who don’t know how to conduct conversations.”
It’s not only affecting young people. The psychologist Esther Perel calls it a “global relational recession”. She writes: “The point is not depth. The point is practice, the gentle strengthening of our social muscles.” On her YouTube channel she recently introduced the topic of Talking to Strangers in 2026.
Something that used to come naturally is now a subject of longing and fascination, as if it were a rare anthropological phenomenon. Videos are springing up on social media, cataloguing encounters with the unknown “other”: earnest, well-meaning, wholesome videos, under the categories “social anxiety”, “extrovert” and “talking to strangers”. Many have the unstated theme of “out and about in the big city”. Some are personal experiments, often extremely ill-advised ones. Can you challenge yourself to tell a joke to an entire train carriage? What happens if you go up to an older woman and tell her she looks beautiful? The (usually young) person doing the filming is often trying to improve themself in some way or attempting to “be braver” or “less socially anxious”. The camera acts as their accountability partner. The people they’re talking to are relegated to the role of “task to be ticked off the list”. Either that or there’s a push towards a Hallmark card effect: “Look, other people are not as horrible as you thought.” (Cue swell of trending motivational audio.)
The trouble with these social media experiments, of course, is that they are performative and individualistic. There’s an element of commodification: the encounter must be ripe for digital packaging. Often it’s not clear if the filming is consensual. The connections are one-way and border on the exploitative or manipulative. They are designed for individual personal growth or free, self-directed therapy (“this made me more confident”) and for clicks and voyeurism (“check out this person’s reaction”). The effect is to make “talking to absolutely anyone” seem even more alienating, fake and narcissistic. This has spawned a secondary genre of parody videos such as the comedian Al Nash’s “A cup of tea with a stranger — an amazing conversation!” In this clip, an irritating interviewer passes tea to a stranger on a park bench under the guise of “helping you with your loneliness”, only for the encounter to turn awkward when the stranger accidentally drops the cup and smashes it.
It’s only natural to fear rejection, humiliation, giving offence or overstepping a boundary when we initiate a conversation — or even when we respond to someone else’s attempt. But according to a study by the University of Virginia (Talking with strangers is surprisingly informative), we overstate these fears in our minds: “People tend to underestimate how much they’ll enjoy the conversation, feel connected to their conversation partner and be liked by their conversation partner.”
The key is to lower the stakes. Make it less of a big deal. Don’t focus on what could go wrong. Also, don’t focus on how amazing this could be. You are just saying, “It’s cold today, isn’t it?” You are not asking someone to join you on a quest for world peace. Similarly, if an approach is made towards you and you don’t want to respond, just be confident and clear either with your gestures (look down, don’t make eye contact) or with speech: “I can’t talk right now.”
In her work on kindness, the University of Sussex psychologist Gillian Sandstrom calls these conversational gambits “small, humanising acts”. It’s important to emphasise the “small” aspect. Sometimes I think people are overwhelmed by the “bigness” in their mind of the fear of interaction, and how disproportionate that seems next to the “smallness” of the pathetic reality. Don’t read too much into passing moments. Trust yourself to read social cues and work out how you stand in relation to them. Know yourself and your own personality. Not everyone wants to talk and not everyone wants to be talked to. And that’s OK. It can depend on the day and on your mood. Give yourself get-out-of-jail-free cards in these conversations. If someone doesn’t respond, assume they didn’t hear you or they’re having a bad day. If someone talks to you and you feel uncomfortable or you’re having a bad day, it is not your job to be kind or nice. If their attempt was well meant, they’ll get over it. We don’t need to avoid each other. But we also don’t have to be on niceness autopilot all the time.
In any case, our worst fears about these interactions are rarely realised. Last year, the team of Stanford psychologist Prof Jamil Zaki, the author of Hope for Cynics: The Surprising Science of Human Goodness, put up posters around campus with messages about approachability and warmth. They found that what students most needed was permission — the reminder to “take a chance”. They concluded: “Too often, we’re sure that conversation and connection will exhaust us, or that we can’t count on others.” In our minds, we paint people (and ourselves) as profoundly disappointing. They — and we — are rarely that bad. And even if they are, it will make a good story to tell later to the people who are not strangers to you.
Is it going to change your life if you talk to someone in a shop about the prospect of rain? Probably not. But in light of the current state of the world, even the slightest possibility of brightening someone’s day is valuable. It’s certainly worth the punt. Perhaps the way they respond matters less than the fact that you retained your humanity enough to try something, to risk, to connect.
Small talk may not profoundly alter your life. But its absence will profoundly alter human life as we know it. We live in a world of intense and often unnecessary division. Small talk is a tiny, free and very possibly priceless reminder of our shared humanity. If we intentionally give up talking to strangers, if we purposely decide to give in to the phone shield, the consequences will be horrible. Arguably, we are already on the verge of doing this. Let’s back up and start a conversation before it’s too late.
...
Read the original on www.theguardian.com »
Let’s pretend we’re farmers with a new plot of land. Given only the Diameter and Height of a tree trunk, we must determine if it’s an Apple, Cherry, or Oak tree. To do this, we’ll use a Decision Tree. Almost every tree with a Diameter ≥ 0.45 is an Oak tree! Thus, we can probably assume that any other trees we find in that region will also be one.
This first decision node will act as our root node. We’ll draw a vertical line at this Diameter and classify everything above it as Oak (our first leaf node), and continue to partition our remaining data on the left. We continue along, hoping to split our plot of land in the most favorable manner. We see that creating a new decision node at Height ≤ 4.88 leads to a nice section of Cherry trees, so we partition our data there.
Our Decision Tree updates accordingly, adding a new leaf node for Cherry. And Some More After this second split we’re left with an area containing many Apple and some Cherry trees. No problem: a vertical division can be drawn to separate the Apple trees a bit better.
Once again, our Decision Tree updates accordingly. And Yet Some More The remaining region just needs a further horizontal division and boom - our job is done! We’ve obtained an optimal set of nested decisions.
That said, some regions still enclose a few misclassified points. Should we continue splitting, partitioning into smaller sections?
Hmm… If we do, the resulting regions would start becoming increasingly complex, and our tree would become unreasonably deep. Such a Decision Tree would learn too much from the noise of the training examples and not enough generalizable rules.
Does this ring familiar? It is the well known tradeoff that we have explored in our explainer on The Bias Variance Tradeoff! In this case, going too deep results in a tree that overfits our data, so we’ll stop here.
We’re done! We can simply pass any new data point’s Height and Diameter values through the newly created Decision Tree to classify them as either an Apple, Cherry, or Oak tree!
Decision Trees are supervised machine learning algorithms used for both regression and classification problems. They’re popular for their ease of interpretation and large range of applications. Decision Trees consist of a series of decision nodes on some dataset’s features, and make predictions at leaf nodes.
Scroll on to learn more! Decision Trees are widely used algorithms for supervised machine learning. They’re popular for their ease of interpretation and large range of applications. They work for both regression and classification problems. A Decision Tree consists of a series of sequential decisions, or decision nodes, on some data set’s features. The resulting flow-like structure is navigated via conditional control statements, or if-then rules, which split each decision node into two or more subnodes. Leaf nodes, also known as terminal nodes, represent prediction outputs for the model. To train a Decision Tree from data means to figure out the order in which the decisions should be assembled from the root to the leaves. New data may then be passed from the top down until reaching a leaf node, representing a prediction for that data point.
We just saw how a Decision Tree operates at a high-level: from the top down, it creates a series of sequential rules that split the data into well-separated regions for classification. But given the large number of potential options, how exactly does the algorithm determine where to partition the data? Before we learn how that works, we need to understand Entropy.
Entropy measures the amount of information of some variable or event. We’ll make use of it to identify regions consisting of a large number of similar (pure) or dissimilar (impure) elements. Given a certain set of events that occur with probabilities , the total entropy can be written as the negative sum of weighted probabilities: The quantity has a number of interesting properties: only if all but one of the are zero, this one having the value of 1. Thus the entropy vanishes only when there is no uncertainty in the outcome, meaning that the sample is completely unsurprising. is maximum when all the are equal. This is the most uncertain, or ‘impure’, situation. Any change towards the equalization of the probabilities increases . The entropy can be used to quantify the impurity of a collection of labeled data points: a node containing multiple classes is impure whereas a node including only one class is pure. Above, you can compute the entropy of a collection of labeled data points belonging to two classes, which is typical for binary classification problems. Click on the Add and Remove buttons to modify the composition of the bubble. Did you notice that pure samples have zero entropy whereas impure ones have larger entropy values? This is what entropy is doing for us: measuring how pure (or impure) a set of samples is. We’ll use it in the algorithm to train Decision Trees by defining the Information Gain.
With the intuition gained with the above animation, we can now describe the logic to train Decision Trees. As the name implies, information gain measures an amount the information that we gain. It does so using entropy. The idea is to subtract from the entropy of our data before the split the entropy of each possible partition thereafter. We then select the split that yields the largest reduction in entropy, or equivalently, the largest increase in information.
The core algorithm to calculate information gain is called ID3. It’s a recursive procedure that starts from the root node of the tree and iterates top-down on all non-leaf branches in a greedy manner, calculating at each depth the difference in entropy:
To be specific, the algorithm’s steps are as follows: Calculate the entropy associated to every feature of the data set. Partition the data set into subsets using different features and cutoff values. For each, compute the information gain as the difference in entropy before and after the split using the formula above. For the total entropy of all children nodes after the split, use the weighted average taking into account , i.e. how many of the samples end up on each child branch. Identify the partition that leads to the maximum information gain. Create a decision node on that feature and split value. When no further splits can be done on a subset, create a leaf node and label it with the most common class of the data points within it if doing classification or with the average value if doing regression. Recurse on all subsets. Recursion stops if after a split all elements in a child node are of the same type. Additional stopping conditions may be imposed, such as requiring a minimum number of samples per leaf to continue splitting, or finishing when the trained tree has reached a given maximum depth. Of course, reading the steps of an algorithm isn’t always the most intuitive thing. To make things easier to understand, let’s revisit how information gain was used to determine the first decision node in our tree. Recall our first decision node split on Diameter ≤ 0.45. How did we choose this condition? It was the result of maximizing information gain.
Each of the possible splits of the data on its two features (Diameter and Height) and cutoff values yields a different value of the information gain.
The line chart displays the different split values for the Diameter feature. Move the decision boundary yourself to see how the data points in the top chart are assigned to the left or right children nodes accordingly. On the bottom you can see the corresponding entropy values of both children nodes as well as the total information gain.
The ID3 algorithm will select the split point with the largest information gain, shown as the peak of the black line in the bottom chart of 0.574 at Diameter = 0.45. Recall our first decision node split on Diameter ≤ 0.45. How did we choose this condition? It was the result of maximizing information gain.
Each of the possible splits of the data on its two features (Diameter and Height) and cutoff values yields a different value of the information gain.
The visualization on the right allows to try different split values for the Diameter feature. Move the decision boundary yourself to see how the data points in the top chart are assigned to the left or right children nodes accordingly. On the bottom you can see the corresponding entropy values of both children nodes as well as the total information gain.
The ID3 algorithm will select the split point with the largest information gain, shown as the peak of the black line in the bottom chart of 0.574 at Diameter = 0.45. An alternative to the entropy for the construction of Decision Trees is the Gini impurity. This quantity is also a measure of information and can be seen as a variation of Shannon’s entropy. Decision trees trained using entropy or Gini impurity are comparable, and only in a few cases do results differ considerably. In the case of imbalanced data sets, entropy might be more prudent. Yet Gini might train faster as it does not make use of logarithms.
Another Look At Our Decision Tree Let’s recap what we’ve learned so far. First, we saw how a Decision Tree classifies data by repeatedly partitioning the feature space into regions according to some conditional series of rules. Second, we learned about entropy, a popular metric used to measure the purity (or lack thereof) of a given sample of data. Third, we learned how Decision Trees use entropy in information gain and the ID3 algorithm to determine the exact conditional series of rules to select. Taken together, the three sections detail the typical Decision Tree algorithm.
To reinforce concepts, let’s look at our Decision Tree from a slightly different perspective.
The tree below maps exactly to the tree we showed in How to Build a Decision Tree section above. However, instead of showing the partitioned feature space alongside our trees structure, let’s look at the partitioned data points and their corresponding entropy at each node itself:
From the top down, our sample of data points to classify shrinks as it gets partitioned to different decision and leaf nodes. In this manner, we could trace the full path taken by a training data point if we so desired. Note also that not every leaf node is pure: as discussed previously (and in the next section), we don’t want the structure of our Decision Trees to be too deep, as such a model likely won’t generalize well to unseen data.
Without question, Decision Trees have a lot of things going for them. They’re simple models that are easy to interpret. They’re fast to train and require minimal data preprocessing. And they hand outliers with ease. Yet they suffer from a major limitation, and that is their instability compared with other predictors. They can be extremely sensitive to small perturbations in the data: a minor change in the training examples can result in a drastic change in the structure of the Decision Tree. Check for yourself how small random Gaussian perturbations on just 5% of the training examples create a set of completely different Decision Trees:
Why Is This A Problem? In their vanilla form, Decision Trees are unstable.
If left unchecked, the ID3 algorithm to train Decision Trees will work endlessly to minimize entropy. It will continue splitting the data until all leaf nodes are completely pure - that is, consisting of only one class. Such a process may yield very deep and complex Decision Trees. In addition, we just saw that Decision Trees are subject to high variance when exposed to small perturbations of the training data.
Both issues are undesirable, as they lead to predictors that fail to clearly distinguish between persistent and random patterns in the data, a problem known as overfitting. This is problematic because it means that our model won’t perform well when exposed to new data. There are ways to prevent excessive growth of Decision Trees by pruning them, for instance constraining their maximum depth, limiting the number of leaves that can be created, or setting a minimum size for the amount of items in each leaf and not allowing leaves with too few items in them.
As for the issue of high variance? Well, unfortunately it’s an intrinsic characteristic when training a single Decision Tree.
...
Read the original on mlu-explain.github.io »
Yes, writing code is easier than ever.
AI assistants autocomplete your functions. Agents scaffold entire features. You can describe what you want in plain English and watch working code appear in seconds. The barrier to producing code has never been lower.
And yet, the day-to-day life of software engineers has gotten more complex, more demanding, and more exhausting than it was two years ago.
This is not a contradiction. It is the reality of what happens when an industry adopts a powerful new tool without pausing to consider the second-order effects on the people using it.
If you are a software engineer reading this and feeling like your job quietly became harder while everyone around you celebrates how easy everything is now, you are not imagining things. The job changed. The expectations changed. And nobody sent a memo.
There is a phenomenon happening right now that most engineers feel but struggle to articulate. The expected output of a software engineer in 2026 is dramatically higher than it was in 2023. Not because anyone held a meeting and announced new targets. Not because your manager sat you down and explained the new rules. The baseline just moved.
It moved because AI tools made certain tasks faster. And when tasks become faster, the assumption follows immediately: you should be doing more. Not in the future. Now.
A February 2026 study published in Harvard Business Review tracked 200 employees at a U. S. tech company over eight months. The researchers found something that will sound familiar to anyone living through this shift. Workers did not use AI to finish earlier and go home. They used it to do more. They took on broader tasks, worked at a faster pace, and extended their hours, often without anyone asking them to. The researchers described a self-reinforcing cycle: AI accelerated certain tasks, which raised expectations for speed. Higher speed made workers more reliant on AI. Increased reliance widened the scope of what workers attempted. And a wider scope further expanded the quantity and density of work.
The numbers tell the rest of the story. Eighty-three percent of workers in the study said AI increased their workload. Burnout was reported by 62 percent of associates and 61 percent of entry-level workers. Among C-suite leaders? Just 38 percent. The people doing the actual work are carrying the intensity. The people setting the expectations are not feeling it the same way.
This gap matters enormously. If leadership believes AI is making everything easier while engineers are drowning in a new kind of complexity, the result is a slow erosion of trust, morale, and eventually talent.
A separate survey of over 600 engineering professionals found that nearly two-thirds of engineers experience burnout despite their organizations using AI in development. Forty-three percent said leadership was out of touch with team challenges. Over a third reported that productivity had actually decreased over the past year, even as their companies invested more in AI tooling.
The baseline moved. The expectations rose. And for many engineers, no one acknowledged that the job they signed up for had fundamentally changed.
Here is something that gets lost in all the excitement about AI productivity: most software engineers became engineers because they love writing code.
Not managing code. Not reviewing code. Not supervising systems that produce code. Writing it. The act of thinking through a problem, designing a solution, and expressing it precisely in a language that makes a machine do exactly what you intended. That is what drew most of us to this profession. It is a creative act, a form of craftsmanship, and for many engineers, the most satisfying part of their day.
Now they are being told to stop.
Not explicitly, of course. Nobody walks into a standup and says “stop writing code.” But the message is there, subtle and persistent. Use AI to write it faster. Let the agent handle the implementation. Focus on higher-level tasks. Your value is not in the code you write anymore, it is in how well you direct the systems that write it for you.
For early adopters, this feels exciting. It feels like evolution. For a significant portion of working engineers, it feels like being told that the thing they spent years mastering, the skill that defines their professional identity, is suddenly less important.
One engineer captured this shift perfectly in a widely shared essay, describing how AI transformed the engineering role from builder to reviewer. Every day felt like being a judge on an assembly line that never stops. You just keep stamping those pull requests. The production volume went up. The sense of craftsmanship went down.
This is not a minor adjustment. It is a fundamental shift in professional identity. Engineers who built their careers around deep technical skill are being asked to redefine what they do and who they are, essentially overnight, without any transition period, training, or acknowledgment that something significant was lost in the process.
Having led engineering teams for over two decades, I have seen technology shifts before. New frameworks, new languages, new methodologies. Engineers adapt. They always have. But this is different because it is not asking engineers to learn a new way of doing what they do. It is asking them to stop doing the thing that made them engineers in the first place and become something else entirely.
That is not an upgrade. That is a career identity crisis. And pretending it is not happening does not make it go away.
While engineers are being asked to write less code, they are simultaneously being asked to do more of everything else.
More product thinking. More architectural decision-making. More code review. More context switching. More planning. More testing oversight. More deployment awareness. More risk assessment.
The scope of what it means to be a “software engineer” expanded dramatically in the last two years, and it happened without a pause to catch up.
This is partly a direct consequence of AI acceleration. When code gets produced faster, the bottleneck shifts. It moves from implementation to everything surrounding implementation: requirements clarity, architecture decisions, integration testing, deployment strategy, monitoring, and maintenance. These were always part of the engineering lifecycle, but they were distributed across roles. Product managers handled requirements. QA handled testing. DevOps handled deployment. Senior architects handled system design.
Now, with AI collapsing the implementation phase, organizations are quietly redistributing those responsibilities to the engineers themselves. The Harvard Business Review study documented this exact pattern. Product managers began writing code. Engineers took on product work. Researchers started doing engineering tasks. Roles that once had clear boundaries blurred as workers used AI to handle jobs that previously sat outside their remit.
The industry is openly talking about this as a positive development. Engineers should be “T-shaped” or “full-stack” in a broader sense. Nearly 45 percent of engineering roles now expect proficiency across multiple domains. AI tools augment generalists more effectively, making it easier for one person to handle multiple components of a system.
On paper, this sounds empowering. In practice, it means that a mid-level backend engineer is now expected to understand product strategy, review AI-generated frontend code they did not write, think about deployment infrastructure, consider security implications of code they cannot fully trace, and maintain a big-picture architectural awareness that used to be someone else’s job.
That is not empowerment. That is scope creep without a corresponding increase in compensation, authority, or time.
From my experience building and scaling teams in fintech and high-traffic platforms, I can tell you that role expansion without clear boundaries always leads to the same outcome: people try to do everything, nothing gets done with the depth it requires, and burnout follows. The engineers who survive are the ones who learn to say no, to prioritize ruthlessly, and to push back when the scope of their role quietly doubles without anyone acknowledging it.
There is an irony at the center of the AI-assisted engineering workflow that nobody wants to talk about: reviewing AI-generated code is often harder than writing the code yourself.
When you write code, you carry the context of every decision in your head. You know why you chose this data structure, why you handled this edge case, why you structured the module this way. The code is an expression of your thinking, and reviewing it later is straightforward because the reasoning is already stored in your memory.
When AI writes code, you inherit the output without the reasoning. You see the code, but you do not see the decisions. You do not know what tradeoffs were made, what assumptions were baked in, what edge cases were considered or ignored. You are reviewing someone else’s work, except that someone is not a colleague you can ask questions. It is a statistical model that produces plausible-looking code without any understanding of your system’s specific constraints.
A survey by Harness found that 67 percent of developers reported spending more time debugging AI-generated code, and 68 percent spent more time reviewing it than they did with human-written code. This is not a failure of the tools. It is a structural property of the workflow. Code review without shared context is inherently more demanding than reviewing code you participated in creating.
Yet the expectation from management is that AI should be making everything faster. So engineers find themselves in a bind: they are producing more code than ever, but the quality assurance burden has increased, the context-per-line-of-code has decreased, and the cognitive load of maintaining a system they only partially built is growing with every sprint.
This is the supervision paradox. The faster AI generates code, the more human attention is required to ensure that code actually works in the context of a real system with real users and real business constraints. The production bottleneck did not disappear. It moved from writing to understanding, and understanding is harder to speed up.
What makes all of this especially difficult is the self-reinforcing nature of the cycle.
AI makes certain tasks faster. Faster tasks create the perception of more available capacity. More perceived capacity leads to more work being assigned. More work leads to more AI reliance. More AI reliance leads to more code that needs review, more context that needs to be maintained, more systems that need to be understood, and more cognitive load on engineers who are already stretched thin.
The Harvard Business Review researchers described this as “workload creep.” Workers did not consciously decide to work harder. The expansion happened naturally, almost invisibly. Each individual step felt reasonable. In aggregate, it produced an unsustainable pace.
Before AI, there was a natural ceiling on how much you could produce in a day. That ceiling was set by thinking speed, typing speed, and the time it takes to look things up. It was frustrating sometimes, but it was also a governor. A natural speed limit that prevented you from outrunning your own ability to maintain quality.
AI removed the governor. Now the only limit is your cognitive endurance. And most people do not know their cognitive limits until they have already blown past them.
This is where many engineers find themselves right now. Shipping more code than any quarter in their career. Feeling more drained than any quarter in their career. The two facts are not unrelated.
The trap is that it looks like productivity from the outside. Metrics go up. Velocity charts look great. More features shipped. More pull requests merged. But underneath the numbers, quality is quietly eroding, technical debt is accumulating faster than it can be addressed, and the people doing the work are running on fumes.
If the picture is difficult for experienced engineers, it is even harder for those starting their careers.
Junior engineers have traditionally learned by doing the simpler, more task-oriented work. Fixing small bugs. Writing straightforward features. Implementing well-defined tickets. This hands-on work built the foundational understanding that eventually allowed them to take on more complex challenges.
AI is rapidly consuming that training ground. If an agent can handle the routine API hookup, the boilerplate module, the straightforward CRUD endpoint, what is left for a junior engineer to learn from? The expectation is shifting toward needing to contribute at a higher level almost from day one, without the gradual ramp-up that previous generations of engineers relied on.
Entry-level hiring at the 15 largest tech firms fell 25 percent from 2023 to 2024. The HackerRank 2025 Developer Skills Report confirmed that expectations are rising faster than productivity gains, and that early-career hiring remains sluggish compared to senior-level roles. Companies are prioritizing experienced talent, but the pipeline that produces experienced talent is being quietly dismantled.
This is a problem that extends beyond individual career concerns. If junior engineers do not get the opportunity to build foundational skills through hands-on work, the industry will eventually face a shortage of senior engineers who truly understand the systems they oversee. You cannot supervise what you never learned to build.
As I have written before, code is for humans to read. If the next generation of engineers never develops the fluency to read, understand, and reason about code at a deep level, no amount of AI tooling will compensate for that gap.
If you lead engineering teams, the most important thing you can do right now is acknowledge that this transition is genuinely difficult. Not theoretically. Not abstractly. For the actual people on your team.
The career they signed up for changed fast. The skills they were hired for are being repositioned. The expectations they are working under shifted without a clear announcement. Acknowledging this reality is not a sign of weakness. It is a prerequisite for maintaining a team that trusts you.
Start with empathy, but do not stop there.
Give your team real training. Not a lunch-and-learn about prompt engineering. Real investment in the skills that the new engineering landscape actually requires: system design, architectural thinking, product reasoning, security awareness, and the ability to critically evaluate code they did not write. These are not trivial skills. They take time to develop, and your team needs structured support to build them.
Give them space to experiment without the pressure of immediate productivity gains. The engineers who will thrive in this environment are the ones who have room to figure out how AI fits into their workflow without being penalized for the learning curve. Every experienced technologist I know who has successfully integrated AI tools went through an adjustment period where they were less productive before they became more productive. That adjustment period is normal, and it needs to be protected.
Set explicit boundaries around role scope. If you are asking engineers to take on product thinking, planning, and risk assessment in addition to their technical work, name it. Define it. Compensate for it. Do not let it happen silently and then wonder why your team is burned out.
Rethink your metrics. If your engineering success metrics are still centered on velocity, tickets closed, and lines of code, you are measuring the wrong things in an AI-assisted world. System stability, code quality, decision quality, customer outcomes, and team health are better indicators of whether your engineering organization is actually producing value or just producing volume.
Protect the junior pipeline. If you have stopped hiring junior engineers because AI can handle entry-level tasks, you are solving a short-term efficiency problem by creating a long-term talent crisis. The senior engineers you rely on today were junior engineers who learned by doing the work that AI is now consuming. That path still matters.
And finally, keep challenging your team. I have never met a good engineer who did not love a good challenge. The engineers on your team are not fragile. They are capable, intelligent people who signed up for hard problems. They can handle this transition. Just make sure they are set up to meet it.
If you are an engineer navigating this shift, here is what I would tell you based on two decades of watching technology cycles reshape this profession.
First, do not abandon your fundamentals. The pressure to become an “AI-first” engineer is real, but the engineers who will be most valuable in five years are the ones who deeply understand the systems they work on. AI is a tool. Understanding architecture, debugging complex systems, reasoning about performance and security: these skills are not becoming less important. They are becoming more important because someone needs to be the adult in the room when AI-generated code breaks in production at 2 AM.
Second, learn to set boundaries with the acceleration trap. Just because you can produce more does not mean you should. Sustainable pace matters. The engineers who burn out trying to match the theoretical maximum output AI makes possible are not the ones who build lasting careers. The ones who learn to work with AI deliberately, choosing when to use it and when to think independently, are the ones who will still be thriving in this profession a decade from now.
Third, embrace the parts of the expanded role that genuinely interest you. If the engineering role now includes more product thinking, more architectural decision-making, more cross-functional communication, treat that as an opportunity rather than an imposition. These are skills that senior engineers and technical leaders need. You are being given access to a broader set of capabilities earlier in your career than any previous generation of engineers. That is not a burden. It is a head start.
Fourth, talk about what you are experiencing. The isolation of feeling like you are the only one struggling with this transition is one of the most damaging aspects of the current moment. You are not the only one. The data confirms it. Two-thirds of engineers report burnout. The expectation gap between leadership and engineering teams is well documented. Talking openly about these challenges, with your team, with your manager, with your broader network, is not complaining. It is professional honesty.
And fifth, remember that this profession has survived every prediction of its demise. COBOL was supposed to eliminate programmers. Expert systems were supposed to replace them. Fourth-generation languages, CASE tools, visual programming, no-code platforms, outsourcing. Every decade brings a new technology that promises to make software engineers obsolete, and every decade the demand for skilled engineers grows. AI will not be different. The tools change. The fundamentals endure.
AI made writing code easier and made being an engineer harder. Both of these things are true at the same time, and pretending that only the first one matters is how organizations lose their best people.
The engineers who are struggling right now are not struggling because they are bad at their jobs. They are struggling because their jobs changed underneath them while the industry celebrated the part that got easier and ignored the parts that got harder.
Expectations rose without announcement. Roles expanded without boundaries. Output demands increased without corresponding increases in support, training, or acknowledgment. And the engineers who raised concerns were told, implicitly or explicitly, that they just needed to adapt faster.
That is not how you build a sustainable engineering culture. That is how you build a burnout machine.
The industry needs to name this paradox honestly. AI is an incredible tool. It is also placing enormous new demands on the people using it. Both things can be true. Both things need to be addressed.
The organizations that get this right, that invest in their people alongside their tools, that acknowledge the human cost of rapid technological change while still pushing forward, those are the organizations that will attract and retain the best engineering talent in the years ahead.
The ones that do not will discover something that every technology cycle eventually teaches: tools do not build products. People do. And people have limits that no amount of AI can automate away.
If this resonated with you, I would love to hear your perspective. What has changed most about your engineering role in the last year? Drop me a message or connect with me on LinkedIn. I write regularly about the intersection of AI, software engineering, and leadership at ivanturkovic.com. Follow along if you want honest, experience-driven perspectives on how technology is actually changing this profession.
...
Read the original on www.ivanturkovic.com »
I’m going to make a bold claim: MCP is already dying. We may not fully realize it yet, but the signs are there. OpenClaw doesn’t support it. Pi doesn’t support it. And for good reason.
When Anthropic announced the Model Context Protocol, the industry collectively lost its mind. Every company scrambled to ship MCP servers as proof they were “AI first.” Massive resources poured into new endpoints, new wire formats, new authorization schemes, all so LLMs could talk to services they could already talk to.
I’ll admit, I never fully understood the need for it. You know what LLMs are really good at? Figuring things out on their own. Give them a CLI and some docs and they’re off to the races.
I tried to avoid writing this for a long time, but I’m convinced MCP provides no real-world benefit, and that we’d be better off without it. Let me explain.
LLMs are really good at using command-line tools. They’ve been trained on millions of man pages, Stack Overflow answers, and GitHub repos full of shell scripts. When I tell Claude to use gh pr view 123, it just works.
MCP promised a cleaner interface, but in practice I found myself writing the same documentation anyway: what each tool does, what parameters it accepts, and more importantly, when to use it. The LLM didn’t need a new protocol.
When Claude does something unexpected with Jira, I can run the same jira issue view command and see exactly what it saw. Same input, same output, no mystery.
With MCP, the tool only exists inside the LLM conversation. Something goes wrong and now I’m spelunking through JSON transport logs instead of just running the command myself. Debugging shouldn’t require a protocol decoder.
This is where the gap gets wide. CLIs compose. I can pipe through jq, chain with grep, redirect to files. This isn’t just convenient; it’s often the only practical approach.
With MCP, your options are dumping the entire plan into the context window (expensive, often impossible) or building custom filtering into the MCP server itself. Either way, you’re doing more work for a worse result. The CLI approach uses tools that already exist, are well-documented, and that both humans and agents understand.
MCP is unnecessarily opinionated about auth. Why should a protocol for giving an LLM tools to use need to concern itself with authentication?
CLI tools don’t care. aws uses profiles and SSO. gh uses gh auth login. kubectl uses kubeconfig. These are battle-tested auth flows that work the same whether I’m at the keyboard or Claude is driving. When auth breaks, I fix it the way I always would: aws sso login, gh auth refresh. No MCP-specific troubleshooting required.
Local MCP servers are processes. They need to start up, stay running, and not silently hang. In Claude Code, they’re spawned as child processes, which works until it doesn’t.
CLI tools are just binaries on disk. No background processes, no state to manage, no initialization dance. They’re there when you need them and invisible when you don’t.
Beyond the design philosophy, MCP has real day-to-day friction:
Initialization is flaky. I’ve lost count of the times I’ve restarted Claude Code because an MCP server didn’t come up. Sometimes it works on retry, sometimes I’m clearing state and starting over.
Re-auth never ends. Using multiple MCP tools? Have fun authenticating each one. CLIs with SSO or long-lived credentials just don’t have this problem. Auth once and you’re done.
Permissions are all-or-nothing. Claude Code lets you allowlist MCP tools by name, but that’s it. You can’t scope to read-only operations or restrict parameters. With CLIs, I can allowlist gh pr view but require approval for gh pr merge. That granularity matters.
I’m not saying MCP is completely useless. If a tool genuinely has no CLI equivalent, MCP might be the right call. I still use plenty in my day-to-day, when it’s the only option available.
I might even argue there’s some value in having a standardized interface, and that there are probably usecases where it makes more sense than a CLI.
But for the vast majority of work, the CLI is simpler, faster to debug, and more reliable.
The best tools are the ones that work for both humans and machines. CLIs have had decades of design iteration. They’re composable, debuggable, and they piggyback on auth systems that already exist.
MCP tried to build a better abstraction. Turns out we already had a pretty good one.
If you’re a company investing in an MCP server but you don’t have an official CLI, stop and rethink what you’re doing. Ship a good API, then ship a good CLI. The agents will figure it out.
...
Read the original on ejholmes.github.io »
Researchers at Oregon State University have created a new nanomaterial designed to destroy cancer cells from the inside. The material activates two separate chemical reactions once inside a tumor cell, overwhelming it with oxidative stress while leaving surrounding healthy tissue unharmed.
Researchers at Oregon State University have created a new nanomaterial designed to destroy cancer cells from the inside. The material activates two separate chemical reactions once inside a tumor cell, overwhelming it with oxidative stress while leaving surrounding healthy tissue unharmed.
The work, led by Oleh Taratula, Olena Taratula, and Chao Wang from the OSU College of Pharmacy, was published in Advanced Functional Materials.
The discovery strengthens the growing field of chemodynamic therapy or CDT. This emerging cancer treatment strategy takes advantage of the unique chemical conditions found inside tumors. Compared with normal tissue, cancer cells tend to be more acidic and contain higher levels of hydrogen peroxide.
Traditional CDT uses these tumor conditions to spark the formation of hydroxyl radicals, highly reactive molecules made of oxygen and hydrogen that contain an unpaired electron. These reactive oxygen species damage cells through oxidation, stripping electrons from essential components such as lipids, proteins, and DNA.
More recent CDT approaches have also succeeded in generating singlet oxygen inside tumors. Singlet oxygen is another reactive oxygen species, named for its single electron spin state rather than the three spin states seen in the more stable oxygen molecules present in the air.
“However, existing CDT agents are limited,” Oleh Taratula said. “They efficiently generate either radical hydroxyls or singlet oxygen but not both, and they often lack sufficient catalytic activity to sustain robust reactive oxygen species production. Consequently, preclinical studies often only show partial tumor regression and not a durable therapeutic benefit.”
To address these shortcomings, the team developed a new CDT nanoagent built from an iron-based metal-organic framework or MOF. This structure is capable of producing both hydroxyl radicals and singlet oxygen, increasing its cancer-fighting potential. The MOF demonstrated strong toxicity across multiple cancer cell lines while causing minimal harm to noncancerous cells.
“When we systemically administered our nanoagent in mice bearing human breast cancer cells, it efficiently accumulated in tumors, robustly generated reactive oxygen species and completely eradicated the cancer without adverse effects,” Olena Taratula said. “We saw total tumor regression and long-term prevention of recurrence, all without seeing any systemic toxicity.”
In these preclinical experiments, tumors disappeared entirely and did not return, and the animals showed no signs of harmful side effects.
Before moving into human trials, the researchers plan to test the treatment in additional cancer types, including aggressive pancreatic cancer, to determine whether the approach can be effective across a wide range of tumors.
Other contributors to the study included Oregon State researchers Kongbrailatpam Shitaljit Sharma, Yoon Tae Goo, Vladislav Grigoriev, Constanze Raitmayr, Ana Paula Mesquita Souza, and Manali Parag Phawde. Funding was provided by the National Cancer Institute of the National Institutes of Health and the Eunice Kennedy Shriver National Institute of Child Health and Human Development.
...
Read the original on www.sciencedaily.com »
* Lectures: MW[F] 9:30–10:50 Tepper 1403 (note: Friday lectures will only be used for review sessions or makeup lectures when needed)
A minimal free version of this course will be offered online, simultaneous to the CMU offering, starting on 1/26 (with a two-week delay from the CMU course). This means that (lecture videos, assignments available on mugrade, etc) will be available to the online course after the dates indicated in the schedule below. By this, we mean that anyone will be able to watch lecture videos for the course, and submit (autograded) assignments (though not quizzes or midterms/final). Enroll here to receive emails on lectures and homeworks once they are available. Note that information here about TAs, office hours, grading, prerequisites, etc, are for the CMU version, not the online offering.
This course provides an introduction to how modern AI systems work. By “modern AI”, we specifically mean the machine learning methods and large language models (LLMs) behind systems like ChatGPT, Gemini, and Claude. [Note]
Despite their seemingly amazing generality, the basic techniques that underlie these AI models are surprisingly simple: a minimal LLM implementation leverages a fairly small set of machine learning methods and architectures, and can be written in a few hundred lines of code.
This course will guide you through the basic methods that will let you implement a basic AI chatbot. You will learn the basics of supervised machine learning, large language models, and post-training. By the end of the course you will be able to write the code that runs an open source LLM from scratch, as well as train these models based upon a corpus of data. The material we cover will include:
* Post-training
The topics above are a general framing of what the course will cover. However, as this course is being offered for the first time in Spring 2026, some elements are likely to change over the first offering.
* Programming: 15-112 or 15-122. You must be proficient in basic Python programming, including object oriented methods.
* Math: 21-111 or 21-120. The course will use basic methods from differential calculus, including computing derivatives. Some familiarity with linear algebra and probability is also beneficial, but these topics will be covered to the extent needed for the course.
A major component of the course will be the development of a minimal AI chatbot through a series of programming assignments. Homeworks are submitted using mugrade system (tutorial video). Some assignments build on previous ones, though for the in-class CMu version we’ll distribute solutions to help you work through any errors that may have cropped up in previous assignments (for the online version, we’d suggest talking to others who were able to complete the assignment). In addition to the (main) programming aspect, some homeworks may contain shorter written portion that works out some of the mathematical details behind the approach.
All homeworks are released as Colab notebooks, at the links below. We are also releasing Marimo notebook versions. The mugrade version of the online assignment will be available two weeks after the release dates for the CMU course.
Each homework will be accompanied by an in-class (15 minute) quiz that assesses basic questions based upon the assignment. This will include replicating (at a high level) some of the code you wrote for the assignment, or answering conceptual questions about the assignment. All quizzes are closed book and closed notes.
In addition to the homework quizzes, there will be 3 in-person exams, two midterms and a final (during finals period). The midterms will focus on material only covered during that section of the courses, while the final will be cumulative (but with an emphasis on the last third of the course). All midterms and final and closed book and closed notes.
Lecture schedule is tentative and will be updated over the course of semester. All materials will be available to the online course two weeks after the dates here.
Students are permitted to use AI assistants for all homework and programming assignments (especially as a reference for understanding any topics that seem confusing), but we strongly encourage you to complete your final submitted version of your assignment without AI. You cannot use any such assistants, or any external materials, during in-class evaluations (both the homework quizzes and the midterms and final).
The rationale behind this policy is a simple one: AI can be extremely helpful as a learning tool (and to be clear, as an actual implementation tool), but over-reliance on these systems can currently be a detriment to learning in many cases. You absolutely need to learn how to code and do other tasks using AI tools, but turning in AI-generated solutions for the relatively short assignments we give you can (at least in our current experience) ultimately lead to substantially less understanding of the material. The choice is yours on assignments, but we believe that you will ultimately perform much better on the in-class quizzes and exams if you do work through your final submitted homework solutions yourself.
...
Read the original on modernaicourse.org »
Andrej Karpathy wrote a 200-line Python script that trains and runs a GPT from scratch, with no libraries or dependencies, just pure Python. The script contains the algorithm that powers LLMs like ChatGPT.
Let’s walk through it piece by piece and watch each part work. Andrej did a walkthrough on his blog, but here I take a more visual approach, tailored for beginners.
The model trains on 32,000 human names, one per line: emma, olivia, ava, isabella, sophia… Each name is a document. The model’s job is to learn the statistical patterns in these names and generate plausible new ones that sound like they could be real.
By the end of training, the model produces names like “kamon”, “karai”, “anna”, and “anton”.The model has learned which characters tend to follow which, which sounds are common at the start vs. the end, and how long a typical name runs. From ChatGPT’s perspective, your conversation is just a document. When you type a prompt, the model’s response is a statistical document completion.
Neural networks work with numbers, not characters. So we need a way to convert text into a sequence of integers and back. The simplest possible tokenizer assigns one integer to each unique character in the dataset. The 26 lowercase letters get ids 0 through 25, and we add one special token called BOS (Beginning of Sequence) with id 26 that marks where a name starts and ends.
Type a name below and watch it get tokenized. Each character maps to its integer id, and BOS tokens wrap both ends:
The integer values themselves have no meaning. Token 4 isn’t “more” than token 2. Each token is just a distinct symbol, like assigning a different color to each letter. Production tokenizers like tiktoken (used by GPT-4) work on chunks of characters for efficiency, giving a vocabulary of ~100,000 tokens, but the principle is the same.
Here’s the core task: given the tokens we’ve seen so far, predict what comes next. We slide through the sequence one position at a time. At position 0, the model sees only BOS and must predict the first letter. At position 1, it sees BOS and the first letter and must predict the second letter. And so on.
Step through the sequence below and watch the context grow while the target shifts forward:
Each step produces one training example: the context on the left is the input, the green token on the right is what the model should predict. For the name “emma”, that’s five input-target pairs. This sliding window is how all language models train, including ChatGPT.
At each position, the model outputs 27 raw numbers, one per possible next token. These numbers (called ) can be anything: positive, negative, large, small. We need to convert them into probabilities that are positive and sum to 1. does this by exponentiating each score and dividing by the total.
Adjust the logits below and watch the probability distribution change. Notice how one large logit dominates, and the exponential amplifies differences.
Here’s the actual softmax code from microgpt. Step through it to see the intermediate values at each line:
The subtraction of the max value before exponentiating doesn’t change the result mathematically (dividing numerator and denominator by the same constant cancels out) but prevents overflow. Without it, exp(100) would produce infinity.
How wrong was the prediction? We need a single number that captures “the model thought the correct answer was unlikely.” If the model assigns probability 0.9 to the correct next token, the loss is low (0.1). If it assigns probability 0.01, the loss is high (4.6). The formula is where is the probability the model assigned to the correct token. This is called .
Drag the slider to adjust the probability of the correct token and watch the loss change:
The curve has two properties that make it useful. First, it’s zero when the model is perfectly confident in the right answer (). Second, it goes to infinity as the model assigns near-zero probability to the truth (), which punishes confident wrong answers severely. Training minimizes this number.
To improve, the model needs to answer: “for each of my 4,192 , if I nudge it up by a tiny amount, does the loss go up or down, and by how much?” computes this by walking the computation backward, applying the at each step.
Every mathematical operation (add, multiply, exp, log) is a node in a graph. Each node remembers its inputs and knows its local derivative. The backward pass starts at the loss (where the is trivially 1.0) and multiplies local derivatives along every path back to the inputs.
Step through the forward pass, then the backward pass for a small example where with :
Now step through the actual Value class code. Watch how each operation records its children and local gradients, then how backward() walks the graph in reverse, accumulating gradients:
Notice that has a gradient of 4.0, not 3.0. That’s because is used in two places: once in the multiplication () and once in the addition (). The gradients from both paths sum up: . This is the multivariable chain rule in action. If a value contributes to the loss through multiple paths, the total derivative is the sum of contributions from each path. This is the same algorithm that PyTorch’s loss.backward() runs, operating on scalars instead of tensors. Same algorithm, just smaller and slower.
We know how to measure error and how to trace that error back to every parameter. Now let’s build the model itself, starting with how it represents tokens.
A raw token id like 4 is just an index. The model can’t do math with a bare integer. So each token looks up a learned vector (a list of 16 numbers) from an table. Think of it as each token having a 16-dimensional “personality” that the model can adjust during training.
Position matters too. The letter “a” at position 0 plays a different role than “a” at position 4. So there’s a second embedding table indexed by position. The token embedding and position embedding are added together to form the input to the rest of the network.
Click a token below to see its embedding vectors and how they combine:
The embedding values start as small random numbers and get tuned during training. After training, tokens that behave similarly (like vowels) tend to end up with similar embedding vectors. The model learns these representations from scratch, with no prior knowledge of what a vowel is.
How tokens talk to each other
This is how work. At each position, the model needs to gather information from previous positions. It does this through : each token produces three vectors from its embedding.
A Query (“what am I looking for?“), a Key (“what do I contain?“), and a Value (“what information do I offer if selected?“). The query at the current position is compared against all keys from previous positions via . High dot product means high relevance. Softmax converts these scores into attention weights, and the weighted sum of values is the output.
Explore the attention weights below. Each cell shows how much one position attends to another. Switch between the four attention heads to see different patterns:
The gray region in the upper-right is the causal mask. Position 2 can’t attend to position 4 because position 4 hasn’t happened yet. This is what makes the model : each position only sees the past.Different heads learn different patterns. One head might attend strongly to the most recent token. Another might focus on the BOS token (to remember “we’re generating a name”). A third might look for vowels. The four heads run in parallel, each operating on a 4-dimensional slice of the 16-dimensional embedding, and their outputs are concatenated and projected back to 16 dimensions.
The model pipes each token through: embed, normalize, attend, add , normalize, MLP, add residual, project to output logits. The (multilayer perceptron) is a two-layer feed-forward network: project up to 64 dimensions, apply (zero out negatives), project back to 16. If attention is how tokens communicate, the MLP is where each position thinks independently.
Step through the pipeline for one token and watch data flow through each stage:
Here’s the actual gpt() function from microgpt. Step through to see the code executing line by line, with the intermediate vector at each stage:
The residual connections (the “Add” steps) are load-bearing. Without them, gradients would shrink to near-zero by the time they reach the early layers, and training would stall. The residual connection gives gradients a shortcut, which is why deep networks can train at all.RMSNorm (root-mean-square normalization) rescales each vector to have unit root-mean-square. This prevents activations from growing or shrinking as they pass through the network, which stabilizes training. GPT-2 used LayerNorm; RMSNorm is simpler and works just as well.
The training loop repeats 1,000 times: pick a name, tokenize it, run the model forward over every position, compute the cross-entropy loss at each position, average the losses, backpropagate to get gradients for every parameter, and update the parameters to make the loss a bit lower.
The optimizer is Adam, which is smarter than naive gradient descent. It maintains a running average of each parameter’s recent gradients (momentum) and a running average of the squared gradients (adaptive ). Parameters that have been getting consistent gradients take larger steps. Parameters that have been oscillating take smaller ones.
Watch the loss decrease over 1,000 training steps. The model starts at ~3.3 (random guessing among 27 tokens: ) and settles around 2.37. The generated names evolve from gibberish to plausible:
Step through the code for one complete training iteration. Watch it pick a name, run the forward pass at each position, compute the loss, run backward, and update the parameters:
Once training is done, is straightforward. Start with BOS, run the forward pass, get 27 probabilities, randomly sample one token, feed it back in, and repeat until the model outputs BOS again (meaning “I’m done”) or we hit the maximum length.
Temperature controls how we sample. Before softmax, we divide the logits by the temperature. A temperature of 1.0 samples directly from the learned distribution. Lower temperatures sharpen the distribution (the model picks its top choices more often). Higher temperatures flatten it (more diverse but potentially less coherent output).
Adjust the temperature and watch the probability distribution change:
Step through the inference loop to see a name being generated character by character. At each step, the model runs forward, produces probabilities, and samples the next token:
A temperature approaching 0 would always pick the highest-probability token (greedy decoding). This produces the most “average” output. A temperature of 1.0 matches what the model actually learned. Values above 1.0 inject extra randomness, which can produce creative outputs but also nonsense. The sweet spot for names is around 0.5.
Everything else is efficiency
This 200-line script contains the complete algorithm. Between this and ChatGPT, litte changes conceptually. The differences are things like: trillions of tokens instead of 32,000 names. Subword tokenization (100K vocabulary) instead of characters. Tensors on GPUs instead of scalar Value objects in Python. Hundreds of billions of parameters instead of 4,192. Hundreds of layers instead of one. Training across thousands of GPUs for months.
But the loop is the same. Tokenize, embed, attend, compute, predict the next token, measure surprise, walk the gradients backward, nudge the parameters. Repeat.
...
Read the original on growingswe.com »
To add this web app to your iOS home screen tap the share button and select "Add to the Home Screen".
10HN is also available as an iOS App
If you visit 10HN only rarely, check out the the best articles from the past week.
If you like 10HN please leave feedback and share
Visit pancik.com for more.