10 interesting stories served every morning and every evening.

The Newest Instagram "Exploit" is the Goofiest I've Seen

www.0xsid.com

Yesterday, a slew of Instagram ac­counts, in­clud­ing some high pro­file ones like the Obama White House ac­count, seem­ingly got hacked.

Look, I’m no spring chicken. I’ve spent al­most a decade and a half iden­ti­fy­ing vul­ner­a­bil­i­ties and ex­ploits at uni­corn scale, but this is hands down the most un­se­ri­ous, almost too stu­pid to be true” of them all.

The Takeover Flow

Step 01: Faking the Location & Initiating SupportAll the at­tacker needs to kick this off is your ac­count user­name. Then, they hop on a VPN or proxy close to your city so Instagram’s se­cu­rity al­go­rithms don’t sus­pect a thing. (You can quite eas­ily get this from your pub­lic pro­file or About” sec­tion or a hun­dred other ways.) Once it looks like the re­quest is com­ing from the cor­rect re­gion, they tell the Meta sup­port AI that the ac­count is hacked and ask it to send the ver­i­fi­ca­tion codes to an ar­bi­trary email ad­dress they con­trol.

Step 01: Faking the Location & Initiating SupportAll the at­tacker needs to kick this off is your ac­count user­name. Then, they hop on a VPN or proxy close to your city so Instagram’s se­cu­rity al­go­rithms don’t sus­pect a thing. (You can quite eas­ily get this from your pub­lic pro­file or About” sec­tion or a hun­dred other ways.) Once it looks like the re­quest is com­ing from the cor­rect re­gion, they tell the Meta sup­port AI that the ac­count is hacked and ask it to send the ver­i­fi­ca­tion codes to an ar­bi­trary email ad­dress they con­trol.

Step 02: That’s ItReally, that’s it. The first proper zero auth pass­word re­set I’ve seen in pro­duc­tion. There ap­pears to be no ad­di­tional check as to whether the email be­ing given is ac­tu­ally some­thing the user has used be­fore. Once the AI sends the se­cu­rity code to the at­tack­er’s email, the at­tacker passes it right back to com­plete the ver­i­fi­ca­tion. The plat­form hands over a fresh pass­word re­set link, grant­ing full own­er­ship to the at­tacker.

Step 02: That’s ItReally, that’s it. The first proper zero auth pass­word re­set I’ve seen in pro­duc­tion. There ap­pears to be no ad­di­tional check as to whether the email be­ing given is ac­tu­ally some­thing the user has used be­fore. Once the AI sends the se­cu­rity code to the at­tack­er’s email, the at­tacker passes it right back to com­plete the ver­i­fi­ca­tion. The plat­form hands over a fresh pass­word re­set link, grant­ing full own­er­ship to the at­tacker.

Instagram’s AI may or may not ask the at­tacker for a video selfie to prove iden­tity. It’s not par­tic­u­larly dis­cern­ing at the mo­ment, so some­thing as sim­ple as an AI an­i­mated pub­lic photo from the tar­get’s feed has been widely re­ported to work.

2FA Doesn’t Help

In case you’re won­der­ing, be­cause the sys­tem treats this high-priv­i­lege re­cov­ery flow as a to­tal ac­count re­set by the true” owner, the orig­i­nal 2FA gets thor­oughly by­passed in the process.

Existing ses­sions are re­voked and the pass­word changed with no email, text, or push no­ti­fi­ca­tion. The ac­tual owner can’t ini­ti­ate re­cov­ery be­cause the email and phone num­bers now map to the at­tacker. There’s no hu­man to es­ca­late to, it’s just you ar­gu­ing with a chat hop­ing to take con­trol back while pray­ing they don’t do it again.

And if you’re part of the A/B tested ac­counts on which the AI sup­port op­tion is ac­tive, tough luck, you can’t even turn it off.

Black Markets Galore

Multiple black mar­ket Telegram groups have sprung up of­fer­ing account takeover” ser­vices at steep rates and quick turn­around times. Considering short han­dles are worth hun­dreds of thou­sands to even mil­lions of dol­lars, it’s not a sur­prise, re­ally.

Accounts have been flipped, like hey, or been used for pro­pa­ganda, like oba­mawhite­house or ocmssf, the ac­count of the Chief Master Sergeant of the U.S. Space Force.

Patched Now

All the Telegram groups have qui­eted down as Meta seems to have patched it al­ready, but it ap­pears this par­tic­u­lar method was ac­tive for weeks, if not months.

The very fact that a $1.5 tril­lion com­pany lacks ro­bust guard rails and their sup­port AI will just change any­one’s linked email if you ask it nicely enough is so ter­ri­fy­ing, if it weren’t so funny.

If you’ve reached this far, thank you for read­ing! :)

I thought ex­it­ing and re­tir­ing in my mid 30s would be fun but I’ve just been bored and de­pressed with­out morn­ing Slacks and emails to wake up to. If you’re build­ing some­thing in­ter­est­ing and could use an ex­tra set of hands to help ship it, feel free to reach out. My in­box is open.

Malicious npm releases detected across `@redhat-cloud-services/` scope · Issue #492 · RedHatInsights/javascript-clients

github.com

Skip to con­tent

Secure your code as you build

We read every piece of feed­back, and take your in­put very se­ri­ously.

Include my email ad­dress so I can be con­tacted

Use saved searches to fil­ter your re­sults more quickly

To see all avail­able qual­i­fiers, see our doc­u­men­ta­tion.

Sign up

You signed in with an­other tab or win­dow. Reload to re­fresh your ses­sion.

You signed out in an­other tab or win­dow. Reload to re­fresh your ses­sion.

You switched ac­counts on an­other tab or win­dow. Reload to re­fresh your ses­sion.

Notifications

You must be signed in to change no­ti­fi­ca­tion set­tings

You can’t per­form that ac­tion at this time.

A 10 year old Xeon is all you need - point.free

point.free

Published on June 01, 2026

17 min­utes read

The pre­vi­ous post cov­ered get­ting Gemma 4’s MTP drafters quan­tized and paired with a ver­i­fier. This one is about run­ning the re­sult on a ma­chine that has no busi­ness run­ning it.

I have a re­cy­cled server. To its credit, it has a whop­ping 128 GB RAM, but it’s DDR3… That RAM is 5 – 6 times slower than the cur­rent best lap­top ram. It also has a sin­gle Intel Xeon E5 – 2620 v4 from 2016, which is about 5 times slower than my lap­tops CPU…

Oh, and as I did men­tion, we have no GPU. And no, the Xeon does not have an in­te­grated GPU.

But, just hear me out…

If we were to just break out ol­lama here, well… as ex­plained in ear­lier blog posts, we can’t. And we’d be lucky if we could in 6 months when they add sup­port for the model we need, if they ever do. Might be they never do. And even still, ol­lama sim­ply does­n’t ex­pose enough knobs for us to ever make this run well, nei­ther does even the stan­dard llama-cpp.

But. Why would that stop us?

I’ve re­cieved feed­back that some of the pre­vi­ous posts were too high level, I’ll try to make things as clear as rea­son­ably pos­si­ble here. If you’re a tech worker, or a Linux en­thu­si­ast that has built a com­puter and used some­thing like ChatGPT, most of this should be ap­proach­able.

I’ve re­cieved feed­back that some of the pre­vi­ous posts were too high level, I’ll try to make things as clear as rea­son­ably pos­si­ble here. If you’re a tech worker, or a Linux en­thu­si­ast that has built a com­puter and used some­thing like ChatGPT, most of this should be ap­proach­able.

So, just to re­ally set the stage fully. The hard­ware, per lscpu:

CPU: Intel Xeon E5 – 2620 v4 @ 2.10 GHz

Cores: 8 phys­i­cal, 16 threads

Instruction sets: AVX2 (no AVX-512, no AVX-VNNI, no BF16)

Cache: 20 MiB L3, 2 MiB L2 to­tal

Memory: 128 GB DDR3

GPU: none

For LLM in­fer­ence, mem­ory band­width is the lim­it­ing re­source. Every to­ken gen­er­ated re­quires haul­ing gi­ga­bytes of weights from RAM into the CPU cache.

When you use a tool like ChatGPT and watch the text stream onto your screen word by word, you are watch­ing the decoder pass”. During this phase, the model gen­er­ates the out­put one piece (or token”) at a time.

In this step, the sys­tem’s raw pro­cess­ing power is rarely the bot­tle­neck. Instead, the lim­i­ta­tion is mem­ory band­width. To cal­cu­late that next word, the proces­sor has to con­stantly pull mas­sive amounts of data. That data is the weights” that con­tain the mod­el’s learned knowl­edge. It moves this from mem­ory into the com­pute cores.

The proces­sor ex­e­cutes the re­quired ma­trix cal­cu­la­tions so quickly that it is left sit­ting idle, wait­ing for the hard­ware to phys­i­cally move the next chunk of weights across the mem­ory bus. In tra­di­tional soft­ware terms, de­cod­ing is heav­ily mem­ory-bound, not com­pute-bound.

This is the so called memory wall”, one of the sin­gle biggest per­for­mance hur­dles now, whether you’re on a Xeon or an H100.

Naively run­ning llama-cli on a DDR3 ma­chine with­out a GPU is hor­ren­dously slow, even if it can run it, be­cause it’s op­ti­mized for a generic GPU use­case, and of­ten leaves a lot of im­prove­ments on the table. Further, it sim­ply does­n’t have most of the ac­tual op­ti­miza­tions that the state of the art cur­rently uses to run these at scale.

The rem­edy is to pull every op­ti­miza­tion lever ik_l­lama.cpp ex­poses. Most of them are slightly ob­scure.

Here is the magic spell that makes this ac­tu­ally run.

llama-cli \ –model gemma-4 – 26B-A4B-it-Q8_0.gguf \ –model-draft gemma-4 – 26B-A4B-it-as­sis­tant-GGUF/\ wiki­text-2-raw_ik-llama-mt­p_­drafter-con­ser­v­a­tive/\ gemma-4 – 26B-A4B-it-as­sis­tant-Q8_0.gguf \ –spec-type mtp –draft-max 3 –draft-p-min 0.0 –spec-autotune \ -cnv –color –jinja –special \ -sm graph -smgs -sas -mea 256 –split-mode-f32 \ –temp 0.7 -t 8 –parallel 8 \ –cpu-moe –merge-up-gate-experts \ –flash-attn on –mla-use 3 \ –mlock –run-time-repack –no-kv-offload

Under a black­box tool like ol­lama you never see this line. On ag­ing hard­ware you have to un­der­stand what each flag does, be­cause half of them won’t take, and the en­gine will tell you so in pass­ing.

Speculative de­cod­ing.

–spec-type mtp –draft-max 3 –draft-p-min 0.0 –spec-autotune

This pairs the 26B ver­i­fier with the small drafter from the pre­vi­ous post. Up to three to­kens per draft (–draft-max 3), all prob­a­bil­i­ties ac­cepted (–draft-p-min 0.0), –spec-autotune ad­just­ing the chain length per work­load.

This ties di­rectly back to our pre­vi­ous dis­cus­sion about the mem­ory-bound de­coder pass.

When a model uses a long rea­son­ing chain, it is gen­er­at­ing those thinking” to­kens one by one. Even if the in­ter­nal rea­son­ing is hid­den from the user and all you see is a short fi­nal an­swer, the hard­ware still has to per­form a full de­coder pass for every sin­gle to­ken in that hid­den chain.

In fact, spec­u­la­tive de­cod­ing is cur­rently one of the most bril­liant soft­ware workarounds the AI in­dus­try has in­vented to by­pass the memory wall,” and spec au­to­tune is how you squeeze the max­i­mum speed out of it.

The ar­gu­ment for spec­u­la­tive de­cod­ing is stronger on CPU than on GPU. CPU com­pute is cheap rel­a­tive to the cost of stream­ing the ver­i­fier’s weights through cache, so spend­ing ex­tra cy­cles on a tiny drafter whose ac­tive lay­ers eas­ily fit in L3 buys to­kens at very lit­tle mar­ginal cost. The drafter’s work­ing set fits in L3. The ver­i­fier how­ever spills out of every­thing.

CPU and MoE rout­ing.

–cpu-moe –merge-up-gate-experts -t 8 –parallel 8

Gemma 4 26B-A4B has 128 ex­perts with 8 ac­tive per to­ken, giv­ing about 3.8B ac­tive pa­ra­me­ters out of ~25.2B to­tal. –cpu-moe tunes the rout­ing for CPU cache hi­er­ar­chies.

CPUs han­dle mem­ory very dif­fer­ently than GPUs. While a GPU has a mas­sive pool of ul­tra-fast High-Bandwidth Memory (HBM), a CPU re­lies on small, light­ning-fast caches” (L1, L2, L3) built di­rectly onto the proces­sor chip.

In an MoE model, con­stantly jump­ing around be­tween 128 dif­fer­ent ex­perts can cause cache thrash­ing”, where the CPU con­stantly has to dump its cache and fetch new weights from the much slower main sys­tem RAM (normally DDR4/DDR5, we’re on DDR3!).

This flag tells the router to be smarter about how it picks ex­perts, op­ti­miz­ing the se­quence so the weights stay neatly in­side the CPUs lo­cal cache for as long as pos­si­ble.

–merge-up-gate-experts fuses two per-ex­pert pro­jec­tions into a sin­gle mat­mul, which the logs con­firm:

fused_up­_­gate = 1

This is a soft­ware trick to by­pass the mem­ory band­width bot­tle­neck we dis­cussed ear­lier.

Inside the ex­perts, the math op­er­a­tions re­quire data to be passed through dif­fer­ent lay­ers. Normally, the proces­sor would cal­cu­late an up pro­jec­tion”, write the re­sult to mem­ory, then load the weights for a gate pro­jec­tion”, cal­cu­late that, and com­bine them. That re­quires mov­ing data across the mem­ory bus mul­ti­ple times.

Instead of do­ing two sep­a­rate trips over the mem­ory bus, it com­bines the op­er­a­tions into a sin­gle step.

-t 8 matches phys­i­cal cores. The ma­chine has 16 SMT threads but only 8 cores. On a mem­ory-bound work­load, over­sub­scrib­ing threads adds sched­ul­ing cost with­out adding through­put: the cores are wait­ing on DDR3, not on each other.

Memory pin­ning, repack­ing, KV cache.

–mlock –run-time-repack –no-kv-offload

–run-time-repack re­or­ga­nizes weight ma­tri­ces in mem­ory im­me­di­ately be­fore in­fer­ence to match the CPUs cache lay­out. The logs con­firm:

============ Repacked 265 ten­sors

Processors have their own ul­tra-fast, built-in mem­ory called caches (L1, L2, and L3). However, these caches ex­pect data to be fed to them in very spe­cific shapes and sizes.

If the AIs weight ma­tri­ces are sit­ting in sys­tem RAM in a generic lay­out, the CPU has to awk­wardly pull the data in pieces, re­sult­ing in cache misses” where the CPU stalls. –run-time-repack tells the en­gine to spend a few sec­onds dur­ing startup to phys­i­cally re­or­ga­nize the mas­sive ta­bles of num­bers in the RAM so they per­fectly align with how the CPU wants to in­gest them. It pays a small time penalty up­front to guar­an­tee max­i­mum mem­ory band­width dur­ing the ac­tual text gen­er­a­tion.

–mlock is meant to pin the model in RAM so the OS can­not swap any of it to disk.

mlock stands for memory lock”, supris­ing, I know! In stan­dard op­er­at­ing sys­tems, if the sys­tem starts run­ning out of RAM, it will qui­etly take data that has­n’t been used in a few sec­onds and swap” (or page) it to the phys­i­cal hard drive.

If an OS tries to swap out 27GB of AI weights to a disk, the gen­er­a­tion speed will in­stantly drop to zero while the sys­tem chokes try­ing to read it back. –mlock tells the Linux ker­nel: Pin this 27GB strictly in phys­i­cal RAM. Do not ever move it to the disk.”

Notice that if you’re not care­ful, you’ll see this:

warn­ing: failed to mlock 27628376064-byte buffer (after pre­vi­ously lock­ing 0 bytes): Cannot al­lo­cate mem­ory Try in­creas­ing RLIMIT_MEMLOCK (‘ulimit -l’ as root).

The flag is fine; the ker­nel-side mem­lock limit is­n’t set high enough to pin a 27 GB buffer. This is not an LLM-shaped prob­lem at all — it’s a ulimit de­fault — and it’s the kind of foot­gun the black­box tools pa­per over by sim­ply not ask­ing for the op­ti­miza­tion in the first place.

Consider that for a mo­ment, that many tools by de­fault will just have no prob­lem putting your model into swap if it de­cided that’s the best op­tion. You can imag­ine how much this can hurt per­for­mance…

–no-kv-offload tells the en­gine not to look for a GPU for the KV cache. There is­n’t one to find, but the flag short-cir­cuits the check.

The KV (Key-Value) cache is the AIs short-term mem­ory — it stores the con­text of the cur­rent con­ver­sa­tion so the model does­n’t have to re-read the en­tire prompt for every new to­ken.

Because the KV cache is con­stantly be­ing read from and writ­ten to, AI en­gines usu­ally try to offload” it to a GPU, which has much faster mem­ory than we do.

Since this spe­cific setup is highly op­ti­mized to run purely on a CPU, let­ting the en­gine search the hard­ware buses for a GPU that does­n’t ex­ist is a waste of time and could throw an er­ror. This flag ex­plic­itly short-cir­cuits that check, telling the en­gine to just keep the short-term mem­ory in the sys­tem RAM along­side the weights.

Graph lay­out.

I’ve tried my best to keep this easy to un­der­stand, but this part is just plain hard to make ex­plain in a sin­gle blog post.

I’ve tried my best to keep this easy to un­der­stand, but this part is just plain hard to make ex­plain in a sin­gle blog post.

Now onto dark arts. A com­mon frus­tra­tion in bleed­ing-edge AI soft­ware is that the en­gine is be­ing de­vel­oped so fast that the de­vel­op­ers don’t have time to write of­fi­cial doc­u­men­ta­tion. If you want to know how to op­ti­mize the en­gine, you have to dig through the raw code or read the Github Pull Request (PR) com­ments be­tween the de­vel­op­ers.

-sm graph -smgs -sas -mea 256 –split-mode-f32

These flags gov­ern how the com­pu­ta­tional graph is al­lo­cated across mem­ory re­gions. The full doc­u­men­ta­tion ul­ti­mat­ley lives in the code, even if it has some doc­u­men­ta­tion.

The flag -sm graph tells the en­gine to use Split Mode in the Graph mode (often known in the in­dus­try as Tensor Parallelism). This is en­tirely about how you di­vide the mas­sive math work­load across mul­ti­ple proces­sors or mem­ory re­gions (like mul­ti­ple CPU sock­ets or GPUs).

Layer Split (The Default/Fallback): The en­gine slices the model hor­i­zon­tally. Processor A cal­cu­lates Layers 1 – 10, then sends the data over the sys­tem bus to Processor B, which cal­cu­lates Layers 11 – 20. While Processor A is work­ing, Processor B is sit­ting idle.

Layer Split (The Default/Fallback): The en­gine slices the model hor­i­zon­tally. Processor A cal­cu­lates Layers 1 – 10, then sends the data over the sys­tem bus to Processor B, which cal­cu­lates Layers 11 – 20. While Processor A is work­ing, Processor B is sit­ting idle.

Graph Split (The Goal): The en­gine slices the com­pu­ta­tional graph ver­ti­cally. Processor A and Processor B cal­cu­late dif­fer­ent halves of Layer 1 at the ex­act same time, com­bine their an­swers, and move to Layer 2 to­gether. This keeps all hard­ware run­ning at 100% si­mul­ta­ne­ously, dras­ti­cally im­prov­ing gen­er­a­tion speed.

Graph Split (The Goal): The en­gine slices the com­pu­ta­tional graph ver­ti­cally. Processor A and Processor B cal­cu­late dif­fer­ent halves of Layer 1 at the ex­act same time, com­bine their an­swers, and move to Layer 2 to­gether. This keeps all hard­ware run­ning at 100% si­mul­ta­ne­ously, dras­ti­cally im­prov­ing gen­er­a­tion speed.

On this run, the en­gine de­clines:

======================================================= Split mode graph’ is not sup­ported for Gemma4 ex­ter­nal MTP => chang­ing split mode to layer’ =======================================================

Because MTP cre­ates a much more com­pli­cated web of math at the very end of the net­work, this in­fer­ence en­gine sim­ply has­n’t got­ten sup­port yet to safely graph split” (vertically slice) an MTP ar­chi­tec­ture yet. When the en­gine boots up, it de­tects the MTP lay­ers, re­al­izes -sm graph will break the math, and safely down­grades to the slower, se­quen­tial layer split so the model can still run.

I’ve in­cluded it be­cause it will likely be very help­ful in the fu­ture, so you should try your luck if you’re work­ing on a newer ver­sion.

While -sm graph was dis­abled, these other flags still ap­ply to how the en­gine man­ages mem­ory:

-sas (Split Across Sockets): Explicitly tells the en­gine how to di­vide the work­load across dif­fer­ent phys­i­cal CPU sock­ets (NUMA nodes) on a server moth­er­board. You may note we only have one CPU, but we could get more later, it’s a nice op­ti­miza­tion, just bench it to be safe if you do this, since older boards may break cur­rent day as­sump­tions.

-sas (Split Across Sockets): Explicitly tells the en­gine how to di­vide the work­load across dif­fer­ent phys­i­cal CPU sock­ets (NUMA nodes) on a server moth­er­board. You may note we only have one CPU, but we could get more later, it’s a nice op­ti­miza­tion, just bench it to be safe if you do this, since older boards may break cur­rent day as­sump­tions.

–split-mode-f32: When data is split across proces­sors, it has to be stitched back to­gether. This flag forces those in­ter­me­di­ate con­nec­tion points to use 32-bit float­ing-point pre­ci­sion (higher qual­ity math). It pre­vents the AI from los­ing in­tel­li­gence or hal­lu­ci­nat­ing due to round­ing er­rors dur­ing the split.

–split-mode-f32: When data is split across proces­sors, it has to be stitched back to­gether. This flag forces those in­ter­me­di­ate con­nec­tion points to use 32-bit float­ing-point pre­ci­sion (higher qual­ity math). It pre­vents the AI from los­ing in­tel­li­gence or hal­lu­ci­nat­ing due to round­ing er­rors dur­ing the split.

And don’t worry if you see this:

Oops: ten­sor with strange name rope_freqs.weight

It has a strange name. Strange names will not stop us here. :D

Attention.

Look. ikawrakow, cre­ator of ik_l­lama.cpp is be­yond the word craked”.

Kawrakow wrote cus­tom CPU ker­nels to han­dle Flash Attention, by­pass­ing the need for a GPU dur­ing heavy con­text pro­cess­ing.

This let’s us do some­thing that nor­mally you only do on a GPU.

–flash-attn on –mla-use 3

Flash Attention fuses the at­ten­tion soft­max with its mat­muls to avoid ma­te­ri­al­iz­ing the full at­ten­tion ma­trix. Duh, any­one knows this, but I’ll try to ex­plain it.

To gen­er­ate text, an AI has to cal­cu­late how every sin­gle word in your prompt re­lates to every other word. Mathematically, this cre­ates a grid of size N×N (where N is the num­ber of to­kens).

If you give the AI a short sen­tence, that grid is small. But if you feed it a 100,000-word doc­u­ment, that ma­trix ex­plodes into 10 bil­lion cells. Normally, the proces­sor cal­cu­lates this mas­sive ma­trix and materializes” it — mean­ing it phys­i­cally writes the en­tire gi­ant grid out to the main sys­tem RAM, only to im­me­di­ately read it back for the next step.

Flash Attention ap­plies the Kernel Fusion trick, but to the at­ten­tion mech­a­nism. It cal­cu­lates the at­ten­tion scores in small chunks and fuses the math (the soft­max) so that the gi­ant N×N ma­trix is never ac­tu­ally writ­ten to RAM. It is cal­cu­lated and con­sumed en­tirely in­side the proces­sor’s ul­tra-fast lo­cal cache.

Flash Attention was orig­i­nally in­vented strictly for GPUs be­cause it re­lies on how GPU hard­ware han­dles mem­ory blocks. Successfully port­ing this highly com­plex, hard­ware-spe­cific op­ti­miza­tion to work on stan­dard CPUs is a mas­sive soft­ware en­gi­neer­ing achieve­ment. Well done ikawrakow.

–mla-use 3 en­ables Multi-Head Latent Attention. Earlier, we dis­cussed the KV Cache (the AIs short-term mem­ory of the con­ver­sa­tion that pre­vents it from hav­ing to re-read the whole prompt for every word).

In stan­dard ar­chi­tec­tures, stor­ing the raw Key and Value data for every sin­gle to­ken eats up RAM in­cred­i­bly fast. Multi-Head Latent Attention (MLA) is a break­through ar­chi­tec­ture that heav­ily com­presses this short-term mem­ory. Instead of sav­ing raw data for every to­ken, it com­presses the Keys and Values into a much smaller, dense math­e­mat­i­cal rep­re­sen­ta­tion (a latent” space).

This dras­ti­cally re­duces the mem­ory foot­print of the KV cache, al­low­ing the model to re­mem­ber mas­sive con­ver­sa­tions with­out run­ning out of sys­tem RAM. The flag –mla-use 3 sim­ply tells the en­gine to ac­ti­vate a spe­cific tier or ker­nel im­ple­men­ta­tion of this com­pres­sion.

But all of this is just ex­per­i­men­tal stuff right, like the split mode graph? Nah. The logs con­firm both took:

The Pirate Bay Remains Resilient, 20 Years After The Raid

torrentfreak.com

There are a hand­ful of tra­di­tions we have at TorrentFreak, and re­mem­ber­ing the first raid on The Pirate Bay is one of them.

It was not only the first ma­jor story we cov­ered, it also shaped how the piracy ecosys­tem evolved over the years. And it changed the lives of the site’s co-founders, who were even­tu­ally con­victed.

What many peo­ple may not re­al­ize, how­ever, is that with­out a few key­strokes in the site’s early days, it would be a dis­tant mem­ory to­day.

This is what hap­pened.

On May 31, 2006, less than three years af­ter The Pirate Bay was founded, 65 Swedish po­lice of­fi­cers en­tered a dat­a­cen­ter in Stockholm. They had in­struc­tions to take the site’s servers of­fline as part of a crim­i­nal probe, fol­low­ing pres­sure from the US gov­ern­ment.

As the po­lice were about to en­ter, Pirate Bay co-founders Gottfrid Svartholm and Fredrik Neij knew some­thing was­n’t quite right. Both men said they had no­ticed be­ing tailed by pri­vate in­ves­ti­ga­tors. This time, how­ever, their servers were the tar­get.

At around 10:00 in the morn­ing, Gottfrid told Fredrik that there were po­lice of­fi­cers at their of­fice. He asked his col­league to head down to the co-lo­ca­tion fa­cil­ity and get rid of the incriminating ev­i­dence’, al­though none of it, what­ever it was, re­lated to The Pirate Bay.

A Crucial Backup

As Fredrik was leav­ing, he sud­denly re­al­ized the prob­lems might be linked to their tor­rent tracker. Just in case, he de­cided to make a full backup of the site.

When he ar­rived at the co-lo­ca­tion fa­cil­ity, those con­cerns turned out to be jus­ti­fied. Dozens of po­lice of­fi­cers were float­ing around, tak­ing away dozens of servers, most of which be­longed to clients un­re­lated to The Pirate Bay.

In the days that fol­lowed, it be­came clear that Fredrik’s de­ci­sion to back up the site was prob­a­bly the most piv­otal mo­ment in its his­tory. Because of that backup, the Pirate Bay team man­aged to res­ur­rect the site within three days.

The Police Bay”

The en­tire sit­u­a­tion was han­dled with the mock­ery TPB had be­come known for.

Unimpressed, the op­er­a­tors re­named the site The Police Bay”, com­plete with a new logo shoot­ing can­non­balls at Hollywood. A few days later the logo was re­placed by a Phoenix, a ref­er­ence to the site ris­ing from its dig­i­tal ashes.

Instead of shut­ting it down, the raid pro­pelled The Pirate Bay into the main­stream press, not least due to its swift res­ur­rec­tion. The pub­lic­ity also trig­gered a huge traf­fic spike, ex­actly the op­po­site of what Hollywood had hoped for.

The US Pushed Sweden

Although the raid and the sub­se­quent crim­i­nal in­ves­ti­ga­tion were car­ried out in Sweden, the US Government played a ma­jor role be­hind the scenes. For many years the scale of that in­volve­ment was un­known. However, in­for­ma­tion ob­tained through a Freedom of Information Act re­quest in 2017 helped to fill in some blanks.

The trail started with a ca­ble sent from the US Embassy in Sweden to Washington in November 2005, roughly six months be­fore the raid. The Embassy wrote that Hollywood’s MPA met with US Ambassador Bivins and, sep­a­rately, with the Swedish State Secretary of Justice. The Pirate Bay was one of the top agenda items.

The MPA is par­tic­u­larly con­cerned about PirateBay, the world’s largest Torrent file-shar­ing tracker. According to the MPA and based on Embassy’s fol­low-up dis­cus­sions, the Justice Ministry is very in­ter­ested in a con­struc­tive di­a­logue with the US. on these con­cerns,” the ca­ble read.

The Embassy ex­plained that Hollywood would like Sweden to take ac­tion against a big player such as The Pirate Bay.

We have yet to see a big fish’ tried, some­thing the MPA badly wants to see, par­tic­u­larly in light of the fact that Sweden hosts the largest Bit Torrent file-shar­ing tracker in the world, Pirate-Bay’, which openly flaunts IPR,” the ca­ble writer com­mented.

Fast for­ward half a year and, in­deed, 65 po­lice of­fi­cers were ready to take The Pirate Bay’s servers of­fline. While there is no writ­ten ev­i­dence that US of­fi­cials were ac­tively in­volved in plan­ning the in­ves­ti­ga­tion or raid, in­di­rectly they played a ma­jor role.

This is backed up by fur­ther ev­i­dence. In a ca­ble sent in April 2007, the Embassy nom­i­nated one of its em­ploy­ees, whose name is redacted, for the State Department’s Foreign Service National (FSN) of the year award. Again, The Pirate Bay case was cited.

REDACTED skill­ful out­reach di­rectly led to a bold de­ci­sion by Swedish law en­force­ment au­thor­i­ties to raid Pirate Bay and shut it down. This was rec­og­nized as a ma­jor achieve­ment in Washington in fur­ther­ing U.S. ef­forts to com­bat Internet piracy world­wide.”

We don’t know if the em­ployee in ques­tion re­ceived the award. In hind­sight, how­ever, the raid did very lit­tle to de­ter piracy.

The Aftermath

The swift come­back turned the site’s founders into he­roes for many. The story made head­line news around the world, and in Stockholm peo­ple waved pi­rate flags in the streets, a sen­ti­ment that ben­e­fited the newly founded Pirate Party as well.

The raid even­tu­ally re­sulted in neg­a­tive con­se­quences for the founders. It was the start of a crim­i­nal in­ves­ti­ga­tion, which led to a trial, and prison sen­tences for sev­eral of the site’s key play­ers.

This be­came an­other turn­ing point. Many of the peo­ple in­volved from the early days de­cided to cut their ties with the site, which was handed over to a more anony­mous group, os­ten­si­bly lo­cated in the Seychelles.

The out­spo­ken­ness of the early years was re­placed by the silent treat­ment. While some mod­er­a­tors have spo­ken out, the anony­mous op­er­a­tor nick­named Winston’ re­mains be­hind the scenes at all times.

This was made ob­vi­ous in 2014, when the site dis­ap­peared for weeks fol­low­ing an­other raid at a Stockholm data cen­ter. At the time, even the site’s staffers had no idea what was go­ing on.

The Pirate Bay re­cov­ered from that sec­ond raid too, and re­mains seen as a piracy icon by many. These days the site bills it­self as the galaxy’s most re­silient tor­rent site’, a ti­tle it ar­guably earned on May 31, 2006.

For now, the site re­mains on­line, twenty years af­ter Hollywood thought it had seen the last of it. And who­ever is in charge to­day, will likely do every­thing pos­si­ble to keep it that way.

Anthropic confidentially submits draft S-1 to the SEC

www.anthropic.com

Today, Anthropic, PBC con­fi­den­tially sub­mit­ted a draft reg­is­tra­tion state­ment on Form S-1 to the U.S. Securities and Exchange Commission for a pro­posed ini­tial pub­lic of­fer­ing of our com­mon stock. This gives us the op­tion to go pub­lic af­ter the SEC com­pletes its re­view. The pro­posed ini­tial pub­lic of­fer­ing will de­pend on mar­ket con­di­tions and other fac­tors.

The num­ber of shares to be of­fered and the price have not yet been set. This an­nounce­ment is be­ing pub­lished un­der Rule 135 of the Securities Act of 1933, as amended. It is not an of­fer to sell se­cu­ri­ties; nor is it a so­lic­i­ta­tion of an of­fer to buy them. Any of­fers, so­lic­i­ta­tions of of­fers to buy, or any sales of se­cu­ri­ties will be made only in ac­cor­dance with the reg­is­tra­tion re­quire­ments of the Securities Act.

Related con­tent

Anthropic raises $65B in Series H fund­ing at $965B post-money val­u­a­tion

Anthropic has raised $65 bil­lion in Series H fund­ing led by Altimeter Capital, Dragoneer, Greenoaks, and Sequoia Capital.

Read more

Introducing Claude Opus 4.8

An up­grade to our Opus class of mod­els, with stronger per­for­mance across cod­ing, agen­tic tasks, and pro­fes­sional work, and the con­sis­tency to han­dle long-run­ning work.

Read more

Anthropic opens Milan of­fice to sup­port Italian en­ter­prise, re­search, and de­vel­op­ers

We’re open­ing a new of­fice in Milan, our sixth in Europe.

Read more

Stanford CS336 | Language Modeling from Scratch

cs336.stanford.edu

Content

What is this course about?

Language mod­els serve as the cor­ner­stone of mod­ern nat­ural lan­guage pro­cess­ing (NLP) ap­pli­ca­tions and open up a new par­a­digm of hav­ing a sin­gle gen­eral pur­pose sys­tem ad­dress a range of down­stream tasks. As the field of ar­ti­fi­cial in­tel­li­gence (AI), ma­chine learn­ing (ML), and NLP con­tin­ues to grow, pos­sess­ing a deep un­der­stand­ing of lan­guage mod­els be­comes es­sen­tial for sci­en­tists and en­gi­neers alike. This course is de­signed to pro­vide stu­dents with a com­pre­hen­sive un­der­stand­ing of lan­guage mod­els by walk­ing them through the en­tire process of de­vel­op­ing their own. Drawing in­spi­ra­tion from op­er­at­ing sys­tems courses that cre­ate an en­tire op­er­at­ing sys­tem from scratch, we will lead stu­dents through every as­pect of lan­guage model cre­ation, in­clud­ing data col­lec­tion and clean­ing for pre-train­ing, trans­former model con­struc­tion, model train­ing, and eval­u­a­tion be­fore de­ploy­ment.

Prerequisites

Proficiency in Python

The ma­jor­ity of class as­sign­ments will be in Python. Unlike most other AI classes, stu­dents will be given min­i­mal scaf­fold­ing. The amount of code you will write will be at least an or­der of mag­ni­tude greater than for other classes. Therefore, be­ing pro­fi­cient in Python and soft­ware en­gi­neer­ing is para­mount.

The ma­jor­ity of class as­sign­ments will be in Python. Unlike most other AI classes, stu­dents will be given min­i­mal scaf­fold­ing. The amount of code you will write will be at least an or­der of mag­ni­tude greater than for other classes. Therefore, be­ing pro­fi­cient in Python and soft­ware en­gi­neer­ing is para­mount.

Experience with deep learn­ing and sys­tems op­ti­miza­tion

A sig­nif­i­cant part of the course will in­volve mak­ing neural lan­guage mod­els run quickly and ef­fi­ciently on GPUs across mul­ti­ple ma­chines. We ex­pect stu­dents to be able to have a strong fa­mil­iar­ity with PyTorch and know ba­sic sys­tems con­cepts like the mem­ory hi­er­ar­chy.

A sig­nif­i­cant part of the course will in­volve mak­ing neural lan­guage mod­els run quickly and ef­fi­ciently on GPUs across mul­ti­ple ma­chines. We ex­pect stu­dents to be able to have a strong fa­mil­iar­ity with PyTorch and know ba­sic sys­tems con­cepts like the mem­ory hi­er­ar­chy.

College Calculus, Linear Algebra (e.g. MATH 51, CME 100)

You should be com­fort­able un­der­stand­ing ma­trix/​vec­tor no­ta­tion and op­er­a­tions.

You should be com­fort­able un­der­stand­ing ma­trix/​vec­tor no­ta­tion and op­er­a­tions.

Basic Probability and Statistics (e.g. CS 109 or equiv­a­lent)

You should know the ba­sics of prob­a­bil­i­ties, Gaussian dis­tri­b­u­tions, mean, stan­dard de­vi­a­tion, etc.

You should know the ba­sics of prob­a­bil­i­ties, Gaussian dis­tri­b­u­tions, mean, stan­dard de­vi­a­tion, etc.

Machine Learning (e.g. CS221, CS229, CS230, CS124, CS224N)

You should be com­fort­able with the ba­sics of ma­chine learn­ing and deep learn­ing.

You should be com­fort­able with the ba­sics of ma­chine learn­ing and deep learn­ing.

Note that this is a 5-unit class. This is a very im­ple­men­ta­tion-heavy class, so please al­lo­cate enough time for it.

Coursework

Assignments

Assignment 1: Basics

Implement all of the com­po­nents (tokenizer, model ar­chi­tec­ture, op­ti­mizer) nec­es­sary to train a stan­dard Transformer lan­guage model.

Train a min­i­mal lan­guage model.

Implement all of the com­po­nents (tokenizer, model ar­chi­tec­ture, op­ti­mizer) nec­es­sary to train a stan­dard Transformer lan­guage model.

Train a min­i­mal lan­guage model.

Assignment 2: Systems

Profile and bench­mark the model and lay­ers from Assignment 1 us­ing ad­vanced tools, op­ti­mize Attention with your own Triton im­ple­men­ta­tion of FlashAttention2.

Build a mem­ory-ef­fi­cient, dis­trib­uted ver­sion of the Assignment 1 model train­ing code.

Profile and bench­mark the model and lay­ers from Assignment 1 us­ing ad­vanced tools, op­ti­mize Attention with your own Triton im­ple­men­ta­tion of FlashAttention2.

Build a mem­ory-ef­fi­cient, dis­trib­uted ver­sion of the Assignment 1 model train­ing code.

Assignment 3: Scaling

Understand the func­tion of each com­po­nent of the Transformer.

Query a train­ing API to fit a scal­ing law to pro­ject model scal­ing.

Understand the func­tion of each com­po­nent of the Transformer.

Query a train­ing API to fit a scal­ing law to pro­ject model scal­ing.

Assignment 4: Data

Convert raw Common Crawl dumps into us­able pre­train­ing data.

Perform fil­ter­ing and dedu­pli­ca­tion to im­prove model per­for­mance.

Convert raw Common Crawl dumps into us­able pre­train­ing data.

Perform fil­ter­ing and dedu­pli­ca­tion to im­prove model per­for­mance.

Assignment 5: Alignment and Reasoning RL

Apply su­per­vised fine­tun­ing and re­in­force­ment learn­ing to train LMs to rea­son when solv­ing math prob­lems.

Optional Part 2: im­ple­ment and ap­ply safety align­ment meth­ods such as DPO.

Apply su­per­vised fine­tun­ing and re­in­force­ment learn­ing to train LMs to rea­son when solv­ing math prob­lems.

Optional Part 2: im­ple­ment and ap­ply safety align­ment meth­ods such as DPO.

All (currently ten­ta­tive) dead­lines are listed in the sched­ule.

GPU com­pute for self-study

If you are fol­low­ing along at home, you can ac­cess GPU com­pute from a cloud provider to com­plete the as­sign­ments.

Here are a few op­tions (public pric­ing for a sin­gle B200 GPU on March 28, 2026):

Modal (sponsor): $6.25/hour. Offers $30 of free monthly com­pute. You are only charged for ac­tual com­pute (no idle re­sources) and their UX makes switch­ing be­tween lo­cal dev and large-scale gpu ex­per­i­ments sim­ple. (Modal Pricing)

Lambda Labs: $6.69/hour (Lambda Pricing)

RunPod: $4.99/hour (RunPod Pricing)

Nebius: $5.50/hour ($3.05/hour pre­emptible) (Nebius Pricing)

Together: $7.49/hour, min­i­mum 8 GPUs, cheaper for longer com­mit­ments (Together Pricing)

For con­ve­nience and to save money, we rec­om­mend de­bug­ging cor­rect­ness of your im­ple­men­ta­tion on CPU first and then us­ing GPU(s) (with the count rec­om­mended in the as­sign­ments) for com­plet­ing train­ing runs (A1, A4, A5) or bench­mark­ing GPU op­er­a­tions (A2).

Honor code

Like all other classes at Stanford, we take the stu­dent Honor Code se­ri­ously. Please re­spect the fol­low­ing poli­cies:

Collaboration: Study groups are al­lowed, but stu­dents must un­der­stand and com­plete their own as­sign­ments, and hand in one as­sign­ment per stu­dent. If you worked in a group, please put the names of the mem­bers of your study group at the top of your as­sign­ment. Please ask if you have any ques­tions about the col­lab­o­ra­tion pol­icy.

AI tools: Prompting LLMs such as ChatGPT is per­mit­ted for low-level pro­gram­ming ques­tions or high-level con­cep­tual ques­tions about lan­guage mod­els, but us­ing it di­rectly to solve the prob­lem is pro­hib­ited. We strongly en­cour­age you to dis­able AI au­to­com­plete (e.g., Cursor Tab, GitHub CoPilot) in your IDE when com­plet­ing as­sign­ments (though non-AI au­to­com­plete, e.g., au­to­com­plet­ing func­tion names is to­tally fine). We have found that AI au­to­com­plete makes it much harder to en­gage deeply with the con­tent. See the AI pol­icy (inspired by this).

Existing code: Implementations for many of the things you will im­ple­ment ex­ist on­line. The hand­outs we’ll give will be self-con­tained, so that you will not need to con­sult third-party code for pro­duc­ing your own im­ple­men­ta­tion. Thus, you should not look at any ex­ist­ing code un­less when oth­er­wise spec­i­fied in the hand­outs.

Submitting course­work

All course­work are sub­mit­ted via Gradescope by the dead­line. Do not sub­mit your course­work via email.

If any­thing goes wrong, please ask a ques­tion in Slack or con­tact a course as­sis­tant.

You can sub­mit as many times as you’d like un­til the dead­line: we will only grade the last sub­mis­sion.

Partial work is bet­ter than not sub­mit­ting any work.

Late days

Each stu­dent has 6 late days to use. A late day ex­tends the dead­line by 24 hours.

You can use up to 3 late days per as­sign­ment.

Regrade re­quests

If you be­lieve that the course staff made an ob­jec­tive er­ror in grad­ing, you may sub­mit a re­grade re­quest on Gradescope within 3 days af­ter the grades are re­leased.

Sponsor

We would like to thank Modal for spon­sor­ing com­pute for this class.

Schedule (YouTube playlist)

NVIDIA RTX Spark — Slim Laptops & Small Desktops

www.nvidia.com

A new be­gin­ning.

Introducing the NVIDIA RTX Spark™ Superchip. The fu­sion of NVIDIA AI and RTX graph­ics in a sin­gle chip re­de­fines Windows PCs and de­liv­ers amaz­ing cre­at­ing, AI de­vel­op­ment, and gam­ing—on the slimmest, most beau­ti­ful RTX lap­tops ever and small, ul­tra-ef­fi­cient desk­tops.

RTX Spark Superchip

Mind and Muscle—All in One

Up to

6,144 Core

Blackwell RTX GPU

Up to

20 Core

Ultra-Efficient CPU

Up to

1 Petaflop

FP4 AI Performance

Up to

128 GB

Unified Memory

Built for Agents and AI

CUDA, the soft­ware that ac­cel­er­ates the world’s AI, runs na­tively on RTX Spark.

All-Day Battery Life

The most power-ef­fi­cient RTX chip ever made, in a chas­sis so slim you’ll for­get you’re car­ry­ing it.

Creator Crafted

Make a mil­lion ideas hap­pen with hun­dreds of cre­ative apps and AI tools su­per­charged by RTX and NVIDIA Studio tools.

Game Ready

Play the lat­est and great­est games with world-lead­ing gam­ing tech. Ray-traced worlds, the full DLSS suite, NVIDIA Reflex, and G-SYNC.

Agents

Your PC Just Went From Tool to Teammate

Welcome to the PC where agents work along­side you—run­ning tasks, gen­er­at­ing as­sets, writ­ing code, on de­mand. You set the ob­jec­tive. The ma­chine han­dles the rest. There’s in­tel­li­gence on both sides of the key­board now.

Make More, Code More, Play More

Creators

Your Creative Power, Unlocked

Every cre­ative work­flow gets its own ac­cel­er­a­tor. FP4 Tensor Cores and uni­fied mem­ory for the lat­est mod­els. RT Cores plus DLSS for real-time 3D ren­ders. 4:2:2 hard­ware en­code and de­code for na­tive, color-ac­cu­rate time­lines. AV1 en­coders and NVIDIA Broadcast for sharper streams and cleaner au­dio.

Developers

The AI Standard

The same NVIDIA CUDA® stack the world’s AI is built on, so you can de­velop and pro­to­type on the same ma­chine. And with up to 128 GB of uni­fied mem­ory, you can pro­to­type, fine-tune, and in­fer­ence on the lat­est mod­els lo­cally.

Gamers

Game Ready

Experience the lat­est and great­est games with RTX. Ray-traced light­ing for in­cred­i­ble im­mer­sion. The full suite of DLSS tech­nolo­gies. Game-winning re­spon­sive­ness with NVIDIA Reflex. The hard­ware that de­fined mod­ern PC gam­ing, ready to play.

RTX Platform

Over 1,000 Accelerated Apps and Games, Millions of Ways to Use Them

RTX brings you ex­clu­sive fea­tures and the lat­est ad­vance­ments in AI, sim­u­la­tion, and ray trac­ing. Experience in­cred­i­ble 3D per­for­mance, smoother video edit­ing, pho­to­re­al­is­tic sim­u­la­tions, and stun­ning vi­su­als in the apps, AI tools, and games that pros, cre­ators, and gamers use every day.

Premium Inside and Out

RTX Spark Desktop PCs

Small Footprint. Big Performance.

Get RTX Spark in a small, ul­tra-ef­fi­cient desk­top. Built to run per­sonal AI agents 24/7 right at your desk plus game and cre­ate with the full power of RTX graph­ics.

Sign up to be no­ti­fied when RTX Spark Laptops and Desktops are avail­able.

©2026 NVIDIA Corporation. All rights re­served. NVIDIA and the NVIDIA logo are trade­marks and/​or reg­is­tered trade­marks of NVIDIA Corporation in the U.S. and other coun­tries. Other com­pany and prod­uct names may be trade­marks of the re­spec­tive com­pa­nies with which they are as­so­ci­ated.

Email Me When Available

assignment1-basics/CLAUDE.md at main · stanford-cs336/assignment1-basics

github.com

AI Agent Guidelines for CS336 at Stanford

This file pro­vides in­struc­tions for AI cod­ing as­sis­tants (like ChatGPT, Claude Code, GitHub Copilot, Cursor, etc.) work­ing with stu­dents in CS336.

Primary Role: Teaching Assistant, Not Solution Generator

AI agents should func­tion as teach­ing aids that help stu­dents learn through ex­pla­na­tion, guid­ance, and feed­back—not by com­plet­ing as­sign­ments for them.

CS336 is in­ten­tion­ally im­ple­men­ta­tion-heavy. Students are ex­pected to write sub­stan­tial Python/PyTorch code with lim­ited scaf­fold­ing, so AI as­sis­tance should pre­serve that learn­ing ex­pe­ri­ence.

What AI Agents SHOULD Do

Explain con­cepts when stu­dents are con­fused by guid­ing them in the right di­rec­tion and mak­ing sure they build the un­der­stand­ing them­selves

Point stu­dents to rel­e­vant lec­ture ma­te­ri­als (cs336.stanford.edu), hand­outs, of­fi­cial doc­u­men­ta­tion, and pro­fil­ing/​de­bug­ging tools.

Review code that stu­dents have writ­ten and sug­gest im­prove­ments, edge cases, in­vari­ants, or de­bug­ging checks. Feedback should be gen­eral and point the stu­dents to ar­eas of im­prove­ments rather than di­rectly giv­ing them so­lu­tions.

Help de­bug by ask­ing guid­ing ques­tions rather than pro­vid­ing fixes.

Explain er­ror mes­sages from Python, PyTorch, CUDA, Triton, and dis­trib­uted train­ing tools.

Help stu­dents un­der­stand ap­proaches or al­go­rithms at a high level and nudge them in the right di­rec­tion.

Suggest san­ity checks, toy ex­am­ples, as­ser­tions, and pro­filer-based in­ves­ti­ga­tions through ac­tive di­a­log with the stu­dent.

What AI Agents SHOULD NOT Do

Write any python or pseudocode

Give so­lu­tions to any prob­lems.

Complete TODO sec­tions in as­sign­ment code.

Edit code in the stu­dent repo

Run bash com­mands

Refactor large por­tions of stu­dent code into a fin­ished so­lu­tion.

Convert as­sign­ment re­quire­ments di­rectly into work­ing code.

Implement core as­sign­ment com­po­nents for stu­dents, such as to­k­eniz­ers, trans­former blocks, op­ti­miz­ers, train­ing loops, Triton ker­nels, dis­trib­uted train­ing logic, scal­ing-law pipelines, data fil­ter­ing/​dedu­pli­ca­tion pipelines, or align­ment/​RL meth­ods.

Point stu­dents to third-party im­ple­men­ta­tions. The course ma­te­ri­als are in­tended to be self-con­tained.

Give the stu­dent the so­lu­tion or idea for how to solve a prob­lem

Teaching Approach

When a stu­dent asks for help:

Ask clar­i­fy­ing ques­tions about what they tried, what they ex­pected, and what hap­pened.

Reference con­cepts from lec­ture, hand­outs, or doc­u­men­ta­tion rather than giv­ing di­rect an­swers.

Suggest next steps in­stead of im­ple­ment­ing them.

Review their code and point out spe­cific ar­eas for im­prove­ment, likely bugs, or miss­ing checks, through di­a­log rather than di­rectly giv­ing them the bugs or miss­ing checks.

Explain the why” be­hind sug­ges­tions, not just the how”.

Prefer tests and in­vari­ants over fixes. For ex­am­ple, sug­gest shape as­ser­tions, tiny toy in­puts, pro­filer checks, or ab­la­tions.

Example Interactions

Good:

Student: My causal mask seems wrong and train­ing blows up. Please tell me what my mis­take is.” Agent: My role is to help guide you to un­der­stand­ing, not to give you the an­swers di­rectly. What have you tried so far?” Student: I have tried run­ning a sin­gle at­ten­tion layer, but it still does not work.” Agent: Check three things: whether the mask is ap­plied be­fore soft­max, whether it broad­casts to the score ten­sor shape you ex­pect, and whether masked po­si­tions be­come a very neg­a­tive value rather than zero. A good san­ity test is a toy se­quence of length 3 where you print the at­ten­tion scores be­fore and af­ter mask­ing. What do those ten­sors look like in your run?”

Student: My causal mask seems wrong and train­ing blows up. Please tell me what my mis­take is.”

Agent: My role is to help guide you to un­der­stand­ing, not to give you the an­swers di­rectly. What have you tried so far?”

Student: I have tried run­ning a sin­gle at­ten­tion layer, but it still does not work.”

Agent: Check three things: whether the mask is ap­plied be­fore soft­max, whether it broad­casts to the score ten­sor shape you ex­pect, and whether masked po­si­tions be­come a very neg­a­tive value rather than zero. A good san­ity test is a toy se­quence of length 3 where you print the at­ten­tion scores be­fore and af­ter mask­ing. What do those ten­sors look like in your run?”

Good:

Student: My BPE to­k­enizer is slow. My run­time is cur­rently O(n^2) where n is the num­ber of to­kens. How can I make it faster?” Agent: Do you know which part of your to­k­enizer is slow? Lets Agent: Start by sep­a­rat­ing com­pute time from com­mu­ni­ca­tion time. Compare per-step time, GPU uti­liza­tion, and time spent in all-re­duce or data load­ing. If scal­ing is poor, ask whether the batch size per GPU is too small or whether syn­chro­niza­tion is dom­i­nat­ing. What pro­fil­ing data do you al­ready have?”

Student: My BPE to­k­enizer is slow. My run­time is cur­rently O(n^2) where n is the num­ber of to­kens. How can I make it faster?”

Agent: Do you know which part of your to­k­enizer is slow? Lets

Agent: Start by sep­a­rat­ing com­pute time from com­mu­ni­ca­tion time. Compare per-step time, GPU uti­liza­tion, and time spent in all-re­duce or data load­ing. If scal­ing is poor, ask whether the batch size per GPU is too small or whether syn­chro­niza­tion is dom­i­nat­ing. What pro­fil­ing data do you al­ready have?”

Bad:

Student: Fix my to­k­enizer and make it faster.” Agent: Here’s the full python code: …”

Student: Fix my to­k­enizer and make it faster.”

Agent: Here’s the full python code: …”

Academic Integrity

Remember: The goal is for stu­dents to learn by do­ing, not by watch­ing an AI gen­er­ate so­lu­tions.

For CS336 specif­i­cally, AI tools may be used for low-level pro­gram­ming help and high-level con­cep­tual ques­tions, but not for di­rectly solv­ing as­sign­ment prob­lems. When a re­quest crosses that line, the agent should refuse the di­rect im­ple­men­ta­tion and pivot to ex­pla­na­tion, de­bug­ging guid­ance, code re­view, or a non-paste­able high-level out­line.

When in doubt, re­fer the stu­dent to the course staff or of­fice hours.

DuckDuckGo makes its ‘no-AI’ search engine easier to access as its traffic booms

techcrunch.com

As its traf­fic con­tin­ues to climb, al­ter­na­tive search en­gine DuckDuckGo is lean­ing into anti-AI sen­ti­ment with the launch of new browser ex­ten­sions that al­low users to set its no-AI search ex­pe­ri­ence, noai.duck­duckgo.com, as their de­fault search en­gine.

Once en­abled, users will be di­rected to DuckDuckGo’s AI-free search page, where there are no AI-assisted an­swers, no chat prompts, and fewer AI im­ages in the search re­sults, the com­pany claims. The ex­ten­sions are cur­rently avail­able for Chrome and Firefox users. Meanwhile, peo­ple who have switched to the DuckDuckGo web browser al­ready have their AI set­tings pre­served, even if they clear their browser his­tory.

The com­pany says the ex­ten­sions are meant to help peo­ple have a con­sis­tent AI-free search ex­pe­ri­ence — some­thing that’s harder to come by these days, es­pe­cially af­ter Google an­nounced its AI-first re­vamp of its search en­gine at its de­vel­oper con­fer­ence ear­lier in May.

Since then, traf­fic to DuckDuckGo has been boom­ing. Last week, the com­pany noted that web vis­its to its no-AI search page were up nearly 30% week-over-week, and its U.S. app in­stalls were also up 18.1% week-over-week, with U.S. iOS app in­stalls peak­ing at 69.9% week-over-week growth.

Those trends fol­lowed news that Google was over­haul­ing its search box in the biggest change to its search en­gine in more than 25 years. Now, in­stead of re­turn­ing links at the top of the page, Google will fa­vor send­ing users into AI-generated search overviews, which are be­com­ing more in­ter­ac­tive ex­pe­ri­ences ca­pa­ble of cre­at­ing vi­su­al­iza­tions, charts, graphs, or even mini apps, as needed. Follow-up ques­tions from AI Overviews will push users into an AI Mode chat ex­pe­ri­ence. The tra­di­tional 10 blue links” that de­fined Google in its ear­lier days are more of an af­ter­thought, ap­pear­ing be­low all this AI-fueled pro­duc­tiv­ity.

But not every­one is on board with hav­ing AI made the de­fault, which is why some are mak­ing the move to al­ter­na­tive search en­gines like DuckDuckGo, Kagi, and oth­ers.

DuckDuckGo says traf­fic to its no-AI search page was up three­fold on Thursday, May 28, 2026 — a new high-wa­ter mark since Google’s search an­nounce­ment — and the num­bers are still climb­ing. The growth is not com­ing in spurts ei­ther, the com­pany points out. Instead, vis­its are av­er­ag­ing roughly 84% above the base­line, sug­gest­ing a more sus­tained shift.

In ad­di­tion to the new no AI search Chrome and Firefox ex­ten­sions, DuckDuckGo will soon up­date its orig­i­nal DuckDuckGo Privacy Essentials ex­ten­sions for Chrome, Firefox, Edge, and Opera to of­fer con­trols for AI search set­tings, as well.

It’s worth not­ing that DuckDuckGo is­n’t an anti-AI com­pany. The com­pany still of­fers its own AI chat­bot with ac­cess to many pop­u­lar mod­els, and a sub­scrip­tion plan that pro­vides ac­cess to the lat­est mod­els and other tools, like a VPN ser­vice, iden­tity theft restora­tion, and per­sonal in­for­ma­tion re­moval ser­vices.

When you pur­chase through links in our ar­ti­cles, we may earn a small com­mis­sion. This does­n’t af­fect our ed­i­to­r­ial in­de­pen­dence.

Sarah has worked as a re­porter for TechCrunch since August 2011. She joined the com­pany af­ter hav­ing pre­vi­ously spent over three years at ReadWriteWeb. Prior to her work as a re­porter, Sarah worked in I.T. across a num­ber of in­dus­tries, in­clud­ing bank­ing, re­tail and soft­ware.

You can con­tact or ver­ify out­reach from Sarah by email­ing sarahp@techcrunch.com or via en­crypted mes­sage at sarah­perez.01 on Signal.

View Bio

KDE at 30

kde.org

KDE is turn­ing 30 this year!

Three decades of pas­sion­ate com­mu­nity ef­fort against all odds; de­liv­er­ing con­trol, pri­vacy, and free­dom to our users; and tons and tons of soft­ware.

CHECK BACK OFTEN!

We will be up­dat­ing this page fre­quently with new con­tent, ex­cit­ing 30th Anniversary news, things you can par­tic­i­pate in, up­dated merch you can get, and much more!

Read on and dis­cover in­ter­est­ing facts you never knew, new merch you did­n’t know you needed (but you do now!), how you too can help en­sure we thrive the next 30 years, and where and how you can cel­e­brate KDEs birth­day.

Let’s start with that…

Plan your party🎉

Join an event hap­pen­ing near you. If there are none, or­ga­nize your own!

Whether it is a meetup over drinks, a nice meal with friends, an in­stallfest, or a full con­fer­ence, let us know what, where, and when you are cel­e­brat­ing KDEs birth­day.

We’ll in­clude your event in our list and it will show up in the map be­low.

HOW TO ADD YOUR EVENT: Visit our wiki page and add your event us­ing the tem­plate.

Help KDE…

Most of our funds (70%!) come from pri­vate end users just like you. Become a Supporting Member and help en­sure we re­ceive a reg­u­lar amount of money we can count on. This helps us plan and know what to ex­pect for the next month, quar­ter, or year.

Use the box at the top of this page and se­lect Become a Member to be­come a Supporting Member.

Or make a one-time do­na­tion and pro­vide us with emer­gency funds to get us through the fol­low­ing year.

Use the box at the top of this page and se­lect 1-time Donation to make your do­na­tion.

Why do­nate

We pro­duce first-class soft­ware and your do­na­tion keeps us in busi­ness” and our soft­ware sus­tain­able for gen­er­a­tions to come.

We keep you in con­trol and your do­na­tion al­lows KDE to re­main truly in­de­pen­dent.

We reach peo­ple the tech in­dus­try left be­hind and your do­na­tion con­tributes to serv­ing those who are ig­nored by the in­dus­try, and bring mar­gin­al­ized users into the com­mu­nity so we can grow the pro­ject for every­one.

We push to get Free Software into pub­lic in­sti­tu­tions and your do­na­tion helps us adapt our soft­ware to what pub­lic in­sti­tu­tions re­quire, so your taxes go to fund Free Software, not some big tech corp.

How we use the money

Our goals are am­bi­tious and we need funds to carry them out. We need:

a solid in­fra­struc­ture for de­vel­op­ers, trans­la­tors, and other con­trib­u­tors

con­trac­tors (marketeers, event plan­ners, lawyers, ac­coun­tants) to carry out spe­cial­ized tasks

to fund con­trib­u­tors trav­el­ling ex­penses so every­body has a chance to par­tic­i­pate in the com­mu­nity

to pay to at­tend events and for ma­te­r­ial for booths

to com­mis­sion art­work and de­signs

tar­geted de­vel­op­ment.

…and save the world

KDE con­tributes to clean­ing up the world and you can too.

KDE con­trib­u­tor Farid in­spired us to take on the 30 for 30” chal­lenge: for our 30th birth­day, we are ask­ing you to do some­thing to help the en­vi­ron­ment and make the planet a nicer place to live in. Farid is plant­ing 30 trees and we want you to come up with some­thing sim­i­lar.

Film you and your crew car­ry­ing out your ef­fort and we will pro­mote your pro­ject on so­cial me­dia.

Here are some more ideas:

Rescue 30 com­put­ers (or more!) from end­ing up in a land­fill

Upcycle 30 phones with a free mo­bile op­er­at­ing sys­tem

Clean up 30 hectares of wood­land

Convert 30 peo­ple to a free op­er­at­ing sys­tem

Take 30 tech­bros to court so they stop build­ing AI dat­a­cen­ters

KDEs his­tory

KDE has had a long and ex­cit­ing his­tory. Here we pre­sent a brief sum­mary of what has hap­pened over the last few decades, but if you want to see all the de­tails, visit our time­line web­site, which gets up­dated every time some­thing im­por­tant hap­pens.

KDE trivia

Did you know that…?

…KDE has helped put ro­bots on Mars?

We did! And we have the graph­i­cal ev­i­dence to prove it:

That is from the doc­u­men­tary Good Night Oppy, about the Opportunity Mars rover. In the scene you can see a NASA en­gi­neer trou­bleshoot­ing the rover while in flight to­wards Mars from a KDE 3 work­sta­tion.

You can watch Good Night Oppy on Amazon Prime.

Submitted by Paul Brown

…KDE built the HTML en­gine that pow­ers most web browsers?

It’s true!

KDEs web en­gine was writ­ten back in 1998 – 1999 and was sub­se­quently used as the ba­sis for Apple’s Webkit and Google’s Blink en­gines. This means that most mod­ern browsers, in­clud­ing Safari, Chrome, Chromium, Microsoft Edge, Opera, Vivaldi, and Brave, use KDE soft­ware at their core.

Indeed, if you ever check out your web server’s ac­cess logs, you will see KHTML on nearly every sin­gle line.

And, yes, the K” in KHTML stands for KDE.

Submitted by Paul Brown

Do you have a KDE Trivia Nugget you would like to share? Tell us about it!

Gallery

1990s

2000s

2010s

2020s

To add this web app to your iOS home screen tap the share button and select "Add to the Home Screen".

10HN is also available as an iOS App

If you visit 10HN only rarely, check out the the best articles from the past week.

Visit pancik.com for more.