10 interesting stories served every morning and every evening.

1 1,585 shares, 60 trendiness


Sign up to try on VideoFX

Veo is our most ca­pa­ble video gen­er­a­tion model to date. It gen­er­ates high-qual­ity, 1080p res­o­lu­tion videos that can go be­yond a minute, in a wide range of cin­e­matic and vi­sual styles.

It ac­cu­rately cap­tures the nu­ance and tone of a prompt, and pro­vides an un­prece­dented level of cre­ative con­trol — un­der­stand­ing prompts for all kinds of cin­e­matic ef­fects, like time lapses or aer­ial shots of a land­scape.

Our video gen­er­a­tion model will help cre­ate tools that make video pro­duc­tion ac­ces­si­ble to every­one. Whether you’re a sea­soned film­maker, as­pir­ing cre­ator, or ed­u­ca­tor look­ing to share knowl­edge, Veo un­locks new pos­si­bil­i­ties for sto­ry­telling, ed­u­ca­tion and more.

Over the com­ing weeks some of these fea­tures will be avail­able to se­lect cre­ators through VideoFX, a new ex­per­i­men­tal tool at labs.google. You can join the wait­list now.

In the fu­ture, we’ll also bring some of Veo’s ca­pa­bil­i­ties to YouTube Shorts and other prod­ucts.

Prompt: A lone cow­boy rides his horse across an open plain at beau­ti­ful sun­set, soft light, warm col­ors

Prompt: A fast-track­ing shot down an sub­ur­ban res­i­den­tial street lined with trees. Daytime with a clear blue sky. Saturated col­ors, high con­trast

Prompt: Extreme close-up of chicken and green pep­per ke­babs grilling on a bar­beque with flames. Shallow fo­cus and light smoke. vivid colours

Prompt: Timelapse of the north­ern lights danc­ing across the Arctic sky, stars twin­kling, snow-cov­ered land­scape

Prompt: An aer­ial shot of a light­house stand­ing tall on a rocky cliff, its bea­con cut­ting through the early dawn, waves crash against the rocks be­low

To pro­duce a co­her­ent scene, gen­er­a­tive video mod­els need to ac­cu­rately in­ter­pret a text prompt and com­bine this in­for­ma­tion with rel­e­vant vi­sual ref­er­ences.

With ad­vanced un­der­stand­ing of nat­ural lan­guage and vi­sual se­man­tics, Veo gen­er­ates video that closely fol­lows the prompt. It ac­cu­rately cap­tures the nu­ance and tone in a phrase, ren­der­ing in­tri­cate de­tails within com­plex scenes.

Prompt: Many spot­ted jel­ly­fish pul­sat­ing un­der wa­ter. Their bod­ies are trans­par­ent and glow­ing in deep ocean

Prompt: ex­treme close-up with a shal­low depth of field of a pud­dle in a street. re­flect­ing a busy fu­tur­is­tic Tokyo city with bright neon signs, night, lens flare

When given both an in­put video and edit­ing com­mand, like adding kayaks to an aer­ial shot of a coast­line, Veo can ap­ply this com­mand to the ini­tial video and cre­ate a new, edited video.

Drone shot along the Hawaii jun­gle coast­line, sunny day. Kayaks in the wa­ter

In ad­di­tion, it sup­ports masked edit­ing, en­abling changes to spe­cific ar­eas of the video when you add a mask area to your video and text prompt.

Veo can also gen­er­ate a video with an im­age as in­put along with the text prompt. By pro­vid­ing a ref­er­ence im­age in com­bi­na­tion with a text prompt, it con­di­tions Veo to gen­er­ate a video that fol­lows the im­age’s style and user promp­t’s in­struc­tions.

The model is also able to make video clips and ex­tend them to 60 sec­onds and be­yond. It can do this ei­ther from a sin­gle prompt, or by be­ing given a se­quence of prompts which to­gether tell a story.

A fast-track­ing shot through a bustling dystopian sprawl with bright neon signs, fly­ing cars and mist, night, lens flare, vol­u­met­ric light­ing. A fast-track­ing shot through a fu­tur­is­tic dystopian sprawl with bright neon signs, star­ships in the sky, night, vol­u­met­ric light­ing.A neon holo­gram of a car dri­ving at top speed, speed of light, cin­e­matic, in­cred­i­ble de­tails, vol­u­met­ric light­ing.The cars leave the tun­nel, back into the real world city Hong Kong.

Maintaining vi­sual con­sis­tency can be a chal­lenge for video gen­er­a­tion mod­els. Characters, ob­jects, or even en­tire scenes can flicker, jump, or morph un­ex­pect­edly be­tween frames, dis­rupt­ing the view­ing ex­pe­ri­ence.

Veo’s cut­ting-edge la­tent dif­fu­sion trans­form­ers re­duce the ap­pear­ance of these in­con­sis­ten­cies, keep­ing char­ac­ters, ob­jects and styles in place, as they would in real life.

Prompt: A pan­ning shot of a serene moun­tain land­scape, the cam­era slowly re­veal­ing snow-capped peaks, gran­ite rocks and a crys­tal-clear lake re­flect­ing the sky

Prompt: Crochet ele­phant in in­tri­cate pat­terns walk­ing on the sa­vanna

Veo builds upon years of gen­er­a­tive video model work in­clud­ing Generative Query Network (GQN), DVD-GAN, Imagen-Video, Phenaki, WALT, VideoPoet and Lumiere, and also our Transformer ar­chi­tec­ture and Gemini.

To help Veo un­der­stand and fol­low prompts more ac­cu­rately, we have also added more de­tails to the cap­tions of each video in its train­ing data. And to fur­ther im­prove per­for­mance, the model uses high-qual­ity, com­pressed rep­re­sen­ta­tions of video (also known as la­tents) so it’s more ef­fi­cient too. These steps im­prove over­all qual­ity and re­duce the time it takes to gen­er­ate videos.

It’s crit­i­cal to bring tech­nolo­gies like Veo to the world re­spon­si­bly. Videos cre­ated by Veo are wa­ter­marked us­ing SynthID, our cut­ting-edge tool for wa­ter­mark­ing and iden­ti­fy­ing AI-generated con­tent, and passed through safety fil­ters and mem­o­riza­tion check­ing processes that help mit­i­gate pri­vacy, copy­right and bias risks.

Veo’s fu­ture will be in­formed by our work with lead­ing cre­ators and film­mak­ers. Their feed­back helps us im­prove our gen­er­a­tive video tech­nolo­gies and makes sure they ben­e­fit the wider cre­ative com­mu­nity and be­yond.

Preview of our work with film­maker Donald Glover and his cre­ative stu­dio, Gilga.

Note: All videos on this page were gen­er­ated by Veo and have not been mod­i­fied.

Sign up to try VideoFX

This work was made pos­si­ble by the ex­cep­tional con­tri­bu­tions of: Abhishek Sharma, Adams Yu, Ali Razavi, Andeep Toor, Andrew Pierson, Ankush Gupta, Austin Waters, Daniel Tanis, Dumitru Erhan, Eric Lau, Eleni Shaw, Gabe Barth-Maron, Greg Shaw, Han Zhang, Henna Nandwani, Hernan Moraldo, Hyunjik Kim, Irina Blok, Jakob Bauer, Jeff Donahue, Junyoung Chung, Kory Mathewson, Kurtis David, Lasse Espeholt, Marc van Zee, Matt McGill, Medhini Narasimhan, Miaosen Wang, Mikołaj Bińkowski, Mohammad Babaeizadeh, Mohammad Taghi Saffar, Nick Pezzotti, Pieter-Jan Kindermans, Poorva Rane, Rachel Hornung, Robert Riachi, Ruben Villegas, Rui Qian, Sander Dieleman, Serena Zhang, Serkan Cabi, Shixin Luo, Shlomi Fruchter, Signe Nørly, Srivatsan Srinivasan, Tobias Pfaff, Tom Hume, Vikas Verma, Weizhe Hua, William Zhu, Xinchen Yan, Xinyu Wang, Yelin Kim, Yuqing Du and Yutian Chen.

We ex­tend our grat­i­tude to Aida Nematzadeh, Alex Cullum, April Lehman, Aäron van den Oord, Benigno Uria, Charlie Chen, Charlie Nash, Charline Le Lan, Conor Durkan, Cristian Țăpuș, David Bridson, David Ding, David Steiner, Emanuel Taropa, Evgeny Gladchenko, Frankie Garcia, Gavin Buttimore, Geng Yan, Greg Shaw, Hadi Hashemi, Harsha Vashisht, Hartwig Adam, Huisheng Wang, Jacob Austin, Jacob Kelly, Jacob Walker, Jim Lin, Jonas Adler, Joost van Amersfoort, Jordi Pont-Tuset, Josh Newlan, Josh V. Dillon, Junwhan Ahn, Kelvin Xu, Kristian Kjems, Lois Zhou, Luis C. Cobo, Maigo Le, Malcolm Reynolds, Marcus Wainwright, Mary Cassin, Mateusz Malinowski, Matt Smart, Matt Young, Mingda Zhang, Minh Giang, Moritz Dickfeld, Nancy Xiao, Nelly Papalampidi, Nir Shabat, Oliver Woodman, Ollie Purkiss, Oskar Bunyan, Patrice Oehen, Pauline Luc, Pete Aykroyd, Petko Georgiev, Phil Chen, RJ Mical, Rakesh Shivanna, Ramya Ganeshan, Richard Nguyen, Robin Strudel, Rohan Anil, Sam Haves, Shanshan Zheng, Sholto Douglas, Siddhartha Brahma, Tatiana López, Tobias Pfaff, Victor Gomes, Vighnesh Birodkar, Xin Chen, Yaroslav Ganin, Yi-Ling Wang, Yilin Ma, Yori Zwols, Yu Qiao, Yuchen Liang, Yusuf Aytar and Zu Kim for their in­valu­able part­ner­ship in de­vel­op­ing and re­fin­ing key com­po­nents of this pro­ject.

Special thanks to Douglas Eck, Nando de Freitas, Oriol Vinyals, Eli Collins, Koray Kavukcuoglu and Demis Hassabis for their in­sight­ful guid­ance and sup­port through­out the re­search process.

We also ac­knowl­edge the many other in­di­vid­u­als who con­tributed across Google DeepMind and our part­ners at Google.

New gen­er­a­tive me­dia mod­els and tools, built with and for cre­ators

We’re in­tro­duc­ing Veo, our most ca­pa­ble model for gen­er­at­ing high-de­f­i­n­i­tion video, and Imagen 3, our high­est qual­ity text-to-im­age model. We’re also shar­ing new demo record­ings cre­ated with our…

Robust and scal­able tool for wa­ter­mark­ing and iden­ti­fy­ing AI-generated im­ages.

Explore our other teams and prod­uct ar­eas


Read the original on deepmind.google »

2 1,011 shares, 40 trendiness

GPUs Go Brrr

AI uses an aw­ful lot of com­pute.

In the last few years we’ve fo­cused a great deal of our work on mak­ing AI use less com­pute (e.g. Based, Monarch Mixer, H3, Hyena, S4, among oth­ers) and run more ef­fi­ciently on the com­pute that we have (e.g. FlashAttention, FlashAttention-2, FlashFFTConv). Lately, re­flect­ing on these ques­tions has prompted us to take a step back, and ask two ques­tions:

What does the hard­ware ac­tu­ally want?

And how can we give that to it?

This post is a mix­ture of prac­tice and phi­los­o­phy. On the prac­ti­cal side, we’re go­ing to talk about what we’ve learned about mak­ing GPUs go brr — and re­lease an em­bed­ded DSL, ThunderKittens, that we’ve built to help us write some par­tic­u­larly speedy ker­nels (which we are also re­leas­ing). On the philo­soph­i­cal side, we’ll briefly talk about how what we’ve learned has changed the way we think about AI com­pute.

For this post, we’re go­ing to fo­cus on the NVIDIA H100 for two rea­sons. First, it rep­re­sents an aw­ful lot of new com­pute go­ing on­line. Second, we think the trends it im­plies are go­ing to con­tinue in fu­ture gen­er­a­tions, and prob­a­bly from other man­u­fac­tur­ers, too. But bear in mind (and we will re­peat in case you for­get) that most of this post ap­plies in some form to other GPUs, too.

Advance apolo­gies for re­stat­ing the data sheet, but the de­tails of the hard­ware are im­por­tant for the dis­cus­sion to come. An H100 SXM GPU con­tains, for our pur­poses:

80 GB of HBM3 with 3 TB/s of band­width. (A bit less band­width in prac­tice.)

50 MB of L2 cache with 12 TB/s of band­width, split across the GPU into two 25MB sec­tions con­nected by a cross­bar. (The cross­bar sucks.)

132 stream­ing mul­ti­proces­sors (SMs), where each has:

up to 227 KB of shared mem­ory within a 256 KB L1 cache. (Together, these have about 33 TB/s of band­width.)

a ten­sor mem­ory ac­cel­er­a­tor (TMA) — a new chunk of hard­ware in Hopper that can do asyn­chro­nous ad­dress gen­er­a­tion and fetch mem­ory. It also does other things like fa­cil­i­tate the on-chip mem­ory net­work (distributed shared mem­ory) but we’re not go­ing to fo­cus on this much, to­day.

4 quad­rants, where each quad­rant has:

A bunch of built-in in­struc­tions like sums, mul­ti­plies, that op­er­ate in par­al­lel on these vec­tor reg­is­ters.

There’s a lot of other stuff, too (memory con­trollers, in­struc­tion caches, etc) but we don’t care about any of that right now.

All of the com­pute hap­pens in the SMs. Most of it hap­pens in the reg­is­ters.

Great, how do I make it go brr?

Keep the ten­sor core fed. That’s it.

An H100 GPU has 989 TFLOPs of half-pre­ci­sion ma­trix mul­ti­ply com­pute, and ~60 TFLOPs of everything else”. So, every cy­cle the ten­sor core is in use, you’re get­ting at least 94% uti­liza­tion of the hard­ware. And every cy­cle the ten­sor core is not in use, you’re get­ting no more than 6% uti­liza­tion of the hard­ware. Put an­other way:

Now it turns out that keep­ing the ten­sor core fed is eas­ier said than done. We’ve dis­cov­ered a num­ber of quirks to the hard­ware that are im­por­tant to keep­ing the ma­trix mul­ti­plies rolling. Much of this also ap­plies to non-H100 GPUs, but the H100 is par­tic­u­larly tricky to keep fed so we fo­cus on it here. (The RTX 4090, by com­par­i­son, is very easy to work with as il­lus­trated in fig­ure 2.)

WGMMA in­struc­tions are nec­es­sary but also re­ally ir­ri­tat­ing to use.

Shared mem­ory is not ac­tu­ally that fast and also re­quires great care.

Occupancy re­mains help­ful, and reg­is­ters are gen­er­ally the key re­source.

Figure 2: NVIDIA GPUs (H100 and 4090) and their spirit an­i­mals (canadian goose and golden re­triever puppy).

Let’s go through each of these in or­der.

The H100 has a new set of in­struc­tions called warp group ma­trix mul­ti­ply ac­cu­mu­late” (wgmma.mma_async in PTX, or HGMMA/IGMMA/QGMMA/BGMMA in SASS). To un­der­stand what makes them spe­cial, we need to look briefly at how you used to have to use ten­sor cores. The ten­sor core in­struc­tions avail­able on pre­vi­ous GPUs were wmma.mma.sync and mma.sync in­struc­tions. With these in­struc­tions a warp of 32 threads on a sin­gle quad­rant of an SM would syn­chro­nously feed their chunk of the data into the ten­sor core and await the re­sult. Only then could they move on.

Not so with wg­mma.mma_a­sync in­struc­tions. Here, 128 con­sec­u­tive threads — split across all quad­rants of the SM — col­lab­o­ra­tively syn­chro­nize, and asyn­chro­nously launch a ma­trix mul­ti­ply di­rectly from shared mem­ory (and op­tion­ally also reg­is­ters.) These warps can then go do other things with their reg­is­ters while the ma­trix mul­ti­ply hap­pens, and await the re­sult when­ever they want.

In our mi­crobench­marks, we found that these in­struc­tions are nec­es­sary to ex­tract the full com­pute of the H100. Without them, the GPU seems to top out around 63% of its peak uti­liza­tion; we sus­pect this is be­cause the ten­sor cores want a deep hard­ware pipeline to keep them fed, even from lo­cal re­sources.

Unfortunately, the mem­ory lay­outs for these in­struc­tions are quite com­pli­cated. The unswiz­zled shared mem­ory lay­outs suf­fer from very poor co­a­lesc­ing, and so they re­quire sub­stan­tial ad­di­tional band­width from L2. The swiz­zled mem­ory lay­outs are flat-out in­cor­rectly doc­u­mented, which took con­sid­er­able time for us to fig­ure out. They’re also brit­tle, in that they ap­pear to only work for spe­cific ma­trix shapes and do not play well with other parts of the wg­mma.mma_a­sync in­struc­tions. For ex­am­ple, the hard­ware can trans­pose sub-ma­tri­ces on its way to the ten­sor cores — but only if the lay­out is not swiz­zled.

Figure 3: NVIDIAs lies. This is an ex­tra­or­di­nar­ily mis­lead­ing rep­re­sen­ta­tion of the ac­tual 128b swiz­zled wg­mma lay­out. This di­a­gram cost us three weeks of life that we will not get back, hence the pub­lic sham­ing.

We’ve also found that unswiz­zled wg­mma lay­outs have both poor mem­ory co­a­lesc­ing as well as bank con­flicts. On ker­nels such as flash at­ten­tion, TMA and the L2 cache are both fast enough so as to hide these prob­lems rea­son­ably well. But to make the full use of the hard­ware, mem­ory re­quest must be co­a­lesced and bank con­flicts avoided, and then con­trol­ling lay­outs very care­fully be­comes crit­i­cal.

Despite these pains, these in­struc­tions re­ally are nec­es­sary to make full use of the H100. Without them, you’ve al­ready lost 37% of the po­ten­tial per­for­mance of the GPU!

Shared mem­ory ap­pears to have a sin­gle-ac­cess la­tency of around 30 cy­cles (this matches our ob­ser­va­tions, too). That does­n’t sound like much, but in that time the SMs ten­sor cores could have done al­most two full 32x32 square ma­trix mul­ti­plies.

In pre­vi­ous work (like Flash Attention), we’ve fo­cused more on the HBM-SRAM bot­tle­neck. And in­deed: this re­ally used to be the bot­tle­neck! But as HBM has got­ten faster and the ten­sor cores con­tinue to grow out of pro­por­tion with the rest of the chip, even rel­a­tively small la­ten­cies like those from shared mem­ory have also be­come im­por­tant to ei­ther re­move or hide.

Shared mem­ory can be tricky to work with be­cause it is banked” into 32 sep­a­rate stores of mem­ory. If one is not care­ful, this can lead to some­thing called bank con­flicts”, where the same mem­ory bank is be­ing asked to si­mul­ta­ne­ously pro­vide mul­ti­ple dif­fer­ent pieces of mem­ory. This leads to re­quests be­ing se­ri­al­ized, and in our ex­pe­ri­ence this can dis­pro­por­tion­ately slow down a ker­nel — and the reg­is­ter lay­outs re­quired by wg­mma and mma in­struc­tions would naively suf­fer from these bank con­flicts. The so­lu­tion is to re­arrange shared mem­ory with var­i­ous swizzling” pat­terns so as to avoid these con­flicts, but it is an im­por­tant de­tail to get right.

More gen­er­ally, we have found it very valu­able to avoid move­ment be­tween reg­is­ters and shared mem­ory when pos­si­ble, and oth­er­wise to use the built-in hard­ware (wgmma and TMA in­struc­tions) to do data move­ment asyn­chro­nously when pos­si­ble. Synchronous move­ment us­ing the ac­tual warps is a worst-case fall­back with the great­est gen­er­al­ity.

One in­ter­est­ing quirk of the H100 is that the ten­sor cores and mem­ory are both fast enough that merely pro­duc­ing the mem­ory ad­dresses to fetch takes a sub­stan­tial frac­tion of the re­sources of the chip. (This is even more the case when com­pli­cated in­ter­leaved or swiz­zling pat­terns are added in.)

NVIDIA ap­pears to un­der­stand this, as they have be­stowed on us the Tensor Memory Accelerator (or TMA, as it likes to be called). TMA al­lows you to spec­ify a multi-di­men­sional ten­sor lay­out in global and shared mem­ory, tell it to asyn­chro­nously fetch a sub­tile of that ten­sor, and trip a bar­rier when it’s done. This saves all of the ad­dress gen­er­a­tion costs, and ad­di­tion­ally makes it much eas­ier to con­struct pipelines.

We have found TMA to be, like wg­mma.mma_a­sync, com­pletely in­dis­pens­able in achiev­ing the full po­ten­tial of the H100. (Probably moreso than wg­mma, in our ex­pe­ri­ence.) It saves reg­is­ter re­sources and in­struc­tion dis­patches, and also has use­ful fea­tures such as the abil­ity to per­form re­duc­tions onto global mem­ory asyn­chro­nously, too — this is par­tic­u­larly use­ful in com­plex back­wards ker­nels. As with wg­mma, the main quirk of it is that its swiz­zling modes are a bit dif­fi­cult to de­ci­pher with­out some re­verse en­gi­neer­ing, but we had sub­stan­tially less pain on this point.

For those newer to CUDA, oc­cu­pancy refers to the num­ber of co-sched­uled threads on the ex­act same ex­e­cu­tion hard­ware. Each cy­cle, the warp sched­uler on that quad­rant of the SM will try to is­sue an in­struc­tion to a warp of threads that are ready for an in­struc­tion. NVIDIA uses this model be­cause it can en­able the hard­ware to be more eas­ily kept full. For ex­am­ple, while one warp of threads is wait­ing for a ma­trix mul­ti­ply, an­other can re­ceive an in­struc­tion to use the fast ex­po­nen­tial hard­ware.

In some ways, the H100 is less re­liant on oc­cu­pancy than pre­vi­ous gen­er­a­tions of the hard­ware. The asyn­chro­nous fea­tures of the chip mean that even a sin­gle in­struc­tion stream can keep many parts of the hard­ware busy — fetch­ing mem­ory, run­ning ma­trix mul­ti­plies, do­ing shared mem­ory re­duc­tions, and still si­mul­ta­ne­ously run­ning math on the reg­is­ters.

But oc­cu­pancy is very good at hid­ing both sins and sync’s. A per­fectly de­signed pipeline might run rea­son­ably fast even with­out any ad­di­tional oc­cu­pancy, but our ob­ser­va­tions sug­gest that NVIDIA re­ally has de­signed their GPUs with oc­cu­pancy in mind. And there are enough syn­chro­niza­tions — and enough ways to make mis­takes — that find­ing ways to in­crease oc­cu­pancy has, in our ex­pe­ri­ence, usu­ally yielded good re­turns at in­creas­ing the re­al­ized uti­liza­tion of the hard­ware.

Finally, while oc­cu­pancy is merely use­ful on the H100, we have found it to be in­creas­ingly im­por­tant on the A100 and RTX 4090, re­spec­tively, likely be­cause they rely in­creas­ingly on syn­chro­nous in­struc­tion dis­patches, rel­a­tive to the H100.

Based on the above, we asked our­selves how we might make it eas­ier to write the kinds of ker­nels we care about while still ex­tract­ing the full ca­pa­bil­i­ties of the hard­ware. Motivated by a con­tin­u­ing pro­lif­er­a­tion of new ar­chi­tec­tures within the lab (and the fact that Flash Attention is like 1200 lines of code), we ended up de­sign­ing a DSL em­bed­ded within CUDA — at first for our own in­ter­nal use.

But then we de­cided it was use­ful enough that, with love in our hearts, we cleaned it up and have re­leased it for you. ThunderKittens is that em­bed­ded DSL. It is named ThunderKittens be­cause we think kit­tens are cute, and also we think it is funny to make you type kit­tens:: in your code.

Figure 4: A ThunderKitten. Look at her big eyes! Are you not be en­tranced!?!?

It is meant to be as sim­ple as pos­si­ble, and con­tains four tem­plated types:

Tiles are pa­ra­me­ter­ized by a height, width, and lay­out. Register vec­tors are pa­ra­me­ter­ized by a length and a lay­out, and shared vec­tors just by a length. (They don’t gen­er­ally suf­fer from bank con­flicts.)

We also give op­er­a­tions to ma­nip­u­late them, ei­ther at the warp level or at the level of a col­lab­o­ra­tive group of warps. Examples in­clude:

Initializers — zero out a shared vec­tor, for ex­am­ple.

Since ThunderKittens is em­bed­ded within CUDA (contrasting li­braries like Triton which we also love very much and rely on heav­ily), the ab­strac­tions fail grace­fully. If it’s miss­ing some­thing, just ex­tend it to do what you want!

To show an ex­am­ple of these prim­i­tives in ac­tion, con­sider Tri’s lovely flash at­ten­tion — a beau­ti­ful al­go­rithm, but com­pli­cated to im­ple­ment in prac­tice, even on top of NVIDIAs won­der­ful Cutlass li­brary.

Here’s a sim­ple for­ward flash at­ten­tion ker­nel for an RTX 4090, writ­ten in ThunderKittens.

#define NUM_WORKERS 16 // This ker­nel uses 16 work­ers in par­al­lel per block, to help is­sue in­struc­tions more quickly.

us­ing name­space kit­tens; // this ker­nel only han­dles head­dim=64 for sim­plic­ity. Also n should be a mul­ti­ple of 256 here.

__global__ void at­tend_k­er64(int n, const bf16* __restrict__ __q__, const bf16* __restrict__ __k__, const bf16* __restrict__ __v__, bf16* __o__) {

auto warpid = kit­tens::warpid();

auto block­_s­tart = block­Idx.x*(n*64);

const bf16 *_q = __q__ + block­_s­tart, *_k = __k__ + block­_s­tart, *_v = __v__ + block­_s­tart;

bf16 *_o = __o__ + block­_s­tart;

ex­tern __shared__ align­men­t_­dummy __shm[]; // this is the CUDA shared mem­ory

shared_al­lo­ca­tor al((int*)&__shm[0]);

// K and V live in shared mem­ory — this is about all that will fit.

st_bf_1x4 (&k_smem)[NUM_WORKERS] = al.al­lo­cate, NUM_WORKERS>();

st_bf_1x4 (&v_smem)[NUM_WORKERS] = al.al­lo­cate, NUM_WORKERS>();

// Initialize all of the reg­is­ter tiles.

rt_bf_1x4<> q_reg, k_reg, v_reg; // v_reg need to be swapped into col_l

rt_fl_1x1<> at­t_block;

rt_bf_1x1<> at­t_block­_mma;

rt_fl_1x4<> o_reg;

rt_fl_1x1<>::col_vec max_vec_last, max_vec; // these are col­umn vec­tors for the at­ten­tion block

rt_fl_1x1<>::col_vec nor­m_vec_last, nor­m_vec; // these are col­umn vec­tors for the at­ten­tion block

int qo_blocks = n / (q_reg.rows*NUM_WORKERS), kv_blocks = n / (q_reg.rows*NUM_WORKERS);

for(auto q_blk = 0; q_blk < qo_blocks; q_blk++) {

// each warp loads its own Q tile of 16x64, and then mul­ti­plies by 1/sqrt(d)

load(q_reg, _q + (q_blk*NUM_WORKERS + warpid)*q_reg.num_el­e­ments, q_reg.cols);

mul(q_reg, q_reg, __float2bfloat16(0.125f)); // tem­per­a­ture ad­just­ment

// zero flash at­ten­tion L, M, and O reg­is­ters.

neg_in­fty(max_vec); // zero reg­is­ters for the Q chunk



// it­er­ate over k, v for these q’s that have been loaded

for(auto kv_idx = 0; kv_idx < kv_blocks; kv_idx++) {

// each warp loads its own chunk of k, v into shared mem­ory

load(v_s­mem[warpid], _v + (kv_idx*NUM_WORKERS + warpid)*q_reg.num_el­e­ments, q_reg.cols);

load(k_s­mem[warpid], _k + (kv_idx*NUM_WORKERS + warpid)*q_reg.num_el­e­ments, q_reg.cols);

__syncthreads(); // we need to make sure all mem­ory is loaded be­fore we can be­gin the com­pute phase

// now each warp goes through all of the sub­tiles, loads them, and then does the flash at­ten­tion in­ter­nal alg.

for(int sub­tile = 0; sub­tile < NUM_WORKERS; sub­tile++) {

load(k_reg, k_s­mem[sub­tile]); // load k from shared into reg­is­ters

zero(at­t_block); // zero 16x16 at­ten­tion tile

mma_ABt(at­t_block, q_reg, k_reg, at­t_block); // Q@K. T

copy(nor­m_vec_last, nor­m_vec);

copy(max_vec_last, max_vec);

row_­max(max_vec, at­t_block, max_vec); // ac­cu­mu­late onto the max_vec

sub­_row(at­t_block, at­t_block, max_vec); // sub­tract max from at­ten­tion — now all &v_reg_col = swap_lay­out_in­place(v_reg); // this is a ref­er­ence and the call has in­val­i­dated v_reg

mul_row(o_reg, o_reg, nor­m_vec_last); // nor­mal­ize o_reg in ad­vance of mma_AB’ing onto it

mma_AB(o_reg, at­t_block­_mma, v_reg_­col, o_reg); // mfma onto o_reg with the lo­cal at­ten­tion@V mat­mul.

__syncthreads(); // we need to make sure all warps are done be­fore we can start load­ing the next kv chunk

store(_o + (q_blk*NUM_WORKERS + warpid)*q_reg.num_el­e­ments, o_reg, q_reg.cols); // write out o. com­piler has an is­sue with reg­is­ter us­age if d is made con­s­t­expr q_reg.rows :/


Read the original on hazyresearch.stanford.edu »

3 904 shares, 35 trendiness

The Forged Apple Employee Badge

This eBay auc­tion, spot­ted by Eric Vitiello, im­me­di­ately caught my eye:

Wow. Someone was sell­ing Apple Employee #10’s em­ployee badge?! What an in­cred­i­ble piece of Apple his­tory! Sure, it’s not Steve Jobs’ badge (despite the auc­tion ti­tle), but there are only so many of these in the world — especially from one of the first ten em­ploy­ees.

At first, it looked good. The plas­tic was scuffed with age, the tape on the map was yel­lowed, the logo was (mostly) cor­rect, and Sherry Livingston re­ally was Employee #10.

But it also felt a lit­tle off. The scuff­ing looked… sand­pa­pery. The splotches on the map felt over­cooked. And I could­n’t stop look­ing at the typewritten” part…

This badge would’ve been (obviously!) made be­fore desk­top pub­lish­ing. A badge tem­plate would’ve been printed by a lo­cal print­ing com­pany, then fed into a type­writer to type the in­di­vid­ual em­ployee de­tails. And that typed text is sus­pi­ciously uni­form.

And just as I was be­gin­ning to 🧐, along came Chris:

What does this Chris guy know, any­way? Well, he’s Chris Espinosa. Who just so hap­pens to be Apple Employee #8.

And we know ex­actly what Chris’ badge looked like:

I asked Chris (who I truly ad­mire!) for more thoughts:

So, yeah. One of the most qual­i­fied peo­ple on planet earth to say this is a fake, is say­ing this is a fake.

I had no choice at this point. I sim­ply had to med­dle.

It seems like the German Red Cross runs a kind of sec­ond hand/​char­ity shop, so ok, sure. But why would the German Red Cross have the em­ployee badge for Apple Employee #10?

I could­n’t wait. The seller sent the Red Croos” proof just a few hours later. (Enough time to… cre­ate it?)

At first blush, again, this prove­nance looks pretty good, right? A German pur­chase re­ceipt, dated 2001, for the item pic­tured. The right logo. A nice em­boss. Seems plau­si­ble.

But, again again, I had a weird feel­ing — this se­ries of pho­tos was try­ing too hard. That binder la­belled BILLS 200[0]—2010”, con­ve­niently flipped up­side down for ca­sual au­then­tic­ity? Why would you put that in these pho­tos un­less you were try­ing a bit too hard to make your case? It felt like a set dresser try­ing to stage a movie scene.

As you can see, the in­voice is ex­pressed in the old German cur­rency, the German mark, be­fore the en­try of the EURO in the years fol­low­ing.

No way. A reg­u­lar hu­man would­n’t point this out. Someone proud of a de­tail they thought of in their forgery would point this out. This lit­tle de­tail would’ve been far more ef­fec­tive for me to dis­cover on my own.

It was time to to­tally close the case. It was time to en­gage the in­ter­net.

The Mastodon re­sponse from Germans was swift and bru­tal.

There was no doubt. This proof” was yet an­other forgery.

I told the seller this. And then, weirdly, for the amount of ef­fort put into this, they made a very dumb er­ror:

…yes, the item sold be­fore I could fin­ish this post. So, my sin­cere apolo­gies to who­ever out there just spent $946.00 on a to­tal (but in­ter­est­ing!) work of fic­tion.

Here are the key take­aways from our meet­ing to­day:

* Don’t buy any­thing from this par­tic­u­lar seller

* When in doubt, Engage The Internet®


Read the original on cabel.com »

4 809 shares, 33 trendiness

Computer Scientists Invent an Efficient New Way to Count

Imagine that you’re sent to a pris­tine rain­for­est to carry out a wildlife cen­sus. Every time you see an an­i­mal, you snap a photo. Your dig­i­tal cam­era will track the to­tal num­ber of shots, but you’re only in­ter­ested in the num­ber of unique an­i­mals — all the ones that you haven’t counted al­ready. What’s the best way to get that num­ber? The ob­vi­ous so­lu­tion re­quires re­mem­ber­ing every an­i­mal you’ve seen so far and com­par­ing each new an­i­mal to the list,” said Lance Fortnow, a com­puter sci­en­tist at the Illinois Institute of Technology. But there are clev­erer ways to pro­ceed, he added, be­cause if you have thou­sands of en­tries, the ob­vi­ous ap­proach is far from easy.

It gets worse. What if you’re Facebook, and you want to count the num­ber of dis­tinct users who log in each day, even if some of them log in from mul­ti­ple de­vices and at mul­ti­ple times? Now we’re com­par­ing each new lo­gin to a list that could run to the bil­lions.

In a re­cent pa­per, com­puter sci­en­tists have de­scribed a new way to ap­prox­i­mate the num­ber of dis­tinct en­tries in a long list, a method that re­quires re­mem­ber­ing only a small num­ber of en­tries. The al­go­rithm will work for any list where the items come in one at a time — think words in a speech, goods on a con­veyor belt or cars on the in­ter­state.

The CVM al­go­rithm, named for its cre­ators — Sourav Chakraborty of the Indian Statistical Institute, Vinodchandran Variyam of the University of Nebraska, Lincoln, and Kuldeep Meel of the University of Toronto — is a sig­nif­i­cant step to­ward solv­ing what’s called the dis­tinct el­e­ments prob­lem, which com­puter sci­en­tists have grap­pled with for more than 40 years. It asks for a way to ef­fi­ciently mon­i­tor a stream of el­e­ments — the to­tal num­ber of which may ex­ceed avail­able mem­ory — and then es­ti­mate the num­ber of unique el­e­ments.

The new al­go­rithm is as­ton­ish­ingly sim­ple and easy to im­ple­ment,” said Andrew McGregor of the University of Massachusetts, Amherst. I would­n’t be sur­prised if this be­came the de­fault way the [distinct el­e­ments] prob­lem is ap­proached in prac­tice.”

To il­lus­trate both the prob­lem and how the CVM al­go­rithm solves it, imag­ine that you’re lis­ten­ing to the au­dio­book of Hamlet. There are 30,557 words in the play. How many are dis­tinct? To find out, you could lis­ten to the play (making fre­quent use of the pause but­ton), write down each word al­pha­bet­i­cally in a note­book, and skip over words al­ready on your list. When you reach the end, you’ll just count the num­ber of words on the list. This ap­proach works, but it re­quires an amount of mem­ory roughly equal to the num­ber of unique words.

In typ­i­cal data-stream­ing sit­u­a­tions, there could be mil­lions of items to keep track of. You might not want to store every­thing,” Variyam said. And that’s where the CVM al­go­rithm can of­fer an eas­ier way. The trick, he said, is to rely on ran­dom­iza­tion.


Read the original on www.quantamagazine.org »

5 805 shares, 45 trendiness

HigherOrderCO/Bend: A massively parallel, high-level programming language

Unlike low-level al­ter­na­tives like CUDA and Metal, Bend has the feel­ing and fea­tures of ex­pres­sive lan­guages like Python and Haskell, in­clud­ing fast ob­ject al­lo­ca­tions, higher-or­der func­tions with full clo­sure sup­port, un­re­stricted re­cur­sion, even con­tin­u­a­tions. Yet, it runs on mas­sively par­al­lel hard­ware like GPUs, with near-lin­ear speedup based on core count, and zero ex­plicit par­al­lel an­no­ta­tions: no thread spawn­ing, no locks, mu­texes, atom­ics. Bend is pow­ered by the HVM2 run­time.

Currently not work­ing on Windows, please use WSL2 as a workaround.

First, in­stall Rust nightly. Then, in­stall both HVM2 and Bend with:

cargo +nightly in­stall hvm

cargo +nightly in­stall bend-lang

Finally, write some Bend file, and run it with one of these com­mands:

bend run

You can also com­pile Bend to stand­alone C/CUDA files with gen-c and

gen-cu, for max­i­mum per­for­mance. But keep in mind our code gen is still on its in­fancy, and is nowhere as ma­ture as SOTA com­pil­ers like GCC and GHC.

To write par­al­lel pro­grams in Bend, all you have to do is… noth­ing. Other than not mak­ing it in­her­ently se­quen­tial! For ex­am­ple, the ex­pres­sion:

(((1 + 2) + 3) + 4)

Can not run in par­al­lel, be­cause +4 de­pends on +3 which de­pends on (1+2). But the fol­low­ing ex­pres­sion:

((1 + 2) + (3 + 4))

Can run in par­al­lel, be­cause (1+2) and (3+4) are in­de­pen­dent; and it will, per Bend’s fun­da­men­tal pledge:

Everything that can run in par­al­lel, will run in par­al­lel.

For a more com­plete ex­am­ple, con­sider:

# Sorting Network = just ro­tate trees!

def sort(d, s, tree):

switch d:

case 0:

re­turn tree

case _:

(x,y) = tree

lft = sort(d-1, 0, x)

rgt = sort(d-1, 1, y)

re­turn rots(d, s, lft, rgt)

# Rotates sub-trees (Blue/Green Box)

def rots(d, s, tree):

switch d:

case 0:

re­turn tree

case _:

(x,y) = tree

re­turn down(d, s, warp(d-1, s, x, y))



im­ple­ments a bitonic sorter with

im­mutable tree ro­ta­tions. It is not the kind of al­go­rithm you’d ex­pect to run fast on GPUs. Yet, since it uses a di­vide-and-con­quer ap­proach, which is

in­her­ently par­al­lel, Bend will run it multi-threaded. Some bench­marks:

That’s a 57x speedup by do­ing noth­ing. No thread spawn­ing, no ex­plicit man­age­ment of locks, mu­texes. We just asked Bend to run our pro­gram on RTX, and it did. Simple as that.

Bend is­n’t lim­ited to a spe­cific par­a­digm, like ten­sors or ma­tri­ces. Any con­cur­rent sys­tem, from shaders to Erlang-like ac­tor mod­els can be em­u­lated on Bend. For ex­am­ple, to ren­der im­ages in real time, we could sim­ply al­lo­cate an im­mutable tree on each frame:

# given a shader, re­turns a square im­age

def ren­der(depth, shader):

bend d = 0, i = 0:

when d < depth:

color = (fork(d+1, i*2+0), fork(d+1, i*2+1))


width = depth / 2

color = shader(i % width, i / width)

re­turn color

# given a po­si­tion, re­turns a color

# for this demo, it just busy loops

def de­mo_shader(x, y):

bend i = 0:

when i < 5000:

color = fork(i + 1)


color = 0x000001

re­turn color

# ren­ders a 256x256 im­age us­ing de­mo_shader

def main:

re­turn ren­der(16, de­mo_shader)

And it would ac­tu­ally work. Even in­volved al­go­rithms par­al­lelize well on Bend. Long-distance com­mu­ni­ca­tion is per­formed by global beta-re­duc­tion (as per the

Interaction Calculus), and syn­chro­nized cor­rectly and ef­fi­ciently by

HVM2′s atomic linker.

Bend is de­vel­oped by HigherOrderCO.com - join our Discord!


Read the original on github.com »

6 710 shares, 27 trendiness

Privacy Principles: Search, Learning and Artificial Intelligence

Our mis­sion is to build a prod­uct that makes work life sim­pler, more pleas­ant and more pro­duc­tive. Our guid­ing prin­ci­ple as we build this prod­uct is that the pri­vacy and se­cu­rity of Customer Data is sacro­sanct, as de­tailed in our Privacy Policy, Security Documentation and SPARC and the Slack Terms. Machine Learning (ML) and Artificial Intelligence (AI) are use­ful tools that we use in lim­ited ways to en­hance our prod­uct mis­sion. We do not de­velop LLMs or other gen­er­a­tive mod­els us­ing cus­tomer data. To de­velop non-gen­er­a­tive AI/ML mod­els for fea­tures such as emoji and chan­nel rec­om­men­da­tions, our sys­tems an­a­lyze Customer Data (e.g. mes­sages, con­tent, and files) sub­mit­ted to Slack as well as Other Information (including us­age in­for­ma­tion) as de­fined in our Privacy Policy and in your cus­tomer agree­ment. To en­sure the pri­vacy and se­cu­rity of Customer Data in this par­tic­u­lar con­text, we have a few guid­ing prin­ci­ples:Data will not leak across work­spaces. For any model that will be used broadly across all of our cus­tomers, we do not build or train these mod­els in such a way that they could learn, mem­o­rize, or be able to re­pro­duce some part of Customer Data. We have tech­ni­cal con­trols in place to pre­vent ac­cess. When de­vel­op­ing AI/ML mod­els or oth­er­wise an­a­lyz­ing Customer Data, Slack can’t ac­cess the un­der­ly­ing con­tent. We have var­i­ous tech­ni­cal mea­sures pre­vent­ing this from oc­cur­ring. Please read our Security White Paper for more info on these con­trols that pro­tect the con­fi­den­tial­ity and se­cu­rity of Customer Data.We of­fer Customers a choice around these prac­tices. If you want to ex­clude your Customer Data from help­ing train Slack global mod­els, you can opt out. If you opt out, Customer Data on your work­space will only be used to im­prove the ex­pe­ri­ence on your own work­space and you will still en­joy all of the ben­e­fits of our glob­ally trained AI/ML mod­els with­out con­tribut­ing to the un­der­ly­ing mod­els. Contact us to opt out. If you want to ex­clude your Customer Data from Slack global mod­els, you can opt out. To opt out, please have your Org or Workspace Owners or Primary Owner con­tact our Customer Experience team at feed­back@slack.com with your Workspace/Org URL and the sub­ject line Slack Global model opt-out re­quest.” We will process your re­quest and re­spond once the opt out has been com­pleted. How Slack may use Customer Data (e.g. mes­sages, con­tent, files) and Other Information to up­date our ser­vices Working from the above prin­ci­ples, here are a few ex­am­ples of im­prove­ments and pri­vacy pro­tec­tive tech­niques that our prod­uct and an­a­lyt­ics teams may use to de­velop, up­date and im­prove Slack: Channel Recommendations: We may use in­sights to rec­om­mend that a user joins a new pub­lic chan­nel in their com­pany. We make these sug­ges­tions based on chan­nel mem­ber­ship, ac­tiv­ity, and topic over­laps. Our model learns from pre­vi­ous sug­ges­tions and whether or not a user joins the chan­nel we rec­om­mend. We pro­tect pri­vacy while do­ing so by sep­a­rat­ing our model from Customer Data. We use ex­ter­nal mod­els (not trained on Slack mes­sages) to eval­u­ate topic sim­i­lar­ity, out­putting nu­mer­i­cal scores. Our global model only makes rec­om­men­da­tions based on these nu­mer­i­cal scores and non-Cus­tomer Data. For more tech­ni­cal de­tails, please visit our Engineering Blog to learn more.Search Results: Our search ma­chine learn­ing mod­els help users find what they’re seek­ing by iden­ti­fy­ing the right re­sults for a par­tic­u­lar query. We do this based on his­tor­i­cal search re­sults and pre­vi­ous en­gage­ments with­out learn­ing from the un­der­ly­ing text of the search query, re­sult, or proxy. Simply put, our model can’t re­con­struct the search query or re­sult. Instead, it learns from team-spe­cific, con­tex­tual in­for­ma­tion like the num­ber of times a mes­sage has been clicked in a search or an over­lap in the num­ber of words in the query and rec­om­mended mes­sage.Au­to­com­plete: Slack might make sug­ges­tions to com­plete search queries or other text– for ex­am­ple au­to­com­plet­ing the phrase Customer Support” af­ter a user types the first sev­eral let­ters of this phrase. These sug­ges­tions are lo­cal and sourced from com­mon pub­lic mes­sage phrases in the user’s work­space. Our al­go­rithm that picks from po­ten­tial sug­ges­tions is trained glob­ally on pre­vi­ously sug­gested and ac­cepted com­ple­tions. We pro­tect data pri­vacy by us­ing rules to score the sim­i­lar­ity be­tween the typed text and sug­ges­tion in var­i­ous ways, in­clud­ing only us­ing the nu­mer­i­cal scores and counts of past in­ter­ac­tions in the al­go­rithm.Emoji Suggestion: Slack might sug­gest emoji re­ac­tions to mes­sages us­ing the con­tent and sen­ti­ment of the mes­sage, the his­toric us­age of the emoji, and the fre­quency of use of the emoji on the team in var­i­ous con­texts. For in­stance, if 🎉 is a com­mon re­ac­tion to cel­e­bra­tory mes­sages in a par­tic­u­lar chan­nel, we will sug­gest users re­act to new, sim­i­larly pos­i­tive mes­sages with 🎉. To do this while pro­tect­ing Customer Data, we might use an ex­ter­nal model (not trained on Slack mes­sages) to clas­sify the sen­ti­ment of the mes­sage. Our model would then sug­gest an emoji only con­sid­er­ing the fre­quency with which a par­tic­u­lar emoji has been as­so­ci­ated with mes­sages of that sen­ti­ment in that work­space.These types of thought­ful per­son­al­iza­tions and im­prove­ments are only pos­si­ble if we study and un­der­stand how our users in­ter­act with Slack.Slack takes pri­vacy se­ri­ously and our con­fi­den­tial­ity oblig­a­tions de­scribed in our cus­tomer agree­ments and Privacy Policy ap­ply in each of these sce­nar­ios. Customers own their own Customer Data. Slack ag­gre­gates and dis­as­so­ci­ates Customer Data such that Slack’s use of Customer Data to up­date the Services will never iden­tify any of our cus­tomers or in­di­vid­u­als as the source of any of these im­prove­ments to any third party, other than to Slack’s af­fil­i­ates or sub-proces­sors. Generative AI is a newer cat­e­gory of AI sys­tems that can gen­er­ate con­tent, such as text, in re­sponse to prompts a user en­ters. This AI cat­e­gory in­cludes Large Language Models (LLMs). Slack uses gen­er­a­tive AI in its Slack AI prod­uct of­fer­ing, lever­ag­ing third-party LLMs. Customers pur­chase Slack AI as an add-on, and the gen­er­a­tive AI func­tion­al­ity is not in­cluded in the stan­dard Slack of­fer­ing. No Customer Data is used to train third-party LLM mod­els. Slack does not train LLMs or other gen­er­a­tive mod­els on Customer Data, or share Customer Data with any LLM providers. Learn more about How We Built Slack AI To Be Secure and Private. Slack AI uses off-the-shelf LLMs where the mod­els are not up­dated by and don’t in other ways re­tain Customer Data af­ter a re­quest to them. Additionally, be­cause Slack AI hosts these mod­els on its own AWS in­fra­struc­ture, Customer Data never leaves Slack’s trust bound­ary, and the providers of the LLM never have any ac­cess to the Customer Data.


Read the original on slack.com »

7 670 shares, 27 trendiness

Modos-Labs/Glider: Open-source E-ink monitor. Mirror of https://gitlab.com/zephray/glider

Open-source Eink mon­i­tor with an em­pha­sis on low la­tency.

Note: This repo only con­tains the hard­ware de­sign, the gate­ware run­ning on the FPGA is my open-source Caster EPDC de­sign. This README also con­tains in­for­ma­tion about the Caster as well.

This is a long doc­u­ment, con­tain­ing not just in­for­ma­tion about this pro­ject, but also pretty much every­thing I know about Eink. Given it’s a bit hard to gather in­for­ma­tion about Eink on­line, I think this is the right thing to do. Use the fol­low­ing table of con­tents to nav­i­gate around.

Eink is a reg­is­tered trade­mark and brand of E Ink Corporation. All the con­tents pro­vided in this repo are based on pub­licly avail­able in­for­ma­tion on­line and orig­i­nal re­search. They are not en­dorsed by Eink in any way and they may con­tain er­rors and/ or in­ac­cu­ra­cies.

If you are in­ter­ested in Eink or any other dis­play tech­nolo­gies, I have a Discord server for that. Feel free to join: https://​dis­cord.gg/​rt­T7euSHQS . (This Discord server is also not en­dorsed by Eink or any other com­pany. It’s not a cus­tomer sup­port server.)

* Supports elec­trophoretic dis­play pan­els with par­al­lel I/F (Eink(R), SiPix and DES)

* Supports both mono­chrome and color-fil­ter-ar­ray (such as Kaleido(TM)) based color screen

* Hardware bayer dither­ing, blue-noise dither­ing, and er­ror-dif­fu­sion dither­ing with no ad­di­tional la­tency

* Epaper power sup­ply with up to 1A peak cur­rent on +/-15V rail sup­port­ing large pan­els

* Up to 133MP/s pro­cess­ing rate with dither­ing en­abled, >200MP/s when dis­abled

The board is de­signed with KiCad. You may need the lat­est sta­ble ver­sion of KiCad to open the source file.

This repo hosts the PCB de­sign, firmware source code, and a ref­er­ence 3D-printable case de­sign. The RTL code is in a sep­a­rate repo: https://​git­lab.com/​zephray/​Caster/.

Eink is the brand of a fam­ily of pa­per-like elec­trophoretic dis­plays. The un­der­ly­ing tech­nol­ogy is in­vented in the MIT Media Lab be­tween 1995 and 1997 by Barrett Comiskey, J. D. Albert, and Joseph Jacobson. They later founded the E Ink Corporation to com­mer­cial­ize this tech­nol­ogy.

Nowadays they are com­monly used on e-read­ers and elec­tronic shelf la­bels. You’ve prob­a­bly seen them on Kindle, in stores, or maybe in some train sta­tions as well.

This sec­tion gives an overview of the elec­trophoretic dis­plays, in­clud­ing the screen pan­els avail­able and un­der­ly­ing tech­nol­ogy. Note this pro­ject does­n’t and can’t sup­port all elec­trophoretic screens. This doc­u­men­ta­tion also solely fo­cuses on us­ing ex­ist­ing off-the-shelf screen pan­els rather than the physics or man­u­fac­tur­ing process of one.

In the sim­plest form, you have charged par­ti­cles with dif­fer­ent col­ors, dis­persed in some oil in some trans­par­ent con­tainer. By ap­ply­ing elec­tric fields the par­ti­cles can be moved up or down to pro­duce ei­ther black or white, or a mix­ture of that.

There are mul­ti­ple tech­nolo­gies based on this ba­sic con­cept, namely Eink’s mi­cro-cap­sule dis­play, SiPix (now ac­quired by Eink)’s mi­cro-cup dis­play, and WFTs DES dis­play. They dif­fer in spe­cific ways of con­fin­ing the par­ti­cles in con­tain­ers, but oth­er­wise very sim­i­lar.

The pix­els on the screen are typ­i­cally arranged as a 2D ar­ray, dri­ven with TFTs. The pix­els are scanned/ dri­ven pe­ri­od­i­cally at a fixed re­fresh rate, typ­i­cally rang­ing from 50Hz to 120Hz. Applying pos­i­tive volt­age on the pixel will typ­i­cally drive the par­ti­cles to­ward the white state while ap­ply­ing neg­a­tive volt­age will drive the par­ti­cles to­wards the black state. This is sim­i­lar to ac­tive ma­trix TN/IPS LCDs, which also use 2D TFT ar­rays and elec­tri­cal fields for chang­ing state. However, un­like LCDs, EPDs main­tain their state af­ter the elec­tri­cal field is re­moved. So un­like LCDs which re­quire con­tin­u­ous re­fresh­ing, the EPDs only need to be re­freshed till the pix­els are fully dri­ven.

In terms of dri­ving the screen panel, de­pend­ing on the pixel value (1 or 0), each pixel would be dri­ven ei­ther with a pos­i­tive volt­age or a neg­a­tive volt­age. A global counter can be used to count the frames elapsed and stop dri­ving the pix­els af­ter a pre­de­fined pe­riod of time (for ex­am­ple, 100ms). Two frame­buffers are typ­i­cally used for de­ter­min­ing if the pixel has changed color or not. If not, then the pixel does not need to be dri­ven.

In terms of dis­play qual­ity, EPDs are no match for mod­ern IPS LCDs. The fol­low­ing is a com­par­i­son table of key pa­ra­me­ters. The spe­cific num­ber would vary de­pend­ing on the screen used but should be within the same ball­park.

It has a few ad­van­tages. It re­flects lights in­stead of emit­ting lights, so it gen­er­ally con­sumes less power and can be used out­doors, etc. It’s also bistable, which means that it re­tains the im­age af­ter the power has been re­moved. Personally, the biggest dif­fer­en­ti­at­ing fac­tor for me (author of this README) is that it looks like pa­per.

The im­age above shows a com­par­i­son be­tween re­flec­tive TFT LCD (SHARP mem­ory LCD in this case) and Eink. The LCD has a mir­ror-like tex­ture which changes re­flec­tiv­ity dras­ti­cally in dif­fer­ent an­gles, while the Eink is more pa­per-like.

There are many other re­flec­tive or bistable dis­play tech­nolo­gies. They are all in­ter­est­ing dis­plays on their own, but none of them feels like pa­per (yet).

Overall, there is no sin­gle per­fect dis­play tech­nol­ogy. Each has its own unique strength. Pick the right one for your pro­ject.

The Eink con­troller is in some ways sim­i­lar to the dis­play con­troller (DC/ CRTC) + tim­ing con­troller (TCON) in a typ­i­cal LCD-based sys­tem. It takes the raw im­age data and con­verts it to sig­nals re­quired to drive the screen.

To un­der­stand the ac­tual work of an eink con­troller, start from the ba­sic con­cept. The color of a pixel can be changed by ap­ply­ing pos­i­tive or neg­a­tive volt­age for a fi­nite pe­riod of time. From the con­troller’s per­spec­tive, de­pend­ing on the cur­rent state of the pixel and the de­sired state of the pixel, there are 4 pos­si­bil­i­ties.

The con­troller needs to store and main­tain the screen state in­side of its own buffer mem­ory, so it would typ­i­cally have a large on-chip SRAM or an off-chip SDRAM con­troller. The con­troller should also have a timer to en­sure the screen does­n’t get over­driven or un­der­driven.

The con­troller of­ten uses the so-called waveform” to re­place the ac­tion col­umn of the pre­vi­ous table. Instead of hard­cod­ing the ac­tion for state tran­si­tion, the ac­tions are stored into a look-up-table (LUT) which can be mod­i­fied at run­time to al­low higher flex­i­bil­ity.

Controllers may also of­fer more ad­vanced fea­tures such as dither­ing ac­cel­er­a­tion, mul­ti­ple re­gion up­dates, au­to­matic LUT se­lec­tion, etc.

As dis­cussed in the pre­vi­ous sec­tion, an Eink screen needs to be cou­pled to an Eink con­troller to func­tion. Aside from that, the screen also needs high-volt­age dri­vers to drive the TFTs and the pix­els. Virtually all E-paper pan­els use ei­ther COG (Chip-on-Glass) or TAB (Tape Auto Bonding) to in­te­grate some chips onto the screen panel it­self. Most of the screens avail­able to­day can be di­vided into two cat­e­gories based on whether or not the con­troller is in­te­grated in:

Here is a non-ex­haus­tive list of the types based on their size: (the size or res­o­lu­tion is not re­lated to or lim­ited by the type, it is just for a cer­tain size, and the ven­dors tend to make them the same type.)

One may no­tice that al­most all e-read­ers/ e-ink cell­phones use screens with­out con­trollers, while al­most all e-ink elec­tronic shelf la­bels (ESL) use screens with con­trollers. This gives some hints about the ad­van­tages and dis­ad­van­tages of two types:

Please keep in mind the dis­cus­sion is about off-the-shelf screens you can buy to­day. These trade­offs do not nec­es­sar­ily come from the fact the con­troller is in­te­grated or not.

Note that I men­tioned the re­fresh speed and to­tal up­date la­tency. They are dif­fer­ent:

The re­fresh speed refers to the time it takes to start re­fresh­ing the screen: from start­ing to see­ing screen chang­ing, to the screen fin­ish show­ing the new con­tent.

The to­tal up­date la­tency refers to the la­tency when the proces­sor needs to up­date the screen, to the screen fin­ish show­ing the new con­tent. As you can see, this is the biggest is­sue for screens with con­trollers. This is the main rea­son why they are rarely used on e-read­ers cell phones or PC mon­i­tors.

This di­a­gram il­lus­trates the dif­fer­ence be­tween the two. It should be noted that the screens with­out con­trollers have the flex­i­bil­ity to be dri­ven quickly, but the sys­tem de­signer might not ar­chi­tect the sys­tem for low la­tency.

Screens with in­te­grated con­trollers have al­most every­thing al­ready in­te­grated. Common dis­play pan­els of this type only need a few ex­ter­nal ca­pac­i­tors, in­duc­tors, and MOSFETs to sup­port the in­te­grated bipo­lar power sup­ply cir­cuit, then it could be hooked up to MCUs or MPUs us­ing com­mon in­ter­faces like SPI or I2C. There are a lot of dri­ving boards and ex­am­ples of these screens avail­able on­line.

This could get com­pli­cated. Note I used a lot of generally” in the pre­vi­ous com­par­i­son table be­cause there are many things one could do to drive them. Some of them would cer­tainly im­pact the per­for­mance. The main is­sue here is the con­troller chip. There are three so­lu­tions to drive these screens:

* Using a ded­i­cated con­troller chip to drive the screen

* Using an SoC that has an in­te­grated con­troller

* Using a fast MCU/SoC to em­u­late the con­troller with GPIO (software tim­ing con­troller)

Then, again here is a com­par­i­son be­tween them:

When us­ing a ded­i­cated con­troller, it could ac­cept data from ex­ter­nal de­vices. This al­lows it to be used in var­i­ous dif­fer­ent types of ap­pli­ca­tions. Ranging from IoT de­vices, and ESLs, to PC mon­i­tors with rel­a­tively fast re­fresh rate and low la­tency.

When us­ing SoC or MCU, the dis­play con­tent is gen­er­ated by the SoC or MCU it­self, which ul­ti­mately is lim­ited by the ca­pa­bil­ity of the SoC or MCU. Given the cur­rent SoCs with E-ink dis­play con­trollers are usu­ally lim­ited in per­for­mance, the ap­pli­ca­tion is lim­ited. The same goes for MCU, it does what an MCU could do. You could find ways to stream video data into SoC or MCUs by us­ing USB, cam­era in­ter­face, WiFi, etc., but this might not be op­ti­mal.

* Specialized con­troller chip


EPSON S1D13xxx: Widely used EPD con­troller in early E-readers. Proprietary, no doc­u­ments avail­able. Probably EOL.

IT8951: Used on the wave­share EPD Hat. Documents are avail­able. It works with large EPDs up to 2048x2048. The draw­back is the speed as the in­ter­face be­tween proces­sor and IT8951 could be slow. This is sim­i­lar to the sit­u­a­tion on screens with in­te­grated con­troller

T1000: Also known as IT8957, up­graded model of IT8951. It sup­ports even higher res­o­lu­tion. It fea­tures a higher speed MIPI DSI in­ter­face to mit­i­gate the slow speed of IT8951.

Waveshare HDMI dri­ver board: FPGA-based con­troller. Closed source but eas­ily pur­chasable, could be in­te­grated into larger pro­jects as a mod­ule.


This pro­ject (Caster + Glider): FPGA-based con­troller, mul­ti­ple up­date modes, ul­tra-low la­tency pro­cess­ing, and wide range of screen sup­port.

* Closed-source

EPSON S1D13xxx: Widely used EPD con­troller in early E-readers. Proprietary, no doc­u­ments avail­able. Probably EOL.

IT8951: Used on the wave­share EPD Hat. Documents are avail­able. It works with large EPDs up to 2048x2048. The draw­back is the speed as the in­ter­face be­tween proces­sor and IT8951 could be slow. This is sim­i­lar to the sit­u­a­tion on screens with in­te­grated con­troller

T1000: Also known as IT8957, up­graded model of IT8951. It sup­ports even higher res­o­lu­tion. It fea­tures a higher speed MIPI DSI in­ter­face to mit­i­gate the slow speed of IT8951.

Waveshare HDMI dri­ver board: FPGA-based con­troller. Closed source but eas­ily pur­chasable, could be in­te­grated into larger pro­jects as a mod­ule.

* EPSON S1D13xxx: Widely used EPD con­troller in early E-readers. Proprietary, no doc­u­ments avail­able. Probably EOL.

* IT8951: Used on the wave­share EPD Hat. Documents are avail­able. It works with large EPDs up to 2048x2048. The draw­back is the speed as the in­ter­face be­tween proces­sor and IT8951 could be slow. This is sim­i­lar to the sit­u­a­tion on screens with in­te­grated con­troller

* T1000: Also known as IT8957, up­graded model of IT8951. It sup­ports even higher res­o­lu­tion. It fea­tures a higher speed MIPI DSI in­ter­face to mit­i­gate the slow speed of IT8951.

* Waveshare HDMI dri­ver board: FPGA-based con­troller. Closed source but eas­ily pur­chasable, could be in­te­grated into larger pro­jects as a mod­ule.

* Open-source

This pro­ject (Caster + Glider): FPGA-based con­troller, mul­ti­ple up­date modes, ul­tra-low la­tency pro­cess­ing, and wide range of screen sup­port.

* This pro­ject (Caster + Glider): FPGA-based con­troller, mul­ti­ple up­date modes, ul­tra-low la­tency pro­cess­ing, and wide range of screen sup­port.

* SoC with in­te­grated con­troller

* MCU/SoC + Software TCON

http://​es­sen­tialscrap.com/​eink/​wave­forms.html: One of the ear­li­est e-ink hacks. Limited in per­for­mance but still could be used as a ref­er­ence

NekoCal: One of the ear­li­est e-ink soft­ware TCON with greyscale sup­port. Used to be avail­able as a DIY kit. No longer up­dated, still could be used as a ref­er­ence

EPDiy: Based on ESP32, sup­ports a lot of dif­fer­ent screens, rec­om­mended if want to build some de­vice with ESP32+Eink or em­bed it into a larger pro­ject.

* http://​es­sen­tialscrap.com/​eink/​wave­forms.html: One of the ear­li­est e-ink hacks. Limited in per­for­mance but still could be used as a ref­er­ence

* NekoCal: One of the ear­li­est e-ink soft­ware TCON with greyscale sup­port. Used to be avail­able as a DIY kit. No longer up­dated, still could be used as a ref­er­ence

* EPDiy: Based on ESP32, sup­ports a lot of dif­fer­ent screens, rec­om­mended if want to build some de­vice with ESP32+Eink or em­bed it into a larger pro­ject.

The in­ter­face sig­nals and tim­ing are fairly sim­i­lar to LCDs with­out a con­troller. Following is the list of sig­nals typ­i­cally found on EPDs:

SD sig­nals go into the source dri­ver, typ­i­cally in the X di­rec­tion. GD sig­nals go into the gate dri­ver, typ­i­cally in the Y di­rec­tion. It’s a 2D ar­ray, the gate dri­ver se­lects one line at a time, and the source dri­ver out­puts the volt­age for all the pix­els in that line.

Conceptually, it’s like a raster scan on a CRT. To send one field of data, both GD and SD are re­set to the start po­si­tion by us­ing the start pulse sig­nal. Data are then trans­mit­ted into the source dri­ver 4 or 8 pix­els at a time. Once the line has been fully trans­mit­ted, the source dri­ver is re­set to the be­gin­ning po­si­tion by a start pulse sig­nal, and the gate dri­ver moves to the next line by a pulse on the gate dri­ver clock. Once all lines have been scanned, the en­tire process re­peats for the next field.

One no­table dif­fer­ence with LCD is that each pixel is rep­re­sented by 2 bits. This, how­ever, does­n’t mean each pixel is 2bpp or 4-level greyscale. The 2-bit per pixel is used to en­code the volt­age ap­plied to the pixel:

Just like CRT/ LCD, there are also blank­ing pe­ri­ods in the en­tire tim­ing (which means it’s just wait­ing with­out ac­tive pixel data be­ing sent). They have iden­ti­cal mean­ings to CRT/ LCD sys­tems:

The fol­low­ing is a piece of pseudo-code im­ple­ment­ing the Eink tim­ing:

#define DATA_BUS_WIDTH 8 // 8bit wide bus


#define VFP 12 // Vertical front porch

#define VSYNC 1 // Vertical sync length

#define VBP 2 // Vertical back porch

#define VACT 758 // Vertical ac­tive lines

#define HFP 72 // Horizontal front porch

#define HSYNC 2 // Horizontal sync length

#define HBP 2 // Horizontal back porch

#define HACT (1024 / PIXEL_PER_CYCLE)

void pulse_h_­clock() {

sd­clk = 1;

sd­clk = 0;

void dri­ve_­line(bool v_in­_act) {

sdce = 1;

gd­clk = 0;

for (int i = 0; i < HFP; i++) pulse_h_­clock();

sdle = 1;

gd­clk = 1;

for (int i = 0; i < HSYNC; i++) pulse_h_­clock();

sdle = 0;


Read the original on github.com »

8 661 shares, 26 trendiness

100 Exercises To Learn Rust

Welcome to 100 Exercises To Learn Rust”!

This course will teach you Rust’s core con­cepts, one ex­er­cise at a time.

You’ll learn about Rust’s syn­tax, its type sys­tem, its stan­dard li­brary, and its ecosys­tem.

We don’t as­sume any prior knowl­edge of Rust, but we as­sume you know at least an­other pro­gram­ming lan­guage.

We also don’t as­sume any prior knowl­edge of sys­tems pro­gram­ming or mem­ory man­age­ment. Those top­ics will be cov­ered in the course.

In other words, we’ll be start­ing from scratch!

You’ll build up your Rust knowl­edge in small, man­age­able steps. By the end of the course, you will have solved ~100 ex­er­cises, enough to feel com­fort­able work­ing on small to medium-sized Rust pro­jects.

This course is based on the learn by do­ing” prin­ci­ple.

It has been de­signed to be in­ter­ac­tive and hands-on.

Mainmatter de­vel­oped this course to be de­liv­ered in a class­room set­ting, over 4 days: each at­tendee ad­vances through the lessons at their own pace, with an ex­pe­ri­enced in­struc­tor pro­vid­ing guid­ance, an­swer­ing ques­tions and div­ing deeper into the top­ics as needed.

If you’re in­ter­ested in at­tend­ing one of our train­ing ses­sions, or if you’d like to bring this course to your com­pany, please get in touch.

You can also fol­low the course on your own, but we rec­om­mend you find a friend or a men­tor to help you along the way should you get stuck. You can also find so­lu­tions to all ex­er­cises in the

so­lu­tions branch of the GitHub repos­i­tory.

On the left side of the screen, you can see that the course is di­vided into sec­tions. Each sec­tion in­tro­duces a new con­cept or fea­ture of the Rust lan­guage.

To ver­ify your un­der­stand­ing, each sec­tion is paired with an ex­er­cise that you need to solve.

You can find the ex­er­cises in the

com­pan­ion GitHub repos­i­tory.

Before start­ing the course, make sure to clone the repos­i­tory to your lo­cal ma­chine:

# If you have an SSH key set up with GitHub

git clone git@github.com:mainmatter/100-exercises-to-learn-rust.git

# Otherwise, use the HTTPS URL:

# git clone https://​github.com/​main­mat­ter/​100-ex­er­cises-to-learn-rust.git

We also rec­om­mend you work on a branch, so you can eas­ily track your progress and pull in up­dates from the main repos­i­tory, if needed:

cd 100-exercises-to-learn-rust

git check­out -b my-so­lu­tions

All ex­er­cises are lo­cated in the ex­er­cises folder. Each ex­er­cise is struc­tured as a Rust pack­age. The pack­age con­tains the ex­er­cise it­self, in­struc­tions on what to do (in src/​lib.rs), and a test suite to au­to­mat­i­cally ver­ify your so­lu­tion.

To ver­ify your so­lu­tions, we’ve pro­vided a tool that will guide you through the course. It is the wr CLI (short for workshop run­ner”). Install it with:

cargo in­stall –locked work­shop-run­ner

In a new ter­mi­nal, nav­i­gate back to the top-level folder of the repos­i­tory. Run the wr com­mand to start the course:


wr will ver­ify the so­lu­tion to the cur­rent ex­er­cise.

Don’t move on to the next sec­tion un­til you’ve solved the ex­er­cise for the cur­rent one.

We rec­om­mend com­mit­ting your so­lu­tions to Git as you progress through the course, so you can eas­ily track your progress and restart” from a known point if needed.

* The ex­er­cise for this sec­tion is lo­cated in ex­er­cises/​01_in­tro/​00_wel­come

This course was writ­ten by Luca Palmieri, Principal Engineering Consultant at Mainmatter.

Luca has been work­ing with Rust since 2018, ini­tially at TrueLayer and then at AWS.

Luca is the au­thor of Zero to Production in Rust”, the go-to re­source for learn­ing how to build back­end ap­pli­ca­tions in Rust.

He is also the au­thor and main­tainer of a va­ri­ety of open-source Rust pro­jects, in­clud­ing


Pavex and wire­mock.


Read the original on rust-exercises.com »

9 593 shares, 21 trendiness

Adobe Photoshop Source Code

pho·to·shop, tran­si­tive verb, of­ten cap­i­tal­ized ˈfō-(ˌ)tō-ˌshäp to al­ter (a dig­i­tal im­age) with Photoshop soft­ware or other im­age-edit­ing soft­ware es­pe­cially in a way that dis­torts re­al­ity (as for de­lib­er­ately de­cep­tive pur­poses)

When broth­ers Thomas and John Knoll be­gan de­sign­ing and writ­ing an im­age edit­ing pro­gram in the late 1980s, they could not have imag­ined that they would be adding a word to the dic­tio­nary.

Thomas Knoll, a PhD stu­dent in com­puter vi­sion at the University of Michigan, had writ­ten a pro­gram in 1987 to dis­play and mod­ify dig­i­tal im­ages. His brother John, work­ing at the movie vi­sual ef­fects com­pany Industrial Light & Magic, found it use­ful for edit­ing pho­tos, but it was­n’t in­tended to be a prod­uct. Thomas said, We de­vel­oped it orig­i­nally for our own per­sonal use…it was a lot a fun to do.”

Gradually the pro­gram, called Display”, be­came more so­phis­ti­cated. In the sum­mer of 1988 they re­al­ized that it in­deed could be a cred­i­ble com­mer­cial prod­uct. They re­named it Photoshop” and be­gan to search for a com­pany to dis­trib­ute it. About 200 copies of ver­sion 0.87 were bun­dled by slide scan­ner man­u­fac­turer Barneyscan as Barneyscan XP.

The fate of Photoshop was sealed when Adobe, en­cour­aged by its art di­rec­tor Russell Brown, de­cided to buy a li­cense to dis­trib­ute an en­hanced ver­sion of Photoshop. The deal was fi­nal­ized in April 1989, and ver­sion 1.0 started ship­ping early in 1990.

Over the next ten years, more than 3 mil­lion copies of Photoshop were sold.

That first ver­sion of Photoshop was writ­ten pri­mar­ily in Pascal for the Apple Macintosh, with some ma­chine lan­guage for the un­der­ly­ing Motorola 68000 mi­cro­proces­sor where ex­e­cu­tion ef­fi­ciency was im­por­tant. It was­n’t the ef­fort of a huge team. Thomas said, For ver­sion 1, I was the only en­gi­neer, and for ver­sion 2, we had two en­gi­neers.” While Thomas worked on the base ap­pli­ca­tion pro­gram, John wrote many of the im­age-pro­cess­ing plug-ins.

With the per­mis­sion of Adobe Systems Inc., the Computer History Museum is pleased to make avail­able, for non-com­mer­cial use, the source code to the 1990 ver­sion 1.0.1 of Photoshop. All the code is here with the ex­cep­tion of the MacApp ap­pli­ca­tions li­brary that was li­censed from Apple. There are 179 files in the zipped folder, com­pris­ing about 128,000 lines of mostly un­com­mented but well-struc­tured code. By line count, about 75% of the code is in Pascal, about 15% is in 68000 as­sem­bler lan­guage, and the rest is data of var­i­ous sorts.

To down­load the code you must agree to the terms of the li­cense, which per­mits only non-com­mer­cial use and does not give you the right to li­cense it to third par­ties by post­ing copies else­where on the web.

Software ar­chi­tect Grady Booch is the Chief Scientist for Software Engineering at IBM Research Almaden and a trustee of the Computer History Museum. He of­fers the fol­low­ing ob­ser­va­tions about the Photoshop source code:

* Opening the files that con­sti­tuted the source code for Photoshop 1.0, I felt a bit like Howard Carter as he first breached the tomb of King Tutankhamen. What won­ders awaited me?

* I was not dis­ap­pointed by what I found. Indeed, it was a mar­velous jour­ney to open up the cun­ning ma­chin­ery of an ap­pli­ca­tion I’d first used over 20 years ago.

* Architecturally, this is a very well-struc­tured sys­tem. There’s a con­sis­tent sep­a­ra­tion of in­ter­face and ab­strac­tion, and the de­sign de­ci­sions made to com­po­nen­tize those ab­strac­tions — with gen­er­ally one ma­jor type for each com­bi­na­tion of in­ter­face and im­ple­men­ta­tion — were easy to fol­low.

* The ab­strac­tions are quite ma­ture. The con­sis­tent nam­ing, the gran­u­lar­ity of meth­ods, the al­most breath­tak­ing sim­plic­ity of the im­ple­men­ta­tions be­cause each type was so well ab­stracted, all com­bine to make it easy to dis­cern the tex­ture of the sys­tem.

* Having the op­por­tu­nity to ex­am­ine Photoshop’s cur­rent ar­chi­tec­ture, I be­lieve I see fun­da­men­tal struc­tures that have per­sisted, though cer­tainly in more evolved forms, in the mod­ern im­ple­men­ta­tion. Tiles, fil­ters, ab­strac­tions for vir­tual mem­ory (to at­tend to im­ages far larger than dis­play buffers or main mem­ory could nor­mally han­dle) are all there in the first ver­sion. Yet it had just over 100,000 lines of code, com­pared to well over 10 mil­lion in the cur­rent ver­sion! Then and now, much of the code is re­lated to in­put/​out­put and the myr­iad of file for­mats that Photoshop has to at­tend to.

* There are only a few com­ments in the ver­sion 1.0 source code, most of which are as­so­ci­ated with as­sem­bly lan­guage snip­pets. That said, the lack of com­ments is sim­ply not an is­sue. This code is so lit­er­ate, so easy to read, that com­ments might even have got­ten in the way.

* It is de­light­ful to find his­tor­i­cal ves­tiges of the time: code to at­tend to Andy Herzfield’s soft­ware for the Thunderscan scan­ner, sup­port of early TARGA raster graph­ics file types, and even a few pass­ing ref­er­ences to Barneyscan lie scat­tered about in the code. These are very small el­e­ments of the over­all code base, but their ap­pear­ance re­minds me that no code is an is­land.

* This is the kind of code I as­pire to write.”

And this is the kind of code we all can learn from. Software source code is the lit­er­a­ture of com­puter sci­en­tists, and it de­serves to be stud­ied and ap­pre­ci­ated. Enjoy a view of Photoshop from the in­side.


Read the original on computerhistory.org »

10 553 shares, 72 trendiness

ChatGPT can talk, but OpenAI employees sure can’t

Editor’s note, May 17, 2024, 11:20 pm ET: This story has been up­dated to in­clude a post-pub­li­ca­tion state­ment from OpenAI.

On Monday, OpenAI an­nounced ex­cit­ing new prod­uct news: ChatGPT can now talk like a hu­man.

It has a cheery, slightly in­gra­ti­at­ing fem­i­nine voice that sounds im­pres­sively non-ro­botic, and a bit fa­mil­iar if you’ve seen a cer­tain 2013 Spike Jonze film. Her,” tweeted OpenAI CEO Sam Altman, ref­er­enc­ing the movie in which a man falls in love with an AI as­sis­tant voiced by Scarlett Johansson.

But the prod­uct re­lease of ChatGPT 4o was quickly over­shad­owed by much big­ger news out of OpenAI: the res­ig­na­tion of the com­pa­ny’s co-founder and chief sci­en­tist, Ilya Sutskever, who also led its su­per­align­ment team, as well as that of his co-team leader Jan Leike (who we put on the Future Perfect 50 list last year).

The res­ig­na­tions did­n’t come as a to­tal sur­prise. Sutskever had been in­volved in the board­room re­volt that led to Altman’s tem­po­rary fir­ing last year, be­fore the CEO quickly re­turned to his perch. Sutskever pub­licly re­gret­ted his ac­tions and backed Altman’s re­turn, but he’s been mostly ab­sent from the com­pany since, even as other mem­bers of OpenAI’s pol­icy, align­ment, and safety teams have de­parted.

But what has re­ally stirred spec­u­la­tion was the ra­dio si­lence from for­mer em­ploy­ees. Sutskever posted a pretty typ­i­cal res­ig­na­tion mes­sage, say­ing I’m con­fi­dent that OpenAI will build AGI that is both safe and ben­e­fi­cial…I am ex­cited for what comes next.”

Leike … did­n’t. His res­ig­na­tion mes­sage was sim­ply: I re­signed.” After sev­eral days of fer­vent spec­u­la­tion, he ex­panded on this on Friday morn­ing, ex­plain­ing that he was wor­ried OpenAI had shifted away from a safety-fo­cused cul­ture.

Questions arose im­me­di­ately: Were they forced out? Is this de­layed fall­out of Altman’s brief fir­ing last fall? Are they re­sign­ing in protest of some se­cret and dan­ger­ous new OpenAI pro­ject? Speculation filled the void be­cause no one who had once worked at OpenAI was talk­ing.

It turns out there’s a very clear rea­son for that. I have seen the ex­tremely re­stric­tive off-board­ing agree­ment that con­tains nondis­clo­sure and non-dis­par­age­ment pro­vi­sions for­mer OpenAI em­ploy­ees are sub­ject to. It for­bids them, for the rest of their lives, from crit­i­ciz­ing their for­mer em­ployer. Even ac­knowl­edg­ing that the NDA ex­ists is a vi­o­la­tion of it.

If a de­part­ing em­ployee de­clines to sign the doc­u­ment, or if they vi­o­late it, they can lose all vested eq­uity they earned dur­ing their time at the com­pany, which is likely worth mil­lions of dol­lars. One for­mer em­ployee, Daniel Kokotajlo, who posted that he quit OpenAI due to los­ing con­fi­dence that it would be­have re­spon­si­bly around the time of AGI,” has con­firmed pub­licly that he had to sur­ren­der what would have likely turned out to be a huge sum of money in or­der to quit with­out sign­ing the doc­u­ment.

While nondis­clo­sure agree­ments aren’t un­usual in highly com­pet­i­tive Silicon Valley, putting an em­ploy­ee’s al­ready-vested eq­uity at risk for de­clin­ing or vi­o­lat­ing one is. For work­ers at star­tups like OpenAI, eq­uity is a vi­tal form of com­pen­sa­tion, one that can dwarf the salary they make. Threatening that po­ten­tially life-chang­ing money is a very ef­fec­tive way to keep for­mer em­ploy­ees quiet.

OpenAI did not re­spond to a re­quest for com­ment in time for ini­tial pub­li­ca­tion. After pub­li­ca­tion, an OpenAI spokesper­son sent me this state­ment: We have never can­celed any cur­rent or for­mer em­ploy­ee’s vested eq­uity nor will we if peo­ple do not sign a re­lease or nondis­par­age­ment agree­ment when they exit.”

Sources close to the com­pany I spoke to told me that this rep­re­sented a change in pol­icy as they un­der­stood it. When I asked the OpenAI spokesper­son if that state­ment rep­re­sented a change, they replied, This state­ment re­flects re­al­ity.”

All of this is highly ironic for a com­pany that ini­tially ad­ver­tised it­self as OpenAI — that is, as com­mit­ted in its mis­sion state­ments to build­ing pow­er­ful sys­tems in a trans­par­ent and ac­count­able man­ner.

OpenAI long ago aban­doned the idea of open-sourc­ing its mod­els, cit­ing safety con­cerns. But now it has shed the most se­nior and re­spected mem­bers of its safety team, which should in­spire some skep­ti­cism about whether safety is re­ally the rea­son why OpenAI has be­come so closed.

OpenAI has spent a long time oc­cu­py­ing an un­usual po­si­tion in tech and pol­icy cir­cles. Their re­leases, from DALL-E to ChatGPT, are of­ten very cool, but by them­selves they would hardly at­tract the near-re­li­gious fer­vor with which the com­pany is of­ten dis­cussed.

What sets OpenAI apart is the am­bi­tion of its mis­sion: to en­sure that ar­ti­fi­cial gen­eral in­tel­li­gence — AI sys­tems that are gen­er­ally smarter than hu­mans — ben­e­fits all of hu­man­ity.” Many of its em­ploy­ees be­lieve that this aim is within reach; that with per­haps one more decade (or even less) — and a few tril­lion dol­lars — the com­pany will suc­ceed at de­vel­op­ing AI sys­tems that make most hu­man la­bor ob­so­lete.

Which, as the com­pany it­self has long said, is as risky as it is ex­cit­ing.

Superintelligence will be the most im­pact­ful tech­nol­ogy hu­man­ity has ever in­vented, and could help us solve many of the world’s most im­por­tant prob­lems,” a re­cruit­ment page for Leike and Sutskever’s team at OpenAI states. But the vast power of su­per­in­tel­li­gence could also be very dan­ger­ous, and could lead to the dis­em­pow­er­ment of hu­man­ity or even hu­man ex­tinc­tion. While su­per­in­tel­li­gence seems far off now, we be­lieve it could ar­rive this decade.”

Naturally, if ar­ti­fi­cial su­per­in­tel­li­gence in our life­times is pos­si­ble (and ex­perts are di­vided), it would have enor­mous im­pli­ca­tions for hu­man­ity. OpenAI has his­tor­i­cally po­si­tioned it­self as a re­spon­si­ble ac­tor try­ing to tran­scend mere com­mer­cial in­cen­tives and bring AGI about for the ben­e­fit of all. And they’ve said they are will­ing to do that even if that re­quires slow­ing down de­vel­op­ment, miss­ing out on profit op­por­tu­ni­ties, or al­low­ing ex­ter­nal over­sight.

We don’t think that AGI should be just a Silicon Valley thing,” OpenAI co-founder Greg Brockman told me in 2019, in the much calmer pre-Chat­GPT days. We’re talk­ing about world-al­ter­ing tech­nol­ogy. And so how do you get the right rep­re­sen­ta­tion and gov­er­nance in there? This is ac­tu­ally a re­ally im­por­tant fo­cus for us and some­thing we re­ally want broad in­put on.”

OpenAI’s unique cor­po­rate struc­ture — a capped-profit com­pany ul­ti­mately con­trolled by a non­profit — was sup­posed to in­crease ac­count­abil­ity. No one per­son should be trusted here. I don’t have su­per-vot­ing shares. I don’t want them,” Altman as­sured Bloomberg’s Emily Chang in 2023. The board can fire me. I think that’s im­por­tant.” (As the board found out last November, it could fire Altman, but it could­n’t make the move stick. After his fir­ing, Altman made a deal to ef­fec­tively take the com­pany to Microsoft, be­fore be­ing ul­ti­mately re­in­stated with most of the board re­sign­ing.)

But there was no stronger sign of OpenAI’s com­mit­ment to its mis­sion than the promi­nent roles of peo­ple like Sutskever and Leike, tech­nol­o­gists with a long his­tory of com­mit­ment to safety and an ap­par­ently gen­uine will­ing­ness to ask OpenAI to change course if needed. When I said to Brockman in that 2019 in­ter­view, You guys are say­ing, We’re go­ing to build a gen­eral ar­ti­fi­cial in­tel­li­gence,’” Sutskever cut in. We’re go­ing to do every­thing that can be done in that di­rec­tion while also mak­ing sure that we do it in a way that’s safe,” he told me.

Their de­par­ture does­n’t her­ald a change in OpenAI’s mis­sion of build­ing ar­ti­fi­cial gen­eral in­tel­li­gence — that re­mains the goal. But it al­most cer­tainly her­alds a change in OpenAI’s in­ter­est in safety work; the com­pany has­n’t an­nounced who, if any­one, will lead the su­per­align­ment team.

And it makes it clear that OpenAI’s con­cern with ex­ter­nal over­sight and trans­parency could­n’t have run all that deep. If you want ex­ter­nal over­sight and op­por­tu­ni­ties for the rest of the world to play a role in what you’re do­ing, mak­ing for­mer em­ploy­ees sign ex­tremely re­stric­tive NDAs does­n’t ex­actly fol­low.

This con­tra­dic­tion is at the heart of what makes OpenAI pro­foundly frus­trat­ing for those of us who care deeply about en­sur­ing that AI re­ally does go well and ben­e­fits hu­man­ity. Is OpenAI a buzzy, if mid­size tech com­pany that makes a chatty per­sonal as­sis­tant, or a tril­lion-dol­lar ef­fort to cre­ate an AI god?

The com­pa­ny’s lead­er­ship says they want to trans­form the world, that they want to be ac­count­able when they do so, and that they wel­come the world’s in­put into how to do it justly and wisely.

But when there’s real money at stake — and there are as­tound­ing sums of real money at stake in the race to dom­i­nate AI — it be­comes clear that they prob­a­bly never in­tended for the world to get all that much in­put. Their process en­sures for­mer em­ploy­ees — those who know the most about what’s hap­pen­ing in­side OpenAI — can’t tell the rest of the world what’s go­ing on.

The web­site may have high-minded ideals, but their ter­mi­na­tion agree­ments are full of hard-nosed legalese. It’s hard to ex­er­cise ac­count­abil­ity over a com­pany whose for­mer em­ploy­ees are re­stricted to say­ing I re­signed.”

ChatGPT’s new cute voice may be charm­ing, but I’m not feel­ing es­pe­cially en­am­ored.

A ver­sion of this story orig­i­nally ap­peared in the Future Perfect newslet­ter. Sign up here!


Read the original on www.vox.com »

To add this web app to your iOS home screen tap the share button and select "Add to the Home Screen".

10HN is also available as an iOS App

If you visit 10HN only rarely, check out the the best articles from the past week.

If you like 10HN please leave feedback and share

Visit pancik.com for more.