10 interesting stories served every morning and every evening.




1 415 shares, 59 trendiness

federico-busato/Modern-CPP-Programming: Modern C++ Programming Course (C++11/14/17/20)

Skip to con­tent

We read every piece of feed­back, and take your in­put very se­ri­ously.

Include my email ad­dress so I can be con­tacted

Use saved searches to fil­ter your re­sults more quickly

To see all avail­able qual­i­fiers, see our doc­u­men­ta­tion.

Sign up

You signed in with an­other tab or win­dow. Reload to re­fresh your ses­sion.

You signed out in an­other tab or win­dow. Reload to re­fresh your ses­sion.

You switched ac­counts on an­other tab or win­dow. Reload to re­fresh your ses­sion.

This com­mit does not be­long to any branch on this repos­i­tory, and may be­long to a fork out­side of the repos­i­tory.

Name al­ready in use

A tag al­ready ex­ists with the pro­vided branch name. Many Git com­mands ac­cept both tag and branch names, so cre­at­ing this branch may cause un­ex­pected be­hav­ior. Are you sure you want to cre­ate this branch?

Use Git or check­out with SVN us­ing the web URL.

Work fast with our of­fi­cial CLI. Learn more about the CLI.

Please sign in

to use Codespaces.

If noth­ing hap­pens, down­load GitHub Desktop and try again.

If noth­ing hap­pens, down­load GitHub Desktop and try again.

If noth­ing hap­pens, down­load Xcode and try again.

Your code­space will open once ready.

There was a prob­lem prepar­ing your code­space, please try again.

Permalink

You can’t per­form that ac­tion at this time.

...

Read the original on github.com »

2 339 shares, 44 trendiness

25th Anniversary Documentary

...

Read the original on www.youtube.com »

3 328 shares, 14 trendiness

God Help Us, Let's Try To Understand The Paper On AI Monosemanticity

You’ve prob­a­bly heard AI is a black box”. No one knows how it works. Researchers sim­u­late a weird type of pseudo-neural-tis­sue, reward” it a lit­tle every time it be­comes a lit­tle more like the AI they want, and even­tu­ally it be­comes the AI they want. But God only knows what goes on in­side of it.

This is bad for safety. For safety, it would be nice to look in­side the AI and see whether it’s ex­e­cut­ing an al­go­rithm like do the thing” or more like trick the hu­mans into think­ing I’m do­ing the thing”. But we can’t. Because we can’t look in­side an AI at all.

Until now! Towards Monosemanticity, re­cently out of big AI com­pany/​re­search lab Anthropic, claims to have gazed in­side an AI and seen its soul. It looks like this:

How did they do it? What is in­side of an AI? And what the heck is monosemanticity”?

[disclaimer: af­ter talk­ing to many peo­ple much smarter than me, I might, just barely, sort of un­der­stand this. Any mis­takes be­low are my own.]

A styl­ized neural net looks like this:

Input neu­rons (blue) take in­for­ma­tion from the world. In an im­age AI, they might take the val­ues of pix­els in the im­age; in a lan­guage AI, they might take char­ac­ters in a text.

These con­nect to in­terneu­rons (black) in the hidden lay­ers”, which do mys­te­ri­ous things.

Then those con­nect to out­put neu­rons (green). In an im­age AI, they might rep­re­sent val­ues of pix­els in a piece of AI art; in a lan­guage AI, char­ac­ters in the chat­bot re­sponse.

Understanding what goes on in­side an AI means un­der­stand­ing what the black neu­rons in the mid­dle layer do.

A promis­ing start­ing point might be to pre­sent the AI with lots of dif­fer­ent stim­uli, then see when each neu­ron does vs. does­n’t fire. For ex­am­ple, if there’s one neu­ron that fires every time the in­put in­volves a dog, and never fires any other time, prob­a­bly that neu­ron is rep­re­sent­ing the con­cept dog”.

Sounds easy, right? A good sum­mer pro­ject for an in­tern, right?

There are at least two prob­lems.

First, GPT-4 has over 100 bil­lion neu­rons (the ex­act num­ber seems to be se­cret, but it’s some­where up there).

Second, this does­n’t work. When you switch to a weaker AI with only” a few hun­dred neu­rons and build spe­cial tools to au­to­mate the stim­u­lus/​analy­sis process, the neu­rons aren’t this sim­ple. A few low-level ones re­spond to ba­sic fea­tures (like curves in an im­age). But deep in the mid­dle, where the real thought has to be hap­pen­ing, there’s noth­ing rep­re­sent­ing dog”. Instead, the neu­rons are much weirder than this. In one im­age model, an ear­lier pa­per found one neu­ron that re­sponds to cat faces, fronts of cars, and cat legs”. The au­thors de­scribed this as polysemanticity” - mul­ti­ple mean­ings for one neu­ron.

Some very smart peo­ple spent a lot of time try­ing to fig­ure out what con­cep­tual sys­tem could make neu­rons be­have like this, and came up with the Toy Models Of Superposition pa­per.

Their in­sight is: sup­pose your neural net has 1,000 neu­rons. If each neu­ron rep­re­sented one con­cept, like dog”, then the net could, at best, un­der­stand 1,000 con­cepts. Realistically it would un­der­stand many fewer than this, be­cause in or­der to get dogs right, it would need to have many sub­con­cepts like dog’s face” or that one un­usual-look­ing dog”. So it would be help­ful if you could use 1,000 neu­rons to rep­re­sent much more than 1,000 con­cepts.

Here’s a way to make two neu­rons rep­re­sent five con­cepts (adapted from here):

If neu­ron A is ac­ti­vated at 0.5, and neu­ron B is ac­ti­vated at 0, you get dog”.

If neu­ron A is ac­ti­vated at 1, and neu­ron B is ac­ti­vated at 0.5, you get apple”.

And so on.

The ex­act num­ber of ver­tices in this ab­stract shape is a trade­off. More ver­tices means that the two-neu­ron-pair can rep­re­sent more con­cepts. But it also risks con­fu­sion. If you ac­ti­vate the con­cepts dog” and heart” at the same time, the AI might in­ter­pret this as apple”. And there’s some weak sense in which the AI in­ter­prets dog” as negative eye”.

This the­ory is called superposition”. Do AIs re­ally do it? And how many ver­tices do they have on their ab­stract shapes?

The Anthropic in­ter­pretabil­ity team trained a very small, sim­ple AI. It needed to re­mem­ber 400 fea­tures, but it had only 30 neu­rons, so it would have to try some­thing like the su­per­po­si­tion strat­egy. Here’s what they found (slightly edited from here):

Follow the black line. On the far left of the graph, the data is dense; you need to think about every fea­ture at the same time. Here the AI as­signs one neu­ron per con­cept (meaning it will only ever learn 30 of the 400 con­cepts it needs to know, and mostly fail the task).

Moving to the right, we al­low fea­tures to be less com­mon - the AI may only have to think about a few at a time. The AI grad­u­ally shifts to pack­ing its con­cepts into tetra­he­dra (three neu­rons per four con­cepts) and tri­an­gles (two neu­rons per three con­cepts). When it reaches digons (one neu­ron per two con­cepts) it stops for a while (to repack­age every­thing this way?) Next it goes through pen­tagons and an un­usual poly­he­dron called the square anti-prism” . . .

. . . which Wikipedia says is best known for be­ing the shape of the bis­cornu (a stuffed or­na­men­tal pin­cush­ion”) and One World Trade Center in New York:

After ex­haust­ing square anti-prisms (8 fea­tures per three neu­rons) it gives up. Why? I don’t know.

A friend who un­der­stands these is­sues bet­ter than I warns that we should­n’t ex­pect to find pen­tagons and square anti-prisms in GPT-4. Probably GPT-4 does some­thing in­com­pre­hen­si­ble in 1000-dimensional space. But it’s the 1000-dimensional equiv­a­lent of these pen­tagons and square anti-prisms, con­serv­ing neu­rons by turn­ing them into di­men­sions and then plac­ing con­cepts in the im­plied space.

The Anthropic in­ter­pretabil­ity team de­scribes this as sim­u­lat­ing a more pow­er­ful AI. That is, the two-neu­ron AI in the pen­tag­o­nal toy ex­am­ple above is sim­u­lat­ing a five-neu­ron AI. They go on to prove that the real AI can then run com­pu­ta­tions in the sim­u­lated AI; in some sense, there re­ally is an ab­stract five neu­ron AI do­ing all the cog­ni­tion. The only rea­son all of our AIs aren’t sim­u­lat­ing in­fi­nitely pow­er­ful AIs and let­ting them do all the work is that as real neu­rons start rep­re­sent­ing more and more sim­u­lated neu­rons, it pro­duces more and more noise and con­cep­tual in­ter­fer­ence.

This is great for AIs but bad for in­ter­preters. We hoped we could fig­ure out what our AIs were do­ing just by look­ing at them. But it turns out they’re sim­u­lat­ing much big­ger and more com­pli­cated AIs, and if we want to know what’s go­ing on, we have to look at those. But those AIs only ex­ist in sim­u­lated ab­stract hy­per­di­men­sional spaces. Sounds hard to dis­sect!

Still, last month Anthropic’s in­ter­pretabil­ity team an­nounced that they suc­cess­fully dis­sected of one of the sim­u­lated AIs in its ab­stract hy­per­di­men­sional space.

First the re­searchers trained a very sim­ple 512-neuron AI to pre­dict text, like a tiny ver­sion of GPT or Anthropic’s com­pet­ing model Claude.

Then, they trained a sec­ond AI called an au­toen­coder to pre­dict the ac­ti­va­tions of the first AI. They told it to posit a cer­tain num­ber of fea­tures (the ex­per­i­ments var­ied be­tween ~2,000 and ~100,000), cor­re­spond­ing to the neu­rons of the higher-di­men­sional AI it was sim­u­lat­ing. Then they made it pre­dict how those fea­tures mapped onto the real neu­rons of the real AI.

They found that even though the orig­i­nal AIs neu­rons weren’t com­pre­hen­si­ble, the new AIs sim­u­lated neu­rons (aka features”) were! They were mono­se­man­tic, ie they meant one spe­cific thing.

Here’s fea­ture #2663 (remember, the orig­i­nal AI only had 512 neu­rons, but they’re treat­ing it as sim­u­lat­ing a larger AI with up to ~100,000 neu­ron-fea­tures).

The sin­gle sen­tence in the train­ing data that ac­ti­vated it most strongly is from Josephus, Book 14: And he passed on to Sepphoris, as God sent a snow”. But we see that all the top ac­ti­va­tions are dif­fer­ent uses of God”.

This sim­u­lated neu­ron seems to be com­posed of a col­lec­tion of real neu­rons in­clud­ing 407, 182, and 259, though prob­a­bly there are many more than these and the in­ter­face just is­n’t show­ing them to me.

None of these neu­rons are them­selves very Godly. When we look at neu­ron #407 - the real neu­ron that con­tributes most to the AIs un­der­stand­ing of God! - an AI-generated sum­mary de­scribes it as fir[ing] pri­mar­ily on non-Eng­lish text, par­tic­u­larly ac­cented Latin char­ac­ters. It also oc­ca­sion­ally fires on non-stan­dard text like HTML tags.” Probably this is be­cause you can’t re­ally un­der­stand AIs at the real-neu­ron-by-real-neu­ron level, so the sum­ma­riz­ing AI - hav­ing been asked to do this im­pos­si­ble thing - is read­ing tea leaves and say­ing ran­dom stuff.

But at the fea­ture level, every­thing is nice and tidy! Remember, this AI is try­ing to pre­dict the next to­ken in a text. At this level, it does so in­tel­li­gi­bly. When Feature #2663 is ac­ti­vated, it in­creases the prob­a­bil­ity of the next to­ken be­ing bless”, forbid”, damn”, or -zilla”.

Shouldn’t the AI be keep­ing the con­cept of God, Almighty Creator and Lord of the Universe, sep­a­rate from God- as in the first half of Godzilla? Probably GPT-4 does that, but this toy AI does­n’t have enough real neu­rons to have enough sim­u­lated neu­rons / fea­tures to spare for the pur­pose. In fact, you can see this sort of thing change later in the pa­per:

At the bot­tom of this tree, you can see what hap­pens to the AIs rep­re­sen­ta­tion of the” in math­e­mat­i­cal ter­mi­nol­ogy as you let it have more and more fea­tures.

First: why is there a fea­ture for the” in math­e­mat­i­cal ter­mi­nol­ogy? I think be­cause of the AIs pre­dic­tive im­per­a­tive - it’s help­ful to know that some spe­cific in­stance of the” should be fol­lowed by math words like numerator” or cosine”.

In their small­est AI (512 fea­tures), there is only one neu­ron for the” in math. In their largest AI tested here (16,384 fea­tures), this has branched out to one neu­ron for the” in ma­chine learn­ing, one for the” in com­plex analy­sis, and one for the” in topol­ogy and ab­stract al­ge­bra.

So prob­a­bly if we up­graded to an AI with more sim­u­lated neu­rons, the God neu­ron would split in two - one for God as used in re­li­gions, one for God as used in kaiju names. Later we might get God in Christianity, God in Judaism, God in phi­los­o­phy, et cetera.

Not all fea­tures/​sim­u­lated-neu­rons are this sim­ple. But many are. The team graded 412 real neu­rons vs. sim­u­lated neu­rons on sub­jec­tive in­ter­pretabil­ity, and found the sim­u­lated neu­rons were on av­er­age pretty in­ter­pretable:

Some, like the God neu­ron, are for spe­cific con­cepts. Many oth­ers, in­clud­ing some of the most in­ter­pretable, are for formal gen­res” of text, like whether it’s up­per­case or low­er­case, English vs. some other al­pha­bet, etc.

How com­mon are these fea­tures? That is, sup­pose you train two dif­fer­ent 4,096-feature AIs on the same text datasets. Will they have mostly the same 4,096 fea­tures? Will they both have some fea­ture rep­re­sent­ing God? Or will the first choose to rep­re­sent God to­gether with Godzilla, and the sec­ond choose to sep­a­rate them? Will the sec­ond one maybe not have a fea­ture for God at all, in­stead us­ing that space to store some other con­cept the first AI can’t pos­si­bly un­der­stand?

The team tests this, and finds that their two AIs are pretty sim­i­lar! On av­er­age, if there’s a fea­ture in the first one, the most sim­i­lar fea­ture in the sec­ond one will have a me­dian cor­re­la­tion of 0.72”.

What comes af­ter this?

In May of this year, OpenAI tried to make GPT-4 (very big) un­der­stand GPT-2 (very small). They got GPT-4 to in­spect each of GPT-2’s 307,200 neu­rons and re­port back on what it found.

It found a col­lec­tion of in­trigu­ing re­sults and ran­dom gib­ber­ish, be­cause they had­n’t mas­tered the tech­niques de­scribed above of pro­ject­ing the real neu­rons into sim­u­lated neu­rons and an­a­lyz­ing the sim­u­lated neu­rons in­stead. Still, it was im­pres­sively am­bi­tious. Unlike the toy AI in the mono­se­man­tic­ity pa­per, GPT-2 is a real (albeit very small and ob­so­lete) AI that once im­pressed peo­ple.

But what we re­ally want is to be able to in­ter­pret the cur­rent gen­er­a­tion of AIs. The Anthropic in­ter­pretabil­ity team ad­mits we’re not there yet, for a few rea­sons.

Scaling the ap­pli­ca­tion of sparse au­toen­coders to fron­tier mod­els strikes us as one of the most im­por­tant ques­tions go­ing for­ward. We’re quite hope­ful that these or sim­i­lar meth­ods will work — Cunningham et al.’s work seems to sug­gest this ap­proach can work on some­what larger mod­els, and we have pre­lim­i­nary re­sults that point in the same di­rec­tion. However, there are sig­nif­i­cant com­pu­ta­tional chal­lenges to be over­come. Consider an au­toen­coder with a 100× ex­pan­sion fac­tor ap­plied to the ac­ti­va­tions of a sin­gle MLP layer of width 10,000: it would have ~20 bil­lion pa­ra­me­ters. Additionally, many of these fea­tures are likely quite rare, po­ten­tially re­quir­ing the au­toen­coder to be trained on a sub­stan­tial frac­tion of the large mod­el’s train­ing cor­pus. So it seems plau­si­ble that train­ing the au­toen­coder could be­come very ex­pen­sive, po­ten­tially even more ex­pen­sive than the orig­i­nal model. We re­main op­ti­mistic, how­ever, and there is a sil­ver lin­ing — it in­creas­ingly seems like a large chunk of the mech­a­nis­tic in­ter­pretabil­ity agenda will now turn on suc­ceed­ing at a dif­fi­cult en­gi­neer­ing and scal­ing prob­lem, which fron­tier AI labs have sig­nif­i­cant ex­per­tise in.

In other words, in or­der to even be­gin to in­ter­pret an AI like GPT-4 (or Anthropic’s equiv­a­lent, Claude), you would need an in­ter­preter-AI around the same size. But train­ing an AI that size takes a gi­ant com­pany and hun­dreds of mil­lions (soon bil­lions) of dol­lars.

Second, scal­ing the in­ter­pre­ta­tion. Suppose we find all the sim­u­lated neu­rons for God and Godzilla and every­thing else, and have a gi­ant map of ex­actly how they con­nect, and hang that map in our room. Now we want to an­swer ques­tions like:

* If you ask the AI a con­tro­ver­sial ques­tion, how does it de­cide how to re­spond?

* Is the AI us­ing racial stereo­types in form­ing judg­ments of peo­ple?

* Is the AI plot­ting to kill all hu­mans?

There will be some com­bi­na­tion of mil­lions of fea­tures and con­nec­tions that an­swers these ques­tions. In some case we can even imag­ine how we would be­gin to do it - check how ac­tive the fea­tures rep­re­sent­ing race are when we ask it to judge peo­ple, maybe. But re­al­is­ti­cally, when we’re work­ing with very com­plex in­ter­ac­tions be­tween mil­lions of neu­rons we’ll have to au­to­mate the process, some larger scale ver­sion of ask GPT-4 to tell us what GPT-2 is do­ing”.

This prob­a­bly works for racial stereo­types. It’s more com­pli­cated once you start ask­ing about killing all hu­mans (what if the GPT-4 equiv­a­lent is the one plot­ting to kill all hu­mans, and feeds us false an­swers?) But maybe there’s some way to make an in­ter­preter AI which it­self is too dumb to plot, but which can in­ter­pret a more gen­eral, more in­tel­li­gent, more dan­ger­ous AI. You can see more about how this could tie into more gen­eral align­ment plans in the post on the ELK prob­lem. I also just found this pa­per, which I haven’t fully read yet but which seems like a start on en­gi­neer­ing safety into in­ter­pretable AIs.

Finally, what does all of this tell us about hu­mans?

Humans also use neural nets to rea­son about con­cepts. We have a lot of neu­rons, but so does GPT-4. Our data is very sparse - there are lots of con­cepts (eg oc­topi) that come up pretty rarely in every­day life. Are our brains full of strange ab­stract poly­he­dra? Are we sim­u­lat­ing much big­ger brains?

This field is very new, but I was able to find one pa­per, Identifying Interpretable Visual Features in Artificial and Biological Neural Systems. The au­thors say:

Through a suite of ex­per­i­ments and analy­ses, we find ev­i­dence con­sis­tent with the hy­poth­e­sis that neu­rons in both deep im­age model [AIs] and the vi­sual cor­tex [of the brain] en­code fea­tures in su­per­po­si­tion. That is, we find non-axis aligned di­rec­tions in the neural state space that are more in­ter­pretable than in­di­vid­ual neu­rons. In ad­di­tion, across both bi­o­log­i­cal and ar­ti­fi­cial sys­tems, we un­cover the in­trigu­ing phe­nom­e­non of what we call fea­ture syn­ergy - sparse com­bi­na­tions in ac­ti­va­tion space that yield more in­ter­pretable fea­tures than the con­stituent parts. Our work pushes in the di­rec­tion of au­to­mated in­ter­pretabil­ity re­search for CNNs, in line with re­cent ef­forts for lan­guage mod­els. Simultaneously, it pro­vides a new frame­work for an­a­lyz­ing neural cod­ing prop­er­ties in bi­o­log­i­cal sys­tems.

This is a sin­gle non-peer-re­viewed pa­per an­nounc­ing a sur­pris­ing claim in a hype-filled field. That means it has to be true - oth­er­wise it would be un­fair!

If this topic in­ter­ests you, you might want to read the full pa­pers, which are much more com­pre­hen­sive and in­ter­est­ing than this post was able to cap­ture. My fa­vorites are:

In the un­likely sce­nario where all of this makes to­tal sense and you feel like you’re ready to make con­tri­bu­tions, you might be a good can­di­date for Anthropic or OpenAI’s align­ment teams, both of which are hir­ing. If you feel like it’s the sort of thing which could make sense and you want to tran­si­tion into learn­ing more about it, you might be a good can­di­date for align­ment train­ing/​schol­ar­ship pro­grams like MATS.

...

Read the original on www.astralcodexten.com »

4 326 shares, 33 trendiness

PeerTube v6 is out, and powered by your ideas !

It’s #givingtuesday, so we’re giv­ing you PeerTube v6 to­day ! PeerTube is the soft­ware we de­velop for cre­ators, me­dia, in­sti­tu­tions, ed­u­ca­tors… to man­age their own video plat­form, as an al­ter­na­tive to YouTube and Twitch.

Thanks to your do­na­tions to our not-for-profit, Framasoft is tak­ing ac­tion to ad­vance the eth­i­cal, user-friendly web. Find a sum­mary of our progress in 2023 on our Support Framasoft page.

➡️ Read the se­ries of ar­ti­cles from this cam­paign (Nov. — Dec. 2023)

The sixth ma­jor ver­sion is be­ing re­leased to­day and we are very proud ! It is the most am­bi­tious one since we added peer-to-peer livestream­ing. There is a good rea­son for that : we packed this v6 with fea­tures in­spired by your ideas !

We are so ea­ger to pre­sent all the work we achieved that we’ll get right into it. But stay tuned : in two weeks, we’ll take more time to talk about PeerTube’s his­tory, the state of this pro­ject and the great plans we have for its fu­ture !

In 2023, and be­fore prepar­ing this ma­jor up­date, we re­leased only two mi­nor ver­sions… but one of them brought to the table a ma­jor tech­ni­cal fea­ture that will help de­moc­ra­tize video host­ing even more.

You’ll get more de­tails in the news ded­i­cated to the 5.1 re­lease, so to keep it short, this ver­sion brought :

* an « asking for an ac­count » fea­ture, where in­stance mod­er­a­tors can man­age and mod­er­ate news ac­count re­quests ;

* a back-to-live but­ton, so in case you lag be­hind dur­ing a livestream, you can go back to the di­rect

* Improvements on the au­then­ti­ca­tion plu­gin, to fa­cil­i­tate sign­ing on with ex­ter­nal cre­den­tials

As you’ll find out in our 5.2 re­lease blog­post, there were some smaller but im­por­tant new fea­tures such as :

* Adapting RSS feeds to pod­cast stan­dards, so any pod­cast client could be able to read a PeerTube chan­nel, for ex­am­ple

* The op­tion to set the pri­vacy of a livestream re­play, that way stream­ers can choose be­fore­hand if the re­play of their live will be Public, Unlisted, Private or Internal

* Improved mouse-free nav­i­ga­tion : for those who pre­fer or need to nav­i­gate us­ing their key­board

* And up­grades in our doc­u­men­ta­tion (it’s quite thor­ough : check it out !)

But the game changer in this 5.2 re­lease was the new re­mote transcod­ing fea­ture.

When a cre­ator up­loads a video (or when they are stream­ing live), PeerTube needs to trans­form their video file into an ef­fi­cient for­mat. This task is called video transcod­ing, and it con­sumes lots of CPU power. PeerTube ad­mins used to need (costly) big-CPU servers for a task that was­n’t per­ma­nent… un­til re­mote transcod­ing.

Remote transcod­ing al­lows PeerTube ad­mins to de­port some or all of their transcod­ing tasks to an­other, more pow­er­ful server, one that can be shared with other ad­mins, for ex­am­ple.

It makes the whole PeerTube ad­min­is­tra­tion cheaper, more re­silient, more power-ef­fi­cient… and opens a way of shar­ing re­sources be­tween com­mu­ni­ties !

We want, once again to thank the NGI Entrust pro­gram and the NLnet foun­da­tion for the grant that helped us achieve such a tech­ni­cal im­prove­ment !

Enough with the past, let’s de­tail the fea­tures of this new ma­jor ver­sion. Note that, for this whole 2023 roadmap, we de­vel­oped fea­tures sug­gested and up­voted by… you ! Or at least by those of you who shared your ideas on our feed­back web­site.

That was a very awaited fea­ture. Password-protected videos can be used in lots of sit­u­a­tions : to cre­ate ex­clu­sive con­tent, mark a step in an ed­u­ca­tional plan, share videos with peo­ple trusted by the ones you trust…

On their PeerTube ac­count, cre­ators can now set a sin­gle pass­word when they up­load, im­port or up­date the set­tings of their videos.

But with our REST API, ad­mins and de­vel­op­ers can take it a step fur­ther. They can set and store as many pass­words as they want, thus eas­ily give and re­voke ac­cess to videos.

This fea­ture was the work of Wicklow, dur­ing his in­tern­ship with us.

If you like to pe­ruse your videos on­line, you might be used to hover the progress bar with your mouse or fin­ger. Usually, a pre­view of the frame ap­pears as a thumb­nail : that’s called a sto­ry­board fea­ture, and that’s now avail­able in PeerTube !

Please note that as Storyboards are only gen­er­ated when up­load­ing (or im­port­ing) a video, they will only be avail­able for new videos of in­stances that up­graded to v6…

Or you can ask, very kindly, to your ad­min(s) that they use the mag­i­cal npm run cre­ate-gen­er­ate-sto­ry­board-job com­mand (warning : this task might need some CPU power), and gen­er­ate sto­ry­boards for older videos.

Sometimes, video cre­ators want to up­date a video, to cor­rect a mis­take, of­fer new in­for­ma­tion… or just to pro­pose a bet­ter cut of their work !

Now, with PeerTube, they can up­load and re­place an older ver­sion of their video. Though the older video file will be per­ma­nently erased (no back­sies !), cre­ators will keep the same URL, ti­tle and in­fos, com­ments, stats, etc.

Obviously, such a fea­ture re­quires trust be­tween video­mak­ers and ad­mins, who don’t want to be re­spon­si­ble for a cute kit­ten video be­ing « updated » into an aw­ful ad­ver­tise­ment for cat-hat­ing groups.

That’s why such a fea­ture will only be avail­able if ad­mins choose to en­able it on their PeerTube plat­forms, and will dis­play a « Video re-up­load » tag on up­dated videos.

Creators can now add chap­ters to their videos on PeerTube. In a video set­tings page, they’ll get a new « chapters » tab where they’ll only need to spec­ify the time­code and ti­tle of each chap­ter for PeerTube to add it.

If they im­port their video from an­other plat­form (cough YouTube cough), PeerTube should au­to­mat­i­cally rec­og­nize and im­port chap­ters set on this dis­tant video.

When chap­ters are set, mark­ers will ap­pear and seg­ment the progress bar. Chapter ti­tles will be dis­played when you hover or touch one of those chap­ters seg­ments.

Last year, thanks to French in­die jour­nal­ist David Dufresne’s Au Poste ! livestream show and his hoster Octopuce, we got a livestream stress test with more than 400 si­mul­ta­ne­ous view­ers : see the re­port here on Octopuce’s blog[FR].

Such tests are re­ally help­ful to un­der­stand where we can im­prove PeerTube to re­duce bot­tle­necks, im­prove per­for­mance, and give ad­vice on the best con­fig­u­ra­tion for a PeerTube server if an ad­min plans on get­ting a lot of traf­fic.

That’s why this year, we have de­cided to re­al­ize more tests, with a thou­sand si­mul­ta­ne­ous users sim­u­lated both in livestream and clas­sic video stream­ing con­di­tions. Lots of thanks and dat­alove to Octopuce for help­ing us de­ploy our test in­fra­struc­ture.

We will soon pub­lish a re­port with our con­clu­sions and rec­om­mended server con­fig­u­ra­tions de­pend­ing on use­cases (late 2023, early 2024). In the mean­time, early tests mo­ti­vated us to add many per­for­mances im­prove­ments into this v6, such as (brace your­selves for the tech­ni­cal terms) :

A new ma­jor ver­sion al­ways comes with its lot of changes, im­prove­ments, bug­fixes, etc. You can read the com­plete log here, but here are the high­lights :

* We needed to set­tle a tech­ni­cal debt : v6 re­moves sup­port for WebTorrent to fo­cus on HLS (with WebRTC P2P). Both are tech­ni­cal bricks used to get peer-to-peer stream­ing in web browsers, but HLS is more fit­ted to what we are do­ing (and plan to do) with PeerTube

* The video player is more ef­fi­cient

It is not be­ing re­built any­more every time the video changes

It au­to­mat­i­cally ad­just its size to match the video ra­tio

* It is not be­ing re­built any­more every time the video changes

* It au­to­mat­i­cally ad­just its size to match the video ra­tio

* We have im­proved SEO, to help videos hosted on a PeerTube plat­form ap­pear higher in the search re­sults of search en­gines

* We worked a lot on im­prov­ing PeerTube’s ac­ces­si­bil­ity on many lev­els, to stream­line the ex­pe­ri­ence of peo­ple with dis­abil­i­ties.

With YouTube wag­ing war against ad­block­ers, Twitch in­creas­ingly ex­ploit­ing stream­ers, and every­one be­com­ing more and more aware of the tox­i­c­ity of this sys­tem… PeerTube is get­ting trac­tion, recog­ni­tion and a grow­ing com­mu­nity.

We have so many an­nounce­ments to make about the fu­ture we plan for PeerTube, that we will pub­lish a sep­a­rate news, in two weeks. We are also plan­ning on host­ing an « Ask Us Anything » livestream, to an­swer the ques­tions you’d have about PeerTube.

Please stay tuned by sub­scrib­ing to PeerTube’s Newsletter, fol­low­ing PeerTube’s Mastodon ac­count or keep­ing an eye on the Framablog.

In the mean­time, we want to re­mind you that all these de­vel­op­ments were achieved by only one full-time payed de­vel­oper, an in­tern, and a fab­u­lous com­mu­nity (lots of dat­alove to Chocobozzz, Wicklow, and the many, many con­trib­u­tors : y’all are amaz­ing !)

Framasoft be­ing a French not-for-profit mainly funded by grass­roots do­na­tions (75 % of our yearly in­come comes from peo­ple like you and us), PeerTube de­vel­op­ment has been funded by two main sources :

If you are a non-French-speak­ing PeerTube afi­cionado, please con­sider sup­port­ing our work by mak­ing a do­na­tion to Framasoft. It will greatly help us fund our many, many pro­jects, and bal­ance our 2024 bud­get.

Once again this year we need you, your sup­port, your shar­ing to help us re­gain ground on the toxic GAFAM web and mul­ti­ply the num­ber of eth­i­cal dig­i­tal spaces. So we’ve asked David Revoy to help us pre­sent this on our sup­port Framasoft page, which we in­vite you to visit (because it’s beau­ti­ful) and above all to share as widely as pos­si­ble :

If we are to bal­ance our bud­get for 2024, we have five weeks to raise €176,425 : we can’t do it with­out your help !

...

Read the original on framablog.org »

5 306 shares, 10 trendiness

Shrig 🐌 (@Shrigglepuss@godforsaken.website)

To use the Mastodon web ap­pli­ca­tion, please en­able JavaScript. Alternatively, try one of the na­tive apps for Mastodon for your plat­form.

...

Read the original on godforsaken.website »

6 278 shares, 23 trendiness

Designing a SIMD Algorithm from Scratch · mcyoung

I’m Miguel. I write about com­pil­ers, per­for­mance, and silly com­puter things. I also draw Pokémon.

Another ex­plainer on a fun, es­o­teric topic: op­ti­miz­ing code with SIMD (single in­struc­tion mul­ti­ple data, also some­times called vec­tor­iza­tion). Designing a good, fast, portable SIMD al­go­rithm is not a sim­ple mat­ter and re­quires think­ing a lit­tle bit like a cir­cuit de­signer.

Here’s the manda­tory per­for­mance bench­mark graph to catch your eye.

SIMD of­ten gets thrown around as a buzz­word by per­for­mance and HPC (high per­for­mance com­put­ing) nerds, but I don’t think it’s a topic that has very friendly in­tro­duc­tions out there, for a lot of rea­sons.

* It’s not some­thing you will re­ally want to care about un­less you think per­for­mance is cool.

* APIs for pro­gram­ming with SIMD in most pro­gram­ming lan­guages are garbage (I’ll get into why).

* SIMD al­go­rithms are hard to think about if you’re very pro­ce­dural-pro­gram­ming-brained. A func­tional pro­gram­ming mind­set can help a lot.

This post is mostly about vb64 (which stands for vec­tor base64), a base64 codec I wrote to see for my­self if Rust’s std::simd li­brary is any good, but it’s also an ex­cuse to talk about SIMD in gen­eral.

What is SIMD, any­ways? Let’s dive in.

If you want to skip straight to the writeup on vb64, click here.

Unfortunately, com­put­ers ex­ist in the real world[ci­ta­tion-needed], and are bound by the laws of na­ture. SIMD has rel­a­tively lit­tle to do with the­o­ret­i­cal CS con­sid­er­a­tions, and every­thing to do with physics.

In the in­fancy of mod­ern com­put­ing, you could sim­ply im­prove per­for­mance of ex­ist­ing pro­grams by buy­ing new com­put­ers. This is of­ten in­cor­rectly at­trib­uted to Moore’s law (the num­ber of tran­sis­tors on IC de­signs dou­bles every two years). Moore’s law still ap­pears to hold as of 2023, but some time in the last 15 years the Dennard scal­ing ef­fect broke down. This means that denser tran­sis­tors even­tu­ally means in­creased power dis­si­pa­tion den­sity. In sim­pler terms, we don’t know how to con­tinue to in­crease the clock fre­quency of com­put­ers with­out lit­er­ally liq­ue­fy­ing them.

So, since the early aughts, the hot new thing has been big­ger core counts. Make your pro­gram more multi-threaded and it will run faster on big­ger CPUs. This comes with syn­chro­niza­tion over­head, since now the cores need to co­op­er­ate. All con­trol flow, be it jumps, vir­tual calls, or syn­chro­niza­tion will re­sult in stall”.

The main causes of stall are branches, in­struc­tions that in­di­cate code can take one of two pos­si­ble paths (like an if state­ment), and mem­ory op­er­a­tions. Branches in­clude all con­trol flow: if state­ments, loops, func­tion calls, func­tion re­turns, even switch state­ments in C. Memory op­er­a­tions are loads and stores, es­pe­cially ones that are cache-un­friendly.

Modern com­pute cores do not ex­e­cute code line-by-line, be­cause that would be very in­ef­fi­cient. Suppose I have this pro­gram:

There’s no rea­son for the CPU to wait to fin­ish com­put­ing a be­fore it be­gins com­put­ing b; it does not de­pend on a, and while the add is be­ing ex­e­cuted, the xor cir­cuits are idle. Computers say program or­der be damned” and is­sue the add for a and the xor for b si­mul­ta­ne­ously. This is called in­struc­tion-level par­al­lelism, and de­pen­den­cies that get in the way of it are of­ten called data haz­ards.

Of course, the Zen 2 in the ma­chine I’m writ­ing this with does not have one measly adder per core. It has dozens and dozens! The op­por­tu­ni­ties for par­al­lelism are mas­sive, as long as the com­piler in your CPUs ex­e­cu­tion pipeline can clear any data haz­ards in the way.

The bet­ter the core can do this, the more it can sat­u­rate all of the functional units” for things like arith­metic, and the more num­bers it can crunch per unit time, ap­proach­ing max­i­mum uti­liza­tion of the hard­ware. Whenever the com­piler can’t do this, the ex­e­cu­tion pipeline stalls and your code is slower.

Branches stall be­cause they need to wait for the branch con­di­tion to be com­puted be­fore fetch­ing the next in­struc­tion (speculative ex­e­cu­tion is a some­what iffy workaround for this). Memory op­er­a­tions stall be­cause the data needs to phys­i­cally ar­rive at the CPU, and the speed of light is fi­nite in this uni­verse.

Trying to re­duce stall by im­prov­ing op­por­tu­ni­ties for sin­gle-core par­al­lelism is not a new idea. Consider the not-so-hum­ble GPU, whose pur­pose in life is to ren­der im­ages. Images are vec­tors of pix­els (i.e., color val­ues), and ren­der­ing op­er­a­tions tend to be highly lo­cal. For ex­am­ple, a con­vo­lu­tion ker­nel for a Gaussian blur will be two or even three or­ders of mag­ni­tude smaller than the fi­nal im­age, lend­ing it­self to lo­cal­ity.

Thus, GPUs are built for di­vide-and-con­quer: they pro­vide prim­i­tives for do­ing batched op­er­a­tions, and ex­tremely lim­ited con­trol flow.

SIMD is syn­ony­mous with batching”. It stands for single in­struc­tion, mul­ti­ple data”: a sin­gle in­struc­tion dis­patches par­al­lel op­er­a­tions on mul­ti­ple lanes of data. GPUs are the orig­i­nal SIMD ma­chines.

SIMD and vector” are of­ten used in­ter­change­ably. The fun­da­men­tal unit a SIMD in­struc­tion (or vector in­struc­tion”) op­er­ates on is a vec­tor: a fixed-size ar­ray of num­bers that you pri­mar­ily op­er­ate on com­po­nent-wise These com­po­nents are called lanes.

SIMD vec­tors are usu­ally quite small, since they need to fit into reg­is­ters. For ex­am­ple, on my ma­chine, the largest vec­tors are 256 bits wide. This is enough for 32 bytes (a u8x32), 4 dou­ble-pre­ci­sion floats (an f64x8), or all kinds of things in be­tween.

Although this does­n’t seem like much, re­mem­ber that of­fload­ing the over­head of keep­ing the pipeline sat­u­rated by a fac­tor of 4x can trans­late to that big of a speedup in la­tency.

The sim­plest vec­tor op­er­a­tions are bit­wise: and, or, xor. Ordinary in­te­gers can be thought of as vec­tors them­selves, with re­spect to the bit­wise op­er­a­tions. That’s lit­er­ally what bitwise” means: lanes-wise with lanes that are one bit wide. An i32 is, in this re­gard, an i1x32.

In fact, as a warmup, let’s look at the prob­lem of count­ing the num­ber of 1 bits in an in­te­ger. This op­er­a­tion is called population count”, or popcnt. If we view an i32 as an i1x32, popcnt is just a fold or re­duce op­er­a­tion:

In other words, we in­ter­pret the in­te­ger as an ar­ray of bits and then add the bits to­gether to a 32-bit ac­cu­mu­la­tor. Note that the ac­cu­mu­la­tor needs to be higher pre­ci­sion to avoid over­flow: ac­cu­mu­lat­ing into an i1 (as with the Iterator::reduce() method) will only tell us whether the num­ber of 1 bits is even or odd.

Of course, this pro­duces… com­i­cally bad code, frankly. We can do much bet­ter if we no­tice that we can vec­tor­ize the ad­di­tion: first we add all of the ad­ja­cent pairs of bits to­gether, then the pairs of pairs, and so on. This means the num­ber of adds is log­a­rith­mic in the num­ber of bits in the in­te­ger.

Visually, what we do is we unzip” each vec­tor, shift one to line up the lanes, add them, and then re­peat with lanes twice as big.

This is what that looks like in code.

This still won’t op­ti­mize down to a popcnt in­struc­tion, of course. The search scope for such a sim­pli­fi­ca­tion is in the regime of su­per­op­ti­miz­ers. However, the gen­er­ated code is small and fast, which is why this is the ideal im­ple­men­ta­tion of popcnt for sys­tems with­out such an in­struc­tion.

It’s es­pe­cially nice be­cause it is im­ple­mentable for e.g. u64 with only one more re­duc­tion step (remember: it’s !), and does not at any point re­quire a full u64 ad­di­tion.

Even though this is just” us­ing scalars, di­vide-and-con­quer ap­proaches like this are the bread and but­ter of the SIMD pro­gram­mer.

Proper SIMD vec­tors pro­vide more so­phis­ti­cated se­man­tics than scalars do, par­tic­u­larly be­cause there is more need to pro­vide re­place­ments for things like con­trol flow. Remember, con­trol flow is slow!

What’s ac­tu­ally avail­able is highly de­pen­dent on the ar­chi­tec­ture you’re com­pil­ing to (more on this later), but the way vec­tor in­struc­tion sets are usu­ally struc­tured is some­thing like this.

We have vec­tor reg­is­ters that are kind of like re­ally big gen­eral-pur­pose reg­is­ters. For ex­am­ple, on x86, most high per­for­mance” cores (like my Zen 2) im­ple­ment AVX2, which pro­vides 256 bit ymm vec­tors. The reg­is­ters them­selves do not have a lane count”; that is spec­i­fied by the in­struc­tions. For ex­am­ple, the vector byte add in­struc­tion” in­ter­prets the reg­is­ter as be­ing di­vided into eight-byte lanes and adds them. The cor­re­spond­ing x86 in­struc­tion is vpaddb, which in­ter­prets a ymm as an i8x32.

The op­er­a­tions you usu­ally get are:

Bitwise op­er­a­tions. These don’t need to spec­ify a lane width be­cause it’s al­ways im­plic­itly 1: they’re bit­wise. Lane-wise arith­metic. This is ad­di­tion, sub­trac­tion, mul­ti­pli­ca­tion, di­vi­sion (both int and float), and shifts (int only). Lane-wise min and max are also com­mon. These re­quire spec­i­fy­ing a lane width. Typically the small­est num­ber of lanes is two or four. Lane-wise com­pare. Given a and b, we can cre­ate a new mask vec­tor m such that m[i] = a[i] < b[i] (or any other com­par­i­son op­er­a­tion). A mask vec­tor’s lanes con­tain boolean val­ues with an un­usual bit-pat­tern: all-ze­ros (for false) or all-ones (for true). Masks can be used to se­lect be­tween two vec­tors: for ex­am­ple, given m, x, and y, you can form a fourth vec­tor z such that z[i] = m[i] ? a[i] : b[i]. Shuffles (sometimes called swiz­zles). Given a and x, cre­ate a third vec­tor s such that s[i] = a[x[i]]. a is used as a lookup table, and x as a set of in­dices. Out of bounds pro­duces a spe­cial value, usu­ally zero. This em­u­lates par­al­lelized ar­ray ac­cess with­out need­ing to ac­tu­ally touch RAM (RAM is ex­tremely slow). Often there is a shuffle2” or riffle” op­er­a­tion that al­lows tak­ing el­e­ments from one of two vec­tors. Given a, b, and x, we now de­fine s as be­ing s[i] = (a ++ b)[x[i]], where a ++ b is a dou­ble-width con­cate­na­tion. How this is ac­tu­ally im­ple­mented de­pends on ar­chi­tec­ture, and it’s easy to build out of sin­gle shuf­fles re­gard­less.

(1) and (2) are or­di­nary num­ber crunch­ing. Nothing deeply spe­cial about them.

The com­par­i­son and se­lect op­er­a­tions in (3) are in­tended to help SIMD code stay branchless”. Branchless code is writ­ten such that it per­forms the same op­er­a­tions re­gard­less of its in­puts, and re­lies on the prop­er­ties of those op­er­a­tions to pro­duce cor­rect re­sults. For ex­am­ple, this might mean tak­ing ad­van­tage of iden­ti­ties like x * 0 = 0 and a ^ b ^ a = b to dis­card garbage” re­sults.

The shuf­fles de­scribed in (4) are much more pow­er­ful than meets the eye.

For ex­am­ple, broadcast” (sometimes called splat”) makes a vec­tor whose lanes are all the same scalar, like Rust’s [42; N] ar­ray lit­eral. A broad­cast can be ex­pressed as a shuf­fle: cre­ate a vec­tor with the de­sired value in the first lane, and then shuf­fle it with an in­dex vec­tor of [0, 0, …].

Interleave” (also called zip” or pack”) takes two vec­tors a and b and cre­ates two new vec­tors c and d whose lanes are al­ter­nat­ing lanes from a and b. If the lane count is n, then c = [a[0], b[0], a[1], b[1], …] and d = [a[n/2], b[n/​2], a[n/​2 + 1], b[n/​2 + 1], …]. This can also be im­ple­mented as a shuf­fle2, with shuf­fle in­dices of [0, n, 1, n + 1, …]. Deinterleave” (or unzip”, or unpack”) is the op­po­site op­er­a­tion: it in­ter­prets a pair of vec­tors as two halves of a larger vec­tor of pairs, and pro­duces two new vec­tors con­sist­ing of the halves of each pair.

Interleave can also be in­ter­preted as tak­ing a [T; N], trans­mut­ing it to a [[T; N/2]; 2], per­form­ing a ma­trix trans­pose to turn it into a [[T; 2]; N/2], and then trans­mut­ing that back to [T; N] again. Deinterleave is the same but it trans­mutes to [[T; 2]; N/2] first.

Rotate” takes a vec­tor a with n lanes and pro­duces a new vec­tor b such that b[i] = a[(i + j) % n], for some cho­sen in­te­ger j. This is yet an­other shuf­fle, with in­dices [j, j + 1, …, n - 1, 0, 1, … j - 1].

Shuffles are worth try­ing to wrap your mind around. SIMD pro­gram­ming is all about rein­ter­pret­ing larger-than-an-in­te­ger-sized blocks of data as smaller blocks of vary­ing sizes, and shuf­fling is im­por­tant for get­ting data into the right place”.

Earlier, I men­tioned that what you get varies by ar­chi­tec­ture. This sec­tion is ba­si­cally a gi­ant foot­note.

So, there’s two big fac­tors that go into this.

We’ve learned over time which op­er­a­tions tend to be most use­ful to pro­gram­mers. x86 might have some­thing that ARM does­n’t be­cause it seemed like a good idea at the time” but turned out to be kinda niche. Instruction set ex­ten­sions are of­ten mar­ket dif­fer­en­tia­tors, even within the same ven­dor. Intel has AVX-512, which pro­vides even more so­phis­ti­cated in­struc­tions, but it’s only avail­able on high-end server chips, be­cause it makes man­u­fac­tur­ing more ex­pen­sive.

Toolchains gen­er­al­ize dif­fer­ent ex­ten­sions as target fea­tures”. Features can be de­tected at run­time through ar­chi­tec­ture-spe­cific magic. On Linux, the lscpu com­mand will list what fea­tures the CPU ad­ver­tises that it rec­og­nizes, which cor­re­late with the names of fea­tures that e.g. LLVM un­der­stands. What fea­tures are en­abled for a par­tic­u­lar func­tion af­fects how LLVM com­piles it. For ex­am­ple, LLVM will only emit ymm-us­ing code when com­pil­ing with +avx2.

So how do you write portable SIMD code? On the sur­face, the an­swer is mostly you don’t”, but it’s more com­pli­cated than that, and for that we need to un­der­stand how the later parts of a com­piler works.

When a user re­quests an add by writ­ing a + b, how should I de­cide which in­struc­tion to use for it? This seems like a trick ques­tion… just an add right? On x86, even this is­n’t so easy, since you have a choice be­tween the ac­tual add in­struc­tion, or a lea in­struc­tion (which, among other things, pre­serves the rflags reg­is­ter). This ques­tion be­comes more com­pli­cated for more so­phis­ti­cated op­er­a­tions. This gen­eral prob­lem is called in­struc­tion se­lec­tion.

Because which target fea­tures” are en­abled af­fects which in­struc­tions are avail­able, they af­fect in­struc­tion se­lec­tion. When I went over op­er­a­tions typically avail­able”, this means that com­pil­ers will usu­ally be able to se­lect good choices of in­struc­tions for them on most ar­chi­tec­tures.

Compiling with some­thing like -march=native or -Ctarget-cpu=native gets you the best” code pos­si­ble for the ma­chine you’re build­ing on, but it might not be portable to dif­fer­ent proces­sors. Gentoo was quite fa­mous for build­ing pack­ages from source on user ma­chines to take ad­van­tage of this (not to men­tion that they loved us­ing -O3, which mostly ex­ists to slow down build times with lit­tle ben­e­fit).

There is also run­time fea­ture de­tec­tion, where a pro­gram de­cides which ver­sion of a func­tion to call at run­time by ask­ing the CPU what it sup­ports. Code de­ployed on het­eroge­nous de­vices (like cryp­tog­ra­phy li­braries) of­ten make use of this. Doing this cor­rectly is very hard and some­thing I don’t par­tic­u­larly want to dig deeply into here.

The sit­u­a­tion is made worse by the fact that in C++, you usu­ally write SIMD code us­ing intrinsics”, which are spe­cial func­tions with in­scrutable names like _mm256_cvtps_epu32 that rep­re­sent a low-level op­er­a­tion in a spe­cific in­struc­tion set (this is a float to int cast from AVX2). Intrinsics are de­fined by hard­ware ven­dors, but don’t nec­es­sar­ily map down to sin­gle in­struc­tions; the com­piler can still op­ti­mize these in­struc­tions by merg­ing, dedu­pli­ca­tion, and through in­struc­tion se­lec­tion.

As a re­sult you wind up writ­ing the same code mul­ti­ple times for dif­fer­ent in­struc­tion sets, with only mi­nor main­tain­abil­ity ben­e­fits over writ­ing as­sem­bly.

The al­ter­na­tive is a portable SIMD li­brary, which does some in­struc­tion se­lec­tion be­hind the scenes at the li­brary level but tries to rely on the com­piler for most of the heavy-duty work. For a long time I was skep­ti­cal that this ap­proach would ac­tu­ally pro­duce good, com­pet­i­tive code, which brings us to the ac­tual point of this ar­ti­cle: us­ing Rust’s portable SIMD li­brary to im­ple­ment a some­what fussy al­go­rithm, and mea­sur­ing per­for­mance.

Let’s de­sign a SIMD im­ple­men­ta­tion for a well-known al­go­rithm. Although it does­n’t look like it at first, the power of shuf­fles makes it pos­si­ble to parse text with SIMD. And this pars­ing can be very, very fast.

In this case, we’re go­ing to im­ple­ment base64 de­cod­ing. To re­view, base64 is an en­cod­ing scheme for ar­bi­trary bi­nary data into ASCII. We in­ter­pret a byte slice as a bit vec­tor, and di­vide it into six-bit chunks called sex­tets. Then, each sex­tet from 0 to 63 is mapped to an ASCII char­ac­ter:

0 to 25 go to A’ to Z’. 26 to 51 go to a’ to z’. 52 to 61 go to 0’ to 9’.

There are other vari­ants of base64, but the bulk of the com­plex­ity is the same for each vari­ant.

There are a few ba­sic pit­falls to keep in mind.

Base64 is a big en­dian” for­mat: specif­i­cally, the bits in each byte are big en­dian. Because a sex­tet can span only parts of a byte, this dis­tinc­tion is im­por­tant. We need to be­ware of cases where the in­put length is not di­vis­i­ble by 4; os­ten­si­bly mes­sages should be padded with = to a mul­ti­ple of 4, but it’s easy to just han­dle mes­sages that aren’t padded cor­rectly.

The length of a de­coded mes­sage is given by this func­tion:

Given all this, the eas­i­est way to im­ple­ment base64 is some­thing like this.

So, what’s the process of turn­ing this into a SIMD ver­sion? We want to fol­low one di­rec­tive with in­ex­orable, ro­botic ded­i­ca­tion.

This is not com­pletely fea­si­ble, since the in­put is of vari­able length. But we can try. There are sev­eral branches in this code:

The for chunk in line. This one is is the length check: it checks if there is any data left to process. The for &byte in line. This is the hottest loop: it branches once per in­put byte. The match byte line is sev­eral branches, to de­ter­mine which of the five valid” match arms we land in. The re­turn Err line. Returning in a hot loop is ex­tra con­trol flow, which is not ideal. The call to de­cod­ed_len con­tains a match, which gen­er­ates branches. The call to Vec::extend_from_slice. This con­tains not just branches, but po­ten­tial calls into the al­lo­ca­tor. Extremely slow.

(5) is the eas­i­est to deal with. The match is map­ping the val­ues 0, 1, 2, 3 to 0, 1, 1, 2. Call this func­tion f. Then, the se­quence given by x - f(x) is 0, 0, 1, 1. This just hap­pens to equal x / 2 (or x >> 1), so we can write a com­pletely branch­less ver­sion of de­cod­ed_len like so.

The oth­ers will not prove so easy. Let’s turn our at­ten­tion to the in­ner­most loop next, branches (2), (3), and (4).

The su­per­power of SIMD is that be­cause you op­er­ate on so much data at a time, you can un­roll the loop so hard it be­comes branch­less.

The in­sight is this: we want to load at most four bytes, do some­thing to them, and then spit out at most three de­coded bytes. While do­ing this op­er­a­tion, we may en­counter a syn­tax er­ror so we need to re­port that some­how.

Here’s some facts we can take ad­van­tage of.

We don’t need to fig­ure out how many bytes are in the output” of the hot loop: our handy branch­less de­cod­ed_len() does that for us. Invalid base64 is ex­tremely rare. We want that syn­tax er­ror to cost as lit­tle as pos­si­ble. If the user still cares about which byte was the prob­lem, they can scan the in­put for it af­ter the fact. A is zero in base64. If we’re pars­ing a trun­cated chunk, padding it with A won’t change the value.

This sug­gests an in­ter­face for the body of the hottest loop”. We can fac­tor it out as a sep­a­rate func­tion, and sim­plify since we can as­sume our in­put is al­ways four bytes now.

You’re prob­a­bly think­ing: why not re­turn Option? Returning an enum will make it messier to elim­i­nate the if !ok branch later on (which we will!). We want to write branch­less code, so let’s fo­cus on find­ing a way of pro­duc­ing that three-byte out­put with­out need­ing to do early re­turns.

Now’s when we want to start talk­ing about vec­tors rather than ar­rays, so let’s try to rewrite our func­tion as such.

Note that the out­put is now four bytes, not three. SIMD lane counts need to be pow­ers of two, and that last el­e­ment will never get looked at, so we don’t need to worry about what winds up there.

The call­site also needs to be tweaked, but only slightly, be­cause Simd is From.

Let’s look at the first part of the for byte in ascii loop. We need to map each lane of the Simd to the cor­re­spond­ing sex­tet, and some­how sig­nal which ones are in­valid. First, no­tice some­thing spe­cial about the match: al­most every arm can be writ­ten as byte - C for some con­stant C. The non-range case looks a lit­tle silly, but hu­mor me:

So, it should be suf­fi­cient to build a vec­tor off­sets that con­tains the ap­pro­pri­ate con­stant C for each lane, and then let sex­tets = ascii - off­sets;

How can we build off­sets? Using com­pare-and-se­lect.

This so­lu­tion is quite el­e­gant, and will pro­duce very com­pet­i­tive code, but it’s not ac­tu­ally ideal. We need to do a lot of com­par­isons here: eight in to­tal. We also keep lots of val­ues alive at the same time, which might lead to un­wanted reg­is­ter pres­sure.

Let’s look at the byte rep­re­sen­ta­tions of the ranges. A-Z, a-z, and 0-9 are, as byte ranges, 0x41..0x5b, 0x61..0x7b, and 0x30..0x3a. Notice they all have dif­fer­ent high nyb­bles! What’s more, + and / are 0x2b and 0x2f, so the func­tion byte >> 4 is al­most enough to dis­tin­guish all the ranges. If we sub­tract one if byte == b’/​’, we have a per­fect hash for the ranges.

In other words, the value (byte >> 4) - (byte == /’) maps the ranges as fol­lows:

* A-Z goes to 4 or 5.

* a-z goes to 6 or 7.

This is small enough that we could cram a lookup table of val­ues for build­ing the off­sets vec­tor into an­other SIMD vec­tor, and use a shuf­fle op­er­a­tion to do the lookup.

This is not my orig­i­nal idea; I came across a GitHub is­sue where an anony­mous user points out this per­fect hash.

Our new ascii-to-sex­tet code looks like this:

There is a small wrin­kle here: Simd::swizzle_dyn() re­quires that the in­dex ar­ray be the same length as the lookup table. This is an­noy­ing be­cause right now ascii is a Simd, but that will not be the case later on, so I will sim­ply sweep this un­der the rug.

Note that we no longer get val­i­da­tion as a side-ef­fect of com­put­ing the sex­tets vec­tor. The same GitHub is­sue also pro­vides an ex­act bloom-fil­ter for check­ing that a par­tic­u­lar byte is valid; you can see my im­ple­men­ta­tion here. I’m not sure how the OP con­structed the bloom fil­ter, but the search space is small enough that you could have writ­ten a lit­tle script to brute force it.

Now comes a much tricker op­er­a­tion: we need to some­how pack all four sex­tets into three bytes. One way to try to wrap our head around what the pack­ing code in de­code_hot() is do­ing is to pass in the all-ones sex­tet in one of the four bytes, and see where those ones end up in the re­turn value.

This is not un­like how they use ra­dioac­tive dyes in bi­ol­ogy to track the mo­ment of mol­e­cules or cells through an or­gan­ism.

Bingo. Playing around with the in­puts lets us ver­ify which pieces of the bytes wind up where. For ex­am­ple, by pass­ing 0b110000 as in­put[1], we see that the two high bits of in­put[1] cor­re­spond to the low bits of out­put[0]. I’ve writ­ten the code so that the bits in each byte are printed in lit­tle-en­dian or­der, so bits on the left are the low bits.

Putting this all to­gether, we can draw a schematic of what this op­er­a­tion does to a gen­eral Simd.

Now, there’s no sin­gle in­struc­tion that will do this for us. Shuffles can be used to move bytes around, but we’re deal­ing with pieces of bytes here. We also can’t re­ally do a shift, since we need bits that are over­shifted to move into ad­ja­cent lanes.

The trick is to just make the lanes big­ger.

...

Read the original on mcyoung.xyz »

7 242 shares, 61 trendiness

Ikea debuts a trio of affordable smart home sensors

/ Sign up for Verge Deals to get deals on prod­ucts we’ve tested sent to your in­box daily.

...

Read the original on www.theverge.com »

8 240 shares, 14 trendiness

Orion Browser by Kagi

Lightweight, na­tively built with WebKit, made for you and your Mac.1

Industry-leading bat­tery life, pri­vacy re­spect­ing by de­sign and

na­tive sup­port for web ex­ten­sions.2

Get Orion+ and sup­port the browser.

Orion is 100% funded by its users and noth­ing

else.

Built on WebKit, Orion gives you a fast, smooth, and light­weight brows­ing ex­pe­ri­ence, with­out hold­ing your de­vice’s bat­tery hostage.

And with Orion’s deep in­te­gra­tion with na­tive tech­nolo­gies, like Keychain or Live Text3, you’ll feel right at home while us­ing it on ma­cOS or iOS/​iPa­dOS.

Orion of­fers na­tive sup­port for many Firefox and Chrome browser

ex­ten­sions al­low­ing ac­cess to the world’s largest

eco-sys­tem of browser ex­ten­sions.

We’re still in the process of ex­pand­ing our ex­ten­sion sup­port to in­clude all avail­able op­tions. Simultaneously, we’re work­ing on bring­ing this fea­ture to iOS

and Orion is the first browser that al­lows you to

in­stall

se­lect web ex­ten­sions di­rectly from the Chrome Web Store or Firefox

Add-Ons on your iPhone or iPad.

Privacy by de­sign, like no other browser.

Orion has been en­gi­neered from ground up as a truly

pri­vacy-re­spect­ing browser. We did it by em­brac­ing a sim­ple prin­ci­ple - Orion is a zero teleme­try browser. Your pri­vate in­for­ma­tion will never leave Orion by de­fault.

And to pro­tect your pri­vacy on the web, Orion comes with in­dus­try-lead­ing anti-track­ing tech­nol­ogy as well as a pow­er­ful built-in ad-blocker.

Available for ma­cOS and iOS/​iPa­dOS.

Install Orion

app on your iPhone or iPad

© Kagi -

Humanize the Web. All Rights Reserved. WebKit and the WebKit logo are trade­marks of Apple Inc.

...

Read the original on kagi.com »

9 238 shares, 14 trendiness

The `hanging-punctuation property` in CSS

The hang­ing-punc­tu­a­tion prop­erty in CSS is al­most a no-brainer. The clas­sic ex­am­ple is a block­quote that starts with a curly-quote. Hanging that open­ing curly-quote into the space off to the start of the text and align­ing the ac­tual words is a bet­ter look.

The blue line is just to help see the align­ment.

It is a cas­cad­ing prop­erty, so you can just do this if you like:

.wp-block-code {

bor­der: 0;

padding: 0;

-webkit-text-size-adjust: 100%;

text-size-ad­just: 100%;

.wp-block-code > span {

dis­play: block;

over­flow: auto;

.shcb-language {

bor­der: 0;

clip: rect(1px, 1px, 1px, 1px);

-webkit-clip-path: in­set(50%);

clip-path: in­set(50%);

height: 1px;

mar­gin: -1px;

over­flow: hid­den;

padding: 0;

po­si­tion: ab­solute;

width: 1px;

word-wrap: nor­mal;

word-break: nor­mal;

.hljs {

box-siz­ing: bor­der-box;

.hljs.shcb-code-table {

dis­play: table;

width: 100%;

.hljs.shcb-code-table > .shcb-loc {

color: in­herit;

dis­play: table-row;

width: 100%;

.hljs.shcb-code-table .shcb-loc > span {

dis­play: table-cell;

.wp-block-code code.hljs:not(.shcb-wrap-lines) {

white-space: pre;

.wp-block-code code.hljs.shcb-wrap-lines {

white-space: pre-wrap;

.hljs.shcb-line-numbers {

bor­der-spac­ing: 0;

counter-re­set: line;

.hljs.shcb-line-numbers > .shcb-loc {

counter-in­cre­ment: line;

.hljs.shcb-line-numbers .shcb-loc > span {

padding-left: 0.75em;

.hljs.shcb-line-numbers .shcb-loc::before {

bor­der-right: 1px solid #ddd;

con­tent: counter(line);

dis­play: table-cell;

padding: 0 0.75em;

text-align: right;

-webkit-user-select: none;

-moz-user-select: none;

-ms-user-select: none;

user-se­lect: none;

white-space: nowrap;

width: 1%;

html {

hang­ing-punc­tu­a­tion: first last;

}Code lan­guage: CSS (css)

In case you go against the grain, for aes­thet­ics, and align text the other way, the `last` value will hang punc­tu­a­tion off the other else also. That’s what it’s sup­posed to do any­way, but in my test­ing (trying quotes and pe­ri­ods), Safari does­n’t sup­port that. 🤷‍♀️

There is some risk to the prop­erty. Because the punc­tu­a­tion hangs off the edge, if you don’t have any avail­able space, it can trig­ger a hor­i­zon­tal scroll bar, which sucks. This is prob­a­bly why it’s not a de­fault. It’s rare there is zero space on the edge of text, though, so meh.

Want it to work across all browsers? Use a neg­a­tive text-in­dent in­stead. Then test for sup­port and re­place it.

block­quote {

text-in­dent: -0.45em;

@​supports (hanging-punctuation: first) {

block­quote {

text-in­dent: 0;

hang­ing-punc­tu­a­tion: first;

}Code lan­guage: CSS (css)

Having to use a magic num­ber for the `text-indent` kinda sucks, so def­i­nitely iso­late where you are ap­ply­ing it. Here’s a demo where a cus­tom prop­erty is used in­stead to make it less weird:

By the way! For putting curly quotes on block­quote, might as well do that in CSS rather than in the con­tent.

block­quote {

&::before {

con­tent: open-quote;

&::after {

con­tent: close-quote;

}Code lan­guage: CSS (css)

Hanging punc­tu­a­tion is rel­e­vant in de­sign soft­ware and print de­sign as well. I feel like any half-de­cent book type­set­ting will be do­ing this. Adobe InDesign calls it Optical Margin Alignment”.

I think hang­ing-punc­tu­a­tion is nice! Just a nice bonus where sup­ported and not a huge deal if it’s not. I’d prob­a­bly start a new pro­ject with:

html {

hang­ing-punc­tu­a­tion: first al­low-end last;

}Code lan­guage: CSS (css)

...

Read the original on chriscoyier.net »

10 235 shares, 10 trendiness

After 151 years, Popular Science will no longer offer a magazine

/ Sign up for Verge Deals to get deals on prod­ucts we’ve tested sent to your in­box daily.

...

Read the original on www.theverge.com »

To add this web app to your iOS home screen tap the share button and select "Add to the Home Screen".

10HN is also available as an iOS App

If you visit 10HN only rarely, check out the the best articles from the past week.

If you like 10HN please leave feedback and share

Visit pancik.com for more.