10 interesting stories served every morning and every evening.
Skip to content
We read every piece of feedback, and take your input very seriously.
Include my email address so I can be contacted
Use saved searches to filter your results more quickly
To see all available qualifiers, see our documentation.
Sign up
You signed in with another tab or window. Reload to refresh your session.
You signed out in another tab or window. Reload to refresh your session.
You switched accounts on another tab or window. Reload to refresh your session.
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Use Git or checkout with SVN using the web URL.
Work fast with our official CLI. Learn more about the CLI.
Please sign in
to use Codespaces.
If nothing happens, download GitHub Desktop and try again.
If nothing happens, download GitHub Desktop and try again.
If nothing happens, download Xcode and try again.
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
Permalink
You can’t perform that action at this time.
...
Read the original on github.com »
You’ve probably heard AI is a “black box”. No one knows how it works. Researchers simulate a weird type of pseudo-neural-tissue, “reward” it a little every time it becomes a little more like the AI they want, and eventually it becomes the AI they want. But God only knows what goes on inside of it.
This is bad for safety. For safety, it would be nice to look inside the AI and see whether it’s executing an algorithm like “do the thing” or more like “trick the humans into thinking I’m doing the thing”. But we can’t. Because we can’t look inside an AI at all.
Until now! Towards Monosemanticity, recently out of big AI company/research lab Anthropic, claims to have gazed inside an AI and seen its soul. It looks like this:
How did they do it? What is inside of an AI? And what the heck is “monosemanticity”?
[disclaimer: after talking to many people much smarter than me, I might, just barely, sort of understand this. Any mistakes below are my own.]
A stylized neural net looks like this:
Input neurons (blue) take information from the world. In an image AI, they might take the values of pixels in the image; in a language AI, they might take characters in a text.
These connect to interneurons (black) in the “hidden layers”, which do mysterious things.
Then those connect to output neurons (green). In an image AI, they might represent values of pixels in a piece of AI art; in a language AI, characters in the chatbot response.
“Understanding what goes on inside an AI” means understanding what the black neurons in the middle layer do.
A promising starting point might be to present the AI with lots of different stimuli, then see when each neuron does vs. doesn’t fire. For example, if there’s one neuron that fires every time the input involves a dog, and never fires any other time, probably that neuron is representing the concept “dog”.
Sounds easy, right? A good summer project for an intern, right?
There are at least two problems.
First, GPT-4 has over 100 billion neurons (the exact number seems to be secret, but it’s somewhere up there).
Second, this doesn’t work. When you switch to a weaker AI with “only” a few hundred neurons and build special tools to automate the stimulus/analysis process, the neurons aren’t this simple. A few low-level ones respond to basic features (like curves in an image). But deep in the middle, where the real thought has to be happening, there’s nothing representing “dog”. Instead, the neurons are much weirder than this. In one image model, an earlier paper found “one neuron that responds to cat faces, fronts of cars, and cat legs”. The authors described this as “polysemanticity” - multiple meanings for one neuron.
Some very smart people spent a lot of time trying to figure out what conceptual system could make neurons behave like this, and came up with the Toy Models Of Superposition paper.
Their insight is: suppose your neural net has 1,000 neurons. If each neuron represented one concept, like “dog”, then the net could, at best, understand 1,000 concepts. Realistically it would understand many fewer than this, because in order to get dogs right, it would need to have many subconcepts like “dog’s face” or “that one unusual-looking dog”. So it would be helpful if you could use 1,000 neurons to represent much more than 1,000 concepts.
Here’s a way to make two neurons represent five concepts (adapted from here):
If neuron A is activated at 0.5, and neuron B is activated at 0, you get “dog”.
If neuron A is activated at 1, and neuron B is activated at 0.5, you get “apple”.
And so on.
The exact number of vertices in this abstract shape is a tradeoff. More vertices means that the two-neuron-pair can represent more concepts. But it also risks confusion. If you activate the concepts “dog” and “heart” at the same time, the AI might interpret this as “apple”. And there’s some weak sense in which the AI interprets “dog” as “negative eye”.
This theory is called “superposition”. Do AIs really do it? And how many vertices do they have on their abstract shapes?
The Anthropic interpretability team trained a very small, simple AI. It needed to remember 400 features, but it had only 30 neurons, so it would have to try something like the superposition strategy. Here’s what they found (slightly edited from here):
Follow the black line. On the far left of the graph, the data is dense; you need to think about every feature at the same time. Here the AI assigns one neuron per concept (meaning it will only ever learn 30 of the 400 concepts it needs to know, and mostly fail the task).
Moving to the right, we allow features to be less common - the AI may only have to think about a few at a time. The AI gradually shifts to packing its concepts into tetrahedra (three neurons per four concepts) and triangles (two neurons per three concepts). When it reaches digons (one neuron per two concepts) it stops for a while (to repackage everything this way?) Next it goes through pentagons and an unusual polyhedron called the “square anti-prism” . . .
. . . which Wikipedia says is best known for being the shape of the biscornu (a “stuffed ornamental pincushion”) and One World Trade Center in New York:
After exhausting square anti-prisms (8 features per three neurons) it gives up. Why? I don’t know.
A friend who understands these issues better than I warns that we shouldn’t expect to find pentagons and square anti-prisms in GPT-4. Probably GPT-4 does something incomprehensible in 1000-dimensional space. But it’s the 1000-dimensional equivalent of these pentagons and square anti-prisms, conserving neurons by turning them into dimensions and then placing concepts in the implied space.
The Anthropic interpretability team describes this as simulating a more powerful AI. That is, the two-neuron AI in the pentagonal toy example above is simulating a five-neuron AI. They go on to prove that the real AI can then run computations in the simulated AI; in some sense, there really is an abstract five neuron AI doing all the cognition. The only reason all of our AIs aren’t simulating infinitely powerful AIs and letting them do all the work is that as real neurons start representing more and more simulated neurons, it produces more and more noise and conceptual interference.
This is great for AIs but bad for interpreters. We hoped we could figure out what our AIs were doing just by looking at them. But it turns out they’re simulating much bigger and more complicated AIs, and if we want to know what’s going on, we have to look at those. But those AIs only exist in simulated abstract hyperdimensional spaces. Sounds hard to dissect!
Still, last month Anthropic’s interpretability team announced that they successfully dissected of one of the simulated AIs in its abstract hyperdimensional space.
First the researchers trained a very simple 512-neuron AI to predict text, like a tiny version of GPT or Anthropic’s competing model Claude.
Then, they trained a second AI called an autoencoder to predict the activations of the first AI. They told it to posit a certain number of features (the experiments varied between ~2,000 and ~100,000), corresponding to the neurons of the higher-dimensional AI it was simulating. Then they made it predict how those features mapped onto the real neurons of the real AI.
They found that even though the original AI’s neurons weren’t comprehensible, the new AI’s simulated neurons (aka “features”) were! They were monosemantic, ie they meant one specific thing.
Here’s feature #2663 (remember, the original AI only had 512 neurons, but they’re treating it as simulating a larger AI with up to ~100,000 neuron-features).
The single sentence in the training data that activated it most strongly is from Josephus, Book 14: “And he passed on to Sepphoris, as God sent a snow”. But we see that all the top activations are different uses of “God”.
This simulated neuron seems to be composed of a collection of real neurons including 407, 182, and 259, though probably there are many more than these and the interface just isn’t showing them to me.
None of these neurons are themselves very Godly. When we look at neuron #407 - the real neuron that contributes most to the AI’s understanding of God! - an AI-generated summary describes it as “fir[ing] primarily on non-English text, particularly accented Latin characters. It also occasionally fires on non-standard text like HTML tags.” Probably this is because you can’t really understand AIs at the real-neuron-by-real-neuron level, so the summarizing AI - having been asked to do this impossible thing - is reading tea leaves and saying random stuff.
But at the feature level, everything is nice and tidy! Remember, this AI is trying to predict the next token in a text. At this level, it does so intelligibly. When Feature #2663 is activated, it increases the probability of the next token being “bless”, “forbid”, “damn”, or “-zilla”.
Shouldn’t the AI be keeping the concept of God, Almighty Creator and Lord of the Universe, separate from God- as in the first half of Godzilla? Probably GPT-4 does that, but this toy AI doesn’t have enough real neurons to have enough simulated neurons / features to spare for the purpose. In fact, you can see this sort of thing change later in the paper:
At the bottom of this tree, you can see what happens to the AI’s representation of “the” in mathematical terminology as you let it have more and more features.
First: why is there a feature for “the” in mathematical terminology? I think because of the AI’s predictive imperative - it’s helpful to know that some specific instance of “the” should be followed by math words like “numerator” or “cosine”.
In their smallest AI (512 features), there is only one neuron for “the” in math. In their largest AI tested here (16,384 features), this has branched out to one neuron for “the” in machine learning, one for “the” in complex analysis, and one for “the” in topology and abstract algebra.
So probably if we upgraded to an AI with more simulated neurons, the God neuron would split in two - one for God as used in religions, one for God as used in kaiju names. Later we might get God in Christianity, God in Judaism, God in philosophy, et cetera.
Not all features/simulated-neurons are this simple. But many are. The team graded 412 real neurons vs. simulated neurons on subjective interpretability, and found the simulated neurons were on average pretty interpretable:
Some, like the God neuron, are for specific concepts. Many others, including some of the most interpretable, are for “formal genres” of text, like whether it’s uppercase or lowercase, English vs. some other alphabet, etc.
How common are these features? That is, suppose you train two different 4,096-feature AIs on the same text datasets. Will they have mostly the same 4,096 features? Will they both have some feature representing God? Or will the first choose to represent God together with Godzilla, and the second choose to separate them? Will the second one maybe not have a feature for God at all, instead using that space to store some other concept the first AI can’t possibly understand?
The team tests this, and finds that their two AIs are pretty similar! On average, if there’s a feature in the first one, the most similar feature in the second one will “have a median correlation of 0.72”.
What comes after this?
In May of this year, OpenAI tried to make GPT-4 (very big) understand GPT-2 (very small). They got GPT-4 to inspect each of GPT-2’s 307,200 neurons and report back on what it found.
It found a collection of intriguing results and random gibberish, because they hadn’t mastered the techniques described above of projecting the real neurons into simulated neurons and analyzing the simulated neurons instead. Still, it was impressively ambitious. Unlike the toy AI in the monosemanticity paper, GPT-2 is a real (albeit very small and obsolete) AI that once impressed people.
But what we really want is to be able to interpret the current generation of AIs. The Anthropic interpretability team admits we’re not there yet, for a few reasons.
Scaling the application of sparse autoencoders to frontier models strikes us as one of the most important questions going forward. We’re quite hopeful that these or similar methods will work — Cunningham et al.’s work seems to suggest this approach can work on somewhat larger models, and we have preliminary results that point in the same direction. However, there are significant computational challenges to be overcome. Consider an autoencoder with a 100× expansion factor applied to the activations of a single MLP layer of width 10,000: it would have ~20 billion parameters. Additionally, many of these features are likely quite rare, potentially requiring the autoencoder to be trained on a substantial fraction of the large model’s training corpus. So it seems plausible that training the autoencoder could become very expensive, potentially even more expensive than the original model. We remain optimistic, however, and there is a silver lining — it increasingly seems like a large chunk of the mechanistic interpretability agenda will now turn on succeeding at a difficult engineering and scaling problem, which frontier AI labs have significant expertise in.
In other words, in order to even begin to interpret an AI like GPT-4 (or Anthropic’s equivalent, Claude), you would need an interpreter-AI around the same size. But training an AI that size takes a giant company and hundreds of millions (soon billions) of dollars.
Second, scaling the interpretation. Suppose we find all the simulated neurons for God and Godzilla and everything else, and have a giant map of exactly how they connect, and hang that map in our room. Now we want to answer questions like:
* If you ask the AI a controversial question, how does it decide how to respond?
* Is the AI using racial stereotypes in forming judgments of people?
* Is the AI plotting to kill all humans?
There will be some combination of millions of features and connections that answers these questions. In some case we can even imagine how we would begin to do it - check how active the features representing race are when we ask it to judge people, maybe. But realistically, when we’re working with very complex interactions between millions of neurons we’ll have to automate the process, some larger scale version of “ask GPT-4 to tell us what GPT-2 is doing”.
This probably works for racial stereotypes. It’s more complicated once you start asking about killing all humans (what if the GPT-4 equivalent is the one plotting to kill all humans, and feeds us false answers?) But maybe there’s some way to make an interpreter AI which itself is too dumb to plot, but which can interpret a more general, more intelligent, more dangerous AI. You can see more about how this could tie into more general alignment plans in the post on the ELK problem. I also just found this paper, which I haven’t fully read yet but which seems like a start on engineering safety into interpretable AIs.
Finally, what does all of this tell us about humans?
Humans also use neural nets to reason about concepts. We have a lot of neurons, but so does GPT-4. Our data is very sparse - there are lots of concepts (eg octopi) that come up pretty rarely in everyday life. Are our brains full of strange abstract polyhedra? Are we simulating much bigger brains?
This field is very new, but I was able to find one paper, Identifying Interpretable Visual Features in Artificial and Biological Neural Systems. The authors say:
Through a suite of experiments and analyses, we find evidence consistent with the hypothesis that neurons in both deep image model [AIs] and the visual cortex [of the brain] encode features in superposition. That is, we find non-axis aligned directions in the neural state space that are more interpretable than individual neurons. In addition, across both biological and artificial systems, we uncover the intriguing phenomenon of what we call feature synergy - sparse combinations in activation space that yield more interpretable features than the constituent parts. Our work pushes in the direction of automated interpretability research for CNNs, in line with recent efforts for language models. Simultaneously, it provides a new framework for analyzing neural coding properties in biological systems.
This is a single non-peer-reviewed paper announcing a surprising claim in a hype-filled field. That means it has to be true - otherwise it would be unfair!
If this topic interests you, you might want to read the full papers, which are much more comprehensive and interesting than this post was able to capture. My favorites are:
In the unlikely scenario where all of this makes total sense and you feel like you’re ready to make contributions, you might be a good candidate for Anthropic or OpenAI’s alignment teams, both of which are hiring. If you feel like it’s the sort of thing which could make sense and you want to transition into learning more about it, you might be a good candidate for alignment training/scholarship programs like MATS.
...
Read the original on www.astralcodexten.com »
It’s #givingtuesday, so we’re giving you PeerTube v6 today ! PeerTube is the software we develop for creators, media, institutions, educators… to manage their own video platform, as an alternative to YouTube and Twitch.
Thanks to your donations to our not-for-profit, Framasoft is taking action to advance the ethical, user-friendly web. Find a summary of our progress in 2023 on our Support Framasoft page.
➡️ Read the series of articles from this campaign (Nov. — Dec. 2023)
The sixth major version is being released today and we are very proud ! It is the most ambitious one since we added peer-to-peer livestreaming. There is a good reason for that : we packed this v6 with features inspired by your ideas !
We are so eager to present all the work we achieved that we’ll get right into it. But stay tuned : in two weeks, we’ll take more time to talk about PeerTube’s history, the state of this project and the great plans we have for its future !
In 2023, and before preparing this major update, we released only two minor versions… but one of them brought to the table a major technical feature that will help democratize video hosting even more.
You’ll get more details in the news dedicated to the 5.1 release, so to keep it short, this version brought :
* an « asking for an account » feature, where instance moderators can manage and moderate news account requests ;
* a back-to-live button, so in case you lag behind during a livestream, you can go back to the direct
* Improvements on the authentication plugin, to facilitate signing on with external credentials
As you’ll find out in our 5.2 release blogpost, there were some smaller but important new features such as :
* Adapting RSS feeds to podcast standards, so any podcast client could be able to read a PeerTube channel, for example
* The option to set the privacy of a livestream replay, that way streamers can choose beforehand if the replay of their live will be Public, Unlisted, Private or Internal
* Improved mouse-free navigation : for those who prefer or need to navigate using their keyboard
* And upgrades in our documentation (it’s quite thorough : check it out !)
But the game changer in this 5.2 release was the new remote transcoding feature.
When a creator uploads a video (or when they are streaming live), PeerTube needs to transform their video file into an efficient format. This task is called video transcoding, and it consumes lots of CPU power. PeerTube admins used to need (costly) big-CPU servers for a task that wasn’t permanent… until remote transcoding.
Remote transcoding allows PeerTube admins to deport some or all of their transcoding tasks to another, more powerful server, one that can be shared with other admins, for example.
It makes the whole PeerTube administration cheaper, more resilient, more power-efficient… and opens a way of sharing resources between communities !
We want, once again to thank the NGI Entrust program and the NLnet foundation for the grant that helped us achieve such a technical improvement !
Enough with the past, let’s detail the features of this new major version. Note that, for this whole 2023 roadmap, we developed features suggested and upvoted by… you ! Or at least by those of you who shared your ideas on our feedback website.
That was a very awaited feature. Password-protected videos can be used in lots of situations : to create exclusive content, mark a step in an educational plan, share videos with people trusted by the ones you trust…
On their PeerTube account, creators can now set a single password when they upload, import or update the settings of their videos.
But with our REST API, admins and developers can take it a step further. They can set and store as many passwords as they want, thus easily give and revoke access to videos.
This feature was the work of Wicklow, during his internship with us.
If you like to peruse your videos online, you might be used to hover the progress bar with your mouse or finger. Usually, a preview of the frame appears as a thumbnail : that’s called a storyboard feature, and that’s now available in PeerTube !
Please note that as Storyboards are only generated when uploading (or importing) a video, they will only be available for new videos of instances that upgraded to v6…
Or you can ask, very kindly, to your admin(s) that they use the magical npm run create-generate-storyboard-job command (warning : this task might need some CPU power), and generate storyboards for older videos.
Sometimes, video creators want to update a video, to correct a mistake, offer new information… or just to propose a better cut of their work !
Now, with PeerTube, they can upload and replace an older version of their video. Though the older video file will be permanently erased (no backsies !), creators will keep the same URL, title and infos, comments, stats, etc.
Obviously, such a feature requires trust between videomakers and admins, who don’t want to be responsible for a cute kitten video being « updated » into an awful advertisement for cat-hating groups.
That’s why such a feature will only be available if admins choose to enable it on their PeerTube platforms, and will display a « Video re-upload » tag on updated videos.
Creators can now add chapters to their videos on PeerTube. In a video settings page, they’ll get a new « chapters » tab where they’ll only need to specify the timecode and title of each chapter for PeerTube to add it.
If they import their video from another platform (cough YouTube cough), PeerTube should automatically recognize and import chapters set on this distant video.
When chapters are set, markers will appear and segment the progress bar. Chapter titles will be displayed when you hover or touch one of those chapters segments.
Last year, thanks to French indie journalist David Dufresne’s Au Poste ! livestream show and his hoster Octopuce, we got a livestream stress test with more than 400 simultaneous viewers : see the report here on Octopuce’s blog[FR].
Such tests are really helpful to understand where we can improve PeerTube to reduce bottlenecks, improve performance, and give advice on the best configuration for a PeerTube server if an admin plans on getting a lot of traffic.
That’s why this year, we have decided to realize more tests, with a thousand simultaneous users simulated both in livestream and classic video streaming conditions. Lots of thanks and datalove to Octopuce for helping us deploy our test infrastructure.
We will soon publish a report with our conclusions and recommended server configurations depending on usecases (late 2023, early 2024). In the meantime, early tests motivated us to add many performances improvements into this v6, such as (brace yourselves for the technical terms) :
A new major version always comes with its lot of changes, improvements, bugfixes, etc. You can read the complete log here, but here are the highlights :
* We needed to settle a technical debt : v6 removes support for WebTorrent to focus on HLS (with WebRTC P2P). Both are technical bricks used to get peer-to-peer streaming in web browsers, but HLS is more fitted to what we are doing (and plan to do) with PeerTube
* The video player is more efficient
It is not being rebuilt anymore every time the video changes
It automatically adjust its size to match the video ratio
* It is not being rebuilt anymore every time the video changes
* It automatically adjust its size to match the video ratio
* We have improved SEO, to help videos hosted on a PeerTube platform appear higher in the search results of search engines
* We worked a lot on improving PeerTube’s accessibility on many levels, to streamline the experience of people with disabilities.
With YouTube waging war against adblockers, Twitch increasingly exploiting streamers, and everyone becoming more and more aware of the toxicity of this system… PeerTube is getting traction, recognition and a growing community.
We have so many announcements to make about the future we plan for PeerTube, that we will publish a separate news, in two weeks. We are also planning on hosting an « Ask Us Anything » livestream, to answer the questions you’d have about PeerTube.
Please stay tuned by subscribing to PeerTube’s Newsletter, following PeerTube’s Mastodon account or keeping an eye on the Framablog.
In the meantime, we want to remind you that all these developments were achieved by only one full-time payed developer, an intern, and a fabulous community (lots of datalove to Chocobozzz, Wicklow, and the many, many contributors : y’all are amazing !)
Framasoft being a French not-for-profit mainly funded by grassroots donations (75 % of our yearly income comes from people like you and us), PeerTube development has been funded by two main sources :
If you are a non-French-speaking PeerTube aficionado, please consider supporting our work by making a donation to Framasoft. It will greatly help us fund our many, many projects, and balance our 2024 budget.
Once again this year we need you, your support, your sharing to help us regain ground on the toxic GAFAM web and multiply the number of ethical digital spaces. So we’ve asked David Revoy to help us present this on our support Framasoft page, which we invite you to visit (because it’s beautiful) and above all to share as widely as possible :
If we are to balance our budget for 2024, we have five weeks to raise €176,425 : we can’t do it without your help !
...
Read the original on framablog.org »
To use the Mastodon web application, please enable JavaScript. Alternatively, try one of the native apps for Mastodon for your platform.
...
Read the original on godforsaken.website »
I’m Miguel. I write about compilers, performance, and silly computer things. I also draw Pokémon.
Another explainer on a fun, esoteric topic: optimizing code with SIMD (single instruction multiple data, also sometimes called vectorization). Designing a good, fast, portable SIMD algorithm is not a simple matter and requires thinking a little bit like a circuit designer.
Here’s the mandatory performance benchmark graph to catch your eye.
“SIMD” often gets thrown around as a buzzword by performance and HPC (high performance computing) nerds, but I don’t think it’s a topic that has very friendly introductions out there, for a lot of reasons.
* It’s not something you will really want to care about unless you think performance is cool.
* APIs for programming with SIMD in most programming languages are garbage (I’ll get into why).
* SIMD algorithms are hard to think about if you’re very procedural-programming-brained. A functional programming mindset can help a lot.
This post is mostly about vb64 (which stands for vector base64), a base64 codec I wrote to see for myself if Rust’s std::simd library is any good, but it’s also an excuse to talk about SIMD in general.
What is SIMD, anyways? Let’s dive in.
If you want to skip straight to the writeup on vb64, click here.
Unfortunately, computers exist in the real world[citation-needed], and are bound by the laws of nature. SIMD has relatively little to do with theoretical CS considerations, and everything to do with physics.
In the infancy of modern computing, you could simply improve performance of existing programs by buying new computers. This is often incorrectly attributed to Moore’s law (the number of transistors on IC designs doubles every two years). Moore’s law still appears to hold as of 2023, but some time in the last 15 years the Dennard scaling effect broke down. This means that denser transistors eventually means increased power dissipation density. In simpler terms, we don’t know how to continue to increase the clock frequency of computers without literally liquefying them.
So, since the early aughts, the hot new thing has been bigger core counts. Make your program more multi-threaded and it will run faster on bigger CPUs. This comes with synchronization overhead, since now the cores need to cooperate. All control flow, be it jumps, virtual calls, or synchronization will result in “stall”.
The main causes of stall are branches, instructions that indicate code can take one of two possible paths (like an if statement), and memory operations. Branches include all control flow: if statements, loops, function calls, function returns, even switch statements in C. Memory operations are loads and stores, especially ones that are cache-unfriendly.
Modern compute cores do not execute code line-by-line, because that would be very inefficient. Suppose I have this program:
There’s no reason for the CPU to wait to finish computing a before it begins computing b; it does not depend on a, and while the add is being executed, the xor circuits are idle. Computers say “program order be damned” and issue the add for a and the xor for b simultaneously. This is called instruction-level parallelism, and dependencies that get in the way of it are often called data hazards.
Of course, the Zen 2 in the machine I’m writing this with does not have one measly adder per core. It has dozens and dozens! The opportunities for parallelism are massive, as long as the compiler in your CPU’s execution pipeline can clear any data hazards in the way.
The better the core can do this, the more it can saturate all of the “functional units” for things like arithmetic, and the more numbers it can crunch per unit time, approaching maximum utilization of the hardware. Whenever the compiler can’t do this, the execution pipeline stalls and your code is slower.
Branches stall because they need to wait for the branch condition to be computed before fetching the next instruction (speculative execution is a somewhat iffy workaround for this). Memory operations stall because the data needs to physically arrive at the CPU, and the speed of light is finite in this universe.
Trying to reduce stall by improving opportunities for single-core parallelism is not a new idea. Consider the not-so-humble GPU, whose purpose in life is to render images. Images are vectors of pixels (i.e., color values), and rendering operations tend to be highly local. For example, a convolution kernel for a Gaussian blur will be two or even three orders of magnitude smaller than the final image, lending itself to locality.
Thus, GPUs are built for divide-and-conquer: they provide primitives for doing batched operations, and extremely limited control flow.
“SIMD” is synonymous with “batching”. It stands for “single instruction, multiple data”: a single instruction dispatches parallel operations on multiple lanes of data. GPUs are the original SIMD machines.
“SIMD” and “vector” are often used interchangeably. The fundamental unit a SIMD instruction (or “vector instruction”) operates on is a vector: a fixed-size array of numbers that you primarily operate on component-wise These components are called lanes.
SIMD vectors are usually quite small, since they need to fit into registers. For example, on my machine, the largest vectors are 256 bits wide. This is enough for 32 bytes (a u8x32), 4 double-precision floats (an f64x8), or all kinds of things in between.
Although this doesn’t seem like much, remember that offloading the overhead of keeping the pipeline saturated by a factor of 4x can translate to that big of a speedup in latency.
The simplest vector operations are bitwise: and, or, xor. Ordinary integers can be thought of as vectors themselves, with respect to the bitwise operations. That’s literally what “bitwise” means: lanes-wise with lanes that are one bit wide. An i32 is, in this regard, an i1x32.
In fact, as a warmup, let’s look at the problem of counting the number of 1 bits in an integer. This operation is called “population count”, or popcnt. If we view an i32 as an i1x32, popcnt is just a fold or reduce operation:
In other words, we interpret the integer as an array of bits and then add the bits together to a 32-bit accumulator. Note that the accumulator needs to be higher precision to avoid overflow: accumulating into an i1 (as with the Iterator::reduce() method) will only tell us whether the number of 1 bits is even or odd.
Of course, this produces… comically bad code, frankly. We can do much better if we notice that we can vectorize the addition: first we add all of the adjacent pairs of bits together, then the pairs of pairs, and so on. This means the number of adds is logarithmic in the number of bits in the integer.
Visually, what we do is we “unzip” each vector, shift one to line up the lanes, add them, and then repeat with lanes twice as big.
This is what that looks like in code.
This still won’t optimize down to a popcnt instruction, of course. The search scope for such a simplification is in the regime of superoptimizers. However, the generated code is small and fast, which is why this is the ideal implementation of popcnt for systems without such an instruction.
It’s especially nice because it is implementable for e.g. u64 with only one more reduction step (remember: it’s !), and does not at any point require a full u64 addition.
Even though this is “just” using scalars, divide-and-conquer approaches like this are the bread and butter of the SIMD programmer.
Proper SIMD vectors provide more sophisticated semantics than scalars do, particularly because there is more need to provide replacements for things like control flow. Remember, control flow is slow!
What’s actually available is highly dependent on the architecture you’re compiling to (more on this later), but the way vector instruction sets are usually structured is something like this.
We have vector registers that are kind of like really big general-purpose registers. For example, on x86, most “high performance” cores (like my Zen 2) implement AVX2, which provides 256 bit ymm vectors. The registers themselves do not have a “lane count”; that is specified by the instructions. For example, the “vector byte add instruction” interprets the register as being divided into eight-byte lanes and adds them. The corresponding x86 instruction is vpaddb, which interprets a ymm as an i8x32.
The operations you usually get are:
Bitwise operations. These don’t need to specify a lane width because it’s always implicitly 1: they’re bitwise. Lane-wise arithmetic. This is addition, subtraction, multiplication, division (both int and float), and shifts (int only). Lane-wise min and max are also common. These require specifying a lane width. Typically the smallest number of lanes is two or four. Lane-wise compare. Given a and b, we can create a new mask vector m such that m[i] = a[i] < b[i] (or any other comparison operation). A mask vector’s lanes contain boolean values with an unusual bit-pattern: all-zeros (for false) or all-ones (for true). Masks can be used to select between two vectors: for example, given m, x, and y, you can form a fourth vector z such that z[i] = m[i] ? a[i] : b[i]. Shuffles (sometimes called swizzles). Given a and x, create a third vector s such that s[i] = a[x[i]]. a is used as a lookup table, and x as a set of indices. Out of bounds produces a special value, usually zero. This emulates parallelized array access without needing to actually touch RAM (RAM is extremely slow). Often there is a “shuffle2” or “riffle” operation that allows taking elements from one of two vectors. Given a, b, and x, we now define s as being s[i] = (a ++ b)[x[i]], where a ++ b is a double-width concatenation. How this is actually implemented depends on architecture, and it’s easy to build out of single shuffles regardless.
(1) and (2) are ordinary number crunching. Nothing deeply special about them.
The comparison and select operations in (3) are intended to help SIMD code stay “branchless”. Branchless code is written such that it performs the same operations regardless of its inputs, and relies on the properties of those operations to produce correct results. For example, this might mean taking advantage of identities like x * 0 = 0 and a ^ b ^ a = b to discard “garbage” results.
The shuffles described in (4) are much more powerful than meets the eye.
For example, “broadcast” (sometimes called “splat”) makes a vector whose lanes are all the same scalar, like Rust’s [42; N] array literal. A broadcast can be expressed as a shuffle: create a vector with the desired value in the first lane, and then shuffle it with an index vector of [0, 0, …].
“Interleave” (also called “zip” or “pack”) takes two vectors a and b and creates two new vectors c and d whose lanes are alternating lanes from a and b. If the lane count is n, then c = [a[0], b[0], a[1], b[1], …] and d = [a[n/2], b[n/2], a[n/2 + 1], b[n/2 + 1], …]. This can also be implemented as a shuffle2, with shuffle indices of [0, n, 1, n + 1, …]. “Deinterleave” (or “unzip”, or “unpack”) is the opposite operation: it interprets a pair of vectors as two halves of a larger vector of pairs, and produces two new vectors consisting of the halves of each pair.
Interleave can also be interpreted as taking a [T; N], transmuting it to a [[T; N/2]; 2], performing a matrix transpose to turn it into a [[T; 2]; N/2], and then transmuting that back to [T; N] again. Deinterleave is the same but it transmutes to [[T; 2]; N/2] first.
“Rotate” takes a vector a with n lanes and produces a new vector b such that b[i] = a[(i + j) % n], for some chosen integer j. This is yet another shuffle, with indices [j, j + 1, …, n - 1, 0, 1, … j - 1].
Shuffles are worth trying to wrap your mind around. SIMD programming is all about reinterpreting larger-than-an-integer-sized blocks of data as smaller blocks of varying sizes, and shuffling is important for getting data into the right “place”.
Earlier, I mentioned that what you get varies by architecture. This section is basically a giant footnote.
So, there’s two big factors that go into this.
We’ve learned over time which operations tend to be most useful to programmers. x86 might have something that ARM doesn’t because it “seemed like a good idea at the time” but turned out to be kinda niche. Instruction set extensions are often market differentiators, even within the same vendor. Intel has AVX-512, which provides even more sophisticated instructions, but it’s only available on high-end server chips, because it makes manufacturing more expensive.
Toolchains generalize different extensions as “target features”. Features can be detected at runtime through architecture-specific magic. On Linux, the lscpu command will list what features the CPU advertises that it recognizes, which correlate with the names of features that e.g. LLVM understands. What features are enabled for a particular function affects how LLVM compiles it. For example, LLVM will only emit ymm-using code when compiling with +avx2.
So how do you write portable SIMD code? On the surface, the answer is mostly “you don’t”, but it’s more complicated than that, and for that we need to understand how the later parts of a compiler works.
When a user requests an add by writing a + b, how should I decide which instruction to use for it? This seems like a trick question… just an add right? On x86, even this isn’t so easy, since you have a choice between the actual add instruction, or a lea instruction (which, among other things, preserves the rflags register). This question becomes more complicated for more sophisticated operations. This general problem is called instruction selection.
Because which “target features” are enabled affects which instructions are available, they affect instruction selection. When I went over operations “typically available”, this means that compilers will usually be able to select good choices of instructions for them on most architectures.
Compiling with something like -march=native or -Ctarget-cpu=native gets you “the best” code possible for the machine you’re building on, but it might not be portable to different processors. Gentoo was quite famous for building packages from source on user machines to take advantage of this (not to mention that they loved using -O3, which mostly exists to slow down build times with little benefit).
There is also runtime feature detection, where a program decides which version of a function to call at runtime by asking the CPU what it supports. Code deployed on heterogenous devices (like cryptography libraries) often make use of this. Doing this correctly is very hard and something I don’t particularly want to dig deeply into here.
The situation is made worse by the fact that in C++, you usually write SIMD code using “intrinsics”, which are special functions with inscrutable names like _mm256_cvtps_epu32 that represent a low-level operation in a specific instruction set (this is a float to int cast from AVX2). Intrinsics are defined by hardware vendors, but don’t necessarily map down to single instructions; the compiler can still optimize these instructions by merging, deduplication, and through instruction selection.
As a result you wind up writing the same code multiple times for different instruction sets, with only minor maintainability benefits over writing assembly.
The alternative is a portable SIMD library, which does some instruction selection behind the scenes at the library level but tries to rely on the compiler for most of the heavy-duty work. For a long time I was skeptical that this approach would actually produce good, competitive code, which brings us to the actual point of this article: using Rust’s portable SIMD library to implement a somewhat fussy algorithm, and measuring performance.
Let’s design a SIMD implementation for a well-known algorithm. Although it doesn’t look like it at first, the power of shuffles makes it possible to parse text with SIMD. And this parsing can be very, very fast.
In this case, we’re going to implement base64 decoding. To review, base64 is an encoding scheme for arbitrary binary data into ASCII. We interpret a byte slice as a bit vector, and divide it into six-bit chunks called sextets. Then, each sextet from 0 to 63 is mapped to an ASCII character:
0 to 25 go to ‘A’ to ‘Z’. 26 to 51 go to ‘a’ to ‘z’. 52 to 61 go to ‘0’ to ‘9’.
There are other variants of base64, but the bulk of the complexity is the same for each variant.
There are a few basic pitfalls to keep in mind.
Base64 is a “big endian” format: specifically, the bits in each byte are big endian. Because a sextet can span only parts of a byte, this distinction is important. We need to beware of cases where the input length is not divisible by 4; ostensibly messages should be padded with = to a multiple of 4, but it’s easy to just handle messages that aren’t padded correctly.
The length of a decoded message is given by this function:
Given all this, the easiest way to implement base64 is something like this.
So, what’s the process of turning this into a SIMD version? We want to follow one directive with inexorable, robotic dedication.
This is not completely feasible, since the input is of variable length. But we can try. There are several branches in this code:
The for chunk in line. This one is is the length check: it checks if there is any data left to process. The for &byte in line. This is the hottest loop: it branches once per input byte. The match byte line is several branches, to determine which of the five “valid” match arms we land in. The return Err line. Returning in a hot loop is extra control flow, which is not ideal. The call to decoded_len contains a match, which generates branches. The call to Vec::extend_from_slice. This contains not just branches, but potential calls into the allocator. Extremely slow.
(5) is the easiest to deal with. The match is mapping the values 0, 1, 2, 3 to 0, 1, 1, 2. Call this function f. Then, the sequence given by x - f(x) is 0, 0, 1, 1. This just happens to equal x / 2 (or x >> 1), so we can write a completely branchless version of decoded_len like so.
The others will not prove so easy. Let’s turn our attention to the innermost loop next, branches (2), (3), and (4).
The superpower of SIMD is that because you operate on so much data at a time, you can unroll the loop so hard it becomes branchless.
The insight is this: we want to load at most four bytes, do something to them, and then spit out at most three decoded bytes. While doing this operation, we may encounter a syntax error so we need to report that somehow.
Here’s some facts we can take advantage of.
We don’t need to figure out how many bytes are in the “output” of the hot loop: our handy branchless decoded_len() does that for us. Invalid base64 is extremely rare. We want that syntax error to cost as little as possible. If the user still cares about which byte was the problem, they can scan the input for it after the fact. A is zero in base64. If we’re parsing a truncated chunk, padding it with A won’t change the value.
This suggests an interface for the body of the “hottest loop”. We can factor it out as a separate function, and simplify since we can assume our input is always four bytes now.
You’re probably thinking: why not return Option? Returning an enum will make it messier to eliminate the if !ok branch later on (which we will!). We want to write branchless code, so let’s focus on finding a way of producing that three-byte output without needing to do early returns.
Now’s when we want to start talking about vectors rather than arrays, so let’s try to rewrite our function as such.
Note that the output is now four bytes, not three. SIMD lane counts need to be powers of two, and that last element will never get looked at, so we don’t need to worry about what winds up there.
The callsite also needs to be tweaked, but only slightly, because Simd is From.
Let’s look at the first part of the for byte in ascii loop. We need to map each lane of the Simd to the corresponding sextet, and somehow signal which ones are invalid. First, notice something special about the match: almost every arm can be written as byte - C for some constant C. The non-range case looks a little silly, but humor me:
So, it should be sufficient to build a vector offsets that contains the appropriate constant C for each lane, and then let sextets = ascii - offsets;
How can we build offsets? Using compare-and-select.
This solution is quite elegant, and will produce very competitive code, but it’s not actually ideal. We need to do a lot of comparisons here: eight in total. We also keep lots of values alive at the same time, which might lead to unwanted register pressure.
Let’s look at the byte representations of the ranges. A-Z, a-z, and 0-9 are, as byte ranges, 0x41..0x5b, 0x61..0x7b, and 0x30..0x3a. Notice they all have different high nybbles! What’s more, + and / are 0x2b and 0x2f, so the function byte >> 4 is almost enough to distinguish all the ranges. If we subtract one if byte == b’/’, we have a perfect hash for the ranges.
In other words, the value (byte >> 4) - (byte == ‘/’) maps the ranges as follows:
* A-Z goes to 4 or 5.
* a-z goes to 6 or 7.
This is small enough that we could cram a lookup table of values for building the offsets vector into another SIMD vector, and use a shuffle operation to do the lookup.
This is not my original idea; I came across a GitHub issue where an anonymous user points out this perfect hash.
Our new ascii-to-sextet code looks like this:
There is a small wrinkle here: Simd::swizzle_dyn() requires that the index array be the same length as the lookup table. This is annoying because right now ascii is a Simd, but that will not be the case later on, so I will simply sweep this under the rug.
Note that we no longer get validation as a side-effect of computing the sextets vector. The same GitHub issue also provides an exact bloom-filter for checking that a particular byte is valid; you can see my implementation here. I’m not sure how the OP constructed the bloom filter, but the search space is small enough that you could have written a little script to brute force it.
Now comes a much tricker operation: we need to somehow pack all four sextets into three bytes. One way to try to wrap our head around what the packing code in decode_hot() is doing is to pass in the all-ones sextet in one of the four bytes, and see where those ones end up in the return value.
This is not unlike how they use radioactive dyes in biology to track the moment of molecules or cells through an organism.
Bingo. Playing around with the inputs lets us verify which pieces of the bytes wind up where. For example, by passing 0b110000 as input[1], we see that the two high bits of input[1] correspond to the low bits of output[0]. I’ve written the code so that the bits in each byte are printed in little-endian order, so bits on the left are the low bits.
Putting this all together, we can draw a schematic of what this operation does to a general Simd.
Now, there’s no single instruction that will do this for us. Shuffles can be used to move bytes around, but we’re dealing with pieces of bytes here. We also can’t really do a shift, since we need bits that are overshifted to move into adjacent lanes.
The trick is to just make the lanes bigger.
...
Read the original on mcyoung.xyz »
/ Sign up for Verge Deals to get deals on products we’ve tested sent to your inbox daily.
...
Read the original on www.theverge.com »
Lightweight, natively built with WebKit, made for you and your Mac.1
Industry-leading battery life, privacy respecting by design and
native support for web extensions.2
Get Orion+ and support the browser.
Orion is 100% funded by its users and nothing
else.
Built on WebKit, Orion gives you a fast, smooth, and lightweight browsing experience, without holding your device’s battery hostage.
And with Orion’s deep integration with native technologies, like Keychain or Live Text3, you’ll feel right at home while using it on macOS or iOS/iPadOS.
Orion offers native support for many Firefox and Chrome browser
extensions allowing access to the world’s largest
eco-system of browser extensions.
We’re still in the process of expanding our extension support to include all available options. Simultaneously, we’re working on bringing this feature to iOS
and Orion is the first browser that allows you to
install
select web extensions directly from the Chrome Web Store or Firefox
Add-Ons on your iPhone or iPad.
Privacy by design, like no other browser.
Orion has been engineered from ground up as a truly
privacy-respecting browser. We did it by embracing a simple principle - Orion is a zero telemetry browser. Your private information will never leave Orion by default.
And to protect your privacy on the web, Orion comes with industry-leading anti-tracking technology as well as a powerful built-in ad-blocker.
Available for macOS and iOS/iPadOS.
Install Orion
app on your iPhone or iPad
© Kagi -
Humanize the Web. All Rights Reserved. WebKit and the WebKit logo are trademarks of Apple Inc.
...
Read the original on kagi.com »
The hanging-punctuation property in CSS is almost a no-brainer. The classic example is a blockquote that starts with a curly-quote. Hanging that opening curly-quote into the space off to the start of the text and aligning the actual words is a better look.
The blue line is just to help see the alignment.
It is a cascading property, so you can just do this if you like:
.wp-block-code {
border: 0;
padding: 0;
-webkit-text-size-adjust: 100%;
text-size-adjust: 100%;
.wp-block-code > span {
display: block;
overflow: auto;
.shcb-language {
border: 0;
clip: rect(1px, 1px, 1px, 1px);
-webkit-clip-path: inset(50%);
clip-path: inset(50%);
height: 1px;
margin: -1px;
overflow: hidden;
padding: 0;
position: absolute;
width: 1px;
word-wrap: normal;
word-break: normal;
.hljs {
box-sizing: border-box;
.hljs.shcb-code-table {
display: table;
width: 100%;
.hljs.shcb-code-table > .shcb-loc {
color: inherit;
display: table-row;
width: 100%;
.hljs.shcb-code-table .shcb-loc > span {
display: table-cell;
.wp-block-code code.hljs:not(.shcb-wrap-lines) {
white-space: pre;
.wp-block-code code.hljs.shcb-wrap-lines {
white-space: pre-wrap;
.hljs.shcb-line-numbers {
border-spacing: 0;
counter-reset: line;
.hljs.shcb-line-numbers > .shcb-loc {
counter-increment: line;
.hljs.shcb-line-numbers .shcb-loc > span {
padding-left: 0.75em;
.hljs.shcb-line-numbers .shcb-loc::before {
border-right: 1px solid #ddd;
content: counter(line);
display: table-cell;
padding: 0 0.75em;
text-align: right;
-webkit-user-select: none;
-moz-user-select: none;
-ms-user-select: none;
user-select: none;
white-space: nowrap;
width: 1%;
html {
hanging-punctuation: first last;
}Code language: CSS (css)
In case you go against the grain, for aesthetics, and align text the other way, the `last` value will hang punctuation off the other else also. That’s what it’s supposed to do anyway, but in my testing (trying quotes and periods), Safari doesn’t support that. 🤷♀️
There is some risk to the property. Because the punctuation hangs off the edge, if you don’t have any available space, it can trigger a horizontal scroll bar, which sucks. This is probably why it’s not a default. It’s rare there is zero space on the edge of text, though, so meh.
Want it to work across all browsers? Use a negative text-indent instead. Then test for support and replace it.
blockquote {
text-indent: -0.45em;
@supports (hanging-punctuation: first) {
blockquote {
text-indent: 0;
hanging-punctuation: first;
}Code language: CSS (css)
Having to use a magic number for the `text-indent` kinda sucks, so definitely isolate where you are applying it. Here’s a demo where a custom property is used instead to make it less weird:
By the way! For putting curly quotes on blockquote, might as well do that in CSS rather than in the content.
blockquote {
&::before {
content: open-quote;
&::after {
content: close-quote;
}Code language: CSS (css)
Hanging punctuation is relevant in design software and print design as well. I feel like any half-decent book typesetting will be doing this. Adobe InDesign calls it “Optical Margin Alignment”.
I think hanging-punctuation is nice! Just a nice bonus where supported and not a huge deal if it’s not. I’d probably start a new project with:
html {
hanging-punctuation: first allow-end last;
}Code language: CSS (css)
...
Read the original on chriscoyier.net »
/ Sign up for Verge Deals to get deals on products we’ve tested sent to your inbox daily.
...
Read the original on www.theverge.com »
To add this web app to your iOS home screen tap the share button and select "Add to the Home Screen".
10HN is also available as an iOS App
If you visit 10HN only rarely, check out the the best articles from the past week.
If you like 10HN please leave feedback and share
Visit pancik.com for more.