10 interesting stories served every morning and every evening.
Table Of Contents
2. How These Measurements Are Made
SQLite reads and writes small blobs (for example, thumbnail images)
35% faster¹ than the same blobs can be read from or written to individual files on disk using fread() or fwrite().
Furthermore, a single SQLite database holding 10-kilobyte blobs uses about 20% less disk space than storing the blobs in individual files.
The performance difference arises (we believe) because when working from an SQLite database, the open() and close() system calls are invoked only once, whereas open() and close() are invoked once for each blob when using blobs stored in individual files. It appears that the overhead of calling open() and close() is greater than the overhead of using the database. The size reduction arises from the fact that individual files are padded out to the next multiple of the filesystem block size, whereas the blobs are packed more tightly into an SQLite database.
The measurements in this article were made during the week of 2017-06-05 using a version of SQLite in between 3.19.2 and 3.20.0. You may expect future versions of SQLite to perform even better.
¹The 35% figure above is approximate. Actual timings vary depending on hardware, operating system, and the details of the experiment, and due to random performance fluctuations on real-world hardware. See the text below for more detail. Try the experiments yourself. Report significant deviations on the SQLite forum.
The 35% figure is based on running tests on every machine that the author has easily at hand. Some reviewers of this article report that SQLite has higher latency than direct I/O on their systems. We do not yet understand the difference. We also see indications that SQLite does not perform as well as direct I/O when experiments are run using a cold filesystem cache.
So let your take-away be this: read/write latency for SQLite is competitive with read/write latency of individual files on disk. Often SQLite is faster. Sometimes SQLite is almost as fast. Either way, this article disproves the common assumption that a relational database must be slower than direct filesystem I/O.
A 2022 study
(alternative link on GitHub) found that SQLite is roughly twice as fast at real-world workloads compared to Btrfs and Ext4 on Linux.
Jim Gray
and others studied the read performance of BLOBs versus file I/O for Microsoft SQL Server and found that reading BLOBs out of the database was faster for BLOB sizes less than between 250KiB and 1MiB. (Paper). In that study, the database still stores the filename of the content even if the content is held in a separate file. So the database is consulted for every BLOB, even if it is only to extract the filename. In this article, the key for the BLOB is the filename, so no preliminary database access is required. Because the database is never used at all when reading content from individual files in this article, the threshold at which direct file I/O becomes faster is smaller than it is in Gray’s paper.
The Internal Versus External BLOBs article on this website is an earlier investigation (circa 2011) that uses the same approach as the Jim Gray paper — storing the blob filenames as entries in the database — but for SQLite instead of SQL Server.
How These Measurements Are Made
I/O performance is measured using the
kvtest.c program from the SQLite source tree. To compile this test program, first gather the kvtest.c source file into a directory with the SQLite amalgamation source files “sqlite3.c” and “sqlite3.h”. Then on unix, run a command like the following:
Or on Windows with MSVC:
Instructions for compiling for Android are shown below.
Use the resulting “kvtest” program to generate a test database with 100,000 random uncompressible blobs, each with a random size between 8,000 and 12,000 bytes using a command like this:
If desired, you can verify the new database by running this command:
Next, make copies of all the blobs into individual files in a directory using a command like this:
At this point, you can measure the amount of disk space used by the test1.db database and the space used by the test1.dir directory and all of its content. On a standard Ubuntu Linux desktop, the database file will be 1,024,512,000 bytes in size and the test1.dir directory will use 1,228,800,000 bytes of space (according to “du -k”), about 20% more than the database.
The “test1.dir” directory created above puts all the blobs into a single folder. It was conjectured that some operating systems would perform poorly when a single directory contains 100,000 objects. To test this, the kvtest program can also store the blobs in a hierarchy of folders with no more than 100 files and/or subdirectories per folder. The alternative on-disk representation of the blobs can be created using the –tree command-line option to the “export” command, like this:
The test1.dir directory will contain 100,000 files with names like “000000”, “000001″, “000002” and so forth but the test1.tree directory will contain the same files in subdirectories like “00/00/00″, “00/00/01”, and so on. The test1.dir and test1.test directories take up approximately the same amount of space, though test1.test is very slightly larger due to the extra directory entries.
All of the experiments that follow operate the same with either “test1.dir” or “test1.tree”. Very little performance difference is measured in either case, regardless of operating system.
Measure the performance for reading blobs from the database and from individual files using these commands:
Depending on your hardware and operating system, you should see that reads from the test1.db database file are about 35% faster than reads from individual files in the test1.dir or test1.tree folders. Results can vary significantly from one run to the next due to caching, so it is advisable to run tests multiple times and take an average or a worst case or a best case, depending on your requirements.
The –blob-api option on the database read test causes kvtest to use the sqlite3_blob_read() feature of SQLite to load the content of the blobs, rather than running pure SQL statements. This helps SQLite to run a little faster on read tests. You can omit that option to compare the performance of SQLite running SQL statements. In that case, the SQLite still out-performs direct reads, though by not as much as when using sqlite3_blob_read(). The –blob-api option is ignored for tests that read from individual disk files.
Measure write performance by adding the –update option. This causes the blobs are overwritten in place with another random blob of exactly the same size.
The writing test above is not completely fair, since SQLite is doing
power-safe transactions whereas the direct-to-disk writing is not. To put the tests on a more equal footing, add either the –nosync option to the SQLite writes to disable calling fsync() or FlushFileBuffers() to force content to disk, or using the –fsync option for the direct-to-disk tests to force them to invoke fsync() or FlushFileBuffers() when updating disk files.
By default, kvtest runs the database I/O measurements all within a single transaction. Use the –multitrans option to run each blob read or write in a separate transaction. The –multitrans option makes SQLite much slower, and uncompetitive with direct disk I/O. This option proves, yet again, that to get the most performance out of SQLite, you should group as much database interaction as possible within a single transaction.
There are many other testing options, which can be seen by running the command:
The chart below shows data collected using
kvtest.c on five different systems:
All machines use SSD except Win7 which has a hard-drive. The test database is 100K blobs with sizes uniformly distributed between 8K and 12K, for a total of about 1 gigabyte of content. The database page size is 4KiB. The -DSQLITE_DIRECT_OVERFLOW_READ compile-time option was used for all of these tests. Tests were run multiple times. The first run was used to warm up the cache and its timings were discarded.
The chart below shows average time to read a blob directly from the filesystem versus the time needed to read the same blob from the SQLite database. The actual timings vary considerably from one system to another (the Ubuntu desktop is much faster than the Galaxy S3 phone, for example). This chart shows the ratio of the times needed to read blobs from a file divided by the time needed to from the database. The left-most column in the chart is the normalized time to read from the database, for reference.
In this chart, an SQL statement (“SELECT v FROM kv WHERE k=?1”) is prepared once. Then for each blob, the blob key value is bound to the ?1 parameter and the statement is evaluated to extract the blob content.
The chart shows that on Windows10, content can be read from the SQLite database about 5 times faster than it can be read directly from disk. On Android, SQLite is only about 35% faster than reading from disk.
Chart 1: SQLite read latency relative to direct filesystem reads.
100K blobs, avg 10KB each, random order using SQL
The performance can be improved slightly by bypassing the SQL layer and reading the blob content directly using the
sqlite3_blob_read() interface, as shown in the next chart:
Further performance improves can be made by using the
memory-mapped I/O feature of SQLite. In the next chart, the entire 1GB database file is memory mapped and blobs are read (in random order) using the sqlite3_blob_read() interface. With these optimizations, SQLite is twice as fast as Android or MacOS-X and over 10 times faster than Windows.
The third chart shows that reading blob content out of SQLite can be twice as fast as reading from individual files on disk for Mac and Android, and an amazing ten times faster for Windows.
Writes are slower. On all systems, using both direct I/O and SQLite, write performance is between 5 and 15 times slower than reads.
Write performance measurements were made by replacing (overwriting) an entire blob with a different blob. All of the blobs in these experiment are random and incompressible. Because writes are so much slower than reads, only 10,000 of the 100,000 blobs in the database are replaced. The blobs to be replaced are selected at random and are in no particular order.
The direct-to-disk writes are accomplished using fopen()/fwrite()/fclose(). By default, and in all the results shown below, the OS filesystem buffers are never flushed to persistent storage using fsync() or FlushFileBuffers(). In other words, there is no attempt to make the direct-to-disk writes transactional or power-safe. We found that invoking fsync() or FlushFileBuffers() on each file written causes direct-to-disk storage to be about 10 times or more slower than writes to SQLite.
The next chart compares SQLite database updates in WAL mode
against raw direct-to-disk overwrites of separate files on disk. The PRAGMA synchronous setting is NORMAL. All database writes are in a single transaction. The timer for the database writes is stopped after the transaction commits, but before a checkpoint is run. Note that the SQLite writes, unlike the direct-to-disk writes, are transactional and power-safe, though because the synchronous setting is NORMAL instead of FULL, the transactions are not durable.
Chart 4: SQLite write latency relative to direct filesystem writes.
10K blobs, avg size 10KB, random order,
WAL mode with synchronous NORMAL,
exclusive of checkpoint time
The android performance numbers for the write experiments are omitted because the performance tests on the Galaxy S3 are so random. Two consecutive runs of the exact same experiment would give wildly different times. And, to be fair, the performance of SQLite on android is slightly slower than writing directly to disk.
The next chart shows the performance of SQLite versus direct-to-disk when transactions are disabled (PRAGMA journal_mode=OFF) and PRAGMA synchronous is set to OFF. These settings put SQLite on an equal footing with direct-to-disk writes, which is to say they make the data prone to corruption due to system crashes and power failures.
In all of the write tests, it is important to disable anti-virus software prior to running the direct-to-disk performance tests. We found that anti-virus software slows down direct-to-disk by an order of magnitude whereas it impacts SQLite writes very little. This is probably due to the fact that direct-to-disk changes thousands of separate files which all need to be checked by anti-virus, whereas SQLite writes only changes the single database file.
The -DSQLITE_DIRECT_OVERFLOW_READ compile-time option causes SQLite to bypass its page cache when reading content from overflow pages. This helps database reads of 10K blobs run a little faster, but not all that much faster. SQLite still holds a speed advantage over direct filesystem reads without the SQLITE_DIRECT_OVERFLOW_READ compile-time option.
Other compile-time options such as using -O3 instead of -Os or using -DSQLITE_THREADSAFE=0 and/or some of the other
recommended compile-time options might help SQLite to run even faster relative to direct filesystem reads.
The size of the blobs in the test data affects performance. The filesystem will generally be faster for larger blobs, since the overhead of open() and close() is amortized over more bytes of I/O, whereas the database will be more efficient in both speed and space as the average blob size decreases.
SQLite is competitive with, and usually faster than, blobs stored in separate files on disk, for both reading and writing.
SQLite is much faster than direct writes to disk on Windows when anti-virus protection is turned on. Since anti-virus software is and should be on by default in Windows, that means that SQLite is generally much faster than direct disk writes on Windows.
Reading is about an order of magnitude faster than writing, for all systems and for both SQLite and direct-to-disk I/O.
I/O performance varies widely depending on operating system and hardware. Make your own measurements before drawing conclusions.
Some other SQL database engines advise developers to store blobs in separate files and then store the filename in the database. In that case, where the database must first be consulted to find the filename before opening and reading the file, simply storing the entire blob in the database gives much faster read and write performance with SQLite. See the Internal Versus External BLOBs article for more information.
The kvtest program is compiled and run on Android as follows. First install the Android SDK and NDK. Then prepare a script named “android-gcc” that looks approximately like this:
Make that script executable and put it on your $PATH. Then compile the kvtest program as follows:
Next, move the resulting kvtest-android executable to the Android device:
Finally use “adb shell” to get a shell prompt on the Android device, cd into the /data/local/tmp directory, and begin running the tests as with any other unix host.
This page last modified on 2023-12-05 14:43:20 UTC
...
Read the original on sqlite.org »
In June, Apple announced a new product called Apple Intelligence. It’s being sold as a new suite of features for the iPhone, iPad, and Mac that will use artificial intelligence to help you write and edit emails, create new pictures and emojis, and generally accomplish all kinds of tasks. There’s just one problem if you’re a European user eager to get your hands on it: Apple won’t be releasing it in Europe.
In June, Apple announced a new product called Apple Intelligence. It’s being sold as a new suite of features for the iPhone, iPad, and Mac that will use artificial intelligence to help you write and edit emails, create new pictures and emojis, and generally accomplish all kinds of tasks. There’s just one problem if you’re a European user eager to get your hands on it: Apple won’t be releasing it in Europe.
In June, Apple announced a new product called Apple Intelligence. It’s being sold as a new suite of features for the iPhone, iPad, and Mac that will use artificial intelligence to help you write and edit emails, create new pictures and emojis, and generally accomplish all kinds of tasks. There’s just one problem if you’re a European user eager to get your hands on it: Apple won’t be releasing it in Europe.
The company said in a statement that an entire suite of new products and features including Apple Intelligence, SharePlay screen sharing, and iPhone screen mirroring would not be released in European Union countries because of the regulatory requirements imposed by the EU’s Digital Markets Act (DMA). European Commission Executive Vice President Margrethe Vestager called the decision a “stunning declaration” of anti-competitive behavior.
Vestager’s statement is ridiculous on its face: A tech giant choosing not to release a product invites more competition, not less, and more importantly, this is exactly what you’d expect to happen given Europe’s regulatory stance.
The economist Albert Hirschman once described the two options in an unfavorable environment as “voice” and “exit.” The most common option is voice—attempt to negotiate, repair the situation, and communicate toward better conditions. But the more drastic option is exit—choosing to leave the unfavorable environment entirely. That’s more common for people or political movements, but it’s growing increasingly relevant to technology in Europe.
Apple’s decision isn’t the first time that poorly designed regulations have pushed tech companies to block features or services in specific countries. Last year, Facebook removed all news content in Canada in response to the country’s Online News Act, which resulted in smaller news outlets losing business. In 2014, Google News withdrew from Spain over a “link tax,” causing lower traffic for Spanish news sites, returning only when the law was changed. Numerous technology firms have left China due to the power the Chinese Communist Party exerts over foreign corporations.
Adult sites are blocking users in a variety of U. S. states over age verification laws. Meta delayed the EU rollout of its Twitter (now X) competitor Threads over regulatory concerns, though it did eventually launch there. The firm, in a move that mirrors the Apple Intelligence decision, has also declined to release its cutting-edge Llama AI models in the EU, citing “regulatory uncertainty.” Technology companies have traditionally invested large amounts of money in voice strategies, lobbying officials and trying to improve poorly written laws. But they are increasingly aware of their ability to exit, especially in the European context. And Europe’s regulatory approach risks creating a balkanized “splinternet,” where international tech giants may choose to withdraw from the European continent.
If that seems far-fetched, consider other recent cases. Europe recently charged Meta with breaching EU regulations over its “pay or consent” plan. Meta’s business is built around personalized ads, which are worth far more than non-personalized ads. EU regulators required that Meta provide an option that did not involve tracking user data, so Meta created a paid model that would allow users to pay a fee for an ad-free service.
This was already a significant concession—personalized ads are so valuable that one analyst estimated paid users would bring in 60 percent less revenue. But EU regulators are now insisting this model also breaches the rules, saying that Meta fails to provide a less personalized but equivalent version of Meta’s social networks. They’re demanding that Meta provide free full services without personalized ads or a monthly fee for users. In a very real sense, the EU has ruled that Meta’s core business model is illegal. Non-personalized ads cannot economically sustain Meta’s services, but it’s the only solution EU regulators want to accept.
Or consider the recent charges the EU levied against X. Under Elon Musk’s ownership, anyone can now purchase a blue check with a paid subscription, whereas blue checks were previously reserved for notable figures. EU regulators singled out the new system for blue checks as a deceptive business practice that violates the bloc’s Digital Services Act.
These charges are absurd. For one, the change in the blue check system was widely advertised and dominated headlines for months—as well as dominating discussion on the site itself. The idea that users have been deceived by one of the loudest and most discussed product changes in the site’s history is silly. And beyond that, the EU’s position is essentially “X cannot change the meaning of the blue check feature—it is permanently bound to the EU’s interpretation of what a blue check should mean.” This goes far beyond competition or privacy concerns; this is the EU straightforwardly making product decisions on behalf of a company.
A final example comes from France, where regulators are preparing to charge Nvidia with anti-competitive practices related to its CUDA software. CUDA is a free software system developed by Nvidia to run on its chips that allows other programs to more efficiently utilize GPUs in calculations. It’s one of the main reasons Nvidia has been so successful—the software makes its chips more powerful, and no competitor has developed comparable technology. It’s exactly the kind of innovative research that should be rewarded, but French regulators seem to view Nvidia’s decades-long investment in CUDA as a crime.
These examples all share a few key features. They’re all actions aimed at successful foreign tech companies—not surprising since the EU’s rules all but ensure there are no comparably successful European companies. They’re all instances of regulatory overreach, where the EU is trying to dictate product decisions or rule entire business strategies illegal. And crucially, the sizes of the possible fines in play are so large that they may end up scaring companies off the continent.
EU policy allows for fines of up to 10 percent of global revenue. Analyst Ben Thompson reports that Meta only gets 10 percent of its revenue from the EU and Apple only 7 percent. Nvidia does not provide exact regional numbers, but it’s likely that the EU provides less than 10 percent of its revenue as well. And this is revenue, not profit. A single fine of that magnitude would be more profit than these companies make in the EU in several years and destroy the economic rationale for operating there. With global-sized punishments for inane local issues, Europe is much closer than it realizes to simply driving tech companies away.
Europe’s regulators may insist that if companies simply followed the rules, they’d be able to make their profits without the threat of fines. This is patently untrue in the case of Meta, where the EU has ruled out every practical business strategy for funding its operations. But it’s also impossible writ large because the EU often doesn’t write clear rules in advance. Instead, the DMA requires businesses to meet abstract goals, and regulators decide afterward whether the company is in compliance or not. The burden does not exist on the EU to write concrete rules with specific requirements but on the companies to read the regulatory tea leaves and determine what steps to take. It’s an arbitrary and poorly designed system, and companies can hardly be blamed for looking to the exit.
Ultimately, Europe needs to figure out what it wants from the world’s technology industry. At times, it seems as if Europe has given up on trying to innovate or succeed in the tech sector. The continent takes more pride in being a leader in regulation than a leader in innovation, and its tech industry is a rounding error compared with that in the United States or China.
What few success stories it has, such as France’s Mistral, risk being strangled by regulatory actions. How would Mistral, a leading AI firm, survive if Nvidia exits the French market due to regulatory concerns? There is no substitute for Nvidia’s cutting-edge chips.
Europeans could end up living in an online backwater with out-of-date phones, cut off from the rest of the world’s search engines and social media sites, unable to even access high-performance computer chips.
As a sovereign body, the EU is within its rights to legislate tech as arbitrarily and harshly as it would like. But politicians such as Vestager don’t get to then act shocked and outraged when tech companies choose to leave. Right now, most tech companies are still attempting to work within the system and make Europe’s regulations more rational. But if voice fails over and over, exit is all that’s left. And in Europe, it’s an increasingly rational choice.
...
Read the original on foreignpolicy.com »
Avery Pennarun is the CEO and co-founder of Tailscale. A version of this post was originally presented at a company all-hands.
We don’t talk a lot in public about the big vision for Tailscale, why we’re really here. Usually I prefer to focus on what exists right now, and what we’re going to do in the next few months. The future can be distracting.
But increasingly, I’ve found companies are starting to buy Tailscale not just for what it does now, but for the big things they expect it’ll do in the future. They’re right! Let’s look at the biggest of big pictures for a change.
But first, let’s go back to where we started.
David Crawshaw’s first post that laid out what we were doing, long long ago in the late twenty-teens, was called Remembering the LAN, about his experience doing networking back in the 1990s.
I have bad news: if you remember doing LANs back in the 1990s, you are probably old. Quite a few of us here at Tailscale remember doing LANs in the 1990s. That’s an age gap compared to a lot of other startups. That age gap makes Tailscale unusual.
Anything unusual about a startup can be an advantage or a disadvantage, depending what you do with it.
Here’s another word for “old” but with different connotations.
I’m a person that likes looking on the bright side. There are disadvantages to being old, like I maybe can’t do a 40-hour coding binge like I used to when I wrote my first VPN, called Tunnel Vision, in 1997. But there are advantages, like maybe we have enough experience to do things right the first time, in fewer hours. Sometimes. If we’re lucky.
And maybe, you know, if you’re old enough, you’ve seen the tech cycle go round a few times and you’re starting to see a few patterns.
That was us, me and the Davids, when we started Tailscale. What we saw was, a lot of things have gotten better since the 1990s. Computers are literally millions of times faster. 100x as many people can be programmers now because they aren’t stuck with just C++ and assembly language, and many, many, many more people now have some kind of computer. Plus app stores, payment systems, graphics. All good stuff.
But, also things have gotten worse. A lot of day-to-day things that used to be easy for developers, are now hard. That was unexpected. I didn’t expect that. I expected I’d be out of a job by now because programming would be so easy.
Instead, the tech industry has evolved into an absolute mess. And it’s getting worse instead of better! Our tower of complexity is now so tall that we seriously consider slathering LLMs on top to write the incomprehensible code in the incomprehensible frameworks so we don’t have to.
And you know, we old people are the ones who have the context to see that.
It’s all fixable. It doesn’t have to be this way.
Before I can tell you a vision for the future I have to tell you what I think went wrong.
Programmers today are impatient for success. They start planning for a billion users before they write their first line of code. In fact, nowadays, we train them to do this without even knowing they’re doing it. Everything they’ve ever been taught revolves around scaling.
We’ve been falling into this trap all the way back to when computer scientists started teaching big-O notation. In big-O notation, if you use it wrong, a hash table is supposedly faster than an array, for virtually anything you want to do. But in reality, that’s not always true. When you have a billion entries, maybe a hash table is faster. But when you have 10 entries, it almost never is.
People have a hard time with this idea. They keep picking the algorithms and architectures that can scale up, even when if you don’t scale up, a different thing would be thousands of times faster, and also easier to build and run.
Even I can barely believe I just said thousands of times easier and I wasn’t exaggerating.
I read a post recently where someone bragged about using kubernetes to scale all the way up to 500,000 page views per month. But that’s 0.2 requests per second. I could serve that from my phone, on battery power, and it would spend most of its time asleep.
In modern computing, we tolerate long builds, and then docker builds, and uploading to container stores, and multi-minute deploy times before the program runs, and even longer times before the log output gets uploaded to somewhere you can see it, all because we’ve been tricked into this idea that everything has to scale. People get excited about deploying to the latest upstart container hosting service because it only takes tens of seconds to roll out, instead of minutes. But on my slow computer in the 1990s, I could run a perl or python program that started in milliseconds and served way more than 0.2 requests per second, and printed logs to stderr right away so I could edit-run-debug over and over again, multiple times per minute.
How did we get here?
We got here because sometimes, someone really does need to write a program that has to scale to thousands or millions of backends, so it needs all that… stuff. And wishful thinking makes people imagine even the lowliest dashboard could be that popular one day.
The truth is, most things don’t scale, and never need to. We made Tailscale for those things, so you can spend your time scaling the things that really need it. The long tail of jobs that are 90% of what every developer spends their time on. Even developers at companies that make stuff that scales to billions of users, spend most of their time on stuff that doesn’t, like dashboards and meme generators.
As an industry, we’ve spent all our time making the hard things possible, and none of our time making the easy things easy.
Programmers are all stuck in the mud. Just listen to any professional developer, and ask what percentage of their time is spent actually solving the problem they set out to work on, and how much is spent on junky overhead.
It’s true here too. Our developer experience at Tailscale is better than average. But even we have largely the same experience. Modern software development is mostly junky overhead.
In fact, we didn’t found Tailscale to be a networking company. Networking didn’t come into it much at all at first.
What really happened was, me and the Davids got together and we said, look. The problem is developers keep scaling things they don’t need to scale, and their lives suck as a result. (For most programmers you can imagine the “wiping your tears with a handful of dollar bills” meme here.) We need to fix that. But how?
We looked at a lot of options, and talked to a lot of people, and there was an underlying cause for all the problems. The Internet. Things used to be simple. Remember the LAN? But then we connected our LANs to the Internet, and there’s been more and more firewalls and attackers everywhere, and things have slowly been degrading ever since.
When we explore the world of over-complexity, most of it has what we might call, no essential complexity. That is, the problems can be solved without complexity, but for some reason the solutions we use are complicated anyway. For example, logging systems. They just stream text from one place to another, but somehow it takes 5 minutes to show up. Or orchestration systems: they’re programs whose only job is to run other programs, which Unix kernels have done just fine, within milliseconds, for decades. People layer on piles of goop. But the goop can be removed.
You can’t build modern software without networking. But the Internet makes everything hard. Is it because networking has essential complexity?
Well, maybe. But maybe it’s only complex when you built it on top of the wrong assumptions, that result in the wrong problems, that you then have to paper over. That’s the Old Internet.
Instead of adding more layers at the very top of the OSI stack to try to hide the problems, Tailscale is building a new OSI layer 3 — a New Internet — on top of new assumptions that avoid the problems in the first place.
If we fix the Internet, a whole chain of dominoes can come falling down, and we reach the next stage of technology evolution.
If you want to know the bottleneck in any particular economic system, look for who gets to charge rent. In the tech world, that’s AWS. Sure, Apple’s there selling popular laptops, but you could buy a different laptop or a different phone. And Microsoft was the gatekeeper for everything, once, but you don’t have Windows lock-in anymore, unless you choose to. All those “the web is the new operating system” people of the early 2000s finally won, we just forgot to celebrate.
But the liberation didn’t last long. If you deploy software, you probably pay rent to AWS.
Why is that? Compute, right? AWS provides scalable computing resources.
Well, you’d think so. But lots of people sell computing resources way cheaper. Even a mid-range Macbook can do 10x or 100x more transactions per second on its SSD than a supposedly fast cloud local disk, because cloud providers sell that disk to 10 or 100 people at once while charging you full price. Why would you pay exorbitant fees instead of hosting your mission-critical website on your super fast Macbook?
We all know why:
Location, location, location. You pay exorbitant rents to cloud providers for their computing power because your own computer isn’t in the right place to be a decent server.
It’s behind a firewall and a NAT and a dynamic IP address and probably an asymmetric network link that drops out just often enough to make you nervous.
You could fix the network link. You could reconfigure the firewall, and port forward through the NAT, I guess, and if you’re lucky you could pay your ISP an exorbitant rate for a static IP, and maybe get a redundant Internet link, and I know some of my coworkers actually did do all that stuff on a rack in their garage. But it’s all a lot of work, and requires expertise, and it’s far away from building the stupid dashboard or blog or cat video website you wanted to build in the first place. It’s so much easier to just pay a hosting provider who has all the IP addresses and network bandwidth money can buy.
And then, if you’re going to pay someone, and you’re a serious company, you’d better buy it from someone serious, because now you have to host your stuff on their equipment which means they have access to… everything, so you need to trust them not to misuse that access.
You know what, nobody ever got fired for buying AWS.
That’s an IBM analogy. We used to say, nobody ever got fired for buying IBM. I doubt that’s true anymore. Why not?
IBM mainframes still exist, and they probably always will, but IBM used to be able to charge rent on every aspect of business computing, and now they can’t. They started losing influence when Microsoft arrived, stealing fire from the gods of centralized computing and bringing it back to individuals using comparatively tiny underpowered PCs on every desk, in every home, running Microsoft software.
I credit Microsoft with building the first widespread distributed computing systems, even though all the early networks were some variant of sneakernet.
I think we can agree that we’re now in a post-Microsoft, web-first world. Neat. Is this world a centralized one like IBM, or a distributed one like Microsoft?
[When I did this as a talk, I took a poll: it was about 50/50]
So, bad news. The pendulum has swung back the other way. IBM was centralized, then Microsoft was distributed, and now the cloud+phone world is centralized again.
We’ve built a giant centralized computer system, with a few megaproviders in the middle, and a bunch of dumb terminals on our desks and in our pockets. The dumb terminals, even our smart watches, are all supercomputers by the standards of 20 years ago, if we used them that way. But they’re not much better than a VT100. Turn off AWS, and they’re all bricks.
It’s easy to fool ourselves into thinking the overall system is distributed. Yes, we build fancy distributed consensus systems and our servers have multiple instances. But all that runs centrally on cloud providers.
This isn’t new. IBM was doing multi-core computing and virtual machines back in the 1960s. It’s the same thing over again now, just with 50 years of Moore’s Law on top. We still have a big monopoly that gets to charge everyone rent because they’re the gatekeeper over the only thing that really matters.
Everyone’s attitude is still stuck in the 1990s, when operating systems mattered. That’s how Microsoft stole the fire from IBM and ruled the world, because writing portable software was so hard that if you wanted to… interconnect… one program to another, if you wanted things to be compatible at all, you had to run them on the same computer, which meant you had to standardize the operating system, and that operating system was DOS, and then Windows.
The web undid that monopoly. Now javascript matters more than all the operating systems put together, and there’s a new element that controls whether two programs can talk to each other: HTTPS. If you can HTTPS from one thing to another, you can interconnect. If you can’t, forget it.
And HTTPS is fundamentally a centralized system. It has a client, and a server. A dumb terminal, and a thing that does the work. The server has a static IP address, a DNS name, a TLS certificate, and an open port. A client has none of those things. A server can keep doing whatever it wants if all the clients go away, but if the servers go away, a client does nothing.
We didn’t get here on purpose, mostly. It was just path dependence. We had security problems and an IPv4 address shortage, so we added firewalls and NATs, so connections became one way from client machines to server machines, and so there was no point putting certificates on clients, and nowadays there are 10 different reasons a client can’t be a server, and everyone is used to it, so we design everything around it. Dumb terminals and centralized servers.
Once that happened, of course some company popped up to own the center of the hub-and-spoke network. AWS does that center better than everyone else, fair and square. Someone had to. They won.
Okay, fast forward. We’ve spent the last 5 years making Tailscale the solution to that problem. Every device gets a cert. Every device gets an IP address and a DNS name and end-to-end encryption and an identity, and safely bypasses firewalls. Every device can be a peer. And we do it all without adding any latency or overhead.
That’s the New Internet. We built it! It’s the future, it’s just unevenly distributed, so far. For people with Tailscale, we’ve already sliced out 10 layers of nonsense. That’s why developers react so viscerally once they get it. Tailscale makes the Internet work how you thought the Internet worked, before you learned how the Internet works.
I like to use Taildrop as an example of what that makes possible. Taildrop is a little feature we spent a few months on back when we were tiny. We should spend more time polishing to make it even easier to use. But at its core, it’s a demo app. As long as you have Tailscale already, Taildrop is just one HTTP PUT operation. The sender makes an HTTP request to the receiver, says “here’s a file named X”, and sends the file. That’s it. It’s the most obvious thing in the world. Why would you do it any other way?
Well, before Tailscale, you didn’t have a choice. The receiver is another client device, not a server. So it was behind a firewall, with no open ports and no identity. Your only option was to upload the file to the cloud and then download it again, even if the sender and receiver are side by side on the same wifi. But that means you pay cloud fees for network egress, and storage, and the CPU time for running whatever server program is managing all that stuff. And if you upload the file and nobody downloads it, you need a rule for when to delete it from storage. And also you pay fees just in case to keep the server online, even when you’re not using it at all. Also, cloud employees can theoretically access the file unless you encrypt it. But you can’t encrypt it without exchanging encryption keys somehow between sender and recipient. And how does the receiver even know a file is there waiting to be received in the first place? Do we need a push notification system? For every client platform? And so on. Layers, and layers, and layers of gunk.
And all that gunk means rent to cloud providers. Transferring files — one of the first things people did on the Internet, for no extra charge, via FTP — now has to cost money, because somebody has got to pay that rent.
With Taildrop, it doesn’t cost money. Not because we’re generously draining our bank accounts to make file transfers free. It’s because the cost overhead is gone altogether, because it’s not built on the same devolved Internet everyone else has been using.
Taildrop is just an example, a trivial one, but it’s an existence proof for a whole class of programs that can be 10x easier just because Tailscale exists.
The chain of dominoes starts with connectivity. Lack of connectivity is why we get centralization, and centralization is why we pay rent for every tiny little program we want to run and why everything is slow and tedious and complicated and hard to debug like an IBM batch job. And we’re about to start those dominoes falling.
The glimpse at these possibilities is why our users get excited about Tailscale, more than they’ve ever been excited about some VPN or proxy, because there’s something underneath our kind of VPN that you can’t get anywhere else. We’re removing layers, and layers, and layers of complexity, and making it easier to work on what you wanted to work on in the first place. Not everybody sees it yet, but they will. And when they do, they’re going to be able to invent things we could never imagine in the old centralized world, just like the Windows era of distributed computing made things possible that were unthinkable on a mainframe.
But there’s one catch. If we’re going to untangle the hairball of connectivity, that connectivity has to apply to…
There’s going to be a new world of haves and have-nots. Where in 1970 you had or didn’t have a mainframe, and in 1995 you had or didn’t have the Internet, and today you have or don’t have a TLS cert, tomorrow you’ll have or not have Tailscale. And if you don’t, you won’t be able to run apps that only work in a post-Tailscale world.
And if not enough people have Tailscale, nobody will build those apps. That’s called a chicken-and-egg problem.
This is why our company strategy sounds so odd at first glance. It’s why we spend so much effort giving Tailscale away for free, but also so much effort getting people to bring it to work, and so much effort doing tangential enterprise features so executives can easily roll it out to whole Fortune 500 companies.
The Internet is for everyone. You know, there were internetworks (lowercase) before the Internet (capitalized). They all lost, because the Internet was the most diverse and inclusive of all. To the people building the Internet, nothing mattered but getting everyone connected. Adoption was slow at first, then fast, then really fast, and today, if I buy a wristwatch and it doesn’t have an Internet link, it’s broken.
We won’t have built a New Internet if nerds at home can’t play with it. Or nerds at universities. Or employees at enterprises. Or, you know, eventually every person everywhere.
There remain a lot of steps between here and there. But, let’s save those details for another time. Meanwhile, how are we doing?
Well, about 1 in 20,000 people in the world uses the New Internet (that’s Tailscale). We’re not going to stop until it’s all of them.
I’m old enough to remember when people made fun of Microsoft for their thing about putting a computer on every desk. Or when TCP/IP was an optional add-on you had to buy from a third party.
You know, all that was less than 30 years ago. I’m old, but come to think of it, I’m not that old. The tech world changes fast. It can change for the better. We’re just getting started.
...
Read the original on tailscale.com »
The Fourth Amendment still applies at the border, despite the feds’ insistence that it doesn’t.
For years, courts have ruled that the government has the right to conduct routine, warrantless searches for contraband at the border. Customs and Border Protection (CBP) has taken advantage of that loophole in the Fourth Amendment’s protection against unreasonable searches and seizures to force travelers to hand over data from their phones and laptops.
But on Wednesday, Judge Nina Morrison in the Eastern District of New York ruled that cellphone searches are a “nonroutine” search, more akin to a strip search than scanning a suitcase or passing a traveler through a metal detector.
Although the interests of stopping contraband are “undoubtedly served when the government searches the luggage or pockets of a person crossing the border carrying objects that can only be introduced to this country by being physically moved across its borders, the extent to which those interests are served when the government searches data stored on a person’s cell phone is far less clear,” the judge declared.
Morrison noted that “reviewing the information in a person’s cell phone is the best approximation government officials have for mindreading,” so searching through cellphone data has an even heavier privacy impact than rummaging through physical possessions. Therefore, the court ruled, a cellphone search at the border requires both probable cause and a warrant. Morrison did not distinguish between scanning a phone’s contents with special software and manually flipping through it.
And in a victory for journalists, the judge specifically acknowledged the First Amendment implications of cellphone searches too. She cited reporting by The Intercept and VICE about CPB searching journalists’ cellphones “based on these journalists’ ongoing coverage of politically sensitive issues” and warned that those phone searches could put confidential sources at risk.
Wednesday’s ruling adds to a stream of cases restricting the feds’ ability to search travelers’ electronics. The 4th and 9th Circuits, which cover the mid-Atlantic and Western states, have ruled that border police need at least “reasonable suspicion” of a crime to search cellphones. Last year, a judge in the Southern District of New York also ruled that the government “may not copy and search an American citizen’s cell phone at the border without a warrant absent exigent circumstances.”
Wednesday’s ruling involves defending the rights of an unsympathetic character. U. S. citizen Kurbonali Sultanov allegedly downloaded a sketchy Russian porn trove, including several images of child sex abuse, which landed him on a government watch list. When Sultanov was on the way back from visiting his family in Uzbekistan, agents from the Department of Homeland Security pulled him aside at the airport and searched his phone, finding the images.
Morrison suppressed the evidence from the phone search but not Sultanov’s “spontaneous” statement admitting to downloading the videos. And her order would not have prevented the police from getting Sultanov’s phone the old-fashioned way. Sultanov had allegedly downloaded the porn while in the United States, and his name popped up on the watch list two months before his return flight. And, in fact, the feds did obtain a court order to search Sultanov’s spare phone.
The Southern District of New York ruling last year also involved an unsympathetic character. Jatiek Smith, a member of the Bloods gang, was being investigated for a “violent and extortionate takeover” of New York’s fire mitigation industry. When Smith flew home from a vacation in Jamaica, the FBI took advantage of the opportunity to search Smith’s phone at the border.
A judge suppressed the evidence from the phone search, but Smith was convicted anyway. In both cases, the feds could have gotten a warrant for the suspects’ phones; they saw the border loophole as a way to skip that step.
In fact, CBP Officer Marves Pichardo admitted that these searches are often warrantless fishing expeditions. CBP searches U. S. citizen’s phones if they’re coming from “countries that have political difficulties at this point in time and that we’re currently looking at for intelligence and stuff like that,” Pichardo testified during an evidence suppression hearing. He asserted that CBP agents can “look at pretty much anything that’s stored on the phone” and that passengers are usually “very compliant.”
Because of the powers the government was claiming, civil libertarians intervened in the Sultanov case. The Knight First Amendment Institute at Columbia University and the Reporters Committee for Freedom of the Press filed an amicus brief in October 2023 arguing that warrantless phone searches are a “grave threat to the Fourth Amendment right to privacy as well as the First Amendment freedoms of the press, speech, and association.” Morrison heavily cited that brief in her ruling.
“As the court recognized, letting border agents freely rifle through journalists’ work product and communications whenever they cross the border would pose an intolerable risk to press freedom,” Grayson Clary, staff attorney at the Reporters Committee for Freedom of the Press, said in a statement sent to reporters. “This thorough opinion provides powerful guidance for other courts grappling with this issue, and makes clear that the Constitution would require a warrant before searching a reporter’s electronic devices.”
...
Read the original on reason.com »
Woohoo! We’re excited and humbled to announce that @stripe has acquired @lmsqueezy.
In 2020, when the world gave us lemons, we decided to make lemonade. We imagined a world where selling digital products would be as simple as opening a lemonade stand. We dreamed of a platform that would take the pain out of selling globally.
Tax headaches, fraud prevention, handling chargebacks, license key management, and file delivery, among other things, are complicated.
We believed it should be simple.
We believed it should be easy-peasy.
As founders, we’ve spent a decade selling digital products, and so we created a solution that met our own needs. But what started as an idea to solve the day-to-day problems of selling digital products evolved into something much bigger. Nine months after our public launch in 2021, we surpassed $1M in ARR and never looked back.
We worked tirelessly through growing pains while also celebrating major milestones along the way. Each step reinforced that we were onto something remarkable.
Along the way, we received many acquisition offers and (Series A) term sheets from investors. But despite the allure of these opportunities, we knew that what we had built was truly special and needed the right partner to take it to the next level.
We’re proud to say that we’ve found that partner in Stripe and have gone from idea to acquisition in under three years.
Stripe continues to set the bar in the payments industry with its world-class developer experience, API standards, and dedication to beauty and craft. It’s no secret that we (like many) have always admired Stripe.
When we began discussions about a potential acquisition, it was immediately apparent that our values and mission were perfectly aligned.
Lemon Squeezy and Stripe share a deep love for our customers and a commitment to making selling effortless.
Now imagine combining everything you love about Lemon Squeezy and Stripe — we believe it’s a match made in heaven.
Lemon Squeezy is now packed with 1,000% more juice.
Lemon Squeezy has been processing payments on Stripe since our inception. This acquisition marks the culmination of years of effort and celebrates our close partnership with Stripe and our shared sense of purpose.
Going forward, our mission remains the same: make selling digital products easy-peasy.
With Stripe’s help, we’ll continue to improve the merchant of record offering, bolstering billing support, building an even more intuitive customer experience, and more.
We’re incredibly excited about the possibilities that lie ahead with the Lemon Squeezy and Stripe teams joining forces. The future is bright.
Rest assured, we’ll continue delivering the same fantastic product and reliability you’ve come to trust. We’ll be in touch as we work through this process with any updates as they come along. We’re excited about finding the best ways to combine Lemon Squeezy and Stripe.
At Lemon Squeezy, you (our wonderful customers) are at the heart of everything we do. We pride ourselves on creating intuitive, customer-focused products backed by top-notch customer service.
We remain as committed as ever.
Over the years, our community has grown exponentially. This growth is a testament to the trust and support you’ve shown us, and we couldn’t be more grateful.
We owe a huge thank you to our team, community, and supporters. Thousands of companies continue to choose to sell globally through Lemon Squeezy, and we’ll never take that for granted.
Thank you for being part of our journey. We look forward to all the fantastic things we will achieve together with Stripe.
...
Read the original on www.lemonsqueezy.com »
When I recently interviewed Mike Clark, he told me, “…you’ll see the actual foundational lift play out in the future on Zen 6, even though it was really Zen 5 that set the table for that.” And at that same Zen 5 architecture event, AMD’s Chief Technology Officer Mark Papermaster said, “Zen 5 is a ground-up redesign of the Zen architecture,” which has brought numerous and impactful changes to the design of the core.
The most substantial of these changes may well be the brand-new 2-Ahead Branch Predictor Unit, an architectural enhancement with roots in papers from three decades ago. But before diving into this both old yet new idea, let’s briefly revisit what branch predictors do and why they’re so critical in modern microprocessor cores.
Ever since computers began operating on programs stored in programmable, randomly accessible memory, architectures have been split into a front end that fetches instructions and a back end responsible for performing those operations. A front end must also support arbitrarily moving the point of current program execution to allow basic functionality like conditional evaluation, looping, and subroutines.
If a processor could simply perform the entire task of fetching an instruction, executing it, and selecting the next instruction location in unison, there would be little else to discuss here. However, incessant demands for performance have dictated that processors perform more operations in the same unit time with the same amount of circuitry, taking us from 5 kHz with ENIAC to the 5+ GHz of some contemporary CPUs like Zen 5, and this has necessitated pipelined logic. A processor must actually maintain in parallel the incrementally completed partial states of logically chronologically distinct operations.
Keeping this pipeline filled is immediately challenged by the existence of conditional jumping within a program. How can the front end know what instructions to begin fetching, decoding, and dispatching when a jump’s condition might be a substantial number of clock cycles away from finishing evaluation? Even unconditional jumps with a statically known target address present a problem when fetching and decoding an instruction needs more than a single pipeline stage.
The two ultimate responses of this problem are to either simply wait when the need is detected or to make a best effort guess at what to do next and be able to unwind discovered mistakes. Unwinding bad guesses must be done by flushing the pipeline of work contingent on the bad guess and restarting at the last known good point. A stall taken on a branch condition is effectively unmitigable and proportional in size to the number of stages between the instruction fetch and the branch condition evaluation completion in the pipeline. Given this and the competitive pressures to not waste throughput, processors have little choice but to attempt guessing program instruction sequences as accurately as possible.
Imagine for a moment that you are a delivery driver without a map or GPS who must listen to on-the-fly navigation from colleagues in the back of the truck. Now further imagine that your windows are completely blacked out and that your buddies only tell you when you were supposed to turn 45 seconds past the intersection you couldn’t even see. You can start to empathize and begin to understand the struggles of the instruction fetcher in a pipelined processor. The art of branch prediction is the universe of strategies that are available to reduce the rate that this woefully afflicted driver has to stop and back up.
Naive strategies like always taking short backwards jumps (turning on to a circular drive) can and historically did provide substantial benefit over always fetching the next largest instruction memory address (just keep driving straight). However, if some small amount of state is allowed to be maintained, much better results in real programs can be achieved. If the blinded truck analogy hasn’t worn too thin yet, imagine the driver keeping a small set of notes of recent turns taken or skipped and hand-drawn scribbles of how roads driven in the last few minutes were arranged and what intersections were passed. These are equivalent to things like branch history and address records, and structures in the 10s of kilobytes have yielded branch prediction percentages in the upper 90s. This article will not attempt to cover the enormous space of research and commercial solutions here, but understanding at least the beginnings of the motivations here is valuable.
The 2-Ahead Branch Predictor is a proposal that dates back to the early ’90s. Even back then the challenge of scaling out architectural widths of 8 or more was being talked about and a 2-Ahead Branch Predictor was one of the methods that academia put forth in order to continue squeezing more and more performance out of a single core.
But as commercial vendors moved from a single core CPU to multi-core CPUs, the size of each individual core started to become a bigger and bigger factor in CPU core design so academia started focusing on more area efficient methods to increase performance with the biggest development being the TAGE predictor. The TAGE predictor is much more area efficient compared to older branch predicting methods so again academia focused on improving TAGE predictors.
But with logic nodes allowing for more and more transistors in a similar area along with moving from dual and quad core CPUs to CPUs with hundreds of out of order CPUs, we have started to focus more and more on single core performance rather than just scaling further and further up. So while some of these ideas are quite old, older than I in fact, they are starting to resurface as companies try and figure out ways to increase the performance of a single core.
It is worth addressing an aspect of x86 that allows it to benefit disproportionately more from 2-ahead branch prediction than some other ISAs might. Architectures with fixed-length instructions, like 64-bit Arm, can trivially decode arbitrary subsets of an instruction cache line in parallel by simply replicating decoder logic and slicing up the input data along guaranteed instruction byte boundaries. On the far opposite end of the spectrum sits x86, which requires parsing instruction bytes linearly to determine where each subsequent instruction boundary lies. Pipelining (usually partially decoding length-determining prefixes first) makes a parallelization of some degree tractable, if not cheap, which resulted in 4-wide decoding being commonplace in performance-oriented x86 cores for numerous years.
While increasing logic density with newer fab nodes has eventually made solutions like Golden Cove’s 6-wide decoding commercially viable, the area and power costs of monolithic parallel x86 decoding are most definitely super-linear with width, and there is not anything resembling an easy path forward with continued expansions here. It is perhaps merciful for Intel and AMD that typical application integer code has a substantial branch density, on the order of one every five to six instructions, which diminishes the motivation to pursue parallelized decoders much wider than that.
The escape valve that x86 front ends need more than anything is for the inherently non-parallelizable portion of decoding, i.e., the determination of the instruction boundaries. If only there was some way to easily skip ahead in the decoding and be magically guaranteed you landed on an even instruction boundary…
Starting with the paper titled “Multiple-block ahead branch predictors” by Seznec et al., it lays out the why and how of the reasoning and implementation needed to make a 2-Ahead Branch Predictor.
Looking into the paper, you’ll see that implementing a branch predictor that can deal with multiple taken branches per cycle is not as simple as just having a branch predictor that can deal with multiple taken branches. To be able to use a 2-Ahead Branch Predictor to its fullest, without exploding area requirements, Seznac et al. recommended dual-porting the instruction fetch.
When we look at Zen 5, we see that dual porting the instruction fetch and the op cache is exactly what AMD has done. AMD now has two 32 Byte per cycle fetch pipes from the 32KB L1 instruction cache, each feeding its own 4-wide decode cluster. The Op Cache is now a dual-ported 6 wide design which can feed up to 12 operands to the Op Queue.
Now, Seznac et al. also recommends dual porting the Branch Target Buffer (BTB). A dual-ported L1 BTB could explain the massive 16K entries that the L1 BTB has access to. As for the L2 BTB, it’s not quite as big as the L1 BTB at only 8K entries but AMD is using it in a manner similar to how a victim cache would be used. So entries that get evicted out of the L1 BTB, end up in the L2 BTB.
With all these changes, Zen 5 can now deal with 2 taken branches per cycle across a non-contiguous block of instructions.
This should reduce the hit to fetch bandwidth when Zen 5 hits a taken branch as well as allowing AMD to predict past the 2 taken branches.
Zen 5 can look farther forward in the instruction stream beyond the 2nd taken branch and as a result Zen 5 can have 3 prediction windows where all 3 windows are useful in producing instructions for decoding. The way that this works is that a 5 bit length field is attached to the 2nd prediction window which prevents the over subscription of the decode or op cache resources. This 5 bit length field while smaller than a pointer does give you the start of the 3rd prediction window. One benefit of this is that if the 3rd window crosses a cache line boundary, the prediction lookup index doesn’t need to store extra state for the next cycle. However a drawback is that if the 3rd prediction window is in the same cache line as the 1st or 2nd prediction window, that partial 3rd window isn’t as effective as having a 3rd full prediction window.
Now when Zen 5 has two threads active, the decode clusters and the accompanying fetch pipes are statically partitioned. This means that to act like a dual fetch core, Zen 5 will have to fetch out of both the L1 instruction cache as well as out of the Op Cache. This maybe the reason why AMD dual-ported the op cache so that they can better insure that they can keep the dual fetch pipeline going.
In the end, this new 2-Ahead Branch Predictor is a major shift for the Zen family of CPU architectures moving forward and is going to give new branch prediction capabilities that will likely serve the future developments of the Zen core in good stead as they refine and improve this branch predictor.
If you like our articles and journalism, and you want to support us in our endeavors, then consider heading over to our Patreon or our PayPal if you want to toss a few bucks our way. If you would like to talk with the Chips and Cheese staff and the people behind the scenes, then consider joining our Discord.
If you want to learn more about how multiple fetch processors work then I would highly recommend the papers below as they helped with my understanding of how this whole system works:
...
Read the original on chipsandcheese.com »
Why does the chromaticity diagram look like that?
I’ve always wanted to understand color theory, so I started reading about the XYZ color space which looked like it was the mother of all color spaces. I had no idea what that meant, but it was created in 1931 so studying 93-year old research seemed like a good place to start.
When reading about the XYZ color space, this cursed image keeps popping up:
I say “cursed” because I have no idea what that means. What the heck is that shape??
I couldn’t find any reasonably clear answer to my question. It’s obviously not a formula like x = func(y). Why is it that shape, and where did the colors come from? Obviously the edges are wavelengths which have a specific color, but how did the image above compute every pixel?
I became obsessed with this question. Below is the path I took to try to answer it.
I’ll spoil the answer but it might not make sense until you read this article: the shape comes from how our eyes perceive red, green, and blue relative to each other. Skip to the last section if you want to see some direct examples.
The fill colors inside the shape are another story, but a simple explanation is there is some math to calculate the mixture of colors and we can draw the above by sampling millions of points in the space and rendering them onto the 2d image.
The first place to start is color matching functions. These functions determine the strength of specific wavelengths (color) to contribute so that our eyes perceive a target wavelength (color). We have 3 color matching functions for red, green, and blue (at wavelengths 700, 546, and 435 respectively), and these functions specify how to mix RGB to so that we visually see a spectral color.
More simply put: imagine that you have red, green, and blue light sources. What is the intensity of each one so that the resulting light matches a specific color on the spectrum?
Note that these are spectral colors: monochromatic light with a single wavelength. Think of colors on the rainbow. Many colors are not spectral, and are a mix of many spectral colors.
The CIE 1931 color space defines these RGB color matching functions. The red, green, and blue lines represent the intensity of each RGB light source:
Note: this plot uses the table from the original study. This raw data must not be used anymore because I couldn’t find it anywhere. I had to extract it myself from an appendix in the original report.
Given a wavelength on the X axis, you can see how to “mix” the RGB wavelengths to produce the target color.
How did they come up with these? They scientifically studied how our eyes mix RGB colors by sitting people down in a room with multiple light sources. One light source was the target color, and the other side had red, green, and blue light sources. People had to adjust the strength of the RGB sources until it matched the target color. They literally had people manually adjust lights and recorded the values! There’s a great article that explains the experiments in more detail.
There’s a big problem with the above functions. Can you see it? What do you think a negative red light source means?
It’s nonsense! That means with this model, given pure RGB lights, there are certain spectral colors that are impossible to recreate. However, this data is still incredibly useful and we can transform it into something meaningful.
Introducing the XYZ color matching functions. The XYZ color space is simply the RGB color space, but multiplied with a matrix to transform it a bit. The important part is this is a linear transform: it’s literally the same thing, just reshaped a little.
I found a raw table for the XYZ color matching functions here and this is what it looks like. The CIE 1931 XYZ color matching functions:
Wikipedia defines the RGB matrix transform as this:
matrix = [
2.364613, -0.89654, -0.468073,
-0.515166, 1.426408, 0.088758,
0.005203, -0.014408, 1.009204
[R, G, B] = matrix * [X, Y, Z]
We can take the XYZ table and transform it with the above matrix, and doing so produces this graph. Look familiar? This is exactly what the RGB graph above looks like (plotted directly from the data table)!
Wikipedia also documents an analytical approximation of this data, which means we can use mathematical functions to generate the data instead of using tables. Press “view source” to see the algorithm:
Ok, so we have these color matching functions. When displaying these colors with RGB lights though, we can’t even show all of the spectral colors. Transforming it into XYZ space, where everything is positive, fixes the numbers but what’s the point if we still can’t physically show them?
The XYZ space describes all colors, even colors that are impossible to display. It’s become a standard space to encode colors in a device-independent way, and it’s up to a specific device to interpret them into a space that it can physically produce. This is nice because we have a standard way to encode color information without restricting the possibilities of the future — as devices become better at displaying more and more colors, they can automatically start displaying them without requiring any infrastructure changes.
Now let’s get back to that cursed shape. That’s actually a chromaticity diagram, which is “objective specification of the quality of a color regardless of its luminance”.
We can derive the chromaticity for a color by taking the XYZ values for it dividing each by the total:
const x = X / (X + Y + Z)
const y = Y / (X + Y + Z)
const z = Z / (X + Y + Z) = 1 - x - y
We don’t actually need z because we can derive it given x and y. Hence we have the “xy chromaticity diagram”. Remember how I said it’s a 3d curve projected onto a 2d space? We’ve done that by just dropping z.
If we want to go back to XYZ from xy, we need the Y value. This is called the xyY color space and is another way to encode colors.
Alright, let’s try this out. Let’s take the RGB table we rendered above, and plot the chromaticity. We do this by using the above functions, and plotting the x and y points (the colors are a basic estimation):
Hey! Look at that! That looks familiar. Why is it so slanted though? If you look at the x axis, it actually goes into negative! That’s because the RGB data is representing impossible colors.
Let’s use an RGB to XYZ matrix to transform it into XYZ space (the opposite of what we did before, where we transformed XYZ into RGB) space. If we render the same data but transformed, it looks like this:
Now that’s looking really familiar!
Just to double-check, let’s render the chromaticity of the XYZ table data. Note that we have more granular data here, so there are more points, but it matches:
Ok, so what about colors? How do we fill the middle part with all the colors? Note: this is where I really start to get out my league, but here’s my best attempt.
What if we iterate over every single pixel in the canvas and try to plot a color for it? The question is given x and y, how do we get a color?
Here are some steps:
We scale each x and y point in the canvas to a value between 0 and 1
Remember above I said we need the Y value to transform back into XYZ space? Turns out that the XYZ space intentionally made Y map to the luminance value of a color, so that means we can… make it up?
What if we just try to use a luminance value of 1?
That lets us generate XYZ values, which we then translate into sRGB space (don’t worry about the s there, it’s just RGB space with some gamma correction)
One immediate problem you hit this produces many invalid colors. We also want to experiment with different values of Y. The demo below has controls to customize its behavior: change Y from 0 to 1, and hide colors with elements below 0 or 255.
That’s neat! We’re getting somewhere, and are obviously constrained by the RGB space. By default, it clips colors with negative values and that produces this triangle. Feels like the dots are starting to connect: the above image is clearly showing connections between XYZ/RGB and limitations of representable colors.
Even more interesting is if you turn on “clip colors max”. You only see a small slice of color, and you need to move the Y slider morph the shape to “fill” the triangle. Almost like we’re moving through 3d space.
For each point, there must be a different Y value that is the most optimal representation of that color. For example, blues are rich when Y is low, but greens are only rich when Y is higher.
I’m still confused how to fill that space within the chromaticity diagram, so let’s take a break.
Let’s create a spectrum. Take the original color matching function. Since that is telling us the XYZ values needed to create a spectral color, shouldn’t we be able to iterate over the wavelengths of visible colors (400-720), get the XYZ values for each one, and convert them to RGB and render a spectrum?
This looks pretty bad, but why? I found a nice article about rendering spectra which seems like another deep hole. My problems aren’t even close to that kind of accuracy; the above isn’t remotely close.
Turn out I need to convert XYZ to sRGB because that’s what the rgb() color function is assuming when rendering to canvas. The main difference is gamma correction which is another topic.
We’ve learned that sRGB can only render a subset of all colors, and turns out there are other color spaces we can use to tell browsers to render more colors. The p3 wide gamut color space is larger than sRGB, and many browsers and displays support it now, so let’s test it.
You specify this color space by using the color function in CSS, for example: color(display-p3 r, g, b). I ran into the same problems where the colors were all wrong, which was surprising because everything I read implied it was linear. Turns out the p3 color space in browsers has the same gamma correction as sRGB, so I needed to include that to get it to work:
If you are seeing this on a wide gamut compatible browser and display, you will see more intense colors. I love that this is a thing, and the idea that so many users are using apps that could be more richly displayed if they supported p3.
I started having an existential crisis around this point. What are my eyes actually seeing? How do displays… actually work? Looking at the wide gamut spectrum above, what happens if I take a screenshot of it in macOS and send it to a user using a display that doesn’t support p3?
To test this I started a zoom chat with a friend and shared my screen and showed them the wide gamut spectrum and asked if they could see a difference (the top and bottom should look different). Turns out they could! I have no idea if macOS, zoom, or something else is translating it into sRGB (thus “downgrading” the colors) or actually transmitting p3. (Also, PNG supports p3, but what do monitors that don’t support it do?)
The sheer complexity of abstractions between my eyes and pixels is overwhelming. There are so many layers which handle reading and writing the individual pixels on my screen, and making it all work across zoom chats, screenshots, and everything is making my mind melt.
A little question: why does printing use the CMY color system with the primaries of cyan, magenta, and yellow, while digital displays build pixels with the primaries of reg, green, and blue? If cyan, magenta, and yellow allow a wider range of colors via mixing why is RGB better digitally? Answer: because RGB is an additive color system and CMY is a subtractive color system. Materials absorb light, while digital displays emit light.
We’re not giving up on figuring out the colors of the chromaticity diagram yet.
I found this incredible article about how to populate chromaticity diagrams. I still have no idea if this is how the original ones were generated. After all, the colors shown are just an approximation (your screen can’t actually display the true colors near the edges), so maybe there’s some other kind of formula.
So that I can get back to my daily life and be present with my family, I’m accepting that this is how those images are generated. Let’s try do it ourselves.
There’s no way to go from an x, y point in the canvas to a color. There’s no formula tells us if it’s a valid point in space or how to approximate a color for it.
We need to do the opposite: start with an value in the XYZ color space, compute an approximate color, and plot it at the right point by converting it into xy space. But how do we even find valid XYZ values? Not all points are valid inside that space (between 0 and 1 on all three axes). To do that we have to take another step back.
I got this technique from the incredible article linked above. What we’re trying to is render all colors in existence. Obviously we can’t actually do that, so we need an approximation. Here’s the approach we’ll take:
First, we need to generate an arbitrary color. The only way to do this is to generate a spectral line shape. Basically it’s a line across all wavelengths (the X axis) that defines how much each wavelength contributes to the color.
To get the xy, coordinate on the canvas, we need to get the XYZ values for the color. To do that, we multiply the XYZ color matching functions with the spectral line, and then take the integral of each line to get the final XYZ values.
We do the same for the RGB color. We multiply the RGB color matching functions with the spectral line and take the integral of each one for the final RGB color. (We’ll talk about the colors more later)
I don’t know if that made any sense, but here’s a demo which might help. The graph in the bottom left is the spectral line we are generating. This represents a specific color, which is shown in the top left. Finally, on the right we plot the color on the chromaticity diagram by summing up the area of the spectral line multiplied by the XYZ color matching functions.
We generated the spectral line graph with two simple sine curves with a specific width and offset. You can change the offset of each curve with the sliders below. You can see that moving those curves, which generates a different spectral line (and thus color) which plots different points on the diagram.
By adjusting the sliders, you are basically painting the chromaticity diagram!
You can see how all of this works but pressing “view source” to see the code.
Obviously this is a very poor representation of the chromaticity diagram. It’s difficult to cover the whole area; adjusting the offset of the curves only allows you to walk through a subset of the entire space. We would need to change how we are generating spectral lines to fully walk through the space.
Here’s a demo which attempts to automate this. It’s using the same code as above, except it’s changing both offset and width of the curves and walking through the space better:
I created an isolated codepen if you want to play with this yourself. If you let this run for a while, you’ll end up with a shape like this:
We’re still not walking through the full space, but it’s not bad! It at least… vaguely resembles the original diagram?
Our coloring isn’t quite right. It’s missing the white spot in the middle and it’s too dark in certain places. Let me explain a little more how we generated these colors.
After all, didn’t we generate RGB colors? If so, why weren’t they clipped and showing a triangle like before? Or at least we should see more “maxing out” of colors near the edges.
My first attempts at the above did show this. Here’s a picture where I only took the integral to find the XYZ values, and then took those values and used XYZ_to_sRGB to transform them into RGB colors:
We do get more of the bright white spot in the middle, but the colors are far too saturated. It’s clear that many of these colors are actually invalid (they are not in between 0 and 255).
Another technique I learned from the incredible article is to avoid using the XYZ points to find the color, and instead do the same integration over the RGB color mapping functions. So we take our spectral line, multiple by each of the RGB functions, and then take the sum of each result to find the individual RGB values.
Even though this still produces invalid colors, intuitively I can see how it more directly maps onto the RGB space and provides a better interpolation.
That’s about as far as I got. I wish I had a better answer for how to generate the colors here, and maybe you know? If so, give me a shout! I’m satisfied with how far I got, and I bet the final answer uses slightly different color matching functions or something, but it doesn’t feel far off.
If you have ideas to improve this, please do so in this demo! I’d love to see any improvements.
I want to drive home that my above implementation is still generating invalid colors. For example, if I add clipping and avoid rendering any colors with elements outside of the 0-255 range, I get the familiar sRGB triangle:
It turns out that even though colors outside the triangle aren’t rendering accurately, we’re still able to represent a change of color because only 1 or 2 of the RGB channels have maxed out. If green maxes out, changes in the red and blue channels will still show up.
But really, why that specific shape? I know it derives from how we perceive red, green, and blue relative to each other. Let’s look at the XYZ color matching functions again:
The shape is derived from these shapes. To render chromaticity, you walk through each wavelength above and calculate the percentage of each XYZ value of the total. So there’s a direct relationship.
Let’s drive this home by generating our own random color matching functions. We generate them with some simple sine waves (view source to see the code):
Now let’s render the chromaticity according to our nonsensical color matching functions:
The shape is very different! So that’s it: the shape is due to the XYZ color matching functions, which were derived from experiments that studied how our eyes perceive red, green, and blue light. That’s why the chromaticity diagram represents something meaningful: it’s how our eyes perceive color.
Looking for old articles? See archive.jlongster.com
...
Read the original on jlongster.com »
...
Read the original on www.moreoverlap.com »
I got some fancy new speakers last week.
They’re powered speakers and they have streaming service integrations built in, unlike the 35-year-old passive speakers I’m upgrading from. Overall they’re great! But they’re so loud that it’s difficult to make small volume adjustments within the range of safe volume levels for my apartment.
To solve that, I’m building a custom volume knob for them them that will give me more precise control within the range I like to listen in.
The speakers sound great, but they’re way louder than I need. I typically use about 10% of the volume range they’re capable of.
That makes it difficult to set the volume levels I prefer using the methods most convenient for me, which are either the regular volume controls on my phone or computer if I’m using AirPlay or the volume slider in Spotify if I’m using Spotify Connect. Those methods either give me a tiny slider that I can only use 10% of or about 15 steps where the jump from step 3 to step 4 takes the speakers from “a bit too quiet” to “definitely bothering the neighbors” levels.
The amp that I used to use was overpowered for my room too, but that wasn’t an issue for me because those volume control methods attenuated the output of a music streamer that I had connected to the amp. With that system, I could set the amplifier’s analog volume knob such that the max volume out of the streamer corresponded to my actual maximum preferred listening volume, giving me access to the full range of Spotify or AirPlay’s volume controls.
Some powered speakers solve this issue by providing control over the max volume, either by a physical knob or by a software setting, but unfortunately these JBLs do not.
While thinking about this problem, I remembered that some other network-connected audio devices I’ve encountered expose undocumented web interfaces. I was curious if these speakers did, so I found their local IP address via my router and navigated to that IP in my browser.
Lo and behold, they do have one!
Sadly, the volume slider there was still not as convenient as I’d like.
After exploring that web interface for a few minutes with my browser’s network dev tools, I found that the speakers expose a pretty straightforward HTTP API, including GET /api/getData and POST /api/setData, which allow me to read and write the current volume level, among other things.
I tried to find some documentation of this API online, but the closest I could find was the source code for a Hombridge plugin for KEF
speakers. It seems like KEF’s and JBL’s network-connected speakers share some code, which isn’t too surprising given that they’re both owned by
Harman
(which is apparently owned by Samsung as of
2017!)
After that, I found that the speakers’ web interface has a page that allows me to download system logs, which turned out to include a copy of the part of the filesystem that stores the current settings!
That helped me track down two specific configuration paths that looked promising: player/attenuation and hostlink/maxVolume.
Sadly, neither of those turned out to be what I was looking for.
player/attenuation turned out to be another interface to the main volume, more-or-less an alias of player:volume.
hostlink/maxVolume sounded like it could be exactly what I was hoping to find. Unfortunately, changing it doesn’t seem to affect anything that I’ve noticed, and the API response implies that it has to do with
Arcam (yet another Samsung/Harman subsidiary):
If I couldn’t set a max volume inside the speaker, I could at least build myself a custom slider that only covers the range of volumes I’m interested in listening at.
To do that, I put together a little web page with nothing but a full-width slider for setting the volume, and I finally have a way to choose reasonable levels!
I tried to do that in a single HTML file, but ran into CORS issues when sending requests to the speakers, so I put together a tiny server using Bun. With that, I was able to keep it down to a single TypeScript file with no dependencies other than Bun itself:
The web server here is pretty small. It just serves the page with the slider and forwards requests to the speakers.
I’m using that tagged
template
just to get nicer syntax highlighting in my editor for the embedded HTML.
This works alright for now, but what I really want is a physical volume knob that I can place wherever it’s convenient in my apartment.
In the next post in this series, I’ll talk about building that, probably using something like an ESP32 board with a rotary encoder, a nice enclosure and a nice feeling knob, maybe with some kind of haptic feedback for the stepped volume changes?
I haven’t actually worked with those components before, and it’s been a while since I last worked on a hardware electronics project, but I’m excited to!
...
Read the original on jamesbvaughan.com »
On June 26th 2024 I launched a website called One Million Checkboxes (OMCB). It had one million global checkboxes on it - checking a box checked it for everyone on the site, immediately.
I built the site in 2 days. I thought I’d get a few hundred users, max. That is not what happened.
Instead, within hours of launching, tens of thousands of users checked millions of boxes. They piled in from Hacker News, /r/InternetIsBeautiful, Mastodon and Twitter. A few days later OMCB appeared in the Washington Post and the New York Times.
Here’s what activity looked like on the first day (I launched at 11:30 AM EST).
I don’t have logs for checked boxes from the first few hours because I originally only kept the latest 1 million logs for a given day(!)
I wasn’t prepared for this level of activity. The site crashed a lot. But by day 2 I started to stabilize things and people checked over 50 million boxes. We passed 650 million before I sunset the site 2 weeks later.
Let’s talk about how I kept the site (mostly) online!
Here’s the gist of the original architecture:
Our checkbox state is just one million bits (125KB). A bit is “1” if the corresponding checkbox is checked and “0” otherwise.
Clients store the bits in a bitset (an array of bytes that makes it easy to store, access, and flip raw bits) and reference that bitset when rendering checkboxes. Clients tell the server when they check a box; the server flips the relevant bit and broadcasts that fact to all connected clients.
To avoid throwing a million elements into the DOM, clients only render the checkboxes in view (plus a small buffer) using react-window.
I could have done this with a single process,I wanted an architecture that I could scale (and an excuse to use Redis for the first time in years). So the actual server setup looked like this:
Clients hit nginx for static content, and then make a GET for the bitset state and a websocket connection (for updates); nginx (acting as a reverse proxy) forwards those requests to one of two Flask servers (run via gunicorn).
State is stored in Redis, which has good primitives for flipping individual bits. Clients tell Flask when they check a box; Flask updates the bits in Redis and writes an event to a pubsub (message queue). Both Flask servers read from that pubsub and notify connected clients when checkboxes are checked/unchecked.
We need the pubsub because we’ve got two Flask instances; a Flask instance can’t just broadcast “box 2 was checked” to its own clients.
Finally, the Flask servers do simple rate-limiting (on requests per session and new sessions per IP - foolishly stored in Redis!) and regularly send full state snapshots to connected clients (in case a client missed an update because, say, the tab was backgrounded).
This code isn’t great! It’s not even async. I haven’t shipped production Python in like 8 years! But I was fine with that. I didn’t think the project would be popular. This was good enough.
I changed a lot of OMCB but the basic architecture - nginx reverse proxy, API workers, Redis for state and message queues - remained.
Before I talk about what changed, let’s look at the principles I had in mind while scaling.
I needed to be able to math out an upper bound on my costs. I aimed to let things break when they broke my expectations instead of going serverless and scaling into bankruptcy.
I assumed the site’s popularity was fleeting. I took on technical debt and aimed for ok solutions that I could hack out in hours over great solutions that would take me days or weeks.
I’m used to running my own servers. I like to log into boxes and run commands. I tried to only add dependencies that I could run and debug on my own.
I optimized for fun, not money. Scaling the site my way was fun. So was saying no to advertisers.
The magic of the site was jumping anywhere and seeing immediate changes. So I didn’t want to scale by, for example, sending clients a view of only the checkboxes they were looking at.
Within 30 minutes of launch, activity looked like this:
The site was still up, but I knew it wouldn’t tolerate the load for much longer.
The most obvious improvement was more servers. Fortunately this was easy - nginx could easily reverse-proxy to Flask instances on another VM, and my state was already in Redis. I started spinning up more boxes.
I spun up the second server around 12:30 PM. Load immediately hit 100%
I originally assumed another server or two would be sufficient. Instead traffic grew as I scaled. I hit #1 on Hacker News; activity on my tweet skyrocketed. I looked for bigger optimizations.
My Flask servers were struggling. Redis was running out of connections (did you notice I wasn’t using a connection pool?). My best idea was to batch updates - I hacked something in that looked like this:
I didn’t bother with backwards compatibility. I figured folks were used to the site breaking and would just refresh.
I also added a connection pool. This definitely did not play nicely with gunicorn and Flask, but it did seem to reduce the number of connections to Redis.
I also beefed up my Redis box - easy to do since I was using Digital Ocean’s managed Redis - from a tiny (1 shared CPU; 2 GB RAM) instance to a box with 4 dedicated CPUs and 32 GB of RAM (I did this after Redis mysteriously went down). The resizing took about 30 minutes; the server came back up.
And then things got trickier.
At around 4:30 PM I accepted it: I had plans. I had spent June at a camp at ITP - a school at NYU. And the night of the 26th was our final show. I had signed up to display a face-controlled Pacman game and invited some friends - I had to go!
I brought an iPad and put OMCB on it. I spun up servers while my friend Uri and my girlfriend Emma kindly stepped in to explain what I was doing to strangers when they came by my booth.
I had no automation for spinning up servers (oops) so my naming conventions evolved as I worked.
My servers. I ended up with 8 worker VMs
I got home from the show around midnight. I was tired. But there was still more work to do, like:
* Reducing the number of Flask processes on each box (I originally had more workers than the number of cores on a box; this didn’t work well)
* Increasing the batch size of my updates - I found that doubling the batch size substantially reduced load. I tried doubling it again. This appeared to help even more. I don’t know how to pick a principled number here.
I pushed the updates. I was feeling good! And then I got a text from my friend Greg Technology.
I realized I hadn’t thought hard enough about bandwidth. Digital Ocean’s bandwidth pricing is pretty sane ($0.01/GB after a pretty generous per-server compounding free allowance). I had a TB of free bandwidth from past work and (pre-launch) didn’t think OMCB would make a dent.
I did back of the envelope math. I send state snapshots (1 million bits; 1 Mbit) every 30 seconds. With 1,000 clients that’s already 2GB a minute! Or 120GB an hour. And we’re probably gonna have more clients than that. And we haven’t even started to think about updates.
It was 2 AM. I was very tired. I did some bad math - maybe I confused GB/hour with GB/minute? - and freaked out. I thought I was already on the hook for thousands of dollars!
So I did a couple of things:
* Frantically texted Greg, who helped me realize that my math was way off.
* Ran ip -s link show dev eth0 on my nginx box to see how many bytes I had sent, confirming that my math was way off.
* Started thinking about how to reduce bandwidth - and how to cap my costs.
I immediately reduced the frequency of my state snapshots, and then (with some help from Greg) pared down the size of the incremental updates I sent to clients.
I moved from stuffing a bunch of dicts into a list to sending two arrays of indices with true and false implied. This was five times shorter than my original implementation!
And then I used linux’s tc utility to slam a hard cap on the amount of data I could send per second. tc is famously hard to use, so I wrote my configuration script with Claude’s help.
This just limits traffic flowing over eth0 (my public interface) to 250Mbit a second. That’s a lot of bandwidth - ~2GB/min, or just under 3 TB a day. But it let me reason about my costs, and at $0.01/GB I knew I wouldn’t go bankrupt overnight.
At around 3:30 AM I got in bed.
My server was pegged at my 250 Mb/s limit for much of the night. I originally thought I was lucky to add limits when I did; I now realize someone probably saw my tweet about reducing bandwidth and tried to give me a huge bill.
Blue is traffic from my workers to nginx, purple is nginx out to the world. The timing is suspicious
I woke up a few hours later. The site was down. I hadn’t been validating input properly.
The site didn’t prevent folks from checking boxes above 1 million. Someone had checked boxes in the hundred million range! This let them push the count of checked boxes to 1 million, tricking the site into thinking things were over.
Redis had also added millions of 0s (between bit one million and bit one hundred million), which 100x’d the data I was sending to clients.
This was embarrassing - I’m new to building for the web but like…I know you should validate your inputs! But it was a quick fix. I stopped nginx, copied the first million bits of my old bitset to a new truncated bitset (I wanted to keep the old one for debugging), taught my code to reference the new bitset, and added proper validation.
Not too bad! I brought the site back up.
The site was slow. The number of checked boxes per hour quickly exceeded the day 1 peak.
The biggest problem was the initial page load. This made sense - we had to hit Redis, which was under a lot of load (and we were making too many connections to it due to bugs in my connection pooling).
I was tired and didn’t feel equipped to debug my connection pool issues. So I embraced the short term and spun up a Redis replica to take load off the primary and spread my connections out.
But there was a problem - after spinning up the replica, I couldn’t find its private IP!
I got my Redis instance’s private IP by prepending “private-” to its DNS entry
To connect to my primary, I used a DNS record - there were records for its public and private IPs. Digital Ocean told me to prepend replica- to those records to get my replica IP. This worked for the public one, but didn’t exist for the private DNS record! And I really wanted the private IP.
I thought sending traffic to a public IP would risk traversing the public internet, which would mean being billed for way more bandwidth.
Since I couldn’t figure out how to find the replica’s private IP in an official way (I’m sure you can! Tell me how!), I took a different approach and starting making connections to private IPs close to the IPs of my Redis primary and my other servers. This worked on the third or fourth try.
Then I hardcoded that IP as my replica IP!
My Flask processes kept crashing, requiring me to babysit the site. The crashes seemed to be from running out of Redis connections. I’m wincing as I type this now, but I still didn’t want to debug what was going on there - it was late and the problem was fuzzy.
So I wrote a script that looked at the number of running Flask processes and bounced my systemd unit if too many were down.
I threw that into the crontab on my boxes and updated my nginx config to briefly take servers out of rotation if they were down (I should have done this sooner!). This appeared to work pretty well. The site stabilized.
At around 12:30 AM I posted some stats on Twitter and got ready to go to bed. And then a user reported an issue:
To keep client checkbox state synchronized, I did two things:
* Sent clients incremental updates when checkboxes were checked or unchecked
* Sent clients occasional full-state snapshots in case they missed an update
These updates didn’t have timestamps. A client could receive a new full-state snapshot and then apply an old incremental update - resulting in them having a totally wrong view of the world until the next full-state snapshot.
I was embarrassed by this - I’ve written a whole lot of state machine code and know better. It was almost 1 AM and I had barely slept the night before; it was a struggle to write code that I (ironically) thought I could write in my sleep. But I:
* Timestamped each update written to my Redis pubsub
* Added the max timestamp of each incremental update in the batches I sent to clients
* Taught clients to drop update batches if their timestamp was behind the timestamp of the last full-state snapshot
This isn’t perfect (clients can apply a batch of mostly-stale updates as long as one update is new) but it’s substantially better.
me to claude, 1 AM
I ran my changes by Claude before shipping to prod. Claude’s suggestions weren’t actually super helpful, but talking through why they were wrong gave me more confidence.
I woke up the next morning and the site was still up! Hackily restarting your servers is great. This was great timing - the site was attracting more mainstream media attention (I woke up to an email from the Washington Post).
I moved my attention from keeping the site up to thinking about how to wind it down. I was still confident folks wouldn’t be interested in the site forever, and I wanted to provide a real ending before everyone moved on.
I came up with a plan - I’d make checked boxes freeze if they weren’t unchecked quickly. I wasn’t sure that my current setup could handle this - it might result in a spike of activity plus I’d be asking my servers to do more work.
So (after taking a break for a day) I got brunch with my friend Eliot - a super talented performance engineer - and asked if he was down to give me a hand. He was, and from around 2 PM to 2 AM on Sunday we discussed implementations of my sunsetting plan and then rewrote the whole backend in go!
The go rewrite was straightforward; we ported without many changes. Lots of our sticking points were things like “finding a go socketio library that supports the latest version of the protocol.”
Things were actually so much faster that we ended up needing to add better rate-limiting; originally we scaled too well and bots on the site were able to push absurd amounts of traffic through the site.
The site was DDOS’d on Sunday night, but addressing this was pretty simple - I just threw the site behind CloudFlare and updated my nginx configs a bit.
The site was rock-solid after the go rewrite. I spent the next week doing interviews, enjoying the attention, and trying to relax.
And then I got to work on sunsetting. Checked boxes would freeze if they weren’t unchecked quickly, which would eventually leave the site totally frozen. The architecture here ended up being pretty simple - mostly some more state in Redis:
I added a hashtable that tracked the last time that a box was checked (this would be too much state to pass to clients, but was fine to keep in Redis), along with a “time to freeze” value. When trying to uncheck a box, we’d first check whether now - last_checked > time_to_freeze - if it is, we don’t uncheck the box and instead update frozen_bitset to note that the relevant checkbox is now frozen.
I distributed frozen_bitset state to clients the same way that I distributed which boxes were checked, and taught clients to disable a checkbox if it was in the frozen bitset. And I added a job to periodically search for bits that should be frozen (but weren’t yet because nobody had tried to uncheck them) and freeze those.
Redis made it soooo easy to avoid race conditions with this implementation - I put all the relevant logic into a Lua script, meaning that it all ran atomically! Redis is great.
I rolled the sunsetting changes 2 weeks and 1 day after I launched OMCB. Box 491915 was checked at 4:35 PM Eastern on July 11th, closing out the site.
Well, a lot. This was the second time that I’d put a server with a ‘real’ backend on the public internet, and the last one barely counted. Learning in a high-intensity but low-stakes environment is great.
Building the site in two days with little regard for scale was a good choice. It’s so hard to know what will do well on the internet - nobody I explained the site to seemed that excited about it - and I doubt I would have launched at all if I spent weeks thinking about scale. Having a bunch of eyes on the site energized me to keep it up and helped me focus on what mattered.
...
Read the original on eieio.games »
To add this web app to your iOS home screen tap the share button and select "Add to the Home Screen".
10HN is also available as an iOS App
If you visit 10HN only rarely, check out the the best articles from the past week.
If you like 10HN please leave feedback and share
Visit pancik.com for more.