10 interesting stories served every morning and every evening.
Published: July 1, 2020
The smartest person I’ve ever known had a habit that, as a teenager, I found striking. After he’d prove a theorem, or solve a problem, he’d go back and continue thinking about the problem and try to ﬁgure out different proofs of the same thing. Sometimes he’d spend hours on a problem he’d already solved.
I had the opposite tendency: as soon as I’d reached the end of the proof, I’d stop since I’d “gotten the answer”.
Afterwards, he’d come out with three or four proofs of the same thing, plus some explanation of why each proof is connected somehow. In this way, he got a much deeper understanding of things than I did.
I concluded that what we call ‘intelligence’ is as much about virtues such as honesty, integrity, and bravery, as it is about ‘raw intellect’.
Intelligent people simply aren’t willing to accept answers that they don’t understand — no matter how many other people try to convince them of it, or how many other people believe it, if they aren’t able to convince them selves of it, they won’t accept it.
Importantly, this is a ‘software’ trait & is independent of more ‘hardware’ traits such as processing speed, working memory, and other such things.
Moreover, I have noticed that these ‘hardware’ traits vary greatly in the smartest people I know — some are remarkably quick thinkers, calculators, readers, whereas others are ‘slow’. The software traits, though, they all have in common — and can, with effort, be learned.
What this means is that you can internalize good intellectual habits that, in effect, “increase your intelligence”. ‘Intelligence’ is not ﬁxed.
This quality of “not stopping at an unsatisfactory answer” deserves some examination.
One component of it is energy: thinking hard takes effort, and it’s much easier to just stop at an answer that seems to make sense, than to pursue everything that you don’t quite get down an endless, and rapidly proliferating, series of rabbit holes.
It’s also so easy to think that you understand something, when you actually don’t. So even ﬁguring out whether you understand something or not requires you to attack the thing from multiple angles and test your own understanding.
This requires a lot of intrinsic motivation, because it’s so hard; so most people simply don’t do it.
The Nobel Prize winner William Shockley was fond of talking about “the will to think”:
Motivation is at least as important as method for the serious thinker, Shockley believed…the essential element for successful work in any ﬁeld was “the will to think”. This was a phrase he learned from the nuclear physicist Enrico Fermi and never forgot. “In these four words,” Shockley wrote later, “[Fermi] distilled the essence of a very significant insight: A competent thinker will be reluctant to commit himself to the effort that tedious and precise thinking demands — he will lack ‘the will to think’ — unless he has the conviction that something worthwhile will be done with the results of his efforts.” The discipline of competent thinking is important throughout life… (source)But it’s not just energy. You have to be able to motivate yourself to spend large quantities of energy on a problem, which means on some level that not understanding something — or having a bug in your thinking — bothers you a lot. You have the drive, the will to know.
Related to this is honesty, or integrity: a sort of compulsive unwillingness, or inability, to lie to yourself. Feynman said that the ﬁrst rule of science is that you do not fool yourself, and you are the easiest person to fool. It is uniquely easy to lie to yourself because there is no external force keeping you honest; only you can run the constant loop of asking “do I really understand this?”.
(This is why writing is important. It’s harder to fool yourself that you understand something when you sit down to write about it and it comes out all disjointed and confused. Writing forces clarity.)
The physicist Michael Faraday believed nothing without being able to experimentally demonstrate it himself, no matter how tedious the demonstation.
Simply hearing or reading of such things was never enough for Faraday. When assessing the work of others, he always had to repeat, and perhaps extend, their experiments. It became a lifelong habit—his way of establishing ownership over an idea. Just as he did countless times later in other settings, he set out to demonstrate this new phenomenon to his own satisfaction. When he had saved enough money to buy the materials, he made a battery from seven copper halfpennies and seven discs cut from a sheet of zinc, interleaved with pieces of paper soaked in salt water. He ﬁxed a copper wire to each end plate, dipped the other ends of the wires in a solution of Epsom salts (magnesium sulfate), and watched. (source)Understanding something really deeply is connected to our physical intuition. A simple “words based” understanding can only go so far. Visualizing something, in three dimensions, can help you with a concrete “hook” that your brain can grasp onto and use as a model; understanding then has a physical context that it can “take place in”.
This is why Jesus speaks in parables throughout the New Testament — in ways that stick with you long after you’ve read them — rather than just stating the abstract principle. “Are not two sparrows sold for a cent? And yet not one of them will fall to the ground apart from your Father.” can stick with you forever in a way that “God watches over all living beings” will not.
Faraday, again, had this quality in spades — the book makes clear that this is partly because he was bad at mathematics and thus understood everything through the medium of experiments, and contrasts this with the French scientists (such as Ampere) who understood everything in a highly abstract way.
But Faraday’s physical intuition led him to some of the most crucial discoveries in all of science:
Much as he admired Ampère’s work, Faraday began to develop his own views on the nature of the force between a current-carrying wire and the magnetic needle it deﬂected. Ampère’s mathematics (which he had no reason to doubt) showed that the motion of the magnetic needle was the result of repulsions and attractions between it and the wire. But, to Faraday, this seemed wrong, or, at least, the wrong way around. What happened, he felt, was that the wire induced a circular force in the space around itself, and that everything else followed from this. The next step beautifully illustrates Faraday’s genius. Taking Sarah’s fourteen-year-old brother George with him down to the laboratory, he stuck an iron bar magnet into hot wax in the bottom of a basin and, when the wax had hardened, ﬁlled the basin with mercury until only the top of the magnet was exposed. He dangled a short length of wire from an insulated stand so that its bottom end dipped in the mercury, and then he connected one terminal of a battery to the top end of the wire and the other to the mercury. The wire and the mercury now formed part of a circuit that would remain unbroken even if the bottom end of the wire moved. And move it did—in rapid circles around the magnet! (source)
Being able to generate these concrete examples, even when you’re not physically doing experiments, is imporatant.
I recently saw this striking representation of the “bag of words” model in NLP. If you were reading this in the usual dry mathematical way these things are represented, and then forced yourself to come up with a visualization like this, then you’d be much further on your way to really grasping the thing.
Conversely, if you’re not coming up with visuals like this, and your understanding of the thing remains on the level of equations or abstract concepts, you probably do not understand the concept deeply and should dig further.
Another quality I have noticed in very intelligent people is being unafraid to look stupid.
Malcolm Gladwell on his father:
My father has zero intellectual insecurities… It has never crossed his mind to be concerned that the world thinks he’s an idiot. He’s not in that game. So if he doesn’t understand something, he just asks you. He doesn’t care if he sounds foolish. He will ask the most obvious question without any sort of concern about it… So he asks lots and lots of dumb, in the best sense of that word, questions. He’ll say to someone, ‘I don’t understand. Explain that to me.’ He’ll just keep asking questions until he gets it right, and I grew up listening to him do this in every conceivable setting. If my father had met Bernie Madoff, he would never have invested money with him because he would have said, ‘I don’t understand’ a hundred times. ‘I don’t understand how that works’, in this kind of dumb, slow voice. ’I don’t understand, sir. What is going on?’Most people are not willing to do this — looking stupid takes courage, and sometimes it’s easier to just let things slide. It is striking how many situations I am in where I start asking basic questions, feel guilty for slowing the group down, and it turns out that nobody understood what was going on to begin with (often people message me privately saying that they’re relieved I asked), but I was the only one who actually spoke up and asked about it.
This is a habit. It’s easy to pick up. And it makes you smarter.
I remember being taught calculus at school and getting stuck on the “dy/dx” notation (aka Leibniz notation) for calculus.
The “dy/dx” just looked like a fraction, it looked like we were doing division, but we weren’t actually doing division. “dy/dx” doesn’t mean “dy” divided by “dx”, it means “the value of an inﬁnitesimal change in y with respect to an inﬁnitesimal change in x”, and I didn’t see how you could break this thing apart as though it was simple division.
At one point the proof of the fundamental theorem of calculus involved multiplying out a polynomial, and along the way you could cancel out “dy*dx” because “both of these quantities are inﬁnitesimal, so in effect this can be cancelled out”. This reasoning did not make sense.
The “proof” of the chain rule we were given looked like this.
(Amusingly, you can even get correct results using invalid mathematics, like this. Even though this is clearly invalid, it doesn’t feel far off the “valid” proof of the chain rule I was taught.)
It turns out that my misgivings were right, and that the Leibniz notation is basically just a convenient shorthand and that you more or less can treat those things “as if” they are fractions, but the proof is super complicated etc. Moreover, the Leibniz shorthand is actually far more powerful and easier to work with than Newton’s functions-based shorthand, which is why mainland Europe got way ahead of England (which stuck with Newton’s notation) in calculus. And then all of the logical problems didn’t really get sorted out until Riemann came along 200 years later and formulated calculus in terms of limits. But all of that went over my head in high school.
At the time, I was infuriated by these inadequate proofs, but I was under time pressure to just learn the operations so that I could answer exam questions because the class needed to move onto the next thing.
And since you actually can answer the exam questions and mechanically perform calculus operations without ever deeply understanding calculus, it’s much easier to just get by and do the exam without really questioning the concepts deeply — which is in fact what happens for most people. (See my essay on education.)
How many people actually go back and try and understand this, or other such topics, in a deeper way? Very few. Moreover, the ‘meta’ lesson is: don’t question it too deeply, you’ll fall behind. Just learn the algorithm, plug in the numbers, and pass your exams. Speed is of the essence. In this way, school kills the “will to understanding” in people.
My countervailing advice to people trying to understand something is: go slow. Read slowly, think slowly, really spend time pondering the thing. Start by thinking about the question yourself before reading a bunch of stuff about it. A week or a month of continuous pondering about a question will get you surprisingly far.
And you’ll have a semantic mental ‘framework’ in your brain on which to then hang all the great things you learn from your reading, which makes it more likely that you’ll retain that stuff as well. I read somewhere that Bill Gates structures his famous “reading weeks” around an outline of important questions he’s thought about and broken down into pieces. e.g. he’ll think about “water scarcity” and then break it down into questions like “how much water is there in the world?”, “where does existing drinking water come from?”, “how do you turn ocean water into drinking water”, etc., and only then will he pick reading to address those questions.
This method is far more effective than just reading random things and letting them pass through you.
The best thing I have read on really understanding things is the Sequences, especially the section on Noticing Confusion.
There are some mantra-like questions it can be helpful to ask as you’re thinking through things. Some examples:
But what exactly is X? What is it? (h/t Laura Deming’s post)Why must X be true? Why does this have to be the case? What is the single, fundamental reason? Do I really believe that this is true, deep down? Would I bet a large amount of money on it with a friend?
First, Ezra Pound’s parable of Agassiz, from his “ABC of Reading” (incidentally one of the most underrated books about literature). I’ve preserved his quirky formattingNo man is equipped for modern thinking until he has understood the anecdote of Agassiz and the ﬁsh:
A post-graduate student equipped with honours and diplomas went to Agassiz to receive the ﬁnal and ﬁnishing touches.
The great man offered him a small ﬁsh and told him to describe it.
Post-Graduate Student: “That’s only a sun-ﬁsh”
Agassiz: “I know that. Write a description of it.”
After a few minutes the student returned with the description of the Ichthus Heliodiplodokus, or whatever term is used to conceal the common sunﬁsh from vulgar knowledge, family of Heliichterinkus, etc., as found in textbooks of the subject.
Agassiz again told the student to describe the ﬁsh.
The student produced a four-page essay.
Agassiz then told him to look at the ﬁsh. At the end of the three weeks the ﬁsh was in an advanced state of decomposition, but the student knew something about it. The second, one of my favorite passages from “Zen and the Art of Motorcycle Maintenance”:
He’d been having trouble with students who had nothing to say. At ﬁrst he thought it was laziness but later it became apparent that it wasn’t. They just couldn’t think of anything to say.
One of them, a girl with strong-lensed glasses, wanted to write a ﬁve-hundredword essay about the United States. He was used to the sinking feeling that comes from statements like this, and suggested without disparagement that she narrow it down to just Bozeman.
When the paper came due she didn’t have it and was quite upset. She had tried and tried but she just couldn’t think of anything to say.
He had already discussed her with her previous instructors and they’d conﬁrmed his impressions of her. She was very serious, disciplined and hardworking, but extremely dull. Not a spark of creativity in her anywhere. Her eyes, behind the thick-lensed glasses, were the eyes of a drudge. She wasn’t bluffing him, she really couldn’t think of anything to say, and was upset by her inability to do as she was told.
It just stumped him. Now he couldn’t think of anything to say. A silence occurred, and then a peculiar answer: “Narrow it down to the main street of Bozeman.” It was a stroke of insight.
She nodded dutifully and went out. But just before her next class she came back in real distress, tears this time, distress that had obviously been there for a long time. She still couldn’t think of anything to say, and couldn’t understand why, if she couldn’t think of anything about all of Bozeman, she should be able to think of something about just one street.
He was furious. “You’re not looking!” he said. A memory came back of his own dismissal from the University for having too much to say. For every fact there is an inﬁnity of hypotheses. The more you look the more you see. She really wasn’t looking and yet somehow didn’t understand this.
He told her angrily, “Narrow it down to the front of one building on the main street of Bozeman. The Opera House. Start with the upper left-hand brick.”
Her eyes, behind the thick-lensed glasses, opened wide. She came in the next class with a puzzled look and handed him a ﬁve- thousand-word essay on the front of the Opera House on the main street of Bozeman, Montana. “I sat in the hamburger stand across the street,” she said, “and started writing about the ﬁrst brick, and the second brick, and then by the third brick it all started to come and I couldn’t stop. They thought I was crazy, and they kept kidding me, but here it all is. I don’t understand it.”
Neither did he, but on long walks through the streets of town he thought about it and concluded she was evidently stopped with the same kind of blockage that had paralyzed him on his ﬁrst day of teaching. She was blocked because she was trying to repeat, in her writing, things she had already heard, just as on the ﬁrst day he had tried to repeat things he had already decided to say. She couldn’t think of anything to write about Bozeman because she couldn’t recall anything she had heard worth repeating. She was strangely unaware that she could look and see freshly for herself, as she wrote, without primary regard for what had been said before. The narrowing down to one brick destroyed the blockage because it was so obvious she had to do some original and direct seeing.
The point of both of these parables: nothing beats direct experience. Get the data yourself. This is why I wanted to analyze the coronavirus genome directly, for example. You develop some basis in reality by getting some ﬁrst-hand data, and reasoning up from there, versus starting with somebody else’s lossy compression of a messy, evolving phenomenon and then wondering why events keep surprising you.
People who have not experienced the thing are unlikely to be generating truth. More likely, they’re resurfacing cached thoughts and narratives. Reading popular science books or news articles is not a substitute for understanding, and may make you stupider, by ﬁlling your mind with narratives and stories that don’t represent your own synthesis.
Even if you can’t experience the thing directly, try going for information-dense sources with high amounts of detail and facts, and then reason up from those facts. On foreign policy, read books published by university presses — not The Atlantic or The Economist or whatever. You can read those after you’ve developed a model of the thing yourself, against which you can judge the popular narratives.
Another thing the parable about the bricks tells us: understanding is not a binary “yes/no”. It has layers of depth. My friend understood Pythagoras’s theorem far more deeply than I did; he could prove it six different ways and had simply thought about it for longer.
The simplest things can reward close study. Michael Nielsen has a nice example of this — the equals sign:
I ﬁrst really appreciated this after reading an essay by the mathematician Andrey Kolmogorov. You might suppose a great mathematician such as Kolmogorov would be writing about some very complicated piece of mathematics, but his subject was the humble equals sign: what made it a good piece of notation, and what its deﬁciencies were. Kolmogorov discussed this in loving detail, and made many beautiful points along the way, e.g., that the invention of the equals sign helped make possible notions such as equations (and algebraic manipulations of equations).
Prior to reading the essay I thought I understood the equals sign. Indeed, I would have been offended by the suggestion that I did not. But the essay showed convincingly that I could understand the equals sign much more deeply. (link)The photographer Robert Capa advised beginning photographers: “If your pictures aren’t good enough, you’re not close enough”. (This is good ﬁction writing advice, by the way.)
It is also good advice for understanding things. When in doubt, go closer.
Thanks to Jose-Luis Ricon for reading a draft of this essay.
Follow me on Twitter: @nabeelqu
A talk at Hydra, online (originally planned to be in Moscow, Russia), 06 Jul 2020
Conﬂict-free Replicated Data Types (CRDTs) are an increasingly popular family of algorithms for optimistic replication. They allow data to be concurrently updated on several replicas, even while those replicas are ofﬂine, and provide a robust way of merging those updates back into a consistent state. CRDTs are used in geo-replicated databases, multi-user collaboration software, distributed processing frameworks, and various other systems.
However, while the basic principles of CRDTs are now quite well known, many challenging problems are lurking below the surface. It turns out that CRDTs are easy to implement badly. Many published algorithms have anomalies that cause them to behave strangely in some situations. Simple implementations often have terrible performance, and making the performance good is challenging.
In this talk Martin goes beyond the introductory material on CRDTs, and discusses some of the hard-won lessons from years of research on making CRDTs work in practice.
A browser is an incredibly complex piece of software. With such enormous complexity, the only way to maintain a rapid pace of development is through an extensive CI system that can give developers conﬁdence that their changes won’t introduce bugs. Given the scale of our CI, we’re always looking for ways to reduce load while maintaining a high standard of product quality. We wondered if we could use machine learning to reach a higher degree of efﬁciency.
At Mozilla we have around 50,000 unique test ﬁles. Each contain many test functions. These tests need to run on all our supported platforms (Windows, Mac, Linux, Android) against a variety of build conﬁgurations (PGO, debug, ASan, etc.), with a range of runtime parameters (site isolation, WebRender, multi-process, etc.).
While we don’t test against every possible combination of the above, there are still over 90 unique conﬁgurations that we do test against. In other words, for each change that developers push to the repository, we could potentially run all 50k tests 90 different times. On an average work day we see nearly 300 pushes (including our testing branch). If we simply ran every test on every conﬁguration on every push, we’d run approximately 1.35 billion test ﬁles per day! While we do throw money at this problem to some extent, as an independent non-proﬁt organization, our budget is ﬁnite.
So how do we keep our CI load manageable? First, we recognize that some of those ninety unique conﬁgurations are more important than others. Many of the less important ones only run a small subset of the tests, or only run on a handful of pushes per day, or both. Second, in the case of our testing branch, we rely on our developers to specify which conﬁgurations and tests are most relevant to their changes. Third, we use an integration branch.
Basically, when a patch is pushed to the integration branch, we only run a small subset of tests against it. We then periodically run everything and employ code sheriffs to ﬁgure out if we missed any regressions. If so, they back out the offending patch. The integration branch is periodically merged to the main branch once everything looks good.
A subset of the tasks we run on a single mozilla-central push. The full set of tasks were too hard to distinguish when scaled to ﬁt in a single image.
These methods have served us well for many years, but it turns out they’re still very expensive. Even with all of these optimizations our CI still runs around 10 compute years per day! Part of the problem is that we have been using a naive heuristic to choose which tasks to run on the integration branch. The heuristic ranks tasks based on how frequently they have failed in the past. The ranking is unrelated to the contents of the patch. So a push that modiﬁes a README ﬁle would run the same tasks as a push that turns on site isolation. Additionally, the responsibility for determining which tests and conﬁgurations to run on the testing branch has shifted over to the developers themselves. This wastes their valuable time and tends towards over-selection of tests.
About a year ago, we started asking ourselves: how can we do better? We realized that the current implementation of our CI relies heavily on human intervention. What if we could instead correlate patches to tests using historical regression data? Could we use a machine learning algorithm to ﬁgure out the optimal set of tests to run? We hypothesized that we could simultaneously save money by running fewer tests, get results faster, and reduce the cognitive burden on developers. In the process, we would build out the infrastructure necessary to keep our CI pipeline running efﬁciently.
The main prerequisite to a machine-learning-based solution is collecting a large and precise enough regression dataset. On the surface this appears easy. We already store the status of all test executions in a data warehouse called ActiveData. But in reality, it’s very hard to do for the reasons below.
Since we only run a subset of tests on any given push (and then periodically run all of them), it’s not always obvious when a regression was introduced. Consider the following scenario:
It is easy to see that the “Test A” failure was regressed by Patch 2, as that’s where it ﬁrst started failing. However with the “Test B” failure, we can’t really be sure. Was it caused by Patch 2 or 3? Now imagine there are 8 patches in between the last PASS and the ﬁrst FAIL. That adds a lot of uncertainty!
Intermittent (aka ﬂaky) failures also make it hard to collect regression data. Sometimes tests can both pass and fail on the same codebase for all sorts of different reasons. It turns out we can’t be sure that Patch 2 regressed “Test A” in the table above after all! That is unless we re-run the failure enough times to be statistically conﬁdent. Even worse, the patch itself could have introduced the intermittent failure in the ﬁrst place. We can’t assume that just because a failure is intermittent that it’s not a regression.
The writers of this post having a hard time.
In order to solve these problems, we have built quite a large and complicated set of heuristics to predict which regressions are caused by which patch. For example, if a patch is later backed out, we check the status of the tests on the backout push. If they’re still failing, we can be pretty sure the failures were not due to the patch. Conversely, if they start passing we can be pretty sure that the patch was at fault.
Some failures are classiﬁed by humans. This can work to our advantage. Part of the code sheriff’s job is annotating failures (e.g. “intermittent” or “ﬁxed by commit” for failures ﬁxed at some later point). These classiﬁcations are a huge help ﬁnding regressions in the face of missing or intermittent tests. Unfortunately, due to the sheer number of patches and failures happening continuously, 100% accuracy is not attainable. So we even have heuristics to evaluate the accuracy of the classiﬁcations!
Another trick for handling missing data is to backﬁll missing tests. We select tests to run on older pushes where they didn’t initially run, for the purpose of ﬁnding which push caused a regression. Currently, sheriffs do this manually. However, there are plans to automate it in certain circumstances in the future.
We also need to collect data about the patches themselves, including ﬁles modiﬁed and the diff. This allows us to correlate with the test failure data. In this way, the machine learning model can determine the set of tests most likely to fail for a given patch.
Collecting data about patches is way easier, as it is totally deterministic. We iterate through all the commits in our Mercurial repository, parsing patches with our rust-parsepatch project and analyzing source code with our rust-code-analysis project.
Now that we have a dataset of patches and associated tests (both passes and failures), we can build a training set and a validation set to teach our machines how to select tests for us.
90% of the dataset is used as a training set, 10% is used as a validation set. The split must be done carefully. All patches in the validation set must be posterior to those in the training set. If we were to split randomly, we’d leak information from the future into the training set, causing the resulting model to be biased and artiﬁcially making its results look better than they actually are.
For example, consider a test which had never failed until last week and has failed a few times since then. If we train the model with a randomly picked training set, we might ﬁnd ourselves in the situation where a few failures are in the training set and a few in the validation set. The model might be able to correctly predict the failures in the validation set, since it saw some examples in the training set.
In a real-world scenario though, we can’t look into the future. The model can’t know what will happen in the next week, but only what has happened so far. To evaluate properly, we need to pretend we are in the past, and future data (relative to the training set) must be inaccessible.
Visualization of our split between training and validation set.
We train an XGBoost model, using features from both test, patch, and the links between them, e.g:
* In the past, how often did this test fail when the same ﬁles were touched?
* How far in the directory tree are the source ﬁles from the test ﬁles?
* How often in the VCS history were the source ﬁles modiﬁed together with the test ﬁles?
The input to the model is a tuple (TEST, PATCH), and the label is a binary FAIL or NOT FAIL. This means we have a single model that is able to take care of all tests. This architecture allows us to exploit the commonalities between test selection decisions in an easy way. A normal multi-label model, where each test is a completely separate label, would not be able to extrapolate the information about a given test and apply it to another completely unrelated test.
Given that we have tens of thousands of tests, even if our model was 99.9% accurate (which is pretty accurate, just one error every 1000 evaluations), we’d still be making mistakes for pretty much every patch! Luckily the cost associated with false positives (tests which are selected by the model for a given patch but do not fail) is not as high in our domain, as it would be if say, we were trying to recognize faces for policing purposes. The only price we pay is running some useless tests. At the same time we avoided running hundreds of them, so the net result is a huge savings!
As developers periodically switch what they are working on the dataset we train on evolves. So we currently retrain the model every two weeks.
After we have chosen which tests to run, we can further improve the selection by choosing where the tests should run. In other words, the set of conﬁgurations they should run on. We use the dataset we’ve collected to identify redundant conﬁgurations for any given test. For instance, is it really worth running a test on both Windows 7 and Windows 10? To identify these redundancies, we use a solution similar to frequent itemset mining:
Collect failure statistics for groups of tests and conﬁgurations
Calculate the “support” as the number of pushes in which both X and Y failed over the number of pushes in which they both run
Calculate the “conﬁdence” as the number of pushes in which both X and Y failed over the number of pushes in which they both run and only one of the two failed.
We only select conﬁguration groups where the support is high (low support would mean we don’t have enough proof) and the conﬁdence is high (low conﬁdence would mean we had many cases where the redundancy did not apply).
Once we have the set of tests to run, information on whether their results are conﬁguration-dependent or not, and a set of machines (with their associated cost) on which to run them; we can formulate a mathematical optimization problem which we solve with a mixed-integer programming solver. This way, we can easily change the optimization objective we want to achieve without invasive changes to the optimization algorithm. At the moment, the optimization objective is to select the cheapest conﬁgurations on which to run the tests.
A machine learning model is only as useful as a consumer’s ability to use it. To that end, we decided to host a service on Heroku using dedicated worker dynos to service requests and Redis Queues to bridge between the backend and frontend. The frontend exposes a simple REST API, so consumers need only specify the push they are interested in (identiﬁed by the branch and topmost revision). The backend will automatically determine the ﬁles changed and their contents using a clone of mozilla-central.
Depending on the size of the push and the number of pushes in the queue to be analyzed, the service can take several minutes to compute the results. We therefore ensure that we never queue up more than a single job for any given push. We cache results once computed. This allows consumers to kick off a query asynchronously, and periodically poll to see if the results are ready.
We currently use the service when scheduling tasks on our integration branch. It’s also used when developers run the special mach try auto command to test their changes on the testing branch. In the future, we may also use it to determine which tests a developer should run locally.
Sequence diagram depicting the communication between the various actors in our infrastructure.
From the outset of this project, we felt it was crucial that we be able to run and compare experiments, measure our success and be conﬁdent that the changes to our algorithms were actually an improvement on the status quo. There are effectively two variables that we care about in a scheduling algorithm:
The amount of resources used (measured in hours or dollars).
The regression detection rate. That is, the percentage of introduced regressions that were caught directly on the push that caused them. In other words, we didn’t have to rely on a human to backﬁll the failure to ﬁgure out which push was the culprit.
scheduler effectiveness = 1000 * regression detection rate / hours per push
The higher this metric, the more effective a scheduling algorithm is. Now that we had our metric, we invented the concept of a “shadow scheduler”. Shadow schedulers are tasks that run on every push, which shadow the actual scheduling algorithm. Only rather than actually scheduling things, they output what they would have scheduled had they been the default. Each shadow scheduler may interpret the data returned by our machine learning service a bit differently. Or they may run additional optimizations on top of what the machine learning model recommends.
Finally we wrote an ETL to query the results of all these shadow schedulers, compute the scheduler effectiveness metric of each, and plot them all in a dashboard. At the moment, there are about a dozen different shadow schedulers that we’re monitoring and ﬁne-tuning to ﬁnd the best possible outcome. Once we’ve identiﬁed a winner, we make it the default algorithm. And then we start the process over again, creating further experiments.
The early results of this project have been very promising. Compared to our previous solution, we’ve reduced the number of test tasks on our integration branch by 70%! Compared to a CI system with no test selection, by almost 99%! We’ve also seen pretty fast adoption of our mach try auto tool, suggesting a usability improvement (since developers no longer need to think about what to select). But there is still a long way to go!
We need to improve the model’s ability to select conﬁgurations and default to that. Our regression detection heuristics and the quality of our dataset needs to improve. We have yet to implement usability and stability ﬁxes to mach try auto.
And while we can’t make any promises, we’d love to package the model and service up in a way that is useful to organizations outside of Mozilla. Currently, this effort is part of a larger project that contains other machine learning infrastructure originally created to help manage Mozilla’s Bugzilla instance. Stay tuned!
If you’d like to learn more about this project or Firefox’s CI system in general, feel free to ask on our Matrix channel, #ﬁrefox-ci:mozilla.org.
Let’s ﬁnd out how well you know computers! All of these programs have a variable NUMBER in them. Your mission: guess how big NUMBER needs to get before the program takes 1 second to run.
You don’t need to guess exactly: they’re all between 1 and a billion. Just try to guess the right order of magnitude! A few notes:
* If the answer is 38,000, both 10,000 and 100,000 are
considered correct answers. The goal is to not be wrong by
more than 10x :)
* We know computers have different disk & network & CPU
speeds! We’re trying to get you to tell the difference
between code that can run 10 times/s and 100000 times/s. A
newer computer won’t make your code run 1000x faster :)
* That said, all this was run on a new laptop with a fast
SSD and a sketchy network connection. The C code was compiled with gcc -O2.
Good luck! We were surprised by a lot of these. We’ll be anonymously collecting your answers, so expect some graphs in the future! =D
Our routing algorithm prefers paths that go through parks, forests or by water, and avoids busy roads wherever possible.
Here’s a few things you can do with Trail Router:
Manually create your own point-to-point route that prefers nature.
Choose whether you’d prefer routes that involve nature, well-lit streets or a lack of hills.
We’re not perfect! Help improve Trail Router by reporting an issue here.
A road is unsafe for runners ( See FAQ
Thank you for reporting an issue!
We will review your issue and be in touch via email. Please note that corrections to mapping data can take a couple of weeks to ﬁlter through.
Dr. Lam explained that “in places like Singapore, we want to keep the windows open as much as possible” to reduce the use of carbon-intensive air-conditioners and to prevent buildup of stale air that can pose health risks for some people.
But with windows open, the constant din from city trafﬁc, trains, jets passing overhead and construction equipment can rattle apartments. The Anti-Noise Control Window, as it is called, is the sonic equivalent of shutting a window.
With any sound, the best way to reduce it is at the source, like a gun’s silencer. So the researchers treated the window aperture itself as the noise source, because most noise enters a room that way.
The system uses a microphone outside the window to detect the repeating sound waves of the offending noise source, which is registered by a computer controller. That in turn deciphers the proper wave frequency needed to neutralize the sound, which is transmitted to the array of speakers on the inside of the window frame.
The speakers then emit the proper “anti” waves, which cancel out the incoming waves, and there you have it: near blissful silence.
“If you sit in the room, you get that same feeling like when you ﬂick on the switch of noise-canceling earphones,” Dr. Lam said, splaying his hands to denote the calming effect.
The system is best at attenuating the audible blasts from the types of steady noise sources found within the optimal frequency range.
Darwin is the Open Source operating system from Apple that forms the basis for Mac OS X and PureDarwin. PureDarwin is a community project that aims to make Darwin more usable (some people think of it as the informal successor to OpenDarwin).
One current goal of this project is to provide a useful bootable ISO/VM of Darwin 10.x
Come Join our forum over at https://www.pd-devs.org/
See the Wiki for more information.
This is the once-a-week free edition of The Diff, the newsletter about inﬂections in ﬁnance and technology. The free edition goes out to 8,221 subscribers, up 211 week-over-week. This week’s subscribers-only posts:
* The UK as a Science Hub is an update on Boris Johnson’s plan (or, if you prefer, Dominic Cumming’s scheme) to make Britain a scientiﬁc powerhouse. The outlines of the plan aren’t new, but the opportunity is.
* The Equity Risk Premium at 0% Interest looks at the implications of low real rates for tech companies. In equilibrium, low rates are good for equities because they raise the present value of future cash ﬂows. But another way of saying this is that, in ﬁnancial terms, low rates mean the future happens all at once.
* Globalization: A Toy Story) is a prequel to today’s note, discussing the history of Hong Kong’s toy industry. Hong Kong’s toy industry was basically nonexistent in 1945, the biggest in the world by 1972, and consistently lost share to China from the 80s onward. It’s a case study in how globalization works.
* The Depressing Bull Thesis for Rocket Mortgage is a writeup of Rocket, the largest mortgage originator in the US, which recently ﬁled to go public. Fewer red ﬂags than expected, but it’s partly driven by a dire ﬁnancial bet.
* Why Are Toys Such a Bad Business?
* V-Shaped Recovery is here… just not evenly distributed.
Early-stage investors sometimes use the heuristic that if a product gets derided as a toy, it’s worth investing in. That model would have gotten you into PCs in the 70s, the Internet in the early 90s, social networks when the good ones were privately-held, cryptocurrencies, and drones. The risk is investing in actual toy companies, which is usually a terrible decision. Hasbro stock hasn’t done anything for half a decade, and Mattel trades where it did in the early 90s. JAKKS Paciﬁc has destroyed most of its shareholders’ wealth, and Funko is working on the same.
This is not a new phenomenon, either. The biggest toy company in the US in the 50s was Louis Marx & Company, whose founder made the cover of Time. Sales declined slightly over the next decade, and faster after that; the company was bankrupt in 1980. Coleco rode the Cabbage Patch Kids trend in the mid-80s—in 1985, they had the highest return on equity of any company in the Fortune 500—but they were bankrupt by 1988.
The record is no better for retailers. Toys R Us is bankrupt, of course, and they followed FAO Shwarz, KB Toys, Right Start, and Zany Brainy.
The toy industry has not been kind to investors, at any level.
There are a few reasons, and a few relevant lessons.
First, the toy business operates on an annual cycle. Historically, about 40% of toy sales happen during the holiday season, and about half of those were in the two weeks before Christmas. (That’s a dated statistic, from about twenty years ago. Discretionary retail sales in general have gotten spikier, since more shoppers are used to fast, free shipping. With Amazon Prime, the Christmas shopping season starts on December 22nd or so.) 84% of US toy sales come from China, transported by a mix of ships and air freight, so they need to be ordered months in advance.
And they have to be marketed: while cheap toys can compete on price, the higher-margin ones only get sold when there’s an effective ad campaign. Mattel created this model (and overturned Louis Marx’s price-ﬁrst approach) when they spent their entire net worth on a one-year sponsorship of The Mickey Mouse Club in 1955.
TV ad campaigns, too, tend to be purchased in advance. About half of TV ad spending is allocated to the upfronts—booked March through May to be delivered by the end of the year.
This locks toy companies into a challenging bet. Every year, they have to a) predict trends, b) invent them, and c) commit capital to them. All without knowing how the rest of the year will turn out. Since toy trends exist, but don’t last for very long, they have to invent new products every year—but the technological state of the art doesn’t advance very fast. It has all the volatility of tech, without the progress.
A handful of companies have made serious money in toys, or, rather, in toy-like or toy-adjacent businesses. The video game industry has generally done well. Disney turns a proﬁt. And Games Workshop has a nice little business. (I wrote up in The Diff in April—note that I’ve since sold the stock, just for valuation reasons) . Lego, too, is a great business, worth an estimated $15bn.
What these companies have in common is that they escape the demographic trap that toy manufacturers are locked into. Every year, there’s a new cohort of six-year-olds, and they need something that a) didn’t exist last year, but b) appeals to timeless six-year-old sensibilities. They don’t have much brand loyalty, because a year later the same toy is a toy for little kids. Each of these successful companies beats that in a different way:
* Video games’ average age has trended older over time, so instead of marketing to more trend-sensitive young people, they’re marketing to more dollar-insensitive not-so-young people.
* Disney has a generational loop, of which toys are a small part. Movies and streaming video get kids hooked on Disney characters, which can be monetized at much higher dollar values through their parks. (See my writeup here for much more.)
* Games Workshop and Lego have a very healthy product dynamic: the ones you already own are an economic complement to the ones you buy. And Lego clearly designs their marketing around hitting two generations, too: at the Lego Store, $30 Rise of Skywalker-themed Lego sets are at a kids’ eye level. The $800 set based on the original trilogy is positioned at an adult’s eye level.
These companies have something else in common: they own their core intellectual property. Video game publishers do make games based on superheroes and sports leagues, and Lego certainly has branded sets, but the core of each business is IP owned by the company itself; the licensed products are a lucrative side business: it’s much easier for a video game company to re-skin characters than for a movie company to start a video game studio. As a case study, one popular game was originally intended to be set in the Game of Thrones universe, but ended up using in-house IP instead. Disney, of course, sells toys based on its own characters. And while some Lego sets are associated with outside brands at the point of purchase, they inevitably end up being fungible with other Legos.
Because it’s a hit-driven industry, toy companies that succeed can be immensely profitable for a while. The problem is that the difference between a cultural landmark and a fad is visible after a decade or so, while the decision of how much to order and how much to spend on marketing has to happen every year regardless. So toy companies with a hit product in year N tend to be bankrupt companies writing down the value of their inventory to ~$0 in year N+3 or N+5.
You shouldn’t have to take big risks to make big returns. So when data from Citibank shows that art has outperformed the S&P by 180% since 2000 with the least volatility of any major asset class—we’re inclined to notice.
The ultra-wealthy have invested in art for centuries, to the tune of over $1.7 trillion in total value—so why can’t the rest of us?
Masterworks lets anyone invest in paintings by some of the most successful artists in history like Banksy, Warhol, Basquiat, and more, in just a few clicks. The only catch? There’s currently a backlog of over 25,000 of people applying for membership, but you can skip the waitlist by signing up today.*
… just not evenly distributed. And not likely to last. In Japan, Uniqlo expects Japanese sales to be up 25% Y/Y in their August quarter, after a 15% decline last quarter. Their regional estimates are very much virus-driven, with optimism in China and pessimism in countries seeing a second wave, or, in the US’s case, a 1.5th wave. And worldwide PC shipments grew in Q2, mostly due to the one-time build-out of home ofﬁces and Zoom-based schools. In China, auto sales were up 10% in Q2 ($), and China’s copper smelting is also rising ($).
Inventory restocking used to be a significant driver of GDP growth: when the economy slowed, companies had too much inventory on hand, and had to cut jobs to work through the excess. Once they ran out, they had to rehire fast. Now, supply and demand for manufacturing are located in different places (with different policies), and companies are more averse to holding inventory for long periods, so this model isn’t as descriptive or predictive as it once was. When the people getting ﬁred are the ones providing demand, it’s easy for a recession to feed on itself, and easy for a recovery to bootstrap itself, too. When those groups are in different countries, and when the swings in inventory are more muted, it’s less of a factor, leading to fewer recessions but much slower rebounds.
The Indonesion government is strapped for cash, and needs to spend heavily to mitigate the effects of Covid-19. But the government is not great at collecting taxes (taxes are 11-12% of GDP. For comparison, Mexico and the Netherlands have similar-sized economies, and collect 16% and 39%, respectively). But tech companies are great at collecting taxes on online commerce, and tend to charge close to the Laffer peak. So Indonesia is outsourcing taxation to them ($) by imposing a 10% value-added tax on large Internet companies. Google, Facebook, and Netﬂix have built their own “tax collection” apparatus, and are better at catching tax-evaders and charging the right amount. As it turns out, tax-farming wasn’t a terrible idea, just a few centuries early
In other tax news, Chinese mainlanders working in Hong Kong suddenly owe the mainland’s 45% tax rates rather than Hong Kong’s 15%. China seems to alternate—on a daily basis—between wanting Hong Kong to be a ﬁnancial center they control and wanting to use their control to end Hong Kong’s status as a ﬁnancial center.
The dollar, by virtue of being the world’s most-used currency, is the currency that least represents how currencies work. Since it’s a reserve currency, dollars are demanded by people who don’t earn them or spend them, but who know they’ll need them, so the US has less control over the value of its money than any other place. To paraphrase John Connally, it’s everyone’s currency but America’s problem.
For example, there’s no way the US could get away with this ($):
Africa’s most populous nation has long maintained several exchange rates. In addition to the interbank and black-market rates, there are ofﬁcial rates for consumers wanting dollars for school and medical fees abroad, for Muslims making the pilgrimage to Saudi Arabia, and for people wishing to buy hard currency at exchange bureaux.
That’s a very clever setup. Smaller countries can use a tiered exchange-rate system to moderately encourage or discourage certain behaviors, or to dole out favors to particular groups. Dollars are so liquid, and used in so many places, that the US has to take a more binary approach, of allowing or banning transactions; taxes get routed around.
Alpha Architect has a negative view on treasuries, arguing that they’re not a good diversiﬁer and that yields are too low to justify owning them. The piece goes into detail on how treasuries function as insurance (not always!), and how they’re mostly owned by price-insensitive buyers like regulated insurance companies and central banks. All true. But the most important line in the piece is: “Okay, Treasuries Aren’t Compelling: What Are My Alternatives? Answer: Nothing.” Investments are always expressed in relative terms. In an aging world, we shouldn’t expect anything to be cheap because there’s so much demand for savings.
Chinese CSI 300 index dropped 1.8% in the last session, and it’s now up only 14% since late June. Bloomberg proﬁles the wild market, with plenty of pull quotes from new investors (“There’s no way I can lose,” sounds like a classic bull market line, but there’s a tinge of desperation there). One company, QuantumCTek, rose 1,000% in its IPO ($).
2K games is trying to push video game prices above the de facto ceiling of $60/copy. Video game prices have been declining in real terms, in part because of cheaper manufacturing and distribution. As the video game industry gets more mature, predicting sales for any given title gets easier, which encourages publishers to invest more in production. So cost deﬂation in one part of the market is offset by cost inﬂation in another.