10 interesting stories served every morning and every evening.
The first thing I usually do when I pick up a new codebase isn’t opening the code. It’s opening a terminal and running a handful of git commands. Before I look at a single file, the commit history gives me a diagnostic picture of the project: who built it, where the problems cluster, whether the team is shipping with confidence or tiptoeing around land mines.
The 20 most-changed files in the last year. The file at the top is almost always the one people warn me about. “Oh yeah, that file. Everyone’s afraid to touch it.”
High churn on a file doesn’t mean it’s bad. Sometimes it’s just active development. But high churn on a file that nobody wants to own is the clearest signal of codebase drag I know. That’s the file where every change is a patch on a patch. The blast radius of a small edit is unpredictable. The team pads their estimates because they know it’s going to fight back.
A 2005 Microsoft Research study found churn-based metrics predicted defects more reliably than complexity metrics alone. I take the top 5 files from this list and cross-reference them against the bug hotspot command below. A file that’s high-churn and high-bug is your single biggest risk.
Every contributor ranked by commit count. If one person accounts for 60% or more, that’s your bus factor. If they left six months ago, it’s a crisis. If the top contributor from the overall shortlog doesn’t appear in a 6-month window (git shortlog -sn –no-merges –since=“6 months ago”), I flag that to the client immediately.
I also look at the tail. Thirty contributors but only three active in the last year. The people who built this system aren’t the people maintaining it.
One caveat: squash-merge workflows compress authorship. If the team squashes every PR into a single commit, this output reflects who merged, not who wrote. Worth asking about the merge strategy before drawing conclusions.
Same shape as the churn command, filtered to commits with bug-related keywords. Compare this list against the churn hotspots. Files that appear on both are your highest-risk code: they keep breaking and keep getting patched, but never get properly fixed.
This depends on commit message discipline. If the team writes “update stuff” for every commit, you’ll get nothing. But even a rough map of bug density is better than no map.
Commit count by month, for the entire history of the repo. I scan the output looking for shapes. A steady rhythm is healthy. But what does it look like when the count drops by half in a single month? Usually someone left. A declining curve over 6 to 12 months tells you the team is losing momentum. Periodic spikes followed by quiet months means the team batches work into releases instead of shipping continuously.
I once showed a CTO their commit velocity chart and they said “that’s when we lost our second senior engineer.” They hadn’t connected the timeline before. This is team data, not code data.
Revert and hotfix frequency. A handful over a year is normal. Reverts every couple of weeks means the team doesn’t trust its deploy process. They’re evidence of a deeper issue: unreliable tests, missing staging, or a deploy pipeline that makes rollbacks harder than they should be. Zero results is also a signal; either the team is stable, or nobody writes descriptive commit messages.
Crisis patterns are easy to read. Either they’re there or they’re not.
These five commands take a couple minutes to run. They won’t tell you everything. But you’ll know which code to read first, and what to look for when you get there. That’s the difference between spending your first day reading the codebase methodically and spending it wandering.
This is the first hour of what I do in a codebase audit. Here’s what the rest of the week looks like.
...
Read the original on piechowski.io »
The first flyby images of the Moon captured by NASA’s Artemis II astronauts during their historic test flight reveal regions no human has ever seen before—including a rare in-space solar eclipse. Released Tuesday, April 7, 2026, the photos were taken on April 6 during the crew’s seven‑hour pass over the lunar far side, marking humanity’s return to the Moon’s vicinity.
...
Read the original on www.nasa.gov »
Open source disk encryption with strong security for the Paranoid
...
Read the original on sourceforge.net »
The US and Iran agreed to a two-week conditional ceasefire on Tuesday evening, which included a temporary reopening of the strait of Hormuz, after a last-minute diplomatic intervention led by Pakistan, canceling an ultimatum from Donald Trump for Iran to surrender or face widespread destruction.
Trump’s announcement of the ceasefire agreement came less than two hours before the US president’s self-imposed 8pm Eastern time deadline to bomb Iran’s power plants and bridges in a move that legal scholars, as well as officials from numerous countries and the pope, had warned could constitute war crimes.
Just hours earlier, Trump had written on Truth Social: “A whole civilization will die tonight, never to be brought back again. I don’t want that to happen, but it probably will.” American B-52 bombers were reported to be en route to Iran before the ceasefire agreement was announced.
But by Tuesday evening, Trump announced that a ceasefire agreement had been mediated through Pakistan, whose prime minister, Shehbaz Sharif, had requested the two-week peace in order to “allow diplomacy to run its course”.
Trump wrote in a post that “subject to the Islamic Republic of Iran agreeing to the COMPLETE, IMMEDIATE, and SAFE OPENING of the Strait of Hormuz, I agree to suspend the bombing and attack of Iran for a period of two weeks”.
In a separate post later, the US president called Tuesday “a big day for world peace” on a social media post, claiming that Iran had “had enough”. He said the US would be “helping with the traffic buildup” in the strait of Hormuz and that “big money will be made” as Iran begins reconstruction.
For several hours afterwards, Israel’s position or agreement with the deal was unclear. But just before midnight ET, the prime minister, Benjamin Netanyahu, said Israel backed the US ceasefire with Iran but that the deal did not cover fighting against Hezbollah in Lebanon. His office said Israel also supported US efforts to ensure Iran no longer posed a nuclear or missile threat.
Pakistan’s prime minister had previously said that the agreed-upon ceasefire covered “everywhere including Lebanon”.
The ceasefire process was clouded in uncertainty after Iran released two different versions of the 10-point plan intended to be the basis for negotiations, and which Trump said was a “workable basis on which to negotiate”.
In the version released in Farsi, Iran included the phrase “acceptance of enrichment” for its nuclear program. But for reasons that remain unclear, that phrase was missing in English versions shared by Iranian diplomats to journalists.
Pakistan has invited the US and Iran to talks in Islamabad on Friday. Tehran said it would attend, but Washington has yet to publicly accept the invitation.
In a telephone call with Agence France-Presse, Trump said he believed China had persuaded Iran to negotiate, and said Tehran’s enriched uranium would be “perfectly taken care of”, without providing more detail.
In the two-week ceasefire, Trump said, he believed the US and Iran could negotiate over the 10-point proposal that would allow an armistice to be “finalized and consummated”.
“This will be a double sided CEASEFIRE!” he continued. “The reason for doing so is that we have already met and exceeded all Military objectives, and are very far along with a definitive Agreement concerning Longterm PEACE with Iran, and PEACE in the Middle East.”
Iran’s foreign minister, Abbas Araghchi, issued a statement shortly after Trump’s announcement saying Iran had agreed to the ceasefire. “For a period of two weeks, safe passage through the Strait of Hormuz will be possible via coordinating with Iran’s Armed Forces,” he wrote.
Oil prices dived, stocks surged and the dollar was knocked back on Wednesday as a two-week Middle East ceasefire sparked a relief rally, fueled by hopes that oil and gas flows through the strait of Hormuz could resume.
Despite the provisional ceasefire, attacks continued across the region in the hours after Trump’s announcement. Before the deadline, airstrikes hit two bridges and a train station in Iran, and the US hit military infrastructure on Kharg Island, a key hub for Iranian oil production.
The sudden about-face will allow Trump to step back as the US war in Iran has dragged on for five weeks with little sign that Tehran is ready to surrender or release its hold on the strait, a conduit for a fifth of the global energy supply, where traffic has slowed to a trickle.
Trump had earlier rejected the 10-point plan as “not good enough” but the president has set deadlines before and allowed them to pass over the five weeks of the conflict. Yet he insisted on Tuesday the ensuing hours would be “one of the most important moments in the long and complex history of the World” unless “something revolutionarily wonderful” happened, with “less radicalized minds” in Iran’s leadership.
News of the provisional ceasefire deal was welcomed but with a note of caution elsewhere.
Iraq’s foreign ministry called for “serious and sustainable dialogue” between the US and Iran “to address the root causes of the disputes”, while the German foreign minister, Johann Wadephul, said the deal “must be the crucial first step towards lasting peace, for the consequences of the war continuing would be incalculable”.
In Australia, the government warned that the latest developments would not necessarily mean the fuel crisis is over. Oil prices fell as traders bet that the reopening of the strait of Hormuz would help fuel supply resume, but the energy minister, Chris Bowen, told reporters Australians should “not get ahead of ourselves”.
He said: “People shouldn’t take today’s progress and expect prices to fall. We welcome progress, but I don’t think we can say the [strait of Hormuz is] now open.”
A spokesperson for New Zealand’s foreign minister, Winston Peters, welcomed the “encouraging news” but noted “there remains significant important work to be done to secure a lasting ceasefire”.
Japan said it expected the move to result in a “final agreement” after Washington and Tehran begin talks on Friday. Describing the ceasefire as a “positive move”, the chief cabinet secretary, Minoru Kihara, said Tokyo wanted to see a de-escalation on the ground in the region, adding that the prime minister, Sanae Takaichi, was seeking talks with the Iranian president, Masoud Pezeshkian.
A temporary end to hostilities will come as a relief to Japan, which depends on the Middle East for about 90% of its crude oil imports, most of which is transported through the strait of Hormuz.
South Korea’s ministry of foreign affairs said it hoped “negotiations between the two sides will be successfully concluded and that peace and stability in the Middle East will be restored at an early date”, as well as wishes for “free and safe navigation of all vessels through the strait of Hormuz”.
...
Read the original on www.theguardian.com »
Years ago I watched a video where guitar teacher Justin Sandercoe explained a way to get better at guitar. It has changed my playing, and it might change yours, too.
This isn’t my idea; it’s Justin’s. You should watch his video and visit his website. Why am I writing about it, then? Because I think it’s an invaluable idea, and I want to extend its reach and offer my testimony.
Note: This post is part of April Cools, where writers publish something sincere but different from their usual work. I hope you find it interesting.
The Way I Learned Then: Tabs#
I grew up with a bunch of kids who played music— guitar, drums, and bass. We were nerds in garages, making noise.
This was the 90’s, when magazines like Guitar World were in their heyday. These magazines contained pages of tablature, or tabs, of the latest songs, and we couldn’t wait to get the newest publications. This was pre-widespread internet, so these tabs were hard to find.
And yet, once I’d purchased the magazine, did I learn “Eruption”? No, I didn’t. It didn’t translate to mastery.
The Way I Learn Now: Listening & Transcribing#
When we think about the guitar greats, we might ask, how did they get great? And we know the answer. They didn’t read tabs. They listened to music and imitated what they heard.
That’s what you need to do. If you want to learn a song and get better, here’s the plan.
Pick an easy song— not “Eruption”. Songs like these:
Rock: “The Ghost of Tom Joad”, Rage Against the Machine
These songs have some things in common: simple riffs, mostly using one note at a time, mostly in one part of the guitar neck.
Next, get a piece of tab paper. When I started learning this technique, I printed off a couple dozen pages of blank tabs.
Hit “play” on the song with your guitar in your hand, tab paper on the table, and a pencil. When you hear the first guitar note, stop the song, find the note on the guitar, and write it down.
Hit “play” again, and when you get to the second note, stop, find it, and write it down.
Keep doing this until you finish the song. It’s going to feel impossible at first, and pointless. You’re going to want to quit. Keep going.
Once you’ve finished transcribing, what’s next? Next, we check our work.
Find some tabs online and compare it to what you’ve written. When I do this, often I realize I got something wrong, and erase my tabs and write them again. Other times, I disagree with the transcription I find.
You can also watch the players performing the songs on video. Sometimes the guitarist is doing something surprising you have to see! They might be using a capo, or they’re plucking the strings with an Allen wrench, or there’s a hidden guitarist offstage.
When you follow this process, an amazing thing happens: you learn the song. After transcribing, I can often play a song, near tempo, on the first try. I think it’s because I’ve already listened to the song a few dozen times and started to commit the movements to my muscle memory.
You also get better at hearing a sound and finding it on the neck. Remember that cool kid in high school who could copy any song on the radio? That’s you now!
I take these songs and put them into a playlist. Then, whenever I want to play, I put on my playlist and go through a few of the songs that I’ve learned. This kind performance-practice is fun.
An important distinction here is that you’re now learning songs, not riffs. Riffs are fun, but professional guitarists play songs. They learn the catchy intro riff, as well as the chorus and bridge riffs. They learn to transition from one to another. It’s a different skill.
Once I get one part learned, I often learn the second guitar part, or the bass part, too. When a song starts, I sometimes pick on the fly which one I’m going to play. It keeps things interesting.
The first time I played “Venus” by Television all the way through, expressing the music rather than simply keeping up, I felt like something had changed.
Here’s a sampling of the songs that have made it into my playlist.
“Someday”, The Strokes (chords or triads; take your pick)
“Maps”, Yeah Yeah Yeahs (just one guitar, so you stay busy)
“Just Like Heaven”, The Cure (iconic lead traversing the neck)
“Killing in the Name”, Rage Against the Machine (fun drop D riffs)
You can learn solos, too, or not. There’s more to the songs than the solos. Learning the rhythm parts can often be just as challenging and fun as the lead.
Stop at the first note, find it, and write it down.
Stop at the second note, find it, and write it down.
Continue to the end of the song.
Compare your tab with others and make adjustments.
🎉 You just learned a song! And possibly some new techniques and styles. Keep working on it, and repeat.
Thanks to Justin, and everyone who taught me things on the guitar over the years. Happy April Cools! Keep rocking.
...
Read the original on jakeworth.com »
Early this year, my home city of Bend, Oregon, ended its contract with surveillance company Flock Safety, following months of public pressure and concerns around weak data privacy protections. Flock’s controversial were shut down, and its partnership with local law enforcement ended.
We weren’t the only city to actively reject Flock cameras. Since the start of 2026, dozens of cities have suspended or deactivated contracts with Flock, labeling it a vast surveillance network. Others might not be aware that automated license plate readers, commonly referred to as ALPR cameras, have already been installed in their neighborhood.
Flock gripped news headlines late last year when it was under the microscope during widespread crackdowns by Immigration and Customs Enforcement. Though Flock doesn’t have a direct partnership with federal agencies (a blurry line I’ll discuss more), law enforcement agencies are free to share data with departments like ICE, and they frequently do.
One study from the Center for Human Rights at the University of Washington found that at least eight Washington law enforcement agencies shared their Flock data networks directly with ICE in 2025, and 10 more departments allowed ICE backdoor access without explicitly granting the agency permission. Many other reports outline similar activity.
Following Super Bowl ads about finding lost dogs, Flock was under scrutiny about its planned partnership with Ring, Amazon’s security brand. The integration would have allowed police to request the use of Ring-brand home security cameras for investigations. Following intense public backlash, Ring cut ties with Flock just like my city did.
To learn more, I spoke to Flock about how the company’s surveillance technology is used (and misused). I also spoke with privacy advocates from the American Civil Liberties Union to discuss surveillance concerns and what communities are doing about it.
If you hear that Flock is setting up near you, it usually means the installation of ALPR cameras to capture license plate photos and monitor cars on the street.
Flock signs contracts with a wide range of entities, including city governments and law enforcement departments. A neighborhood can also partner with Flock — for example, if an HOA decides it wants extra eyes on the road, it may choose to use Flock’s systems.
When Flock secures a contract, the company at strategic locations. Though these cameras are primarily marketed for license plate recognition, Flock reports on its site that its surveillance system is intended to reduce crime, including property crimes such as “mail and package theft, home invasions, vandalism, trespassing, and burglary.” The company also says it frequently solves violent crimes like “assault, kidnappings, shootings and homicides.”
Flock has recently expanded into other technologies, including advanced cameras that monitor more than just vehicles. Most concerning are the latest Flock drones equipped with high-powered cameras. Flock’s “Drone as First Responder” platform automates drone operations, including launching them in response to 911 calls or gunfire. Flock’s drones, which reach speeds up to 60 mph, can follow vehicles or people and provide information to law enforcement.
Drones like these can be used to track fleeing suspects. In practice, the key is how law enforcement chooses to use them, and whether states pass laws allowing police to use drones without a warrant — I’ll cover state laws more below, because that’s a big part of today’s surveillance.
It’s important to note that not all cities or neighborhoods refer to Flock Safety by name, even when using its technology. They might mention the Drone as First Responder program, or ALPR cameras, without further details. For example, a March announcement about police drones from the city of Lancaster, California, doesn’t mention Flock at all, even though it was the company behind the drone program.
Flock states on its website that its standard license-plate cameras cannot technically track vehicles, but only take a “point-in-time” image of a car to nab the license plate.
However, due to AI video and image search, contracted parties like local law enforcement can use these tools to piece together license information and form their own timeline of where and when a vehicle went. Adding to those capabilities, Flock also told Forbes that it’s making efforts to expand access to include video clips and live feeds.
Flock’s machine learning can also note details like a vehicle’s body type, color, the condition of the license plate and a wide variety of identifiers, like roof racks, paint colors and what you have stored in the back. Flock rarely calls this AI, but it’s similar to you can find in the latest home security cameras
A Flock spokesperson told me the company has boundaries and does not use facial recognition. “We have more traditional video cameras that can send an alert when one sees if a person is in the frame, for instance, in a business park at 2 a.m. or in the public parks after dark.”
By “traditional” cameras, Flock refers to those that capture a wider field of view — more than just cars and license plates — and can record video rather than just snapshot images.
The information Flock can access provides a comprehensive picture that police can use to track cars by running searches on their software. Just like you might Google a local restaurant, police can search for a basic vehicle description and retrieve recent matches that the surveillance equipment may have found. Those searches can sometimes extend to people, too.
“We have an investigative tool called Freeform that lets you use natural language prompts to find the investigative lead you’re looking for, including the description of what a person’s clothes may be,” the Flock spokesperson told me.
Unlike red-light cameras, Flock’s cameras can be installed nearly anywhere and snap vehicle ID images for all cars. There are Safe Lists that people can use to help Flock cameras filter out vehicles by filling out a form with their address and license plate to mark their vehicle as a “resident.”
The opposite is also true: Flock cameras can use a hot list of known, wanted vehicles and send automatic alerts to police if one is found.
With Flock drones, these intelligent searches become even more complete, allowing cameras to track where cars are going and identify people. That raises additional privacy concerns about having eyes in the sky over your backyard.
“While flying, the drone faces forward, looking at the horizon, until it gets to the call for service, at which point the camera looks down,” the Flock spokesperson said. “Every flight path is logged in a publicly available flight dashboard for appropriate oversight.”
Yet unlike personal security options, there’s no easy way to opt out of this kind of surveillance. You can’t turn off a feature, cancel a subscription or throw away a device to avoid it.
And even though more than 45 cities have canceled Flock contracts amid public outcry, that doesn’t guarantee that all surveillance cameras will be removed from the designated area.
When I reached out to the police department in Eugene, another city in Oregon that ended its Flock contract, the PD director of public information told me that, while there were concerns about certain vulnerabilities and data security requirements with the particular vendor, the technology itself is not the problem. “Eugene Police’s ALPR system experience has demonstrated the value of leveraging ALPR technology to aid investigations … the department must ensure that any vendors meet the highest standards.”
Flock’s stance, as outlined in its privacy and ethics guide, is that license plate numbers and vehicle descriptions aren’t personal information. The company says it doesn’t surveil “private data” — only cars and general descriptive markers.
But vehicle information can be considered personal because it’s legally tied to the vehicle’s owner. Privacy laws, including proposed federal legislation from 2026, prohibit the release of personal information from state motor vehicle records in order to protect citizens.
However, those laws typically include exemptions for legal actions and law enforcement, sometimes even for private security companies.
AI detection also plays a role. When someone can identify a vehicle through searches like “red pickup truck with a dog in the bed,” that tracking goes beyond basic license plates to much more personal information about the driver and their life. It may include the bumper stickers, what can be seen in the backseat and whether a vehicle has a visible gun rack.
Flock’s practices — like its recent push toward live video feeds and drones to track suspects — move out of the gray area, and that’s where privacy advocates are rightly concerned. Despite its policy, it appears you can track specific people using Flock tech. You’ll just need to pay more to do so, such as upgrading from ALPRs to Flock’s suspect-following drone program, or using its Freeform tool to track someone by the clothes they’re wearing.
Flock states on its website that it stores data for 30 days on Amazon Web Services cloud storage and then deletes it. It uses KMS-based encryption (a managed encryption key system common in AWS) and reports that all images and related data are encrypted from on-device storage to cloud storage.
When Flock collects criminal justice information, or sensitive data managed by law enforcement, it’s only available to official government agencies, not an entity like your local HOA. Because video data is encrypted throughout its transfer to the end user, employees at Flock cannot access it. These are the same kind of security practices I look for when reviewing home security cameras, but there are more complications here.
However, Flock also makes it clear that its customers — whether that’s a local police department, private business or another institution — own their data and control access to it. Once end users access that data, Flock’s own privacy measures don’t do much to help. That raises concerns about the security of local law enforcement systems, each of which has its own data regulations and accountability practices.
You may have noticed a theme: Flock provides powerful surveillance technology, and the final results are deeply influenced by how customers use it. That can be creepy at best, and an illegal abuse of power at worst.
Since Flock Safety began partnering with law enforcement, a growing number of officers have been found abusing the surveillance system. In one instance, a Kansas police chief used Flock cameras 164 times while tracking an ex. In another case, a sheriff in Texas lied about using Flock to “track a missing person,” but was later found to be investigating a possible abortion. In Georgia, a police chief was arrested for using Flock to stalk and harass citizens. In Virginia, a man sued the city of Norfolk over purported privacy violations and discovered that Flock cameras had been used to track him 526 times, around four times per day.
Those are just a few examples from a long list, giving real substance to worries about a surveillance state and a lack of checks and balances. When I asked Flock how its systems protect against abuse and overreach, a spokesperson referred to its accountability feature, an auditing tool that “records every search that a user of Flock conducts in the system.” Flock used this tool during the Georgia case above, which ultimately led to the arrest of the police chief.
While police search logs are often tracked like this, reports indicate that many authorities start searches with vague terms and cast a wide net using terms like “investigation,” “crime” or a broad immigration term like “deportee” to gain access to as much data as possible. While police can’t avoid Flock’s audit logs, they can use general or discriminatory terms — or skip filling out fields entirely — to evade investigations and hide intent.
Regardless of the auditing tools, the onus is on local organizations to manage investigations, accountability and transparency. That brings me to a particularly impactful current event.
ICE is the elephant in the room in my Flock guide. Does Flock share its surveillance data with federal agencies such as ICE? Yes, the federal government frequently has access to that data, but how it gets access is important.
Flock states on its website that it has not shared data or partnered with ICE or any other Department of Homeland Security officials since terminating its pilot programs in August 2025. Flock says its focus is now on local law enforcement, but that comes with a hands-off approach that doesn’t control what happens to information downstream.
“Flock has no authority to share data on our customers’ behalf, nor the authority to disrupt their law enforcement operations,” the Flock spokesperson told me. “Local police all over the country collaborate with federal agencies for various reasons, with or without Flock technology. ”
That collaboration has grown more complex. As Democratic Senator Ron Wyden from Oregon stated in an open letter to Flock Safety, “local” law enforcement isn’t that local anymore, especially when 75% of Flock’s law enforcement customers have enrolled in the National Lookup Tool, which allows information sharing across the country between all participants.
“Flock has built a dangerous platform in which abuse of surveillance data is almost certain,” Wyden wrote. “The company has adopted a see-no-evil approach of not proactively auditing the searches done by its law enforcement customers because, as the company’s Chief Communications Officer told the press, ‘It is not Flock’s job to police the police.’”
Police department sharing isn’t always easy to track, but reporting from 404 Media found that police departments across the country have been creating Flock searches with reasons listed as “immigration,” “ICE,” or “ICE warrant,” among others. Again, since police can put whatever terms they want in these fields — depending on local policies — we don’t know for sure how common it is to look up info for ICE.
Additionally, there’s not always an official process or chain of accountability for sharing this data. In Oregon, reports found that a police department was conducting Flock searches on behalf of ICE and the FBI via a simple email thread.
“When this kind of surveillance power is in malevolent hands — and in the case of ICE, I feel comfortable saying a growing number of Americans view it as a bad actor — these companies are empowering actions the public increasingly finds objectionable,” a lawyer with the ACLU told a Salt Lake City news outlet earlier this year.
With the myriad ways law enforcement shares Flock data with the federal government, it may seem like there’s not much you can do. But one powerful tool is advocating for new laws.
In the past two years, a growing number of state laws have been passed or proposed to address Flock Safety, license plate readers and surveillance. Much of this legislation is bipartisan, or has been passed by both traditionally right- and left-leaning states, although some go further than others.
When I contacted the ACLU to learn what legislation is most effective in situations like this, Chad Marlow, senior policy counsel and lead on the ACLU’s advocacy work for Flock and related surveillance, gave several examples.
“I would limit the allowed uses for ALPR,” Marlow told me. “While some uses, like for toll collection and Amber Alerts, with the right guardrails in place, are not particularly problematic, some ALPRs are used to target communities of color and low-income communities for fine/fee enforcement and for minor crime enforcement, which can exacerbate existing policing inequities.”
This type of harmful ALPR targeting is typically used to both oppress minorities and bring in a greater number of fees for local law organizations — problems that existed long before AI recognition camera, but have been exacerbated by the technology.
New legislation can help, but it needs to be carefully crafted. The most effective laws fall into two categories. The first is requiring any collected ALPR or related data to be deleted within a certain time frame — the shorter, the better. New Hampshire wins here with a 3-minute rule.
“For states that want a little more time to see if captured ALPR data is relevant to an ongoing investigation, keeping the data for a few days is sufficient,” Marlow said. “Some states, like Washington and Virginia, recently adopted 21-day limits, which is the very outermost acceptable limit.”
The second type of promising law makes it illegal to share ALPR and similar data outside the state (such as with ICE) and has been passed by states like Virginia, Illinois and California.
“Ideally, no data should be shared outside the collecting agency without a warrant,” Marlow said. “But some states have chosen to prohibit data sharing outside of the state, which is better than nothing, and does limit some risks.”
Vermont, meanwhile, requires a strict approval process for ALPRs that, by 2025, left no law enforcement agency in the state using license cams.
But what happens if police choose to ignore laws and continue using Flock as they see fit? That’s already happened. In California, for example, police in Los Angeles and San Diego were found sharing information with Homeland Security in 2025, in violation of a state law that bans organizations from sharing license plate data out of state.
When this happens, the recourse is typically a lawsuit, either from the state attorney general or a class action by the community, both of which are ongoing in California in 2026. But what should people do while legislation and lawsuits proceed?
Marlow acknowledged that individuals can’t do much about Flock surveillance without bans or legislation.
“Flock identifies and tracks your vehicle by scanning its license plate, and covering your license plate is illegal, so that is not an option,” he told me.
However, Marlow suggested minor changes that could make a difference for those who are seriously worried. “When people are traveling to sensitive locations, they could take public transportation and pay with cash (credit cards can be tracked, as can share-a-rides) or get a lift from a friend, but those aren’t really practical on an everyday basis.”
Ditching or restricting Flock Safety is one way communities are fighting back against what they consider to be unnecessary surveillance with the potential for abuse. But AI surveillance doesn’t begin or end with one company.
Flock Safety is an intermediary that provides technology in demand by powerful organizations. It’s hardly the only one with these kinds of high-tech eyes — it’s just one of the first to enter the market at a national level. If Flock were gone, another company would likely step in to fill the gap, unless restricted by law.
As Flock’s integration with other apps and cameras becomes more complex, it’s going to be harder to tell where Flock ends and another solution begins, even without rival companies showing up with the latest AI tracking.
But rivals are showing up, from Shield AI for military intelligence to commercial applications by companies like Ambient.ai, Verkada’s AI security searches and the infamous intelligence firm Palantir, all looking for ways to integrate and expand. Motorola, in particular, is in on the action with its VehicleManager platform.
The first step is being aware, including knowing which new cameras your city is installing and which software partnerships your local law enforcement has. If you don’t like what you discover, find ways to participate in the decision-making process, like attending open city council meetings on Flock, as in Bend.
On a broader level, keep track of the legislation your state is considering regarding Flock and similar surveillance contracts and operations, as these will have the greatest long-term impact. Blocking data from being shared out of state and requiring police to delete surveillance ASAP are particularly important steps. You can contact your state senators and representatives to encourage legislation like this.
When you’re wondering what to share with politicians, I recommend something like what Marlow told me: “The idea of keeping a location dossier on every single person just in case one of us turns out to be a criminal is just about the most un-American approach to privacy I can imagine.”
You can also sign up for and donate to projects that are addressing Flock concerns, such as The Plate Privacy Project from The Institute for Justice. I’m currently talking to them about the latest events, and I’ll update if they have any additional tips for us.
Keep following CNET home security, where I break down the latest news you should know, like privacy settings to turn on, security camera settings you may want to turn off and how surveillance intersects with our daily lives. Things are changing fast, but we’re staying on top of it.
...
Read the original on www.cnet.com »
Since its launch in 2007, the Wii has seen several operating systems ported to it: Linux, NetBSD, and most-recently, Windows NT. Today, Mac OS X joins that list.
In this post, I’ll share how I ported the first version of Mac OS X, 10.0 Cheetah, to the Nintendo Wii. If you’re not an operating systems expert or low-level engineer, you’re in good company; this project was all about learning and navigating countless “unknown unknowns”. Join me as we explore the Wii’s hardware, bootloader development, kernel patching, and writing drivers - and give the PowerPC versions of Mac OS X a new life on the Nintendo Wii.
Visit the wiiMac bootloader repository for instructions on how to try this project yourself.
Before figuring out how to tackle this project, I needed to know whether it would even be possible. According to a 2021 Reddit comment:
There is a zero percent chance of this ever happening.
Feeling encouraged, I started with the basics: what hardware is in the Wii, and how does it compare to the hardware used in real Macs from the era.
The Wii uses a PowerPC 750CL processor - an evolution of the PowerPC 750CXe that was used in G3 iBooks and some G3 iMacs. Given this close lineage, I felt confident that the CPU wouldn’t be a blocker.
As for RAM, the Wii has a unique configuration: 88 MB total, split across 24 MB of 1T-SRAM (MEM1) and 64 MB of slower GDDR3 SDRAM (MEM2); unconventional, but technically enough for Mac OS X Cheetah, which officially calls for 128 MB of RAM but will unofficially boot with less. To be safe, I used QEMU to boot Cheetah with 64 MB of RAM and verified that there were no issues.
Other hardware I’d eventually need to support included:
* The SD card for booting the rest of the system once the kernel was running
* Video output via a framebuffer that lives in RAM
* The Wii’s USB ports for using a mouse and keyboard
Convinced that the Wii’s hardware wasn’t fundamentally incompatible with Mac OS X, I moved my attention to investigating the software stack I’d be porting.
Mac OS X has an open source core (Darwin, with XNU as the kernel and IOKit as the driver model), with closed-source components layered on top (Quartz, Dock, Finder, system apps and frameworks). In theory, if I could modify the open-source parts enough to get Darwin running, the closed-source parts would run without additional patches.
Porting Mac OS X would also require understanding how a real Mac boots. PowerPC Macs from the early 2000s use Open Firmware as their lowest-level software environment; for simplicity, it can be thought of as the first code that runs when a Mac is powered on. Open Firmware has several responsibilities, including:
* Providing useful functions for I/O, drawing, and hardware communication
* Loading and executing an operating system bootloader from the filesystem
Open Firmware eventually hands off control to BootX, the bootloader for Mac OS X. BootX prepares the system so that it can eventually pass control to the kernel. The responsibilities of BootX include:
* Loading and decoding the XNU kernel, a Mach-O executable, from the root filesystem
Once XNU is running, there are no dependencies on BootX or Open Firmware. XNU continues on to initialize processors, virtual memory, IOKit, BSD, and eventually continue booting by loading and running other executables from the root filesystem.
The last piece of the puzzle was how to run my own custom code on the Wii - a trivial task thanks to the Wii being “jailbroken”, allowing anyone to run homebrew with full access to the hardware via the Homebrew Channel and BootMii.
Armed with knowledge of how the boot process works on a real Mac, along with how to run low-level code on the Wii, I needed to select an approach for booting Mac OS X on the Wii. I evaluated three options:
Port Open Firmware, use that to run unmodified BootX to boot Mac OS X
Port BootX and modify it to not rely on Open Firmware, use that to boot Mac OS X
Write a custom bootloader that performs the bare-minimum setup to boot Mac OS X
Since Mac OS X doesn’t depend on Open Firmware or BootX once running, spending time porting either of those seemed like an unnecessary distraction. Additionally, both Open Firmware and BootX contain added complexity for supporting many different hardware configurations - complexity that I wouldn’t need since this only needs to run on the Wii. Following in the footsteps of the Wii Linux project, I decided to write my own bootloader from scratch. The bootloader would need to, at a minimum:
* Load the kernel from the SD card
Once the kernel was running, none of the bootloader code would matter. At that point, my focus would shift to patching the kernel and writing drivers.
I decided to base my bootloader on some low-level example code for the Wii called ppcskel. ppcskel puts the system into a sane initial state, and provides useful functions for common things like reading files from the SD card, drawing text to the framebuffer, and logging debug messages to a USB Gecko.
Next, I had to figure out how to load the XNU kernel into memory so that I could pass control to it. The kernel is stored in a special binary format called Mach-O, and needs to be properly decoded before being used.
The Mach-O executable format is well-documented, and can be thought of as a list of load commands that tell the loader where to place different sections of the binary file in memory. For example, a load command might instruct the loader to read the data from file offset 0x2cf000 and store it at the memory address 0x2e0000. After processing all of the kernel’s load commands, we end up with this memory layout:
The kernel file also specifies the memory address where execution should begin. Once the bootloader jumps to this address, the kernel is in full control and the bootloader is no longer running.
To jump to the kernel-entry-point’s memory address, I needed to cast the address to a function and call it:
After this code ran, the screen went black and my debug logs stopped arriving via the serial debug connection - while anticlimactic, this was an indicator that the kernel was running.
The question then became: how far was I making it into the boot process? To answer this, I had to start looking at XNU source code. The first code that runs is a PowerPC assembly _start routine. This code reconfigures the hardware, overriding all of the Wii-specific setup that the bootloader performed and, in the process, disables bootloader functionality for serial debugging and video output. Without normal debug-output facilities, I’d need to track progress a different way.
The approach that I came up with was a bit of a hack: binary-patch the kernel, replacing instructions with ones that illuminate one of the front-panel LEDs on the Wii. If the LED illuminated after jumping to the kernel, then I’d know that the kernel was making it at least that far. Turning on one of these LEDs is as simple as writing a value to a specific memory address. In PowerPC assembly, those instructions are:
lis r5, 0xd80 ; load upper half of 0x0D8000C0 into r5
ori r5, r5, 0xc0 ; load lower half of 0x0D8000C0 into r5
lwz r4, (r5) ; read the 32-bit value from 0x0D8000C0
sync ; memory barrier
xori r4, r4, 0x20 ; toggle bit 5
stw r4, (r5) ; write the value back to 0x0D8000C0
To know which parts of the kernel to patch, I cross-referenced function names in XNU source code with function offsets in the compiled kernel binary, using Hopper Disassembler to make the process easier. Once I identified the correct offset in the binary that corresponded to the code I wanted to patch, I just needed to replace the existing instructions at that offset with the ones to blink the LED.
To make this patching process easier, I added some code to the bootloader to patch the kernel binary on the fly, enabling me to try different offsets without manually modifying the kernel file on disk.
After tracing through many kernel startup routines, I eventually mapped out this path of execution:
This was an exciting milestone - the kernel was definitely running, and I had even made it into some higher-level C code. To make it past the 300 exception crash, the bootloader would need to pass a pointer to a valid device tree.
The device tree is a data structure representing all of the hardware in the system that should be exposed to the operating system. As the name suggests, it’s a tree made up of nodes, each capable of holding properties and references to child nodes.
On real Mac computers, the bootloader scans the hardware and constructs a device tree based on what it finds. Since the Wii’s hardware is always the same, this scanning step can be skipped. I ended up hard-coding the device tree in the bootloader, taking inspiration from the device tree that the Wii Linux project uses.
Since I wasn’t sure how much of the Wii’s hardware I’d need to support in order to get the boot process further along, I started with a minimal device tree: a root node with children for the cpus and memory:
My plan was to expand the device tree with more pieces of hardware as I got further along in the boot process - eventually constructing a complete representation of all of the Wii’s hardware that I planned to support in Mac OS X.
Once I had a device tree created and stored in memory, I needed to pass it to the kernel as part of boot_args:
With the device tree in memory, I had made it past the device_tree.c crash. The bootloader was performing the basics well: loading the kernel, creating boot arguments and a device tree, and ultimately, calling the kernel. To make additional progress, I’d need to shift my attention toward patching the kernel source code to fix remaining compatibility issues.
At this point, the kernel was getting stuck while running some code to set up video and I/O memory. XNU from this era makes assumptions about where video and I/O memory can be, and reconfigures Block Address Translations (BATs) in a way that doesn’t play nicely with the Wii’s memory layout (MEM1 starting at 0x00000000, MEM2 starting at 0x10000000). To work around these limitations, it was time to modify the kernel’s source code and boot a modified kernel binary.
Figuring out a sane development environment to build an OS kernel from 25 years ago took some effort. Here’s what I landed on:
* XNU source code lives on the host’s filesystem, and is exposed via an NFS server
* The guest accesses the XNU source via an NFS mount
* The host uses SSH to control the guest
* Edit XNU source on host, kick off a build via SSH on the guest, build artifacts end up on the filesystem accessible by host and guest
To set up the dependencies needed to build the Mac OS X Cheetah kernel on the Mac OS X Cheetah guest, I followed the instructions here. They mostly matched up with what I needed to do. Relevant sources are available from Apple here.
After fixing the BAT setup and adding some small patches to reroute console output to my USB Gecko, I now had video output and serial debug logs working - making future development and debugging significantly easier. Thanks to this new visibility into what was going on, I could see that the virtual memory, IOKit, and BSD subsystems were all initialized and running - without crashing. This was a significant milestone, and gave me confidence that I was on the right path to getting a full system working.
Readers who have attempted to run Mac OS X on a PC via “hackintoshing” may recognize the last line in the boot logs: the dreaded “Still waiting for root device”. This occurs when the system can’t find a root filesystem from which to continue booting. In my case, this was expected: the kernel had done all it could and was ready to load the rest of the Mac OS X system from the filesystem, but it didn’t know where to locate this filesystem. To make progress, I would need to tell the kernel how to read from the Wii’s SD card. To do this, I’d need to tackle the next phase of this project: writing drivers.
Mac OS X drivers are built using IOKit - a collection of software components that aim to make it easy to extend the kernel to support different hardware devices. Drivers are written using a subset of C++, and make extensive use of object-oriented programming concepts like inheritance and composition. Many pieces of useful functionality are provided, including:
* Base classes and “families” that implement common behavior for different types of hardware
* Probing and matching drivers to hardware present in the device tree
In IOKit, there are two kinds of drivers: a specific device driver and a nub. A specific device driver is an object that manages a specific piece of hardware. A nub is an object that serves as an attach-point for a specific device driver, and also provides the ability for that attached driver to communicate with the driver that created the nub. It’s this chain of driver-to-nub-to-driver that creates the aforementioned provider-client relationships. I struggled for a while to grasp this concept, and found a concrete example useful.
Real Macs can have a PCI bus with several PCI ports. In this example, consider an ethernet card being plugged into one of the PCI ports. A driver, IOPCIBridge, handles communicating with the PCI bus hardware on the motherboard. This driver scans the bus, creating IOPCIDevice nubs (attach-points) for each plugged-in device that it finds. A hypothetical driver for the plugged-in ethernet card (let’s call it SomeEthernetCard) can attach to the nub, using it as its proxy to call into PCI functionality provided by the IOPCIBridge driver on the other side. The SomeEthernetCard driver can also create its own IOEthernetInterface nubs so that higher-level parts of the IOKit networking stack can attach to it.
Someone developing a PCI ethernet card driver would only need to write SomeEthernetCard; the lower-level PCI bus communication and the higher-level networking stack code is all provided by existing IOKit driver families. As long as SomeEthernetCard can attach to an IOPCIDevice nub and publish its own IOEthernetInterface nubs, it can sandwich itself between two existing families in the driver stack, benefiting from all of the functionality provided by IOPCIFamily while also satisfying the needs of IONetworkingFamily.
Unlike Macs from the same era, the Wii doesn’t use PCI to connect its various pieces of hardware to its motherboard. Instead, it uses a custom system-on-a-chip (SoC) called the Hollywood. Through the Hollywood, many pieces of hardware can be accessed: the GPU, SD card, WiFi, Bluetooth, interrupt controllers, USB ports, and more. The Hollywood also contains an ARM coprocessor, nicknamed the Starlet, that exposes hardware functionality to the main PowerPC processor via inter-processor-communication (IPC).
This unique hardware layout and communication protocol meant that I couldn’t piggy-back off of an existing IOKit driver family like IOPCIFamily. Instead, I would need to implement an equivalent driver for the Hollywood SoC, creating nubs that represent attach-points for all of the hardware it contains. I landed on this layout of drivers and nubs (note that this is only showing a subset of the drivers that had to be written):
Now that I had a better idea of how to represent the Wii’s hardware in IOKit, I began work on my Hollywood driver.
I started by creating a new C++ header and implementation file for a NintendoWiiHollywood driver. Its driver “personality” enabled it to be matched to a node in the device tree with the name “hollywood”`. Once the driver was matched and running, it was time to publish nubs for all of its child devices.
Once again leaning on the device tree as the source of truth for what hardware lives under the Hollywood, I iterated through all of the Hollywood node’s children, creating and publishing NintendoWiiHollywoodDevice nubs for each:
Once NintendoWiiHollywoodDevice nubs were created and published, the system would be able to have other device drivers, like an SD card driver, attach to them.
Next, I moved on to writing a driver to enable the system to read and write from the Wii’s SD card. This driver is what would enable the system to continue booting, since it was currently stuck looking for a root filesystem from which to load additional startup files.
I began by subclassing IOBlockStorageDevice, which has many abstract methods intended to be implemented by subclassers:
For most of these methods, I could implement them with hard-coded values that matched the Wii’s SD card hardware; vendor string, block size, max read and write transfer size, ejectability, and many others all return constant values, and were trivial to implement.
The more interesting methods to implement were the ones that needed to actually communicate with the currently-inserted SD card: getting the capacity of the SD card, reading from the SD card, and writing to the SD card:
To communicate with the SD card, I utilized the IPC functionality provided by MINI running on the Starlet co-processor. By writing data to certain reserved memory addresses, the SD card driver was able to issue commands to MINI. MINI would then execute those commands, communicating back any result data by writing to a different reserved memory address that the driver could monitor.
MINI supports many useful command types. The ones used by the SD card driver are:
* IPC_SDMMC_SIZE: Returns the number of sectors on the currently-inserted SD card
With these three command types, reads, writes, and capacity-checks could all be implemented, enabling me to satisfy the core requirements of the block storage device subclass.
Like with most programming endevours, things rarely work on the first try. To investigate issues, my primary debugging tool was sending log messages to the serial debugger via calls to IOLog. With this technique, I was able to see which methods were being called on my driver, what values were being passed in, and what values my IPC implementation was sending to and receiving from MINI - but I had no ability to set breakpoints or analyze execution dynamically while the kernel was running.
One of the trickier bugs that I encountered had to do with cached memory. When the SD card driver wants to read from the SD card, the command it issues to MINI (running on the ARM CPU) includes a memory address at which to store any loaded data. After MINI finishes writing to memory, the SD card driver (running on the PowerPC CPU) might not be able to see the updated contents if that region is mapped as cacheable. In that case, the PowerPC will read from its cache lines rather than RAM, returning stale data instead of the newly loaded contents. To work around this, the SD card driver must use uncached memory for its buffers.
After several days of bug-fixing, I reached a new milestone: IOBlockStorageDriver, which attached to my SD card driver, had started publishing IOMedia nubs representing the logical partitions present on the SD. Through these nubs, higher-level parts of the system were able to attach and begin using the SD card. Importantly, the system was now able to find a root filesystem from which to continue booting, and I was no longer stuck at “Still waiting for root device”:
My boot logs now looked like this:
After some more rounds of bug fixes (while on the go), I was able to boot past single-user mode:
And eventually, make it through the entire verbose-mode startup sequence, which ends with the message: “Startup complete”:
At this point, the system was trying to find a framebuffer driver so that the Mac OS X GUI could be shown. As indicated in the logs, WindowServer was not happy - to fix this, I’d need to write my own framebuffer driver.
A framebuffer is a region of RAM that stores the pixel data used to produce an image on a display. This data is typically made up of color component values for each pixel. To change what’s displayed, new pixel data is written into the framebuffer, which is then shown the next time the display refreshes. For the Wii, the framebuffer usually lives somewhere in MEM1 due to it being slightly faster than MEM2. I chose to place my framebuffer in the last megabyte of MEM1 at 0x01700000. At 640x480 resolution, and 16 bits per pixel, the pixel data for the framebuffer fit comfortably in less than one megabyte of memory.
Early in the boot process, Mac OS X uses the bootloader-provided framebuffer address to display simple boot graphics via video_console.c. In the case of a verbose-mode boot, font-character bitmaps are written into the framebuffer to produce a visual log of what’s happening while starting up. Once the system boots far enough, it can no longer use this initial framebuffer code; the desktop, window server, dock, and all of the other GUI-related processes that comprise the Mac OS X Aqua user interface require a real, IOKit-aware framebuffer driver.
To tackle this next driver, I subclassed IOFramebuffer. Similar to subclassing IOBlockStorageDevice for the SD card driver, IOFramebuffer also had several abstract methods for my framebuffer subclass to implement:
Once again, most of these were trivial to implement, and simply required returning hard-coded Wii-compatible values that accurately described the hardware. One of the most important methods to implement is getApertureRange, which returns an IODeviceMemory instance whose base address and size describe the location of the framebuffer in memory:
After returning the correct device memory instance from this method, the system was able to transition from the early-boot text-output framebuffer, to a framebuffer capable of displaying the full Mac OS X GUI. I was even able to boot the Mac OS X installer:
Readers with a keen eye might notice some issues:
* The verbose-mode text framebuffer is still active, causing text to be displayed and the framebuffer to be scrolled
The fix for the early-boot video console still writing text output to the framebuffer was simple: tell the system that our new, IOKit framebuffer is the same as the one that was previously in use by returning true from isConsoleDevice:
The fix for the incorrect colors was much more involved, as it relates to a fundamental incompatibility between the Wii’s video hardware and the graphics code that Mac OS X uses.
The Nintendo Wii’s video encoder hardware is optimized for analogue TV signal output, and as a result, expects 16-bit YUV pixel data in its framebuffer. This is a problem, since Mac OS X expects the framebuffer to contain RGB pixel data. If the framebuffer that the Wii displays contains non-YUV pixel data, then colors will be completely wrong.
To work around this incompatibility, I took inspiration from the Wii Linux project, which had solved this problem many years ago. The strategy is to use two framebuffers: an RGB framebuffer that Mac OS X interacts with, and a YUV framebuffer that the Wii’s video hardware outputs to the attached display. 60 times per second, the framebuffer driver converts the pixel data in the RGB framebuffer to YUV pixel data, placing the converted data in the framebuffer that the Wii’s video hardware displays:
After implementing the dual-framebuffer strategy, I was able to boot into a correctly-colored Mac OS X system - for the first time, Mac OS X was running on a Nintendo Wii:
...
Read the original on bryankeller.github.io »
Last week, the nonprofit research group OpenAI revealed that it had developed a new text-generation model that can write coherent, versatile prose given a certain subject matter prompt. However, the organization said, it would not be releasing the full algorithm due to “safety and security concerns.”
Instead, OpenAI decided to release a “much smaller” version of the model and withhold the data sets and training codes that were used to develop it. If your knowledge of the model, called GPT-2, came solely on headlines from the resulting news coverage, you might think that OpenAI had built a weapons-grade chatbot. A headline from Metro U. K. read, “Elon Musk-Founded OpenAI Builds Artificial Intelligence So Powerful That It Must Be Kept Locked Up for the Good of Humanity.” Another from CNET reported, “Musk-Backed AI Group: Our Text Generator Is So Good It’s Scary.” A column from the Guardian was titled, apparently without irony, “AI Can Write Just Like Me. Brace for the Robot Apocalypse.”
That sounds alarming. Experts in the machine learning field, however, are debating whether OpenAI’s claims may have been a bit exaggerated. The announcement has also sparked a debate about how to handle the proliferation of potentially dangerous A. I. algorithms.
OpenAI is a pioneer in artificial intelligence research that was initially funded by titans like SpaceX and Tesla founder Elon Musk, venture capitalist Peter Thiel, and LinkedIn co-founder Reid Hoffman. The nonprofit’s mission is to guide A. I. development responsibly, away from abusive and harmful applications. Besides text generation, OpenAI has also developed a robotic hand that can teach itself simple tasks, systems that can beat pro players of the strategy video game Dota 2, and algorithms that can incorporate human input into their learning processes.
On Feb. 14, OpenAI announced yet another feat of machine learning ingenuity in a blog post detailing how its researchers had trained a language model using text from 8 million webpages to predict the next word in a piece of writing. The resulting algorithm, according to the nonprofit, was stunning: It could “[adapt] to the style and content of the conditioning text” and allow users to “generate realistic and coherent continuations about a topic of their choosing.” To demonstrate the feat, OpenAI provided samples of text that GPT-2 had produced given a particular human-written prompt.
For example, researchers fed the generator the following scenario:
In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.
The scientist named the population, after their distinctive horn, Ovid’s Unicorn. These four-horned, silver-white unicorns were previously unknown to science. Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved. Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, and several companions, were exploring the Andes Mountains when they found a small valley, with no other animals or humans. Pérez noticed that the valley had what appeared to be a natural fountain, surrounded by two peaks of rock and silver snow. Pérez and the others then ventured further into the valley. “By the time we reached the top of one peak, the water looked blue, with some crystals on top,” said Pérez.
Other samples exhibited GPT-2’s turns as a novelist writing another battle passage of The Lord of the Rings, a columnist railing against recycling, and a speechwriter composing John F. Kennedy’s address to the American people in the wake of his hypothetical resurrection as a cyborg.
While researchers admit that the algorithm’s prose can be a bit sloppy—it often rambles, uses repetitive language, can’t quite nail topic transitions, and inexplicably mentions “fires happening under water”—OpenAI nevertheless contends that GPT-2 is far more sophisticated than any other text generator that it has developed. That’s a bit self-referential, but most in the A. I. field seem to agree that GPT-2 is truly at the cutting edge of what’s currently possible with text generation. Most A.I. tech is only equipped to handle specific tasks and tends to fumble anything else outside a very narrow range. Training the GPT-2 algorithm to adapt nimbly to various modes of writing is a significant achievement. The model also stands out from older text generators in that it can distinguish between multiple definitions of a single word based on context clues and has a deeper knowledge of more obscure usages. These enhanced capabilities allow the algorithm to compose longer and more coherent passages, which could be used to improve translation services, chatbots, and A.I. writing assistants. That doesn’t mean it will necessarily revolutionize the field.
Nevertheless, OpenAI said that it would only be publishing a “much smaller version” of the model due to concerns that it could be abused. The blog post fretted that it could be used to generate false news articles, impersonate people online, and generally flood the internet with spam and vitriol. While people can, of course, create such malicious content themselves, the implementation of sophisticated A. I. text generation may augment the scale at which it’s generated. What GPT-2 lacks in elegant prose stylings it could more than make up for in its prolificacy.
Yet the prevailing notion among most A. I. experts, including those at OpenAI, was that withholding the algorithm is a stopgap measure at best. Plus, “It’s not clear that there’s any, like, stunningly new technique they [OpenAI] are using. They’re just doing a good job of taking the next step,” says Robert Frederking, the principal systems scientist at Carnegie Mellon’s Language Technologies Institute. “A lot of people are wondering if you actually achieve anything by embargoing your results when everybody else can figure out how to do it anyway.”
An entity with enough capital and knowledge of A. I. research that’s already out in the public could build a text generator comparable to GPT-2, even by renting servers from Amazon Web Services. If OpenAI had released the algorithm, you perhaps would not have to spend as much time and computing power developing your own text generator. But the process by which it built the model isn’t exactly a mystery. (OpenAI did not respond to Slate’s requests for comment by publication.)
Some in the machine learning community have accused OpenAI of exaggerating the risks of its algorithm for media attention and depriving academics, who may not have the resources to build such a model themselves, the opportunity to conduct research with GPT-2. However, David Bau, a researcher at MIT’s Computer Science and Artificial Intelligence Laboratory, sees this decision more of a gesture intended to start a debate about ethics in A. I. “One organization pausing one particular project isn’t really going to change anything long term,” says Bau. “But OpenAI gets a lot of attention for anything they do … and I think they should be applauded for turning a spotlight on this issue.”
It’s worth considering, as OpenAI seems to be encouraging us to do, how researchers and society in general should approach powerful A. I. models. The dangers that come with the proliferation of A.I. won’t necessarily involve insubordinate killer robots. Let’s say, hypothetically, that OpenAI had managed to create a truly unprecedented text generator that could be easily downloaded and operated by laypeople on a mass scale. For John Bowers, a research associate at the Berkman Klein Center, what to do next may come down to a cost-benefit calculus. “The fact of the matter is that a lot of the cool stuff that we’re seeing coming out of A.I. research can be weaponized in some form,” says Bowers.
In the case of increasingly sophisticated text generators, Bowers would press for releasing the algorithms because of their contributions to the field of natural language processing and practical uses, though he acknowledges that important developments in A. I. image recognition could be leveraged for invasive surveillance. However, Bowers would lean away from trying to advance and proliferate an A.I. tool like that used to make deepfakes, which is often used to graft images of people’s faces onto pornography. “To me, deepfakes are a prime example of a technology that has way more downside than upside.”
Bowers stresses, however, that these are all judgment calls, which in part speaks to the current shortcomings of the machine learning field that OpenAI is trying to highlight. “A. I. is a very young field, one that in many ways hasn’t achieved maturity in terms of how we think about the products we’re building and the balance between the harm they’ll do in the world and the good,” he says. Machine learning practitioners have not yet established many widely accepted frameworks for considering the ethical implications of creating and releasing A.I.-enabled technologies.
If recent history is any indication, trying to suppress or control the proliferation of A. I. tools may also be a losing battle. Even if there is a consensus around the ethics of disseminating certain algorithms, it might not be enough to stop people who disagree.
Frederking says an analogous precedent to the current conundrum with A. I. might be the popularization of consumer-level encryption in the 1990s, when the government repeatedly tried and failed to regulate cryptography. In 1991, Joe Biden, then a senator, introduced a bill mandating that tech companies install back doors that would allow law enforcement to carry out warrants to retrieve voice, text, and other communications from customers. Programmer Phil Zimmerman soon spoiled the scheme by developing a tool called PGP, which encrypted communications so that they could only be read by the sender and receiver. PGP soon enjoyed widespread adoption, undercutting back doors accessible to tech companies and the government. And as lawmakers were mulling further attempts to stem the adoption of strong encryption services, the National Research Council concluded in a 1996 study that users could easily and legally obtain those same services from countries like Israel and Finland.
“There’s a general philosophy that when the time has come for some scientific progress to happen, you really can’t stop it,” says Frederking. “You just need to figure out how you’re going to deal with it.”
Future Tense
is a partnership of Slate, New America, and Arizona State University
that examines emerging technologies, public policy, and society.
...
Read the original on slate.com »
The redesign of a safety feature that is more than 100 years old originated from a simple need. “Bicycle bells have remained almost unchanged for over a century, but the world around them has not. Škoda DuoBell is the first bell ever designed to penetrate noise-cancelling headphones. It is a smart analogue trick that outsmarts the artificial intelligence algorithms in these headphones. It is a small adjustment that will improve safety on city streets,” said Ben Edwards from AMV BBDO, the agency involved in developing the concept. The idea was also supported by the agency PHD, while production company Unit9 contributed to the development of the prototype.
The number of cyclists in major cities worldwide is increasing. For example, in London, the number of cyclists is expected to surpass the number of car drivers for the first time in history this year. At the same time, however, the risk of collisions between cyclists and inattentive pedestrians is also rising. In 2024 alone, according to data from Transport for London, the number of such incidents increased by 24%.
...
Read the original on www.skoda-storyboard.com »
Almost everyone at some point in their career has dealt with the deeply frustrating process of moving large amounts of data from one place to another, and if you haven’t, you probably just haven’t worked with large enough datasets yet. For Andy Warfield, one of those formative experiences was at UBC, working alongside genomics researchers who were producing extraordinary volumes of sequencing data but spending an absurd amount of their time on the mechanics of getting that data where it needed to be. Forever copying data back and forth, managing multiple inconsistent copies. It is a problem that has frustrated builders across every industry, from scientists in the lab to engineers training machine learning models, and it is exactly the type of problem that we should be solving for our customers.
In this post, Andy writes about the solution that his team came up with: S3 Files. The hard-won lessons, a few genuinely funny moments, and at least one ill-fated attempt to name a new data type. It is a fascinating read that I think you’ll enjoy.
It turns out that sunflowers are a lot more promiscuous than humans.
About a decade ago, just before joining Amazon, I had wrapped up my second startup and was back teaching at UBC. I wanted to explore something that I didn’t have a lot of research experience with and decided to learn about genomics, and in particular the intersection of computer systems and how biologists perform genomics research. I wound up spending time with Loren Rieseberg, a botany professor at UBC who studies sunflower DNA—analyzing genomes to understand how plants develop traits that let them thrive in challenging environments like drought or salty soils.
The botanists’ joke about promiscuity (the one that started this blog) was one reason why Loren’s lab was so fun to work with. Their explanation was that human DNA has about 3 billion base pairs, and any two humans are 99.9% identical at a genomic level—all of our DNA is remarkably similar. But sunflowers, being flowers, and not at all monogamous, have both larger genomes (about 3.6 billion base pairs) and way more variation (10 times more genetic variation between individuals).
One of my PhD grads at the time, JS Legare, decided to join me on this adventure and went on to do a postdoc in Loren’s lab, exploring how we might move these workloads to the cloud. Genomic analysis is an example of something that some researchers have called “burst parallel” computing. Analyzing DNA can be done with massive amounts of parallel computation, and when you do that it often runs for relatively short periods of time. This means that using local hardware in a lab can be a poor fit, because you often don’t have enough compute to run fast analysis when you need to, and the compute you do have sits idle when you aren’t doing active work. Our idea was to explore using S3 and serverless compute to run tens or hundreds of thousands of tasks in parallel so that researchers could run complex analysis very very quickly, and then scale down to zero when they were done.
The biologists worked in Linux with an analytics framework called GATK4—a genomic analysis toolkit with integration for Apache Spark. All of their data lived on a shared NFS filer. In bridging to the cloud, JS built a system he called “bunnies” (another promiscuity joke) to package analyses in containers and run them on S3, which was a real win for velocity, repeatability, and performance through parallelization. But a standout lesson was the friction at the storage boundary.
S3 was great for parallelism, cost, and durability, but every tool the genomics researchers used expected a local Linux filesystem. Researchers were forever copying data back and forth, managing multiple, sometimes inconsistent copies. This data friction—S3 on one side, a filesystem on the other, and a manual copy pipeline in between—is something I’ve seen over and over in the years since. In media and entertainment, in pretraining for machine learning, in silicon design, and in scientific computing. Different tools are written to access data in different ways and it sucks when the API that sits in front of our data becomes a source of friction that makes it harder to work with.
We are all aware, and I think still maybe even a little stunned, at the way that agentic tooling is changing software development today. Agents are pretty darned good at writing code, and they are getting better at it fast enough that we’re all spending a fair bit of time thinking about what it all even means (even Werner). One thing that does really seem true though is that agentic development has profoundly changed the cost of building applications. Cost in terms of dollars, in terms of time, and especially in terms of the skill associated with writing workable code. And it’s this last part that I’ve been finding the most exciting lately, because for about as long as we’ve had software, successful applications have always involved combining two often disjointed skillsets: On one hand skill in the domain of the application being written, like genomics, or finance, or design, and on the other hand skill in actually writing code. In a lot of ways, agents are illustrating just how prohibitively high the barrier to entry for writing software has always been, and are suddenly allowing apps to be written by a much larger set of people–people with deep skills in the domains of the applications being written, rather than in the mechanics of writing them.
As we find ourselves in this spot where applications are being written faster, more experimentally, more diversely than ever, the cycle time from idea to running code is compressing dramatically. As the cost of building applications collapses, and as each application we build can serve as a reference for the next one, it really feels like the code/data division is becoming more meaningful than it has ever been before. We are entering a time where applications will come and go, and as always, data outlives all of them. The role of effective storage systems has always been not just to safely store data, but also to help abstract and decouple it from individual applications. As the pace of application development accelerates, this property of storage has become more important than ever, because the easier data is to attach to and work with, the more that we can play, build, and explore new ways to benefit from it.
S3 as a steward for your data
Over the past few years, the S3 team has been really focused on this last point. We’ve been looking closely at situations where the way that data is accessed in S3 just isn’t simple enough–precisely like the example of biologists in Loren’s lab having to build scripts to copy data around so that it’s in the right place to use with their tooling–and we started looking more broadly at places where customers were finding that working with storage was distracting them from working with data. The first lesson that we had here was with structured data. S3 stores exabytes of parquet data and averages over 25 million requests per second to that format alone. A lot of this was either as plain parquet or structured as Hive tables. And it was clear that people wanted to do more with this data. Open table formats, notably Apache Iceberg, were emerging as functionally richer table abstractions allowing insertions and mutations, schema changes, and snapshots of tables. While Iceberg was clearly helping lift the level of abstraction for tabular data on S3, it also still carried a set of sharp edges because it was having to surface tables strictly over the object API.
As Iceberg started to grow in popularity, customers who adopted it at scale told us that managing security policy was difficult, that they didn’t want to have to manage table maintenance and compaction, and that they wanted working with tabular data to be easier. Moreover, a lot of work on Iceberg and Open Table Formats (OTFs) generally was being driven specifically for Spark. While Spark is very important as an analytics engine, people store data in S3 because they want to be able to work with it using any tool they want, even (and especially!) the tools that don’t exist yet. So in 2024, at re:Invent, we launched S3 Tables as a managed, first-class table primitive that can serve as a building block for structured data. S3 Tables stores data in Iceberg, but adds guardrails to protect data integrity and durability. It makes compaction automatic, adds support for cross-region table replication, and continues to refine and extend the idea that a table should be a first-class data primitive that sits alongside objects as a way to build applications. Today we have over 2 million tables stored in S3 Tables and are seeing all sorts of remarkable applications built on top of them.
At around the same time, we were beginning to have a lot of conversations about similarity search and vector indices with S3 customers. AI advances over the past few years have really created both an opportunity and a need for vector indexes over all sorts of stored data. The opportunity is provided by advanced embedding models, which have introduced a step-function change in the ability to provide semantic search. Suddenly, customers with large archival media collections, like historical sports footage, could build a vector index and do a live search for a specific player scoring diving touchdowns and instantly get a collection of clips, assembled as a hit reel, that can be used in live broadcast. That same property of semantically relevant search is equally valuable for RAG and for applying models over data they weren’t trained on.
As customers started to build and operate vector indexes over their data, they began to highlight a slightly different source of data friction. Powerful vector databases already existed, and vectors had been quickly working their way in as a feature on existing databases like Postgres. But these systems stored indexes in memory or on SSD, running as compute clusters with live indices. That’s the right model for a continuous low-latency search facility, but it’s less helpful if you’re coming to your data from a storage perspective. Customers were finding that, especially over text-based data like code or PDFs, that the vectors themselves were often more bytes than the data being indexed, stored on media many times more expensive.
So just like with the team’s work on structured data with S3 Tables, at the last re:Invent we launched S3 Vectors as a new S3-native data type for vector indices. S3 Vectors takes a very S3 spin on storing vectors in that its design anchors on a performance, cost and durability profile that is very similar to S3 objects. Probably most importantly though, S3 Vectors is designed to be fully elastic, meaning that you can quickly create an index with only a few hundred records in it, and scale over time to billions of records. S3 Vector’s biggest strength is really with the sheer simplicity of having an always-available API endpoint that can support similarity search indices. Just like objects and tables, it’s another data primitive that you can just reach for as part of application development.
Today, we are launching S3 Files, a new S3 feature that integrates the Amazon Elastic File System (EFS) into S3 and allows any existing S3 data to be accessed directly as a network attached file system.
The story about files is actually longer, and even more interesting than the work on either Tables or Vectors, because files turn out to be a complex and tricky data type to cleanly integrate with object storage. We actually started working on the files idea before we launched S3 Tables, as a joint effort between the EFS and S3 teams, but let’s put a pin in that for a second.
As I described with the genomics example of analyzing sunflower DNA, there is an enormous body of existing software that works with data through filesystem APIs, data science tools, build systems, log processors, configuration management, and training pipelines. If you have watched agentic coding tools work with data, they are very quick to reach for the rich range of Unix tools to work directly with data in the local file system. Working with data in S3 means deepening the reasoning that they have to do to actively go list files in S3, transfer them to the local disk, and then operate on those local copies. And it’s obviously broader than just the agentic use case, it’s true for every customer application that works with local file systems in their jobs today. Natively supporting files on S3 makes all of that data immediately more accessible—and ultimately more valuable. You don’t have to copy data out of S3 to use pandas on it, or to point a training job at it, or to interact with it using a design tool.
With S3 Files, you get a really simple thing. You can now mount any S3 bucket or prefix inside your EC2 VM, container, or Lambda function and access that data through your file system. If you make changes, your changes will be propagated back to S3. As a result, you can work with your objects as files, and your files as objects.
And this is where the story gets interesting, because as we often learn when we try to make things simple for customers, making something simple is often one of the more complicated things that you can set out to do.
Builders hate the fact that they have to decide early on whether their data is going to live in a file system or an object store, and to be stuck with the consequences of that from then on. With that decision, they are basically picking how they are going to interact with their data not just now, but long into the future, and if they get it wrong they either have to do a migration or build a layer of automation for copying data.
Early on, the idea was basically that we would just put EFS and S3 in a giant pot, simmer it for a bit, and we would get the best of both worlds. We even called the early version of the project “EFS3” (and I’m glad we didn’t keep that name!). But things got tricky in a hurry. Every time we sat down to work through designs, we found difficult technical challenges and tough decisions. And in each of these decisions, either the file or the object presentation of data would have to give something up in the design that would make it a bit less good. One of the engineers on the team described this as “a battle of unpalatable compromises.” We were hardly the first storage people to discover how difficult it is to converge file and object into a single storage system, but we were also acutely aware of how much not having a solution to the problem was frustrating builders.
We were determined to find a path through it so we did the only sensible thing you can do when you are faced with a really difficult technical design problem: we locked a bunch of our most senior engineers in a room and said we weren’t going to let them out till they had a plan that they all liked.
Passionate and contentious discussions ensued. And ensued. And ensued. And eventually we gave up. We just couldn’t get to a solution that didn’t leave someone (and in most cases really everyone) unhappy with the design.
A quick aside at this point: I may be taking some dramatic liberties with the comment about locking people in a room. The Amazon meeting rooms don’t have locks on them. But to be clear on this point: I frequently find that we make the fastest and most constructive progress on really hard design problems when we get smart, passionate people with differing technical views in front of a whiteboard to really dig in over a period of days. This isn’t an earth-moving observation, but it’s often surprising how easy it can be to forget in the face of trying to talk through big hard problems in one-hour blocks over video conference. The engineers in these discussions deeply understood file and object workloads and the subtleties of how different they can be, and so these discussions were deep, sometimes heated, and absolutely fascinating. And despite all of this, we still couldn’t get to a design that we liked. It was really frustrating.
This was around Christmas of 2024. Leading into the holidays, the team changed course. They went through the design docs and discussion notes that they had and started to enumerate all of the specific design compromises and the behaviour that we would need to be comfortable with if we wanted to present both file and object interfaces as a single unified system. We all looked at it and agreed that it wasn’t the best of both worlds, it was the lowest common denominator, and we could all think of example workloads on both sides that would break in surprising, often subtle, and always frustrating ways.
I think the example where this really stood out to me was around the top-level semantics and experience of how objects and files are actually different as data primitives. Here’s a painfully simple characterization: files are an operating system construct. They exist on storage, and persist when the power is out, but when they are used they are incredibly rich as a way of representing data, to the point that they are very frequently used as a way of communicating across threads, processes, and applications. Application APIs for files are built to support the idea that I can update a record in a database in place, or append data to a log, and that you can concurrently access that file and see my change almost instantaneously, to an arbitrary sub-region of the file. There’s a rich set of OS functionality, like mmap() that doubles down on files as shared persistent data that can mutate at a very fine granularity and as if it is a set of in-memory data structures.
Now if we flip over to object world, the idea of writing to the middle of an object while someone else is accessing it is more or less sacrilege. The immutability of objects is an assumption that is cooked into APIs and applications. Tools will download and verify content hashes, they will use object versioning to preserve old copies. Most notable of all, they often build sophisticated and complex workflows that are entirely anchored on the notifications that are associated with whole object creation. This last thing was something that surprised me when I started working on S3, and it’s actually really cool. Systems like S3 Cross Region Replication (CRR) replicate data based on notifications that happen when objects are created or overwritten and those notifications are counted on to have at-least-once semantics in order to ensure that we never miss replication for an object. Customers use similar pipelines to trigger log processing, image transcoding and all sorts of other stuff–it’s a very popular pattern for application design over objects. In fact, notifications are an example of an S3 subsystem that makes me marvel at the scale of the storage system I get to work on: S3 sends over 300 billion event notifications every day just to serverless event listeners that process new objects!
The thing that we came to realize was that there is actually a pretty profound boundary between files and objects. File interactions are agile, often mutation heavy, and semantically rich. Objects on the other hand come with a relatively focused and narrow set of semantics; and we realized that this boundary that separated them was what we really needed to pay attention to, and that rather than trying to hide it, the boundary itself was the feature we needed to build.
When we got back from the holidays, we started locking (well, ok, not exactly locking) folks in rooms again, but this time with the view that the boundary between file and object didn’t actually have to be invisible. And this time, the team started coming out of discussions looking a lot happier.
The first decision was that we were going to treat first-class file access on S3 as a presentation layer for working with data. We would allow customers to define an S3 mount on a bucket or prefix, and that under the covers, that mount would attach an EFS namespace to mirror the metadata from S3. We would make the transit and consistency of data across the two layers an absolutely central part of our design. We started to describe this as “stage and commit,” a term that we borrowed from version control systems like git—changes would be able to accumulate in EFS, and then be pushed down collectively to S3—and that the specifics of how and when data transited the boundary would be published as part of the system, clear to customers, and something that we could actually continue to evolve and improve as a programmatic primitive over time. (I’m going to talk about this point a little more at the end, because there’s much more the team is excited to do on this surface).
Being explicit about the boundary between file and object presentations is something that I did not expect at all when the team started working on S3 Files, and it’s something that I’ve really come to love about the design. It is early and there is plenty of room for us to evolve, but I think the team all feels that it sets us up on a path where we are excited to improve and evolve in partnership with what builders need, and not be stuck behind those unpalatable compromises.
Not out of the woods
Deciding on this stage and commit thing was one of those design decisions that provided some boundaries and separation of concerns. It gave us a clear structure, but it didn’t make the hard problems go away. The team still had to navigate real tradeoffs between file and object semantics, performance, and consistency. Let me walk through a few examples to show how nuanced these two abstractions really are, and how the team approached these decisions.
S3 readers often assume full object updates, notifications, and in many cases access to historical versions. File systems have fine-grained mutations, but they have important consistency and atomicity tricks as well. Many applications depend on the ability to do atomic file renames as a way of making a large change visible all at once. They do the same thing with directory moves. S3 conditionals help a bit with the first thing but aren’t an exact match, and there isn’t an S3 analog for the second. So as mentioned above, separating the layers allows these modalities to coexist in parallel systems with a single view of the same data. You can mutate and rename a file all you want, and at a later point, it will be written as a whole to S3.
Authorization is equally thorny. S3 and file systems think about authorization in very different ways. S3 supports IAM policies scoped to key prefixes—you can say “deny GetObject on anything under /private/”. In fact, you can further constrain those permissions based on things like the network or properties of the request itself. IAM policies are incredibly rich, and also much more expensive to evaluate than file permissions are. File systems have spent years getting things like permission checks off of the data path, often evaluating up front and then using a handle for persistent future access. Files are also a little weird as an entity to wrap authorization policy around, because permissions for a file live in its inode. Hard links allow you to have many inodes for the same file, and you also need to think about directory permissions that determine if you can get to a file in the first place. Unless you have a handle on it, in which case it kind of doesn’t matter, even if it’s renamed, moved, and often even deleted.
There’s a lot more complexity, erm, richness to discuss here—especially around topics like user and group identity—but by moving to an explicit boundary, the team got themselves out of having to co-represent both types of permissions on every single object. Instead, permissions could be specified on the mount itself (familiar territory for network file system users) and enforced within the file system, with specific mappings applied across the two worlds.
This design had another advantage. It preserved IAM policy on S3 as a backstop. You can always disable access at the S3 layer if you need to change a data perimeter, while delegating authorization up to the file layer within each mount. And it left the door open for situations in the future where we might want to explore multiple different mounts over the same data.
If you are familiar with both file and object systems, it’s not a hard exercise to think about cases where file and object naming behaves quite differently. When you start to sit down and really dig into it, things get almost hilariously desolate. File systems have first-class path separators—often forward slash (“/”) characters. S3 has these too, but they are really just a suggestion. In fact, S3’s LIST command allows you to specify anything you want to be parsed as a path separator and there are a handful of customers who have built remarkable multi-dimensional naming structures that embed multiple different separators in the same paths and pass a different delimiter to LIST depending on how they want to organize results.
Here’s another simple and annoying one: because S3 doesn’t have directories, you can have objects that end with that same slash. That’s to say, that you can have a thing that looks like a directory but is a file. For about 20 minutes the team thought this was a cool feature and were calling them “filerectories.” Thank goodness we didn’t keep that one.
There are tens of these differences, and we carefully thought about restricting to a single common structure or just fixing ourselves on one side or the other. On all of these paths we realized that we were going to break assumptions about naming inside applications.
We decided to lean into the boundary and allow both sides to stick with their existing naming conventions and semantics. When objects or files are created that can’t be moved across the boundary, we decided that (and wow was this ever a lot of passionate discussion) we just wouldn’t move them. Instead, we would emit an event to allow customers to monitor and take action if necessary. This is clearly an example of downloading complexity onto the developer, but I think it’s also a profoundly good example of that being the right thing to do, because we are choosing not to fail things in the domains where they already expect to run, we are building a boundary that admits the vast majority of path names that actually do work in both cases, and we are building a mechanism to detect and correct problems as they arise.
The last big area of differences that the team spent a lot of time talking about was performance, and in particular the performance and request latency of namespace interactions. File and object namespaces are optimized for very different things. In a file system, there are a lot of data-dependent accesses to metadata. Accessing a file means also accessing (and in some cases updating) the directory record. There are also many operations that end up traversing all of the directory records along a path. As a result, fast file system namespaces—even big distributed ones, tend to co-locate all the metadata for a directory on a single host so that those interactions are as fast as possible. The object namespace is completely flat and tends to optimize for very highly parallel point queries and updates. There are many cases in S3 where individual “directories” have billions of objects in them and are being accessed by hundreds of thousands of clients in parallel.
As we looked through the set of challenges that I’ve just described, we spent a lot of time talking about adoption. S3 is two decades old and we wanted a solution that existing S3 customers could immediately use on their own data, and not one that meant migrating to something completely new. There are enormous numbers of existing buckets serving applications that depend on S3’s object semantics working exactly as documented. We were not willing to introduce subtle new behaviours that could break those applications.
It turns out that very few applications use both file and object interfaces concurrently on the same data at the same instant. The far more common pattern is multiphase. A data processing pipeline uses filesystem tools in one stage to produce output that’s consumed by object-based applications in the next. Or a customer wants to run analytics queries over a snapshot of data that’s actively being modified through a filesystem.
We realized that it’s not necessary to converge file and object semantics to solve the data silo problem. What they needed was the same data in one place, with the right view for each access pattern. A file view that provides full NFS close-to-open consistency. An object view that provides full S3 atomic-PUT strong consistency. And a synchronization layer that keeps them connected.
So we shipped it
All of that arguing—the team’s list of “unpalatable compromises”, the passionate and occasionally desolate discussions about filerectories—turned out to be exactly the work we needed to do. I think the team all feels that the design is better for having gone through it. S3 Files lets you mount any S3 bucket or prefix as a filesystem on your EC2 instance, container, or Lambda function. Behind the scenes it’s backed by EFS, which provides the file experience your tools already expect. NFS semantics, directory operations, permissions. From your application’s perspective, it’s a mounted directory. From S3’s perspective, the data is objects in a bucket.
The way it works is worth a quick walk through. When you first access a directory, S3 Files imports metadata from S3 and populates a synchronized view. For files under 128 KB it also pulls the data itself. For larger files only metadata comes over and the data is fetched from S3 when you actually read it. This lazy hydration is important because it means that you can mount a bucket with millions of objects in it and just start working immediately. This “start working immediately” part is a good example of a simple experience that is actually pretty sophisticated under the covers–being able to mount and immediately work with objects in S3 as files is an obvious and natural expectation for the feature, and it would be pretty frustrating to have to wait minutes or hours for the file view of metadata to be populated. But under the covers, S3 Files needs to scan S3 metadata and populate a file-optimized namespace for it, and the team was able to make this happen very quickly, and as a background operation that preserves a simple and very agile customer experience.
When you create or modify files, changes are aggregated and committed back to S3 roughly every 60 seconds as a single PUT. Sync runs in both directions, so when other applications modify objects in the bucket, S3 Files automatically spots those modifications and reflects them in the filesystem view automatically. If there is ever a conflict where files are modified from both places at the same time, S3 is the source of truth and the filesystem version moves to a lost+found directory with a CloudWatch metric identifying the event. File data that hasn’t been accessed in 30 days is evicted from the filesystem view but not deleted from S3, so storage costs stay proportional to your active working set.
There are many smaller, and really fun bits of work that happened as the team built the system. One of the improvements that I think is really cool is what we are calling “read bypass.” For high-throughput sequential reads, read bypass automatically reroutes the read data path to not use traditional NFS access, and instead to perform parallel GET requests directly to S3 itself, this approach achieves 3 GB/s per client (with further room to improve) and scales to terabits per second across multiple clients. And for those who are interested, there’s way more detail in our technical docs (which are a pretty interesting read).
One thing I’ve really come to appreciate about the design is how honest it is about its own edges. The explicit boundary between file and object domains isn’t a limitation we’re papering over. It’s the thing that lets both sides remain uncompromised. That said, there are places where we know we still have work to do. Renames are expensive because S3 has no native rename operation, so renaming a directory means copying and deleting every object under that prefix. We warn you when a mount covers more than 50 million objects for exactly this reason. Explicit commit control isn’t there at launch; the 60-second window works for most workloads but we know it won’t be enough for everyone. And there are object keys that simply can’t be represented as valid POSIX filenames, so they won’t appear in the filesystem view. We’ve been in customer beta for about nine months and these are the things that we’ve learned and continued to evolve and iterate on with early customers. We’d rather be clear about them than pretend they don’t exist.
When we were working with Loren’s lab at UBC, JS spent a remarkable amount of his time building caching and naming layers — not doing biology, but writing infrastructure to shuttle data between where it lived and where tools expected it to be. That friction really stood out to me, and looking back at it now, I think the lesson we kept learning — in that lab, and then over and over again as the S3 team worked on Tables, Vectors, and now Files — is that different ways of working with data aren’t a problem to be collapsed. They’re a reality to be served. The sunflowers in Loren’s lab thrived on variation, and it turns out data access patterns do too.
What I find most exciting about S3 Files is something I genuinely did not expect when we started: that the explicit boundary between file and object turned out to be the best part of the design. We spent months trying to make it disappear, and when we finally accepted it as a first-class element of the system, everything got better. Stage and commit gives us a surface that we can continue to evolve — more control over when and how data transits the boundary, richer integration with pipelines and workflows–and it sets us up to do that without compromising either side.
20 years ago, S3 started as an object store. Over the past couple of years, with Tables, Vectors, and now Files, it’s become something broader. A place where data lives durably and can be worked with in whatever way makes sense for the job at hand. Our goal is for the storage system to get out of the way of your work, not to be a thing that you have to work around. We’re nowhere near done, but I’m really excited about the direction that we’re heading in.
As Werner says, “Now, go build!”
...
Read the original on www.allthingsdistributed.com »
To add this web app to your iOS home screen tap the share button and select "Add to the Home Screen".
10HN is also available as an iOS App
If you visit 10HN only rarely, check out the the best articles from the past week.
If you like 10HN please leave feedback and share
Visit pancik.com for more.