Hexbyte – Tech News – Ars Technica |
AUSTIN, Texas—As much as subscription services want you to believe it, not
can be found on Amazon or Netflix. Want to read Brett Kavanaugh buddy
, for instance (or their now infamous
)? Curious to watch a bunch of
? How about perusing the
in the world? There’s one place to turn today, and it’s not Google or any pirate sites you may or may not frequent.
“I’ve got government video of how to wash your hands or prep for nuclear war,” says Mark Graham, director of the Wayback Machine at the Internet Archive. “We could easily make a list of .ppt files in all the websites from .mil, the Military Industrial PowerPoint Complex.”
Graham recently talked with several small groups of attendees at the 2018 Online News Association conference, and Ars was lucky enough to be part of one. He later made a full presentation to the conference, which is now available in audio form. And the immediate takeaway is that the scale of the Internet Archive today may be as hard to fathom as the scale of the Internet itself.
The longtime non-profit’s physical space remains easy to comprehend, at least, so Graham starts there. The main operation now runs out of an old church (pews still intact) in San Francisco, with the Internet Archive today employing nearly 200 staffers. The archive also maintains a nearby warehouse for storing physical media—not just books, but things like vinyl records, too. That’s where Graham jokes the main unit of measurement is “shipping container.” The archive gets that much material every two weeks.
The company currently stands as the second-largest scanner of books in the world, next to Google. Graham put the current total above four million. The archive even has a wishlist for its next 1.5 million scans, including anything cited on Wikipedia. Yes, the Wayback Machine is in the process of making sure you’re not finding 404s during any Wiki rabbithole (Graham recently told the BBC that Wayback bots have restored nearly six million pages lost to linkrot as part of that effort). Today, books published prior to 1923 are free to download through the Internet Archive, and a lot of the stuff from afterwards can be borrowed as a digital copy.
So grateful for the extraordinary work our friends at @internetarchive are doing to fight 404s and digitally preserve millions of links to websites and sources Wikipedians cite, as they build the world’s largest encyclopedia. 🙌 https://t.co/LRN2uyFQKQ
— WikiResearch (@WikiResearch) October 2, 2018
Of course, the Internet Archive offers much more than text these days. Its broadcast-news collection has more than 200 million hours with tools such as the ability to search for words in chyrons and access to recent news (broadcasts are embargoed for 24 hours and then delivered to visitors in searchable two-minute chunks). The growing audio and music portion of the Internet Archive covers radio news, podcasting, and physical media (like a collection of 200,000 78s recently donated by the Boston Library). And as Ars has written about, the organization boasts an extensive classic video game collection that anyone can boot up in a browser-based emulator for research or leisure. Officially, that section involves 300,000-plus overall software titles, “so you can actually play Oregon Trail on an old Apple C computer through a browser right now—no advertising, no tracking users,” Graham says.
“Some might call us hoarders,” he says. “I like to say we’re archivists.”
In total, Graham says the Internet Archive adds four petabytes of information per year (that’s four million gigabytes, for context). The organization’s current data totals 22 petabytes—but the Internet Archive actually holds on to 44 petabytes worth. “Because we’re paranoid,” Graham says. “Machines can go down, and we have a reputation.” That NASA-ish ethos helped the non-profit once survive nearly $600,000 worth of fire damage—all without any archived data loss.