PastPages.org has expanded to archive HTML from a pilot group of news homepages thanks to a recently launched open-source software project sponsored by the Reynolds Journalism Institute.
In addition to more than 1.5 million images already archived from nearly 100 news sites around the world, PastPages users can now download and analyze the raw source code, harvested each hour, from five upgraded sites. They are CNN, The Drudge Report, Google News, the Los Angeles Times and the New York Times.
The new HTML archival system is powered by StoryTracker, a project started in June thanks to funding from the Reynolds Journalism Institute, a research and development center based at the University of Missouri.
The lead developer is me, Ben Welsh, founder of PastPages (and graduate of the Missouri School of Journalism).
My charge is to aid researchers affiliated with the institute in a scholarly effort to track and analyze the hyperlinks, headlines and images published by Internet news publishers. Our shared goal is to pursue that mission by crafting free and open software tools that benefit from the code of others, and, we hope, ultimately benefit others by breaking new ground.
Our project is far from complete, but the new features at PastPages illustrate what it is possible today using StoryTracker’s codebase, already published on GitHub and distributed via the Python Package Index.
There you can access a menu of options, documented here, for creating an orderly archive of HTML snapshots, as well as the outlines of a system for analyzing content that will expand in the coming weeks.
We have also released django-urlachivefield, a custom database field for the Django web framework that, given a few simple lines of code, will automatically archive a URL to the storage backend of your choice. You can see how it is used within PastPages here and here.
There is also storysniffer, the beginnings of an effort to create a straightforward service that can inspect a URL and return an estimate about whether or not it links to a news story.
All of this could benefit greatly from your ideas, critiques, bug reports and, most of all, patches. And if you have any thoughts you’d like to share privately, please email me at firstname.lastname@example.org.
Right now, you’ll have to know a little Python to get it going, but if it proves useful it could grow into something for anyone to use via a web interface. The GIF at the top of this post was made like so:
Credit for the idea goes to PastPages users who impressed me with GIFs of their own, including Jeremy Singer-Vine. Andrei Scheinkman, and Zachary M. Seward.
And please keep hacking on that new API!
I’m happy to announce the launch of the PastPages API, which offers a machine-readable version of the site that programmers can use to mine our homepage archive.
While the API is currently free and requires no registration, access is throttled and the system’s structure is likely to change in the future. It was developed using django-tastypie and follows its common conventions.
Today marks the release of the second generation of PastPages' code base, nicknamed “bradlee.” The screenshotting system has been rewritten to make it faster and cheaper by shedding dependencies and introducing a task queue. Here's a quick rundown:
- Firefox -> Webkit
- Selenium -> PhantomJS
- Xvfb headless server -> Nothing!
- One-by-one screenshot script -> Concurrent Celery queue
- Memcached -> Varnish
The result is that a significantly less powerful server now completes a screenshotting run in half the time the old server did before. That saves money in addition to time.
For a few minutes this morning, CNN ran an incorrect headline declaring that a significant part of President Barack Obama’s healthcare law had been struck down by the Supreme Court.
I captured the image above manually several minutes ago. CNN has already corrected the page with a new headline.
The error was missed by PastPage’s hourly script, which visits CNN once per hour. The last visit happened at 10:02 AM EDT, before CNN made a judgement. When it visits again in the next hour, the error will certainly still be gone.
This shows just how quickly news sites can change the framing of stories and proves that even PastPages’ hourly screenshot is wanting. One of my goals with the future development of the site is to increase how frequently it captures data. Al Shaw has suggested we allow for instant on-demand archival when a human spots an error that ought to be captured.
@palewire PastPages totally needs a GO NOW button that you can mash when shit gets crazy— Al Shaw (@A_L) June 28, 2012
If you’re a developer and you’d like to help make this happen, all of the code is open on GitHub and I’d welcome your contributions.
This morning the United States Supreme Court issued a split decision on the legality of a hardline immigration law adopted by the state of Arizona. Four of the law’s provision were reviewed, but only three struck down, according to Kevin Russell at SCOTUSblog.
English-language news outlets in the U.S. and Britain jumped on the news, but disagreed on how to frame the results. Some emphasized that much of the law went down. Others emphasized the survival of a part of the law that, according to the Los Angeles Times, will allow “state officials to begin enforcing a provision that calls on police, when making lawful stops, to check the immigration status of people who may be in the country illegally.”
Fox News and the Los Angeles Times are examples of a “glass three-quarters empty” frame.
Reuters and BBC are examples of the “glass quarter full” frame, framing the news as good news for its supporters.
You can review all of the homepages archived by PastPages for that same hour right here.
Also, the Los Angeles Times is my employer, but in no way associated with PastPages, which I maintain on my own time with the support of a network of individual donors. Read all about it.
Update: Soon after, Reuters changed its play, opting for more ambiguous frame with this revised headline.