PastPages is the news homepage archive.

Every hour it captures a snapshot of the top stories reported by news organizations around the world.

This blog is a selection from its files by Editor and Publisher Ben Welsh.

PastPages now archiving HTML

PastPages.org has expanded to archive HTML from a pilot group of news homepages thanks to a recently launched open-source software project sponsored by the Reynolds Journalism Institute.

In addition to more than 1.5 million images already archived from nearly 100 news sites around the world, PastPages users can now download and analyze the raw source code, harvested each hour, from five upgraded sites. They are CNN, The Drudge Report, Google News, the Los Angeles Times and the New York Times.

The new HTML archival system is powered by StoryTracker, a project started in June thanks to funding from the Reynolds Journalism Institute, a research and development center based at the University of Missouri.

The lead developer is me, Ben Welsh, founder of PastPages (and graduate of the Missouri School of Journalism).

My charge is to aid researchers affiliated with the institute in a scholarly effort to track and analyze the hyperlinks, headlines and images published by Internet news publishers. Our shared goal is to pursue that mission by crafting free and open software tools that benefit from the code of others, and, we hope, ultimately benefit others by breaking new ground.

Our project is far from complete, but the new features at PastPages illustrate what it is possible today using StoryTracker’s codebase, already published on GitHub and distributed via the Python Package Index.

There you can access a menu of options, documented here, for creating an orderly archive of HTML snapshots, as well as the outlines of a system for analyzing content that will expand in the coming weeks.

We have also released django-urlachivefield, a custom database field for the Django web framework that, given a few simple lines of code, will automatically archive a URL to the storage backend of your choice. You can see how it is used within PastPages here and here.

There is also storysniffer, the beginnings of an effort to create a straightforward service that can inspect a URL and return an estimate about whether or not it links to a news story.

All of this could benefit greatly from your ideas, critiques, bug reports and, most of all, patches. And if you have any thoughts you’d like to share privately, please email me at ben.welsh@gmail.com.

The Seattle Times homepage, animated by pastpages2gif.

New tool makes animated GIFs from the PastPages homepage archive

image

Last night I cobbled together pastpages2gif, a command-line tool that pulls images from the new PastPages API and combines them into an animated GIF.

Right now, you’ll have to know a little Python to get it going, but if it proves useful it could grow into something for anyone to use via a web interface. The GIF at the top of this post was made like so:

More examples and a copy of the code are at https://github.com/pastpages/pastpages2gif. If you see anything that sucks or have an idea for improvements, please email me or file a ticket.

Credit for the idea goes to PastPages users who impressed me with GIFs of their own, including Jeremy Singer-Vine. Andrei Scheinkman, and Zachary M. Seward.

And please keep hacking on that new API!

Say hello to the PastPages API

image

I’m happy to announce the launch of the PastPages API, which offers a machine-readable version of the site that programmers can use to mine our homepage archive.

You can easily get a list of all the sites we track, see the latest homepages from France, find out how Xinhua covered New Year’s Eve or any other query you can dream up.

The data are published in JSON, JSONP, XML and other popular formats. Documentation is available at http://www.pastpages.org/api/docs/.

While the API is currently free and requires no registration, access is throttled and the system’s structure is likely to change in the future. It was developed using django-tastypie and follows its common conventions.

If you encounter any problems, please contact me via email or file a ticket. Try it out. Tell me what sucks.

Second-generation PastPages code base is all the way live

image

Today marks the release of the second generation of PastPages' code base, nicknamed “bradlee.” The screenshotting system has been rewritten to make it faster and cheaper by shedding dependencies and introducing a task queue. Here's a quick rundown:

  • Firefox -> Webkit
  • Selenium -> PhantomJS
  • Xvfb headless server -> Nothing!
  • One-by-one screenshot script -> Concurrent Celery queue
  • Memcached -> Varnish

The result is that a significantly less powerful server now completes a screenshotting run in half the time the old server did before. That saves money in addition to time. 

All of the code is open source on GitHub with the entire deployment route included as a Chef cookbook. Patches welcome!

Chicago Tribune redesign, before and after

Before at 6pm CDT

After at 8pm CDT

For a few minutes this morning, CNN ran an incorrect headline declaring that a significant part of President Barack Obama’s healthcare law had been struck down by the Supreme Court. 

I captured the image above manually several minutes ago. CNN has already corrected the page with a new headline.

The error was missed by PastPage’s hourly script, which visits CNN once per hour. The last visit happened at 10:02 AM EDT, before CNN made a judgement. When it visits again in the next hour, the error will certainly still be gone.

This shows just how quickly news sites can change the framing of stories and proves that even PastPages’ hourly screenshot is wanting. One of my goals with the future development of the site is to increase how frequently it captures data. Al Shaw has suggested we allow for instant on-demand archival when a human spots an error that ought to be captured.

If you’re a developer and you’d like to help make this happen, all of the code is open on GitHub and I’d welcome your contributions.

Media split on how to frame decision on Arizona’s controversial immigration law

This morning the United States Supreme Court issued a split decision on the legality of a hardline immigration law adopted by the state of Arizona. Four of the law’s provision were reviewed, but only three struck down, according to Kevin Russell at SCOTUSblog.

English-language news outlets in the U.S. and Britain jumped on the news, but disagreed on how to frame the results. Some emphasized that much of the law went down. Others emphasized the survival of a part of the law that, according to the Los Angeles Times, will allow “state officials to begin enforcing a provision that calls on police, when making lawful stops, to check the immigration status of people who may be in the country illegally.”

Fox News and the Los Angeles Times are examples of a “glass three-quarters empty” frame.

Reuters and BBC are examples of the “glass quarter full” frame, framing the news as good news for its supporters.

You can review all of the homepages archived by PastPages for that same hour right here.

Also, the Los Angeles Times is my employer, but in no way associated with PastPages, which I maintain on my own time with the support of a network of individual donors. Read all about it.

Update: Soon after, Reuters changed its play, opting for more ambiguous frame with this revised headline.

Advanced search by title, tag and date range

PastPages now offers an advanced search that allows users to quickly pull up screenshots from any date range by title or tag. Try it out. Tell me what sucks.

Knight News Challenge Round 2: The Mapping L.A. API 

Check out my Knight News Challenge pitch related to my day job at the Los Angeles Times

Next page Something went wrong, try loading again? Loading more posts