The guardian engineering blog - Why we built our automated SXSW listings pages

Earlier this week I posted about how we built our automated SXSW music listings pages. Generating automatic pages on an editorial site such as guardian.co.uk has some risks associated it, and this is how we coped with them

Problems with user generated content

I wanted to list artists by genre as well as by name and the Last.fm api helpfully returns the tags users have given the artists. The problem with user-generated content however is that it can be quite random. As anybody on Last.fm can tag artists with whatever they want, you end up with artists tagged with meaningless things like “bands i need to check out” or the bizzarre “bands that would eat children if only they could fit a whole one inside their mouths” After the import process we were left with hundreds of tags, of which only some were useful to us. Stephen Abbott spent quite a while deleting tags from our content and making sure they all made sense.

The robot

With more than 1,800 pages and more being added daily, keeping up with all the band pages and making sure the data on them is accurate is an impossible task. So we decided not to try. Instead we added a little robot icon – designed by Mariana Santos – and a disclaimer at the top of the page.

Admin pages

For obvious reasons, the Guardian would not launch automatic pages unless we had a quick way of taking them down when necessary. You can’t see them, but there are some admin screens where editors can remove any artist and any tag from the site, and also turn off the YouTube and Soundcloud components for an individual artist if they return inaccurate results. They can also choose which artists to feature on the a-z listing pages.

Under the hood

Although the pages look like Guardian content, I used the Guardian’s
microapp framework to build the pages and all the data is hosted on an external app. The core of the app is data stored in Google appengine, a platform that makes it very easy to get apps hosted and up and running. When the app encounters an artist name or Musicbrainz ID that it has no record of, it adds a set of tasks to the taskqueue to collect data from Musicbrainz, Last.fm, Soundcloud, Amazon and our gig
API.

Artist details are stored in appengine’s datastore using Siena, a simple orm library. Each artist page is built by querying the datastore for records associated with the artist (gigs, albums, tracks etc). The results of the queries are then stored in memcache for a few minutes, to prevent high traffic causing too many concurrent queries to the datastore. Appengine makes it easy to create recurring tasks that run on a schedule. Our listing pages check the sxsw website daily to ensure we have the most up-to-date listings.

Why we built our automated SXSW listings pages - Part two

Problems with user generated content

The robot

Admin pages

Under the hood