The Guardian Engineering Blog - Hacks and Hackers: ScraperWiki day

A couple of weeks ago I participated in a hack event organised by ScraperWiki, which was sponsored by our Open Platform. This was part of their Hacks and Hackers series of hack days which have traversed the UK and Ireland, bringing together journalists and developers to mine data from the internet.

It was a one day event that produces a number of projects based around scraping data from the web and generating some form of journalistic narrative from that data.

What is ScraperWiki?

Scraperwiki is a website that aims to provide an environment for programmers to collaborate in the construction and maintenance of screen scrapers. Screen scrapers are used to pull useful information from the internet and convert it into a form which is more universally usable. So rather than simply reading it, you could map it, you could put it into a spreadsheet to add it all up etc, depending on the type of data that you have scraped. This can be a very useful resource for journalists and researchers.

Mostly these scrapers are written for a given use case and then binned or forgotten about. Now you can put then on ScraperWiki, and the ScraperWiki team will run them on their servers on a regular basis and provide a database for this data to be stored in. Additionally ScraperWiki try to encourage collaboration, encouraging programmers to improve existing scrapers, to fix broken ones, to add new ones and also encouraging people to suggest new data sets for the community to have a go at.

Hacks and Hackers

ScraperWiki have been running a series of events that they have called Hacks and Hackers. The concept behind these is to bring together journalists (hacks) and programmers (hackers) and to explain to them about ScraperWiki and to then encourage the hacks to pitch projects to hackers (and vice versa) and construct projects in a single day. The single day concept is important. It is to demonstrate to the hacks what can be achieved by a cross discipline team in a short time period.

Hacks and Hackers Glasgow

I attended the Glasgow event, which took place on Friday 25 March at the BBC. ScraperWiki have written up an excellent summary on their blog. More information can be found on our Edinburgh blog. Michael, our Edinburgh beat blogger, was one of the attendees.

For my own part I found the day to be equally frustrating and fascinating. Frustrating because scraping web pages can be difficult if the HTML of the page is poorly marked up. Other people also found the lack of data could be frustrating. It is all very well to have an idea, but if that information is not online, or you don’t know where it might be, you can’t do that project.

Fascinating as the breath of ideas and methods used was brilliant. For example, Mo McRoberts scraped the results of a Google search, as opposed to a single webpage, which I though was the kind of genius idea that is so obvious as to be missed by most people.

The ScraperWiki blog provides details and video of the projects and participants.

My own project was to try to parse a site that contained details of properties that the Scottish executive owns, that it provides for rent or lease. The idea being that we would be able to provide a map of the sites, and their value, and to extrapolate from that some narrative about lost revenue for councils due to sites being unused, comparing with the cuts proposed by the councils.

Unfortunately we ran out of time before we could complete the scraper and get these results up on the maps. This was due to a combination of my own Python programming inexperience and the poor markup of the site. But we did learn that even when the site has poor mark up it is possible to get the information using ScraperWiki tools – it’s just a bit fiddly.

Summary

This was, for me, a new type of hack day, being as it was explicit in its desire to match people in a cross discipline team for the projects. This was interesting as the hacks all had personal themes that they wished to explore and added a new dimension as you tried to find data and provide the hack with the things they needed to tell their story.

This was the last stop (so far) on the Hacks and Hackers tour, but if you want to get involved in similar things check out ScraperWiki’s get involved page. There are plenty of scrapers that require attention!