Monthly Archives: May 2011

7 ways to get data out of PDFs

A frequent obstacle in data journalism is when the information you want to analyse is locked away in a PDF. Here are 6 ways to tackle that problem – with space for a 7th:

1) For simple PDFs: Google Docs’ conversion facility

Google Docs recently added a feature that allows you to convert a PDF to a ‘Google document’ when you upload it. It’s pretty powerful, and about the simplest way you can extract information.

It does not work, however, if the PDF was generated by scanning – in other words if it is an image, rather than a document that has been converted to PDF.

2) For scanned documents and pulling out key players: Document Cloud

Document Cloud is a tool for journalists to convert PDFs to text. It will also add ‘semantic’ information along the way, such as what organisations, people and ‘entities’ such as dates and locations are mentioned within it, and there are some useful features that allow you to present documents for others to comment on. 

The good news is that it works very well with scanned documents, using Optical Character Recognition (OCR). The bad news is that you need to ask permission to use it, so if you don’t work as a professional journalist you may not be able to use it. Still, there’s no harm in asking.

3) For scanned documents: The Data Science Toolkit

The Data Science Toolkit allows you to do lots of clever things, including converting PDFs using OCR with theFile2Text converter. Upload your document, and you’re away. Also works on other document formats, and PNGs, TIFFs and JPEGs.

4) For stripping out tables: PDF2XL

If you’re willing to shell out around £70 then PDF2XL is recommended as a useful piece of software for stripping out tables from Excel files. 

5) For automating the process: Scrape from PDF to XML using Scraperwiki

Scraperwiki is a collaborative website for scraping all sorts of hard-to-find information into some sort of useful format, so it’s no surprise that PDFs are a common problem there. They have a template scraper for converting PDF documents to XML (a more structured format) – if you can understand a little bit of programming then you can try to adapt it to your own purposes.

6) If it’s held by a public body and you have time: a well-written FOI request

Do you need all the data in the PDF or just some? Is that data available elsewhere? Try an advanced search using a phrase from the data in quotes and adding filetype:xls to see if you can find the spreadsheet it comes from. Or submit an FOI request for the data stipulating that it be provided in spreadsheet or CSV (comma separated values) format (if the PDF was supplied in response to an FOI request in the first place, go back and ask for the information to be provided in spreadsheet or CSV (comma separated values) format). 

It’s a good idea to also ask how the information is stored, including any software used, as you can check with the software vendor how easily the information can be extracted and bat away any excuses the body may come back at you with.

7) Add your own here

There must be others – tell me your own tips.

UPDATE: On Twitter: Simon Rogers uses Acrobat Pro; Kevin Anderson uses Omnipage. And Jack Schofield uses Zamzar.

That investigations project summarised

To sum up the idea outlined in the previous post in more detail:

The project is a game platform to help journalists collaborate on investigations. The tool makes it easier for users to pursue investigations by:

  1. Providing project management functionality with template structures based on previous investigations, which users might also explore as a way of understanding a story
  2. Providing static and dynamic resources based on previous and new investigations
  3. Providing a pleasurable competitive experience based on game mechanics, using both negative and positive feedback mechanisms to incentivise progress
  4. Providing access to – and building – a network of other investigators

The platform builds on a number of qualities of investigative journalism in the internet age. Digital technology has made collaboration and research easier but competition for attention is higher. It builds on the experiences of the successful investigative journalism platform Help Me Investigate by separating the technology from editorial, facilitating network connections by focusing on a small number of investigation templates, and providing a platform for building on and connecting others? experiences.

At the same time the game retains Help Me Investigate?s successful modularisation of investigations into challenges and updates, adding a turn-based competitive system that draws from game mechanics.

Some very exciting partner organisations are already lined up from the UK and Europe – but I know from experience that the best way to make a project better is to allow others to find out about it and comment on it.

An investigations game

The following is a description of a game that I'm hoping to build – if a bid to the IPI News Innovation Contest is successful. I'd welcome any suggestions for how this might be designed better – as well as potential contributors, partners and users.

An investigations game: how it works

Users register with the site and join an existing investigation – or start a new one based on a limited number of ?templates? (e.g. investigating lobbying; following the money of local government or EU expenditure, charity funding or health; testing the claims of a corporation or police investigation; etc.). Once joined, they can also invite others. An investigation must have at least two ?players? before it can begin.

Once under way, as a player you are given a challenge (e.g. submit a Freedom of Information request; analyse data; identify regulations; speak to an expert; sum up the story so far, etc.). The challenge will come with help tips and resources from investigative journalists. It also has a points value based on its difficulty.

You choose to accept, exchange or pass on the challenge. Exchanging will bring up a new challenge; passing will pass the challenge on to the next player.

Exchanging or passing come with a points penalty – but if you accept and then complete a challenge, you will gain points. These can also be used to ?unlock? parts of the game or ?level up?.

Once you have accepted a challenge you have a limited time to complete it – anything from 24 hours to three weeks depending on the challenge.?You can also choose to try to do the challenge faster for extra points.

You can add updates on your progress and edit the challenge itself, adding new resources or tips of your own. These are added to the global ?template?, allowing other investigations to benefit. They will also gain you extra points.

If you have not marked the challenge as complete as the deadline nears, you will receive reminders (one of the findings of research into Help Me Investigate was the need for more ?negative feedback?). You can ?stall? the deadline – but it will cost you points (the gamble you make is that you will earn more points if you succeed). If you fail to complete the challenge, points are deducted and play passes to the next player.

If you complete the challenge, however, you are awarded points, and rise up the leaderboard. Some challenges also come with ?badges? such as ?FOI Star?, ?Document Hound?, ?Data Don?, and so on. These can be cross-published to social media such as Facebook and Twitter.

You will also be asked if you want to add or change an investigation ?hypothesis?, and the next player must confirm that you have completed the challenge. They can ask you questions if your process is not transparent. Rejection will cost you points: trust is central to collaboration – two rejections will lead to your being ejected from an investigation.

Play continues in turn until a player decides the investigation is ?closed?, posting a link to a report on the results.

Template investigations

The following represents a selection of potential investigations that users might be able to pursue, based on existing successful examples. These are obviously subject to change in discussion with partner organisations:

  1. Local government spending – follow the money
  2. National government: lobbying – identifying conflicts of interest
  3. EU politics – follow the money
  4. Policing and crime – accountability
  5. Consumer affairs – testing claims
  6. Science and environment – testing claims
  7. Health – follow the money
  8. Charity – follow the money
  9. Education – follow the money