Tag Archives: PDF

How to: get data out of council budget reports

When councils publish their draft budget reports it’s not always easy to extract the figures that they’re based on. Here then is a guide to getting the data out of budget reports:

Get the ball rolling

Budget reports are generally presented in PDF format, with data presented as tables, appendices, charts and maps.

Before you do anything else, it’s worth asking the council’s press office for any spreadsheets used for the report – especially for charts and maps, which you cannot extract.

This might not get you data immediately, but it sets the ball rolling while you’re working on it from your side. On that front, you might also want to consider FOI requests to particular departments for data prepared for particular aspects of the budget.

Getting tables out of PDFs Continue reading

What to do when an FOI response is not provided in the form asked for

Heard the one about the FOI request for data to be provided in an Excel spreadsheet? The authority printed it out, scanned it, and sent a PDF version of the scan.

You’ll forgive the journalist for being suspicious when an authority goes to such extremes to make it hard to interrogate their data.

In cases like these it’s worth looking at the Information Commissioner’s awareness guidance 29 on ‘means of communication’ (PDF):

This quotes Section 11(1) of the Freedom of Information Act, which stipulates that authorities should comply with your preference for “a copy of the information in [a] form acceptable to the applicant … so far as is reasonably practicable”

The key phrase here is “reasonably practicable”. In the example above, there is no excuse that simply sending the original Excel file – instead of a scanned PDF – was not “reasonably practicable”. 

What then? Well, you should ask for an explanation, and make a formal complaint to the authority quoting the ICO guidance and Section 11(1) or the FOI Act. If that doesn’t get any results, write to the ICO. Here’s the full passage from the guidance:

“If a public authority decides that it is not reasonably practicable to provide the information in the form preferred by the applicant … the authority must tell the applicant and give its reasons. The duty on the public authority is then to provide the information by any means which are reasonable in the circumstances. 

“If the applicant is not satisfied with the decision and wants to make a complaint, they must complete the public authority’s complaints procedure (if there is one). Once this process is complete, if the applicant remains dissatisfied, they may write to the ICO.”

If you are unlucky to deal with an authority which is regularly uncooperative in this manner, it may be worth quoting the awareness guidance 29 and Section 11(1) of the Act in your request for the information to be provided in spreadsheet form, for example:

“I would like this information to be provided in spreadsheet format (xls or csv) in line with Section 11(1) of the FOI Act and ICO Awareness Guidance 29.

Also useful more broadly when looking at the way a request is handled are the guidelines on ‘Request handling’ on the ICO website.

7 ways to get data out of PDFs

A frequent obstacle in data journalism is when the information you want to analyse is locked away in a PDF. Here are 6 ways to tackle that problem – with space for a 7th:

1) For simple PDFs: Google Docs’ conversion facility

Google Docs recently added a feature that allows you to convert a PDF to a ‘Google document’ when you upload it. It’s pretty powerful, and about the simplest way you can extract information.

It does not work, however, if the PDF was generated by scanning – in other words if it is an image, rather than a document that has been converted to PDF.

2) For scanned documents and pulling out key players: Document Cloud

Document Cloud is a tool for journalists to convert PDFs to text. It will also add ‘semantic’ information along the way, such as what organisations, people and ‘entities’ such as dates and locations are mentioned within it, and there are some useful features that allow you to present documents for others to comment on. 

The good news is that it works very well with scanned documents, using Optical Character Recognition (OCR). The bad news is that you need to ask permission to use it, so if you don’t work as a professional journalist you may not be able to use it. Still, there’s no harm in asking.

3) For scanned documents: The Data Science Toolkit

The Data Science Toolkit allows you to do lots of clever things, including converting PDFs using OCR with theFile2Text converter. Upload your document, and you’re away. Also works on other document formats, and PNGs, TIFFs and JPEGs.

4) For stripping out tables: PDF2XL

If you’re willing to shell out around £70 then PDF2XL is recommended as a useful piece of software for stripping out tables from Excel files. 

5) For automating the process: Scrape from PDF to XML using Scraperwiki

Scraperwiki is a collaborative website for scraping all sorts of hard-to-find information into some sort of useful format, so it’s no surprise that PDFs are a common problem there. They have a template scraper for converting PDF documents to XML (a more structured format) – if you can understand a little bit of programming then you can try to adapt it to your own purposes.

6) If it’s held by a public body and you have time: a well-written FOI request

Do you need all the data in the PDF or just some? Is that data available elsewhere? Try an advanced search using a phrase from the data in quotes and adding filetype:xls to see if you can find the spreadsheet it comes from. Or submit an FOI request for the data stipulating that it be provided in spreadsheet or CSV (comma separated values) format (if the PDF was supplied in response to an FOI request in the first place, go back and ask for the information to be provided in spreadsheet or CSV (comma separated values) format). 

It’s a good idea to also ask how the information is stored, including any software used, as you can check with the software vendor how easily the information can be extracted and bat away any excuses the body may come back at you with.

7) Add your own here

There must be others – tell me your own tips.

UPDATE: On Twitter: Simon Rogers uses Acrobat Pro; Kevin Anderson uses Omnipage. And Jack Schofield uses Zamzar.