Tag Archives: google docs

The best way to set up a spreadsheet for inputting FOI responses

On Help Me Investigate the Olympics we’ve been sending FOI requests to all of the local authorities which hosted the Olympic torch relay. 

That’s the easy part. The difficult bit is compiling the results into something meaningful. To do this, it’s helpful to create a spreadsheet using Google Docs so that all members can input the results. In the process, however, we’ve used a number of techniques to both speed up and future-proof the results.

Drop-down menus 

Firstly, it helps to make sure that you will be able to match the data you’re inputting with other data, such as the population of a local authority, or the political party in charge.

You can do this in your Google spreadsheet by creating a drop-down menu for the ‘matching’ column (normally an authority name or code) which draws from another list. This is how:
  1. Select the column you want your drop-down lists to appear in (normally A)
  2. Click on Data>Validation…
  3. On the window that appears, click on the drop-down menu for Criteria and select Items from a list.
  4. You can now either type in the list yourself, or select a range of cells that contain the items for your list (Create list from a range). It’s best to do the latter: click on the grid button to the right of this box, go to the sheet containing your ‘official’ list (which you’ll need to have copied into another sheet in your spreadsheet – here’s a good one to use from OpenlyLocal) and select the column or row containing your items. You should end up with a reference that looks like this: SheetName!A:A. Click OK. And click Save to apply the drop-down.
  5. Back on your data entry sheet each cell in the column you first selected should now have a small drop-down icon to the right. Users can select from this to make sure they’re using the same names for authorities as your ‘official’ list.

Input data at the most granular level

If you’re inputting data that isn’t consistently categorised across different respondents (as is most often the case), you need to input the data so that this doesn’t cause a problem.

The simplest way to do this – and also the fastest way – is to avoid imposing a particular classification system (either one authority’s, or your own) onto your data as you’re inputting it.

Imposing a classification system slows down the inputting process (the user has to make a decision where to categorise each piece of data), can introduce inconsistency (one user makes a different decision to another), and add extra work later on (what if you realise you decide to add a new category halfway through?)

Much simpler is to input the data in as basic a fashion as possible.

For example, we’re requesting information on spending. Instead of having one column per local authority, and having columns for each type of spending, like this:

NAME ——— TYPE1 — TYPE2 — TOTAL
COUNCIL A – 100 ——- 10000 —- 10100

It’s simpler to have one row per item like so:

NAME ——— TYPE — AMOUNT — NOTES
COUNCIL A – Lights –10000 ——— “Blah blah”

This allows you to add extra columns when required (for example, if a third party contributed funding, or if you want a column for answers to a particular question).

It also means you don’t have to think about classification now. You can impose this on the data later by, for example, filtering it to look for rows which mention “light”, “lighting”, and other terms you’ve come across while inputting.

You can also generate a per-council view of the data with a straightforward pivot table.

It’s a good idea to have a ‘NOTEWORTHY?’ column so if you do come across interesting individual items you can simply put YES in that column and come back to them later, rather than get distracted by them. 

Use spreadsheet techniques to split pasted data

Even if your data has been supplied in a word document, you can paste it into your spreadsheet and use formulae to split it up as you need, rather than having to manually input each item separately.

It makes sense to have your spreadsheet set up in the same order as data is normally provided, i.e. the ‘item’ and ‘cost’ columns coming in that order, next to each other, so that these formulae work best.

Here are some typical techniques, depending on how it’s been provided:

Item and cost are separated by a space: if the items are always only one word (no spaces), then followed by the cost, you can paste them into another column in your spreadsheet and type this formula in your ‘item’ cell to grab and split them: 

=SPLIT(E34,” “)

If item and cost are separated by a colon, e.g. “Item: £200” then you can adapt it as follows:

=SPLIT(E34,”:”)

Note: change E34 to the first cell your raw data has been pasted into.

This will split the contents of that cell wherever there is a space, so anything before that space will be pulled into your ‘item’ cell, and anything after will be pulled into the cell next to it (ideally your ‘cost’ cell)

if it works, copy that formula down as many rows as you need to apply it to all the data you originally pasted.

Finally, you’ll probably want to delete the raw data – but if you do this the formulae will stop working (there’ll be nothing for them to split), so before you do that select all the cells containing the results of your formulae (both columns). Click on Copy > Paste special… and select Values only.

This will paste the results of your formulae on top of the same cells that previously contained them. That means you can safely delete anything that the now-gone formulae used.

Other formats: the SPLIT function can also be used where your costs always begin with a pound sign, like so:  

=SPLIT(E34,”£”)

This will remove the pound sign, but you don’t need it.

Where there’s no pound sign and the data has multiple spaces, but the figure is always at the end of each line, try using =RIGHT to grab it like so:

=RIGHT(E34,6)

The number (6 in the example above) specifies how many characters you want. ‘6’ will grab any amount at the end of a line up to 99,999 (note that the comma is counted as a character). If you are likely to have higher amounts, change it to 7, or 8, etc.

This formula will also grab any other characters if your amounts are small, so with the line “Lighting 100” it will give you “ng 100”. So you will probably have to combine this formula with the =SPLIT function explained above, and then just use the second cell that you get from that.

There are other techniques you can use, but these are the most commonly useful.

How do I publish my data online?

If you’ve got some data for your investigation and want to publish it – either for others to see the raw material, or to invite them to help you explore it – there are a number of ways to do it. 

If your data is in Excel, for example, you can use a tool like Tableizer to copy and paste the data to convert it into a HTML table that you can then use in a blog post or webpage.

You can also upload your spreadsheet to Google Docs, and publish the spreadsheet from there. This has the advantage of making it easier for others to work with the data (which they can’t easily do with a HTML table).

Google Docs allows you to publish the data in a range of formats – and will provide HTML for you to embed the spreadsheet too (this page explains how). Another advantage of this approach is that if you update the spreadsheet, these embedded and published versions will update too.

Finally, you may want to consider uploading and publishing your data to a site like BuzzData, a place where data journalists, developers, and other people interested in data share their work. The site allows you to ‘follow’ particular datasets and users, and so is a good way to connect with people who share an interest in your field, and who might be able to help you interrogate the data that you have.

7 ways to get data out of PDFs

A frequent obstacle in data journalism is when the information you want to analyse is locked away in a PDF. Here are 6 ways to tackle that problem – with space for a 7th:

1) For simple PDFs: Google Docs’ conversion facility

Google Docs recently added a feature that allows you to convert a PDF to a ‘Google document’ when you upload it. It’s pretty powerful, and about the simplest way you can extract information.

It does not work, however, if the PDF was generated by scanning – in other words if it is an image, rather than a document that has been converted to PDF.

2) For scanned documents and pulling out key players: Document Cloud

Document Cloud is a tool for journalists to convert PDFs to text. It will also add ‘semantic’ information along the way, such as what organisations, people and ‘entities’ such as dates and locations are mentioned within it, and there are some useful features that allow you to present documents for others to comment on. 

The good news is that it works very well with scanned documents, using Optical Character Recognition (OCR). The bad news is that you need to ask permission to use it, so if you don’t work as a professional journalist you may not be able to use it. Still, there’s no harm in asking.

3) For scanned documents: The Data Science Toolkit

The Data Science Toolkit allows you to do lots of clever things, including converting PDFs using OCR with theFile2Text converter. Upload your document, and you’re away. Also works on other document formats, and PNGs, TIFFs and JPEGs.

4) For stripping out tables: PDF2XL

If you’re willing to shell out around £70 then PDF2XL is recommended as a useful piece of software for stripping out tables from Excel files. 

5) For automating the process: Scrape from PDF to XML using Scraperwiki

Scraperwiki is a collaborative website for scraping all sorts of hard-to-find information into some sort of useful format, so it’s no surprise that PDFs are a common problem there. They have a template scraper for converting PDF documents to XML (a more structured format) – if you can understand a little bit of programming then you can try to adapt it to your own purposes.

6) If it’s held by a public body and you have time: a well-written FOI request

Do you need all the data in the PDF or just some? Is that data available elsewhere? Try an advanced search using a phrase from the data in quotes and adding filetype:xls to see if you can find the spreadsheet it comes from. Or submit an FOI request for the data stipulating that it be provided in spreadsheet or CSV (comma separated values) format (if the PDF was supplied in response to an FOI request in the first place, go back and ask for the information to be provided in spreadsheet or CSV (comma separated values) format). 

It’s a good idea to also ask how the information is stored, including any software used, as you can check with the software vendor how easily the information can be extracted and bat away any excuses the body may come back at you with.

7) Add your own here

There must be others – tell me your own tips.

UPDATE: On Twitter: Simon Rogers uses Acrobat Pro; Kevin Anderson uses Omnipage. And Jack Schofield uses Zamzar.