Tag Archives: data

5 tips on mining and using big data for journalists

cijlogo

Matt Fowler is a freelance application developer and programmer who helps journalists understand and use big data. At the CIJ Summer School this year he gave some top tips in the field, which we have summarised below…

1. Double check privacy settings of your data

You don’t want private work being published on show for all to see.

2. Tidy up the data and make the structure simpler

This re-engineering effort can get details out and help you to discover information to turn into stories. Continue reading

How to: get data out of council budget reports

When councils publish their draft budget reports it’s not always easy to extract the figures that they’re based on. Here then is a guide to getting the data out of budget reports:

Get the ball rolling

Budget reports are generally presented in PDF format, with data presented as tables, appendices, charts and maps.

Before you do anything else, it’s worth asking the council’s press office for any spreadsheets used for the report – especially for charts and maps, which you cannot extract.

This might not get you data immediately, but it sets the ball rolling while you’re working on it from your side. On that front, you might also want to consider FOI requests to particular departments for data prepared for particular aspects of the budget.

Getting tables out of PDFs Continue reading

How we investigated Olympic sponsor torchbearers

Details unearthed by the ongoing investigation have been picked up by a range of national and local newspapers and broadcast outlets.?

We are still unearthing leads, so if you want to get involved let me know.

A search engine for data from FOI responses

Tony Hirst?has created two basic tools that allow you to search for data supplied in response to FOI requests: this search tool for local councils; and this one for universities?(ignore the word 'council' – it's an error).

The data is limited to requests made via WhatDoTheyKnow (which accounts for around 10% of FOI requests) and responses with spreadsheets attached (rather than PDFs, for example) – but it's still a useful tool.

His?post about his experiment?provides more detail, including possible further developments:

"It strikes me that if I crawled the response pages, I could build my own index of data files, catalogued according to FOI request titles, in effect generating a ?fake? data.gov.uk or data.ac.uk opendata catalogue as powered by FOI requests?? (What would be really handy in the local council requests would be if the responses were tagged with with appropriate?LGSL code or IPSV terms?(indexing on the way out) as a form of useful public metadata that can help put the FOI released data to work??)
"Insofar as the requests may or may not be useful as signaling particular topic areas as good candidates as ?standard? open data releases, I still need to do some text analysis on the request titles.
"[…] PS via a post on HelpMeInvestigate, I came across this list of?FOI responses to requests made to the NHS Prescription Pricing Division. From a quick skim, some of the responses have ?data? file attachments, though in the form of PDFs rather than spreadsheets/CSV. However, it would be possible to scrape the pages to at least identify ones that do have attachments (which is a clue they may contain data sets?)"

How do I publish my data online?

If you’ve got some data for your investigation and want to publish it – either for others to see the raw material, or to invite them to help you explore it – there are a number of ways to do it. 

If your data is in Excel, for example, you can use a tool like Tableizer to copy and paste the data to convert it into a HTML table that you can then use in a blog post or webpage.

You can also upload your spreadsheet to Google Docs, and publish the spreadsheet from there. This has the advantage of making it easier for others to work with the data (which they can’t easily do with a HTML table).

Google Docs allows you to publish the data in a range of formats – and will provide HTML for you to embed the spreadsheet too (this page explains how). Another advantage of this approach is that if you update the spreadsheet, these embedded and published versions will update too.

Finally, you may want to consider uploading and publishing your data to a site like BuzzData, a place where data journalists, developers, and other people interested in data share their work. The site allows you to ‘follow’ particular datasets and users, and so is a good way to connect with people who share an interest in your field, and who might be able to help you interrogate the data that you have.

VIDEO: Helena Bengtsson’s tips on finding stories in data

Helena Bengtsson is a Database Editor at Sveriges Television in Sweden, interrogating data for news stories. After speaking at the Balkan Investigative Reporters Network Summer School in Croatia, as we waited for our flights at Zagreb airport I asked her what tips she would give to people trying to find stories in a dataset.

Structured Query Language (SQL): an introduction

.
The Centre for Investigative Journalism this year offered a training session on Structured Query Language (SQL) for journalists at the tail end of their summer school, which was well worth attending,

SQL, which runs almost all database programmes, such as Access and FileMaker Pro, is a powerful computer-assisted reporting (CAR) tool that can help you analyse even large datasets by asking questions or “queries” of it, so that you can find the stories worth writing about.

For the purposes of the training, the tutor, David Donald, Data Editor at the Center for Public Integrity (CPI) in Washington D.C., used MySQL and Navicat Lite, both open source. MySQL also runs on all platforms (Windows, Mac or Linux).

I have summarised below some of the basics I have learned, as a taster:

Download and Install

You can download and install MySQL, followed by Navicat, a graphical user interface (GUI – ‘gooey’), from these sites:

Then Install Navicat Lite from here

  1. In Navicat Lite click on Connection > MySQL to create a new connection. 
  2. Give it any name you want, and leave ‘localhost’ as it is (this means your computer). 
  3. Type the password you created for your MySQL databases in the password box. 
  4. Click OK.
(You can also use this window to connect to a server in your company, but you’ll need to ask someone in your technical department for the details.)

Creating your first database
  1. Once you’ve created your connection you’ll notice, there are already a couple of databases – these were added when you installed MySQL. 
  2. Ignore these, but right click either on the name of your connection or the databases and select ‘New Database’.
  3. Give it a name and click OK.
  4. You should now be able to see it alongside the other databases.
 

Importing Tables

  1. Open the Navicat Lite window.
  2. Highlight ‘Tables’ on the left hand column.
  3. The ‘Import Wizard’ will appear at the top. 
  4. Start importing; follow instructions on screen.

When you first import a table for analysis, you will have to tick a few buttons in the Navicat window. Things to remember are the DATA TYPES that you can have for each column:

  • VARCHAR can be text, character or string. If the table uses coding, for example 1 for female, 2 for male, you will want to consider this text, not number, that is, VARCHAR, not FLOAT.
  • FLOAT – if you tick FLOAT, you are telling the computer the data in question is a number
  • DATEs – check if dates on your dataset use the American or British format and select accordingly. Years can be entered as four or two digits, eg 18/07/2011 or 18/07/11.

Basic commands
The following are the six basic commands in MySQL.Conventionally they are written in ALL CAPS.

The two required commands are: 
  • SELECT – tells you which columns or fields you are pulling data from.
  • FROM – the name of the table the data is coming from. May be case sensitive.
All other commands are optional but they MUST be used in this order, even if you don’t use all of them:
  • WHERE – is the criteria and works like a filter [eg =’NHS’ or =#01/05/2010# or >34 / For multiple criteria use AND. eg/ Category=’NHS’ AND Region=’England’]
  • GROUP BY – is the aggregator
  • HAVING – always used together with GROUP BY and is the second criteria after ag gregation
  • ORDER BY – to sort your data; can be numeric or alphabetic

Basic rules for data analysis
  • Never bring in more columns into queries than necessary. Keep them as simple as you can.
  • Queries are extremely literal: any spelling mistakes, spaces where there shouldn’t be one will affect results.
  • To indicate you are expecting TEXT, use single quote mark (‘…’), eg ‘NHS’
  • To indicate you are expecting a NUMBER, use NO PUNCTUATION.
  • To indicate you are expectating a DATE, use hashmarks on both sides (#…#), eg #07/07/77#
  • Use the wildcard (*) to indicate you want the query to consider everything, not specific columns/fields eg in SELECT
  • Never make any changes in the original table. Keep it pristine.

Functions/Expressions and Order by
 If you want to calculate use the function:

SUM(argument) – the ‘argument’ will be a column name, e.g. sum(debt) or sum(amount) [N.B.: no space between sum and the parentheses]

N.B.: You can use a function in SELECT but never in GROUP

You can ORDER BY ascending or descending (DESC) order. The default is ascending. Enter DESC If you want numbers ordered from highest to lowest.

Examples of queries
Say you are working on a table called TENNGIVE, which lists names of people who have made donations to political parties and the sums they donated. A query on MySQL could look like this:

  • SELECT last, rest, sum(amount) [= this means you want query results to show columns called last, rest and the sum of the amounts each person donated]
  • FROM TENNGIVE [=name of table you are using]
  • GROUP BY last, rest
  • HAVING sum(amount) > 2000 [=you only want a list of people who contributed more than 2,000pounds/dollars for example]
  • ORDER BY sum(amount) DESC [=you want the sum column ordered in descending order from high to low]
  • or ORDER BY 3 DESC [=in which 3 indicates you want it to sort column 3]

Observe what happens in this example.
– Last is a column with surnames
– Rest is a column with first names, including middle names
– The list included people called Haslam but I suspect some could be misspelled or they could have Haslam in their middle names
– This was resolved by using LIKE and a % for the possible mis-spellings

  • SELECT last, rest, occupation, sum(amount)
  • FROM TENNGIVE
  • WHERE last LIKE ‘hasl%’ or rest LIKE ‘%hasl%’ [This ensures people with Haslam in their last (last) OR middle names (rest) appear, even if misspelled]
  • GROUP BY last, rest, occupation [Note that GROUP BY does not, and should not, include the function: sum(amount)]
  • ORDER BY 3 DESC

COUNT(*)
Under SELECT you could want a COUNT of how many different occupations are on the table/database, for example 15 lawyers, 10 teachers, etc:
  • SELECT occupation, COUNT(occupation)
  • FROM TENNGIVE
  • GROUP BY occupation
  • ORDER BY 2 DESC

COUNT(occupation) is the ‘argument’ here. But the problem with COUNT is that it does not include any NULLs, which completely skews the results of your query. To resolve this problem use COUNT(*), which counts everything, including nulls.

Many data journalists panic when they realise they have been extracting data without the (*) after COUNT and publishing stories using completely erroneous data. You have been warned: to include the zeros in your dataset, use COUNT(*).

Cleaning data
Your queries could also give you inaccurate results if the data in your table is “dirty”. BBC and B.B.C., Housewife and Housewife(space), for example, could be counted twice when they are the same, or you may want to get rid of any HTML codes in your dataset.

A few ways to clean your data:
  1. Use TRIM in SELECT, eg. SELECT TRIM(occupation), COUNT(occupation)
  2. Use Google Refine, which you can download free onto your computer. The post Cleaning Data using Google Refine in the Online Journalism Blog may help.
  3. Other simple ways of cleaning data on an Excel sheet can be found here.

MySQL servers: MySQL vs. MySQL Server Express
  • MySQL is an open-source relational database system (RDBMS) owned by Oracle, with a freely available code source. It runs on all operating systems (OS). 
  • There is also another free RDBMS called SQL Server Express. The latest version is the 2008 one. Being a Microsoft product, SQL Server Express will only run on Windows but Mac users can download a (free) programme called MAMP (=Macintosh, Apache, MySQL and PHP). 
  • SQL Server is MySQL’s more full-bodied “big brother” but it comes at a cost.
  • Both servers have their pluses and minuses. I found this SQL Authority.com blog post (N.B. bear in mind this post is from 2009) helpful. 

David Donald says with MySQL there are no two ways: it’s “use it or lose it”. The best thing is to download it all and get practising.  

Online Tutorials
Don’t panic – online tutorials and forums are available if you get stuck:
  1. 1. MySQL Tutorial
  2. 2. Tutorials Point (includes a discussion forum)
  3. 3. Navicat support (for Windows, Mac and Linux)

Further reading
  

Further SQL training

SQL is not an exclusive remit of journalists of course. Far from it. Anyone who works or analyses large datasets/databases can benefit from it. As a matter of fact, a few City workers attended the session alongside journalists and CAR enthusiasts like myself.

The session was a crash course, which gave you enough basic knowledge to start using SQL. I assure you it was “safe” even for non-geeks to follow, but, as David said, once you learn it, you do need to keep using it in order to achieve geekdom… 

If you are interested in learning it in more depth, David Donald will be hosting a one-week “CAR bootcamp” in the first week of October. Registration will open after 15th September, so keep checking the CIJ website or sign up to their newsletter.

Updated on 13/09/2011:
Information on the autumn Data Journalism/CAR workshop with David Donald is already up on the CIJ website: 5-9 October 2011. The early bird discount is only valid until Thursday, 15th September, so better hurry. There is also the option of joining the Advanced SQL course on the last three days at a reduced cost. Check out the course’s timetable here and general info here

The Government want to make data more transparent…

The government have made commitments to a whole new range of data transparency initiatives, which look set to make the UK government (and data.gov) a world leader in open data.

The Guardian reported that in an open-letter to the cabinet, David Cameron announced a range of initiatives that will “represent the most ambitious open data agenda of any government in the world, and demonstrate our determination to make the public sector more transparent and accountable”, including a release of the Treasury’s Coins Database data and details on Government spending over £25,000.

The twenty-strong list of commitments are best explained in The Guardian Datablog’s breakdown but were announced on the Number 10 website in an article with some comment from Francis Maude, the Cabinet Office Minister.

Her words seemed to promote the data release as a benefit for everyday life;

The new commitments represent a quantum leap in government transparency and will radically help to drive better public services. Having this data available will help people find the right doctor for their needs or the best teacher for their child and will help frontline professionals compare their performance and effectiveness and improve it.” 

These proposals follow on from the announcements in May of last year that data on government spending and crime data would be made more accessible, leading to the launch of the National Crime Maps in February.

There are still issues with how the data will be handled and received by the public, and the public may also be sceptical of a massive data release set to aid the progress of investigative journalism, especially in the wake of the phone-hacking scandal and the closure of the News of the World.

The National Crime Maps were slated by critics and the data used was said to be near impossible to extract and use in a constructive way. It will be key to see how the government plan to release the new datasets and whether they will be in a usable, translatable and extractable format.

The criticisms of the Coins database are a perfect example of data being collated, ‘distributed’ and still being difficult to extract journalistic value from.

Let’s just say it’s hard to remain positive.

How are the BBC handling their FOI data…?

Now, more than ever, public organisations are trying to make themselves just that: public.

The genie cannot be put back into the bottle, however hard authorities try,” writes John Kampfner in today’s Media Guardian. “The information relationship has shifted, but the power relationship has not. The Democracy recession is gathering pace.”

The general public now have a thirst for that most gritty and honest of information and journalists more than ever are gaining access to it.

This means that everyone is becoming more wary of the way in which both public and private organisations handle, store and release information for public consumption and record.

This is why we are mentioning onebillionpageviews: the anti-license fee website have offered a single download that allows access to all the Freedom of Information requests that the BBC received (and hosted on their site) before 2008, which were later removed.

First of all, it is shocking that whilst everyone is so tuned in to the way in which public organisations handle data that the BBC would simply remove a huge cache of data from their site.

Secondly, for the BBC to do anything this brave and seemingly careless with their data when websites like NoTVLicenseFee are willing to keep that store of data available for the foreseeable future seems counter to their nature as a ‘public’ organisation.

They have also chosen to host every freedom of information request that the BBC received since the big removal and any more that arise in the future.

It’s great that a site like this wants to hold organisations to account and make sure that data is readily available amongst the rise in public curiosity into how their money is being spent, but it is also important that massive organisations like the BBC are careful to not be caught in the crossfire that grows out of the “democracy recession”.