This post was originally published on the general Help Me Investigate blog.
Recently I helped Pupul Chatterjee map bus stops in Birmingham for BrumTransport. I thought I’d share the process here as it demonstrates a number of techniques in filtering data that isn’t helpfully categorised.
Does the data exist? Searching in the right place
I started my search for bus stop data at Google’s Tables search engine – this allows you to search public Google Fusion Tables as well as ‘web tables’. I started with a broad search for “bus stops” – the top result looked promising: a Guardian datablog post on “Every bus stop, train station, ferry port and taxi rank in Britain“.
The question was: how easily could we extract Birmingham bus stops from that?
Filtering and downloading
Open the fusion table that mapped those stops. You would hope that one part of the data tells us what type of stop this is (bus, train, etc), and another relates to the town or city that a bus stop lies in. To find out, we need to look at the raw data rather than the map view.
You could download the data, but as there are 400,000 rows, that might take some time. A quicker way is to look at it as a table in Fusion Tables, which you can do by selecting Visualize > Table.
In that view you can see two columns which might help us: Stop type, name; and NatGazLocality.
To test whether these will work, we can use Fusion Tables’s filter function. This is opened by selecting View > Filter. Once you do this a new tab appears above the table with three elements:
- A drop-down menu for you to choose which column you want to filter on
- A drop-down menu with options for how you want to filter – for example, whether you want to look within numerical ranges (above, below, between, equals), text matches, and how exact you want to be (matches, contains, contains ignoring case, etc.)
- And an open text box for you to type the values you want to filter against, i.e. the number or text
On the first drop-down menu select Stop type, name; and on the second select contains ignoring case. In the final box type “Bus”…
…As the entries in this column are quite consistent, Fusion Tables brings up a suggestion of any entries that match – there is only one, so you can select that if you wish.
Click Apply under the filter and you should see the results.
We also want to narrow on location, so click the link above ‘Apply’ that says Add condition.
This time on the first drop-down menu select NatGazLocality; and on the second select contains ignoring case. In the final box type “Birmingham”. As there are lots of different entries here, Fusion Tables makes no suggestions this time.
Click Apply again and you should see the results once more.
A browse down the NatGazLocality column should make you wary: the locations are not all the same: we have Rubery (Birmingham) as well as just Birmingham. To check how comprehensive this data covers Birmingham, it’s best to switch back to the map view, by selecting Visualize > Map and applying the same filters as before.
You’ll notice – at least in Birmingham’s case – that areas are missing stops. A search in the data for specific stops we know exist shows that their NatGazLocality doesn’t mention ‘Birmingham’ at all, but just the local area name.
We have two options at this point: we can try to compile a list of those local area names and try to match each one. Or we can find some other way of identifying the region in this data.
Searching for a problem
At this point I searched for something very specific:
list of bus stops birmingham "natgazlocality"
The “natgazlocality” part is in quotes because I only wanted webpages that contained that term, rather than pages that may be linked to with it.
There is only one result: this post on importing bus stop data by Geoff Mackenzie, which happens to mention Birmingham. A scan of the post throws up this key passage:
ATCOCodeseems quite often to work as a key to look up NextBuses for a particular stop and appears to be unique (although until we actually chew through the data it’s always possible it’ll surprise us and have one or more special cases with missing codes or duplicates so it might not be a good idea to depend on the uniqueness of these codes). The first three characters of this code (usually, maybe always numeric) can be associated with regions as well; I have a Microsoft Word file as well which provides at least some of these mappings. Not all of these are actually regions though – there are other interpretations of these codes for national services like air.
The part in bold is what we needed: the ATCOcode field may be able to help identify stops within a region.
Testing the hypothesis
At this point we don’t know which ATCO codes may relate to our region. To get an idea, we can look at the results which have ‘Birmingham’ in the NatGazLocality field again, this time focusing on that ATCOCode field.
With that filter applied, you can click on File > Download to download only the matching results and analyse them in Excel. There are a number of ways of doing this, but perhaps the simplest is to sort your data on that field and scan down to see how the first few characters change, if at all.
It turns out these results always begin with the four characters 4300 or 2000.
We can use our filter again to test whether that will help us find the data we need.
Back in Fusion Tables, make sure you are in the map visualisation and this time use the filter on ATCOCode with starts with and 4300.
You should be able to see at a glance where the results fall – and indeed this time there is a clear pattern: the stops are around in the West Midlands – Wolverhampton, the Black Country, Birmingham and Coventry – and include stops that we couldn’t see before.
Trying the same process with 2000 brings up stops in an area further south: Kidderminster, Bromsgrove, Redditch.
We could choose one or the other, or to combine the two – but we finally have our data.