Recently the NICAR mailing list (for journalists who use computer assisted reporting) discussed how they dealt with datasets that were ‘too big for Excel’. With their permission, I’m reproducing a digest of the highlights.
How much is too much
Different versions of Excel have different limits to the data they can handle. From a million rows in Excel 2010 to just 16,000 rows by 256 columns in Excel 5, Office Watch gives a good rundown of the various versions.
Tom Torok points out that Excel 2007’s million row limit is per sheet, rather than per workbook (spreadsheet), so if you have more than 1 million rows you could spread them across mulitple sheets.
Stephen Doig suggests that if the data was only going to be used as a large directory, you could “save it as a text file and search it with a decent text editor”. And Dan Nguyen provides a list of useful text editors here.
Query languages and servers
Ed Borasky notes that Excel has a query language built in. “Put the whole file in Access or SQL Server’s free version and teach the users how to query a database in their spreadsheets.” And Tim Henderson adds: “You’ve hit on the perfect reason to learn online programming. Get a $7/month MySQL server account, a book on PHP or Python or whatever and you’ll have it up and running in a week I bet.” – although Borasky disagreed that you needed a dedicated server: “Something that small you can build in a virtual machine or even run natively in Windows.“
Another option is PANDA, “a tool to host large datasets locally (or on an EC2) so everyone in your news room can search them” (Brian Abelson, seconded by Nolan Hicks, who suggested this setup process).
Margie Roswell tried that out, and came up with the following list of additional tips:
- You need to get Java working. For some reason, it wouldn’t work for me in Firefox. In Chrome. I had to go into chrome://plugins/ and check to enable Java(TM) to be “always allowed.” (not sure how secure that is… but nothing would work until I did that.)
- If you’ve already set up amazon AWS EC2 once before, and don’t have a record of your keypair, you need to create a new one.
- You need to know the URL
- Even though they provide the default username of “root,” they tell you that the user name needs to be “ubuntu”
I also added a mention of the SQLite plugin for Firefox, which turns your browser into a database. But Brian Bowling tried it out with a big dataset and crashed it:
Back it up!
And finally a word of warning from Kate Martin: “Whatever you do, save the original file in read-only format. The last thing you want is someone accidentally changing something important.“