Turning documents into data: DocHive

For a while now the Raleigh Public Record have been working on a promising tool for converting documents to data. Now they have announced that a beta version is due out in time for the NICAR conference at the end of February.

What’s particularly promising about this tool is that it works with images – not, as is currently the case with most PDF conversion tools, metadata or embedded data. They write:

Here’s how it works: the program converts the PDF into an image file usingImageMagick, then uses a template to break a page up into smaller sections.

For example, in the campaign finance documents, DocHive will make separate sections for donor name, occupation, donation amount and all the other fields. Then, the program will take each of those sections and turn it into a separate image file.

The software takes that small image and uses optical character recognition technology to read the words or numbers and insert them into a CSV file.

They are also looking for people with “tricky document sets” to help test DocHive and people who want to help “test or prepare the new program for release.”

If you’re interested in either, email the development team ateditor@raleighpublicrecord.org

Finding documents online – FindThatFile

Here's a potentially useful search engine if you're specifically looking for documents:?http://www.findthatfile.com

findthatfile.com?allows you to narrow your search by filetype in a way that is a little bit more powerful than Google's own advanced search facility (and more intuitive). Filetypes include?PDFs, documents (DOC, TXT, etc), audio, video, RAR and ZIP compressed files.?

(Strangely spreadsheets are not included, for which you might want to try the excellent Zanran).

The site also has an API, which may be useful if you want to find documents related to a long list kept in a spreadsheet.

Nicole Boivin from Find That File?says:

"We open each file, identify its author, title, contents, text extracts and all kinds of goodies that nobody else does.? We also search more places than anyone else : Web, FTP, Usenet, Metalink and P2P (ed2k/emule) including 47 file types and 557+ file extensions including over 239 file upload services."