For a while now the Raleigh Public Record have been working on a promising tool for converting documents to data. Now they have announced that a beta version is due out in time for the NICAR conference at the end of February.
What’s particularly promising about this tool is that it works with images – not, as is currently the case with most PDF conversion tools, metadata or embedded data. They write:
Here’s how it works: the program converts the PDF into an image file usingImageMagick, then uses a template to break a page up into smaller sections.
For example, in the campaign finance documents, DocHive will make separate sections for donor name, occupation, donation amount and all the other fields. Then, the program will take each of those sections and turn it into a separate image file.
The software takes that small image and uses optical character recognition technology to read the words or numbers and insert them into a CSV file.
They are also looking for people with “tricky document sets” to help test DocHive and people who want to help “test or prepare the new program for release.”
If you’re interested in either, email the development team ateditor@raleighpublicrecord.org