Tag Archives: documents

Turning documents into data: DocHive

For a while now the Raleigh Public Record have been working on a promising tool for converting documents to data. Now they have announced that a beta version is due out in time for the NICAR conference at the end of February.

What’s particularly promising about this tool is that it works with images – not, as is currently the case with most PDF conversion tools, metadata or embedded data. They write:

Here’s how it works: the program converts the PDF into an image file usingImageMagick, then uses a template to break a page up into smaller sections.

For example, in the campaign finance documents, DocHive will make separate sections for donor name, occupation, donation amount and all the other fields. Then, the program will take each of those sections and turn it into a separate image file.

The software takes that small image and uses optical character recognition technology to read the words or numbers and insert them into a CSV file.

They are also looking for people with “tricky document sets” to help test DocHive and people who want to help “test or prepare the new program for release.”

If you’re interested in either, email the development team ateditor@raleighpublicrecord.org

Finding documents online – FindThatFile

Here's a potentially useful search engine if you're specifically looking for documents:?http://www.findthatfile.com

findthatfile.com?allows you to narrow your search by filetype in a way that is a little bit more powerful than Google's own advanced search facility (and more intuitive). Filetypes include?PDFs, documents (DOC, TXT, etc), audio, video, RAR and ZIP compressed files.?

(Strangely spreadsheets are not included, for which you might want to try the excellent Zanran).

The site also has an API, which may be useful if you want to find documents related to a long list kept in a spreadsheet.

Nicole Boivin from Find That File?says:

"We open each file, identify its author, title, contents, text extracts and all kinds of goodies that nobody else does.? We also search more places than anyone else : Web, FTP, Usenet, Metalink and P2P (ed2k/emule) including 47 file types and 557+ file extensions including over 239 file upload services."

What to do if you have documents you want to upload

If you have a document relating to your investigation that is not already online – for instance a PDF, a Word document, a scanned document, or a letter, here is some advice on how to get it into the investigation:

1. Get it onto your computer if it isn’t already

If your evidence is physical – e.g. a printout – then use a scanner to get it onto your computer. Many company photocopiers now offer this facility as well.

2. Upload it to a document-sharing website

There are a number of these. Scribd is a very useful place to store PDFs, Word documents, Excel spreadsheets and Powerpoint presentations. You will need to create a (free) account first. Once you do, just follow the instructions given here. You can also use the service to create backup copies of documents that are already online.

The biggest advantage of Scribd is that people can label and annotate documents, making it easier for others to spot things you might not see. It also makes it easy to embed documents in other webpages so you could display the document in a blog post about it.

Google Docs will also allow you to upload the same types of documents – you’ll find links on how to do this via this page.

If you have scanned in a document and it is an image then you’ll need an image-sharing website. There are dozens of these but the best-known and most widely used is Flickr. Again, you’ll need to create a free account and then go to the upload page. You can also upload by sending them to a special email address – more information on that can be found here.

Perhaps the easiest way to get your documents online is to send them in an email to post@posterous.com – this will create a blog for you with your document ’embedded’ in your first entry. If you send a number of images Posterous will even create a gallery for you. There’s more information on Posterous’ FAQ (Frequently Asked Questions) page.

3. Link to it on your Investigation Page

Once your document is online you just need to link to it from the investigation page.

  • If you’ve used Posterous a link will have been emailed to you.
  • On Scribd make sure you are logged on and go to http://www.scribd.com/my_docs – then click on the name of the document you want to link to. You will be taken to the page with your document on it. Copy the address of that webpage.
  • On Google Docs open your document and click on Share (in the top right) and select ‘Publish as a web page‘ – a window should appear with further options. Select these as you wish and you should be presented with a web address to copy. More information here.
  • On Flickr log on to your account, click on You and then Your Photostream to see your images. Click on the image you need and copy the address of the webpage.

Now go to your investigation and the challenge that relates to your documents (e.g. ‘Add background information’). Accept it if you haven’t already and in the Add an update box that appears type a description of your document. In the Web link box paste or type the web address your document has been published to.