Savile extracted

On Friday the BBC released documents from The Pollard Report into the Savile inquiry.

These were published as scanned PDFs, making it impossible to search text or count mentions of particular terms.

We’ve used document extraction service DocumentCloud to convert the two key documents – appendices 10 (statements) and 12 (emails and documents) – into text. These are linked below. If you use them, let us know so we can continue to do this.

Savile Transcript appendix 10 (PDF)
Savile Transcript appendix 10 (Text)


Savile Appendix12 (PDF)
Savile Appendix12 (Text)


2 thoughts on “Savile extracted

  1. Jon Soroko

    With pdf documents, it’s helpful to keep in mind (1) that certain versions of Acrobat and other pdf generators have OCR built-in, for instance MS Word, when outputting to pdf, will export the text as a separate, searchable layer. You have to know to look for it; (2) Acrobat has its own OCR capability – not great, but sometimes workable – and the capacity to search not only individual pdf files, but large sets of pdfs.

    I’m not familiar with the UK FOI statute – in the U.S. we have 50 state laws, one federal FOI law, but in addition to that, individual laws about public access in particular subject areas (e.g. environmental hazards). The better-drafted of these laws forbid the government from dumping printed files on the requester when they’ve already got them in electronic form. If you’ve got those provisions, make sure to ask for things in the format YOU want.

    For a public (i.e. not well-funded) project, OCR and quality control of the OCR can bring things to a halt, unless you can get an OCR services vendor to donate services.

    I’m quite impressed with the overall helpmeinvestigate project – amazing what you’ve done, and exciting to think of what you can do as you continue and grow.

    – JS

    1. paulbradshaw Post author

      Thanks Jon – that’s very useful. As for UK FOI, yes it does include precedents and guidance that says bodies should provide data in computer readable format when requested.


Leave a Reply