Update

It is recommended that you do not use this method anymore. Please use collective.documentviewer now which should cover all the use cases.

We just released a new site that houses thousands of scanned PDF documents that are now viewable in the browser via Flex Paper. We started with PDFs that were just scanned images. Plone, with the help of a few packages, then OCR'd and replaced the PDF with a searchable PDF counterpart.

Features

  • Convert Image PDFs to searchable versions
  • Split large PDFs into multiple documents
  • Overwrite metadata of PDF
  • OCR text is then searchable via Plone search
  • Online viewable version
  • All document processing is done via asynchronous processes so adding documents is not slow
  • Can monitor conversion asynchronous processes

Requirements

  • wc.pageturner : For online viewable PDFs
  • wildcard.pdfpal : heavy lifting in PDF processing
  • plone.app.async : asynchronously process PDF documents
  • Tesseract > 3.0.1 system package
  • swftools system package
  • ghostscript system package
  • hocr2pdf system package
  • pdftk system package
  • tiff2pdf system package

Caveats

  • Probably only works in Linux
  • wildcard.pdfpal is pretty specific and isn't smart at if it should process the PDF. For instance, if the PDF is already searchable, it'll still try to convert it regardless.
  • We're not really interested in wildly supporting pdfpal beyond our use case(that's why it's not listed on plone.org, but in the collective and on pypi). So if you're interested in implementing this, you might end up contributing to the project and cleaning up some of the cruft in the package.