Using Plone as a Document Repository

Update

It is recommended that you do not use this method anymore. Please use collective.documentviewer now which should cover all the use cases.

We just released a new site that houses thousands of scanned PDF documents that are now viewable in the browser via Flex Paper. We started with PDFs that were just scanned images. Plone, with the help of a few packages, then OCR'd and replaced the PDF with a searchable PDF counterpart.

Features

Convert Image PDFs to searchable versions
Split large PDFs into multiple documents
Overwrite metadata of PDF
OCR text is then searchable via Plone search
Online viewable version
All document processing is done via asynchronous processes so adding documents is not slow
Can monitor conversion asynchronous processes

Requirements

wc.pageturner : For online viewable PDFs
wildcard.pdfpal : heavy lifting in PDF processing
plone.app.async : asynchronously process PDF documents
Tesseract > 3.0.1 system package
swftools system package
ghostscript system package
hocr2pdf system package
pdftk system package
tiff2pdf system package

Caveats

Probably only works in Linux
wildcard.pdfpal is pretty specific and isn't smart at if it should process the PDF. For instance, if the PDF is already searchable, it'll still try to convert it regardless.
We're not really interested in wildly supporting pdfpal beyond our use case(that's why it's not listed on plone.org, but in the collective and on pypi). So if you're interested in implementing this, you might end up contributing to the project and cleaning up some of the cruft in the package.