Using Plone as a Document Repository
by
Nathan Van Gheem
—
last modified
Apr 14, 2011 02:39 AM
Sharing my experience in using plone to OCR PDF documents and displaying the documents in the browser with Flex Paper.
We just released a new site that houses thousands of scanned PDF documents that are now viewable in the browser via Flex Paper. We started with PDFs that were just scanned images. Plone, with the help of a few packages, then OCR'd and replaced the PDF with a searchable PDF counterpart.
Features
- Convert Image PDFs to searchable versions
- Split large PDFs into multiple documents
- Overwrite metadata of PDF
- OCR text is then searchable via Plone search
- Online viewable version
- All document processing is done via asynchronous processes so adding documents is not slow
- Can monitor conversion asynchronous processes
Requirements
- wc.pageturner : For online viewable PDFs
- wildcard.pdfpal : heavy lifting in PDF processing
- plone.app.async : asynchronously process PDF documents
- Tesseract > 3.0.1 system package
- swftools system package
- ghostscript system package
- hocr2pdf system package
- pdftk system package
- tiff2pdf system package
Caveats
- Probably only works in Linux
- wildcard.pdfpal is pretty specific and isn't smart at if it should process the PDF. For instance, if the PDF is already searchable, it'll still try to convert it regardless.
- We're not really interested in wildly supporting pdfpal beyond our use case(that's why it's not listed on plone.org, but in the collective and on pypi). So if you're interested in implementing this, you might end up contributing to the project and cleaning up some of the cruft in the package.
