Making searchable PDFs

From XPUB & Lens-Based wiki
(Redirected from Pdfsandwich)

See also: Digital zines I: PDF

Tesseract can directly convert an image of text into a PDF with a hidden layer of selectable / copy-pasteable text.

    tesseract words.png out -l eng PDF

See: this very good guide

To make corrections to the OCR'd text, you can use tesseract to output using the HOCR format, make corrections (to the words ?!) and then use the "hocr2pdf" utility to put the pieces (image + corrected/positioned text) back together in PDF format.


pdfsandwich includes the hocr2pdf utility.

