Making searchable PDFs: Difference between revisions

From XPUB & Lens-Based wiki
No edit summary
No edit summary
 
(3 intermediate revisions by 2 users not shown)
Line 1: Line 1:
https://guides.library.illinois.edu/c.php?g=347520&p=4121426
See also: [[Digital zines I: PDF]]
 
[[Tesseract]] can directly convert an image of text into a PDF with a hidden layer of selectable / copy-pasteable text.
 
    tesseract words.png out -l eng PDF
 
See: [https://guides.library.illinois.edu/c.php?g=347520&p=4121426 this very good guide]
 
To make corrections to the OCR'd text, you can use tesseract to output using the HOCR format, make corrections (to the words ?!) and then use the "hocr2pdf" utility to put the pieces (image + corrected/positioned text) back together in PDF format.


https://github.com/tesseract-ocr/tesseract/wiki


http://www.tobias-elze.de/pdfsandwich/sandwich1.png
http://www.tobias-elze.de/pdfsandwich/sandwich1.png
Line 8: Line 15:


pdfsandwich includes the ''hocr2pdf'' utility.
pdfsandwich includes the ''hocr2pdf'' utility.
https://imgs.xkcd.com/comics/sandwich.png
== Links ==
* https://github.com/tesseract-ocr/tesseract/wiki
* http://www.tobias-elze.de/pdfsandwich
* https://guides.library.illinois.edu/c.php?g=347520&p=4121426

Latest revision as of 23:54, 8 January 2020

See also: Digital zines I: PDF

Tesseract can directly convert an image of text into a PDF with a hidden layer of selectable / copy-pasteable text.

    tesseract words.png out -l eng PDF

See: this very good guide

To make corrections to the OCR'd text, you can use tesseract to output using the HOCR format, make corrections (to the words ?!) and then use the "hocr2pdf" utility to put the pieces (image + corrected/positioned text) back together in PDF format.


sandwich1.png

pdfsandwich includes the hocr2pdf utility.

sandwich.png


Links