Making searchable PDFs: Difference between revisions
No edit summary |
No edit summary |
||
(6 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
See also: [[Digital zines I: PDF]] | |||
[[Tesseract]] can directly convert an image of text into a PDF with a hidden layer of selectable / copy-pasteable text. | |||
tesseract words.png out -l eng PDF | |||
* pdfsandwich | See: [https://guides.library.illinois.edu/c.php?g=347520&p=4121426 this very good guide] | ||
To make corrections to the OCR'd text, you can use tesseract to output using the HOCR format, make corrections (to the words ?!) and then use the "hocr2pdf" utility to put the pieces (image + corrected/positioned text) back together in PDF format. | |||
http://www.tobias-elze.de/pdfsandwich/sandwich1.png | |||
* [http://www.tobias-elze.de/pdfsandwich/ pdfsandwich] | |||
pdfsandwich includes the ''hocr2pdf'' utility. | |||
https://imgs.xkcd.com/comics/sandwich.png | |||
== Links == | |||
* https://github.com/tesseract-ocr/tesseract/wiki | |||
* http://www.tobias-elze.de/pdfsandwich | |||
* https://guides.library.illinois.edu/c.php?g=347520&p=4121426 |
Latest revision as of 23:54, 8 January 2020
See also: Digital zines I: PDF
Tesseract can directly convert an image of text into a PDF with a hidden layer of selectable / copy-pasteable text.
tesseract words.png out -l eng PDF
See: this very good guide
To make corrections to the OCR'd text, you can use tesseract to output using the HOCR format, make corrections (to the words ?!) and then use the "hocr2pdf" utility to put the pieces (image + corrected/positioned text) back together in PDF format.
pdfsandwich includes the hocr2pdf utility.