User:Alexander Roidl/pdf2html
< User:Alexander Roidl
Revision as of 14:33, 3 June 2018 by Alexander Roidl (talk | contribs)
PDF2HTML
PDF2HTMLEX
+
- very exact representation of every PDF, seems stable
-
- not maintained anymore
- heavy processing (high cpu usage)
- takes long
- little modification possible
poppler
+
- very simple
- lightweight
- fast
-
- not very accurate
- one image per page
PyPDF2
import PyPDF2 pdfFileObject = open('sample.pdf', 'rb') pdfReader = PyPDF2.PdfFileReader(pdfFileObject) count = pdfReader.numPages for i in range(count): page = pdfReader.getPage(i) print(page.extractText())