User:Alexander Roidl/pdf2html: Difference between revisions
(Created page with "= PDF2HTML = ==PDF2HTMLEX== * https://github.com/coolwanglu/pdf2htmlEX + * very exact representation of every PDF, seems stable - * not maintained anymore * heavy processin...") |
No edit summary |
||
Line 23: | Line 23: | ||
* not very accurate | * not very accurate | ||
* one image per page | * one image per page | ||
== PyPDF2== | |||
<pre> | |||
import PyPDF2 | |||
pdfFileObject = open('sample.pdf', 'rb') | |||
pdfReader = PyPDF2.PdfFileReader(pdfFileObject) | |||
count = pdfReader.numPages | |||
for i in range(count): | |||
page = pdfReader.getPage(i) | |||
print(page.extractText()) | |||
</pre> |
Revision as of 14:33, 3 June 2018
PDF2HTML
PDF2HTMLEX
+
- very exact representation of every PDF, seems stable
-
- not maintained anymore
- heavy processing (high cpu usage)
- takes long
- little modification possible
poppler
+
- very simple
- lightweight
- fast
-
- not very accurate
- one image per page
PyPDF2
import PyPDF2 pdfFileObject = open('sample.pdf', 'rb') pdfReader = PyPDF2.PdfFileReader(pdfFileObject) count = pdfReader.numPages for i in range(count): page = pdfReader.getPage(i) print(page.extractText())