User:Alexander Roidl/pdf2html: Difference between revisions

From XPUB & Lens-Based wiki
(Created page with "= PDF2HTML = ==PDF2HTMLEX== * https://github.com/coolwanglu/pdf2htmlEX + * very exact representation of every PDF, seems stable - * not maintained anymore * heavy processin...")
 
No edit summary
Line 23: Line 23:
* not very accurate
* not very accurate
* one image per page
* one image per page
== PyPDF2==
<pre>
import PyPDF2
pdfFileObject = open('sample.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
    page = pdfReader.getPage(i)
    print(page.extractText())
</pre>

Revision as of 14:33, 3 June 2018

PDF2HTML

PDF2HTMLEX

+

  • very exact representation of every PDF, seems stable

-

  • not maintained anymore
  • heavy processing (high cpu usage)
  • takes long
  • little modification possible

poppler

+

  • very simple
  • lightweight
  • fast

-

  • not very accurate
  • one image per page

PyPDF2

import PyPDF2
pdfFileObject = open('sample.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
    page = pdfReader.getPage(i)
    print(page.extractText())