User:Alexander Roidl/pdf2html

From XPUB & Lens-Based wiki

PDF2HTML

PDF2HTMLEX

+

  • very exact representation of every PDF, seems stable

-

  • not maintained anymore
  • heavy processing (high cpu usage)
  • takes long
  • little modification possible

poppler

+

  • very simple
  • lightweight
  • fast

-

  • not very accurate
  • one image per page

PyPDF2

import PyPDF2
pdfFileObject = open('sample.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
    page = pdfReader.getPage(i)
    print(page.extractText())