User:Simon/Trim4/Extracting text from PDF: Difference between revisions

Revision as of 12:04, 28 September 2019

In Al Sweigart's Automate the Boring Stuff with Python, there's a nice section on a Python library called PyPDF2 that allows you to work with the contents of PDFs. To begin with, I thought I'd try extracting text from a PDF of William S. Burrough's The Electronic Revolution. I chose this PDF as the only version I've found of it online is a 40pp document published by ubuclassics (which I suppose is the publishing house for ubuweb.com). There was no identifier other than this (no ISBN etc.), and it was impossible locating any other version online. What's more, the PDF had very small text, which was uncomfortable to read when I ran the booklet.sh script on it.

I thought it would be worthwhile laying out this book again for print reading purposes, and the first step is to get the text from the PDF. Pandoc is usually my go to for extracting text, but it doesn't work with PDFs, so I tried PyPDF2.

I began by copying a file called electronic_revolution.pdf to a folder, then in the terminal cd into that directory. Then I initiated the interactive python interpreter with this command:

   $ python3

Next I wrote the following commands in Python 3:

   >>> import PyPDF2
   >>> pdfFileObj = open('electronic_revolution.pdf', 'rb')
   >>> pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
   >>> pdfReader.numPages
   40
   >>> pageObj = pdfReader.getPage(0)
   >>> pageObj.extractText()
   'ubuclassics2005WILLIAM S. BURROUGHSTheElectronic\nRevolution'

This returns the value for total page count, and the text for the first page (0). What I want to get is the text for the whole document.