User:Simon/Trim4/Extracting text from PDF

In Al Sweigart's Automate the Boring Stuff with Python, there's a nice section on a Python library called PyPDF2 that allows you to work with the contents of PDFs. To begin with, I thought I'd try extracting text from a PDF of William S. Burrough's The Electronic Revolution. I chose this PDF as the only version I've found of it online is a 40pp document published by ubuclassics (which I suppose is the publishing house for ubuweb.com). There was no identifier other than this (no ISBN etc.), and it was impossible locating any other version online. What's more, the PDF had very small text, which was uncomfortable to read when I ran the booklet.sh script on it.

I thought it would be worthwhile laying out this book again for print reading purposes, and the first step is to get the text from the PDF. Pandoc is usually my go to for extracting text, but it doesn't work with PDFs, so I tried PyPDF2.

28.09.19

I began by copying a file called electronic_revolution.pdf to a folder, then in the terminal cd into that directory. Then I initiated the interactive python interpreter with this command:

   $ python3

Next I wrote the following commands in Python 3 (comments above each line):

   # First, import the PyPDF2 module
   >>> import PyPDF2
   # Then open electronic_revolution.pdf in read binary mode and store it in pdfFileObj
   >>> pdfFileObj = open('electronic_revolution.pdf', 'rb')
   # To get a PdfFileReader object that rep- resents this PDF, call PyPDF2.PdfFileReader() and pass it pdfFileObj. Store this PdfFileReader object in pdfReader
   >>> pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
   # The total number of pages in the document is stored in the numPages attribute of a PdfFileReader object
   >>> pdfReader.numPages
   40
   # The PDF has 40 pages. To extract text from a page, you need to get a Page object, which represents a single page of a PDF, from a PdfFileReader object.
   # You can get a Page object by calling the getPage() method on a PdfFileReader object and passing it the page number of the page you’re interested in — in our case, 0
   >>> pageObj = pdfReader.getPage(0)
   # Once you have your Page object, call its extractText() method to return a string of the page’s text
   >>> pageObj.extractText()
   'ubuclassics2005WILLIAM S. BURROUGHSTheElectronic\nRevolution'

This returns the value for total page count, and the text for the first page (0). What I want to get is the text for the whole document. So I modify this line to use the list of available pages.

   >>> pageObj = pdfReader.getPage(0-40)

For now, I will hard-code the number, but next I'd like to see if I can calculate the list programatically.