User:Simon/Trim4/Extracting text from PDF: Difference between revisions
No edit summary |
No edit summary |
||
Line 17: | Line 17: | ||
>>> pageObj.extractText() | >>> pageObj.extractText() | ||
'ubuclassics2005WILLIAM S. BURROUGHSTheElectronic\nRevolution' | 'ubuclassics2005WILLIAM S. BURROUGHSTheElectronic\nRevolution' | ||
This returns the value for total page count, and the text for the first page (0). What I want to get is the text for the whole document. |
Revision as of 12:04, 28 September 2019
In Al Sweigart's Automate the Boring Stuff with Python, there's a nice section on a Python library called PyPDF2 that allows you to work with the contents of PDFs. To begin with, I thought I'd try extracting text from a PDF of William S. Burrough's The Electronic Revolution. I chose this PDF as the only version I've found of it online is a 40pp document published by ubuclassics (which I suppose is the publishing house for ubuweb.com). There was no identifier other than this (no ISBN etc.), and it was impossible locating any other version online. What's more, the PDF had very small text, which was uncomfortable to read when I ran the booklet.sh script on it.
I thought it would be worthwhile laying out this book again for print reading purposes, and the first step is to get the text from the PDF. Pandoc is usually my go to for extracting text, but it doesn't work with PDFs, so I tried PyPDF2.
I began by copying a file called electronic_revolution.pdf to a folder, then in the terminal cd
into that directory. Then I initiated the interactive python interpreter with this command:
$ python3
Next I wrote the following commands in Python 3:
>>> import PyPDF2 >>> pdfFileObj = open('electronic_revolution.pdf', 'rb') >>> pdfReader = PyPDF2.PdfFileReader(pdfFileObj) >>> pdfReader.numPages 40 >>> pageObj = pdfReader.getPage(0) >>> pageObj.extractText() 'ubuclassics2005WILLIAM S. BURROUGHSTheElectronic\nRevolution'
This returns the value for total page count, and the text for the first page (0). What I want to get is the text for the whole document.