User talk:Simon/Cleaning up text

Hidden characters (e.g. tabs, spaces, carriage and ‘soft’ returns)

Extracting text from a PDF

In Al Sweigart's Automate the Boring Stuff with Python, there's a nice section on a Python library called PyPDF2 that allows you to work with the contents of PDFs. To begin with, I thought I'd try extracting text from a PDF of William S. Burrough's The Electronic Revolution. I chose this PDF as the only version I've found of it online is a 40pp document published by ubuclassics (which I suppose is the publishing house for ubuweb.com). There was no identifier other than this (no ISBN etc.), and it was impossible locating any other version online. What's more, the PDF had very small text, which was uncomfortable to read when I ran the booklet.sh script on it.

I thought it would be worthwhile laying out this book again for print reading purposes, and the first step is to get the text from the PDF. Pandoc is usually my go to for extracting text, but it doesn't work with PDFs, so I tried PyPDF2.

28.09.19

I began by copying a file called electronic_revolution.pdf to a folder, then in the terminal cd into that directory. Then I initiated the interactive python interpreter with this command:

$ python3

Next I wrote the following commands in Python 3 (comments above each line):

# First, import the PyPDF2 module
>>> import PyPDF2
# Then open electronic_revolution.pdf in read binary mode and store it in pdfFileObj
>>> pdfFileObj = open('electronic_revolution.pdf', 'rb')
# To get a PdfFileReader object that rep- resents this PDF, call PyPDF2.PdfFileReader() and pass it pdfFileObj. Store this PdfFileReader object in pdfReader
>>> pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# The total number of pages in the document is stored in the numPages attribute of a PdfFileReader object
>>> pdfReader.numPages
>>> 40
# The PDF has 40 pages. To extract text from a page, you need to get a Page object, which represents a single page of a PDF, from a PdfFileReader object.
# You can get a Page object by calling the getPage() method on a PdfFileReader object and passing it the page number of the page you’re interested in — in our case, 0
>>> pageObj = pdfReader.getPage(0)
# Once you have your Page object, call its extractText() method to return a string of the page’s text
>>> pageObj.extractText()
>>> 'ubuclassics2005WILLIAM S. BURROUGHSTheElectronic\nRevolution'

This returns the value for total page count, and the text for the first page (0). What I want to get is the text for the whole document. I have no idea how to do this!!! This doesn't work at all.

>>> pageObj = pdfReader.getPage(0-40)

02.10.19

With Rita & Pedro's help I managed to write a Python script that includes a for loop to extract text from the entire PDF:

# imports the PyPDF2 module
    import PyPDF2
    
    filename = input("name of the file: ")
    
    with open(filename ,'rb') as pdf_file, open('input.txt', 'w') as text_file:
    read_pdf = PyPDF2.PdfFileReader(pdf_file)
    number_of_pages = read_pdf.getNumPages()
    for page_number in range(number_of_pages):   # use xrange in Py2
        page = read_pdf.getPage(page_number)
        page_content = page.extractText()
        text_file.write(page_content)

The next step is to then begin cleaning up the text by removing the line-breaks. We wrote a simple shell script for this that runs the Python script, then a command to take out line breaks:

$ python3 extract_text.py
$ grep -v "^$" input.txt > output.txt

Only problem is that the names of the txt files that are produced will all be "input.txt", which means that if you run this on more than one PDF, you'll have to move input.txt to another directory before running again, rename the file manually, or perhaps I could write another Python script that renames the file and include it in the shell script after the last line.