User:Pedro Sá Couto/Prototyping 5th/Text Launderette Scripts: Difference between revisions

From XPUB & Lens-Based wiki
No edit summary
 
(13 intermediate revisions by the same user not shown)
Line 1: Line 1:
=Scripts=
=Scripts=
====From the git====
====From the git====
https://git.xpub.nl/pedrosaclout/Text_Launderette_Scripts
https://git.xpub.nl/pedrosaclout/DIY_Book_Scanner_Workflow


==Merge PDF==
=DIY Book Scanner Workflow=
This shell script uses pdftk to merge all ocr pdf's created.
<source lang="shell">
#!/bin/bash
#line 3 means here
# cd "$(dirname "$0")"


cd ocred
==Getting started==
pwd
pdftk *.pdf cat output final.pdf


</source>
This set of scripts was written for the Text Laundrette workshop. The workshop takes place in the Publication Station, WDkA building.<br> Rotterdam, 03-02-2020<br>It is a workflow to turn the pictures from the DIY Book Scanner into a final OCRed PDF.<br>


==Crop Bounding Box==
==Dependencies==
While capturing the pages of the book a bounding box is created. With this script, you iterate through a folder and crop the images.
===Brew (MAC) or apt-get (LINUX)===
<source lang="python">
You’ll need the command-line tools for Xcode installed.
import cv2
import time
import logging


d = 1
<source lang="shell">
xcode-select --install
</source>


while True:
After install Homebrew.
    try:
        threshold = 25
        time.sleep(1)


        input = ('input%d.jpg'%d)
<source lang="shell">
        page = ('page%d.jpg'%d)
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
</source>


        print("Value of d is:",d,"\n","Page name:",input)
Run the following command once you’re done to ensure Homebrew is installed and working properly:
        img = cv2.imread(input, 0) # load grayscale version


        # the indeces where the useful region starts and ends
<source lang="shell">
        hStrart = 0
brew doctor
        hEnd = img.shape[0]
</source>
        vStart = 0
        vEnd = img.shape[1]


        # get row and column maxes for each row and column
<source lang="shell">
        hMax = img.max(1)
sudo apt-get install python3 python3-pip imagemagick poppler pdfunite
        vMax = img.max(0)
</source>


        hDone_flag = False
<source lang="shell">
        vDone_flag = False
brew install python3 python3-pip imagemagick poppler pdfunite
</source>


        # go through the list of max and begin where the pixel value is greater
===PIP3===
        # than the threshold
<source lang="shell">
        for i in range(hMax.size):
sudo pip3 install pdf2image Pillow time logging opencv-python pytesseract
            if not hDone_flag:
</source>
                if hMax[i] > threshold:
<br>
                    hStart = i
                    hDone_flag = True


            if hDone_flag:
==How to use==
                if hMax[i] < threshold:
Add your pictures from the book scanner to the folder "/scans"
                    hEnd = i
                    break


        for i in range(vMax.size):
Make all the files executable.
            if not vDone_flag:
                if vMax[i] > threshold:
                    vStart = i
                    vDone_flag = True
 
            if vDone_flag:
                if vMax[i] < threshold:
                    vEnd = i
                    break
 
        # load the color image and choose only the useful area from it
        img2 = (cv2.imread(input))[hStart:hEnd, vStart:vEnd,:]
 
        # write the cropped image
        cv2.imwrite(page, img2)
 
        d+=1
        print("Value of d is:", d)
 
    except:
        logging.exception("message")
        print("All pages must be ready!")
        break


<source lang="shell">
sudo chmod 777 merge_scans.sh workshop_stream.sh marge_files.sh
</source>
</source>


In case you want to skip any of the scripts just comment out in the shell code, <em>workshop_stream.sh</em>.


==OCR==
Run ./workshop_stream.sh
OCR all the jpegs in one batch, dividing them into searchable pdfs.
<source lang="python">
# import libraries
from PIL import Image
import pytesseract
import time


i = 1
Wait :)
<br>


while True:
==Aditional information==
    try:
The workflow follows these scripts, by successive order:
        img = Image.open("split/page%i.jpg"%i)
        print(img)
        pdf = pytesseract.image_to_pdf_or_hocr(img, lang="eng", extension='pdf')
        time.sleep(1)
        file = open(("ocred/page%i.pdf"%i), "w+b")
        file.write(bytearray(pdf))
        file.close()
        i+=1
        print(i)


    except:
===Create 5 directories===
        print("All pages must be ready!")
        break


<source lang="shell">
mkdir split
mkdir rotated
mkdir ocred
mkdir bounding_box
mkdir cropped
</source>
===Merge the files in the directory ''scans''===
All the scans will be appended to one pdf called out.pdf
<source lang="shell">
./merge_scans.sh
</source>
</source>


==Rotate JPGS==
===Burst the pdf in ''scans''===
The book scanner takes a picture of a book page in a landscape format. These have to be processed and rotated. This script iterates with a different behaviour through the even and odd pages.
Burst this pdf, renaming all the files so they can be iterated later.
<source lang="python">
<source lang="shell">
from PIL import Image
python3 burstpdf.py
import time
</source>


i = 1
===Rotate the pdfs===
The book scanner takes pictures of the pdfs, this scrip iterates through the odd and even pages rotating them to their original position.
<source lang="shell">
python3 rotation.py
</source>


while True:
===Cropping the bounding boxes===
The pages are now in their original position, but they have a bounding box. This script iterates through them and crops the highest contrast area found.
<source lang="shell">
python3 bounding_box.py
</source>


    page = Image.open("split/input%i.jpg"%i)
===Cropping the mirror===
The pages are now cropped, but the mirror is still visible in the middle.
<source lang="shell">
python3 mirror_crop.py
</source>


    if i % 2 == 0:
===OCR===
        #check where the for loop is
In this part we OCR the jpg, turning these into PDFs.
        print("trying even")
<source lang="shell">
 
python3 tesseract_ocr.py
        #rotate image by 90 degrees
</source>
        angle = 90
        out = page.rotate(angle, expand=True)
        out.save('rotated/input%i.jpg'%i)
        print('This is an even page number')
 
        time.sleep(2)
        print("variable i: ", i)
 
    else:
        #check where the for loop is
        print("trying odd")
 
        #rotate image by 90 degrees
        angle = 270
        out = page.rotate(angle, expand=True)
        out.save('rotated/input%i.jpg'%i)
        print('This is an even page number')
 
        time.sleep(1)
        print("variable i: ", i)
 
    i+=1


===Merge all the files and create the pdf===
The OCRed pages are now joined into their final PDF, your book is ready :)
<source lang="shell">
./merge_files.sh
</source>
</source>


==Burst PDF==
==License==
Burst a pdf into separate jpegs.
The package is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
<source lang="python">
#Based in the code in https://iq.opengenus.org/pdf_to_image_in_python/
 
import pdf2image
from PIL import Image
import time
 
#DECLARE CONSTANTS
PDF_PATH = (input("What pdf do you want to use? (include extention as example.pdf): "))
DPI = 200
FIRST_PAGE = None
LAST_PAGE = None
FORMAT = 'jpg'
THREAD_COUNT = 1
USERPWD = None
USE_CROPBOX = False
STRICT = False
 
def pdftopil():
    #This method reads a pdf and converts it into a sequence of images
    #PDF_PATH sets the path to the PDF file
    #dpi parameter assists in adjusting the resolution of the image
    #first_page parameter allows you to set a first page to be processed by pdftoppm
    #last_page parameter allows you to set a last page to be processed by pdftoppm
    #fmt parameter allows to set the format of pdftoppm conversion (PpmImageFile, TIFF)
    #thread_count parameter allows you to set how many thread will be used for conversion.
    #userpw parameter allows you to set a password to unlock the converted PDF
    #use_cropbox parameter allows you to use the crop box instead of the media box when converting
    #strict parameter allows you to catch pdftoppm syntax error with a custom type PDFSyntaxError
 
    start_time = time.time()
    pil_images = pdf2image.convert_from_path(PDF_PATH, dpi=DPI, first_page=FIRST_PAGE, last_page=LAST_PAGE, fmt=FORMAT, thread_count=THREAD_COUNT, userpw=USERPWD, use_cropbox=USE_CROPBOX, strict=STRICT)
    print ("Time taken : " + str(time.time() - start_time))
    return pil_images
 
def save_images(pil_images):
    d = 1
    for image in pil_images:
        image.save(("split/input%d"%d) + ".jpg")
        d += 1
 
if __name__ == "__main__":
    pil_images = pdftopil()
    save_images(pil_images)
 
</source>


=About Text Launderette=
=About Text Launderette=

Latest revision as of 13:40, 31 January 2020

Scripts

From the git

https://git.xpub.nl/pedrosaclout/DIY_Book_Scanner_Workflow

DIY Book Scanner Workflow

Getting started

This set of scripts was written for the Text Laundrette workshop. The workshop takes place in the Publication Station, WDkA building.
Rotterdam, 03-02-2020
It is a workflow to turn the pictures from the DIY Book Scanner into a final OCRed PDF.

Dependencies

Brew (MAC) or apt-get (LINUX)

You’ll need the command-line tools for Xcode installed.

xcode-select --install

After install Homebrew.

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Run the following command once you’re done to ensure Homebrew is installed and working properly:

brew doctor
sudo apt-get install python3 python3-pip imagemagick poppler pdfunite
brew install python3 python3-pip imagemagick poppler pdfunite

PIP3

sudo pip3 install pdf2image Pillow time logging opencv-python pytesseract


How to use

Add your pictures from the book scanner to the folder "/scans"

Make all the files executable.

sudo chmod 777 merge_scans.sh workshop_stream.sh marge_files.sh

In case you want to skip any of the scripts just comment out in the shell code, workshop_stream.sh.

Run ./workshop_stream.sh

Wait :)

Aditional information

The workflow follows these scripts, by successive order:

Create 5 directories

mkdir split
mkdir rotated
mkdir ocred
mkdir bounding_box
mkdir cropped

Merge the files in the directory scans

All the scans will be appended to one pdf called out.pdf

./merge_scans.sh

Burst the pdf in scans

Burst this pdf, renaming all the files so they can be iterated later.

python3 burstpdf.py

Rotate the pdfs

The book scanner takes pictures of the pdfs, this scrip iterates through the odd and even pages rotating them to their original position.

python3 rotation.py

Cropping the bounding boxes

The pages are now in their original position, but they have a bounding box. This script iterates through them and crops the highest contrast area found.

python3 bounding_box.py

Cropping the mirror

The pages are now cropped, but the mirror is still visible in the middle.

python3 mirror_crop.py

OCR

In this part we OCR the jpg, turning these into PDFs.

python3 tesseract_ocr.py

Merge all the files and create the pdf

The OCRed pages are now joined into their final PDF, your book is ready :)

./merge_files.sh

License

The package is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).

About Text Launderette

TITLE

XPUB workshops – Text Launderette

STATION

Publication Station

LOCATION

BL.00.4

TUTORS

Simon Browne & Pedro Sá Couto

DESCRIPTION

We will use a home-made, DIY book scanner, and open-source software to scan, process, and add digital features to printed texts brought by the participants to the workshop. Ultimately, we will include them in the “bootleg library”, a shadow library accessible over a local network.
Shadow libraries operate outside of legal copyright frameworks, in response to decreased open access to knowledge. This workshop aims to extend our research on libraries, their sociability, and methods by which we can add provenance to texts included in public or private, legal or extra-legal collections.
Participants should bring: a printed text, which they’d like to digitize and share.

PRACTICAL INFORMATION

Under the name of .py.rate.chnic sessions, the second-year students from the Experimental Publishing Master program invite you to participate in a series of hands-on workshops, related to the topics of their graduation projects. Each workshop offers the participants an opportunity to engage with the students’ research by partaking in their processes, experiments, and discussions.

MINIMAL ENROLMENT

5

MAXIMUM ENROLMENT

15

NR OF SESSIONS

1