User:Pedro Sá Couto/TW/REPUBLISHING FLOW: Difference between revisions

From XPUB & Lens-Based wiki
No edit summary
No edit summary
Line 29: Line 29:
<br>
<br>


=FLOW=
==JSTOR.SH==
====To activate the stream I use ./jstor.sh====


 
<source lang="python">
 
cp `ls -td -- /Users/PSC/Desktop/JSTOR/jstorpaper/* | head -n 1` /Users/PSC/Desktop/JSTOR/overlay
 
cd /Users/PSC/Desktop/JSTOR/overlay
 
for name in *; do mv "$name" "${name// /_}"; done
 
mv /Users/PSC/Desktop/JSTOR/overlay/*.pdf target.pdf
 
mkdir -p split
 
python3 burstpdf.py
=RESULTS IN EACH STEP=
python3 overlaylogo_cover.py
'''0.''' Starting with a Paper from JSTOR<br>
python3 overlaylogo_page.py
[[File:42938075.pdf|thumb|Calibration of Watermark soil moisture sensors for soil matric potential and temperature.pdf]]
rm target.pdf
 
convert "split/*.{png,jpeg,pdf}" -quality 100 name.pdf
'''1.''' Bursting the PDF into PNGs<br>
var1=`ls -td -- /Users/PSC/Desktop/JSTOR/jstorpaper/*.pdf | head -n 1`
====PDF is seperated into pages====
mv name.pdf $var1
<gallery>
rm -r split
File:wiki_page1.png
mv `ls -td -- /Users/PSC/Desktop/JSTOR/jstorpaper/* | head -n 1` /Users/PSC/Desktop/JSTOR/ready
File:wiki_page2.png
cd /Users/PSC/Desktop/JSTOR/ready
File:wiki_page3.png
ocrmypdf `ls -td -- /Users/PSC/Desktop/JSTOR/ready/* | head -n 1` `ls -td -- /Users/PSC/Desktop/JSTOR/ready/* | head -n 1`
File:wiki_page4.png
mv `ls -td -- /Users/PSC/Desktop/JSTOR/ready/* | head -n 1` /Users/PSC/Desktop/JSTOR/ready/ocred
File:wiki_page5.png
</source>
File:wiki_page6.png
</gallery>
 
'''2.''' Overlaying the cover<br>
 
====The cover is overlayed and dewatermarked====
<gallery>
File:wiki_page1_water.png
</gallery>
 
'''3.''' Overlaying the pages<br>
 
====The pages are overlayed and dewatermarked====
<gallery>
File:wiki_page2_water.png
File:wiki_page3_water.png
File:wiki_page4_water.png
File:wiki_page5_water.png
File:wiki_page6_water.png
</gallery>
 
'''4.''' OCR again<br>
====You have a De-watermarked, searchable PDF====
[[File:42938075_dewater.pdf|thumb|De Watermarked Calibration of Watermark soil moisture sensors for soil matric potential and temperature.pdf]]
<br>
<br>


=1. Bursting the PDF into png=
==1. Bursting the PDF into png==
<source lang="python">
<source lang="python">
#Based in the code in https://iq.opengenus.org/pdf_to_image_in_python/
#Based in the code in https://iq.opengenus.org/pdf_to_image_in_python/
Line 123: Line 102:
<br>
<br>


=2. Overlaying the cover=
==2. Overlaying the cover==
<source lang="python">
<source lang="python">
from PIL import Image
from PIL import Image
Line 144: Line 123:
<br>
<br>


=3. Overlaying the pages=
==3. Overlaying the pages==
====This happens through ./jstor.sh====
====This happens through ./jstor.sh====
<source lang="python">
<source lang="python">
Line 179: Line 158:
<br>
<br>


=4. OCR again=
==4. OCR again==
====This happens through ./jstor.sh====
====This happens through ./jstor.sh====
<source lang="python">
<source lang="python">
ocrmypdf `ls -td -- /Users/PSC/Desktop/JSTOR/ready/* | head -n 1` `ls -td -- /Users/PSC/Desktop/JSTOR/ready/* | head -n 1`
ocrmypdf `ls -td -- /Users/PSC/Desktop/JSTOR/ready/* | head -n 1` `ls -td -- /Users/PSC/Desktop/JSTOR/ready/* | head -n 1`
</source>
</source>
<br>
=RESULTS IN EACH STEP=
'''0.''' Starting with a Paper from JSTOR<br>
[[File:42938075.pdf|thumb|Calibration of Watermark soil moisture sensors for soil matric potential and temperature.pdf]]
'''1.''' Bursting the PDF into PNGs<br>
====PDF is seperated into pages====
<gallery>
File:wiki_page1.png
File:wiki_page2.png
File:wiki_page3.png
File:wiki_page4.png
File:wiki_page5.png
File:wiki_page6.png
</gallery>
'''2.''' Overlaying the cover<br>
====The cover is overlayed and dewatermarked====
<gallery>
File:wiki_page1_water.png
</gallery>
'''3.''' Overlaying the pages<br>
====The pages are overlayed and dewatermarked====
<gallery>
File:wiki_page2_water.png
File:wiki_page3_water.png
File:wiki_page4_water.png
File:wiki_page5_water.png
File:wiki_page6_water.png
</gallery>
'''4.''' OCR again<br>
====You have a De-watermarked, searchable PDF====
[[File:42938075_dewater.pdf|thumb|De Watermarked Calibration of Watermark soil moisture sensors for soil matric potential and temperature.pdf]]
<br>
<br>

Revision as of 04:58, 6 June 2020

STEPS

Republishing is separated into 6 steps:

1. Moving the book from the webserver to a work place

1.1 Replacing all spaces with underscores

2. Creating the watermark from the gathered form in Tactical Watermarks

2.1 Create the watermark in pdf with reportlab
2.2 Convert to a png

3. Append the watermark to the pdf

3.1 Burst the pdf into pages
3.2 Rotate the watermark with PIL
3.3 Overlay the watermark with PIL
3.4 Merge all images into a PDF

4. OCR the pdf if not OCRed already
5. Save the file in a directory open to Library Genesis Staff
6. Delete all the unwanted traces

RUN.SH

To activate the stream I use ./run.sh

sudo chmod 777 *
./movebookfolder.sh
./watermarkformtxt.sh
./appendwatermarktopdf.sh
./republish.sh
./deletetraces.sh


FLOW

JSTOR.SH

To activate the stream I use ./jstor.sh

cp `ls -td -- /Users/PSC/Desktop/JSTOR/jstorpaper/* | head -n 1` /Users/PSC/Desktop/JSTOR/overlay
cd /Users/PSC/Desktop/JSTOR/overlay
for name in *; do mv "$name" "${name// /_}"; done
mv /Users/PSC/Desktop/JSTOR/overlay/*.pdf target.pdf
mkdir -p split
python3 burstpdf.py
python3 overlaylogo_cover.py
python3 overlaylogo_page.py
rm target.pdf
convert "split/*.{png,jpeg,pdf}" -quality 100 name.pdf
var1=`ls -td -- /Users/PSC/Desktop/JSTOR/jstorpaper/*.pdf | head -n 1`
mv name.pdf $var1
rm -r split
mv `ls -td -- /Users/PSC/Desktop/JSTOR/jstorpaper/* | head -n 1` /Users/PSC/Desktop/JSTOR/ready
cd /Users/PSC/Desktop/JSTOR/ready
ocrmypdf `ls -td -- /Users/PSC/Desktop/JSTOR/ready/* | head -n 1` `ls -td -- /Users/PSC/Desktop/JSTOR/ready/* | head -n 1`
mv `ls -td -- /Users/PSC/Desktop/JSTOR/ready/* | head -n 1` /Users/PSC/Desktop/JSTOR/ready/ocred


1. Bursting the PDF into png

#Based in the code in https://iq.opengenus.org/pdf_to_image_in_python/

import pdf2image
from PIL import Image
import time

#DECLARE CONSTANTS
PDF_PATH = "target.pdf"
DPI = 200
FIRST_PAGE = None
LAST_PAGE = None
FORMAT = 'png'
THREAD_COUNT = 1
USERPWD = None
USE_CROPBOX = False
STRICT = False

def pdftopil():
    #This method reads a pdf and converts it into a sequence of images
    #PDF_PATH sets the path to the PDF file
    #dpi parameter assists in adjusting the resolution of the image
    #first_page parameter allows you to set a first page to be processed by pdftoppm
    #last_page parameter allows you to set a last page to be processed by pdftoppm
    #fmt parameter allows to set the format of pdftoppm conversion (PpmImageFile, TIFF)
    #thread_count parameter allows you to set how many thread will be used for conversion.
    #userpw parameter allows you to set a password to unlock the converted PDF
    #use_cropbox parameter allows you to use the crop box instead of the media box when converting
    #strict parameter allows you to catch pdftoppm syntax error with a custom type PDFSyntaxError

    start_time = time.time()
    pil_images = pdf2image.convert_from_path(PDF_PATH, dpi=DPI, first_page=FIRST_PAGE, last_page=LAST_PAGE, fmt=FORMAT, thread_count=THREAD_COUNT, userpw=USERPWD, use_cropbox=USE_CROPBOX, strict=STRICT)
    print ("Time taken : " + str(time.time() - start_time))
    return pil_images

def save_images(pil_images):
    d = 1
    for image in pil_images:
        image.save(("split/page%d"%d) + ".png")
        d += 1

if __name__ == "__main__":
    pil_images = pdftopil()
    save_images(pil_images)


2. Overlaying the cover

from PIL import Image

background = Image.open("split/page1.png")

#rescaling the logo
basewidth = (background.size[0])
finalcover = Image.open("cover.png")
wpercent = (basewidth/float(finalcover.size[0]))
hsize = int((float(finalcover.size[1])*float(wpercent)))
finalcover = finalcover.resize((basewidth,hsize), Image.ANTIALIAS)
finalcover.save("cover_rescale.png")

foreground = Image.open("cover_rescale.png")

background.paste(foreground, (0, -180), foreground.convert('RGBA'))
background.save("split/page1.png")


3. Overlaying the pages

This happens through ./jstor.sh

from PIL import Image

base = Image.open("split/page2.png")

#rescaling the logo
basewidth = (base.size[0])
finalpage = Image.open("pages.png")
wpercent = (basewidth/float(finalpage.size[0]))
hsize = int((float(finalpage.size[1])*float(wpercent)))
finalpage = finalpage.resize((basewidth,hsize), Image.ANTIALIAS)
finalpage.save("page_rescale.png")

foreground = Image.open("page_rescale.png")

i = 2

while True:
    try:
        background = Image.open("split/page%i.png"%i)

        background.paste(foreground, (0, -140), foreground.convert('RGBA'))
        background.save("split/page%i.png"%i)

        i+=1

    except:
        print("DID MY JOB!")
        break


4. OCR again

This happens through ./jstor.sh

ocrmypdf `ls -td -- /Users/PSC/Desktop/JSTOR/ready/* | head -n 1` `ls -td -- /Users/PSC/Desktop/JSTOR/ready/* | head -n 1`



RESULTS IN EACH STEP

0. Starting with a Paper from JSTOR
File:42938075.pdf

1. Bursting the PDF into PNGs

PDF is seperated into pages

2. Overlaying the cover

The cover is overlayed and dewatermarked

3. Overlaying the pages

The pages are overlayed and dewatermarked

4. OCR again

You have a De-watermarked, searchable PDF

File:42938075 dewater.pdf