Revision as of 05:01, 6 June 2020

STEPS

Republishing is separated into 6 steps:

1. Moving the book from the webserver to a work place

1.1 Replacing all spaces with underscores

2. Creating the watermark from the gathered form in Tactical Watermarks

2.1 Create the watermark in pdf with reportlab

2.2 Convert to a png

3. Append the watermark to the pdf

3.1 Burst the pdf into pages

3.2 Rotate the watermark with PIL

3.3 Overlay the watermark with PIL

3.4 Merge all images into a PDF

4. OCR the pdf if not OCRed already
5. Save the file in a directory open to Library Genesis Staff
6. Delete all the unwanted traces

FLOW

RUN.SH

To activate the stream I use ./run.sh

sudo chmod 777 *
./movebookfolder.sh
./watermarkformtxt.sh
./appendwatermarktopdf.sh
./republish.sh
./deletetraces.sh

1. Moving the book from the webserver to a work place

2. Creating the watermark from the gathered form in Tactical Watermarks

3. Append the watermark to the pdf

4. OCR the pdf if not OCRed already

5. Save the file in a directory open to Library Genesis Staff

6. Delete all the unwanted traces

RESULTS IN EACH STEP

0. Starting with a Paper from JSTOR
File:42938075.pdf

1. Bursting the PDF into PNGs

PDF is seperated into pages

2. Overlaying the cover

The cover is overlayed and dewatermarked

3. Overlaying the pages

The pages are overlayed and dewatermarked

4. OCR again

You have a De-watermarked, searchable PDF

File:42938075 dewater.pdf

@@ Line 31: / Line 31: @@
 <br>
-==1. Bursting the PDF into png==
+==1. Moving the book from the webserver to a work place==
 <source lang="python">
-#Based in the code in https://iq.opengenus.org/pdf_to_image_in_python/
+</source>
+<br>
-import pdf2image
+==2. Creating the watermark from the gathered form in Tactical Watermarks==
-from PIL import Image
+<source lang="python">
-import time
+</source>
+<br>
-#DECLARE CONSTANTS
+==3. Append the watermark to the pdf==
-PDF_PATH = "target.pdf"
+<source lang="python">
-DPI = 200
-FIRST_PAGE = None
-LAST_PAGE = None
-FORMAT = 'png'
-THREAD_COUNT = 1
-USERPWD = None
-USE_CROPBOX = False
-STRICT = False
-def pdftopil():
-    #This method reads a pdf and converts it into a sequence of images
-    #PDF_PATH sets the path to the PDF file
-    #dpi parameter assists in adjusting the resolution of the image
-    #first_page parameter allows you to set a first page to be processed by pdftoppm
-    #last_page parameter allows you to set a last page to be processed by pdftoppm
-    #fmt parameter allows to set the format of pdftoppm conversion (PpmImageFile, TIFF)
-    #thread_count parameter allows you to set how many thread will be used for conversion.
-    #userpw parameter allows you to set a password to unlock the converted PDF
-    #use_cropbox parameter allows you to use the crop box instead of the media box when converting
-    #strict parameter allows you to catch pdftoppm syntax error with a custom type PDFSyntaxError
-    start_time = time.time()
-    pil_images = pdf2image.convert_from_path(PDF_PATH, dpi=DPI, first_page=FIRST_PAGE, last_page=LAST_PAGE, fmt=FORMAT, thread_count=THREAD_COUNT, userpw=USERPWD, use_cropbox=USE_CROPBOX, strict=STRICT)
-    print ("Time taken : " + str(time.time() - start_time))
-    return pil_images
-def save_images(pil_images):
-    d = 1
-    for image in pil_images:
-        image.save(("split/page%d"%d) + ".png")
-        d += 1
-if __name__ == "__main__":
-    pil_images = pdftopil()
-    save_images(pil_images)
 </source>
 <br>
-==2. Overlaying the cover==
+==4. OCR the pdf if not OCRed already==
 <source lang="python">
-from PIL import Image
-background = Image.open("split/page1.png")
-#rescaling the logo
-basewidth = (background.size[0])
-finalcover = Image.open("cover.png")
-wpercent = (basewidth/float(finalcover.size[0]))
-hsize = int((float(finalcover.size[1])*float(wpercent)))
-finalcover = finalcover.resize((basewidth,hsize), Image.ANTIALIAS)
-finalcover.save("cover_rescale.png")
-foreground = Image.open("cover_rescale.png")
-background.paste(foreground, (0, -180), foreground.convert('RGBA'))
-background.save("split/page1.png")
 </source>
 <br>
-==3. Overlaying the pages==
+==5. Save the file in a directory open to Library Genesis Staff==
-====This happens through ./jstor.sh====
 <source lang="python">
-from PIL import Image
-base = Image.open("split/page2.png")
-#rescaling the logo
-basewidth = (base.size[0])
-finalpage = Image.open("pages.png")
-wpercent = (basewidth/float(finalpage.size[0]))
-hsize = int((float(finalpage.size[1])*float(wpercent)))
-finalpage = finalpage.resize((basewidth,hsize), Image.ANTIALIAS)
-finalpage.save("page_rescale.png")
-foreground = Image.open("page_rescale.png")
-i = 2
-while True:
-    try:
-        background = Image.open("split/page%i.png"%i)
-        background.paste(foreground, (0, -140), foreground.convert('RGBA'))
-        background.save("split/page%i.png"%i)
-        i+=1
-    except:
-        print("DID MY JOB!")
-        break
 </source>
 <br>
-==4. OCR again==
+==6. Delete all the unwanted traces==
-====This happens through ./jstor.sh====
 <source lang="python">
-ocrmypdf `ls -td -- /Users/PSC/Desktop/JSTOR/ready/* | head -n 1` `ls -td -- /Users/PSC/Desktop/JSTOR/ready/* | head -n 1`
 </source>
 <br>

User:Pedro Sá Couto/TW/REPUBLISHING FLOW: Difference between revisions

Revision as of 05:01, 6 June 2020

Contents

STEPS

Republishing is separated into 6 steps:

FLOW

RUN.SH

To activate the stream I use ./run.sh

1. Moving the book from the webserver to a work place

2. Creating the watermark from the gathered form in Tactical Watermarks

3. Append the watermark to the pdf

4. OCR the pdf if not OCRed already

5. Save the file in a directory open to Library Genesis Staff

6. Delete all the unwanted traces

RESULTS IN EACH STEP

PDF is seperated into pages

The cover is overlayed and dewatermarked

The pages are overlayed and dewatermarked

You have a De-watermarked, searchable PDF