User:Pedro Sá Couto/TW/JSTOR De-watermarking
< User:Pedro Sá Couto | TW
Revision as of 02:59, 6 June 2020 by Pedro Sá Couto (talk | contribs)
STREAM
The process to dewatermark is separated into 4 steps:
1. Bursting the PDF into png
2. Overlaying the cover
3. Overlaying the pages
4. OCR again
JSTOR.SH
To activate the stream I use ./jstor.sh
cp `ls -td -- /Users/PSC/Desktop/JSTOR/jstorpaper/* | head -n 1` /Users/PSC/Desktop/JSTOR/overlay cd /Users/PSC/Desktop/JSTOR/overlay for name in *; do mv "$name" "${name// /_}"; done mv /Users/PSC/Desktop/JSTOR/overlay/*.pdf target.pdf mkdir -p split python3 burstpdf.py python3 overlaylogo_cover.py python3 overlaylogo_page.py rm target.pdf convert "split/*.{png,jpeg,pdf}" -quality 100 name.pdf var1=`ls -td -- /Users/PSC/Desktop/JSTOR/jstorpaper/*.pdf | head -n 1` mv name.pdf $var1 rm -r split mv `ls -td -- /Users/PSC/Desktop/JSTOR/jstorpaper/* | head -n 1` /Users/PSC/Desktop/JSTOR/ready cd /Users/PSC/Desktop/JSTOR/ready ocrmypdf `ls -td -- /Users/PSC/Desktop/JSTOR/ready/* | head -n 1` `ls -td -- /Users/PSC/Desktop/JSTOR/ready/* | head -n 1` mv `ls -td -- /Users/PSC/Desktop/JSTOR/ready/* | head -n 1` /Users/PSC/Desktop/JSTOR/ready/ocred
1. Bursting the PDF into png
#Based in the code in https://iq.opengenus.org/pdf_to_image_in_python/ import pdf2image from PIL import Image import time #DECLARE CONSTANTS PDF_PATH = "target.pdf" DPI = 200 FIRST_PAGE = None LAST_PAGE = None FORMAT = 'png' THREAD_COUNT = 1 USERPWD = None USE_CROPBOX = False STRICT = False def pdftopil(): #This method reads a pdf and converts it into a sequence of images #PDF_PATH sets the path to the PDF file #dpi parameter assists in adjusting the resolution of the image #first_page parameter allows you to set a first page to be processed by pdftoppm #last_page parameter allows you to set a last page to be processed by pdftoppm #fmt parameter allows to set the format of pdftoppm conversion (PpmImageFile, TIFF) #thread_count parameter allows you to set how many thread will be used for conversion. #userpw parameter allows you to set a password to unlock the converted PDF #use_cropbox parameter allows you to use the crop box instead of the media box when converting #strict parameter allows you to catch pdftoppm syntax error with a custom type PDFSyntaxError start_time = time.time() pil_images = pdf2image.convert_from_path(PDF_PATH, dpi=DPI, first_page=FIRST_PAGE, last_page=LAST_PAGE, fmt=FORMAT, thread_count=THREAD_COUNT, userpw=USERPWD, use_cropbox=USE_CROPBOX, strict=STRICT) print ("Time taken : " + str(time.time() - start_time)) return pil_images def save_images(pil_images): d = 1 for image in pil_images: image.save(("split/page%d"%d) + ".png") d += 1 if __name__ == "__main__": pil_images = pdftopil() save_images(pil_images)
2. Overlaying the cover
from PIL import Image background = Image.open("split/page1.png") #rescaling the logo basewidth = (background.size[0]) finalcover = Image.open("cover.png") wpercent = (basewidth/float(finalcover.size[0])) hsize = int((float(finalcover.size[1])*float(wpercent))) finalcover = finalcover.resize((basewidth,hsize), Image.ANTIALIAS) finalcover.save("cover_rescale.png") foreground = Image.open("cover_rescale.png") background.paste(foreground, (0, -180), foreground.convert('RGBA')) background.save("split/page1.png")
4. OCR again
This happens through ./jstor.sh
ocrmypdf `ls -td -- /Users/PSC/Desktop/JSTOR/ready/* | head -n 1` `ls -td -- /Users/PSC/Desktop/JSTOR/ready/* | head -n 1`