User:Pedro Sá Couto/TW/JSTOR De-watermarking: Difference between revisions
< User:Pedro Sá Couto | TW
No edit summary |
No edit summary |
||
Line 10: | Line 10: | ||
====To activate the stream I use ./jstor.sh==== | ====To activate the stream I use ./jstor.sh==== | ||
< | <source lang="python"> | ||
cp `ls -td -- /Users/PSC/Desktop/JSTOR/jstorpaper/* | head -n 1` /Users/PSC/Desktop/JSTOR/overlay | cp `ls -td -- /Users/PSC/Desktop/JSTOR/jstorpaper/* | head -n 1` /Users/PSC/Desktop/JSTOR/overlay | ||
cd /Users/PSC/Desktop/JSTOR/overlay | cd /Users/PSC/Desktop/JSTOR/overlay | ||
Line 28: | Line 28: | ||
ocrmypdf `ls -td -- /Users/PSC/Desktop/JSTOR/ready/* | head -n 1` `ls -td -- /Users/PSC/Desktop/JSTOR/ready/* | head -n 1` | ocrmypdf `ls -td -- /Users/PSC/Desktop/JSTOR/ready/* | head -n 1` `ls -td -- /Users/PSC/Desktop/JSTOR/ready/* | head -n 1` | ||
mv `ls -td -- /Users/PSC/Desktop/JSTOR/ready/* | head -n 1` /Users/PSC/Desktop/JSTOR/ready/ocred | mv `ls -td -- /Users/PSC/Desktop/JSTOR/ready/* | head -n 1` /Users/PSC/Desktop/JSTOR/ready/ocred | ||
</ | </source> | ||
<br> | <br> | ||
Revision as of 03:00, 6 June 2020
STREAM
The process to dewatermark is separated into 4 steps:
1. Bursting the PDF into png
2. Overlaying the cover
3. Overlaying the pages
4. OCR again
JSTOR.SH
To activate the stream I use ./jstor.sh
cp `ls -td -- /Users/PSC/Desktop/JSTOR/jstorpaper/* | head -n 1` /Users/PSC/Desktop/JSTOR/overlay
cd /Users/PSC/Desktop/JSTOR/overlay
for name in *; do mv "$name" "${name// /_}"; done
mv /Users/PSC/Desktop/JSTOR/overlay/*.pdf target.pdf
mkdir -p split
python3 burstpdf.py
python3 overlaylogo_cover.py
python3 overlaylogo_page.py
rm target.pdf
convert "split/*.{png,jpeg,pdf}" -quality 100 name.pdf
var1=`ls -td -- /Users/PSC/Desktop/JSTOR/jstorpaper/*.pdf | head -n 1`
mv name.pdf $var1
rm -r split
mv `ls -td -- /Users/PSC/Desktop/JSTOR/jstorpaper/* | head -n 1` /Users/PSC/Desktop/JSTOR/ready
cd /Users/PSC/Desktop/JSTOR/ready
ocrmypdf `ls -td -- /Users/PSC/Desktop/JSTOR/ready/* | head -n 1` `ls -td -- /Users/PSC/Desktop/JSTOR/ready/* | head -n 1`
mv `ls -td -- /Users/PSC/Desktop/JSTOR/ready/* | head -n 1` /Users/PSC/Desktop/JSTOR/ready/ocred
1. Bursting the PDF into png
#Based in the code in https://iq.opengenus.org/pdf_to_image_in_python/ import pdf2image from PIL import Image import time #DECLARE CONSTANTS PDF_PATH = "target.pdf" DPI = 200 FIRST_PAGE = None LAST_PAGE = None FORMAT = 'png' THREAD_COUNT = 1 USERPWD = None USE_CROPBOX = False STRICT = False def pdftopil(): #This method reads a pdf and converts it into a sequence of images #PDF_PATH sets the path to the PDF file #dpi parameter assists in adjusting the resolution of the image #first_page parameter allows you to set a first page to be processed by pdftoppm #last_page parameter allows you to set a last page to be processed by pdftoppm #fmt parameter allows to set the format of pdftoppm conversion (PpmImageFile, TIFF) #thread_count parameter allows you to set how many thread will be used for conversion. #userpw parameter allows you to set a password to unlock the converted PDF #use_cropbox parameter allows you to use the crop box instead of the media box when converting #strict parameter allows you to catch pdftoppm syntax error with a custom type PDFSyntaxError start_time = time.time() pil_images = pdf2image.convert_from_path(PDF_PATH, dpi=DPI, first_page=FIRST_PAGE, last_page=LAST_PAGE, fmt=FORMAT, thread_count=THREAD_COUNT, userpw=USERPWD, use_cropbox=USE_CROPBOX, strict=STRICT) print ("Time taken : " + str(time.time() - start_time)) return pil_images def save_images(pil_images): d = 1 for image in pil_images: image.save(("split/page%d"%d) + ".png") d += 1 if __name__ == "__main__": pil_images = pdftopil() save_images(pil_images)
2. Overlaying the cover
from PIL import Image background = Image.open("split/page1.png") #rescaling the logo basewidth = (background.size[0]) finalcover = Image.open("cover.png") wpercent = (basewidth/float(finalcover.size[0])) hsize = int((float(finalcover.size[1])*float(wpercent))) finalcover = finalcover.resize((basewidth,hsize), Image.ANTIALIAS) finalcover.save("cover_rescale.png") foreground = Image.open("cover_rescale.png") background.paste(foreground, (0, -180), foreground.convert('RGBA')) background.save("split/page1.png")
4. OCR again
This happens through ./jstor.sh
ocrmypdf `ls -td -- /Users/PSC/Desktop/JSTOR/ready/* | head -n 1` `ls -td -- /Users/PSC/Desktop/JSTOR/ready/* | head -n 1`