User:Pedro Sá Couto/TW/JSTOR De-watermarking

From XPUB & Lens-Based wiki

The process to dewatermark is separated into 4 steps: 1. Bursting the PDF into png 2. Overlaying the cover 3. Overlaying the pages 4. OCR again


The process is activated through ./jstor.sh


cp `ls -td -- /Users/PSC/Desktop/JSTOR/jstorpaper/* | head -n 1` /Users/PSC/Desktop/JSTOR/overlay
cd /Users/PSC/Desktop/JSTOR/overlay
for name in *; do mv "$name" "${name// /_}"; done
mv /Users/PSC/Desktop/JSTOR/overlay/*.pdf target.pdf
mkdir -p split
python3 burstpdf.py
python3 overlaylogo_cover.py
python3 overlaylogo_page.py
rm target.pdf
convert "split/*.{png,jpeg,pdf}" -quality 100 name.pdf
var1=`ls -td -- /Users/PSC/Desktop/JSTOR/jstorpaper/*.pdf | head -n 1`
mv name.pdf $var1
rm -r split
mv `ls -td -- /Users/PSC/Desktop/JSTOR/jstorpaper/* | head -n 1` /Users/PSC/Desktop/JSTOR/ready
cd /Users/PSC/Desktop/JSTOR/ready
ocrmypdf `ls -td -- /Users/PSC/Desktop/JSTOR/ready/* | head -n 1` `ls -td -- /Users/PSC/Desktop/JSTOR/ready/* | head -n 1`
mv `ls -td -- /Users/PSC/Desktop/JSTOR/ready/* | head -n 1` /Users/PSC/Desktop/JSTOR/ready/ocred