User:Pedro Sá Couto/TW/JSTOR De-watermarking: Difference between revisions
< User:Pedro Sá Couto | TW
(11 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
=STEPS= | =STEPS= | ||
====De-watermarking is separated into 4 steps:==== | |||
1. Bursting the PDF into png<br> | 1. Bursting the PDF into png<br> | ||
2. Overlaying the cover<br> | 2. Overlaying the cover<br> | ||
Line 10: | Line 10: | ||
'''0.''' Starting with a Paper from JSTOR<br> | '''0.''' Starting with a Paper from JSTOR<br> | ||
[[File:42938075.pdf|thumb|Calibration of Watermark soil moisture sensors for soil matric potential and temperature.pdf]] | [[File:42938075.pdf|thumb|Calibration of Watermark soil moisture sensors for soil matric potential and temperature.pdf]] | ||
'''1.''' Bursting the PDF into PNGs<br> | '''1.''' Bursting the PDF into PNGs<br> | ||
====PDF is seperated into pages==== | ====PDF is seperated into pages==== | ||
<gallery | <gallery> | ||
File:wiki_page1.png | File:wiki_page1.png | ||
File:wiki_page2.png | File:wiki_page2.png | ||
Line 26: | Line 25: | ||
====The cover is overlayed and dewatermarked==== | ====The cover is overlayed and dewatermarked==== | ||
<gallery | <gallery> | ||
File:wiki_page1_water.png | File:wiki_page1_water.png | ||
</gallery> | |||
'''3.''' Overlaying the pages<br> | |||
====The pages are overlayed and dewatermarked==== | |||
<gallery> | |||
File:wiki_page2_water.png | File:wiki_page2_water.png | ||
File:wiki_page3_water.png | File:wiki_page3_water.png | ||
Line 34: | Line 39: | ||
File:wiki_page6_water.png | File:wiki_page6_water.png | ||
</gallery> | </gallery> | ||
'''4.''' OCR again<br> | '''4.''' OCR again<br> | ||
====You have a De-watermarked, searchable PDF==== | ====You have a De-watermarked, searchable PDF==== | ||
[[File:42938075_dewater.pdf|thumb|De Watermarked Calibration of Watermark soil moisture sensors for soil matric potential and temperature.pdf]] | |||
<br> | <br> | ||
=JSTOR.SH= | =FLOW= | ||
==JSTOR.SH== | |||
====To activate the stream I use ./jstor.sh==== | ====To activate the stream I use ./jstor.sh==== | ||
It is on a bash for loop. | |||
In this way I can easily control each 5 outputs. | |||
<source lang="python"> | <source lang="python"> | ||
for i in {1..5} | |||
do | |||
for name in *; do mv "$name" "${name// /_}"; done | cd /Users/PSC/Desktop/JSTOR/jstorpaper/ | ||
mv /Users/PSC/Desktop/JSTOR/overlay/*.pdf target.pdf | for name in *; do mv "$name" "${name// /_}"; done | ||
mkdir -p split | cd `ls -td -- /Users/PSC/Desktop/JSTOR/jstorpaper/* | head -n 1` | ||
python3 burstpdf.py | var2=`ls -td -- /Users/PSC/Desktop/JSTOR/jstorpaper/* | head -n 1` | ||
python3 overlaylogo_cover.py | for name in *; do mv "$name" "${name// /_}"; done | ||
python3 overlaylogo_page.py | cp $var2/*.pdf /Users/PSC/Desktop/JSTOR/overlay | ||
rm target.pdf | cd /Users/PSC/Desktop/JSTOR/overlay | ||
convert "split/*.{png,jpeg,pdf}" -quality 100 name.pdf | mv /Users/PSC/Desktop/JSTOR/overlay/*.pdf target.pdf | ||
var1= | mkdir -p split | ||
mv name.pdf $var1 | python3 burstpdf.py | ||
rm -r split | python3 overlaylogo_cover.py | ||
mv | python3 overlaylogo_page.py | ||
cd /Users/PSC/Desktop/JSTOR/ready | rm target.pdf | ||
ocrmypdf `ls -td -- /Users/PSC/Desktop/JSTOR/ready/* | head -n 1` `ls -td -- /Users/PSC/Desktop/JSTOR/ready/* | convert "split/*.{png,jpeg,pdf}" -quality 100 name.pdf | ||
var1=$var2/*.pdf | |||
mv name.pdf $var1 | |||
rm -r split | |||
mv $var2 /Users/PSC/Desktop/JSTOR/ready | |||
cd /Users/PSC/Desktop/JSTOR/ready | |||
ocrmypdf `ls -td -- /Users/PSC/Desktop/JSTOR/ready/*/*.pdf | head -n 1` `ls -td -- /Users/PSC/Desktop/JSTOR/ready/*/*.pdf | head -n 1` | |||
done | |||
</source> | </source> | ||
<br> | <br> | ||
=1. Bursting the PDF into png= | ==1. Bursting the PDF into png== | ||
<source lang="python"> | <source lang="python"> | ||
#Based in the code in https://iq.opengenus.org/pdf_to_image_in_python/ | #Based in the code in https://iq.opengenus.org/pdf_to_image_in_python/ | ||
Line 115: | Line 126: | ||
<br> | <br> | ||
=2. Overlaying the cover= | ==2. Overlaying the cover== | ||
<source lang="python"> | <source lang="python"> | ||
from PIL import Image | from PIL import Image | ||
Line 123: | Line 134: | ||
#rescaling the logo | #rescaling the logo | ||
basewidth = (background.size[0]) | basewidth = (background.size[0]) | ||
baseheight = (background.size[1]) | |||
finalcover = Image.open("cover.png") | finalcover = Image.open("cover.png") | ||
wpercent = (basewidth/float(finalcover.size[0])) | wpercent = (basewidth/float(finalcover.size[0])) | ||
Line 130: | Line 142: | ||
foreground = Image.open("cover_rescale.png") | foreground = Image.open("cover_rescale.png") | ||
foregroundheight = (foreground.size[1]) | |||
background.paste(foreground, (0, - | background.paste(foreground, (0, (baseheight-foregroundheight)), foreground.convert('RGBA')) | ||
background.save("split/page1.png") | background.save("split/page1.png") | ||
</source> | </source> | ||
<br> | <br> | ||
=3. Overlaying the pages= | ==3. Overlaying the pages== | ||
====This happens through ./jstor.sh==== | ====This happens through ./jstor.sh==== | ||
<source lang="python"> | <source lang="python"> | ||
Line 145: | Line 159: | ||
#rescaling the logo | #rescaling the logo | ||
basewidth = (base.size[0]) | basewidth = (base.size[0]) | ||
baseheight = (base.size[1]) | |||
finalpage = Image.open("pages.png") | finalpage = Image.open("pages.png") | ||
wpercent = (basewidth/float(finalpage.size[0])) | wpercent = (basewidth/float(finalpage.size[0])) | ||
Line 152: | Line 167: | ||
foreground = Image.open("page_rescale.png") | foreground = Image.open("page_rescale.png") | ||
foregroundheight = (foreground.size[1]) | |||
i = 2 | i = 2 | ||
Line 159: | Line 175: | ||
background = Image.open("split/page%i.png"%i) | background = Image.open("split/page%i.png"%i) | ||
background.paste(foreground, (0, - | background.paste(foreground, (0, (baseheight-foregroundheight)), foreground.convert('RGBA')) | ||
background.save("split/page%i.png"%i) | background.save("split/page%i.png"%i) | ||
Line 171: | Line 187: | ||
<br> | <br> | ||
=4. OCR again= | ==4. OCR again== | ||
====This happens through ./jstor.sh==== | ====This happens through ./jstor.sh==== | ||
<source lang="python"> | <source lang="python"> | ||
ocrmypdf `ls -td -- /Users/PSC/Desktop/JSTOR/ready/* | head -n 1` `ls -td -- /Users/PSC/Desktop/JSTOR/ready/* | head -n 1` | ocrmypdf `ls -td -- /Users/PSC/Desktop/JSTOR/ready/*/*.pdf | head -n 1` `ls -td -- /Users/PSC/Desktop/JSTOR/ready/*/*.pdf | head -n 1` | ||
</source> | </source> | ||
<br> | <br> |
Latest revision as of 15:39, 15 June 2020
STEPS
De-watermarking is separated into 4 steps:
1. Bursting the PDF into png
2. Overlaying the cover
3. Overlaying the pages
4. OCR again
RESULTS IN EACH STEP
0. Starting with a Paper from JSTOR
File:42938075.pdf
1. Bursting the PDF into PNGs
PDF is seperated into pages
2. Overlaying the cover
The cover is overlayed and dewatermarked
3. Overlaying the pages
The pages are overlayed and dewatermarked
4. OCR again
You have a De-watermarked, searchable PDF
FLOW
JSTOR.SH
To activate the stream I use ./jstor.sh
It is on a bash for loop. In this way I can easily control each 5 outputs.
for i in {1..5}
do
cd /Users/PSC/Desktop/JSTOR/jstorpaper/
for name in *; do mv "$name" "${name// /_}"; done
cd `ls -td -- /Users/PSC/Desktop/JSTOR/jstorpaper/* | head -n 1`
var2=`ls -td -- /Users/PSC/Desktop/JSTOR/jstorpaper/* | head -n 1`
for name in *; do mv "$name" "${name// /_}"; done
cp $var2/*.pdf /Users/PSC/Desktop/JSTOR/overlay
cd /Users/PSC/Desktop/JSTOR/overlay
mv /Users/PSC/Desktop/JSTOR/overlay/*.pdf target.pdf
mkdir -p split
python3 burstpdf.py
python3 overlaylogo_cover.py
python3 overlaylogo_page.py
rm target.pdf
convert "split/*.{png,jpeg,pdf}" -quality 100 name.pdf
var1=$var2/*.pdf
mv name.pdf $var1
rm -r split
mv $var2 /Users/PSC/Desktop/JSTOR/ready
cd /Users/PSC/Desktop/JSTOR/ready
ocrmypdf `ls -td -- /Users/PSC/Desktop/JSTOR/ready/*/*.pdf | head -n 1` `ls -td -- /Users/PSC/Desktop/JSTOR/ready/*/*.pdf | head -n 1`
done
1. Bursting the PDF into png
#Based in the code in https://iq.opengenus.org/pdf_to_image_in_python/
import pdf2image
from PIL import Image
import time
#DECLARE CONSTANTS
PDF_PATH = "target.pdf"
DPI = 200
FIRST_PAGE = None
LAST_PAGE = None
FORMAT = 'png'
THREAD_COUNT = 1
USERPWD = None
USE_CROPBOX = False
STRICT = False
def pdftopil():
#This method reads a pdf and converts it into a sequence of images
#PDF_PATH sets the path to the PDF file
#dpi parameter assists in adjusting the resolution of the image
#first_page parameter allows you to set a first page to be processed by pdftoppm
#last_page parameter allows you to set a last page to be processed by pdftoppm
#fmt parameter allows to set the format of pdftoppm conversion (PpmImageFile, TIFF)
#thread_count parameter allows you to set how many thread will be used for conversion.
#userpw parameter allows you to set a password to unlock the converted PDF
#use_cropbox parameter allows you to use the crop box instead of the media box when converting
#strict parameter allows you to catch pdftoppm syntax error with a custom type PDFSyntaxError
start_time = time.time()
pil_images = pdf2image.convert_from_path(PDF_PATH, dpi=DPI, first_page=FIRST_PAGE, last_page=LAST_PAGE, fmt=FORMAT, thread_count=THREAD_COUNT, userpw=USERPWD, use_cropbox=USE_CROPBOX, strict=STRICT)
print ("Time taken : " + str(time.time() - start_time))
return pil_images
def save_images(pil_images):
d = 1
for image in pil_images:
image.save(("split/page%d"%d) + ".png")
d += 1
if __name__ == "__main__":
pil_images = pdftopil()
save_images(pil_images)
2. Overlaying the cover
from PIL import Image
background = Image.open("split/page1.png")
#rescaling the logo
basewidth = (background.size[0])
baseheight = (background.size[1])
finalcover = Image.open("cover.png")
wpercent = (basewidth/float(finalcover.size[0]))
hsize = int((float(finalcover.size[1])*float(wpercent)))
finalcover = finalcover.resize((basewidth,hsize), Image.ANTIALIAS)
finalcover.save("cover_rescale.png")
foreground = Image.open("cover_rescale.png")
foregroundheight = (foreground.size[1])
background.paste(foreground, (0, (baseheight-foregroundheight)), foreground.convert('RGBA'))
background.save("split/page1.png")
3. Overlaying the pages
This happens through ./jstor.sh
from PIL import Image
base = Image.open("split/page2.png")
#rescaling the logo
basewidth = (base.size[0])
baseheight = (base.size[1])
finalpage = Image.open("pages.png")
wpercent = (basewidth/float(finalpage.size[0]))
hsize = int((float(finalpage.size[1])*float(wpercent)))
finalpage = finalpage.resize((basewidth,hsize), Image.ANTIALIAS)
finalpage.save("page_rescale.png")
foreground = Image.open("page_rescale.png")
foregroundheight = (foreground.size[1])
i = 2
while True:
try:
background = Image.open("split/page%i.png"%i)
background.paste(foreground, (0, (baseheight-foregroundheight)), foreground.convert('RGBA'))
background.save("split/page%i.png"%i)
i+=1
except:
print("DID MY JOB!")
break
4. OCR again
This happens through ./jstor.sh
ocrmypdf `ls -td -- /Users/PSC/Desktop/JSTOR/ready/*/*.pdf | head -n 1` `ls -td -- /Users/PSC/Desktop/JSTOR/ready/*/*.pdf | head -n 1`