User:Pedro Sá Couto/TW/JSTOR De-watermarking: Difference between revisions

From XPUB & Lens-Based wiki
No edit summary
 
(30 intermediate revisions by the same user not shown)
Line 1: Line 1:
The process to dewatermark is separated into 4 steps:
=STEPS=
1. Bursting the PDF into png
====De-watermarking is separated into 4 steps:====
2. Overlaying the cover
1. Bursting the PDF into png<br>
3. Overlaying the pages
2. Overlaying the cover<br>
4. OCR again
3. Overlaying the pages<br>
4. OCR again<br>
<br>


=RESULTS IN EACH STEP=
'''0.''' Starting with a Paper from JSTOR<br>
[[File:42938075.pdf|thumb|Calibration of Watermark soil moisture sensors for soil matric potential and temperature.pdf]]


=JSTOR.SH=
'''1.''' Bursting the PDF into PNGs<br>
==To activate the stream I use ./jstor.sh==
====PDF is seperated into pages====
<gallery>
File:wiki_page1.png
File:wiki_page2.png
File:wiki_page3.png
File:wiki_page4.png
File:wiki_page5.png
File:wiki_page6.png
</gallery>


<pre>
'''2.''' Overlaying the cover<br>
cp `ls -td -- /Users/PSC/Desktop/JSTOR/jstorpaper/* | head -n 1` /Users/PSC/Desktop/JSTOR/overlay
 
cd /Users/PSC/Desktop/JSTOR/overlay
====The cover is overlayed and dewatermarked====
for name in *; do mv "$name" "${name// /_}"; done
<gallery>
mv /Users/PSC/Desktop/JSTOR/overlay/*.pdf target.pdf
File:wiki_page1_water.png
mkdir -p split
</gallery>
python3 burstpdf.py
 
python3 overlaylogo_cover.py
'''3.''' Overlaying the pages<br>
python3 overlaylogo_page.py
 
rm target.pdf
====The pages are overlayed and dewatermarked====
convert "split/*.{png,jpeg,pdf}" -quality 100 name.pdf
<gallery>
var1=`ls -td -- /Users/PSC/Desktop/JSTOR/jstorpaper/*.pdf | head -n 1`
File:wiki_page2_water.png
mv name.pdf $var1
File:wiki_page3_water.png
rm -r split
File:wiki_page4_water.png
mv `ls -td -- /Users/PSC/Desktop/JSTOR/jstorpaper/* | head -n 1` /Users/PSC/Desktop/JSTOR/ready
File:wiki_page5_water.png
cd /Users/PSC/Desktop/JSTOR/ready
File:wiki_page6_water.png
ocrmypdf `ls -td -- /Users/PSC/Desktop/JSTOR/ready/* | head -n 1` `ls -td -- /Users/PSC/Desktop/JSTOR/ready/* | head -n 1`
</gallery>
mv `ls -td -- /Users/PSC/Desktop/JSTOR/ready/* | head -n 1` /Users/PSC/Desktop/JSTOR/ready/ocred
 
</pre>
'''4.''' OCR again<br>
====You have a De-watermarked, searchable PDF====
[[File:42938075_dewater.pdf|thumb|De Watermarked Calibration of Watermark soil moisture sensors for soil matric potential and temperature.pdf]]
<br>
 
=FLOW=
==JSTOR.SH==
====To activate the stream I use ./jstor.sh====
It is on a bash for loop.
In this way I can easily control each 5 outputs.
 
<source lang="python">
for i in {1..5}
do
  cd /Users/PSC/Desktop/JSTOR/jstorpaper/
  for name in *; do mv "$name" "${name// /_}"; done
  cd `ls -td -- /Users/PSC/Desktop/JSTOR/jstorpaper/* | head -n 1`
  var2=`ls -td -- /Users/PSC/Desktop/JSTOR/jstorpaper/* | head -n 1`
  for name in *; do mv "$name" "${name// /_}"; done
  cp $var2/*.pdf /Users/PSC/Desktop/JSTOR/overlay
  cd /Users/PSC/Desktop/JSTOR/overlay
  mv /Users/PSC/Desktop/JSTOR/overlay/*.pdf target.pdf
  mkdir -p split
  python3 burstpdf.py
  python3 overlaylogo_cover.py
  python3 overlaylogo_page.py
  rm target.pdf
  convert "split/*.{png,jpeg,pdf}" -quality 100 name.pdf
  var1=$var2/*.pdf
  mv name.pdf $var1
  rm -r split
  mv $var2 /Users/PSC/Desktop/JSTOR/ready
  cd /Users/PSC/Desktop/JSTOR/ready
  ocrmypdf `ls -td -- /Users/PSC/Desktop/JSTOR/ready/*/*.pdf | head -n 1` `ls -td -- /Users/PSC/Desktop/JSTOR/ready/*/*.pdf | head -n 1`
done
</source>
<br>
 
==1. Bursting the PDF into png==
<source lang="python">
#Based in the code in https://iq.opengenus.org/pdf_to_image_in_python/
 
import pdf2image
from PIL import Image
import time
 
#DECLARE CONSTANTS
PDF_PATH = "target.pdf"
DPI = 200
FIRST_PAGE = None
LAST_PAGE = None
FORMAT = 'png'
THREAD_COUNT = 1
USERPWD = None
USE_CROPBOX = False
STRICT = False
 
def pdftopil():
    #This method reads a pdf and converts it into a sequence of images
    #PDF_PATH sets the path to the PDF file
    #dpi parameter assists in adjusting the resolution of the image
    #first_page parameter allows you to set a first page to be processed by pdftoppm
    #last_page parameter allows you to set a last page to be processed by pdftoppm
    #fmt parameter allows to set the format of pdftoppm conversion (PpmImageFile, TIFF)
    #thread_count parameter allows you to set how many thread will be used for conversion.
    #userpw parameter allows you to set a password to unlock the converted PDF
    #use_cropbox parameter allows you to use the crop box instead of the media box when converting
    #strict parameter allows you to catch pdftoppm syntax error with a custom type PDFSyntaxError
 
    start_time = time.time()
    pil_images = pdf2image.convert_from_path(PDF_PATH, dpi=DPI, first_page=FIRST_PAGE, last_page=LAST_PAGE, fmt=FORMAT, thread_count=THREAD_COUNT, userpw=USERPWD, use_cropbox=USE_CROPBOX, strict=STRICT)
    print ("Time taken : " + str(time.time() - start_time))
    return pil_images
 
def save_images(pil_images):
    d = 1
    for image in pil_images:
        image.save(("split/page%d"%d) + ".png")
        d += 1
 
if __name__ == "__main__":
    pil_images = pdftopil()
    save_images(pil_images)
</source>
<br>
 
==2. Overlaying the cover==
<source lang="python">
from PIL import Image
 
background = Image.open("split/page1.png")
 
#rescaling the logo
basewidth = (background.size[0])
baseheight = (background.size[1])
finalcover = Image.open("cover.png")
wpercent = (basewidth/float(finalcover.size[0]))
hsize = int((float(finalcover.size[1])*float(wpercent)))
finalcover = finalcover.resize((basewidth,hsize), Image.ANTIALIAS)
finalcover.save("cover_rescale.png")
 
foreground = Image.open("cover_rescale.png")
foregroundheight = (foreground.size[1])
 
background.paste(foreground, (0, (baseheight-foregroundheight)), foreground.convert('RGBA'))
background.save("split/page1.png")
 
</source>
<br>
 
==3. Overlaying the pages==
====This happens through ./jstor.sh====
<source lang="python">
from PIL import Image
 
base = Image.open("split/page2.png")
 
#rescaling the logo
basewidth = (base.size[0])
baseheight = (base.size[1])
finalpage = Image.open("pages.png")
wpercent = (basewidth/float(finalpage.size[0]))
hsize = int((float(finalpage.size[1])*float(wpercent)))
finalpage = finalpage.resize((basewidth,hsize), Image.ANTIALIAS)
finalpage.save("page_rescale.png")
 
foreground = Image.open("page_rescale.png")
foregroundheight = (foreground.size[1])
 
i = 2
 
while True:
    try:
        background = Image.open("split/page%i.png"%i)
 
        background.paste(foreground, (0, (baseheight-foregroundheight)), foreground.convert('RGBA'))
        background.save("split/page%i.png"%i)
 
        i+=1
 
    except:
        print("DID MY JOB!")
        break
 
</source>
<br>
 
==4. OCR again==
====This happens through ./jstor.sh====
<source lang="python">
ocrmypdf `ls -td -- /Users/PSC/Desktop/JSTOR/ready/*/*.pdf | head -n 1` `ls -td -- /Users/PSC/Desktop/JSTOR/ready/*/*.pdf | head -n 1`
</source>
<br>

Latest revision as of 16:39, 15 June 2020

STEPS

De-watermarking is separated into 4 steps:

1. Bursting the PDF into png
2. Overlaying the cover
3. Overlaying the pages
4. OCR again

RESULTS IN EACH STEP

0. Starting with a Paper from JSTOR
File:42938075.pdf

1. Bursting the PDF into PNGs

PDF is seperated into pages

2. Overlaying the cover

The cover is overlayed and dewatermarked

3. Overlaying the pages

The pages are overlayed and dewatermarked

4. OCR again

You have a De-watermarked, searchable PDF

File:42938075 dewater.pdf

FLOW

JSTOR.SH

To activate the stream I use ./jstor.sh

It is on a bash for loop. In this way I can easily control each 5 outputs.

for i in {1..5}
do
  cd /Users/PSC/Desktop/JSTOR/jstorpaper/
  for name in *; do mv "$name" "${name// /_}"; done
  cd `ls -td -- /Users/PSC/Desktop/JSTOR/jstorpaper/* | head -n 1`
  var2=`ls -td -- /Users/PSC/Desktop/JSTOR/jstorpaper/* | head -n 1`
  for name in *; do mv "$name" "${name// /_}"; done
  cp $var2/*.pdf /Users/PSC/Desktop/JSTOR/overlay
  cd /Users/PSC/Desktop/JSTOR/overlay
  mv /Users/PSC/Desktop/JSTOR/overlay/*.pdf target.pdf
  mkdir -p split
  python3 burstpdf.py
  python3 overlaylogo_cover.py
  python3 overlaylogo_page.py
  rm target.pdf
  convert "split/*.{png,jpeg,pdf}" -quality 100 name.pdf
  var1=$var2/*.pdf
  mv name.pdf $var1
  rm -r split
  mv $var2 /Users/PSC/Desktop/JSTOR/ready
  cd /Users/PSC/Desktop/JSTOR/ready
  ocrmypdf `ls -td -- /Users/PSC/Desktop/JSTOR/ready/*/*.pdf | head -n 1` `ls -td -- /Users/PSC/Desktop/JSTOR/ready/*/*.pdf | head -n 1`
done


1. Bursting the PDF into png

#Based in the code in https://iq.opengenus.org/pdf_to_image_in_python/

import pdf2image
from PIL import Image
import time

#DECLARE CONSTANTS
PDF_PATH = "target.pdf"
DPI = 200
FIRST_PAGE = None
LAST_PAGE = None
FORMAT = 'png'
THREAD_COUNT = 1
USERPWD = None
USE_CROPBOX = False
STRICT = False

def pdftopil():
    #This method reads a pdf and converts it into a sequence of images
    #PDF_PATH sets the path to the PDF file
    #dpi parameter assists in adjusting the resolution of the image
    #first_page parameter allows you to set a first page to be processed by pdftoppm
    #last_page parameter allows you to set a last page to be processed by pdftoppm
    #fmt parameter allows to set the format of pdftoppm conversion (PpmImageFile, TIFF)
    #thread_count parameter allows you to set how many thread will be used for conversion.
    #userpw parameter allows you to set a password to unlock the converted PDF
    #use_cropbox parameter allows you to use the crop box instead of the media box when converting
    #strict parameter allows you to catch pdftoppm syntax error with a custom type PDFSyntaxError

    start_time = time.time()
    pil_images = pdf2image.convert_from_path(PDF_PATH, dpi=DPI, first_page=FIRST_PAGE, last_page=LAST_PAGE, fmt=FORMAT, thread_count=THREAD_COUNT, userpw=USERPWD, use_cropbox=USE_CROPBOX, strict=STRICT)
    print ("Time taken : " + str(time.time() - start_time))
    return pil_images

def save_images(pil_images):
    d = 1
    for image in pil_images:
        image.save(("split/page%d"%d) + ".png")
        d += 1

if __name__ == "__main__":
    pil_images = pdftopil()
    save_images(pil_images)


2. Overlaying the cover

from PIL import Image

background = Image.open("split/page1.png")

#rescaling the logo
basewidth = (background.size[0])
baseheight = (background.size[1])
finalcover = Image.open("cover.png")
wpercent = (basewidth/float(finalcover.size[0]))
hsize = int((float(finalcover.size[1])*float(wpercent)))
finalcover = finalcover.resize((basewidth,hsize), Image.ANTIALIAS)
finalcover.save("cover_rescale.png")

foreground = Image.open("cover_rescale.png")
foregroundheight = (foreground.size[1])

background.paste(foreground, (0, (baseheight-foregroundheight)), foreground.convert('RGBA'))
background.save("split/page1.png")


3. Overlaying the pages

This happens through ./jstor.sh

from PIL import Image

base = Image.open("split/page2.png")

#rescaling the logo
basewidth = (base.size[0])
baseheight = (base.size[1])
finalpage = Image.open("pages.png")
wpercent = (basewidth/float(finalpage.size[0]))
hsize = int((float(finalpage.size[1])*float(wpercent)))
finalpage = finalpage.resize((basewidth,hsize), Image.ANTIALIAS)
finalpage.save("page_rescale.png")

foreground = Image.open("page_rescale.png")
foregroundheight = (foreground.size[1])

i = 2

while True:
    try:
        background = Image.open("split/page%i.png"%i)

        background.paste(foreground, (0, (baseheight-foregroundheight)), foreground.convert('RGBA'))
        background.save("split/page%i.png"%i)

        i+=1

    except:
        print("DID MY JOB!")
        break


4. OCR again

This happens through ./jstor.sh

ocrmypdf `ls -td -- /Users/PSC/Desktop/JSTOR/ready/*/*.pdf | head -n 1` `ls -td -- /Users/PSC/Desktop/JSTOR/ready/*/*.pdf | head -n 1`