Character recognition - from pdf to txt: Difference between revisions
Andre Castro (talk | contribs) No edit summary |
Andre Castro (talk | contribs) No edit summary |
||
(3 intermediate revisions by the same user not shown) | |||
Line 2: | Line 2: | ||
Note: | Note: | ||
* the patch uses pdftk, imagemagick and tesseract. | * the patch uses pdftk, imagemagick and tesseract-ocr(in debian repository) . | ||
* pdf should be 300dpi or higher resolution | * pdf should be 300dpi or higher resolution | ||
* takes as argument original pdf file name | * takes as argument original pdf file name | ||
Line 8: | Line 8: | ||
<source lang="bash"> | <source lang="bash"> | ||
#! | #!/bin/sh | ||
#needs: pdftk, imagemagick and tesseract to be installed | #needs: pdftk, imagemagick and tesseract to be installed | ||
#pdf should be 300dpi or higher resolution | #pdf should be 300dpi or higher resolution | ||
Line 49: | Line 48: | ||
</source> | </source> | ||
$ chmod +x script.sh | |||
$ script.sh yourimagepdf.pdf |
Latest revision as of 11:21, 28 February 2013
This patch character-recognizes text in an image pdf and outputs it into a txt file
Note:
- the patch uses pdftk, imagemagick and tesseract-ocr(in debian repository) .
- pdf should be 300dpi or higher resolution
- takes as argument original pdf file name
#!/bin/sh
#needs: pdftk, imagemagick and tesseract to be installed
#pdf should be 300dpi or higher resolution
pdftk $1 burst #splits a pdf document into single pages named pg_0*.pdf
#convert each pdf page
#1:from pdf to 8bit 300dpi tifs # this takes a while
for i in pg*.pdf
do
convert -units pixelsperinch -density 300x300 -colorspace Gray -depth 8 $i "`basename $i .p\
df`.tif"
done;
#2:from gray-scale tif to monochorome tif
for i in pg*.tif
do
convert $i +dither -monochrome -normalize "`basename $i .tif`-m.tif"
done;
#3:character recognition with tesseract
for i in *-m.tif
do
tesseract $i "`basename $i .tif`"
done;
#4: cat ocr text content under 1 file
DUMP=`basename $1 .pdf`.txt;#replace sufix .pdf to .txt
echo ==== $DUMP is the file with text content ====;
touch $DUMP; #create a new txt file
cat pg*.txt >> $DUMP;
#5: garbage collect
echo ==== moving old files to transh/ ====;
mkdir trash;
mv pg* trash;
# remove dir if you no longer need those files
$ chmod +x script.sh $ script.sh yourimagepdf.pdf