User:Joca/tesseract-preprocessing: Difference between revisions

From XPUB & Lens-Based wiki
 
Line 10: Line 10:
=== Unpaper ===
=== Unpaper ===
'''site''': https://github.com/Flameeyes/unpaper/blob/master/doc/basic-concepts.md
'''site''': https://github.com/Flameeyes/unpaper/blob/master/doc/basic-concepts.md
Unpaper is one of the dependencies used in pdfsandwich. It is able to deskew scanned pages and optionally combine single pages onto spreads. It uses .ppm .pbm and .pnm files as input and will output the same formats.
Unpaper is one of the dependencies used in pdfsandwich. It is able to deskew scanned pages and optionally combine single pages onto spreads. It uses .ppm .pbm and .pnm files as input and will output the same formats.


=== Textcleaner script ===
=== Textcleaner script ===
'''site''': http://www.fmwconcepts.com/imagemagick/textcleaner/index.php
'''site''': http://www.fmwconcepts.com/imagemagick/textcleaner/index.php
Textcleaner is an imagemagick script for cropping, grayscaling and de-noising of images. I don't need the full script, but maybe I can reconstruct it and make a script that just does the grayscaling and denoising.
Textcleaner is an imagemagick script for cropping, grayscaling and de-noising of images. I don't need the full script, but maybe I can reconstruct it and make a script that just does the grayscaling and denoising.



Latest revision as of 11:28, 29 January 2018

To improve the performance of Tesseract for optimal image recognition, it might help to pre-process the images by straighten them, removing the background colors and the noise (binarisation). (see https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality). In this page I list a number of tools to do that programmatically. My objective is to create a pipeline that combines multiple tools to deskew a scanned images and put it through a process of Binarisation.

Interesting tools

pdfsandwich

site: http://www.tobias-elze.de/pdfsandwich/

Pdfsandwich is a wrapper script that combines the applications unpaper (since version 0.0.9), convert, gs, hocr2pdf (for tesseract prior to version 3.03), and tesseract. It uses non-OCR'ed pdf's as an input and will output PDF's with an text layer over the image layer. It features several options to for example not the background image in the pdf, rotate images beforehand etc. It is an easy to use tool, but I am interested in deconstructing it. By doing so I can make a specific pipeline for just deskewing and binarisation which I can connect to my own application.

Unpaper

site: https://github.com/Flameeyes/unpaper/blob/master/doc/basic-concepts.md

Unpaper is one of the dependencies used in pdfsandwich. It is able to deskew scanned pages and optionally combine single pages onto spreads. It uses .ppm .pbm and .pnm files as input and will output the same formats.

Textcleaner script

site: http://www.fmwconcepts.com/imagemagick/textcleaner/index.php

Textcleaner is an imagemagick script for cropping, grayscaling and de-noising of images. I don't need the full script, but maybe I can reconstruct it and make a script that just does the grayscaling and denoising.

OCRmyPDF

site: https://github.com/jbarlow83/OCRmyPDF

What's next?

First I want to manually preprocess a non OCR'ed PDF using unpaper and imagemagick. The output should be a tiff that I can use in Tesseract. If this works, I'd like to turn these commands into a script that uses pdf's or images as input and .tiff as an output. Another goal is to make an overview of preprocessing tools, which might include my script, and post that as a recipe on the CookBook.