User:Ssstephen/handwriting-ocr

From XPUB & Lens-Based wiki

Another process page. I want to digitise my handwritten notebooks. I would like to be able to:

scan notebook pages

An image format is presumably most useful here as a first digital file. What format? Is lighting important? How does the book scanner in the studio work?

pre-process images

Increase contrast, remove colour; generally make it as easy as possible for the OCR software. I guess I should make a rough guess of what tesseract would want, and make a sample of pre-processed pages to train the model on. But then if I want to change the pre-processing methods I have to train it each time? Seems to make sense but also seems like a long process.

Can image magick do this for me?

OCR

Tesseract seems like a good one. The things I have seen about this don't make me super optimistic about the quality of output I will get. But there are a few reasons I think it is worth exploring anyway:

  • I will be training and using only one person's handwriting.
  • My handwriting is fucking immaculate.
  • I am interested in the translations and what is lost/gained there too.
  • If it did work, it would be really fun to do some computational linguistics on my notebooks. Make a concordance of them. Search through them. Make them into a website and print them out again.

train tesseract

Using a dataset of my own pages. Prep the dataset images? Make the dataset. Train the software. Test it. Reiterate.

Output flawless .txt transcriptions of my handwritten nonsense

How could this be possible? What about location of the notes on the page? What about all the vertical text and columned text? I really dont think this will go well. How long would it take to just transcribe them by hand instead?

combine the images and .txt into a multipage PDF

The original images, not the ones tesseract uses.