User:Ssstephen/handwriting-ocr
Another process page. I want to digitise my handwritten notebooks. I would like to be able to:
scan notebook pages
An image format is presumably most useful here as a first digital file. What format? Is lighting important? How does the book scanner in the studio work?
pre-process images
Increase contrast, remove colour; generally make it as easy as possible for the OCR software. I guess I should make a rough guess of what tesseract would want, and make a sample of pre-processed pages to train the model on. But then if I want to change the pre-processing methods I have to train it each time? Seems to make sense but also seems like a long process.
Can image magick do this for me?
OCR
Tesseract seems like a good one. The things I have seen about this don't make me super optimistic about the quality of output I will get. But there are a few reasons I think it is worth exploring anyway:
- I will be training and using only one person's handwriting.
- My handwriting is fucking immaculate.
- I am interested in the translations and what is lost/gained there too.
- If it did work, it would be really fun to do some computational linguistics on my notebooks. Make a concordance of them. Search through them. Make them into a website and print them out again.
train tesseract
Using a dataset of my own pages. Prep the dataset images? Make the dataset. Train the software. Test it. Reiterate.
Output flawless .txt transcriptions of my handwritten nonsense
How could this be possible? What about location of the notes on the page? What about all the vertical text and columned text? I really dont think this will go well. How long would it take to just transcribe them by hand instead?
combine the images and .txt into a multipage PDF
The original images, not the ones tesseract uses.