|
|
Line 24: |
Line 24: |
| tesseract nameofpicture.png outputbase<br><br> | | tesseract nameofpicture.png outputbase<br><br> |
|
| |
|
| [[File:scan_source.png | 600px | thumbnail | left | Scan of a book page]] | | [[File:scan_source.png | 600px | thumbnail | left | Scanning a book page]] |
| <br clear=all> | | <br clear=all> |
| [[File:ocr_output.png | 600px | thumbnail | left | Output: recognition of the characters with tesseract-ocr and styled with javascript ]] | | [[File:ocr_output.png | 600px | thumbnail | left | Output: character recognition with tesseract-ocr / styled with javascript ]] |
| <br clear=all>
| |
| | |
| https://github.com/tesseract-ocr/tesseract
| |
| | |
| See pad: https://pad.xpub.nl/p/IFL_2018-05-14
| |
| | |
| | |
| == Prototyping ==
| |
| === Image classifier for annotations ===
| |
| | |
| At the time of this special issue, a point of interest for everyone was annotations. We were reading and annotating texts together and debating the possibilities of sharing these notes. One particular discussion was about what could/should be considered as annotation: folding corners of pages, linking to other contents, highlighting, scribbling, drawing. I was curious if we could train a computer to see all of these traces, so I started prototyping some examples.
| |
| | |
| Aim: make the computer recognize "clean" pages of books or "annotated" pages of books.
| |
| | |
| Using the script from [[.py.rate.chnic_sessions#29.10.2018:_Zalan_.26_Alex|.py.rate.chnic session 2]], [https://pad.xpub.nl/p/pyrate2| pad notes here], and [https://git.xpub.nl/aaaa/learning_algorithms/src/branch/master/ImageClassificationPython| Alex's git here]. My data set [https://git.xpub.nl/rita/image_classifier_annotation here].
| |
| | |
| [[File:annotated_eg.png | 600px | thumbnail | left | "Annotated" example from data set > test set]]
| |
| <br clear=all>
| |
| | |
| [[File:clean_eg.png | 600px | thumbnail | left | "Clean" example from data set > test set]]
| |
| <br clear=all>
| |
| | |
| Each set (test and training) had 50 examples of "clean" pages and "annotated" pages, it makes sense to add more in the future.<br>
| |
| The results were not very accurate. Pages with hand-written text gave better results while highlighting and computer notes were often misinterpreted.
| |
| It’s useful to try to see what the computer is looking for, understand if the script is breaking the image in parts, and try other scripts.
| |
| | |
| Some results:
| |
| <gallery | widths=200px heights=200px>
| |
| test10.jpg.predicted.png|Right prediction
| |
| test5.jpg.predicted.png |Right prediction
| |
| test2.jpg.predicted.png|Wrong prediction
| |
| test6.jpg.predicted.png|Wrong prediction
| |
| </gallery>
| |
| | |
| === Computer categorization for text files ===
| |
| | |
| The actions of categorizing and cataloging happen in the most mundane activities, but they are not innocent. They translate values and certain visions of the world.<br>
| |
| In the Rietveld Academy Library, we saw how the librarians are challenging the Library of Congress classification. With Dušan we browsed in the Monoskop Index, an interesting combination of a “book index, library catalog, and tag cloud”.<br>
| |
| With this script, I was experimenting with an automated classification of text files. The script searches for the three most common words in the text and tries to match these words to a category. For example, if one of the most common words is “books” the category of the text is considered “Library Studies”. The same would happen with the word “archives”, “author”, “bibliographic”, “bibliotheca”, “book”, “bookcase”, etc. The script only has one category right now, but it would be easy to add more. By doing so, I would be making associations that are very personal, sometimes inaccurate, and I would be creating a bias in the catalog.
| |
| | |
| [[File:Common words.png|600px |thumbnail|left| Testing it with Balázs Bodó's text, Own Nothing ]]
| |
| <br clear=all> | | <br clear=all> |
Publishing an “image gallery”
Imagemagick’s suite of tools includes montage which is quite flexible and useful for making a quick overview page of image.
- mogrify
- identify
- convert
- Sizing down a bunch of images
Warning: MOGRIFY MODIES THE IMAGES – ERASING THE ORIGINAL – make a copy of the images before you do this!!!
mogrify -resize 1024x *.JPG
Fixing the orientation of images
mogrify -auto-orient *.JPG
Using Montage
montage -label "%f" *.JPG \
-shadow
-geometry 1000x1000+100+100
-montage.caption.jpg
Using pdftk to put things together
OCR
simple tesseract:
tesseract nameofpicture.png outputbase
Output: character recognition with tesseract-ocr / styled with javascript