User:Andre Castro/2.1/research-experiments-log: Difference between revisions
Andre Castro (talk | contribs) |
Andre Castro (talk | contribs) |
||
Line 137: | Line 137: | ||
* converted the 2 images into a text file using tesseract | * converted the 2 images into a text file using tesseract | ||
Results | tesseract image.tif text | ||
Outputs result in .txt format | |||
===Results=== | |||
** gimp tiff: resulting text and image | ** gimp tiff: resulting text and image | ||
[[File:Page10-g-1.png|thumb|center]] | [[File:Page10-g-1.png|thumb|center]] | ||
Line 170: | Line 173: | ||
wqumii mam‘: ix} th.s::‘r 3:m‘m'2e:,*:12.€z21°§§:; impamgazzt itazm is that: Sams | wqumii mam‘: ix} th.s::‘r 3:m‘m'2e:,*:12.€z21°§§:; impamgazzt itazm is that: Sams | ||
:13 wzag Lzsmi in am: days mi Szzgzarawrigggad giaipgfi‘ | :13 wzag Lzsmi in am: days mi Szzgzarawrigggad giaipgfi‘ | ||
</pre> | |||
===Conclusion=== | |||
In order to be able to have the whole book translated into text I will have to fine-tune imagemagick conversion to TIFF, so that tesseract does a better job, in character recognition. | |||
Just as to see how would tesseract would react to a scanned TIFF, I scanned a printed text on a table scanner at 600pdi directly to a TIFF file. The result was very good | |||
Here is the both scan and text | |||
[[File:Teste-page001.png|thumb|center]] | |||
<pre> | |||
Introduction | |||
In the preface to A Contribution to the Critique of Political Economy, Marx argues that, ‘at a | |||
certain stage of development, the material productive forces of society come in conflict with | |||
the existing relations of production’? What is possible in the information age is in direct con- | |||
flict with what is permissible. Publishers, film producers and the telecommunication indus- | |||
try conspire with lawmakers to bottle up and sabotage free networks, to forbid information | |||
from circulating outside of their control. The corporations in the recording industry attempt | |||
to forcibly maintain their position as mediators between artists and fans, as fans and artists | |||
merge closer together and explore new ways of interacting. | |||
</pre> | </pre> |
Revision as of 18:29, 10 October 2012
Experiment's Log
One experiment per day
03.10.2012
Sharing my digital library
Steps
- put my calibre ebook library on machine online for 24h
- start calibre content server
$ calibre-server
which point to my Calibre-library, and uses port machine_ip:8080
- check library remotely
Problem!! Calibre library is only accessible on the local LAN where I am at. http://www.mobileread.com/forums/showthread.php?t=160387
- in order to access the library remotely I have to have calibre installed on a server.
- calibre server on a LAN becomes a bit redundant, perhaps is handy for grabbing the books to ereaders and exchange them with people that are near one (on the same LAN), but one cannot say is yet a strong strategy for sharing books with someone in another part of the world.
04.10.2012
opening and modifying and epub
Too much talk about ebooks, witout actually looking at the insides of one. That's what I will do, will and create a new one with a section of the original.
- choose a book: Japanese Fairy Tales by Yei Theodora Ozaki
dissecting the epub
unzip the epub: there are 2 directories;
A) 4018/ - content dir
A.1) html ,css - html content and style
A.2) content.opf - Open Packaging Format metadata file (can be called anything, but content.opf is the convention) - specifies the location of all the contents in the book + the metadata (in xml)< br/>
A.2.1) metadata: required terms:title and identifier(the identifier must be a unique value, although it's up to the digital book creator to define that unique value)
A.2.2) manifest: all the content files par of the book
A.2.3) spine: indicates the order files they appear in the ebook - but not extraneous (like begging and end)
A.2.4) guide: (not required) explains what each section means
A.3) toc.ncx - The NCX defines the table of contents, but also metadata (overlaps w/ content.opf)
A.3.1) metadata- requires:
- uid: unique ID for the digital book. Should match the dc:identifier in the OPF file.
- depth: the level of the hierarchy in the table of contents
- totalPageCount and maxPageNumber: only to paper books and can be left at 0.
A.3.2) navMap: contains the navPoints
A.3.2.1) navPoint:
- playOrder - reading order. (same as itemref elements in the OPF spine).
- navLabel/text describes the title of this book section, a chapter title or number
- content src attribute points to content file. (a file declared in the OPF manifest). (can also point to anchors within XHTML eg: content.html#footnote1.)
A.4)and cover;
B) META-INF/container.xml (pointing to content.opf) - EPUB reading systems will look for this file first, as it points to the location of the metadata for the digital book.
B.2) META-INF can contain file such as digital signatures, encryption, and DRM
C) ./mimetype - file containing 'application/epub+zip'
packaging the epub into an epub+zip file
- create the new ZIP archive and add the mimetype file (no compression)
$ zip -0Xq my-book.epub mimetype
- add the remaining items
$ zip -Xr9Dq my-book.epub *
-X and -D minimize extraneous information in the .zip file; -r adds files from dirs 9? q?
Conclusion
- in the epub the text overlaps heavily, maybe because I didn't add a font size to the css; (text-height:200%; solved it)
- however in a kindle looks fine. Why?because kindle is imposing its font on the text - 'If you are publishing to Kindle it forces your font into it's own custom font so it doesn't matter'. This show what close the device is.
- even tough the process of creating an epub and its structure are straight-forward, writing the .opf and .ncx without any automated process its a pain (but important to understand the epub's structure) and may easily lead to errors.
- also it seems not the easiest format to experiment with. The need to declare all the files used and the metadata as well as packaging, will make think twice before trying something out.
resources
http://www.ibm.com/developerworks/xml/tutorials/x-epubtut/index.html
05.10.2012
Began a ebook - Spam Life - A compilation of folk stories and fary-tales from the year 2012
Original Message --------
Subject: Contribute with spam emails
Date: Fri, 05 Oct 2012 18:36:49 +0200
From: andre castro <andrecastro83@gmail.com>
To: undisclosed-recipients:;
Hia,
I am emailing to ask you for a favor.
Do you happen to have received sometime recently spam email, in a quite
personal tone and addressing you? Do think they might be somewhere in
you inbox or spam?
If you do so, or receive one in the next few day, I'd like to ask you,
you to forward them to me as I am putting together a compilation of
those particular spam emails. They can be in any language
Thank you
Best
a
The result won't take long
09.10.12
I made a few attempts to work with Sigil to help me work in Spam book project, however its far from ideal. I adds excessive markup content by itself, and makes the whole business quite messy, specially when working with large amount of text, when automated processes are very handy.
What I will try to do today is divide the creation of an epub in 4 stages.
- Having all the content in plain text with mediawiki notation - mostly headings, bold, and italics
- use a parse to generate an html from the plain text file
- Use calibre to convert the html to epub.
- Question: What is the fundamental difference between and ebook and a website? Can't a website, mostly made of text and re-flowable , be a ebook? With perhaps even more potential and easier to experiment with?
10.10.12
Character Encoding of a image-pdf.
I wanted to get hold of Licklider Libraries of Future. The only online copy I found was a pdf constituted of scanned images. Thought this could be a good opportunity to get some hands on character recognition process.
Steps
- extract separate images out of a pdf
- convert output images files to a TIFF (single-bit uncompressed) so the tesseract (ocr software) can read them.
- using imagemaick
convert -monochrome -density 600 source.pdf page.tif
-density refers to resolution of the scanned image (the higher the better) 600pdi
- using gimp (as documented in http://alexsleat.co.uk/2010/04/12/howto-simple-tesseract-usage-guide-ocr/ )
- converted the 2 images into a text file using tesseract
tesseract image.tif text
Outputs result in .txt format
Results
- gimp tiff: resulting text and image
but rather that publication has been extended far beyond our present ability to make real use of the record. The summa- tion of human experience is being expanded at a prodigious rate, and the means we use for threading through the con- sequent maze to the momentarily important item is the same as was used in the days of square-rigged ships.*
- imagemagick tiff: resulting text and image
§;;iuEy ix’: af ihszfé aiaximiai gzimi *»;<~°2:::ii?s,:s§;§-°‘ iéf §§§‘:‘T.°:f:3é‘;‘;i%E3§-°-£§3:3;}I imafiriégis, bait m1:E:::::* that Apwiiaaiim has bi?-iii} §:E;}i§:€i3fl§€3-ii fa: b€::«,.«~’§;3m?§ mu: p=:°‘@$e:m sgibifiiiy Ea mam ma} we Sf that rzaaazaig ’E’ha samzmaw timz Qf Emmazfa m:§§:r§V€.%:1a$ £3 baffsizzzg $.I><Z§&§‘fid€€§ at :5: pradigiaaza mm, and tha;: X?JE€;?3£:iE.§S ass: far fifamadizfag ifiarmzgfiz ‘mg: mam wqumii mam‘: ix} th.s::‘r 3:m‘m'2e:,*:12.€z21°§§:; impamgazzt itazm is that: Sams :13 wzag Lzsmi in am: days mi Szzgzarawrigggad giaipgfi‘
Conclusion
In order to be able to have the whole book translated into text I will have to fine-tune imagemagick conversion to TIFF, so that tesseract does a better job, in character recognition.
Just as to see how would tesseract would react to a scanned TIFF, I scanned a printed text on a table scanner at 600pdi directly to a TIFF file. The result was very good
Here is the both scan and text
Introduction In the preface to A Contribution to the Critique of Political Economy, Marx argues that, ‘at a certain stage of development, the material productive forces of society come in conflict with the existing relations of production’? What is possible in the information age is in direct con- flict with what is permissible. Publishers, film producers and the telecommunication indus- try conspire with lawmakers to bottle up and sabotage free networks, to forbid information from circulating outside of their control. The corporations in the recording industry attempt to forcibly maintain their position as mediators between artists and fans, as fans and artists merge closer together and explore new ways of interacting.