User:Bohye Woo/Prototyping 03

From XPUB & Lens-Based wiki

PROTOTYPING EXPERIMENTS

TESSERACT — OCR/HOCR

<script type="text/javascript">

    //store all class 'ocr_line' in 'lines'
    var lines = document.querySelectorAll(".ocr_line");    

    //loop through each element in 'lines'
    for (var i = 0; i < lines.length; i++){ 

      var line = lines[i];
      console.log(line.title) 

      //split the content of 'title' every space and store the list in 'parts'
      var parts = line.title.split(" ");
      console.log(parts);

      // width and height starts from the side 
      var left = parseInt(parts[1], 10);
      var top = parseInt(parts[2], 10);
      var width = (parseInt(parts[3], 10) - left);
      var height = (parseInt(parts[4], 10) - top);

      // create a style element with the content selected from the list 'parts'
      line.style = "position: absolute; left: " + parts[1] + "px; top: " + parts[2] + "px; width: " + width + "px; height: " + height + "px; border: 5px solid lightblue";

      var words = line.querySelectorAll(".ocrx_word");

      for (var e = 0; e < words.length; e++){ 

        var span = words[e];
        console.log(span.title) 

        var parts = span.title.split(" ");
        console.log(parts);

        var wleft = parseInt(parts[1], 10);
        var wtop = parseInt(parts[2], 10);
        var wwidth = (parseInt(parts[3], 10) - wleft);
        var wheight = (parseInt(parts[4], 10) - wtop);

        span.style = "position: absolute; left: " + (wleft - left) + "px; top: " + (wtop - top) + "px; width: " + wwidth  + "px; height: " + wheight + "px; border: 2px solid purple";
      } 
    }


DIGITAL WATERMAKING

According to the Digital Watermarking Alliance: http://digitalwatermarkingalliance.org/about/quick-facts/

   Digital watermarking is the process by which identifying data is woven into media content such as images, printed materials, movies, music or TV programming, giving those objects a unique, digital identity that can be used for a variety of valuable applications.
   ...
   Digital watermarking can enable content identification, forensic tracking and copyright communication on a broad scale and can provide a range of solutions for identifying, securing, managing and tracking digital images, audio, video, and printed materials.

* Verso Books

Notes

  • we found our names and emails on cover of the books, and our names on each of the file. (e.g. 02_Copyrightbohye5416yournamegmailcom.xhtml: This eBook is licensed to your-name, id@gmail.com on 05/13/2019)
  • we need to target these texts and remove them (how?)
  • they were appended to file while we download the item. BooXtream video explained how it works:

industry standard epub file + customer data BooXtream API download link to BooXtream watermarked file multiple invisible watermarks encoded with personal information data (kindle files) Exlibris

Watermarks

[WM0] -- Ex Libris Image Watermark

Ex Libris Image Watermark is a small personalised bookplate. (visible)

[WM1] -- Disclaimer Page Watermark: (visible)

(e.g. This ebook was sold to bohye woo, bohyeklaire@gmail.com on 13/05/2019. Verso ebooks are free of Digital Rights Management (DRM-free) but are subject to the terms of this license. You own this file once you’ve downloaded it, and you can use it on any of your devices. It has visible and invisible watermarks, applied by Booxtream, which contain your name and email address. You are prohibited from uploading Verso ebooks to any website or file-sharing network, or in any other way making them available for distribution, sharing, copying, downloading, or reselling. Royalties from every sale will be paid to the author: if you’re reading someone else’s copy, then please buy your own license from Verso Books.)

[WM2] -- Footer Watermarks: (visible)

It's on every at the end of the chapter on the bottom: (e.g. This eBook is licensed to bohye woo, bohyeklaire@gmail.com on 05/13/2019)

[WM3] -- Filename Watermarks (invisible)

https://sigil-ebook.com/

[WM4] -- Timestamp Fingerprinting

[WM5] -- CSS Watermark

remove boekstaaf class in css stylesheet

[WM6] -- Image Metadata Watermarks

how to use Exiftool http://xahlee.info/img/metadata_in_image_files.html https://www.sno.phy.queensu.ca/~phil/exiftool/

EXIFTOOL

Steganograhy tool

Steganography with https://www.blackmoreops.com/2017/01/11/steganography-in-kali-linux-hiding-data-in-image/

apt-get install steghide

Experiment

Diving into a forensic way of investigation on labour: In order to reveal the hidden labour in shadow library, Participant will be a detective to find digital labour footages. Being in a detective mode, we'll track down traces/labours that are made throughout the process of select, copy, scan, upload, download, modify, edit, categorize files and put metadata etc...


Downloading a digital book "Fifty shades of grey" in different forms such MOBI, PDF, EPUB and using meta data revealing tool to forensicate an investigation on hidden labours.

FINAL WORKSHOP – BLURRY BOUNDARY

https://git.xpub.nl/pedrosaclout/Workshop_Folder

Makefile

src=$(shell ls *.jpeg)
pdf=$(src:%.jpeg=%.pdf)


pdf: $(pdf)

zapspaces:
	rename "s/ /_/" *
	rename "s/\.jpg/.jpeg/" *

# Scan.pdf: Scan.jpeg
# 	tesseract Scan.jpeg Scan -l eng pdf

%.pdf: %.jpeg
	tesseract $*.jpeg $* -l eng pdf

# %.ppm: %.jpeg
# 	convert $*.jpeg $*.ppm
#
# %.un.jpeg: %.un.ppm
# 	convert $*.un.ppm $*.un.jpeg
#
# %.un.jpeg: %.jpeg
# 	convert $*.jpeg tmp.ppm
# 	unpaper tmp.ppm tmp2.ppm
# 	convert tmp2.ppm $*.un.jpeg
# 	rm tmp.ppm
# 	rm tmp2.ppm
#
# #debug vars
# print-%:
# 	@echo '$*=$($*)'

OCR all jpegs

#!/bin/bash

cd "$(dirname "$0")"
make zapspaces
make

Merge all the pdfs together

#!/bin/bash

cd "$(dirname "$0")"
pdftk *.pdf cat output newfile.pdf