User:Pedro Sá Couto/Prototyping 3rd: Difference between revisions
Line 77: | Line 77: | ||
File:indexing.gif | File:indexing.gif | ||
</gallery> | </gallery> | ||
[[File:Indexing.gif|thumbnail]] | |||
===Verso Books=== | ===Verso Books=== |
Revision as of 18:31, 25 May 2019
Watermarks, downloading, deleting
https://pad.xpub.nl/p/IFL_2018-05-13
JSOR (Ithaka Harbors, Inc) publishers
- download an article from http://jstor.org/ from within the school´s internet
- search for watermarks on the PDF
- look for some leads in https://sourceforge.net/p/pdfedit/mailman/message/27874955/ — https://github.com/kanzure/pdfparanoia/issues?utf8=%E2%9C%93&q=jstor
Gather Direct messages
- While working around the watermarks that JSTOR leaves in their PDFs, we understood that in their website, one can have a preview through the whole document without any of these watermarks. The idea began to download it directly from these preview images rather than actually downloading it from a JSTOR account.
- https://www.jstor.org/stable/23267102?Search=yes&resultItemClick=true&searchText=tea&searchUri=%2Faction%2FdoBasicSearch%3FQuery%3Dtea%26amp%3Bacc%3Don%26amp%3Bfc%3Doff%26amp%3Bwc%3Don%26amp%3Bgroup%3Dnone&ab_segments=0%2Fl2b-basic-1%2Frelevance_config_with_tbsub_l2b&refreqid=search%3A9fd0deff8d3258de87d3b54d6dfad664&seq=1#metadata_info_tab_contents the idea is to create a script that iterates through the url sequence number("seq=i")
- https://git.xpub.nl/pedrosaclout/jsort_scrape
# import libraries
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import os
import time
import datetime
from pprint import pprint
import requests
import multiprocessing
import base64
i = 1
while True:
try:
# get the url from the terminal
# url = input("Enter instance.social url (include https:// + exlude "seq=%i#metadata_info_tab_contents"): ")
# URL iterates through the sequence number
url = ("https://www.jstor.org/stable/23267102?Search=yes&resultItemClick=true&searchText=tea&searchUri=%2Faction%2FdoBasicSearch%3FQuery%3Dtea%26amp%3Bacc%3Don%26amp%3Bfc%3Doff%26amp%3Bwc%3Don%26amp%3Bgroup%3Dnone&ab_segments=0%2Fl2b-basic-1%2Frelevance_config_with_tbsub_l2b&refreqid=search%3A9fd0deff8d3258de87d3b54d6dfad664&" + "seq=%i#metadata_info_tab_contents"%i)
# Tell Selenium to open a new Firefox session
# and specify the path to the driver
driver = webdriver.Firefox(executable_path=os.path.dirname(os.path.realpath(__file__)) + '/geckodriver')
# Implicit wait tells Selenium how long it should wait before it throws an exception
driver.implicitly_wait(10)
driver.get(url)
time.sleep(3)
# get the image bay64 code
img = driver.find_element_by_css_selector('#page-scan-container.page-scan-container')
src = img.get_attribute('src')
# check if source is correct
# pprint(src)
# strip type from Javascript to base64 string only
base64String = src.split(',').pop();
pprint(base64String)
# decode base64 string
imgdata = base64.b64decode(base64String)
# save the image
filename = ('page%i.gif'%i)
with open(filename, 'wb') as f:
f.write(imgdata)
driver.close()
i+=1
print("DONE! Closing Window")
except:
print("Impossible to print image")
time.sleep(1)
Verso Books
- Create an account in https://www.versobooks.com/
- Download book as epub, a free one will do, for instance https://www.versobooks.com/books/2772-verso-2017-mixtape
- search for watermarks on the EPUB
- Look at the Institute for Biblio-Immunology -- First Communique: https://pastebin.com/raw/E1xgCUmb for more leads. https://www.booxtream.com/
Tesseract, OCR, Book scan
<script type="text/javascript">
//store all class 'ocr_line' in 'lines'
var lines = document.querySelectorAll(".ocr_line");
//loop through each element in 'lines'
for (var i = 0; i < lines.length; i++){
var line = lines[i];
console.log(line.title)
//split the content of 'title' every space and store the list in 'parts'
var parts = line.title.split(" ");
console.log(parts);
// width and height starts from the side
var left = parseInt(parts[1], 10);
var top = parseInt(parts[2], 10);
var width = (parseInt(parts[3], 10) - left);
var height = (parseInt(parts[4], 10) - top);
// create a style element with the content selected from the list 'parts'
line.style = "position: absolute; left: " + parts[1] + "px; top: " + parts[2] + "px; width: " + width + "px; height: " + height + "px; border: 5px solid lightblue";
var words = line.querySelectorAll(".ocrx_word");
for (var e = 0; e < words.length; e++){
var span = words[e];
console.log(span.title)
var parts = span.title.split(" ");
console.log(parts);
var wleft = parseInt(parts[1], 10);
var wtop = parseInt(parts[2], 10);
var wwidth = (parseInt(parts[3], 10) - wleft);
var wheight = (parseInt(parts[4], 10) - wtop);
span.style = "position: absolute; left: " + (wleft - left) + "px; top: " + (wtop - top) + "px; width: " + wwidth + "px; height: " + wheight + "px; border: 2px solid purple";
}
}