User:Tash/Prototyping 03: Difference between revisions

Latest revision as of 14:53, 15 June 2018

Workshop: CGI & TF-IDF search engine

#!//usr/local/bin/python3
import cgi
import cgitb; cgitb.enable()
import nltk
import re

print ("Content-type:text/html;charset=utf-8")
print ()

#cgi.print_environ()

f = cgi.FieldStorage()
submit1 = f.getvalue("submit1", "")
submit2 = f.getvalue("submit2", "")

text = f.getvalue("text", "")

### SORTING
import os
import csv
import string
import pandas as pd
import sys
### SEARCHING
#input keyword you want to search
keyword = text


print ("""<!DOCTYPE html>
<html>
<head>
	<title>Search</title>
	<meta charset="utf-8">
</head>
<body>
<p style='font-size: 20pt; font-family: Courier'>Search by keyword</p>
	<form method="get">
	<textarea name="text" style="background: yellow; font-size: 10pt; width: 370px; height: 28px;" autofocus></textarea>
	<input type="submit" name="submit" value="Search" style='font-size: 9pt; height: 32px; vertical-align:top;'>

</form>
<p style='font-size: 9pt; font-family: Courier'>
	webring <br>
<a href="http://145.24.204.185:8000/form.html">joca</a>
<a href="http://145.24.198.145:8000/form.html">alice</a>
<a href="http://145.24.246.69:8000/form.html">michael</a>
<a href="http://145.24.165.175:8000/form.html">ange</a>
<a href="http://145.24.254.39:8000/form.html">zalan</a>

</p>
</body>
</html>""")
x = 0
if text :
	#read csv, and split on "," the line
	csv_file = csv.reader(open('tfidf.csv', "r"), delimiter=",")
	col_names = next(csv_file)
	#loop through csv list
	for row in csv_file:
		#if current rows value is equal to input, print that row
		if keyword == row[0] :
			tfidf_list = list(zip(col_names, row))
			del tfidf_list[0]
			sorted_by_second = sorted(tfidf_list, key=lambda x:float(x[1]), reverse=True)
			print ("<p></p>")
			print ("--------------------------------------------------------------------------------------")
			print ("<p style='font-size: 20pt; font-family: Courier'>Results</p>")
			for item in sorted_by_second:
				x = x+1
				print ("--------------------------------------------------------------------------------------")
				print ("<br></br>")
				print(x, item)
				n = item[0]

				f = open("cgi-bin/texts/{}".format(n), "r")
				sents = nltk.sent_tokenize(f.read())

				for sentence in sents:
					if re.search(r'\b({})\b'.format(text), sentence):
						print ("<br></br>")
						print(sentence)
				f.close()
				print ("<br></br>")

Workshop with Marcell Mars

On tunneling: contexts: censorship (e.g. of nation states like China, Turkey, Iran or of institutions and academia), anonymity, organizing resistance or political action
rules: physical location (as far as the wifi goes),
AP: access point (for wireless routers for example)
router: a networking device that forwards data packets between computer networks. Routers perform the traffic directing functions on the Internet.
encryption certificates: used by websites to enable secure HTTPS connections, issued to domain and subdomain. Issued by authorities like Let’s Encrypt (recently free) and DigiCert.
DNS: domain name server or system, which resolves and distributes IP adresses, and lets you get to the domain. Usually set to automatic DHCP, but you can manually choose your own conversion point, like those served by Google (8.8.8.8) 
ping: command line tool to send a quick byte of info to check if a domain is alive
network interface: the device / card (e.g. wireless or ethernet) through which your computer is talking to the internet. IP addresses are assigned to the network interface

> so to avoid network admins from seeing your DNS requests (and tracking domain and subdomains) you can ‘tunnel’ and use things like an encrypted DNS server, or proxy servers

When talking about networks:
in Unix philosophy, everything is a file, with paths which you can read and write into. Networks are streaming media, so here things become more complex. Here, ports are the sockets through which you can make connections. Over time, default conventions have been assigned – like 22 for SSH and 443 for HTTPS.

Repositories: https://gitlab.com/marcellmars/letssharebooks https://github.com/marcellmars/logan_and_jessica

Exercise: https://imgur.com/a/xuUuN https://rsync.samba.org/

Interesting projects: https://beakerbrowser.com/ https://ipfs.io/ https://zerotier.com/

Resarch on databases and networks

SQL

SQL - Structured Query Language. It is declarative computer language aimed at querying relational databases. MySQL is a relational database - a piece of software optimized for data storage and retrieval. There are many such databases - Oracle, Microsoft SQL Server, SQLite and many others are examples of such.

SQLite

SQLite is an embedded SQL database engine that implements a self-contained, serverless, zero-configuration, transactional SQL database engine. The code for SQLite is in the public domain and is thus free for use for any purpose, commercial or private. SQLite is the most widely deployed database in the world with more applications than we can count, including several high-profile projects.

Unlike most other SQL databases, SQLite does not have a separate server process. SQLite reads and writes directly to ordinary disk files. A complete SQL database with multiple tables, indices, triggers, and views, is contained in a single disk file. Furthermore, the file format is cross-platform. A database that is created on one machine can be copied and used on a different machine with a different architecture. https://sqlite.org/about.html

Flask

Flask is a BSD-licensed microframework for Python based on Werkzeug and Jinja 2.

Syncthing

Session with Tash, Andre & Alice: 28.05.2018
How to configure and install syncthing on the raspberry pi, and two of our own machines?
Syncthing can be used to sync book files and catalog files between different instances of our library (e.g. syncing catalog between server and Pi's, syncing book files between Pi's) Files are not stored in the cloud and it allows for decentralized, read-write architecture (different from rsync which uses a master-slave relationship)

Running Syncthing
At first start Syncthing will generate a configuration file, some keys and then start the admin GUI in your browser. The GUI remains available on https://localhost:8384/. For Syncthing to be able to synchronize files with another device, it must be told about that device. This is accomplished by exchanging “device IDs”. A device ID is a unique, cryptographically-secure identifier that is generated as part of the key generation the first time you start Syncthing. It is printed in the log above, and you can see it in the web GUI by selecting the “gear menu” (top right) and “Show ID”. Two devices will only connect and talk to each other if they are both configured with each other’s device ID. Since the configuration must be mutual for a connection to happen, device IDs don’t need to be kept secret. They are essentially part of the public key. To get your two devices to talk to each other click “Add Device” at the bottom right on both, and enter the device ID of the other side. You should also select the folder(s) that you want to share. The device name is optional and purely cosmetic. It can be changed later if required.

Configuration
Syncthing config.xml file, which can be edited via terminal or through the web GUI interface. Each element describes one folder. The following attributes may be set on the folder element:

id - The folder ID, must be unique. (mandatory)labelThe label of a folder is a human readable and descriptive local name. May be different on each device, empty, and/or identical to other folder labels. (optional)

path - The path to the directory where the folder is stored on this device; not sent to other devices. (mandatory)

type - Controls how the folder is handled by Syncthing. Possible values are:

readwrite - The folder is in default mode. Sending local and accepting remote changes.readonlyThe folder is in “send-only” mode – it will not be modified by Syncthing on this device.

rescanIntervalS - The rescan interval, in seconds. Can be set to zero to disable when external plugins are used to trigger rescans.

Because the pi can't access the browser GUI, you can change the config file to add the GUI port address from 127... to 0000 served on Apache web server. Then you can look at the GUI remotely in your browser. Alternatively, you can add device keys via terminal in the config file. Question: Can we have rw permissions on the main pi, and read only permissions on all others? - probs

Troubleshooting

Kernel Panic
Don't use the shark SD card! Aymeric bought them for super cheap and they will corrupt the f up. Kernel panic means you have to try and reboot the Pi in recovery mode. Or... abort.

Merging & file conflicts
Editing CSV files in different nodes at the same time will result in conflicts. How to make a fault tolerant, decentralized file system which will allow up-to-date uploads, edits and deletions between different nodes? Important for us: How to keep catalog and files separate so that only catalog is visible to public? AND How to make sure file and catalog are synced in a way that is distributed?

RQLite

rqlite is an easy-to-use, lightweight, distributed relational database, which uses SQLite as its storage engine. Forming a cluster is very straightforward, it gracefully handles leader elections, and tolerates failures of machines, including the leader.

Creating a cluster of nodes (Pi's) : https://github.com/rqlite/rqlite/blob/master/DOC/CLUSTER_MGMT.md#creating-a-cluster

Extracting images from PDF

pdfimages extracts more and fragmented images

To make more dynamic 'cover images':

Option 1: using pdfimages -j magnet_reader_3_processual_publishing_actual_gestures.pdf ./pdfimages

Option 2: python script which looks for start bytes and endbytes of jpg files:

python script extracts less images, only recognizes complete jpgs

coding=utf-8
Extract jpg's from pdf's. Quick and dirty.

import sys

with open(sys.argv[1], "rb") as file:

   pdf = file.read()

startmark = b"\xff\xd8" startfix = 0 endmark = b"\xff\xd9" endfix = 2 i = 0

njpg = 0 while True:

   istream = pdf.find(b"stream", i)
   if istream < 0:
       break
   istart = pdf.find(startmark, istream, istream + 20)
   if istart < 0:
       i = istream + 20
       continue
   iend = pdf.find(b"endstream", istart)
   if iend < 0:
       raise Exception("Didn't find end of stream!")
   iend = pdf.find(endmark, iend - 20)
   if iend < 0:
       raise Exception("Didn't find end of JPG!")

   istart += startfix
   iend += endfix
   print("JPG %d from %d to %d" % (njpg, istart, iend))
   jpg = pdf[istart:iend]
   with open("jpg%d.jpg" % njpg, "wb") as jpgfile:
       jpgfile.write(jpg)

   njpg += 1
   i = iend

</end>

@@ Line 96: / Line 96: @@
 On tunneling:
 contexts: censorship (e.g. of nation states like China, Turkey, Iran or of institutions and academia), anonymity, organizing resistance or political action
+<br>
 rules: physical location (as far as the wifi goes),
+<br>
 AP: access point (for wireless routers for example)
+<br>
 router: a networking device that forwards data packets between computer networks. Routers perform the traffic directing functions on the Internet.
+<br>
 encryption certificates: used by websites to enable secure HTTPS connections, issued to domain and subdomain. Issued by authorities like Let’s Encrypt (recently free) and DigiCert.
-DNS: domain name server or system, which resolves and distributes IP adresses, and lets you get to the domain. Usually set to automatic DHCP, but you can manually choose your own conversion point, like those served by Google (8.8.8.8)
+<br>
+DNS: domain name server or system, which resolves and distributes IP adresses, and lets you get to the domain. Usually set to automatic DHCP, but you can manually choose your own conversion point, like those served by Google (8.8.8.8)
+<br>
 ping: command line tool to send a quick byte of info to check if a domain is alive
+<br>
 network interface: the device / card (e.g. wireless or ethernet) through which your computer is talking to the internet. IP addresses are assigned to the network interface
@@ Line 108: / Line 114: @@
 When talking about networks:
+<br>
 in Unix philosophy, everything is a file, with paths which you can read and write into. Networks are streaming media, so here things become more complex. Here, ports are the sockets through which you can make connections. Over time, default conventions have been assigned – like 22 for SSH and 443 for HTTPS.
@@ Line 123: / Line 130: @@
 https://zerotier.com/
+== Resarch on databases and networks ==
-==Self directed research==
-=== Brainstorm 23.04.2018===
-Interface: How do you visualize that which is UNSTABLE? Serendipity? Missing data? Uncertainty? Dissent? Multiple views?
-On data provenance and feminist visualization: https://civic.mit.edu/feminist-data-visualization
-HOW can you GET data that's MISSING ?! E.G. from LibGen: where is the UPLOAD DATA? what could we do with it?
-Simple test to highlight absent information: in LibGen's catalogue CSV there are row without titles
-How to search for blanks?
-something like:
-   csvgrep -c Title -m "" content.csv
-^ this solution matches spaces but doesn't look for empty state cells.
-[[File:Libgen blanks.gif|500px|thumbnail|right]]
-   csvgrep -c Author -r "^$" content.csv
-^ this solution finds rows with empty state cells in the 'Author' column
-andre's exciting explorations of the archive.org api search: Internet Archive
-Advanced search: https://archive.org/advancedsearch.php ghost in the mp3
-=== Interface & database ===
 ====SQL====
@@ Line 162: / Line 144: @@
 ====Flask====
 Flask is a BSD-licensed microframework for Python based on Werkzeug and Jinja 2.
 <onlyinclude>
-==Syncthing==
+===Syncthing===
 Session with Tash, Andre & Alice: 28.05.2018
 <br>
@@ Line 198: / Line 179: @@
 Because the pi can't access the browser GUI, you can change the config file to add the GUI port address from 127... to 0000 served on Apache web server. Then you can look at the GUI remotely in your browser. Alternatively, you can add device keys via terminal in the config file. Question: Can we have rw permissions on the main pi, and read only permissions on all others? - probs
-[[File:Sharksd.jpg|300px|thumbnail|left]]
+[[File:Sharksd.jpg|260px|thumbnail|left]]
 '''Troubleshooting'''
 <br>
@@ Line 205: / Line 186: @@
 <br>
 Don't use the shark SD card! Aymeric bought them for super cheap and they will corrupt the f up.
+Kernel panic means you have to try and reboot the Pi in recovery mode. Or... abort.
 Merging & file conflicts
@@ Line 210: / Line 192: @@
 Editing CSV files in different nodes at the same time will result in conflicts.
 How to make a fault tolerant, decentralized file system which will allow up-to-date uploads, edits and deletions between different nodes?
+Important for us: How to keep catalog and files separate so that only catalog is visible to public? AND How to make sure file and catalog are synced in a way that is distributed?
 </onlyinclude>
-== Search functionality==
+<br>
-Using Flask-WTForms to create a search which queries the SQL database.
+=== RQLite ===
-Links: https://pythonhosted.org/Flask-Bootstrap/forms.html and https://programfault.com/flask-101-how-to-add-a-search-form/
+rqlite is an easy-to-use, lightweight, distributed relational database, which uses SQLite as its storage engine. Forming a cluster is very straightforward, it gracefully handles leader elections, and tolerates failures of machines, including the leader.
-'''in forms.py'''
-* simple string search field
-<source lang= python>
+Creating a cluster of nodes (Pi's) : https://github.com/rqlite/rqlite/blob/master/DOC/CLUSTER_MGMT.md#creating-a-cluster
-class SearchForm(FlaskForm):
-    search = StringField('', validators=[InputRequired()])
-</search>
+[[File:IMG 2410.jpg|400px|thumbnail|center]]
-'''in views.py'''
+== Extracting images from PDF ==
-* putting search bar on home page
+[[File:Pdf images.png|400px|thumbnail|right | pdfimages extracts more and fragmented images]]
-* routing results.html, setting up redirect and error message
-<source lang= python>
+To make more dynamic 'cover images':
-@app.route('/', methods=['GET', 'POST'])
-def home():
-    """Render website's home page."""
-    #return render_template('home.html')
-    search = SearchForm(request.form)
-    if request.method == 'POST':
-        return search_results(search)
-    return render_template('home.html', form=search)
-## search
+Option 1: using
-@app.route('/results', methods= ['GET'])
+pdfimages -j magnet_reader_3_processual_publishing_actual_gestures.pdf ./pdfimages
-def search_results(search):
-    results = []
-    search_string = search.data['search']
-    if search_string:
+Option 2: python script which looks for start bytes and endbytes of jpg files:
-        results=Book.query.filter(Book.title.contains(search_string)).all()
+[[File:img_pdfscript.png|400px|thumbnail|right| python script extracts less images, only recognizes complete jpgs]]
-    if not results:
-        flash('No results found!')
-        return redirect('/')
-    else:
-        # display results
-        return render_template('results.html', books=results)
-</source>
-'''in results.html'''
-* template page for showing results, same as show_books.html
 <source lang=python>
+# coding=utf-8
+# Extract jpg's from pdf's. Quick and dirty.
-{% extends 'base.html' %}
+import sys
-{% block main %}
-<div class="container">
-  <h1 class="page-header">Search Results</h1>
-  {% with messages = get_flashed_messages() %}
-    {% if messages %}
-      <div class="alert alert-success">
-        <ul>
-        {% for message in messages %}
-          <li>{{ message }}</li>
-        {% endfor %}
-        </ul>
-      </div>
-    {% endif %}
-  {% endwith %}
-  <table style="width:100%">
-    <tr>
-        <th>Cover</th>
-      <th>Title</th>
-      <th>Author</th>
-      <th>Filetype</th>
-      <th>Tag</th>
-    </tr>
-        {% for book in books %}
-    <tr>
-      <td><img src="../uploads/cover/{{ book.cover }}" width="80"></td>
-      <td><a href="books/{{ book.id }}">{{ book.title }}</a></td>
-      <td>  {% for author in book.authors %}
-              <li><a href="{{url_for('show_author_by_id', id=author.id)}}">{{ author.author_name }}</a>  </li>
-        {% endfor %}</td>
-      <td>{{ book.fileformat }}</td>
-      <td>{{ book.tag}}</td>
-    </tr>
-  {% endfor %}
-  </table>
+with open(sys.argv[1], "rb") as file:
+    pdf = file.read()
-</div>
+startmark = b"\xff\xd8"
-{% endblock %}
+startfix = 0
+endmark = b"\xff\xd9"
+endfix = 2
+i = 0
+njpg = 0
+while True:
+    istream = pdf.find(b"stream", i)
+    if istream < 0:
+        break
+    istart = pdf.find(startmark, istream, istream + 20)
+    if istart < 0:
+        i = istream + 20
+        continue
+    iend = pdf.find(b"endstream", istart)
+    if iend < 0:
+        raise Exception("Didn't find end of stream!")
+    iend = pdf.find(endmark, iend - 20)
+    if iend < 0:
+        raise Exception("Didn't find end of JPG!")
-</source>
+    istart += startfix
+    iend += endfix
+    print("JPG %d from %d to %d" % (njpg, istart, iend))
+    jpg = pdf[istart:iend]
+    with open("jpg%d.jpg" % njpg, "wb") as jpgfile:
+        jpgfile.write(jpg)
-====HTML / bootstrap visualizations====
+    njpg += 1
-Responsive image gallery, for search interface:
+    i = iend
-[[File:Responsiveweb.jpg|frameless|left]]
+</end>