Interfacing the law prototyping Zalan Szakacs

From XPUB & Lens-Based wiki


CGI & TF-IDF search engine – Michael

Pad of the lesson: prototyping-17042018

Screen Shot 2018-04-17 at 14.56.41.png

Comma separated catalogue – Andre + Michael

Screen Shot 2018-04-23 at 15.41.33.png

Pad of the lesson: CSV-libgen

CSV:

  • comma-separated values
  • plain-text file
  • commas are used as field seperators
  • A record ends at a line terminator (such as new line \n)
  • All records have to have the same number of fields

How to handle it? Tools for handling CSV

  • head - output the first part of files
  • tail - output the last part of files
  • CSVKit - "swiss army knife" of CSV files. Install: sudo pip3 install csvkit
Documentation: https://csvkit.readthedocs.io/en/1.0.1/#

What is it? How is it structure? What data does it contain? Is it incomplete or messy?

  • Headers are in seperate file libgen_columns.txt, we might need to integrate it into content.csv, as python CSV library takes the first row of a CSV file as the headers row
  • convert libgen_columns.txt to CSV:

A alternative: cat libgen_columns.txt | sed "s/\ \ \`/\"/g" | sed "s/\`.*/\",/g"| tr --delete '\n'

MAC: cat libgen_columns.txt | sed "s/\ \ \`/\"/g" | sed "s/\`.*/\",/g" | tr -d '\n'

converting txt to csv (MAC): libgen_columns.txt | sed "s/\ \ \`/\"/g" | sed "s/\`.*/\",/g" | tr -d '\n' > headers.csv


Workshop 1 Building our own search engine – Dusan Barok (monoskop)

Zalansearch1.png
Zalansearch2.png
#!//usr/local/bin/python3
import cgi
import cgitb; cgitb.enable()
import nltk
import re

print ("Content-type:text/html;charset=utf-8")
print ()

#cgi.print_environ()

f = cgi.FieldStorage()
submit1 = f.getvalue("submit1", "")
submit2 = f.getvalue("submit2", "")

text = f.getvalue("text", "")

### SORTING
import os
import csv
import string
import pandas as pd
import sys
### SEARCHING
#input keyword you want to search
keyword = text


print ("""<!DOCTYPE html>
<html>
<head>
	<title>Search</title>
	<meta charset="utf-8">
</head>
<body>
<p style='font-size: 20pt; font-family: Courier'>Search by keyword</p>
	<form method="get">
	<textarea name="text" style="background: yellow; font-size: 10pt; width: 370px; height: 28px;" autofocus></textarea>
	<input type="submit" name="submit" value="Search" style='font-size: 9pt; height: 32px; vertical-align:top;'>

</form>
<p style='font-size: 9pt; font-family: Courier'>
	webring <br>
<a href="http://145.24.204.185:8000/form.html">joca</a>
<a href="http://145.24.198.145:8000/form.html">alice</a>
<a href="http://145.24.246.69:8000/form.html">michael</a>
<a href="http://145.24.165.175:8000/form.html">ange</a>
<a href="http://145.24.254.39:8000/form.html">zalan</a>

</p>
</body>
</html>""")
x = 0
if text :
	#read csv, and split on "," the line
	csv_file = csv.reader(open('tfidf.csv', "r"), delimiter=",")
	col_names = next(csv_file)
	#loop through csv list
	for row in csv_file:
		#if current rows value is equal to input, print that row
		if keyword == row[0] :
			tfidf_list = list(zip(col_names, row))
			del tfidf_list[0]
			sorted_by_second = sorted(tfidf_list, key=lambda x:float(x[1]), reverse=True)
			print ("<p></p>")
			print ("--------------------------------------------------------------------------------------")
			print ("<p style='font-size: 20pt; font-family: Courier'>Results</p>")
			for item in sorted_by_second:
				x = x+1
				print ("--------------------------------------------------------------------------------------")
				print ("<br></br>")
				print(x, item)
				n = item[0]

				f = open("cgi-bin/texts/{}".format(n), "r")
				sents = nltk.sent_tokenize(f.read())

				for sentence in sents:
					if re.search(r'\b({})\b'.format(text), sentence):
						print ("<br></br>")
						print(sentence)
				f.close()
				print ("<br></br>")

Workshop 2 Tunneling – Marcel Mars (memory of the world)