Interfacing the law prototyping Zalan Szakacs: Difference between revisions
No edit summary |
|||
Line 2: | Line 2: | ||
== CGI & TF-IDF search engine – Michael == | == CGI & TF-IDF search engine – Michael == | ||
Pad of the lesson: | |||
[https://pad.pzimediadesign.nl/p/CSV-libgen CSV-libgen] | |||
==CSV:== | |||
* comma-separated values | |||
* plain-text file | |||
* commas are used as field seperators | |||
* A record ends at a line terminator (such as new line \n) | |||
* All records have to have the same number of fields | |||
==How to handle it? Tools for handling CSV== | |||
* head - output the first part of files | |||
* tail - output the last part of files | |||
* CSVKit - "swiss army knife" of CSV files. Install: sudo pip3 install csvkit | |||
Documentation: https://csvkit.readthedocs.io/en/1.0.1/# | |||
* Python CSV core lib https://docs.python.org/3.5/library/csv.html | |||
* structured-text-tools https://github.com/dbohdan/structured-text-tools | |||
==What is it? How is it structure? What data does it contain? Is it incomplete or messy?== | |||
* Headers are in seperate file libgen_columns.txt, we might need to integrate it into content.csv, as python CSV library takes the first row of a CSV file as the headers row | |||
* convert libgen_columns.txt to CSV: | |||
A alternative: | |||
cat libgen_columns.txt | sed "s/\ \ \`/\"/g" | sed "s/\`.*/\",/g"| tr --delete '\n' | |||
MAC: | |||
cat libgen_columns.txt | sed "s/\ \ \`/\"/g" | sed "s/\`.*/\",/g" | tr -d '\n' | |||
converting txt to csv (MAC): | |||
libgen_columns.txt | sed "s/\ \ \`/\"/g" | sed "s/\`.*/\",/g" | tr -d '\n' > headers.csv | |||
== Comma separated catalogue – Andre + Michael == | == Comma separated catalogue – Andre + Michael == | ||
== Workshop 1 Building our own search engine – Dusan Barok (monoskop) == | == Workshop 1 Building our own search engine – Dusan Barok (monoskop) == |
Revision as of 14:10, 6 June 2018
CGI & TF-IDF search engine – Michael
Pad of the lesson: CSV-libgen
CSV:
- comma-separated values
- plain-text file
- commas are used as field seperators
- A record ends at a line terminator (such as new line \n)
- All records have to have the same number of fields
How to handle it? Tools for handling CSV
- head - output the first part of files
- tail - output the last part of files
- CSVKit - "swiss army knife" of CSV files. Install: sudo pip3 install csvkit
Documentation: https://csvkit.readthedocs.io/en/1.0.1/#
- Python CSV core lib https://docs.python.org/3.5/library/csv.html
- structured-text-tools https://github.com/dbohdan/structured-text-tools
What is it? How is it structure? What data does it contain? Is it incomplete or messy?
- Headers are in seperate file libgen_columns.txt, we might need to integrate it into content.csv, as python CSV library takes the first row of a CSV file as the headers row
- convert libgen_columns.txt to CSV:
A alternative: cat libgen_columns.txt | sed "s/\ \ \`/\"/g" | sed "s/\`.*/\",/g"| tr --delete '\n'
MAC: cat libgen_columns.txt | sed "s/\ \ \`/\"/g" | sed "s/\`.*/\",/g" | tr -d '\n'
converting txt to csv (MAC): libgen_columns.txt | sed "s/\ \ \`/\"/g" | sed "s/\`.*/\",/g" | tr -d '\n' > headers.csv
Comma separated catalogue – Andre + Michael
Workshop 1 Building our own search engine – Dusan Barok (monoskop)
#!//usr/local/bin/python3
import cgi
import cgitb; cgitb.enable()
import nltk
import re
print ("Content-type:text/html;charset=utf-8")
print ()
#cgi.print_environ()
f = cgi.FieldStorage()
submit1 = f.getvalue("submit1", "")
submit2 = f.getvalue("submit2", "")
text = f.getvalue("text", "")
### SORTING
import os
import csv
import string
import pandas as pd
import sys
### SEARCHING
#input keyword you want to search
keyword = text
print ("""<!DOCTYPE html>
<html>
<head>
<title>Search</title>
<meta charset="utf-8">
</head>
<body>
<p style='font-size: 20pt; font-family: Courier'>Search by keyword</p>
<form method="get">
<textarea name="text" style="background: yellow; font-size: 10pt; width: 370px; height: 28px;" autofocus></textarea>
<input type="submit" name="submit" value="Search" style='font-size: 9pt; height: 32px; vertical-align:top;'>
</form>
<p style='font-size: 9pt; font-family: Courier'>
webring <br>
<a href="http://145.24.204.185:8000/form.html">joca</a>
<a href="http://145.24.198.145:8000/form.html">alice</a>
<a href="http://145.24.246.69:8000/form.html">michael</a>
<a href="http://145.24.165.175:8000/form.html">ange</a>
<a href="http://145.24.254.39:8000/form.html">zalan</a>
</p>
</body>
</html>""")
x = 0
if text :
#read csv, and split on "," the line
csv_file = csv.reader(open('tfidf.csv', "r"), delimiter=",")
col_names = next(csv_file)
#loop through csv list
for row in csv_file:
#if current rows value is equal to input, print that row
if keyword == row[0] :
tfidf_list = list(zip(col_names, row))
del tfidf_list[0]
sorted_by_second = sorted(tfidf_list, key=lambda x:float(x[1]), reverse=True)
print ("<p></p>")
print ("--------------------------------------------------------------------------------------")
print ("<p style='font-size: 20pt; font-family: Courier'>Results</p>")
for item in sorted_by_second:
x = x+1
print ("--------------------------------------------------------------------------------------")
print ("<br></br>")
print(x, item)
n = item[0]
f = open("cgi-bin/texts/{}".format(n), "r")
sents = nltk.sent_tokenize(f.read())
for sentence in sents:
if re.search(r'\b({})\b'.format(text), sentence):
print ("<br></br>")
print(sentence)
f.close()
print ("<br></br>")