CGI & TF-IDF search engine – Michael

Pad of the lesson: prototyping-17042018

CGI (the other one) Common Gateway Interface > from early days of the web, introducing the idea of browsing, the query, running software dynamically online

perl-- early programming language, web editing, design error_in_time (poem-programming) 'More than one way to do it!' - Larry Wall, creator of Perl

you can run perl scripts: perl

Comma separated catalogue – Andre + Michael

Pad of the lesson: CSV-libgen


  • comma-separated values
  • plain-text file
  • commas are used as field seperators
  • A record ends at a line terminator (such as new line \n)
  • All records have to have the same number of fields

How to handle it? Tools for handling CSV

  • head - output the first part of files
  • tail - output the last part of files
  • CSVKit - "swiss army knife" of CSV files. Install: sudo pip3 install csvkit

What is it? How is it structure? What data does it contain? Is it incomplete or messy?

  • Headers are in seperate file libgen_columns.txt, we might need to integrate it into content.csv, as python CSV library takes the first row of a CSV file as the headers row
  • convert libgen_columns.txt to CSV:

A alternative: cat libgen_columns.txt | sed "s/\ \ \`/\"/g" | sed "s/\`.*/\",/g"| tr --delete '\n'

MAC: cat libgen_columns.txt | sed "s/\ \ \`/\"/g" | sed "s/\`.*/\",/g" | tr -d '\n'

converting txt to csv (MAC): libgen_columns.txt | sed "s/\ \ \`/\"/g" | sed "s/\`.*/\",/g" | tr -d '\n' > headers.csv

Workshop 1 Building our own search engine – Dusan Barok (monoskop)

import cgi
import cgitb; cgitb.enable()
import nltk
import re

print ("Content-type:text/html;charset=utf-8")
print ()


f = cgi.FieldStorage()
submit1 = f.getvalue("submit1", "")
submit2 = f.getvalue("submit2", "")

text = f.getvalue("text", "")

import os
import csv
import string
import pandas as pd
import sys
#input keyword you want to search
keyword = text

print ("""<!DOCTYPE html>
	<meta charset="utf-8">
<p style='font-size: 20pt; font-family: Courier'>Search by keyword</p>
	<form method="get">
	<textarea name="text" style="background: yellow; font-size: 10pt; width: 370px; height: 28px;" autofocus></textarea>
	<input type="submit" name="submit" value="Search" style='font-size: 9pt; height: 32px; vertical-align:top;'>

<p style='font-size: 9pt; font-family: Courier'>
	webring <br>
<a href="">joca</a>
<a href="">alice</a>
<a href="">michael</a>
<a href="">ange</a>
<a href="">zalan</a>

x = 0
if text :
	#read csv, and split on "," the line
	csv_file = csv.reader(open('tfidf.csv', "r"), delimiter=",")
	col_names = next(csv_file)
	#loop through csv list
	for row in csv_file:
		#if current rows value is equal to input, print that row
		if keyword == row[0] :
			tfidf_list = list(zip(col_names, row))
			del tfidf_list[0]
			sorted_by_second = sorted(tfidf_list, key=lambda x:float(x[1]), reverse=True)
			print ("<p></p>")
			print ("--------------------------------------------------------------------------------------")
			print ("<p style='font-size: 20pt; font-family: Courier'>Results</p>")
			for item in sorted_by_second:
				x = x+1
				print ("--------------------------------------------------------------------------------------")
				print ("<br></br>")
				print(x, item)
				n = item[0]

				f = open("cgi-bin/texts/{}".format(n), "r")
				sents = nltk.sent_tokenize(

				for sentence in sents:
					if'\b({})\b'.format(text), sentence):
						print ("<br></br>")
				print ("<br></br>")

Workshop 2 Tunneling – Marcel Mars (memory of the world)

Pad of the day Marcell Mars workshop

mapping locations with firewall location/exclusive spaces: physical border range of wi-fi mapping the network: network switch boxes, ethernet cables, router

AP - access point eduroam- authorization

works everywhere in eduroam environment the password matches in the same building- metadata tells you when the information comes from--the university guarantee to the central institution

what they will know from a website:

  • domain name
  • subdomain

https: secure connection certificate (we were paying before)'s_Encrypt so not to use https for every site mozilla was behind that--

notary wildcard/ the first word is hidden:


notary will charge you for wildcard/ now they are free

ex: somebody stills the certificate of the server of wikipedia and then impersonates wikipedia

jessica and logan use the part after the slash in the browser

DNS servers root servers

Automatic (DHCP): automatic IP address systemd-resolve --status for ubuntu cat /etc/resolv.conf DNS cat /etc/hosts ip a link/ether (with barcode)

macchanger in linux--> you change to another decive and they dont know your megadress ping <is the server alive> copy the in the /etc/host <you block facebook, telling your computer to go to localhost instead>

Network interface - part of the computer that talks to the internet an ip address is assigned to the network interface on your computer (wireless, cable)

discussion on current discourse on security, politics: Amazon Web Services "more dangerous than google" ! cloud computing (set up for domains)

Even Apple is on Amazon servers:

Letter: In solidarity with Library Genesis and Sci-Hub inviting other 'custodians' into civil obedience