User:Lucia Dossin/Protyping/Assignment 5
Roll Your Own Google
From the exercise's page: 'This exercise is at once a simple exercise in CGI scripting and an opportunity to critically reflect on the state of the Web and the role of centralized commercial services such as Google.
Create a cgi "search engine". It needs to be self-contained, that is contain it's own index based on your own specific crawling of data. Part of the point of doing this is to reflect on the question of data centralization and specificity. What does it mean to create your own index? Your "results" could be "purely" algorithmic, and/or based only on input provided to it (via the search box), and/or using either collected or crawled data you've yourself gathered. It's only fair, that's how Google works too.'
Description
Free Association Index
My own search engine uses an index that's made by crossing two different texts. As input, a word in a search box - as output, a list of correlated words. By clicking on a word from the results, user starts another search, using the clicked word as search term.
The initial idea was to have a subjective dictionary of synonyms, where the correlations between words would be done directly by me. The associations would be exactly the ones I could think of, for each and every word in the index. Very quickly, this idea showed itself hard to execute, as it would require a good number of years for me to build this index.
A much more viable way would be to have input from existing text, such as a dictionary, for example. But if I used only a dictionary, the results would be the 'real' meaning of each word. A mix between two texts was then made: one text generated the titles (or the entries) and another text generated the content (the correlations). For the titles, I used A Pocket Dictionary by William Richards and for the content, I used The Cook and Housekeeper's Complete and Universal Dictionary; Including a System of Modern Cookery, in all Its Various Branches, Adapted to the Use of Private Families
The search mechanism is running through the use of Whoosh, a Python library created by Matt Chaput.
There's an initial version of the Free Association Index on
http://headroom.pzwart.wdka.hro.nl/~ldossin/free-association-index/
There is still a lot to explore in this project: for example, investigating/changing the parameters which control the score of each word, recording the clicks and the association path generated in each visit (possiblly displaying it to the user), adding new indexes (in the same fashion of Google's Images, News, Videos tabs, there could be several indexes where the same word would display different results, according to the 'nature' of the index).
Also, some improvements could be done, such as allowing the search of more than one word, and building a more interesting index, with less pronouns and other non-meaningful words, for example.
Besides that, I would also like to allow the user to suggest a correlation to the term being searched, so that the results would display both the 'internal (official)' correlations as well as 'external' ones.
All that would require a backend to allow me to manage the suggestions and eventually add, myself, new correlations to existing terms. The backend could also be useful to create new indexes.
Screenshots
Code
1.Reads the txt files and writes new ones, with one word in each line.
This is code was run twice: to generate the titles and the content.
#!/usr/bin/env python
with open("initial_content/titles-file.txt") as c, open('out-titles.txt', 'w') as out:
for line in c:
words = line.split()
for w in words:
out.write(w+'\n')
2.Creates an index
import os.path
from whoosh.index import create_in
from whoosh.index import open_dir
from whoosh.fields import Schema, STORED, ID, KEYWORD, TEXT
schema = Schema(title=TEXT(stored=True), content=TEXT(stored=True), tags=TEXT(stored=True ))
if not os.path.exists("index"):
os.mkdir("index")
ix = create_in("index", schema)
3.Adds documents to the index
from whoosh.fields import Schema, TEXT, KEYWORD
from whoosh.index import open_dir
import random
import math
ix = open_dir("index")
cont=[]
removed=['to', 'and', 'is', 'for', 'from', 'the', 'a', 'an', 'of', 'in', 'at', 'with', 'this', 'that', 'or', 'by']
with open("initial_content/out-content.txt") as c:
for line in c:
if line != ' ' and line != '\n':
line = line[:-1]
if line not in removed:
cont.append(str(line))
writer = ix.writer()
with open("initial_content/out-titles.txt") as f:
for line in f:
if line != ' ' and line != '\n':
line = line[:-1]
line = line.lower()
if line not in removed:
r=[]
for n in range(9):
r.append( str(cont[random.randint(1,len(cont)-1)])))
s = line
for lin in range(9):
writer.update_document(title=u'"'+ s + '"', content=u'"'+ r[lin] +'"')
writer.add_document(title=u'"'+ r[3] + '"', content=u'"'+ s +'"')
writer.commit()