User:Laurier Rochon/prototyping/??????????

??????????

Scrapin' wut?

I want to scrape questions

Auxilary verb	Subject	Main verb	Topic	Impatience	Question mark
Are	you	having	fun	yet	?

Why?

I want to perform topic modeling on these questions, and sort them according to the subjects they relate to
I would like to build a database of interrogations that link back to a certain topic - in other words, what do people wonder about, ask themselves?

How

I want to tap into the informal language of the Web - the blogosphere, rather than calculated writing
I am planning on using both methods (API & manual scraping) to amass the information
Because both potential outcomes (see 'Master Plan' below) would create rather personal experiences, the initial search terms' lexical fields would be geared towards the individual (i.e. : 'personal', 'how can I', 'I need to', etc.)
Step 1 : use the Google Blog Search API (JSON) to return a list of blogs (64 max per scrape), and store them in a database. The title of the post is very important, as it will give much information on what this post is about
Step 2 : use python to visit those links, scape the page (Beautifulsoup, lxml, etc.) and catch all questions - store them in a database. If possible, use the main text body to categorize the post further.
Storage structure (hypothetical) : if possible, I would like to store this information in plain text files, although I might need a relational database to if I start dealing with different content types. My first intention was (and is still) to have one text file for every topic, and all data in a tab-delimited format. I.e. : GOD.txt contains
Timestamp1 Blog url1 Post title1 Author1 Question1
Timestamp2 Blog url2 Post title2 Author2 Question2
On the other hand, it could be useful to rate (using NLTK) the relevance of certain topics (a blog post would not really deal with only ONE topic), which is not really possible in the last example. Using a relational approach, I could have a table of questions, a table of topics, a table of authors, etc.

Master plan

Idea #1 : following last semester's direction, I am interested in the idea of an 'interrogator', or 'lie detector'. Once again, I would build a chatting program, but made for human-computer interaction, instead of human-COMPUTER-human interaction (The Listener). This program would fire off questions at you, pulled from the database I will have collected, and try to corner you into contradictions. Assuming the computer's memory is faster and more effective than any human's this could be an interesting challenge.
Idea #2 : not following in any semester's direction, it could be fun to make books out of this database (encyclopedia of interrogations on subject XXXXX). Not only is it an interesting regression of medium from a theoretical point of view, but leveraging the power of the database could yield fun results such as a book called "Of turtles and peer pressure, by Author1, Author2, Author3, etc.", joining all data relating to 'turtles' and 'peer pressure', compressing it into a large string of questions, packed in a book. Similar project (seemingly ripping Jonathan Harris' idea) : http://gregory.incident.net/project/le-registre--the-register/ (blog feelings turned into books)

Soft

Soon...