User:Laurier Rochon/prototyping/??????????

From XPUB & Lens-Based wiki



??????????

Scrapin' wut?

  • I want to scrape questions

Auxilary verbSubjectMain verbTopicImpatienceQuestion mark
Areyouhavingfunyet?



Why?

  • I want to perform topic modeling on these questions, and sort them according to the subjects they relate to
  • I would like to build a database of interrogations that link back to a certain topic - in other words, what do people wonder about, ask themselves?



How

  • I want to tap into the informal language of the Web - the blogosphere, rather than calculated writing
  • I am planning on using both methods (API & manual scraping) to amass the information
  • Because both potential outcomes (see 'Master Plan' below) would create rather personal experiences, the initial search terms' lexical fields would be geared towards the individual (i.e. : 'personal', 'how can I', 'I need to', etc.)
  • Step 1 : use the Google Blog Search API (JSON) to return a list of blogs (64 max per scrape), and store them in a database. The title of the post is very important, as it will give much information on what this post is about
  • Step 2 : use python to visit those links, scape the page (Beautifulsoup, lxml, etc.) and catch all questions - store them in a database. If possible, use the main text body to categorize the post further.
  • Storage structure (hypothetical) : if possible, I would like to store this information in plain text files, although I might need a relational database to if I start dealing with different content types. My first intention was (and is still) to have one text file for every topic, and all data in a tab-delimited format. I.e. : GOD.txt contains
    Timestamp1       Blog url1       Post title1       Author1       Question1
    Timestamp2       Blog url2       Post title2       Author2       Question2
  • On the other hand, it could be useful to rate (using NLTK) the relevance of certain topics (a blog post would not really deal with only ONE topic), which is not really possible in the last example. Using a relational approach, I could have a table of questions, a table of topics, a table of authors, etc.



Master plan

  • Idea #1 : following last semester's direction, I am interested in the idea of an 'interrogator', or 'lie detector'. Once again, I would build a chatting program, but made for human-computer interaction, instead of human-COMPUTER-human interaction (The Listener). This program would fire off questions at you, pulled from the database I will have collected, and try to corner you into contradictions. Assuming the computer's memory is faster and more effective than any human's this could be an interesting challenge.
  • Idea #2 : not following in any semester's direction, it could be fun to make books out of this database (encyclopedia of interrogations on subject XXXXX). Not only is it an interesting regression of medium from a theoretical point of view, but leveraging the power of the database could yield fun results such as a book called "Of turtles and peer pressure, by Author1, Author2, Author3, etc.", joining all data relating to 'turtles' and 'peer pressure', compressing it into a large string of questions, packed in a book. Similar project (seemingly ripping Jonathan Harris' idea) : http://gregory.incident.net/project/le-registre--the-register/ (blog feelings turned into books)



Soft

  • This first simple API call returns a nice list of blog titles, full of insecure people, self-absorbed people, oblivious people and just normal people. Already the titles have nice questions in them, also.
import urllib2
import json

start=0
titles=[]

while start<64:
	url = ('https://ajax.googleapis.com/ajax/services/search/blogs?v=1.0&q=myself&start='+ str (start)+'&rsz=large')

	f = urllib2.urlopen(url)
	data = json.load(f)

	for r in data['responseData']['results']:
		titles.append(r['title'])
	start += 8

for t in titles:
	print t.encode("utf-8")
The Non-Blonde: Smells Like Coming Home To <b>Myself</b>
I will never call <b>myself</b> a star: Anushka Sharma : News : News <b>...</b>
An argument I&#39;m delighted to use <b>myself</b>
Jay Sean – Me Against <b>Myself</b> - Liriklagump3indonesia.com
Introducing <b>myself</b> .....
Setting <b>Myself</b> Free From Food With God&#39;s Help - That&#39;s Fit
kickin <b>myself</b> a tad bit - Overclock.net - Overclocking.net
StephTheBookworm: Accepting <b>Myself</b> as a Blogger
Christina Aguilera - Not <b>Myself</b> Tonight (2010) HDTV 720p x264 <b>...</b>
How can I prepare and calm <b>myself</b> before an audition? | Health Wiki
Simply Introducing <b>Myself</b> | Real Super Powers
I Want To Kill <b>Myself</b>? Dumps
fell of the wagon so dissapointed <b>myself</b>
Introducing <b>myself</b>: &quot;SolarWilliam&quot; - Webdigity webmaster forums
Surrender, Dorothy: I Have Really Hurt <b>Myself</b>
Smile » Blog Archive » Some updates on frogr 0.4 and <b>myself</b>
Surrender, Dorothy: I Have Really Hurt <b>Myself</b>
Musicalfan Loves Minerals: I Couldn&#39;t Help <b>Myself</b>
Smile » Blog Archive » Some updates on frogr 0.4 and <b>myself</b>
How Can I Make <b>Myself</b> Stop Worrying So Much And Just Be Happy? Dumps
MayRay in the City: A Post to <b>Myself</b>
Am I Full Of <b>Myself</b> Or Is She A Fake? iPhone ™
Sepulchre of Heroes: So Allow Me to Introduce <b>Myself</b>
Life with dignity: Kicking <b>myself</b> in the back..
Photobombing... <b>myself</b> and Chris Strom LLC! | Flickr - Photo Sharing!
best way to educate <b>myself</b>- catholicism? | Book for Everyone
Lollipop Loves: Educating <b>myself</b>
Isonomist: I find <b>myself</b> unmoored in time.
Introducing <b>Myself</b> to this forum - vBadvanced Forums
The French Fry Fairy and <b>Myself</b> Attend Free Pizza Day at Tucci&#39;s <b>...</b>
how can i measure <b>myself</b> without a tape measure for a dress i want <b>...</b>
Daily[n] News: Save me from <b>myself</b>...
Daily[n] News: Save me from <b>myself</b>...
Lavender, Leopard, and Lace: Let Me Introduce <b>Myself</b>
What&#39;s Up With This Girl Am I Just Full Of <b>Myself</b>? 3G Pie
What&#39;s Up With This Girl Am I Just Full Of <b>Myself</b>? Search 3G
Im going to find <b>myself</b> a good woman ! | Flickr - Photo Sharing!
Apparently, when I feel sorry for <b>myself</b>… « Girl Meets Bulgaria
I want to do payroll <b>myself</b> for my employees, how can I do it <b>...</b>
I think I just ****ed <b>myself</b> over... HELP ASAP! - Overclock.net <b>...</b>
Bullard: &#39;I Definitely Don&#39;t Think of <b>Myself</b> As Don Quixote <b>...</b>
Am I Full Of <b>Myself</b> Or Is She A Fake? - Iphone - fake - Full <b>...</b>
How Will Be This Year 2011 For <b>Myself</b> And For My Husband? Dumps
Challenge to <b>myself</b> |
Dolphin and Condor Fabrics: D.I. <b>Myself</b>
What&#39;s Up With This Girl Am I Just Full Of <b>Myself</b>? iPhone ™
What all do I need to do to train <b>myself</b> to become good at making <b>...</b>
conradlihilihi: Allow Me to Introduce <b>Myself</b>..
An argument I&#39;m delighted to use <b>myself</b> - Flashman Letters
Judith HeartSong: thinking and choosing for <b>myself</b>
Is it bad to treat <b>myself</b> once a week to nuttella on wheat toast <b>...</b>
10 Excuses For Saving <b>Myself</b> The Liability of Pet Dog Insurance <b>...</b>
A Vintage Girl at Home: How am I not <b>myself</b>? How am I not <b>myself</b> <b>...</b>
CANIdoit <b>Myself</b> 2011 Goals!!!;)
never <b>myself</b>
Computer Care, Can I Do It <b>Myself</b>? | Technology , gadget, Smart <b>...</b>