User:Laurier Rochon/prototyping/??????????
??????????
Scrapin' wut?
- I want to scrape questions
Auxilary verb | Subject | Main verb | Topic | Impatience | Question mark |
Are | you | having | fun | yet | ? |
Why?
- It's about analyzing what people don't know about certain things. (i.e. what is the most common question people ask themselves when it comes to religion?)
- I want to perform topic modeling on these questions, and sort them according to the subjects they relate to
Pieces
- 1 scraper that gets the original links to scrape -> store in a text file, tab-delimited
- 1 spider that visits those links, grabs the questions and stores them in another (many?) text file(s). Perhaps this spider could also grab other links and add them to the first file. This spider will also have to perform the topic modeling tasks needed to categorize and contextualize the questions I will be harvesting.
- I want to keep an archive of visited pages (another text file) too, which will mean there will be a filling->emptying->filling motion going from the freshly scraped archive to the then-scraped links
- Finally some kind of output will use the questions in some interesting way...
Scraper + pipeline = scrapeline
Master plan
- We'll see what happens...
Idea #1 : following last semester's direction, I would build this time an 'interrogator', or 'lie detector'. Once again, a chatting program, but made for human-computer interaction, instead of human-COMPUTER-human interaction (The Listener). It would fire off questions at you, pulled from the database I will have collected, let you respond, and try to corner you into contradictions. Assuming the computer's memory is faster and more precise than any human's this could be an interesting challenge.Idea #2 : not following in any semester's direction whatsoever, it could be fun to make books out of this database (encyclopedia of interrogations on subject XXXXX). Not only do I find it a curious regression of medium (back to print) from a theoretical point of view, but leveraging the power of the database could yield unorthodox results such as a book called "Of turtles and peer pressure, by Author1, Author2, Author3, etc.", joining all data relating to 'turtles' and 'peer pressure', and compressing it into a large string of questions, packed in a book. Pushing a little further, can some questions on a subject answer questions on another? Related : (seemingly ripping Jonathan Harris' idea, but in print) : http://gregory.incident.net/project/le-registre--the-register/ (blog feelings turned into books)
How
- I want to tap into the informal language of the Web - the blogosphere, rather than calculated writing
- I am planning on using both methods (API & manual scraping) to amass the information
- Because both potential outcomes (see 'Master Plan' below) would create rather personal experiences, the initial search terms' lexical fields would be geared towards the individual (i.e. : 'personal', 'how can I', 'I need to', etc.)
- Step 1 : use the Google Blog Search API (JSON) to return a list of blogs (64 max per scrape), and store them in a database. The title of the post is very important, as it will give much information on what this post is about
- Step 2 : use python to visit those links, scape the page (Beautifulsoup, lxml, etc.) and catch all questions - store them in a database. If possible, use the main text body to categorize the post further.
- Storage structure (hypothetical) : if possible, I would like to store this information in plain text files, although I might need a relational database to if I start dealing with different content types. My first intention was (and is still) to have one text file for every topic, and all data in a tab-delimited format. I.e. : GOD.txt contains
Timestamp1 Blog url1 Post title1 Author1 Question1
Timestamp2 Blog url2 Post title2 Author2 Question2 - On the other hand, it could be useful to rate (using NLTK) the relevance of certain topics (a blog post would not really deal with only ONE topic), which is not really possible in the last example. Using a relational approach, I could have a table of questions, a table of topics, a table of authors, etc.
Soft
I moved all further software development to this page
- This first simple API call returns a nice list of blog titles and urls - full of insecure people, self-absorbed people, oblivious people and just normal people. Already the titles have nice questions in them, also.
- On a more technical note (1), a bit of research seems to indicate that a large majority of blogs (very arbitrarily checked) - perhaps because of Google's indexing, or because nobody really ever bothered to build their own blog AND divert from naming conventions - are using very standard identifiers for content holding. Because blog's sidebars are typically cloud tags, links and a bunch of crap full of '?' marks (url.com?tag=bla for parameters), it would probably be wise to filter content by ID/classes ('content, page, wrapper, container, content, entry' would cover almost everyone)
- On another technical note, how to grab content in the most efficient manner, for the manual scape? Or in other words, how to make it degrade nicely, if some of those ID/classes aren't part of the markup?...The plan would be to have a set of very specific ids/classes ('postbody','post','post-xxxxx'), and then going up the chain to more general 'wrapper' and 'container' classes, and finally using the 'body' as the final fallback...Until something smarter comes up!
import urllib2
import json
start=0
titles=[]
urls=[]
while start<64:
url = ('https://ajax.googleapis.com/ajax/services/search/blogs?v=1.0&q=myself&start='+ str (start)+'&rsz=large')
f = urllib2.urlopen(url)
data = json.load(f)
for r in data['responseData']['results']:
titles.append(r['title'])
urls.append(r['postUrl'])
print r['title']
print r['postUrl']
start += 8
The Non-Blonde: Smells Like Coming Home To <b>Myself</b>
http://thenonblonde.blogspot.com/2011/01/smells-like-coming-home-to-myself.html
I will never call <b>myself</b> a star: Anushka Sharma : News : News <b>...</b>
http://www.news.chauthiduniya.com/i-will-never-call-myself-a-star-anushka-sharma
I'd Find <b>Myself</b> Drowning In My Own Tears « Betsy Lerner
http://betsylerner.wordpress.com/2011/01/23/id-find-myself-drowning-in-my-own-tears/
An argument I'm delighted to use <b>myself</b>
http://timworstall.com/2011/01/23/an-argument-im-delighted-to-use-myself/
Jay Sean – Me Against <b>Myself</b> - Liriklagump3indonesia.com
http://liriklagump3indonesia.com/j/jay-sean/jay-sean-me-against-myself/
Introducing <b>myself</b> .....
http://www.exceem.co.uk/forums/introductions/66771-introducing-myself.html
Setting <b>Myself</b> Free From Food With God's Help - That's Fit
http://www.thatsfit.com/2011/01/21/setting-myself-free-from-food-with-gods-help/
kickin <b>myself</b> a tad bit - Overclock.net - Overclocking.net
http://www.overclock.net/intel-general/923191-kickin-myself-tad-bit.html
Christina Aguilera - Not <b>Myself</b> Tonight (2010) HDTV 720p x264 <b>...</b>
http://worldforfree.net/videos/1146319765-christina-aguilera-not-myself-tonight-2010-hdtv-720p-x264.html
How can I prepare and calm <b>myself</b> before an audition? | Health Wiki
http://www.healthcarewiki.org/how-can-i-prepare-and-calm-myself-before-an-audition/
Simply Introducing <b>Myself</b> | Real Super Powers
http://www.realsuperpowers.com/simply-introducing-myself
Introducing <b>myself</b>: "SolarWilliam" - Webdigity webmaster forums
http://www.webdigity.com/index.php/topic,11185.0.Introducing+myself%3A+%26amp%3Bquot%3BSolarWilliam%26amp%3Bquot%3B.html
I Want To Kill <b>Myself</b>? Dumps
http://www.yiyu.us/i-want-to-kill-myself/
fell of the wagon so dissapointed <b>myself</b>
http://www.atkinsdietbulletinboard.com/forums/atkins-diet-extended-induction/94643-fell-wagon-so-dissapointed-myself.html
Surrender, Dorothy: I Have Really Hurt <b>Myself</b>
http://surrenderdorothy.typepad.com/surrender_dorothy/2011/01/i-have-really-hurt-myself.html
Smile » Blog Archive » Some updates on frogr 0.4 and <b>myself</b>
http://blogs.igalia.com/mario/2011/01/22/some-updates-on-frogr-0-4-and-myself/
Surrender, Dorothy: I Have Really Hurt <b>Myself</b>
http://surrenderdorothy.typepad.com/surrender_dorothy/2011/01/i-have-really-hurt-myself.html
Smile » Blog Archive » Some updates on frogr 0.4 and <b>myself</b>
http://blogs.igalia.com/mario/2011/01/22/some-updates-on-frogr-0-4-and-myself/
Musicalfan Loves Minerals: I Couldn't Help <b>Myself</b>
http://musicalfanlovesminerals.blogspot.com/2011/01/i-couldnt-help-myself.html
How Can I Make <b>Myself</b> Stop Worrying So Much And Just Be Happy? Dumps
http://www.yiyu.us/how-can-i-make-myself-stop-worrying-so-much-and-just-be-happy/
MayRay in the City: A Post to <b>Myself</b>
http://mayrayinthecity.blogspot.com/2011/01/post-to-myself.html
Am I Full Of <b>Myself</b> Or Is She A Fake? iPhone ™
http://www.iphonetm.com/am-i-full-of-myself-or-is-she-a-fake/
Sepulchre of Heroes: So Allow Me to Introduce <b>Myself</b>
http://sepulchreofheroes.blogspot.com/2011/01/so-allow-me-to-introduce-myself.html
Life with dignity: Kicking <b>myself</b> in the back..
http://alexandra-lifewithdignity.blogspot.com/2011/01/kicking-myself-in-back.html
Anushka: I feel scared to call <b>myself</b> a star
http://www.unp.co.in/f163/anushka-i-feel-scared-to-call-myself-a-star-135963/
...
- And finally, stripping inline JS and CSS is supra easy using Beautifulsoup - then we can split questions at the '?' mark, check for the previous capital letter that follows a period, and we should have a basic question scraper...
import urllib2
from BeautifulSoup import BeautifulSoup
request = urllib2.Request("http://maxwelldemon.com/2011/01/22/i-find-myself-looking-for-a-job/")
request.add_header("User-Agent", "Mozilla/5.0 (X11; U; Linux x86_64;
fr; rv:1.9.1.5) Gecko/20091109 Ubuntu/9.10 (karmic) Firefox/3.5.5")
f=urllib2.urlopen(request)
c = f.read()
soup = BeautifulSoup(''.join(BeautifulSoup(c).findAll(text=lambda text:
text.parent.name != "script" and text.parent.name != "style")))
print soup