User:Laurier Rochon/prototyping/??????????: Difference between revisions

From XPUB & Lens-Based wiki
No edit summary
No edit summary
 
(12 intermediate revisions by the same user not shown)
Line 1: Line 1:
__NOTITLE__
<br /><br />
<div style='font-family:Georgia,Sans;color:#555;font-size:14px;'>
<div style='font-family:Georgia,Sans;color:#555;font-size:14px;'>
<span style='font-style:italic;font-size:42px;color:#000;'>??????????</span>
<span style='font-style:italic;font-size:42px;color:#000;'>??????????</span>
Line 9: Line 9:
</ul><br />
</ul><br />
<table border='1' cellpadding='10' cellspacing='5' style='border-collapse:collapse'>
<table border='1' cellpadding='10' cellspacing='5' style='border-collapse:collapse'>
<tr><td>Auxilary verb</td><td>Subject</td><td>Main verb</td><td>Topic</td><td>Impatience</td><td>Question mark</td></tr>
<tr><td>'''Auxilary verb'''</td><td>'''Subject'''</td><td>'''Main verb'''</td><td>'''Topic'''</td><td>'''Impatience'''</td><td>'''Question mark'''</td></tr>
<tr><td>Are</td><td>you</td><td>having</td><td>fun</td><td>yet</td><td>?</td></tr>
<tr><td>Are</td><td>you</td><td>having</td><td>fun</td><td>yet</td><td>?</td></tr>
</table>
</table>
Line 16: Line 16:


<ul>
<ul>
<li>It's about analyzing what people don't know about certain things. (i.e. what is the most common question people ask themselves when it comes to religion?)</li>
<li>I want to perform [http://mallet.cs.umass.edu/topics.php topic modeling] on these questions, and sort them according to the subjects they relate to</li>
<li>I want to perform [http://mallet.cs.umass.edu/topics.php topic modeling] on these questions, and sort them according to the subjects they relate to</li>
<li>I would like to build a database of interrogations that link back to a certain topic - in other words, what do people wonder about, ask themselves?</li>
</ul>
 
<br /><br /><span style='font-size:25px;color:#000;margin-right:15px;'>Pieces</span>
 
<ul>
<li>1 scraper that gets the original links to scrape -> store in a text file, tab-delimited</li>
<li>1 spider that visits those links, grabs the questions and stores them in another (many?) text file(s). Perhaps this spider could also grab other links and add them to the first file. This spider will also have to perform the topic modeling tasks needed to categorize and contextualize the questions I will be harvesting.</li>
<li>I want to keep an archive of visited pages (another text file) too, which will mean there will be a filling->emptying->filling motion going from the freshly scraped archive to the then-scraped links</li>
<li>Finally some kind of output will use the questions in some interesting way...</li>
</ul>
 
'''Scraper + pipeline = scrapeline'''
 
[[File:Scrapeline.jpg|1000px]]
 
<br /><br /><span style='font-size:25px;color:#000;margin-right:15px;'>Master plan</span>
 
<ul>
<li>We'll see what happens...</li>
<li><del>Idea #1 : following last semester's direction, I would build this time an 'interrogator', or 'lie detector'. Once again, a chatting program, but made for human-computer interaction, instead of human-COMPUTER-human interaction (The Listener). It would fire off  questions at you, pulled from the database I will have collected, let you respond, and try to corner you into contradictions. Assuming the computer's memory is faster and more precise than any human's this could be an interesting challenge.</del></li>
<li><del>Idea #2 : not following in any semester's direction whatsoever, it could be fun to make books out of this database (encyclopedia of interrogations on subject XXXXX). Not only do I find it a curious regression of medium (back to print) from a theoretical point of view, but leveraging the power of the database could yield unorthodox results such as a book called "Of turtles and peer pressure, by Author1, Author2, Author3, etc.", joining all data relating to 'turtles' and 'peer pressure', and compressing it into a large string of questions, packed in a book. Pushing a little further, can some questions on a subject answer questions on another? Related : (seemingly ripping Jonathan Harris' idea, but in print) : http://gregory.incident.net/project/le-registre--the-register/ (blog feelings turned into books)</del></li>
</ul>
</ul>


Line 35: Line 56:
</ul>
</ul>


<br /><br /><span style='font-size:25px;color:#000;margin-right:15px;'>Master plan</span>
<br /><br /><span style='font-size:25px;color:#000;margin-right:15px;'>Soft</span>
 


<ul>
I moved all further software development to [http://pzwart3.wdka.hro.nl/wiki/User:Laurier_Rochon/prototyping/%3F%3F%3F%3F%3F%3F%3F%3F%3F%3Fsoft  this page]
<li>Idea #1 : following last semester's direction, I am interested in the idea of an 'interrogator', or 'lie detector'. Once again, I would build a chatting program, but made for human-computer interaction, instead of human-COMPUTER-human interaction (The Listener). This program would fire off  questions at you, pulled from the database I will have collected, and try to corner you into contradictions. Assuming the computer's memory is faster and more effective than any human's this could be an interesting challenge.</li>
<li>Idea #2 : not following in any semester's direction, it could be fun to make books out of this database (encyclopedia of interrogations on subject XXXXX). Not only is it an interesting regression of medium from a theoretical point of view, but leveraging the power of the database could yield fun results such as a book called "Of turtles and peer pressure, by Author1, Author2, Author3, etc.", joining all data relating to 'turtles' and 'peer pressure', compressing it into a large string of questions, packed in a book. Similar project (seemingly ripping Jonathan Harris' idea) : http://gregory.incident.net/project/le-registre--the-register/ (blog feelings turned into books)</li>
</ul>


<br /><br /><span style='font-size:25px;color:#000;margin-right:15px;'>Soft</span>


<ul>
<ul>
<li>This first simple API call returns a nice list of blog titles, full of insecure people, self-absorbed people, oblivious people and just normal people. Already the titles have nice questions in them, also.</li>
<li>This first simple API call returns a nice list of blog titles and urls - full of insecure people, self-absorbed people, oblivious people and just normal people. Already the titles have nice questions in them, also.</li>
<li>On a more technical note (1), a bit of research seems to indicate that a large majority of blogs (very arbitrarily checked) - perhaps because of Google's indexing, or because nobody really ever bothered to build their own blog AND divert from naming conventions - are using very standard identifiers for content holding. Because blog's sidebars are typically cloud tags, links and a bunch of crap full of '?' marks (url.com?tag=bla for parameters), it would probably be wise to filter content by ID/classes ('content, page, wrapper, container, content, entry' would cover almost everyone)</li>
<li>On another technical note, how to grab content in the most efficient manner, for the manual scape? Or in other words, how to make it degrade nicely, if some of those ID/classes aren't part of the markup?...The plan would be to have a set of very specific ids/classes ('postbody','post','post-xxxxx'), and then going up the chain to more general 'wrapper' and 'container' classes, and finally using the 'body' as the final fallback...Until something smarter comes up!</li>
</ul>
</ul>


<div style='font-size:10px;'>
<source lang='python'>
<source lang='python'>
import urllib2
import urllib2
Line 54: Line 75:
start=0
start=0
titles=[]
titles=[]
urls=[]


while start<64:
while start<64:
Line 63: Line 85:
for r in data['responseData']['results']:
for r in data['responseData']['results']:
titles.append(r['title'])
titles.append(r['title'])
urls.append(r['postUrl'])
print r['title']
print r['postUrl']
start += 8
start += 8


for t in titles:
print t.encode("utf-8")
</source>
</source>
</div>


<div style='font-size:10px;'>
<source lang='text'>
<source lang='text'>
The Non-Blonde: Smells Like Coming Home To <b>Myself</b>
The Non-Blonde: Smells Like Coming Home To <b>Myself</b>
http://thenonblonde.blogspot.com/2011/01/smells-like-coming-home-to-myself.html
I will never call <b>myself</b> a star: Anushka Sharma : News : News <b>...</b>
I will never call <b>myself</b> a star: Anushka Sharma : News : News <b>...</b>
http://www.news.chauthiduniya.com/i-will-never-call-myself-a-star-anushka-sharma
I&#39;d Find <b>Myself</b> Drowning In My Own Tears « Betsy Lerner
http://betsylerner.wordpress.com/2011/01/23/id-find-myself-drowning-in-my-own-tears/
An argument I&#39;m delighted to use <b>myself</b>
An argument I&#39;m delighted to use <b>myself</b>
http://timworstall.com/2011/01/23/an-argument-im-delighted-to-use-myself/
Jay Sean – Me Against <b>Myself</b> - Liriklagump3indonesia.com
Jay Sean – Me Against <b>Myself</b> - Liriklagump3indonesia.com
http://liriklagump3indonesia.com/j/jay-sean/jay-sean-me-against-myself/
Introducing <b>myself</b> .....
Introducing <b>myself</b> .....
http://www.exceem.co.uk/forums/introductions/66771-introducing-myself.html
Setting <b>Myself</b> Free From Food With God&#39;s Help - That&#39;s Fit
Setting <b>Myself</b> Free From Food With God&#39;s Help - That&#39;s Fit
http://www.thatsfit.com/2011/01/21/setting-myself-free-from-food-with-gods-help/
kickin <b>myself</b> a tad bit - Overclock.net - Overclocking.net
kickin <b>myself</b> a tad bit - Overclock.net - Overclocking.net
StephTheBookworm: Accepting <b>Myself</b> as a Blogger
http://www.overclock.net/intel-general/923191-kickin-myself-tad-bit.html
Christina Aguilera - Not <b>Myself</b> Tonight (2010) HDTV 720p x264 <b>...</b>
Christina Aguilera - Not <b>Myself</b> Tonight (2010) HDTV 720p x264 <b>...</b>
http://worldforfree.net/videos/1146319765-christina-aguilera-not-myself-tonight-2010-hdtv-720p-x264.html
How can I prepare and calm <b>myself</b> before an audition? | Health Wiki
How can I prepare and calm <b>myself</b> before an audition? | Health Wiki
http://www.healthcarewiki.org/how-can-i-prepare-and-calm-myself-before-an-audition/
Simply Introducing <b>Myself</b> | Real Super Powers
Simply Introducing <b>Myself</b> | Real Super Powers
http://www.realsuperpowers.com/simply-introducing-myself
Introducing <b>myself</b>: &quot;SolarWilliam&quot; - Webdigity webmaster forums
http://www.webdigity.com/index.php/topic,11185.0.Introducing+myself%3A+%26amp%3Bquot%3BSolarWilliam%26amp%3Bquot%3B.html
I Want To Kill <b>Myself</b>? Dumps
I Want To Kill <b>Myself</b>? Dumps
http://www.yiyu.us/i-want-to-kill-myself/
fell of the wagon so dissapointed <b>myself</b>
fell of the wagon so dissapointed <b>myself</b>
Introducing <b>myself</b>: &quot;SolarWilliam&quot; - Webdigity webmaster forums
http://www.atkinsdietbulletinboard.com/forums/atkins-diet-extended-induction/94643-fell-wagon-so-dissapointed-myself.html
Surrender, Dorothy: I Have Really Hurt <b>Myself</b>
Surrender, Dorothy: I Have Really Hurt <b>Myself</b>
http://surrenderdorothy.typepad.com/surrender_dorothy/2011/01/i-have-really-hurt-myself.html
Smile » Blog Archive » Some updates on frogr 0.4 and <b>myself</b>
Smile » Blog Archive » Some updates on frogr 0.4 and <b>myself</b>
http://blogs.igalia.com/mario/2011/01/22/some-updates-on-frogr-0-4-and-myself/
Surrender, Dorothy: I Have Really Hurt <b>Myself</b>
Surrender, Dorothy: I Have Really Hurt <b>Myself</b>
http://surrenderdorothy.typepad.com/surrender_dorothy/2011/01/i-have-really-hurt-myself.html
Smile » Blog Archive » Some updates on frogr 0.4 and <b>myself</b>
http://blogs.igalia.com/mario/2011/01/22/some-updates-on-frogr-0-4-and-myself/
Musicalfan Loves Minerals: I Couldn&#39;t Help <b>Myself</b>
Musicalfan Loves Minerals: I Couldn&#39;t Help <b>Myself</b>
Smile » Blog Archive » Some updates on frogr 0.4 and <b>myself</b>
http://musicalfanlovesminerals.blogspot.com/2011/01/i-couldnt-help-myself.html
How Can I Make <b>Myself</b> Stop Worrying So Much And Just Be Happy? Dumps
How Can I Make <b>Myself</b> Stop Worrying So Much And Just Be Happy? Dumps
http://www.yiyu.us/how-can-i-make-myself-stop-worrying-so-much-and-just-be-happy/
MayRay in the City: A Post to <b>Myself</b>
MayRay in the City: A Post to <b>Myself</b>
http://mayrayinthecity.blogspot.com/2011/01/post-to-myself.html
Am I Full Of <b>Myself</b> Or Is She A Fake? iPhone ™
Am I Full Of <b>Myself</b> Or Is She A Fake? iPhone ™
http://www.iphonetm.com/am-i-full-of-myself-or-is-she-a-fake/
Sepulchre of Heroes: So Allow Me to Introduce <b>Myself</b>
Sepulchre of Heroes: So Allow Me to Introduce <b>Myself</b>
http://sepulchreofheroes.blogspot.com/2011/01/so-allow-me-to-introduce-myself.html
Life with dignity: Kicking <b>myself</b> in the back..
Life with dignity: Kicking <b>myself</b> in the back..
Photobombing... <b>myself</b> and Chris Strom LLC! | Flickr - Photo Sharing!
http://alexandra-lifewithdignity.blogspot.com/2011/01/kicking-myself-in-back.html
best way to educate <b>myself</b>- catholicism? | Book for Everyone
Anushka: I feel scared to call <b>myself</b> a star
Lollipop Loves: Educating <b>myself</b>
http://www.unp.co.in/f163/anushka-i-feel-scared-to-call-myself-a-star-135963/
Isonomist: I find <b>myself</b> unmoored in time.
 
Introducing <b>Myself</b> to this forum - vBadvanced Forums
...
The French Fry Fairy and <b>Myself</b> Attend Free Pizza Day at Tucci&#39;s <b>...</b>
 
how can i measure <b>myself</b> without a tape measure for a dress i want <b>...</b>
</source>
Daily[n] News: Save me from <b>myself</b>...
</div>
Daily[n] News: Save me from <b>myself</b>...
 
Lavender, Leopard, and Lace: Let Me Introduce <b>Myself</b>
<ul>
What&#39;s Up With This Girl Am I Just Full Of <b>Myself</b>? 3G Pie
<li>And finally, stripping inline JS and CSS is supra easy using Beautifulsoup - then we can split questions at the '?' mark, check for the previous capital letter that follows a period, and we should have a basic question scraper...</li>
What&#39;s Up With This Girl Am I Just Full Of <b>Myself</b>? Search 3G
</ul>
Im going to find <b>myself</b> a good woman ! | Flickr - Photo Sharing!
 
Apparently, when I feel sorry for <b>myself</b>… « Girl Meets Bulgaria
<div style='font-size:10px;'>
I want to do payroll <b>myself</b> for my employees, how can I do it <b>...</b>
<source lang="python">
I think I just ****ed <b>myself</b> over... HELP ASAP! - Overclock.net <b>...</b>
import urllib2
Bullard: &#39;I Definitely Don&#39;t Think of <b>Myself</b> As Don Quixote <b>...</b>
from BeautifulSoup import BeautifulSoup
Am I Full Of <b>Myself</b> Or Is She A Fake? - Iphone - fake - Full <b>...</b>
 
How Will Be This Year 2011 For <b>Myself</b> And For My Husband? Dumps
request = urllib2.Request("http://maxwelldemon.com/2011/01/22/i-find-myself-looking-for-a-job/")
Challenge to <b>myself</b> |
request.add_header("User-Agent", "Mozilla/5.0 (X11; U; Linux x86_64;
Dolphin and Condor Fabrics: D.I. <b>Myself</b>
fr; rv:1.9.1.5) Gecko/20091109 Ubuntu/9.10 (karmic) Firefox/3.5.5")
What&#39;s Up With This Girl Am I Just Full Of <b>Myself</b>? iPhone ™
f=urllib2.urlopen(request)
What all do I need to do to train <b>myself</b> to become good at making <b>...</b>
c = f.read()
conradlihilihi: Allow Me to Introduce <b>Myself</b>..
 
An argument I&#39;m delighted to use <b>myself</b> - Flashman Letters
soup = BeautifulSoup(''.join(BeautifulSoup(c).findAll(text=lambda text:
Judith HeartSong: thinking and choosing for <b>myself</b>
text.parent.name != "script" and text.parent.name != "style")))
Is it bad to treat <b>myself</b> once a week to nuttella on wheat toast <b>...</b>
 
10 Excuses For Saving <b>Myself</b> The Liability of Pet Dog Insurance <b>...</b>
print soup
A Vintage Girl at Home: How am I not <b>myself</b>? How am I not <b>myself</b> <b>...</b>
CANIdoit <b>Myself</b> 2011 Goals!!!;)
never <b>myself</b>
Computer Care, Can I Do It <b>Myself</b>? | Technology , gadget, Smart <b>...</b>
</source>
</source>
</div>


</div>
</div>


</div>
</div>

Latest revision as of 18:46, 22 February 2011



??????????

Scrapin' wut?

  • I want to scrape questions

Auxilary verbSubjectMain verbTopicImpatienceQuestion mark
Areyouhavingfunyet?



Why?

  • It's about analyzing what people don't know about certain things. (i.e. what is the most common question people ask themselves when it comes to religion?)
  • I want to perform topic modeling on these questions, and sort them according to the subjects they relate to



Pieces

  • 1 scraper that gets the original links to scrape -> store in a text file, tab-delimited
  • 1 spider that visits those links, grabs the questions and stores them in another (many?) text file(s). Perhaps this spider could also grab other links and add them to the first file. This spider will also have to perform the topic modeling tasks needed to categorize and contextualize the questions I will be harvesting.
  • I want to keep an archive of visited pages (another text file) too, which will mean there will be a filling->emptying->filling motion going from the freshly scraped archive to the then-scraped links
  • Finally some kind of output will use the questions in some interesting way...

Scraper + pipeline = scrapeline

Scrapeline.jpg



Master plan

  • We'll see what happens...
  • Idea #1 : following last semester's direction, I would build this time an 'interrogator', or 'lie detector'. Once again, a chatting program, but made for human-computer interaction, instead of human-COMPUTER-human interaction (The Listener). It would fire off questions at you, pulled from the database I will have collected, let you respond, and try to corner you into contradictions. Assuming the computer's memory is faster and more precise than any human's this could be an interesting challenge.
  • Idea #2 : not following in any semester's direction whatsoever, it could be fun to make books out of this database (encyclopedia of interrogations on subject XXXXX). Not only do I find it a curious regression of medium (back to print) from a theoretical point of view, but leveraging the power of the database could yield unorthodox results such as a book called "Of turtles and peer pressure, by Author1, Author2, Author3, etc.", joining all data relating to 'turtles' and 'peer pressure', and compressing it into a large string of questions, packed in a book. Pushing a little further, can some questions on a subject answer questions on another? Related : (seemingly ripping Jonathan Harris' idea, but in print) : http://gregory.incident.net/project/le-registre--the-register/ (blog feelings turned into books)



How

  • I want to tap into the informal language of the Web - the blogosphere, rather than calculated writing
  • I am planning on using both methods (API & manual scraping) to amass the information
  • Because both potential outcomes (see 'Master Plan' below) would create rather personal experiences, the initial search terms' lexical fields would be geared towards the individual (i.e. : 'personal', 'how can I', 'I need to', etc.)
  • Step 1 : use the Google Blog Search API (JSON) to return a list of blogs (64 max per scrape), and store them in a database. The title of the post is very important, as it will give much information on what this post is about
  • Step 2 : use python to visit those links, scape the page (Beautifulsoup, lxml, etc.) and catch all questions - store them in a database. If possible, use the main text body to categorize the post further.
  • Storage structure (hypothetical) : if possible, I would like to store this information in plain text files, although I might need a relational database to if I start dealing with different content types. My first intention was (and is still) to have one text file for every topic, and all data in a tab-delimited format. I.e. : GOD.txt contains
    Timestamp1       Blog url1       Post title1       Author1       Question1
    Timestamp2       Blog url2       Post title2       Author2       Question2
  • On the other hand, it could be useful to rate (using NLTK) the relevance of certain topics (a blog post would not really deal with only ONE topic), which is not really possible in the last example. Using a relational approach, I could have a table of questions, a table of topics, a table of authors, etc.



Soft


I moved all further software development to this page


  • This first simple API call returns a nice list of blog titles and urls - full of insecure people, self-absorbed people, oblivious people and just normal people. Already the titles have nice questions in them, also.
  • On a more technical note (1), a bit of research seems to indicate that a large majority of blogs (very arbitrarily checked) - perhaps because of Google's indexing, or because nobody really ever bothered to build their own blog AND divert from naming conventions - are using very standard identifiers for content holding. Because blog's sidebars are typically cloud tags, links and a bunch of crap full of '?' marks (url.com?tag=bla for parameters), it would probably be wise to filter content by ID/classes ('content, page, wrapper, container, content, entry' would cover almost everyone)
  • On another technical note, how to grab content in the most efficient manner, for the manual scape? Or in other words, how to make it degrade nicely, if some of those ID/classes aren't part of the markup?...The plan would be to have a set of very specific ids/classes ('postbody','post','post-xxxxx'), and then going up the chain to more general 'wrapper' and 'container' classes, and finally using the 'body' as the final fallback...Until something smarter comes up!
import urllib2
import json

start=0
titles=[]
urls=[]

while start<64:
	url = ('https://ajax.googleapis.com/ajax/services/search/blogs?v=1.0&q=myself&start='+ str (start)+'&rsz=large')

	f = urllib2.urlopen(url)
	data = json.load(f)

	for r in data['responseData']['results']:
		titles.append(r['title'])
		urls.append(r['postUrl'])
		print r['title']
		print r['postUrl']
	start += 8
The Non-Blonde: Smells Like Coming Home To <b>Myself</b>
http://thenonblonde.blogspot.com/2011/01/smells-like-coming-home-to-myself.html
I will never call <b>myself</b> a star: Anushka Sharma : News : News <b>...</b>
http://www.news.chauthiduniya.com/i-will-never-call-myself-a-star-anushka-sharma
I&#39;d Find <b>Myself</b> Drowning In My Own Tears « Betsy Lerner
http://betsylerner.wordpress.com/2011/01/23/id-find-myself-drowning-in-my-own-tears/
An argument I&#39;m delighted to use <b>myself</b>
http://timworstall.com/2011/01/23/an-argument-im-delighted-to-use-myself/
Jay Sean – Me Against <b>Myself</b> - Liriklagump3indonesia.com
http://liriklagump3indonesia.com/j/jay-sean/jay-sean-me-against-myself/
Introducing <b>myself</b> .....
http://www.exceem.co.uk/forums/introductions/66771-introducing-myself.html
Setting <b>Myself</b> Free From Food With God&#39;s Help - That&#39;s Fit
http://www.thatsfit.com/2011/01/21/setting-myself-free-from-food-with-gods-help/
kickin <b>myself</b> a tad bit - Overclock.net - Overclocking.net
http://www.overclock.net/intel-general/923191-kickin-myself-tad-bit.html
Christina Aguilera - Not <b>Myself</b> Tonight (2010) HDTV 720p x264 <b>...</b>
http://worldforfree.net/videos/1146319765-christina-aguilera-not-myself-tonight-2010-hdtv-720p-x264.html
How can I prepare and calm <b>myself</b> before an audition? | Health Wiki
http://www.healthcarewiki.org/how-can-i-prepare-and-calm-myself-before-an-audition/
Simply Introducing <b>Myself</b> | Real Super Powers
http://www.realsuperpowers.com/simply-introducing-myself
Introducing <b>myself</b>: &quot;SolarWilliam&quot; - Webdigity webmaster forums
http://www.webdigity.com/index.php/topic,11185.0.Introducing+myself%3A+%26amp%3Bquot%3BSolarWilliam%26amp%3Bquot%3B.html
I Want To Kill <b>Myself</b>? Dumps
http://www.yiyu.us/i-want-to-kill-myself/
fell of the wagon so dissapointed <b>myself</b>
http://www.atkinsdietbulletinboard.com/forums/atkins-diet-extended-induction/94643-fell-wagon-so-dissapointed-myself.html
Surrender, Dorothy: I Have Really Hurt <b>Myself</b>
http://surrenderdorothy.typepad.com/surrender_dorothy/2011/01/i-have-really-hurt-myself.html
Smile » Blog Archive » Some updates on frogr 0.4 and <b>myself</b>
http://blogs.igalia.com/mario/2011/01/22/some-updates-on-frogr-0-4-and-myself/
Surrender, Dorothy: I Have Really Hurt <b>Myself</b>
http://surrenderdorothy.typepad.com/surrender_dorothy/2011/01/i-have-really-hurt-myself.html
Smile » Blog Archive » Some updates on frogr 0.4 and <b>myself</b>
http://blogs.igalia.com/mario/2011/01/22/some-updates-on-frogr-0-4-and-myself/
Musicalfan Loves Minerals: I Couldn&#39;t Help <b>Myself</b>
http://musicalfanlovesminerals.blogspot.com/2011/01/i-couldnt-help-myself.html
How Can I Make <b>Myself</b> Stop Worrying So Much And Just Be Happy? Dumps
http://www.yiyu.us/how-can-i-make-myself-stop-worrying-so-much-and-just-be-happy/
MayRay in the City: A Post to <b>Myself</b>
http://mayrayinthecity.blogspot.com/2011/01/post-to-myself.html
Am I Full Of <b>Myself</b> Or Is She A Fake? iPhone ™
http://www.iphonetm.com/am-i-full-of-myself-or-is-she-a-fake/
Sepulchre of Heroes: So Allow Me to Introduce <b>Myself</b>
http://sepulchreofheroes.blogspot.com/2011/01/so-allow-me-to-introduce-myself.html
Life with dignity: Kicking <b>myself</b> in the back..
http://alexandra-lifewithdignity.blogspot.com/2011/01/kicking-myself-in-back.html
Anushka: I feel scared to call <b>myself</b> a star
http://www.unp.co.in/f163/anushka-i-feel-scared-to-call-myself-a-star-135963/

...
  • And finally, stripping inline JS and CSS is supra easy using Beautifulsoup - then we can split questions at the '?' mark, check for the previous capital letter that follows a period, and we should have a basic question scraper...
import urllib2
from BeautifulSoup import BeautifulSoup

request = urllib2.Request("http://maxwelldemon.com/2011/01/22/i-find-myself-looking-for-a-job/")
request.add_header("User-Agent", "Mozilla/5.0 (X11; U; Linux x86_64; 
fr; rv:1.9.1.5) Gecko/20091109 Ubuntu/9.10 (karmic) Firefox/3.5.5")
f=urllib2.urlopen(request)
c = f.read()

soup = BeautifulSoup(''.join(BeautifulSoup(c).findAll(text=lambda text: 
text.parent.name != "script" and text.parent.name != "style")))

print soup