User:Eleanorg/2.1/BeautifulSoup RSS grabber

From XPUB & Lens-Based wiki
< User:Eleanorg
Revision as of 19:09, 24 October 2012 by Eleanorg (talk | contribs)

URLLib is all well and good, but how do you get /useful/ information from a webpage? I followed a lovely tutorial. It grabs a few items from the HufPo main RSS feed and prints out the titles, links and first part of the main article for each item:

'The X Factor' Top 16: Demi Lovato And Simon Cowell Choose Their Top Fours (VIDEO) http://www.huffingtonpost.com/2012/10/24/the-x-factor-top-16-demi-simon-video_n_2007825.html

Thanks to a baseball postseason rain delay, last week's episode of <a href="http://www.huffingtonpost.com/2012/10/18/the-x-factor-half-top-16-reveal-video_n_1978116.html" target="_hplink">"The X Factor" Top 16 reveal was abruptly cut in half</a>. So Fox scheduled the show on Tuesday night at 9:30 p.m. ET so the judges could finish revealing their full Top 16.

Last week, <a href="http://www.aoltv.com/celebs/la-reid/10053271/main" target="_hplink">L.A. Reid</a> and <a href="http://www.aoltv.com/celebs/britney-spears/1290171/main" target="_hplink">Britney Spears</a> picked their groups' contestants who would advance to the live shows. So what mattered this week was seeing which four singers <a href="http://www.aoltv.com/celebs/demi-lovato/547804/main" target="_hplink">Demi Lovato</a> picked for the Young Adults category, and which four Groups made it throug


'Extreme Cheapskates': Millionaire Forages For Food, Pees In Bottles (VIDEO) http://www.huffingtonpost.com/2012/10/24/extreme-cheapskates-millionaire-video_n_2007805.html

Most people have an image in their heads of what a millionaire lifestyle would be like. Well, the latest episode of "<a href="http://www.aoltv.com/show/extreme-cheapskates/9121212" target="_hplink">Extreme Cheapskates</a>" throws that fantasy right out the window and into the woods -- where Victoria forages for food. According to her boyfriend, Victoria is a millionaire. Perhaps she got that way by never spending a cent. Ever.

She's so beyond frugal that she pees in bottles to avoid having to flush, showers at a gym -- despite her shower working just fine -- and uses appliances that are nearly fifty years old, except that she doesn't even always do that. To save on costs, she cooked a meal for her family on a makeshift stove outside

And then there's how she gets her food. Victoria was proud that she gets her food from a combination of dumpster di


SAME THING, USING ONLY BEAUTIFUL SOUP: <title>Syria Agrees To Ceasefire During Eid Al-Adha Holiday, Peace Envoy Lakhdar Brahimi Says</title> <link rel="alternate" type="text/html" href="http://www.huffingtonpost.com/2012/10/24/syria-ceasefire-eid_n_2008028.html" />


<title>'Sons Of Anarchy': Joel McHale Can't Outrun The Club, Gemma Can't Dodge A Semi (VIDEO)</title> <link rel="alternate" type="text/html" href="http://www.huffingtonpost.com/2012/10/24/sons-of-anarchy-joel-mchale-run-video_n_2007887.html" />


<title>'The X Factor' Top 16: Demi Lovato And Simon Cowell Choose Their Top Fours (VIDEO)</title> <link rel="alternate" type="text/html" href="http://www.huffingtonpost.com/2012/10/24/the-x-factor-top-16-demi-simon-video_n_2007825.html" />


<title>'Extreme Cheapskates': Millionaire Forages For Food, Pees In Bottles (VIDEO)</title> <link rel="alternate" type="text/html" href="http://www.huffingtonpost.com/2012/10/24/extreme-cheapskates-millionaire-video_n_2007805.html" />


<source lang="python">

  1. !/usr/bin/python
  2. -*- coding:utf-8 -*-
  1. tutorial on webkit & gtk
  2. http://www.youtube.com/watch?v=Ap_DlSrT-iE

from urllib import urlopen from BeautifulSoup import BeautifulSoup import re


print "hello"

  1. grab the hufpo RSS feed

webpage = urlopen('http://feeds.huffingtonpost.com/huffingtonpost/raw_feed').read()

  1. print webpage

patTitle = re.compile('<title>(.*)</title>') # regex for text between title tags, in capture parentheses patLink = re.compile('<link rel.*href="(.*)" />') # ditto for link URLs

findPatTitle = re.findall(patTitle, webpage) # find all matches to patTitle in webpage variable findPatLink = re.findall(patLink, webpage)

  1. print findPatTitle
  2. print findPatLink


  1. findPatTitle is a list. Iterate over it to print out each item in the list:

for i in range(1,5):

       print findPatTitle[i]
       print findPatLink[i]
       origArticlePage = urlopen(findPatLink[i]).read()                # open the URL that the item link points to
       divBegin = origArticlePage.find('<div class="articleBody"')     # find the POSITION of the start of the div containing the articl$
       article = origArticlePage[divBegin:(divBegin+1000)]             # ...and grab the following 1000 characters

# now we use beautiful soup to parse these 1000 characters in 'article' and grab out important info inside

tags soup = BeautifulSoup(article) listOfParagraphs = soup.findAll('p') # we use a regex to find all the p tags - beautiful soup parses them so no need f$ for i in listOfParagraphs: print i print "\n"

  1. you can use beautiful soup to simplify the process of finding titles and links, too.
  2. instead of cumbersome blah.compile() regex statements, just use beautifulsoup thus:

soup2 = BeautifulSoup(webpage) titleSoup = soup2.findAll('title') # this creates a list of all the titles linkSoup = soup2.findAll('link') # this creates a list of all the links print "SAME THING, USING ONLY BEAUTIFUL SOUP:" for i in range(1,5): print titleSoup[i] print linkSoup[i] print "\n" </python>