User:Eleanorg/2.1/BeautifulSoup RSS grabber
URLLib is all well and good, but how do you get /useful/ information from a webpage? I followed a lovely tutorial. It grabs a few items from the HufPo main RSS feed and prints out the titles, links and first part of the main article for each item:
'The X Factor' Top 16: Demi Lovato And Simon Cowell Choose Their Top Fours (VIDEO) http://www.huffingtonpost.com/2012/10/24/the-x-factor-top-16-demi-simon-video_n_2007825.html
Thanks to a baseball postseason rain delay, last week's episode of <a href="http://www.huffingtonpost.com/2012/10/18/the-x-factor-half-top-16-reveal-video_n_1978116.html" target="_hplink">"The X Factor" Top 16 reveal was abruptly cut in half</a>. So Fox scheduled the show on Tuesday night at 9:30 p.m. ET so the judges could finish revealing their full Top 16.
Last week, <a href="http://www.aoltv.com/celebs/la-reid/10053271/main" target="_hplink">L.A. Reid</a> and <a href="http://www.aoltv.com/celebs/britney-spears/1290171/main" target="_hplink">Britney Spears</a> picked their groups' contestants who would advance to the live shows. So what mattered this week was seeing which four singers <a href="http://www.aoltv.com/celebs/demi-lovato/547804/main" target="_hplink">Demi Lovato</a> picked for the Young Adults category, and which four Groups made it throug
'Extreme Cheapskates': Millionaire Forages For Food, Pees In Bottles (VIDEO)
http://www.huffingtonpost.com/2012/10/24/extreme-cheapskates-millionaire-video_n_2007805.html
Most people have an image in their heads of what a millionaire lifestyle would be like. Well, the latest episode of "<a href="http://www.aoltv.com/show/extreme-cheapskates/9121212" target="_hplink">Extreme Cheapskates</a>" throws that fantasy right out the window and into the woods -- where Victoria forages for food. According to her boyfriend, Victoria is a millionaire. Perhaps she got that way by never spending a cent. Ever.
She's so beyond frugal that she pees in bottles to avoid having to flush, showers at a gym -- despite her shower working just fine -- and uses appliances that are nearly fifty years old, except that she doesn't even always do that. To save on costs, she cooked a meal for her family on a makeshift stove outside
And then there's how she gets her food. Victoria was proud that she gets her food from a combination of dumpster di
SAME THING, USING ONLY BEAUTIFUL SOUP:
<title>Syria Agrees To Ceasefire During Eid Al-Adha Holiday, Peace Envoy Lakhdar Brahimi Says</title>
<link rel="alternate" type="text/html" href="http://www.huffingtonpost.com/2012/10/24/syria-ceasefire-eid_n_2008028.html" />
<title>'Sons Of Anarchy': Joel McHale Can't Outrun The Club, Gemma Can't Dodge A Semi (VIDEO)</title>
<link rel="alternate" type="text/html" href="http://www.huffingtonpost.com/2012/10/24/sons-of-anarchy-joel-mchale-run-video_n_2007887.html" />
<title>'The X Factor' Top 16: Demi Lovato And Simon Cowell Choose Their Top Fours (VIDEO)</title>
<link rel="alternate" type="text/html" href="http://www.huffingtonpost.com/2012/10/24/the-x-factor-top-16-demi-simon-video_n_2007825.html" />
<title>'Extreme Cheapskates': Millionaire Forages For Food, Pees In Bottles (VIDEO)</title>
<link rel="alternate" type="text/html" href="http://www.huffingtonpost.com/2012/10/24/extreme-cheapskates-millionaire-video_n_2007805.html" />
<source lang="python">
- !/usr/bin/python
- -*- coding:utf-8 -*-
- tutorial on webkit & gtk
- http://www.youtube.com/watch?v=Ap_DlSrT-iE
from urllib import urlopen from BeautifulSoup import BeautifulSoup import re
print "hello"
- grab the hufpo RSS feed
webpage = urlopen('http://feeds.huffingtonpost.com/huffingtonpost/raw_feed').read()
- print webpage
patTitle = re.compile('<title>(.*)</title>') # regex for text between title tags, in capture parentheses patLink = re.compile('<link rel.*href="(.*)" />') # ditto for link URLs
findPatTitle = re.findall(patTitle, webpage) # find all matches to patTitle in webpage variable findPatLink = re.findall(patLink, webpage)
- print findPatTitle
- print findPatLink
- findPatTitle is a list. Iterate over it to print out each item in the list:
for i in range(1,5):
print findPatTitle[i] print findPatLink[i]
origArticlePage = urlopen(findPatLink[i]).read() # open the URL that the item link points to divBegin = origArticlePage.find('<div class="articleBody"') # find the POSITION of the start of the div containing the articl$ article = origArticlePage[divBegin:(divBegin+1000)] # ...and grab the following 1000 characters
# now we use beautiful soup to parse these 1000 characters in 'article' and grab out important info inside
tags soup = BeautifulSoup(article) listOfParagraphs = soup.findAll('p') # we use a regex to find all the p tags - beautiful soup parses them so no need f$ for i in listOfParagraphs: print i print "\n"
- you can use beautiful soup to simplify the process of finding titles and links, too.
- instead of cumbersome blah.compile() regex statements, just use beautifulsoup thus:
soup2 = BeautifulSoup(webpage) titleSoup = soup2.findAll('title') # this creates a list of all the titles linkSoup = soup2.findAll('link') # this creates a list of all the links print "SAME THING, USING ONLY BEAUTIFUL SOUP:" for i in range(1,5): print titleSoup[i] print linkSoup[i] print "\n" </python>