User:Eleanorg/2.1/BeautifulSoup RSS grabber

URLLib is all well and good, but how do you get /useful/ information from a webpage? I followed a lovely tutorial:

!/usr/bin/python
-*- coding:utf-8 -*-

tutorial on webkit & gtk
http://www.youtube.com/watch?v=Ap_DlSrT-iE

from urllib import urlopen from BeautifulSoup import BeautifulSoup import re

print "hello"

grab the hufpo RSS feed

webpage = urlopen('http://feeds.huffingtonpost.com/huffingtonpost/raw_feed').read()

print webpage

patTitle = re.compile('<title>(.*)</title>') # regex for text between title tags, in capture parentheses patLink = re.compile('<link rel.*href="(.*)" />') # ditto for link URLs

findPatTitle = re.findall(patTitle, webpage) # find all matches to patTitle in webpage variable findPatLink = re.findall(patLink, webpage)

print findPatTitle
print findPatLink

findPatTitle is a list. Iterate over it to print out each item in the list:

for i in range(1,5):

       print findPatTitle[i]
       print findPatLink[i]

       origArticlePage = urlopen(findPatLink[i]).read()                # open the URL that the item link points to
       divBegin = origArticlePage.find('<div class="articleBody"')     # find the POSITION of the start of the div containing the articl$
       article = origArticlePage[divBegin:(divBegin+1000)]             # ...and grab the following 1000 characters

# now we use beautiful soup to parse these 1000 characters in 'article' and grab out important info inside

tags soup = BeautifulSoup(article) listOfParagraphs = soup.findAll('p') # we use a regex to find all the p tags - beautiful soup parses them so no need f$ for i in listOfParagraphs: print i print "\n"

you can use beautiful soup to simplify the process of finding titles and links, too.
instead of cumbersome blah.compile() regex statements, just use beautifulsoup thus:

soup2 = BeautifulSoup(webpage) titleSoup = soup2.findAll('title') # this creates a list of all the titles linkSoup = soup2.findAll('link') # this creates a list of all the links print "SAME THING, USING ONLY BEAUTIFUL SOUP:" for i in range(1,5): print titleSoup[i] print linkSoup[i] print "\n" </python>