User:Eleanorg/2.1/BeautifulSoup RSS grabber
URLLib is all well and good, but how do you get /useful/ information from a webpage? I followed a lovely tutorial:
<source lang="python">
- !/usr/bin/python
- -*- coding:utf-8 -*-
- tutorial on webkit & gtk
- http://www.youtube.com/watch?v=Ap_DlSrT-iE
from urllib import urlopen from BeautifulSoup import BeautifulSoup import re
print "hello"
- grab the hufpo RSS feed
webpage = urlopen('http://feeds.huffingtonpost.com/huffingtonpost/raw_feed').read()
- print webpage
patTitle = re.compile('<title>(.*)</title>') # regex for text between title tags, in capture parentheses patLink = re.compile('<link rel.*href="(.*)" />') # ditto for link URLs
findPatTitle = re.findall(patTitle, webpage) # find all matches to patTitle in webpage variable findPatLink = re.findall(patLink, webpage)
- print findPatTitle
- print findPatLink
- findPatTitle is a list. Iterate over it to print out each item in the list:
for i in range(1,5):
print findPatTitle[i] print findPatLink[i]
origArticlePage = urlopen(findPatLink[i]).read() # open the URL that the item link points to divBegin = origArticlePage.find('<div class="articleBody"') # find the POSITION of the start of the div containing the articl$ article = origArticlePage[divBegin:(divBegin+1000)] # ...and grab the following 1000 characters
# now we use beautiful soup to parse these 1000 characters in 'article' and grab out important info inside
tags soup = BeautifulSoup(article) listOfParagraphs = soup.findAll('p') # we use a regex to find all the p tags - beautiful soup parses them so no need f$ for i in listOfParagraphs: print i print "\n"
- you can use beautiful soup to simplify the process of finding titles and links, too.
- instead of cumbersome blah.compile() regex statements, just use beautifulsoup thus:
soup2 = BeautifulSoup(webpage) titleSoup = soup2.findAll('title') # this creates a list of all the titles linkSoup = soup2.findAll('link') # this creates a list of all the links print "SAME THING, USING ONLY BEAUTIFUL SOUP:" for i in range(1,5): print titleSoup[i] print linkSoup[i] print "\n" </python>