User:Eleanorg/2.1/BeautifulSoup RSS grabber

From XPUB & Lens-Based wiki
< User:Eleanorg
Revision as of 19:07, 24 October 2012 by Eleanorg (talk | contribs) (Created page with "URLLib is all well and good, but how do you get /useful/ information from a webpage? I followed a lovely tutorial: <source lang="python"> #!/usr/bin/python #-*- coding:utf-8 -*-...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

URLLib is all well and good, but how do you get /useful/ information from a webpage? I followed a lovely tutorial:

<source lang="python">

  1. !/usr/bin/python
  2. -*- coding:utf-8 -*-
  1. tutorial on webkit & gtk
  2. http://www.youtube.com/watch?v=Ap_DlSrT-iE

from urllib import urlopen from BeautifulSoup import BeautifulSoup import re


print "hello"

  1. grab the hufpo RSS feed

webpage = urlopen('http://feeds.huffingtonpost.com/huffingtonpost/raw_feed').read()

  1. print webpage

patTitle = re.compile('<title>(.*)</title>') # regex for text between title tags, in capture parentheses patLink = re.compile('<link rel.*href="(.*)" />') # ditto for link URLs

findPatTitle = re.findall(patTitle, webpage) # find all matches to patTitle in webpage variable findPatLink = re.findall(patLink, webpage)

  1. print findPatTitle
  2. print findPatLink


  1. findPatTitle is a list. Iterate over it to print out each item in the list:

for i in range(1,5):

       print findPatTitle[i]
       print findPatLink[i]
       origArticlePage = urlopen(findPatLink[i]).read()                # open the URL that the item link points to
       divBegin = origArticlePage.find('<div class="articleBody"')     # find the POSITION of the start of the div containing the articl$
       article = origArticlePage[divBegin:(divBegin+1000)]             # ...and grab the following 1000 characters

# now we use beautiful soup to parse these 1000 characters in 'article' and grab out important info inside

tags soup = BeautifulSoup(article) listOfParagraphs = soup.findAll('p') # we use a regex to find all the p tags - beautiful soup parses them so no need f$ for i in listOfParagraphs: print i print "\n"

  1. you can use beautiful soup to simplify the process of finding titles and links, too.
  2. instead of cumbersome blah.compile() regex statements, just use beautifulsoup thus:

soup2 = BeautifulSoup(webpage) titleSoup = soup2.findAll('title') # this creates a list of all the titles linkSoup = soup2.findAll('link') # this creates a list of all the links print "SAME THING, USING ONLY BEAUTIFUL SOUP:" for i in range(1,5): print titleSoup[i] print linkSoup[i] print "\n" </python>