User:Eleanorg/2.1/BeautifulSoup RSS grabber: Difference between revisions

Latest revision as of 18:11, 24 October 2012

URLLib is all well and good, but how do you get /useful/ information from a webpage? I followed a lovely tutorial. It grabs a few items from the HufPo main RSS feed and prints out the titles, links and first part of the main article for each item. The bit at the bottom uses only beautiful soup (not regular Python regexes) to grab just the title and link for each item:

<title>Syria Agrees To Ceasefire During Eid Al-Adha Holiday, Peace Envoy Lakhdar Brahimi Says</title> <link rel="alternate" type="text/html" href="http://www.huffingtonpost.com/2012/10/24/syria-ceasefire-eid_n_2008028.html" />

<title>'Sons Of Anarchy': Joel McHale Can't Outrun The Club, Gemma Can't Dodge A Semi (VIDEO)</title> <link rel="alternate" type="text/html" href="http://www.huffingtonpost.com/2012/10/24/sons-of-anarchy-joel-mchale-run-video_n_2007887.html" />

<title>'The X Factor' Top 16: Demi Lovato And Simon Cowell Choose Their Top Fours (VIDEO)</title> <link rel="alternate" type="text/html" href="http://www.huffingtonpost.com/2012/10/24/the-x-factor-top-16-demi-simon-video_n_2007825.html" />

<title>'Extreme Cheapskates': Millionaire Forages For Food, Pees In Bottles (VIDEO)</title> <link rel="alternate" type="text/html" href="http://www.huffingtonpost.com/2012/10/24/extreme-cheapskates-millionaire-video_n_2007805.html" />

#!/usr/bin/python
#-*- coding:utf-8 -*-

# tutorial on webkit & gtk
# http://www.youtube.com/watch?v=Ap_DlSrT-iE

from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re


print "hello"

# grab the hufpo RSS feed
webpage = urlopen('http://feeds.huffingtonpost.com/huffingtonpost/raw_feed').read()
#print webpage

patTitle = re.compile('<title>(.*)</title>')    # regex for text between title tags, in capture parentheses
patLink = re.compile('<link rel.*href="(.*)" />')       # ditto for link URLs

findPatTitle = re.findall(patTitle, webpage)    # find all matches to patTitle in webpage variable
findPatLink = re.findall(patLink, webpage)

#print findPatTitle
#print findPatLink


# findPatTitle is a list. Iterate over it to print out each item in the list:
for i in range(1,5):
        print findPatTitle[i]
        print findPatLink[i]

        origArticlePage = urlopen(findPatLink[i]).read()                # open the URL that the item link points to
        divBegin = origArticlePage.find('<div class="articleBody"')     # find the POSITION of the start of the div containing the articl$
        article = origArticlePage[divBegin:(divBegin+1000)]             # ...and grab the following 1000 characters

        # now we use beautiful soup to parse these 1000 characters in 'article' and grab out important info inside <p> tags
        soup = BeautifulSoup(article)
        listOfParagraphs = soup.findAll('p')            # we use a regex to find all the p tags - beautiful soup parses them so no need f$
        for i in listOfParagraphs:
                print i
        print "\n"


# you can use beautiful soup to simplify the process of finding titles and links, too.
# instead of cumbersome blah.compile() regex statements, just use beautifulsoup thus:

soup2 = BeautifulSoup(webpage)
titleSoup = soup2.findAll('title')      # this creates a list of all the titles
linkSoup = soup2.findAll('link')        # this creates a list of all the links

print "SAME THING, USING ONLY BEAUTIFUL SOUP:"
for i in range(1,5):
        print titleSoup[i]
        print linkSoup[i]
        print "\n"

Revision as of 18:10, 24 October 2012 (view source) Eleanorg (talk \| contribs) No edit summary ← Older edit		Latest revision as of 18:11, 24 October 2012 (view source) Eleanorg (talk \| contribs) No edit summary
Line 78:		Line 78:


	</~~python~~>		</source>