User:Eleanorg/2.1/BeautifulSoup RSS grabber: Difference between revisions

From XPUB & Lens-Based wiki
No edit summary
No edit summary
 
(One intermediate revision by the same user not shown)
Line 1: Line 1:
URLLib is all well and good, but how do you get /useful/ information from a webpage? I followed a lovely tutorial. It grabs a few items from the HufPo main RSS feed and prints out the titles, links and first part of the main article for each item:
URLLib is all well and good, but how do you get /useful/ information from a webpage? I followed a lovely tutorial. It grabs a few items from the HufPo main RSS feed and prints out the titles, links and first part of the main article for each item. The bit at the bottom uses only beautiful soup (not regular Python regexes) to grab just the title and link for each item:


'The X Factor' Top 16: Demi Lovato And Simon Cowell Choose Their Top Fours (VIDEO)
<div style ="font-size:11px; background-color:#ddd;">
http://www.huffingtonpost.com/2012/10/24/the-x-factor-top-16-demi-simon-video_n_2007825.html
<p>Thanks to a baseball postseason rain delay, last week's episode of <a href="http://www.huffingtonpost.com/2012/10/18/the-x-factor-half-top-16-reveal-video_n_1978116.html" target="_hplink">"The X Factor" Top 16 reveal was abruptly cut in half</a>. So Fox scheduled the show on Tuesday night at 9:30 p.m. ET so the judges could finish revealing their full Top 16.</p>
<p>Last week, <a href="http://www.aoltv.com/celebs/la-reid/10053271/main" target="_hplink">L.A. Reid</a> and <a href="http://www.aoltv.com/celebs/britney-spears/1290171/main" target="_hplink">Britney Spears</a> picked their groups' contestants who would advance to the live shows. So what mattered this week was seeing which four singers <a href="http://www.aoltv.com/celebs/demi-lovato/547804/main" target="_hplink">Demi Lovato</a> picked for the Young Adults category, and which four Groups made it throug</p>
 
 
'Extreme Cheapskates': Millionaire Forages For Food, Pees In Bottles (VIDEO)
http://www.huffingtonpost.com/2012/10/24/extreme-cheapskates-millionaire-video_n_2007805.html
<p>Most people have an image in their heads of what a millionaire lifestyle would be like. Well, the latest episode of "<a href="http://www.aoltv.com/show/extreme-cheapskates/9121212" target="_hplink">Extreme Cheapskates</a>" throws that fantasy right out the window and into the woods -- where Victoria forages for food. According to her boyfriend, Victoria is a millionaire. Perhaps she got that way by never spending a cent. Ever.</p>
<p>She's so beyond frugal that she pees in bottles to avoid having to flush, showers at a gym -- despite her shower working just fine -- and uses appliances that are nearly fifty years old, except that she doesn't even always do that. To save on costs, she cooked a meal for her family on a makeshift stove outside</p>
<p>And then there's how she gets her food. Victoria was proud that she gets her food from a combination of dumpster di</p>
 
 
SAME THING, USING ONLY BEAUTIFUL SOUP:
<title>Syria Agrees To Ceasefire During Eid Al-Adha Holiday, Peace Envoy Lakhdar Brahimi Says</title>
<title>Syria Agrees To Ceasefire During Eid Al-Adha Holiday, Peace Envoy Lakhdar Brahimi Says</title>
<link rel="alternate" type="text/html" href="http://www.huffingtonpost.com/2012/10/24/syria-ceasefire-eid_n_2008028.html" />
<link rel="alternate" type="text/html" href="http://www.huffingtonpost.com/2012/10/24/syria-ceasefire-eid_n_2008028.html" />
Line 30: Line 17:
<link rel="alternate" type="text/html" href="http://www.huffingtonpost.com/2012/10/24/extreme-cheapskates-millionaire-video_n_2007805.html" />
<link rel="alternate" type="text/html" href="http://www.huffingtonpost.com/2012/10/24/extreme-cheapskates-millionaire-video_n_2007805.html" />


 
</div>


<source lang="python">
<source lang="python">
Line 91: Line 78:




</python>
</source>

Latest revision as of 19:11, 24 October 2012

URLLib is all well and good, but how do you get /useful/ information from a webpage? I followed a lovely tutorial. It grabs a few items from the HufPo main RSS feed and prints out the titles, links and first part of the main article for each item. The bit at the bottom uses only beautiful soup (not regular Python regexes) to grab just the title and link for each item:

<title>Syria Agrees To Ceasefire During Eid Al-Adha Holiday, Peace Envoy Lakhdar Brahimi Says</title> <link rel="alternate" type="text/html" href="http://www.huffingtonpost.com/2012/10/24/syria-ceasefire-eid_n_2008028.html" />


<title>'Sons Of Anarchy': Joel McHale Can't Outrun The Club, Gemma Can't Dodge A Semi (VIDEO)</title> <link rel="alternate" type="text/html" href="http://www.huffingtonpost.com/2012/10/24/sons-of-anarchy-joel-mchale-run-video_n_2007887.html" />


<title>'The X Factor' Top 16: Demi Lovato And Simon Cowell Choose Their Top Fours (VIDEO)</title> <link rel="alternate" type="text/html" href="http://www.huffingtonpost.com/2012/10/24/the-x-factor-top-16-demi-simon-video_n_2007825.html" />


<title>'Extreme Cheapskates': Millionaire Forages For Food, Pees In Bottles (VIDEO)</title> <link rel="alternate" type="text/html" href="http://www.huffingtonpost.com/2012/10/24/extreme-cheapskates-millionaire-video_n_2007805.html" />

#!/usr/bin/python
#-*- coding:utf-8 -*-

# tutorial on webkit & gtk
# http://www.youtube.com/watch?v=Ap_DlSrT-iE

from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re


print "hello"

# grab the hufpo RSS feed
webpage = urlopen('http://feeds.huffingtonpost.com/huffingtonpost/raw_feed').read()
#print webpage

patTitle = re.compile('<title>(.*)</title>')    # regex for text between title tags, in capture parentheses
patLink = re.compile('<link rel.*href="(.*)" />')       # ditto for link URLs

findPatTitle = re.findall(patTitle, webpage)    # find all matches to patTitle in webpage variable
findPatLink = re.findall(patLink, webpage)

#print findPatTitle
#print findPatLink


# findPatTitle is a list. Iterate over it to print out each item in the list:
for i in range(1,5):
        print findPatTitle[i]
        print findPatLink[i]

        origArticlePage = urlopen(findPatLink[i]).read()                # open the URL that the item link points to
        divBegin = origArticlePage.find('<div class="articleBody"')     # find the POSITION of the start of the div containing the articl$
        article = origArticlePage[divBegin:(divBegin+1000)]             # ...and grab the following 1000 characters

        # now we use beautiful soup to parse these 1000 characters in 'article' and grab out important info inside <p> tags
        soup = BeautifulSoup(article)
        listOfParagraphs = soup.findAll('p')            # we use a regex to find all the p tags - beautiful soup parses them so no need f$
        for i in listOfParagraphs:
                print i
        print "\n"


# you can use beautiful soup to simplify the process of finding titles and links, too.
# instead of cumbersome blah.compile() regex statements, just use beautifulsoup thus:

soup2 = BeautifulSoup(webpage)
titleSoup = soup2.findAll('title')      # this creates a list of all the titles
linkSoup = soup2.findAll('link')        # this creates a list of all the links

print "SAME THING, USING ONLY BEAUTIFUL SOUP:"
for i in range(1,5):
        print titleSoup[i]
        print linkSoup[i]
        print "\n"