User:Eleanorg/2.1/BeautifulSoup RSS grabber: Difference between revisions
No edit summary |
No edit summary |
||
Line 1: | Line 1: | ||
URLLib is all well and good, but how do you get /useful/ information from a webpage? I followed a lovely tutorial. It grabs a few items from the HufPo main RSS feed and prints out the titles, links and first part of the main article for each item: | URLLib is all well and good, but how do you get /useful/ information from a webpage? I followed a lovely tutorial. It grabs a few items from the HufPo main RSS feed and prints out the titles, links and first part of the main article for each item. The bit at the bottom uses only beautiful soup (not regular Python regexes) to grab just the title and link for each item: | ||
<div style ="font-size:11px; background-color:#ddd;"> | |||
<title>Syria Agrees To Ceasefire During Eid Al-Adha Holiday, Peace Envoy Lakhdar Brahimi Says</title> | <title>Syria Agrees To Ceasefire During Eid Al-Adha Holiday, Peace Envoy Lakhdar Brahimi Says</title> | ||
<link rel="alternate" type="text/html" href="http://www.huffingtonpost.com/2012/10/24/syria-ceasefire-eid_n_2008028.html" /> | <link rel="alternate" type="text/html" href="http://www.huffingtonpost.com/2012/10/24/syria-ceasefire-eid_n_2008028.html" /> | ||
Line 30: | Line 17: | ||
<link rel="alternate" type="text/html" href="http://www.huffingtonpost.com/2012/10/24/extreme-cheapskates-millionaire-video_n_2007805.html" /> | <link rel="alternate" type="text/html" href="http://www.huffingtonpost.com/2012/10/24/extreme-cheapskates-millionaire-video_n_2007805.html" /> | ||
</div> | |||
<source lang="python"> | <source lang="python"> |
Revision as of 18:10, 24 October 2012
URLLib is all well and good, but how do you get /useful/ information from a webpage? I followed a lovely tutorial. It grabs a few items from the HufPo main RSS feed and prints out the titles, links and first part of the main article for each item. The bit at the bottom uses only beautiful soup (not regular Python regexes) to grab just the title and link for each item:
<title>Syria Agrees To Ceasefire During Eid Al-Adha Holiday, Peace Envoy Lakhdar Brahimi Says</title> <link rel="alternate" type="text/html" href="http://www.huffingtonpost.com/2012/10/24/syria-ceasefire-eid_n_2008028.html" />
<title>'Sons Of Anarchy': Joel McHale Can't Outrun The Club, Gemma Can't Dodge A Semi (VIDEO)</title>
<link rel="alternate" type="text/html" href="http://www.huffingtonpost.com/2012/10/24/sons-of-anarchy-joel-mchale-run-video_n_2007887.html" />
<title>'The X Factor' Top 16: Demi Lovato And Simon Cowell Choose Their Top Fours (VIDEO)</title>
<link rel="alternate" type="text/html" href="http://www.huffingtonpost.com/2012/10/24/the-x-factor-top-16-demi-simon-video_n_2007825.html" />
<title>'Extreme Cheapskates': Millionaire Forages For Food, Pees In Bottles (VIDEO)</title>
<link rel="alternate" type="text/html" href="http://www.huffingtonpost.com/2012/10/24/extreme-cheapskates-millionaire-video_n_2007805.html" />
<source lang="python">
- !/usr/bin/python
- -*- coding:utf-8 -*-
- tutorial on webkit & gtk
- http://www.youtube.com/watch?v=Ap_DlSrT-iE
from urllib import urlopen from BeautifulSoup import BeautifulSoup import re
print "hello"
- grab the hufpo RSS feed
webpage = urlopen('http://feeds.huffingtonpost.com/huffingtonpost/raw_feed').read()
- print webpage
patTitle = re.compile('<title>(.*)</title>') # regex for text between title tags, in capture parentheses patLink = re.compile('<link rel.*href="(.*)" />') # ditto for link URLs
findPatTitle = re.findall(patTitle, webpage) # find all matches to patTitle in webpage variable findPatLink = re.findall(patLink, webpage)
- print findPatTitle
- print findPatLink
- findPatTitle is a list. Iterate over it to print out each item in the list:
for i in range(1,5):
print findPatTitle[i] print findPatLink[i]
origArticlePage = urlopen(findPatLink[i]).read() # open the URL that the item link points to divBegin = origArticlePage.find('<div class="articleBody"') # find the POSITION of the start of the div containing the articl$ article = origArticlePage[divBegin:(divBegin+1000)] # ...and grab the following 1000 characters
# now we use beautiful soup to parse these 1000 characters in 'article' and grab out important info inside
tags soup = BeautifulSoup(article) listOfParagraphs = soup.findAll('p') # we use a regex to find all the p tags - beautiful soup parses them so no need f$ for i in listOfParagraphs: print i print "\n"
- you can use beautiful soup to simplify the process of finding titles and links, too.
- instead of cumbersome blah.compile() regex statements, just use beautifulsoup thus:
soup2 = BeautifulSoup(webpage) titleSoup = soup2.findAll('title') # this creates a list of all the titles linkSoup = soup2.findAll('link') # this creates a list of all the links print "SAME THING, USING ONLY BEAUTIFUL SOUP:" for i in range(1,5): print titleSoup[i] print linkSoup[i] print "\n" </python>