PythonFeedparser: Difference between revisions

From XPUB & Lens-Based wiki
No edit summary
(No difference)

Revision as of 21:37, 2 June 2008

RSS feeds provide a simple and robust way to incorporate / remix the (current) contents of websites in your own Python script. You can use Python to, for instance, create a custom front end to a site, or take a "snapshot" of the daily contents of a website for purposes of archiving and later data analysis.

To begin, you can use the Feedparser Python library to do the work of translating the feed into a Python object. The Feedparser library tries to hide the many complexities of varying formats and version of RSS feeds. You can download Feedparser from: http://feedparser.org/

Step one is simply using Feedparser to parse (read) a feed. In this case, the RSS feed of nytimes.com:

#!python numbers=off
import feedparser

fu = "http://www.nytimes.com/services/xml/rss/nyt/[[HomePage]].xml"
f = feedparser.parse(fu)

If all goes well, this code does nothing. That is, it doesn't print anything out. In fact it connects to the nytimes server and reads and attempts to understand the contents of the feed.

To see if it's working, you could run the script with the "-i" option of python, to leave python open afterwards:

python -i feedreader.py

TIP: Use pprint to "pretty" print out the resulting object (or parts of the object), for instance, we could look at the first item (via the "entities" attribute):

from pprint import pprint
pprint(f.entries[0])

produces:

#!python numbers=off
{'author': u'ELAINE SCIOLINO',
: 'guidislink': False,
: 'id': u'http://www.nytimes.com/2007/03/20/world/europe/20iran.html',
: 'link': u'http://www.nytimes.com/2007/03/20/world/europe/20iran.html?ex=1332043200&en=5bb15b55325c85a4&ei=5088&partner=rssnyt&emc=rss',
: 'links': [{'href': u'http://www.nytimes.com/2007/03/20/world/europe/20iran.html?ex=1332043200&en=5bb15b55325c85a4&ei=5088&partner=rssnyt&emc=rss',
: 'rel': 'alternate',
: 'type': 'text/html'}],
: 'summary': u'Russia said it will withhold nuclear fuel unless Iran suspends its uranium enrichment as demanded by the Security Council.',
: 'summary_detail': {'base': 'http://graphics8.nytimes.com/services/xml/rss/nyt/[[HomePage]].xml',
: 'language': None,
: 'type': 'text/html',
: 'value': u'Russia said it will withhold nuclear fuel unless Iran suspends its uranium enrichment as demanded by the Security Council.'},
: 'title': u'Russia Gives Iran Ultimatum on Enrichment',
: 'title_detail': {'base': 'http://graphics8.nytimes.com/services/xml/rss/nyt/[[HomePage]].xml',
: 'language': None,
: 'type': 'text/plain',
: 'value': u'Russia Gives Iran Ultimatum on Enrichment'},
: 'updated': u'Tue, 20 Mar 2007 01:04:29 EDT',
: 'updated_parsed': (2007, 3, 20, 5, 4, 29, 1, 79, 0)}

Now to create a simple HTML document fr0m the contents of this feed, we could use the title and summary fields, wrapping them in approrpiate HTML tags (say, h2, and p):

#!python numbering=off
for e in f.entries:
	print "<h2>" + e.title.encode("utf8") + "</h2>"
	print "<p>" + e.summary.encode("utf8") + "</p>"

Notice here that because the Feedparse returns unicode strings, you need to explicitly select an encoding to use when outputting the text -- in this case use "utf8" since this is supported by many web browsers.

Finally, you need to add the regular start and end of an HTML page around this loop. NB the initial <? ?> tag that signals to the browser that the document is using UTF8. The complete script is as follows:

#!python numbers=off
import feedparser
fu = "http://www.nytimes.com/services/xml/rss/nyt/[[HomePage]].xml"
f = feedparser.parse(fu)

print """<?xml version="1.0" encoding="UTF-8"?>
<html>
<head>
<style>
h2 { color: purple; font-size: 72px }
</style>
</head>
<body>
"""

for e in f.entries:
	print "<h2>" + e.title.encode("utf8") + "</h2>"
	print "<p>" + e.summary.encode("utf8") + "</p>"


print """</body>
</html>
"""

This produces a pretty bland version of headlines... but of course you could take this as a starting point for more interesting kinds of dynamic layout...

History

http://news.com.com/Bloggings+roots+reach+to+the+70s/2100-1025_3-6168685.html