User:Laurier Rochon/prototyping/??????????soft: Difference between revisions

From XPUB & Lens-Based wiki
(Created page with "== Feb 12 2011 == Scraping blog urls, checking them against the current archive + storing them in a tab-separated file. <source lang="python"> #!/usr/bin/python2.6 import urll...")
 
No edit summary
Line 2: Line 2:


Scraping blog urls, checking them against the current archive + storing them in a tab-separated file.
Scraping blog urls, checking them against the current archive + storing them in a tab-separated file.
And then adding a cron job lets you pile up the results like [http://aesonmusic.com/cgiscrape/blogs this], doing la scrape every hour.


<source lang="python">
<source lang="python">

Revision as of 19:47, 12 February 2011

Feb 12 2011

Scraping blog urls, checking them against the current archive + storing them in a tab-separated file.

And then adding a cron job lets you pile up the results like this, doing la scrape every hour.

#!/usr/bin/python2.6

import urllib2
import json
from datetime import date
import os

#txt = '../cgiscrape/blogs'
txt = 'blogs'

start=0
scrapedate=date.today()
entries=[]
urllist=[]

if not os.path.exists(txt):
	f = open(txt,'w')
	f.close()
else:
	f = open(txt,'r')
	data = f.read()
	if len(data)>0:
		urls = data.split('\n')
		for a in urls:
			line = a.split('\t')
			if len(line)>1:
				urllist.append(line[2])
c=0
while start<64:
	url = ('https://ajax.googleapis.com/ajax/services/search/blogs?v=1.0&q=myself&start='+ str (start)+'&rsz=large')
 
	f = urllib2.urlopen(url)
	data = json.load(f)
	for r in data['responseData']['results']:
		if r['postUrl'] not in urllist:
			entry = "%s\t%s\t%s\t%s\t%s\t%s" % (scrapedate, r['title'], r['postUrl'], r['publishedDate'], r['blogUrl'], r['author'])
			entry = entry.encode("utf-8")
			entries.append(entry)
			c = c+1
	start += 8

print 'added %s entries' % (c)

se = '\n'.join(map(str, entries))
f = open(txt,'a')
f.write(se)
f.close()