User:Laurier Rochon/prototyping/??????????soft: Difference between revisions
(Created page with "== Feb 12 2011 == Scraping blog urls, checking them against the current archive + storing them in a tab-separated file. <source lang="python"> #!/usr/bin/python2.6 import urll...") |
No edit summary |
||
Line 2: | Line 2: | ||
Scraping blog urls, checking them against the current archive + storing them in a tab-separated file. | Scraping blog urls, checking them against the current archive + storing them in a tab-separated file. | ||
And then adding a cron job lets you pile up the results like [http://aesonmusic.com/cgiscrape/blogs this], doing la scrape every hour. | |||
<source lang="python"> | <source lang="python"> |
Revision as of 18:47, 12 February 2011
Feb 12 2011
Scraping blog urls, checking them against the current archive + storing them in a tab-separated file.
And then adding a cron job lets you pile up the results like this, doing la scrape every hour.
#!/usr/bin/python2.6
import urllib2
import json
from datetime import date
import os
#txt = '../cgiscrape/blogs'
txt = 'blogs'
start=0
scrapedate=date.today()
entries=[]
urllist=[]
if not os.path.exists(txt):
f = open(txt,'w')
f.close()
else:
f = open(txt,'r')
data = f.read()
if len(data)>0:
urls = data.split('\n')
for a in urls:
line = a.split('\t')
if len(line)>1:
urllist.append(line[2])
c=0
while start<64:
url = ('https://ajax.googleapis.com/ajax/services/search/blogs?v=1.0&q=myself&start='+ str (start)+'&rsz=large')
f = urllib2.urlopen(url)
data = json.load(f)
for r in data['responseData']['results']:
if r['postUrl'] not in urllist:
entry = "%s\t%s\t%s\t%s\t%s\t%s" % (scrapedate, r['title'], r['postUrl'], r['publishedDate'], r['blogUrl'], r['author'])
entry = entry.encode("utf-8")
entries.append(entry)
c = c+1
start += 8
print 'added %s entries' % (c)
se = '\n'.join(map(str, entries))
f = open(txt,'a')
f.write(se)
f.close()