User:Andre Castro/prototyping/1.2/web-sound-scraping: Difference between revisions
Andrecastro (talk | contribs) (Created page with "=web sound scraping= Some years ago I heard about net-sound art piece, but cannot recall its name or author. Based on a sound art work (which i cannot recall the name)that c...") |
Andrecastro (talk | contribs) No edit summary |
||
Line 33: | Line 33: | ||
<source lang="python"> | <source lang="python"> | ||
#Scrapes sound files from a single webpage# | #Scrapes sound files from a single webpage# | ||
#NEEDS URL AS ARGUMENT | #NEEDS URL AS ARGUMENT | ||
Line 42: | Line 43: | ||
from urlparse import urlparse | from urlparse import urlparse | ||
useragent = "Mozilla/5.001 (windows; U; NT4.0; en-US; rv:1.0) Gecko/25250101" | |||
url = sys.argv[1] # input arg | url = sys.argv[1] # input arg | ||
Line 47: | Line 49: | ||
url_home = url_parse.scheme + "://" + url_parse.netloc | url_home = url_parse.scheme + "://" + url_parse.netloc | ||
request=urllib2.Request(url, None, {'User-Agent': useragent}) | |||
url_open = urllib2.urlopen( | url_open = urllib2.urlopen(request) | ||
html = url_open.read() #html into a string | html = url_open.read() #html into a string | ||
url_open.close | url_open.close | ||
Line 97: | Line 99: | ||
else: | else: | ||
print "NO AUDIO FILES IN THIS URL" | print "NO AUDIO FILES IN THIS URL" | ||
</source> | </source> |
Revision as of 10:32, 26 January 2012
web sound scraping
Some years ago I heard about net-sound art piece, but cannot recall its name or author.
Based on a sound art work (which i cannot recall the name)that consisted on a scraping mechanism that searched web looking for sound files. Once these were found they were integrated into a continuous sound stream.
I would like develop a similar project. And I manage to find information on the referred work, would be interesting the compared its results to the ones resulting from my take on it. It might make explicit differences between the web of those days (I guess beginings of 2000's) and today, by looking such as the found sounds and the situations when were/are this employed.
stages of prototyping
- simple web scraper - scrapes only 1 page of a website
- scraper in depth - scrapes all levels of a website
- crawler
- sound stream based on the sound files
- processing of the sound files
simple web sound scraper
This python script scrapes sound files from a webpage (given in its argument) and saves them into the directory in question
Execution example (script savef in sound-scraper.py):
python sound-scraper.py http://www.freesound.org/
#Scrapes sound files from a single webpage#
#NEEDS URL AS ARGUMENT
############ TO DO #####################
#check if SF are repeated ( same url, but different format) -> store only 1 format
#######################################
import sys, lxml.etree, lxml, html5lib, urllib2, urllib
from urlparse import urlparse
useragent = "Mozilla/5.001 (windows; U; NT4.0; en-US; rv:1.0) Gecko/25250101"
url = sys.argv[1] # input arg
url_parse = urlparse(url)
url_home = url_parse.scheme + "://" + url_parse.netloc
request=urllib2.Request(url, None, {'User-Agent': useragent})
url_open = urllib2.urlopen(request)
html = url_open.read() #html into a string
url_open.close
#only proceed if file formats are present in the html
if ('.wav' in html) or ('.aiff' in html) or ('.flac' in html) or ('.ogg' in html) or ('.mp3' in html):
# http://pzwart3.wdka.hro.nl/wiki/Extracting_parts_of_an_HTML_document
url_open = urllib2.urlopen(url)
htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False)
page = htmlparser.parse(url_open)
a_tag = page.xpath("//a[@href]") #sound files come within <a>
source_tag = page.xpath("//source[@src]") #sound files come <source>.
# other ways in which sounds apear in websites?????
for i in a_tag:
href = i.get('href')
if ('.wav' in href) or ('.aiff' in href) or ('.flac' in href) or ('.ogg' in href) or ('.mp3' in href): #other formats ?? .wave .aif
print "FOUND SF in href"
if 'http://' in href:
url_partition = href.rpartition('/') #splits last occurance (after /) - partition[2] - names de SF
urllib.urlretrieve(href, url_partition[2]) #download SF
print url_partition[2]
else:
full_url = url_home+href #add home to relative url
url_partition = full_url.rpartition('/')
urllib.urlretrieve(full_url, url_partition[2]) #download SF
print url_partition[2]
for i in source_tag:
src = i.get('src')
if ('.wav' in src) or ('.aiff' in src) or ('.flac' in src) or ('.ogg' in src) or ('.mp3' in src):
#print "FOUND SF in src"
if 'http://' in src:
url_partition = src.rpartition('/')
urllib.urlretrieve(src, url_partition[2]) #download SF
print url_partition[2]
else:
full_url = url_home+src #add home to relative url
url_partition = full_url.rpartition('/') # .rpartion splits the string at the last occurrence and returns a 3-tuple containing the part before the separato
urllib.urlretrieve(full_url, url_partition[2]) #download SF
print url_partition[2]
else:
print "NO AUDIO FILES IN THIS URL"