User:Andre Castro/prototyping/1.2/web-sound-scraping
web sound scraping
Some years ago I heard about net-sound art piece, but cannot recall its name or author.
Based on a sound art work (which i cannot recall the name)that consisted on a scraping mechanism that searched web looking for sound files. Once these were found they were integrated into a continuous sound stream.
I would like develop a similar project. And I manage to find information on the referred work, would be interesting the compared its results to the ones resulting from my take on it. It might make explicit differences between the web of those days (I guess beginings of 2000's) and today, by looking such as the found sounds and the situations when were/are this employed.
stages of prototyping
- simple web scraper - scrapes only 1 page of a website
- spider - scrapes all levels of a website and outer links
- sound stream based on the sound files
- information on files provenance, date, metadata
- processing of the sound files
simple web sound scraper
This python script scrapes sound files from a webpage (given in its argument) and saves them into the directory in question
Execution example (script savef in sound-scraper.py):
python sound-scraper.py http://www.freesound.org/
#Scrapes sound files from a single webpage#
#NEEDS URL AS ARGUMENT
############ TO DO #####################
#check if SF are repeated ( same url, but different format) -> store only 1 format
#######################################
import sys, lxml.etree, lxml, html5lib, urllib2, urllib
from urlparse import urlparse
useragent = "Mozilla/5.001 (windows; U; NT4.0; en-US; rv:1.0) Gecko/25250101"
url = sys.argv[1] # input arg
url_parse = urlparse(url)
url_home = url_parse.scheme + "://" + url_parse.netloc
request=urllib2.Request(url, None, {'User-Agent': useragent})
url_open = urllib2.urlopen(request)
html = url_open.read() #html into a string
url_open.close
#only proceed if file formats are present in the html
if ('.wav' in html) or ('.aiff' in html) or ('.flac' in html) or ('.ogg' in html) or ('.mp3' in html):
# http://pzwart3.wdka.hro.nl/wiki/Extracting_parts_of_an_HTML_document
url_open = urllib2.urlopen(url)
htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False)
page = htmlparser.parse(url_open)
a_tag = page.xpath("//a[@href]") #sound files come within <a>
source_tag = page.xpath("//source[@src]") #sound files come <source>.
# other ways in which sounds apear in websites?????
for i in a_tag:
href = i.get('href')
if ('.wav' in href) or ('.aiff' in href) or ('.flac' in href) or ('.ogg' in href) or ('.mp3' in href): #other formats ?? .wave .aif
print "FOUND SF in href"
if 'http://' in href:
url_partition = href.rpartition('/') #splits last occurance (after /) - partition[2] - names de SF
urllib.urlretrieve(href, url_partition[2]) #download SF
print url_partition[2]
else:
full_url = url_home+href #add home to relative url
url_partition = full_url.rpartition('/')
urllib.urlretrieve(full_url, url_partition[2]) #download SF
print url_partition[2]
for i in source_tag:
src = i.get('src')
if ('.wav' in src) or ('.aiff' in src) or ('.flac' in src) or ('.ogg' in src) or ('.mp3' in src):
#print "FOUND SF in src"
if 'http://' in src:
url_partition = src.rpartition('/')
urllib.urlretrieve(src, url_partition[2]) #download SF
print url_partition[2]
else:
full_url = url_home+src #add home to relative url
url_partition = full_url.rpartition('/') # .rpartion splits the string at the last occurrence and returns a 3-tuple containing the part before the separato
urllib.urlretrieve(full_url, url_partition[2]) #download SF
print url_partition[2]
else:
print "NO AUDIO FILES IN THIS URL"