Latest revision as of 11:54, 26 January 2012

web sound scraping

Some years ago I heard about net-sound art piece, but cannot recall its name or author.

Based on a sound art work (which i cannot recall the name)that consisted on a scraping mechanism that searched web looking for sound files. Once these were found they were integrated into a continuous sound stream.

I would like develop a similar project. And I manage to find information on the referred work, would be interesting the compared its results to the ones resulting from my take on it. It might make explicit differences between the web of those days (I guess beginings of 2000's) and today, by looking such as the found sounds and the situations when were/are this employed.

stages of prototyping

simple web scraper - scrapes only 1 page of a website
spider - scrapes all levels of a website and outer links
sound stream based on the sound files
information on files provenance, date, metadata
processing of the sound files

simple web sound scraper

This python script scrapes sound files from a webpage (given in its argument) and saves them into the directory in question

Execution example (script savef in sound-scraper.py):

python sound-scraper.py http://www.freesound.org/

#Scrapes sound files from a single webpage#
#NEEDS URL AS ARGUMENT
############ TO DO ##################### 
#check if SF are repeated ( same url, but different format) -> store only 1 format
#######################################

import sys, lxml.etree, lxml, html5lib, urllib2, urllib
from urlparse import urlparse 

useragent = "Mozilla/5.001 (windows; U; NT4.0; en-US; rv:1.0) Gecko/25250101"

url = sys.argv[1] # input arg
url_parse = urlparse(url)
url_home = url_parse.scheme + "://" + url_parse.netloc

request=urllib2.Request(url, None, {'User-Agent': useragent})
url_open = urllib2.urlopen(request)
html = url_open.read() #html into a string
url_open.close

#only proceed if file formats are present in the html
if ('.wav' in html) or ('.aiff' in html) or ('.flac' in html) or ('.ogg' in html) or ('.mp3' in html):
	# http://pzwart3.wdka.hro.nl/wiki/Extracting_parts_of_an_HTML_document	
	url_open = urllib2.urlopen(url)
	htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False)
	page = htmlparser.parse(url_open)

	a_tag = page.xpath("//a[@href]") #sound files come within <a>

	source_tag = page.xpath("//source[@src]") #sound files come <source>. 

	# other ways in which sounds apear in websites?????

	for i in a_tag:
		href = i.get('href')  
		if ('.wav' in href) or ('.aiff' in href) or ('.flac' in href) or ('.ogg' in href) or ('.mp3' in href):	#other formats ?? .wave .aif
			print "FOUND SF in href"
			if 'http://' in href:
				url_partition = href.rpartition('/') #splits last occurance (after /) - partition[2] - names de SF   
				urllib.urlretrieve(href, url_partition[2]) #download SF
				print url_partition[2]

			else:
				full_url = url_home+href #add home to relative url
				url_partition = full_url.rpartition('/') 
				urllib.urlretrieve(full_url, url_partition[2]) #download SF
				print url_partition[2]
		
	for i in source_tag:
		src = i.get('src')
		if ('.wav' in src) or ('.aiff' in src) or ('.flac' in src) or ('.ogg' in src) or ('.mp3' in src):
			#print "FOUND SF in src" 	
			if 'http://' in src:
				url_partition = src.rpartition('/')
				urllib.urlretrieve(src, url_partition[2]) 		#download SF 
				print url_partition[2]

			else:
				full_url = url_home+src #add home to relative url
				url_partition = full_url.rpartition('/') # .rpartion splits the string at the last occurrence and returns a 3-tuple containing the part before the separato
				urllib.urlretrieve(full_url, url_partition[2])	#download SF
				print url_partition[2]

else:
	print "NO AUDIO FILES IN THIS URL"

@@ Line 14: / Line 14: @@
 * <b>simple web scraper</b> - scrapes only 1 page of a website
-* <b>scraper in depth</b> - scrapes all levels of a website
+* <b>spider</b> - scrapes all levels of a website and outer links
-* <b>crawler</b>
 * <b>sound stream based on the sound files</b>
+* <b>information on files provenance, date, metadata</b>
 * <b>processing of the sound files</b>