User:Andre Castro/prototyping/1.2/web-sound-scraping: Difference between revisions

From XPUB & Lens-Based wiki
(Created page with "=web sound scraping= Some years ago I heard about net-sound art piece, but cannot recall its name or author. Based on a sound art work (which i cannot recall the name)that c...")
 
No edit summary
Line 33: Line 33:


<source lang="python">
<source lang="python">
#Scrapes sound files from a single webpage#
#Scrapes sound files from a single webpage#
#NEEDS URL AS ARGUMENT
#NEEDS URL AS ARGUMENT
Line 42: Line 43:
from urlparse import urlparse  
from urlparse import urlparse  


useragent = "Mozilla/5.001 (windows; U; NT4.0; en-US; rv:1.0) Gecko/25250101"


url = sys.argv[1] # input arg
url = sys.argv[1] # input arg
Line 47: Line 49:
url_home = url_parse.scheme + "://" + url_parse.netloc
url_home = url_parse.scheme + "://" + url_parse.netloc


 
request=urllib2.Request(url, None, {'User-Agent': useragent})
url_open = urllib2.urlopen(url)
url_open = urllib2.urlopen(request)
html = url_open.read() #html into a string
html = url_open.read() #html into a string
url_open.close
url_open.close
Line 97: Line 99:
else:
else:
print "NO AUDIO FILES IN THIS URL"
print "NO AUDIO FILES IN THIS URL"


</source>
</source>

Revision as of 11:32, 26 January 2012

web sound scraping

Some years ago I heard about net-sound art piece, but cannot recall its name or author.

Based on a sound art work (which i cannot recall the name)that consisted on a scraping mechanism that searched web looking for sound files. Once these were found they were integrated into a continuous sound stream.

I would like develop a similar project. And I manage to find information on the referred work, would be interesting the compared its results to the ones resulting from my take on it. It might make explicit differences between the web of those days (I guess beginings of 2000's) and today, by looking such as the found sounds and the situations when were/are this employed.


stages of prototyping

  • simple web scraper - scrapes only 1 page of a website
  • scraper in depth - scrapes all levels of a website
  • crawler
  • sound stream based on the sound files
  • processing of the sound files


simple web sound scraper

This python script scrapes sound files from a webpage (given in its argument) and saves them into the directory in question


Execution example (script savef in sound-scraper.py):

python sound-scraper.py http://www.freesound.org/


#Scrapes sound files from a single webpage#
#NEEDS URL AS ARGUMENT
############ TO DO ##################### 
#check if SF are repeated ( same url, but different format) -> store only 1 format
#######################################

import sys, lxml.etree, lxml, html5lib, urllib2, urllib
from urlparse import urlparse 

useragent = "Mozilla/5.001 (windows; U; NT4.0; en-US; rv:1.0) Gecko/25250101"

url = sys.argv[1] # input arg
url_parse = urlparse(url)
url_home = url_parse.scheme + "://" + url_parse.netloc

request=urllib2.Request(url, None, {'User-Agent': useragent})
url_open = urllib2.urlopen(request)
html = url_open.read() #html into a string
url_open.close

#only proceed if file formats are present in the html
if ('.wav' in html) or ('.aiff' in html) or ('.flac' in html) or ('.ogg' in html) or ('.mp3' in html):
	# http://pzwart3.wdka.hro.nl/wiki/Extracting_parts_of_an_HTML_document	
	url_open = urllib2.urlopen(url)
	htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False)
	page = htmlparser.parse(url_open)

	a_tag = page.xpath("//a[@href]") #sound files come within <a>

	source_tag = page.xpath("//source[@src]") #sound files come <source>. 

	# other ways in which sounds apear in websites?????

	for i in a_tag:
		href = i.get('href')  
		if ('.wav' in href) or ('.aiff' in href) or ('.flac' in href) or ('.ogg' in href) or ('.mp3' in href):	#other formats ?? .wave .aif
			print "FOUND SF in href"
			if 'http://' in href:
				url_partition = href.rpartition('/') #splits last occurance (after /) - partition[2] - names de SF   
				urllib.urlretrieve(href, url_partition[2]) #download SF
				print url_partition[2]

			else:
				full_url = url_home+href #add home to relative url
				url_partition = full_url.rpartition('/') 
				urllib.urlretrieve(full_url, url_partition[2]) #download SF
				print url_partition[2]
		
	for i in source_tag:
		src = i.get('src')
		if ('.wav' in src) or ('.aiff' in src) or ('.flac' in src) or ('.ogg' in src) or ('.mp3' in src):
			#print "FOUND SF in src" 	
			if 'http://' in src:
				url_partition = src.rpartition('/')
				urllib.urlretrieve(src, url_partition[2]) 		#download SF 
				print url_partition[2]

			else:
				full_url = url_home+src #add home to relative url
				url_partition = full_url.rpartition('/') # .rpartion splits the string at the last occurrence and returns a 3-tuple containing the part before the separato
				urllib.urlretrieve(full_url, url_partition[2])	#download SF
				print url_partition[2]

else:
	print "NO AUDIO FILES IN THIS URL"