User:Andre Castro/prototyping/1.2/Archiveorg-seachTerm

From XPUB & Lens-Based wiki
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Liberté, Égalité, Beyoncé

Front-end: http://pzwart3.wdka.hro.nl/~acastro/radio/

Stream: http://pzwart1.wdka.hro.nl:8000/liberte_egalite_beyonce


OLD - needs updating

An interpretation of texts through sound, according to the archive's knowledge on the terms that constitutes the texts

I envision this project to develop into a continuous sound-stream, a sort of internet radio where each sound(-file) matches word from a text. The stream becomes an interpretation of texts through sound.

Map

[ text gathering ]-->(text pool)
		     /
		    /
		   /
[      sound scraping 	]--->(sound and playlist pool)
			           /
				  /
[ sound stream scheduling - liquidsoap] 
		|
		|
[ stream transmission - icecast2]
 |		|	|	
 |		|	|
 |		|	|
listeners listeners listeners


Front-end

http://pzwart3.wdka.hro.nl/~acastro/blind-archive/andre.html

http://pzwart3.wdka.hro.nl/~acastro/cgi-bin/playingnow.html

Currently

Using liquidsoap It plays the role of a radio station manager, deciding when playlists,jingles, etc... are played and send the stream to icecast2

Liquidsoap recipe - in process

Current Development 07/03/2012

At this point of its development the Blind Sound Archive is working by:


  • When the previous process is finished another script begins querying archive.org for the sounds:

Soundfiles are in (in pzwart3) /home/acastro/public_html/blind-archive/sf-archive


2DO:

  • check if sf are not too long
  • create a list of the url from which each file is downloaded
  • process files from 1 topic into a single file: ecasound
  • create player(html5) to play the soundfiles

The process step-by-step

						
[Player / Front End] < ---------
				\
 				 \
				  \
[Sound Scraper] ----->	{ Sound sequences pool }
	^-----------------
		          \
[Texts spider]	----> { Texts pool }
	^
	|
 topics	---> { topics pool }


1 - Text Search

gathering text sources on various topics

  • from a pool topics, one in chosen
  • a spider search for online texts on that topic (7: 1x per weekday) (or 1 per everyday - makes sence for news)
  • different sources: stackoverflow / stackexchange / news feeds / weather reports / blogs
  • the text are stored in the texts-pool (xml?)


  • Tech:
    • Spider / rss-feed reader / api
    • xml text pool:
<texts>
	<news>
		<day1 date="2012....">
                     <item>blahh blahhh balllahhh</item>
                     <item>blooo bluuu baaoooo</item>
                     <item>cooo cuuu caaoooo</item>  
                </day1>		
		...
	<news/>
</texts>


						


[word-sound search]
    ^
    |
[text feeding] 
    ^
    |
 (text-sources.xml)
    ^
    |
[Text Download]

¿ Scheduling of these cripts ?


2 - Sound Search

finding sounds on archive.org that match the words from the text

  • for each word of a text a sound is downloaded from archive.org
    • words are fed one by one to archive.org's search engine, asking it for audio items tagged under that given term
    • search further limited by the collection, so that text/sound-sequences exhibit a more coherent identity and distinguishes themselves from other topics

TEST: 2 Topics: [source:poem collection:music ..:?? ] [source:weather forecast collection:field-recordings?? ] [source:news collectio:spoken word ]

  • sounds are downloaded and saved in a server directories (date-topic/)
  • sound directories from a day will deleted after that day
  • Frequency: the process takes place during the previous day. 1x per day
  • Tech:
    • Python: Text sequencing
    • Python: API queries + download - Done


3 - Player/stream

the resulting sound (files) sequences are player

  • sequences from soundfiles is create (1 topic=1sequence)
  • each sequence-topic last as long as its duration
  • then player move to next sequence-topic
  • if all sequences have been played, player goes through them again and again (different order ?) until the 24h of a day have been completed
  • Tech:
    • Liquid Soap
    • Icecast

4 - Front End=

  • Topic being played is displayed
  • Links to sound sources in archive.org
  • User can supply more topics
  • User can comment on the sound??
  • Tech:
    • html
    • ...





Searching soundfiles per term 17/02/2012

Feching sound files from archive.org based on search terms

In order to do that I am making 2 API requests:

  • 1 - searching for a given term within mediaType:Audio
    • getting the identifier of the first search occurance id_0
  • 2 - requesting details on identifier (id_0)


I use the 2nd (details) API query to look for the containig files.

From this list I get the first ogg (in case ogg files are present)

Downloading soundfile In archive.org files are stored http://www.archive.org/download/ + identifier + filename

12/02/2012 - Latest script

#!/usr/bin/pyhton
import urllib2, urllib, json, re, shutil, datetime, os

#create directroy where sfoundfiles will be saved
dt_obj = datetime.datetime.now()
date_str = dt_obj.strftime("%Y%m%d-%H%M%S")
archive_dir = 'sf-archive-' + date_str
os.makedirs(archive_dir)

sentence = "US President Obama unveils a $3.8 trillion budget, with plans to raise taxes on the wealthy"
sentance_list=re.findall(r"[\w']+|[.,!?;]", sentence) # find words and puctuation and slipt themo list

search_list = []
info_list = [] # Structure: [term, url, num of results ]
download_urls = []

# Results of search tem + mediatype:Audio
for term in sentance_list:	# build info list [ [term, url, response,  num of results], [...], [...] ] 
	results = []
	results.append(term)
	if ('.'in term) or ('?' in term) or ('!' in term):
 		results.append("fullstop")
		results.append("No Url")
		results.append("No Respose")
		results.append("No Results")
		info_list.append(results) # push the results list into the info list
		print 'stop: ' + term		
	elif (','in term) or (';' in term):
		results.append("comma")
		results.append("No Url")
		results.append("No Respose")
		results.append("No Results")
		info_list.append(results) # push the results list into the info list
		print 'comma: '  + term	 
	else:
		print 'word: ' + term
		url = 'http://www.archive.org/advancedsearch.php?q=' + term + '+AND+mediatype:Audio&rows=300&output=json' #api query
		print 'url: '+ url
		results.append(url)
		search = urllib2.urlopen(url)
		search_result = json.load(search)
		response = search_result['response']
		num_results =	response['numFound']
		results.append(response)
		results.append(num_results)
		info_list.append(results) # push the results list into the info list



# go throught the info list, checking if its a punctuation mar, if there are more than 0 search results, and if the item contains ogg files
for info in info_list: # checks the number of results results_list
	url = info[1]
	
	print info[0]
	print info[1]
	print info[3]
	print

	if ('comma' in url):
		download_urls.append('comma')
		print 'comma found'
	elif ('fullstop' in url):
		download_urls.append('fullstop')
	elif num_results < 1:	
		print 'num_results is 0'
	 	print	
		download_urls.append(0)	
				
	else: 
 		done = False 	
 		for n in range(num_results): #loop through the results looking for .ogg and < size limit			
			identifier = info[2]['docs'][n]['identifier']
			print
			print identifier
			format = info[2]['docs'][n]['format'] 

			if "Ogg Vorbis" in format:
				# go to details url								
				details_url = 'http://www.archive.org/details/' + identifier + '&output=json' #details on identifier  http://www.archive.org/details/electroacoustic_music&output=json
				print details_url
				try:				
					details_search = urllib2.urlopen(details_url)
					details_result = json.load(details_search)
					files=details_result['files'].keys() #look at the containig files
									
					for ogg in files:
						#print str(o)			
						if re.search('.ogg$', ogg) or re.search('.OGG$', ogg): #if there are .ogg or .OGG 
							print "ogg found"
							print ogg							
							size =  details_result['files'][ogg]['size']
							print size		
							if int(size) > 1000000:	#check file size
								print "file TOO large"			
							else: 
								print "RIGHT SIZE"
								audio_url = 'http://www.archive.org/download/' + identifier + ogg	
								download_urls.append(audio_url)		
								done = True
								break
				except urllib2.HTTPError:
					print '404'+ details_url	

			if done: 
				break


print download_urls


#silence and punctuation soundfiles - WILL LEAVE THEM OUT FOR NOW - since you don't have it in your machine
#silence = "silences/silence.ogg"
#comma = "silences/comma.ogg"
#fullstop = "silences/fullstop.ogg"

for i, url in enumerate(download_urls): #Download files from url	
	num = '%02d' % (1+(i))
	if url == 0:
		#silence_file = archive_dir+"/"+str(num)+'silence.ogg'
		#shutil.copyfile(silence, silence_file)
	elif url == 'comma':
		#comma_file = archive_dir+"/"+str(num)+'comma.ogg'
		#shutil.copyfile(comma, comma_file)
	elif url == 'fullstop':
		#fullstop_file = archive_dir+"/"+str(num)+'fullstop.ogg'
		#shutil.copyfile(fullstop, fullstop_file)
	else:
		file_name = str(num) + sentance_list[i] + '.ogg'
		print file_name + ' ' + url
		urllib.urlretrieve(url, archive_dir + "/" + file_name)