User:Andre Castro/prototyping/1.2/Archiveorg-seachTerm

From XPUB & Lens-Based wiki
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Liberté, Égalité, Beyoncé



OLD - needs updating

An interpretation of texts through sound, according to the archive's knowledge on the terms that constitutes the texts

I envision this project to develop into a continuous sound-stream, a sort of internet radio where each sound(-file) matches word from a text. The stream becomes an interpretation of texts through sound.


[ text gathering ]-->(text pool)
[      sound scraping 	]--->(sound and playlist pool)
[ sound stream scheduling - liquidsoap] 
[ stream transmission - icecast2]
 |		|	|	
 |		|	|
 |		|	|
listeners listeners listeners



Using liquidsoap It plays the role of a radio station manager, deciding when playlists,jingles, etc... are played and send the stream to icecast2

Liquidsoap recipe - in process

Current Development 07/03/2012

At this point of its development the Blind Sound Archive is working by:

  • When the previous process is finished another script begins querying for the sounds:

Soundfiles are in (in pzwart3) /home/acastro/public_html/blind-archive/sf-archive


  • check if sf are not too long
  • create a list of the url from which each file is downloaded
  • process files from 1 topic into a single file: ecasound
  • create player(html5) to play the soundfiles

The process step-by-step

[Player / Front End] < ---------
[Sound Scraper] ----->	{ Sound sequences pool }
[Texts spider]	----> { Texts pool }
 topics	---> { topics pool }

1 - Text Search

gathering text sources on various topics

  • from a pool topics, one in chosen
  • a spider search for online texts on that topic (7: 1x per weekday) (or 1 per everyday - makes sence for news)
  • different sources: stackoverflow / stackexchange / news feeds / weather reports / blogs
  • the text are stored in the texts-pool (xml?)

  • Tech:
    • Spider / rss-feed reader / api
    • xml text pool:
		<day1 date="2012....">
                     <item>blahh blahhh balllahhh</item>
                     <item>blooo bluuu baaoooo</item>
                     <item>cooo cuuu caaoooo</item>  


[word-sound search]
[text feeding] 
[Text Download]

¿ Scheduling of these cripts ?

2 - Sound Search

finding sounds on that match the words from the text

  • for each word of a text a sound is downloaded from
    • words are fed one by one to's search engine, asking it for audio items tagged under that given term
    • search further limited by the collection, so that text/sound-sequences exhibit a more coherent identity and distinguishes themselves from other topics

TEST: 2 Topics: [source:poem collection:music ..:?? ] [source:weather forecast collection:field-recordings?? ] [source:news collectio:spoken word ]

  • sounds are downloaded and saved in a server directories (date-topic/)
  • sound directories from a day will deleted after that day
  • Frequency: the process takes place during the previous day. 1x per day
  • Tech:
    • Python: Text sequencing
    • Python: API queries + download - Done

3 - Player/stream

the resulting sound (files) sequences are player

  • sequences from soundfiles is create (1 topic=1sequence)
  • each sequence-topic last as long as its duration
  • then player move to next sequence-topic
  • if all sequences have been played, player goes through them again and again (different order ?) until the 24h of a day have been completed
  • Tech:
    • Liquid Soap
    • Icecast

4 - Front End=

  • Topic being played is displayed
  • Links to sound sources in
  • User can supply more topics
  • User can comment on the sound??
  • Tech:
    • html
    • ...

Searching soundfiles per term 17/02/2012

Feching sound files from based on search terms

In order to do that I am making 2 API requests:

  • 1 - searching for a given term within mediaType:Audio
    • getting the identifier of the first search occurance id_0
  • 2 - requesting details on identifier (id_0)

I use the 2nd (details) API query to look for the containig files.

From this list I get the first ogg (in case ogg files are present)

Downloading soundfile In files are stored + identifier + filename

12/02/2012 - Latest script

import urllib2, urllib, json, re, shutil, datetime, os

#create directroy where sfoundfiles will be saved
dt_obj =
date_str = dt_obj.strftime("%Y%m%d-%H%M%S")
archive_dir = 'sf-archive-' + date_str

sentence = "US President Obama unveils a $3.8 trillion budget, with plans to raise taxes on the wealthy"
sentance_list=re.findall(r"[\w']+|[.,!?;]", sentence) # find words and puctuation and slipt themo list

search_list = []
info_list = [] # Structure: [term, url, num of results ]
download_urls = []

# Results of search tem + mediatype:Audio
for term in sentance_list:	# build info list [ [term, url, response,  num of results], [...], [...] ] 
	results = []
	if ('.'in term) or ('?' in term) or ('!' in term):
		results.append("No Url")
		results.append("No Respose")
		results.append("No Results")
		info_list.append(results) # push the results list into the info list
		print 'stop: ' + term		
	elif (','in term) or (';' in term):
		results.append("No Url")
		results.append("No Respose")
		results.append("No Results")
		info_list.append(results) # push the results list into the info list
		print 'comma: '  + term	 
		print 'word: ' + term
		url = '' + term + '+AND+mediatype:Audio&rows=300&output=json' #api query
		print 'url: '+ url
		search = urllib2.urlopen(url)
		search_result = json.load(search)
		response = search_result['response']
		num_results =	response['numFound']
		info_list.append(results) # push the results list into the info list

# go throught the info list, checking if its a punctuation mar, if there are more than 0 search results, and if the item contains ogg files
for info in info_list: # checks the number of results results_list
	url = info[1]
	print info[0]
	print info[1]
	print info[3]

	if ('comma' in url):
		print 'comma found'
	elif ('fullstop' in url):
	elif num_results < 1:	
		print 'num_results is 0'
 		done = False 	
 		for n in range(num_results): #loop through the results looking for .ogg and < size limit			
			identifier = info[2]['docs'][n]['identifier']
			print identifier
			format = info[2]['docs'][n]['format'] 

			if "Ogg Vorbis" in format:
				# go to details url								
				details_url = '' + identifier + '&output=json' #details on identifier
				print details_url
					details_search = urllib2.urlopen(details_url)
					details_result = json.load(details_search)
					files=details_result['files'].keys() #look at the containig files
					for ogg in files:
						#print str(o)			
						if'.ogg$', ogg) or'.OGG$', ogg): #if there are .ogg or .OGG 
							print "ogg found"
							print ogg							
							size =  details_result['files'][ogg]['size']
							print size		
							if int(size) > 1000000:	#check file size
								print "file TOO large"			
								print "RIGHT SIZE"
								audio_url = '' + identifier + ogg	
								done = True
				except urllib2.HTTPError:
					print '404'+ details_url	

			if done: 

print download_urls

#silence and punctuation soundfiles - WILL LEAVE THEM OUT FOR NOW - since you don't have it in your machine
#silence = "silences/silence.ogg"
#comma = "silences/comma.ogg"
#fullstop = "silences/fullstop.ogg"

for i, url in enumerate(download_urls): #Download files from url	
	num = '%02d' % (1+(i))
	if url == 0:
		#silence_file = archive_dir+"/"+str(num)+'silence.ogg'
		#shutil.copyfile(silence, silence_file)
	elif url == 'comma':
		#comma_file = archive_dir+"/"+str(num)+'comma.ogg'
		#shutil.copyfile(comma, comma_file)
	elif url == 'fullstop':
		#fullstop_file = archive_dir+"/"+str(num)+'fullstop.ogg'
		#shutil.copyfile(fullstop, fullstop_file)
		file_name = str(num) + sentance_list[i] + '.ogg'
		print file_name + ' ' + url
		urllib.urlretrieve(url, archive_dir + "/" + file_name)