User:Andre Castro/prototyping/1.2/Archiveorg-seachTerm: Difference between revisions

From XPUB & Lens-Based wiki
(Created page with "==Searching soundfiles per term== Feching sound feiles from archive.org based on search terms In order to do that I am making 2 API requests: * 1 - searching for a given term w...")
 
No edit summary
 
(25 intermediate revisions by the same user not shown)
Line 1: Line 1:
==Searching soundfiles per term==
=Liberté, Égalité, Beyoncé=
<b>
Front-end: http://pzwart3.wdka.hro.nl/~acastro/radio/


Feching sound feiles from archive.org based on search terms
Stream: http://pzwart1.wdka.hro.nl:8000/liberte_egalite_beyonce
</b>
 
 
==OLD - needs updating==
 
<b>An interpretation of texts through sound, according to the archive's knowledge on the terms that constitutes the texts</b>
 
I envision this project to develop into a continuous sound-stream, a sort of internet radio where each sound(-file) matches word from a text. The stream becomes an interpretation of texts through sound. 
 
===Map===
 
<source lang="text">
 
[ text gathering ]-->(text pool)
    /
    /
  /
[      sound scraping ]--->(sound and playlist pool)
          /
  /
[ sound stream scheduling - liquidsoap]
|
|
[ stream transmission - icecast2]
| | |
| | |
| | |
listeners listeners listeners
 
</source>
 
 
===Front-end===
http://pzwart3.wdka.hro.nl/~acastro/blind-archive/andre.html
 
http://pzwart3.wdka.hro.nl/~acastro/cgi-bin/playingnow.html
 
===Currently===
Using [http://liquidsoap.fm liquidsoap]
It plays the role of a radio station manager, deciding when playlists,jingles, etc... are played and send the stream to icecast2
 
[http://pzwart3.wdka.hro.nl/wiki/User:Andre_Castro/prototyping/1.2/liquidsoap-noites Liquidsoap recipe - in process ]
 
==Current Development 07/03/2012==
 
At this point of its development the Blind Sound Archive is working by:
 
* Collecting online texts from several sources (through rss-feed) and storing them into a xml file
** the texts sources are organized in several topics (eg: cooking, news, bible, code, science, etc) 
** it runs hourly
** script: http://pzwart3.wdka.hro.nl/~acastro/blind-archive/blind-collect-text.py
** text sources database: http://pzwart3.wdka.hro.nl/~acastro/blind-archive/blind-text-sources-X.xml
 
 
* When the previous process is finished another script begins querying archive.org for the sounds:
** each word from the collected text is subject to a search in the archive
** each of the text's topics has a specific search area in the archive (eg: cooking in 78rpm records, science in ambient music,  poetry in spoken-word)
** soundfiles are downloaded and store in the server 
** script: http://pzwart3.wdka.hro.nl/~acastro/blind-archive/blind-collect-text.py
 
Soundfiles are in (in pzwart3) /home/acastro/public_html/blind-archive/sf-archive
 
 
 
===2DO:===
* check if sf are not too long
* create a list of the url from which each file is downloaded
 
* process files from 1 topic into a single file: ecasound
 
* create player(html5) to play the soundfiles
 
==The process step-by-step==
 
<source lang="email">
[Player / Front End] < ---------
\
\
  \
[Sound Scraper] -----> { Sound sequences pool }
^-----------------
          \
[Texts spider] ----> { Texts pool }
^
|
topics ---> { topics pool }
 
</source>
 
 
===1 - Text Search===
 
gathering text sources on various topics
 
* from a pool topics, one in chosen
* a spider search for  online texts on that topic (7: 1x per weekday) (or 1 per everyday - makes sence for news)
* different sources: stackoverflow / stackexchange / news feeds / weather reports / blogs
* the text are stored in the texts-pool (xml?)
 
 
* Tech:
** Spider / rss-feed reader / api
** xml text pool:
<source lang="xml">
<texts>
<news>
<day1 date="2012....">
                    <item>blahh blahhh balllahhh</item>
                    <item>blooo bluuu baaoooo</item>
                    <item>cooo cuuu caaoooo</item> 
                </day1>
...
<news/>
</texts>
</source>
 
 
 
<source lang="email">
 
 
[word-sound search]
    ^
    |
[text feeding]
    ^
    |
(text-sources.xml)
    ^
    |
[Text Download]   
 
</source>
 
¿ Scheduling of these cripts ?
 
 
* Google API http://code.google.com/apis/customsearch/v1/overview.html
 
===2 - Sound Search===
 
finding sounds on archive.org that match the words from the text
* for each word of a text a sound is downloaded from archive.org
** words are fed one by one to archive.org's search engine, asking it for audio items tagged under that given term
** <div style="color:red"> search further limited by the collection, so that text/sound-sequences exhibit a more coherent identity and distinguishes themselves from other topics
TEST: 2 Topics: [source:poem collection:music ..:?? ] [source:weather forecast collection:field-recordings?? ] [source:news collectio:spoken word ]
</div>
* sounds are downloaded and saved in a server directories (date-topic/)
* sound directories from a day will deleted after that day
 
* Frequency: the process takes place during the previous day. 1x per day
 
* Tech:
** Python: Text sequencing
** Python: API queries + download - Done
 
 
===3 - Player/stream===
the resulting sound (files) sequences are player
 
* sequences from soundfiles is create (1 topic=1sequence)
** x-fade/cat files http://eca.cx/ecasound/Documentation/examples.html
* each sequence-topic last as long as its duration
* then player move to next sequence-topic
* if all sequences have been played, player goes through them again and again (different order ?) until the 24h of a day have been completed  
 
* Tech:
** Liquid Soap
** Icecast
 
===4 - Front End====
* Topic being played is displayed
* Links to sound sources in archive.org
* User can supply more topics
* User can comment on the sound??
 
* Tech:
** html
** ...
 
 
 
 
 
 
 
----
 
==Searching soundfiles per term 17/02/2012==
 
Feching sound files from archive.org based on search terms


In order to do that I am making 2 API requests:
In order to do that I am making 2 API requests:
Line 12: Line 206:
I use the 2nd (details) API query to look for the containig files.
I use the 2nd (details) API query to look for the containig files.


From this list I get the first ogg or mp3 (in case ogg files are present)
From this list I get the first ogg (in case ogg files are present)


Download soundfile  
Downloading soundfile  
In archive.org files are stored http://www.archive.org/download/ + identifier + filename
In archive.org files are stored http://www.archive.org/download/ + identifier + filename
===12/02/2012 - Latest script===
<source lang="python">
#!/usr/bin/pyhton
import urllib2, urllib, json, re, shutil, datetime, os
#create directroy where sfoundfiles will be saved
dt_obj = datetime.datetime.now()
date_str = dt_obj.strftime("%Y%m%d-%H%M%S")
archive_dir = 'sf-archive-' + date_str
os.makedirs(archive_dir)
sentence = "US President Obama unveils a $3.8 trillion budget, with plans to raise taxes on the wealthy"
sentance_list=re.findall(r"[\w']+|[.,!?;]", sentence) # find words and puctuation and slipt themo list
search_list = []
info_list = [] # Structure: [term, url, num of results ]
download_urls = []
# Results of search tem + mediatype:Audio
for term in sentance_list: # build info list [ [term, url, response,  num of results], [...], [...] ]
results = []
results.append(term)
if ('.'in term) or ('?' in term) or ('!' in term):
results.append("fullstop")
results.append("No Url")
results.append("No Respose")
results.append("No Results")
info_list.append(results) # push the results list into the info list
print 'stop: ' + term
elif (','in term) or (';' in term):
results.append("comma")
results.append("No Url")
results.append("No Respose")
results.append("No Results")
info_list.append(results) # push the results list into the info list
print 'comma: '  + term
else:
print 'word: ' + term
url = 'http://www.archive.org/advancedsearch.php?q=' + term + '+AND+mediatype:Audio&rows=300&output=json' #api query
print 'url: '+ url
results.append(url)
search = urllib2.urlopen(url)
search_result = json.load(search)
response = search_result['response']
num_results = response['numFound']
results.append(response)
results.append(num_results)
info_list.append(results) # push the results list into the info list




# go throught the info list, checking if its a punctuation mar, if there are more than 0 search results, and if the item contains ogg files
for info in info_list: # checks the number of results results_list
url = info[1]
print info[0]
print info[1]
print info[3]
print


<source lang=pyhton>
if ('comma' in url):
#!/usr/bin/pyhton
download_urls.append('comma')
import urllib2, urllib, json, re
print 'comma found'
elif ('fullstop' in url):
download_urls.append('fullstop')
elif num_results < 1:
print 'num_results is 0'
print
download_urls.append(0)
else:
done = False
for n in range(num_results): #loop through the results looking for .ogg and < size limit
identifier = info[2]['docs'][n]['identifier']
print
print identifier
format = info[2]['docs'][n]['format']


# ====API Query====
if "Ogg Vorbis" in format:
term = 'orange'
# go to details url
url = 'http://www.archive.org/advancedsearch.php?q=' + term + '+AND+mediatype:Audio&rows=15&output=json' #api query
details_url = 'http://www.archive.org/details/' + identifier + '&output=json' #details on identifier  http://www.archive.org/details/electroacoustic_music&output=json
print details_url
try:
details_search = urllib2.urlopen(details_url)
details_result = json.load(details_search)
files=details_result['files'].keys() #look at the containig files
for ogg in files:
#print str(o)
if re.search('.ogg$', ogg) or re.search('.OGG$', ogg): #if there are .ogg or .OGG
print "ogg found"
print ogg
size = details_result['files'][ogg]['size']
print size
if int(size) > 1000000: #check file size
print "file TOO large"
else:
print "RIGHT SIZE"
audio_url = 'http://www.archive.org/download/' + identifier + ogg
download_urls.append(audio_url)
done = True
break
except urllib2.HTTPError:
print '404'+ details_url


search = urllib2.urlopen(url)
if done:
search_result = json.load(search)
break
id_0 = search_result['response']['docs'][0]['identifier'] #look for the identifier in json dict


details_url = 'http://www.archive.org/details/' + id_0 + '&output=json' #details on identifier
details_search = urllib2.urlopen(details_url)
details_result = json.load(details_search)


files=details_result['files'].keys() #look for the containig files
print download_urls
files_list=[]




for i in files:
#silence and punctuation soundfiles - WILL LEAVE THEM OUT FOR NOW - since you don't have it in your machine
mp3 = re.findall(r'.mp3$', i)
#silence = "silences/silence.ogg"
ogg = re.findall(r'.ogg$', i)
#comma = "silences/comma.ogg"
#print mp3
#fullstop = "silences/fullstop.ogg"
#print ogg
if len(ogg)>0:
files_list.append(i)
extension = '.ogg'
if i in ogg:
print 'ogg in list' 
elif len(mp3)>0:
files_list.append(i)
extension = '.mp3'
print files_list


audio_url = 'http://www.archive.org/download/' + id_0 + files_list[0]
for i, url in enumerate(download_urls): #Download files from url
urllib.urlretrieve(audio_url, term + extension)
num = '%02d' % (1+(i))
print files_list[0]
if url == 0:
print audio_url
#silence_file = archive_dir+"/"+str(num)+'silence.ogg'
#shutil.copyfile(silence, silence_file)
elif url == 'comma':
#comma_file = archive_dir+"/"+str(num)+'comma.ogg'
#shutil.copyfile(comma, comma_file)
elif url == 'fullstop':
#fullstop_file = archive_dir+"/"+str(num)+'fullstop.ogg'
#shutil.copyfile(fullstop, fullstop_file)
else:
file_name = str(num) + sentance_list[i] + '.ogg'
print file_name + ' ' + url
urllib.urlretrieve(url, archive_dir + "/" + file_name)  


</source>
</source>

Latest revision as of 19:55, 26 March 2012

Liberté, Égalité, Beyoncé

Front-end: http://pzwart3.wdka.hro.nl/~acastro/radio/

Stream: http://pzwart1.wdka.hro.nl:8000/liberte_egalite_beyonce


OLD - needs updating

An interpretation of texts through sound, according to the archive's knowledge on the terms that constitutes the texts

I envision this project to develop into a continuous sound-stream, a sort of internet radio where each sound(-file) matches word from a text. The stream becomes an interpretation of texts through sound.

Map

[ text gathering ]-->(text pool)
		     /
		    /
		   /
[      sound scraping 	]--->(sound and playlist pool)
			           /
				  /
[ sound stream scheduling - liquidsoap] 
		|
		|
[ stream transmission - icecast2]
 |		|	|	
 |		|	|
 |		|	|
listeners listeners listeners


Front-end

http://pzwart3.wdka.hro.nl/~acastro/blind-archive/andre.html

http://pzwart3.wdka.hro.nl/~acastro/cgi-bin/playingnow.html

Currently

Using liquidsoap It plays the role of a radio station manager, deciding when playlists,jingles, etc... are played and send the stream to icecast2

Liquidsoap recipe - in process

Current Development 07/03/2012

At this point of its development the Blind Sound Archive is working by:


  • When the previous process is finished another script begins querying archive.org for the sounds:

Soundfiles are in (in pzwart3) /home/acastro/public_html/blind-archive/sf-archive


2DO:

  • check if sf are not too long
  • create a list of the url from which each file is downloaded
  • process files from 1 topic into a single file: ecasound
  • create player(html5) to play the soundfiles

The process step-by-step

						
[Player / Front End] < ---------
				\
 				 \
				  \
[Sound Scraper] ----->	{ Sound sequences pool }
	^-----------------
		          \
[Texts spider]	----> { Texts pool }
	^
	|
 topics	---> { topics pool }


1 - Text Search

gathering text sources on various topics

  • from a pool topics, one in chosen
  • a spider search for online texts on that topic (7: 1x per weekday) (or 1 per everyday - makes sence for news)
  • different sources: stackoverflow / stackexchange / news feeds / weather reports / blogs
  • the text are stored in the texts-pool (xml?)


  • Tech:
    • Spider / rss-feed reader / api
    • xml text pool:
<texts>
	<news>
		<day1 date="2012....">
                     <item>blahh blahhh balllahhh</item>
                     <item>blooo bluuu baaoooo</item>
                     <item>cooo cuuu caaoooo</item>  
                </day1>		
		...
	<news/>
</texts>


						


[word-sound search]
    ^
    |
[text feeding] 
    ^
    |
 (text-sources.xml)
    ^
    |
[Text Download]

¿ Scheduling of these cripts ?


2 - Sound Search

finding sounds on archive.org that match the words from the text

  • for each word of a text a sound is downloaded from archive.org
    • words are fed one by one to archive.org's search engine, asking it for audio items tagged under that given term
    • search further limited by the collection, so that text/sound-sequences exhibit a more coherent identity and distinguishes themselves from other topics

TEST: 2 Topics: [source:poem collection:music ..:?? ] [source:weather forecast collection:field-recordings?? ] [source:news collectio:spoken word ]

  • sounds are downloaded and saved in a server directories (date-topic/)
  • sound directories from a day will deleted after that day
  • Frequency: the process takes place during the previous day. 1x per day
  • Tech:
    • Python: Text sequencing
    • Python: API queries + download - Done


3 - Player/stream

the resulting sound (files) sequences are player

  • sequences from soundfiles is create (1 topic=1sequence)
  • each sequence-topic last as long as its duration
  • then player move to next sequence-topic
  • if all sequences have been played, player goes through them again and again (different order ?) until the 24h of a day have been completed
  • Tech:
    • Liquid Soap
    • Icecast

4 - Front End=

  • Topic being played is displayed
  • Links to sound sources in archive.org
  • User can supply more topics
  • User can comment on the sound??
  • Tech:
    • html
    • ...





Searching soundfiles per term 17/02/2012

Feching sound files from archive.org based on search terms

In order to do that I am making 2 API requests:

  • 1 - searching for a given term within mediaType:Audio
    • getting the identifier of the first search occurance id_0
  • 2 - requesting details on identifier (id_0)


I use the 2nd (details) API query to look for the containig files.

From this list I get the first ogg (in case ogg files are present)

Downloading soundfile In archive.org files are stored http://www.archive.org/download/ + identifier + filename

12/02/2012 - Latest script

#!/usr/bin/pyhton
import urllib2, urllib, json, re, shutil, datetime, os

#create directroy where sfoundfiles will be saved
dt_obj = datetime.datetime.now()
date_str = dt_obj.strftime("%Y%m%d-%H%M%S")
archive_dir = 'sf-archive-' + date_str
os.makedirs(archive_dir)

sentence = "US President Obama unveils a $3.8 trillion budget, with plans to raise taxes on the wealthy"
sentance_list=re.findall(r"[\w']+|[.,!?;]", sentence) # find words and puctuation and slipt themo list

search_list = []
info_list = [] # Structure: [term, url, num of results ]
download_urls = []

# Results of search tem + mediatype:Audio
for term in sentance_list:	# build info list [ [term, url, response,  num of results], [...], [...] ] 
	results = []
	results.append(term)
	if ('.'in term) or ('?' in term) or ('!' in term):
 		results.append("fullstop")
		results.append("No Url")
		results.append("No Respose")
		results.append("No Results")
		info_list.append(results) # push the results list into the info list
		print 'stop: ' + term		
	elif (','in term) or (';' in term):
		results.append("comma")
		results.append("No Url")
		results.append("No Respose")
		results.append("No Results")
		info_list.append(results) # push the results list into the info list
		print 'comma: '  + term	 
	else:
		print 'word: ' + term
		url = 'http://www.archive.org/advancedsearch.php?q=' + term + '+AND+mediatype:Audio&rows=300&output=json' #api query
		print 'url: '+ url
		results.append(url)
		search = urllib2.urlopen(url)
		search_result = json.load(search)
		response = search_result['response']
		num_results =	response['numFound']
		results.append(response)
		results.append(num_results)
		info_list.append(results) # push the results list into the info list



# go throught the info list, checking if its a punctuation mar, if there are more than 0 search results, and if the item contains ogg files
for info in info_list: # checks the number of results results_list
	url = info[1]
	
	print info[0]
	print info[1]
	print info[3]
	print

	if ('comma' in url):
		download_urls.append('comma')
		print 'comma found'
	elif ('fullstop' in url):
		download_urls.append('fullstop')
	elif num_results < 1:	
		print 'num_results is 0'
	 	print	
		download_urls.append(0)	
				
	else: 
 		done = False 	
 		for n in range(num_results): #loop through the results looking for .ogg and < size limit			
			identifier = info[2]['docs'][n]['identifier']
			print
			print identifier
			format = info[2]['docs'][n]['format'] 

			if "Ogg Vorbis" in format:
				# go to details url								
				details_url = 'http://www.archive.org/details/' + identifier + '&output=json' #details on identifier  http://www.archive.org/details/electroacoustic_music&output=json
				print details_url
				try:				
					details_search = urllib2.urlopen(details_url)
					details_result = json.load(details_search)
					files=details_result['files'].keys() #look at the containig files
									
					for ogg in files:
						#print str(o)			
						if re.search('.ogg$', ogg) or re.search('.OGG$', ogg): #if there are .ogg or .OGG 
							print "ogg found"
							print ogg							
							size =  details_result['files'][ogg]['size']
							print size		
							if int(size) > 1000000:	#check file size
								print "file TOO large"			
							else: 
								print "RIGHT SIZE"
								audio_url = 'http://www.archive.org/download/' + identifier + ogg	
								download_urls.append(audio_url)		
								done = True
								break
				except urllib2.HTTPError:
					print '404'+ details_url	

			if done: 
				break


print download_urls


#silence and punctuation soundfiles - WILL LEAVE THEM OUT FOR NOW - since you don't have it in your machine
#silence = "silences/silence.ogg"
#comma = "silences/comma.ogg"
#fullstop = "silences/fullstop.ogg"

for i, url in enumerate(download_urls): #Download files from url	
	num = '%02d' % (1+(i))
	if url == 0:
		#silence_file = archive_dir+"/"+str(num)+'silence.ogg'
		#shutil.copyfile(silence, silence_file)
	elif url == 'comma':
		#comma_file = archive_dir+"/"+str(num)+'comma.ogg'
		#shutil.copyfile(comma, comma_file)
	elif url == 'fullstop':
		#fullstop_file = archive_dir+"/"+str(num)+'fullstop.ogg'
		#shutil.copyfile(fullstop, fullstop_file)
	else:
		file_name = str(num) + sentance_list[i] + '.ogg'
		print file_name + ' ' + url
		urllib.urlretrieve(url, archive_dir + "/" + file_name)