User:Andre Castro/prototyping/1.2/Archiveorg-seachTerm: Difference between revisions
Andrecastro (talk | contribs) (Created page with "==Searching soundfiles per term== Feching sound feiles from archive.org based on search terms In order to do that I am making 2 API requests: * 1 - searching for a given term w...") |
Andrecastro (talk | contribs) No edit summary |
||
(25 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
== | |||
=Liberté, Égalité, Beyoncé= | |||
<b> | |||
Front-end: http://pzwart3.wdka.hro.nl/~acastro/radio/ | |||
Feching sound | Stream: http://pzwart1.wdka.hro.nl:8000/liberte_egalite_beyonce | ||
</b> | |||
==OLD - needs updating== | |||
<b>An interpretation of texts through sound, according to the archive's knowledge on the terms that constitutes the texts</b> | |||
I envision this project to develop into a continuous sound-stream, a sort of internet radio where each sound(-file) matches word from a text. The stream becomes an interpretation of texts through sound. | |||
===Map=== | |||
<source lang="text"> | |||
[ text gathering ]-->(text pool) | |||
/ | |||
/ | |||
/ | |||
[ sound scraping ]--->(sound and playlist pool) | |||
/ | |||
/ | |||
[ sound stream scheduling - liquidsoap] | |||
| | |||
| | |||
[ stream transmission - icecast2] | |||
| | | | |||
| | | | |||
| | | | |||
listeners listeners listeners | |||
</source> | |||
===Front-end=== | |||
http://pzwart3.wdka.hro.nl/~acastro/blind-archive/andre.html | |||
http://pzwart3.wdka.hro.nl/~acastro/cgi-bin/playingnow.html | |||
===Currently=== | |||
Using [http://liquidsoap.fm liquidsoap] | |||
It plays the role of a radio station manager, deciding when playlists,jingles, etc... are played and send the stream to icecast2 | |||
[http://pzwart3.wdka.hro.nl/wiki/User:Andre_Castro/prototyping/1.2/liquidsoap-noites Liquidsoap recipe - in process ] | |||
==Current Development 07/03/2012== | |||
At this point of its development the Blind Sound Archive is working by: | |||
* Collecting online texts from several sources (through rss-feed) and storing them into a xml file | |||
** the texts sources are organized in several topics (eg: cooking, news, bible, code, science, etc) | |||
** it runs hourly | |||
** script: http://pzwart3.wdka.hro.nl/~acastro/blind-archive/blind-collect-text.py | |||
** text sources database: http://pzwart3.wdka.hro.nl/~acastro/blind-archive/blind-text-sources-X.xml | |||
* When the previous process is finished another script begins querying archive.org for the sounds: | |||
** each word from the collected text is subject to a search in the archive | |||
** each of the text's topics has a specific search area in the archive (eg: cooking in 78rpm records, science in ambient music, poetry in spoken-word) | |||
** soundfiles are downloaded and store in the server | |||
** script: http://pzwart3.wdka.hro.nl/~acastro/blind-archive/blind-collect-text.py | |||
Soundfiles are in (in pzwart3) /home/acastro/public_html/blind-archive/sf-archive | |||
===2DO:=== | |||
* check if sf are not too long | |||
* create a list of the url from which each file is downloaded | |||
* process files from 1 topic into a single file: ecasound | |||
* create player(html5) to play the soundfiles | |||
==The process step-by-step== | |||
<source lang="email"> | |||
[Player / Front End] < --------- | |||
\ | |||
\ | |||
\ | |||
[Sound Scraper] -----> { Sound sequences pool } | |||
^----------------- | |||
\ | |||
[Texts spider] ----> { Texts pool } | |||
^ | |||
| | |||
topics ---> { topics pool } | |||
</source> | |||
===1 - Text Search=== | |||
gathering text sources on various topics | |||
* from a pool topics, one in chosen | |||
* a spider search for online texts on that topic (7: 1x per weekday) (or 1 per everyday - makes sence for news) | |||
* different sources: stackoverflow / stackexchange / news feeds / weather reports / blogs | |||
* the text are stored in the texts-pool (xml?) | |||
* Tech: | |||
** Spider / rss-feed reader / api | |||
** xml text pool: | |||
<source lang="xml"> | |||
<texts> | |||
<news> | |||
<day1 date="2012...."> | |||
<item>blahh blahhh balllahhh</item> | |||
<item>blooo bluuu baaoooo</item> | |||
<item>cooo cuuu caaoooo</item> | |||
</day1> | |||
... | |||
<news/> | |||
</texts> | |||
</source> | |||
<source lang="email"> | |||
[word-sound search] | |||
^ | |||
| | |||
[text feeding] | |||
^ | |||
| | |||
(text-sources.xml) | |||
^ | |||
| | |||
[Text Download] | |||
</source> | |||
¿ Scheduling of these cripts ? | |||
* Google API http://code.google.com/apis/customsearch/v1/overview.html | |||
===2 - Sound Search=== | |||
finding sounds on archive.org that match the words from the text | |||
* for each word of a text a sound is downloaded from archive.org | |||
** words are fed one by one to archive.org's search engine, asking it for audio items tagged under that given term | |||
** <div style="color:red"> search further limited by the collection, so that text/sound-sequences exhibit a more coherent identity and distinguishes themselves from other topics | |||
TEST: 2 Topics: [source:poem collection:music ..:?? ] [source:weather forecast collection:field-recordings?? ] [source:news collectio:spoken word ] | |||
</div> | |||
* sounds are downloaded and saved in a server directories (date-topic/) | |||
* sound directories from a day will deleted after that day | |||
* Frequency: the process takes place during the previous day. 1x per day | |||
* Tech: | |||
** Python: Text sequencing | |||
** Python: API queries + download - Done | |||
===3 - Player/stream=== | |||
the resulting sound (files) sequences are player | |||
* sequences from soundfiles is create (1 topic=1sequence) | |||
** x-fade/cat files http://eca.cx/ecasound/Documentation/examples.html | |||
* each sequence-topic last as long as its duration | |||
* then player move to next sequence-topic | |||
* if all sequences have been played, player goes through them again and again (different order ?) until the 24h of a day have been completed | |||
* Tech: | |||
** Liquid Soap | |||
** Icecast | |||
===4 - Front End==== | |||
* Topic being played is displayed | |||
* Links to sound sources in archive.org | |||
* User can supply more topics | |||
* User can comment on the sound?? | |||
* Tech: | |||
** html | |||
** ... | |||
---- | |||
==Searching soundfiles per term 17/02/2012== | |||
Feching sound files from archive.org based on search terms | |||
In order to do that I am making 2 API requests: | In order to do that I am making 2 API requests: | ||
Line 12: | Line 206: | ||
I use the 2nd (details) API query to look for the containig files. | I use the 2nd (details) API query to look for the containig files. | ||
From this list I get the first ogg | From this list I get the first ogg (in case ogg files are present) | ||
Downloading soundfile | |||
In archive.org files are stored http://www.archive.org/download/ + identifier + filename | In archive.org files are stored http://www.archive.org/download/ + identifier + filename | ||
===12/02/2012 - Latest script=== | |||
<source lang="python"> | |||
#!/usr/bin/pyhton | |||
import urllib2, urllib, json, re, shutil, datetime, os | |||
#create directroy where sfoundfiles will be saved | |||
dt_obj = datetime.datetime.now() | |||
date_str = dt_obj.strftime("%Y%m%d-%H%M%S") | |||
archive_dir = 'sf-archive-' + date_str | |||
os.makedirs(archive_dir) | |||
sentence = "US President Obama unveils a $3.8 trillion budget, with plans to raise taxes on the wealthy" | |||
sentance_list=re.findall(r"[\w']+|[.,!?;]", sentence) # find words and puctuation and slipt themo list | |||
search_list = [] | |||
info_list = [] # Structure: [term, url, num of results ] | |||
download_urls = [] | |||
# Results of search tem + mediatype:Audio | |||
for term in sentance_list: # build info list [ [term, url, response, num of results], [...], [...] ] | |||
results = [] | |||
results.append(term) | |||
if ('.'in term) or ('?' in term) or ('!' in term): | |||
results.append("fullstop") | |||
results.append("No Url") | |||
results.append("No Respose") | |||
results.append("No Results") | |||
info_list.append(results) # push the results list into the info list | |||
print 'stop: ' + term | |||
elif (','in term) or (';' in term): | |||
results.append("comma") | |||
results.append("No Url") | |||
results.append("No Respose") | |||
results.append("No Results") | |||
info_list.append(results) # push the results list into the info list | |||
print 'comma: ' + term | |||
else: | |||
print 'word: ' + term | |||
url = 'http://www.archive.org/advancedsearch.php?q=' + term + '+AND+mediatype:Audio&rows=300&output=json' #api query | |||
print 'url: '+ url | |||
results.append(url) | |||
search = urllib2.urlopen(url) | |||
search_result = json.load(search) | |||
response = search_result['response'] | |||
num_results = response['numFound'] | |||
results.append(response) | |||
results.append(num_results) | |||
info_list.append(results) # push the results list into the info list | |||
# go throught the info list, checking if its a punctuation mar, if there are more than 0 search results, and if the item contains ogg files | |||
for info in info_list: # checks the number of results results_list | |||
url = info[1] | |||
print info[0] | |||
print info[1] | |||
print info[3] | |||
print | |||
< | if ('comma' in url): | ||
# | download_urls.append('comma') | ||
print 'comma found' | |||
elif ('fullstop' in url): | |||
download_urls.append('fullstop') | |||
elif num_results < 1: | |||
print 'num_results is 0' | |||
print | |||
download_urls.append(0) | |||
else: | |||
done = False | |||
for n in range(num_results): #loop through the results looking for .ogg and < size limit | |||
identifier = info[2]['docs'][n]['identifier'] | |||
print | |||
print identifier | |||
format = info[2]['docs'][n]['format'] | |||
# ==== | if "Ogg Vorbis" in format: | ||
# go to details url | |||
details_url = 'http://www.archive.org/details/' + identifier + '&output=json' #details on identifier http://www.archive.org/details/electroacoustic_music&output=json | |||
print details_url | |||
try: | |||
details_search = urllib2.urlopen(details_url) | |||
details_result = json.load(details_search) | |||
files=details_result['files'].keys() #look at the containig files | |||
for ogg in files: | |||
#print str(o) | |||
if re.search('.ogg$', ogg) or re.search('.OGG$', ogg): #if there are .ogg or .OGG | |||
print "ogg found" | |||
print ogg | |||
size = details_result['files'][ogg]['size'] | |||
print size | |||
if int(size) > 1000000: #check file size | |||
print "file TOO large" | |||
else: | |||
print "RIGHT SIZE" | |||
audio_url = 'http://www.archive.org/download/' + identifier + ogg | |||
download_urls.append(audio_url) | |||
done = True | |||
break | |||
except urllib2.HTTPError: | |||
print '404'+ details_url | |||
if done: | |||
break | |||
print download_urls | |||
#silence and punctuation soundfiles - WILL LEAVE THEM OUT FOR NOW - since you don't have it in your machine | |||
#silence = "silences/silence.ogg" | |||
#comma = "silences/comma.ogg" | |||
#fullstop = "silences/fullstop.ogg" | |||
for i, url in enumerate(download_urls): #Download files from url | |||
urllib.urlretrieve( | num = '%02d' % (1+(i)) | ||
if url == 0: | |||
#silence_file = archive_dir+"/"+str(num)+'silence.ogg' | |||
#shutil.copyfile(silence, silence_file) | |||
elif url == 'comma': | |||
#comma_file = archive_dir+"/"+str(num)+'comma.ogg' | |||
#shutil.copyfile(comma, comma_file) | |||
elif url == 'fullstop': | |||
#fullstop_file = archive_dir+"/"+str(num)+'fullstop.ogg' | |||
#shutil.copyfile(fullstop, fullstop_file) | |||
else: | |||
file_name = str(num) + sentance_list[i] + '.ogg' | |||
print file_name + ' ' + url | |||
urllib.urlretrieve(url, archive_dir + "/" + file_name) | |||
</source> | </source> |
Latest revision as of 18:55, 26 March 2012
Liberté, Égalité, Beyoncé
Front-end: http://pzwart3.wdka.hro.nl/~acastro/radio/
Stream: http://pzwart1.wdka.hro.nl:8000/liberte_egalite_beyonce
OLD - needs updating
An interpretation of texts through sound, according to the archive's knowledge on the terms that constitutes the texts
I envision this project to develop into a continuous sound-stream, a sort of internet radio where each sound(-file) matches word from a text. The stream becomes an interpretation of texts through sound.
Map
[ text gathering ]-->(text pool)
/
/
/
[ sound scraping ]--->(sound and playlist pool)
/
/
[ sound stream scheduling - liquidsoap]
|
|
[ stream transmission - icecast2]
| | |
| | |
| | |
listeners listeners listeners
Front-end
http://pzwart3.wdka.hro.nl/~acastro/blind-archive/andre.html
http://pzwart3.wdka.hro.nl/~acastro/cgi-bin/playingnow.html
Currently
Using liquidsoap It plays the role of a radio station manager, deciding when playlists,jingles, etc... are played and send the stream to icecast2
Liquidsoap recipe - in process
Current Development 07/03/2012
At this point of its development the Blind Sound Archive is working by:
- Collecting online texts from several sources (through rss-feed) and storing them into a xml file
- the texts sources are organized in several topics (eg: cooking, news, bible, code, science, etc)
- it runs hourly
- script: http://pzwart3.wdka.hro.nl/~acastro/blind-archive/blind-collect-text.py
- text sources database: http://pzwart3.wdka.hro.nl/~acastro/blind-archive/blind-text-sources-X.xml
- When the previous process is finished another script begins querying archive.org for the sounds:
- each word from the collected text is subject to a search in the archive
- each of the text's topics has a specific search area in the archive (eg: cooking in 78rpm records, science in ambient music, poetry in spoken-word)
- soundfiles are downloaded and store in the server
- script: http://pzwart3.wdka.hro.nl/~acastro/blind-archive/blind-collect-text.py
Soundfiles are in (in pzwart3) /home/acastro/public_html/blind-archive/sf-archive
2DO:
- check if sf are not too long
- create a list of the url from which each file is downloaded
- process files from 1 topic into a single file: ecasound
- create player(html5) to play the soundfiles
The process step-by-step
[Player / Front End] < ---------
\
\
\
[Sound Scraper] -----> { Sound sequences pool }
^-----------------
\
[Texts spider] ----> { Texts pool }
^
|
topics ---> { topics pool }
1 - Text Search
gathering text sources on various topics
- from a pool topics, one in chosen
- a spider search for online texts on that topic (7: 1x per weekday) (or 1 per everyday - makes sence for news)
- different sources: stackoverflow / stackexchange / news feeds / weather reports / blogs
- the text are stored in the texts-pool (xml?)
- Tech:
- Spider / rss-feed reader / api
- xml text pool:
<texts>
<news>
<day1 date="2012....">
<item>blahh blahhh balllahhh</item>
<item>blooo bluuu baaoooo</item>
<item>cooo cuuu caaoooo</item>
</day1>
...
<news/>
</texts>
[word-sound search]
^
|
[text feeding]
^
|
(text-sources.xml)
^
|
[Text Download]
¿ Scheduling of these cripts ?
2 - Sound Search
finding sounds on archive.org that match the words from the text
- for each word of a text a sound is downloaded from archive.org
- words are fed one by one to archive.org's search engine, asking it for audio items tagged under that given term
- search further limited by the collection, so that text/sound-sequences exhibit a more coherent identity and distinguishes themselves from other topics
TEST: 2 Topics: [source:poem collection:music ..:?? ] [source:weather forecast collection:field-recordings?? ] [source:news collectio:spoken word ]
- sounds are downloaded and saved in a server directories (date-topic/)
- sound directories from a day will deleted after that day
- Frequency: the process takes place during the previous day. 1x per day
- Tech:
- Python: Text sequencing
- Python: API queries + download - Done
3 - Player/stream
the resulting sound (files) sequences are player
- sequences from soundfiles is create (1 topic=1sequence)
- x-fade/cat files http://eca.cx/ecasound/Documentation/examples.html
- each sequence-topic last as long as its duration
- then player move to next sequence-topic
- if all sequences have been played, player goes through them again and again (different order ?) until the 24h of a day have been completed
- Tech:
- Liquid Soap
- Icecast
4 - Front End=
- Topic being played is displayed
- Links to sound sources in archive.org
- User can supply more topics
- User can comment on the sound??
- Tech:
- html
- ...
Searching soundfiles per term 17/02/2012
Feching sound files from archive.org based on search terms
In order to do that I am making 2 API requests:
- 1 - searching for a given term within mediaType:Audio
- getting the identifier of the first search occurance id_0
- 2 - requesting details on identifier (id_0)
I use the 2nd (details) API query to look for the containig files.
From this list I get the first ogg (in case ogg files are present)
Downloading soundfile In archive.org files are stored http://www.archive.org/download/ + identifier + filename
12/02/2012 - Latest script
#!/usr/bin/pyhton
import urllib2, urllib, json, re, shutil, datetime, os
#create directroy where sfoundfiles will be saved
dt_obj = datetime.datetime.now()
date_str = dt_obj.strftime("%Y%m%d-%H%M%S")
archive_dir = 'sf-archive-' + date_str
os.makedirs(archive_dir)
sentence = "US President Obama unveils a $3.8 trillion budget, with plans to raise taxes on the wealthy"
sentance_list=re.findall(r"[\w']+|[.,!?;]", sentence) # find words and puctuation and slipt themo list
search_list = []
info_list = [] # Structure: [term, url, num of results ]
download_urls = []
# Results of search tem + mediatype:Audio
for term in sentance_list: # build info list [ [term, url, response, num of results], [...], [...] ]
results = []
results.append(term)
if ('.'in term) or ('?' in term) or ('!' in term):
results.append("fullstop")
results.append("No Url")
results.append("No Respose")
results.append("No Results")
info_list.append(results) # push the results list into the info list
print 'stop: ' + term
elif (','in term) or (';' in term):
results.append("comma")
results.append("No Url")
results.append("No Respose")
results.append("No Results")
info_list.append(results) # push the results list into the info list
print 'comma: ' + term
else:
print 'word: ' + term
url = 'http://www.archive.org/advancedsearch.php?q=' + term + '+AND+mediatype:Audio&rows=300&output=json' #api query
print 'url: '+ url
results.append(url)
search = urllib2.urlopen(url)
search_result = json.load(search)
response = search_result['response']
num_results = response['numFound']
results.append(response)
results.append(num_results)
info_list.append(results) # push the results list into the info list
# go throught the info list, checking if its a punctuation mar, if there are more than 0 search results, and if the item contains ogg files
for info in info_list: # checks the number of results results_list
url = info[1]
print info[0]
print info[1]
print info[3]
print
if ('comma' in url):
download_urls.append('comma')
print 'comma found'
elif ('fullstop' in url):
download_urls.append('fullstop')
elif num_results < 1:
print 'num_results is 0'
print
download_urls.append(0)
else:
done = False
for n in range(num_results): #loop through the results looking for .ogg and < size limit
identifier = info[2]['docs'][n]['identifier']
print
print identifier
format = info[2]['docs'][n]['format']
if "Ogg Vorbis" in format:
# go to details url
details_url = 'http://www.archive.org/details/' + identifier + '&output=json' #details on identifier http://www.archive.org/details/electroacoustic_music&output=json
print details_url
try:
details_search = urllib2.urlopen(details_url)
details_result = json.load(details_search)
files=details_result['files'].keys() #look at the containig files
for ogg in files:
#print str(o)
if re.search('.ogg$', ogg) or re.search('.OGG$', ogg): #if there are .ogg or .OGG
print "ogg found"
print ogg
size = details_result['files'][ogg]['size']
print size
if int(size) > 1000000: #check file size
print "file TOO large"
else:
print "RIGHT SIZE"
audio_url = 'http://www.archive.org/download/' + identifier + ogg
download_urls.append(audio_url)
done = True
break
except urllib2.HTTPError:
print '404'+ details_url
if done:
break
print download_urls
#silence and punctuation soundfiles - WILL LEAVE THEM OUT FOR NOW - since you don't have it in your machine
#silence = "silences/silence.ogg"
#comma = "silences/comma.ogg"
#fullstop = "silences/fullstop.ogg"
for i, url in enumerate(download_urls): #Download files from url
num = '%02d' % (1+(i))
if url == 0:
#silence_file = archive_dir+"/"+str(num)+'silence.ogg'
#shutil.copyfile(silence, silence_file)
elif url == 'comma':
#comma_file = archive_dir+"/"+str(num)+'comma.ogg'
#shutil.copyfile(comma, comma_file)
elif url == 'fullstop':
#fullstop_file = archive_dir+"/"+str(num)+'fullstop.ogg'
#shutil.copyfile(fullstop, fullstop_file)
else:
file_name = str(num) + sentance_list[i] + '.ogg'
print file_name + ' ' + url
urllib.urlretrieve(url, archive_dir + "/" + file_name)