User:Manetta/scripts/python-translate-to-computer-phonemes: Difference between revisions
No edit summary |
No edit summary |
||
(2 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
= translating text into phonemes used by Sphinx = | = translating text into phonemes used by Sphinx = | ||
[ | using the CMU dictionary file from the software package [http://cmusphinx.sourceforge.net/ Sphinx] (cmu07a.dic)<br> | ||
for download here: http://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/ <br> | for download here: http://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/ <br> | ||
more information about it: http://www.speech.cs.cmu.edu/cgi-bin/cmudict <br> | |||
and its README file: http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/README.weide<br> | |||
'''cmu07a.dic''' looks like: <br> | '''cmu07a.dic''' looks like: <br> | ||
Line 19: | Line 23: | ||
absolutist AE B S IH L UW T IH S T | absolutist AE B S IH L UW T IH S T | ||
absolve AH B Z AA L V | absolve AH B Z AA L V | ||
this is its alphabet: <br> | |||
(from: [http://www.speech.cs.cmu.edu/cgi-bin/cmudict speech.cs.cmu.edu]) | |||
Phoneme Example Translation | |||
------- ------- ----------- | |||
AA odd AA D | |||
AE at AE T | |||
AH hut HH AH T | |||
AO ought AO T | |||
AW cow K AW | |||
AY hide HH AY D | |||
B be B IY | |||
CH cheese CH IY Z | |||
D dee D IY | |||
DH thee DH IY | |||
EH Ed EH D | |||
ER hurt HH ER T | |||
EY ate EY T | |||
F fee F IY | |||
G green G R IY N | |||
HH he HH IY | |||
IH it IH T | |||
IY eat IY T | |||
JH gee JH IY | |||
K key K IY | |||
L lee L IY | |||
M me M IY | |||
N knee N IY | |||
NG ping P IH NG | |||
OW oat OW T | |||
OY toy T OY | |||
P pee P IY | |||
R read R IY D | |||
S sea S IY | |||
SH she SH IY | |||
T tea T IY | |||
TH theta TH EY T AH | |||
UH hood HH UH D | |||
UW two T UW | |||
V vee V IY | |||
W we W IY | |||
Y yield Y IY L D | |||
Z zee Z IY | |||
ZH seizure S IY ZH ER | |||
cmu07a.dic is a file in which "the pronunciation is encoded using a modified form of the Arpabet system, <br>with the addition of stress marks on vowels of levels 0, 1, and 2." (released in 2008) | |||
from: [https://en.wikipedia.org/wiki/CMU_Pronouncing_Dictionary wikipage about the cmu07a.dic] | |||
"Arpabet is a phonetic transcription code developed by Advanced Research Projects Agency (ARPA) <br>as a part of their Speech Understanding Project (1971–1976). It represents each phoneme of <br>General American English with a distinct sequence of ASCII characters." | |||
from: [https://en.wikipedia.org/wiki/Arpabet Wikipage on Arpabet] | |||
Latest revision as of 00:22, 27 March 2015
translating text into phonemes used by Sphinx
using the CMU dictionary file from the software package Sphinx (cmu07a.dic)
for download here: http://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/
more information about it: http://www.speech.cs.cmu.edu/cgi-bin/cmudict
and its README file: http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/README.weide
cmu07a.dic looks like:
abso AE B S OW absolom AE B S AH L AH M absolut AE B S AH L UW T absolut's AE B S AH L UW T S absolute AE B S AH L UW T absolutely AE B S AH L UW T L IY absoluteness AE B S AH L UW T N AH S absolutes AE B S AH L UW T S absolution AE B S AH L UW SH AH N absolutism AE B S AH L UW T IH Z AH M absolutist AE B S IH L UW T IH S T absolve AH B Z AA L V
this is its alphabet:
(from: speech.cs.cmu.edu)
Phoneme Example Translation ------- ------- ----------- AA odd AA D AE at AE T AH hut HH AH T AO ought AO T AW cow K AW AY hide HH AY D B be B IY CH cheese CH IY Z D dee D IY DH thee DH IY EH Ed EH D ER hurt HH ER T EY ate EY T F fee F IY G green G R IY N HH he HH IY IH it IH T IY eat IY T JH gee JH IY K key K IY L lee L IY M me M IY N knee N IY NG ping P IH NG OW oat OW T OY toy T OY P pee P IY R read R IY D S sea S IY SH she SH IY T tea T IY TH theta TH EY T AH UH hood HH UH D UW two T UW V vee V IY W we W IY Y yield Y IY L D Z zee Z IY ZH seizure S IY ZH ER
cmu07a.dic is a file in which "the pronunciation is encoded using a modified form of the Arpabet system,
with the addition of stress marks on vowels of levels 0, 1, and 2." (released in 2008)
from: wikipage about the cmu07a.dic
"Arpabet is a phonetic transcription code developed by Advanced Research Projects Agency (ARPA)
as a part of their Speech Understanding Project (1971–1976). It represents each phoneme of
General American English with a distinct sequence of ASCII characters."
from: Wikipage on Arpabet
import re
import os
with open('output.txt', 'w') as txt:
x = open('input.txt', 'r')
searchlines = x.readlines()
x.close()
print searchlines
search = searchlines[0].split(" ")
print search[0]
for i, searchitem in enumerate(search):
print searchitem
dic = open('cmu07a.dic', 'r')
for line in dic:
if re.match(searchitem, line):
print line
break
txt.write(line), "\n"
dic.close()
call K AO L me M IY echo EH K OW
my M AY wife W AY F is IH Z echo EH K OW
my M AY brother B R AH DH ER is IH Z echo EH K OW
echo EH K OW is IH Z my M AY mom M AA M
my M AY boss B AA S name N EY M is IH Z echo EH K OW
my M AY dad D AE D is IH Z echo EH K OW