User:Manetta/scripts/python-translate-to-computer-phonemes: Difference between revisions

From XPUB & Lens-Based wiki
No edit summary
 
(5 intermediate revisions by the same user not shown)
Line 1: Line 1:
= translating text into computer phonemes =
= translating text into phonemes used by Sphinx =


using the CMU dictionary file from the software package [http://cmusphinx.sourceforge.net/ Sphinx] (cmu07a.dic)<br>
using the CMU dictionary file from the software package [http://cmusphinx.sourceforge.net/ Sphinx] (cmu07a.dic)<br>
for download here: http://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/




[[File:Mb-echo-semantic-simulations-01-page005.png | 500px]]
for download here: http://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/ <br>
more information about it: http://www.speech.cs.cmu.edu/cgi-bin/cmudict <br>
and its README file: http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/README.weide<br>
 
 
'''cmu07a.dic''' looks like: <br>
 
abso AE B S OW
absolom AE B S AH L AH M
absolut AE B S AH L UW T
absolut's AE B S AH L UW T S
absolute AE B S AH L UW T
absolutely AE B S AH L UW T L IY
absoluteness AE B S AH L UW T N AH S
absolutes AE B S AH L UW T S
absolution AE B S AH L UW SH AH N
absolutism AE B S AH L UW T IH Z AH M
absolutist AE B S IH L UW T IH S T
absolve AH B Z AA L V
 
 
this is its alphabet: <br>
(from: [http://www.speech.cs.cmu.edu/cgi-bin/cmudict speech.cs.cmu.edu])
 
        Phoneme Example Translation
        ------- ------- -----------
        AA odd    AA D
        AE at AE T
        AH hut HH AH T
        AO ought AO T
        AW cow K AW
        AY hide HH AY D
        B be B IY
        CH cheese CH IY Z
        D dee D IY
        DH thee DH IY
        EH Ed EH D
        ER hurt HH ER T
        EY ate EY T
        F fee F IY
        G green G R IY N
        HH he HH IY
        IH it IH T
        IY eat IY T
        JH gee JH IY
        K key K IY
        L lee L IY
        M me M IY
        N knee N IY
        NG ping P IH NG
        OW oat OW T
        OY toy T OY
        P pee P IY
        R read R IY D
        S sea S IY
        SH she SH IY
        T tea T IY
        TH theta TH EY T AH
        UH hood HH UH D
        UW two T UW
        V vee V IY
        W we W IY
        Y yield Y IY L D
        Z zee Z IY
        ZH seizure S IY ZH ER
 
 
 
cmu07a.dic is a file in which "the pronunciation is encoded using a modified form of the Arpabet system, <br>with the addition of stress marks on vowels of levels 0, 1, and 2." (released in 2008)
from: [https://en.wikipedia.org/wiki/CMU_Pronouncing_Dictionary wikipage about the cmu07a.dic]
 
"Arpabet is a phonetic transcription code developed by Advanced Research Projects Agency (ARPA) <br>as a part of their Speech Understanding Project (1971–1976). It represents each phoneme of <br>General American English with a distinct sequence of ASCII characters."
from: [https://en.wikipedia.org/wiki/Arpabet Wikipage on Arpabet]




Line 62: Line 133:
  is IH Z
  is IH Z
  echo EH K OW
  echo EH K OW
[[File:Mb-echo-semantic-simulations-01-page005.png | 500px]]

Latest revision as of 00:22, 27 March 2015

translating text into phonemes used by Sphinx

using the CMU dictionary file from the software package Sphinx (cmu07a.dic)


for download here: http://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/
more information about it: http://www.speech.cs.cmu.edu/cgi-bin/cmudict
and its README file: http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/README.weide


cmu07a.dic looks like:

abso	AE B S OW
absolom	AE B S AH L AH M
absolut	AE B S AH L UW T
absolut's	AE B S AH L UW T S
absolute	AE B S AH L UW T
absolutely	AE B S AH L UW T L IY
absoluteness	AE B S AH L UW T N AH S
absolutes	AE B S AH L UW T S
absolution	AE B S AH L UW SH AH N
absolutism	AE B S AH L UW T IH Z AH M
absolutist	AE B S IH L UW T IH S T
absolve	AH B Z AA L V


this is its alphabet:
(from: speech.cs.cmu.edu)

       Phoneme Example Translation
       ------- ------- -----------
       AA	odd     AA D
       AE	at	AE T
       AH	hut	HH AH T
       AO	ought	AO T
       AW	cow	K AW
       AY	hide	HH AY D
       B 	be	B IY
       CH	cheese	CH IY Z
       D 	dee	D IY
       DH	thee	DH IY
       EH	Ed	EH D
       ER	hurt	HH ER T
       EY	ate	EY T
       F 	fee	F IY
       G 	green	G R IY N
       HH	he	HH IY
       IH	it	IH T
       IY	eat	IY T
       JH	gee	JH IY
       K 	key	K IY
       L 	lee	L IY
       M 	me	M IY
       N 	knee	N IY
       NG	ping	P IH NG
       OW	oat	OW T
       OY	toy	T OY
       P 	pee	P IY
       R 	read	R IY D
       S 	sea	S IY
       SH	she	SH IY
       T 	tea	T IY
       TH	theta	TH EY T AH
       UH	hood	HH UH D
       UW	two	T UW
       V 	vee	V IY
       W 	we	W IY
       Y 	yield	Y IY L D
       Z 	zee	Z IY
       ZH	seizure	S IY ZH ER


cmu07a.dic is a file in which "the pronunciation is encoded using a modified form of the Arpabet system, 
with the addition of stress marks on vowels of levels 0, 1, and 2." (released in 2008)

from: wikipage about the cmu07a.dic

"Arpabet is a phonetic transcription code developed by Advanced Research Projects Agency (ARPA) 
as a part of their Speech Understanding Project (1971–1976). It represents each phoneme of
General American English with a distinct sequence of ASCII characters."

from: Wikipage on Arpabet


import re
import os

with open('output.txt', 'w') as txt:

	x = open('input.txt', 'r')
	searchlines = x.readlines()
	x.close()
	print searchlines
	search = searchlines[0].split(" ")
	print search[0]

	for i, searchitem in enumerate(search):
		print searchitem
		dic = open('cmu07a.dic', 'r')
		for line in dic:
			if re.match(searchitem, line): 
				print line
				break		
				txt.write(line), "\n"
		dic.close()


call	K AO L
me	M IY
echo	EH K OW
my	M AY
wife	W AY F
is	IH Z
echo	EH K OW
my	M AY
brother	B R AH DH ER
is	IH Z
echo	EH K OW
echo	EH K OW
is	IH Z
my	M AY
mom	M AA M
my	M AY
boss	B AA S
name	N EY M
is	IH Z
echo	EH K OW
my	M AY
dad	D AE D
is	IH Z
echo	EH K OW


Mb-echo-semantic-simulations-01-page005.png