Latest revision as of 00:22, 27 March 2015

translating text into phonemes used by Sphinx

using the CMU dictionary file from the software package Sphinx (cmu07a.dic)

for download here: http://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/
more information about it: http://www.speech.cs.cmu.edu/cgi-bin/cmudict
and its README file: http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/README.weide

cmu07a.dic looks like:

abso	AE B S OW
absolom	AE B S AH L AH M
absolut	AE B S AH L UW T
absolut's	AE B S AH L UW T S
absolute	AE B S AH L UW T
absolutely	AE B S AH L UW T L IY
absoluteness	AE B S AH L UW T N AH S
absolutes	AE B S AH L UW T S
absolution	AE B S AH L UW SH AH N
absolutism	AE B S AH L UW T IH Z AH M
absolutist	AE B S IH L UW T IH S T
absolve	AH B Z AA L V

this is its alphabet:
(from: speech.cs.cmu.edu)

       Phoneme Example Translation
       ------- ------- -----------
       AA	odd     AA D
       AE	at	AE T
       AH	hut	HH AH T
       AO	ought	AO T
       AW	cow	K AW
       AY	hide	HH AY D
       B 	be	B IY
       CH	cheese	CH IY Z
       D 	dee	D IY
       DH	thee	DH IY
       EH	Ed	EH D
       ER	hurt	HH ER T
       EY	ate	EY T
       F 	fee	F IY
       G 	green	G R IY N
       HH	he	HH IY
       IH	it	IH T
       IY	eat	IY T
       JH	gee	JH IY
       K 	key	K IY
       L 	lee	L IY
       M 	me	M IY
       N 	knee	N IY
       NG	ping	P IH NG
       OW	oat	OW T
       OY	toy	T OY
       P 	pee	P IY
       R 	read	R IY D
       S 	sea	S IY
       SH	she	SH IY
       T 	tea	T IY
       TH	theta	TH EY T AH
       UH	hood	HH UH D
       UW	two	T UW
       V 	vee	V IY
       W 	we	W IY
       Y 	yield	Y IY L D
       Z 	zee	Z IY
       ZH	seizure	S IY ZH ER

cmu07a.dic is a file in which "the pronunciation is encoded using a modified form of the Arpabet system, 
with the addition of stress marks on vowels of levels 0, 1, and 2." (released in 2008)

from: wikipage about the cmu07a.dic

"Arpabet is a phonetic transcription code developed by Advanced Research Projects Agency (ARPA) 
as a part of their Speech Understanding Project (1971–1976). It represents each phoneme of 
General American English with a distinct sequence of ASCII characters."

from: Wikipage on Arpabet

import re
import os

with open('output.txt', 'w') as txt:

	x = open('input.txt', 'r')
	searchlines = x.readlines()
	x.close()
	print searchlines
	search = searchlines[0].split(" ")
	print search[0]

	for i, searchitem in enumerate(search):
		print searchitem
		dic = open('cmu07a.dic', 'r')
		for line in dic:
			if re.match(searchitem, line): 
				print line
				break		
				txt.write(line), "\n"
		dic.close()

call	K AO L
me	M IY
echo	EH K OW

my	M AY
wife	W AY F
is	IH Z
echo	EH K OW

my	M AY
brother	B R AH DH ER
is	IH Z
echo	EH K OW

echo	EH K OW
is	IH Z
my	M AY
mom	M AA M

my	M AY
boss	B AA S
name	N EY M
is	IH Z
echo	EH K OW

my	M AY
dad	D AE D
is	IH Z
echo	EH K OW