Steps/ Guidelines
of making Bangla TTS
Here is a quick outline
of what things you need to address while developing
a voice for Bangla. Ultimately a voice in Festival will
consist of a diphone database, a lexicon (and lts rules)
and a number of scheme files that offer the complete
voice. By convention a voice name consist of an institution
name (like aiub,cmu, cstr, etc), if you don’t have an
institution just use net. Second you need to identify
the language, there is an ISO two letter standard for
it fails to distinguish dialects (such as BD, US and
UK English) so it need not be strictly followed. However
a short identifier for the language is probably preferred.
Third you identify the speaker, we have typically used
three letter initials which are the initials of the
person speaker but any name is reasonable.
The basic processes which are needed to address
• Construct basic template files (Bd_schema.scm
and others)
• Generate phoneset definition (aiub_bd_phones.scm)
• Generate diphone schema file (bddiph.list)
• Generate prompts ( using KAL voice , cross language
generation using English)
• Record speaker
• Label nonsense words
• Extract picthmarks and LPC coeffcient
• Test phone synthesis
• Add lexicon/LTS support
• Add tokenization
• Add prosody (phrasing, durations and intonation)
• Test and evaluate voice
• Package for distribution
You can either use the
schema files created by us or create your own files
if you feel that will give better output. You can get
the schema files in source distribution for download.
As with all parts of
festvox, we must set the following enviroment variables
to where we have installed versions of the Edinburgh
Speech Tools and the festvox distribution.
export ESTDIR=/home/projects/1.4.1/speech_tools
export FESTVOXDIR=/home/projects/festvox
For making the Bangla voice based on
aiub_iar, first create a directory to hold the voice.
mkdir ~/data/aiub_bd_iar_diphone
cd ~/data/aiub_bd_iar_diphone
We will need in the regions of 500M-1000M
of space to build a voice. Construct the basic directory
structure and skeleton files with the command
$FESTVOXDIR/src/diphones/setup_diphone
aiub bd iar
Now we can generate the diphone schema
list.
$ ./festival festvox/diphlist.scm festvox/bd_schema.scm
$ Festival>(diphone-gen-schema "bd" "etc/bddiph.list")
The schema file has the following format
( bd_0001 ("k-a" "a-k")
(# t aa k a k aa #) )
( bd_0002 ("kh-a" "a-kh") (# t aa
kh a kh aa #) )
( bd_0003 ("g-a" "a-g") (# t aa
g a g aa #) )
( bd_0004 ("gh-a" "a-gh") (# t aa
gh a gh aa #) )
( bd_0005 ("n:-a" "a-n:") (# t aa
n: a n: aa #) )
Next we can generate the prompts and
their label files with the following command.
$ ./festival festvox/diphlist.scm festvox/bd_schema.scm
$festival> (diphone-gen-waves "prompt-wav"
"prompt-lab" "etc/bddiph.list")’
The stage is to record the prompts.
$ bin/prompt_them etc/bddiph.list
The recorded prompts can the be labeled
by
$bin/make_labs prompt-wav/*.wav
And the diphone index may be built by
$ bin/make_diph_index etc/bddiph.list
dic/bddiph.est
If no EGG signal has been collected
you can extract the pitchmarks by
$ bin/make_pm_wave wav/*.wav
A program to move the predicted pitchmarks
to the nearest peak in the waveform is also provided.
This is almost always a good idea, even for EGG extracted
pitch marks.
$ bin/make_pm_fix pm/*.pm
Getting good pitchmarks is important
to the quality of the synthesis.
Because there is often a power mismatch
through a set of diphone we provided a simple method
for finding what general power difference exist between
files. This finds the mean power for each vowel in each
file and calculates a factor with respect to the overal
mean vowel power. A table of power modifiers for each
file can be calculated by
$ bin/find_powerfactors lab/*.lab
The factors cacluated by this are saved
in etc/powfacts.
Then build the pitch-synchronous LPC
coefficients, which used the power factors if they’ve
been calculated.
$ bin/make_lpc wav/*.wav
This should get you to the stage where
you can test the basic waveform synthesizer. There is
still much to do but initial tests (and correction of
labeling errors etc) can start now. Start festival as
festival festvox/aiub_bd_iar_diphone.scm
"(voice_aiub_bd_iar_diphone)"
and then enter string of phones
festival> (SayPhones ’(# AmI bhAlO
AcaI))
We write a set of letter-to-sound rules,
by hand that expand words into their phones.This is
added to festvox/aiub_bd_lex.scm.
For the time being we just use the default intonation
model, though simple rule drive improvements are possible.
Now we have a basic synthesizer, although
there is much to do, we can now type (romanized) text
to it.
festival festvox/aiub_bd_iar_diphone.scm "(voice_aiub_bd_iar_diphone)"
The next part is to test and improve
these various initial subsystems, lexicons, text analysis
prosody, and correct waveform synthesis problem. This
is ane endless task but you should spend significantly
more time on it that we have done for this example.
Once you are happy with the completed voice you can
package it for distribution. The first stage is to generate
a group file for the diphone database. This extracts
the subparts of the nonsense words and puts them into
a single file offering something smaller and quicker
to access. The groupfile can be built as follows.
festival festvox/aiub_bd_iar_diphone.scm
"(voice_aiub_bd_iar_diphone)"
...
festival (us_make_group_file "group/iarlpc.group"
nil)
|