Kathak::BTTS

Here is a quick outline of what things you need to address while developing a voice for Bangla. Ultimately a voice in Festival will consist of a diphone database, a lexicon (and lts rules) and a number of scheme files that offer the complete voice. By convention a voice name consist of an institution name (like aiub,cmu, cstr, etc), if you don’t have an institution just use net. Second you need to identify the language, there is an ISO two letter standard for it fails to distinguish dialects (such as BD, US and UK English) so it need not be strictly followed. However a short identifier for the language is probably preferred. Third you identify the speaker, we have typically used three letter initials which are the initials of the person speaker but any name is reasonable.

The basic processes which are needed to address

• Construct basic template files (Bd_schema.scm and others)
• Generate phoneset definition (aiub_bd_phones.scm)
• Generate diphone schema file (bddiph.list)
• Generate prompts ( using KAL voice , cross language generation using English)
• Record speaker
• Label nonsense words
• Extract picthmarks and LPC coeffcient
• Test phone synthesis
• Add lexicon/LTS support
• Add tokenization
• Add prosody (phrasing, durations and intonation)
• Test and evaluate voice
• Package for distribution

You can either use the schema files created by us or create your own files if you feel that will give better output. You can get the schema files in source distribution for download.

As with all parts of festvox, we must set the following enviroment variables to where we have installed versions of the Edinburgh Speech Tools and the festvox distribution.

export ESTDIR=/home/projects/1.4.1/speech_tools
export FESTVOXDIR=/home/projects/festvox

For making the Bangla voice based on aiub_iar, first create a directory to hold the voice.

mkdir ~/data/aiub_bd_iar_diphone
cd ~/data/aiub_bd_iar_diphone

We will need in the regions of 500M-1000M of space to build a voice. Construct the basic directory structure and skeleton files with the command

$FESTVOXDIR/src/diphones/setup_diphone aiub bd iar

Now we can generate the diphone schema list.

$ ./festival festvox/diphlist.scm festvox/bd_schema.scm
$ Festival>(diphone-gen-schema "bd" "etc/bddiph.list")

The schema file has the following format

( bd_0001 ("k-a" "a-k") (# t aa k a k aa #) )
( bd_0002 ("kh-a" "a-kh") (# t aa kh a kh aa #) )
( bd_0003 ("g-a" "a-g") (# t aa g a g aa #) )
( bd_0004 ("gh-a" "a-gh") (# t aa gh a gh aa #) )
( bd_0005 ("n:-a" "a-n:") (# t aa n: a n: aa #) )

Next we can generate the prompts and their label files with the following command.

$ ./festival festvox/diphlist.scm festvox/bd_schema.scm
$festival> (diphone-gen-waves "prompt-wav" "prompt-lab" "etc/bddiph.list")’

The stage is to record the prompts.

$ bin/prompt_them etc/bddiph.list

The recorded prompts can the be labeled by

$bin/make_labs prompt-wav/*.wav

And the diphone index may be built by

$ bin/make_diph_index etc/bddiph.list dic/bddiph.est

If no EGG signal has been collected you can extract the pitchmarks by

$ bin/make_pm_wave wav/*.wav

A program to move the predicted pitchmarks to the nearest peak in the waveform is also provided. This is almost always a good idea, even for EGG extracted pitch marks.

$ bin/make_pm_fix pm/*.pm

Getting good pitchmarks is important to the quality of the synthesis.

Because there is often a power mismatch through a set of diphone we provided a simple method for finding what general power difference exist between files. This finds the mean power for each vowel in each file and calculates a factor with respect to the overal mean vowel power. A table of power modifiers for each file can be calculated by

$ bin/find_powerfactors lab/*.lab

The factors cacluated by this are saved in etc/powfacts.

Then build the pitch-synchronous LPC coefficients, which used the power factors if they’ve been calculated.

$ bin/make_lpc wav/*.wav

This should get you to the stage where you can test the basic waveform synthesizer. There is still much to do but initial tests (and correction of labeling errors etc) can start now. Start festival as

festival festvox/aiub_bd_iar_diphone.scm "(voice_aiub_bd_iar_diphone)"

and then enter string of phones

festival> (SayPhones ’(# AmI bhAlO AcaI))

We write a set of letter-to-sound rules, by hand that expand words into their phones.This is added to festvox/aiub_bd_lex.scm.
For the time being we just use the default intonation model, though simple rule drive improvements are possible.

Now we have a basic synthesizer, although there is much to do, we can now type (romanized) text to it.

festival festvox/aiub_bd_iar_diphone.scm "(voice_aiub_bd_iar_diphone)"

The next part is to test and improve these various initial subsystems, lexicons, text analysis prosody, and correct waveform synthesis problem. This is ane endless task but you should spend significantly more time on it that we have done for this example. Once you are happy with the completed voice you can package it for distribution. The first stage is to generate a group file for the diphone database. This extracts the subparts of the nonsense words and puts them into a single file offering something smaller and quicker to access. The groupfile can be built as follows.

festival festvox/aiub_bd_iar_diphone.scm "(voice_aiub_bd_iar_diphone)"
...
festival (us_make_group_file "group/iarlpc.group" nil)

Page maintained by Imrul Amir Rahat.amir.rahat@gmail.com