ECE497 Project Voice Dialer
Members: Dan Bennett, David Bliss, Will Gerth, and Lei Liu!
Concept: Google Voice based voice dialer using TI embedded speech recognition.
Timeline: TBD
Goal: To complete and connect a voice dialed call from the beagleboard via a phone device of the users choosing.
Contents
- 1 Executive Summary
- 2 Instillation Instructions
- 3 User Instructions
- 4 Highlights
- 5 Theory of Operation
- 5.1 TIesr
- 5.1.1 Step 1: Data Preparation
- 5.1.2 Step 2: Making the Letter File
- 5.1.3 Step 3: Building the Compressed Binary Dictionary Files
- 5.1.4 Step 4: Building the Acoustic Model Data Files
- 5.1.5 Step 5: Creating the Hierarchical Linear Regression cluster tree file
- 5.1.6 Step 6: Creating the Gaussian cluster files
- 5.1.7 Step 7: Testing the data files
- 5.1 TIesr
- 6 Work Breakdown
- 7 Conclusions
Executive Summary
The voice dialer project aims to complete and connect a voice dialed call from the beagleboard via a phone device of the users choosing. TIesr is used to build a Hidden Markov Model for the voice recognition. (Give two sentence intro to the project.)
The TIser is working which returns a voice recognition result from audio input. The Google Voice dialer is also completed so that it can be used to make a call from a Google Voice account to any valid phone number. (Give two sentences telling what works.)
Give two sentences telling what isn't working.
Generally our team has reached our goal of making a voice controlled dialer. Although the TIesr HHM model does not work perfectly due to small training data, we have finished building all software structure and proved it working on Beagleboard. (End with a two sentence conclusion.)
Instillation Instructions
Give step by step instructions on how to install your project on the SPEd2 image.
- Include your github path as a link like this: https://github.com/MarkAYoder/gitLearn.
- Include any additional packages installed via opkg.
- Include kernel mods.
- If there is extra hardware needed, include links to where it can be obtained.
User Instructions
Once everything is installed, how do you use the program? Give details here, so if you have a long user manual, link to it here.
Highlights
Speaker-independent speech recognition algorithm recognizes phone numbers from people talking, and then give it a call from Google Voice dialer!
Theory of Operation
This project is divided into two parts, the dialer and the recognizer. The recognizer is written in C, and acts as the main driver for the application. The dialer is a utility script written in Python that dials a phone number.
TIesr
In our project, we used TI Embedded Speech Recognizer (TIESR) for Speaker-Independent recognition. The TIESR speech recognizer is targeted toward embedded platforms where computation and memory storage efficiency are important. TIESR uses Hidden Markov Model (HMM) technology to model the acoustic signals found in speech.
To make TIesr a high performance speech recognizer, the model must be built and trained before using. During this, some softwares are needed to build the HMM. They are, The Hidden Markov Modeling Toolkit (HTK), which may be obtained from: http://htk.eng.cam.ac.uk/ and Perl Modules Math::FFT and Algorithms::Cluster from the CPAN.
Since our goal--to recognize ten digits--is a reletively simple task for TIesr, we do not utilize the pronunciation decision tree files. Below are steps we used to train the TIesr model.
Step 1: Data Preparation
Prepare text files for ten digits in alphabetical order.
eight,
five,
four,
nine,
one,
seven,
six,
three,
two,
zero.
Step 2: Making the Letter File
Instead of creating pronunciation decision trees, for small vocabularies the only file necessary is one that contains a sorted list of all characters making up words in the dictionary. This must be put in a file named "cAttValue.txt", and we put it in Data/Lang/cAttValue.txt. Each character should be a single byte.
Step 3: Building the Compressed Binary Dictionary Files
The dictionary file must be converted into a binary form for subsequent processing steps, since the TIesr tools use a binary dictionary. We use HTK HDMan tool to generate binary file "dict.bin" from "phone.lis".
Step 4: Building the Acoustic Model Data Files
Firstly we recorded 50 speech clips, 5 for each digit. They are sampled at 8KHz, using 16 bit LSB first PCM coding method.
Then we use "sample_to_htk.pl" provided by TIesr to convert those .raw audio to .htk format file, which can be utilized for building HHM.
After that, we carefully labelled out the time segment of each audio file, showing when a word starts and ends.
Next we used the .htk files and segment information to train the HHM for four times. The number of iteration time can only determined by experiment.
Finally the trained HTK data is converted to TIesr-compatible acoustic data files.
Step 5: Creating the Hierarchical Linear Regression cluster tree file
In this step we uses the results of word model to determine a linear regression tree for the HMM models.
Step 6: Creating the Gaussian cluster files
Gaussiancluster, which is also included in TIesr files, is used to provide TIesr clustering information.
Step 7: Testing the data files
In this step, we use testtiesrflex to generate all the final data needed by TIesrSI module to recognize speech and make decision.
Work Breakdown
Main dialer program and cross-compilation by Dan Bennett and Will Gerth;
Google Voice dialer script by David Bliss;
TIesr module build and HMM training by Lei Liu.
Also list here what doesn't work yet and when you think it will be finished and who is finishing it.
Conclusions
Give some concluding thoughts about the project. Suggest some future additions that could make it even more interesting.
Based on the progress right now, we can make a conclusion that our thought is almost implemented. We combined TIesr and Google Voice together, making them working on embedded Linux system.
For suggestion, the Python Google Voice code does not support talking on the phone, so this can be improved definitely. But this would be very hard to do, because Google Voice never publish its code officially.
What's more, there is also a long way to go on our TIesr model. A large amount of data is required for training to enhance the performance. To make things more interesting, name recognition can be added into our program. To make it more more interesting, the TIesr could be trained to be able to recognize words of foreign language.