ECE497 Project Voice Dialer

From eLinux.org
Revision as of 13:03, 11 November 2011 by Yoder (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Members: Dan Bennett, David Bliss, Will Gerth, and Lei Liu!

Concept: Google Voice based voice dialer using TI embedded speech recognition.

Timeline: TBD

Goal: To complete and connect a voice dialed call from the beagleboard via a phone device of the users choosing.

Executive Summary

The voice dialer project aims to complete and connect a voice dialed call from the beagleboard via a phone device of the users choosing. TIesr is used to build a Hidden Markov Model for the voice recognition. (Give two sentence intro to the project.)

The TIser is working which returns a voice recognition result from audio input. The Google Voice dialer is also completed so that it can be used to make a call from a Google Voice account to any valid phone number. (Give two sentences telling what works.)

Give two sentences telling what isn't working.

Generally our team has reached our goal of making a voice controlled dialer. Although the TIesr HHM model does not work perfectly due to small training data, we have finished building all software structure and proved it working on Beagleboard. (End with a two sentence conclusion.)

Instillation Instructions

Give step by step instructions on how to install your project on the SPEd2 image.

  • Include your github path as a link like this: https://github.com/MarkAYoder/gitLearn.
  • Include any additional packages installed via opkg.
  • Include kernel mods.
  • If there is extra hardware needed, include links to where it can be obtained.

User Instructions

Once everything is installed, how do you use the program? Give details here, so if you have a long user manual, link to it here.

Highlights

Speaker-independent speech recognition algorithm recognizes phone numbers from people talking, and then give it a call from Google Voice dialer!

Theory of Operation

This project is divided into two parts, the dialer and the recognizer. The recognizer is written in C, and acts as the main driver for the application. The dialer is a utility script written in Python that dials a phone number.

TIesr

In our project, we used TI Embedded Speech Recognizer (TIESR) for Speaker-Independent recognition. The TIESR speech recognizer is targeted toward embedded platforms where computation and memory storage efficiency are important. TIESR uses Hidden Markov Model (HMM) technology to model the acoustic signals found in speech.

To make TIesr a high performance speech recognizer, the model must be built and trained before using. During this, some softwares are needed to build the HMM. They are, The Hidden Markov Modeling Toolkit (HTK), which may be obtained from: http://htk.eng.cam.ac.uk/ and Perl Modules Math::FFT and Algorithms::Cluster from the CPAN.

Since our goal--to recognize ten digits--is a reletively simple task for TIesr, we do not utilize the pronunciation decision tree files. Below are steps we used to train the TIesr model.

Step 1: Data Preparation

Prepare text files for ten digits in alphabetical order.

eight,

five,

four,

nine,

one,

seven,

six,

three,

two,

zero.

Step 2: Making the Letter File

Instead of creating pronunciation decision trees, for small vocabularies the only file necessary is one that contains a sorted list of all characters making up words in the dictionary. This must be put in a file named "cAttValue.txt", and we put it in Data/Lang/cAttValue.txt. Each character should be a single byte.

Step 3: Building the Compressed Binary Dictionary Files

The dictionary file must be converted into a binary form for subsequent processing steps, since the TIesr tools use a binary dictionary. We use HTK HDMan tool to generate binary file "dict.bin" from "phone.lis".

Step 4: Building the Acoustic Model Data Files

Firstly we recorded 50 speech clips, 5 for each digit. They are sampled at 8KHz, using 16 bit LSB first PCM coding method.

Then we use "sample_to_htk.pl" provided by TIesr to convert those .raw audio to .htk format file, which can be utilized for building HHM.

After that, we carefully labelled out the time segment of each audio file, showing when a word starts and ends.

Next we used the .htk files and segment information to train the HHM for four times. The number of iteration time can only determined by experiment.

Finally the trained HTK data is converted to TIesr-compatible acoustic data files.

Step 5: Creating the Hierarchical Linear Regression cluster tree file

In this step we uses the results of word model to determine a linear regression tree for the HMM models.

Step 6: Creating the Gaussian cluster files

Gaussiancluster, which is also included in TIesr files, is used to provide TIesr clustering information.

Step 7: Testing the data files

In this step, we use testtiesrflex to generate all the final data needed by TIesrSI module to recognize speech and make decision.

Work Breakdown

Main dialer program and cross-compilation by Dan Bennett and Will Gerth;

Google Voice dialer script by David Bliss;

TIesr module build and HMM training by Lei Liu.

Also list here what doesn't work yet and when you think it will be finished and who is finishing it.

Conclusions

Give some concluding thoughts about the project. Suggest some future additions that could make it even more interesting.

Based on the progress right now, we can make a conclusion that our thought is almost implemented. We combined TIesr and Google Voice together, making them working on embedded Linux system.

For suggestion, the Python Google Voice code does not support talking on the phone, so this can be improved definitely. But this would be very hard to do, because Google Voice never publish its code officially.

What's more, there is also a long way to go on our TIesr model. A large amount of data is required for training to enhance the performance. To make things more interesting, name recognition can be added into our program. To make it more more interesting, the TIesr could be trained to be able to recognize words of foreign language.