MSR CORE 12 Project: Evolution Strategy Based Design of Low-Power and High Performance Compact Hardware Speech Sensors

MSR CORE12 Project:
Evolution Strategy Based Design of Low-Power and High Performance Compact Hardware Speech Sensors

Takahiro Shinozaki, Kai Zhu, Boyu Qian, Jian Wang, Yi Liu

Project Goal

Developping a speech recognition system that is suitable for hardware implementation of speech sensors.

Motivation

In our daily lives, it is often the case that we want to control electric divides such as an audio player and an illumination lamp, find a small item such as a wallet and eyeglasses, catch an event such as a baby is crying and a dog is barking. Sometimes, however, it is bothering to walk in a room interrupting what you are doing, is time-consuming to find something, and is impossible without a help of someone else. These problems can be solved if tiny and energy efficient speech sensors are ubiquitously embedded in our living environment. Moreover, such speech sensors would have wide applications in toys, education, safety, and security.
a figure showing possible applications

Problems

The sensors must be very small so that it can be attached to various things. The energy consumption must be minimum since it must continuously work with a tiny energy source so that it can react to a voice at any time. It must be noise robust since it is used in noisy environments and there is a distance between the user and the speech sensor, and the SNR is low.

Approach

*Combines deep neural network (DNN) based speech feature extraction and the template based matching: By using DNN trained with a large amount of data containing many speakers, robust speech features can be obtained. By using the DNN features, accurate speech recognition can be realized using the template matching approaches. The DNN is first trained and optimized using fast computers and then implemented in speech sensor hardwares. Any keyword or speaker can be detected immediately by just registering a template to the sensor.
*Apply evolution strategy based system optimization: Many design factors of the speech recognition system must be jointly optimized to realize tiny and energy efficient speech sensors. Since the design factors influence each other, the optimization is a very complex problem. To solve the problem, we apply evolution algorithms to optimize the design factors.

Hardware implementation

An FPGA-based hardware platform has been developed. The FPGA-based implementation has an advantage that any speech sensor design can be easily tested.

Demo video

In this demo, a Japanese keyword "Onsei" is detected.

keyword detection from isolated utterances(MP4 2.7MByte)
keyword detection from continuous utterances(MP4 2.5MByte)
Robot control by voice-to-Infrared interface(MP4 4.1MByte)