1-630-802-8605 Ravi.das@bn-inc.net

This propelled the idea of perhaps using voice recognition to confirm the identity of a particular individual. The research and development into voice recognition continued well into the 1960’s, and the voice spectrographs which were used at the time started to utilize statistical modeling as a means of biometric template creation, rather than using the traditional approaches.

This continued trend would allow for the evolution of automated voice recognition tools to come into play. In fact, the first known voice recognition systems was called the “Forensic Automatic Speaker Recognition”, or FASR for short.

In today’s biometric world, voice recognition can be considered to be both a behavioral based and a physical based biometric.

This is so because the acoustic properties of a particular person’s voice is a direct function of the shape of the individual’s mouth, as well as the length and the quality of the vocal cords (the physical component). But also, at the same time, the behavioral data of an individual’s voice is present in the template as well, and this includes such variables as the pitch, volume, and the rhythm of the voice.

Voice Recognition-How It Works

The first step in voice recognition is for an individual to produce an actual voice sample. Voice production is a facet of life in which we take for granted every day, and the actual process is complicated. The production of sound originates at the vocal cords. In between the vocal cords is a gap. When we attempt to communicate, the muscles which control the vocal cords contract.

As a result, the gap narrows, and as we exhale, this breathe passes through the gap, which creates sound. The unique patterns of an individual’s voice is then produced by the vocal tract. The vocal tract consists of the laryngeal pharynx, oral pharynx, oral cavity, nasal pharynx, and the nasal cavity. It is these unique patterns created by the vocal tract which is used by voice recognition systems.

Even though people may sound alike to the human ear, everybody, to some degree, has a different or unique annunciation in their speech. To ensure a good quality voice sample, the individual usually recites some sort of text, which can either be a verbal phrase a series of numbers, or even repeating a passage of text, which is usually prompted by the voice recognition system. The individual usually has to repeat this a number of times.

The most common devices used to capture an individual’s voice samples are computer microphones, cell (mobile) phones, and the land line based telephones. As a result, a key advantage of voice recognition is that it can leverage existing telephony technology, with minimal disruption to an entity’s business processes. In terms of noise disruption, computer microphones and cell phones create the most, and land line based telephones create the least.

Factors Affecting Voice Recognition

But, it is very important to note at this point that the same medium which was used to enroll the end user into the voice recognition system use the same type of medium for later uses. For example, if a smartphone was used to create the enrollment template, then the same smartphone should be used in subsequent verification transactions in the voice recognition system.

There are also other factors which can affect the quality of voice samples other than the noise disruptions created by telephony devices. For example, factors such as mispronounced verbal phrases, different media used for enrollment and verification (using a land line telephone for the enrollment process, but then using a cell phone for the verification process), as well as the emotional and physical conditions of the individual.

Finally, the voice samples are converted from an analog format to a digital format for processing. These raw voice data types are inputted into a spectrograph, which is a pure visualization of the acoustic properties of the individual’s voice. An example of a spectrograph can be seen in. As can be seen, the two primary axes are frequency (the vertical axis), and time (the horizontal axis).

The next steps are unique feature extraction and creation of the template. The extraction algorithms look for unique patterns in the individual’s voice samples. To create the template, a “model” of the voice is created. There are two statistical techniques which are primarily utilized when formulating the voice recognition biometric templates. They are:

  1. Hidden Markov Models, or HMMs: This is a statistical model which is used with text dependent systems. This type of model displays such variables as the changes and the fluctuations of the voice over a certain period of time, which is a direct function of the pitch, duration, dynamics, and the quality of the person’s speaking voice;
  2. Gaussian Mixture Models, or GMMs: This is a state mapping model, in which various types and kinds of vector states are created which represent the unique sound characteristics of the particular individual. Unlike Hidden Markov Models, the Gaussian Mixture Models is devoted exclusively for use by text independent systems.

However, just like other biometric technologies, there are other external factors which can impact the quality of the voice recorded (this recorded voice can also be considered the “raw image”), and subsequently, affect the quality of the enrollment and verification templates created. These factors are as follows:

  1. Any type or kind of misspoken, misunderstood, or misread text phrases the end user is supposed to recite;
  2. The emotional state of the individual when they are reciting their specific text or phrase;
  3. Poor room sound acoustics;
  4. Media mismatch (for instance, using different telephones or microphones for both the enrollment and verification phases);
  5. The physical state of the individual (for instance, if they are sick, suffering from any other kind or type of ailment, etc.);
  6. The age of the individual (for example, the vocal tract changes as we get older in age).

The Advantages & Disadvantages of Voice Recognition

When voice recognition is compared against the seven major criterion, it is a mixed result which comes out in the end:

  1. Universality: Voice recognition is not at all language dependent. This is probably its biggest strength. As long as an individual can speak, theoretically, they can be easily enrolled into a voice recognition system;
  2. Uniqueness: Unlike some of the other biometrics, such as the iris and the retina, the voice does not possess as many unique features which results in the lack of rich information and data;
  3. Permanence: Over the lifespan of an individual, and with age, the voice changes for many reasons which include: age, fatigue, any type or kind of disease of the vocal cords, as well as the emotional state and any medication the individual might be under;
  4. Collectability: This is probably the biggest disadvantage of voice recognition. For example, any type or kind of variability in the medium which is used to collect the raw voice sample can greatly affect or skew the voice recognition biometric templates. Therefore, it is of upmost importance to ensure that the same type and kind of collection medium is used for both the enrollment and verification stages. There should be no interchangeability involved whatsoever;
  5. Performance: Because of the variableness involved when collecting the raw voice samples, it is difficult to put a gauge on how well a voice recognition system can actually perform. Also, the template size can be very large, on the range of 1,500 to 3,000 bytes;
  6. Acceptability: This is probably one of the strongest advantages of voice recognition. The technology is actually very non intrusive, and can be deployed in a manner which is very covert to the end user;
  7. Resistance to circumvention: When compared to the other biometric technologies, voice recognition, to a certain degree, can be easily spoofed, by one end user mimicking the voice acoustics of another end user. This is in large part due to the lack of unique features in the voice itself.

The Market Applications of Voice Recognition

The market applications of voice recognition have been much more limited when compared to the other biometric technologies of hand geometry recognition, fingerprint recognition, and even iris recognition. A part of the reason for this is that there is a lack of vendors actually developing voice recognition solutions.

But, on the whole, voice recognition is starting to get some serious traction, as businesses and governments worldwide have started to realize its ease of deployment and the other strong advantages it offers to the market place. Probably one of the biggest market applications voice recognition is used for is in the financial world.

For example, many of the smaller to medium sized brokerage institutions offer voice recognition to their customers as a means of quick verification. Rather than wasting a customer’s time by inputting their PIN and Social Security Numbers, a customer can be identified very quickly by the use of voice recognition.

What would take normally minutes to authenticate a customer with the traditional means of security is now literally shaved down to just mere seconds.

As a result, financial transactions for the customer can occur at a very quick pace. Also, voice recognition can be used on smartphones as a means of verification instead of having to enter a numerical based password. Voice recognition is also currently being used in correctional facilities (to monitor the telephone privileges of the inmates); the railroad system; border protection and control; as well as certain types and kinds of physical and logical access entries.