May 21, 2024

Link your products to AI via voice interface technology

As Baidu Ali smart speakers gradually enter people's field of vision,

An AI technology war officially started,

The way of human-computer interaction continues to expand.

The voice interface has become a new entry point.

Popularize all corners of the world.

But how do these systems work?

What are the hardware requirements for creating such a device?

Let us follow an engineer from TI,

Let's take a look at their insights! ! !

What is the voice interface?

Speech recognition technology has been around us since the 1950s. At that time, Bell Labs engineers created a system that could identify a single number. However, speech recognition is only part of the complete speech interface technology. The voice interface contains all aspects of the traditional user interface: it can present information and provide a way for users to manipulate . In the voice interface, manipulation or presentation of some information will be done by voice. The voice interface option can also be configured on some traditional user interfaces such as buttons or displays.

The first voice interface device that most people encounter is most likely a mobile phone, or a very basic program for converting a language into text on a personal computer. However, these devices operate very slowly, have inaccurate recognition, and have limited vocabulary recognition.

What exactly changed speech recognition from an adjunct function into a hot technology in the computer world? First, today's computing power and algorithm performance are significantly improved. Second, the application of cloud technology and big data analytics has improved speech recognition and improved the speed and accuracy of recognition.

How do I add speech recognition to your device?

Some people often have questions about how to add some kind of voice interface to a project. In fact, TI offers several different voice interface products , including the SitaraTM family of ARM® processors and the C5000TM DSP family, all with voice processing capabilities. The two series of products have their own advantages and are suitable for different applications.

When choosing between DSP and ARM solutions, the key factor to consider is whether or not the device can leverage the cloud voice platform. There are three application scenarios: the first is offline , and all processing takes place on the local device. The second is online , through cloud-based voice processing devices such as Amazon's Alexa, Google Assistant or IBM Watson; the third is a mixture of the two .

Offline: Car Voice Control

From the current development trend, people seem to want everything to be connected to the Internet. However, whether for cost reasons or lack of reliable network connectivity, in some applications, the meaning of connecting networks is actually small. In modern automotive applications, many entertainment information systems use offline voice interface systems . These voice interface systems typically use only a limited set of commands, such as "calling a call," "playing music," and "increasing or lowering the volume." Although the speech recognition algorithms of traditional processors have made significant progress, they are still not satisfactory. In such cases, DSPs such as the C55xx provide the best performance for the system.

Online: Smart Home Center

A lot of hot topics about voice interfaces are centered around interconnected devices such as Google Home and Amazon Alexa. Because Amazon allows third parties to access its voice processing ecosystem with Alex voice services, their development in this area has attracted attention. In addition, other cloud services such as Microsoft Azur can also provide speech recognition services and similar functions. It is worth noting that the sound processing of these devices all happen in the cloud.

Whether it is worthwhile to provide uplink data to voice service providers for this convenient integration is entirely up to the user. However, cloud service providers have taken on the main job, and equipment vendors need to do very simple. In fact, because the voice synthesis part of the interface also occurs in the cloud, Alexa only needs to complete the simplest function, that is, play and record the recording file. Since no special signal processing functions are required, the ARM processor is sufficient to handle the interface work. This means that if your device is already equipped with an ARM processor, you may be able to integrate a cloud computing voice interface.

In fact, it is also very important to pay attention to services that Alexa cannot provide. Alexa does not directly perform any kind of device control or cloud integration. Many of Alexa's "smart devices" have cloud computing capabilities, which are provided by developers and can use Alexa's voice processing capabilities to drive drivers into existing cloud applications. For example, if you tell Alexa to order a pizza, your favorite pizza shop needs to develop a “skill” for Alexa. This skill is a code that defines the work content when you order a pizza. Alexa calls this skill every time you order a pizza. This skill embeds an online ordering system that can place orders for you. Similarly, smart home device manufacturers must implement the skills of how Alexa interacts with local devices and online services. Amazon comes with many of these skills, plus the skills provided by third-party developers, even if you don't develop any skills, Alexa devices can still be very useful.

Mixing: Interconnected thermostat

Sometimes, even if we don't have an internet connection, we have a need to ensure that some of the basic features of the device work properly. For example, if the thermostat does not adjust the temperature autonomously when the Internet is not connected, this can be a very troublesome problem. In order to avoid this problem, a good product designer will design some local sound processing functions to achieve a seamless connection. To do this, the system must have a DSP, such as the C55XX for local voice processing and the ARM processor for connecting the networked interface to the cloud.

What is voice triggering?

You may have noticed that until now we have not mentioned the true magic of the new generation of voice assistants: that is, always pay attention to "trigger vocabulary." How will they track the sounds you make anywhere in the room, or how do you hear your voice when the device plays audio? There is nothing particularly magical about achieving these, just some intelligent software. This type of software is independent of the voice interface of the cloud and can also be run offline.

The most understandable part of this system is the "wake up vocabulary." The wake-up vocabulary is a simple local speech recognition program that looks for a single vocabulary in the received audio signal by continuous sampling. Since most voice services are happy to accept audio without wake-up vocabulary, the vocabulary does not need to specify any special voice platform. Because the requirements for implementing this functionality are relatively low, operations can be done on ARM processors by utilizing open source databases such as Sphinx or KITT.AI.

In order to hear the sound you make anywhere in the room, the speech recognition device uses a process called beamforming. Most importantly, the source of the sound is determined by comparing the arrival times of different sounds with the distance between the microphones. Once the position of the target sound is confirmed, the device uses audio processing techniques such as spatial filtering to further reduce noise and enhance signal quality. The implementation of beamforming depends on the layout of the microphone. A true 360-degree recognition requires a non-linear microphone array (usually a circular shape). For wall-mounted devices, only two microphones are required to enable 180-degree spatial discrimination.

 

The last resort of the voice assistant is automatic echo cancellation (AEC). AEC is somewhat similar to noise canceling headphones, but the application is just the opposite. The algorithm is implemented by using an output audio signal such as known music. In noise canceling headphones to take advantage of this to eliminate external noise, AEC eliminates the effect of the output signal on the microphone of the input signal, the device can ignore the audio it produces, and it can still receive whatever the speaker plays. Achieving AEC requires a lot of calculations, which works best in DSP.

In order to implement all of the above mentioned functions such as wake-up identification, beamforming, and AEC, the ARM processor is required to work with the DSP: DSP enhances all signal processing functions, while the ARM processor controls device logic and interfaces. DSPs can play an important role in performing input data pipelines, thereby minimizing processing delays and providing a better user experience. ARM is free to run, such as advanced operating systems such as Linux to control other devices. Such advanced features all occur locally, and if a cloud service is used, only a single voice file containing the final processing results will be received.

Windows Tablet

The latest Windows has multiple versions, including Basic, Home, and Ultimate. Windows has developed from a simple GUI to a typical operating system with its own file format and drivers, and has actually become the most user-friendly operating system. Windows has added the Multiple Desktops feature. This function allows users to use multiple desktop environments under the same operating system, that is, users can switch between different desktop environments according to their needs. It can be said that on the tablet platform, the Windows operating system has a good foundation.

Windows Tablet,New Windows Tablet,Tablet Windows

Jingjiang Gisen Technology Co.,Ltd , https://www.gisentech.com