Design of Cloud TV System Based on Intelligent Speech Recognition

In order to improve the operability of smart TV, the paper proposes a cloud TV system design based on intelligent voice design. The system adds voice input and cloud network technology to the traditional smart TV, and realizes the function of operating the TV after the voice intelligent processing. It can automatically find or use the TV function through voice input, thereby improving the operability of the smart TV and allowing the smart TV. It is more convenient to use and suitable for more people.
At present, with the rapid development of computer and Internet technologies, the trend of 3c convergence and the digital development of TV sets, TV sets, which are the core appliances for home entertainment, have begun to develop towards intelligent multimedia network TV. The intelligent network television is a multi-functional network terminal, through which users can get a lot of information and services. However, with the increase of application functions, the operation becomes complicated. In the face of the complex functions of the smart TV and the difficult operation, it is only the paper manual of the TV set or the electronic document is played on the TV in the form of flash. There is no detailed navigation function to guide the user operation, or detailed The description, TV is facing any consumer, they are not very clear about the operation of many functions, and even many functions can not be found. In today's intelligent electronic products, intelligent voice design is a hot topic. The realization of this technology has improved the operability of electronic products and brought more convenience to users. Therefore, designing a TV system based on intelligent voice design, using voice to achieve fast navigation to various functions, information, services and other applications has become a top priority.
The system is a cloud television system based on intelligent voice setting, and the input voice data is transmitted to the television system. The system preprocesses the analog voice data into a digital voice signal, and sends the digital voice data to the cloud according to the requirements of each module. After the cloud is intelligently analyzed and processed, it returns specific control commands to the TV for processing.
1. Overall system design
The structure of the TV system is shown in Figure 1. The system is divided into three modules: voice design, TV system processing, and cloud processing. In the case of a network connection, the voice is recorded by the microphone, and the voice input is converted into a specific voice format and transmitted to the central server of the cloud through the voice module. The cloud server transmits the past voice and many voice models representing specific characters. Compare to provide many different possibilities for the specific characters contained in the input speech. The cloud server then generates a sequence of characters that, according to a character-based language model, represent the different possibilities of the particular sequence of known specific characters contained in the input speech. The sequence of characters is then transmitted over the network to a central server, where the sequence of characters generates a sequence of words, according to a vocabulary and a vocabulary-based language model that represents the difference in the particular sequence of known specific characters contained in the input speech. possibility. Then, the cloud server determines, according to the vocabulary, which specific lexical sequence matches the input speech, and transmits the determined vocabulary sequence back to the terminal television system via the network, and the television system processes the obtained data into modules (the television system is different). Modules have different functions). The TV system hardware uses the MIPS architecture CPU and configures the Linux operating system. Voice through the MIC input, designed with two MIC interfaces, using a standard network interface for network communication.
2. Speech recognition system design
2.1 Basic knowledge of speech recognition
Voice-based technology, also known as automatic speech design, is AutomaTIc Speech RecogniTIon (ASR), which aims to convert vocabulary content in human speech into computer-readable input such as buttons, binary codes, or sequences of characters. Unlike speaker settings and speaker confirmation, the latter attempts to identify or confirm the speaker of the speech rather than the vocabulary content contained therein.
The speech recognition system is essentially a pattern recognition system. Speech recognition generally takes two steps. The first step is the "learning" or "training" phase of the system. The task at this stage is to establish an acoustic model that identifies the basic unit and a language model for grammar analysis. The second step is the "recognition" or "test" phase. According to the type of the identification system, a recognition method that satisfies the requirements is selected, and the speech feature parameters required by the recognition method are extracted by using the speech analysis method, and the system is compared with the system model according to certain criteria and measures, and the recognition result is obtained through the judgment. .
2.2 Voice design system design
The block diagram of the voice setting system is shown in Figure 2. First, the analog voice signal input by the TV microphone is pre-processed, and the cloud needs a digital voice signal. Here, the pre-processing uses voice IC for processing, including pre-filtering, sampling and quantization, signal digitization, windowing, breakpoint detection, pre-emphasis. Wait. After the speech signal is preprocessed, the next important part is the feature parameter extraction. The purpose is to extract the sequence of speech features from time to time from the speech waveform. The result of feature extraction is sent to the TV operating system for judgment processing, and it is analyzed whether it needs to be transmitted to the cloud server, and the cloud server transmits the received voice to the television terminal after performing intelligent analysis processing, and performs corresponding function processing. .
2.3 Cloud Server Intelligent Processing
Cloud server processing mainly analyzes and processes digital voice data. The function of this system is complex, and the voice processing workload is very large. The design is based on the cloud computing server. The server needs to perform intelligence analysis and processing. In addition, the smart device mainly focuses on the semantic analysis of some keywords and voices of the television system, and separately processes different modules of the television to complete the functions desired by the user. The use of a cloud computing server can reduce the hardware cost of the television terminal and increase the processing speed to achieve intelligent processing of user commands.
2.3.1 Transmission Protocol between TV and Cloud
For a particular TV system, each module has a specific keyword that is required to transmit module characteristics and corresponding voice data when transmitting data to the cloud.
2.3.2 Main methods of speech training and recognition
After the data is received in the cloud, the voice data needs to be set. Speech training and recognition is a process of pattern training and recognition. Pattern training refers to processing a large amount of training information according to certain rules, obtaining model parameters that can reflect the essential characteristics of the information, and combining the model parameters obtained from the training information into a pattern library, and pattern matching refers to the basis. A certain rule specification, matching the input unknown mode with the mode in the pattern library, and searching for a mode with the highest similarity, that is, the best match, from the pattern library. There are many kinds of methods for training and matching. At present, the most common methods include dynamic time warping (DTW), hidden Markov chain (HMM) model, and artificial neural network (ANN).
2.3.3 Hidden Markov Chain Model
The system uses the Hidden Markov Models (HMM) model to train and identify speech. In the hidden Markov chain model, it uses Markov chains to simulate the change of statistical properties of the signal. Essentially it Is a probabilistic model of a double stochastic process. The probability model of the first stochastic process refers to the transition between states by Markov chain, and the probability model of another stochastic process refers to the stochastic correspondence between each state and multiple observations. In the application of practical problems, the HMM's double stochastic process observer can't directly see the state, only see the observation value, and only use a random process to sense the existence and characteristics of the state. In essence, the human language process is also a double stochastic process. The speech signal itself is a time-varying sequence that can be observed. It is the parameter stream of the phoneme emitted by the human brain according to the grammatical knowledge and the need of speech. This part is relative to the unobservable states in the HMM model. The HMM model can simulate this double stochastic process well, and well describes the local stationarity of the speech signal and the overall non-stationarity. It is an ideal model for describing speech signals.
2.3.4 Intelligent speech recognition
The keyword recognition system used here is a continuous speech recognition (LVCSR) based keyword recognition system, as shown in FIG. 3, using this structure for a continuous speech keyword recognition system: after the language passes through the continuous speech syllable recognizer, A corresponding N-Best word or syllable grid is generated, and then a keyword search algorithm is used to perform a keyword search on the grid. The process can be roughly divided into three steps: the first step is to search for the phonetic primitive, that is to say, the pinyin sequence corresponding to the input speech is obtained through this search. Through continuous decoding, a N-Best syllable sequence or a grid of syllables can be obtained. In the second step, different keyword tables are selected for the TV terminal function module. In the third step, according to the syllable sequence obtained in the previous step and the keyword vocabulary comparison, the keyword search is performed to obtain an imaginary hit (a word that may become a keyword). In the fourth step, the confidence of the hypothetical hit obtained in the third step is analyzed according to other knowledge sources, and the result of the keyword recognition is given. In the fifth step, the keyword result outputted in the fourth step is intelligently processed, and the final output result is given according to a specific TV system function module.
3. TV intelligent speech recognition processing software flow
3.1 Recording detection
The flow chart of intelligent voice recognition processing of TV is shown in Figure 4. When you need to use voice settings, you first need to press the record button. At this time, the system will detect whether the network is connected and whether the microphone can be used normally. If one of the tests fails, the system Will not do recording work, prompt to check the network or check the microphone.
3.2 Recording processing
After the device is detected, recording is performed. Due to system limitations, the recording has a time limit and cannot be too long. The TV terminal performs pre-processing and feature extraction on the voice recorded by the microphone, and then transmits the voice and module features to the cloud server together, and the cloud server performs detailed processing, and then transfers the data back to the terminal television.
3.3 Intelligent function processing
Waiting for receiving data in the TV terminal, no data is received within 5 seconds, it is regarded as TIme out, and data processing fails. If the corresponding processing is performed after receiving the data, there is a keyword identification for each module in the cloud, and the returned data is then judged and processed for the corresponding module. For example, in the main function interface, if the voice input "shezhi", the system will enter the setting interface. Or in the Video interface, if you enter "halibote", the system will find the Harry Potter movie.
4. Experimental application
Since the situation of the television system is complicated during use, there are some differences in the accuracy of voice setting. In order to obtain relatively accurate data, the test is divided into several cases, one is when the television system does not play Audio and when playing audio, and the other is when the length of the input voice is inconsistent.
4.1 Testing noisy environmental tests
This test is divided into two situations, one is when there is no audio (or audio mute), the other is when there is audio (because the audio is not the same when the audio is played, so the synthesis of various noisy environments The value is mainly), the experimental results are shown in Table 1:
4.2 Test changes the input keyword length test
The system is intelligent voice setting, and needs to do intelligent analysis. The input of the voice is used to judge the action of the system. The key is the accuracy of the voice setting and the intelligent recognition processing, and the length of the input keyword is equivalent to the accuracy of the system. The essential. This experiment is to analyze the input with inconsistent length. The experimental results are shown in Table 2:
From the two tests, the system identification accuracy is quite high, and the experiment has achieved the expected results. The key is that when processing in a special environment, the system has keywords and intelligent processing after recognition to achieve better intelligent processing.
5 Conclusion
The system is based on efficient voice setting technology and stable MIPS hardware platform. The software design is based on the Linux operating system. Cloud computing is used to process voice data on the original smart TV system, so that the system can process real-time more. high. The test shows that the system can judge the voice input very accurately, the data processing speed is fast, and the system stability is high. This system achieves the function of using intelligent voice setting in the TV system, which greatly improves the operability of the system through voice operation, making it convenient to use and more intelligent.

D-SUB cable
Economy D-SUB cable, premium D-SUB cable, right angle D-SUB cable
Dongguan Bofan technology Co., LTD , https://www.ufriendcc.com