The concept of voice interface control of the computing system "DIVA" to help people with speech disorders
INTRODUCTION 3-333432.
Currently, much attention is paid to creating an accessible environment for people with disabilities and disabilities. An important means of ensuring accessibility and improving the quality of life, social interaction, integration into society for people with disabilities are computer equipment and specialized information systems. An analysis of the literature has shown that today various developments are underway to facilitate the interaction between man and computer, including the development of voice interfaces for managing a computing system. However, these developments are focused on creating speaker-independent systems that are trained on big data and do not take into account the peculiarities of computer command pronunciation by people with various disabilities of speech functions.
The goal of the research work is to design a voice-dependent voice interface for controlling a computing system based on machine learning methods.
Tasks solved in the work:
3r33333.
Conduct a review of voice interfaces and how they are used to control computing systems;
Explore approaches to personalization of voice control computing system;
Develop a mathematical model of voice interface control computing system DIVA;
Develop an algorithm for software implementation DIVA.
3r33333.
Solution methods. To solve the set tasks, the methods of system analysis, mathematical modeling, machine learning are used.
[b] VOICE INTERFACE AS A METHOD OF CONTROL OF A COMPUTING SYSTEM
Creating speech recognition systems is an extremely complex task. It is especially difficult to recognize the Russian language, which has many features. All speech recognition systems can be divided into two classes:
Speaker dependent systems - are adjusted on the speaker's speech in the course of training. To work with another speaker, such systems require a complete reconfiguration.
Systems independent of speaker - whose work does not depend on the announcer. Such systems do not require prior training and are able to recognize the speech of any speaker.
Initially, systems of the first type appeared on the market. In them, the sound image of the team was stored in the form of a holistic standard. For comparing the unknown pronouncement and the reference command, dynamic programming methods were used. These systems worked well in recognizing small sets of 10–30 commands and understood only one speaker. To work with another speaker, these systems required a complete reconfiguration.
In order to understand fluent speech, it was necessary to go to dictionaries of much larger sizes, from several tens to hundreds of thousands of words. The methods used in the systems of the first type were not suitable for solving this problem, since it is simply impossible to create standards for such a number of words.
In order to understand fluent speech, it was necessary to go to dictionaries of much larger sizes, from several tens to hundreds of thousands of words. The methods used in the systems of the first type were not suitable for solving this problem, since it is simply impossible to create standards for such a number of words.
In addition, there was a desire to make a system that does not depend on the speaker. This is a very difficult task, since each person has an individual manner of pronunciation: the rate of speech, the timbre of the voice, and the peculiarities of pronunciation. Such differences are called speech variability. To take it into account, new statistical methods were proposed, based mainly on mathematical tools 3r3431. Hidden Markov Models (SMM) 3-33432. or Artificial Neural Networks . The best results are achieved when combining these two methods. Instead of creating patterns for each word, patterns are created for the individual sounds that make up the words, the so-called acoustic models. Acoustic models are formed by statistical processing of large speech databases containing records of the speech of hundreds of people. Existing speech recognition systems use two fundamentally different approaches:
Recognition of voice tags - recognition of speech fragments on a pre-recorded sample. This approach is widely used in relatively simple systems designed to execute pre-recorded speech commands.
Recognition of lexical elements - selection from speech of the simplest lexical elements, such as phonemes and allophones. This approach is suitable for creating text dictation systems in which a complete transformation of the spoken sounds into text occurs.
Overview of various Internet sources allows you to select the following software products that solve the problem of speech recognition and their main characteristics: 3r3447.
Gorynych PROF ???r3r3432. - This is an easy-to-use program for recognizing oral speech and typing by dictation with the support of the Russian language. It is based on Russian developments in the field of speech recognition.
Features: 3r3447.
announcer;
language dependence (Russian and English);
Accuracy of recognition depends on the core system of the American program "Dragon Dictate";
provides voice control of individual functions of the operating system, text editors and application programs;
requires training.
[b] VoiceNavigator - This is a high-tech solution for contact centers, designed to build a voice self-service systems (GHS). VoiceNavigator allows you to automatically handle calls using speech synthesis and speech recognition technologies.
Features: 3r3447.
speaker independence;
resistance to ambient noise and interference in the telephone channel;
Russian speech recognition works with a reliability of 97% (100 words dictionary).
Speereo Speech Recognition - speech recognition occurs directly on the device, and not on the server, which is a key advantage, according to the developers.
Features: 3r3447.
Russian speech recognition works with a reliability of about 95%;
speaker independence;
vocabulary of about 150 thousand words;
simultaneous support for multiple languages;
compact size of the engine. Sakrament ASR Engine (developed by Sakrament) 3r3-3328.
Sakrament ASR Engine - (developed by Sacrament) - speech recognition technology is used to create speech management tools - programs that control the actions of a computer or other electronic device using voice commands, as well as organizing telephone help and information services.
Features: 3r3447.
speaker independence;
language independence;
recognition accuracy reaches 95-98%;
speech recognition in the form of expressions and small sentences;
no learning opportunity.
Google Voice Search - Recently, Google’s voice search is built into Google Chrome’s browser, allowing you to use this service across multiple platforms.
Features: 3r3447.
Russian language support;
the ability to embed speech recognition on web resources;
voice commands, phrases;
For work, you need a permanent connection to the internet network.
Dragon NaturallySpeaking - (company "Nuance") The world leader in software for the recognition of human speech. The ability to create new documents, send e-mail, manage popular browsers and a variety of applications through voice commands.
Features: 3r3447.
there is no support for the Russian language;
recognition accuracy up to 99%.
ViaVoice - (IBM) is a software product for hardware implementations. ProVox Technologies, on the basis of this core, has created a system for dictating the reports of VoxReports radiologists.
Features: 3r3447.
recognition accuracy reaches 95-98%;
speaker independence;
The system dictionary is limited to a set of specific terms.
Sphinx - known and workable from open source speech recognition software today. Development is carried out at Carnegie Mellon University, distributed under the terms of the license of Berkley Software Distribution (BSD) and is available for both commercial and non-commercial use.
Features: 3r3447.
speaker independence;
continuous speech recognition;
learnability;
availability of version for embedded systems - Pocket Sphinx.
Thus, the review showed that the market is dominated by software products targeted at a large number of users, are independent-independent, as a rule, have a proprietary license, which significantly limits their use for managing computer systems for people with disabilities. Systems for voice control of specialized tools, such as smart home, exoskeleton, etc., are not universal. However, interest in new technologies is increasing, there are opportunities to control various devices through mobile communication, bluetooth technologies. Including household appliances. The use of user-oriented voice control technology will improve the quality of everyday life and social adaptation for people with disabilities.
MATHEMATICAL APPARATUS FOR RECOGNITION OF THE STATUS OF A SPEAKER AND ITS PECULIARITIES 3-33432.
To solve the problem posed in the work, we analyze the requirements for the DIVA system.
The system should be:
3r33333.
announcer-dependent;
be trained in the particular pronunciation of a particular user;
recognize a certain number of voice tags and translate them into control commands.
3r33333.
The voice interface should be: dictational, with a limited set of vocabulary.
Voice commands are a sound wave. The sound wave can be represented as a spectrum of frequencies entering it. [b] Digital sound Is a way of representing an electrical signal by means of discrete numerical values of its amplitude. The input file for the voice interface is the sound file in the RAM, as a result of the file being fed to the neural network, the program produces the corresponding result.
Digitization - this is the fixation of the amplitude of the signal after a certain period of time and the registration of the obtained amplitude values as rounded digital values. Digitizing a signal involves two processes — the sampling process and the quantization process.
The sampling process is 3r33232. - This is the process of obtaining the values of a signal that is converted with a certain time step, such a step is called a discretization step. The number of measurements of the magnitude of a signal performed in one second is called the sampling frequency or sampling frequency, or sampling frequency. The smaller the sampling step, the higher the sampling rate and the more accurate we will get an idea of the signal.
[b] Quantization - it is the process of replacing the real values of the signal amplitude with approximate values with some accuracy. Each of the 2N possible levels is called a quantization level, and the distance between the two closest quantization levels is called a quantization step. If the amplitude scale is divided into levels linearly, quantization is called linear or homogeneous.
The recorded amplitude values of the signal are called counts. The higher the sampling rate and the more quantization levels, the more accurate the digital representation of the signal.
As a mathematical tool for solving the problem of selecting characterizing features, it is advisable to use a neural network that can learn and automatically select the necessary features. This will allow to train the system for the pronunciation of voice commands of a particular user. Comparing the mechanisms of different neural networks, we selected two of the most appropriate. This is the network of Kosko and Kokhoken.
Kohonen self-organizing map - neural network with learning without a teacher, performing the task of visualization and clustering. It is a method of projecting a multidimensional space into a space with a lower dimension (most often, two-dimensional), it is also used to solve problems of modeling, forecasting, identifying sets of independent features, searching patterns in large data arrays, developingPC games. It is one of the versions of Kohonen's neural networks.
Kohonen's network is a suitable network, since this network can automatically partition training examples into clusters, where the number of clusters is specified by the user. After learning the network, you can calculate to which cluster the input example belongs, and output the corresponding result.
Kosco's neural network or bidirectional associative memory (WCF) 3r33232. - single-layer neural network with feedback, based on two ideas: the adaptive resonant theory of Stefan Grosberg and Hopfield's auto-associative memory. DAP is heteroassociative: the input vector is fed to one set of neurons, and the corresponding output vector is produced on a different set of neurons. Like the Hopfield network, the WCT is capable of generalization, generating the right reactions, despite the distorted inputs. In addition, adaptive versions of the WCT can be implemented, highlighting the reference image of the noisy copies. These capabilities strongly resemble the process of human thinking and allow artificial neural networks to take a step towards brain modeling.
The advantage of this network is that, based on discrete neural networks of adaptive resonant theory, a new bi-directional associative memory has been developed that can memorize new information without retraining a neural network. This allows the user to replenish voice tags when needed.
[b] DESIGNING A DIVA
The concept of DIVA software implementation contains three stages that are implemented in one software product having an ergonomic graphical interface.
3r33421. Collection of training examples. 3r33434.
For learning the neural network, the user is prompted to utter several vocabulary voice tags. Since the recorded phrases consist of one word, the file size does not matter. And for further processing, the sound is recorded in the WAV format. This is a PCM lossless recording format. It is a standard for further sound processing using the 3r3405 library. python_speech_features [/i] Python language. The audio file must be accompanied by its “value”, which is necessary for the further training of the neural network (relevant commands).
3r33421. Neural network training. 3r33434.
The program reads audio files, and generates new audio files by changing the length of the audio track, as well as changing the pitch, volume, and tone of speech. This is necessary to increase the number of examples for the training sample, which will increase the quality of recognition by the neural network. In the program, the user will be asked to train the network on previously recorded voice tags. The user can also supplement the base with training voice tags, and further training the neural network later.
3r33421. Using the program. 3r33434.
After learning the program on the set words, the user can start work or add new voice tags to the training. A trained neural network can recognize the supplied audio files.
[b] CONCLUSION 3-333432.
Thus, the research paper reviewed the current market of voice interfaces and their use. It is shown that this type of software is focused on the use of voice-activated voice control in systems and does not take into account the individual characteristics of the user, which is especially important for people with disabilities and those with speech disorders.
The requirements for the voice interface for managing the “Deep interactive voice assistant DIVA” computing system have been defined to help people with speech disorders.
A mathematical apparatus suitable for implementing the concept of DIVA is described. An algorithm for the software implementation of the voice interface has been compiled.
Further development involves the development of a program with a convenient graphical interface for implementing a prototype voice control interface that can be used for various tasks, such as managing household appliances, computers, and robotic technology (exoskeleton) for people with disabilities.
It may be interesting
weber
Author15-11-2018, 00:28
Publication DateMathematics / Machine learning
Category- Comments: 0
- Views: 265
https://iptvbeast.net/