Recognizing speech in python using pocketsphinx or as I was trying to make a voice assistant

This is a tutorial for using the pocketsphinx library on python. I hope he helps you
 
Quickly deal with this library and not step on my rake.
speech_recognition . As it turned out, I'm not one such. For recognition, I used Google Speech Recognition, because it was the only one that did not require any keys, passwords, etc. For the synthesis of speech was taken gTTS. In general, almost a clone of turned out. this assistant, because of what I could not calm down.
 
However, I could not calm down because of this: the answer had to wait a long time (the recording did not end right away, sending the speech to the server for recognition and the text for the synthesis took a long time), speech was not always recognized correctly, more than half a meter from the microphone, , it was necessary to speak clearly, the speech synthesized by Google sounded horribly, there was no activation phrase, that is, the sounds were constantly recorded and transmitted to the server.
 
The first improvement was the synthesis of speech using yandex speechkit cloud:
 
URL = 'https://tts.voicetech.yandex.net/generate?text='+text+'&format=wav&lang=en-RU&speaker=ermil&key='+key+'&speed=1&emotion=good'
response = requests.get (URL)
if response.status_code == 200:
with open (speech_file_name, 'wb') as file:
file.write (response.content)

 
Then it was the recognition turn. I immediately became interested in the inscription "CMU Sphinx (works offline)" on the library page . I will not talk about the basic concepts of pocketsphinx, because before me it did chubakur (for which he thank you very much) in this post.
 
Installing Pocketsphinx
 
Immediately I will say, it's so easy to install pocketsphinx (at least I did not succeed), so pip install pocketsphinx Will not work, fall with an error, will swear on the wheel. To install pocketsphinx, go to here it is and download the installer (msi). Please note: the installer is only for version 3.5!
 

Speech recognition with pocketsphinx


 

Pocketsphinx can recognize speech from both a microphone and a file. Also he can look for hot phrases (I did not really get it, for some reason the code that should be executed when the hot word is executed several times, although I only said it). From cloud solutions pocketsphinx differs in that it works offline and can work on a limited dictionary, resulting in increased accuracy. If interested, at the library page there are examples. Note the "Default config" item.


 

Russian language and acoustic model


 

Initially, pocketsphinx comes with English language and acoustic models and a dictionary. You can download Russian by this link is . You need to unpack the archive. Then you need the folder /zero_ru_cont_8k_v3/zero_en.cd_cont_4000 Move to the folder C: /Users /tutam /AppData /Local /Programs /Python /Python35-32 /Lib /site-packages /pocketsphinx /model , where this is the folder into which you unpacked the archive. A moved folder is an acoustic model. The same procedure should be done with the files en.lm and en.dic from the folder /zero_ru_cont_8k_v3 / . File en.lm it's a language model, and en.dic this is a dictionary. If you did everything correctly, then the following code should work.


 
    import os
from pocketsphinx import LiveSpeech, get_model_path
model_path = get_model_path ()
speech = LiveSpeech (
? verbose = False,
? sampling_rate = 1600?
? buffer_size = 204?
? no_search = False,
full_utt = False,
.hmm = os.path.join (model_path, 'zero_en.cd_cont_4000 '),
lm = os.path.join (model_path,' en.lm '),
dic = os.path.join (model_path,' en.dic ')
)
print ("Say something!")
for phrase in speech:
print (phrase)

 

First check that the microphone is connected and working. If the text does not appear for a long time. Say something! - this is normal. Most of this time, the creation of a copy of LiveSpeech , which is created so long because the Russian language model weighs more than 500 (!) mb. I have a copy of LiveSpeech it takes about 2 minutes.


 

This code should recognize almost any phrases you say. Agree, the accuracy is disgusting. But it can be fixed. And increase the speed of creating LiveSpeech also possible.


 

JSGF


 

Instead of the language model, you can force pocketsphinx to work with a simplified grammar. For this, is used. jsgf file. Its use speeds up the creation of the instance. LiveSpeech . About how to create grammar files written here . If there is a language model, then jsgf the file will be ignored, so if you want to use your own grammar file, you need to write this:


 
    speech = LiveSpeech (
? verbose = False,
? sampling_rate = 1600?
? buffer_size = 204?
? no_search = False,
full_utt = False,
.hmm = os.path.join (model_path, 'zero_en.cd_cont_4000 '),
? lm = False,
? jsgf = os.path.join (model_path,' grammar.jsgf '),
dic = os.path.join (model_path,' en.dic ')
)
.

 

Naturally, you need to create a grammar file in the folder C: /Users /tutam /AppData /Local /Programs /Python /Python35-32 /Lib /site-packages /pocketsphinx /model . And more: when using jsgf will have to speak more clearly and share words.


 

Create your own vocabulary


 

A dictionary is a set of words and their transcriptions, the smaller it is, the higher the recognition accuracy. To create a dictionary with Russian words, you need transfer 1000 rubles to my account and ask me to compose a dictionary use the project ru4sphinx . Download, unpack. Then open the notepad and write the words that should be in the dictionary, each with a new line, then save the file as my_dictionary.txt in the folder text2dict , in the encoding. UTF-8 . P Then open the console and write: C: UserstutamDownloadsru4sphinx-masterru4sphinx-mastertext2dict> perl dict2transcript.pl my_dictionary.txt my_dictionary_out.txt . We open my_dictionary_out.txt , copy the contents. Open the notepad, paste the copied text and save the file as my_dict.dic (instead of the "text file", select "all files"), encoded in UTF-8 .


 
    speech = LiveSpeech (
? verbose = False,
? sampling_rate = 1600?
? buffer_size = 204?
? no_search = False,
full_utt = False,
.hmm = os.path.join (model_path, 'zero_en.cd_cont_4000 '),
lm = os.path.join (model_path,' en.lm '),
dic = os.path.join (model_path,' my_dict.dic ')
)

 
Some transcriptions may need to be corrected.
 
Using pocketsphinx via speech_recognition
 
Use pocketsphinx via speech_recognition only makes sense if you recognize English speech. In speech_recognition, you can not specify an empty language model and use jsgf, and therefore it takes 2 minutes to recognize each fragment. Proven.
 
The result is
 
After a few evenings, I realized that I wasted my time. In a dictionary of two words (yes and no), the Sphinx manages to make mistakes, and often. Eats away 30-40% of celeron, and with the language model is also a fat piece of memory. And Yandex almost any speech recognizes unerringly, while not eating memory and processor. So think for yourself, is it worth to take it at all.
 
P.S.
: this is my first post, so I'm waiting for advice on the design and content of the article.
What solution do you like more for speech recognition?
sphinx
Yandex Speechkit Cloud
Google Cloud Speech API
Its version of
Nobody has voted yet. There are no abstentions.
Only registered users can participate in the survey. Enter , you are welcome.
+ 0 -

Add comment