Grokotron
An experimental local Speech to Text (STT) engine built on Kaldi ASR.
Last updated
An experimental local Speech to Text (STT) engine built on Kaldi ASR.
Last updated
Mycroft AI’s primary mission has always been to create a true privacy-respecting voice assistant. One that is truly a personal assistant rather than a household spying device. A device that does what you want it to do rather than what the mega-corporation that sold it to you wants it to do.
One of the greatest challenges for us to achieve this has been the lack of a fast, accurate, flexible Speech to Text (STT) engine that can run locally. While the product is still in early days of development, we believe we finally have an answer to this problem. We call it Grokotron.
Grokotron provides limited domain automatic speech recognition on low-resource hardware like the Raspberry Pi 4 that comes in the Mark II. It does this extremely quickly, and of course completely offline. Grokotron’s impressive accuracy and performance is due to its hybrid nature. It includes both an acoustic model and a grammar of expected expressions which constrains its transcription built on top of the popular open source Kaldi Speech Recognition Toolkit.
You can read more about this experimental release on our blog.
A proof of concept image for use on the Mark II is available for testing. This is based on the Dinkum Sandbox image. By default it will not connect to the internet or pair with the Mycroft backend and speech recognition is limited to the phrases defined in the sentences.ini file.
The grammar is easy to define and extend with a simple markup language. This ability to be expanded easily means that while the range of expressions Grokotron can process is limited, it can be quite large and can be practically extended to cover nearly anything a voice assistant needs.
Currently the primary grammar is defined in /opt/grokotron/sentences.ini
using a variation of the voice2json template language, which itself is a simplified form of the JSpeech Grammar Format (JSGF).
If you look at the default sentences.ini file you will see that it is broken up into sections by Skill name. These are not currently used for anything, it is purely to make the file more readable for humans.
Within a sentence, you can specify optional word(s) by surrounding them [with brackets]
.
The template:
represents 2 different possible sentences - one with the optional word, and one without:
an example sentence template
an example sentence
Note that if an optional word is required at the beginning of a sentence the opening square bracket must be escaped so that it does not get interpreted as a section heading. For example:
Where you have a set of options, one of which must be present we use parentheses ()
and a pipe delimeter |
The template:
will represent:
set the light to red
set the light to green
set the light to blue
You can also include alternatives within square brackets to define optional alternatives.
So the following sentence template:
represents 3 different sentences:
An example sentence with some optional words
An example sentence that has optional words
An example sentence optional words
Where a range of numbers may be needed, they can defined using two consecutive periods (0..100)
.
For example the sentence template:
Would return 101 sentence variations using all the numbers from 0 to 100. So each of the following sentences would be included, along with everything in between:
set the volume to 0 percent
set the volume to 1 percent
set the volume to 36 percent
set the volume to 100 percent
Rules allow you to reuse common phrases, alternatives, etc. Rules are defined by
alongside your sentences and referenced by <rule_name>
.
The template above with colors could be rewritten as:
which will represent the same 3 sentences as above. Importantly, you can share rules across intents by prefixing the rule’s name with the intent name followed by a dot:
The second section (background
) references the colors
rule from the light
section.
Rules may also reference each other, for example:
Where many alternatives are required, entity slots can be defined using a $
prefix.
The sentence template:
will look for a file at /opt/grokotron/slots/wallpapers
In that file we would list all of the options available, such as:
As we have used parentheses in the sentence template, each new line in the wallpapers
file will be a required term. If we instead used [square brackets]
that slot would be optional.
In the default sentences.ini
file you might notice that slots of often followed by a set of {curly braces}
. This is a way of defining how the content of a slot can be referenced by other parts of the system, similar to entities in Adapt and Padatious. However they are not yet used as the template currently only defines the possible grammar for speech recognition, not intent definitions.
After modifying the Grokotron grammar or any slot files, the model must be retrained by running /opt/grokotron/train.py
If running this as the default pi
user, you will first need to set the permissions of the output directory. For example:
Currently the Grokotron proof-of-concept image intentionally does not communicate with our backend server. That means no pairing is required, and the device can run completely offline. However this also means that the location of the device, including the time and date are set to a default value of Kansas City, Missouri.
These can be updated within the mycroft.conf
files on the device. If you have an existing Mark II setup, the quickest way to get these values is to copy the location
block from the remote configuration stored at: ~/.config/mycroft/mycroft.remote.conf
For example:
Once in place, restart the Mycroft Dinkum services to ensure it takes effect system wide: