Grokotron

An experimental local Speech to Text (STT) engine built on Kaldi ASR.

Mycroft AI’s primary mission has always been to create a true privacy-respecting voice assistant. One that is truly a personal assistant rather than a household spying device. A device that does what you want it to do rather than what the mega-corporation that sold it to you wants it to do.

One of the greatest challenges for us to achieve this has been the lack of a fast, accurate, flexible Speech to Text (STT) engine that can run locally. While the product is still in early days of development, we believe we finally have an answer to this problem. We call it Grokotron.

Grokotron provides limited domain automatic speech recognition on low-resource hardware like the Raspberry Pi 4 that comes in the Mark II. It does this extremely quickly, and of course completely offline. Grokotron’s impressive accuracy and performance is due to its hybrid nature. It includes both an acoustic model and a grammar of expected expressions which constrains its transcription built on top of the popular open source Kaldi Speech Recognition Toolkit.

You can read more about this experimental release on our blog.

Download

A proof of concept image for use on the Mark II is available for testing. This is based on the Dinkum Sandbox image. By default it will not connect to the internet or pair with the Mycroft backend and speech recognition is limited to the phrases defined in the sentences.ini file.

Defining Grammar

The grammar is easy to define and extend with a simple markup language. This ability to be expanded easily means that while the range of expressions Grokotron can process is limited, it can be quite large and can be practically extended to cover nearly anything a voice assistant needs.

Currently the primary grammar is defined in /opt/grokotron/sentences.ini using a variation of the voice2json template language, which itself is a simplified form of the JSpeech Grammar Format (JSGF).

Sections

If you look at the default sentences.ini file you will see that it is broken up into sections by Skill name. These are not currently used for anything, it is purely to make the file more readable for humans.

Optional Words

Within a sentence, you can specify optional word(s) by surrounding them [with brackets].

The template:

an example sentence [template]

represents 2 different possible sentences - one with the optional word, and one without:

  1. an example sentence template

  2. an example sentence

Note that if an optional word is required at the beginning of a sentence the opening square bracket must be escaped so that it does not get interpreted as a section heading. For example:

[SomeSkill]
\[no longer] a problem sentence

Alternatives

Where you have a set of options, one of which must be present we use parentheses () and a pipe delimeter |

The template:

set the light to (red | green | blue)

will represent:

  1. set the light to red

  2. set the light to green

  3. set the light to blue

Optional Alternatives

You can also include alternatives within square brackets to define optional alternatives.

So the following sentence template:

An example sentence [with some | that has] optional words

represents 3 different sentences:

  1. An example sentence with some optional words

  2. An example sentence that has optional words

  3. An example sentence optional words

Number ranges

Where a range of numbers may be needed, they can defined using two consecutive periods (0..100).

For example the sentence template:

set the volume to (0..100) percent

Would return 101 sentence variations using all the numbers from 0 to 100. So each of the following sentences would be included, along with everything in between:

  • set the volume to 0 percent

  • set the volume to 1 percent

  • set the volume to 36 percent

  • set the volume to 100 percent

Rules

Rules allow you to reuse common phrases, alternatives, etc. Rules are defined by

rule_name = ... 

alongside your sentences and referenced by <rule_name>.

The template above with colors could be rewritten as:

colors = (red | green | blue)
set the light to <colors>

which will represent the same 3 sentences as above. Importantly, you can share rules across intents by prefixing the rule’s name with the intent name followed by a dot:

[light]
colors = (red | green | blue)
set the light to <colors>

[background]
set the background to <light.colors>

The second section (background) references the colors rule from the light section.

Rules may also reference each other, for example:

seconds = ((1){seconds} second | (2..59){seconds} seconds)
minutes = ((1){minutes} minute | (2..59){minutes} minutes)
hours = ((1){hours} hour | (2..59){hours} hours)
time = (<seconds> | <minutes> [[and] <seconds>] | <hours> [[and] <minutes>] [[and] <seconds>])

Slots

Where many alternatives are required, entity slots can be defined using a $ prefix.

The sentence template:

change wallpaper to ($wallpapers)

will look for a file at /opt/grokotron/slots/wallpapers In that file we would list all of the options available, such as:

default
river
sea
earth
moon
nebula
city
blue
green
orange

As we have used parentheses in the sentence template, each new line in the wallpapers file will be a required term. If we instead used [square brackets] that slot would be optional.

In the default sentences.ini file you might notice that slots of often followed by a set of {curly braces}. This is a way of defining how the content of a slot can be referenced by other parts of the system, similar to entities in Adapt and Padatious. However they are not yet used as the template currently only defines the possible grammar for speech recognition, not intent definitions.

Retraining

After modifying the Grokotron grammar or any slot files, the model must be retrained by running /opt/grokotron/train.py

If running this as the default pi user, you will first need to set the permissions of the output directory. For example:

sudo chown -R pi:pi /opt/grokotron/output

Setting the location, time and date

Currently the Grokotron proof-of-concept image intentionally does not communicate with our backend server. That means no pairing is required, and the device can run completely offline. However this also means that the location of the device, including the time and date are set to a default value of Kansas City, Missouri.

These can be updated within the mycroft.conf files on the device. If you have an existing Mark II setup, the quickest way to get these values is to copy the location block from the remote configuration stored at: ~/.config/mycroft/mycroft.remote.conf

For example:

{
  "location": {
    "city": {
      "code": "Lawrence",
      "name": "Lawrence",
      "state": {
        "code": "KS",
        "name": "Kansas",
        "country": {
          "code": "US",
          "name": "United States"
        }
      }
    },
    "coordinate": {
      "latitude": 38.971669,
      "longitude": -95.23525
    },
    "timezone": {
      "code": "America/Chicago",
      "name": "Central Standard Time",
      "dstOffset": 3600000,
      "offset": -21600000
    }
  }
}

Once in place, restart the Mycroft Dinkum services to ensure it takes effect system wide:

sudo systemctl restart mycroft-dinkum.target

Last updated