As a privacy focused project and community, many people are interested in fully offline or self-hosted options. Mycroft has intentionally been built in a modular fashion, so this is possible however is not easy and is unlikely to provide an equivalent user experience.
To achieve this we need to look at three key technologies: backend services provided by Home.mycroft.ai; speech recognition or speech-to-text (STT); and speech-synthesis or text-to-speech (TTS). For backend services, the official backend known as Selene is available on Github under the AGPL v3.0 license, alternatively you can use the simpler Community developed Personal Backend. You can choose to run your own STT service such as Mozilla DeepSpeech or Kaldi, however in our opinion these do not yet provide sufficient accuracy for mainstream usage. Finally, to generate speech on device, simply select the British Male voice. The more realistic sounding voices are generated on Mycroft servers and require significant hardware to synthesize speech within a reasonable time frame.
If you are running your own services, your Mycroft installation can be directed to use those using the mycroft.conf file.
When you trigger a device using the wake word (eg Hey Mycroft), this is using one of two systems. Precise is trained of samples of other people saying the same thing. Anytime it hears something, it then reports its confidence that it was the wake word.
The training data we have has been collected from our existing community of users and a large proportion of these are adult males from the mid-west of the USA. Because of this bias in our data, there is also a bias in our wake word models. We are working to fix this, however currently it means that Mycroft has more difficulty hearing the wake word from women, children, and those with other accents.
You can increase or decrease the likelihood that it will report a match, however this requires some experimentation. You may end up with a lot of false activations, or it may stop responding at all.
If you are running Mycroft on older hardware, it's also possible that Precise is not supported and the system has fallen back to using PocketSphinx. This fallback system is not as accurate and results vary wildly. You can find out which system Mycroft is using by asking:
Hey Mycroft, what is the active listener?
If Mycroft never activates at all, there might be an issue with your microphone. For this, check out our audio troubleshooting guide:
By default, to answer a request Mycroft:
Detects the wake word
Records 3 - 10 seconds of audio
Sends this audio to a cloud-based speech-to-text (STT) service
Transcribes the audio and returns the text transcription
Parses the text to understand the intent
Sends the text to the intent handler with the highest confidence
Allows the Skill to perform some action and provide the text to be spoken
Synthesizes audio from the given text, either locally or remotely, depending on the text-to-speech (TTS) engine in use
Plays the synthesized spoken audio.
Through this process there are a number of factors that can affect the perceived speed of Mycroft's responses:
System resources - more processing power and memory never hurts!
Network latency - as it is not yet possible to perform everything on device, network latency and connection speed can play a significant role in slowing down response times.
Streaming STT - we have been experimenting with the use of streaming services. This transcribes audio as it's received rather than waiting for the entire utterance to be finished and sending the resulting audio file to a server to be processed in its entirety. It is possible to switch to a streaming STT service however at present this is not available by default and requires a paid 3rd party service. See Switching STT Engines for a list of options available.
Dialog structure - a long sentence will always take more time to synthesize than a short one. For this reason Mycroft breaks up longer dialog into chunks and returns one to speak whilst the next is being generated. Skill developers can help provide quicker response times by considering the structure of their dialog and breaking that dialog up using punctuation in appropriate places.
TTS Caching - synthesized audio is cached meaning common recently generated phrases don't need to be generated, they can be returned immediately.
The best answer is provided by @Thorsten who documented their journey to create a custom TTS model in German.
It is worth noting that it is a significant investment of time to train your own TTS model. We strongly recommend watching Thorsten's entire video before you get started. If a 1 hour video is too long, be warned that the process will take a minimum of weeks and more likely months.
There are exciting new projects that may soon enable us to generate new voices based off minutes of recorded audio. However currently it requires 16+ hours of very consistent, high-quality audio, with the associated text metadata.
To capture this training data we have the Mimic Recording Studio. Note that this generates audio files, which can be used to train TTS models using a range of technologies, not just Mycroft's Mimic.
No. Purchases from Mycroft do not currently include any taxes or other importation fees. Unless otherwise stated, all products are shipped from the USA. This means that a product being shipped to another country may incur additional taxes and import fees. These are the sole responsibility of the customer and Mycroft will not reimburse any costs associated with these local fees and taxes.