As a privacy focused project and community, many people are interested in fully offline or self-hosted options. Mycroft has intentionally been built in a modular fashion, so this is possible however is not easy and is unlikely to provide an equivalent user experience.
To achieve this we need to look at three key technologies: backend services provided by Home.mycroft.ai; speech recognition or speech-to-text (STT); and speech-synthesis or text-to-speech (TTS). For backend services, the official backend known as Selene is available on Github under the AGPL v3.0 license, alternatively you can use the simpler Community developed Personal Backend. You can choose to run your own STT service such as Mozilla DeepSpeech or Kaldi, however in our opinion these do not yet provide sufficient accuracy for mainstream usage. Finally, to generate speech on device, simply select the British Male voice. The more realistic sounding voices are generated on Mycroft servers and require significant hardware to synthesize speech within a reasonable time frame.
If you are running your own services, your Mycroft installation can be directed to use those using the mycroft.conf file.
By default, to answer a request Mycroft:
Detects the wake word
Records 3 - 10 seconds of audio
Sends this audio to a cloud-based speech-to-text (STT) service
Transcribes the audio and returns the text transcription
Parses the text to understand the intent
Sends the text to the intent handler with the highest confidence
Allows the Skill to perform some action and provide the text to be spoken
Synthesizes audio from the given text, either locally or remotely, depending on the text-to-speech (TTS) engine in use
Plays the synthesized spoken audio.
Through this process there are a number of factors that can affect the perceived speed of Mycroft's responses:
System resources - more processing power and memory never hurts!
Network latency - as it is not yet possible to perform everything on device, network latency and connection speed can play a significant role in slowing down response times.
Streaming STT - we have been experimenting with the use of streaming services. This transcribes audio as it's received rather than waiting for the entire utterance to be finished and sending the resulting audio file to a server to be processed in its entirety. It is possible to switch to a streaming STT service however at present this is not available by default and requires a paid 3rd party service. See Switching STT Engines for a list of options available.
Dialog structure - a long sentence will always take more time to synthesize than a short one. For this reason Mycroft breaks up longer dialog into chunks and returns one to speak whilst the next is being generated. Skill developers can help provide quicker response times by considering the structure of their dialog and breaking that dialog up using punctuation in appropriate places.
TTS Caching - synthesized audio is cached meaning common recently generated phrases don't need to be generated, they can be returned immediately.
The best answer is provided by @Thorsten who documented their journey to create a custom TTS model in German.
It is worth noting that it is a significant investment of time to train your own TTS model. We strongly recommend watching Thorsten's entire video before you get started. If a 1 hour video is too long, be warned that the process will take a minimum of weeks and more likely months.
There are exciting new projects that may soon enable us to generate new voices based off minutes of recorded audio. However currently it requires 16+ hours of very consistent, high-quality audio, with the associated text metadata.
To capture this training data we have the Mimic Recording Studio. Note that this generates audio files, which can be used to train TTS models using a range of technologies, not just Mycroft's Mimic.