Mycroft AI
  • Documentation
  • About Mycroft AI
    • Why use Mycroft AI?
    • Glossary of terms
    • Contributing
    • FAQ
  • Using Mycroft AI
    • Get Mycroft
      • Mark II
        • Mark II Dev Kit
      • Mark 1
      • Picroft
      • Linux
      • Mac OS and Windows with VirtualBox
      • Docker
      • Android
    • Pairing Your Device
    • Basic Commands
    • Installing New Skills
    • Customizations
      • Configuration Manager
      • mycroft.conf
      • Languages
        • Français (French)
        • Deutsch (German)
      • Using a Custom Wake Word
      • Speech-To-Text
      • Text-To-Speech
    • Troubleshooting
      • General Troubleshooting
      • Audio Troubleshooting
      • Wake Word Troubleshooting
      • Log Files
      • Support Skill
      • Getting more support
  • Skill Development
    • Voice User Interface Design Guidelines
      • What can a Skill do?
      • Design Process
      • Voice Assistant Personas
      • Interactions
        • Intents
        • Statements and Prompts
        • Confirmations
      • Conversations
      • Error Handling
      • Example Interaction Script
      • Prototyping
      • Design to Development
    • Development Setup
      • Python Resources
      • Your First Skill
    • Skill Structure
      • Lifecycle Methods
      • Logging
      • Skill Settings
      • Dependencies
        • Manifest.yml
        • Requirements files
      • Filesystem access
      • Skill API
    • Integration Tests
      • Test Steps
      • Scenario Outlines
      • Test Runner
      • Reviewing the Report
      • Adding Custom Steps
      • Old Test System
    • User interaction
      • Intents
        • Padatious Intents
        • Adapt Intents
      • Statements
      • Prompts
      • Parsing Utterances
      • Confirmations
      • Conversational Context
      • Converse
    • Displaying information
      • GUI Framework
      • Show Simple Content
      • Mycroft-GUI on a PC
      • Mark 1 Display
    • Advanced Skill Types
      • Fallback Skill
      • Common Play Framework
      • Common Query Framework
      • Common IoT Framework
    • Mycroft Skills Manager
      • Troubleshooting
    • Marketplace Submission
      • Skills Acceptance Process
        • Information Review Template
        • Code Review Template
        • Functional Review Template
        • Combined Template
      • Skill README.md
    • FAQ
  • Mycroft Technologies
    • Technology Overview
    • Roadmap
    • Mycroft Core
      • MessageBus
      • Message Types
      • Services
        • Enclosure
        • Voice Service
        • Audio Service
        • Skills Service
      • Plugins
        • Audioservice Plugins
        • STT Plugins
        • TTS Plugins
        • Wake Word Plugins
      • Testing
      • Legacy Repo
    • Adapt
      • Adapt Examples
      • Adapt Tutorial
    • Lingua Franca
    • Mimic TTS
      • Mimic 3
      • Mimic 2
      • Mimic 1
      • Mimic Recording Studio
    • Mycroft GUI
      • Remote STT and TTS
    • Mycroft Skills Kit
    • Mycroft Skills Manager
    • Padatious
    • Precise
    • Platforms
Powered by GitBook
On this page
  • Installation
  • Hardware Requirements
  • Software Requirements
  • TTS Plugin for Mycroft AI
  • Docker Image
  • Debian Package
  • Python Package
  • From Source
  • Usage
  • Voice Keys
  • Command-Line Interface
  • Web Server
  • Speech Dispatcher
  • Downloading Voices
  • How It Works
  • Phoneme Ids
  • gruut Phoneme-based Voices
  • eSpeak Phoneme-based Voices
  • Character-based Voices
  • Epitran-based Voices
  • Components of a Voice Model
  • License
  • Feedback or questions?

Was this helpful?

  1. Mycroft Technologies
  2. Mimic TTS

Mimic 3

A fast, privacy-focused, open-source, neural Text to Speech (TTS) engine.

PreviousMimic TTSNextMimic 2

Last updated 2 years ago

Was this helpful?

Mimic 3 is a neural text to speech engine that can run locally, even on low-end hardware like the Raspberry Pi 4. It is the default text to speech engine on the .

Installation

Hardware Requirements

Mimic 3 was designed to run on the Raspberry Pi 4 (64-bit OS), but will also run on other platforms:

  • amd64

    • AMD/Intel-based desktops/laptops

    • Tested:

      • Very fast on Ryzen 9 5950X, less than 0.05

  • arm64

    • Raspberry Pi 3/4 and Zero 2 with

    • Tested:

      • Usable on Pi 4, around 0.5

  • armv7l

    • Raspberry Pi 1/2/3/4 and Zero 2 with 32-bit Pi OS

    • Tested:

      • Slow on Pi 3, around 1.3

Real-Time Factor

The performance of a text to speech system is often measured by its real-time factor (RTF). This is the ratio of how long it takes to generate audio to how long the audio is when spoken. In general, lower is better for RTF.

An RTF of 1 means that it took one second of compute time to generate one second of spoken audio. An RTF of 0.5 is better than 1, however, since the same second of spoken audio now only took half a second to generate.

Mycroft Devices

Device
Supported
Notes

Mark II

Full support

Default engine. Runs well locally.

Mark 1

Partial support

Runs slower than real-time because the Mark 1 contains a Raspberry Pi 3B. It is not recommended at this time.

Picroft

Variable

Varies depending on the hardware. A Raspberry Pi 4 or better is recommended.

Software Requirements

  • Linux

    • Recommended: 64-bit Debian bullseye or Raspberry Pi OS

  • Python 3.7+

    • Recommended: Python 3.9

  • Python packages

  • System packages

    • libespeak-ng1

    • libatomic1 (32-bit ARM only)

    • libgomp1 (32-bit ARM only)

    • libatlas-base-dev (32-bit ARM only)

TTS Plugin for Mycroft AI

Install the necessary system packages:

sudo apt-get install libespeak-ng1

On 32-bit ARM platforms (a.k.a. armv7l or armhf), you will also need some extra libraries:

sudo apt-get install libatomic1 libgomp1 libatlas-base-dev

Then, ensure that you're using the latest pip:

mycroft-pip install --upgrade pip

Next, install the TTS plugin in Mycroft:

mycroft-pip install mycroft-plugin-tts-mimic3[all]

Removing [all] will install support for English only.

mycroft-config set tts.module mimic3_tts_plug

or you can manually add the following to mycroft.conf with mycroft-config edit user:

"tts": {
  "module": "mimic3_tts_plug"
}

Plugin Configuration Options

A range of configuration options can be added to customize the Mimic 3 TTS output, for example:

{
  "tts": {
    "module": "mimic3_tts_plug",
    "mimic3_tts_plug": {
      "voice": "en_US/cmu-arctic_low",  // voice key
      "speaker": "fem",  // default speaker
      "length_scale": 1.0,  // speaking rate
      "noise_scale": 0.667,  // speaking variablility
      "noise_w": 1.0  // phoneme duration variablility
    }
  }
}
  • length_scale - controls how fast the voice speaks the text. A value of 1 is the speed of the training dataset. Less than 1 is faster, and more than 1 is slower.

  • noise_scale - the amount of noise added to the generated audio (0-1). Can help mask audio artifacts from the voice model. Multi-speaker models tend to sound better with a lower amount of noise than single speaker models.

  • noise_w - the amount of noise used to generate phoneme durations (0-1). Allows for variable speaking cadance, with a value closer to 1 being more variable. Multi-speaker models tend to sound better with a lower amount of phoneme variability than single speaker models.

Docker Image

A pre-built Docker image is available for AMD/Intel CPUs as well as 32/64-bit ARM:

mkdir -p "${HOME}/.local/share/mycroft/mimic3"
chmod a+rwx "${HOME}/.local/share/mycroft/mimic3"
docker run \
       -it \
       -p 59125:59125 \
       -v "${HOME}/.local/share/mycroft/mimic3:/home/mimic3/.local/share/mycroft/mimic3" \
       'mycroftai/mimic3'

The following convenience scripts are also available:

Debian Package

  • mycroft-mimic3-tts_<version>_amd64.deb

    • For desktops and laptops (AMD/Intel CPUs)

  • mycroft-mimic3-tts_<version>_arm64.deb

  • mycroft-mimic3-tts_<version>_armhf.deb

    • For Raspberry Pi 1/2/3/4 and Zero 2 with 32-bit Pi OS

Once downloaded, install the package with (note the ./):

sudo apt install ./mycroft-mimic3-tts_<VERSION>_<PLATFORM>.deb

Once installed, the following commands will be available in /usr/bin:

  • mimic3

  • mimic3-server

  • mimic3-download

Python Package

First, ensure that you're using the latest pip:

pip install --upgrade pip

Then, install the package:

pip install mycroft-mimic3-tts[all]

Removing [all] will install support for English only.

Once installed, the following commands will be available:

  • mimic3

  • mimic3-download

  • mimic3-server

From Source

Clone the repository:

git clone https://github.com/mycroftAI/mimic3.git

Run the install script:

cd mimic3/
./install.sh

A virtual environment will be created in mimic3/.venv and the mycroft-mimic3-tts Python module will be installed in editiable mode (pip install -e).

Once installed, the following commands will be available in .venv/bin:

  • mimic3

  • mimic3-server

  • mimic3-download


Usage

There are many ways to use Mimic 3, including:

Voice Keys

Voices in Mimic 3 are keyed by a name with specific parts. These parts include the voice's language, region, training dataset, quality level, and speaker.

The default voice is en_UK/apope_low

Command-Line Interface

Basic Synthesis

The mimic3 command can be used to synthesize audio on the command line:

mimic3 --voice <voice> "<text>" > output.wav

SSML

cat << EOF |
<speak>
  <s>
    Spoken before pause with default voice.
  </s>
  <break time="2s" />
  <voice name="en_US/vctk_low#p236">
    <s>
      Spoken after pause in a different voice.
    </s>
  </voice>
</speak>
EOF
    mimic3 --ssml --voice 'en_US/cmu-arctic#eey' > output.wav

SSML even lets you mix and match languages:

cat << EOF |
<speak>
  <voice name="de_DE/thorsten_low">
    <s>
      Eine Sprache ist niemals genug.
    </s>
  </voice>
  <voice name="nl/rdh_low">
    <s>
      Eén taal is nooit genoeg.
    </s>
  </voice>
  <voice name="en_US/vctk_low">
    <s>
      One language is never enough.
    </s>
  </voice>
</speak>
EOF
    mimic3 --ssml > output.wav

If your SSML contains <mark> tags, add --mark-file <file> to the command-line and use --interactive mode. As the marks are encountered, their names will be written on separate lines to the file:

mimic3 --ssml --interactive --mark-file - '<speak>Test 1. <mark name="here" /> Test 2.</speak>'

The following SSML tags are supported:

  • <speak> - wrap around SSML text

    • lang - set language for document

  • <s> - sentence (disables automatic sentence breaking)

    • lang - set language for sentence

  • <w> / <token> - word (disables automatic tokenization)

  • <voice name="..."> - set voice of inner text

  • <prosody attribute="value"> - change speaking attributes

    • Supported attribute names:

      • volume - speaking volume

        • number in [0, 100] - 0 is silent, 100 is loudest (default)

        • +X, -X, +X%, -X% - absolute/percent offset from current volume

        • one of "default", "silent", "x-loud", "loud", "medium", "soft", "x-soft"

      • rate - speaking rate

        • number - 1 is default rate, < 1 is slower, > 1 is faster

        • X% - 100% is default rate, 50% is half speed, 200% is twice as fast

        • one of "default", "x-fast", "fast", "medium", "slow", "x-slow"

  • <say-as interpret-as=""> - force interpretation of inner text

    • interpret-as one of "spell-out", "date", "number", "time", or "currency"

    • format - way to format text depending on interpret-as

      • number - one of "cardinal", "ordinal", "digits", "year"

      • date - string with "d" (cardinal day), "o" (ordinal day), "m" (month), or "y" (year)

  • <break time=""> - Pause for given amount of time

    • time - seconds ("123s") or milliseconds ("123ms")

  • <sub alias=""> - substitute alias for inner text

  • <phoneme ph=""> - supply phonemes for inner text

    • See phonemes.txt in voice directory for available phonemes

    • Phonemes may need to be separated by whitespace

SSML <say-as> support varies between voice types:

  • Character-based voices do not currently support <say-as>

Long Texts

If your text is very long, and you would like to listen to it as its being synthesized, use --interactive mode:

mimic3 --interactive < long.txt

Each input line will be synthesized and played (see --play-program). By default, 5 sentences will be kept in an output queue, only blocking synthesis when the queue is full. You can adjust this value with --result-queue-size.

curl --output - 'https://www.gutenberg.org/files/11/11-0.txt' | \
    mimic3 --interactive --process-on-blank-line

Multiple WAV Output

With --output-dir set to a directory, Mimic 3 will output a separate WAV file for each sentence:

mimic3 'Test 1. Test 2.' --output-dir /path/to/wavs

By default, each WAV file will be named using the (slightly modified) text of the sentence. You can have WAV files named using a timestamp instead with --output-naming time. For full control of the output naming, the --csv command-line flag indicates that each sentence is of the form id|text where id will be the name of the WAV file.

cat << EOF |
s01|The birch canoe slid on the smooth planks.
s02|Glue the sheet to the dark blue background.
s03|It's easy to tell the depth of a well.
s04|These days a chicken leg is a rare dish.
s05|Rice is often served in round bowls.
s06|The juice of lemons makes fine punch.
s07|The box was thrown beside the parked truck.
s08|The hogs were fed chopped corn and garbage.
s09|Four hours of steady work faced us.
s10|Large size in stockings is hard to sell.
EOF
  mimic3 --csv --output-dir /path/to/wavs

You can adjust the delimiter with --csv-delimiter <delimiter>.

Additionally, you can use the --csv-voice option to specify a different voice or speaker for each line:

cat << EOF |
s01|#awb|The birch canoe slid on the smooth planks.
s02|#rms|Glue the sheet to the dark blue background.
s03|#slt|It's easy to tell the depth of a well.
s04|#ksp|These days a chicken leg is a rare dish.
s05|#clb|Rice is often served in round bowls.
s06|#aew|The juice of lemons makes fine punch.
s07|#bdl|The box was thrown beside the parked truck.
s08|#lnh|The hogs were fed chopped corn and garbage.
s09|#jmk|Four hours of steady work faced us.
s10|en_UK/apope_low|Large size in stockings is hard to sell.
EOF
  mimic3 --voice 'en_US/cmu-arctic_low' --csv-voice --output-dir /path/to/wavs

The second contain can contain a #<speaker> or an entirely different voice!

Interactive Mode

With --interactive, Mimic 3 will switch into interactive mode. After entering a sentence, it will be played with --play-program.

mimic3 --interactive
Reading text from stdin...
Hello world!<ENTER>

Use CTRL+D or CTRL+C to exit.

Noise and Length Settings

Synthesis has the following additional parameters:

  • --noise-scale and --noise-w

    • Determine the speaker volatility during synthesis

    • 0-1, default is 0.667 and 0.8 respectively

  • --length-scale - makes the voice speaker slower (> 1) or faster (< 1)

Individual voices have default settings for these parameters in their config.json files (under inference).

List Voices

mimic3 --voices

CUDA Acceleration

Web Server

A small HTTP server is available for serving multiple clients. This is faster than the command-line interface since voice models only need to be loaded once.

Running the Server

mimic3-server

To access the web server from a different device, run mimic3-server --host 0.0.0.0 (you can also change the port with --port).

Some other useful arguments to mimic3-server:

  • --preload-voice <VOICE_KEY> - loads a voice model at startup instead of on first use

  • --cache-dir <DIRECTORY> - caches WAV files in <DIRECTORY> (uses system temporary directory if no <DIRECTORY>)

  • --num-threads <THREADS> - use more than one thread of inference, increasing throughput for multiple clients

See mimic3-server --help for more options.

Endpoints

  • /api/tts

  • /api/voices

    • Returns a JSON list of available voices

CUDA Acceleration

Running the Client

mimic3 --remote --voice 'en_UK/apope_low' 'My hovercraft is full of eels.' > hovercraft_eels.wav

If your server is somewhere besides localhost, use mimic3 --remote <URL> ...

See mimic3 --help for more options.

MaryTTS Compatibility

tts:
  - platform: marytts
    host: "localhost"
    port: 59125
    voice: "en_UK/apope_low"

Speech Dispatcher

WORK IN PROGRESS: This has not been tested on a broad range of systems. Some debugging may be required.

sudo apt-get install speech-dispatcher

Create the file /etc/speech-dispatcher/modules/mimic3-generic.conf with the contents:

GenericExecuteSynth "printf %s \'$DATA\' | /path/to/mimic3 --remote --voice \'$VOICE\' --stdout | $PLAY_COMMAND"
AddVoice "en" "MALE1" "en_UK/apope_low"

You will need sudo access to do this. Make sure to change /path/to/mimic3 to wherever you installed Mimic 3. Note that the --remote option is used to connect to a local Mimic 3 web server (use --remote <URL> if your server is somewhere besides localhost).

To change the voice later, you only need to replace en_UK/apope_low.

Next, edit the existing file /etc/speech-dispatcher/speechd.conf and ensure the following settings are present:

DefaultVoiceType  "MALE1"
DefaultModule mimic3-generic
DefaultLanguage "en"
AudioOutputMethod "libao"

Restart speech-dispatcher with:

sudo systemctl restart speech-dispatcher

and test it out with:

spd-say 'Hello from speech dispatcher.'

Systemd Service

To ensure that Mimic 3 runs at boot, create a systemd service at $HOME/.config/systemd/user/mimic3.service with the contents:

[Unit]
Description=Run Mimic 3 web server
Documentation=https://github.com/MycroftAI/mimic3

[Service]
ExecStart=/path/to/mimic3-server

[Install]
WantedBy=default.target

Make sure to change /path/to/mimic3-server to wherever you installed Mimic 3.

Refresh the systemd services:

systemctl --user daemon-reload

Now try starting the service:

systemctl --user start mimic3

If that's successful, ensure it starts at boot:

systemctl --user enable mimic3

Downloading Voices

Mimic 3 automatically downloads voices when they're first used, but you can manually download them too with mimic3-download.

For example:

mimic3-download 'en_US/*'

will download all U.S. English voices to ${HOME}/.local/share/mycroft/mimic3/voices.

You can list the available voices with --voices:

mimic3 --voices | awk '{print $1}'
KEY
de_DE/m-ailabs_low
de_DE/thorsten_low
el_GR/rapunzelina_low
en_UK/apope_low
en_US/cmu-arctic_low
en_US/ljspeech_low
en_US/vctk_low
es_ES/carlfm_low
es_ES/m-ailabs_low
...

Voice models are stored locally in your home directory:

tree "${HOME}/.local/share/mycroft/mimic3/voices"

├── de_DE
│   ├── m-ailabs_low
│   │   ├── ALIASES
│   │   ├── config.json
│   │   ├── generator.onnx
│   │   ├── LICENSE
│   │   ├── phoneme_map.txt
│   │   ├── phonemes.txt
│   │   ├── README.md
│   │   ├── SOURCE
│   │   ├── speaker_map.csv
│   │   └── speakers.txt
...

See mimic3-download --help for more options.


How It Works

Phoneme Ids

At a high level, Mimic 3 performs two important tasks:

  1. Converting raw text to numeric input for the VITS TTS model, and

  2. Using the model to transform numeric input into audio output

The second step is the same for every voice, but the first step (text to numbers) varies. There are currently four implementations of step 1, described below.

gruut Phoneme-based Voices

gruut normalizes text and phonemizes words according to a lexicon, with a pre-trained grapheme-to-phoneme model used to guess unknown word pronunciations.

eSpeak Phoneme-based Voices

eSpeak-ng normalizes and phonemizes text using internal rules and lexicons. It supports a large number of languages, and can handle many textual forms.

Character-based Voices

Voices whose "phonemes" are characters from an alphabet, typically with some punctuation.

For voices whose orthography (writing system) is close enough to its spoken form, character-based voices allow for skipping the phonemization step. However, these voices do not support text normalization, so numbers, dates, etc. must be written out.

Epitran-based Voices

epitran uses rules to generate phonetic pronunciations from text. It does not support text normalization, however, so numbers, dates, etc. must be written out.

Components of a Voice Model

Voice models are stored in a directory with a specific layout:

  • <language>_<region> (e.g., en_UK)

    • <voice-name>_<quality> (e.g., apope_low)

      • ALIASES - alternative names for the voice, one per line (optional)

      • LICENSE - text, name, or URL of voice model license

      • phoneme_map.txt - mapping from source phoneme to destination phoneme(s) (optional)

      • phonemes.txt - mapping from integer ids to phonemes (_ = padding, ^ = beginning of utterance, $ = end of utterance, # = word break)

      • README.md - description of the voice

      • SOURCE - URL(s) of the dataset(s) this voice was trained on

      • VERSION - version of the voice in the format "MAJOR.Minor.bugfix" (e.g. "1.0.2")


License


Feedback or questions?

See

Additional language support can be selectively installed by replacing all with a two-character language code, such as de (German) or fr (French). See for an up-to-date list of language codes.

Enable the plugin in your file:

voice - a defining the TTS model to be used. You can find a .

speaker - for multi-speaker voice models, the default speaker to be used. To hear all the speakers see

Visit the web page at

Grab the Debian package from the for your platform:

For Raspberry 3/4 and Zero 2 with

Additional language support can be selectively installed by replacing all with a two-character language code, such as de (German) or fr (French). See for an up-to-date list of language codes.

Voice models are automatically downloaded from and stored in ${HOME}/.local/share/mycroft/mimic3 (technically ${XDG_DATA_HOME}/mycroft/mimic3). You can also .

where <voice> is a like en_UK/apope_low. <TEXT> may contain multiple sentences, which will be combined in the final output WAV file. These can also be .

A subset of Speech Synthesis Markup Language, or , is available through the command line and web interface. SSML allows you to fine tune your output.

voice -

based voices do not currently support <say-as>

If your long text is fixed-width with blank lines separating paragraphs like those from , use the --process-on-blank-line option so that sentences will not be broken at line boundaries. For example, you can listen to "Alice in Wonderland" like this:

If you have a GPU with support for CUDA, you can accelerate synthesis with the --cuda flag. This requires you to install the Python package.

Using is highly recommended. See the Dockerfile.gpu file in the parent repository for an example of how to build a compatible container.

This will start a web server at

POST text or and receive WAV audio back

Use ?voice= to select a different

Set Content-Type to application/ssml+xml (or use ?ssml=1) for input

An test page is also available at

If you have a GPU with support for CUDA, you can accelerate synthesis with the --cuda flag. This requires you to install the Python package.

Using is highly recommended. See the for an example of how to build a compatible container.

Assuming you have started mimic3-server and can access , then run:

Use the Mimic 3 web server as a drop-in replacement for , for example with .

Make sure to use a Mimic 3 like en_UK/apope_low instead of a MaryTTS voice name:

Mimic 3 can be used with the for Linux via .

After , start the . Next, make sure you have speech-dispatcher installed:

Verify the web server is running by visiting

Some voices even have multiple speakers. This one has over .

Mimic 3 uses the , a "Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech". VITS is a combination of the and the .

Our implementation is heavily based on , with the addition of export for speed.

Voices that use for phonemization.

Voices that use for phonemization (via ).

Voices that use for phonemization.

config.json - training/inference configuration (see for details)

generator.onnx - exported inference model (see ids_to_audio method in )

Mimic 3 is available under the

Join us in or the .

requirements.txt
setup.py
mycroft.conf
https://mycroft.ai/mimic-3/
http://localhost:59125
mimic3
mimic3-server
mimic3-download
latest release
64-bit Pi OS
setup.py
SSML
gruut
eSpeak-ng
epitran
Project Gutenberg
onnxruntime-gpu
nvidia-docker
http://localhost:59125
OpenAPI
http://localhost:59125/openapi
onnxruntime-gpu
nvidia-docker
Dockerfile.gpu
http://localhost:59125
MaryTTS
Home Assistant
Orca screen reader
speech-dispatcher
http://localhost:59125
one hundred
VITS
GlowTTS duration predictor
HiFi-GAN vocoder
Jaehyeon Kim's PyTorch model
Onnx runtime
gruut
eSpeak-ng
espeak-phonemizer
epitran
code
voice.py
AGPL v3 license
Mycroft Chat
Community Forums
Mark II
Listen to voice samples
64-bit Pi OS
Install Mimic 3
See example use cases
Learn how it works
RTF
RTF
RTF
list of all available Voice Keys on Github
Voice Key
From the command line
As a web server
In a screen reader
Github
manually download them
voice key
split into separate WAV files
voice key
SSML
voice/speaker
SSML
voice key
installing Mimic 3
web server
Structure of a Mimic 3 voice key
screenshot of web interface
mimic 3 architecture