Mycroft AI
  • Documentation
  • About Mycroft AI
    • Why use Mycroft AI?
    • Glossary of terms
    • Contributing
    • FAQ
  • Using Mycroft AI
    • Get Mycroft
      • Mark II
        • Mark II Dev Kit
      • Mark 1
      • Picroft
      • Linux
      • Mac OS and Windows with VirtualBox
      • Docker
      • Android
    • Pairing Your Device
    • Basic Commands
    • Installing New Skills
    • Customizations
      • Configuration Manager
      • mycroft.conf
      • Languages
        • Français (French)
        • Deutsch (German)
      • Using a Custom Wake Word
      • Speech-To-Text
      • Text-To-Speech
    • Troubleshooting
      • General Troubleshooting
      • Audio Troubleshooting
      • Wake Word Troubleshooting
      • Log Files
      • Support Skill
      • Getting more support
  • Skill Development
    • Voice User Interface Design Guidelines
      • What can a Skill do?
      • Design Process
      • Voice Assistant Personas
      • Interactions
        • Intents
        • Statements and Prompts
        • Confirmations
      • Conversations
      • Error Handling
      • Example Interaction Script
      • Prototyping
      • Design to Development
    • Development Setup
      • Python Resources
      • Your First Skill
    • Skill Structure
      • Lifecycle Methods
      • Logging
      • Skill Settings
      • Dependencies
        • Manifest.yml
        • Requirements files
      • Filesystem access
      • Skill API
    • Integration Tests
      • Test Steps
      • Scenario Outlines
      • Test Runner
      • Reviewing the Report
      • Adding Custom Steps
      • Old Test System
    • User interaction
      • Intents
        • Padatious Intents
        • Adapt Intents
      • Statements
      • Prompts
      • Parsing Utterances
      • Confirmations
      • Conversational Context
      • Converse
    • Displaying information
      • GUI Framework
      • Show Simple Content
      • Mycroft-GUI on a PC
      • Mark 1 Display
    • Advanced Skill Types
      • Fallback Skill
      • Common Play Framework
      • Common Query Framework
      • Common IoT Framework
    • Mycroft Skills Manager
      • Troubleshooting
    • Marketplace Submission
      • Skills Acceptance Process
        • Information Review Template
        • Code Review Template
        • Functional Review Template
        • Combined Template
      • Skill README.md
    • FAQ
  • Mycroft Technologies
    • Technology Overview
    • Roadmap
    • Mycroft Core
      • MessageBus
      • Message Types
      • Services
        • Enclosure
        • Voice Service
        • Audio Service
        • Skills Service
      • Plugins
        • Audioservice Plugins
        • STT Plugins
        • TTS Plugins
        • Wake Word Plugins
      • Testing
      • Legacy Repo
    • Adapt
      • Adapt Examples
      • Adapt Tutorial
    • Lingua Franca
    • Mimic TTS
      • Mimic 3
      • Mimic 2
      • Mimic 1
      • Mimic Recording Studio
    • Mycroft GUI
      • Remote STT and TTS
    • Mycroft Skills Kit
    • Mycroft Skills Manager
    • Padatious
    • Precise
    • Platforms
Powered by GitBook
On this page
  • How it works
  • Generating Speech from Text

Was this helpful?

  1. Mycroft Technologies
  2. Mimic TTS

Mimic 2

Mimic II is a machine learning Text-to-Speech engine designed to run in the cloud.

PreviousMimic 3NextMimic 1

Last updated 3 years ago

Was this helpful?

Mimic 2 uses deep learning, generating higher quality speech than traditional concatenative Text-to-Speech systems. Using deep learning, we don’t need to hand engineer the features in speech; instead, we let the computer learn how to generate it. With enough processing power and data, computers are capable of learning features like intonation, tone, stress, and rhythm. These features make speech sound more human-like.

How it works

Simply put, the Mimic 2 engine takes in a string of text and maps it to an audio output. Mimic 2 was trained on a corpus of about 20,000 English sentences that equates to about 16 hours of audio. During the recording process, Kusal had to speak clearly into a high-quality microphone in a noise isolated environment. We broke the recording sessions up into a span of two weeks to prevent vocal fatigue. The clarity and the quality of the data are critical, as with any machine learning system. Like they say: “garbage in, garbage out.”

This generation of Mimic is based on Tacotron, a groundbreaking and very successful neural network architecture for speech synthesis. It’s able to take a highly compressed source of information (text) and decompress it into audio. This process is complicated because the same text can correspond to different sounds and speaking styles. Because of the various nuances in spoken language, it is difficult to generate the appropriate output sound from the input text. Try speaking these simple sentences out loud and pay attention to the different pronunciations of the same word:

  • I read a book last night; I like to read.

  • The violinist took a bow after he dropped his bow.

  • You have to use chopsticks to master their use.

  • After you graduate you become a graduate.

Below is a concise technical explanation of the deep learning approach. It’s written with the goal of simplifying the various methods in the neural network architecture for understandability. If you’d like more in-depth details, you can read Google’s on Tacotron.

Generating Speech from Text

As a high-level overview, the model takes in characters as input and outputs a raw spectrogram. An algorithm then transforms the raw spectrogram into audio waveforms. There are many neural network layers in this architecture that perform various functions. But for conciseness, the layers are grouped into 3 main modules; an Encoder, Decoder, and an Audio Reconstruction module.

Technical diagram of Mycroft's Mimic2 Open Source Text to Speech

The voice generation starts with the Encoder. A sentence is broken up into individual characters as embeddings and fed into the Feature Extraction module. The Encoder uses this module to extract out local and contextual information from the characters. The process helps define the various patterns in a sentence to aid in producing the sound of the audio output. For example, the “C” in “Chat” is pronounced differently than the “C” in “Cat.” Also, the intonation of a sentence would sound different if there is a question mark at the end versus a period. The output of the Feature Extraction module is essentially an abstract numerical data representation of the various features in a sentence. These encoded text features are used to help generate the sound.

After making a pass through the Encoder, the output of the Feature Extraction module is fed into the Decoder. The Decoder’s job is to generate a mel spectrogram from the encoded text features. A spectrogram is a pictoral representation of sound. A mel spectrogram is a form of a spectrogram that represents sounds that are tuned to the human ear.

The Frame Prediction module (FPM), produces the raw mel spectrogram by recursively generating the frames. The information from the Encoder output is used to aid in the generation process. A method in the FPM called the Attention Mechanism helps align each character with its corresponding sound. As you can see in the diagram, the decoding process takes many repeated steps. The number of steps is dependent on the audio length. The Encoder output is fed into the FPM to start the decoding process.

The FPM produces two things: a mel spectrogram frame which contains information on what the generated audio sounds like; and an abstract numerical data representation of the state of the FPM, which we’ll call the internal state. The internal state contains information that is critical to generating the next frame. The internal state holds information like the text representation from the Encoder and the data from the previous decoding steps. These outputs are necessary because to generate the next sound in speech; it’s beneficial to know prior sounds generated and the information that generated those sounds. In reality, the FPM module is a lot more complicated, but we’ll stick to this explanation for the sake of simplicity. The FPM combines those two outputs, passes it on into the next decoding step, then does its job again. The Decoder repeats this process, building out the mel spectrogram frame by frame. Each step is taking in information from the previous step. After the final mel spectrogram output, the whole sentence is represented in speech form.

The final step in the speech generation process is the Audio Reconstruction module. This module has two main components, the post-processing net, and the Griffin-Lim algorithm. The full mel spectrogram generated from the Decoder is fed into the post-processing net. The network transforms the mel spectrogram into a linear spectrogram. This step is crucial for two reasons. First, the output needs to be converted from a mel spectrogram to a linear spectrogram before it can be reconstructed. Second, during the decoding process, the FPM makes mistakes in generating the mel spectrogram frames. The post-processing net can learn to correct these mistakes. Once the post-processing net generates the linear spectrogram, the Griffin-Lim algorithm is applied to it. Griffin-Lim is a reconstruction algorithm that can take linear spectrograms and turn it into audio waveforms. The linear spectrogram does not contain information on the phase, which is any particular point in time of a waveform. Griffin-Lim estimates the phases in a waveform from the spectrogram, but it’s not perfect. Thus each output waveform has some form of phase distortion. Phase distortion is why the voice can sound as if it’s omnipotent.

Diagram of architecture

Mel Spectrogram

Mimic V2
white paper