FastPitch 1.0 for PyTorch | NVIDIA NGC (2024)

FastPitch is one of two major components in a neural, text-to-speech (TTS) system:

a mel-spectrogram generator such as FastPitch or Tacotron 2, and
a waveform synthesizer such as WaveGlow (see NVIDIA example code).

Such two-component TTS system is able to synthesize natural sounding speech from raw transcripts.

The FastPitch model generates mel-spectrograms and predicts a pitch contour from raw input text.In version 1.1, it does not need any pre-trained aligning model to bootstrap from.It allows to exert additional control over the synthesized utterances, such as:

modify the pitch contour to control the prosody,
increase or decrease the fundamental frequency in a naturally sounding way, that preserves the perceived identity of the speaker,
alter the rate of speech,
adjust the energy,
specify input as graphemes or phonemes,
switch speakers when the model has been trained with data from multiple speakers.Some of the capabilities of FastPitch are presented on the website with samples.

Speech synthesized with FastPitch has state-of-the-art quality, and does not suffer from missing/repeating phrases like Tacotron 2 does.This is reflected in Mean Opinion Scores (details).

Model	Mean Opinion Score (MOS)
Tacotron 2	3.946 ± 0.134
FastPitch 1.0	4.080 ± 0.133

The current version of the model offers even higher quality, as reflectedin the pairwise preference scores (details).

Model	Average preference
FastPitch 1.0	0.435 ± 0.068
FastPitch 1.1	0.565 ± 0.068

The FastPitch model is based on the FastSpeech model. The main differences between FastPitch and FastSpeech are that FastPitch:

Model architecture

FastPitch is a fully feedforward Transformer model that predicts mel-spectrogramsfrom raw text (Figure 1). The entire process is parallel, which means that all input letters are processed simultaneously to produce a full mel-spectrogram in a single forward pass.

Figure 1. Architecture of FastPitch (source). The model is composed of a bidirectional Transformer backbone (also known as a Transformer encoder), a pitch predictor, and a duration predictor. After passing through the first *N* Transformer blocks, encoding, the signal is augmented with pitch information and discretely upsampled. Then it goes through another set of *N* Transformer blocks, with the goal ofsmoothing out the upsampled signal, and constructing a mel-spectrogram.

Default configuration

The FastPitch model supports multi-GPU and mixed precision training with dynamic lossscaling (see Apex codehere),as well as mixed precision inference.

The following features were implemented in this model:

Feature support matrix

The following features are supported by this model.

Feature	FastPitch
Automatic mixed precision (AMP)	Yes
Distributed data parallel (DDP)	Yes

Features

Automatic Mixed Precision (AMP) - This implementation uses native PyTorch AMPimplementation of mixed precision training. It allows us to use FP16 trainingwith FP32 master weights by modifying just a few lines of code.

DistributedDataParallel (DDP) - The model uses PyTorch Lightning implementationof distributed data parallelism at the module level which can run acrossmultiple machines.

Mixed precision training

Mixed precision is the combined use of different numerical precisions in a computational method. Mixed precision training offers significant computational speedup by performing operations in half-precision format while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of Tensor Cores in Volta, and following with both the Turing and Ampere architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:

Porting the model to use the FP16 data type where appropriate.
Adding loss scaling to preserve small gradient values.

The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in CUDA 8 in the NVIDIA Deep Learning SDK.

For information about:

How to train using mixed precision, see the Mixed Precision Training paper and Training With Mixed Precision documentation.
Techniques used for mixed precision training, see the Mixed-Precision Training of Deep Neural Networks blog.
APEX tools for mixed precision training, see the NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch.

Enabling mixed precision

For training and inference, mixed precision can be enabled by adding the --amp flag.Mixed precision is using native PyTorch implementation.

Enabling TF32

TensorFloat-32 (TF32) is the new math mode in NVIDIA A100 GPUs for handling the matrix math also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs.

TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require high dynamic range for weights or activations.

For more information, refer to the TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x blog post.

TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.

Glossary

Character durationThe time during which a character is being articulated. Could be measured in milliseconds, mel-spectrogram frames, etc. Some characters are not pronounced, and thus have 0 duration.

Fundamental frequencyThe lowest vibration frequency of a periodic soundwave, for example, produced by a vibrating instrument. It is perceived as the loudest. In the context of speech, it refers to the frequency of vibration of vocal chords. Abbreviated as f0.

PitchA perceived frequency of vibration of music or sound.

TransformerThe paper Attention Is All You Need introduces a novel architecture called Transformer, which repeatedly applies the attention mechanism. It transforms one sequence into another.