FastPitch 1.0 for PyTorch | NVIDIA NGC (2024)

FastPitch is one of two major components in a neural, text-to-speech (TTS) system:

Such two-component TTS system is able to synthesize natural sounding speech from raw transcripts.

The FastPitch model generates mel-spectrograms and predicts a pitch contour from raw input text.In version 1.1, it does not need any pre-trained aligning model to bootstrap from.It allows to exert additional control over the synthesized utterances, such as:

  • modify the pitch contour to control the prosody,
  • increase or decrease the fundamental frequency in a naturally sounding way, that preserves the perceived identity of the speaker,
  • alter the rate of speech,
  • adjust the energy,
  • specify input as graphemes or phonemes,
  • switch speakers when the model has been trained with data from multiple speakers.Some of the capabilities of FastPitch are presented on the website with samples.

Speech synthesized with FastPitch has state-of-the-art quality, and does not suffer from missing/repeating phrases like Tacotron 2 does.This is reflected in Mean Opinion Scores (details).

ModelMean Opinion Score (MOS)
Tacotron 23.946 ± 0.134
FastPitch 1.04.080 ± 0.133

The current version of the model offers even higher quality, as reflectedin the pairwise preference scores (details).

ModelAverage preference
FastPitch 1.00.435 ± 0.068
FastPitch 1.10.565 ± 0.068

The FastPitch model is based on the FastSpeech model. The main differences between FastPitch and FastSpeech are that FastPitch:

  • no dependence on external aligner (Transformer TTS, Tacotron 2); in version 1.1, FastPitch aligns audio to transcriptions by itself as in One TTS Alignment To Rule Them All,
  • explicitly learns to predict the pitch contour,
  • pitch conditioning removes harsh sounding artifacts and provides faster convergence,
  • no need for distilling mel-spectrograms with a teacher model,
  • capabilities to train a multi-speaker model.

The FastPitch model is similar to FastSpeech2, which has been developed concurrently. FastPitch averages pitch/energy values over input tokens, and treats energy as optional.

FastPitch is trained on a publiclyavailable LJ Speech dataset.

This model is trained with mixed precision using Tensor Cores on Volta, Turing, and the NVIDIA Ampere GPU architectures. Therefore, researchers can get results from 2.0x to 2.7x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.

Model architecture

FastPitch is a fully feedforward Transformer model that predicts mel-spectrogramsfrom raw text (Figure 1). The entire process is parallel, which means that all input letters are processed simultaneously to produce a full mel-spectrogram in a single forward pass.

FastPitch 1.0 for PyTorch | NVIDIA NGC (1)

Figure 1. Architecture of FastPitch (source). The model is composed of a bidirectional Transformer backbone (also known as a Transformer encoder), a pitch predictor, and a duration predictor. After passing through the first *N* Transformer blocks, encoding, the signal is augmented with pitch information and discretely upsampled. Then it goes through another set of *N* Transformer blocks, with the goal ofsmoothing out the upsampled signal, and constructing a mel-spectrogram.

Default configuration

The FastPitch model supports multi-GPU and mixed precision training with dynamic lossscaling (see Apex codehere),as well as mixed precision inference.

The following features were implemented in this model:

  • data-parallel multi-GPU training,
  • dynamic loss scaling with backoff for Tensor Cores (mixed precision)training,
  • gradient accumulation for reproducible results regardless of the number of GPUs.

Pitch contours and mel-spectrograms can be generated on-line during training.To speed-up training, those could be generated during the pre-processing step and readdirectly from the disk during training. For more information on data pre-processing refer to Dataset guidelines and the paper.

Feature support matrix

The following features are supported by this model.

FeatureFastPitch
Automatic mixed precision (AMP)Yes
Distributed data parallel (DDP)Yes

Features

Automatic Mixed Precision (AMP) - This implementation uses native PyTorch AMPimplementation of mixed precision training. It allows us to use FP16 trainingwith FP32 master weights by modifying just a few lines of code.

DistributedDataParallel (DDP) - The model uses PyTorch Lightning implementationof distributed data parallelism at the module level which can run acrossmultiple machines.

Mixed precision training

Mixed precision is the combined use of different numerical precisions in a computational method. Mixed precision training offers significant computational speedup by performing operations in half-precision format while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of Tensor Cores in Volta, and following with both the Turing and Ampere architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:

  1. Porting the model to use the FP16 data type where appropriate.
  2. Adding loss scaling to preserve small gradient values.

The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in CUDA 8 in the NVIDIA Deep Learning SDK.

For information about:

  • How to train using mixed precision, see the Mixed Precision Training paper and Training With Mixed Precision documentation.
  • Techniques used for mixed precision training, see the Mixed-Precision Training of Deep Neural Networks blog.
  • APEX tools for mixed precision training, see the NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch.

Enabling mixed precision

For training and inference, mixed precision can be enabled by adding the --amp flag.Mixed precision is using native PyTorch implementation.

Enabling TF32

TensorFloat-32 (TF32) is the new math mode in NVIDIA A100 GPUs for handling the matrix math also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs.

TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require high dynamic range for weights or activations.

For more information, refer to the TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x blog post.

TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.

Glossary

Character durationThe time during which a character is being articulated. Could be measured in milliseconds, mel-spectrogram frames, etc. Some characters are not pronounced, and thus have 0 duration.

Fundamental frequencyThe lowest vibration frequency of a periodic soundwave, for example, produced by a vibrating instrument. It is perceived as the loudest. In the context of speech, it refers to the frequency of vibration of vocal chords. Abbreviated as f0.

PitchA perceived frequency of vibration of music or sound.

TransformerThe paper Attention Is All You Need introduces a novel architecture called Transformer, which repeatedly applies the attention mechanism. It transforms one sequence into another.

FastPitch 1.0 for PyTorch | NVIDIA NGC (2024)

References

Top Articles
Latest Posts
Article information

Author: Moshe Kshlerin

Last Updated:

Views: 5498

Rating: 4.7 / 5 (57 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Moshe Kshlerin

Birthday: 1994-01-25

Address: Suite 609 315 Lupita Unions, Ronnieburgh, MI 62697

Phone: +2424755286529

Job: District Education Designer

Hobby: Yoga, Gunsmithing, Singing, 3D printing, Nordic skating, Soapmaking, Juggling

Introduction: My name is Moshe Kshlerin, I am a gleaming, attractive, outstanding, pleasant, delightful, outstanding, famous person who loves writing and wants to share my knowledge and understanding with you.