Alignment, Attention, and TextGrids

Voice Mimicking and Text to Speech refers to a set of algorithms that makes an attempt at cloning your voice so it may synthesize something new out of plain old text. Within this field, you have two major approaches that try to achieve such a feat.


Attention BaseD

One approach is using attention based models. These models, in particular, make Voice Mimicking very easy to do. All you really need is a large dataset of audio files and a metadata file that outlines what was said in each audio file. This is possible because the generic teacher-student distillation process provides a recurrent neural network (usually) that tries to learn the proper alignments on its own.

The issue with these attention based models, besides being complicated and time-consuming, is how they tend to produce less than optimal results from inaccurate durations from the teacher and the underlying information loss caused by data simplification when the teacher attempts to distill the target mel-spectrogram (Ren 2020).


(Mel-Spectrogram from Getting To Know The Mel Spectrogram by Dayla)

There are a handful of these models but one that comes to mind is Google’s Tacotron which was published back in March 2017. Before you toss the idea of using such a model, the research on these types are still very much alive. You can check out the latest research from Google in their page.


Alignment Based

The other major approach is using alignment based models. These models address the issues caused by attention based models through a more direct training approach on ground truth audio files instead of the simplified version of it. This more direct approach introduces variation information of speech (i.e. pitch, energy, and more accurate duration). 

How each alignment based model handles the new set of information depends on the model itself. For example, FastSpeech 2…

“…extracts duration, pitch and energy from speech waveform and directly takes them as conditional inputs in training and uses predicted values in inference.”


Due to this, they were able to achieve a Mean Opinion Score (MOS) of roughly 3.71 (Ren 2020) and, as a comparison, which might come as a surprise, Tactoron 2 (an attention based model) had a MOS of roughly 3.7 (Shen 2018).

This is all great to see, however, the purpose of this article isn’t to dive deep on the latest research but instead how to get started with alignment based models when all you have is a set of audio files. In particular, if you use ming024’s implementation on FastSpeech 2 you will only be able to get so far before the algorithm will ask for a TextGrid file. 

These files provide the alignment based models exact time ranges on when something was said and what word was actually spoken. You see, with the attention based models you wouldn’t need any TextGrid file because it would figure it out on its own.

(Praat TextGrid software via ResearchGate


Generating TextGrid Files

With all this in mind, in order to generate TextGrid files you can use the Praat software to manually create these files or you can take a more automated approach via Montreal Forced Aligner (MFA)

How MFA generates these files, automatically, is by using Machine Learning to find the appropriate segments within your audio files. More specifically, MFA uses a Hidden Markov Model-Gaussian Mixture Model (aka HMM-GMM) by default. 

In order to get MFA working properly you will need a few things before you can simply run MFA.


Step 1: Create Metadata File

This “how-to” article assumes all you have is a set of audio files and nothing else, so, the first thing you will need is a metadata file. This file is a CSV file that contains the audio file name and the words spoken in that file, like so:

file text
bob_audio_0_001 Time makes us sentimental. Perhaps, in the end, it is because of time that we suffer.
bob_audio_0_002 Reason is purposive activity.
bob_audio_0_003 Your hair is winter fire January embers


For the sake of simplicity later, each audio file should be of type ‘wav’ and, if possible, make the metadata file a tab delimited file.

Do note, however, that the words spoken in text should not be in quotes, the file name should not include its extension and ultimately do not put any spaces in the file name (both within the metadata file and the actual audio file). Certain stages of this process are working to allow spaces but at the moment it will cause a lot of difficult-to-find errors.

If you have hundreds and hundreds of audio files and you are not wanting to hire a transcriber, you could take some time to look at the current Speech-To-Text (STT) models to automatically transcribe your audio files. For more information, do check out the “2019 Guide for Automatic Speech Recognition”.


Step 2: Resample Your Audio via SoX

One of the first issues that emerged, during my attempt at this process, was the issue of not having the right sampling rate within each file. After looking through MFA’s issues tab, I had discovered several ways to resample the audio (i.e. ‘pysox’) but the one that worked quickly was SoX via command prompt.

To get this part working, you will need to install sox and utilize the follow code:

import glob

import subprocess

in_path = 'path_to_wavs_directory'

out_path = 'path_to_output_directory'

sr = 16000

for file in glob.glob(in_path + '*.wav'):

    output_file = out_path + file[len(in_path):]"sox \"{file}\" -c 1 -r {sr} \"{output_file}\"", shell=True)

This will move through all your audio files, resample them to be 16kHz (required), and of 1 channel (most audio comes in as 2 aka “stereo”). After you have resampled the audio files you can now move on to the next step.


Step 3: Generate LAB files via Prosodylab Aligner Tools

The Montreal Forced Aligner requires LAB files before it will conduct any work on those TextGrid files. A LAB file is simply a text file that contains the words spoken in each given audio file. You could generate these LAB files through a simple script or you can use Prosodylab Aligner Tools to achieve the same.

What Prosodylab offers is an easy to use command prompt to completely understand what you are wanting out of the metadata text file. In my situation, however, I had to modify chunks of their code to get the LAB files I was looking for. If you find yourself with a large sum of LAB files that have repeating text in each file then you might have to do the same.

The modified file is contained in the github repo I created to help assist with various parts of this process. 

Lastly, the prompt itself is self explanatory but if more information is needed I suggest looking at their documentation and/or the README in the github repo I’ve provided.


Step 4: Install and Run MFA

Now you are ready to run MFA to generate those TextGrid files! 

Follow these installation steps to effectively use MFA. Once completed, run the following within your command prompt:

activate aligner

mfa train C:/'path to wavs directory' C:/'path to librispeech-lexicon.txt' C:/'desired output path for TextGrid files'

Since MFA is using Machine Learning to produce these alignment files, this process will take some time. You will have to eventually tweak the hyperparameters to ensure that you have +90% of your audio files aligned. To help direct your focus, adjusting the lattice beam width will help… A LOT!

If you are having a hard time finding the lexicon (for english) you can download it here: LibriSpeech Lexicon.


Step 5: Check Results

The final step is to make sure you check the results. This was important for me since the first several attempts produced so many strange errors but if the first couple of TextGrid files look correct you’re safe to assume the remaining files are also correct.

And there you have it! 

After completing these steps, you should be able to run an alignment based model on your custom audio files. No matter what you intend to use these models for, it might be a good idea to get these models working on the famous LibriSpeech dataset so you have an idea on what to expect and therefore help direct how you modify the code to address your particular audio files. 


GitHub Repo: 


Ren, Y., Liu, T., Zhao, Z., Zhao, S., Qin, T., Tan, X., & Hu, C. (2020, October 16). FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. Retrieved January 20, 2021, from

Shen, J., Pang, R., Weiss, R., Schuster, M., Jaitly, N., Yang, Z., . . . Wu, Y. (2018, February 16). Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. Retrieved January 20, 2021, from

Leave a comment

You must be logged in to post a comment.