Torchaudio 2.1 Release Note
Hilights
TorchAudio v2.1 introduces the new features and backward-incompatible changes;
- [BETA] A new API to apply filter, effects and codec
torchaudio.io.AudioEffector
can apply filters, effects and encodings to waveforms in online/offline fashion.
You can use it as a form of augmentation.
Please refer to https://pytorch.org/audio/2.1/tutorials/effector_tutorial.html for the examples. - [BETA] Tools for forced alignment
New functions and a pre-trained model for forced alignment were added.
torchaudio.functional.forced_align
computes alignment from an emission andtorchaudio.pipelines.MMS_FA
provides access to the model trained for multilingual forced alignment in MMS: Scaling Speech Technology to 1000+ languages project.
Please refer to https://pytorch.org/audio/2.1/tutorials/ctc_forced_alignment_api_tutorial.html for the usage offorced_align
function, and https://pytorch.org/audio/2.1/tutorials/forced_alignment_for_multilingual_data_tutorial.html for how one can useMMS_FA
to align transcript in multiple languages. - [BETA] TorchAudio-Squim : Models for reference-free speech assessment
Model architectures and pre-trained models from the paper TorchAudio-Squim: Reference-less Speech Quality and Intelligibility measures in TorchAudio were added.
You can usetorchaudio.pipelines.SQUIM_SUBJECTIVE
andtorchaudio.pipelines.SQUIM_OBJECTIVE
models to estimate the various speech quality and intelligibility metrics. This is helpful when evaluating the quality of speech generation models, such as TTS.
Please refer to https://pytorch.org/audio/2.1/tutorials/squim_tutorial.html for the detail. - [BETA] CUDA-based CTC decoder
torchaudio.models.decoder.CUCTCDecoder
takes emission stored in CUDA memory and performs CTC beam search on it in CUDA device. The beam search is fast. It eliminates the need to move data from CUDA device to CPU when performing automatic speech recognition. With PyTorch's CUDA support, it is now possible to perform the entire speech recognition pipeline in CUDA.
Please refer to https://pytorch.org/audio/2.1/tutorials/asr_inference_with_cuda_ctc_decoder_tutorial.html for the detail. - [Prototype] Utilities for AI music generation
We are working to add utilities that are relevant to music AI. Since the last release, the following APIs were added to the prototype.
Please refer to respective documentation for the usage.- torchaudio.prototype.chroma_filterbank
- torchaudio.prototype.transforms.ChromaScale
- torchaudio.prototype.transforms.ChromaSpectrogram
- torchaudio.prototype.pipelines.VGGISH
- New recipes for training models.
Recipes for Audio-visual ASR, multi-channel DNN beamforming and TCPGen context-biasing were added.
Please refer to the recipes - Update to FFmpeg support
The version of supported FFmpeg libraries was updated.
TorchAudio v2.1 works with FFmpeg 6, 5 and 4.4. The support for 4.3, 4.2 and 4.1 are dropped.
Please refer to https://pytorch.org/audio/2.1/installation.html#optional-dependencies for the detail of the new FFmpeg integration mechanism. - Update to libsox integration
TorchAudio now depends on libsox installed separately from torchaudio. Sox I/O backend no longer supports file-like object. (This is supported by FFmpeg backend and soundfile)
Please refer to https://pytorch.org/audio/2.1/installation.html#optional-dependencies for the detail.
New Features
I/O
- Support overwriting PTS in
torchaudio.io.StreamWriter
(#3135) - Include format information after filter
torchaudio.io.StreamReader.get_out_stream_info
(#3155) - Support CUDA frame in
torchaudio.io.StreamReader
filter graph (#3183, #3479) - Support YUV444P in GPU decoder (#3199)
- Add additional filter graph processing to
torchaudio.io.StreamWriter
(#3194) - Cache and reuse HW device context in GPU decoder (#3178)
- Cache and reuse HW device context in GPU encoder (#3215)
- Support changing the number of channels in
torchaudio.io.StreamReader
(#3216) - Support encode spec change in
torchaudio.io.StreamWriter
(#3207) - Support encode options such as compression rate and bit rate (#3179, #3203, #3224)
- Add
420p10le
support totorchaudio.io.StreamReader
CPU decoder (#3332) - Support multiple FFmpeg versions (#3464, #3476)
- Support writing opus and mp3 with soundfile (#3554)
- Add switch to disable sox integration and ffmpeg integration at runtime (#3500)
Ops
- Add
torchaudio.io.AudioEffector
(#3163, #3372, #3374) - Add
torchaudio.transforms.SpecAugment
(#3309, #3314) - Add
torchaudio.functional.forced_align
(#3348, #3355, #3533, #3536, #3354, #3365, #3433, #3357) - Add
torchaudio.functional.merge_tokens
(#3535, #3614) - Add
torchaudio.functional.frechet_distance
(#3545)
Models
- Add
torchaudio.models.SquimObjective
for speech enhancement (#3042, 3087, #3512) - Add
torchaudio.models.SquimSubjective
for speech enhancement (#3189) - Add
torchaudio.models.decoder.CUCTCDecoder
(#3096)
Pipelines
- Add
torchaudio.pipelines.SquimObjectiveBundle
for speech enhancement (#3103) - Add
torchaudio.pipelines.SquimSubjectiveBundle
for speech enhancement (#3197) - Add
torchaudio.pipelines.MMS_FA
Bundle for forced alignment (#3521, #3538)
Tutorials
- Add tutorial for
torchaudio.io.AudioEffector
(#3226) - Add tutorials for CTC forced alignment API (#3356, #3443, #3529, #3534, #3542, #3546, #3566)
- Add tutorial for
torchaudio.models.decoder.CUCTCDecoder
(#3297) - Add tutorial for real-time av-asr (#3511)
- Add tutorial for TorchAudio-SQUIM pipelines (#3279, #3313)
- Split HW acceleration tutorial into nvdec/nvenc tutorials (#3483, #3478)
Recipe
- Add TCPGen context-biasing Conformer RNN-T (#2890)
- Add AV-ASR recipe (#3278, #3421, #3441, #3489, #3493, #3498, #3492, #3532)
- Add multi-channel DNN beamforming training recipe (#3036)
Backward-incompatible changes
Third-party libraries
In this release, the following third party libraries are removed from TorchAudio binary distributions. TorchAudio now search and link these libraries at runtime. Please install them to use the corresponding APIs.
SoX
libsox
is used for various audio I/O, filtering operations.
Pre-built binaries are avaialble via package managers, such as conda
, apt
and brew
. Please refer to the respective documetation.
The APIs affected include;
torchaudio.load
("sox" backend)torchaudio.info
("sox" backend)torchaudio.save
("sox" backend)torchaudio.sox_effects.apply_effects_tensor
torchaudio.sox_effects.apply_effects_file
torchaudio.functional.apply_codec
(also deprecated, see below)
Changes related to the removal: #3232, #3246, #3497, #3035
Flashlight Text
flashlight-text
is the core of CTC decoder.
Pre-built packages are available on PyPI. Please refer to https://github.com/flashlight/text for the detail.
The APIs affected include;
torchaudio.models.decoder.CTCDecoder
Changes related to the removal: #3232, #3246, #3236, #3339
Kaldi
A custom built libkaldi
was used to implement torchaudio.functional.compute_kaldi_pitch
. This function, along with libkaldi integration, is removed in this release. There is no replcement.
Changes related to the removal: #3368, #3403
I/O
- Switch to the backend dispatcher (#3241)
To make I/O operations more flexible, TorchAudio introduced the backend dispatcher in v2.0, and users could opt-in to use the dispatcher.
In this release, the backend dispatcher becomes the default mechanism for selecting the I/O backend.
You can pass backend
argument to torchaudio.info
, torchaudio.load
and torchaudio.save
function to select I/O backend library per-call basis. (If it is omitted, an available backend is automatically selected.)
If you want to use the global backend mechanism, you can set the environment variable, TORCHAUDIO_USE_BACKEND_DISPATCHER=0
.
Please note, however, that this the global backend mechanism is deprecated and is going to be removed in the next release.
Please see #2950 for the detail of migration work.
torchaudio.io.StreamReader
accepted a byte-string wrapped in 1D torch.Tensor
object. This is no longer supported.
Please wrap the underlying data with io.BytesIO
instead.
The optional arguments of add_[audio|video]_stream
methods of torchaudio.io.StreamReader
and torchaudio.io.StreamWriter
are now keyword-only arguments.
- Drop the support of FFmpeg < 4.1 (#3561, 3557)
Previously TorchAudio supported FFmpeg 4 (>=4.1, <=4.4). In this release, TorchAudio supports FFmpeg 4, 5 and 6 (>=4.4, <7). With this change, support for FFmpeg 4.1, 4.2 and 4.3 are dropped.
Ops
- Use named file in
torchaudio.functional.apply_codec
(#3397)
In previous versions, TorchAudio shipped custom built libsox
, so that it can perform in-memory decoding and encoding.
Now, in-memory decoding and encoding are handled by FFmpeg binding, and with the switch to dynamic libsox
linking, torchaudio.functional.apply_codec
no longer process audio in in-memory fashion. Instead it writes to temporary file.
For in-memory processing, please use torchaudio.io.AudioEffector
.
- Switch to
lstsq
when solving InverseMelScale (#3280)
Previously, torchaudio.transform.InverseMelScale
ran SGD optimizer to find the inverse of mel-scale transform. This approach has number of issues as listed in #2643.
This release switches to use torch.linalg.lstsq
.
Models
The infer
method of torchaudio.models.RNNTBeamSearch
has been updated to accept series of previous hypotheses.
bundle = torchaudio.pipelines.EMFORMER_RNNT_BASE_LIBRISPEECH
decoder: RNNTBeamSearch = bundle.get_decoder()
hypothesis = None
while streaming:
...
hypo, state = decoder.infer(
features,
length,
beam_width,
state=state,
hypothesis=hypothesis,
)
...
hypothesis = hypo
# Previously this had to be hypothesis = hypo[0]
Deprecations
Ops
- Update and deprecate
torchaudio.functional.apply_codec
function (#3386)
Due to the removal of custom libsox binding, torchaudio.functional.apply_codec
no longer supports in-memory processing. Please migrate to torchaudio.io.AudioEffector
.
Please refer to for the detailed usage of torchaudio.io.AudioEffector
.
- https://pytorch.org/audio/2.1/generated/torchaudio.io.AudioEffector.html
- https://pytorch.org/audio/stable/tutorials/effector_tutorial.html
Bug Fixes
Models
- Fix the negative sampling in ConformerWav2Vec2PretrainModel (#3085)
- Fix extract_features method for WavLM models (#3350)
Tutorials
- Fix backtracking in forced alignment tutorial (#3440)
- Fix initialization of
get_trellis
in forced alignment tutorial (#3172)
Build
- Fix MKL issue on Intel mac build (#3307)
I/O
- Surpress warning when saving vorbis with sox backend (#3359)
- Fix g722 encoding in
torchaudio.io.StreamWriter
(#3373) - Refactor arg mapping in ffmpeg save function (#3387)
- Fix save INT16 sox backend (#3524)
- Fix SoundfileBackend method decorators (#3550)
- Fix PTS initialization when using NVIDIA encoder (#3312)
Ops
- Add non-default CUDA device support to
lfilter
(#3432)
Improvements
I/O
- Set "experimental" automatically when using native opus/vorbis encoder (#3192)
- Improve the performance of NV12 frame conversion (#3344)
- Improve the performance of YUV420P frame conversion (#3342)
- Refactor backend implementations (#3547, #3548, #3549)
- Raise an error if
torchaudio.io.StreamWriter
is not opened (#3152) - Warn if decoding YUV images with different plane size (#3201)
- Expose AudioMetadata (#3556)
- Refactor the internal of
torchaudio.io.StreamReader
(#3157, #3170, #3186, #3184, #3188, #3320, #3296, #3328, #3419, #3209) - Refactor the internal of
torchaudio.io.StreamWriter
(#3205, #3319, #3296, #3328, #3426, #3428) - Refactor the FFmpeg abstraction layer (#3249, #3251)
- Migrate the binding of FFmpeg utils to PyBind11 (#3228)
- Simplify sox namespace (#3383)
- Use const reference in sox implementation (#3389)
- Ensure StreamReader returns tensors with requires_grad is False (#3467)
- Set the default #threads to 1 in StreamWriter (#3370)
- Remove ffmpeg fallback from sox_io backend (#3516)
Ops
- Add arbitrary dim Tensor support to mask_along_axis{,_iid} (#3289)
- Fix resampling to support dynamic input lengths for onnx exports. (#3473)
- Optimize Torchaudio Vad (#3382)
Documentation
- Build and use GPU-enabled FFmpeg in doc CI (#3045)
- Misc tutorial update (#3449)
- Update notes on FFmpeg version (#3480)
- Update documentation about dependencies (#3517)
- Update I/O and backend docs (#3555)
Tutorials
Build
- Resolve some compilation warnings (#3471)
- Use pre-built binaries for ffmpeg extension (#3460)
- Add aarch64 workflow (#3553)
- Add CUDA 12.1 builds (#3284)
- Update CUDA to 12.1 U1 (#3563)
Recipe
- Fix Adam and AdamW initializers in wav2letter example (#3145)
- Update Librispeech RNNT recipe to support Lightening 2.0 (#3336)
- Update HuBERT/SSL training recipes to support Lightning 2.x (#3396)
- Add wav2vec2 loss function in self_supervised_learning training recipe (#3090)
- Add Wav2Vec2DataModule in self_supervised_learning training recipe (#3081)