Audio Processing Using Deep Learning

Extend deep learning workflows with audio and speech processing applications

Apply deep learning to audio and speech processing applications by using Deep Learning Toolbox™ together with Audio Toolbox™.

Apps

Audio Labeler

Define and visualize ground-truth labels

Functions

`audioDatastore`	Datastore for collection of audio files
`audioDataAugmenter`	Augment audio data
`audioFeatureExtractor`	Streamline audio feature extraction
`vggishFeatures`	Extract VGGish features
`vggish`	VGGish neural network
`yamnet`	YAMNet neural network
`yamnetGraph`	Graph of YAMNet AudioSet ontology
`classifySound`	Classify sounds in audio signal

Topics

Introduction to Deep Learning for Audio Applications (Audio Toolbox)

Learn common tools and workflows to apply deep learning to audio applications.

Classify Sound Using Deep Learning (Audio Toolbox)

Train, validate, and test a simple long short-term memory (LSTM) to classify sounds.

Transfer Learning with Pretrained Audio Networks (Audio Toolbox)

Use transfer learning to retrain YAMNet, a pretrained convolutional neural network (CNN), to classify a new set of audio signals.

Featured Examples

Speech Command Recognition Using Deep Learning

Train a deep learning model that detects the presence of speech commands in audio. The example uses the Speech Commands Dataset [1] to train a convolutional neural network to recognize a given set of commands.

Open Script

Speech Command Recognition Code Generation with Intel MKL-DNN

Deploy feature extraction and a convolutional neural network (CNN) for speech command recognition on Intel® processors. To generate the feature extraction and network code, you use MATLAB Coder and the Intel Math Kernel Library for Deep Neural Networks (MKL-DNN). In this example, the generated code is a MATLAB executable (MEX) function, which is called by a MATLAB script that displays the predicted speech command along with the time domain signal and auditory spectrogram. For details about audio preprocessing and network training, see Speech Command Recognition Using Deep Learning.

Open Live Script

Speech Command Recognition Code Generation on Raspberry Pi

Deploy feature extraction and a convolutional neural network (CNN) for speech command recognition to Raspberry Pi™. To generate the feature extraction and network code, you use MATLAB Coder, MATLAB Support Package for Raspberry Pi Hardware, and the ARM® Compute Library. In this example, the generated code is an executable on your Raspberry Pi, which is called by a MATLAB script that displays the predicted speech command along with the signal and auditory spectrogram. Interaction between the MATLAB script and the executable on your Raspberry Pi is handled using the user datagram protocol (UDP). For details about audio preprocessing and network training, see Speech Command Recognition Using Deep Learning.

Open Live Script

Cocktail Party Source Separation Using Deep Learning Networks

Isolate a speech signal using a deep learning network.

Open Live Script

Keyword Spotting in Noise Using MFCC and LSTM Networks

Identify a keyword in noisy speech using a deep learning network. In particular, the example uses a Bidirectional Long Short-Term Memory (BiLSTM) network and mel frequency cepstral coefficients (MFCC).

Open Live Script

Denoise Speech Using Deep Learning Networks

Denoise speech signals using deep learning networks. The example compares two types of networks applied to the same task: fully connected, and convolutional.

Open Live Script

Train Generative Adversarial Network (GAN) for Sound Synthesis

Train and use a generative adversarial network (GAN) to generate sounds.

Open Script

Voice Activity Detection in Noise Using Deep Learning

Detect regions of speech in a low signal-to-noise environment using deep learning. The example uses the Speech Commands Dataset to train a Bidirectional Long Short-Term Memory (BiLSTM) network to detect voice activity.

Open Live Script

Classify Gender Using LSTM Networks

Classify the gender of a speaker using deep learning. The example uses a Bidirectional Long Short-Term Memory (BiLSTM) network and Gammatone Cepstral Coefficients (gtcc), pitch, harmonic ratio, and several spectral shape descriptors.

Open Live Script

Spoken Digit Recognition with Wavelet Scattering and Deep Learning

Classify spoken digits using both machine and deep learning techniques. In the example, you perform classification using wavelet time scattering with a support vector machine (SVM) and with a long short-term memory (LSTM) network. You also apply Bayesian optimization to determine suitable hyperparameters to improve the accuracy of the LSTM network. In addition, the example illustrates an approach using a deep convolutional neural network (CNN) and mel-frequency spectrograms.

Open Live Script

Sequential Feature Selection for Audio Features

A typical workflow for feature selection applied to the task of spoken digit recognition.

Open Live Script

Speech Emotion Recognition

Illustrates a simple speech emotion recognition (SER) system using a BiLSTM network. You begin by downloading the data set and then testing the trained network on individual files. The network was trained on a small German-language database [1].

Open Live Script

Acoustic Scene Recognition Using Late Fusion

Create a multi-model late fusion system for acoustic scene recognition. The example trains a convolutional neural network (CNN) using mel spectrograms and an ensemble classifier using wavelet scattering. The example uses the TUT dataset for training and evaluation [1].

Open Script

Documentation

Audio Processing Using Deep Learning

Apps

Functions

Topics

Featured Examples

Speech Command Recognition Using Deep Learning

Speech Command Recognition Code Generation with Intel MKL-DNN

Speech Command Recognition Code Generation on Raspberry Pi

Cocktail Party Source Separation Using Deep Learning Networks

Keyword Spotting in Noise Using MFCC and LSTM Networks

Denoise Speech Using Deep Learning Networks

Train Generative Adversarial Network (GAN) for Sound Synthesis

Voice Activity Detection in Noise Using Deep Learning

Classify Gender Using LSTM Networks

Spoken Digit Recognition with Wavelet Scattering and Deep Learning

Sequential Feature Selection for Audio Features

Speech Emotion Recognition

Acoustic Scene Recognition Using Late Fusion

Deep Learning Toolbox Documentation

Support