YAMNet neural network
Download and unzip the Audio Toolbox™ model for YAMNet.
Type yamnet
at the Command Window. If the Audio Toolbox model for YAMNet is not installed, then the function provides a link to the location of the network weights. To download the model, click the link. Unzip the file to a location on the MATLAB path.
Alternatively, execute the following commands to download and unzip the YAMNet model to your temporary directory.
downloadFolder = fullfile(tempdir,'YAMNetDownload'); loc = websave(downloadFolder,'https://ssd.mathworks.com/supportfiles/audio/yamnet.zip'); YAMNetLocation = tempdir; unzip(loc,YAMNetLocation) addpath(fullfile(YAMNetLocation,'yamnet'))
Check that the installation is successful by typing yamnet
at the Command Window. If the network is installed, then the function returns a SeriesNetwork
(Deep Learning Toolbox) object.
yamnet
ans = SeriesNetwork with properties: Layers: [86×1 nnet.cnn.layer.Layer] InputNames: {'input_1'} OutputNames: {'Sound'}
Load a pretrained YAMNet convolutional neural network and examine the layers and classes.
Use yamnet
to load the pretrained YAMNet network. The output net is a SeriesNetwork
(Deep Learning Toolbox) object.
net = yamnet
net = SeriesNetwork with properties: Layers: [86×1 nnet.cnn.layer.Layer] InputNames: {'input_1'} OutputNames: {'Sound'}
View the network architecture using the Layers
property. The network has 86 layers. There are 28 layers with learnable weights: 27 convolutional layers, and 1 fully connected layer.
net.Layers
ans = 86x1 Layer array with layers: 1 'input_1' Image Input 96×64×1 images 2 'conv2d' Convolution 32 3×3×1 convolutions with stride [2 2] and padding 'same' 3 'b' Batch Normalization Batch normalization with 32 channels 4 'activation' ReLU ReLU 5 'depthwise_conv2d' Grouped Convolution 32 groups of 1 3×3×1 convolutions with stride [1 1] and padding 'same' 6 'L11' Batch Normalization Batch normalization with 32 channels 7 'activation_1' ReLU ReLU 8 'conv2d_1' Convolution 64 1×1×32 convolutions with stride [1 1] and padding 'same' 9 'L12' Batch Normalization Batch normalization with 64 channels 10 'activation_2' ReLU ReLU 11 'depthwise_conv2d_1' Grouped Convolution 64 groups of 1 3×3×1 convolutions with stride [2 2] and padding 'same' 12 'L21' Batch Normalization Batch normalization with 64 channels 13 'activation_3' ReLU ReLU 14 'conv2d_2' Convolution 128 1×1×64 convolutions with stride [1 1] and padding 'same' 15 'L22' Batch Normalization Batch normalization with 128 channels 16 'activation_4' ReLU ReLU 17 'depthwise_conv2d_2' Grouped Convolution 128 groups of 1 3×3×1 convolutions with stride [1 1] and padding 'same' 18 'L31' Batch Normalization Batch normalization with 128 channels 19 'activation_5' ReLU ReLU 20 'conv2d_3' Convolution 128 1×1×128 convolutions with stride [1 1] and padding 'same' 21 'L32' Batch Normalization Batch normalization with 128 channels 22 'activation_6' ReLU ReLU 23 'depthwise_conv2d_3' Grouped Convolution 128 groups of 1 3×3×1 convolutions with stride [2 2] and padding 'same' 24 'L41' Batch Normalization Batch normalization with 128 channels 25 'activation_7' ReLU ReLU 26 'conv2d_4' Convolution 256 1×1×128 convolutions with stride [1 1] and padding 'same' 27 'L42' Batch Normalization Batch normalization with 256 channels 28 'activation_8' ReLU ReLU 29 'depthwise_conv2d_4' Grouped Convolution 256 groups of 1 3×3×1 convolutions with stride [1 1] and padding 'same' 30 'L51' Batch Normalization Batch normalization with 256 channels 31 'activation_9' ReLU ReLU 32 'conv2d_5' Convolution 256 1×1×256 convolutions with stride [1 1] and padding 'same' 33 'L52' Batch Normalization Batch normalization with 256 channels 34 'activation_10' ReLU ReLU 35 'depthwise_conv2d_5' Grouped Convolution 256 groups of 1 3×3×1 convolutions with stride [2 2] and padding 'same' 36 'L61' Batch Normalization Batch normalization with 256 channels 37 'activation_11' ReLU ReLU 38 'conv2d_6' Convolution 512 1×1×256 convolutions with stride [1 1] and padding 'same' 39 'L62' Batch Normalization Batch normalization with 512 channels 40 'activation_12' ReLU ReLU 41 'depthwise_conv2d_6' Grouped Convolution 512 groups of 1 3×3×1 convolutions with stride [1 1] and padding 'same' 42 'L71' Batch Normalization Batch normalization with 512 channels 43 'activation_13' ReLU ReLU 44 'conv2d_7' Convolution 512 1×1×512 convolutions with stride [1 1] and padding 'same' 45 'L72' Batch Normalization Batch normalization with 512 channels 46 'activation_14' ReLU ReLU 47 'depthwise_conv2d_7' Grouped Convolution 512 groups of 1 3×3×1 convolutions with stride [1 1] and padding 'same' 48 'L81' Batch Normalization Batch normalization with 512 channels 49 'activation_15' ReLU ReLU 50 'conv2d_8' Convolution 512 1×1×512 convolutions with stride [1 1] and padding 'same' 51 'L82' Batch Normalization Batch normalization with 512 channels 52 'activation_16' ReLU ReLU 53 'depthwise_conv2d_8' Grouped Convolution 512 groups of 1 3×3×1 convolutions with stride [1 1] and padding 'same' 54 'L91' Batch Normalization Batch normalization with 512 channels 55 'activation_17' ReLU ReLU 56 'conv2d_9' Convolution 512 1×1×512 convolutions with stride [1 1] and padding 'same' 57 'L92' Batch Normalization Batch normalization with 512 channels 58 'activation_18' ReLU ReLU 59 'depthwise_conv2d_9' Grouped Convolution 512 groups of 1 3×3×1 convolutions with stride [1 1] and padding 'same' 60 'L101' Batch Normalization Batch normalization with 512 channels 61 'activation_19' ReLU ReLU 62 'conv2d_10' Convolution 512 1×1×512 convolutions with stride [1 1] and padding 'same' 63 'L102' Batch Normalization Batch normalization with 512 channels 64 'activation_20' ReLU ReLU 65 'depthwise_conv2d_10' Grouped Convolution 512 groups of 1 3×3×1 convolutions with stride [1 1] and padding 'same' 66 'L111' Batch Normalization Batch normalization with 512 channels 67 'activation_21' ReLU ReLU 68 'conv2d_11' Convolution 512 1×1×512 convolutions with stride [1 1] and padding 'same' 69 'L112' Batch Normalization Batch normalization with 512 channels 70 'activation_22' ReLU ReLU 71 'depthwise_conv2d_11' Grouped Convolution 512 groups of 1 3×3×1 convolutions with stride [2 2] and padding 'same' 72 'L121' Batch Normalization Batch normalization with 512 channels 73 'activation_23' ReLU ReLU 74 'conv2d_12' Convolution 1024 1×1×512 convolutions with stride [1 1] and padding 'same' 75 'L122' Batch Normalization Batch normalization with 1024 channels 76 'activation_24' ReLU ReLU 77 'depthwise_conv2d_12' Grouped Convolution 1024 groups of 1 3×3×1 convolutions with stride [1 1] and padding 'same' 78 'L131' Batch Normalization Batch normalization with 1024 channels 79 'activation_25' ReLU ReLU 80 'conv2d_13' Convolution 1024 1×1×1024 convolutions with stride [1 1] and padding 'same' 81 'L132' Batch Normalization Batch normalization with 1024 channels 82 'activation_26' ReLU ReLU 83 'global_average_pooling2d' Global Average Pooling Global average pooling 84 'dense' Fully Connected 521 fully connected layer 85 'softmax' Softmax softmax 86 'Sound' Classification Output crossentropyex with 'Speech' and 520 other classes
To view the names of the classes learned by the network, you can view the Classes
property of the classification output layer (the final layer). View the first 10 classes by specifying the first 10 elements.
net.Layers(end).Classes(1:10)
ans = 10×1 categorical
Speech
Child speech, kid speaking
Conversation
Narration, monologue
Babbling
Speech synthesizer
Shout
Bellow
Whoop
Yell
Use analyzeNetwork
(Deep Learning Toolbox) to visually explore the network.
analyzeNetwork(net)
YAMNet was released with a corresponding sound class ontology, which you can explore using the yamnetGraph
object.
ygraph = yamnetGraph;
p = plot(ygraph);
layout(p,'layered')
The ontology graph plots all 521 possible sound classes. Plot a subgraph of the sounds related to respiratory sounds.
allRespiratorySounds = dfsearch(ygraph,"Respiratory sounds");
ygraphSpeech = subgraph(ygraph,allRespiratorySounds);
plot(ygraphSpeech)
The YAMNet network requires you to preprocess and extract features from audio signals by converting them to the sample rate the network was trained on, and then extracting overlapping log-mel spectrograms. This example walks through the required preprocessing and feature extraction necessary to match the preprocessing and feature extraction used to train YAMNet. The classifySound
function performs these steps for you.
Read in an audio signal to classify it. Resample the audio signal to 16 kHz and then convert it to single precision.
[audioIn,fs0] = audioread('Counting-16-44p1-mono-15secs.wav');
fs = 16e3;
audioIn = resample(audioIn,fs,fs0);
audioIn = single(audioIn);
Define mel spectrogram parameters and then extract features using the melSpectrogram
function.
FFTLength = 512; numBands = 64; frequencyRange = [125 7500]; windowLength = 0.025*fs; overlapLength = 0.015*fs; melSpect = melSpectrogram(audioIn,fs, ... 'Window',hann(windowLength,'periodic'), ... 'OverlapLength',overlapLength, ... 'FFTLength',FFTLength, ... 'FrequencyRange',frequencyRange, ... 'NumBands',numBands, ... 'FilterBankNormalization','none', ... 'WindowNormalization',false, ... 'SpectrumType','magnitude', ... 'FilterBankDesignDomain','warped');
Convert the mel spectrogram to the log scale.
melSpect = log(melSpect + single(0.001));
Reorient the mel spectrogram so that time is along the first dimension as rows.
melSpect = melSpect.'; [numSTFTWindows,numBands] = size(melSpect)
numSTFTWindows = 1551
numBands = 64
Partition the spectrogram into frames of length 96 with an overlap of 48. Place the frames along the fourth dimension.
frameWindowLength = 96; frameOverlapLength = 48; hopLength = frameWindowLength - frameOverlapLength; numHops = floor((numSTFTWindows - frameWindowLength)/hopLength) + 1; frames = zeros(frameWindowLength,numBands,1,numHops,'like',melSpect); for hop = 1:numHops range = 1 + hopLength*(hop-1):hopLength*(hop - 1) + frameWindowLength; frames(:,:,1,hop) = melSpect(range,:); end
Create a YAMNet network.
net = yamnet();
Classify the spectrogram images.
classes = classify(net,frames);
Classify the audio signal as the most frequently occurring sound.
mySound = mode(classes)
mySound = categorical
Speech
Download and unzip the air compressor data set [1]. This data set consists of recordings from air compressors in a healthy state or one of 7 faulty states.
url = 'https://www.mathworks.com/supportfiles/audio/AirCompressorDataset/AirCompressorDataset.zip'; downloadFolder = fullfile(tempdir,'aircompressordataset'); datasetLocation = tempdir; if ~exist(fullfile(tempdir,'AirCompressorDataSet'),'dir') loc = websave(downloadFolder,url); unzip(loc,fullfile(tempdir,'AirCompressorDataSet')) end
Create an audioDatastore
object to manage the data and split it into train and validation sets.
ads = audioDatastore(downloadFolder,'IncludeSubfolders',true,'LabelSource','foldernames'); [adsTrain,adsValidation] = splitEachLabel(ads,0.8,0.2);
Read an audio file from the datastore and save the sample rate for later use. Reset the datastore to return the read pointer to the beginning of the data set. Listen to the audio signal and plot the signal in the time domain.
[x,fileInfo] = read(adsTrain); fs = fileInfo.SampleRate; reset(adsTrain) sound(x,fs) figure t = (0:size(x,1)-1)/fs; plot(t,x) xlabel('Time (s)') title('State = ' + string(fileInfo.Label)) axis tight
Create an audioFeatureExtractor
object to extract the Bark spectrum from audio signals. Use the same window, overlap length, frequency range, and number of bands as YAMNet was trained on. Depending on your transfer learning task, you can modify the input features more or less from the input features YAMNet was trained on.
afe = audioFeatureExtractor('SampleRate',fs, ... 'Window',hann(0.025*fs,'periodic'), ... 'OverlapLength',round(0.015*fs), ... 'barkSpectrum',true); setExtractorParams(afe,'barkSpectrum','NumBands',64);
Extract Bark spectrograms from the train set. There are multiple Bark spectrograms for each audio signal. Replicate the labels so that they are in one-to-one correspondence with the spectrograms.
numSpectrumsPerSpectrogram = 96; numSpectrumsOverlapBetweenSpectrograms = 48; numSpectrumsHopBetweenSpectrograms = numSpectrumsPerSpectrogram - numSpectrumsOverlapBetweenSpectrograms; emptyLabelVector = adsTrain.Labels; emptyLabelVector(:) = []; trainFeatures = []; trainLabels = emptyLabelVector; while hasdata(adsTrain) [audioIn,fileInfo] = read(adsTrain); features = extract(afe,audioIn); features = log10(features + single(0.001)); [numSpectrums,numBands] = size(features); numSpectrograms = floor((numSpectrums - numSpectrumsPerSpectrogram)/numSpectrumsHopBetweenSpectrograms) + 1; for hop = 1:numSpectrograms range = 1 + numSpectrumsHopBetweenSpectrograms*(hop-1):numSpectrumsHopBetweenSpectrograms*(hop-1) + numSpectrumsPerSpectrogram; trainFeatures = cat(4,trainFeatures,features(range,:)); trainLabels = cat(1,trainLabels,fileInfo.Label); end end
Extract features from the validation set and replicate the labels.
validationFeatures = []; validationLabels = emptyLabelVector; while hasdata(adsValidation) [audioIn,fileInfo] = read(adsValidation); features = extract(afe,audioIn); features = log10(features + single(0.001)); [numSpectrums,numBands] = size(features); numSpectrograms = floor((numSpectrums - numSpectrumsPerSpectrogram)/numSpectrumsHopBetweenSpectrograms) + 1; for hop = 1:numSpectrograms range = 1 + numSpectrumsHopBetweenSpectrograms*(hop-1):numSpectrumsHopBetweenSpectrograms*(hop-1) + numSpectrumsPerSpectrogram; validationFeatures = cat(4,validationFeatures,features(range,:)); validationLabels = cat(1,validationLabels,fileInfo.Label); end end
The air compressor data set has only eight classes. Read in YAMNet, convert it to a layerGraph
(Deep Learning Toolbox), and then replace the final fullyConnectedLayer
(Deep Learning Toolbox) and the final classificationLayer
(Deep Learning Toolbox) to reflect the new task.
uniqueLabels = unique(adsTrain.Labels); numLabels = numel(uniqueLabels); net = yamnet; lgraph = layerGraph(net.Layers); newDenseLayer = fullyConnectedLayer(numLabels,"Name","dense"); lgraph = replaceLayer(lgraph,"dense",newDenseLayer); newClassificationLayer = classificationLayer("Name","Sounds","Classes",uniqueLabels); lgraph = replaceLayer(lgraph,"Sound",newClassificationLayer);
To define training options, use trainingOptions
(Deep Learning Toolbox).
miniBatchSize = 128; validationFrequency = floor(numel(trainLabels)/miniBatchSize); options = trainingOptions('adam', ... 'InitialLearnRate',3e-4, ... 'MaxEpochs',2, ... 'MiniBatchSize',miniBatchSize, ... 'Shuffle','every-epoch', ... 'Plots','training-progress', ... 'Verbose',false, ... 'ValidationData',{single(validationFeatures),validationLabels}, ... 'ValidationFrequency',validationFrequency);
To train the network, use trainNetwork
(Deep Learning Toolbox).
trainNetwork(single(trainFeatures),trainLabels,lgraph,options);
References
[1] Verma, Nishchal K., et al. “Intelligent Condition Based Monitoring Using Acoustic Signals for Air Compressors.” IEEE Transactions on Reliability, vol. 65, no. 1, Mar. 2016, pp. 291–309. DOI.org (Crossref), doi:10.1109/TR.2015.2459684.
net
— Pretrained YAMNet neural networkSeriesNetwork
objectPretrained YAMNet neural network, returned as a SeriesNetwork
(Deep Learning Toolbox)
object.
[1] Gemmeke, Jort F., et al. “Audio Set: An Ontology and Human-Labeled Dataset for Audio Events.” 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2017, pp. 776–80. DOI.org (Crossref), doi:10.1109/ICASSP.2017.7952261.
[2] Hershey, Shawn, et al. “CNN Architectures for Large-Scale Audio Classification.” 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2017, pp. 131–35. DOI.org (Crossref), doi:10.1109/ICASSP.2017.7952132.
Usage notes and limitations:
Only the activations
and predict
object
functions are supported.
To create a SeriesNetwork
object for code generation, see Load Pretrained Networks for Code Generation (MATLAB Coder).
Usage notes and limitations:
Only the activations
, classify
,
predict
, predictAndUpdateState
, and
resetState
object functions are supported.
To create a SeriesNetwork
object for code generation, see Load Pretrained Networks for Code Generation (GPU Coder).
audioFeatureExtractor
| classifySound
| designAuditoryFilterBank
| melSpectrogram
| vggish
| yamnetGraph
You have a modified version of this example. Do you want to open this example with your edits?