Voice Activity Detection in Noise Using Deep Learning

This example shows how to detect regions of speech in a low signal-to-noise environment using deep learning. The example uses the Speech Commands Dataset to train a Bidirectional Long Short-Term Memory (BiLSTM) network to detect voice activity.

Introduction

Voice activity detection is an essential component of many audio systems, such as automatic speech recognition and speaker recognition. Voice activity detection can be especially challenging in low signal-to-noise (SNR) situations, where speech is obstructed by noise.

This example uses long short-term memory (LSTM) networks, which are a type of recurrent neural network (RNN) well-suited to study sequence and time-series data. An LSTM network can learn long-term dependencies between time steps of a sequence. An LSTM layer (lstmLayer) can look at the time sequence in the forward direction, while a bidirectional LSTM layer (bilstmLayer) can look at the time sequence in both forward and backward directions. This example uses a bidirectional LSTM layer.

This example trains a voice activity detection bidirectional LSTM network with feature sequences of spectral characteristics and a harmonic ratio metric.

In high SNR scenarios, traditional speech detection algorithms perform adequately. Read in an audio file that consists of words spoken with pauses between. Resample the audio to 16 kHz. Listen to the audio.

fs = 16e3;
[speech,fileFs] = audioread('Counting-16-44p1-mono-15secs.wav');
speech = resample(speech,fs,fileFs);
speech = speech/max(abs(speech));
sound(speech,fs)

Use the detectSpeech (Audio Toolbox) function to locate regions of speech. The detectSpeech function correctly identifies all regions of speech.

win = hamming(50e-3 * fs,'periodic');
detectSpeech(speech,fs,'Window',win)

Corrupt the audio signal with washing machine noise at a -20 dB SNR. Listen to the corrupted audio.

[noise,fileFs] = audioread('WashingMachine-16-8-mono-200secs.mp3');
noise = resample(noise,fs,fileFs);

SNR = -20;
noiseGain = 10^(-SNR/20) * norm(speech) / norm(noise);

noisySpeech = speech + noiseGain*noise(1:numel(speech));
noisySpeech = noisySpeech./max(abs(noisySpeech));

sound(noisySpeech,fs)

Call detectSpeech on the noisy audio signal. The function fails to detect the speech regions given the very low SNR.

detectSpeech(noisySpeech,fs,'Window',win)

Load a pretrained network and a configured audioFeatureExtractor (Audio Toolbox) object. The network was trained to detect speech in a low SNR environments given features output from the audioFeatureExtractor object.

load('Audio_VoiceActivityDetectionExample.mat','speechDetectNet','afe')
speechDetectNet
speechDetectNet = 
  SeriesNetwork with properties:

         Layers: [6×1 nnet.cnn.layer.Layer]
     InputNames: {'sequenceinput'}
    OutputNames: {'classoutput'}

afe
afe = 
  audioFeatureExtractor with properties:

   Properties
                     Window: [256×1 double]
              OverlapLength: 128
                 SampleRate: 16000
                  FFTLength: []
    SpectralDescriptorInput: 'linearSpectrum'

   Enabled Features
     spectralCentroid, spectralCrest, spectralEntropy, spectralFlux, spectralKurtosis, spectralRolloffPoint
     spectralSkewness, spectralSlope, harmonicRatio

   Disabled Features
     linearSpectrum, melSpectrum, barkSpectrum, erbSpectrum, mfcc, mfccDelta
     mfccDeltaDelta, gtcc, gtccDelta, gtccDeltaDelta, spectralDecrease, spectralFlatness
     spectralSpread, pitch


   To extract a feature, set the corresponding property to true.
   For example, obj.mfcc = true, adds mfcc to the list of enabled features.

Extract features from the speech data and then normalize them. Orient the features so that time is across columns.

features = extract(afe,noisySpeech);
features = (features - mean(features,1)) ./ std(features,[],1);
features = features';

Pass the features through the speech detection network to classify each feature vector as belonging to a frame of speech or not.

decisionsCategorical = classify(speechDetectNet,features);

Each decision corresponds to an analysis window analyzed by the audioFeatureExtractor. Replicate the decisions so that they are in one-to-one correspondence with the audio samples. Plot the speech, the noisy speech, and the VAD decisions.

decisionsWindow = 1.2*(double(decisionsCategorical)-1);
decisionsSample = [repelem(decisionsWindow(1),numel(afe.Window)), ...
                   repelem(decisionsWindow(2:end),numel(afe.Window)-afe.OverlapLength)];


t = (0:numel(decisionsSample)-1)/afe.SampleRate;
plot(t,noisySpeech(1:numel(t)), ...
     t,speech(1:numel(t)), ...
     t,decisionsSample);
 xlabel('Time (s)')
 ylabel('Amplitude')
 legend('Noisy Speech','Speech','VAD','Location','southwest')

You can also use the trained VAD network in a streaming context. To simulate a streaming environment, first save the speech and noise signals as WAV files. To simulate streaming input, you will read frames from the files and mix them at a desired SNR.

audiowrite('Speech.wav',speech,fs)
audiowrite('Noise.wav',noise,fs)

To apply the VAD network to streaming audio, you have to trade off between delay and accuracy. Define parameters for the streaming voice activity detection in noise demonstration. You can set the duration of the test, the sequence length fed into the network, the sequence hop length, and the SNR to test. Generally, increasing the sequence length increases the accuracy but also increases the lag. You can also choose the signal output to your device as the original signal or the noisy signal.

testDuration = 20;

sequenceLength = 400;
sequenceHop = 20;
SNR = -20;
noiseGain = 10^(-SNR/20) * norm(speech) / norm(noise);

signalToListenTo = "noisy";

Call the streaming demo helper function to observe the performance of the VAD network on streaming audio. The parameters you set using the live controls do not interrupt the streaming example. After the streaming demo is complete, you can modify parameters of the demonstration, then run the streaming demo again. You can find the code for the streaming demo in the Supporting Functions.

helperStreamingDemo(speechDetectNet,afe, ...
                    'Speech.wav','Noise.wav', ...
                    testDuration,sequenceLength,sequenceHop,signalToListenTo,noiseGain);

The remainder of the example walks through training and evaluating the VAD network.

Train and Evaluate VAD Network

Training:

  1. Create an audioDatastore (Audio Toolbox) that points to the audio speech files used to train the LSTM network.

  2. Create a training signal consisting of speech segments separated by segments of silence of varying durations.

  3. Corrupt the speech-plus-silence signal with washing machine noise (SNR = -10 dB).

  4. Extract feature sequences consisting of spectral characteristics and harmonic ratio from the noisy signal.

  5. Train the LSTM network using the feature sequences to identify regions of voice activity.

Prediction:

  1. Create an audioDatastore of speech files used to test the trained network, and create a test signal consisting of speech separated by segments of silence.

  2. Corrupt the test signal with washing machine noise (SNR = -10 dB).

  3. Extract feature sequences from the noisy test signal.

  4. Identify regions of voice activity by passing the test features through the trained network.

  5. Compare the network's accuracy to the voice activity baseline from the signal-plus-silence test signal.

Here is a sketch of the training process.

Here is a sketch of the prediction process. You use the trained network to make predictions.

Load Speech Commands Data Set

Download and extract the Google Speech Commands Dataset [1].

url = 'https://ssd.mathworks.com/supportfiles/audio/google_speech.zip';

downloadFolder = tempdir;
datasetFolder = fullfile(downloadFolder,'google_speech');

if ~exist(datasetFolder,'dir')
    disp('Downloading Google speech commands data set (1.9 GB)...')
    unzip(url,downloadFolder)
end
Downloading Google speech commands data set (1.9 GB)...

Create an audioDatastore (Audio Toolbox) that points to the training data set.

adsTrain = audioDatastore(fullfile(datasetFolder, 'train'), "Includesubfolders",true);

Create an audioDatastore (Audio Toolbox) that points to the validation data set.

adsValidation = audioDatastore(fullfile(datasetFolder, 'validation'), "Includesubfolders",true);

Create Speech-Plus-Silence Training Signal

Read the contents of an audio file using read (Audio Toolbox). Get the sample rate from the adsInfo struct.

[data,adsInfo] = read(adsTrain);
Fs = adsInfo.SampleRate;

Listen to the audio signal using the sound command.

sound(data,Fs)

Plot the audio signal.

timeVector = (1/Fs) * (0:numel(data)-1);
plot(timeVector,data)
ylabel("Amplitude")
xlabel("Time (s)")
title("Sample Audio")
grid on

The signal has non-speech portions (silence, background noise, etc) that do not contain useful speech information. This example removes silence using the detectSpeech (Audio Toolbox) function.

Extract the useful portion of data. Define a 50 ms periodic Hamming window for analysis. Call detectSpeech with no output arguments to plot the detected speech regions. Call detectSpeech again to return the indices of the detected speech. Isolate the detected speech regions and then use the sound command to listen to the audio.

win = hamming(50e-3 * Fs,'periodic');

detectSpeech(data,Fs,'Window',win);

speechIndices = detectSpeech(data,Fs,'Window',win);

sound(data(speechIndices(1,1):speechIndices(1,2)),Fs)

The detectSpeech function returns indices that tightly surround the detected speech region. It was determined empirically that, for this example, extending the indices of the detected speech by five frames on either side increased the final model's performance. Extend the speech indices by five frames and then listen to the speech.

speechIndices(1,1) = max(speechIndices(1,1) - 5*numel(win),1);
speechIndices(1,2) = min(speechIndices(1,2) + 5*numel(win),numel(data));

sound(data(speechIndices(1,1):speechIndices(1,2)),Fs)

Reset the training datastore and shuffle the order of files in the datastores.

reset(adsTrain)
adsTrain = shuffle(adsTrain);
adsValidation = shuffle(adsValidation);

The detectSpeech function calculates statistics-based thresholds to determine the speech regions. You can skip the threshold calculation and speed up the detectSpeech function by specifying the thresholds directly. To determine thresholds for a data set, call detectSpeech on a sampling of files and get the thresholds it calculates. Take the mean of the thresholds.

TM = [];
for index1 = 1:500
     data = read(adsTrain); 
     [~,T] = detectSpeech(data,Fs,'Window',win);
     TM = [TM;T];
end

T = mean(TM);

reset(adsTrain)

Create a 1000-second training signal by combining multiple speech files from the training data set. Use detectSpeech to remove unwanted portions of each file. Insert a random period of silence between speech segments.

Preallocate the training signal.

duration = 2000*Fs;
audioTraining = zeros(duration,1);

Preallocate the voice activity training mask. Values of 1 in the mask correspond to samples located in areas with voice activity. Values of 0 correspond to areas with no voice activity.

maskTraining = zeros(duration,1);

Specify a maximum silence segment duration of 2 seconds.

maxSilenceSegment = 2;

Construct the training signal by calling read on the datastore in a loop.

numSamples = 1;    

while numSamples < duration
    data = read(adsTrain);
    data = data ./ max(abs(data)); % Normalize amplitude

    % Determine regions of speech
    idx = detectSpeech(data,Fs,'Window',win,'Thresholds',T);

    % If a region of speech is detected
    if ~isempty(idx)
        
        % Extend the indices by five frames
        idx(1,1) = max(1,idx(1,1) - 5*numel(win));
        idx(1,2) = min(length(data),idx(1,2) + 5*numel(win));
        
        % Isolate the speech
        data = data(idx(1,1):idx(1,2));
        
        % Write speech segment to training signal
        audioTraining(numSamples:numSamples+numel(data)-1) = data;
        
        % Set VAD baseline
        maskTraining(numSamples:numSamples+numel(data)-1) = true;
        
        % Random silence period
        numSilenceSamples = randi(maxSilenceSegment*Fs,1,1);
        numSamples = numSamples + numel(data) + numSilenceSamples;
    end
end

Visualize a 10-second portion of the training signal. Plot the baseline voice activity mask.

figure
range = 1:10*Fs;
plot((1/Fs)*(range-1),audioTraining(range));
hold on
plot((1/Fs)*(range-1),maskTraining(range));
grid on
lines = findall(gcf,"Type","Line");
lines(1).LineWidth = 2;
xlabel("Time (s)")
legend("Signal","Speech Region")
title("Training Signal (first 10 seconds)");

Listen to the first 10 seconds of the training signal.

sound(audioTraining(range),Fs);

Add Noise to the Training Signal

Corrupt the training signal with washing machine noise by adding washing machine noise to the speech signal such that the signal-to-noise ratio is -10 dB.

Read 8 kHz noise and convert it to 16 kHz.

noise = audioread("WashingMachine-16-8-mono-1000secs.mp3");
noise = resample(noise,2,1);

Corrupt training signal with noise.

audioTraining = audioTraining(1:numel(noise));
SNR = -10;
noise = 10^(-SNR/20) * noise * norm(audioTraining) / norm(noise);
audioTrainingNoisy = audioTraining + noise; 
audioTrainingNoisy = audioTrainingNoisy / max(abs(audioTrainingNoisy));

Visualize a 10-second portion of the noisy training signal. Plot the baseline voice activity mask.

figure
plot((1/Fs)*(range-1),audioTrainingNoisy(range));
hold on
plot((1/Fs)*(range-1),maskTraining(range));
grid on
lines = findall(gcf,"Type","Line");
lines(1).LineWidth = 2;
xlabel("Time (s)")
legend("Noisy Signal","Speech Area")
title("Training Signal (first 10 seconds)");

Listen to the first 10 seconds of the noisy training signal.

sound(audioTrainingNoisy(range),Fs)

Note that you obtained the baseline voice activity mask using the noiseless speech-plus-silence signal. Verify that using detectSpeech on the noise-corrupted signal does not yield good results.

speechIndices = detectSpeech(audioTrainingNoisy,Fs,'Window',win);

speechIndices(:,1) = max(1,speechIndices(:,1) - 5*numel(win));
speechIndices(:,2) = min(numel(audioTrainingNoisy),speechIndices(:,2) + 5*numel(win));

noisyMask = zeros(size(audioTrainingNoisy));
for ii = 1:size(speechIndices)
    noisyMask(speechIndices(ii,1):speechIndices(ii,2)) = 1;
end

Visualize a 10-second portion of the noisy training signal. Plot the voice activity mask obtained by analyzing the noisy signal.

figure
plot((1/Fs)*(range-1),audioTrainingNoisy(range));
hold on
plot((1/Fs)*(range-1),noisyMask(range));
grid on
lines = findall(gcf,"Type","Line");
lines(1).LineWidth = 2;
xlabel("Time (s)")
legend("Noisy Signal","Mask from Noisy Signal")
title("Training Signal (first 10 seconds)");

Create Speech-Plus-Silence Validation Signal

Create a 200-second noisy speech signal to validate the trained network. Use the validation datastore. Note that the validation and training datastores have different speakers.

Preallocate the validation signal and the validation mask. You will use this mask to assess the accuracy of the trained network.

duration = 200*Fs;
audioValidation = zeros(duration,1);
maskValidation = zeros(duration,1);

Construct the validation signal by calling read on the datastore in a loop.

numSamples = 1;    
while numSamples < duration
    data = read(adsValidation);
    data = data ./ max(abs(data)); % Normalize amplitude
    
    % Determine regions of speech
    idx = detectSpeech(data,Fs,'Window',win,'Thresholds',T);
    
    % If a region of speech is detected
    if ~isempty(idx)
        
        % Extend the indices by five frames
        idx(1,1) = max(1,idx(1,1) - 5*numel(win));
        idx(1,2) = min(length(data),idx(1,2) + 5*numel(win));

        % Isolate the speech
        data = data(idx(1,1):idx(1,2));
        
        % Write speech segment to training signal
        audioValidation(numSamples:numSamples+numel(data)-1) = data;
        
        % Set VAD Baseline
        maskValidation(numSamples:numSamples+numel(data)-1) = true;
        
        % Random silence period
        numSilenceSamples = randi(maxSilenceSegment*Fs,1,1);
        numSamples = numSamples + numel(data) + numSilenceSamples;
    end
end

Corrupt the validation signal with washing machine noise by adding washing machine noise to the speech signal such that the signal-to-noise ratio is -10 dB. Use a different noise file for the validation signal than you did for the training signal.

noise = audioread("WashingMachine-16-8-mono-200secs.mp3");
noise = resample(noise,2,1);
noise = noise(1:duration);
audioValidation = audioValidation(1:numel(noise));

noise = 10^(-SNR/20) * noise * norm(audioValidation) / norm(noise);
audioValidationNoisy = audioValidation + noise; 
audioValidationNoisy = audioValidationNoisy / max(abs(audioValidationNoisy));

Extract Training Features

This example trains the LSTM network using the following features:

  1. spectralCentroid (Audio Toolbox)

  2. spectralCrest (Audio Toolbox)

  3. spectralEntropy (Audio Toolbox)

  4. spectralFlux (Audio Toolbox)

  5. spectralKurtosis (Audio Toolbox)

  6. spectralRolloffPoint (Audio Toolbox)

  7. spectralSkewness (Audio Toolbox)

  8. spectralSlope (Audio Toolbox)

  9. harmonicRatio (Audio Toolbox)

This example uses audioFeatureExtractor (Audio Toolbox) to create an optimal feature extraction pipeline for the feature set. Create an audioFeatureExtractor object to extract the feature set. Use a 256-point Hann window with 50% overlap.

afe = audioFeatureExtractor('SampleRate',Fs, ...
    'Window',hann(256,"Periodic"), ...
    'OverlapLength',128, ...
    ...
    'spectralCentroid',true, ...
    'spectralCrest',true, ...
    'spectralEntropy',true, ...
    'spectralFlux',true, ...
    'spectralKurtosis',true, ...
    'spectralRolloffPoint',true, ...
    'spectralSkewness',true, ...
    'spectralSlope',true, ...
    'harmonicRatio',true);

featuresTraining = extract(afe,audioTrainingNoisy);

Display the dimensions of the features matrix. The first dimension corresponds to the number of windows the signal was broken into (it depends on the window length and the overlap length). The second dimension is the number of features used in this example.

[numWindows,numFeatures] = size(featuresTraining)
numWindows = 125009
numFeatures = 9

In classification applications, it is a good practice to normalize all features to have zero mean and unity standard deviation.

Compute the mean and standard deviation for each coefficient, and use them to normalize the data.

M = mean(featuresTraining,1);
S = std(featuresTraining,[],1);
featuresTraining = (featuresTraining - M) ./ S;

Extract the features from the validation signal using the same process.

featuresValidation = extract(afe,audioValidationNoisy);
featuresValidation = (featuresValidation - mean(featuresValidation,1)) ./ std(featuresValidation,[],1);

Each feature corresponds to 128 samples of data (the hop length). For each hop, set the expected voice/no voice value to the mode of the baseline mask values corresponding to those 128 samples. Convert the voice/no voice mask to categorical.

windowLength = numel(afe.Window);
hopLength = windowLength - afe.OverlapLength;
range = (hopLength) * (1:size(featuresTraining,1)) + hopLength;
maskMode = zeros(size(range));
for index = 1:numel(range)
    maskMode(index) = mode(maskTraining( (index-1)*hopLength+1:(index-1)*hopLength+windowLength ));
end
maskTraining = maskMode.';

maskTrainingCat = categorical(maskTraining);

Do the same for the validation mask.

range = (hopLength) * (1:size(featuresValidation,1)) + hopLength;
maskMode = zeros(size(range));
for index = 1:numel(range)
    maskMode(index) = mode(maskValidation( (index-1)*hopLength+1:(index-1)*hopLength+windowLength ));
end
maskValidation = maskMode.';

maskValidationCat = categorical(maskValidation);

Split the training features and the mask into sequences of length 800, with 75% overlap between consecutive sequences.

sequenceLength = 800;
sequenceOverlap = round(0.75*sequenceLength);

trainFeatureCell = helperFeatureVector2Sequence(featuresTraining',sequenceLength,sequenceOverlap);
trainLabelCell = helperFeatureVector2Sequence(maskTrainingCat',sequenceLength,sequenceOverlap);

Define the LSTM Network Architecture

LSTM networks can learn long-term dependencies between time steps of sequence data. This example uses the bidirectional LSTM layer bilstmLayer to look at the sequence in both forward and backward directions.

Specify the input size to be sequences of length 9 (the number of features). Specify a hidden bidirectional LSTM layer with an output size of 200 and output a sequence. This command instructs the bidirectional LSTM layer to map the input time series into 200 features that are passed to the next layer. Then, specify a bidirectional LSTM layer with an output size of 200 and output the last element of the sequence. This command instructs the bidirectional LSTM layer to map its input into 200 features and then prepares the output for the fully connected layer. Finally, specify two classes by including a fully connected layer of size 2, followed by a softmax layer and a classification layer.

layers = [ ...    
    sequenceInputLayer( size(featuresValidation,2) )    
    bilstmLayer(200,"OutputMode","sequence")   
    bilstmLayer(200,"OutputMode","sequence")   
    fullyConnectedLayer(2)   
    softmaxLayer   
    classificationLayer      
    ];

Next, specify the training options for the classifier. Set MaxEpochs to 20 so that the network makes 20 passes through the training data. Set MiniBatchSize to 64 so that the network looks at 64 training signals at a time. Set Plots to "training-progress" to generate plots that show the training progress as the number of iterations increases. Set Verbose to false to disable printing the table output that corresponds to the data shown in the plot. Set Shuffle to "every-epoch" to shuffle the training sequence at the beginning of each epoch. Set LearnRateSchedule to "piecewise" to decrease the learning rate by a specified factor (0.1) every time a certain number of epochs (10) has passed. Set ValidationData to the validation predictors and targets.

This example uses the adaptive moment estimation (ADAM) solver. ADAM performs better with recurrent neural networks (RNNs) like LSTMs than the default stochastic gradient descent with momentum (SGDM) solver.

maxEpochs = 20;
miniBatchSize = 64;
options = trainingOptions("adam", ...
    "MaxEpochs",maxEpochs, ...
    "MiniBatchSize",miniBatchSize, ...
    "Shuffle","every-epoch", ...
    "Verbose",0, ...
    "SequenceLength",sequenceLength, ...
    "ValidationFrequency",floor(numel(trainFeatureCell)/miniBatchSize), ...
    "ValidationData",{featuresValidation.',maskValidationCat.'}, ...
    "Plots","training-progress", ...
    "LearnRateSchedule","piecewise", ...
    "LearnRateDropFactor",0.1, ...
    "LearnRateDropPeriod",5);

Train the LSTM Network

Train the LSTM network with the specified training options and layer architecture using trainNetwork. Because the training set is large, the training process can take several minutes.

doTraining = true;
if doTraining
   [speechDetectNet,netInfo] = trainNetwork(trainFeatureCell,trainLabelCell,layers,options);
    fprintf("Validation accuracy: %f percent.\n", netInfo.FinalValidationAccuracy);
else
    load speechDetectNet
end
Validation accuracy: 90.089844 percent.

Use Trained Network to Detect Voice Activity

Estimate voice activity in the validation signal using the trained network. Convert the estimated VAD mask from categorical to double.

EstimatedVADMask = classify(speechDetectNet,featuresValidation.');
EstimatedVADMask = double(EstimatedVADMask);
EstimatedVADMask = EstimatedVADMask.' - 1;

Calculate and plot the validation confusion matrix from the vectors of actual and estimated labels.

figure
cm = confusionchart(maskValidation,EstimatedVADMask,"title","Validation Accuracy");
cm.ColumnSummary = "column-normalized";
cm.RowSummary = "row-normalized";

If you changed parameters of your network or feature extraction pipeline, consider resaving the MAT file with the new network and audioFeatureExtractor object.

resaveNetwork = false;
if resaveNetwork
     save('Audio_VoiceActivityDetectionExample.mat','speechDetectNet','afe');
end

Supporting Functions

Convert Feature Vectors to Sequences

function [sequences,sequencePerFile] = helperFeatureVector2Sequence(features,featureVectorsPerSequence,featureVectorOverlap)
    if featureVectorsPerSequence <= featureVectorOverlap
        error('The number of overlapping feature vectors must be less than the number of feature vectors per sequence.')
    end

    if ~iscell(features)
        features = {features};
    end
    hopLength = featureVectorsPerSequence - featureVectorOverlap;
    idx1 = 1;
    sequences = {};
    sequencePerFile = cell(numel(features),1);
    for ii = 1:numel(features)
        sequencePerFile{ii} = floor((size(features{ii},2) - featureVectorsPerSequence)/hopLength) + 1;
        idx2 = 1;
        for j = 1:sequencePerFile{ii}
            sequences{idx1,1} = features{ii}(:,idx2:idx2 + featureVectorsPerSequence - 1); %#ok<AGROW>
            idx1 = idx1 + 1;
            idx2 = idx2 + hopLength;
        end
    end
end

Streaming Demo

function helperStreamingDemo(speechDetectNet,afe,cleanSpeech,noise,testDuration,sequenceLength,sequenceHop,signalToListenTo,noiseGain)

Create dsp.AudioFileReader (DSP System Toolbox) objects to read from the speech and noise files frame by frame.

    speechReader = dsp.AudioFileReader(cleanSpeech,'PlayCount',inf);
    noiseReader = dsp.AudioFileReader(noise,'PlayCount',inf);
    fs = speechReader.SampleRate;

Create a dsp.MovingStandardDeviation (DSP System Toolbox) object and a dsp.MovingAverage (DSP System Toolbox) object. You will use these to determine the standard deviation and mean of the audio features for normalization. The statistics should improve over time.

    movSTD = dsp.MovingStandardDeviation('Method','Exponential weighting','ForgettingFactor',1);
    movMean = dsp.MovingAverage('Method','Exponential weighting','ForgettingFactor',1);

Create three dsp.AsyncBuffer (DSP System Toolbox) objects. One to buffer the input audio, one to buffer the extracted features, and one to buffer the output buffer. The output buffer is only necessary for visualizing the decisions in real time.

    audioInBuffer = dsp.AsyncBuffer;
    featureBuffer = dsp.AsyncBuffer;
    audioOutBuffer = dsp.AsyncBuffer;

For the audio buffers, you will buffer both the original clean speech signal, and the noisy signal. You will play back only the specified signalToListenTo. Convert the signalToListenTo variable to the channel you want to listen to.

    channelToListenTo = 1;
    if strcmp(signalToListenTo,"clean")
        channelToListenTo = 2;
    end

Create a dsp.TimeScope (DSP System Toolbox) to visualize the original speech signal, the noisy signal that the network is applied to, and the decision output from the network.

    scope = dsp.TimeScope('SampleRate',fs, ...
        'TimeSpan',3, ...
        'BufferLength',fs*3*3, ...
        'YLimits',[-1.2 1.2], ...
        'TimeSpanOverrunAction','Scroll', ...
        'ShowGrid',true, ...
        'NumInputPorts',3, ...
        'LayoutDimensions',[3,1], ...
        'Title','Noisy Speech');
    scope.ActiveDisplay = 2;
    scope.Title = 'Clean Speech (Original)';
    scope.ActiveDisplay = 3;
    scope.Title = 'Detected Speech';

Create an audioDeviceWriter (Audio Toolbox) object to play either the original or noisy audio from your speakers.

    deviceWriter = audioDeviceWriter('SampleRate',fs);

Initialize variables used in the loop.

    windowLength = numel(afe.Window);
    hopLength = windowLength - afe.OverlapLength;
    myMax = 0;
    audioBufferInitialized = false;
    featureBufferInitialized = false;

Run the streaming demonstration.

    tic
    while toc < testDuration
    
        % Read a frame of the speech signal and a frame of the noise signal
        speechIn = speechReader();
        noiseIn = noiseReader();
        
        % Mix the speech and noise at the specified SNR
        noisyAudio = speechIn + noiseGain*noiseIn;
        
        % Update a running max for normalization
        myMax = max(myMax,max(abs(noisyAudio)));
        
        % Write the noisy audio and speech to buffers
        write(audioInBuffer,[noisyAudio,speechIn]);
        
        % If enough samples are buffered,
        % mark the audio buffer as initialized and push the read pointer
        % for the audio buffer up a window length.
        if audioInBuffer.NumUnreadSamples >= windowLength && ~audioBufferInitialized
            audioBufferInitialized = true;
            read(audioInBuffer,windowLength);
        end
        
        % If enough samples are in the audio buffer to calculate a feature
        % vector, read the samples, normalize them, extract the feature vectors, and write
        % the latest feature vector to the features buffer.
        while (audioInBuffer.NumUnreadSamples >= hopLength) && audioBufferInitialized
            x = read(audioInBuffer,windowLength + hopLength,windowLength);
            write(audioOutBuffer,x(end-hopLength+1:end,:));
            noisyAudio = x(:,1);
            noisyAudio = noisyAudio/myMax;
            features = extract(afe,noisyAudio);
            write(featureBuffer,features(2,:));
        end
        
        % If enough feature vectors are buffered, mark the feature buffer
        % as initialized and push the read pointer for the feature buffer
        % and the audio output buffer (so that they are in sync).
        if featureBuffer.NumUnreadSamples >= (sequenceLength + sequenceHop) && ~featureBufferInitialized
            featureBufferInitialized = true;
            read(featureBuffer,sequenceLength - sequenceHop);
            read(audioOutBuffer,(sequenceLength - sequenceHop)*windowLength);
        end
        
       while featureBuffer.NumUnreadSamples >= sequenceHop && featureBufferInitialized
            features = read(featureBuffer,sequenceLength,sequenceLength - sequenceHop);
            features(isnan(features)) = 0;
            
            % Use only the new features to update the
            % standard deviation and mean. Normalize the features.
            localSTD = movSTD(features(end-sequenceHop+1:end,:));
            localMean = movMean(features(end-sequenceHop+1:end,:));
            features = (features - localMean(end,:)) ./ localSTD(end,:);
            
            decision = classify(speechDetectNet,features');
            decision = decision(end-sequenceHop+1:end);
            decision = double(decision)' - 1;
            decision = repelem(decision,hopLength);
            
            audioHop = read(audioOutBuffer,sequenceHop*hopLength);
            
            % Listen to the speech or speech+noise
            deviceWriter(audioHop(:,channelToListenTo));
            
            % Visualize the speech+noise, the original speech, and the
            % voice activity detection.
            scope(audioHop(:,1),audioHop(:,2),audioHop(:,1).*decision)
       end
    end
    release(deviceWriter)
    release(audioInBuffer)
    release(audioOutBuffer)
    release(featureBuffer)
    release(movSTD)
    release(movMean)
    release(scope)
end

References

[1] Warden P. "Speech Commands: A public dataset for single-word speech recognition", 2017. Available from https://storage.googleapis.com/download.tensorflow.org/data/speech_commands_v0.01.tar.gz. Copyright Google 2017. The Speech Commands Dataset is licensed under the Creative Commons Attribution 4.0 license

See Also

|

Related Topics