This example shows how to detect regions of speech in a low signal-to-noise environment using deep learning. The example uses the Speech Commands Dataset to train a Bidirectional Long Short-Term Memory (BiLSTM) network to detect voice activity.
Voice activity detection is an essential component of many audio systems, such as automatic speech recognition and speaker recognition. Voice activity detection can be especially challenging in low signal-to-noise (SNR) situations, where speech is obstructed by noise.
This example uses long short-term memory (LSTM) networks, which are a type of recurrent neural network (RNN) well-suited to study sequence and time-series data. An LSTM network can learn long-term dependencies between time steps of a sequence. An LSTM layer (lstmLayer
) can look at the time sequence in the forward direction, while a bidirectional LSTM layer (bilstmLayer
) can look at the time sequence in both forward and backward directions. This example uses a bidirectional LSTM layer.
This example trains a voice activity detection bidirectional LSTM network with feature sequences of spectral characteristics and a harmonic ratio metric.
In high SNR scenarios, traditional speech detection algorithms perform adequately. Read in an audio file that consists of words spoken with pauses between. Resample the audio to 16 kHz. Listen to the audio.
fs = 16e3;
[speech,fileFs] = audioread('Counting-16-44p1-mono-15secs.wav');
speech = resample(speech,fs,fileFs);
speech = speech/max(abs(speech));
sound(speech,fs)
Use the detectSpeech
(Audio Toolbox) function to locate regions of speech. The detectSpeech
function correctly identifies all regions of speech.
win = hamming(50e-3 * fs,'periodic'); detectSpeech(speech,fs,'Window',win)
Corrupt the audio signal with washing machine noise at a -20 dB SNR. Listen to the corrupted audio.
[noise,fileFs] = audioread('WashingMachine-16-8-mono-200secs.mp3');
noise = resample(noise,fs,fileFs);
SNR = -20;
noiseGain = 10^(-SNR/20) * norm(speech) / norm(noise);
noisySpeech = speech + noiseGain*noise(1:numel(speech));
noisySpeech = noisySpeech./max(abs(noisySpeech));
sound(noisySpeech,fs)
Call detectSpeech
on the noisy audio signal. The function fails to detect the speech regions given the very low SNR.
detectSpeech(noisySpeech,fs,'Window',win)
Load a pretrained network and a configured audioFeatureExtractor
(Audio Toolbox) object. The network was trained to detect speech in a low SNR environments given features output from the audioFeatureExtractor
object.
load('Audio_VoiceActivityDetectionExample.mat','speechDetectNet','afe')
speechDetectNet
speechDetectNet = SeriesNetwork with properties: Layers: [6×1 nnet.cnn.layer.Layer] InputNames: {'sequenceinput'} OutputNames: {'classoutput'}
afe
afe = audioFeatureExtractor with properties: Properties Window: [256×1 double] OverlapLength: 128 SampleRate: 16000 FFTLength: [] SpectralDescriptorInput: 'linearSpectrum' Enabled Features spectralCentroid, spectralCrest, spectralEntropy, spectralFlux, spectralKurtosis, spectralRolloffPoint spectralSkewness, spectralSlope, harmonicRatio Disabled Features linearSpectrum, melSpectrum, barkSpectrum, erbSpectrum, mfcc, mfccDelta mfccDeltaDelta, gtcc, gtccDelta, gtccDeltaDelta, spectralDecrease, spectralFlatness spectralSpread, pitch To extract a feature, set the corresponding property to true. For example, obj.mfcc = true, adds mfcc to the list of enabled features.
Extract features from the speech data and then normalize them. Orient the features so that time is across columns.
features = extract(afe,noisySpeech); features = (features - mean(features,1)) ./ std(features,[],1); features = features';
Pass the features through the speech detection network to classify each feature vector as belonging to a frame of speech or not.
decisionsCategorical = classify(speechDetectNet,features);
Each decision corresponds to an analysis window analyzed by the audioFeatureExtractor
. Replicate the decisions so that they are in one-to-one correspondence with the audio samples. Plot the speech, the noisy speech, and the VAD decisions.
decisionsWindow = 1.2*(double(decisionsCategorical)-1); decisionsSample = [repelem(decisionsWindow(1),numel(afe.Window)), ... repelem(decisionsWindow(2:end),numel(afe.Window)-afe.OverlapLength)]; t = (0:numel(decisionsSample)-1)/afe.SampleRate; plot(t,noisySpeech(1:numel(t)), ... t,speech(1:numel(t)), ... t,decisionsSample); xlabel('Time (s)') ylabel('Amplitude') legend('Noisy Speech','Speech','VAD','Location','southwest')
You can also use the trained VAD network in a streaming context. To simulate a streaming environment, first save the speech and noise signals as WAV files. To simulate streaming input, you will read frames from the files and mix them at a desired SNR.
audiowrite('Speech.wav',speech,fs) audiowrite('Noise.wav',noise,fs)
To apply the VAD network to streaming audio, you have to trade off between delay and accuracy. Define parameters for the streaming voice activity detection in noise demonstration. You can set the duration of the test, the sequence length fed into the network, the sequence hop length, and the SNR to test. Generally, increasing the sequence length increases the accuracy but also increases the lag. You can also choose the signal output to your device as the original signal or the noisy signal.
testDuration =20; sequenceLength =
400; sequenceHop =
20; SNR =
-20; noiseGain = 10^(-SNR/20) * norm(speech) / norm(noise); signalToListenTo =
"noisy";
Call the streaming demo helper function to observe the performance of the VAD network on streaming audio. The parameters you set using the live controls do not interrupt the streaming example. After the streaming demo is complete, you can modify parameters of the demonstration, then run the streaming demo again. You can find the code for the streaming demo in the Supporting Functions.
helperStreamingDemo(speechDetectNet,afe, ... 'Speech.wav','Noise.wav', ... testDuration,sequenceLength,sequenceHop,signalToListenTo,noiseGain);
The remainder of the example walks through training and evaluating the VAD network.
Training:
Create an audioDatastore
(Audio Toolbox) that points to the audio speech files used to train the LSTM network.
Create a training signal consisting of speech segments separated by segments of silence of varying durations.
Corrupt the speech-plus-silence signal with washing machine noise (SNR = -10 dB).
Extract feature sequences consisting of spectral characteristics and harmonic ratio from the noisy signal.
Train the LSTM network using the feature sequences to identify regions of voice activity.
Prediction:
Create an audioDatastore
of speech files used to test the trained network, and create a test signal consisting of speech separated by segments of silence.
Corrupt the test signal with washing machine noise (SNR = -10 dB).
Extract feature sequences from the noisy test signal.
Identify regions of voice activity by passing the test features through the trained network.
Compare the network's accuracy to the voice activity baseline from the signal-plus-silence test signal.
Here is a sketch of the training process.
Here is a sketch of the prediction process. You use the trained network to make predictions.
Download and extract the Google Speech Commands Dataset [1].
url = 'https://ssd.mathworks.com/supportfiles/audio/google_speech.zip'; downloadFolder = tempdir; datasetFolder = fullfile(downloadFolder,'google_speech'); if ~exist(datasetFolder,'dir') disp('Downloading Google speech commands data set (1.9 GB)...') unzip(url,downloadFolder) end
Downloading Google speech commands data set (1.9 GB)...
Create an audioDatastore
(Audio Toolbox) that points to the training data set.
adsTrain = audioDatastore(fullfile(datasetFolder, 'train'), "Includesubfolders",true);
Create an audioDatastore
(Audio Toolbox) that points to the validation data set.
adsValidation = audioDatastore(fullfile(datasetFolder, 'validation'), "Includesubfolders",true);
Read the contents of an audio file using read
(Audio Toolbox). Get the sample rate from the adsInfo
struct.
[data,adsInfo] = read(adsTrain); Fs = adsInfo.SampleRate;
Listen to the audio signal using the sound command.
sound(data,Fs)
Plot the audio signal.
timeVector = (1/Fs) * (0:numel(data)-1); plot(timeVector,data) ylabel("Amplitude") xlabel("Time (s)") title("Sample Audio") grid on
The signal has non-speech portions (silence, background noise, etc) that do not contain useful speech information. This example removes silence using the detectSpeech
(Audio Toolbox) function.
Extract the useful portion of data. Define a 50 ms periodic Hamming window for analysis. Call detectSpeech
with no output arguments to plot the detected speech regions. Call detectSpeech
again to return the indices of the detected speech. Isolate the detected speech regions and then use the sound
command to listen to the audio.
win = hamming(50e-3 * Fs,'periodic'); detectSpeech(data,Fs,'Window',win);
speechIndices = detectSpeech(data,Fs,'Window',win);
sound(data(speechIndices(1,1):speechIndices(1,2)),Fs)
The detectSpeech
function returns indices that tightly surround the detected speech region. It was determined empirically that, for this example, extending the indices of the detected speech by five frames on either side increased the final model's performance. Extend the speech indices by five frames and then listen to the speech.
speechIndices(1,1) = max(speechIndices(1,1) - 5*numel(win),1); speechIndices(1,2) = min(speechIndices(1,2) + 5*numel(win),numel(data)); sound(data(speechIndices(1,1):speechIndices(1,2)),Fs)
Reset the training datastore and shuffle the order of files in the datastores.
reset(adsTrain) adsTrain = shuffle(adsTrain); adsValidation = shuffle(adsValidation);
The detectSpeech
function calculates statistics-based thresholds to determine the speech regions. You can skip the threshold calculation and speed up the detectSpeech
function by specifying the thresholds directly. To determine thresholds for a data set, call detectSpeech
on a sampling of files and get the thresholds it calculates. Take the mean of the thresholds.
TM = []; for index1 = 1:500 data = read(adsTrain); [~,T] = detectSpeech(data,Fs,'Window',win); TM = [TM;T]; end T = mean(TM); reset(adsTrain)
Create a 1000-second training signal by combining multiple speech files from the training data set. Use detectSpeech
to remove unwanted portions of each file. Insert a random period of silence between speech segments.
Preallocate the training signal.
duration = 2000*Fs; audioTraining = zeros(duration,1);
Preallocate the voice activity training mask. Values of 1 in the mask correspond to samples located in areas with voice activity. Values of 0 correspond to areas with no voice activity.
maskTraining = zeros(duration,1);
Specify a maximum silence segment duration of 2 seconds.
maxSilenceSegment = 2;
Construct the training signal by calling read
on the datastore in a loop.
numSamples = 1; while numSamples < duration data = read(adsTrain); data = data ./ max(abs(data)); % Normalize amplitude % Determine regions of speech idx = detectSpeech(data,Fs,'Window',win,'Thresholds',T); % If a region of speech is detected if ~isempty(idx) % Extend the indices by five frames idx(1,1) = max(1,idx(1,1) - 5*numel(win)); idx(1,2) = min(length(data),idx(1,2) + 5*numel(win)); % Isolate the speech data = data(idx(1,1):idx(1,2)); % Write speech segment to training signal audioTraining(numSamples:numSamples+numel(data)-1) = data; % Set VAD baseline maskTraining(numSamples:numSamples+numel(data)-1) = true; % Random silence period numSilenceSamples = randi(maxSilenceSegment*Fs,1,1); numSamples = numSamples + numel(data) + numSilenceSamples; end end
Visualize a 10-second portion of the training signal. Plot the baseline voice activity mask.
figure range = 1:10*Fs; plot((1/Fs)*(range-1),audioTraining(range)); hold on plot((1/Fs)*(range-1),maskTraining(range)); grid on lines = findall(gcf,"Type","Line"); lines(1).LineWidth = 2; xlabel("Time (s)") legend("Signal","Speech Region") title("Training Signal (first 10 seconds)");
Listen to the first 10 seconds of the training signal.
sound(audioTraining(range),Fs);
Corrupt the training signal with washing machine noise by adding washing machine noise to the speech signal such that the signal-to-noise ratio is -10 dB.
Read 8 kHz noise and convert it to 16 kHz.
noise = audioread("WashingMachine-16-8-mono-1000secs.mp3");
noise = resample(noise,2,1);
Corrupt training signal with noise.
audioTraining = audioTraining(1:numel(noise)); SNR = -10; noise = 10^(-SNR/20) * noise * norm(audioTraining) / norm(noise); audioTrainingNoisy = audioTraining + noise; audioTrainingNoisy = audioTrainingNoisy / max(abs(audioTrainingNoisy));
Visualize a 10-second portion of the noisy training signal. Plot the baseline voice activity mask.
figure plot((1/Fs)*(range-1),audioTrainingNoisy(range)); hold on plot((1/Fs)*(range-1),maskTraining(range)); grid on lines = findall(gcf,"Type","Line"); lines(1).LineWidth = 2; xlabel("Time (s)") legend("Noisy Signal","Speech Area") title("Training Signal (first 10 seconds)");
Listen to the first 10 seconds of the noisy training signal.
sound(audioTrainingNoisy(range),Fs)
Note that you obtained the baseline voice activity mask using the noiseless speech-plus-silence signal. Verify that using detectSpeech
on the noise-corrupted signal does not yield good results.
speechIndices = detectSpeech(audioTrainingNoisy,Fs,'Window',win); speechIndices(:,1) = max(1,speechIndices(:,1) - 5*numel(win)); speechIndices(:,2) = min(numel(audioTrainingNoisy),speechIndices(:,2) + 5*numel(win)); noisyMask = zeros(size(audioTrainingNoisy)); for ii = 1:size(speechIndices) noisyMask(speechIndices(ii,1):speechIndices(ii,2)) = 1; end
Visualize a 10-second portion of the noisy training signal. Plot the voice activity mask obtained by analyzing the noisy signal.
figure plot((1/Fs)*(range-1),audioTrainingNoisy(range)); hold on plot((1/Fs)*(range-1),noisyMask(range)); grid on lines = findall(gcf,"Type","Line"); lines(1).LineWidth = 2; xlabel("Time (s)") legend("Noisy Signal","Mask from Noisy Signal") title("Training Signal (first 10 seconds)");
Create a 200-second noisy speech signal to validate the trained network. Use the validation datastore. Note that the validation and training datastores have different speakers.
Preallocate the validation signal and the validation mask. You will use this mask to assess the accuracy of the trained network.
duration = 200*Fs; audioValidation = zeros(duration,1); maskValidation = zeros(duration,1);
Construct the validation signal by calling read
on the datastore in a loop.
numSamples = 1; while numSamples < duration data = read(adsValidation); data = data ./ max(abs(data)); % Normalize amplitude % Determine regions of speech idx = detectSpeech(data,Fs,'Window',win,'Thresholds',T); % If a region of speech is detected if ~isempty(idx) % Extend the indices by five frames idx(1,1) = max(1,idx(1,1) - 5*numel(win)); idx(1,2) = min(length(data),idx(1,2) + 5*numel(win)); % Isolate the speech data = data(idx(1,1):idx(1,2)); % Write speech segment to training signal audioValidation(numSamples:numSamples+numel(data)-1) = data; % Set VAD Baseline maskValidation(numSamples:numSamples+numel(data)-1) = true; % Random silence period numSilenceSamples = randi(maxSilenceSegment*Fs,1,1); numSamples = numSamples + numel(data) + numSilenceSamples; end end
Corrupt the validation signal with washing machine noise by adding washing machine noise to the speech signal such that the signal-to-noise ratio is -10 dB. Use a different noise file for the validation signal than you did for the training signal.
noise = audioread("WashingMachine-16-8-mono-200secs.mp3");
noise = resample(noise,2,1);
noise = noise(1:duration);
audioValidation = audioValidation(1:numel(noise));
noise = 10^(-SNR/20) * noise * norm(audioValidation) / norm(noise);
audioValidationNoisy = audioValidation + noise;
audioValidationNoisy = audioValidationNoisy / max(abs(audioValidationNoisy));
This example trains the LSTM network using the following features:
spectralCentroid
(Audio Toolbox)
spectralCrest
(Audio Toolbox)
spectralEntropy
(Audio Toolbox)
spectralFlux
(Audio Toolbox)
spectralKurtosis
(Audio Toolbox)
spectralRolloffPoint
(Audio Toolbox)
spectralSkewness
(Audio Toolbox)
spectralSlope
(Audio Toolbox)
harmonicRatio
(Audio Toolbox)
This example uses audioFeatureExtractor
(Audio Toolbox) to create an optimal feature extraction pipeline for the feature set. Create an audioFeatureExtractor
object to extract the feature set. Use a 256-point Hann window with 50% overlap.
afe = audioFeatureExtractor('SampleRate',Fs, ... 'Window',hann(256,"Periodic"), ... 'OverlapLength',128, ... ... 'spectralCentroid',true, ... 'spectralCrest',true, ... 'spectralEntropy',true, ... 'spectralFlux',true, ... 'spectralKurtosis',true, ... 'spectralRolloffPoint',true, ... 'spectralSkewness',true, ... 'spectralSlope',true, ... 'harmonicRatio',true); featuresTraining = extract(afe,audioTrainingNoisy);
Display the dimensions of the features matrix. The first dimension corresponds to the number of windows the signal was broken into (it depends on the window length and the overlap length). The second dimension is the number of features used in this example.
[numWindows,numFeatures] = size(featuresTraining)
numWindows = 125009
numFeatures = 9
In classification applications, it is a good practice to normalize all features to have zero mean and unity standard deviation.
Compute the mean and standard deviation for each coefficient, and use them to normalize the data.
M = mean(featuresTraining,1); S = std(featuresTraining,[],1); featuresTraining = (featuresTraining - M) ./ S;
Extract the features from the validation signal using the same process.
featuresValidation = extract(afe,audioValidationNoisy); featuresValidation = (featuresValidation - mean(featuresValidation,1)) ./ std(featuresValidation,[],1);
Each feature corresponds to 128 samples of data (the hop length). For each hop, set the expected voice/no voice value to the mode of the baseline mask values corresponding to those 128 samples. Convert the voice/no voice mask to categorical.
windowLength = numel(afe.Window); hopLength = windowLength - afe.OverlapLength; range = (hopLength) * (1:size(featuresTraining,1)) + hopLength; maskMode = zeros(size(range)); for index = 1:numel(range) maskMode(index) = mode(maskTraining( (index-1)*hopLength+1:(index-1)*hopLength+windowLength )); end maskTraining = maskMode.'; maskTrainingCat = categorical(maskTraining);
Do the same for the validation mask.
range = (hopLength) * (1:size(featuresValidation,1)) + hopLength; maskMode = zeros(size(range)); for index = 1:numel(range) maskMode(index) = mode(maskValidation( (index-1)*hopLength+1:(index-1)*hopLength+windowLength )); end maskValidation = maskMode.'; maskValidationCat = categorical(maskValidation);
Split the training features and the mask into sequences of length 800, with 75% overlap between consecutive sequences.
sequenceLength = 800; sequenceOverlap = round(0.75*sequenceLength); trainFeatureCell = helperFeatureVector2Sequence(featuresTraining',sequenceLength,sequenceOverlap); trainLabelCell = helperFeatureVector2Sequence(maskTrainingCat',sequenceLength,sequenceOverlap);
LSTM networks can learn long-term dependencies between time steps of sequence data. This example uses the bidirectional LSTM layer bilstmLayer
to look at the sequence in both forward and backward directions.
Specify the input size to be sequences of length 9
(the number of features). Specify a hidden bidirectional LSTM layer with an output size of 200 and output a sequence. This command instructs the bidirectional LSTM layer to map the input time series into 200 features that are passed to the next layer. Then, specify a bidirectional LSTM layer with an output size of 200 and output the last element of the sequence. This command instructs the bidirectional LSTM layer to map its input into 200 features and then prepares the output for the fully connected layer. Finally, specify two classes by including a fully connected layer of size 2
, followed by a softmax layer and a classification layer.
layers = [ ... sequenceInputLayer( size(featuresValidation,2) ) bilstmLayer(200,"OutputMode","sequence") bilstmLayer(200,"OutputMode","sequence") fullyConnectedLayer(2) softmaxLayer classificationLayer ];
Next, specify the training options for the classifier. Set MaxEpochs
to 20
so that the network makes 20 passes through the training data. Set MiniBatchSize
to 64
so that the network looks at 64 training signals at a time. Set Plots
to "training-progress"
to generate plots that show the training progress as the number of iterations increases. Set Verbose
to false
to disable printing the table output that corresponds to the data shown in the plot. Set Shuffle
to "every-epoch"
to shuffle the training sequence at the beginning of each epoch. Set LearnRateSchedule
to "piecewise"
to decrease the learning rate by a specified factor (0.1) every time a certain number of epochs (10) has passed. Set ValidationData
to the validation predictors and targets.
This example uses the adaptive moment estimation (ADAM) solver. ADAM performs better with recurrent neural networks (RNNs) like LSTMs than the default stochastic gradient descent with momentum (SGDM) solver.
maxEpochs = 20; miniBatchSize = 64; options = trainingOptions("adam", ... "MaxEpochs",maxEpochs, ... "MiniBatchSize",miniBatchSize, ... "Shuffle","every-epoch", ... "Verbose",0, ... "SequenceLength",sequenceLength, ... "ValidationFrequency",floor(numel(trainFeatureCell)/miniBatchSize), ... "ValidationData",{featuresValidation.',maskValidationCat.'}, ... "Plots","training-progress", ... "LearnRateSchedule","piecewise", ... "LearnRateDropFactor",0.1, ... "LearnRateDropPeriod",5);
Train the LSTM network with the specified training options and layer architecture using trainNetwork
. Because the training set is large, the training process can take several minutes.
doTraining = true; if doTraining [speechDetectNet,netInfo] = trainNetwork(trainFeatureCell,trainLabelCell,layers,options); fprintf("Validation accuracy: %f percent.\n", netInfo.FinalValidationAccuracy); else load speechDetectNet end
Validation accuracy: 90.089844 percent.
Estimate voice activity in the validation signal using the trained network. Convert the estimated VAD mask from categorical to double.
EstimatedVADMask = classify(speechDetectNet,featuresValidation.'); EstimatedVADMask = double(EstimatedVADMask); EstimatedVADMask = EstimatedVADMask.' - 1;
Calculate and plot the validation confusion matrix from the vectors of actual and estimated labels.
figure cm = confusionchart(maskValidation,EstimatedVADMask,"title","Validation Accuracy"); cm.ColumnSummary = "column-normalized"; cm.RowSummary = "row-normalized";
If you changed parameters of your network or feature extraction pipeline, consider resaving the MAT file with the new network and audioFeatureExtractor
object.
resaveNetwork =false; if resaveNetwork save('Audio_VoiceActivityDetectionExample.mat','speechDetectNet','afe'); end
function [sequences,sequencePerFile] = helperFeatureVector2Sequence(features,featureVectorsPerSequence,featureVectorOverlap) if featureVectorsPerSequence <= featureVectorOverlap error('The number of overlapping feature vectors must be less than the number of feature vectors per sequence.') end if ~iscell(features) features = {features}; end hopLength = featureVectorsPerSequence - featureVectorOverlap; idx1 = 1; sequences = {}; sequencePerFile = cell(numel(features),1); for ii = 1:numel(features) sequencePerFile{ii} = floor((size(features{ii},2) - featureVectorsPerSequence)/hopLength) + 1; idx2 = 1; for j = 1:sequencePerFile{ii} sequences{idx1,1} = features{ii}(:,idx2:idx2 + featureVectorsPerSequence - 1); %#ok<AGROW> idx1 = idx1 + 1; idx2 = idx2 + hopLength; end end end
function helperStreamingDemo(speechDetectNet,afe,cleanSpeech,noise,testDuration,sequenceLength,sequenceHop,signalToListenTo,noiseGain)
Create dsp.AudioFileReader
(DSP System Toolbox) objects to read from the speech and noise files frame by frame.
speechReader = dsp.AudioFileReader(cleanSpeech,'PlayCount',inf); noiseReader = dsp.AudioFileReader(noise,'PlayCount',inf); fs = speechReader.SampleRate;
Create a dsp.MovingStandardDeviation
(DSP System Toolbox) object and a dsp.MovingAverage
(DSP System Toolbox) object. You will use these to determine the standard deviation and mean of the audio features for normalization. The statistics should improve over time.
movSTD = dsp.MovingStandardDeviation('Method','Exponential weighting','ForgettingFactor',1); movMean = dsp.MovingAverage('Method','Exponential weighting','ForgettingFactor',1);
Create three dsp.AsyncBuffer
(DSP System Toolbox) objects. One to buffer the input audio, one to buffer the extracted features, and one to buffer the output buffer. The output buffer is only necessary for visualizing the decisions in real time.
audioInBuffer = dsp.AsyncBuffer; featureBuffer = dsp.AsyncBuffer; audioOutBuffer = dsp.AsyncBuffer;
For the audio buffers, you will buffer both the original clean speech signal, and the noisy signal. You will play back only the specified signalToListenTo
. Convert the signalToListenTo
variable to the channel you want to listen to.
channelToListenTo = 1; if strcmp(signalToListenTo,"clean") channelToListenTo = 2; end
Create a dsp.TimeScope
(DSP System Toolbox) to visualize the original speech signal, the noisy signal that the network is applied to, and the decision output from the network.
scope = dsp.TimeScope('SampleRate',fs, ... 'TimeSpan',3, ... 'BufferLength',fs*3*3, ... 'YLimits',[-1.2 1.2], ... 'TimeSpanOverrunAction','Scroll', ... 'ShowGrid',true, ... 'NumInputPorts',3, ... 'LayoutDimensions',[3,1], ... 'Title','Noisy Speech'); scope.ActiveDisplay = 2; scope.Title = 'Clean Speech (Original)'; scope.ActiveDisplay = 3; scope.Title = 'Detected Speech';
Create an audioDeviceWriter
(Audio Toolbox) object to play either the original or noisy audio from your speakers.
deviceWriter = audioDeviceWriter('SampleRate',fs);
Initialize variables used in the loop.
windowLength = numel(afe.Window); hopLength = windowLength - afe.OverlapLength; myMax = 0; audioBufferInitialized = false; featureBufferInitialized = false;
Run the streaming demonstration.
tic while toc < testDuration % Read a frame of the speech signal and a frame of the noise signal speechIn = speechReader(); noiseIn = noiseReader(); % Mix the speech and noise at the specified SNR noisyAudio = speechIn + noiseGain*noiseIn; % Update a running max for normalization myMax = max(myMax,max(abs(noisyAudio))); % Write the noisy audio and speech to buffers write(audioInBuffer,[noisyAudio,speechIn]); % If enough samples are buffered, % mark the audio buffer as initialized and push the read pointer % for the audio buffer up a window length. if audioInBuffer.NumUnreadSamples >= windowLength && ~audioBufferInitialized audioBufferInitialized = true; read(audioInBuffer,windowLength); end % If enough samples are in the audio buffer to calculate a feature % vector, read the samples, normalize them, extract the feature vectors, and write % the latest feature vector to the features buffer. while (audioInBuffer.NumUnreadSamples >= hopLength) && audioBufferInitialized x = read(audioInBuffer,windowLength + hopLength,windowLength); write(audioOutBuffer,x(end-hopLength+1:end,:)); noisyAudio = x(:,1); noisyAudio = noisyAudio/myMax; features = extract(afe,noisyAudio); write(featureBuffer,features(2,:)); end % If enough feature vectors are buffered, mark the feature buffer % as initialized and push the read pointer for the feature buffer % and the audio output buffer (so that they are in sync). if featureBuffer.NumUnreadSamples >= (sequenceLength + sequenceHop) && ~featureBufferInitialized featureBufferInitialized = true; read(featureBuffer,sequenceLength - sequenceHop); read(audioOutBuffer,(sequenceLength - sequenceHop)*windowLength); end while featureBuffer.NumUnreadSamples >= sequenceHop && featureBufferInitialized features = read(featureBuffer,sequenceLength,sequenceLength - sequenceHop); features(isnan(features)) = 0; % Use only the new features to update the % standard deviation and mean. Normalize the features. localSTD = movSTD(features(end-sequenceHop+1:end,:)); localMean = movMean(features(end-sequenceHop+1:end,:)); features = (features - localMean(end,:)) ./ localSTD(end,:); decision = classify(speechDetectNet,features'); decision = decision(end-sequenceHop+1:end); decision = double(decision)' - 1; decision = repelem(decision,hopLength); audioHop = read(audioOutBuffer,sequenceHop*hopLength); % Listen to the speech or speech+noise deviceWriter(audioHop(:,channelToListenTo)); % Visualize the speech+noise, the original speech, and the % voice activity detection. scope(audioHop(:,1),audioHop(:,2),audioHop(:,1).*decision) end end release(deviceWriter) release(audioInBuffer) release(audioOutBuffer) release(featureBuffer) release(movSTD) release(movMean) release(scope) end
[1] Warden P. "Speech Commands: A public dataset for single-word speech recognition", 2017. Available from https://storage.googleapis.com/download.tensorflow.org/data/speech_commands_v0.01.tar.gz. Copyright Google 2017. The Speech Commands Dataset is licensed under the Creative Commons Attribution 4.0 license
trainingOptions
| trainNetwork