This example shows how to deploy feature extraction and a convolutional neural network (CNN) for speech command recognition on Intel® processors. To generate the feature extraction and network code, you use MATLAB Coder and the Intel Math Kernel Library for Deep Neural Networks (MKL-DNN). In this example, the generated code is a MATLAB executable (MEX) function, which is called by a MATLAB script that displays the predicted speech command along with the time domain signal and auditory spectrogram. For details about audio preprocessing and network training, see Speech Command Recognition Using Deep Learning.
The MATLAB Coder Interface for Deep Learning Support Package
Xeon processor with support for Intel Advanced Vector Extensions 2 (Intel AVX2)
Intel Math Kernel Library for Deep Neural Networks (MKL-DNN)
Environment variables for Intel MKL-DNN
For supported versions of libraries and for information about setting up environment variables, see Prerequisites for Deep Learning with MATLAB Coder (MATLAB Coder).
Use the same parameters for the feature extraction pipeline and classification as developed in Speech Command Recognition Using Deep Learning.
Define the same sample rate the network was trained on (16 kHz). Define the classification rate and the number of audio samples input per frame. The feature input to the network is a Bark spectrogram that corresponds to 1 second of audio data. The Bark spectrogram is calculated for 25 ms windows with 10 ms hops.
fs = 16000; classificationRate = 20; samplesPerCapture = fs/classificationRate; segmentDuration = 1; segmentSamples = round(segmentDuration*fs); frameDuration = 0.025; frameSamples = round(frameDuration*fs); hopDuration = 0.010; hopSamples = round(hopDuration*fs);
Create an audioFeatureExtractor
object to extract 50-band Bark spectrograms without window normalization.
afe = audioFeatureExtractor( ... 'SampleRate',fs, ... 'FFTLength',512, ... 'Window',hann(frameSamples,'periodic'), ... 'OverlapLength',frameSamples - hopSamples, ... 'barkSpectrum',true); numBands = 50; setExtractorParams(afe,'barkSpectrum','NumBands',numBands,'WindowNormalization',false);
Load the pretrained convolutional neural network and labels.
load('commandNet.mat') labels = trainedNet.Layers(end).Classes; numLabels = numel(labels); backgroundIdx = find(labels == 'background');
Define buffers and decision thresholds to post process network predictions.
probBuffer = single(zeros([numLabels,classificationRate/2])); YBuffer = single(numLabels * ones(1, classificationRate/2)); countThreshold = ceil(classificationRate*0.2); probThreshold = single(0.7);
Create an audioDeviceReader
object to read audio from your device. Create a dsp.AsyncBuffer
object to buffer the audio into chunks.
adr = audioDeviceReader('SampleRate',fs,'SamplesPerFrame',samplesPerCapture,'OutputDataType','single'); audioBuffer = dsp.AsyncBuffer(fs);
Create a dsp.MatrixViewer
object and a timescope
object to display the results.
matrixViewer = dsp.MatrixViewer("ColorBarLabel","Power per band (dB/Band)", ... "XLabel","Frames", ... "YLabel","Bark Bands", ... "Position",[400 100 600 250], ... "ColorLimits",[-4 2.6445], ... "AxisOrigin",'Lower left corner', ... "Name","Speech Command Recognition Using Deep Learning"); timeScope = timescope('SampleRate', fs, ... 'YLimits',[-1 1], 'Position', [400 380 600 250], ... 'Name','Speech Command Recognition Using Deep Learning', ... 'TimeSpanSource','Property', ... 'TimeSpan',1, ... 'BufferLength',fs); timeScope.YLabel = 'Amplitude'; timeScope.ShowGrid = true;
Show the time scope and matrix viewer. Detect commands as long as both the time scope and matrix viewer are open or until the time limit is reached. To stop the live detection before the time limit is reached, close the time scope window or matrix viewer window.
show(timeScope) show(matrixViewer) timeLimit = 10; tic while isVisible(timeScope) && isVisible(matrixViewer) && toc < timeLimit %% Capture Audio x = adr(); write(audioBuffer,x); y = read(audioBuffer,fs,fs-samplesPerCapture); % Compute auditory features features = extract(afe,y); auditory_features = log10(features + 1e-6); % Transpose to get the auditory spectrum auditorySpectrum = auditory_features'; % Perform prediction probs = predict(trainedNet, auditory_features); [~, YPredicted] = max(probs); % Perform statistical post processing YBuffer = [YBuffer(2:end),YPredicted]; probBuffer = [probBuffer(:,2:end),probs(:)]; [YMode_idx, count] = mode(YBuffer); count = single(count); maxProb = max(probBuffer(YMode_idx,:)); if (YMode_idx == single(backgroundIdx) || count < countThreshold || maxProb < probThreshold) speechCommandIdx = backgroundIdx; else speechCommandIdx = YMode_idx; end % Update plots matrixViewer(auditorySpectrum); timeScope(x); if (speechCommandIdx == backgroundIdx) timeScope.Title = ' '; else timeScope.Title = char(labels(speechCommandIdx)); end drawnow end
Hide the scopes.
hide(matrixViewer) hide(timeScope)
To create a function to perform feature extraction compatible with code generation, call generateMATLABFunction
on the audioFeatureExtractor
object. The generateMATLABFunction
object function creates a standalone function that performs equivalent feature extraction and is compatible with code generation.
generateMATLABFunction(afe,'extractSpeechFeatures')
The HelperSpeechCommandRecognition supporting function encapsulates the feature extraction and network prediction process demonstrated previously. So that the feature extraction is compatible with code generation, feature extraction is handled by the generated extractSpeechFeatures
function. So that the network is compatible with code generation, the supporting function uses the coder.loadDeepLearningNetwork
(MATLAB Coder) function to load the network.
Use the HelperSpeechCommandRecognition function to perform live detection of speech commands.
show(timeScope) show(matrixViewer) timeLimit = 10; tic while isVisible(timeScope) && isVisible(matrixViewer) && toc < timeLimit x = adr(); [speechCommandIdx, auditorySpectrum] = HelperSpeechCommandRecognition(x); matrixViewer(auditorySpectrum); timeScope(x); if (speechCommandIdx == backgroundIdx) timeScope.Title = ' '; else timeScope.Title = char(labels(speechCommandIdx)); end drawnow end
Hide the scopes.
hide(timeScope) hide(matrixViewer)
Create a code generation configuration object for generation of an executable program. Specify the target language as C++.
cfg = coder.config('mex'); cfg.TargetLang = 'C++';
Create a configuration object for deep learning code generation with the MKL-DNN library. Attach the configuration object to the code generation configuration object.
dlcfg = coder.DeepLearningConfig('mkldnn');
cfg.DeepLearningConfig = dlcfg;
Call codegen
(MATLAB Coder) to generate C++ code for the HelperSpeechCommandRecognition
function. Specify the configuration object and prototype arguments. A MEX file named HelperSpeechCommandRecognition_mex
is generated to your current folder.
codegen HelperSpeechCommandRecognition -config cfg -args {rand(samplesPerCapture, 1, 'single')} -profile -report -v
Code generation successful: View report
Show the time scope and matrix viewer. Detect commands using the generated MEX for as long as both the time scope and matrix viewer are open or until the time limit is reached. To stop the live detection before the time limit is reached, close the time scope window or matrix viewer window.
show(timeScope) show(matrixViewer) timeLimit = 20; tic while isVisible(timeScope) && isVisible(matrixViewer) && toc < timeLimit x = adr(); [speechCommandIdx, auditorySpectrum] = HelperSpeechCommandRecognition_mex(x); matrixViewer(auditorySpectrum); timeScope(x); if (speechCommandIdx == backgroundIdx) timeScope.Title = ' '; else timeScope.Title = char(labels(speechCommandIdx)); end drawnow end hide(matrixViewer) hide(timeScope)
Use tic
and toc
to compare the execution time to run the simulation completely in MATLAB with the execution time of the MEX function.
Measure the performance of the simulation code.
testDur = 50e-3; x = pinknoise(fs*testDur,'single'); numLoops = 100; tic for k = 1:numLoops [speechCommandIdx, auditory_features] = HelperSpeechCommandRecognition(x); end exeTime = toc; fprintf('SIM execution time per 50 ms of audio = %0.4f ms\n',(exeTime/numLoops)*1000);
SIM execution time per 50 ms of audio = 6.6746 ms
Measure the performance of the MEX code.
tic for k = 1:numLoops [speechCommandIdx, auditory_features] = HelperSpeechCommandRecognition_mex(x); end exeTimeMex = toc; fprintf('MEX execution time per 50 ms of audio = %0.4f ms\n',(exeTimeMex/numLoops)*1000);
MEX execution time per 50 ms of audio = 1.5188 ms
Evaluate the performance gained from using the MEX function. This performance test is performed on a machine using NVIDIA Quadro P620 (Version 26) GPU and Intel(R) Xeon(R) W-2133 CPU running at 3.60 GHz.
PerformanceGain = exeTime/exeTimeMex
PerformanceGain = 4.3945