This example shows how to use a small set of measurement data in Diagnostic Feature Designer to develop a feature set, generate and run code to compute those features for a larger set of measurement data, and compare model accuracies in Classification Learner.
Using a smaller data set at first has several advantages, including faster feature extraction and cleaner visualization. Subsequently generating code so that you can automate the feature computations with an expanded set of members increases the number of feature samples and therefore improves classification model accuracy.
The example, based on Analyze and Select Features for Pump Diagnostics, uses the pump fault data from that example and computes the same features. For more detailed information about the steps and the rationale for the feature development operations using the pump-fault data in this example, see Analyze and Select Features for Pump Diagnostics. This example assumes that you are familiar with the layout and operations in the app. For more information on working with the app, see the three-part tutorial in Identify Condition Indicators for Predictive Maintenance Algorithm Design.
Load the data set pumpData. pumpData
is a 240-member ensemble table that contains simulated measurements for flow and pressure. pumpData
also contains categorical fault codes that represent combinations of three independent faults. For example, a fault code of 0
represents data from a system with no faults. A fault code of 111
represents data from a system with all three faults.
load savedPumpData pumpData
View a histogram of original fault codes. The histogram shows the number of ensemble members associated with each fault code.
fcCat = pumpData{:,3}; histogram(fcCat) title('Fault Code Distribution for Full Pump Data Set') xlabel('Fault Codes') ylabel('Number of Members')
Create a subset of this data set that contains 10% of the data, or 24 members. Because simulation data is often clustered, generate a randomized index with which to select the members. For the purposes of this example, first use rng
to create a repeatable random seed.
rng('default')
Compute a randomized 24-element index vector idx
. Sort the vector so that the indices are in order.
pdh = height(pumpData); nsel = 24; idx = randi(pdh,nsel,1); idx = sort(idx);
Use idx
to select member rows from pumpData
.
pdSub = pumpData(idx,:);
View a histogram of the fault codes in the reduced data set.
fcCatSub = pdSub{:,3}; histogram(fcCatSub) title('Fault Code Distribution for Reduced Pump Data Set') xlabel('Fault Codes') ylabel('Number of Members')
All the fault combinations are represented.
Open Diagnostic Feature Designer by using the diagnosticFeatureDesigner
command. Import pdSub
into the app as a multimember ensemble.
Extract the time-domain signal features from both the flow and pressure signal. In the Feature Designer tab, click Time Domain > Signal Features and select all features.
As Analyze and Select Features for Pump Diagnostics describes, computing the frequency spectrum of the flow highlights the cyclic nature of the flow signal. Estimate the frequency spectrum with Spectral Estimation > Power Spectrum by using the options shown for both flow and pressure.
Compute spectral features in the band 25–250 Hz, using the options shown, for both flow and pressure spectra.
Rank your features using Rank Features > FeatureTable1. Because faultCode
contains multiple possible values, the app defaults to the One-Way ANOVA
ranking method.
Export the features set to Classification Learner so that you can train a classification model. In the Feature Ranking tab, click Export > Export Features to the Classification Learner. Select all the features that have a One-Way ANOVA
metric greater than 1. This selection includes all the features from pressure_ps_spec/Data_Wn2
and up.
Once you click Export, Classification Learner opens a new session. Accept 5-fold cross-validation and click Start Session.
Train all available models by clicking All in the Classification Learner tab, and then Train.
For this session, the highest scoring model, Boosted Trees
, has an accuracy around 63%. Your results may vary.
Now that you have completed your interactive feature work with a small data set, you can apply the same computations to the full data set using generated code. In Diagnostic Feature Designer, generate a function to calculate the features. To do so, in the Feature Ranking Tab, select Export > Generate Function for Features. Select the same 50 features that you exported to Classification Learner.
When you click OK, a function appears in the editor.
Save the function to your local folder as diagnosticFeatures
.
Execute diagnosticFeatures
with the full pumpData
ensemble to get the 240-member feature set. Use the following command.
feature240 = diagnosticFeatures(pumpData);
feature240
is a 240-by-51 table. The table includes the condition variable faultCode
and the 50 features.
Train classification models again in Classification Learner, using feature240
this time. Open a new session window using the following command.
classificationLearner
In the Classification Learner window, click New Session > From Workspace. In the New Session window, in Data Set > Data Set Variable, select feature240
.
Repeat the steps you performed with the 24-member data set. Accept 5-fold cross-validation, start the session, and train all models.
For this session, the highest model accuracy, achieved by both Bagged Trees
and RUSBoosted Trees
, is around 80%. Again, your results may vary, but they should still reflect the increase in best accuracy.