This example shows how to find, clean, and delete observations with missing data in a dataset array.
Import the data from the spreadsheet messy.xlsx
.
messyData = dataset('XLSFile',fullfile(matlabroot,'help/toolbox/stats/examples','messy.xlsx'))
messyData = var1 var2 var3 var4 var5 'afe1' '3' 'yes' '3' 3 'egh3' '.' 'no' '7' 7 'wth4' '3' 'yes' '3' 3 'atn2' '23' 'no' '23' 23 'arg1' '5' 'yes' '5' 5 'jre3' '34.6' 'yes' '34.6' 34.6 'wen9' '234' 'yes' '234' 234 'ple2' '2' 'no' '2' 2 'dbo8' '5' 'no' '5' 5 'oii4' '5' 'yes' '5' 5 'wnk3' '245' 'yes' '245' 245 'abk6' '563' '' '563' 563 'pnj5' '463' 'no' '463' 463 'wnn3' '6' 'no' '6' 6 'oks9' '23' 'yes' '23' 23 'wba3' '' 'yes' 'NaN' 14 'pkn4' '2' 'no' '2' 2 'adw3' '22' 'no' '22' 22 'poj2' '-99' 'yes' '-99' -99 'bas8' '23' 'no' '23' 23 'gry5' 'NA' 'yes' 'NaN' 21
When you import data from a spreadsheet, dataset
reads any
variables with nonnumeric elements as a cell array of character vectors. This is
why the variable var2
is a cell array of character vectors.
When importing data from a text file, you have more flexibility to specify which
nonnumeric expressions to treat as missing using the option
TreatAsEmpty
.
There are many different missing data indicators in
messy.xlsx
, such as:
Empty cells
A period (.
)
NA
NaN
-99
Display the subset of observations that have at least one missing
value using ismissing
.
ix = ismissing(messyData,'NumericTreatAsMissing',-99,... 'StringTreatAsMissing',{'NaN','.','NA'}); messyData(any(ix,2),:)
ans = var1 var2 var3 var4 var5 'egh3' '.' 'no' '7' 7 'abk6' '563' '' '563' 563 'wba3' '' 'yes' 'NaN' 14 'poj2' '-99' 'yes' '-99' -99 'gry5' 'NA' 'yes' 'NaN' 21
By default, ismissing
recognizes the following
missing value indicators:
NaN
for numeric arrays
''
for character arrays
<undefined>
for categorical
arrays
Use the NumericTreatAsMissing
and StringTreatAsMissing
options
to specify other values to treat as missing.
You can convert the char
variables that should
be numeric using str2double
.
messyData.var2 = str2double(messyData.var2); messyData.var4 = str2double(messyData.var4)
messyData = var1 var2 var3 var4 var5 'afe1' 3 'yes' 3 3 'egh3' NaN 'no' 7 7 'wth4' 3 'yes' 3 3 'atn2' 23 'no' 23 23 'arg1' 5 'yes' 5 5 'jre3' 34.6 'yes' 34.6 34.6 'wen9' 234 'yes' 234 234 'ple2' 2 'no' 2 2 'dbo8' 5 'no' 5 5 'oii4' 5 'yes' 5 5 'wnk3' 245 'yes' 245 245 'abk6' 563 '' 563 563 'pnj5' 463 'no' 463 463 'wnn3' 6 'no' 6 6 'oks9' 23 'yes' 23 23 'wba3' NaN 'yes' NaN 14 'pkn4' 2 'no' 2 2 'adw3' 22 'no' 22 22 'poj2' -99 'yes' -99 -99 'bas8' 23 'no' 23 23 'gry5' NaN 'yes' NaN 21
var2
and var4
are
numeric arrays. During the conversion, str2double
replaces
the nonnumeric elements of the variables var2
and var4
with
the value NaN
. However, there are no changes to
the numeric missing value indicator, -99
.When applying the same function to many dataset array variables,
it can sometimes be more convenient to use datasetfun
.
For example, to convert both var2
and var4
to
numeric arrays simultaneously, you can use:
messyData(:,[2,4]) = datasetfun(@str2double,messyData, ... 'DataVars',[2,4],'DatasetOutput',true);
Clean the data so that the missing values indicated by the code -99
have
the standard MATLAB® numeric missing value indicator, NaN
.
messyData = replaceWithMissing(messyData,'NumericValues',-99)
messyData = var1 var2 var3 var4 var5 'afe1' 3 'yes' 3 3 'egh3' NaN 'no' 7 7 'wth4' 3 'yes' 3 3 'atn2' 23 'no' 23 23 'arg1' 5 'yes' 5 5 'jre3' 34.6 'yes' 34.6 34.6 'wen9' 234 'yes' 234 234 'ple2' 2 'no' 2 2 'dbo8' 5 'no' 5 5 'oii4' 5 'yes' 5 5 'wnk3' 245 'yes' 245 245 'abk6' 563 '' 563 563 'pnj5' 463 'no' 463 463 'wnn3' 6 'no' 6 6 'oks9' 23 'yes' 23 23 'wba3' NaN 'yes' NaN 14 'pkn4' 2 'no' 2 2 'adw3' 22 'no' 22 22 'poj2' NaN 'yes' NaN NaN 'bas8' 23 'no' 23 23 'gry5' NaN 'yes' NaN 21
Create a new dataset array that contains only the complete observations—those without missing data.
ix = ismissing(messyData); completeData = messyData(~any(ix,2),:)
completeData = var1 var2 var3 var4 var5 'afe1' 3 'yes' 3 3 'wth4' 3 'yes' 3 3 'atn2' 23 'no' 23 23 'arg1' 5 'yes' 5 5 'jre3' 34.6 'yes' 34.6 34.6 'wen9' 234 'yes' 234 234 'ple2' 2 'no' 2 2 'dbo8' 5 'no' 5 5 'oii4' 5 'yes' 5 5 'wnk3' 245 'yes' 245 245 'pnj5' 463 'no' 463 463 'wnn3' 6 'no' 6 6 'oks9' 23 'yes' 23 23 'pkn4' 2 'no' 2 2 'adw3' 22 'no' 22 22 'bas8' 23 'no' 23 23
dataset
| ismissing
| replaceWithMissing