Clean Messy and Missing Data

This example shows how to find, clean, and delete observations with missing data in a dataset array.

Load sample data.

Import the data from the spreadsheet messy.xlsx.

messyData = dataset('XLSFile',fullfile(matlabroot,'help/toolbox/stats/examples','messy.xlsx'))
messyData = 

    var1          var2          var3         var4          var5
    'afe1'        '3'           'yes'        '3'              3
    'egh3'        '.'           'no'         '7'              7
    'wth4'        '3'           'yes'        '3'              3
    'atn2'        '23'          'no'         '23'            23
    'arg1'        '5'           'yes'        '5'              5
    'jre3'        '34.6'        'yes'        '34.6'        34.6
    'wen9'        '234'         'yes'        '234'          234
    'ple2'        '2'           'no'         '2'              2
    'dbo8'        '5'           'no'         '5'              5
    'oii4'        '5'           'yes'        '5'              5
    'wnk3'        '245'         'yes'        '245'          245
    'abk6'        '563'         ''           '563'          563
    'pnj5'        '463'         'no'         '463'          463
    'wnn3'        '6'           'no'         '6'              6
    'oks9'        '23'          'yes'        '23'            23
    'wba3'        ''            'yes'        'NaN'           14
    'pkn4'        '2'           'no'         '2'              2
    'adw3'        '22'          'no'         '22'            22
    'poj2'        '-99'         'yes'        '-99'          -99
    'bas8'        '23'          'no'         '23'            23
    'gry5'        'NA'          'yes'        'NaN'           21

When you import data from a spreadsheet, dataset reads any variables with nonnumeric elements as a cell array of character vectors. This is why the variable var2 is a cell array of character vectors. When importing data from a text file, you have more flexibility to specify which nonnumeric expressions to treat as missing using the option TreatAsEmpty.

There are many different missing data indicators in messy.xlsx, such as:

  • Empty cells

  • A period (.)

  • NA

  • NaN

  • -99

Find observations with missing values.

Display the subset of observations that have at least one missing value using ismissing.

ix = ismissing(messyData,'NumericTreatAsMissing',-99,...
                 'StringTreatAsMissing',{'NaN','.','NA'});
messyData(any(ix,2),:)
ans = 

    var1          var2         var3         var4         var5
    'egh3'        '.'          'no'         '7'            7 
    'abk6'        '563'        ''           '563'        563 
    'wba3'        ''           'yes'        'NaN'         14 
    'poj2'        '-99'        'yes'        '-99'        -99 
    'gry5'        'NA'         'yes'        'NaN'         21 

By default, ismissing recognizes the following missing value indicators:

  • NaN for numeric arrays

  • '' for character arrays

  • <undefined> for categorical arrays

Use the NumericTreatAsMissing and StringTreatAsMissing options to specify other values to treat as missing.

Convert character arrays to double arrays.

You can convert the char variables that should be numeric using str2double.

messyData.var2 = str2double(messyData.var2);
messyData.var4 = str2double(messyData.var4)
messyData = 

    var1          var2    var3         var4    var5
    'afe1'           3    'yes'           3       3
    'egh3'         NaN    'no'            7       7
    'wth4'           3    'yes'           3       3
    'atn2'          23    'no'           23      23
    'arg1'           5    'yes'           5       5
    'jre3'        34.6    'yes'        34.6    34.6
    'wen9'         234    'yes'         234     234
    'ple2'           2    'no'            2       2
    'dbo8'           5    'no'            5       5
    'oii4'           5    'yes'           5       5
    'wnk3'         245    'yes'         245     245
    'abk6'         563    ''            563     563
    'pnj5'         463    'no'          463     463
    'wnn3'           6    'no'            6       6
    'oks9'          23    'yes'          23      23
    'wba3'         NaN    'yes'         NaN      14
    'pkn4'           2    'no'            2       2
    'adw3'          22    'no'           22      22
    'poj2'         -99    'yes'         -99     -99
    'bas8'          23    'no'           23      23
    'gry5'         NaN    'yes'         NaN      21
Now, var2 and var4 are numeric arrays. During the conversion, str2double replaces the nonnumeric elements of the variables var2 and var4 with the value NaN. However, there are no changes to the numeric missing value indicator, -99.

When applying the same function to many dataset array variables, it can sometimes be more convenient to use datasetfun. For example, to convert both var2 and var4 to numeric arrays simultaneously, you can use:

messyData(:,[2,4]) = datasetfun(@str2double,messyData, ...
    'DataVars',[2,4],'DatasetOutput',true);

Replace missing value indicators.

Clean the data so that the missing values indicated by the code -99 have the standard MATLAB® numeric missing value indicator, NaN.

messyData = replaceWithMissing(messyData,'NumericValues',-99)
messyData = 

    var1          var2    var3         var4    var5
    'afe1'           3    'yes'           3       3
    'egh3'         NaN    'no'            7       7
    'wth4'           3    'yes'           3       3
    'atn2'          23    'no'           23      23
    'arg1'           5    'yes'           5       5
    'jre3'        34.6    'yes'        34.6    34.6
    'wen9'         234    'yes'         234     234
    'ple2'           2    'no'            2       2
    'dbo8'           5    'no'            5       5
    'oii4'           5    'yes'           5       5
    'wnk3'         245    'yes'         245     245
    'abk6'         563    ''            563     563
    'pnj5'         463    'no'          463     463
    'wnn3'           6    'no'            6       6
    'oks9'          23    'yes'          23      23
    'wba3'         NaN    'yes'         NaN      14
    'pkn4'           2    'no'            2       2
    'adw3'          22    'no'           22      22
    'poj2'         NaN    'yes'         NaN     NaN
    'bas8'          23    'no'           23      23
    'gry5'         NaN    'yes'         NaN      21

Create a dataset array with complete observations.

Create a new dataset array that contains only the complete observations—those without missing data.

ix = ismissing(messyData);
completeData = messyData(~any(ix,2),:)
completeData = 

    var1          var2    var3         var4    var5
    'afe1'           3    'yes'           3       3
    'wth4'           3    'yes'           3       3
    'atn2'          23    'no'           23      23
    'arg1'           5    'yes'           5       5
    'jre3'        34.6    'yes'        34.6    34.6
    'wen9'         234    'yes'         234     234
    'ple2'           2    'no'            2       2
    'dbo8'           5    'no'            5       5
    'oii4'           5    'yes'           5       5
    'wnk3'         245    'yes'         245     245
    'pnj5'         463    'no'          463     463
    'wnn3'           6    'no'            6       6
    'oks9'          23    'yes'          23      23
    'pkn4'           2    'no'            2       2
    'adw3'          22    'no'           22      22
    'bas8'          23    'no'           23      23

See Also

| |

Related Examples

More About