Clean Messy and Missing Data in Tables

This example shows how to find, clean, and delete table rows with missing data.

Load Sample Data

Load sample data from a comma-separated text file, messy.csv. The file contains many different missing data indicators:

  • Empty character vector ('')

  • period (.)

  • NA

  • NaN

  • -99

To specify the character vectors to treat as empty values, use the 'TreatAsEmpty' name-value pair argument with the readtable function. (Use the disp function to display all 21 rows, even when running this example as a live script.)

T = readtable('messy.csv','TreatAsEmpty',{'.','NA'});
disp(T)
       A         B          C          D       E  
    ________    ____    __________    ____    ____

    {'afe1'}       3    {'yes'   }       3       3
    {'egh3'}     NaN    {'no'    }       7       7
    {'wth4'}       3    {'yes'   }       3       3
    {'atn2'}      23    {'no'    }      23      23
    {'arg1'}       5    {'yes'   }       5       5
    {'jre3'}    34.6    {'yes'   }    34.6    34.6
    {'wen9'}     234    {'yes'   }     234     234
    {'ple2'}       2    {'no'    }       2       2
    {'dbo8'}       5    {'no'    }       5       5
    {'oii4'}       5    {'yes'   }       5       5
    {'wnk3'}     245    {'yes'   }     245     245
    {'abk6'}     563    {0x0 char}     563     563
    {'pnj5'}     463    {'no'    }     463     463
    {'wnn3'}       6    {'no'    }       6       6
    {'oks9'}      23    {'yes'   }      23      23
    {'wba3'}     NaN    {'yes'   }     NaN      14
    {'pkn4'}       2    {'no'    }       2       2
    {'adw3'}      22    {'no'    }      22      22
    {'poj2'}     -99    {'yes'   }     -99     -99
    {'bas8'}      23    {'no'    }      23      23
    {'gry5'}     NaN    {'yes'   }     NaN      21

T is a table with 21 rows and five variables. 'TreatAsEmpty' only applies to numeric columns in the file and cannot handle numeric values specified as text, such as '-99'.

Summarize Table

View the data type, description, units, and other descriptive statistics for each variable by creating a table summary using the summary function.

summary(T)
Variables:

    A: 21x1 cell array of character vectors

    B: 21x1 double

        Values:

            Min             -99   
            Median          14    
            Max             563   
            NumMissing      3     

    C: 21x1 cell array of character vectors

    D: 21x1 double

        Values:

            Min             -99   
            Median          7     
            Max             563   
            NumMissing      2     

    E: 21x1 double

        Values:

            Min         -99   
            Median       14   
            Max         563   

When you import data from a file, the default is for readtable to read any variables with nonnumeric elements as a cell array of character vectors.

Find Rows with Missing Values

Display the subset of rows from the table, T, that have at least one missing value.

TF = ismissing(T,{'' '.' 'NA' NaN -99});
rowsWithMissing = T(any(TF,2),:);
disp(rowsWithMissing)
       A         B         C          D      E 
    ________    ___    __________    ___    ___

    {'egh3'}    NaN    {'no'    }      7      7
    {'abk6'}    563    {0x0 char}    563    563
    {'wba3'}    NaN    {'yes'   }    NaN     14
    {'poj2'}    -99    {'yes'   }    -99    -99
    {'gry5'}    NaN    {'yes'   }    NaN     21

readtable replaced '.' and 'NA' with NaN in the numeric variables, B, D, and E.

Replace Missing Value Indicators

Clean the data so that the missing values indicated by code -99 have the standard MATLAB® numeric missing value indicator, NaN.

T = standardizeMissing(T,-99);
disp(T)
       A         B          C          D       E  
    ________    ____    __________    ____    ____

    {'afe1'}       3    {'yes'   }       3       3
    {'egh3'}     NaN    {'no'    }       7       7
    {'wth4'}       3    {'yes'   }       3       3
    {'atn2'}      23    {'no'    }      23      23
    {'arg1'}       5    {'yes'   }       5       5
    {'jre3'}    34.6    {'yes'   }    34.6    34.6
    {'wen9'}     234    {'yes'   }     234     234
    {'ple2'}       2    {'no'    }       2       2
    {'dbo8'}       5    {'no'    }       5       5
    {'oii4'}       5    {'yes'   }       5       5
    {'wnk3'}     245    {'yes'   }     245     245
    {'abk6'}     563    {0x0 char}     563     563
    {'pnj5'}     463    {'no'    }     463     463
    {'wnn3'}       6    {'no'    }       6       6
    {'oks9'}      23    {'yes'   }      23      23
    {'wba3'}     NaN    {'yes'   }     NaN      14
    {'pkn4'}       2    {'no'    }       2       2
    {'adw3'}      22    {'no'    }      22      22
    {'poj2'}     NaN    {'yes'   }     NaN     NaN
    {'bas8'}      23    {'no'    }      23      23
    {'gry5'}     NaN    {'yes'   }     NaN      21

standardizeMissing replaces three instances of -99 with NaN.

Create a new table, T2, and replace missing values with values from previous rows of the table. fillmissing provides a number of ways to fill in missing values.

T2 = fillmissing(T,'previous');
disp(T2)
       A         B         C        D       E  
    ________    ____    _______    ____    ____

    {'afe1'}       3    {'yes'}       3       3
    {'egh3'}       3    {'no' }       7       7
    {'wth4'}       3    {'yes'}       3       3
    {'atn2'}      23    {'no' }      23      23
    {'arg1'}       5    {'yes'}       5       5
    {'jre3'}    34.6    {'yes'}    34.6    34.6
    {'wen9'}     234    {'yes'}     234     234
    {'ple2'}       2    {'no' }       2       2
    {'dbo8'}       5    {'no' }       5       5
    {'oii4'}       5    {'yes'}       5       5
    {'wnk3'}     245    {'yes'}     245     245
    {'abk6'}     563    {'yes'}     563     563
    {'pnj5'}     463    {'no' }     463     463
    {'wnn3'}       6    {'no' }       6       6
    {'oks9'}      23    {'yes'}      23      23
    {'wba3'}      23    {'yes'}      23      14
    {'pkn4'}       2    {'no' }       2       2
    {'adw3'}      22    {'no' }      22      22
    {'poj2'}      22    {'yes'}      22      22
    {'bas8'}      23    {'no' }      23      23
    {'gry5'}      23    {'yes'}      23      21

Remove Rows with Missing Values

Create a new table, T3, that contains only the rows from T without missing values. T3 has only 16 rows.

T3 = rmmissing(T);
disp(T3)
       A         B         C        D       E  
    ________    ____    _______    ____    ____

    {'afe1'}       3    {'yes'}       3       3
    {'wth4'}       3    {'yes'}       3       3
    {'atn2'}      23    {'no' }      23      23
    {'arg1'}       5    {'yes'}       5       5
    {'jre3'}    34.6    {'yes'}    34.6    34.6
    {'wen9'}     234    {'yes'}     234     234
    {'ple2'}       2    {'no' }       2       2
    {'dbo8'}       5    {'no' }       5       5
    {'oii4'}       5    {'yes'}       5       5
    {'wnk3'}     245    {'yes'}     245     245
    {'pnj5'}     463    {'no' }     463     463
    {'wnn3'}       6    {'no' }       6       6
    {'oks9'}      23    {'yes'}      23      23
    {'pkn4'}       2    {'no' }       2       2
    {'adw3'}      22    {'no' }      22      22
    {'bas8'}      23    {'no' }      23      23

T3 contains 16 rows and five variables.

Organize Data

Sort the rows of T3 in descending order by C, and then sort in ascending order by A.

T3 = sortrows(T2,{'C','A'},{'descend','ascend'});
disp(T3)
       A         B         C        D       E  
    ________    ____    _______    ____    ____

    {'abk6'}     563    {'yes'}     563     563
    {'afe1'}       3    {'yes'}       3       3
    {'arg1'}       5    {'yes'}       5       5
    {'gry5'}      23    {'yes'}      23      21
    {'jre3'}    34.6    {'yes'}    34.6    34.6
    {'oii4'}       5    {'yes'}       5       5
    {'oks9'}      23    {'yes'}      23      23
    {'poj2'}      22    {'yes'}      22      22
    {'wba3'}      23    {'yes'}      23      14
    {'wen9'}     234    {'yes'}     234     234
    {'wnk3'}     245    {'yes'}     245     245
    {'wth4'}       3    {'yes'}       3       3
    {'adw3'}      22    {'no' }      22      22
    {'atn2'}      23    {'no' }      23      23
    {'bas8'}      23    {'no' }      23      23
    {'dbo8'}       5    {'no' }       5       5
    {'egh3'}       3    {'no' }       7       7
    {'pkn4'}       2    {'no' }       2       2
    {'ple2'}       2    {'no' }       2       2
    {'pnj5'}     463    {'no' }     463     463
    {'wnn3'}       6    {'no' }       6       6

In C, the rows are grouped first by 'yes', followed by 'no'. Then in A, the rows are listed alphabetically.

Reorder the table so that A and C are next to each other.

T3 = T3(:,{'A','C','B','D','E'});
disp(T3)
       A           C        B       D       E  
    ________    _______    ____    ____    ____

    {'abk6'}    {'yes'}     563     563     563
    {'afe1'}    {'yes'}       3       3       3
    {'arg1'}    {'yes'}       5       5       5
    {'gry5'}    {'yes'}      23      23      21
    {'jre3'}    {'yes'}    34.6    34.6    34.6
    {'oii4'}    {'yes'}       5       5       5
    {'oks9'}    {'yes'}      23      23      23
    {'poj2'}    {'yes'}      22      22      22
    {'wba3'}    {'yes'}      23      23      14
    {'wen9'}    {'yes'}     234     234     234
    {'wnk3'}    {'yes'}     245     245     245
    {'wth4'}    {'yes'}       3       3       3
    {'adw3'}    {'no' }      22      22      22
    {'atn2'}    {'no' }      23      23      23
    {'bas8'}    {'no' }      23      23      23
    {'dbo8'}    {'no' }       5       5       5
    {'egh3'}    {'no' }       3       7       7
    {'pkn4'}    {'no' }       2       2       2
    {'ple2'}    {'no' }       2       2       2
    {'pnj5'}    {'no' }     463     463     463
    {'wnn3'}    {'no' }       6       6       6

See Also

| | | | | |

Related Topics