Inconsistent Data

When you examine a data plot, you might find that some points appear to differ dramatically from the rest of the data. In some cases, it is reasonable to consider such points outliers, or data values that appear to be inconsistent with the rest of the data.

The following example illustrates how to remove outliers from three data sets in the 24-by-3 matrix count. In this case, an outlier is defined as a value that is more than three standard deviations away from the mean.

Caution

Be cautious about changing data unless you are confident that you understand the source of the problem you want to correct. Removing an outlier has a greater effect on the standard deviation than on the mean of the data. Deleting one such point leads to a smaller new standard deviation, which might result in making some remaining points appear to be outliers!

% Import the sample data
load count.dat;
% Calculate the mean and the standard deviation
% of each data column in the matrix
mu = mean(count)
sigma = std(count)

The Command Window displays

mu =
       32.0000   46.5417   65.5833

sigma =
       25.3703   41.4057   68.0281

When an outlier is considered to be more than three standard deviations away from the mean, use the following syntax to determine the number of outliers in each column of the count matrix:

[n,p] = size(count);
% Create a matrix of mean values by
% replicating the mu vector for n rows
MeanMat = repmat(mu,n,1);
% Create a matrix of standard deviation values by
% replicating the sigma vector for n rows
SigmaMat = repmat(sigma,n,1);
% Create a matrix of zeros and ones, where ones indicate
% the location of outliers
outliers = abs(count - MeanMat) > 3*SigmaMat;
% Calculate the number of outliers in each column
nout = sum(outliers) 

The procedure returns the following number of outliers in each column:

nout =
       1    0    0

There is one outlier in the first data column of count and none in the other two columns.

To remove an entire row of data containing the outlier, type

count(any(outliers,2),:) = [];

Here, any(outliers,2) returns a 1 when any of the elements in the outliers vector are nonzero. The argument 2 specifies that any works down the second dimension of the count matrix—its columns.