Find outliers in data
returns a
logical array whose elements are TF
= isoutlier(A
)true
when an outlier is detected
in the corresponding element of A
. By default, an outlier is a
value that is more than three scaled median absolute deviations
(MAD) away from the median. If A
is a matrix or table,
then isoutlier
operates on each column separately. If
A
is a multidimensional array, then
isoutlier
operates along the first dimension whose size does
not equal 1.
specifies a moving method for detecting local outliers according to a window length
defined by TF
= isoutlier(A
,movmethod
,window
)window
. For example,
isoutlier(A,'movmedian',5)
returns true
for all elements more than three local scaled MAD from the local median within a
sliding window containing five elements.
specifies
additional parameters for detecting outliers using one or more name-value
pair arguments. For example, TF
= isoutlier(___,Name,Value
)isoutlier(A,'SamplePoints',t)
detects
outliers in A
relative to the corresponding elements
of a time vector t
.
Find the outliers in a vector of data. A logical 1 in the output indicates the location of an outlier.
A = [57 59 60 100 59 58 57 58 300 61 62 60 62 58 57]; TF = isoutlier(A)
TF = 1x15 logical array
0 0 0 1 0 0 0 0 1 0 0 0 0 0 0
Define outliers as points more than three standard deviations from the mean, and find the locations of outliers in a vector.
A = [57 59 60 100 59 58 57 58 300 61 62 60 62 58 57];
TF = isoutlier(A,'mean')
TF = 1x15 logical array
0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
Create a vector of data containing a local outlier.
x = -2*pi:0.1:2*pi; A = sin(x); A(47) = 0;
Create a time vector that corresponds to the data in A
.
t = datetime(2017,1,1,0,0,0) + hours(0:length(x)-1);
Define outliers as points more than three local scaled MAD away from the local median within a sliding window. Find the locations of the outliers in A
relative to the points in t
with a window size of 5 hours. Plot the data and detected outliers.
TF = isoutlier(A,'movmedian',hours(5),'SamplePoints',t); plot(t,A,t(TF),A(TF),'x') legend('Data','Outlier')
Find outliers for each row of a matrix.
Create a matrix of data containing outliers along the diagonal.
A = magic(5) + diag(200*ones(1,5))
A = 5×5
217 24 1 8 15
23 205 7 14 16
4 6 213 20 22
10 12 19 221 3
11 18 25 2 209
Find the locations of outliers based on the data in each row.
TF = isoutlier(A,2)
TF = 5x5 logical array
1 0 0 0 0
0 1 0 0 0
0 0 1 0 0
0 0 0 1 0
0 0 0 0 1
Create a vector of data containing an outlier. Find and plot the location of the outlier, and the thresholds and center value determined by the outlier method. The center value is the median of the data, and the upper and lower thresholds are three scaled MAD above and below the median.
x = 1:10; A = [60 59 49 49 58 100 61 57 48 58]; [TF,L,U,C] = isoutlier(A); plot(x,A,x(TF),A(TF),'x',x,L*ones(1,10),x,U*ones(1,10),x,C*ones(1,10)) legend('Original Data','Outlier','Lower Threshold','Upper Threshold','Center Value')
A
— Input dataInput data, specified as a vector, matrix, multidimensional array, table, or timetable.
If A
is a table, then its variables must
be of type double
or single
,
or you can use the 'DataVariables'
name-value pair
to list double
or single
variables
explicitly. Specifying variables is useful when you are working with
a table that contains variables with data types other than double
or single
.
If A
is a timetable, then isoutlier
operates
only on the table elements. Row times must be unique and listed in
ascending order.
Data Types: double
| single
| table
| timetable
method
— Method for detecting outliers'median'
(default) | 'mean'
| 'quartiles'
| 'grubbs'
| 'gesd'
Method for detecting outliers, specified as one of the following:
Method | Description |
---|---|
'median' | Returns true for elements more
than three scaled MAD from the median. The scaled MAD is
defined as
c*median(abs(A-median(A))) , where
c=-1/(sqrt(2)*erfcinv(3/2)) . |
'mean' | Returns true for elements more
than three standard deviations from the mean. This
method is faster but less robust than
'median' . |
'quartiles' | Returns true for elements more
than 1.5 interquartile ranges above the upper quartile
or below the lower quartile. This method is useful when
the data in A is not normally
distributed. |
'grubbs' | Applies Grubbs’s test for outliers, which removes one
outlier per iteration based on hypothesis testing. This
method assumes that the data in A is
normally distributed. |
'gesd' | Applies the generalized extreme Studentized deviate
test for outliers. This iterative method is similar to
'grubbs' , but can perform better
when there are multiple outliers masking each
other. |
threshold
— Percentile thresholdsPercentile thresholds, specified as a two-element row vector whose
elements are in the interval [0,100]. The first element indicates the lower
percentile threshold and the second element indicates the upper percentile
threshold. For example, a threshold of [10 90]
defines
outliers as points below the 10th percentile and above the 90th percentile.
The first element of threshold
must be less than the
second element.
movmethod
— Moving method'movmedian'
| 'movmean'
Moving method for detecting outliers, specified as one of the following:
Method | Description |
---|---|
'movmedian' | Returns true for elements more than three
local scaled MAD from the local median over a window length specified
by window . |
'movmean' | Returns true for elements more than three
local standard deviations from the local mean over a window length
specified by window . |
window
— Window lengthWindow length, specified as a positive integer scalar, a two-element vector of positive integers, a positive duration scalar, or a two-element vector of positive durations.
When window
is a positive integer scalar, the window is centered about the
current element and contains window-1
neighboring
elements. If window
is even, then the window is centered
about the current and previous elements.
When window
is a two-element vector of positive
integers [b f]
, the window contains the current element,
b
elements backward, and f
elements forward.
When A
is a timetable or 'SamplePoints'
is
specified as a datetime
or duration
vector,
then window
must be of type duration
,
and the windows are computed relative to the sample points.
Data Types: double
| single
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
| duration
dim
— Dimension to operate alongDimension to operate along, specified as a positive integer scalar. If no value is specified, then the default is the first array dimension whose size does not equal 1.
Consider a matrix A
.
isoutlier(A,1)
detects outliers based on
the data in each column of A
.
isoutlier(A,2)
detects outliers based on
the data in each row of A
.
When A
is a table or timetable, dim
is
not supported. isoutlier
operates along each table
or timetable variable separately.
Data Types: double
| single
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
Specify optional
comma-separated pairs of Name,Value
arguments. Name
is
the argument name and Value
is the corresponding value.
Name
must appear inside quotes. You can specify several name and value
pair arguments in any order as
Name1,Value1,...,NameN,ValueN
.
isoutlier(A,'mean','ThresholdFactor',4)
'ThresholdFactor'
— Detection threshold factorDetection threshold factor, specified as the comma-separated
pair consisting of 'ThresholdFactor'
and a nonnegative
scalar.
For methods 'median'
and
'movmedian'
, the detection threshold factor
replaces the number of scaled MAD, which is 3 by default.
For methods 'mean'
and
'movmean'
, the detection threshold factor replaces
the number of standard deviations from the mean, which is 3 by
default.
For methods 'grubbs'
and 'gesd'
, the detection
threshold factor is a scalar ranging from 0 to 1. Values close to 0
result in a smaller number of outliers and values close to 1 result in a
larger number of outliers. The default detection threshold factor is
0.05.
For the 'quartiles'
method, the detection threshold factor replaces the
number of interquartile ranges, which is 1.5 by default.
This name-value pair is not supported when the specified method is
'percentiles'
.
Data Types: double
| single
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
'SamplePoints'
— Sample pointsSample points, specified as the comma-separated pair consisting
of 'SamplePoints'
and a vector. The sample points
represent the location of the data in A
. Sample
points do not need to be uniformly sampled. By default, the sample
points vector is [1 2 3 ...]
.
Moving windows are defined relative to the sample points, which must be sorted and contain
unique elements. For example, if t
is a vector of
times corresponding to the input data, then
isoutlier(rand(1,10),'movmean',3,'SamplePoints',t)
has a window that represents the time interval between
t(i)-1.5
and t(i)+1.5
.
When the sample points vector has data type datetime
or duration
,
then the moving window length must have type duration
.
Data Types: double
| single
| datetime
| duration
'DataVariables'
— Table variablesvartype
subscriptTable variables, specified as the comma-separated pair consisting of
'DataVariables'
and a variable name, a cell array
of variable names, a numeric vector, a logical vector, a function
handle, or a table vartype
subscript. The
'DataVariables'
value indicates which columns of
the input table to detect outliers in, and can be one of the following:
A character vector specifying a single table variable name
A cell array of character vectors where each element is a table variable name
A vector of table variable indices
A logical vector whose elements each correspond to a table
variable, where true
includes the
corresponding variable and false
excludes
it
A function handle that takes the table as input and returns a logical scalar
A table vartype
subscript
The data type associated with the indicated variable must be double
or single
.
Example: 'Age'
Example: {'Height','Weight'}
Example: @isnumeric
Example: vartype('numeric')
'MaxNumOutliers'
— Maximum outlier countMaximum outlier count, for the 'gesd'
method only,
specified as the comma-separated pair consisting of
'MaxNumOutliers'
and a positive integer. The
'MaxNumOutliers'
value specifies the maximum
number of outliers returned by the 'gesd'
method. For
example, isoutlier(A,'gesd','MaxNumOutliers',5)
returns no more than five outliers.
The default value for 'MaxNumOutliers'
is the
integer nearest to 10 percent of the number of elements in
A
. Setting a larger value for the maximum number
of outliers can ensure that all outliers are detected, but at the cost
of reduced computational efficiency.
The 'gesd'
method assumes the non-outlier input
data is sampled from an approximate normal distribution. When the data
is not sampled in this way, the number of returned outliers might exceed
the 'MaxNumOutliers'
value.
Data Types: double
| single
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
TF
— Outlier indicatorOutlier indicator, returned as a vector, matrix, or multidimensional
array. An element of TF
is true
when
the corresponding element of A
is an outlier and false
otherwise. TF
is
the same size as A
.
Data Types: logical
L
— Lower thresholdLower threshold used by the outlier detection method, returned as a
scalar, vector, matrix, multidimensional array, table, or timetable. For
example, the lower value of the default outlier detection method is three
scaled MAD below the median of the input data. L
has the
same size as A
in all dimensions except for the operating
dimension where the length is 1.
Data Types: double
| single
| table
| timetable
U
— Upper thresholdUpper threshold used by the outlier detection method, returned as a
scalar, vector, matrix, multidimensional array, table, or timetable. For
example, the upper value of the default outlier detection method is three
scaled MAD above the median of the input data. U
has the
same size as A
in all dimensions except for the operating
dimension where the length is 1.
Data Types: double
| single
| table
| timetable
C
— Center valueCenter value used by the outlier detection method, returned as a scalar,
vector, matrix, multidimensional array, table, or timetable. For example,
the center value of the default outlier detection method is the median of
the input data. C
has the same size as
A
in all dimensions except for the operating
dimension where the length is 1.
Data Types: double
| single
| table
| timetable
For a random variable vector A made up of N scalar observations, the median absolute deviation (MAD) is defined as
for i = 1,2,...,N.
The scaled MAD is defined as c*median(abs(A-median(A)))
where
c=-1/(sqrt(2)*erfcinv(3/2))
.
Usage notes and limitations:
The 'percentiles'
, 'grubbs'
, and
'gesd'
methods are not supported.
The 'movmedian'
and 'movmean'
methods do not support tall timetables.
The 'SamplePoints'
and 'MaxNumOutliers'
name-value pairs are not supported.
The value of 'DataVariables'
cannot be a function handle.
Computation of isoutlier(A)
, isoutlier(A,'median',...)
, or isoutlier(A,'quartiles',...)
along the first dimension is only supported for tall column vectors A
.
For more information, see Tall Arrays.
Usage notes and limitations:
The 'movmean'
and 'movmedian'
methods for detecting outliers do not support timetable input data,
datetime 'SamplePoints'
values, or duration
'SamplePoints'
values.
String and character array inputs must be constant.
Usage notes and limitations:
The 'movmedian'
moving method is not
supported.
The 'SamplePoints'
and
'DataVariables'
name-value pairs are not
supported.
For more information, see Run MATLAB Functions on a GPU (Parallel Computing Toolbox).
Clean Outlier
Data | filloutliers
| ischange
| islocalmax
| islocalmin
| ismissing
| rmoutliers
You have a modified version of this example. Do you want to open this example with your edits?