rmoutliers

Detect and remove outliers in data

collapse all in page

Syntax

B = rmoutliers(A)

B = rmoutliers(A,method)

B = rmoutliers(A,'percentiles',threshold)

B = rmoutliers(A,movmethod,window)

B = rmoutliers(___,dim)

B = rmoutliers(___,Name,Value)

[B,TF] =
rmoutliers(___)

Description

example

B = rmoutliers(A) detects and removes outliers from the data in a vector, matrix, table, or timetable.

If A is a row or column vector, rmoutliers detects outliers and removes them.
If A is a matrix, table, or timetable, rmoutliers detects outliers in each column or variable of A separately and removes the entire row.

By default, an outlier is a value that is more than three scaled median absolute deviations (MAD).

example

B = rmoutliers(A,method) specifies a method for determining outliers. For example, rmoutliers(A,'mean') defines an outlier as an element of A more than three standard deviations from the mean.

B = rmoutliers(A,'percentiles',threshold) defines outliers as points outside of the percentiles specified in threshold. The threshold argument is a two-element row vector containing the lower and upper percentile thresholds, such as [10 90].

example

B = rmoutliers(A,movmethod,window) specifies a moving method for detecting local outliers according to a specified window. For example, rmoutliers(A,'movmean',5) defines outliers as elements more than three local standard deviations away from the local mean within a five-element window.

example

B = rmoutliers(___,dim) removes outliers along dimension dim of A for any of the previous syntaxes. For example, rmoutliers(A,2) removes columns instead of rows for a matrix A.

example

B = rmoutliers(___,Name,Value) specifies additional parameters for detecting and removing outliers using one or more name-value pair arguments. For example, rmoutliers(A,'SamplePoints',t) detects outliers in A relative to the corresponding elements of a time vector t.

example

[B,TF] = rmoutliers(___) also returns a logical vector corresponding to the rows or columns of A that were removed.

Examples

collapse all

Remove Outliers in Vector

Open Live Script

Create a vector containing two outliers, and remove them. TF allows you to identify which elements of the input vector were detected as outliers and removed.

A = [57 59 60 100 59 58 57 58 300 61 62 60 62 58 57];
[B,TF] = rmoutliers(A)

B = 1×13

    57    59    60    59    58    57    58    61    62    60    62    58    57

TF = 1x15 logical array

   0   0   0   1   0   0   0   0   1   0   0   0   0   0   0

A(TF)

ans = 1×2

   100   300

Detect Outliers using Mean

Open Live Script

Remove outliers of a vector where an outlier is defined as a point more than three standard deviations from the mean of the data.

A = [57 59 60 100 59 58 57 58 300 61 62 60 62 58 57];
[B,TF] = rmoutliers(A,'mean')

B = 1×14

    57    59    60   100    59    58    57    58    61    62    60    62    58    57

TF = 1x15 logical array

   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0

A(TF)

ans = 300

Detect Outliers with Sliding Window

Open Live Script

Create a vector of data containing a local outlier.

x = -2*pi:0.1:2*pi;
A = sin(x);
A(47) = 0;

Create a time vector that corresponds to the data in A.

t = datetime(2017,1,1,0,0,0) + hours(0:length(x)-1);

Define outliers as points more than three local scaled MAD away from the local median within a sliding window. Find the locations of the outliers in A relative to the points in t with a window size of 5 hours, and remove them.

[B,TF] = rmoutliers(A,'movmedian',hours(5),'SamplePoints',t);

Plot the input data and the data with the outlier removed.

plot(t,A,'b.-',t(~TF),B,'r-')
legend('Input Data','Output Data')

Remove Columns Containing Outliers

Open Live Script

Create a matrix containing two outliers, and remove the columns containing them.

A = magic(5);
A(4,4) = 500;
A(5,5) = 500;
A

A = 5×5

    17    24     1     8    15
    23     5     7    14    16
     4     6    13    20    22
    10    12    19   500     3
    11    18    25     2   500

B = rmoutliers(A,2)

B = 5×3

    17    24     1
    23     5     7
     4     6    13
    10    12    19
    11    18    25

Input Arguments

collapse all

`A` — Input data
vector | matrix | table | timetable

Input data, specified as a vector, matrix, table, or timetable.

Data Types: double | single

`method` — Method for detecting outliers
`'median'` (default) | `'mean'` | `'quartiles'` | `'grubbs'` | `'gesd'`

Method for detecting outliers, specified as one of the following:

Method	Description
`'median'`	Outliers are defined as elements more than three scaled MAD from the median. The scaled MAD is defined as `cmedian(abs(A-median(A)))`, where `c=-1/(sqrt(2)erfcinv(3/2))`.
`'mean'`	Outliers are defined as elements more than three standard deviations from the mean. This method is faster but less robust than `'median'`.
`'quartiles'`	Outliers are defined as elements more than 1.5 interquartile ranges above the upper quartile (75 percent) or below the lower quartile (25 percent). This method is useful when the data in `A` is not normally distributed.
`'grubbs'`	Outliers are detected using Grubbs’s test for outliers, which removes one outlier per iteration based on hypothesis testing. This method assumes that the data in `A` is normally distributed.
`'gesd'`	Outliers are detected using the generalized extreme Studentized deviate test for outliers. This iterative method is similar to `'grubbs'`, but can perform better when there are multiple outliers masking each other.

`threshold` — Percentile thresholds
two-element row vector

Percentile thresholds, specified as a two-element row vector whose elements are in the interval [0,100]. The first element indicates the lower percentile threshold and the second element indicates the upper percentile threshold. For example, a threshold of [10 90] defines outliers as points below the 10th percentile and above the 90th percentile. The first element of threshold must be less than the second element.

`movmethod` — Moving method
`'movmedian'` | `'movmean'`

Moving method for determining outliers, specified as one of the following:

Method	Description
`'movmedian'`	Outliers are defined as elements more than three local scaled MAD from the local median over a window length specified by `window`.
`'movmean'`	Outliers are defined as elements more than three local standard deviations from the local mean over a window length specified by `window`.

`window` — Window length
scalar | two-element vector

Window length, specified as a scalar or two-element vector.

When window is a positive integer scalar, the window is centered about the current element and contains window-1 neighboring elements. If window is even, then the window is centered about the current and previous elements.

When window is a two-element vector of positive integers [b f], the window contains the current element, b elements backward, and f elements forward.

When A is a timetable or 'SamplePoints' is specified as a datetime or duration vector, window must be of type duration, and the windows are computed relative to the sample points.

`dim` — Operating dimension
1 (default) | 2

Operating dimension, specified as 1 or 2. By default, rmoutliers operates along the first dimension whose size does not equal 1.

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: rmoutliers(A,'ThresholdFactor',4)

`'ThresholdFactor'` — Detection threshold factor
nonnegative scalar

Detection threshold factor, specified as the comma-separated pair consisting of 'ThresholdFactor' and a nonnegative scalar.

For methods 'median' and 'movmedian', the detection threshold factor replaces the number of scaled MAD, which is 3 by default.

For methods 'mean' and 'movmean', the detection threshold factor replaces the number of standard deviations from the mean, which is 3 by default.

For methods 'grubbs' and 'gesd', the detection threshold factor is a scalar ranging from 0 to 1. Values close to 0 result in a smaller number of outliers and values close to 1 result in a larger number of outliers. The default detection threshold factor is 0.05.

For the 'quartile' method, the detection threshold factor replaces the number of interquartile ranges, which is 1.5 by default.

This name-value pair is not supported when the specified method is 'percentiles'.

`'SamplePoints'` — Sample points
vector

Sample points, specified as the comma-separated pair consisting of 'SamplePoints' and a vector. The sample points represent the location of the data in A, and must be sorted and contain unique elements. Sample points do not need to be uniformly sampled. If A is a timetable, then the default sample points vector is the vector of row times. Otherwise, the default vector is [1 2 3 ...].

Moving windows are defined relative to the sample points. For example, if t is a vector of times corresponding to the input data, then rmoutliers(rand(1,10),'movmean',3,'SamplePoints',t) has a window that represents the time interval between t(i)-1.5 and t(i)+1.5.

When the sample points vector has data type datetime or duration, then the moving window length must have type duration.

Data Types: single | double | datetime | duration

`'DataVariables'` — Table variables
variable name | array of variable names | numeric vector | logical vector | function handle | table `vartype` subscript

Table variables, specified as the comma-separated pair consisting of 'DataVariables' and a variable name, an array of variable names, a numeric vector, a logical vector, a function handle, or a table vartype subscript. The 'DataVariables' value indicates which variables of the input table to detect outliers in, and can be one of the following:

A character vector or string specifying a single table variable name
A cell array of character vectors where each element is a table variable name
A string array where each element is a table variable name
A vector of table variable indices
A logical vector whose elements each correspond to a table variable, where true includes the corresponding variable and false excludes it
A function handle that takes the table as input and returns a logical scalar
A table vartype subscript

Example: 'Age'

Example: {'Height','Weight'}

Example: ["Age","Weight"]

Example: @isnumeric

Example: vartype('numeric')

`'MinNumOutliers'` — Minimum outlier count
1 (default) | positive integer scalar

Minimum outlier count, specified as the comma-separated pair consisting of 'MinNumOutliers' and a positive scalar. The 'MinNumOutliers' value specifies the minimum number of outliers required to remove a row or column. For example, rmoutliers(A,'MinNumOutliers',3) removes a row of a matrix A when there are 3 or more outliers detected in that column.

`'MaxNumOutliers'` — Maximum outlier count
positive scalar

Maximum outlier count, for the 'gesd' method only, specified as the comma-separated pair consisting of 'MaxNumOutliers' and a positive scalar. The 'MaxNumOutliers' value specifies the maximum number of outliers returned by the 'gesd' method. For example, rmoutliers(A,'MaxNumOutliers',5) returns no more than five outliers.

The default value for 'MaxNumOutliers' is the integer nearest to 10 percent of the number of elements in A. Setting a larger value for the maximum number of outliers can ensure that all outliers are detected, but at the cost of reduced computational efficiency.

Output Arguments

collapse all

`B` — Data with outliers removed
vector | matrix | table | timetable

Data with outliers removed, returned as a vector, matrix, table, or timetable. The size of B depends on the number of removed rows or columns.

`TF` — Removed data indicator
logical vector

Removed data indicator, returned as a logical vector. The value 1 (true) corresponds to rows or columns in A that were removed. The value 0 (false) corresponds to unchanged rows or columns. The orientation and size of TF depends on A and the dimension of operation.

Extended Capabilities

Tall Arrays
Calculate with arrays that have more rows than fit in memory.

Usage notes and limitations:

The 'percentiles', 'grubbs', and 'gesd' methods are not supported.
The 'movmedian' and 'movmean' methods do not support tall timetables.
The 'SamplePoints' and 'MaxNumOutliers' name-value pairs are not supported.
The value of 'DataVariables' cannot be a function handle.
Computation of rmoutliers(A), rmoutliers(A,'median',...), or rmoutliers(A,'quartiles',...) along the first dimension is only supported for tall column vectors A.
rmoutliers(A,2) is not supported for tall tables.

For more information, see Tall Arrays.

C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.

Usage notes and limitations:

The 'movmean' and 'movmedian' methods for detecting outliers do not support timetable input data, datetime 'SamplePoints' values, or duration 'SamplePoints' values.
For table input, dim must equal 1.

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

Usage notes and limitations:

The 'movmedian' moving method is not supported.
The 'SamplePoints' and 'DataVariables' name-value pairs are not supported.

For more information, see Run MATLAB Functions on a GPU (Parallel Computing Toolbox).

Documentation

rmoutliers

Syntax

Description

Examples

Remove Outliers in Vector

Detect Outliers using Mean

Detect Outliers with Sliding Window

Remove Columns Containing Outliers

Input Arguments

`A` — Input data
vector | matrix | table | timetable

`method` — Method for detecting outliers
`'median'` (default) | `'mean'` | `'quartiles'` | `'grubbs'` | `'gesd'`

`threshold` — Percentile thresholds
two-element row vector

`movmethod` — Moving method
`'movmedian'` | `'movmean'`

`window` — Window length
scalar | two-element vector

`dim` — Operating dimension
1 (default) | 2

Name-Value Pair Arguments

`'ThresholdFactor'` — Detection threshold factor
nonnegative scalar

`'SamplePoints'` — Sample points
vector

`'DataVariables'` — Table variables
variable name | array of variable names | numeric vector | logical vector | function handle | table `vartype` subscript

`'MinNumOutliers'` — Minimum outlier count
1 (default) | positive integer scalar

`'MaxNumOutliers'` — Maximum outlier count
positive scalar

Output Arguments

`B` — Data with outliers removed
vector | matrix | table | timetable

`TF` — Removed data indicator
logical vector

Extended Capabilities

Tall Arrays
Calculate with arrays that have more rows than fit in memory.

C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

See Also

Topics

MATLAB Documentation

Support

Documentation

rmoutliers

Syntax

Description

Examples

Remove Outliers in Vector

Detect Outliers using Mean

Detect Outliers with Sliding Window

Remove Columns Containing Outliers

Input Arguments

A — Input data vector | matrix | table | timetable

method — Method for detecting outliers 'median' (default) | 'mean' | 'quartiles' | 'grubbs' | 'gesd'

threshold — Percentile thresholds two-element row vector

movmethod — Moving method 'movmedian' | 'movmean'

window — Window length scalar | two-element vector

dim — Operating dimension 1 (default) | 2

Name-Value Pair Arguments

'ThresholdFactor' — Detection threshold factor nonnegative scalar

'SamplePoints' — Sample points vector

'DataVariables' — Table variables variable name | array of variable names | numeric vector | logical vector | function handle | table vartype subscript

'MinNumOutliers' — Minimum outlier count 1 (default) | positive integer scalar

'MaxNumOutliers' — Maximum outlier count positive scalar

Output Arguments

B — Data with outliers removed vector | matrix | table | timetable

TF — Removed data indicator logical vector

Extended Capabilities

Tall Arrays Calculate with arrays that have more rows than fit in memory.

C/C++ Code Generation Generate C and C++ code using MATLAB® Coder™.

GPU Arrays Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

See Also

Topics

MATLAB Documentation

Support

`A` — Input data
vector | matrix | table | timetable

`method` — Method for detecting outliers
`'median'` (default) | `'mean'` | `'quartiles'` | `'grubbs'` | `'gesd'`

`threshold` — Percentile thresholds
two-element row vector

`movmethod` — Moving method
`'movmedian'` | `'movmean'`

`window` — Window length
scalar | two-element vector

`dim` — Operating dimension
1 (default) | 2

`'ThresholdFactor'` — Detection threshold factor
nonnegative scalar

`'SamplePoints'` — Sample points
vector

`'DataVariables'` — Table variables
variable name | array of variable names | numeric vector | logical vector | function handle | table `vartype` subscript

`'MinNumOutliers'` — Minimum outlier count
1 (default) | positive integer scalar

`'MaxNumOutliers'` — Maximum outlier count
positive scalar

`B` — Data with outliers removed
vector | matrix | table | timetable

`TF` — Removed data indicator
logical vector

Tall Arrays
Calculate with arrays that have more rows than fit in memory.

C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.