mvksdensity

Kernel smoothing function estimate for multivariate data

Syntax

f = mvksdensity(x,pts,'Bandwidth',bw)

f = mvksdensity(x,pts,'Bandwidth',bw,Name,Value)

Description

f = mvksdensity(x,pts,'Bandwidth',bw) computes a probability density estimate of the sample data in the n-by-d matrix x, evaluated at the points in pts using the required name-value pair argument value bw for the bandwidth value. The estimation is based on a product Gaussian kernel function.

For univariate or bivariate data, use ksdensity instead.

example

f = mvksdensity(x,pts,'Bandwidth',bw,Name,Value) returns any of the previous output arguments, using additional options specified by one or more Name,Value pair arguments. For example, you can define the function type that mvksdensity evaluates, such as probability density, cumulative probability, or survivor function. You can also assign weights to the input values.

Examples

collapse all

Estimate Multivariate Kernel Density

Open Live Script

Load the Hald cement data.

load hald

The data measures the heat of hardening for 13 different cement compositions. The predictor matrix ingredients contains the percent composition for each of four cement ingredients. The response matrix heat contains the heat of hardening (in cal\g) after 180 days.

Estimate the kernel density for the first three observations in ingredients.

xi = ingredients(1:3,:);
f = mvksdensity(ingredients,xi,'Bandwidth',0.8);

Estimate Multivariate Kernel Density Using Grids

Open Live Script

Load the Hald cement data.

load hald

Create a array of points at which to estimate the density. First, define the range and spacing for each variable, using a similar number of points in each dimension.

gridx1 = 0:2:22;
gridx2 = 20:5:80;
gridx3 = 0:2:24;
gridx4 = 5:5:65;

Next, use ndgrid to generate a full grid of points using the defined range and spacing.

[x1,x2,x3,x4] = ndgrid(gridx1,gridx2,gridx3,gridx4);

Finally, transform and concatenate to create an array that contains the points at which to estimate the density. This array has one column for each variable.

x1 = x1(:,:)';
x2 = x2(:,:)';
x3 = x3(:,:)';
x4 = x4(:,:)';
xi = [x1(:) x2(:) x3(:) x4(:)];

Estimate the density.

f = mvksdensity(ingredients,xi,...
	'Bandwidth',[4.0579 10.7345 4.4185 11.5466],...
	'Kernel','normpdf');

View the size of xi and f to confirm that mvksdensity calculates the density at each point in xi.

size_xi = size(xi)

size_xi = 1×2

       26364           4

size_f = size(f)

size_f = 1×2

       26364           1

Input Arguments

collapse all

`x` — Sample data
numeric matrix

Sample data for which mvksdensity returns the probability density estimate, specified as an n-by-d matrix of numeric values. n is the number of data points (rows) in x, and d is the number of dimensions (columns).

Data Types: single | double

`pts` — Points at which to evaluate f
matrix

Points at which to evaluate the probability density estimate f, specified as a matrix with the same number of columns as x. The returned estimate f and pts have the same number of rows.

Data Types: single | double

`bw` — Value for the bandwidth of the kernel smoothing window
scalar value | d-element vector

Value for the bandwidth of the kernel-smoothing window, specified as a scalar value or d-element vector. d is the number of dimensions (columns) in the sample data x. If bw is a scalar value, it applies to all dimensions.

If you specify 'BoundaryCorrection' as 'log'(default) and 'Support' as either 'positive' or a two-row matrix, mvksdensity converts bounded data to be unbounded by using log transformation. The value of bw is on the scale of the transformed values.

Silverman's rule of thumb for the bandwidth is

$b_{i} = σ_{i} {\frac{4}{(d + 2) n}}^{\frac{1}{(d + 4)}}, i = 1, 2, ..., d,$

where d is the number of dimensions, n is the number of observations, and $σ_{i}$ is the standard deviation of the i^th variate [4].

Example: 'Bandwidth',0.8

Data Types: single | double

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: 'Kernel','triangle','Function,'cdf' specifies that mvksdensity estimates the cdf of the sample data using the triangle kernel function.

`'BoundaryCorrection'` — Boundary correction method
'log' (default) | 'reflection'

Boundary correction method, specified as the comma-separated pair consisting of 'BoundaryCorrection' and either 'log' or 'reflection'.

Value Description

Value	Description
`'log'`	`mvksdensity` converts bounded data to be unbounded by using one of the following transformations. Then, it transforms back to the original bounded scale after density estimation. If you specify `'Support','positive'`, then `mvksdensity` applies `log`(x_j) for each dimension, where x_j is the `j`th column of the input argument `x`. If you specify `'Support'` as a two-row matrix consisting of the lower and upper limits for each dimension, then `mvksdensity` applies `log`((x_j-L_j)/(U_j-x_j)) for each dimension, where L_j and U_j are the lower and upper limits of the `j`th dimension, respectively. The value of `bw` is on the scale of the transformed values.
`'reflection'`	`mvksdensity` augments bounded data by adding reflected data near the boundaries, then it returns estimates corresponding to the original support. For details, see Reflection Method.

'log'

mvksdensity converts bounded data to be unbounded by using one of the following transformations. Then, it transforms back to the original bounded scale after density estimation.

If you specify 'Support','positive', then mvksdensity applies log(x_j) for each dimension, where x_j is the jth column of the input argument x.
If you specify 'Support' as a two-row matrix consisting of the lower and upper limits for each dimension, then mvksdensity applies log((x_j-L_j)/(U_j-x_j)) for each dimension, where L_j and U_j are the lower and upper limits of the jth dimension, respectively.

The value of bw is on the scale of the transformed values.

'reflection'

mvksdensity augments bounded data by adding reflected data near the boundaries, then it returns estimates corresponding to the original support. For details, see Reflection Method.

mvksdensity applies boundary correction only when you specify 'Support' as a value other than 'unbounded'.

Example: 'BoundaryCorrection','reflection'

`'Function'` — Function to estimate
`'pdf'` (default) | `'cdf'` | `'survivor'`

Function to estimate, specified as the comma-separated pair consisting of 'Function' and one of the following.

Value	Description
`'pdf'`	Probability density function
`'cdf'`	Cumulative distribution function
`'survivor'`	Survivor function

Example: 'Function','cdf'

`'Kernel'` — Type of kernel smoother
`'normal'` (default) | `'box'` | `'triangle'` | `'epanechnikov'` | function handle | character vector | string scalar

Type of kernel smoother, specified as the comma-separated pair consisting of 'Kernel' and one of the following.

Value	Description
`'normal'`	Normal (Gaussian) kernel
`'box'`	Box kernel
`'triangle'`	Triangular kernel
`'epanechnikov'`	Epanechnikov kernel

You can also specify a kernel function that is a custom or built-in function. Specify the function as a function handle (for example, @myfunction or @normpdf) or as a character vector or string scalar (for example, 'myfunction' or 'normpdf'). The software calls the specified function with one argument that is an array of distances between data values and locations where the density is evaluated, normalized by the bandwidth in that dimension. The function must return an array of the same size containing the corresponding values of the kernel function.

mvksdensity applies the same kernel to each dimension.

Example: 'Kernel','box'

`'Support'` — Support for the density
`'unbounded'` (default) | `'positive'` | 2-by-d matrix

Support for the density, specified as the comma-separated pair consisting of 'support' and one of the following.

Value	Description
`'unbounded'`	Allow the density to extend over the whole real line
`'positive'`	Restrict the density to positive values
2-by-d matrix	Specify the finite lower and upper bounds for the support of the density. The first row contains the lower limits and the second row contains the upper limits. Each column contains the limits for one dimension of `x`.

'Support' can also be a combination of positive, unbounded, and bounded variables specified as [0 -Inf L; Inf Inf U].

Example: 'Support','positive'

Data Types: single | double | char | string

`'Weights'` — Weights for sample data
vector

Weights for sample data, specified as the comma-separated pair consisting of 'Weights' and a vector of length size(x,1), where x is the sample data.

Example: 'Weights',xw

Data Types: single | double

Output Arguments

collapse all

`f` — Estimated function values
vector

Estimated function values, returned as a vector. f and pts have the same number of rows.

More About

collapse all

Multivariate Kernel Distribution

A multivariate kernel distribution is a nonparametric representation of the probability density function (pdf) of a random vector. You can use a kernel distribution when a parametric distribution cannot properly describe the data, or when you want to avoid making assumptions about the distribution of the data. A multivariate kernel distribution is defined by a smoothing function and a bandwidth matrix, which control the smoothness of the resulting density curve.

The multivariate kernel density estimator is the estimated pdf of a random vector. Let x = (x₁, x₂, …, x_d)' be a d-dimensional random vector with a density function f and let y_i = (y_i1, y_i2, …, y_id)' be a random sample drawn from f for i = 1, 2, …, n, where n is the number of random samples. For any real vectors of x, the multivariate kernel density estimator is given by

${\hat{f}}_{H} (x) = \frac{1}{n} \sum_{i = 1}^{n} K_{H} (x - y_{i}),$

where $K_{H} (x) = {| H |}^{- 1 / 2} K (H^{- 1 / 2} x)$ , $K (\cdot)$ is the kernel smoothing function, and H is the d-by-d bandwidth matrix.

mvksdensity uses a diagonal bandwidth matrix and a product kernel. That is, H^1/2 is a square diagonal matrix with the elements of vector (h₁, h₂, …, h_d) on the main diagonal. K(x) takes the product form K(x) = k(x₁)k(x₂) ⋯k(x_d), where $k (\cdot)$ is a one-dimensional kernel smoothing function. Then, the multivariate kernel density estimator becomes

${\hat{f}}_{H} (x) = \frac{1}{n} \sum_{i = 1}^{n} K_{H} (x - y_{i}) = \frac{1}{n h_{1} h_{2} \dots h_{d}} \sum_{i = 1}^{n} K (\frac{x_{1} - y_{i 1}}{h_{1}}, \frac{x_{2} - y_{i 2}}{h_{2}}, \dots, \frac{x_{d} - y_{i d}}{h_{d}}) = \frac{1}{n h_{1} h_{2} \dots h_{d}} \sum_{i = 1}^{n} \prod_{j = 1}^{d} k (\frac{x_{j} - y_{i j}}{h_{j}}) .$

The kernel estimator for the cumulative distribution function (cdf), for any real vectors of x, is given by

${\hat{F}}_{H} (x) = \int_{- \infty}^{x_{1}} \int_{- \infty}^{x_{2}} \dots \int_{- \infty}^{x_{d}} {\hat{f}}_{H} (t) d t_{d} \dots d t_{2} d t_{1} = \frac{1}{n} \sum_{i = 1}^{n} \prod_{j = 1}^{d} G (\frac{x_{j} - y_{i j}}{h_{j}}),$

where $G (x_{j}) = \int_{- \infty}^{x_{j}} k (t_{j}) d t_{j}$ .

Reflection Method

The reflection method is a boundary correction method that accurately finds kernel density estimators when a random variable has bounded support. If you specify 'BoundaryCorrection','reflection', mvksdensity uses the reflection method.

If you additionally specify 'Support' as a two-row matrix consisting of the lower and upper limits for each dimension, then mvksdensity finds the kernel estimator as follows.

If 'Function' is 'pdf', then the kernel density estimator is
${\hat{f}}_{H} (x) = \frac{1}{n h_{1} h_{2} \dots h_{d}} \sum_{i = 1}^{n} \prod_{j = 1}^{d} [k (\frac{x_{j} - y_{i j}^{-}}{h_{j}}) + k (\frac{x_{j} - y_{i j}}{h_{j}}) + k (\frac{x_{j} - y_{i j}^{+}}{h_{j}})]$ for L_j ≤ x_j ≤ U_j,
where $y_{i j}^{-} = 2 L_{j} - y_{i j}$ , $y_{i j}^{+} = 2 U_{j} - y_{i j}$ , and y_ij is the jth element of the ith sample data corresponding to x(i,j) of the input argument x. L_j and U_j are the lower and upper limits of the jth dimension, respectively.
If 'Function' is 'cdf', then the kernel estimator for cdf is
${\hat{F}}_{H} (x) = \frac{1}{n} \sum_{i = 1}^{n} \prod_{j = 1}^{d} [G (\frac{x_{j} - y_{i j}^{-}}{h_{j}}) + G (\frac{x_{j} - y_{i j}}{h_{j}}) + G (\frac{x_{j} - y_{i j}^{+}}{h_{j}}) - G (\frac{L_{j} - y_{i j}^{-}}{h_{j}}) - G (\frac{L_{j} - y_{i j}}{h_{j}}) - G (\frac{L_{j} - y_{i j}^{+}}{h_{j}})]$ for L_j ≤ x_j ≤ U_j.
To obtain a kernel estimator for a survivor function (when 'Function' is 'survivor'), mvksdensity uses both ${\hat{f}}_{H} (x)$ and ${\hat{F}}_{H} (x)$ .

If you additionally specify 'Support' as 'positive' or a matrix including [0 inf], then mvksdensity finds the kernel density estimator by replacing [L_j U_j] with [0 inf] in the above equations.

References

[1] Bowman, A. W., and A. Azzalini. Applied Smoothing Techniques for Data Analysis. New York: Oxford University Press Inc., 1997.

[2] Hill, P. D. “Kernel estimation of a distribution function.” Communications in Statistics – Theory and Methods. Vol. 14, Issue 3, 1985, pp. 605-620.

[3] Jones, M. C. “Simple boundary correction for kernel density estimation.” Statistics and Computing. Vol. 3, Issue 3, 1993, pp. 135-146.

[4] Silverman, B. W. Density Estimation for Statistics and Data Analysis. Chapman & Hall/CRC, 1986.

[5] Scott, D. W. Multivariate Density Estimation: Theory, Practice, and Visualization. John Wiley & Sons, 2015.

Extended Capabilities

C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.

Usage notes and limitations:

Names in name-value pair arguments, including 'Bandwidth', must be compile-time constants.
Values in the following name-value pair arguments must also be compile-time constants: 'BoundaryCorrection', 'Function', and 'Kernel'. For example, to use the 'Function','cdf' name-value pair argument in the generated code, include {coder.Constant('Function'),coder.Constant('cdf')} in the -args value of codegen.
The value of the 'Kernel' name-value pair argument cannot be a custom function handle. To specify a custom kernel function, use a character vector or string scalar.
For the value of the 'Support' name-value pair argument, the compile-time data type must match the runtime data type.

For more information on code generation, see Introduction to Code Generation and General Code Generation Workflow.

Documentation

mvksdensity

Syntax

Description

Examples

Estimate Multivariate Kernel Density

Estimate Multivariate Kernel Density Using Grids

Input Arguments

`x` — Sample data
numeric matrix

`pts` — Points at which to evaluate f
matrix

`bw` — Value for the bandwidth of the kernel smoothing window
scalar value | d-element vector

Name-Value Pair Arguments

`'BoundaryCorrection'` — Boundary correction method
'log' (default) | 'reflection'

`'Function'` — Function to estimate
`'pdf'` (default) | `'cdf'` | `'survivor'`

`'Kernel'` — Type of kernel smoother
`'normal'` (default) | `'box'` | `'triangle'` | `'epanechnikov'` | function handle | character vector | string scalar

`'Support'` — Support for the density
`'unbounded'` (default) | `'positive'` | 2-by-d matrix

`'Weights'` — Weights for sample data
vector

Output Arguments

`f` — Estimated function values
vector

More About

Multivariate Kernel Distribution

Reflection Method

References

Extended Capabilities

C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.

See Also

Topics

Statistics and Machine Learning Toolbox Documentation

Support

Documentation

mvksdensity

Syntax

Description

Examples

Estimate Multivariate Kernel Density

Estimate Multivariate Kernel Density Using Grids

Input Arguments

x — Sample data numeric matrix

pts — Points at which to evaluate f matrix

bw — Value for the bandwidth of the kernel smoothing window scalar value | d-element vector

Name-Value Pair Arguments

'BoundaryCorrection' — Boundary correction method 'log' (default) | 'reflection'

'Function' — Function to estimate 'pdf' (default) | 'cdf' | 'survivor'

'Kernel' — Type of kernel smoother 'normal' (default) | 'box' | 'triangle' | 'epanechnikov' | function handle | character vector | string scalar

'Support' — Support for the density 'unbounded' (default) | 'positive' | 2-by-d matrix

'Weights' — Weights for sample data vector

Output Arguments

f — Estimated function values vector

More About

Multivariate Kernel Distribution

Reflection Method

References

Extended Capabilities

C/C++ Code Generation Generate C and C++ code using MATLAB® Coder™.

See Also

Topics

Statistics and Machine Learning Toolbox Documentation

Support

`x` — Sample data
numeric matrix

`pts` — Points at which to evaluate f
matrix

`bw` — Value for the bandwidth of the kernel smoothing window
scalar value | d-element vector

`'BoundaryCorrection'` — Boundary correction method
'log' (default) | 'reflection'

`'Function'` — Function to estimate
`'pdf'` (default) | `'cdf'` | `'survivor'`

`'Kernel'` — Type of kernel smoother
`'normal'` (default) | `'box'` | `'triangle'` | `'epanechnikov'` | function handle | character vector | string scalar

`'Support'` — Support for the density
`'unbounded'` (default) | `'positive'` | 2-by-d matrix

`'Weights'` — Weights for sample data
vector

`f` — Estimated function values
vector

C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.