Wilkinson notation provides a way to describe regression and repeated measures models without specifying coefficient values. This specialized notation identifies the response variable and which predictor variables to include or exclude from the model. You can also include squared and higher-order terms, interaction terms, and grouping variables in the model formula.
Specifying a model using Wilkinson notation provides several advantages:
You can include or exclude individual predictors and
interaction terms from the model. For example, using the 'Interactions'
name-value
pair available in each model fitting functions includes interaction
terms for all pairs of variables. Using Wilkinson notation instead
allows you to include only the interaction terms of interest.
You can change the model formula without changing
the design matrix, if your input data uses the table
data
type. For example, if you fit an initial model using all the available
predictor variables, but decide to remove a variable that is not statistically
significant, then you can re-write the model formula to include only
the variables of interest. You do not need to make any changes to
the input data itself.
Statistics and Machine Learning Toolbox™ offers several model fitting functions that use Wilkinson notation, including:
Linear models (using fitlm
and stepwiselm
)
Generalized linear models (using fitglm
)
Linear mixed-effects models (using fitlme
and fitlmematrix
)
Generalized linear mixed-effects models (using fitglme
)
Repeated measures models (using fitrm
)
A formula for model specification is a character vector or string scalar of the form
y ~ terms
, where y
is the name of the
response variable, and terms
defines the model using the
predictor variable names and the following operators.
Predictor Terms in Model | Wilkinson Notation |
---|---|
intercept | 1 |
no intercept | –1 |
x1 | x1 |
x1, x2 | x1 + x2 |
x1, x2, x1x2 | x1*x2 or x1 + x2 + x1:x2 |
x1x2 | x1:x2 |
x1, x12 | x1^2 |
x12 | x1^2 – x1 |
Wilkinson notation includes an intercept term in the model by default, even if you do not add 1 to the model formula. To exclude the intercept from the model, use -1 in the formula.
The *
operator (for interactions) and the ^
operator
(for power and exponents) automatically include all lower-order terms.
For example, if you specify x^3
, the model will
automatically include x3, x2,
and x. If you want to exclude certain variables
from the model, use the –
operator to remove
the unwanted terms.
For random-effects and mixed-effects models, the formula specification includes the names of the predictor variables and the grouping variables. For example, if the predictor variable x1 is a random effect grouped by the variable g, then represent this in Wilkinson notation as follows:
(x1 | g)
For repeated measures models, the formula specification includes all of the repeated measures as responses, and the factors as predictor variables. Specify the response variables for repeated measures models as described in the following table.
Response Terms in Model | Wilkinson Notation |
---|---|
y1 | y1 |
y1, y2, y3 | y1,y2,y3 |
y1, y2, y3, y4, y5 | y1–y5 |
For example, if you have three repeated measures as responses and the factors x1, x2, and x3 as the predictor variables, then you can define the repeated measures model using Wilkinson notation as follows:
y1,y2,y3 ~ x1 + x2 + x3
or
y1-y3 ~ x1 + x2 + x3
If the input data (response and predictor variables) is stored
in a table or dataset array, you can specify the formula using the
variable names. For example, load the carsmall
sample
data. Create a table containing Weight
, Acceleration
,
and MPG
. Name each variable using the 'VariableNames'
name-value
pair argument of the fitting function fitlm
.
Then fit the following model to the data:
load carsmall tbl = table(Weight,Acceleration,MPG, ... 'VariableNames',{'Weight','Acceleration','MPG'}); mdl = fitlm(tbl,'MPG ~ Weight + Acceleration')
mdl = Linear regression model: MPG ~ 1 + Weight + Acceleration Estimated Coefficients: Estimate SE tStat pValue __________ __________ _______ __________ (Intercept) 45.155 3.4659 13.028 1.6266e-22 Weight -0.0082475 0.00059836 -13.783 5.3165e-24 Acceleration 0.19694 0.14743 1.3359 0.18493 Number of observations: 94, Error degrees of freedom: 91 Root Mean Squared Error: 4.12 R-squared: 0.743, Adjusted R-Squared: 0.738 F-statistic vs. constant model: 132, p-value = 1.38e-27
The model object display uses the variable names provided in the input table.
If the input data is stored as a matrix, you can specify the
formula using default variable names such as y
, x1
,
and x2
. For example, load the carsmall
sample
data. Create a matrix containing the predictor variables Weight
and Acceleration
.
Then fit the following model to the data:
load carsmall X = [Weight,Acceleration]; y = MPG; mdl = fitlm(X,y,'y ~ x1 + x2')
mdl = Linear regression model: y ~ 1 + x1 + x2 Estimated Coefficients: Estimate SE tStat pValue __________ __________ _______ __________ (Intercept) 45.155 3.4659 13.028 1.6266e-22 x1 -0.0082475 0.00059836 -13.783 5.3165e-24 x2 0.19694 0.14743 1.3359 0.18493 Number of observations: 94, Error degrees of freedom: 91 Root Mean Squared Error: 4.12 R-squared: 0.743, Adjusted R-Squared: 0.738 F-statistic vs. constant model: 132, p-value = 1.38e-27
The term x1
in the model specification formula
corresponds to the first column of the predictor variable matrix X
.
The term x2
corresponds to the second column of
the input matrix. The term y
corresponds to the
response variable.
Use fitlm
and stepwiselm
to fit linear models.
For a linear regression model with an intercept and two fixed-effects predictors, such as
specify the model formula using Wilkinson notation as follows:
'y ~ x1 + x2'
For a linear regression model with no intercept and two fixed-effects predictors, such as
specify the model formula using Wilkinson notation as follows:
'y ~ -1 + x1 + x2'
For a linear regression model with an intercept, two fixed-effects predictors, and an interaction term, such as
specify the model formula using Wilkinson notation as follows:
'y ~ x1*x2'
or
'y ~ x1 + x2 + x1:x2'
For a linear regression model with an intercept, three fixed-effects predictors, and interaction effects between all three predictors plus all lower-order terms, such as
specify the model formula using Wilkinson notation as follows:
'y ~ x1*x2*x3'
For a linear regression model with an intercept, three fixed-effects predictors, and interaction effects between two of the predictors, such as
specify the model formula using Wilkinson notation as follows:
'y ~ x1*x2 + x3'
or
'y ~ x1 + x2 + x3 + x1:x2'
For a linear regression model with an intercept, three fixed-effects predictors, and pairwise interaction effects between all three predictors, but excluding an interaction effect between all three predictors simultaneously, such as
specify the model formula using Wilkinson notation as follows:
'y ~ x1*x2*x3 - x1:x2:x3'
Use fitlme
and fitlmematrix
to fit linear mixed-effects
models.
For a linear mixed-effects model that contains a random intercept but no predictor terms, such as
where
and g is the grouping variable with m levels, specify the model formula using Wilkinson notation as follows:
'y ~ (1 | g)'
For a linear mixed-effects model that contains a fixed intercept, random intercept, and fixed slope for the continuous predictor variable, such as
where
and g is the grouping variable with m levels, specify the model formula using Wilkinson notation as follows:
'y ~ x1 + (1 | g)'
For a linear mixed-effects model that contains a fixed intercept, plus a random intercept and a random slope that have a possible correlation between them, such as
where
and D is a 2-by-2 symmetric and positive semidefinite covariance matrix, parameterized by a variance component vector θ, specify the model formula using Wilkinson notation as follows:
'y ~ x1 + (x1 | g)'
The pattern of the random effects covariance matrix is determined
by the model fitting function. To specify the covariance matrix pattern,
use the name-value pairs available through fitlme
when
fitting the model. For example, you can specify the assumption that
the random intercept and random slope are independent of one another
using the 'CovariancePattern'
name-value pair argument
in fitlme
.
Use fitglm
and stepwiseglm
to fit generalized linear models.
In a generalized linear model, the y response variable has a distribution other than normal, but you can represent the model as an equation that is linear in the regression coefficients. Specifying a generalized linear model requires three parts:
Distribution of the response variable
Link function
Linear predictor
The distribution of the response variable and the link function
are specified using name-value pair arguments in the fit function fitglm
or stepwiseglm
.
The linear predictor portion of the equation, which appears
on the right side of the ~
symbol in the model
specification formula, uses Wilkinson notation in the same way as
for the linear model examples.
A generalized linear model models the link function, rather than the actual response, as y. This is reflected in the output display for the model object.
For a generalized linear regression model with an intercept and two predictors, such as
specify the model formula using Wilkinson notation as follows:
'y ~ x1 + x2'
Use fitglme
to fit generalized
linear mixed-effects models.
In a generalized linear mixed-effects model, the y response variable has a distribution other than normal, but you can represent the model as an equation that is linear in the regression coefficients. Specifying a generalized linear model requires three parts:
Distribution of the response variable
Link function
Linear predictor
The distribution of the response variable and the link function
are specified using name-value pair arguments in the fit function fitglme
.
The linear predictor portion of the equation, which appears
on the right side of the ~
symbol in the model
specification formula, uses Wilkinson notation in the same way as
for the linear mixed-effects model examples.
A generalized linear model models the link function as y, not the response itself. This is reflected in the output display for the model object.
The pattern of the random effects covariance matrix is determined
by the model fitting function. To specify the covariance matrix pattern,
use the name-value pairs available through fitglme
when
fitting the model. For example, you can specify the assumption that
the random intercept and random slope are independent of one another
using the 'CovariancePattern'
name-value pair argument
in fitglme
.
For a generalized linear mixed-effects model that contains a fixed intercept, random intercept, and fixed slope for the continuous predictor variable, where the response can be modeled using a Poisson distribution, such as
where
and g is the grouping variable with m levels, specify the model formula using Wilkinson notation as follows:
'y ~ x1 + (1 | g)'
Use fitrm
to fit repeated
measures models.
For a repeated measures model with five response measurements and one predictor variable, specify the model formula using Wilkinson notation as follows:
'y1-y5 ~ x1'
For a repeated measures model with five response measurements and three predictor variables, plus an interaction between two of the predictor variables, specify the model formula using Wilkinson notation as follows:
'y1-y5 ~ x1*x2 + x3'
[1] Wilkinson, G. N., and C. E. Rogers. Symbolic description of factorial models for analysis of variance. J. Royal Statistics Society 22, pp. 392–399, 1973.