Exact GPR Method

An instance of response y from a Gaussian process regression (GPR) model can be modeled as

P(yi|f(xi),xi) ~N(yi|h(xi)Tβ+f(xi),σ2)

Hence, making predictions for new data from a GPR model requires:

  • Knowledge of the coefficient vector, β, of fixed basis functions

  • Ability to evaluate the covariance function k(x,x|θ) for arbitrary x and x, given the kernel parameters or hyperparameters, θ.

  • Knowledge of the noise variance σ2 that appears in the density P(yi|f(xi),xi)

That is, one needs first to estimate β, θ, and σ2 from the data (X,y).

Parameter Estimation

One approach for estimating the parameters β, θ, and σ2 of a GPR model is by maximizing the likelihood P(y|X) as a function of β, θ, and σ2[1]. That is, if β^, θ^, and σ^2 are the estimates of β, θ, and σ2, respectively, then:

β^,θ^,σ^2=arg maxβ,θ,σ2logP(y|X,β,θ,σ2).

Because

P(y|X)=P(y|X,β,θ,σ2)=N(y|Hβ,K(X,X|θ)+σ2In),

the marginal log likelihood function is as follows:

logP(y|X,β,θ,σ2)=12(yHβ)T[K(X,X|θ)+σ2In]1(yHβ)n2log2π12log|K(X,X|θ)+σ2In|.

where H is the vector of explicit basis functions, and K(X,X|θ) is the covariance function matrix (for more information, see Gaussian Process Regression Models).

To estimate the parameters, the software first computes β^(θ,σ2), which maximizes the log likelihood function with respect to β for given θ and σ2. It then uses this estimate to compute the β-profiled likelihood:

log{P(y|X,β^(θ,σ2),θ,σ2)}.

The estimate of β for given θ, and σ2 is

β^(θ,σ2)=[ HT[K(X,X|θ)+σ2In]1 H]1HT[K(X,X|θ)+σ2In]1 y.

Then, the β-profiled log likelihood is given by

logP(y|X,β^(θ,σ2),θ,σ2)=12(yHβ^(θ,σ2))T[K(X,X|θ)+σ2In]1(yHβ^(θ,σ2))n2log2π12log|K(X,X|θ)+σ2In|

The software then maximizes the β-profiled log-likelihood over θ, and σ2 to find their estimates.

Prediction

Making probabilistic predictions from a GPR model with known parameters requires the density P(ynew|y,X,xnew). Using the definition of conditional probabilities, one can write:

P(ynew|y,X,xnew)=P(ynew,y|X,xnew)P(y|X,xnew).

To find the joint density in the numerator, it is necessary to introduce the latent variables fnew and f corresponding to ynew, and y, respectively. Then, it is possible to use the joint distribution for ynew, y, fnew, and f to compute P(ynew,y|X,xnew):

P(ynew,y|X,xnew)=P(ynew,y,fnew,f|X,xnew)dfdfnew=P(ynew,y|fnew,f,X,xnew)P(fnew,f|X,xnew)dfdfnew.

Gaussian process models assume that each response yi only depends on the corresponding latent variable fi and the feature vector xi. Writing P(ynew,y|fnew,f,X,xnew) as a product of conditional densities and based on this assumption produces:

P(ynew,y|fnew,f,X,xnew)=P(ynew|fnew,xnew)i=1nP(yi|f(xi),xi).

After integrating with respect to ynew, the result only depends on f and X:

P(y|f,X)=i=1nP(yi|fi,xi)=i=1nN(yi|h(xi)Tβ+fi,σ2).

Hence,

P(ynew, y|fnew, f, X, xnew)=P(ynew|fnew, xnew)P(y|f,X).

Again using the definition of conditional probabilities,

P(fnew,f|X,xnew)=P(fnew|f,X,xnew)*P(f|X,xnew),

it is possible to write P(ynew,y|X,xnew) as follows:

P(ynew,y|X,xnew)=P(ynew|fnew, xnew)P(y|f,X)P(fnew|f,X,xnew)P(f|X,xnew)dfdfnew.

Using the facts that

P(f|X,xnew)=P(f|X)

and

P(y|f,X)P(f|X)=P(y,f|X)=P(f|y,X)P(y|X),

one can rewrite P(ynew,y|X,xnew) as follows:

P(ynew,y|X,xnew)=P(y|X)P(ynew|fnew, xnew)P(f|y,X)P(fnew|f,X,xnew)dfdfnew.

It is also possible to show that

P(y|X,xnew)=P(y|X).

Hence, the required density P(ynew|y,X,xnew) is:

P(ynew|y,X,xnew)=P(ynew,y|X,xnew)P(y|X,xnew)=P(ynew,y|X,xnew)P(y|X)=P(ynew|fnew, xnew)(1)P(f|y,X)(2)P(fnew|f,X,xnew)(3)dfdfnew.

It can be shown that

(1)P(ynew|fnew,xnew)=N(ynew|h(xnew)Tβ+fnew,σnew2)

(2)P(f|y,X)=N(f|1σ2(Inσ2+K(X,X)1)1(yHβ),(Inσ2+K(X,X)1)1)

(3)P(fnew|f,X,xnew)=N(fnew|K(xnewT,X)K(X,X)1f,Δ),whereΔ=k(xnew,xnew)K(xnewT,X) K(X,X)1K(X,xnewT).

After the integration and required algebra, the density of the new response ynew at a new point xnew, given y, X is found as

P(ynew|y,X,xnew)=N(ynew|h(xnew)Tβ+μ,σnew2+Σ),

where

μ=K(xnewT,X)(K(X,X)+σ2In)1(yHβ)α

and

Σ=k(xnew,xnew)K(xnewT,X)(K(X,X)+σ2In)1K(X,xnewT).

The expected value of prediction ynew at a new point xnew given y, X, and parameters β, θ, and σ2 is

E(ynew|y, X,xnew,β,θ,σ2)= h(xnew)Tβ+ K(xnewT,X|θ)α= h(xnew)Tβ+i=1nαik(xnew,xi|θ),

where

α=(K(X,X|θ)+σ2In)1(yHβ).

Computational Complexity of Exact Parameter Estimation and Prediction

Training a GPR model with the exact method (when FitMethod is 'Exact') requires the inversion of an n-by-n kernel matrix K(X,X). The memory requirement for this step scales as O(n^2) since K(X,X) must be stored in memory. One evaluation of logP(y|X) scales as O(n^3). Therefore, the computational complexity is O(k*n^3), where k is the number of function evaluations needed for maximization and n is the number of observations.

Making predictions on new data involves the computation of α^. If prediction intervals are desired, this step could also involve the computation and storage of the Cholesky factor of (K(X,X)+σ2In) for later use. The computational complexity of this step using the direct computation of α^ is O(n^3) and the memory requirement is O(n^2).

Hence, for large n, estimation of parameters or computing predictions can be very expensive. The approximation methods usually involve rearranging the computation so as to avoid the inversion of an n-by-n matrix. For the available approximation methods, please see the related links at the bottom of the page.

References

[1] Rasmussen, C. E. and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press. Cambridge, Massachusetts, 2006.

See Also

|

Related Topics