1
\documentclass[12pt]{article}
2
\usepackage{Sweave,amsmath,amsfonts,bm}
3
\usepackage[authoryear,round]{natbib}
4
\bibliographystyle{plainnat}
5
\DeclareMathOperator \tr {tr}
6
\DefineVerbatimEnvironment{Sinput}{Verbatim}
7
{formatcom={\vspace{-1ex}},fontshape=sl,
8
fontfamily=courier,fontseries=b, fontsize=\footnotesize}
9
\DefineVerbatimEnvironment{Soutput}{Verbatim}
10
{formatcom={\vspace{-1ex}},fontfamily=courier,fontseries=b,%
11
fontsize=\footnotesize}
12
%%\VignetteIndexEntry{PLS vs GLS for LMMs}
13
%%\VignetteDepends{lme4}
14
\title{Penalized least squares versus generalized least squares
15
representations of linear mixed models}
16
\author{Douglas Bates\\Department of Statistics\\%
17
University of Wisconsin -- Madison}
19
\SweaveOpts{engine=R,eps=FALSE,pdf=TRUE,strip.white=true,keep.source=TRUE}
20
\SweaveOpts{include=FALSE}
21
\setkeys{Gin}{width=\textwidth}
22
\newcommand{\code}[1]{\texttt{\small{#1}}}
23
\newcommand{\package}[1]{\textsf{\small{#1}}}
24
\newcommand{\trans}{\ensuremath{^\prime}}
25
<<preliminaries,echo=FALSE,results=hide>>=
26
options(width=65,digits=5)
33
The methods in the \code{lme4} package for \code{R} for fitting
34
linear mixed models are based on sparse matrix methods, especially
35
the Cholesky decomposition of sparse positive-semidefinite matrices,
36
in a penalized least squares representation of the conditional model
37
for the response given the random effects. The representation is
38
similar to that in Henderson's mixed-model equations. An
39
alternative representation of the calculations is as a generalized
40
least squares problem. We describe the two representations, show
41
the equivalence of the two representations and explain why we feel
42
that the penalized least squares approach is more versatile and more
43
computationally efficient.
46
\section{Definition of the model}
47
\label{sec:Definition}
49
We consider linear mixed models in which the random effects are
50
represented by a $q$-dimensional random vector, $\bm{\mathcal{B}}$, and
51
the response is represented by an $n$-dimensional random vector,
52
$\bm{\mathcal{Y}}$. We observe a value, $\bm y$, of the response. The
53
random effects are unobserved.
55
For our purposes, we will assume a ``spherical'' multivariate normal
56
conditional distribution of $\bm{\mathcal{Y}}$, given
57
$\bm{\mathcal{B}}$. That is, we assume the variance-covariance matrix
58
of $\bm{\mathcal{Y}}|\bm{\mathcal{B}}$ is simply $\sigma^2\bm I_n$,
59
where $\bm I_n$ denotes the identity matrix of order $n$. (The term
60
``spherical'' refers to the fact that contours of the conditional
61
density are concentric spheres.)
64
$\mathrm{E}[\bm{\mathcal{Y}}|\bm{\mathcal{B}}=\bm b]$, is a linear
65
function of $\bm b$ and the $p$-dimensional fixed-effects parameter,
69
\mathrm{E}[\bm{\mathcal{Y}}|\bm{\mathcal{B}}=\bm b]=
70
\bm X\bm\beta+\bm Z\bm b ,
72
where $\bm X$ and $\bm Z$ are known model matrices of sizes $n\times
73
p$ and $n\times q$, respectively. Thus
75
\label{eq:yconditional}
76
\bm{\mathcal{Y}}|\bm{\mathcal{B}}\sim
77
\mathcal{N}\left(\bm X\bm\beta+\bm Z\bm b,\sigma^2\bm I_n\right) .
80
The marginal distribution of the random effects
83
\bm{\mathcal{B}}\sim\mathcal{N}\left(\bm 0,\sigma^2\bm\Sigma(\bm\theta)\right)
85
is also multivariate normal, with mean $\bm 0$ and variance-covariance
86
matrix $\sigma^2\bm\Sigma(\bm\theta)$. The scalar, $\sigma^2$, in
87
(\ref{eq:remargin}) is the same as the $\sigma^2$ in
88
(\ref{eq:yconditional}). As described in the next section, the
89
relative variance-covariance matrix, $\bm\Sigma(\bm\theta)$, is a
90
$q\times q$ positive semidefinite matrix depending on a parameter
91
vector, $\bm\theta$. Typically the dimension of $\bm\theta$ is much,
92
much smaller than $q$.
95
\subsection{Variance-covariance of the random effects}
98
The relative variance-covariance matrix, $\bm\Sigma(\bm\theta)$, must
99
be symmetric and positive semidefinite (i.e. $\bm x\trans\bm\Sigma\bm
100
x\ge0,\forall\bm x\in\mathbb{R}^q$). Because the estimate of a
101
variance component can be zero, it is important to allow for a
102
semidefinite $\bm\Sigma$. We do not assume that $\bm\Sigma$ is
103
positive definite (i.e. $\bm x\trans\bm\Sigma\bm x>0,\forall\bm
104
x\in\mathbb{R}^q, \bm x\ne\bm 0$) and, hence, we cannot assume that $\bm\Sigma^{-1}$
107
A positive semidefinite matrix such as $\bm\Sigma$ has a Cholesky
108
decomposition of the so-called ``LDL$\trans$'' form. We use a
109
slight modification of this form,
112
\bm\Sigma(\bm\theta)=\bm T(\bm\theta)\bm S(\bm\theta)\bm
113
S(\bm\theta)\bm T(\bm\theta)\trans ,
115
where $\bm T(\bm\theta)$ is a unit lower-triangular $q\times q$ matrix
116
and $\bm S(\bm\theta)$ is a diagonal $q\times q$ matrix with
117
nonnegative diagonal elements that act as scale factors. (They are
118
the relative standard deviations of certain linear combinations of the
119
random effects.) Thus, $\bm T$ is a triangular matrix and $\bm S$ is
122
Both $\bm T$ and $\bm S$ are highly patterned.
124
\subsection{Orthogonal random effects}
125
\label{sec:orthogonal}
127
Let us define a $q$-dimensional random vector, $\bm{\mathcal{U}}$, of
128
orthogonal random effects with marginal distribution
131
\bm{\mathcal{U}}\sim\mathcal{N}\left(\bm 0,\sigma^2\bm I_q\right)
133
and, for a given value of $\bm\theta$, express $\bm{\mathcal{B}}$ as a
134
linear transformation of $\bm{\mathcal{U}}$,
137
\bm{\mathcal{B}}=\bm T(\bm\theta)\bm S(\bm\theta)\bm{\mathcal{U}} .
139
Note that the transformation (\ref{eq:UtoB}) gives the desired
140
distribution of $\bm{\mathcal{B}}$ in that
141
$\mathrm{E}[\bm{\mathcal{B}}]=\bm T\bm
142
S\mathrm{E}[\bm{\mathcal{U}}]=\bm 0$ and
144
\mathrm{Var}(\bm{\mathcal{B}})=\mathrm{E}[\bm{\mathcal{B}}\bm{\mathcal{B}}\trans]
145
=\bm T\bm S\mathrm{E}[\bm{\mathcal{U}}\bm{\mathcal{U}}\trans]\bm
146
S\bm T\trans=\sigma^2\bm T\bm S\bm S\bm T\trans=\bm\Sigma .
149
The conditional distribution, $\bm{\mathcal{Y}}|\bm{\mathcal{U}}$, can
150
be derived from $\bm{\mathcal{Y}}|\bm{\mathcal{B}}$ as
153
\bm{\mathcal{Y}}|\bm{\mathcal{U}}\sim\mathcal{N}\left(\bm X\bm\beta+\bm
154
Z\bm T\bm S\bm u, \sigma^2\bm I\right)
156
We will write the transpose of $\bm Z\bm T\bm S$ as $\bm A$. Because
157
the matrices $\bm T$ and $\bm S$ depend on the parameter $\bm\theta$,
158
$\bm A$ is also a function of $\bm\theta$,
161
\bm A\trans(\bm\theta)=\bm Z\bm T(\bm\theta)\bm S(\bm\theta) .
164
In applications, the matrix $\bm Z$ is derived from indicator columns
165
of the levels of one or more factors in the data and is a
166
\emph{sparse} matrix, in the sense that most of its elements are zero.
167
The matrix $\bm A$ is also sparse. In fact, the structure of $\bm T$
168
and $\bm S$ are such that pattern of nonzeros in $\bm A$ is that same
169
as that in $\bm Z\trans$.
171
\subsection{Sparse matrix methods}
172
\label{sec:sparseMatrix}
174
The reason for defining $\bm A$ as the transpose of a model matrix is
175
because $\bm A$ is stored and manipulated as a sparse matrix. In the
176
compressed column-oriented storage form that we use for sparse
177
matrices, there are advantages to storing $\bm A$ as a matrix of $n$
178
columns and $q$ rows. In particular, the CHOLMOD sparse matrix
179
library allows us to evaluate the sparse Cholesky factor, $\bm
180
L(\bm\theta)$, a sparse lower triangular matrix that satisfies
182
\label{eq:SparseChol}
183
\bm L(\bm\theta)\bm L(\bm\theta)\trans=
184
\bm P\left(\bm A(\bm\theta)\bm A(\bm\theta)\trans+\bm I_q\right)\bm P\trans ,
186
directly from $\bm A(\bm\theta)$.
188
In (\ref{eq:SparseChol}) the $q\times q$ matrix $\bm P$ is a
189
``fill-reducing'' permutation matrix determined from the pattern of
190
nonzeros in $\bm Z$. $\bm P$ does not affect the statistical theory
191
(if $\bm{\mathcal{U}}\sim\mathcal{N}(\bm 0,\sigma^2\bm I)$ then $\bm
192
P\trans\bm{\mathcal{U}}$ also has a $\mathcal{N}(\bm 0,\sigma^2\bm I)$
193
distribution because $\bm P\bm P\trans=\bm P\trans\bm P=\bm I$) but,
194
because it affects the number of nonzeros in $\bm L$, it can have a
195
tremendous impact on the amount storage required for $\bm L$ and the
196
time required to evaluate $\bm L$ from $\bm A$. Indeed, it is
197
precisely because $\bm L(\bm\theta)$ can be evaluated quickly, even
198
for complex models applied the large data sets, that the \code{lmer}
199
function is effective in fitting such models.
201
\section{The penalized least squares approach to linear mixed models}
202
\label{sec:Penalized}
204
Given a value of $\bm\theta$ we form $\bm A(\bm\theta)$ from which we
205
evaluate $\bm L(\bm\theta)$. We can then solve for the $q\times p$
206
matrix, $\bm R_{\bm{ZX}}$, in the system of equations
209
\bm L(\theta)\bm R_{\bm{ZX}}=\bm P\bm A(\bm\theta)\bm X
211
and for the $p\times p$ upper triangular matrix, $\bm R_{\bm X}$, satisfying
214
\bm R_{\bm X}\trans\bm R_{\bm X}=
215
\bm X\trans\bm X-\bm R_{\bm{ZX}}\trans\bm R_{\bm{ZX}}
218
The conditional mode, $\tilde{\bm u}(\bm\theta)$, of the
219
orthogonal random effects and the conditional mle,
220
$\widehat{\bm\beta}(\bm\theta)$, of the fixed-effects parameters
221
can be determined simultaneously as the solutions to a penalized least
226
\tilde{\bm u}(\bm\theta)\\
227
\widehat{\bm\beta}(\bm\theta)
229
\arg\min_{\bm u,\bm\beta}\left\|
230
\begin{bmatrix}\bm y\\\bm 0\end{bmatrix} -
232
\bm A\trans\bm P\trans & \bm X\\
235
\begin{bmatrix}\bm u\\\bm\beta\end{bmatrix} ,
238
for which the solution satisfies
242
\bm P\left(\bm A\bm A\trans+\bm I\right)\bm P\trans &
244
\bm X\trans\bm A\trans\bm P\trans & \bm X\trans\bm X
247
\tilde{\bm u}(\bm\theta)\\
248
\widehat{\bm\beta}(\bm\theta)
250
\begin{bmatrix}\bm P\bm A\bm y\\\bm X\trans\bm y\end{bmatrix} .
252
The Cholesky factor of the system matrix for the PLS problem can be
253
expressed using $\bm L$, $\bm R_{\bm Z\bm X}$ and $\bm R_{\bm X}$, because
257
\bm P\left(\bm A\bm A\trans+\bm I\right)\bm P\trans & \bm P\bm
259
\bm X\trans\bm A\trans\bm P\trans & \bm X\trans\bm X
263
\bm R_{\bm Z\bm X}\trans & \bm R_{\bm X}\trans
266
\bm L\trans & \bm R_{\bm Z\bm X}\\
267
\bm 0 & \bm R_{\bm X}
271
In the \code{lme4} package the \code{"mer"} class is the
272
representation of a mixed-effects model. Several slots in this class
273
are matrices corresponding directly to the matrices in the preceding
274
equations. The \code{A} slot contains the sparse matrix $\bm
275
A(\bm\theta)$ and the \code{L} slot contains the sparse Cholesky
276
factor, $\bm L(\bm\theta)$. The \code{RZX} and \code{RX} slots contain
277
$\bm R_{\bm Z\bm X}(\bm\theta)$ and $\bm R_{\bm X}(\bm\theta)$,
278
respectively, stored as dense matrices.
280
It is not necessary to solve for $\tilde{\bm u}(\bm\theta)$ and
281
$\widehat{\bm\beta}(\bm\theta)$ to evaluate the \emph{profiled}
282
log-likelihood, which is the log-likelihood evaluated $\bm\theta$ and
283
the conditional estimates of the other parameters,
284
$\widehat{\bm\beta}(\bm\theta)$ and $\widehat{\sigma^2}(\bm\theta)$.
285
All that is needed for evaluation of the profiled log-likelihood is
286
the (penalized) residual sum of squares, $r^2$, from the penalized
287
least squares problem (\ref{eq:PLS}) and the determinant $|\bm A\bm
288
A\trans+\bm I|=|\bm L|^2$. Because $\bm L$ is triangular, its
289
determinant is easily evaluated as the product
290
of its diagonal elements. Furthermore, $|\bm L|^2 > 0$ because it is
291
equal to $|\bm A\bm A\trans + \bm I|$, which is the determinant of a
292
positive definite matrix. Thus $\log(|\bm L|^2)$ is both well-defined
293
and easily calculated from $\bm L$.
295
The profiled deviance (negative twice the profiled log-likelihood), as
296
a function of $\bm\theta$ only ($\bm\beta$ and $\sigma^2$ at their
297
conditional estimates), is
299
\label{eq:profiledDev}
300
d(\bm\theta|\bm y)=\log(|\bm L|^2)+n\left(1+\log(r^2)+\frac{2\pi}{n}\right)
302
The maximum likelihood estimates, $\widehat{\bm\theta}$, satisfy
305
\widehat{\bm\theta}=\arg\min_{\bm\theta}d(\bm\theta|\bm y)
307
Once the value of $\widehat{\bm\theta}$ has been determined, the mle
308
of $\bm\beta$ is evaluated from (\ref{eq:PLSsol}) and the mle of
309
$\sigma^2$ as $\widehat{\sigma^2}(\bm\theta)=r^2/n$.
311
Note that nothing has been said about the form of the sparse model
312
matrix, $\bm Z$, other than the fact that it is sparse. In contrast
313
to other methods for linear mixed models, these results apply to
314
models where $\bm Z$ is derived from crossed or partially crossed
315
grouping factors, in addition to models with multiple, nested grouping
318
The system (\ref{eq:PLSsol}) is similar to Henderson's ``mixed-model
319
equations'' (reference?). One important difference between
320
(\ref{eq:PLSsol}) and Henderson's formulation is that Henderson
321
represented his system of equations in terms of $\bm\Sigma^{-1}$ and,
322
in important practical examples, $\bm\Sigma^{-1}$ does not exist at
323
the parameter estimates. Also, Henderson assumed that equations like
324
(\ref{eq:PLSsol}) would need to be solved explicitly and, as we have
325
seen, only the decomposition of the system matrix is needed for
326
evaluation of the profiled log-likelihood. The same is true of the
327
profiled the logarithm of the REML criterion, which we define later.
329
\section{The generalized least squares approach to linear mixed models}
332
Another common approach to linear mixed models is to derive the
333
marginal variance-covariance matrix of $\bm{\mathcal{Y}}$ as a
334
function of $\bm\theta$ and use that to determine the conditional
335
estimates, $\widehat{\bm\beta}(\bm\theta)$, as the solution of a
336
generalized least squares (GLS) problem. In the notation of
337
\S\ref{sec:Definition} the marginal mean of $\bm{\mathcal{Y}}$ is
338
$\mathrm{E}[\bm{\mathcal{Y}}]=\bm X\bm\beta$ and the marginal
339
variance-covariance matrix is
341
\label{eq:marginalvarcovY}
342
\mathrm{Var}(\bm{\mathcal{Y}})=\sigma^2\left(\bm I_n+\bm Z\bm T\bm
343
S\bm S\bm T\trans\bm Z\trans\right)=\sigma^2\left(\bm I_n+\bm
344
A\trans\bm A\right) =\sigma^2\bm V(\bm\theta) ,
346
where $\bm V(\bm\theta)=\bm I_n+\bm A\trans\bm A$.
348
The conditional estimates of $\bm\beta$ are often written as
351
\widehat{\bm\beta}(\bm\theta)=\left(\bm X\trans\bm V^{-1}\bm
352
X\right)^{-1}\bm X\trans\bm V^{-1}\bm y
354
but, of course, this formula is not suitable for computation. The
355
matrix $\bm V(\bm\theta)$ is a symmetric $n\times n$ positive definite
356
matrix and hence has a Cholesky factor. However, this factor is
357
$n\times n$, not $q\times q$, and $n$ is always larger than $q$ ---
358
sometimes orders of magnitude larger. Blithely writing a formula in
359
terms of $\bm V^{-1}$ when $\bm V$ is $n\times n$, and $n$ can be in
360
the millions does not a computational formula make.
362
\subsection{Relating the GLS approach to the Cholesky factor}
365
We can use the fact that
368
\bm V^{-1}(\bm\theta)=\left(\bm I_n+\bm A\trans\bm A\right)^{-1}=
369
\bm I_n-\bm A\trans\left(\bm I_q+\bm A\bm A\trans\right)^{-1}\bm A
371
to relate the GLS problem to the PLS problem. One way to establish
372
(\ref{eq:Vinv}) is simply to show that the product
374
(\bm I+\bm A\trans\bm A)\left(\bm I-\bm A\trans\left(\bm I+\bm A\bm
375
A\trans\right)^{-1}\bm A\right)\\
377
=&\bm I+\bm A\trans\bm A-\bm A\trans\left(\bm I+\bm A\bm
379
\left(\bm I+\bm A\bm A\trans\right)^{-1}\bm A\\
380
=&\bm I+\bm A\trans\bm A-\bm A\trans\bm A\\
384
Incorporating the permutation matrix $\bm P$ we have
388
\bm V^{-1}(\bm\theta)=&\bm I_n-\bm A\trans\bm P\trans\bm P\left(\bm
389
I_q+\bm A\bm A\trans\right)^{-1}\bm P\trans\bm P\bm A\\
390
=&\bm I_n-\bm A\trans\bm P\trans(\bm L\bm L\trans)^{-1}\bm P\bm A\\
391
=&\bm I_n-\left(\bm L^{-1}\bm P\bm A\right)\trans\bm L^{-1}\bm P\bm A .
394
Even in this form we would not want to routinely evaluate $\bm
395
V^{-1}$. However, (\ref{eq:PLA}) does allow us to simplify many
398
For example, the variance-covariance of the estimator $\widehat{\bm
399
\beta}$, conditional on $\bm\theta$ and $\sigma$, can be expressed as
401
\label{eq:varcovbeta}
403
\sigma^2\left(\bm X\trans\bm V^{-1}(\bm\theta)\bm X\right)^{-1}
404
=&\sigma^2\left(\bm X\trans\bm X-\left(\bm L^{-1}\bm P\bm
405
A\bm X\right)\trans\left(\bm L^{-1}\bm P\bm A\bm
406
X\right)\right)^{-1}\\
407
=&\sigma^2\left(\bm X\trans\bm X-\bm R_{\bm Z\bm X}\trans\bm
408
R_{\bm Z\bm X}\right)^{-1}\\
409
=&\sigma^2\left(\bm R_{\bm X}\trans\bm R_{\bm X}\right)^{-1} .
413
\section{Trace of the ``hat'' matrix}
416
Another calculation that is of interest to some is the
417
the trace of the ``hat'' matrix, which can be written as
420
\tr\left(\begin{bmatrix}\bm A\trans&\bm X\end{bmatrix}
421
\left(\begin{bmatrix}\bm A\trans&\bm X\\\bm I&\bm0\end{bmatrix}\trans
422
\begin{bmatrix}\bm A\trans&\bm X\\\bm I&\bm0\end{bmatrix}\right)^{-1}
423
\begin{bmatrix}\bm A\\\bm X\trans\end{bmatrix}\right)\\
424
= \tr\left(\begin{bmatrix}\bm A\trans&\bm X\end{bmatrix}
425
\left(\begin{bmatrix}\bm L&\bm0\\
426
\bm R_{\bm{ZX}}\trans&\bm R_{\bm X}\trans\end{bmatrix}
427
\begin{bmatrix}\bm L\trans&\bm R_{\bm{ZX}}\\
428
\bm0&\bm R_{\bm X}\end{bmatrix}\right)^{-1}
429
\begin{bmatrix}\bm A\\\bm X\trans\end{bmatrix}\right)