The modes of posterior distributions for mixed linear models

Mixed linear models, also known as two-level hierarchical models, are commonly used in many applications. In this paper, we consider the marginal distribution that arises within a Bayesian framework, when the components of variance are integrated out of the joint posterior distribution. We provide analytical tools for describing the surface of the distribution of interest. The main theorem and its proof show how to determine the number of local maxima, and their approximate location and relative size. This information can be used by practitioners to assess the performance of Laplace-type integral approximations, to compute possibly disconnected highest posterior density regions, and to custom-design numerical algorithms.


INTRODUCTION
Mixed linear models (sometimes known as two-level hierarchical models or as components of variance models) are used in a wide variety of fields, including genetics, econometrics, biomedical and environmental applications, and many others (e.g., Fuller and Harter, 1987;Harvey, 1989;Smith et al., 1990;Harville and Carriquiry, 1992;Wang et al., 1994;Hobert and Casella, 1996;Van Dyk, 1999).The two-level linear hierarchical model can be formulated as follows: y|β, s, τ e ∼ N(Xβ + Zs, Iτ −1 e ) s|A, τ s ∼ N(0, Aτ −1 s ), (1.1) where y ∈ IR n is the observed data vector, β ∈ IR p is a vector of unknown fixed (or covariate) effects, s ∈ IR q is a vector of random effects, and X ∈ IR n•p and Z ∈ IR n•q are the model matrices corresponding to the fixed and random effects, respectively.Here, the matrix A is assumed to be known; in animal genetics applications, for example, the elements of A reflect the familial relations of animals in the dataset.Practitioners are often interested in estimating β, s, or a combination of both.The variance components (τ −1 s , τ −1 e ) are considered nuisance parameters that must, however, be accounted for in estimation.
The recent introduction of Markov chain Monte Carlo (MCMC) methods (e.g., Gelfand and Smith, 1990; Besag and Green, 1993) has facilitated estimation in mixed linear models from a Bayesian perspective.In the Bayesian framework, prior distributions are chosen for the vector β and the scalarvalued variance components, and the marginal posterior distributions of all parameters in the model are sought.These marginal posterior distributions can be approximated simultaneously, by generating Markov chains with stationary distributions equal to the marginal distributions of interest.Except in trivial cases, analytical derivation of all marginal distributions of interest is not possible.
In some applications, however, the dimensions p and q of the vectors β and s, respectively, can be large enough as to make the use of MCMC methods impractical.If practitioners are interested only in the first and second order moments of the marginal posterior distributions of parameters in the model, then Laplace-type approximations can be used very effectively.Several approximations to the posterior mean of a parameter vector have been proposed.Lindley (1980) was first to suggest the use of asymptotic expansions for ratios of integrals to approximate posterior moments.He derived Taylor expansions around the maximum likelihood (ML) estimate of the parameter of interest, as far as terms of order n −1 .Tierney et al. (1989) and Kass and Steffey (1989) proposed the application of Laplace's method to expand the ratios of integrals, with modifications which enable the evaluation of nonpositive functions.Morris (1988) proposed a more general version of Laplace-type approximations, where the kernel is a distribution chosen from the Pearson family, which includes as a special case the normal family of distributions.While authors suggested that expansions be carried out around the MLE of the parameter of interest, they mentioned that more satisfactory results may be obtained by expanding, instead, around the posterior mode.Laplace's method (and its variations) work only in the case of unimodal surfaces, and fails when the surface of interest has more than one mode.Robert and Casella (1999, Chapter 3) stress once again the importance of unimodality of the underlying distribution.
In this paper, we address the problem of analytically finding the mode(s) of the posterior distribution p(β, s|y) that arises from model (1) after integration of the joint posterior distribution with respect to the variance components (τ s , τ e ).We consider a rather general class of prior distributions for the variance components, but concentrate on a family of distributions that guarantees integrability of the joint posterior distribution (see, e.g., Hobart and Casella, 1996).Specifically, we investigate the surface of the posterior distribution of interest, and attempt to answer the following questions: • How many (local) maxima does the distribution have?
• If multimodal, what is the relative size of each mode?
• Where are the modes located?Answering these questions is interesting from an academic viewpoint, given the widespread use of two-level hierarchical models.A more pragmatic motivation for this work was presented above, as we argued that the performance of Laplace approximations depends to a large extent on whether p(β, s|y) is unimodal or multimodal.But even beyond the issue of behavior of integral approximations, in some applications posterior modes (or maximum a posteriori (MAP)) are used as point estimates of location parameters.This is because maxima are typically easier to compute than expectations.In these cases, the existence of more than one mode requires careful consideration, in particular when no single mode is associated to a significantly larger mass than the others.Similarly, identification of highest posterior density regions for (β, s) requires knowledge of the number of modes of p(β, s|y), as in the multimodal case the region may well be disconnected (Berger, 1985).
This paper is organized as follows.Additional notation, as well as prior and posterior distributions of model parameters, are presented in Section 2. In the same section, we discuss the relation between our work and that of Bauwens, Drèze, Lubrano, Richard, and Tompa (Drèze, 1977;Richard and Tompa, 1980;Drèze and Richard, 1983;Lubrano et al., 1984;Bauwens and Richard, 1985), who have proposed numerical algorithms to draw values from a distribution very similar to p(β, s|y) that arises in a specific econometric application.Section 3 presents the main theorem (its proof is given in the Appendix) and several corollaries and remarks to the theorem.Some explicit criteria for determining uni-or multimodality of the posterior distribution p(β, s|y) are also given in this section.Section 4 contains an illustrative example.Finally, some discussion and conclusions are presented in Section 5.

The Mixed Linear Model
Consider the hierarchical model (1), and write it as a linear mixed model (e.g., Goldberger, 1962;Henderson, 1963): ,τ e ∈ IR, τ e > 0. Further, let Cov (e, s) = 0, rank (X) = p < n, and rank (Z) ≤ q < n The first assumption seems unnecessarily restrictive.More general models, however, can be brought into this form by a linear transformation.Let the random effects satisfy s * ∼N ¡ 0, τ −1 s A

¢
, where A ∈ IR q•q is a known, positive semi-definite matrix and denote by A 1/2 its (unique) positive semidefinite square root.Then, s = A 1/2 s * has a distribution that is N ¡ 0, τ −1 s I ¢ for the nonzero components, and the results below apply to s * = A 1/2 s.Furthermore, if Ẍ is a model matrix with rank ³ Ẍ´< p, then the model can be reparameterized to obtain a matrix X with full (column) rank and a new parameter vector Ä β.The new vector Ä β contains linear combinations of the components of β, and only these linear combinations are estimable (e.g., Searle, 1981).
Using the notation δ = τ s /τ e , we obtain y ∼ N ¡ Xβ, τ −1 e ¡ I + δ −1 ZZ 0 ¢¢ , where 0 denotes the transpose of a matrix.The parameter space is then defined as Let ϕ(β, s) = Xβ +Zs.The likelihood function of y, conditional on β, s, and τ e is then given by

Prior Distributions
Within the Bayesian framework, prior distributions must be chosen for the unknown parameters in model (2).In this case, prior distributions for β and for the variance components for s and e (or for δ and τ e ) must be chosen.Assume that a priori, β has a multivariate normal distribution, β ∼ MV N (α, I) , (2.4) with α ∈ IR p and > 0 known, and denote its density by π 1 (β).For the vector (δ, τ e ) consider the following family of prior distributions where G 1 (δ), G 2 (δ), and G 3 (δ) are functions of δ which satisfy the conditions , and G 3 (δ) ≥ 0. This family includes commonly used prior distributions for scale parameters, such as the uniform noninformative prior (Box and Tiao, 1973, Section 1.2.1),Jeffreys' (1961) priors, and the conjugate (Gamma) family (Raiffa and Schlaiffer, 1971).

Posterior Distribution
The joint posterior density of (β, s, δ, τ e ) given y can now be written in the form where k 1 is the normalizing constant, n+q+p) , and n 0 (δ) = . 12 (n + q + 2G 2 (δ)) + 1.Consider the case where no prior information on β is available.That is, consider the limiting case where → ∞, which yields (with the appropriate scaling factor p/2 ) lim →∞ p/2 p (β, s, δ, τ e |y) ∝ where ∝ means "proportional to."Upon integrating (8) analytically with respect to τ e , the marginal posterior density of β, s, and δ, given y can be written in the form Inferences about the fixed and the random effects (β, s) should be based on p (β, s|y).This marginal posterior density can be written in closed form (up to a normalizing constant), if the functions G 1 (δ), G 2 (δ), and G 3 (δ) are specified more precisely.Let These choices for G 1 (δ), G 2 (δ), and G 3 (δ) correspond to the case where a Gamma (a s , b s ) prior is assumed for τ s , a Gamma(a e , b e ) prior is assumed for τ e , and τ s and τ e are independent a priori.We have, therefore, restricted π 2 (δ, τ e ) to any prior distribution obtained by considering a conjugate prior family for the independent precision components τ s and τ e . Let and Then, substituting ( 10)-( 12) into (9) yields Define, further, q 1 = (y − ϕ) 0 (y − ϕ) + 2b e and q 2 = s 0 s + 2b s , and rewrite (13) in the form By using the appropriate change of variable (Stroud, 1987), ( 14) can be written in the form of a Beta function, and integration with respect to δ yields where ϕ = Xβ + Zs.In the remainder of this paper, we study the behavior of (15) in IR p × IR q .We are interested in determining whether (15) is uni-or multimodal, and in establishing the location and relative size of the mode(s).Inferences about β and s can be based on the joint marginal posterior distribution of β and s, as given in (15).
Note that p (β, s|y) is proportional to the product of two multivariate t-density kernels.This distribution is known as a product form polyt, and was studied by Drèze (1977), Richard and Tompa (1980), Drèze and Richard (1983), Lubrano et al. (1984), Richard (1984) and Bauwens and Richard (1985).The poly-t distribution arises as the posterior density function of regression coefficients of a single structural equation under various non-informative prior specifications, or in instrumental variable settings within a Bayesian framework.In a series of papers, the authors above characterize the poly-t distribution, both in product and ratio form, and propose algorithms and software that can be used to evaluate normalizing constants, first and second order moments of the distributions, and various fractiles.Results presented in this paper complement and expand the work of Drèze, Richard, and collaborators.The special structure of the product form poly-t distribution will be useful when trying to describe p (β, s|y) in Two special cases of model ( 2) may be of interest.Consider first the random effects model with a common mean μ.For this model, where 1 ∈ IR n is a vector of 1's.Under the set of preceding assumptions, and by substituting 1μ for Xβ, the form of the marginal posterior density of μ and s is (2.16) The second special case is the completely random model y = Zs + e (2.17) which leads in a similar fashion to In Section 3, we analyze the local maxima, i.e. the modes, of the posterior density (15) The function ( 17) is just a special case of (15), and the form ( 17) is discussed at the end of the next section.

THE MAXIMA OF POSTERIOR DENSITIES
The main result of this paper is the following.
The proof of Theorem 1 is given in the Appendix.Here we describe how to compute the curve Φ (κ) in general, and for an important special case.The (local) maxima of the posterior density p (β, s|y) can then be found by maximizing a real function defined on a bounded interval [0, κ], using for example, Newton-Raphson type algorithms.
We first introduce some notation to simplify the expressions in p (β, s|y) from (2.14).Let Then the function h : IR p × IR q → IR, defined by attains its extreme values at the same points as p (β, s|y), and these values differ only by multiplication with the positive constant k 4 .
Let M = (X, Z) be the matrix made up of the columns of X and Z and let p + r, with 0 ≤ r ≤ q be the rank of M. Let N be a submatrix of M with p + r linearly independent columns, and let V ⊂ IR p × IR q be the corresponding subspace.We first define a curve For example, the curve ∼Φ evaluated at 0, is a point in (β, s)-space whose coordinates are the components of the vector v 1 .Here, s (κ) is the solution to the system of equations with P = X (X 0 X) −1 X 0 , and N = (X, Z s ) , and β (κ) is given by To obtain the values of h in (3.1) on the maximum curve with Ẑ a q × r matrix whose columns are r linearly independent rows of Z.
Maximization of the function h in (3.10) is done over the interval [0, κ], and the points where the (local) maxima are attained, can be computed according to (3.11)-(3.12).

Corollary 2
1.If r = 0, then the posterior distribution p (β, s|y) has exactly one maximum, attained at v 1 ∈ IR p , which is given by the normal equations Ny ∈ IR p , then p (β, s|y) has exactly one maximum, attained at v 1 .
Proof: 1. and 2. follow directly from the computations in the proof of the theorem.The fact that p (β, s|y) is symmetric on each line through v 1 under conditions 1. or 2. yields the result 3.
We now consider an important special case, for which the (local) maxima can be computed explicitly.Recall from the discussion above that the β-component of ∼Φ (κ) (and hence, of Φ (κ)) is always a straight line.Furthermore, it follows from the proof of Theorem 1 in the Appendix, that ∼Φ (κ) s = s (κ) is a line if and only if one of the (generalized) eigendirections of Z 0 s (I − P) Z s is s 1 .This is, in particular, the case if the level surfaces of this quadratic form are spheres, or if r = 1.If ∼Φ (κ) is a straight line in V , so is Φ (κ) in IR p × IR q .For this case, we obtain expressions, as follows, for the modes of the posterior density in explicit form.In order to simplify notation, we reparameterize the line segment {Φ (κ) , κ ∈ [0, κ]} as The mode(s) of p (β, s|y) can now be computed as the maxima of the function h (Φ (κ)), with h defined as in (3.1).With v 0 and w 1 as above, let For x ∈ IR p × IR q denote by x s the components in the s-space Then f (Φ (α)) = (s 0 s+c 2 ) l can be written as Then, differentiating the product f (Φ (α)) • g (Φ (α)) with respect to α, and equating to zero yields, a third-degree polynomial in α.The explicit solution of (3.22) can be obtained via Cardano's rule.Let and let F denote the discriminant of the reduced equation (with γ = α+ C 3D ) We then obtain, with u = , the following criterion: 1.If F > 0, then (3.22) has one real root, namely γ 1 = u + t.
2. If F = 0, then (3.22) has three real roots, one of which is double, namely , where γ 2 is the double root.3.If F < 0, then (3.22) has three distinct real roots, namely , where j is the imaginary unit.
2. The posterior density p (β, s|y) is bimodal, iff F < 0, and in this case, , are the modes.
These computations show that in the setup of this special case, no iterative procedure is necessary for determining the modes.It is only necessary to invert the matrices X 0 X and N 0 N, to compute ³ Ẑ0 Ẑ´− 1 , and to solve (3.12) for w 1 .These computations are straightforward.
Results from Theorem 1 and the special case above can be used to custom-tailor MCMC algorithms: • If the mode ν is unique, use information about the location of ν on the curve {Φ (κ) , κ ∈ [0, κ]} to "blast" into the area of p (β, s|y) of highest mass.
• If the stationary distribution p (β, s|y) is multi-modal, i.e., has multiple local maxima (Φ (κ i ) , ν i ) , i ≥ 2, then design the algorithm so that it visits all the modes, and set the number of visits to be proportional to the relative size ν i of each mode.
We conclude this section with some remarks on the application of our main result.Consider the completely random model (2.17) y = Zs + e. Define W := l.s.{Z 1 , ..., Z q } ⊂ IR n , and reorder the columns of Z such that {Z 1 , ..., Z r } is a basis of W .Let V = Z −1 [W ], the inverse image of W under Z with basis {ê 1 , ..., êr }, and denote the kernel of Z by K. Then the point v 1 is given by the normal equations ´−1 Z| 0 V y ∈IR q , and we have v 0 = s 0 = 0.With this notation, the proof of Theorem 1 goes through, yielding one or two (local) maxima for the posterior density p (s|y).
If p (β, s|y) has a unique maximum, the point where this maximum is attained, need not be the posterior mean (see the example in Section 4).However, each component of the curve ∼Φ (κ) , κ ∈ [0, κ] is monotone in V , and hence, the posterior mean in V is always contained in the multidimensional rectangle described by ∼Φ (0) and ∼Φ (κ).Of course, the same holds true in IR p × IR q , using the corners Φ(0) and Φ(κ), since Φ is obtained from ∼Φ by projection along a subspace.This gives a rough estimate for IE (β, s|y).
The proof of Theorem 1 shows that if v 1 ∈ IR p , then the posterior mode(s) and the posterior mean cannot be attained at (β 0 , 0) or w 1 since p (β, s|y) is strictly increasing at w 1 = Φ (0), and strictly decreasing at (β 0 , 0) = Φ (κ).For numerical computations of the mode(s) as maxima of p (β, s|y) via Newton-Raphson type methods in IR p × IR q , however, (β 0 , 0) and w 1 are reasonable starting values.

AN INSTRUCTIVE EXAMPLE
The proof of Theorem 1 is constructive, and therefore explicit criteria for unimodality can be given, once the function Φ , defined in (6.1) in the Appendix, is known.Actually, the equations (??)-(??) show that it suffices to know ∼Φ and ξ 2 .In general, one can proceed as follows: Solve the normal equations N 0 Nv 1 = N 0 y and X 0 Xβ 0 = X 0 y given in Section 3 to obtain the points v 1 and 1 s 1 , and solve the system of equations for κ ∈ [0, κ], to obtain s (κ) in V s .(One has to set s (0) = s 1 , since the Lagrange multiplier is singular at this point.)Compute β (κ) via (6.5) - (6.8), to obtain ∼Φ (κ) = (β (κ) , s (κ)) in V .For κ = 0, (6.17) is used to get Φ (0) s , and now ξ 2 in (6.20) can be computed.This allows maximization of h= f •g on ∼Φ (κ) to obtain the value(s) κ i , where the maxima are attained.Another application of (6.18) yields the s-components of Φ(κ i ), and finally, we get the desired point(s) Φ (κ i ) from (6.18).The entire procedure requires the computation of three inverse matrices of (N 0 N, X 0 X, Ẑ0 Ẑ) of dimension p+r, p, and q−r, respectively, the solution of five systems of linear equations (in p+r, p, q−r, q−r variables), and the solution of (4.1) (in r+1 variables), which is quadratic in one equation, for all values κ ∈ (0, κ).In specific cases, explicit formulas for s (κ) in (4.1) can be obtained via symbolic manipulation software.In the previous section, we derived an explicit criterion for uniand bimodality in the case where ∼Φ is a straight line in V .Here, we present a simple example that demonstrates, how uni-and bimodality depend on the data.

Example
A simple example illustrates the steps needed for the computation of the mode(s) of p (β, s|y), and shows how uni-or multimodality of the posterior distribution depend on the observations.Let n = 4, p = 1, q = 2, and consider the matrices Then, rank(M) = 2, and hence, r = 1.Let a e = 2, and a s = 1.Then, m = 1 2 (n + 2a e ) = 4, and l = 1 2 (q + a s ) = 1.5.Set, furthermore, 2b e = c 1 = 2, and 2b s = c 2 = 1.0.For this setup we have Solutions to normal equations N 0 Nv 1 = N 0 y and X 0 Xβ 0 = X 0 y are the vectors v 1 and v 0 , where Equation (6.17) gives, as the component in the kernel K, and hence, For data set 1: There is a unique mode at α ≈ 0.978, and hence, the mode is For data set 2: There is a unique mode at α 1 ≈ 0.95, yielding and a double root of (3.22).Hence, there is an inflection point of h (Φ (α)) at α 2 ≈ 0.161, which corresponds to the point [1.165, 0.617, −0.617] 0 .
For data set 3: There are two modes, at α 1 ≈ 0.918 and α 2 ≈ 0.08, i.e.In the third data set, the first mode ν 1 is approximately 10 times more significant than the second mode ν 2 .Therefore, an algorithm designed to visit both modes should spend about 10 times longer around ν 1 .

CONCLUSIONS
For a mixed linear model like (2.1), with location and dispersion assumptions as those in (2.4)-(2.5), the estimation of the fixed and random effects β and s may be of interest.If the variance components (τ s ) −1 and (τ e ) −1 are known, the optimum solution to the prediction problem is to obtain the best linear unbiased estimate (BLUE) of β and the best linear unbiased predictor BLUP of s.If the variance components are unknown, β and s can be estimated within a Bayesian framework, and MCMC methods provide a convenient means to do so.Alternatively, Laplace methods can be implemented to approximate moments of the posterior distribution of interest.
In either case, knowledge about the shape of the marginal posterior distribution of the parameters of interest is important.The number, approximate location, and relative size of each (local) maxima of the distribution provide critical information to assess the performance of Laplace methods, custom-designed MCMC algorithms, and to construct highest posterior density regions for the parameters under study.The type of problems we have in mind are high dimensional, where the vector of parameters might have hundreds of thousands of dimensions, as is the case in animal breeding applications.In those cases, some idea about the characteristics of the very high-dimensional surface can be an important aid in analysis.
We have shown that for the class of mixed linear models with any number of fixed effects and one vector of random effects (possibly very highdimensional), and for prior assumptions as in (2.4), (2.5), (2.9)-(2.11), the maxima of the posterior distribution p (β, s|y) occur on a curve segment Φ(κ), κ ∈ [0, κ].If the curve Φ is a straight line, uni-or multimodality of the stationary distribution depends critically on the data, and can be determined by inspection of the roots of a third-degree polynomial.
Information about the mode(s) of the distribution p (β, s|y), that has a poly-t form, can be used to either start a sampling algorithm (see, e.g., Lubrano et al., 1984;Bauwens and Richard, 1985), or to tailor the algorithm so that it will visit all modes.Consider the problem of drawing an initial value ³ β 0 , s 0 ´from a starting distribution p ³ β 0 , s 0 ´so that p ³ β 0 , s 0 |y ´> 0.
The value ³ β 0 , s 0 ´may result, for example, from importance sampling from an approximation to p ³ β 0 , s 0 ´built around the mode(s) of p (β, s|y).The procedure presented in this paper for exploration of a specific stationary distribution is easily implemented, since it does not involve sophisticated numerical methods.In fact, all that is required is the solution of sets of nor-mal equations, and of systems of linear equations, with one quadratic term.In this regard, it provides a convenient analytical tool for exploratory analysis, and may improve the performance of numerical methods to approximate poly-t distributions.
Proof Let us first introduce some notation to simplify the expressions in p (β, s|y) from (2.14).Let c 1 = 2b e > 0, c 2 = 2b s > 0, m = 1 2 (n + 2a e ) ≥ 1, l = 1 2 (q + a s ) ≥ 1.Then the function h : IR p × IR q → IR, defined by attains its extreme values at the same points as p (β, s|y), and these values differ only by multiplication with the positive constant k 4 .
Next, let M = (X, Z) be the matrix made up of the columns of X and Z.The matrix M has rank p + r, with 0 ≤ r ≤ q.Denote by W the linear span of the columns of (X, Z) = M, that is, W := l.s.{X 1 , ..., X p , Z 1 , ..., Z q } ⊂ IR n .Reorder the columns of Z such that {X 1 , ..., X p , Z 1 , ..., Z r } forms a basis of W . Define V to be the inverse image of W under M, that is, the linear span formed by the set of p + r standard basis vectors, i.e., V := M −1 [W ] = l.s.{e 1 , ..., e p , ê1 , ..., êr }, with the corresponding reordering of the random effects s ∈ IR q .Then M restricted to V , written as N, is a linear isomorphism from V to W . Denote by K ⊂ IR p × IR q , the kernel of M :IR p ×IR q → IR n .Then, each (β, s) ∈ IR p ×IR q has a unique canonical decomposition (β, s) = P p i=1 β i e i + P r j=1 s j êj +k, where {e 1 , ..., e p , ê1 , ..., êr } is the standard basis in V , and k ∈ K. Denote l.s.{ê 1 , ..., êr } by V s , and the corresponding part of Z by Z s .Next, we solve the problem in V (the "nonsingular problem") by considering the minima and the level surfaces of the functions The product h = f • g has its minima exactly at the points, where h |V , (h restricted to V ) has its maxima.
Regarding f , the following results follow from its structure.In V s , the quadratic form s 0 s+c 2 has a unique minimum at s = 0. Hence, the level surfaces of f in V are of the form IR p × B (0, ρ) where B (0, ρ) = {s ∈ V s , s 0 s = ρ; ρ ≥ 0} is the Euclidean ball centered at zero with radius √ ρ, and the levels of f do not depend on β.
The behavior of the function g can be investigated in a similar way.The quadratic form where v 1 is given by the normal equations The level surfaces of the quadratic form in Hence, the level surfaces of g are of the form L (v 1 , η), with levels in [(ω + c 1 ) m , ∞), where ω is the minimal distance from y to W , i.e., ω 2 = (y − Nv 1 ) 0 (y − Nv 1 ).
In order to find the points where the minima of h = f •g occur, we proceed in the following way: f is strictly increasing on the level surfaces IR p × B (0, ρ) as ρ increases, and g is strictly increasing on L (v 1 , η) for increasing η.Hence, for η ≥ 0 fixed, the minimum of h occurs at the level ρ (η) = min {ρ ≥ 0, IR p × B (0, ρ) ∩ L (v 1 , η) 6 = ∅}.Note that L (v 1 , η) determines a strictly convex body, and that IR×B (0, ρ (η))∩L (v 1 , η) consists of exactly one point.The minima of h occur on the curve ∼Φ in V , which is given by these points for varying η.
We proceed to compute this curve using Lagrange multipliers (e.g., Protter and Murrey, 1977, Chapter 14.4).The is the following: minimize This system of equations has exactly 2 solutions, and a unique one, when considering ρ (η).
Let us first compute the point where s 0 s =0.It follows from (6.5)-(6.8)that in this case, we obtain where the solution to (6.5)-(6.8) is unique.Plugging (6.5) into (6.6)yields Here, I − P = I − X (X 0 X) −1 X 0 is the projection matrix that projects the solution in V onto that in V s .Note that I − P is symmetric and idempotent (i.e.I − P is its square root), and therefore, (6.11) says that s is the solution component in V s if s is perpendicular to the tangent hyperplane of ρ is a convex function of κ, where κ = (s 0 1 s 1 ).(Note that ρ is a linear function of κ iff one of the (generalized) eigendirections of Z 0 s (I − P) 0 (I − P) Z s is s 1 .)Now we obtain from (6.7), Z 0 s X (X 0 X) −1 X 0 Z s (s 1 − s) = κ + 2 (β 1 − β) 0 X 0 X 1 − β) , (6.12) and, therefore, η is a convex function of κ.
Consider the solution ∼Φ of (6.5)-(6.8) in V , parameterized by κ ∈ [0, κ].On this curve, ρ and η are convex functions of κ, and so are f and g, because l, m ≥ 1.Note that, by the arguments above, the β-component of ∼Φ is a straight line between β 1 and β 0 .Hence ∼Φ is a straight line in V iff one of the (generalized) eigendirections of Z 0 s (I − P) 0 (I − P) Z s is s 1 : This case has been considered in more detail in the second part of Section 3. We have completed the proof for the case where N = M, i.e., all columns of X and Z are linearly independent.
We now extend the result to IR p × IR q .Define the functions f : IR p × IR q → IR, f (β, s) = (s 0 s + c 2 ) l g : IR p × IR q → IR, g (β, s) = For κ ∈ [0, κ], consider the affine subspace ∼Φ (κ) + K ⊂ IR p × IR q , where Φ is the solution curve of (6.5)-(6.8),and K is the kernel of M. We have for all k ∈ K g (β, s v , k) = g (β, s v ) .(6.15) Hence, we need to find the points Φ (κ) ∈ ∼Φ (κ) + K, where f attains its minimum on Φ (κ) + K. Since f (β, s) = f (0, s) for all s ∈ IR q , we can restrict our attention to IR q , i.e., to the s-component of ∼Φ (κ) + K, denoted by (∼Φ (κ) + K) s .Since s 0 s is the Euclidean norm in IR q , the unique minimum is attained at Φ(κ) s with the property that Φ (κ) s ⊥ (∼Φ (κ) + K) s .Hence, Φ (κ) s lies in K ⊥ , the orthogonal complement of K, and is the projection of ∼Φ (κ) s onto K ⊥ along K.But K ⊥ is spanned by the rows of M, i.e., in IR q by the rows of Z.In order to arrive at the corresponding normal equations, denote by Ẑ a q × r matrix, whose columns are r linearly independent rows of Z. Then Φ (κ) s is given by Ẑ0 ẐΦ (κ) s = Ẑ0 ∼Φ (κ) s , (6.16) or Ẑ´− 1 Ẑ0 ∼Φ (κ) s .(6.17) Now, let {k j , j = 1, ..., q − r} be a basis in K. Since X has full column rank, none of the k j is contained in IR p .The point Φ (κ) satisfies the system of linear equations for some coefficients α j ∈ IR.This system has a unique solution (α 1 , ..., α q−r ), because the k j are linearly independent.In fact, Φ (κ) is a curve in IR p ×IR q , with Φ (κ) = (β 0 , 0) and Φ (0) = w 1 , the point corresponding to v 1 ∈ V .
h (Φ (α)) for α ∈ [−0.5, 1.5] are shown in Figures 1.a -1.b, for data sets 1 and 3.The corresponding posterior densities are shown in Figures 2 and 3. Solutions for each of the data sets are: