Notes: Graphical Gaussian Models for Genome Data

(back to index)

Synopsis:

For software to efficiently identify GGM networks from data visit
the GeneNet page.

A simple method for inferring the network of (linear) dependencies among a set of variables
is to compute all pairwise correlations and subsequently to draw the corresponding
graph (for some specified threshold). While popular and often used on many types of genomic
data (e.g. gene expression, metabolite concentrations etc.) **the naive correlation approach does not
allow to infer the dependency network.
Instead, graphical Gaussians models (GGMs) should be used**. These allow to correctly identify direct
influences, have close connections with causal graphical models, are straightforward to interpret, and yet
are essentially as easy to
compute as naive correlation models. This page lists pointers to learning GGMs from data, including
procedures suitable for "small n, large p" data sets (category iii).

Introduction:

Graphical Gaussian Models (GGMs), also known as "covariance
selection" or "
concentration graph" models, have recently become a popular
tool to study **gene association networks**. The
key idea behind GGMs is to use *partial correlations* as
a measure of independence of any two genes. This makes it
straightforward to distinguish direct from indirect
interactions. Note that partial correlations are related to the
*inverse* of the correlation matrix. Also note that in
GGMs missing edges indicate __conditional
independence__.

A related but completely different concept
are the so-called **gene relevance networks**
which are based on the "covariance
graph" model. In the latter interactions are defined
through standard correlation coefficients so that missing edges
denote __marginal independence__ only.

There is a simple reason why GGMs should be preferred over
relevance networks for identification of gene networks:
** the correlation coefficient is weak criterion for
measuring dependence**, as marginally, i.e.
directly and indirectly, more or less all genes will be
correlated. This implies that zero correlation is in fact a

The best starting place to learn about GGMs is the classic paper that introduced this concept in the early 1970s. (A.P. Dempster. 1972. Covariance Selection. Biometrics 28:157-175). Further details can be found in the GGM books by J. Whittaker (1990) and by D. Edwards (1995).

Application of GGMs to Genomic Data:

Application of GGMs to genomic data is quite challenging, as the number of genes (p) is usually much larger than the number of available samples (n), and classical GGM theory is not valid in a small sample setting. With this page I'd like to provide a commented list of some recent work dealing with GGM gene expression analysis (there are only very few so far). In my understanding, all of these paper fit in one of three categories:

- analysis with classic GGM theory,
- using limited order partial correlations, and
- application of regularized GGMs.

For small n, large p data it seems that methods from section iii. are most suited (see below for references and software).

I. Classic GGM Analysis:

The following papers simply apply classical GGM theory (i.e. with not further modification) to analyze gene expression data. It turns out that such an analysis is necessarily restricted to very small numbers of genes or gene clusters as to satisfy n > p.

- P. J. Waddell and H. Kishino. 2000. Correspondence analysis of genes and tissue types and finding genetics links from microarray data. Genome Informatics 11:83-95
- P. J. Waddell and H. Kishino. 2000. Cluster inferences methods and graphical models evaluated on NCI60 microarray gene expression data. Genome Informatics 11:129--140
- H. Toh and K. Horimoto. 2002. Inference of a genetic network by a combined approach of cluster analysis and graphical Gaussian modeling. Bioinformatics 18:287--297
- H. Toh and K. Horimoto. 2002. System for automatically inferring a genetic network from expression profiles. J. Biol. Physics 28:449--464
- X. Wu, Y. Ye and K. R. Subramanian. 2003. Interactive analysis of gene interactions using graphical Gaussian model. ACM SIGKDD Workshop on Data Mining in Bioinformatics 3:63-69

II. Limited Order Partial Correlations:

One way to circumvent the problem of computing full partial correlation coefficients when the sample size is small compared to the number of genes is to use partial correlation coefficients of limited order. This results in something inbetween a full GGM model (with correlation conditioned on all p-2 remaining genes) and a relevance network model (with unconditioned correlation). This is the strategy employed in the following papers:

- A. de la Fuente, N. Bing, I. Hoeschele, and P. Mendes. 2004. Discovery of meaningful associations in genomic data using partial correlation coefficients. Bioinformatics 20:3565-3574.
- A. Wille, P. Zimmermann et al. 2004. Sparse graphical Gaussian modeling of the isoprenoid gene network in Arabidopsis thaliana. Genome Biology 5:R92
- P. M. Magwene and J. Kim. 2004. Estimating genomic coexpression networks using first-order conditional independence. Genome Biology 5:R100
- A. Wille and P. Bühlmann. 2006. Low-order conditional
independence graphs for inferring genetic networks.
Statist. Appl. Genet. Mol. Biol.
**4**: 32. - R. Castelo and A. Roverato. A robust procedure for Gaussian graphical model search from microarray data with p larger than n. Preprint.

III. Regularized GGMs:

Another possibility (and in my opinion the statistically most sound way) to marry GGMs with small sample modeling is to introduce regularization and moderation. This essentially boils down to finding suitable estimates for the covariance matrix and its inverse when n < p. This can either be done in a full Bayesian manner, or in an empirical Bayes way via variance reduction, shrinkage estimates etc. Once regularized estimates of partial correlation are available then heuristic searches can subsequently to be employed to find an optimal graphical model (or set of models).

Outside a genomic context using regularized GGMs was first proposed by F. Wong, C.K. Carter, and R. Kohn. (2003. Efficient estimation of covariance selection models. Biometrika 90:809-830). For gene expression data this strategy is pursued in the following papers:

- A. Dobra, C. Hans, B. Jones, J.R. Nevins, and M. West. 2004. Sparse graphical models for exploring gene expression data. J. Multiv. Analysis 90:196-212.
*See the web page of M. West for various other related articles.* -
In these papers a regularized estimate of the
correlation matrix is obtained, either by Stein-type shrinkage
(3) or by bootstrap variance reduction (2). This estimate
is subsequently
employed for computing partial correlation. Network
selection is based on false discovery
rate multiple testing. This
method is implemented in GeneNet.
J. Schäfer and K. Strimmer. 2005. An empirical Bayes approach to inferring large-scale gene association networks. Bioinformatics
**21**: 754-764. -
J. Schäfer and K. Strimmer. 2005. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statist. Appl. Genet. Mol. Biol.
**4**: 32. -
J. Schäfer and K. Strimmer. 2005. Learning large-scale graphical Gaussian models from genomic data. In: J. F. Mendes. (Ed.). Proceedings of "Science of Complex Networks: from Biology to the Internet and WWW" (CNET 2004), Aveiro, PT, August 2004. (Publisher: The American Institute of Physics).
- N. Mainshausen and P. Bühlmann 2006. High-dimensional graphs and variable selection with the lasso. Annals of Statistics 34 (3)
*This approach uses lasso regression to induce sparsity on a node level among the partial correlations.* - H. Li and J. Gui. 2006. Gradient directed regularization for sparse Gaussian concentration graphs, with applications to inference of genetic networks. Biostatistics 2006 7:302-317.
*These authors regularize the concentration matrix rather than the covariance matrix.*

Please drop me me a line (korbinian.strimmer@lmu.de) for suggestions and comments.