46 SP1: Biostatistics Integrating biological knowledge related to co-expression when analysing Agrocampus Ouest & IRMAR, UMR 6625 du CNRS 65 rue de St-Brieuc, 35042 Rennes, France marie.verbanck@orange.fr, sebastien.le@agrocampus-ouest.fr Interpreting results provided by multivariate exploratory methods (such Abstract.

as Principal Component Analysis for instance) applied on genomic data is almost impossible at a gene level due to the number of genes. Integrative approaches which involve the incorporation of biological knowledge have become unavoidable. De Tayrac et al. (2009) proposed a strategy which allows to use an a priori information, such as Gene Ontology (GO) or Kegg terms to enhance their results. The idea consists in constituting modules of genes according to the a priori information and using those modules as a supplementary information in order to interpret results on the basis of the genes' functions.

However, the composition of those modules may be disconnected from the structure of the genomic data to be studied and does not consider the di erent degrees of specicity of the terms which convey the existence of di erent levels of regulation.

Hence appears the natural idea of improving the way modules are constituted.

The aim of this talk is to propose a new approach combining Canonical Correspondence Analysis with Hierarchical Multiple Factor Analysis (Francoa et al., 2009) to get modules that have two main features: 1) they are constituted of genes that belong to the same biological processes; 2) they are constituted of genes that are co-expressed with respect to the data set of interest.

The interpretation of the biological processes is thus facilitated by the coexpression of the genes within a group, whereas the method highlights a few keygenes whose functions can be easily taken into account to go deeper into the interpretation. An application of this method to a chicken microarray data set has allowed to bring out the well-known mechanisms implemented in reply to fasting, and to come up with new trails.

transcriptomic data, integration of biological knowledge, CanoKeywords:

nical Correspondence Analysis, Hierarchical Multiple Factor Analysis References DE TAYRAC M., LE S., AUBRY M., MOSSER J., HUSSON F. (2009): Simultaneous analysis of distinct Omics data sets with integration of biological knowledge: Multiple Factor Analysis approach. BMC Genomics 2009, 10:32.

FRANCOA J., CROSSAB J., DESPHANDEC S. (2009): Hierarchical MultipleFactor Analysis for Classifying Genotypes Based on Phenotypic and Genetic Data. Crop Science 50(1):105-117.

Additional Hierarchy in the Modelling of Elizabeth Stojanovski1 and Kerrie Mengersen University of Newcastle, Australia, elizabeth.stojanovski@newcastle.edu.au Queensland University of Technology, Australia Abstract. The random-effects model in the frequentist framework assumes study effects to be randomly sampled from a common distribution. Associated parameters allow further variation between studies compared to fixed-effects models. The quantities of most interest are typically the hyperparameters. In the case that effect variability is small within a population, more borrowing of information occurs across studies.

The variability of an effect is often estimated using a method of moments approximation proposed by DerSimonian and Laird [1986]. This method is assessed in greater detail.

Keywords: meta-analysis, Bayesian, random-effects References DerSIMONIAN, R., LAIRD, N. (1986): Meta-analysis in clinical trials. Controlled Clinical Trials 7, 177-188.

48 SP1: Biostatistics Bayesian Modelling of Cross-study Discrepancies in Gene Networks Department of Statistics, the Chinese University of Hong Kong Shatin, N.T., Hong Kong, xfan@sta.cuhk.edu.hk Abstract. There are often multiple studies performed to investigate a same biological system from similar or related angles due to its high complexity. Many meta-analyses over these studies suggested that there are an excess of genes showing discordant gene expression across similar studies compared to what would have been predicted by chance alone. Scharpf et al. (2009) introduced a hierarchical Bayesian model for detecting differential gene expression in multiple data sets while allowing for cross-study discrepancy. Fan et al. (2009) and Fan et al. (2010) used a Bayesian approach to integrate cell-cycle microarray data sets and showed that the discrepancy about the cell-cycle regulated genes exists between individual laboratories and across synchronization techniques. In this paper, instead of dealing with the discrepancy at gene level as in Scharpf et al. (2009), we introduced a Bayesian approach to model the discrepancy at gene network level. The fundamental conjecture is that the gene expression discrepancy is resulted from the dynamics of the gene regulatory networks. Starting with different parameter settings, the network dynamics may show multiple steady states. Therefore, a gene can be highly expressed in one phenotype than the other in some studies, while the opposite is observed in other studies. Similarly, in cell-cycle experiments, a gene’s expression can be highly periodic in some studies, while aperiodic in other studies. This phenomenon also exists in some stress response studies, where the lists of differential expressed genes for the same stimulus vary significantly across different study. The new Bayesian approach is applied on the time-series microarray data sets from fission yeast cell-cycle experiments. A gene network is inferred from the combined data. Its dynamics is simulated under the inferred parameter setting as well as other settings as an effort to explain the discrepancy observed in the cell cycle studies.


Keywords: gene network, meta-analysis, Bayesian computing, cell cycle References FAN, X. and LIU, J.S. (2009): Comment on “A Bayesian Model for Cross-Study Differential Gene Expression” by Scharpf, Tjelmeland, Parmigiani, and Nobel, Journal of the American Statistical Association 104 (488), 1314-1318.

FAN, X., PYNE, S. and LIU, J.S. (2010): Bayesian Meta-Analysis for Identifying Periodically Expressed Genes in Fission Yeast Cell Cycle. To appear in Annals of Applied Statistics.


A Bayesian model for cross-study differential gene expression. Journal of the American Statistical Association 104 (488), 1295-1310.

Mixture models of truncated data for estimating the number of species.

Sebastien Li-Thiao-T´1, Jean-Jacques Daudin1, and St´phane Robin UMR 518 AgroParisTech / INRA MIA 16 rue Claude Bernard, F-75231 Paris Cedex 05, sebastien.li-thiao-te@agroparistech.fr Abstract. Metagenomics goes beyond DNA sequencing by tackling communities of microorganisms in their natural environment. Previously, each microbial strain needed to be cultured before sequencing. Applying DNA sequencing directly to the sample has revealed the great diversity of the microbial populations in soil, sea water or the intestinal flora.

Even though many new species can be studied, many more remain unobserved in the collected data. Estimating the total number of microbial species in the biological sample and their abundance distribution is key to determining the number of sequencing runs needed.

In the standard model introduced by Fisher et al. (1943), each species contribute a Poisson-distributed number of individuals to the dataset, with a species-specific abundance parameter. Unobserved species are those that contribute zero individuals. Mixture models provide flexible models for the distribution of the abundance parameters.

Following Bunge and Barger (2008), we use a truncated mixture model of geometric distributions. We propose to perform parameter estimation in the Bayesian framework with a variational algorithm, Beal and Ghahramani (2003). In this work, the number of components is not selected and we use Bayesian model averaging to combine the estimates from all considered models. In particular, the variational framework provides an efficient way of computing the weights for each model.

Keywords: mixture models, bayesian model averaging, variational methods, truncation, metagenomics References BEAL, M. J. and GHAHRAMANI, Z. (2003): The variational Bayesian EM algorithm for incomplete data : with application to scoring graphical model structures. Bayesian Statistics 7 (pp. 453–464).

BUNGE, J. and BARGER, K. (2008): Parametric models for estimating the number of classes. Biometrical Journal, 50(5).

FISHER, R. A., CORBET, A. S., and WILLIAMS, C. B. (1943): The Relation Between the Number of Species and the Number of Individuals in a Random Sample of an Animal Population, Journal of Animal Ecology, 12, 42–58.

50 SP1: Biostatistics Sequential Monte Carlo techniques for 91893 Orsay cedex, France, samis.trevezas@ecp.fr Ecole Centrale Paris, Laboratory of Applied Mathematics and Systems 92290, Chˆtenay-Malabry, France, paul-henry.cournede@ecp.fr Abstract. Parametric identification in plant growth models, those that can be formalized as discrete dynamical systems, is a challenging problem due to specific data acquisition (system observation is supposed to be done with destructive measurements), non-linear dynamics, model uncertainties and high-dimensional parameter space. The general approach for parametric identification in dynamical models involves the use of a stochastic framework for the model and the observation equations. When the dynamical system that describes the state evolution is non-linear with gaussian noise, then Kalman filtering techniques can be used as approximation schemes. Nevertheless, when applied properly, sequential MonteCarlo (or particle filter) methods offer a better alternative for state and parameter estimation (Doucet and Johansen (2008)). In this talk, we present how sequential Monte-Carlo methods can be used for maximum likelihood estimation via EM-type algorithm in plant growth modeling. In particular, we illustrate this method in a version of the functional-structural plant growth model, called GreenLab (Courn`de et al. (2006)). The observed vector consists of organ masses, measured by censoring plant’s evolution at a fixed observation time. The model hidden states represent biomasses produced at every growth cycle. Under some assumptions, we show that the estimation problem can be tackled in the framework of hidden (latent variable) models (Capp´ et al. (2005)), where an appropriate bivariate stochastic process describes the variables of the system. We use sequential Monte-Carlo in order to approximate the non-explicit E-step in the EM-type algorithm, and parametric bootstrap in order to obtain approximate confidence intervals for the MLE.

Keywords: maximum likelihood estimation; parametric identification; plant growth model; sequential Monte-Carlo References DOUCET, A. and JOHANSEN, A.M. (2008): A tutorial on particle filtering and smoothing: fifteen years later. Technical report, Department of Statistics, University of British Columbia.

CAPPE, O., MOULINES E. and RYDEN T. (2005): Inference in hidden Markov models. Springer.

COURNEDE, P.H., KANG, M.Z., MATHIEU, A., BARCZI, J.F., YAN, H.P., HU, B.G. and DE REFFYE, P. (2006): Structural Factorization of Plants to Compute their Functional and Architectural Growth. Simulation: 82(7).

