Fully Bayesian computation for comparative metagenomics
Statistical comparison of abundances of gene categories pooled across all taxa is among the standard approaches to analysing metagenomic data sets. We recently demonstrated that relative abundances of gene categories as used in such analyses are biased by the average genome sizes of the communities compared. Correcting for this and gene length bias enables estimation of gene abundances on an ecologically meaningful scale, as community averaged gene copy numbers. We experimented with fully Bayesian methods addressing this issue which can propagate error in estimation of nuisance parameters (here average genome sizes) in a statistically sound manner. Sampling from posterior distributions using Markov Chain Monte Carlo (MCMC) is a versatile tool for parameter estimation in the Bayesian framework; however, data set sizes typical for a metagenomic analysis are far beyond those that can be handled using general purpose sampling software. We developed an adaptive MCMC implementation which can scale to real life metagenomic data sets with tens of thousands of parameters and allows a flexible exploitation of posterior distributions for addressing a wider range of questions than allowed when using standard statistical techniques. Our results suggest that, although computationally less demanding approaches like marginal likelihood, empirical or approximate Bayes might be preferable when available, the full flexibility of Bayesian computation is also not out of reach even for the rather parameter rich problems of metagenomics.
AWI Organizations > Infrastructure > Scientific Computing