We present an new sequential Monte Carlo sampler for coalescent based
Bayesian hierarchical clustering. Our model is appropriate for modeling
non-i.i.d. data and offers a substantial reduction of computational cost when
compared to the original sampler without resorting to approximations. We also
propose a quadratic complexity approximation that in practice shows almost no
loss in performance compared to its counterpart.
Gaussian factor models have proven widely useful for parsimoniously
characterizing dependence in multivariate data. There is a rich literature on
their extension to mixed categorical and continuous variables, using latent
Gaussian variables or through generalized latent trait models acommodating
measurements in the exponential family. However, when generalizing to
non-Gaussian measured variables the latent variables typically influence both
the dependence structure and the form of the marginal distributions,
complicating interpretation and introducing artifacts.
Unbiased, label-free proteomics is becoming a powerful technique for
measuring protein expression in almost any biological sample. The output of
these measurements after preprocessing are a collection of features (10's to
100's of thousands) and their associated intensities for each sample. Subsets
of features within the data are from the same peptide, subsets of peptides are
from the same protein, and subsets of proteins are in the same biological
pathways, therefore there is the potential for very complex and informative
correlational structure inherent in this data.