We present the nested Chinese restaurant process (nCRP), a stochastic process
which assigns probability distributions to infinitely-deep,
infinitely-branching trees. We show how this stochastic process can be used as
a prior distribution in a Bayesian nonparametric model of document collections.
Specifically, we present an application to information retrieval in which
documents are modeled as paths down a random tree, and the preferential
attachment dynamics of the nCRP leads to clustering of documents according to
sharing of topics at multiple levels of abstraction. Given a corpus of
documents, a posterior inference algorithm finds an approximation to a
posterior distribution over trees, topics and allocations of words to levels of
the tree. We demonstrate this algorithm on collections of scientific abstracts
from several journals. This model exemplifies a recent trend in statistical
machine learning--the use of Bayesian nonparametric methods to infer
distributions on flexible data structures.