Consider observation data, comprised of n observation vectors with values on
a set of attributes. This gives us n points in attribute space. Having data
structured as a tree, implied by having our observations embedded in an
ultrametric topology, offers great advantage for proximity searching. If we
have preprocessed data through such an embedding, then an observation's nearest
neighbor is found in constant computational time, i.e. O(1) time. A further
powerful approach is discussed in this work: the inducing of a hierarchy, and
hence a tree, in linear computational time, i.e.
We describe many vantage points on the Baire metric and its use in clustering
data, or its use in preprocessing and structuring data in order to support
search and retrieval operations. In some cases, we proceed directly to clusters
and do not directly determine the distances. We show how a hierarchical
clustering can be read directly from one pass through the data. We offer
insights also on practical implications of precision of data measurement. As a
mechanism for treating multidimensional data, including very high dimensional
data, we use random projections.
The Baire metric induces an ultrametric on a dataset and is of linear
computational complexity, contrasted with the standard quadratic time
agglomerative hierarchical clustering algorithm. In this work we evaluate
empirically this new approach to hierarchical clustering. We compare
hierarchical clustering based on the Baire metric with (i) agglomerative
hierarchical clustering, in terms of algorithm properties; (ii) generalized
ultrametrics, in terms of definition; and (iii) fast clustering through k-means
partititioning, in terms of quality of results.
We survey agglomerative hierarchical clustering algorithms and discuss
efficient implementations that are available in R and other software
environments. We look at hierarchical self-organizing maps, and mixture models.
We review grid-based clustering, focusing on hierarchical density-based
approaches. Finally we describe a recently developed very efficient (linear
time) hierarchical clustering algorithm, which can also be viewed as a
hierarchical grid-based algorithm.
Data analysis and data mining are concerned with unsupervised pattern finding
and structure determination in data sets. "Structure" can be understood as
symmetry and a range of symmetries are expressed by hierarchy. Such symmetries
directly point to invariants, that pinpoint intrinsic properties of the data
and of the background empirical domain of interest. We review many aspects of
hierarchy here, including ultrametric topology, generalized ultrametric,
linkages with lattices and other discrete algebraic structures and with p-adic
number representations.
By a "covering" we mean a Gaussian mixture model fit to observed data.
Approximations of the Bayes factor can be availed of to judge model fit to the
data within a given Gaussian mixture model. Between families of Gaussian
mixture models, we propose the R\'enyi quadratic entropy as an excellent and
tractable model comparison framework.