Consider observation data, comprised of n observation vectors with values on
a set of attributes. This gives us n points in attribute space. Having data
structured as a tree, implied by having our observations embedded in an
ultrametric topology, offers great advantage for proximity searching. If we
have preprocessed data through such an embedding, then an observation's nearest
neighbor is found in constant computational time, i.e. O(1) time. A further
powerful approach is discussed in this work: the inducing of a hierarchy, and
hence a tree, in linear computational time, i.e.
There are very significant changes taking place in the university sector and
in related higher education institutes in many parts of the world. In this work
we look at financial data from 2010 and 2011 from the UK higher education
sector. Situating ourselves to begin with in the context of teaching versus
research in universities, we look at the data in order to explore the new
divergence between the broad agendas of teaching and research in universities.
The innovation agenda has become at least equal to the research and teaching
objectives of universities.
The Ward error sum of squares hierarchical clustering method has been very
widely used since its first description by Ward in a 1963 publication. It has
also been generalized in various ways. However there are different
interpretations in the literature and there are different implementations of
the Ward agglomerative algorithm in commonly used software systems, including
differing expressions of the agglomerative criterion. Our survey work and case
studies will be useful for all those involved in developing software for data
analysis using Ward's hierarchical clustering method.
We describe many vantage points on the Baire metric and its use in clustering
data, or its use in preprocessing and structuring data in order to support
search and retrieval operations. In some cases, we proceed directly to clusters
and do not directly determine the distances. We show how a hierarchical
clustering can be read directly from one pass through the data. We offer
insights also on practical implications of precision of data measurement. As a
mechanism for treating multidimensional data, including very high dimensional
data, we use random projections.
The Baire metric induces an ultrametric on a dataset and is of linear
computational complexity, contrasted with the standard quadratic time
agglomerative hierarchical clustering algorithm. In this work we evaluate
empirically this new approach to hierarchical clustering. We compare
hierarchical clustering based on the Baire metric with (i) agglomerative
hierarchical clustering, in terms of algorithm properties; (ii) generalized
ultrametrics, in terms of definition; and (iii) fast clustering through k-means
partititioning, in terms of quality of results.
We survey agglomerative hierarchical clustering algorithms and discuss
efficient implementations that are available in R and other software
environments. We look at hierarchical self-organizing maps, and mixture models.
We review grid-based clustering, focusing on hierarchical density-based
approaches. Finally we describe a recently developed very efficient (linear
time) hierarchical clustering algorithm, which can also be viewed as a
hierarchical grid-based algorithm.
Our work has focused on support for film or television scriptwriting. Since
this involves potentially varied story-lines, we note the implicit or latent
support for interactivity. Furthermore the film, television, games, publishing
and other sectors are converging, so that cross-over and re-use of one form of
product in another of these sectors is ever more common. Technically our work
has been largely based on mathematical algorithms for data clustering and
display. Operationally, we also discuss how our algorithms can support
collective, distributed problem-solving.
Following a review of metric, ultrametric and generalized ultrametric, we
review their application in data analysis. We show how they allow us to explore
both geometry and topology of information, starting with measured data. Some
themes are then developed based on the use of metric, ultrametric and
generalized ultrametric in logic. In particular we study approximation chains
in an ultrametric or generalized ultrametric context.
The Lady Maisry ballads afford us a framework within which to segment a
storyline into its major components. Segments and as a consequence nodal points
are discussed for nine different variants of the Lady Maisry story of a (young)
woman being burnt to death by her family, on account of her becoming pregnant
by a foreign personage. We motivate the importance of nodal points in textual
and literary analysis. We show too how the openings of the nine variants can be
analyzed comparatively, and also the conclusions of the ballads.
Data analysis and data mining are concerned with unsupervised pattern finding
and structure determination in data sets. "Structure" can be understood as
symmetry and a range of symmetries are expressed by hierarchy. Such symmetries
directly point to invariants, that pinpoint intrinsic properties of the data
and of the background empirical domain of interest. We review many aspects of
hierarchy here, including ultrametric topology, generalized ultrametric,
linkages with lattices and other discrete algebraic structures and with p-adic
number representations.
By a "covering" we mean a Gaussian mixture model fit to observed data.
Approximations of the Bayes factor can be availed of to judge model fit to the
data within a given Gaussian mixture model. Between families of Gaussian
mixture models, we propose the R\'enyi quadratic entropy as an excellent and
tractable model comparison framework.