Pedro Contreras

  1. The Future of Search and Discovery in Big Data Analytics: Ultrametric Information Spaces.

    Authors: Fionn Murtagh, Pedro Contreras
    Subjects: Information Retrieval
    Abstract

    Consider observation data, comprised of n observation vectors with values on
    a set of attributes. This gives us n points in attribute space. Having data
    structured as a tree, implied by having our observations embedded in an
    ultrametric topology, offers great advantage for proximity searching. If we
    have preprocessed data through such an embedding, then an observation's nearest
    neighbor is found in constant computational time, i.e. O(1) time. A further
    powerful approach is discussed in this work: the inducing of a hierarchy, and
    hence a tree, in linear computational time, i.e.

  2. Fast, Linear Time, m-Adic Hierarchical Clustering for Search and Retrieval using the Baire Metric, with linkages to Generalized Ultrametrics, Hashing, Formal Concept Analysis, and Precision of Data Measurement.

    Authors: Fionn Murtagh, Pedro Contreras
    Subjects: Machine Learning
    Abstract

    We describe many vantage points on the Baire metric and its use in clustering
    data, or its use in preprocessing and structuring data in order to support
    search and retrieval operations. In some cases, we proceed directly to clusters
    and do not directly determine the distances. We show how a hierarchical
    clustering can be read directly from one pass through the data. We offer
    insights also on practical implications of precision of data measurement. As a
    mechanism for treating multidimensional data, including very high dimensional
    data, we use random projections.

  3. Fast, Linear Time Hierarchical Clustering using the Baire Metric.

    Authors: Fionn Murtagh, Pedro Contreras
    Subjects: Machine Learning
    Abstract

    The Baire metric induces an ultrametric on a dataset and is of linear
    computational complexity, contrasted with the standard quadratic time
    agglomerative hierarchical clustering algorithm. In this work we evaluate
    empirically this new approach to hierarchical clustering. We compare
    hierarchical clustering based on the Baire metric with (i) agglomerative
    hierarchical clustering, in terms of algorithm properties; (ii) generalized
    ultrametrics, in terms of definition; and (iii) fast clustering through k-means
    partititioning, in terms of quality of results.

  4. Methods of Hierarchical Clustering.

    Authors: Fionn Murtagh, Pedro Contreras
    Subjects: Information Retrieval
    Abstract

    We survey agglomerative hierarchical clustering algorithms and discuss
    efficient implementations that are available in R and other software
    environments. We look at hierarchical self-organizing maps, and mixture models.
    We review grid-based clustering, focusing on hierarchical density-based
    approaches. Finally we describe a recently developed very efficient (linear
    time) hierarchical clustering algorithm, which can also be viewed as a
    hierarchical grid-based algorithm.

  5. Hierarchical Clustering for Finding Symmetries and Other Patterns in Massive, High Dimensional Datasets.

    Authors: Fionn Murtagh, Pedro Contreras
    Subjects: Machine Learning
    Abstract

    Data analysis and data mining are concerned with unsupervised pattern finding
    and structure determination in data sets. "Structure" can be understood as
    symmetry and a range of symmetries are expressed by hierarchy. Such symmetries
    directly point to invariants, that pinpoint intrinsic properties of the data
    and of the background empirical domain of interest. We review many aspects of
    hierarchy here, including ultrametric topology, generalized ultrametric,
    linkages with lattices and other discrete algebraic structures and with p-adic
    number representations.

  6. Scale-Based Gaussian Coverings: Combining Intra and Inter Mixture Models in Image Segmentation.

    Authors: Fionn Murtagh, Pedro Contreras, Jean-Luc Starck
    Subjects: Computer Vision and Pattern Recognition
    Abstract

    By a "covering" we mean a Gaussian mixture model fit to observed data.
    Approximations of the Bayes factor can be availed of to judge model fit to the
    data within a given Gaussian mixture model. Between families of Gaussian
    mixture models, we propose the R\'enyi quadratic entropy as an excellent and
    tractable model comparison framework.

Syndicate content