Daniel Lemire

  1. Extracting, Transforming and Archiving Scientific Data.

    Authors: Daniel Lemire, Andre Vellino
    Subjects: Digital Libraries
    Abstract

    It is becoming common to archive research datasets that are not only large
    but also numerous. In addition, their corresponding metadata and the software
    required to analyse or display them need to be archived. Yet the manual
    curation of research data can be di?cult and expensive, particularly in very
    large digital repositories, hence the importance of models and tools for
    automating digital curation tasks.

  2. The universality of iterated hashing over variable-length strings.

    Authors: Daniel Lemire
    Subjects: Databases
    Abstract

    Iterated hash functions process strings recursively, one character at a time.
    At each iteration, they compute a new hash value from the preceding hash value
    and the next character. We prove that iterated hashing can be pairwise
    independent, but never 3-wise independent. We show that it can be almost
    universal over strings much longer than the number of hash values; we bound the
    maximal string length given the collision probability.

  3. Reordering Columns for Smaller Indexes.

    Authors: Daniel Lemire, Owen Kaser
    Subjects: Databases
    Abstract

    Column-oriented indexes-such as projection or bitmap indexes-are compressed
    by run-length encoding to reduce storage and increase speed. Sorting the tables
    improves compression. On realistic data sets, permuting the columns in the
    right order before sorting can reduce the number of runs by a factor of two or
    more. For many cases, we prove that the number of runs in table columns is
    minimized if we sort columns by increasing cardinality. Yet-maybe
    surprisingly-we must sometimes maximize the number of runs to minimize the
    index size.

Syndicate content