It is becoming common to archive research datasets that are not only large
but also numerous. In addition, their corresponding metadata and the software
required to analyse or display them need to be archived. Yet the manual
curation of research data can be di?cult and expensive, particularly in very
large digital repositories, hence the importance of models and tools for
automating digital curation tasks.
Iterated hash functions process strings recursively, one character at a time.
At each iteration, they compute a new hash value from the preceding hash value
and the next character. We prove that iterated hashing can be pairwise
independent, but never 3-wise independent. We show that it can be almost
universal over strings much longer than the number of hash values; we bound the
maximal string length given the collision probability.
Column-oriented indexes-such as projection or bitmap indexes-are compressed
by run-length encoding to reduce storage and increase speed. Sorting the tables
improves compression. On realistic data sets, permuting the columns in the
right order before sorting can reduce the number of runs by a factor of two or
more. For many cases, we prove that the number of runs in table columns is
minimized if we sort columns by increasing cardinality. Yet-maybe
surprisingly-we must sometimes maximize the number of runs to minimize the
index size.