We consider the problem of cardinality estimation in data stream
applications, focusing on two techniques that use pseudo-random variates to
form low-dimensional data sketches. We apply conventional statistical methods
to compare algorithms based on storing either selected order statistics or
random projections. We derive estimators of the cardinality in both cases and
show that the maximal-term estimator is recursively computable and has
exponentially decreasing error bounds.
We consider the problem of approximating the empirical Shannon entropy of a
high-frequency data stream when space limitations make exact computation
infeasible. It is known that \alpha-dependent quantities such as the Renyi and
Tsallis entropies can be estimated efficiently and unbiasedly from
low-dimensional \alpha-stable data sketches. An approximation to the Shannon
entropy can be obtained from either of these quantities by taking \alpha
sufficiently close to 1. However, practical guidelines for the choice of
$\alpha$ are lacking. We avoid this problem by going directly to the limit.