We develop an abstract model of information acquisition from redundant data.
We assume a random sampling process from data which provide information with
bias and are interested in the fraction of information we expect to learn as
function of (i) the sampled fraction (recall) and (ii) varying bias of
information (redundancy distributions). We develop two rules of thumb with
varying robustness.
An answer to a query has a well-defined lineage expression (alternatively
called how-provenance) that explains how the answer was derived. Recent work
has also shown how to compute the lineage of a non-answer to a query. However,
the cause of an answer or non-answer is a more subtle notion and consists, in
general, of only a fragment of the lineage. In this paper, we adapt Halpern,
Pearl, and Chockler's recent definitions of causality and responsibility to
define the causes of answers and non-answers to queries, and their degree of
responsibility.
Over the past 40 years, database management systems (DBMSs) have evolved to
provide a sophisticated variety of data management capabilities. At the same
time, tools for managing queries over the data have remained relatively
primitive. One reason for this is that queries are typically issued through
applications. They are thus debugged once and re-used repeatedly. This mode of
interaction, however, is changing. As scientists (and others) store and share
increasingly large volumes of data in data centers, they need the ability to
analyze the data by issuing exploratory queries.