While users today have access to many tools that assist in performing large
scale data analysis tasks, understanding the performance characteristics of
their parallel computations, such as MapReduce jobs, remains difficult. We
present PerfXplain, a system that enables users to ask questions about the
relative performances (i.e., runtimes) of pairs of MapReduce jobs. PerfXplain
provides a new query language for articulating performance queries and an
algorithm for generating explanations from a log of past MapReduce job
executions.
An answer to a query has a well-defined lineage expression (alternatively
called how-provenance) that explains how the answer was derived. Recent work
has also shown how to compute the lineage of a non-answer to a query. However,
the cause of an answer or non-answer is a more subtle notion and consists, in
general, of only a fragment of the lineage. In this paper, we adapt Halpern,
Pearl, and Chockler's recent definitions of causality and responsibility to
define the causes of answers and non-answers to queries, and their degree of
responsibility.
We show that it is possible to significantly improve the accuracy of a
general class of histogram queries while satisfying differential privacy. Our
approach carefully chooses a set of queries to evaluate, and then exploits
consistency constraints that should hold over the noisy output. In a
post-processing phase, we compute the consistent input most likely to have
produced the noisy output. The final output is differentially-private and
consistent, but in addition, it is often much more accurate.
Over the past 40 years, database management systems (DBMSs) have evolved to
provide a sophisticated variety of data management capabilities. At the same
time, tools for managing queries over the data have remained relatively
primitive. One reason for this is that queries are typically issued through
applications. They are thus debugged once and re-used repeatedly. This mode of
interaction, however, is changing. As scientists (and others) store and share
increasingly large volumes of data in data centers, they need the ability to
analyze the data by issuing exploratory queries.