While users today have access to many tools that assist in performing large
scale data analysis tasks, understanding the performance characteristics of
their parallel computations, such as MapReduce jobs, remains difficult. We
present PerfXplain, a system that enables users to ask questions about the
relative performances (i.e., runtimes) of pairs of MapReduce jobs. PerfXplain
provides a new query language for articulating performance queries and an
algorithm for generating explanations from a log of past MapReduce job
executions.
Over the past 40 years, database management systems (DBMSs) have evolved to
provide a sophisticated variety of data management capabilities. At the same
time, tools for managing queries over the data have remained relatively
primitive. One reason for this is that queries are typically issued through
applications. They are thus debugged once and re-used repeatedly. This mode of
interaction, however, is changing. As scientists (and others) store and share
increasingly large volumes of data in data centers, they need the ability to
analyze the data by issuing exploratory queries.