Big array analytics is becoming indispensable in answering important
scientific and business questions. Most analysis tasks consist of multiple
steps, each making one or multiple passes over the arrays to be analyzed and
generating intermediate results. In the big data setting, I/O optimization is a
key to efficient analytics. In this paper, we develop a framework and
techniques for capturing a broad range of analysis tasks expressible in
nested-loop forms, representing them in a declarative way, and optimizing their
I/O by identifying sharing opportunities.
R is a numerical computing environment that is widely popular for statistical
data analysis. Like many such environments, R performs poorly for large
datasets whose sizes exceed that of physical memory. We present our vision of
RIOT (R with I/O Transparency), a system that makes R programs I/O-efficient in
a way transparent to the users. We describe our experience with RIOT-DB, an
initial prototype that uses a relational database system as a backend.