While high-level data parallel frameworks, like MapReduce, simplify the
design and implementation of large-scale data processing systems, they do not
naturally or efficiently support many important data mining and machine
learning algorithms and can lead to inefficient learning systems. To help fill
this critical void, we introduced the GraphLab abstraction which naturally
expresses asynchronous, dynamic, graph-parallel computation while ensuring data
consistency and achieving a high degree of parallel performance in the
shared-memory setting.
In information retrieval, a fundamental goal is to transform a document into
concepts that are representative of its content. The term "representative" is
in itself challenging to define, and various tasks require different
granularities of concepts. In this paper, we aim to model concepts that are
sparse over the vocabulary, and that flexibly adapt their content based on
other relevant semantic information such as textual structure or associated
image features.
In this work we present in-network techniques to improve the efficiency of
spatial aggregate queries. Such queries are very common in a sensornet setting,
demanding more targeted techniques for their handling. Our approach constructs
and maintains multi-resolution cube hierarchies inside the network, which can
be constructed in a distributed fashion. In case of failures, recovery can also
be performed with in-network decisions.
Designing and implementing efficient, provably correct parallel machine
learning (ML) algorithms is challenging. Existing high-level parallel
abstractions like MapReduce are insufficiently expressive while low-level tools
like MPI and Pthreads leave ML experts repeatedly solving the same design
challenges. By targeting common patterns in ML, we developed GraphLab, which
improves upon abstractions like MapReduce by compactly expressing asynchronous
iterative algorithms with sparse computational dependencies while ensuring data
consistency and achieving a high degree of parallel performance.
Representing distributions over permutations can be a daunting task due to
the fact that the number of permutations of $n$ objects scales factorially in
$n$. One recent way that has been used to reduce storage complexity has been to
exploit probabilistic independence, but as we argue, full independence
assumptions impose strong sparsity constraints on distributions and are
unsuitable for modeling rankings.