We provide a simple method and relevant theoretical analysis for efficiently
estimating higher-order lp distances. While the analysis mainly focuses on l4,
our methodology extends naturally to p = 6,8,10..., (i.e., when p is even).
Distance-based methods are popular in machine learning. In large-scale
applications, storing, computing, and retrieving the distances can be both
space and time prohibitive. Efficient algorithms exist for estimating lp
distances if 0 < p <= 2. The task for p > 2 is known to be difficult. Our work
partially fills this gap.
Database theory and database practice are typically the domain of computer
scientists who adopt what may be termed an algorithmic perspective on their
data. This perspective is very different than the more statistical perspective
adopted by statisticians, scientific computers, machine learners, and other who
work on what may be broadly termed statistical data analysis. In this article,
I will address fundamental aspects of this algorithmic-statistical disconnect,
with an eye to bridging the gap between these two very different approaches.
Recently, Mahoney and Orecchia demonstrated that popular diffusion-based
procedures to compute a quick \emph{approximation} to the first nontrivial
eigenvector of a data graph Laplacian \emph{exactly} solve certain regularized
Semi-Definite Programs (SDPs). In this paper, we extend that result by
providing a statistical interpretation of their approximation procedure.
Eigenvector localization refers to the situation when most of the components
of an eigenvector are zero or near-zero. This phenomenon has been observed on
eigenvectors associated with extremal eigenvalues, and in many of those cases
it can be meaningfully interpreted in terms of "structural heterogeneities" in
the data.
The CUR decomposition provides an approximation of a matrix $X$ that has low
reconstruction error and that is sparse in the sense that the resulting
approximation lies in the span of only a few columns of $X$. In this regard, it
appears to be similar to many sparse PCA methods. However, CUR takes a
randomized algorithmic approach, whereas most sparse PCA methods are framed as
convex optimization problems. In this paper, we try to understand CUR from a
sparse optimization viewpoint.
In recent years, ideas from statistics and scientific computing have begun to
interact in increasingly sophisticated and fruitful ways with ideas from
computer science and the theory of algorithms to aid in the development of
improved worst-case algorithms that are useful for large-scale scientific and
Internet data analysis problems.
Regularization is a powerful technique for extracting useful information from
noisy data. Typically, it is implemented by adding some sort of norm constraint
to an objective function and then exactly optimizing the modified objective
function. This procedure typically leads to optimization problems that are
computationally more expensive than the original problem, a fact that is
clearly problematic if one is interested in large-scale applications.
Recent work in theoretical computer science and scientific computing has
focused on nearly-linear-time algorithms for solving systems of linear
equations. While introducing several novel theoretical perspectives, this work
has yet to lead to practical algorithms. In an effort to bridge this gap, we
describe in this paper two related results. Our first and main result is a
simple algorithm to approximate the solution to a set of linear equations
defined by a Laplacian (for a graph $G$ with $n$ nodes and $m \le n^2$ edges)
constraint matrix.
Detecting clusters or communities in large real-world graphs such as large
social or information networks is a problem of considerable interest. In
practice, one typically chooses an objective function that captures the
intuition of a network cluster as set of nodes with better internal
connectivity than external connectivity, and then one applies approximation
algorithms or heuristics to extract sets of nodes that are related to the
objective function and that "look like" good communities for the application of
interest.