This paper describes a novel theoretical characterization of the performance
of non-local means (NLM) for noise removal. NLM has proven effective in a
variety of empirical studies, but little is understood fundamentally about how
it performs relative to classical methods based on wavelets or how various
parameters (e.g., patch size) should be chosen.
In this paper we consider the problem of locating a nonzero entry in a
high-dimensional vector from possibly adaptive linear measurements. We consider
a recursive bisection method which we dub the compressive binary search and
show that it improves on what any nonadaptive method can achieve. We establish
a non-asymptotic bound that applies to all methods, regardless of their
computational complexity. Combined, these results show that the compressive
binary search is within a double logarithmic factor of the optimal performance.
We consider the hypothesis testing problem of deciding whether an observed
high-dimensional normal vector has independent components or, alternatively, if
it has a small subset of correlated components. The correlated components may
have a certain combinatorial structure known to the statistician. We establish
upper and lower bounds for the Bayes risk in terms of the size of the
correlated subset, the level of correlation, and the structure of the class of
possibly correlated sets.
We consider the task of detecting a salient cluster in a sensor network,
i.e., an undirected graph with a random variable attached to each node.
Motivated by recent research in environmental statistics and the drive to
compete with the reigning scan statistic, we explore alternatives based on the
percolative properties of the network. The first method is based on the size of
the largest connected component after removing the nodes in the network whose
value is lower than a given threshold. The second one is the upper level set
scan test introduced by Patil and Taillie (2003).
Testing for the significance of a subset of regression coefficients in a
linear model, a staple of statistical analysis, goes back at least to the work
of Fisher who introduced the analysis of variance (ANOVA). We study this
problem under the assumption that the coefficient vector is sparse, a common
situation in modern high-dimensional settings. Suppose the regression vector is
of dimension p with S non-zero coefficients with S = p^{1 -alpha}. Under
moderate sparsity levels, i.e. alpha <= 1/2, we show that ANOVA is essentially
optimal under some conditions on the design.
Let M be a bounded domain of a Euclidian space with smooth boundary. We
relate the Cheeger constant of M and the conductance of a neighborhood graph
defined on a random sample from M. By restricting the minimization defining the
latter over a particular class of subsets, we obtain consistency (after
normalization) as the sample size increases, and show that any minimizing
sequence of subsets has a subsequence converging to a Cheeger set of M.
We consider the problem of detecting whether or not in a given sensor
network, there is a cluster of sensors which exhibit an "unusual behavior".
Formally, suppose we are given a set of nodes and attach a random variable to
each node. We observe a realization of this process and want to decide between
the following two hypotheses: under the null, the variables are i.i.d. standard
normal; under the alternative, there is a cluster of variables that are i.i.d.
normal with positive mean and unit variance, while the rest are i.i.d. standard
normal.
In the context of clustering, we assume a generative model where each cluster
is the result of sampling points in the neighborhood of an embedded smooth
surface, possibly contaminated with outliers. We consider a prototype for a
higher-order spectral clustering method based on the residual from a local
linear approximation.
In the context of clustering, we consider a generative model in a Euclidean
ambient space with clusters of different shapes, dimensions, sizes and
densities. In an asymptotic setting where the number of points becomes large,
we obtain theoretical guaranties for a few emblematic methods based on pairwise
distances: a simple algorithm based on the extraction of connected components
in a neighborhood graph; the spectral clustering method of Ng, Jordan and
Weiss; and hierarchical clustering with single linkage.