Semisupervised methods are techniques for using labeled data $(X_1,Y_1),...,
(X_n,Y_n)$ together with unlabeled data $X_{n+1},..., X_N$ to make predictions.
These methods invoke some assumption that links the marginal distribution $P_X$
of $X$ to the regression function $f(x)$. For example, it is common to assume
that $f$ is very smooth over high density regions of $P_X$. Many of the methods
are ad-hoc and have been shown to work in specific examples but are lacking a
theoretical foundation. We provide a minimax framework for analyzing
semisupervised methods.
Semisupervised methods inevitably invoke some assumption that links the
marginal distribution of the features to the regression function of the label.
Most commonly, the cluster or manifold assumptions are used which imply that
the regression function is smooth over high-density clusters or manifolds
supporting the data.
Hierarchical clustering based on pairwise similarities is a common tool used
in a broad range of scientific applications. However, in many problems it may
be expensive to obtain or compute similarities between the items to be
clustered. This paper investigates the hierarchical clustering of N items based
on a small subset of pairwise similarities, significantly less than the
complete set of N(N-1)/2 similarities.
High density clusters can be characterized by the connected components of a
level set $L(\lambda) = \{x:\ p(x)>\lambda\}$ of the underlying probability
density function $p$ generating the data, at some appropriate level
$\lambda\geq 0$. The complete hierarchical clustering can be characterized by a
cluster tree ${\cal T}= \bigcup_{\lambda} L(\lambda)$. In this paper, we study
the behavior of a density level set estimate $\widehat L(\lambda)$ and cluster
tree estimate $\widehat{\cal{T}}$ based on a kernel density estimator with
kernel bandwidth $h$.
The ability to detect weak distributed activation patterns in networks is
critical to several applications, such as identifying the onset of anomalous
activity or incipient congestion in the Internet, or faint traces of a
biochemical spread by a sensor network. This is a challenging problem since
weak distributed patterns can be invisible in per node statistics as well as a
global network-wide aggregate. Most prior work considers situations in which
the activation/non-activation of each node is statistically independent, but
this is unrealistic in many problems.
Consider the problem of estimating the $\gamma$-level set
$G^*_{\gamma}=\{x:f(x)\geq\gamma\}$ of an unknown $d$-dimensional density
function $f$ based on $n$ independent observations $X_1,...,X_n$ from the
density. This problem has been addressed under global error criteria related to
the symmetric set difference. However, in certain applications a spatially
uniform mode of convergence is desirable to ensure that the estimated set is
close to the target set everywhere. The Hausdorff error criterion provides this
degree of uniformity and, hence, is more appropriate in such situations.