Unbalanced data arises in many learning tasks such as clustering of
multi-class data, hierarchical divisive clustering and semisupervised learning.
Graph-based approaches are popular tools for these problems. Graph construction
is an important aspect of graph-based learning. We show that graph-based
algorithms can fail for unbalanced data for many popular graphs such as k-NN,
\epsilon-neighborhood and full-RBF graphs. We propose a novel graph
construction technique that encodes global statistical information into node
degrees through a ranking scheme.
We propose a novel method of introducing structure into existing machine
learning techniques by developing structure-based similarity and distance
measures. To learn structural information, low-dimensional structure of the
data is captured by solving a non-linear, low-rank representation problem. We
show that this low-rank representation can be kernelized, has a closed-form
solution, allows for separation of independent manifolds, and is robust to
noise.
Non-adaptive group testing involves grouping arbitrary subsets of $n$ items
into different pools. Each pool is then tested and defective items are
identified. A fundamental question involves minimizing the number of pools
required to identify at most $d$ defective items. Motivated by applications in
network tomography, sensor networks and infection propagation we formulate
group testing problems on graphs. Unlike conventional group testing problems
each group here must conform to the constraints imposed by a graph.
The fundamental task of group testing is to recover a small distinguished
subset of items from a large group while efficiently reducing the total number
of tests (measurements). The key contribution of this paper is in adopting a
new information-theoretic perspective on group testing problems. Establishing
its connection to Shannon-coding theory, we formulate the group testing problem
as a channel coding/decoding problem and derive a unifying result that reduces
many of the interesting questions to computation of a mutual information
expression.
We propose a novel non-parametric adaptive anomaly detection algorithm for
high dimensional data based on score functions derived from nearest neighbor
graphs on $n$-point nominal data. Anomalies are declared whenever the score of
a test sample falls below $\alpha$, which is supposed to be the desired false
alarm level. The resulting anomaly detector is shown to be asymptotically
optimal in that it is uniformly most powerful for the specified false alarm
level, $\alpha$, for the case when the anomaly density is a mixture of the
nominal and a known density.