Semisupervised methods are techniques for using labeled data $(X_1,Y_1),...,
(X_n,Y_n)$ together with unlabeled data $X_{n+1},..., X_N$ to make predictions.
These methods invoke some assumption that links the marginal distribution $P_X$
of $X$ to the regression function $f(x)$. For example, it is common to assume
that $f$ is very smooth over high density regions of $P_X$. Many of the methods
are ad-hoc and have been shown to work in specific examples but are lacking a
theoretical foundation. We provide a minimax framework for analyzing
semisupervised methods.
In this paper, we propose a semiparametric approach, named nonparanormal
skeptic, for efficiently and robustly estimating high dimensional undirected
graphical models. To achieve modeling flexibility, we consider Gaussian Copula
graphical models (or the nonparanormal) as proposed by Liu et al. (2009). To
achieve estimation robustness, we exploit nonparametric rank-based correlation
coefficient estimators, including Spearman's rho and Kendall's tau.
Don Fraser has given an interesting account of the agreements and
disagreements between Bayesian posterior probabilities and confidence levels.
In this comment I discuss some cases where the lack of such agreement is
extreme. I then discuss a few cases where it is possible to have Bayes
procedures with frequentist validity. Such frequentist-Bayesian---or
Frasian---methods deserve more attention [arXiv:1112.5582].
We present some nonparametric methods for graphical modeling. In the discrete
case, where the data are binary or drawn from a finite alphabet, Markov random
fields are already essentially nonparametric, since the cliques can take only a
finite number of values. Continuous data are different. The Gaussian graphical
model is the standard parametric model for continuous data, but it makes
distributional assumptions that are often unrealistic. We discuss two
approaches to building more flexible graphical models.
We propose a relaxed privacy definition called {\em random differential
privacy} (RDP). Differential privacy requires that adding any new observation
to a database will have small effect on the output of the data-release
procedure. Random differential privacy requires that adding a {\em randomly
drawn new observation} to a database will have small effect on the output. We
show an analog of the composition property of differentially private procedures
which applies to our new definition.
Semisupervised methods inevitably invoke some assumption that links the
marginal distribution of the features to the regression function of the label.
Most commonly, the cluster or manifold assumptions are used which imply that
the regression function is smooth over high-density clusters or manifolds
supporting the data.
We investigate and extend the conformal prediction method due to
Vovk,Gammerman and Shafer (2005) to construct nonparametric prediction regions.
These regions have guaranteed distribution free, finite sample coverage,
without any assumptions on the distribution or the bandwidth. Explicit
convergence rates of the loss function are established for such regions under
standard regularity conditions. Approximations for simplifying implementation
and data driven bandwidth selection methods are also discussed. The theoretical
properties of our method are demonstrated through simulations.
We find lower and upper bounds for the risk of estimating a manifold in
Hausdorff distance under several models. We also show that there are close
connections between manifold estimation and the problem of deconvolving a
singular measure.
High density clusters can be characterized by the connected components of a
level set $L(\lambda) = \{x:\ p(x)>\lambda\}$ of the underlying probability
density function $p$ generating the data, at some appropriate level
$\lambda\geq 0$. The complete hierarchical clustering can be characterized by a
cluster tree ${\cal T}= \bigcup_{\lambda} L(\lambda)$. In this paper, we study
the behavior of a density level set estimate $\widehat L(\lambda)$ and cluster
tree estimate $\widehat{\cal{T}}$ based on a kernel density estimator with
kernel bandwidth $h$.
Genetic investigations often involve the testing of vast numbers of related
hypotheses simultaneously. To control the overall error rate, a substantial
penalty is required, making it difficult to detect signals of moderate
strength. To improve the power in this setting, a number of authors have
considered using weighted $p$-values, with the motivation often based upon the
scientific plausibility of the hypotheses. We review this literature, derive
optimal weights and show that the power is remarkably robust to
misspecification of these weights.
We sharply characterize the performance of different penalization schemes for
the problem of selecting the relevant variables in the multi-task setting.
Previous work focuses on the regression problem where conditions on the design
matrix complicate the analysis. A clearer and simpler picture emerges by
studying the Normal means model. This model, often used in the field of
statistics, is a simplified model that provides a laboratory for studying
complex procedures.
We find the minimax rate of convergence in Hausdorff distance for estimating
a manifold M of dimension d embedded in R^D given a noisy sample from the
manifold. We assume that the manifold satisfies a smoothness condition and that
the noise distribution has compact support. We show that the optimal rate of
convergence is n^{-2/(2+d)}. Thus, the minimax rate depends only on the
dimension of the manifold, not on the dimension of the space in which M is
embedded.
Undirected graphical models encode in a graph $G$ the dependency structure of
a random vector $Y$. In many applications, it is of interest to model $Y$ given
another random vector $X$ as input. We refer to the problem of estimating the
graph $G(x)$ of $Y$ conditioned on $X=x$ as ``graph-valued regression.'' In
this paper, we propose a semiparametric method for estimating $G(x)$ that
builds a tree on the $X$ space just as in CART (classification and regression
trees), but at each leaf of the tree estimates a graph. We call the method
``Graph-optimized CART,'' or Go-CART.
A challenging problem in estimating high-dimensional graphical models is to
choose the regularization parameter in a data-dependent way. The standard
techniques include $K$-fold cross-validation ($K$-CV), Akaike information
criterion (AIC), and Bayesian information criterion (BIC). Though these methods
work well for low-dimensional problems, they are not suitable in high
dimensional settings. In this paper, we present StARS: a new stability-based
method for choosing the regularization parameter in high dimensional inference
for undirected graphs.
We develop nonparametric methods for estimating filamentary structure from
planar point process data and find the minimax lower bound for this problem. We
show that, under weak conditions, the filaments have a simple geometric
representation as the medial axis of the data distribution's support. Our
methods convert an estimator of the support's boundary into an estimator of the
filaments. We find the rates of convergence of our estimators and show that
when using an optimal boundary estimator, they achieve the minimax rate.
We introduce a new version of forward stepwise regression. Our modification
finds solutions to regression problems where the selected predictors appear in
a structured pattern, with respect to a predefined distance measure over the
candidate predictors. Our method is motivated by the problem of predicting
HIV-1 drug resistance from protein sequences. We find that our methods improve
the interpretability of drug resistance while producing comparable predictive
accuracy to standard methods.
We study graph estimation and density estimation in high dimensions. To avoid
the curse of dimensionality, we consider a family of density estimators based
on tree structured undirected graphical models. We do not assume the true
distribution corresponds to a tree; rather, we try to find the best tree-based
approximation to the true distribution. We apply the Chow-Liu algorithm to
kernel density estimates to build a tree and then use a data-splitting scheme
to choose the number of edges. We also prove oracle properties on both function
estimation and structure learning.
We study generalized density-based clustering in which sharply defined
clusters such as clusters on lower dimensional manifolds are allowed. We show
that accurate clustering is possible even in high dimensions. We propose two
data-based methods for choosing the bandwidth and we study the stability
properties of density clusters. We show that a simple graph-based algorithm
successfully approximates the high density clusters.
The lasso has become an important practical tool for high dimensional
regression as well as the object of intense theoretical investigation. But
despite the availability of efficient algorithms, the lasso remains
computationally demanding in regression problems where the number of variables
vastly exceeds the number of data points. A much older method, marginal
regression, largely displaced by the lasso, offers a promising alternative in
this case. Computation for marginal regression is practical even when the
dimension is very high.
One goal of statistical privacy research is to construct a data release
mechanism that protects individual privacy while preserving information
content. An example is a {\em random mechanism} that takes an input database
$X$ and outputs a random database $Z$ according to a distribution
$Q_n(\cdot|X)$. {\em Differential privacy} is a particular privacy requirement
developed by computer scientists in which $Q_n(\cdot |X)$ is required to be
insensitive to changes in one data point in $X$. This makes it difficult to
infer from $Z$ whether a given individual is in the original database $X$.
We consider the problem of reliably finding filaments in point clouds.
Realistic data sets often have numerous filaments of various sizes and shapes.
Statistical techniques exist for finding one (or a few) filaments but these
methods do not handle noisy data sets with many filaments. Other methods can be
found in the astronomy literature but they do not have rigorous statistical
guarantees. We propose the following method. Starting at each data point we
construct the steepest ascent path along a kernel density estimator.
We consider the problem of reliably finding filaments in point clouds.
Realistic data sets often have numerous filaments of various sizes and shapes.
Statistical techniques exist for finding one (or a few) filaments but these
methods do not handle noisy data sets with many filaments. Other methods can be
found in the astronomy literature but they do not have rigorous statistical
guarantees. We propose the following method. Starting at each data point we
construct the steepest ascent path along a kernel density estimator.
This paper explores the following question: what kind of statistical
guarantees can be given when doing variable selection in high-dimensional
models? In particular, we look at the error rates and power of some multi-stage
regression methods. In the first stage we fit a set of candidate models. In the
second stage we select one model by cross-validation. In the third stage we use
hypothesis testing to eliminate some variables.