Larry Wasserman

  1. Density-Sensitive Semisupervised Inference.

    Authors: Larry Wasserman, Aarti Singh, Martin Azizyan
    Subjects: Statistics
    Abstract

    Semisupervised methods are techniques for using labeled data $(X_1,Y_1),...,
    (X_n,Y_n)$ together with unlabeled data $X_{n+1},..., X_N$ to make predictions.
    These methods invoke some assumption that links the marginal distribution $P_X$
    of $X$ to the regression function $f(x)$. For example, it is common to assume
    that $f$ is very smooth over high density regions of $P_X$. Many of the methods
    are ad-hoc and have been shown to work in specific examples but are lacking a
    theoretical foundation. We provide a minimax framework for analyzing
    semisupervised methods.

  2. High Dimensional Semiparametric Gaussian Copula Graphical Models.

    Authors: Larry Wasserman, Ming Yuan, Han Liu, John Lafferty, Fang Han
    Subjects: Machine Learning
    Abstract

    In this paper, we propose a semiparametric approach, named nonparanormal
    skeptic, for efficiently and robustly estimating high dimensional undirected
    graphical models. To achieve modeling flexibility, we consider Gaussian Copula
    graphical models (or the nonparanormal) as proposed by Liu et al. (2009). To
    achieve estimation robustness, we exploit nonparametric rank-based correlation
    coefficient estimators, including Spearman's rho and Kendall's tau.

  3. Frasian Inference.

    Authors: Larry Wasserman
    Subjects: Methodology
    Abstract

    Don Fraser has given an interesting account of the agreements and
    disagreements between Bayesian posterior probabilities and confidence levels.
    In this comment I discuss some cases where the lack of such agreement is
    extreme. I then discuss a few cases where it is possible to have Bayes
    procedures with frequentist validity. Such frequentist-Bayesian---or
    Frasian---methods deserve more attention [arXiv:1112.5582].

  4. Sparse Nonparametric Graphical Models.

    Authors: Larry Wasserman, Han Liu, John Lafferty
    Subjects: Machine Learning
    Abstract

    We present some nonparametric methods for graphical modeling. In the discrete
    case, where the data are binary or drawn from a finite alphabet, Markov random
    fields are already essentially nonparametric, since the cliques can take only a
    finite number of values. Continuous data are different. The Gaussian graphical
    model is the standard parametric model for continuous data, but it makes
    distributional assumptions that are often unrealistic. We discuss two
    approaches to building more flexible graphical models.

  5. Random Differential Privacy.

    Authors: Larry Wasserman, Alessandro Rinaldo, Rob Hall
    Subjects: Methodology
    Abstract

    We propose a relaxed privacy definition called {\em random differential
    privacy} (RDP). Differential privacy requires that adding any new observation
    to a database will have small effect on the output of the data-release
    procedure. Random differential privacy requires that adding a {\em randomly
    drawn new observation} to a database will have small effect on the output. We
    show an analog of the composition property of differentially private procedures
    which applies to our new definition.

  6. Adaptive Semisupervised Inference.

    Authors: Larry Wasserman, Aarti Singh, Martin Azizyan
    Subjects: Statistics
    Abstract

    Semisupervised methods inevitably invoke some assumption that links the
    marginal distribution of the features to the regression function of the label.
    Most commonly, the cluster or manifold assumptions are used which imply that
    the regression function is smooth over high-density clusters or manifolds
    supporting the data.

  7. Efficient Nonparametric Conformal Prediction Regions.

    Authors: Larry Wasserman, Jing Lei, James Robins
    Subjects: Statistics
    Abstract

    We investigate and extend the conformal prediction method due to
    Vovk,Gammerman and Shafer (2005) to construct nonparametric prediction regions.
    These regions have guaranteed distribution free, finite sample coverage,
    without any assumptions on the distribution or the bandwidth. Explicit
    convergence rates of the loss function are established for such regions under
    standard regularity conditions. Approximations for simplifying implementation
    and data driven bandwidth selection methods are also discussed. The theoretical
    properties of our method are demonstrated through simulations.

  8. Manifold Estimation and Singular Deconvolution Under Hausdorff Loss.

    Authors: Larry Wasserman, Christopher R. Genovese, Marco Perone-Pacifico, Isabella Verdinelli
    Subjects: Statistics
    Abstract

    We find lower and upper bounds for the risk of estimating a manifold in
    Hausdorff distance under several models. We also show that there are close
    connections between manifold estimation and the problem of deconvolving a
    singular measure.

  9. Stability of Density-Based Clustering.

    Authors: Larry Wasserman, Aarti Singh, Alessandro Rinaldo, Rebecca Nugent
    Subjects: Machine Learning
    Abstract

    High density clusters can be characterized by the connected components of a
    level set $L(\lambda) = \{x:\ p(x)>\lambda\}$ of the underlying probability
    density function $p$ generating the data, at some appropriate level
    $\lambda\geq 0$. The complete hierarchical clustering can be characterized by a
    cluster tree ${\cal T}= \bigcup_{\lambda} L(\lambda)$. In this paper, we study
    the behavior of a density level set estimate $\widehat L(\lambda)$ and cluster
    tree estimate $\widehat{\cal{T}}$ based on a kernel density estimator with
    kernel bandwidth $h$.

  10. Genome-Wide Significance Levels and Weighted Hypothesis Testing.

    Authors: Larry Wasserman, Kathryn Roeder
    Subjects: Methodology
    Abstract

    Genetic investigations often involve the testing of vast numbers of related
    hypotheses simultaneously. To control the overall error rate, a substantial
    penalty is required, making it difficult to detect signals of moderate
    strength. To improve the power in this setting, a number of authors have
    considered using weighted $p$-values, with the motivation often based upon the
    scientific plausibility of the hypotheses. We review this literature, derive
    optimal weights and show that the power is remarkably robust to
    misspecification of these weights.

  11. Union Support Recovery in Multi-task Learning.

    Authors: Larry Wasserman, John Lafferty, Mladen Kolar
    Subjects: Machine Learning
    Abstract

    We sharply characterize the performance of different penalization schemes for
    the problem of selecting the relevant variables in the multi-task setting.
    Previous work focuses on the regression problem where conditions on the design
    matrix complicate the analysis. A clearer and simpler picture emerges by
    studying the Normal means model. This model, often used in the field of
    statistics, is a simplified model that provides a laboratory for studying
    complex procedures.

  12. Minimax Manifold Estimation.

    Authors: Larry Wasserman, Marco Perone-Pacifico, Isabella Verdinelli, Christopher Genovese
    Subjects: Machine Learning
    Abstract

    We find the minimax rate of convergence in Hausdorff distance for estimating
    a manifold M of dimension d embedded in R^D given a noisy sample from the
    manifold. We assume that the manifold satisfies a smoothness condition and that
    the noise distribution has compact support. We show that the optimal rate of
    convergence is n^{-2/(2+d)}. Thus, the minimax rate depends only on the
    dimension of the manifold, not on the dimension of the space in which M is
    embedded.

  13. Graph-Valued Regression.

    Authors: Larry Wasserman, Han Liu, John Lafferty, Xi Chen
    Subjects: Machine Learning
    Abstract

    Undirected graphical models encode in a graph $G$ the dependency structure of
    a random vector $Y$. In many applications, it is of interest to model $Y$ given
    another random vector $X$ as input. We refer to the problem of estimating the
    graph $G(x)$ of $Y$ conditioned on $X=x$ as ``graph-valued regression.'' In
    this paper, we propose a semiparametric method for estimating $G(x)$ that
    builds a tree on the $X$ space just as in CART (classification and regression
    trees), but at each leaf of the tree estimates a graph. We call the method
    ``Graph-optimized CART,'' or Go-CART.

  14. Stability Approach to Regularization Selection (StARS) for High Dimensional Graphical Models.

    Authors: Larry Wasserman, Kathryn Roeder, Han Liu
    Subjects: Machine Learning
    Abstract

    A challenging problem in estimating high-dimensional graphical models is to
    choose the regularization parameter in a data-dependent way. The standard
    techniques include $K$-fold cross-validation ($K$-CV), Akaike information
    criterion (AIC), and Bayesian information criterion (BIC). Though these methods
    work well for low-dimensional problems, they are not suitable in high
    dimensional settings. In this paper, we present StARS: a new stability-based
    method for choosing the regularization parameter in high dimensional inference
    for undirected graphs.

  15. Nonparametric Filament Estimation.

    Authors: Larry Wasserman, Christopher R. Genovese, Marco Perone-Pacifico, Isabella Verdinelli
    Subjects: Statistics
    Abstract

    We develop nonparametric methods for estimating filamentary structure from
    planar point process data and find the minimax lower bound for this problem. We
    show that, under weak conditions, the filaments have a simple geometric
    representation as the medial axis of the data distribution's support. Our
    methods convert an estimator of the support's boundary into an estimator of the
    filaments. We find the rates of convergence of our estimators and show that
    when using an optimal boundary estimator, they achieve the minimax rate.

  16. Structured, Sparse Regression With Application to HIV Drug Resistance.

    Authors: Larry Wasserman, Kathryn Roeder, Daniel Percival, Roni Rosenfeld
    Subjects: Methodology
    Abstract

    We introduce a new version of forward stepwise regression. Our modification
    finds solutions to regression problems where the selected predictors appear in
    a structured pattern, with respect to a predefined distance measure over the
    candidate predictors. Our method is motivated by the problem of predicting
    HIV-1 drug resistance from protein sequences. We find that our methods improve
    the interpretability of drug resistance while producing comparable predictive
    accuracy to standard methods.

  17. Tree Density Estimation.

    Authors: Larry Wasserman, Han Liu, John Lafferty
    Subjects: Machine Learning
    Abstract

    We study graph estimation and density estimation in high dimensions. To avoid
    the curse of dimensionality, we consider a family of density estimators based
    on tree structured undirected graphical models. We do not assume the true
    distribution corresponds to a tree; rather, we try to find the best tree-based
    approximation to the true distribution. We apply the Chow-Liu algorithm to
    kernel density estimates to build a tree and then use a data-splitting scheme
    to choose the number of edges. We also prove oracle properties on both function
    estimation and structure learning.

  18. Generalized Density Clustering.

    Authors: Larry Wasserman, Alessandro Rinaldo
    Subjects: Statistics
    Abstract

    We study generalized density-based clustering in which sharply defined
    clusters such as clusters on lower dimensional manifolds are allowed. We show
    that accurate clustering is possible even in high dimensions. We propose two
    data-based methods for choosing the bandwidth and we study the stability
    properties of density clusters. We show that a simple graph-based algorithm
    successfully approximates the high density clusters.

  19. Revisiting Marginal Regression.

    Authors: Larry Wasserman, Jiashun Jin, Christopher Genovese
    Subjects: Statistics
    Abstract

    The lasso has become an important practical tool for high dimensional
    regression as well as the object of intense theoretical investigation. But
    despite the availability of efficient algorithms, the lasso remains
    computationally demanding in regression problems where the number of variables
    vastly exceeds the number of data points. A much older method, marginal
    regression, largely displaced by the lasso, offers a promising alternative in
    this case. Computation for marginal regression is practical even when the
    dimension is very high.

  20. A statistical framework for differential privacy.

    Authors: Larry Wasserman, Shuheng Zhou
    Subjects: Statistics
    Abstract

    One goal of statistical privacy research is to construct a data release
    mechanism that protects individual privacy while preserving information
    content. An example is a {\em random mechanism} that takes an input database
    $X$ and outputs a random database $Z$ according to a distribution
    $Q_n(\cdot|X)$. {\em Differential privacy} is a particular privacy requirement
    developed by computer scientists in which $Q_n(\cdot |X)$ is required to be
    insensitive to changes in one data point in $X$. This makes it difficult to
    infer from $Z$ whether a given individual is in the original database $X$.

  21. On the path density of a gradient field.

    Authors: Larry Wasserman, Christopher R. Genovese, Marco Perone-Pacifico, Isabella Verdinelli
    Subjects: Statistics
    Abstract

    We consider the problem of reliably finding filaments in point clouds.
    Realistic data sets often have numerous filaments of various sizes and shapes.
    Statistical techniques exist for finding one (or a few) filaments but these
    methods do not handle noisy data sets with many filaments. Other methods can be
    found in the astronomy literature but they do not have rigorous statistical
    guarantees. We propose the following method. Starting at each data point we
    construct the steepest ascent path along a kernel density estimator.

  22. On the path density of a gradient field.

    Authors: Larry Wasserman, Christopher R. Genovese, Marco Perone-Pacifico, Isabella Verdinelli
    Subjects: Statistics
    Abstract

    We consider the problem of reliably finding filaments in point clouds.
    Realistic data sets often have numerous filaments of various sizes and shapes.
    Statistical techniques exist for finding one (or a few) filaments but these
    methods do not handle noisy data sets with many filaments. Other methods can be
    found in the astronomy literature but they do not have rigorous statistical
    guarantees. We propose the following method. Starting at each data point we
    construct the steepest ascent path along a kernel density estimator.

  23. High-dimensional variable selection.

    Authors: Larry Wasserman, Kathryn Roeder
    Subjects: gr. Statistics
    Abstract

    This paper explores the following question: what kind of statistical
    guarantees can be given when doing variable selection in high-dimensional
    models? In particular, we look at the error rates and power of some multi-stage
    regression methods. In the first stage we fit a set of candidate models. In the
    second stage we select one model by cross-validation. In the third stage we use
    hypothesis testing to eliminate some variables.

Syndicate content