Many modern data mining applications are concerned with the analysis of
datasets in which the observations are described by paired high-dimensional
vectorial representations or "views". Some typical examples can be found in web
mining and genomics applications.
We study sparse principal components analysis in the high-dimensional
setting, where $p$ (the number of variables) can be much larger than $n$ (the
number of observations). We prove optimal, non-asymptotic lower and upper
bounds on the minimax estimation error for the leading eigenvector when it
belongs to an $\ell_q$ ball for $q \in [0,1]$. Our bounds are sharp in $p$ and
$n$ for all $q \in [0, 1]$ over a wide class of distributions. The upper bound
is obtained by analyzing the performance of $\ell_q$-constrained PCA.
We assume i.i.d. data sampled from a mixture distribution with K components
along fixed d-dimensional linear subspaces and an additional outlier component.
For p>0, we study the simultaneous recovery of the K fixed subspaces by
minimizing the l_p-averaged distances of the sampled data points from any K
subspaces. Under some conditions, we show that if $0<p\leq1$, then all
underlying subspaces can be precisely recovered by l_p minimization with
overwhelming probability.
We consider the problem of sequential sampling from a finite number of
independent statistical populations to maximize the expected infinite horizon
average outcome per period, under a constraint that the expected average
sampling cost does not exceed an upper bound. The outcome distributions are not
known. We construct a class of consistent adaptive policies, under which the
average outcome converges with probability 1 to the true value under complete
information for all distributions with finite means.
We consider the problem of using a factor model we call {\em spike-and-slab
sparse coding} (S3C) to learn features for a classification task. The S3C model
resembles both the spike-and-slab RBM and sparse coding. Since exact inference
in this model is intractable, we derive a structured variational inference
procedure and employ a variational EM training algorithm. Prior work on
approximate inference for this model has not prioritized the ability to exploit
parallel architectures and scale to enormous problem sizes.
The hierarchical Dirichlet process (HDP) has become an important Bayesian
nonparametric model for grouped data, such as document collections. The HDP is
used to construct a flexible mixed-membership model where the number of
components is determined by the data. As for most Bayesian nonparametric
models, exact posterior inference is intractable---practitioners use Markov
chain Monte Carlo (MCMC) or variational inference.
An empirical investigation of the interaction of sample size and
discretization - in this case the entropy-based method CAIM (Class-Attribute
Interdependence Maximization) - was undertaken to evaluate the impact and
potential bias introduced into data mining performance metrics due to variation
in sample size as it impacts the discretization process. Of particular interest
was the effect of discretizing within cross-validation folds averse to outside
discretization folds.
We present some nonparametric methods for graphical modeling. In the discrete
case, where the data are binary or drawn from a finite alphabet, Markov random
fields are already essentially nonparametric, since the cliques can take only a
finite number of values. Continuous data are different. The Gaussian graphical
model is the standard parametric model for continuous data, but it makes
distributional assumptions that are often unrealistic. We discuss two
approaches to building more flexible graphical models.
Information theoretic active learning has been widely studied for
probabilistic models. For simple regression an optimal myopic policy is easily
tractable. However, for other tasks and with more complex models, such as
classification with nonparametric models, the optimal solution is harder to
compute. Current approaches make approximations to achieve tractability. We
propose an approach that expresses information gain in terms of predictive
entropies, and apply this method to the Gaussian Process Classifier (GPC).
Low-rank structure have been profoundly studied in data mining and machine
learning. In this paper, we show a dense matrix $X$'s low-rank approximation
can be rapidly built from its left and right random projections $Y_1=XA_1$ and
$Y_2=X^TA_2$, or bilateral random projection (BRP). We then show power scheme
can further improve the precision. The deterministic, average and deviation
bounds of the proposed method and its power scheme modification are proved
theoretically.
We formulate a convex minimization to robustly recover a subspace from a
contaminated data set, partially sampled around it, and propose a fast
iterative algorithm to achieve the corresponding minimum. We establish exact
recovery by this minimizer, quantify the effect of noise and regularization,
explain how to take advantage of a known intrinsic dimension and establish
linear convergence of the iterative algorithm.
Supervised linear feature extraction corresponds to fitting a reduced rank
multivariate model. This paper studies rank penalized and rank constrained
multivariate generalized linear models. From the perspective of thresholding
rules, we build a framework for fitting singular value penalized models and use
it for feature extraction. Through solving the rank constraint form of the
problem, we propose progressive feature space reduction for fast computation in
high dimensions with little performance loss.
A beta-negative binomial (BNB) process is proposed, leading to a
beta-gamma-Poisson process, which may be viewed as a "multi-scoop"
generalization of the beta-Bernoulli process. The BNB process is augmented into
a beta-gamma-gamma-Poisson hierarchical structure, and applied as a
nonparametric Bayesian prior for an infinite Poisson factor analysis model. A
finite approximation for the beta process Levy random measure is constructed
for convenient implementation. Efficient MCMC computations are performed with
data augmentation and marginalization techniques.
We consider the problem of function estimation in the case where the data
distribution may shift between training and test time, and additional
information about it may be available at test time. This relates to popular
scenarios such as covariate shift, concept drift, transfer learning and
semi-supervised learning. This working paper discusses how these tasks could be
tackled depending on the kind of changes of the distributions. It argues that
knowledge of an underlying causal direction can facilitate several of these
tasks.
We describe a simple and efficient procedure for approximating the L\'evy
measure of a $\text{Gamma}(\alpha,1)$ random variable. We use this
approximation to derive a finite sum-representation that converges almost
surely to Ferguson's representation of the Dirichlet process based on arrivals
of a homogeneous Poisson process. We compare the efficiency of our
approximation to several other well known approximations of the Dirichlet
process and demonstrate a substantial improvement.
This paper considers the sparse eigenvalue problem, which is to extract
dominant (largest) sparse eigenvectors with at most $k$ non-zero components. We
propose a simple yet effective solution called truncated power method that can
approximately solve the underlying nonconvex optimization problem. A strong
sparse recovery result is proved for the truncated power method, and this
theory is our key motivation for developing the new algorithm. The proposed
method is tested on applications such as sparse principal component analysis
and the densest $k$-subgraph problem.
Exact inference in the linear regression model with spike and slab priors is
often intractable. Expectation propagation (EP) can be used for approximate
inference. However, the regular sequential form of EP (R-EP) may fail to
converge in this model when the size of the training set is very small. As an
alternative, we propose a provably convergent EP algorithm (PC-EP).
It is now well known that decentralised optimisation can be formulated as a
potential game, and game-theoretical learning algorithms can be used to find an
optimum. One of the most common learning techniques in game theory is
fictitious play. However fictitious play is founded on an implicit assumption
that opponents' strategies are stationary. We present a novel variation of
fictitious play that allows the use of a more realistic model of opponent
strategy.
Unbalanced data arises in many learning tasks such as clustering of
multi-class data, hierarchical divisive clustering and semisupervised learning.
Graph-based approaches are popular tools for these problems. Graph construction
is an important aspect of graph-based learning. We show that graph-based
algorithms can fail for unbalanced data for many popular graphs such as k-NN,
\epsilon-neighborhood and full-RBF graphs. We propose a novel graph
construction technique that encodes global statistical information into node
degrees through a ranking scheme.
The asymptotic pseudo-trajectory approach to stochastic approximation of
Benaim, Hofbauer and Sorin is extended for asynchronous stochastic
approximations with a set-valued mean field. The asynchronicity of the process
is incorporated into the mean field to produce convergence results which remain
similar to those of an equivalent synchronous process. In addition, this allows
many of the restrictive assumptions previously associated with asynchronous
stochastic approximation to be removed.
Contemporary global optimization algorithms are based on local measures of
utility, rather than a probability measure over location and value of the
optimum. They thus attempt to collect low function values, not to learn about
the optimum. The reason for the absence of probabilistic global optimizers is
that the corresponding inference problem is intractable in several ways.
We apply the concept of structured sparsity to improve boosted decision trees
with general loss functions. The existing approach to this problem is
Friedman's gradient boosting procedure using a regression tree base learner.
Although this method has led to many successful industrial applications, it
suffers from several theoretical and practical drawbacks. By employing the idea
of structured greedy search, we are able to design a regularized greedy forest
procedure to address these issues.
This work concerns the way that statistical models are used to make
decisions. In particular, we aim to merge the way estimation algorithms are
designed with how they are used for a subsequent task. Our methodology
considers the operational cost of carrying out a policy, based on a predictive
model. The operational cost becomes a regularization term in the learning
algorithm's objective function, allowing either an \textit{optimistic} or
\textit{pessimistic} view of possible costs.
Information-maximization clustering learns a probabilistic classifier in an
unsupervised manner so that mutual information between feature vectors and
cluster assignments is maximized. A notable advantage of this approach is that
it only involves continuous optimization of model parameters, which is
substantially easier to solve than discrete optimization of cluster
assignments. However, existing methods still involve non-convex optimization
problems, and therefore finding a good local optimal solution is not
straightforward in practice.
We develop mask iterative hard thresholding algorithms (mask IHT and mask
DORE) for sparse image reconstruction of objects with known contour. The
measurements follow a noisy underdetermined linear model common in the
compressive sampling literature. Assuming that the contour of the object that
we wish to reconstruct is known and that the signal outside the contour is
zero, we formulate a constrained residual squared error minimization problem
that incorporates both the geometric information (i.e. the knowledge of the
object's contour) and the signal sparsity constraint.
A linear non-Gaussian structural equation model called LiNGAM is an
identifiable model for exploratory causal analysis. Previous methods estimate a
causal ordering of variables and their connection strengths based on a single
dataset. However, in many application domains, data are obtained under
different conditions, that is, multiple datasets are obtained rather than a
single dataset.
Probabilistic graphical models combine the graph theory and probability
theory to give a multivariate statistical modeling. They provide a unified
description of uncertainty using probability and complexity using the graphical
model. Especially, graphical models provide the following several useful
properties:
- Graphical models provide a simple and intuitive interpretation of the
structures of probabilistic models. On the other hand, they can be used to
design and motivate new models.
Recent breakthrough results in compressed sensing (CS) have established that
many high dimensional objects can be accurately recovered from a relatively
small number of non- adaptive linear projection observations, provided that the
objects possess a sparse representation in some basis.
While Gaussian probability densities are omnipresent in applied mathematics,
Gaussian cumulative probabilities are hard to calculate in any but the
univariate case. We offer here an empirical study of the utility of Expectation
Propagation (EP) as an approximate integration method for this problem. For
rectangular integration regions, the approximation is highly accurate. We also
extend the derivations to the more general case of polyhedral integration
regions. However, we find that in this polyhedral case, EP's answer, though
often accurate, can be almost arbitrarily wrong.
Driven by a large number of potential applications in areas like
bioinformatics, information retrieval and social network analysis, the problem
setting of inferring relations between pairs of data objects has recently been
investigated quite intensively in the machine learning community. To this end,
current approaches typically consider datasets containing crisp relations, so
that standard classification methods can be adopted. However, relations between
objects like similarities and preferences are often expressed in a graded
manner in real-world applications.
The Ward error sum of squares hierarchical clustering method has been very
widely used since its first description by Ward in a 1963 publication. It has
also been generalized in various ways. However there are different
interpretations in the literature and there are different implementations of
the Ward agglomerative algorithm in commonly used software systems, including
differing expressions of the agglomerative criterion. Our survey work and case
studies will be useful for all those involved in developing software for data
analysis using Ward's hierarchical clustering method.
We describe many vantage points on the Baire metric and its use in clustering
data, or its use in preprocessing and structuring data in order to support
search and retrieval operations. In some cases, we proceed directly to clusters
and do not directly determine the distances. We show how a hierarchical
clustering can be read directly from one pass through the data. We offer
insights also on practical implications of precision of data measurement. As a
mechanism for treating multidimensional data, including very high dimensional
data, we use random projections.
Gaussian process models -also called Kriging models- are often used as
mathematical approximations of expensive experiments. However, the number of
observation required for building an emulator becomes unrealistic when using
classical covariance kernels when the dimension of input increases. In oder to
get round the curse of dimensionality, a popular approach is to consider
simplified models such as additive models.
We consider a standard binary classification problem. The performance of any
binary classifier based on the training data is characterized by the excess
risk. We study Bahadur's type exponential bounds on the minimax accuracy
confidence function based on the excess risk. We study how this quantity
depends on the complexity of the class of distributions characterized by
exponents of entropies of the class of regression functions or of the class of
Bayes classifiers corresponding to the distributions from the class.
This paper addresses the problem of estimating the latent dimensionality in
nonnegative matrix factorization (NMF). The estimation is done via automatic
relevance determination (ARD). Uncovering the model order is important as it is
necessary to strike the right balance between data fidelity and overfitting. We
propose a Bayesian model for NMF and two families of algorithms known as l1-ARD
and l2-ARD, each assuming different priors on the basis elements and the
activation coefficients.
This paper addresses the problem of segmenting a time-series with respect to
changes in the mean value or in the variance. The first case is when the time
data is modeled as a sequence of independent and normal distributed random
variables with unknown, possibly changing, mean value but fixed variance. The
main assumption is that the mean value is piecewise constant in time, and the
task is to estimate the change times and the mean values within the segments.
The second case is when the mean value is constant, but the variance can
change.
Iterative information processing, either based on heuristics or analytical
frameworks, has been shown to be a very powerful tool for the design of
efficient, yet feasible, wireless receiver architectures. Within this context,
algorithms performing message-passing on a probabilistic graph, such as the
sum-product (SP) and variational message passing (VMP) algorithms, have become
increasingly popular.
We information-theoretically reformulate two measures of capacity from
statistical learning theory: empirical VC-entropy and empirical Rademacher
complexity. We show these capacity measures count the number of hypotheses
about a dataset that a learning algorithm falsifies when it finds the
classifier in its repertoire minimizing empirical risk. It then follows from
that the future performance of predictors on unseen data is controlled in part
by how many hypotheses the learner falsifies.
The graphical lasso [Banerjee et al., 2008, Friedman et al., 2007b] is a
popular approach for learning the structure in an undirected Gaussian graphical
model, using $\ell_1$ regularization to control the number of zeros in the
precision matrix ?${\B\Theta}={\B\Sigma}^{-1}$. The R package glasso is
popular, fast, and allows one to efficiently build a path of models for
different values of the tuning parameter. Convergence of GLASSO can be tricky;
the converged precision matrix might not be the inverse of the estimated
covariance, and occasionally it fails to converge with warm starts.
In this paper, we propose a second order optimization method to learn models
where both the dimensionality of the parameter space and the number of training
samples is high. In our method, we construct on each iteration a Krylov
subspace formed by the gradient and an approximation to the Hessian matrix, and
then use a subset of the training data samples to optimize over this subspace.
As with the Hessian Free (HF) method of [7], the Hessian matrix is never
explicitly constructed, and is computed using a subset of data.
In this paper, we give a new generalization error bound of Multiple Kernel
Learning (MKL) for a general class of regularizations, and discuss what kind of
regularization gives a favorable predictive accuracy. Our main target in this
paper is dense type regularizations including \ellp-MKL. According to the
recent numerical experiments, the sparse regularization does not necessarily
show a good performance compared with dense type regularizations.
We are working to develop automated intelligent agents, which can act and
react as learning machines with minimal human intervention. To accomplish this,
an intelligent agent is viewed as a question-asking machine, which is designed
by coupling the processes of inference and inquiry to form a model-based
learning unit. In order to select maximally-informative queries, the
intelligent agent needs to be able to compute the relevance of a question.
Vapnik-Chervonenkis (VC) dimension is a fundamental measure of the
generalization capacity of learning algorithms. However, apart from a few
special cases, it is hard or impossible to calculate analytically. Vapnik et
al. [10] proposed a technique for estimating the VC dimension empirically.
While their approach behaves well in simulations, it could not be used to bound
the generalization risk of classifiers, because there were no bounds for the
estimation error of the VC dimension itself.
The graphical lasso (glasso) is a widely-used fast algorithm for estimating
sparse inverse covariance matrices. The glasso solves an L_1 penalized maximum
likelihood problem and is implemented on CRAN. The output from the glasso, a
regularized covariance matrix estimate Sigma_glasso and a sparse inverse
covariance matrix estimate Omega_glasso, not only identify a graphical model
but can also serve as intermediate inputs into multivariate procedures such as
PCA, LDA, MANOVA, and others.
The popular cubic smoothing spline estimate of a regression function arises
as the minimizer of the penalized sum of squares $\sum_j(Y_j - {\mu}(t_j))^2 +
{\lambda}\int_a^b [{\mu}"(t)]^2 dt$, where the data are $t_j,Y_j$, $j=1,...,
n$. The minimization is taken over an infinite-dimensional function space, the
space of all functions with square integrable second derivatives. But the
calculations can be carried out in a finite-dimensional space.
A main goal of regression is to derive statistical conclusions on the
conditional distribution of the output variable Y given the input values x. Two
of the most important characteristics of a single distribution are location and
scale. Support vector machines (SVMs) are well established to estimate location
functions like the conditional median or the conditional mean. We investigate
the estimation of scale functions by SVMs when the conditional median is
unknown, too. Estimation of scale functions is important e.g. to estimate the
volatility in finance.
Principal component analysis (PCA) is widely used for dimensionality
reduction, with well-documented merits in various applications involving
high-dimensional data, including computer vision, preference measurement, and
bioinformatics. In this context, the fresh look advocated here permeates
benefits from variable selection and compressive sampling, to robustify PCA
against outliers.
In this paper we address the problem of pool based active learning, and
provide an algorithm, called UPAL, that works by minimizing the unbiased
estimator of the risk of a hypothesis in a given hypothesis space. For the
space of linear classifiers and the squared loss we show that UPAL is
equivalent to an exponentially weighted average forecaster. Exploiting some
recent results regarding the spectra of random matrices allows us to establish
consistency of UPAL when the true hypothesis is a linear hypothesis.
Linear and Quadratic Discriminant analysis (LDA/QDA) are common tools for
classification problems. For these methods we assume observations are normally
distributed within group. We estimate a mean and covariance matrix for each
group and classify using Bayes theorem. With LDA, we estimate a single, pooled
covariance matrix, while for QDA we estimate a separate covariance matrix for
each group. Rarely do we believe in a homogeneous covariance structure between
groups, but often there is insufficient data to separately estimate covariance
matrices.
Discovering causal relationships is a hard task, often hindered by the need
for intervention, and often requiring large amounts of data to resolve
statistical uncertainty. However, humans quickly arrive at useful causal
relationships. One possible reason is that humans use strong prior knowledge;
and rather than encoding hard causal relationships, they encode beliefs over
causal structures, allowing for sound generalization from the observations they
obtain from directly acting in the world.
We propose a novel method of introducing structure into existing machine
learning techniques by developing structure-based similarity and distance
measures. To learn structural information, low-dimensional structure of the
data is captured by solving a non-linear, low-rank representation problem. We
show that this low-rank representation can be kernelized, has a closed-form
solution, allows for separation of independent manifolds, and is robust to
noise.
The starting point of this article is the question "How to retrieve
fingerprints of rhythm in written texts?". We address this problem in the case
of Brazilian and European Portuguese. These two dialects of Modern Portuguese
share the same lexicon and most of the sentences they produce are superficially
identical. Yet they are conjectured, on linguistic grounds, to implement
different rhythms. We show that this linguistic question can be formulated as a
problem of model selection in the class of variable length Markov chains.
We propose a scalable, efficient and statistically motivated computational
framework for Graphical Lasso (Friedman et al., 2007b) - a covariance
regularization framework that has received significant attention in the
statistics community over the past few years. Existing algorithms have trouble
in scaling to dimensions larger than a thousand. Our proposal significantly
enhances the state-of-the-art for such moderate sized problems and gracefully
scales to larger problems where other algorithms become practically infeasible.
This requires a few key new ideas.
Latent feature models are widely used to decompose data into a small number
of components. Bayesian nonparametric variants of these models, which use the
Indian buffet process (IBP) as a prior over latent features, allow the number
of features to be determined from the data. We present a generalization of the
IBP, the distance dependent Indian buffet process (dd-IBP), for modeling
non-exchangeable data. It relies on a distance function defined between data
points, biasing nearby data to share more features.
We describe the first sub-quadratic sampling algorithm for the Multiplicative
Attribute Graph Model (MAGM) of Kim and Leskovec (2010). We exploit the close
connection between MAGM and the Kronecker Product Graph Model (KPGM) of
Leskovec et al. (2010), and show that to sample a graph from a MAGM it suffices
to sample small number of KPGM graphs and \emph{quilt} them together. Under
mild technical conditions our algorithm runs in $O((\log_2(n))^3 |E|)$ time,
where $n$ is the number of nodes and $|E|$ is the number of edges in the
sampled graph.
We show that the use of techniques from algebra and algebraic geometry can be
highly beneficial for tackling machine learning problems, where the set of
desired solutions can be described in terms of approximate polynomial
equations. Namely, they allow one to directly solve learning problems using
algebraic operations thus avoiding iterative optimization which may converge to
local minima. In this spirit, we suggest a novel representation for probability
distributions in terms of elements in the polynomial ring derived from
estimated cumulants.
We introduce a new regression framework, Gaussian process regression networks
(GPRN), which combines the structural properties of Bayesian neural networks
with the non-parametric flexibility of Gaussian processes. This model
accommodates input dependent signal and noise correlations between multiple
response variables, input dependent length-scales and amplitudes, and
heavy-tailed predictive distributions. We derive both efficient Markov chain
Monte Carlo and variational Bayes inference procedures for this model.
We introduce a factor analysis model that summarizes the dependencies between
observed variable groups, instead of dependencies between individual variables
as standard factor analysis does. A group may correspond to one view of the
same set of objects, one of many data sets tied by co-occurrence, or a set of
alternative variables collected from statistics tables to measure one property
of interest.
We consider the problem of covariance matrix estimation in the presence of
latent variables. Under suitable conditions, it is possible to learn the
marginal covariance matrix of the observed variables via a tractable convex
program, where the concentration matrix of the observed variables is decomposed
into a sparse matrix (representing the graphical structure of the observed
variables) and a low rank matrix (representing the marginalization effect of
latent variables). We present an efficient first-order method based on split
Bregman to solve the convex problem.
Detection of emerging topics are now receiving renewed interest motivated by
the rapid growth of social networks. Conventional term-frequency-based
approaches may not be appropriate in this context, because the information
exchanged are not only texts but also images, URLs, and videos. We focus on the
social aspects of theses networks. That is, the links between users that are
generated dynamically intentionally or unintentionally through replies,
mentions, and retweets.
We study the generalization performance of arbitrary online learning
algorithms trained on samples coming from a dependent source of data. We show
that the generalization error of any stable online algorithm concentrates
around its regret--an easily computable statistic of the online performance of
the algorithm--when the underlying ergodic process is $\beta$- or
$\phi$-mixing.
Transportation distances have been used for more than a decade now in machine
learning to compare histograms of features. They have one parameter: the ground
metric, which can be any metric between the features themselves. As is the case
for all parameterized distances, transportation distances can only prove useful
in practice when this parameter is carefully chosen. To date, the only option
available to practitioners to set the ground metric parameter was to rely on a
priori knowledge of the features, which limited considerably the scope of
application of transportation distances.
We consider the problem of learning the structure of Ising models (pairwise
binary Markov random fields) from i.i.d. samples. While several methods have
been proposed to accomplish this task, their relative merits and limitations
remain somewhat obscure. By analyzing a number of concrete examples, we show
that low-complexity algorithms often fail when the Markov random field develops
long-range correlations. More precisely, this phenomenon appears to be related
to the Ising model phase transition (although it does not coincide with it).
There is a growing interest in using a longitudinal observational databases
to detect drug safety signal. In this paper we present a novel method, which we
used online during the OMOP Cup. We consider homogeneous ensembling, which is
based on random re-sampling (known, also, as bagging) as a main innovation
compared to the previous publications in the related field. This study is based
on a very large simulated database of the 10 million patients records, which
was created by the Observational Medical Outcomes Partnership (OMOP).
We present the discrete infinite logistic normal distribution (DILN), a
Bayesian nonparametric prior for mixed membership models. DILN is a
generalization of the hierarchical Dirichlet process (HDP) that models
correlation structure between the weights of the atoms at the group level. We
derive a representation of DILN as a normalized collection of gamma-distributed
random variables, and study its statistical properties. We consider
applications to topic modeling and derive a variational inference algorithm for
approximate posterior inference.
Modelling the real world complexity of music is a challenge for machine
learning. We address the task of modeling melodic sequences from the same music
genre. We perform a comparative analysis of two probabilistic models; a
Dirichlet Variable Length Markov Model (Dirichlet-VMM) and a Time Convolutional
Restricted Boltzmann Machine (TC-RBM). We show that the TC-RBM learns
descriptive music features, such as underlying chords and typical melody
transitions and dynamics.
In this paper, we extend Meek's conjecture (Meek 1997) from directed and
acyclic graphs to chain graphs, and prove that the extended conjecture is true.
Specifically, we prove that if a chain graph H is an independence map of the
independence model induced by another chain graph G, then (i) G can be
transformed into H by a sequence of directed and undirected edge additions and
feasible splits and mergings, and (ii) after each operation in the sequence H
remains an independence map of the independence model induced by G.
Research in reinforcement learning has produced algorithms for optimal
decision making under uncertainty that fall within two main types. The first
employs a Bayesian framework, where optimality improves with increased
computational time. This is because the resulting planning task takes the form
of a dynamic programming problem on a belief tree with an infinite number of
states. The second type employs relatively simple algorithm which are shown to
suffer small regret within a distribution-free framework.
We present a probabilistic model for natural images which is based on
Gaussian scale mixtures and a simple multiscale representation. In contrast to
the dominant approach to modeling whole images focusing on Markov random
fields, we formulate our model in terms of a directed graphical model. We show
that it is able to generate images with interesting higher-order correlations
when trained on natural images or samples from an occlusion based model.
This paper presents algorithms for hierarchical, agglomerative clustering
which perform most efficiently in the general-purpose setup that is given in
modern standard software. Requirements are: (1) the input data is given by
pairwise dissimilarities between data points, but extensions to vector data are
also discussed (2) the output is a "stepwise dendrogram", a data structure
which is shared by all implementations in current standard software.
We propose a non-parametric link prediction algorithm for a sequence of graph
snapshots over time. The model predicts links based on the features of its
endpoints, as well as those of the local neighborhood around the endpoints.
This allows for different types of neighborhoods in a graph, each with its own
dynamics (e.g, growing or shrinking communities). We prove the consistency of
our estimator, and give a fast implementation based on locality-sensitive
hashing.
The performance of Orthogonal Matching Pursuit (OMP) for variable selection
is analyzed for random designs. When contrasted with the deterministic case,
since the performance is here measured after averaging over the distribution of
the design matrix, one can have far less stringent sparsity constraints on the
coefficient vector. We demonstrate that for exact sparse vectors, the
performance of the OMP is similar to known results on the Lasso algorithm
[\textit{IEEE Trans. Inform. Theory} \textbf{55} (2009) 2183--2202].
Ensemble methods for supervised machine learning have become popular due to
their ability to accurately predict class labels with groups of simple,
lightweight "base learners." While ensembles offer computationally efficient
models that have good predictive capability they tend to be large and offer
little insight into the patterns or structure in a dataset. We consider an
ensemble technique that returns a model of ranked rules. The model accurately
predicts class labels and has the advantage of indicating which parameter
constraints are most useful for predicting those labels.
This paper presents regression models obtained from a process of blind
prediction of peptide binding affinity from provided descriptors for several
distinct datasets as part of the 2006 Comparative Evaluation of Prediction
Algorithms (COEPRA) contest. This paper finds that kernel partial least
squares, a nonlinear partial least squares (PLS) algorithm, outperforms PLS,
and that the incorporation of transferable atom equivalent features improves
predictive capability.
This article addresses the problem of classification method based on both
labeled and unlabeled data, where we assume that a density function for labeled
data is different from that for unlabeled data. We propose a semi-supervised
logistic regression model for classification problem along with the technique
of covariate shift adaptation. Unknown parameters involved in proposed models
are estimated by regularization with EM algorithm. A crucial issue in modeling
process is the choices of tuning parameters in our semi-supervised logistic
models.
Concave regularization methods provide natural procedures for sparse
recovery. However, they are difficult to analyze in the high dimensional
setting. Only recently a few sparse recovery results have been established for
some specific local solutions obtained via specialized numerical procedures.
Still, the fundamental relationship between these solutions such as whether
they are identical or their relationship to the global minimizer of the
underlying nonconvex formulation is unknown.
Monte Carlo (MC) techniques are often used to estimate integrals of a
multivariate function using randomly generated samples of the function. In
light of the increasing interest in uncertainty quantification and robust
design applications in aerospace engineering, the calculation of expected
values of such functions (e.g. performance measures) becomes important.
However, MC techniques often suffer from high variance and slow convergence as
the number of samples increases.
Sparse modeling and estimation of complex signals is not uncommon in
practice. However, historically, much attention has been drawn to real-valued
system models, lacking the research of sparse signal modeling and estimation
for complex-valued models. This paper introduces a unifying sparse Bayesian
formalism that generalizes to complex- as well as real-valued systems. The
methodology relies on hierarchical Bayesian sparsity-inducing prior modeling of
the parameter of interest.
The optimal selection of experimental conditions is essential to maximizing
the value of data for inference and prediction, particularly in situations
where experiments are time-consuming and expensive to conduct. We propose a
general mathematical framework and an algorithmic approach for optimal
experimental design with nonlinear simulation-based models; in particular, we
focus on finding sets of experiments that provide the most information about
targeted sets of parameters.
We consider the sparse inverse covariance regularization problem or graphical
lasso with regularization parameter $\rho$. Suppose the co- variance graph
formed by thresholding the entries of the sample covariance matrix at $\rho$ is
decomposed into connected components. We show that the vertex-partition induced
by the thresholded covariance graph is exactly equal to that induced by the
estimated concentration graph. This simple rule, when used as a wrapper around
existing algorithms, leads to enormous performance gains.
In this work we introduce a mixture of GPs to address the data association
problem, i.e. to label a group of observations according to the sources that
generated them. Unlike several previously proposed GP mixtures, the novel
mixture has the distinct characteristic of using no gating function to
determine the association of samples and mixture components. Instead, all the
GPs in the mixture are global and samples are clustered following
"trajectories" across input space.
Multi-step ahead forecasting is still an open challenge in time series
forecasting. Several approaches that deal with this complex problem have been
proposed in the literature but an extensive comparison on a large number of
tasks is still missing. This paper aims to fill this gap by reviewing existing
strategies for multi-step ahead forecasting and comparing them in theoretical
and practical terms.
We present a method to estimate block membership of nodes in a random graph
generated by a stochastic blockmodel. We use an embedding procedure motivated
by the random dot product graph model, a particular example of the latent
position model. The embedded vectors are clustered through minimization of a
mean square error/criteria. We prove that this method is consistent for
assigning nodes to blocks, as only a negligible number of nodes will be
mis-assigned. We prove consistency of the method for directed and undirected
graphs.
We propose a novel algebraic framework for treating probability distributions
represented by their cumulants such as the mean and covariance matrix. As an
example, we consider the unsupervised learning problem of finding the subspace
on which several probability distributions agree. Instead of minimizing an
objective function involving the estimated cumulants, we show that by treating
the cumulants as elements of the polynomial ring we can directly solve the
problem, at a lower computational cost and with higher accuracy.
In many data mining applications collection of sufficiently large datasets is
the most time consuming and expensive. On the other hand, industrial methods of
data collection create huge databases, and make difficult direct applications
of the advanced machine learning algorithms. To address the above problems, we
consider active learning (AL), which may be very efficient either for the
experimental design or for the data filtering.
This paper studies the deviations of the regret in a stochastic multi-armed
bandit problem. When the total number of plays n is known beforehand by the
agent, Audibert et al. (2009) exhibit a policy such that with probability at
least 1-1/n, the regret of the policy is of order log(n). They have also shown
that such a property is not shared by the popular ucb1 policy of Auer et al.
(2002). This work first answers an open question: it extends this negative
result to any anytime policy.
We investigate multi-task learning from an output space regularization
perspective. Most multi-task approaches tie together related tasks by
constraining them to share input spaces and function classes. In contrast to
this, we propose a multi-task paradigm which we call output space
regularization, in which the only constraint is that the output spaces of the
multiple tasks are related. We focus on a specific instance of output space
regularization, multi-task averaging, that is both widely applicable and
amenable to analysis.
In many areas of machine learning, it becomes necessary to find the
eigenvector decompositions of large matrices. We discuss two methods for
reducing the computational burden of spectral decompositions: the more
venerable Nystom extension and a newly introduced algorithm based on random
projections. Previous work has centered on the ability to reconstruct the
original matrix. We argue that a more interesting and relevant comparison is
their relative performance in clustering and classification tasks using the
approximate eigenvectors as features.
Graphical models compactly capture stochastic dependencies amongst a
collection of random variables using a graph. Inference over graphical models
corresponds to finding marginal probability distributions given joint
probability distributions. Several inference algorithms rely on iterative
message passing between nodes. Although these algorithms can be generalized so
that the message passing occurs between clusters of nodes, there are limited
frameworks for finding such clusters. Moreover, current frameworks rely on
finding clusters that are overlapping.
We present a new, efficient method for automatically detecting severe
conflicts `edit wars' in Wikipedia and evaluate this method on six different
language WPs. We discuss how the number of edits, reverts, the length of
discussions, the burstiness of edits and reverts deviate in such pages from
those following the general workflow, and argue that earlier work has
significantly over-estimated the contentiousness of the Wikipedia editing
process.
In many scientific disciplines structures in high-dimensional data have to be
found, e.g., in stellar spectra, in genome data, or for face recognition tasks.
In this work we present a novel approach to non-linear dimensionality
reduction. It is based on fitting K-nearest neighbor regression to the
unsupervised regression framework for learning of low-dimensional manifolds.
Similar to related approaches that are mostly based on kernel methods,
unsupervised K-nearest neighbor (UKNN) regression optimizes latent variables
w.r.t.
We propose a method for nonparametric density estimation that exhibits
robustness to contamination of the training sample. This method achieves
robustness by combining a traditional kernel density estimator (KDE) with ideas
from classical M-estimation. We interpret the KDE based on a radial, positive
semi-definite kernel as a sample mean in the associated reproducing kernel
Hilbert space. Since the sample mean is sensitive to outliers, we estimate it
robustly via M-estimation, yielding a robust kernel density estimator (RKDE).
Purely data driven approaches for machine learning present difficulties when
data is scarce relative to the complexity of the model or when the model is
forced to extrapolate. On the other hand, purely mechanistic approaches need to
identify and specify all the interactions in the problem at hand (which may not
be feasible) and still leave the issue of how to parameterize the system. In
this paper, we present a hybrid approach using Gaussian processes and
differential equations to combine data driven modelling with a physical model
of the system.
Machine learning approaches to multi-label document classification have (to
date) largely relied on discriminative modeling techniques such as support
vector machines. A drawback of these approaches is that performance rapidly
drops off as the total number of labels and the number of labels per document
increase.
We consider the problem of training probabilistic conditional random fields
(CRFs) in the context of a task where performance is measured using a specific
loss function. While maximum likelihood is the most common approach to training
CRFs, it ignores the inherent structure of the task's loss function.
We consider the problem of high-dimensional Ising (graphical) model
selection. We propose a simple algorithm for structure estimation based on the
thresholding of the empirical conditional mutual information quantities. This
algorithm requires only low-order statistics of the data and has a sample
complexity of n =omega(J_{min}^{-4} log p), where p is the number of variables
and $J_{\min}$ is the minimum (absolute) edge potential in the model. We also
establish necessary conditions for structure estimation.
Kernel methods are among the most popular techniques in machine learning.
From a frequentist/discriminative perspective they play a central role in
regularization theory as they provide a natural choice for the hypotheses space
and the regularization functional through the notion of reproducing kernel
Hilbert spaces. From a Bayesian/generative perspective they are the key in the
context of Gaussian processes, where the kernel function is also known as the
covariance function.
To better understand the spatial structure of large panels of economic and
financial time series and provide a guideline for constructing semiparametric
models, this paper first considers estimating a large spatial covariance matrix
of the generalized $m$-dependent and $\beta$-mixing time series (with $J$
variables and $T$ observations) by hard thresholding regularization as long as
${{\log J \, \cx^*(\ct)}}/{T} = \Co(1)$ (the former scheme with some time
dependence measure $\cx^*(\ct)$) or $\log J /{T} = \Co(1)$ (the latter scheme
with some upper bounded mixing coefficient).
Divergence estimators based on direct approximation of density-ratios without
going through separate approximation of numerator and denominator densities
have been successfully applied to machine learning tasks that involve
distribution comparison such as outlier detection, transfer learning, and
two-sample homogeneity test. However, since density-ratio functions often
possess high fluctuation, divergence estimation is still a challenging task in
practice.
This paper presents Natural Evolution Strategies (NES), a recent family of
algorithms that constitute a more principled approach to black-box optimization
than established evolutionary algorithms. NES maintains a parameterized
distribution on the set of solution candidates, and the natural gradient is
used to update the distribution's parameters in the direction of higher
expected fitness. We introduce a collection of techniques that address issues
of convergence, robustness, sample complexity, computational complexity and
sensitivity to hyperparameters.
This paper considers the robust and efficient implementation of Gaussian
process regression with a Student-t observation model. The challenge with the
Student-t model is the analytically intractable inference which is why several
approximative methods have been proposed. The expectation propagation (EP) has
been found to be a very accurate method in many empirical studies but the
convergence of the EP is known to be problematic with models containing
non-log-concave site functions such as the Student-t distribution.
Standard compressive sensing results state that to exactly recover an s
sparse signal in R^p, one requires O(s\cdotlog p) measurements. While this
bound is extremely useful in practice, often real world signals are not only
sparse, but also exhibit structure in the sparsity pattern. We focus on
group-structured patterns in this paper. Under this model, groups of signal
coefficients are active (or inactive) together. The groups are predefined, but
the particular set of groups that are active (i.e., in the signal support) must
be learned from measurements.
Probabilistic principal component analysis (PPCA) seeks a low dimensional
representation of a data set in the presence of independent spherical Gaussian
noise, Sigma = (sigma^2)*I. The maximum likelihood solution for the model is an
eigenvalue problem on the sample covariance matrix. In this paper we consider
the situation where the data variance is already partially explained by other
factors, e.g. covariates of interest, or temporal correlations leaving some
residual variance.
Nonnegative matrix factorization (NMF) is now a common tool for audio source
separation. When learning NMF on large audio databases, one major drawback is
that the complexity in time is O(FKN) when updating the dictionary (where (F;N)
is the dimension of the input power spectrograms, and K the number of basis
spectra), thus forbidding its application on signals longer than an hour. We
provide an online algorithm with a complexity of O(FK) in time and memory for
updates in the dictionary.
A key problem in statistical modeling is model selection, how to choose a
model at an appropriate level of complexity. This problem appears in many
settings, most prominently in choosing the number ofclusters in mixture models
or the number of factors in factor analysis. In this tutorial we describe
Bayesian nonparametric methods, a class of methods that side-steps this issue
by allowing the data to determine the complexity of the model. This tutorial is
a high-level introduction to Bayesian nonparametric methods and contains
several examples of their application.
We introduce the Pitman Yor Diffusion Tree (PYDT) for hierarchical
clustering, a generalization of the Dirichlet Diffusion Tree (Neal, 2001) which
removes the restriction to binary branching structure. The generative process
is described and shown to result in an exchangeable distribution over data
points. We prove some theoretical properties of the model and then present two
inference methods: a collapsed MCMC sampler which allows us to model
uncertainty over tree structures, and a computationally efficient greedy
Bayesian EM search algorithm.
Due to space limitations, our submission "Source Separation and Clustering of
Phase-Locked Subspaces", accepted for publication on the IEEE Transactions on
Neural Networks in 2011, presented some results without proof. Those proofs are
provided in this paper.
The Baire metric induces an ultrametric on a dataset and is of linear
computational complexity, contrasted with the standard quadratic time
agglomerative hierarchical clustering algorithm. In this work we evaluate
empirically this new approach to hierarchical clustering. We compare
hierarchical clustering based on the Baire metric with (i) agglomerative
hierarchical clustering, in terms of algorithm properties; (ii) generalized
ultrametrics, in terms of definition; and (iii) fast clustering through k-means
partititioning, in terms of quality of results.
It is of increasing importance to develop learning methods for ranking. In
contrast to many learning objectives, however, the ranking problem presents
difficulties due to the fact that the space of permutations is not smooth. In
this paper, we examine the class of rank-linear objective functions, which
includes popular metrics such as precision and discounted cumulative gain. In
particular, we observe that expectations of these gains are completely
characterized by the marginals of the corresponding distribution over
permutation matrices.
Stochastic Kronecker graphs supply a parsimonious model for large sparse real
world graphs. They can specify the distribution of a large random graph using
only three or four parameters. Those parameters have however proved difficult
to choose in specific applications. This article looks at method of moments
estimators that are computationally much simpler than maximum likelihood. The
estimators are fast and in our examples, they typically yield Kronecker
parameters with expected feature counts closer to a given graph than we get
from KronFit.
The exploration-exploitation tradeoff is among the central challenges of
reinforcement learning. A hypothetical exact Bayesian learner would provide the
optimal solution, but is intractable in general. I show that, however, in the
specific case of Gaussian process inference, it is possible to make analytic
statements about optimal learning of both rewards and transition dynamics, for
nonlinear, time-varying systems in continuous time and space, subject to a
relatively weak restriction on the dynamics. The solution is described by an
infinite-dimensional differential equation.
In this paper, we first demonstrate that b-bit minwise hashing, whose
estimators are positive definite kernels, can be naturally integrated with
learning algorithms such as SVM and logistic regression. We adopt a simple
scheme to transform the nonlinear (resemblance) kernel into linear (inner
product) kernel; and hence large-scale problems can be solved extremely
efficiently. Our method provides a simple effective solution to large-scale
learning in massive and extremely high-dimensional datasets, especially when
data do not fit in memory.
This paper addresses the problem of inferring sparse causal networks modeled
by multivariate auto-regressive (MAR) processes. Conditions are derived under
which the Group Lasso (gLasso) procedure consistently estimates sparse network
structure. The key condition involves a "false connection score." In
particular, we show that consistent recovery is possible even when the number
of observations of the network is far less than the number of parameters
describing the network, provided that the false connection score is less than
one.
A number of recent work studied the effectiveness of feature selection using
Lasso. It is known that under the restricted isometry properties (RIP), Lasso
does not generally lead to the exact recovery of the set of nonzero
coefficients, due to the looseness of convex relaxation. This paper considers
the feature selection property of nonconvex regularization, where the solution
is given by a multi-stage convex relaxation scheme.
Multidimensional scaling (MDS) is a class of projective algorithms
traditionally used to produce two- or three-dimensional visualizations of
datasets consisting of multidimensional objects or interobject distances.
Recently, metric MDS has been applied to the problems of graph embedding for
the purpose of approximate encoding of edge or path costs using node
coordinates in metric space. Several authors have also pointed out that for
data with an inherent hierarchical structure, hyperbolic target space may be a
more suitable choice for accurate embedding than Euclidean space.
We consider the problem of online linear regression on individual sequences.
The goal in this paper is for the forecaster to output sequential predictions
which are, after T time rounds, almost as good as the ones output by the best
linear predictor in a given L1-ball in R^d. We consider both the cases where
the dimension d is small and large relative to the time horizon T. We first
present regret bounds with optimal dependencies on the sizes U, X and Y of the
L1-ball, the input data and the observations.
In data sets with many more features than observations, independent screening
based on all univariate regression models leads to a computationally convenient
variable selection method. Recent efforts have shown that in the case of
generalized linear models, independent screening may suffice to capture all
relevant features with high probability, even in ultra-high dimension. It is
unclear whether this formal sure screening property is attainable when the
response is a right-censored survival time.
We show how to compute lower bounds for the maximum possible Bayes error if
the class-conditional distributions must satisfy moment constraints. Our
approach makes use of Curto and Fialkow's solutions for the truncated moment
problem. We also give an upper bound that follows from related work by
Lanckreit et al., Bertsimas and Popescu, and Marshall and Olkin.
We define and discuss the first sparse coding algorithm based on closed-form
EM updates and continuous latent variables. The underlying generative model
consists of a flexibly parameterized `spike-and-slab' prior and a standard
Gaussian noise model. Closed-form solutions for E- and M-step equations are
derived by generalizing probabilistic PCA. The resulting EM algorithm infers
all model parameters including the level of sparsity, and it takes all modes of
a potentially multi-modal posterior into account.