Inverse reinforcement learning (IRL) addresses the problem of recovering a
task description given a demonstration of the optimal policy used to solve such
a task. The optimal policy is usually provided by an expert or teacher, making
IRL specially suitable for the problem of apprenticeship learning. The task
description is encoded in the form of a reward function of a Markov decision
process (MDP). Several algorithms have been proposed to find the reward
function corresponding to a set of demonstrations.
We describe Information Forests, an approach to classification that
generalizes Random Forests by replacing the splitting criterion of non-leaf
nodes from a discriminative one -- based on the entropy of the label
distribution -- to a generative one -- based on maximizing the information
divergence between the class-conditional distributions in the resulting
partitions. The basic idea consists of deferring classification until a measure
of "classification confidence" is sufficiently high, and instead breaking down
the data so as to maximize this measure.
In this paper I present an extended implementation of the Random ferns
algorithm contained in the R package rFerns.
It differs from the original by the ability of consuming categorical and
numerical attributes instead of only binary ones. Also, instead of using simple
attribute subspace ensemble it employs bagging and thus produce error
approximation and variable importance measure modelled after Random forest
algorithm.
In this paper, we derive Hybrid, Bayesian and Marginalized Cramer Rao Lower
Bounds (HCRB, BCRB and MCRB) for the single measurement vector and multiple
measurement vector Sparse Bayesian Learning (SBL) problem of estimating
compressible vectors and their prior distribution parameters. We assume the
unknown vector to be drawn from a compressible Student-t prior distribution. We
derive CRBs that encompass the deterministic or random nature of the unknown
parameters of the prior distribution and the regression noise variance.
A significant challenge to make learning techniques more suitable for general
purpose use is to move beyond i) complete supervision, ii) low dimensional
data, iii) a single task and single view per instance. Solving these challenges
allows working with "Big Data" problems that are typically high dimensional
with multiple (but possibly incomplete) labelings and views.
Most machine learning algorithms, such as classification or regression, treat
the individual data point as the object of interest. Here we consider extending
machine learning algorithms to operate on groups of data points. We suggest
treating a group of data points as a set of i.i.d. samples from an underlying
feature distribution for the group. Our approach is to generalize kernel
machines from vectorial inputs to i.i.d. sample sets of vectors.
Clustering is considered a non-supervised learning setting, in which the goal
is to partition a collection of data points into disjoint clusters. Often a
bound $k$ on the number of clusters is given or assumed by the practitioner.
Many versions of this problem have been defined, most notably $k$-means and
$k$-median.
For a large multi-hop wireless network, nodes are preferable to make
distributed and localized link-scheduling decisions with only interactions
among a small number of neighbors. However, for a slowly decaying channel and
densely populated interferers, a small size neighborhood often results in
nontrivial link outages and is thus insufficient for making optimal scheduling
decisions. A question arises how to deal with the information outside a
neighborhood in distributed link-scheduling. In this work, we develop joint
approximation of information and distributed link scheduling.
We present a novel approach to learn a kernel-based regression function. It
is based on the useof conical combinations of data-based parameterized kernels
and on a new stochastic convex optimization procedure of which we establish
convergence guarantees. The overall learning procedure has the nice properties
that a) the learned conical combination is automatically designed to perform
the regression task at hand and b) the updates implicated by the optimization
procedure are quite inexpensive.
We introduce a class of learning problems where the agent is presented with a
series of tasks. Intuitively, if there is relation among those tasks, then the
information gained during execution of one task has value for the execution of
another task. Consequently, the agent is intrinsically motivated to explore its
environment beyond the degree necessary to solve the current task it has at
hand. We develop a decision theoretic setting that generalises standard
reinforcement learning tasks and captures this intuition.
Diabetes is a major health problem in both developing and developed countries
and its incidence is rising dramatically. In this study, we investigate a novel
automatic approach to diagnose Diabetes disease based on Feature Weighted
Support Vector Machines (FW-SVMs) and Modified Cuckoo Search (MCS). The
proposed model consists of three stages: Firstly, PCA is applied to select an
optimal subset of features out of set of all the features. Secondly, Mutual
Information is employed to construct the FWSVM by weighting different features
based on their degree of importance.
Feature selection methods such as wrappers may produce high-quality feature
subsets, however they often require building multiple models and are
time-consuming. Motivated by a fresh perspective on the relationship between an
information-based forward feature selection criterion and decision trees, in
this paper we propose a tree regularization framework, in which only a single
model needs to be built. The framework enables many tree models to perform
feature selection. The key idea of the regularization framework is to penalize
using a new feature for splitting when its gain (e.g.
The issue of discrete probability estimation for samples of small size is
addressed in this study. The maximum likelihood method often suffers
over-fitting when insufficient data is available. Although the Bayesian
approach can avoid over-fitting by using prior distributions, it still has
problems with objective analysis. In response to these drawbacks, a new
theoretical framework based on thermodynamics, where energy and temperature are
introduced, was developed. Entropy and likelihood are placed at the center of
this method.
Latent Dirichlet allocation (LDA) is an important class of hierarchical
Bayesian models for probabilistic topic modeling, which attracts worldwide
interests and touches on many important applications in text mining, computer
vision and computational biology. This paper introduces a topic modeling
toolbox (TMBP) based on the belief propagation (BP) algorithms. This toolbox is
implemented by MEX C++/MATLAB platform for either Windows or Linux.
We examine various methods for combining the output of one-class models. In
particular, we show that simple meta-learning based ensemble achieves better
result than weighting methods. Furthermore we propose a new one-class ensemble
scheme, called TUPSO that uses meta-learning for combining multiple one-class
classifiers. We also present a new one-class classification performance
measures to weigh the base-classifiers, a process that proved helpful for
increasing the classification performance of the induced ensemble.
We provide the first algorithm for online bandit linear optimization whose
regret after T rounds is of order sqrt{Td ln N} on any finite class X of N
actions in d dimensions, and of order d*sqrt{T} (up to log factors) when X is
infinite. These bounds are not improvable in general. The basic idea utilizes
tools from convex geometry to construct what is essentially an optimal
exploration basis. We also present an application to a model of linear bandits
with expert advice.
The success of kernel-based learning methods depend on the choice of kernel.
Recently, kernel learning methods have been proposed that use data to select
the most appropriate kernel, usually by combining a set of base kernels. We
introduce a new algorithm for kernel learning that combines a {\em continuous
set of base kernels}, without the common step of discretizing the space of base
kernels. We demonstrate that our new method achieves state-of-the-art
performance across a variety of real-world datasets.
Unsupervised aggregation of independently built univariate predictors is
explored as an alternative regularization approach for noisy, sparse datasets.
Bipartite ranking algorithm Smooth Rank implementing this approach is
introduced. The advantages of this algorithm are demonstrated on two types of
problems. First, Smooth Rank is applied to two-class problems from bio-medical
field, where ranking is often preferable to classification. In comparison
against SVMs with radial and linear kernels, Smooth Rank had the best
performance on 8 out of 12 benchmark benchmarks.
We consider the multi-armed bandit problems in which a player aims to accrue
reward by sequentially playing a given set of arms with unknown reward
statistics. In the classic work, policies were proposed to achieve the optimal
logarithmic regret order for some special classes of light-tailed reward
distributions, e.g., Auer et al.'s UCB1 index policy for reward distributions
with finite support.
We consider the problem of learning a linear factor model with an unknown
number of factors. We propose a regularized form of principal component
analysis (PCA) and demonstrate through experiments with synthetic and real data
the superiority of resulting estimates to those produced by pre-existing factor
analysis approaches. We also establish theoretical results that elucidate the
manner in which our algorithm corrects biases induced by conventional PCA.
Sparse estimation methods are aimed at using or obtaining parsimonious
representations of data or models. They were first dedicated to linear variable
selection but numerous extensions have now emerged such as structured sparsity
or kernel selection. It turns out that many of the related estimation problems
can be cast as convex optimization problems by regularizing the empirical risk
with appropriate non-smooth norms. The goal of this paper is to present from a
general perspective optimization tools and techniques dedicated to such
sparsity-inducing penalties.
The multi-armed bandit problem is a popular model for studying
exploration/exploitation trade-off in sequential decision problems. Many
algorithms are now available for this well-studied problem. One of the earliest
algorithms, given by W. R. Thompson, dates back to 1933. This algorithm,
referred to as Thompson Sampling, is a natural Bayesian algorithm. The basic
idea is to choose an arm to play according to its probability of being the best
arm. Thompson Sampling algorithm has experimentally been shown to be close to
optimal.
Many real world problems exhibit patterns that have periodic behavior. For
example, in astrophysics, periodic variable stars play a pivotal role in
understanding our universe. An important step when analyzing data from such
processes is the problem of identifying the period: estimating the period of a
periodic function based on noisy observations made at irregularly spaced time
points. This problem is still a difficult challenge despite extensive study in
different disciplines. The paper makes several contributions toward solving
this problem.
We propose a new online learning model for learning with preference feedback.
The model is especially suited for applications like web search and recommender
systems, where preference data is readily available from implicit user feedback
(e.g. clicks). In particular, at each time step a potentially structured object
(e.g. a ranking) is presented to the user in response to a context (e.g.
query), providing him or her with some unobserved amount of utility. As
feedback the algorithm receives an improved object that would have provided
higher utility.
One of the many benefits of Bayesian nonparametric processes such as the
Dirichlet process is that they can be used for modeling infinite mixture
models, thus providing a flexible answer to the question of how many clusters
exist in a data set. For the most part, such flexibility is currently lacking
in techniques based on hard clustering, such as k-means, graph cuts, and
Bregman hard clustering. For finite mixture models, there is a precise
connection between k-means and mixtures of Gaussians, obtained by an
appropriate limiting argument.
Latent Dirichlet Allocation models discrete data as a mixture of discrete
distributions, using Dirichlet beliefs over the mixture weights. We study a
variation of this concept, in which the documents' mixture weight beliefs are
replaced with squashed Gaussian distributions. This allows documents to be
associated with elements of a Hilbert space, admitting kernel topic models
(KTM), modelling temporal, spatial, hierarchical, social and other structure
between documents. The main challenge is efficient approximate inference on the
latent Gaussian.
We propose a method to efficiently construct data-dependent kernels which can
make use of large quantities of (unlabeled) data. Our construction makes an
approximation in the standard construction of semi-supervised kernels in
Sindhwani et al. 2005. In typical cases these kernels can be computed in
nearly-linear time (in the amount of data), improving on the cubic time of the
standard construction, enabling large scale semi-supervised learning in a
variety of contexts.
We consider the problem of identifying patterns in a data set that exhibit
anomalous behavior, often referred to as anomaly detection. In most anomaly
detection algorithms, the dissimilarity between data samples is calculated by a
single criterion. When dissimilarities are calculated by multiple criteria, one
might perform anomaly detection using a linear combination of the multiple
dissimilarities.
Crowdsourcing systems, in which numerous tasks are electronically distributed
to numerous "information piece-workers", have emerged as an effective paradigm
for human-powered solving of large scale problems in domains such as image
classification, data entry, optical character recognition, recommendation, and
proofreading. Because these low-paid workers can be unreliable, nearly all
crowdsourcers must devise schemes to increase confidence in their answers,
typically by assigning each task multiple times and combining the answers in
some way such as majority voting.
We consider unsupervised crowdsourcing performance based on the model wherein
the responses of end-users are essentially rated according to how their
responses correlate with the majority of other responses to the same
subtasks/questions. In one setting, we consider an independent sequence of
identically distributed crowdsourcing assignments (meta-tasks), while in the
other we consider a single assignment with a large number of component
subtasks. Both problems yield intuitive results in which the overall
reliability of the crowd is a factor.
The use of image transformations is essential for efficient modeling and
learning of visual data. But the class of relevant transformations is large:
affine transformations, projective transformations, elastic deformations, ...
the list goes on. Therefore, learning these transformations, rather than hand
coding them, is of great conceptual interest. To the best of our knowledge, all
the related work so far has been concerned with either supervised or weakly
supervised learning (from correlated sequences, video streams, or
image-transform pairs).
The most basic assumption used in statistical learning theory is that
training data and test data are drawn from the same underlying distribution.
Unfortunately, in many applications, the "in-domain" test data is drawn from a
distribution that is related, but not identical, to the "out-of-domain"
distribution of the training data. We consider the common case in which labeled
out-of-domain data is plentiful, but labeled in-domain data is scarce.
User preferences for items can be inferred from either explicit feedback,
such as item ratings, or implicit feedback, such as rental histories. Research
in collaborative filtering has concentrated on explicit feedback, resulting in
the development of accurate and scalable models. However, since explicit
feedback is often difficult to collect it is important to develop effective
models that take advantage of the more widely available implicit feedback.
In this paper we explore the problem of noise tolerant learning of
classifiers. We formulate the problem as follows. We assume that there is an
${\bf unobservable}$ training set which is noise-free. The actual training set
given to the learning algorithm is obtained from this ideal data set by
corrupting the class label of each example where the probability that the class
label on an example is corrupted is a function of the feature vector of the
example. This would account for almost all kinds of noisy data one may
encounter in practice.
Bias - variance decomposition of the expected error defined for regression
and classification problems is an important tool to study and compare different
algorithms, to find the best areas for their application. Here the
decomposition is introduced for the survival analysis problem. In our
experiments, we study bias -variance parts of the expected error for two
algorithms: original Cox proportional hazard regression and CoxPath, path
algorithm for L1-regularized Cox regression, on the series of increased
training sets.
This paper examines the problem of ranking a collection of objects using
pairwise comparisons (rankings of two objects). In general, the ranking of $n$
objects can be identified by standard sorting methods using $n log_2 n$
pairwise comparisons. We are interested in natural situations in which
relationships among the objects may allow for ranking using far fewer pairwise
comparisons. Specifically, we assume that the objects can be embedded into a
$d$-dimensional Euclidean space and that the rankings reflect their relative
distances from a common reference point in $R^d$.
Metrics specifying distances between data points can be learned in a
discriminative manner or from generative models. In this paper, we show how to
unify generative and discriminative learning of metrics via a kernel learning
framework. Specifically, we learn local metrics optimized from parametric
generative models. These are then used as base kernels to construct a global
kernel that minimizes a discriminative training criterion. We consider both
linear and nonlinear combinations of local metric kernels.
In this paper, we consider the problem of preserving privacy in the online
learning setting. We study the problem in the online convex programming (OCP)
framework---a popular online learning setting with several interesting
theoretical and practical implications---while using differential privacy as
the formal privacy measure.
We introduce the problem of reconstructing a sequence of multidimensional
real vectors where some of the data are missing. This problem contains
regression and mapping inversion as particular cases where the pattern of
missing data is independent of the sequence index. The problem is hard because
it involves possibly multivalued mappings at each vector in the sequence, where
the missing variables can take more than one value given the present variables;
and the set of missing variables can vary from one vector to the next.
We consider the problem of optimizing the sum of a smooth convex function and
a non-smooth convex function using proximal-gradient methods, where an error is
present in the calculation of the gradient of the smooth term or in the
proximity operator with respect to the non-smooth term.
Sparse estimation methods are aimed at using or obtaining parsimonious
representations of data or models. While naturally cast as a combinatorial
optimization problem, variable or feature selection admits a convex relaxation
through the regularization by the $\ell_1$-norm. In this paper, we consider
situations where we are not only interested in sparsity, but where some
structural prior knowledge is available as well.
In this paper, we present a new multiple instance learning (MIL) method,
called MIS-Boost, which learns discriminative instance prototypes by explicit
instance selection in a boosting framework. Unlike previous instance selection
based MIL methods, we do not restrict the prototypes to a discrete set of
training instances but allow them to take arbitrary values in the instance
feature space. We also do not restrict the total number of prototypes and the
number of selected-instances per bag; these quantities are completely
data-driven.
We consider a bandit problem over a graph where the rewards are not directly
observed. Instead, the decision maker can compare two nodes and receive
(stochastic) information pertaining to the difference in their value. The graph
structure describes the set of possible comparisons. Consequently, comparing
between two nodes that are relatively far requires estimating the difference
between every pair of nodes on the path between them.
In this paper, we consider Markov Decision Processes (MDPs) with error
states. Error states are those states entering which is undesirable or
dangerous. We define the risk with respect to a policy as the probability of
entering such a state when the policy is pursued. We consider the problem of
finding good policies whose risk is smaller than some user-specified threshold,
and formalize it as a constrained MDP with two criteria. The first criterion
corresponds to the value function originally given.
The paper studies machine learning problems where each example is described
using a set of Boolean features and where hypotheses are represented by linear
threshold elements. One method of increasing the expressiveness of learned
hypotheses in this context is to expand the feature set to include conjunctions
of basic features. This can be done explicitly or where possible by using a
kernel function.
In this paper we investigate clustering in the weighted setting, in which
every data point is assigned a real valued weight. We conduct a theoretical
analysis on the influence of weighted data on standard clustering algorithms in
each of the partitional and hierarchical settings, characterising the precise
conditions under which such algorithms react to weights, and classifying
clustering methods into three broad categories: weight-responsive,
weight-considering, and weight-robust.
Machine learning over fully distributed data poses an important problem in
peer-to-peer (P2P) applications. In this model we have one data record at each
network node, but without the possibility to move raw data due to privacy
considerations. For example, user profiles, ratings, history, or sensor
readings can represent this case. This problem is difficult, because there is
no possibility to learn local models, yet the communication cost needs to be
kept low.
Volterra and polynomial regression models play a major role in nonlinear
system identification and inference tasks. Exciting applications ranging from
neuroscience to genome-wide association analysis build on these models with the
additional requirement of parsimony. This requirement has high interpretative
value, but unfortunately cannot be met by least-squares based or kernel
regression methods. To this end, compressed sampling (CS) approaches, already
successful in linear regression settings, can offer a viable alternative.
In this paper we study the problem of learning phylogenies and hidden Markov
models. We call a Markov model nonsingular if all transition matrices have
determinants bounded away from 0 (and 1). We highlight the role of the
nonsingularity condition for the learning problem. Learning hidden Markov
models without the nonsingularity condition is at least as hard as learning
parity with noise, a well-known learning problem conjectured to be
computationally hard. On the other hand, we give a polynomial-time algorithm
for learning nonsingular phylogenies and hidden Markov models.
Multiclass prediction is the problem of classifying an object into a relevant
target class. We consider the problem of learning a multiclass predictor that
uses only few features, and in particular, the number of used features should
increase sub-linearly with the number of possible classes. This implies that
features should be shared by several classes. We describe and analyze the
ShareBoost algorithm for learning a multiclass predictor that uses few shared
features.
We present an algorithm which attains O(\sqrt{T}) internal (and thus
external) regret for finite games with partial monitoring under the local
observability condition. Recently, this condition has been shown by (Bartok,
Pal, and Szepesvari, 2011) to imply the O(\sqrt{T}) rate for partial monitoring
games against an i.i.d. opponent, and the authors conjectured that the same
holds for non-stochastic adversaries.
We present a data dependent generalization bound for a large class of
regularized algorithms which implement structured sparsity constraints. The
bound can be applied to standard squared-norm regularization, the lasso, the
group lasso, some versions of the group lasso with overlapping groups, multiple
kernel learning and other regularization schemes. In all these cases
competitive results are obtained.
PAQ8 is an open source lossless data compression algorithm that currently
achieves the best compression rates on many benchmarks. This report presents a
detailed description of PAQ8 from a statistical machine learning perspective.
It shows that it is possible to understand some of the modules of PAQ8 and use
this understanding to improve the method. However, intuitive statistical
explanations of the behavior of other modules remain elusive.
Stability is a general notion that quantifies the sensitivity of a learning
algorithm's output to small change in the training dataset (e.g. deletion or
replacement of a single training sample). Such conditions have recently been
shown to be more powerful to characterize learnability in the general learning
setting under i.i.d. samples where uniform convergence is not necessary for
learnability, but where stability is both sufficient and necessary for
learnability. We here show that similar stability conditions are also
sufficient for online learnability, i.e.
Prognosis of disease progression is necessary for development of
individualized treatment, understanding of the disease. Risk modeling is a
challenging problem, and too often amount of available relevant observations is
not sufficient to build a quality model with traditional approaches. New method
Smooth Rank for survival analysis, risk modeling is introduced here. Smooth
Rank is robust against overfitting on relatively small samples. The method is
compared with established risk modeling methods on 10 real life datasets.
This work deals with the problem of classifying uncertain data. With this aim
the Uncertain Nearest Neighbor (UNN) rule is here introduced, which represents
the generalization of the deterministic nearest neighbor rule to the case in
which uncertain objects are available. The UNN rule relies on the concept of
nearest neighbor class, rather than on that of nearest neighbor object. The
nearest neighbor class of a test object is the class that maximizes the
probability of providing its nearest neighbor.
We provide a new framework for generating multiple good quality partitions
(clusterings) of a single data set. Our approach decomposes this problem into
two components, generating many high-quality partitions, and then grouping
these partitions to obtain k representatives. The decomposition makes the
approach extremely modular and allows us to optimize various criteria that
control the choice of representative partitions.
This paper gives specific divergence examples of value-iteration for several
major Reinforcement Learning and Adaptive Dynamic Programming algorithms, when
using a function approximator for the value function. These divergence examples
differ from previous divergence examples in the literature, in that they are
applicable for a greedy policy, i.e. in a "value iteration" scenario. Perhaps
surprisingly, with a greedy policy, it is also possible to get divergence for
the algorithms TD(1) and Sarsa(1).
We study the problem of navigating through a database of similar objects
using comparisons. This problem is known to be strongly related to the
small-world network design problem. However, contrary to prior work, which
focuses on cases where objects in the database are equally popular, we consider
here the case where the demand for objects may be heterogeneous.
In this paper we propose a new algorithm for learning polyhedral classifiers
which we call as Polyceptron. It is a Perception like algorithm which updates
the parameters only when the current classifier misclassifies any training
data. We give both batch and online version of Polyceptron algorithm. Finally
we give experimental results to show the effectiveness of our approach.
This work considers the problem of learning the structure of a broad class of
multivariate latent variable tree models, which include a variety of continuous
and discrete models (including the widely used linear-Gaussian models, hidden
Markov models, and Markov evolutionary trees). The setting is one where we only
have samples from certain observed variables in the tree and our goal is to
estimate the tree structure (i.e., the graph of how the underlying hidden
variables are connected to the observed variables).
We consider the problem of high-dimensional Gaussian graphical model
selection. We identify a set of graphs for which an efficient estimation
algorithm exists, and this algorithm is based on thresholding of empirical
conditional covariances. Under a set of transparent conditions, we establish
structural consistency (or sparsistency) for the proposed algorithm, when the
number of samples n=omega(J_{min}^{-2} log p), where p is the number of
variables and J_{min} is the minimum (absolute) edge potential of the graphical
model.
This work introduces SubMF, a parallel divide-and-conquer framework for noisy
matrix factorization. SubMF divides a large-scale matrix factorization task
into smaller subproblems, solves each subproblem in parallel using an arbitrary
base matrix factorization algorithm, and combines the subproblem solutions
using techniques from randomized matrix approximation. Our experiments with
collaborative filtering, video background modeling, and simulated data
demonstrate the near-linear to super-linear speed-ups attainable with this
approach.
Sparse linear regression -- finding an unknown vector from linear
measurements -- is now known to be possible with fewer samples than variables,
via methods like the LASSO. We consider the multiple sparse linear regression
problem, where several related vectors -- with partially shared support sets --
have to be recovered. A natural question in this setting is whether one can use
the sharing to further decrease the overall number of samples required.
Mini-batch algorithms have been proposed as a way to speed-up stochastic
convex optimization problems. We study how such algorithms can be improved
using accelerated gradient methods. We provide a novel analysis, which shows
how standard gradient methods may sometimes be insufficient to obtain a
significant speed-up and propose a novel accelerated gradient algorithm, which
deals with this deficiency, enjoys a uniformly superior guarantee and works
well in practice.
Motivated by the amount of code that goes unidentified on the web, we
introduce a practical method for algorithmically identifying the programming
language of source code. Our work is based on supervised learning and
intelligent statistical features. We also explored, but abandoned, a
grammatical approach. In testing, our implementation greatly outperforms that
of an existing tool that relies on a Bayesian classifier.
This paper addresses the pattern classification problem arising when
available target data include some uncertainty information. Target data
considered here is either qualitative (a class label) or quantitative (an
estimation of the posterior probability). Our main contribution is a SVM
inspired formulation of this problem allowing to take into account class label
through a hinge loss as well as probability estimates using epsilon-insensitive
cost function together with a minimum norm (maximum margin) objective.
Signal Sequence Labeling consists in predicting a sequence of labels given an
observed sequence of samples. A naive way is to filter the signal in order to
reduce the noise and to apply a classification algorithm on the filtered
samples. We propose in this paper to jointly learn the filter with the
classifier leading to a large margin filtering for classification. This method
allows to learn the optimal cutoff frequency and phase of the filter that may
be different from zero. Two methods are proposed and tested on a toy dataset
and on a real life BCI dataset from BCI Competition III.
One of the major challenges of ECoG-based Brain-Machine Interfaces is the
movement prediction of a human subject. Several methods exist to predict an arm
2-D trajectory. The fourth BCI Competition gives a dataset in which the aim is
to predict individual finger movements (5-D trajectory). The difficulty lies in
the fact that there is no simple relation between ECoG signals and finger
movement. We propose in this paper to decode finger flexions using switching
models.
Estimator algorithms in learning automata are useful tools for adaptive,
real-time optimization in computer science and engineering applications. This
paper investigates theoretical convergence properties for a special case of
estimator algorithms---the pursuit learning algorithm. We identify a gap in
existing proofs of probabilistic convergence for pursuit learning and present a
more refined analysis to fill this gap.
We consider an adversarial online learning setting where a decision maker can
choose an action in every stage of the game. In addition to observing the
reward of the chosen action, the decision maker gets side observations on the
reward he would have obtained had he chosen some of the other actions. The
observation structure is encoded as a graph, where node $i$ is linked to node
$j$ if sampling $i$ provides information on the reward of $j$.
Observational data usually comes with a multimodal nature, which means that
it can be naturally represented by a multi-layer graph whose layers share the
same set of vertices (users) with different edges (pairwise relationships). In
this paper, we address the problem of combining different layers of the
multi-layer graph for improved clustering of the vertices compared to using
layers independently.
Most online algorithms used in machine learning today are based on variants
of mirror descent or follow-the-leader. In this paper, we present an online
algorithm based on a completely different approach, which combines "random
playout" and randomized rounding of loss subgradients. As an application of our
approach, we provide the first computationally efficient online algorithm for
collaborative filtering with norm-constrained matrices. As a second
application, we solve an open question linking batch learning and transductive
online learning.
We address the problem of learning in an online setting where the learner
repeatedly observes features, selects among a set of actions, and receives
reward for the action taken. We provide the first efficient algorithm with an
optimal regret. Our algorithm uses a cost sensitive classification learner as
an oracle and has a running time $\mathrm{polylog}(N)$, where $N$ is the number
of classification rules among which the oracle might choose. This is
exponentially faster than all previous algorithms that achieve optimal regret
in this setting.
The main principle of stacked generalization (or Stacking) is using a
second-level generalizer to combine the outputs of base classifiers in an
ensemble. In this paper, we investigate different combination types under the
stacking framework; namely weighted sum (WS), class-dependent weighted sum
(CWS) and linear stacked generalization (LSG). For learning the weights, we
propose using regularized empirical risk minimization with the hinge loss. In
addition, we propose using group sparsity for regularization to facilitate
classifier selection.
We address the problem of minimizing a convex function over the space of
large matrices with low rank. While this optimization problem is hard in
general, we propose an efficient greedy algorithm and derive its formal
approximation guarantees. Each iteration of the algorithm involves
(approximately) finding the left and right singular vectors corresponding to
the largest singular value of a certain matrix, which can be calculated in
linear time.
The use of $L_1$ regularisation for sparse learning has generated immense
research interest, with successful application in such diverse areas as signal
acquisition, image coding, genomics and collaborative filtering.
We show that all non-negative submodular functions have high {\em
noise-stability}. As a consequence, we obtain a polynomial-time learning
algorithm for this class with respect to any product distribution on
$\{-1,1\}^n$ (for any constant accuracy parameter $\epsilon$). Our algorithm
also succeeds in the agnostic setting. Previous work on learning submodular
functions required either query access or strong assumptions about the types of
submodular functions to be learned (and did not hold in the agnostic setting).
In batch learning, stability together with existence and uniqueness of the
solution corresponds to well-posedness of Empirical Risk Minimization (ERM)
methods; recently, it was proved that CV_loo stability is necessary and
sufficient for generalization and consistency of ERM. In this note, we
introduce CV_on stability, which plays a similar note in online learning. We
show that stochastic gradient descent (SDG) with the usual hypotheses is CVon
stable and we then discuss the implications of CV_on stability for convergence
of SGD.
We begin this report by describing the Probably Approximately Correct (PAC)
model for learning a concept class, consisting of subsets of a domain, and a
function class, consisting of functions from the domain to the unit interval.
Two combinatorial parameters, the Vapnik-Chervonenkis (VC) dimension and its
generalization, the Fat Shattering dimension of scale e, are explained and a
few examples of their calculations are given with proofs.
We develop a coherent framework for integrative simultaneous analysis of the
exploration-exploitation and model order selection trade-offs. We improve over
our preceding results on the same subject (Seldin et al., 2011) by combining
PAC-Bayesian analysis with Bernstein-type inequality for martingales. Such a
combination is also of independent interest for studies of multiple
simultaneously evolving martingales.
In this paper, we propose to (seamlessly) integrate b-bit minwise hashing
with linear SVM to substantially improve the training (and testing) efficiency
using much smaller memory, with essentially no loss of accuracy. Theoretically,
we prove that the resemblance matrix, the minwise hashing matrix, and the b-bit
minwise hashing matrix are all positive definite matrices (kernels).
Interestingly, our proof for the positive definiteness of the b-bit minwise
hashing kernel naturally suggests a simple strategy to integrate b-bit hashing
with linear SVM.
In manifold learning, algorithms based on graph Laplacians constructed from
data have received considerable attention both in practical applications and
theoretical analysis. In particular, the convergence of graph Laplacians
obtained from sampled data to certain continuous operators has become an active
research topic recently.
Feature selection is an important pre-processing step for many pattern
classification tasks. Traditionally, feature selection methods are designed to
obtain a feature subset that can lead to high classification accuracy. However,
classification accuracy has recently been shown to be an inappropriate
performance metric of classification systems in many cases. Instead, the Area
Under the receiver operating characteristic Curve (AUC) and its multi-class
extension, MAUC, have been proved to be better alternatives.
We first present our work in machine translation, during which we used
aligned sentences to train a neural network to embed n-grams of different
languages into an $d$-dimensional space, such that n-grams that are the
translation of each other are close with respect to some metric. Good n-grams
to n-grams translation results were achieved, but full sentences translation is
still problematic.
We present two alternative ways to apply PAC-Bayesian analysis to sequences
of dependent random variables. The first is based on a new lemma that enables
to bound expectations of convex functions of certain dependent random variables
by expectations of the same functions of independent Bernoulli random
variables. This lemma provides an alternative tool to Hoeffding-Azuma
inequality to bound concentration of martingale values.
Boosting is a popular way to derive powerful learners from simpler hypothesis
classes. Following previous work (Mason et al., 1999; Friedman, 2000) on
general boosting frameworks, we analyze gradient-based descent algorithms for
boosting with respect to any convex objective and introduce a new measure of
weak learner performance into this setting which generalizes existing work. We
present the first weak to strong learning guarantees for the existing gradient
boosting work for smooth convex objectives, and also demonstrate that this work
fails for non-smooth objectives.
In many physical, statistical, biological and other investigations it is
desirable to approximate a system of points by objects of lower dimension
and/or complexity. For this purpose, Karl Pearson invented principal component
analysis in 1901 and found 'lines and planes of closest fit to system of
points'. The famous k-means algorithm solves the approximation problem too, but
by finite sets instead of lines and planes. This chapter gives a brief
practical introduction into the methods of construction of general principal
objects, i.e.
We introduce an algorithm that, given n objects, learns a similarity matrix
over all n^2 pairs, from crowdsourced data alone. The algorithm samples
responses to adaptively chosen triplet-based relative-similarity queries. Each
query has the form "is object 'a' more similar to 'b' or to 'c'?" and is chosen
to be maximally informative given the preceding responses. The output is an
embedding of the objects into Euclidean space (like MDS); we refer to this as
the "crowd kernel."
We investigate unsupervised pre-training of deep architectures as feature
generators for "shallow" classifiers. Stacked Denoising Autoencoders (SdA),
when used as feature pre-processing tools for SVM classification, can lead to
significant improvements in accuracy - however, at the price of a substantial
increase in computational cost. In this paper we create a simple algorithm
which mimics the layer by layer training of SdAs. However, in contrast to SdAs,
our algorithm requires no training through gradient descent as the parameters
can be computed in closed-form.
We present a method to stop the evaluation of a decision making process when
the result of the full evaluation is obvious. This trait is highly desirable
for online margin-based machine learning algorithms where a classifier
traditionally evaluates all the features for every example. We observe that
some examples are easier to classify than others, a phenomenon which is
characterized by the event when most of the features agree on the class of an
example.
This paper considers the problem of clustering a partially observed
unweighted graph -- i.e. one where for some node pairs we know there is an edge
between them, for some others we know there is no edge, and for the remaining
we do not know whether or not there is an edge. We want to organize the nodes
into disjoint clusters so that there is relatively dense (observed)
connectivity within clusters, and sparse across clusters.
A fundamental result of statistical learnig theory states that a concept
class is PAC learnable if and only if it is a uniform Glivenko-Cantelli class
if and only if the VC dimension of the class is finite. However, the theorem is
only valid under special assumptions of measurability of the class, in which
case the PAC learnability even becomes consistent.
In many practical applications of clustering, the objects to be clustered
evolve over time, and a clustering result is desired at each time step. In such
applications, evolutionary clustering typically outperforms ordinary static
clustering by producing clustering results that reflect long-term trends while
being robust to short-term variations. Several evolutionary clustering
algorithms have recently been proposed, often by adding a temporal smoothness
penalty to the cost function of a static clustering method.
A wide class of regularization problems in machine learning and statistics
employ a regularization term which is obtained by composing a simple convex
function \omega with a linear transformation. This setting includes Group Lasso
methods, the Fused Lasso and other total variation methods, multi-task learning
methods and many more. In this paper, we present a general approach for
computing the proximity operator of this class of regularizers, under the
assumption that the proximity operator of the function \omega is known in
advance.
We introduce new online and batch algorithms that are robust to data with
missing features, a situation that arises in many practical applications. In
the online setup, we allow for the comparison hypothesis to change as a
function of the subset of features that is observed on any given round,
extending the standard setting where the comparison hypothesis is fixed
throughout. In the batch setup, we present a convex relation of a non-convex
problem to jointly estimate an imputation function, used to fill in the values
of missing features, along with the classification hypothesis.
We consider the problem of classification when inputs correspond to sets of
vectors. This setting occurs in many problems such as the classification of
pieces of mail containing several pages, of web sites with several sections or
of images that have been pre-segmented into smaller regions. We propose
generalizations of the restricted Boltzmann machine (RBM) that are appropriate
in this context and explore how to incorporate different assumptions about the
relationship between the input sets and the target class within the RBM.
We consider a collection of prediction experiments, which are clustered in
the sense that groups of experiments ex- hibit similar relationship between the
predictor and response variables. The experiment clusters as well as the
regres- sion relationships are unknown. The regression relation- ships define
the experiment clusters, and in general, the predictor and response variables
may not exhibit any clus- tering. We call this prediction problem clustered
regres- sion with unknown clusters (CRUC) and in this paper we focus on linear
regression.
We study decision making in environments where the reward is only partially
observed, but can be modeled as a function of an action and an observed
context. This setting, known as contextual bandits, encompasses a wide variety
of applications including health-care policy and Internet advertising. A
central task is evaluation of a new policy given historic data consisting of
contexts, actions and received rewards. The key challenge is that the past data
typically does not faithfully represent proportions of actions taken by a new
policy.
In this work we study parallelization of online learning, a core primitive in
machine learning. In a parallel environment all known approaches for parallel
online learning lead to delayed updates, where the model is updated using
out-of-date information.
We show that the disagreement coefficient of certain smooth hypothesis
classes is $O(m)$, where $m$ is the dimension of the hypothesis space, thereby
answering a question posed in \cite{friedman09}.
The collection of massive volumes of data requires machine learning
algorithms that can be applied to distributed data. We describe COMET (Cloud of
Massive Ensemble Trees), a recipe for distributed supervised learning
consisting of three components: (1) Map-Reduce is used to parallelize the
learning and evaluation tasks and collect the results, (2) an IVoting Random
Forest is used for the learning task on each local data partition, and (3) the
results of all local ensembles are combined into one massive ensemble and lazy
evaluation is used to dynamically subsample it.
We consider the problem of learning an unknown product distribution $X$ over
$\{0,1\}^n$ using samples $f(X)$ where $f$ is a \emph{known} transformation
function. Each choice of a transformation function $f$ specifies a learning
problem in this framework.
In multi-label learning, each sample is associated with several labels.
Existing works indicate that exploring correlations between labels improve the
prediction performance. However, embedding the label correlations into the
training process significantly increases the problem size. Moreover, the
mapping of the label structure in the feature space is not clear. In this
paper, we propose a novel multi-label learning method "Structured Decomposition
+ Group Sparsity (SDGS)".
We consider the problem of approximately reconstructing a partially-observed,
approximately low-rank matrix. This problem has received much attention lately,
mostly using the trace-norm as a surrogate to the rank. Here we study low-rank
matrix reconstruction using both the trace-norm, as well as the less-studied
max-norm, and present reconstruction guarantees based on existing analysis on
the Rademacher complexity of the unit balls of these norms. We show how these
are superior in several ways to recently published guarantees based on
specialized analysis.
This encyclopedic article gives a mini-introduction into the theory of
universal learning, founded by Ray Solomonoff in the 1960s and significantly
developed and extended in the last decade. It explains the spirit of universal
learning, but necessarily glosses over technical subtleties.
This paper proposes a new distance metric between clusterings that
incorporates information about the spatial distribution of points and clusters.
Our approach builds on the idea of a Hilbert space-based representation of
clusters as a combination of the representations of their constituent points.
We use this representation and the underlying metric to design a
spatially-aware consensus clustering procedure. This consensus procedure is
implemented via a novel reduction to Euclidean clustering, and is both simple
and efficient. All of our results apply to both soft and hard clusterings.
Recent research in multi-robot exploration and mapping has focused on
sampling environmental fields, which are typically modeled using the Gaussian
process (GP). Existing information-theoretic exploration strategies for
learning GP-based environmental field maps adopt the non-Markovian problem
structure and consequently scale poorly with the length of history of
observations. Hence, it becomes computationally impractical to use these
strategies for in situ, real-time active sampling.
We present a dynamic learning algorithm for a class of single-product revenue
management problems. In these problems, a retailer sells a single product with
limited on-hand inventory over a finite selling season. Customer demand arrives
according to a Poisson process, the rate of which is influenced by a single
action taken by the retailer (such as price adjustment, sales commission,
advertisement intensity, etc.). The relation between the action and the demand
rate is not known in advance.
In this paper, we propose a two-timescale delay-optimal dynamic clustering
and power allocation design for downlink network MIMO systems. The dynamic
clustering control is adaptive to the global queue state information (GQSI)
only and computed at the base station controller (BSC) over a longer time
scale. On the other hand, the power allocations of all the BSs in one cluster
are adaptive to both intra-cluster channel state information (CCSI) and
intra-cluster queue state information (CQSI), and computed at the cluster
manager (CM) over a shorter time scale.
Many problems in machine learning and statistics can be formulated as
(generalized) eigenproblems. In terms of the associated optimization problem,
computing linear eigenvectors amounts to finding critical points of a quadratic
function subject to quadratic constraints. In this paper we show that a certain
class of constrained optimization problems with nonquadratic objective and
constraints can be understood as nonlinear eigenproblems. We derive a
generalization of the inverse power method which is guaranteed to converge to a
nonlinear eigenvector.
We obtain a tight distribution-specific characterization of the sample
complexity of large-margin classification with L_2 regularization: We introduce
the \gamma-adapted-dimension, which is a simple function of the spectrum of a
distribution's covariance matrix, and show distribution-specific upper and
lower bounds on the sample complexity, both governed by the
\gamma-adapted-dimension of the source distribution. We conclude that this new
quantity tightly characterizes the true sample complexity of large-margin
classification.
Recently, considerable research efforts have been devoted to the design of
methods to learn from data overcomplete dictionaries for sparse coding.
However, learned dictionaries require the solution of an optimization problem
for coding new data. In order to overcome this drawback, we propose an
algorithm aimed at learning both a dictionary and its dual: a linear mapping
directly performing the coding.
We consider the celebrated Blackwell Approachability Theorem for two-player
games with vector payoffs. We show that Blackwell's result is equivalent, via
efficient reductions, to the existence of "no-regret" algorithms for Online
Linear Optimization. Indeed, we show that any algorithm for one such problem
can be efficiently converted into an algorithm for the other. We provide a
useful application of this reduction: the first efficient algorithm for
calibrated forecasting.
Sequential prediction problems such as imitation learning, where future
observations depend on previous predictions (actions), violate the common
i.i.d. assumptions made in statistical learning. This leads to poor performance
in both theory and often in practice. Some recent approaches provide stronger
performance guarantees in this setting, but remain somewhat unsatisfactory as
they train either non-stationary or a stochastic policies and require a large
number of iterations.
Nesterov's accelerated gradient methods (AGM) have been successfully applied
in many machine learning areas. However, their empirical performance on
training max-margin models has been inferior to existing specialized solvers.
In this paper, we first extend AGM to strongly convex and composite objective
functions with Bregman style prox-functions. Our unifying framework covers both
the $\infty$-memory and 1-memory styles of AGM, tunes the Lipschiz constant
adaptively, and bounds the duality gap.
We propose a new approach to value function approximation which combines
linear temporal difference reinforcement learning with subspace identification.
In practical applications, reinforcement learning (RL) is complicated by the
fact that state is either high-dimensional or partially observable. Therefore,
RL methods are designed to work with features of state rather than state
itself, and the success or failure of learning is often determined by the
suitability of the selected features.
Gaussian graphical models are of great interest in statistical learning.
Because the conditional independencies between different nodes correspond to
zero entries in the inverse covariance matrix of the Gaussian distribution, one
can learn the structure of the graph by estimating a sparse inverse covariance
matrix from sample data, by solving a convex maximum likelihood problem with an
$\ell_1$-regularization term.
This paper proposes uni-orthogonal and bi-orthogonal nonnegative matrix
factorization algorithms with robust convergence proofs. We design the
algorithms based on the work of Lee and Seung for the standard nonnegative
matrix factorization [1], and derive the converged versions by utilizing ideas
from the work of Lin [2]. The experimental results confirm the theoretical
guarantees of the convergences.