This paper presents a kernel-based discriminative learning framework on
probability measures. Rather than relying on large collections of vectorial
training examples, our framework learns using a collection of probability
distributions that have been constructed to meaningfully represent training
data. By representing these probability distributions as mean embeddings in the
reproducing kernel Hilbert space (RKHS), we are able to apply many standard
kernel-based learning techniques in straightforward fashion.
In this paper, we study two general classes of optimization algorithms for
kernel methods with convex loss function and quadratic norm regularization, and
analyze their convergence. The first approach, based on fixed-point iterations,
is simple to implement and analyze, and can be easily parallelized. The second,
based on coordinate descent, exploits the structure of additively separable
loss functions to compute solutions of line searches in closed form.
In this paper, the framework of kernel machines with two layers is
introduced, generalizing classical kernel methods. The new learning methodology
provide a formal connection between computational architectures with multiple
layers and the theme of kernel learning in standard regularization methods.
First, a representer theorem for two-layer networks is presented, showing that
finite linear combinations of kernels on each layer are optimal architectures
whenever the corresponding functions solve suitable variational problems in
reproducing kernel Hilbert spaces (RKHS).