Collaborative filtering is a rapidly advancing research area. Every year
several new techniques are proposed and yet it is not clear which of the
techniques work best and under what conditions. In this paper we conduct a
study comparing several collaborative filtering techniques -- both classic and
recent state-of-the-art -- in a variety of experimental contexts. Specifically,
we report conclusions controlling for number of items, number of users,
sparsity level, performance criteria, and computational complexity.
Incorporating domain knowledge into the modeling process is an effective way
to improve learning accuracy. However, as it is provided by humans, domain
knowledge can only be specified with some degree of uncertainty. We propose to
explicitly model such uncertainty through probabilistic constraints over the
parameter space. In contrast to hard parameter constraints, our approach is
effective also when the domain knowledge is inaccurate and generally results in
superior modeling accuracy.
Sentiment analysis predicts the presence of positive or negative emotions in
a text document. In this paper we consider higher dimensional extensions of the
sentiment concept, which represent a richer set of human emotions. Our approach
goes beyond previous work in that our model contains a continuous manifold
rather than a finite set of human emotions. We investigate the resulting model,
compare it to psychological observations, and explore its predictive
capabilities.
Maximum likelihood estimators are often of limited practical use due to the
intensive computation they require. We propose a family of alternative
estimators that maximize a stochastic variation of the composite likelihood
function. Each of the estimators resolve the computation-accuracy tradeoff
differently, and taken together they span a continuous spectrum of
computation-accuracy tradeoff resolutions. We prove the consistency of the
estimators, provide formulas for their asymptotic variance, statistical
robustness, and computational complexity.
Many popular linear classifiers, such as logistic regression, boosting, or
SVM, are trained by optimizing a margin-based risk function. Traditionally,
these risk functions are computed based on a labeled dataset. We develop a
novel technique for estimating such risks using only unlabeled data and p(y).
We prove that the technique is consistent for high-dimensional linear
classifiers and demonstrate it on synthetic and real-world data.
Text documents are complex high dimensional objects. To effectively visualize
such data it is important to reduce its dimensionality and visualize the low
dimensional embedding as a 2-D or 3-D scatter plot. In this paper we explore
dimensionality reduction methods that draw upon domain knowledge in order to
achieve a better low dimensional embedding and visualization of documents. We
consider the use of geometries specified manually by an expert, geometries
derived automatically from corpus statistics, and geometries computed from
linguistic resources.
Semisupervised learning has emerged as a popular framework for improving
modeling accuracy while controlling labeling cost. Based on an extension of
stochastic composite likelihood we quantify the asymptotic accuracy of
generative semi-supervised learning. In doing so, we complement
distribution-free analysis by providing an alternative framework to measure the
value associated with different labeling policies and resolve the fundamental
question of how much data to label and in what manner.