Compressed Counting (CC) [22] was recently proposed for estimating the ath
frequency moments of data streams, where 0 < a <= 2. CC can be used for
estimating Shannon entropy, which can be approximated by certain functions of
the ath frequency moments as a -> 1. Monitoring Shannon entropy for anomaly
detection (e.g., DDoS attacks) in large networks is an important task. This
paper presents a new algorithm for improving CC. The improvement is most
substantial when a -> 1--. For example, when a = 0:99, the new algorithm
reduces the estimation variance roughly by 100-fold.
We provide a simple method and relevant theoretical analysis for efficiently
estimating higher-order lp distances. While the analysis mainly focuses on l4,
our methodology extends naturally to p = 6,8,10..., (i.e., when p is even).
Distance-based methods are popular in machine learning. In large-scale
applications, storing, computing, and retrieving the distances can be both
space and time prohibitive. Efficient algorithms exist for estimating lp
distances if 0 < p <= 2. The task for p > 2 is known to be difficult. Our work
partially fills this gap.
Logitboost is an influential boosting algorithm for classification. In this
paper, we develop robust logitboost to provide an explicit formulation of
tree-split criterion for building weak learners (regression trees) for
logitboost. This formulation leads to a numerically stable implementation of
logitboost. We then propose abc-logitboost for multi-class classification, by
combining robust logitboost with the prior work of abc-boost. Previously,
abc-boost was implemented as abc-mart using the mart algorithm.
We show that information about social relationships can be used to improve
user-level sentiment analysis. The main motivation behind our approach is that
users that are somehow "connected" may be more likely to hold similar opinions;
therefore, relationship information can complement what we can extract about a
user's viewpoints from their utterances.
In this paper, we first demonstrate that b-bit minwise hashing, whose
estimators are positive definite kernels, can be naturally integrated with
learning algorithms such as SVM and logistic regression. We adopt a simple
scheme to transform the nonlinear (resemblance) kernel into linear (inner
product) kernel; and hence large-scale problems can be solved extremely
efficiently. Our method provides a simple effective solution to large-scale
learning in massive and extremely high-dimensional datasets, especially when
data do not fit in memory.
In this paper, we propose to (seamlessly) integrate b-bit minwise hashing
with linear SVM to substantially improve the training (and testing) efficiency
using much smaller memory, with essentially no loss of accuracy. Theoretically,
we prove that the resemblance matrix, the minwise hashing matrix, and the b-bit
minwise hashing matrix are all positive definite matrices (kernels).
Interestingly, our proof for the positive definiteness of the b-bit minwise
hashing kernel naturally suggests a simple strategy to integrate b-bit hashing
with linear SVM.
Abc-boost is a new line of boosting algorithms for multi-class
classification, by utilizing the commonly used sum-to-zero constraint. To
implement abc-boost, a base class must be identified at each boosting step.
Prior studies used a very expensive procedure based on exhaustive search for
determining the base class at each boosting step. Good testing performances of
abc-boost (implemented as abc-mart and abc-logitboost) on a variety of datasets
were reported.
Estimating the p-th frequency moment of data stream is a very heavily studied
problem. The problem is actually trivial when p = 1, assuming the strict
Turnstile model. The sample complexity of our proposed algorithm is essentially
O(1) near p=1. This is a very large improvement over the previously believed
O(1/eps^2) bound. The proposed algorithm makes the long-standing problem of
entropy estimation an easy task, as verified by the experiments included in the
appendix.
Libgober and Wood proved that the Chern number $c_{1}c_{n-1}$ of a
$n$-dimensional compact complex manifold can be determined by its Hirzebruch
$\chi_{y}$-genus. Inspired by the idea of their proof, we show that, for
compact, spin, almost-complex manifolds, more Chern numbers can be determined
by the indices of some twisted Dirac and signature operators. As a byproduct,
we get a divisibility result of certain characteristic number for such
manifolds. Using our method, we give a direct proof of Libgober-Wood's result,
which was originally proved by induction.
This empirical study is mainly devoted to comparing four tree-based boosting
algorithms: mart, abc-mart, robust logitboost, and abc-logitboost, for
multi-class classification on a variety of publicly available datasets. Some of
those datasets have been thoroughly tested in prior studies using a broad range
of classification algorithms including SVM, neural nets, and deep learning.
In terms of the empirical classification errors, our experiment results
demonstrate:
This paper establishes the theoretical framework of b-bit minwise hashing.
The original minwise hashing method has become a standard technique for
estimating set similarity (e.g., resemblance) with applications in information
retrieval, data management, social networks and computational advertising.
Compressed Counting (CC), based on maximally skewed stable random
projections, was recently proposed for estimating the p-th frequency moments of
data streams. The case p->1 is extremely useful for estimating Shannon entropy
of data streams. In this study, we provide a very simple algorithm based on the
sample minimum estimator and prove a much improved sample complexity bound,
compared to prior results.