We introduce a method for aggregating many least squares estimator so that
the resulting estimate has two properties: sparsity and structure. That is,
only a few candidate covariates are used in the resulting model, and the
selected covariates follow some structure over the candidate covariates that is
assumed to be known a priori. While sparsity is well studied in many settings,
including aggregation, structured sparse methods are still emerging.
We present two sets of theoretical results on the grouped lasso with overlap
of Jacob, Obozinski and Vert (2009) in the linear regression setting. This
method allows for joint selection of predictors in sparse regression, allowing
for complex structured sparsity over the predictors encoded as a set of groups.
This flexible framework suggests that arbitrarily complex structures can be
encoded with an intricate set of groups. Our results show that this strategy
results in unexpected theoretical consequences for the procedure.
We present a new similarity measure tailored to posts in an online forum. Our
measure takes into account all the available information about user interest
and interaction --- the content of posts, the threads in the forum, and the
author of the posts. We use this post similarity to build a similarity between
users, based on principal coordinate analysis. This allows easy visualization
of the user activity as well. Similarity between users has numerous
applications, such as clustering or classification.
We introduce a new version of forward stepwise regression. Our modification
finds solutions to regression problems where the selected predictors appear in
a structured pattern, with respect to a predefined distance measure over the
candidate predictors. Our method is motivated by the problem of predicting
HIV-1 drug resistance from protein sequences. We find that our methods improve
the interpretability of drug resistance while producing comparable predictive
accuracy to standard methods.