Heavy-tailed high-dimensional data are commonly encountered in various
scientific fields and pose great challenges to modern statistical analysis. A
natural procedure to address this problem is to use penalized least absolute
deviation (LAD) method with weighted $L_1$-penalty, called weighted robust
Lasso (WR-Lasso), in which weights are introduced to ameliorate the bias
problem induced by the $L_1$-penalty.
Most papers on high-dimensional statistics are based on the assumption that
none of the regressors are correlated with the regression error, namely, they
are exogeneous. Yet, endogeneity arises easily in high-dimensional regression
due to a large pool of regressors and this causes the inconsistency of the
penalized least-squares methods and possible false scientific discoveries. A
necessary condition for model selection of a very general class of penalized
regression methods is given, which allows us to prove formally the
inconsistency claim.
The variance covariance matrix plays a central role in the inferential
theories of high dimensional factor models in finance and economics. Popular
regularization methods of directly exploiting sparsity are not directly
applicable to many financial problems. Classical methods of estimating the
covariance matrices are based on the strict factor models, assuming independent
idiosyncratic components. This assumption, however, is restrictive in practical
applications.
The multiple testing procedure plays an important role in detecting the
presence of spatial signals for large-scale imaging data. Typically, the
spatial signals are sparse but clustered.
Variance estimation is a fundamental problem in statistical modeling. In
ultrahigh dimensional linear regressions where the dimensionality is much
larger than sample size, traditional variance estimation techniques are not
applicable. Recent advances on variable selection in ultrahigh dimensional
linear regressions make this problem accessible. One of the major problems in
ultrahigh dimensional regression is the high spurious correlation between the
unobserved realized noise and some of the predictors.
Multiple hypothesis testing is a fundamental problem in high dimensional
inference, with wide applications in many scientific fields. In genome-wide
association studies, tens of thousands of tests are performed simultaneously to
find if any genes are associated with some traits and those tests are
correlated. When test statistics are correlated, false discovery control
becomes very challenging under arbitrary dependence.
For high-dimensional classification, it is well known that naively performing
the Fisher discriminant rule leads to poor results due to diverging spectra and
noise accumulation. Therefore, researchers proposed independence rules to
circumvent the diverse spectra, and sparse independence rules to mitigate the
issue of noise accumulation. However, in biological applications, there are
often a group of correlated genes responsible for clinical outcomes, and the
use of the covariance information can significantly reduce misclassification
rates.
We propose several statistics to test the Markov hypothesis for
$\beta$-mixing stationary processes sampled at discrete time intervals. Our
tests are based on the Chapman--Kolmogorov equation. We establish the
asymptotic null distributions of the proposed test statistics, showing that
Wilks's phenomenon holds. We compute the power of the test and provide
simulations to investigate the finite sample performance of the test statistics
when the null model is a diffusion process, with alternatives consisting of
models with a stochastic mean reversion level, stochastic volatility and jumps.
Estimation of genewise variance arises from two important applications in
microarray data analysis: selecting significantly differentially expressed
genes and validation tests for normalization of microarray data. We approach
the problem by introducing a two-way nonparametric model, which is an extension
of the famous Neyman--Scott model and is applicable beyond microarray data.
Multiple hypothesis testing is a fundamental problem in high dimensional
inference, with wide applications in many scientific fields. In genome-wide
association studies, tens of thousands of tests are performed simultaneously to
find if any genes are associated with some traits and those tests are
correlated. When test statistics are correlated, false discovery control
becomes very challenging under arbitrary dependence.
High throughput genetic sequencing arrays with thousands of measurements per
sample and a great amount of related censored clinical data have increased
demanding need for better measurement specific model selection. In this paper
we establish strong oracle properties of non-concave penalized methods for {\it
non-polynomial} (NP) dimensional data with censoring in the framework of Cox's
proportional hazards model. A class of folded-concave penalties are employed
and both LASSO and SCAD are discussed specifically.
Motivated by normalizing DNA microarray data and by predicting the interest
rates, we explore nonparametric estimation of additive models with highly
correlated covariates. We introduce two novel approaches for estimating the
additive components, integration estimation and pooled backfitting estimation.
The former is designed for highly correlated covariates, and the latter is
useful for nonhighly correlated covariates. Asymptotic normalities of the
proposed estimators are established.
Portfolio allocation with gross-exposure constraint is an effective method to
increase the efficiency and stability of selected portfolios among a vast pool
of assets, as demonstrated in Fan et al (2008). The required high-dimensional
volatility matrix can be estimated by using high frequency financial data. This
enables us to better adapt to the local volatilities and local correlations
among vast number of assets and to increase significantly the sample size for
estimating the volatility matrix.
Variable selection in high dimensional space has challenged many contemporary
statistical problems from many frontiers of scientific disciplines. Recent
technology advance has made it possible to collect a huge amount of covariate
information such as microarray, proteomic and SNP data via bioimaging
technology while observing survival information on patients in clinical
studies. Thus, the same challenge applies to the survival analysis in order to
understand the association between genomics information and clinical
information about the survival time.
The non-Gaussian quasi maximum likelihood estimator is frequently used in
GARCH models with intension to improve the efficiency of the GARCH parameters.
However, the method is usually inconsistent unless the quasi-likelihood happens
to be the true one. We identify an unknown scale parameter that is critical to
the consistent estimation of non-Gaussian QMLE. As a part of estimating this
unknown parameter, a two-step non-Gaussian QMLE (2SNG-QMLE) is proposed for
estimation the GARCH parameters.
In high-dimensional model selection problems, penalized simple least-square
approaches have been extensively used. This paper addresses the question of
both robustness and efficiency of penalized model selection methods, and
proposes a data-driven weighted linear combination of convex loss functions,
together with weighted $L_1$-penalty. It is completely data-adaptive and does
not require prior knowledge of the error distribution. The weighted
$L_1$-penalty is used both to ensure the convexity of the penalty term and to
ameliorate the bias caused by the $L_1$-penalty.
A variable screening procedure via correlation learning was proposed Fan and
Lv (2008) to reduce dimensionality in sparse ultra-high dimensional models.
Even when the true model is linear, the marginal regression can be highly
nonlinear. To address this issue, we further extend the correlation learning to
marginal nonparametric learning. Our nonparametric independence screening is
called NIS, a specific member of the sure independence screening. Several
closely related variable screening procedures are proposed.
This paper studies the sparsistency and rates of convergence for estimating
sparse covariance and precision matrices based on penalized likelihood with
nonconvex penalty functions. Here, sparsistency refers to the property that all
parameters that are zero are actually estimated as zero with probability
tending to one. Depending on the case of applications, sparsity priori may
occur on the covariance matrix, its inverse or its Cholesky decomposition. We
study these three sparsity exploration problems under a unified framework with
a general penalty function.
Generalized linear models and the quasi-likelihood method extend the ordinary
regression models to accommodate more general conditional distributions of the
response. Nonparametric methods need no explicit parametric specification, and
the resulting model is completely determined by the data themselves. However,
nonparametric estimation schemes generally have a slower convergence rate such
as the local polynomial smoothing estimation of nonparametric generalized
linear models studied in Fan, Heckman and Wand [J. Amer. Statist. Assoc. 90
(1995) 141--150].
Ultrahigh dimensional variable selection plays an increasingly important role
in contemporary scientific discoveries and statistical research. Among others,
Fan and Lv (2008) propose an independent screening framework by ranking the
marginal correlations. They showed that the correlation ranking procedure
possesses a sure independence screening property within the context of the
linear model with Gaussian covariates and responses.
Ultrahigh dimensional variable selection plays an increasingly important role
in contemporary scientific discoveries and statistical research. Among others,
Fan and Lv (2008) propose an independent screening framework by ranking the
marginal correlations. They showed that the correlation ranking procedure
possesses a sure independence screening property within the context of the
linear model with Gaussian covariates and responses.
High dimensional statistical problems arise from diverse fields of scientific
research and technological development. Variable selection plays a pivotal role
in contemporary statistical learning and scientific discoveries. The
traditional idea of best subset selection methods, which can be regarded as a
specific form of penalized likelihood, is computationally too expensive for
many modern statistical applications. Other forms of penalized likelihood
methods have been successfully developed over the last decade to cope with high
dimensionality.
High dimensional statistical problems arise from diverse fields of scientific
research and technological development. Variable selection plays a pivotal role
in contemporary statistical learning and scientific discoveries. The
traditional idea of best subset selection methods, which can be regarded as a
specific form of penalized likelihood, is computationally too expensive for
many modern statistical applications. Other forms of penalized likelihood
methods have been successfully developed over the last decade to cope with high
dimensionality.
Penalized likelihood methods are fundamental to ultra-high dimensional
variable selection. How high dimensionality such methods can handle remains
largely unknown. In this paper, we show that in the context of generalized
linear models, such methods possess model selection consistency with oracle
properties even for dimensionality of Non-Polynomial (NP) order of sample size,
for a class of penalized likelihood approaches using folded-concave penalty
functions, which were introduced to ameliorate the bias problems of convex
penalty functions.
In the analysis of cluster data, the regression coefficients are frequently
assumed to be the same across all clusters. This hampers the ability to study
the varying impacts of factors on each cluster. In this paper, a semiparametric
model is introduced to account for varying impacts of factors over clusters by
using cluster-level covariates. It achieves the parsimony of parametrization
and allows the explorations of nonlinear interactions. The random effect in the
semiparametric model also accounts for within-cluster correlation.