In this note we illustrate and develop further with mathematics and examples,
the work on successive standardization (or normalization) that is studied
earlier by the same authors in Olshen and Rajaratnam (2010) and Olshen and
Rajaratnam (2011). Thus, we deal with successive iterations applied to
rectangular arrays of numbers, where to avoid technical difficulties an array
has at least three rows and at least three columns. Without loss, an iteration
begins with operations on columns: first subtract the mean of each column; then
divide by its standard deviation.
The question of when zeros (i.e., sparsity) in a positive definite matrix $A$
are preserved in its Cholesky decomposition, and vice versa, was addressed by
Paulsen et al. in the Journal of Functional Analysis (85, pp151-178). In
particular, they prove that for the pattern of zeros in $A$ to be retained in
the Cholesky decomposition of $A$, the pattern of zeros in $A$ has to
necessarily correspond to a chordal (or decomposable) graph associated with a
specific type of vertex ordering.
The graphical lasso (glasso) is a widely-used fast algorithm for estimating
sparse inverse covariance matrices. The glasso solves an L_1 penalized maximum
likelihood problem and is implemented on CRAN. The output from the glasso, a
regularized covariance matrix estimate Sigma_glasso and a sparse inverse
covariance matrix estimate Omega_glasso, not only identify a graphical model
but can also serve as intermediate inputs into multivariate procedures such as
PCA, LDA, MANOVA, and others.
This paper treats the problem of screening a p-variate sample for strongly
and multiply connected vertices in the partial correlation graph associated
with the the partial correlation matrix of the sample. This problem, called hub
screening, is important in many applications ranging from network security to
computational biology to finance to social networks. In the area of network
security, a node that becomes a hub of high correlation with neighboring nodes
might signal anomalous activity such as a coordinated flooding attack.
In this paper we construct a family of DAG Wishart distributions that form a
rich conjugate family of priors with multiple shape parameters for Gaussian DAG
models, and proceed to undertake a theoretical analysis of this class with the
goal of posterior inference. We first prove that our family of DAG Wishart
distributions satisfies the strong directed hyper Markov property.
Positive definite (p.d.) matrices arise naturally in many areas within
mathematics and also feature extensively in scientific applications. In modern
high-dimensional applications, a common approach to finding sparse positive
definite matrices is to threshold their small off-diagonal elements. This
thresholding, sometimes referred to as hard-thresholding, sets small elements
to zero. Thresholding has the attractive property that the resulting matrices
are sparse, and are thus easier to interpret and work with.
Discussion of "A statistical analysis of multiple temperature proxies: Are
reconstructions of surface temperatures over the last 1000 years reliable?" by
B.B. McShane and A.J. Wyner [arXiv:1104.4002]
Gaussian covariance graph models encode marginal independence among the
components of a multivariate random vector by means of a graph $G$.
This paper treats the problem of screening for variables with high
correlations in high dimensional data in which there can be many fewer samples
than variables. We focus on threshold-based correlation screening methods for
three related applica- tions: screening for variables with large correlations
within a single treatment (auto- correlation screening); screening for
variables with large cross-correlations over two treatments (cross-correlation
screening); screening for variables that have persistently large
auto-correlations over two treatments (persistent-correlation screening).
Standard statistical techniques often require transforming data to have mean
0 and standard deviation 1. Typically, this process of "standardization" or
"normalization" is applied across subjects when each subject produces a single
number. High throughput genomic and financial data often come as rectangular
arrays where each coordinate in one direction concerns subjects who might have
different status (case or control, say), and each coordinate in the other
designates "outcome" for a specific feature, for example, "gene," "polymorphic
site" or some aspect of financial profile.
A covariance graph is an undirected graph associated with a multivariate
probability distribution of a given random vector where each vertex represents
each of the different components of the random vector and where the absence of
an edge between any pair of variables implies marginal independence between
these two variables. Covariance graph models have recently received much
attention in the literature and constitute a sub-family of graphical models.
Though they are conceptually simple to understand, they are considerably more
difficult to analyze.