The model for homogeneity of proportions in a two-way
contingency-table/cross-tabulation is the same as the model of independence,
except that the probabilistic process generating the data is viewed as fixing
the column totals (but not the row totals).
Goodness-of-fit tests based on the Euclidean distance often outperform
chi-square and other classical tests (including the standard exact tests) by at
least an order of magnitude when the model being tested for goodness-of-fit is
a discrete probability distribution that is not close to uniform. The present
article discusses numerous examples of this.
If a discrete probability distribution in a model being tested for
goodness-of-fit is not close to uniform, then forming the Pearson chi-square
statistic can involve division by nearly zero. This often leads to serious
trouble in practice -- even in the absence of round-off errors -- as the
present article illustrates via numerous examples.
Recently popularized randomized methods for principal component analysis
(PCA) efficiently and reliably produce nearly optimal accuracy --- even on
parallel processors --- unlike the classical (deterministic) alternatives. We
adapt one of these randomized methods for use with data sets that are too large
to be stored in random-access memory (RAM). (The traditional terminology is
that our procedure works efficiently "out-of-core.") We illustrate the
performance of the algorithm via several numerical examples.
The classic chi-squared statistic for testing goodness-of-fit has long been a
cornerstone of modern statistical practice. The statistic consists of a sum in
which each summand involves multiplying by the inverse of (i.e., dividing by)
the probability associated with the corresponding bin in the distribution being
tested for goodness-of-fit. This inversion typically precipitates rebinning to
uniformize the probabilities associated with the bins, in order to make the
test reasonably powerful. With the now widespread availability of computers,
there is no longer any need for this.
We discuss several tests for whether a given set of independent and
identically distributed (i.i.d.) draws does not come from a specified
probability density function. The most commonly used are Kolmogorov-Smirnov
tests, particularly Kuiper's variant, which focus on discrepancies between the
cumulative distribution function for the specified probability density and the
empirical cumulative distribution function for the given set of i.i.d.
We describe an algorithm that, given any full-rank matrix A having fewer rows
than columns, can rapidly compute the orthogonal projection of any vector onto
the null space of A, as well as the orthogonal projection onto the row space of
A, provided that both A and its adjoint can be applied rapidly to arbitrary
vectors. As an intermediate step, the algorithm solves the overdetermined
linear least-squares regression involving the adjoint of A (and so can be used
for this, too).
We accelerate the computation of spherical harmonic transforms, using what is
known as the butterfly scheme. This provides a convenient alternative to the
approach taken in the second paper from this series on "Fast algorithms for
spherical harmonic expansions." The requisite precomputations become manageable
when organized as a "depth-first traversal" of the program's control-flow
graph, rather than as the perhaps more natural "breadth-first traversal" that
processes one-by-one each level of the multilevel procedure. We illustrate the
results via several numerical examples.