We present a new similarity measure tailored to posts in an online forum. Our
measure takes into account all the available information about user interest
and interaction --- the content of posts, the threads in the forum, and the
author of the posts. We use this post similarity to build a similarity between
users, based on principal coordinate analysis. This allows easy visualization
of the user activity as well. Similarity between users has numerous
applications, such as clustering or classification.
The American Community Survey (ACS) provides one-year (1y), three-year (3y)
and five-year (5y) multi-year estimates (MYEs) of various demographic and
economic variables for each "community", although the 1y and 3y may not be
available for communities with a small population. These survey estimates are
not truly measuring the same quantities, since they each cover different time
spans. Using some simplistic models, we demonstrate that comparing different
period-length MYEs results in spurious conclusions about trend movements.
In the conviction of Lucia de Berk an important role was played by a simple
hypergeometric model, used by the expert consulted by the court, which produced
very small probabilities of occurrences of certain numbers of incidents. We
want to draw attention to the fact that, if we take into account the variation
among nurses in incidents they experience during their shifts, these
probabilities can become considerably larger. This points to the danger of
using an oversimplified discrete probability model in these circumstances.
We develop a simulation tool to support policy-decisions about healthcare for
chronic diseases in defined populations. Incident disease-cases are generated
in-silico from an age-sex characterised general population using standard
epidemiological approaches. A novel disease-treatment model then simulates
continuous life courses for each patient using discrete event simulation.
Ideally, the discrete event simulation model would be inferred from complete
longitudinal healthcare data via a likelihood or Bayesian approach.
Functional brain connectivity, as revealed through distant correlations in
the signals measured by functional Magnetic Resonance Imaging (fMRI), is a
promising source of biomarkers of brain pathologies. However, establishing and
using diagnostic markers requires probabilistic inter-subject comparisons.
Principled comparison of functional-connectivity structures is still a
challenging issue. We give a new matrix-variate probabilistic model suitable
for inter-subject comparison of functional connectivity matrices on the
manifold of Symmetric Positive Definite (SPD) matrices.
With reference to the questionnaire adopted within the Italian project
"Ulisse" to assess health condition of elderly people, we investigate two
important issues: discriminant power and actual number of dimensions measured
by the items composing the questionnaire. The adopted statistical approach is
based on the joint use of the latent class model and a multidimensional item
response theory model based on the 2PL parametrization. The latter allows us to
account for the different discriminant power of these items.
We consider problems of Bayesian inference for a spatial epidemic on a graph,
where the final state of the epidemic corresponds to bond percolation, and
where only the set or number of finally infected sites is observed. We develop
appropriate Markov chain Monte Carlo algorithms, demonstrating their
effectiveness, and we study problems of optimal experimental design. In
particular, we demonstrate that for lattice-based processes an experiment on a
sparsified lattice can yield more information on model parameters than one
conducted on a complete lattice.
We investigate the opinion dynamics by extending the majority rule model to a
preferential selection model, in which agents choose opinions with some
probability rather than absolutely follow the majority. In the model, agent $i$
agrees with one of binary opinions with the probability that is a power
function of the number of agents holding this opinion among agent $i$ and its
nearest neighbors, where an adjustable parameter $\alpha$ controls the degree
of preferential selection. We find that global consensus is unable to be
reached if $\alpha<1$.
This paper introduces a new model and methodology for estimating the ability
of NBA players. The main idea is to directly measure how good a player is by
comparing how their team performs when they are on the court as opposed to when
they are off it. This is achieved in a such a way as to control for the
changing abilities of the other players on court at different times during a
match.
Exponential random graph models are extremely difficult models to handle from
a statistical viewpoint, since their normalising constant, which depends on
model parameters, is available only in very trivial cases. We show how
inference can be carried out in a Bayesian framework using a MCMC algorithm,
which circumvents the need to calculate the normalising constants. We use a
population MCMC approach which accelerates convergence and improves mixing of
the Markov chain.
Minimum mean squared error (MMSE) estimators of signals from samples
corrupted by jitter (timing noise) and additive noise are nonlinear, even when
the signal prior and additive noise have normal distributions. This paper
develops stochastic algorithms based on Gibbs sampling and slice sampling to
approximate optimal MMSE estimators in this Bayesian formulation. Simulations
demonstrate that these nonlinear algorithms can improve significantly upon the
linear MMSE estimator.
This paper examines the problem of estimating the parameters of a bandlimited
signal from samples corrupted by random jitter (timing noise) and additive iid
Gaussian noise, where the signal lies in the span of a finite basis. For the
presented classical estimation problem, the Cramer-Rao lower bound (CRB) is
computed, and an Expectation-Maximization (EM) algorithm approximating the
maximum likelihood (ML) estimator is developed. Simulations are performed to
study the convergence properties of the EM algorithm and compare the
performance both against the CRB and a basic linear estimator.
This paper introduces Bayesian supervised and unsupervised segmentation
algorithms aimed at oceanic segmentation of SAR images. The data term,
\emph{i.e}., the density of the observed backscattered signal given the region,
is modeled by a finite mixture of Gamma densities with a given predefined
number of components. To estimate the parameters of the class conditional
densities, a new expectation maximization algorithm was developed. The prior is
a multi-level logistic Markov random field enforcing local continuity in a
statistical sense.
This paper presents a general stochastic model developed for a class of
cooperative wireless relay networks, in which imperfect knowledge of the
channel state information at the destination node is assumed. The framework
incorporates multiple relay nodes operating under general known non-linear
processing functions. When a non-linear relay function is considered, the
likelihood function is generally intractable resulting in the maximum
likelihood and the maximum a posteriori detectors not admitting closed form
solutions.
This study examined the teaching practices of 227 college instructors of
introductory statistics (from the health and behavioral sciences). Using
primarily multidimensional scaling (MDS) techniques, a two-dimensional, 10-item
teaching practice scale, TISS (Teaching of Introductory Statistics Scale), was
developed and validated. The two dimensions (subscales) were characterized as
constructivist, and behaviorist, and are orthogonal to each other.
Studying the topology of so-called real networks, that is networks obtained
from sociological or biological data for instance, has become a major field of
interest in the last decade. One way to deal with it is to consider that
networks are built from small functional units called motifs, which can be
found by looking for small subgraphs whose numbers of occurrences in the whole
network are surprisingly high. In this article, we propose to define motifs
through a local overrepresentation in the network and develop a statistic to
detect them without relying on simulations.
The Bourewa beach site on the Rove Peninsula of Viti Levu is the earliest
known human settlement in the Fiji Islands. How did the settlement at Bourewa
develop in space and time? We have radiocarbon dates on sixty specimens, found
in association with evidence for human presence, taken from pits across the
site. Owing to the lack of diagnostic stratigraphy, there is no direct
archaeological evidence for distinct phases of occupation through the period of
interest.
We apply multiple testing procedures to the validation of estimated default
probabilities in credit rating systems. The goal is to identify rating classes
for which the probability of default is estimated inaccurately, while still
maintaining a predefined level of committing type I errors as measured by the
familywise error rate (FWER) and the false discovery rate (FDR). For FWER, we
also consider procedures that take possible discreteness of the data resp. test
statistics into account.
Rapid research progress in genotyping techniques have allowed large
genome-wide association studies. Existing methods often focus on determining
associations between single loci and a specific phenotype. However, a
particular phenotype is usually the result of complex relationships between
multiple loci and the environment. In this paper, we describe a two-stage
method for detecting epistasis by combining the traditionally used single-locus
search with a search for multiway interactions. Our method is based on an
extended version of Fisher's exact test.
Imputation of missing data in large regions of satellite imagery is necessary
when the acquired image has been damaged by shadows due to clouds, or
information gaps produced by sensor failure.
In recent years, spatial and spatio-temporal modeling have become an
important area of research in many fields (epidemiology, environmental studies,
disease mapping). In this work we propose different spatial models to study
hospital recruitment, including some potentially explicative variables.
Interest is on the distribution per geographical unit of the ratio between the
number of patients living in this geographical unit and the population in the
same unit. Models considered are within the framework of Bayesian Latent
Gaussian models.
Spatial Independent Components Analysis (ICA) is increasingly used in the
context of functional Magnetic Resonance Imaging (fMRI) to study cognition and
brain pathologies. Salient features present in some of the extracted
Independent Components (ICs) can be interpreted as brain networks, but the
segmentation of the corresponding regions from ICs is still ill-controlled.
Here we propose a new ICA-based procedure for extraction of sparse features
from fMRI datasets. Specifically, we introduce a new thresholding procedure
that controls the deviation from isotropy in the ICA mixing model.
Spatial Independent Component Analysis (ICA) is an increasingly used
data-driven method to analyze functional Magnetic Resonance Imaging (fMRI)
data. To date, it has been used to extract sets of mutually correlated brain
regions without prior information on the time course of these regions. Some of
these sets of regions, interpreted as functional networks, have recently been
used to provide markers of brain diseases and open the road to paradigm-free
population comparisons.
Inferential summaries of tree estimates are useful in the setting of
evolutionary biology, where phylogenetic trees have been built from DNA data
since the 1960's. In bioinformatics, psychometrics and data mining,
hierarchical clustering techniques output the same mathematical objects, and
practitioners have similar questions about the stability and `generalizability'
of these summaries.
The theoretical base of the research of occupational injuries is the idea of
the process as Markov chain of random variables. However the exact proof of
this position was not carried out whereas the experimental passing of the
hypothesis is connected always with the determined confidence limits and
consequently it gives the space for alternative assumptions. In this research
some databases of occupational injuries had been studied using spectral
analysis techniques and the presentation of the occupational injuries as the
temporal sequence of the cases ("telegraph wave" process type).
In practical nonlinear filtering, the assessment of achievable filtering
performance is important. In this paper, we focus on the problem of efficiently
approximate the posterior Cramer-Rao lower bound (CRLB) in a recursive manner.
By using Gaussian assumptions, two types of approximations for calculating the
CRLB are proposed: An exact model using the state estimate as well as a
Taylor-series-expanded model using both of the state estimate and its error
covariance, are derived. Moreover, the difference between the two approximated
CRLBs is also formulated analytically.
In order to maintain consistent quality of service, computer network
engineers face the task of monitoring the traffic fluctuations on the
individual links making up the network.
Profile likelihood intervals of large quantiles in Extreme Value
distributions provide a good way to estimate these parameters of interest since
they take into account the asymmetry of the likelihood surface in the case of
small and moderate sample sizes; however they are seldom used in practice. In
contrast, maximum likelihood asymptotic (mla) intervals are commonly used
without respect to sample size.
This thesis is dedicated to the statistical analysis of multi-sub ject fMRI
data, with the purpose of identifying bain structures involved in certain
cognitive or sensori-motor tasks, in a reproducible way across sub jects. To
overcome certain limitations of standard voxel-based testing methods, as
implemented in the Statistical Parametric Mapping (SPM) software, we introduce
a Bayesian model selection approach to this problem, meaning that the most
probable model of cerebral activity given the data is selected from a
pre-defined collection of possible models.
The variance of the concentration in a sample can be estimated using
knowledge of the particle masses, concentrations and the parameter for the
dependent selection of particles. A number of variance estimators are
constructed including a class of hybrid estimators.
Missing data is a recurrent issue in epidemiology where the infection process
may be partially observed. Approximate Bayesian Computation, an alternative to
data imputation methods such as Markov Chain Monte Carlo integration, is
proposed for making inference in epidemiological models. It is a
likelihood-free method that relies exclusively on numerical simulations. ABC
consists in computing a distance between simulated and observed summary
statistics and weighting the simulations according to this distance.
We show how to construct the best linear unbiased predictor (BLUP) for the
continuation of a curve in a spline-function model. We assume that the entire
curve is drawn from some smooth random process and that the curve is given up
to some cut point. We demonstrate how to compute the BLUP efficiently.
Confidence bands for the BLUP are discussed. Finally, we apply the proposed
BLUP to real-world call center data. Specifically, we forecast the continuation
of both the call arrival counts and the workload process at the call center of
a commercial bank.
The most commonly used relative abundance index in stock assessments of
longline fisheries is catch per unit effort (CPUE), here defined as the number
of fish of the targeted species caught per hook and minute of soak time.
Longline CPUE can be affected by interspecific competition and the retrieval of
unbaited or empty hooks, and interannual variation in these can lead to biases
in the apparent abundance trends in the CPUE. Interspecific competition on
longlines has been previously studied but the return of empty hooks is ignored
in all current treatments of longline CPUE.
We propose a new approach for clustering DNA features using array CGH data
from multiple tumor samples. We distinguish data-collapsing: joining contiguous
DNA clones or probes with extremely similar data into regions, from clustering:
joining contiguous, correlated regions based on a maximum likelihood principle.
The model-based clustering algorithm accounts for the apparent spatial patterns
in the data. We evaluate the randomness of the clustering result by a cluster
stability score in combination with cross-validation.
"Evidence and Evolution: the Logic behind the Science" was published in 2008
by Elliott Sober. It examines the philosophical foundations of the statistical
arguments used to evaluate hypotheses in evolutionary biology, based on simple
examples and likelihood ratios. The difficulty with reading the book from a
statistician's perspective is the reluctance of the author to engage into model
building and even less into parameter estimation.
Portfolio allocation with gross-exposure constraint is an effective method to
increase the efficiency and stability of selected portfolios among a vast pool
of assets, as demonstrated in Fan et al (2008). The required high-dimensional
volatility matrix can be estimated by using high frequency financial data. This
enables us to better adapt to the local volatilities and local correlations
among vast number of assets and to increase significantly the sample size for
estimating the volatility matrix.
We consider processes on social networks that can potentially involve three
phenomena: homophily, or the formation of social ties due to matching
individual traits; social contagion, also known as social influence; and the
causal effect of an individual's covariates on their behavior or other
measurable responses. We show that, generically, all of these are confounded
with each other. Distinguishing them from one another requires strong
assumptions on the parametrization of the social process or on the adequacy of
the covariates used (or both).
To account for the complex interplay between positive and negative dimensions
of experience in a well-defined framework, psychology needs theoretical models
associated with mathematical tools that integrate both dimensions. To this end,
we drew upon the Balanced States of Mind Model, an information-processing model
that relates quantitatively precise emotional balances of positive and negative
affects to psychopathology and optimal functioning.
Ozone and particulate matter PM2.5 are co-pollutants that have long been
associated with increased public health risks. Information on concentration
levels for both pollutants come from two sources: monitoring sites and output
from complex numerical models that produce concentration surfaces over large
spatial regions.
This paper develops strategic foundations for an important statistical model
of random networks with heterogeneous expected degrees. Based on this, we show
how social networking services that subtly alter the costs and indirect
benefits of relationships can cause large changes in behavior and welfare. In
the model, agents who value friends and friends of friends choose how much to
socialize, which increases the probabilities of links but is costly.
This work focuses on decentralized decision making in a population of
individuals each implementing the sequential probability ratio test. The
individual decisions are combined into a decentralized decision via an
aggregation rule chosen from a family of aggregation rules, denoted as q out of
N rule. We study how the population size affects the performance of the
decentralized decision making, i.e., the decision accuracy and time. In a group
applying the q out of N, a global decision is reached as soon as q out of the N
decision makers agree on an answer.
I discuss the statistical methods used in a paper in a respected management
journal, in order to present a critique of how statistics is typically used in
this type of research. Three themes emerge. The value of any statistical
approach is limited by various factors, especially the restricted nature of the
population sampled.
Flow cytometry is a technology that rapidly measures antigen-based markers
associated to cells in a cell population. Although analysis of flow cytometry
data has traditionally considered one or two markers at a time, there has been
increasing interest in multidimensional analysis. However, flow cytometers are
limited in the number of markers they can jointly observe, which is typically a
fraction of the number of markers of interest. For this reason, practitioners
often perform multiple assays based on different, overlapping combinations of
markers.
We propose a method to generate a warning system for the early detection of
time clusters applied to public health surveillance data. This new method
relies on the evaluation of a return period associated to any new count of a
particular infection reported to a surveillance system. The method is applied
to Salmonella surveillance in France and compared to the model developed by
Farrington et al.
Over the past decades, the competition for academic resources has gradually
intensified, and worsened with the current financial crisis. To optimize the
resource allocation, individualized assessment of research results is being
actively studied but the current indices, such as the number of papers, the
number of citations, the h-factor and its variants have limitations, especially
their inability of determining co-authors' credit shares fairly.
Assume that we observe a large number of curves, all of them with identical,
although unknown, shape, but with a different random shift. The objective is to
estimate the individual time shifts and their distribution. Such an objective
appears in several biological applications like neuroscience or ECG signal
processing, in which the estimation of the distribution of the elapsed time
between repetitive pulses with a possibly low signal-noise ratio, and without a
knowledge of the pulse shape is of interest.
The generation of multi-step density forecasts for non-Gaussian data mostly
relies on Monte-Carlo simulations which are computationally intensive. Using
aggregated wind power in Ireland, we study two approaches of multi-step density
forecasts which can be obtained from simple iterations so that intensive
computations are avoided. In the first approach, we apply a logistic
transformation to normalize the data approximately and describe the transformed
data using ARIMA-GARCH models so that multi-step forecasts can be iterated
easily.
Nonparametric estimation of the gap time distribution in a simple renewal
process may be considered a problem in survival analysis under particular
sampling frames corresponding to how the renewal process is observed. This note
describes several such situations where simple product limit estimators, though
inefficient, may still be useful.
This article presents a statistical analysis method and introduces the
corresponding software package "tailstat," which is believed to be widely
applicable to today's internet society. The proposed method facilitates
statistical analyses with small sample sets from given populations, which
render the central limit theorem inapplicable. A large-scale case study
demonstrates the effectiveness of the method and provides implications for
applying similar analyses to other cases.
I argue that we must distinguish between:
(0) the Three-Doors-Problem Problem [sic], which is to make sense of some
real world question of a real person.
(1) a large number of solutions to this meta-problem, i.e., many specific
Three-Doors-Problem problems, which are competing mathematizations of the
meta-problem (0).
We consider the problem of regression learning for deterministic design and
independent random errors. We start by proving a sharp PAC-Bayesian type bound
for the exponentially weighted aggregate (EWA) under the expected squared
empirical loss. For a broad class of noise distributions the presented bound is
valid whenever the temperature parameter $\beta$ of the EWA is larger than or
equal to $4\sigma^2$, where $\sigma^2$ is the noise variance.
As compared to load demand, frequent wind energy intermittencies produce
large short-term (sub 1-hr to 3-hr) deficits (and surpluses) in the energy
supply. These intermittent deficits pose systemic and structural risks that
will likely lead to energy deficits that have significant reliability
implications for energy system operators and consumers. This work provides a
toolset to help policy makers quantify these first-order risks. The thinking
methodology / framework shows that increasing wind energy penetration
significantly increases the risk of loss in California.
We examine the use of a structured thresholding algorithm for sparse
underwater channel estimation using compressed sensing. This method shows some
improvements over standard algorithms for sparse channel estimation such as
matching pursuit, iterative detection and least squares.
Researchers in many scientific fields make inferences from individuals to
larger groups. For many groups however, there is no list of members from which
to take a random sample. Respondent-driven sampling (RDS) is a relatively new
sampling methodology that circumvents this difficulty by using the social
networks of the groups under study. The RDS method has been shown to provide
unbiased estimates of population proportions given certain conditions. The
method is now widely used in the study of HIV-related high-risk populations
globally.
Successful implementation of California's Renewable Portfolio Standard (RPS)
mandating 33 percent renewable energy generation by 2020 requires inclusion of
a robust strategy to mitigate increased risk of energy deficits (blackouts) due
to short time-scale (sub 1 hour) intermittencies in renewable energy sources.
Of these RPS sources, wind energy has the fastest growth rate--over 25%
year-over-year. If these growth trends continue, wind energy could make up 15
percent of California's energy portfolio by 2016 (wRPS15).
Bayesian inferences in high energy physics often use uniform prior
distributions for parameters about which little or no information is available
before data are collected. The resulting posterior distributions are therefore
sensitive to the choice of parametrization for the problem and may even be
improper if this choice is not carefully considered. Here we describe an
extensively tested methodology, known as reference analysis, which allows one
to construct parametrization-invariant priors that embody the notion of minimal
informativeness in a mathematically well-defined sense.
We propose a simple and efficient Bayesian model of iterative learning on
social networks. This model is efficient in two senses: the process both
results in an optimal belief, and can be carried out with modest computational
resources for large networks. This result extends Condorcet's Jury Theorem to
general social networks, while preserving rationality and computational
feasibility.
This paper studies business cycle patterns in UK sectoral output. It analyzes
the distinction between white noise processes and their non-white noise
counterparts in the frequency domain and further examines the associated
features and patterns for the process where white noise conditions are
violated. The characteristics of these sectors, arising from their
institutional features that may influence business cycles behavior and
patterns, are discussed.
This study introduces a new method of visualizing complex tree structured
objects. The usefulness of this method is illustrated in the context of
detecting unexpected features in a data set of very large trees. The major
contribution is a novel two-dimensional graphical representation of each tree,
with a covariate coded by color. The motivating data set contains three
dimensional representations of brain artery systems of 105 subjects.
A lot of financial data present slight negative skewness and excess kurtosis.
Whereas the kurtosis is usually addressed via student-t distributions, the
evident asymmetry in the data is often tacitly ignored. Here I introduce a new
class of generalized skewed distribution functions, which allows a very
flexible approach to model skewed data. Originating from a system-theory and an
input/output point of view, a non-linear transformation converts a random
variable X into a so called Lambert W random variable Y. Its skewness depends
on the skewness of X and a skew parameter delta.
We argue that the time from the onset of infectiousness to infectious
contact, which we call the contact interval, is a better basis for inference in
epidemic data than generation or serial intervals. Since an infectious person
might recover before making infectious contact or make infectious contact with
previously infected persons, infectious contact intervals can be right-censored
and survival analysis is the natural approach to estimation.
It is now widely accepted that knowledge can be acquired from networks by
clustering their vertices according to connection profiles. Many methods have
been proposed. In this paper, we concentrate on a mixture model for graphs, the
so-called MixNet model, which is closely related to the stochastic block model.
The clustering of vertices and the estimation of MixNet model parameters have
been subject to previous work and numerous inference strategies such as
variational Expectation Maximization (EM) and classification EM have been
proposed.
Technological advances in genotyping have given rise to hypothesis-based
association studies of increasing scope. As a result, the scientific hypotheses
addressed by these studies have become more complex and more difficult to
address using existing analytic methodologies. Obstacles to analysis include
inference in the face of multiple comparisons, complications arising from
correlations among the SNPs (single nucleotide polymorphisms), choice of their
genetic parameterization and missing data.
Numerous statistics have been proposed for the measure of offensive ability
in major league baseball. While some of these measures may offer moderate
predictive power in certain situations, it is unclear which simple offensive
metrics are the most reliable or consistent. We address this issue with a
Bayesian hierarchical model for variable selection to capture which offensive
metrics are most predictive within players across time.
In this article we introduce and study a mathematical framework for
characterizing and simulating networks of noisy integrate-and-fire neurons
based on the spike times. We show that the firing times of the neurons in the
networks constitute a Markov chain, whose transition probability is related to
the probability distribution of the interspike interval of the neurons in the
network.
Missing data estimation is an important challenge with high-dimensional data
arranged in the form of a matrix. Typically this data matrix is transposable,
meaning that either the rows, columns or both can be treated as features. To
model transposable data, we present a modification of the matrix-variate
normal, the mean-restricted matrix-variate normal, in which the rows and
columns each have a separate mean vector and covariance matrix.
This article develops a general detection theory for speech analysis based on
time-varying autoregressive models, which themselves generalize the classical
linear predictive speech analysis framework. This theory leads to a
computationally efficient decision-theoretic procedure that may be applied to
detect the presence of vocal tract variation in speech waveform data.
A generalization of Gy's theory for the variance of the fundamental sampling
error is reviewed. Practical situations where the generalized model potentially
leads to more accurate variance estimates are identified as: clustering of
particles, differences in densities or sizes of the particles or repulsive
inter-particle forces. Two general approaches for estimating an input parameter
for the generalized model are discussed. The first approach consists of
modelling based on physical properties of particles such as size, density and
electrostatic forces between particles.
This paper addresses the issue of detecting point objects in a clutter
background and estimating their position by image processing. We are interested
in the specific context where the object signature significantly varies with
its random subpixel location because of aliasing. Conventional matched filter
neglects this phenomenon and causes consistent loss of detection performance.
Thus, alternative detectors are proposed and numerical results show the
improvement brought by approximate and generalized likelihood ratio tests in
comparison with pixel matched filtering.
The next generation of telescopes will acquire terabytes of image data on a
nightly basis. Collectively, these large images will contain billions of
interesting objects, which astronomers call sources. The astronomers' task is
to construct a catalog detailing the coordinates and other properties of the
sources. The source catalog is the primary data product for most telescopes and
is an important input for testing new astrophysical theories, but to construct
the catalog one must first detect the sources.
This paper focuses on methodological approaches for characterising the
specific topics within a technological field based on scientific literature
data. We introduce a diachronic clustering analysis approach and some
bibliometric indicators. The results are visualised with the software-tool
Stanalyst [1]. We are applying our methods to the field "Molecular Biology".
This field has grown a great deal in the last decade.
Large datasets with interactions between objects are common to numerous
scientific fields (i.e. social science, internet, biology...). The interactions
naturally define a graph and a common way to explore or summarize such dataset
is graph clustering. Most techniques for clustering graph vertices just use the
topology of connections ignoring informations in the vertices features.
Microarrays have been developed that tile the entire nonrepetitive genomes of
many different organisms, allowing for the unbiased mapping of active
transcription regions or protein binding sites across the entire genome. These
tiling array experiments produce massive correlated data sets that have many
experimental artifacts, presenting many challenges to researchers that require
innovative analysis methods and efficient computational algorithms.
The statistical analysis of complex networks is a challenging task, given
that appropriate statistical models and efficient computational procedures are
required in order for structures to be learned. One line of research has aimed
at developing mixture models for random graphs, and this strategy has been
successful in revealing structures in social and biological networks. The
principle of these models is to assume that the distribution of the edge values
follows a parametric distribution, conditionally on a latent structure which is
used to detect connectivity patterns.
To address an important risk classification issue that arises in clinical
practice, we propose a new mixture model via latent cure rate markers for
survival data with a cure fraction. In the proposed model, the latent cure rate
markers are modeled via a multinomial logistic regression and patients who
share the same cure rate are classified into the same risk group. Compared to
available cure rate models, the proposed model fits better to data from a
prostate cancer clinical trial.
We present a weighted-Lasso method to infer the parameters of a first-order
vector auto-regressive model that describes time course expression data
generated by directed gene-to-gene regulation networks. These networks are
assumed to own a priori internal structures of connectivity which drive the
inference method. Solution to the optimization problem is efficiently computed
using an active-set algorithm. We illustrate the performance both on synthetic
data and on the yeast regulation network by analyzing Spellman et al's dataset.
Analyses of serially-sampled data often begin with the assumption that the
observations represent discrete samples from a latent continuous-time
stochastic process. The continuous-time Markov chain (CTMC) is one such
generative model whose popularity extends to a variety of disciplines ranging
from computational finance to human genetics and genomics. A common theme among
these diverse applications is the need to simulate sample paths of a CTMC
conditional on realized data that is discretely observed.
We present a new joint longitudinal and survival model aimed at estimating
the association between the risk of an event and the change in and history of a
biomarker that is repeatedly measured over time. We use cubic B-splines models
for the longitudinal component that lend themselves to straight-forward
formulations of the slope and integral of the trajectory of the biomarker. The
model is applied to data collected in a long term follow-up study of HIV
infected infants in Uganda. Estimation is carried out using MCMC methods.
The stationary distribution of allele frequencies under a variety of
Wright--Fisher $k$-allele models with selection and parent independent mutation
is well studied. However, the statistical properties of maximum likelihood
estimates of parameters under these models are not well understood. Under each
of these models there is a point in data space which carries the strongest
possible signal for selection, yet, at this point, the likelihood is unbounded.
This result remains valid even if all of the mutation parameters are assumed to
be known.
The stationary distribution of allele frequencies under a variety of
Wright--Fisher $k$-allele models with selection and parent independent mutation
is well studied. However, the statistical properties of maximum likelihood
estimates of parameters under these models are not well understood. Under each
of these models there is a point in data space which carries the strongest
possible signal for selection, yet, at this point, the likelihood is unbounded.
This result remains valid even if all of the mutation parameters are assumed to
be known.
The statistical analysis of covariance matrix data is considered and, in
particular, methodology is discussed which takes into account the non-Euclidean
nature of the space of positive semi-definite symmetric matrices. The main
motivation for the work is the analysis of diffusion tensors in medical image
analysis. The primary focus is on estimation of a mean covariance matrix and,
in particular, on the use of Procrustes size-and-shape space. Comparisons are
made with other estimation techniques, including using the matrix logarithm,
matrix square root and Cholesky decomposition.
The statistical analysis of covariance matrix data is considered and, in
particular, methodology is discussed which takes into account the non-Euclidean
nature of the space of positive semi-definite symmetric matrices. The main
motivation for the work is the analysis of diffusion tensors in medical image
analysis. The primary focus is on estimation of a mean covariance matrix and,
in particular, on the use of Procrustes size-and-shape space. Comparisons are
made with other estimation techniques, including using the matrix logarithm,
matrix square root and Cholesky decomposition.
To address an important risk classification issue that arises in clinical
practice, we propose a new mixture model via latent cure rate markers for
survival data with a cure fraction. In the proposed model, the latent cure rate
markers are modeled via a multinomial logistic regression and patients who
share the same cure rate are classified into the same risk group. Compared to
available cure rate models, the proposed model fits better to data from a
prostate cancer clinical trial.
Spatially explicit data layers of tree species assemblages, referred to as
forest types or forest type groups, are a key component in large-scale
assessments of forest sustainability, biodiversity, timber biomass, carbon
sinks and forest health monitoring. This paper explores the utility of coupling
georeferenced national forest inventory (NFI) data with readily available and
spatially complete environmental predictor variables through spatially-varying
multinomial logistic regression models to predict forest type groups across
large forested landscapes.
Hierarchical models are a powerful tool for high-throughput data with a small
to moderate number of replicates, as they allow sharing information across
units of information, for example, genes. We propose two such models and show
its increased sensitivity in microarray differential expression applications.
We build on the gamma--gamma hierarchical model introduced by Kendziorski et
al. [Statist. Med. 22 (2003) 3899--3914] and Newton et al. [Biostatistics 5
(2004) 155--176], by addressing important limitations that may have hampered
its performance and its more widespread use.
Although anger is an important emotion that underlies much overt aggression
at great social cost, little is known about how to quantify anger or to specify
the relationship between anger and the overt behaviors that express it. This
paper proposes a novel statistical model which provides both a metric for the
intensity of anger and an approach to determining the quantitative relationship
between anger intensity and the specific behaviors that it controls.
Colon and rectum cancer share many risk factors, and are often tabulated
together as ``colorectal cancer'' in published summaries. However, recent work
indicating that exercise, diet, and family history may have differential
impacts on the two cancers encourages analyzing them separately, so that
corresponding public health interventions can be more efficiently targeted. We
analyze colon and rectum cancer data from the Minnesota Cancer Surveillance
System from 1998--2002 over the 16-county Twin Cities (Minneapolis--St. Paul)
metro and exurban area.
We consider nonparametric estimation of the state price density encapsulated
in option prices. Unlike usual density estimation problems, we only observe
option prices and their corresponding strike prices rather than samples from
the state price density. We propose to model the state price density directly
with a nonparametric mixture and estimate it using least squares. We show that
although the minimization is taken over an infinitely dimensional function
space, the minimizer always admits a finite dimensional representation and can
be computed efficiently.
Having observed an $m\times n$ matrix $X$ whose rows are possibly correlated,
we wish to test the hypothesis that the columns are independent of each other.
Our motivation comes from microarray studies, where the rows of $X$ record
expression levels for $m$ different genes, often highly correlated, while the
columns represent $n$ individual microarrays, presumably obtained
independently. The presumption of independence underlies all the familiar
permutation, cross-validation and bootstrap methods for microarray analysis, so
it is important to know when independence fails.
We develop a new estimation technique for recovering depth-of-field from
multiple stereo images. Depth-of-field is estimated by determining the shift in
image location resulting from different camera viewpoints. When this shift is
not divisible by pixel width, the multiple stereo images can be combined to
form a super-resolution image. By modeling this super-resolution image as a
realization of a random field, one can view the recovery of depth as a
likelihood estimation problem.
A predictor variable or dose that is measured with substantial error may
possess an error-free milestone, such that it is known with negligible error
whether the value of the variable is to the left or right of the milestone.
Such a milestone provides a basis for estimating a linear relationship between
the true but unknown value of the error-free predictor and an outcome, because
the milestone creates a strong and valid instrumental variable. The inferences
are nonparametric and robust, and in the simplest cases, they are exact and
distribution free.
We visit the following problem: For a `generic' model of consumer choice
(namely, distributions over preference lists) and a limited amount of data on
how consumers actually make decisions (such as marginal preference
information), how may one predict revenues from offering a particular
assortment of choices? This is a central problem in operations research and
marketing. We present a framework to answer such questions and design a number
of tractable algorithms from a data and computational standpoint for the same.
Material indentation studies, in which a probe is brought into controlled
physical contact with an experimental sample, have long been a primary means by
which scientists characterize the mechanical properties of materials. More
recently, the advent of atomic force microscopy, which operates on the same
fundamental principle, has in turn revolutionized the nanoscale analysis of
soft biomaterials such as cells and tissues.
We propose a novel approach for distributed statistical detection of
change-points in high-volume network traffic. We consider more specifically the
task of detecting and identifying the targets of Distributed Denial of Service
(DDoS) attacks. The proposed algorithm, called DTopRank, performs distributed
network anomaly detection by aggregating the partial information gathered in a
set of network monitors.
An extension of the latent Markov Rasch model is described for the analysis
of binary longitudinal data with covariates when subjects are collected in
clusters, e.g. students clustered in classes. For each subject, the latent
process is used to represent the characteristic of interest (e.g. ability)
conditional on the effect of the cluster to which he/she belongs. The latter
effect is modeled by a discrete latent variable associated with each cluster.
For the maximum likelihood estimation of the model parameters we outline an EM
algorithm.
We develop the relational topic model (RTM), a hierarchical model of both
network structure and node attributes. We focus on document networks, where the
attributes of each document are its words, i.e., discrete observations taken
from a fixed vocabulary. For each pair of documents, the RTM models their link
as a binary random variable that is conditioned on their contents. The model
can be used to summarize a network of documents, predict links between them,
and predict words within them.
We address the asymptotic and approximate distributions of a large class of
test statistics with quadratic forms used in association studies. The
statistics of interest do not necessarily follow a chi-square distribution and
take the general form $D=X^T A X$, where $X$ follows the multivariate normal
distribution, and $A$ is a general similarity matrix which may or may not be
positive semi-definite.
After an elementary derivation of the "time transformation", mapping a
counting process onto a homogeneous Poisson process with rate one, a brief
review of Ogata's goodness of fit tests is presented and a new test, the
"Wiener process test", is proposed. This test is based on a straightforward
application of Donsker's Theorem to the intervals of time transformed counting
processes. The finite sample properties of the test are studied by Monte Carlo
simulations.
In this work we deal with parameter estimation in a latent variable model,
namely the multiple-hidden i.i.d. model, which is derived from multiple
alignment algorithms. We first provide a rigorous formalism for the homology
structure of k sequences related by a star-shaped phylogenetic tree in the
context of multiple alignment based on indel evolution models. We discuss
possible definitions of likelihoods and compare them to the criterion used in
multiple alignment algorithms.
One of the most important tasks in image processing problem and machine
vision is object recognition, and the success of many proposed methods relies
on a suitable choice of algorithm for the segmentation of an image. This paper
focuses on how to apply texture operators based on the concept of fractal
dimension and cooccurence matrix, to the problem of object recognition and a
new method based on fractal dimension is introduced.
In the year 2005 Jorge Hirsch introduced the h index for quantifying the
research output of scientists. Today, the h index is a widely accepted
indicator of research performance. The h index has been criticized for its
insufficient reliability - the ability to discriminate reliably between
meaningful amounts of research performance.
International migration is now a significant driver of population change
across Europe but the methods available to estimate its true impact upon
sub-national areas remain inconsistent, constrained by inadequate systems of
measurement and data capture. In the absence of a population register for
England, official statistics on immigration and emigration are derived from a
combination of survey and census sources.
We present a Dempster--Shafer (DS) approach to estimating limits from Poisson
counting data with nuisance parameters. Dempster--Shafer is a statistical
framework that generalizes Bayesian statistics. DS calculus augments
traditional probability by allowing mass to be distributed over power sets of
the event space. This eliminates the Bayesian dependence on prior distributions
while allowing the incorporation of prior information when it is available. We
use the Poisson Dempster--Shafer model (DSM) to derive a posterior DSM for the
``Banff upper limits challenge'' three-Poisson model.
Implementations of quantum key distribution as available nowadays suffer from
inefficiencies due to post processing of the raw key that severely cuts down
the final secure key rate. We present a simple model for the error scattering
across the raw key and derive "closed form" expressions for the probability of
a parity check failure, or experiencing more than some fixed number of errors.
Our results can serve for improvement for key establishment, as information
reconciliation via interactive error correction and privacy amplification rests
on mostly unproven assumptions.