Sentiment analysis predicts the presence of positive or negative emotions in
a text document. In this paper we consider higher dimensional extensions of the
sentiment concept, which represent a richer set of human emotions. Our approach
goes beyond previous work in that our model contains a continuous manifold
rather than a finite set of human emotions. We investigate the resulting model,
compare it to psychological observations, and explore its predictive
capabilities.
We show that the frequency of word use is not only determined by the word
length [1] and the average information content [2], but also by its emotional
content.We have analysed three established lexica of affective word usage in
English, German, and Spanish, to verify that these lexica have a neutral,
unbiased, emotional content. Taking into account the frequency of word usage,
we find that words with a positive emotional content are more frequently used.
This lends support to Pollyanna hypothesis [3] that there should be a positive
bias in human expression.
A tagger is a mandatory segment of most text scrutiny systems, as it
consigned a s yntax class (e.g., noun, verb, adjective, and adverb) to every
word in a sentence. In this paper, we present a simple part of speech tagger
for homoeopathy clinical language. This paper reports about the anticipated
part of speech tagger for homoeopathy clinical language. It exploit standard
pattern for evaluating sentences, untagged clinical corpus of 20085 words is
used, from which we had selected 125 sentences (2322 tokens).
This paper deals with the identification of Multiword Expressions (MWEs) in
Manipuri, a highly agglutinative Indian Language. Manipuri is listed in the
Eight Schedule of Indian Constitution. MWE plays an important role in the
applications of Natural Language Processing(NLP) like Machine Translation, Part
of Speech tagging, Information Retrieval, Question Answering etc. Feature
selection is an important factor in the recognition of Manipuri MWEs using
Conditional Random Field (CRF).
Background: Micro-blogging services such as Twitter offer the potential to
crowdsource epidemics in real-time. However, Twitter posts ('tweets') are often
ambiguous and reactive to media trends. In order to ground user messages in
epidemic response we focused on tracking reports of self-protective behaviour
such as avoiding public gatherings or increased sanitation as the basis for
further risk analysis. Results: We created guidelines for tagging self
protective behaviour based on Jones and Salath\'e (2009)'s behaviour response
survey.
Background: Online news reports are increasingly becoming a source for event
based early warning systems that detect natural disasters. Harnessing the
massive volume of information available from multilingual newswire presents as
many challenges as opportunities due to the patterns of reporting complex
spatiotemporal events. Results: In this article we study the problem of
utilising correlated event reports across languages.
Background: Accurate and timely detection of public health events of
international concern is necessary to help support risk assessment and response
and save lives. Novel event-based methods that use the World Wide Web as a
signal source offer potential to extend health surveillance into areas where
traditional indicator networks are lacking. In this paper we address the issue
of systematically evaluating online health news to support automatic alerting
using daily disease-country counts text mined from real world data using
BioCaster.
Recent studies have shown strong correlation between social networking data
and national influenza rates. We expanded upon this success to develop an
automated text mining system that classifies Twitter messages in real time into
six syndromic categories based on key terms from a public health ontology.
10-fold cross validation tests were used to compare Naive Bayes (NB) and
Support Vector Machine (SVM) models on a corpus of 7431 Twitter messages. SVM
performed better than NB on 4 out of 6 syndromes.
The goal of the present chapter is to explore the possibility of providing
the research (but also the industrial) community that commonly uses spoken
corpora with a stable portfolio of well-documented standardised formats that
allow a high re-use rate of annotated spoken resources and, as a consequence,
better interoperability across tools used to produce or exploit such resources.
Machine transliteration is a method for automatically converting words in one
language into phonetically equivalent ones in another language. Machine
transliteration plays an important role in natural language applications such
as information retrieval and machine translation, especially for handling
proper nouns and technical terms. Four machine transliteration models --
grapheme-based transliteration model, phoneme-based transliteration model,
hybrid transliteration model, and correspondence-based transliteration model --
have been proposed by several researchers.
We show that information about social relationships can be used to improve
user-level sentiment analysis. The main motivation behind our approach is that
users that are somehow "connected" may be more likely to hold similar opinions;
therefore, relationship information can complement what we can extract about a
user's viewpoints from their utterances.
A fundamental requirement of any task-oriented dialogue system is the ability
to generate object descriptions that refer to objects in the task domain. The
subproblem of content selection for object descriptions in task-oriented
dialogue has been the focus of much previous work and a large number of models
have been proposed.
In this paper we concentrate on the resolution of the lexical ambiguity that
arises when a given word has several different meanings. This specific task is
commonly referred to as word sense disambiguation (WSD). The task of WSD
consists of assigning the correct sense to words using an electronic dictionary
as the source of word definitions. We present two WSD methods based on two main
methodological approaches in this research area: a knowledge-based method and a
corpus-based method.
We introduce a stochastic graph-based method for computing relative
importance of textual units for Natural Language Processing. We test the
technique on the problem of Text Summarization (TS). Extractive TS relies on
the concept of sentence salience to identify the most important sentences in a
document or set of documents. Salience is typically defined in terms of the
presence of particular important words or in terms of similarity to a centroid
pseudo-sentence.
We present a system capable of automatically solving combinatorial logic
puzzles given in (simplified) English. It involves translating the English
descriptions of the puzzles into answer set programming(ASP) and using ASP
solvers to provide solutions of the puzzles. To translate the descriptions, we
use a lambda-calculus based approach using Probabilistic Combinatorial
Categorial Grammars (PCCG) where the meanings of words are associated with
parameters to be able to distinguish between multiple meanings of the same
word. Meaning of many words and the parameters are learned.
For a system to understand natural language, it needs to be able to take
natural language text and answer questions given in natural language with
respect to that text; it also needs to be able to follow instructions given in
natural language. To achieve this, a system must be able to process natural
language and be able to capture the knowledge within that text. Thus it needs
to be able to translate natural language text into a formal language. We
discuss our approach to do this, where the translation is achieved by composing
the meaning of words in a sentence.
We present a system to translate natural language sentences to formulas in a
formal or a knowledge representation language. Our system uses two inverse
lambda-calculus operators and using them can take as input the semantic
representation of some words, phrases and sentences and from that derive the
semantic representation of other words and phrases. Our inverse lambda operator
works on many formal languages including first order logic, database query
languages and answer set programming.
This paper focuses on a system, WOLFIE (WOrd Learning From Interpreted
Examples), that acquires a semantic lexicon from a corpus of sentences paired
with semantic representations. The lexicon learned consists of phrases paired
with meaning representations. WOLFIE is part of an integrated system that
learns to transform sentences into representations such as logical database
queries. Experimental results are presented demonstrating WOLFIE's ability to
learn useful lexicons for a database interface in four different natural
languages.
Modelling compositional meaning for sentences using empirical distributional
methods has been a challenge for computational linguists. We implement the
abstract categorical model of Coecke et al. (arXiv:1003.4394v1 [cs.CL]) using
data from the BNC and evaluate it. The implementation is based on unsupervised
learning of matrices for relational words and applying them to the vectors of
their arguments.
Conversational participants tend to immediately and unconsciously adapt to
each other's language styles: a speaker will even adjust the number of articles
and other function words in their next utterance in response to the number in
their partner's immediately preceding utterance. This striking level of
coordination is thought to have arisen as a way to achieve social goals, such
as gaining approval or emphasizing difference in status. But has the adaptation
mechanism become so deeply embedded in the language-generation process as to
become a reflex?
The psycholinguistic theory of communication accommodation accounts for the
general observation that participants in conversations tend to converge to one
another's communicative behavior: they coordinate in a variety of dimensions
including choice of words, syntax, utterance length, pitch and gestures. In its
almost forty years of existence, this theory has been empirically supported
exclusively through small-scale or controlled laboratory studies. Here we
address this phenomenon in the context of Twitter conversations.
To facilitate future research in unsupervised induction of syntactic
structure and to standardize best-practices, we propose a tagset that consists
of twelve universal part-of-speech categories. In addition to the tagset, we
develop a mapping from 25 different treebank tagsets to this universal set. As
a result, when combined with the original treebank data, this universal tagset
and mapping produce a dataset consisting of common parts-of-speech for 22
different languages.
We address the problem of inferring a speaker's level of certainty based on
prosodic information in the speech signal, which has application in
speech-based dialogue systems. We show that using phrase-level prosodic
features centered around the phrases causing uncertainty, in addition to
utterance-level prosodic features, improves our model's level of certainty
classification. In addition, our models can be used to predict which phrase a
person is uncertain about.
Grishin proposed enriching the Lambek calculus with multiplicative
disjunction (par) and coresiduals. Applications to linguistics were discussed
by Moortgat, who spoke of the Lambek-Grishin calculus (LG). In this paper, we
adapt Girard's polarity-sensitive double negation embedding for classical logic
to extract a compositional Montagovian semantics from a display calculus for
focused proof search in LG.
"What other people think" has always been an important piece of information
during various decision-making processes. Today people frequently make their
opinions available via the Internet, and as a result, the Web has become an
excellent source for gathering consumer opinions. There are now numerous Web
resources containing such opinions, e.g., product reviews forums, discussion
groups, and Blogs.
We have developed a full discourse parser in the Penn Discourse Treebank
(PDTB) style. Our trained parser first identifies all discourse and
non-discourse relations, locates and labels their arguments, and then
classifies their relation types. When appropriate, the attribution spans to
these relations are also determined. We present a comprehensive evaluation from
both component-wise and error-cascading perspectives.
It is usual to consider that standards generate mixed feelings among
scientists. They are often seen as not really reflecting the state of the art
in a given domain and a hindrance to scientific creativity. Still, scientists
should theoretically be at the best place to bring their expertise into
standard developments, being even more neutral on issues that may typically be
related to competing industrial interests.
In this paper the problems of deriving a taxonomy from a text and
concept-oriented text segmentation are approached. Formal Concept Analysis
(FCA) method is applied to solve both of these linguistic problems. The
proposed segmentation method offers a conceptual view for text segmentation,
using a context-driven clustering of sentences. The Concept-oriented Clustering
Segmentation algorithm (COCS) is based on k-means linear clustering of the
sentences. Experimental results obtained using COCS algorithm are presented.
This paper describes a probabilistic top-down parser for minimalist grammars.
Top-down parsers have the great advantage of having a certain predictive power
during the parsing, which takes place in a left-to-right reading of the
sentence.
Patterns of word use both reflect and influence a myriad of human activities
and interactions. Like other entities that reproduce and evolve, words rise or
decline depending upon a complex interplay between fitness and environment.
Using Internet discussion communities as model systems, we show that the word
niche, defined as the extent of the word's association with specific people and
topics, is a strong determinant of changes in word frequency. Previous, a
posteriori, studies have indicated that word frequency is a correlate of word
success at historical time scales.
Categorial type logics, pioneered by Lambek, seek a proof-theoretic
understanding of natural language syntax by identifying categories with
formulas and derivations with proofs. We typically observe an intuitionistic
bias: a structural configuration of hypotheses (a constituent) derives a single
conclusion (the category assigned to it). Acting upon suggestions of Grishin to
dualize the logical vocabulary, Moortgat proposed the Lambek-Grishin calculus
(LG) with the aim of restoring symmetry between hypotheses and conclusions.
This paper presents our investigations on emotional state categorization from
speech signals with a psychologically inspired computational model against
human performance under the same experimental setup. Based on psychological
studies, we propose a multistage categorization strategy which allows
establishing an automatic categorization model flexibly for a given emotional
speech categorization task. We apply the strategy to the Serbian Emotional
Speech Corpus (GEES) and the Danish Emotional Speech Corpus (DES), where human
performance was reported in previous psychological studies.
Researchers in textual entailment have begun to consider inferences involving
'downward-entailing operators', an interesting and important class of lexical
items that change the way inferences are made. Recent work proposed a method
for learning English downward-entailing operators that requires access to a
high-quality collection of 'negative polarity items' (NPIs). However, English
is one of the very few languages for which such a list exists. We propose the
first approach that can be applied to the many languages for which there is no
pre-existing high-precision database of NPIs.
We report on work in progress on extracting lexical simplifications (e.g.,
"collaborate" -> "work together"), focusing on utilizing edit histories in
Simple English Wikipedia for this task. We consider two main approaches: (1)
deriving simplification probabilities via an edit model that accounts for a
mixture of different operations, and (2) using metadata to focus on edits that
are more likely to be simplification operations.
Space is a circuit oriented, spatial programming language designed to exploit
the massive parallelism available in a novel formal model of computation called
the Synchronic A-Ram, and physically related FPGA and reconfigurable
architectures. Space expresses variable grained MIMD parallelism, is modular,
strictly typed, and deterministic. Barring operations associated with memory
allocation and compilation, modules cannot access global variables, and are
referentially transparent.
We investigate inflection structure of a synthetic language using Latin as an
example. We construct a bipartite graph in which one group of vertices
correspond to dictionary headwords and the other group to inflected forms
encountered in a given text. Each inflected form is connected to its
corresponding headword, which in some cases in non-unique. The resulting sparse
graph decomposes into a large number of connected components, to be called word
groups. We then show how the concept of the word group can be used to construct
coverage curves of selected Latin texts.
We analyze the rank-frequency distributions of words in selected English and
Polish texts. We show that for the lemmatized (basic) word forms the
scale-invariant regime breaks after about two decades, while it might be
consistent for the whole range of ranks for the inflected word forms. We also
find that for a corpus consisting of texts written by different authors the
basic scale-invariant regime is broken more strongly than in the case of
comparable corpus consisting of texts written by the same author.
The Right Frontier Constraint (RFC), as a constraint on the attachment of new
constituents to an existing discourse structure, has important implications for
the interpretation of anaphoric elements in discourse and for Machine Learning
(ML) approaches to learning discourse structures. In this paper we provide
strong empirical support for SDRT's version of RFC. The analysis of about 100
doubly annotated documents by five different naive annotators shows that SDRT's
RFC is respected about 95% of the time.
This paper complements the main DEFT'10 article describing the MARF approach
to the DEFT'10 NLP competition. This paper is aimed to present the complete
result sets of all the conducted experiments and their settings in the
resulting tables highlighting the approach and the best results, but also
showing the worse and the worst and their analysis. This is the first iteration
of the initial release of the results.
There is much debate over the degree to which language learning is governed
by innate language-specific biases, or acquired through cognition-general
principles. Here we examine the probabilistic language acquisition hypothesis
on three levels: We outline a novel theoretical result showing that it is
possible to learn the exact generative model underlying a wide class of
languages, purely from observing samples of the language.
The Lady Maisry ballads afford us a framework within which to segment a
storyline into its major components. Segments and as a consequence nodal points
are discussed for nine different variants of the Lady Maisry story of a (young)
woman being burnt to death by her family, on account of her becoming pregnant
by a foreign personage. We motivate the importance of nodal points in textual
and literary analysis. We show too how the openings of the nine variants can be
analyzed comparatively, and also the conclusions of the ballads.
In the article, the methodology and the principles of the compilation of the
Frequency dictionary for Ivan Franko's novel Dlja domashnjoho ohnyshcha (For
the Hearth) are described. The following statistical parameters of the novel
vocabulary are obtained: variety, exclusiveness, concentration indexes,
correlation between word rank and text coverage, etc. The main quantitative
characteristics of Franko's novels Perekhresni stezhky (The Cross-Paths) and
Dlja domashnjoho ohnyshcha are compared on the basis of their frequency
dictionaries.
Lexicon-Grammar tables constitute a large-coverage syntactic lexicon but they
cannot be directly used in Natural Language Processing (NLP) applications
because they sometimes rely on implicit information. In this paper, we
introduce LGExtract, a generic tool for generating a syntactic lexicon for NLP
from the Lexicon-Grammar tables. It is based on a global table that contains
undefined information and on a unique extraction script including all
operations to be performed for all tables.
In the article, the project of quantitative parametrization of all texts by
Ivan Franko is manifested. It can be made only by using modern computer
techniques after the frequency dictionaries for all Franko's works are
compiled. The paper describes the application spheres, methodology, stages,
principles and peculiarities in the compilation of the frequency dictionary of
the second half of the 19th century - the beginning of the 20th century. The
relation between the Ivan Franko frequency dictionary, explanatory dictionary
of writer's language and text corpus is discussed.
Archaeological excavations in the sites of the Indus Valley civilization
(2500-1900 BCE) in Pakistan and northwestern India have unearthed a large
number of artifacts with inscriptions made up of hundreds of distinct signs. To
date there is no generally accepted decipherment of these sign sequences and
there have been suggestions that the signs could be non-linguistic. Here we
apply complex network analysis techniques to a database of available Indus
inscriptions, with the aim of detecting patterns indicative of syntactic
organization.
A statistical physics study of punctuation effects on sentence lengths is
presented for written texts: {\it Alice in wonderland} and {\it Through a
looking glass}. The translation of the first text into esperanto is also
considered as a test for the role of punctuation in defining a style, and for
contrasting natural and artificial, but written, languages. Several log-log
plots of the sentence length-rank relationship are presented for the major
punctuation marks. Different power laws are observed with characteristic
exponents.
Automatically detecting discourse segments is an important preliminary step
towards full discourse parsing. Previous research on discourse segmentation
have relied on the assumption that elementary discourse units (EDUs) in a
document always form a linear sequence (i.e., they can never be nested).
Unfortunately, this assumption turns out to be too strong, for some theories of
discourse like SDRT allows for nested discourse units. In this paper, we
present a simple approach to discourse segmentation that is able to produce
nested EDUs.
In this chapter, we assume that systematically studying spatial markers
semantics in language provides a means to reveal fundamental properties and
concepts characterizing conceptual representations of space. We propose a
formal system accounting for the properties highlighted by the linguistic
analysis, and we use these tools for representing the semantic content of
several spatial relations of French. The first part presents a semantic
analysis of the expression of space in French aiming at describing the
constraints that formal representations have to take into account.
While previous linguistic and psycholinguistic research on space has mainly
analyzed spatial relations, the studies reported in this paper focus on how
language distinguishes among spatial entities. Descriptive and experimental
studies first propose a classification of entities, which accounts for both
static and dynamic space, has some cross-linguistic validity, and underlies
adults' cognitive processing.
We propose a mathematical framework for a unification of the distributional
theory of meaning in terms of vector space models, and a compositional theory
for grammatical types, for which we rely on the algebra of Pregroups,
introduced by Lambek. This mathematical framework enables us to compute the
meaning of a well-typed sentence from the meanings of its constituents.
Concretely, the type reductions of Pregroups are `lifted' to morphisms in a
category, a procedure that transforms meanings of constituents into a meaning
of the (well-typed) whole.
The recognition and classification of Named Entities (NER) are regarded as an
important component for many Natural Language Processing (NLP) applications.
The classification is usually made by taking into account the immediate context
in which the NE appears. In some cases, this immediate context does not allow
getting the right classification. We show in this paper that the use of an
extended syntactic context and large-scale resources could be very useful in
the NER task.
Text documents are complex high dimensional objects. To effectively visualize
such data it is important to reduce its dimensionality and visualize the low
dimensional embedding as a 2-D or 3-D scatter plot. In this paper we explore
dimensionality reduction methods that draw upon domain knowledge in order to
achieve a better low dimensional embedding and visualization of documents. We
consider the use of geometries specified manually by an expert, geometries
derived automatically from corpus statistics, and geometries computed from
linguistic resources.
The article provides lexical statistical analysis of K. Vonnegut's two novels
and their Russian translations. It is found out that there happen some changes
between the speed of word types and word tokens ratio change in the source and
target texts. The author hypothesizes that these changes are typical for
English-Russian translations, and moreover, they represent an example of
Baker's translation feature of levelling out.
Hidden Markov models (HMMs) have been successfully applied to automatic
speech recognition for more than 35 years in spite of the fact that a key HMM
assumption -- the statistical independence of frames -- is obviously violated
by speech data. In fact, this data/model mismatch has inspired many attempts to
modify or replace HMMs with alternative models that are better able to take
into account the statistical dependence of frames.
The syntactic topic model (STM) is a Bayesian nonparametric model of language
that discovers latent distributions of words (topics) that are both
semantically and syntactically coherent. The STM models dependency parsed
corpora where sentences are grouped into documents. It assumes that each word
is drawn from a latent topic chosen by combining document-level features and
the local syntactic context. Each document has a distribution over latent
topics, as in topic models, which provides the semantic consistency.
This article presents SLAM, an Automatic Solver for Lexical Metaphors like
?d\'eshabiller* une pomme? (to undress* an apple). SLAM calculates a
conventional solution for these productions. To carry on it, SLAM has to
intersect the paradigmatic axis of the metaphorical verb ?d\'eshabiller*?,
where ?peler? (?to peel?) comes closer, with a syntagmatic axis that comes from
a corpus where ?peler une pomme? (to peel an apple) is semantically and
syntactically regular.
Combined with space-time coding, the orthogonal frequency division
multiplexing (OFDM) system explores space diversity. It is a potential scheme
to offer spectral efficiency and robust high data rate transmissions over
frequency-selective fading channel. However, space-time coding impairs the
system ability to suppress interferences as the signals transmitted from two
transmit antennas are superposed and interfered at the receiver antennas.
A rhetorical structure tree (RS tree) is a representation of discourse
relations among elementary discourse units (EDUs). A RS tree is very useful to
many text processing tasks employing relationships among EDUs such as text
understanding, summarization, and question answering. Thai language with its
unique linguistic characteristics requires a unique RS tree construction
technique. This paper proposes an approach for Thai RS tree construction which
consists of three major steps: EDU segmentation, Thai RS tree construction, and
discourse relation (DR) identification.
This document discusses an approach and its rudimentary realization towards
automatic classification of PPs; the topic, that has not received as much
attention in NLP as NPs and VPs. The approach is a rule-based heuristics
outlined in several levels of our research. There are 7 semantic categories of
PPs considered in this document that we are able to classify from an annotated
corpus.
The recent increase in dimensionality of data has thrown a great challenge to
the existing dimensionality reduction methods in terms of their effectiveness.
Dimensionality reduction has emerged as one of the significant preprocessing
steps in machine learning applications and has been effective in removing
inappropriate data, increasing learning accuracy, and improving
comprehensibility. Feature redundancy exercises great influence on the
performance of classification process.
Using Pustejovsky's "The Syntax of Event Structure" and Fong's "On Mending a
Torn Dress" we give a glimpse of a Pustejovsky-like analysis to some example
sentences in Fong. We attempt to give a framework for semantics to the noun
phrases and adverbs as appropriate as well as the lexical entries for all words
in the examples and critique both papers in light of our findings and
difficulties.
Maximum mutual information (MMI) is a model selection criterion used for
hidden Markov model (HMM) parameter estimation that was developed more than
twenty years ago as a discriminative alternative to the maximum likelihood
criterion for HMM-based speech recognition.
In this article, we record the main linguistic differences or singularities
of 17th century English, analyse them morphologically and syntactically and
propose equivalent forms in contemporary English. We show how 17th century
texts may be transcribed into modern English, combining the use of electronic
dictionaries with rules of transcription implemented as transducers.
If the use of the apostrophe in contemporary English often marks the Saxon
genitive, it may also indicate the omission of one or more let-ters. Some
writers (wrongly?) use it to mark the plural in symbols or abbreviations,
visual-ised thanks to the isolation of the morpheme "s". This punctuation mark
was imported from the Continent in the 16th century. During the 19th century
its use was standardised. However the rules of its usage still seem problematic
to many, including literate speakers of English.
The recognition of Arabic Named Entities (NE) is a problem in different
domains of Natural Language Processing (NLP) like automatic translation.
Indeed, NE translation allows the access to multilingual information. This
translation doesn't always lead to ex-pected result especially when NE contains
a person name. For this reason and in order to ameliorate translation, we can
transliterate some part of NE. In this context, we propose a method that
integrates translation and transliteration together using the linguistic NooJ
platform that is based on local grammars and transducers.
We are developing electronic dictionaries and transducers for the automatic
processing of the Albanian Language. We will analyze the words inside a linear
segment of text. We will also study the relationship between units of sense and
units of form. The composition of words takes different forms in Albanian. We
have found that morphemes are frequently concatenated or simply juxtaposed or
contracted. The inflected grammar of NooJ allows constructing the dictionaries
of flexed forms (declensions or conjugations).
The complexity of sentences characteristic to biomedical articles poses a
challenge to natural language parsers, which are typically trained on
large-scale corpora of non-technical text. We propose a text simplification
process, bioSimplify, that seeks to reduce the complexity of sentences in
biomedical abstracts in order to improve the performance of syntactic parsers
on the processed sentences. Syntactic parsing is typically one of the first
steps in a text mining pipeline. Thus, any improvement in performance would
have a ripple effect over all processing steps.
Accurate systems for extracting Protein-Protein Interactions (PPIs)
automatically from biomedical articles can help accelerate biomedical research.
Biomedical Informatics researchers are collaborating to provide metaservices
and advance the state-of-art in PPI extraction. One problem often neglected by
current Natural Language Processing systems is the characteristic complexity of
the sentences in biomedical literature.
This paper presents a brief survey on Automatic Speech Recognition and
discusses the major themes and advances made in the past 60 years of research,
so as to provide a technological perspective and an appreciation of the
fundamental progress that has been accomplished in this important area of
speech communication.
In recent decades, Speech interactive systems gained increasing importance.
To develop Dictation System like Dragon for Indian languages it is most
important to adapt the system to a speaker with minimum training. In this paper
we focus on the importance of creating speech database at syllable units and
identifying minimum text to be considered while training any speech recognition
system. There are systems developed for continuous speech recognition in
English and in few Indian languages like Hindi and Tamil.
The recent resurgence of interest in spatio-temporal neural network as speech
recognition tool motivates the present investigation. In this paper an approach
was developed based on temporal radial basis function "TRBF" looking to many
advantages: few parameters, speed convergence and time invariance. This
application aims to identify vowels taken from natural speech samples from the
Timit corpus of American speech. We report a recognition accuracy of 98.06
percent in training and 90.13 in test on a subset of 6 vowel phonemes, with the
possibility to expend the vowel sets in future.
Paraphrasing methods recognize, generate, or extract phrases, sentences, or
longer natural language expressions that convey almost the same information.
Textual entailment methods, on the other hand, recognize, generate, or extract
pairs of natural language expressions, such that a human who reads (and trusts)
the first element of a pair would most likely infer that the other element is
also true. Paraphrasing can be seen as bidirectional textual entailment and
methods from the two areas are often very similar.
In this chapter we present the main issues in representing machine readable
dictionaries in XML, and in particular according to the Text Encoding
Dictionary (TEI) guidelines.
Phylogenetic trees can be reconstructed from the matrix which contains the
distances between all pairs of languages in a family. Recently, we proposed a
new method which uses normalized Levenshtein distances among words with same
meaning and averages on all the items of a given list. Decisions about the
number of items in the input lists for language comparison have been debated
since the beginning of glottochronology. The point is that words associated to
some of the meanings have a rapid lexical evolution.
This paper is about automatic acquisition of lexical information from
corpora, especially subcategorization acquisition.
A dictionary defines words in terms of other words. Definitions can tell you
the meanings of words you don't know, but only if you know the meanings of the
defining words. How many words do you need to know (and which ones) in order to
be able to learn all the rest from definitions? We reduced dictionaries to
their "grounding kernels" (GKs), about 10% of the dictionary, from which all
the other words could be defined. The GK words turned out to have
psycholinguistic correlates: they were learned at an earlier age and more
concrete than the rest of the dictionary.
This paper discusses two new procedures for extracting verb valences from raw
texts, with an application to the Polish language. The first novel technique,
the EM selection algorithm, performs unsupervised disambiguation of valence
frame forests, obtained by applying a non-probabilistic deep grammar parser and
some post-processing to the text. The second new idea concerns filtering of
incorrect frames detected in the parsed text and is motivated by an observation
that verbs which take similar arguments tend to have similar frames.
A survey of dictionary models and formats is presented as well as a
presentation of corresponding recent standardisation activities.
Co-words have been considered as carriers of meaning across different domains
in studies of science, technology, and society. Words and co-words, however,
obtain meaning in sentences, and sentences obtain meaning in their contexts of
use. At the science/society interface, words can be expected to have different
meanings: the codes of communication that provide meaning to words differ on
the varying sides of the interface. Furthermore, meanings and interfaces may
change over time.
The idea of measuring distance between languages seems to have its roots in
the work of the French explorer Dumont D'Urville (D'Urville 1832). He collected
comparative words lists of various languages during his voyages aboard the
Astrolabe from 1826 to1829 and, in his work about the geographical division of
the Pacific, he proposed a method to measure the degree of relation among
languages.
In order to verify hypotheses concerning relationship between two languages
it is necessary to define evaluate their distance from lexical differences.
This concept seems to have its roots in the work of the French explorer Dumont
D'Urville. He collected comparative words lists of various languages during his
voyages aboard the Astrolabe from 1826 to 1829 and, in his work about the
geographical division of the Pacific, he proposed a method to measure the
degree of relation among languages.
A simple method for finding the entropy and redundancy of a reasonable long
sample of English text is presented. In fact, this method can be extended to
other Latin languages. Some implications for practical applications such as
plagiarism-detection software, and the minimum number of words that should be
used in social Internet network messaging, are discussed. Results on the
entropy of the English language have been obtained by direct computer
processing and from first principles according to Shannon theory.
We propose and compare various sentence selection strategies for active
learning for the task of detecting mentions of entities. The best strategy
employs the sum of confidences of two statistical classifiers trained on
different views of the data. Our experimental results show that, compared to
the random selection strategy, this strategy reduces the amount of required
labeled training data by over 50% while achieving the same performance.
The goal of this paper is two-fold: to present an abstract data model for
linguistic annotations and its implementation using XML, RDF and related
standards; and to outline the work of a newly formed committee of the
International Standards Organization (ISO), ISO/TC 37/SC 4 Language Resource
Management, which will use this work as its starting point.
This paper presents a mechanism of resolving unidentified lexical units in
text-based machine translation (TBMT). In machine translation system it is
unlikely to have a complete MT lexicon and hence there is a need of a mechanism
to handle the problem of unidentified words. These unknown words could be
abbreviations, names, acronyms and newly introduced terms. We have proposed an
algorithm for the resolution of the unidentified words. This algorithm takes
discourse unit (primitive discourse) as a unit of analysis and provides real
time updates to the lexicon.
This paper presents a theoretical research based approach to ellipsis
resolution in machine translation. Moreover, the formula of discourse is
applied in order to resolve ellipses. The validity of the discourse formula is
analyzed by applying it to the real world text i.e. newspaper fragments. The
source text is converted into mono-sentential discourses where complex
discourses require further dissection either directly into primitive discourses
or first into compound discourses and later into primitive ones. The procedure
of dissection needs further improvement i.e.
Automated language processing is central to the drive to enable facilitated
referencing of increasingly available Sanskrit E texts. The first step towards
processing Sanskrit text involves the handling of Sanskrit compound words that
are an integral part of Sanskrit texts. This firstly necessitates the
processing of euphonic conjunctions or sandhis, which are points in words or
between words, at which adjacent letters coalesce and transform. The ancient
Sanskrit grammarian Panini's codification of the Sanskrit grammar is the
accepted authority in the subject.
Artificial Neural Network (ANN) s has widely been used for recognition of
optically scanned character, which partially emulates human thinking in the
domain of the Artificial Intelligence. But prior to recognition, it is
necessary to segment the character from the text to sentences, words etc.
Segmentation of words into individual letters has been one of the major
problems in handwriting recognition. Despite several successful works all over
the work, development of such tools in specific languages is still an ongoing
process especially in the Indian context.
In this paper we describe a WSD experiment based on bilingual English-Spanish
comparable corpora in which individual noun phrases have been identified and
aligned with their respective counterparts in the other language. The
evaluation of the experiment has been carried out against SemCor.
We show that, with the alignment algorithm employed, potential precision is
high (74.3%), however the coverage of the method is low (2.7%), due to
alignments being far less frequent than we expected.
This paper describes a hybrid system for WSD, presented to the English
all-words and lexical-sample tasks, that relies on two different unsupervised
approaches. The first one selects the senses according to mutual information
proximity between a context word a variant of the sense. The second heuristic
analyzes the examples of use in the glosses of the senses so that simple
syntactic patterns are inferred. This patterns are matched against the
disambiguation contexts.
We have participated in the SENSEVAL-2 English tasks (all words and lexical
sample) with an unsupervised system based on mutual information measured over a
large corpus (277 million words) and some additional heuristics. A supervised
extension of the system was also presented to the lexical sample task.
Both syntax-phonology and syntax-semantics interfaces in Higher Order Grammar
(HOG) are expressed as axiomatic theories in higher-order logic (HOL), i.e. a
language is defined entirely in terms of provability in the single logical
system. An important implication of this elegant architecture is that the
meaning of a valid expression turns out to be represented not by a single, nor
even by a few "discrete" terms (in case of ambiguity), but by a "continuous"
set of logically equivalent terms. The note is devoted to precise formulation
and proof of this observation.
Multimodal interfaces, combining the use of speech, graphics, gestures, and
facial expressions in input and output, promise to provide new possibilities to
deal with information in more effective and efficient ways, supporting for
instance: - the understanding of possibly imprecise, partial or ambiguous
multimodal input; - the generation of coordinated, cohesive, and coherent
multimodal presentations; - the management of multimodal interaction (e.g.,
task completion, adapting the interface, error prevention) by representing and
exploiting models of the user, the domain, the task, the intera
There are many scientific problems generated by the multiple and conflicting
alternative definitions of linguistic recursion and human recursive processing
that exist in the literature. The purpose of this article is to make available
to the linguistic community the standard mathematical definition of recursion
and to apply it to discuss linguistic recursion. As a byproduct, we obtain an
insight into certain "soft universals" of human languages, which are related to
cognitive constructs necessary to implement mathematical reasoning, i.e.
mathematical model theory.
We present a method for grouping the synonyms of a lemma according to its
dictionary senses. The senses are defined by a large machine readable
dictionary for French, the TLFi (Tr\'esor de la langue fran\c{c}aise
informatis\'e) and the synonyms are given by 5 synonym dictionaries (also for
French). To evaluate the proposed method, we manually constructed a gold
standard where for each (word, definition) pair and given the set of synonyms
defined for that word by the 5 synonym dictionaries, 4 lexicographers specified
the set of synonyms they judge adequate.
This article proposes a method to extract dependency structures from
phrase-structure level parsing with Interaction Grammars. Interaction Grammars
are a formalism which expresses interactions among words using a polarity
system. Syntactical composition is led by the saturation of polarities.
Interactions take place between constituents, but as grammars are lexicalized,
these interactions can be translated at the level of words. Dependency
relations are extracted from the parsing process: every dependency is the
consequence of a polarity saturation.
We describe an encoding scheme for discourse structure and reference, based
on the TEI Guidelines and the recommendations of the Corpus Encoding
Specification (CES). A central feature of the scheme is a CES-based data
architecture enabling the encoding of and access to multiple views of a
marked-up document. We describe a tool architecture that supports the encoding
scheme, and then show how we have used the encoding scheme and the tools to
perform a discourse analytic task in support of a model of global discourse
cohesion called Veins Theory (Cristea & Ide, 1998).
This paper presents an abstract data model for linguistic annotations and its
implementation using XML, RDF and related standards; and to outline the work of
a newly formed committee of the International Standards Organization (ISO),
ISO/TC 37/SC 4 Language Resource Management, which will use this work as its
starting point. The primary motive for presenting the latter is to solicit the
participation of members of the research community to contribute to the work of
the committee.
It is widely recognized that the proliferation of annotation schemes runs
counter to the need to re-use language resources, and that standards for
linguistic annotation are becoming increasingly mandatory. To answer this need,
we have developed a framework comprised of an abstract model for a variety of
different annotation types (e.g., morpho-syntactic tagging, syntactic
annotation, co-reference annotation, etc.), which can be instantiated in
different ways depending on the annotator's approach and goals.
Sandhi means to join two or more words to coin new word. Sandhi literally
means `putting together' or combining (of sounds), It denotes all combinatory
sound-changes effected (spontaneously) for ease of pronunciation.
Sandhi-vicheda describes [5] the process by which one letter (whether single or
cojoined) is broken to form two words. Part of the broken letter remains as the
last letter of the first word and part of the letter forms the first letter of
the next letter.
Following the principles of Cognitive Grammar, we concentrate on a model for
reference resolution that attempts to overcome the difficulties previous
approaches, based on the fundamental assumption that all reference (independent
on the type of the referring expression) is accomplished via access to and
restructuring of domains of reference rather than by direct linkage to the
entities themselves.
The paper reviews the hurdles while trying to implement the OLAC extension
for Dravidian / Indian languages. The paper further explores the possibilities
which could minimise or solve these problems. In this context, the Chinese
system of text processing and the anusaaraka system are scrutinised.
OLAC was founded in 2000 for creating online databases of language resources.
This paper intends to review the bottom-up distributed character of the project
and proposes an extension of the architecture for Dravidian languages. An
ontological structure is considered for effective natural language processing
(NLP) and its advantages over statistical methods are reviewed
This paper presents the system called PATATRAS (PATent and Article Tracking,
Retrieval and AnalysiS) realized for the IP track of CLEF 2009. Our approach
presents three main characteristics: 1. The usage of multiple retrieval models
(KL, Okapi) and term index definitions (lemma, phrase, concept) for the three
languages considered in the present track (English, French, German) producing
ten different sets of ranked results. 2. The merging of the different results
based on multiple regression models using an additional validation set created
from the patent collection. 3.