Computation and Language (Computational Linguistics and Natural Language and Speech Processing)

  1. A Model-Driven Probabilistic Parser Generator.

    Authors: Luis Quesada, Fernando Berzal, Francisco J. Cortijo
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    Existing probabilistic scanners and parsers impose hard constraints on the
    way lexical and syntactic ambiguities can be resolved. Furthermore, traditional
    grammar-based parsing tools are limited in the mechanisms they allow for taking
    context into account. In this paper, we propose a model-driven tool that allows
    for statistical language models with arbitrary probability estimators.

  2. Parsing of Myanmar sentences with function tagging.

    Authors: Win Win Thant, Tin Myat Htwe, Ni Lar Thein
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    This paper describes the use of Naive Bayes to address the task of assigning
    function tags and context free grammar (CFG) to parse Myanmar sentences. Part
    of the challenge of statistical function tagging for Myanmar sentences comes
    from the fact that Myanmar has free-phrase-order and a complex morphological
    system. Function tagging is a pre-processing step for parsing. In the task of
    function tagging, we use the functional annotated corpus and tag Myanmar
    sentences with correct segmentation, POS (part-of-speech) tagging and chunking
    information.

  3. Characterizing Ranked Chinese Syllable-to-Character Mapping Spectrum: A Bridge Between the Spoken and Written Chinese Language.

    Authors: Wentian Li
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    One important aspect of the relationship between spoken and written Chinese
    is the ranked syllable-to-character mapping spectrum, which is the ranked list
    of syllables by the number of characters that map to the syllable. Previously,
    this spectrum is analyzed for more than 400 syllables without distinguishing
    the four intonations. In the current study, the spectrum with 1280 toned
    syllables is analyzed by logarithmic function, Beta rank function, and
    piecewise logarithmic function.

  4. A Corpus-based Evaluation of a Domain-specific Text to Knowledge Mapping Prototype.

    Authors: Rushdi Shams, Adel Elsayed, Quazi Mah-Zereen Akter
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    The aim of this paper is to evaluate a Text to Knowledge Mapping (TKM)
    Prototype. The prototype is domain-specific, the purpose of which is to map
    instructional text onto a knowledge domain. The context of the knowledge domain
    is DC electrical circuit. During development, the prototype has been tested
    with a limited data set from the domain. The prototype reached a stage where it
    needs to be evaluated with a representative linguistic data set called corpus.
    A corpus is a collection of text drawn from typical sources which can be used
    as a test data set to evaluate NLP systems.

  5. Context-sensitive Spelling Correction Using Google Web 1T 5-Gram Information.

    Authors: Youssef Bassil, Mohammad Alwani
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    In computing, spell checking is the process of detecting and sometimes
    providing spelling suggestions for incorrectly spelled words in a text.
    Basically, a spell checker is a computer program that uses a dictionary of
    words to perform spell checking. The bigger the dictionary is, the higher is
    the error detection rate. The fact that spell checkers are based on regular
    dictionaries, they suffer from data sparseness problem as they cannot capture
    large vocabulary of words including proper names, domain-specific terms,
    technical jargons, special acronyms, and terminologies.

  6. Segmentation Similarity and Agreement.

    Authors: Chris Fournier, Diana Inkpen
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    We propose a new segmentation evaluation metric, called segmentation
    similarity (S), that quantifies the similarity between two segmentations as the
    proportion of boundaries that are not transformed when comparing them using
    edit distance, essentially using edit distance as a penalty function and
    scaling penalties by segmentation size. We propose several adapted
    inter-annotator agreement coefficients which use S that are suitable for
    segmentation. We show that S is configurable enough to suit a wide variety of
    segmentation evaluations, and is an improvement upon the state of the art.

  7. You had me at hello: How phrasing affects memorability.

    Authors: Jon Kleinberg, Cristian Danescu-Niculescu-Mizil, Lillian Lee, Justin Cheng
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    Understanding the ways in which information achieves widespread public
    awareness is a research question of significant interest. We consider whether,
    and how, the way in which the information is phrased --- the choice of words
    and sentence structure --- can affect this process. To this end, we develop an
    analysis framework and build a corpus of movie quotes, annotated with
    memorability information, in which we are able to control for both the speaker
    and the setting of the quotes. We find significant differences between
    memorable and non-memorable quotes in several key dimensions.

  8. Arabic Keyphrase Extraction using Linguistic knowledge and Machine Learning Techniques.

    Authors: Tarek El-Shishtawy, Abdulwahab Al-sammak
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    In this paper, a supervised learning technique for extracting keyphrases of
    Arabic documents is presented. The extractor is supplied with linguistic
    knowledge to enhance its efficiency instead of relying only on statistical
    information such as term frequency and distance. During analysis, an annotated
    Arabic corpus is used to extract the required lexical features of the document
    words. The knowledge also includes syntactic rules based on part of speech tags
    and allowed word sequences to extract the candidate keyphrases.

  9. An Accurate Arabic Root-Based Lemmatizer for Information Retrieval Purposes.

    Authors: Tarek El-Shishtawy, Fatma El-Ghannam
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    In spite of its robust syntax, semantic cohesion, and less ambiguity, lemma
    level analysis and generation does not yet focused in Arabic NLP literatures.
    In the current research, we propose the first non-statistical accurate Arabic
    lemmatizer algorithm that is suitable for information retrieval (IR) systems.
    The proposed lemmatizer makes use of different Arabic language knowledge
    resources to generate accurate lemma form and its relevant features that
    support IR purposes. As a POS tagger, the experimental results show that, the
    proposed algorithm achieves a maximum accuracy of 94.8%.

  10. Beyond Sentiment: The Manifold of Human Emotions.

    Authors: Guy Lebanon, Seungyeon Kim, Fuxin Li, Irfan Essa
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    Sentiment analysis predicts the presence of positive or negative emotions in
    a text document. In this paper we consider higher dimensional extensions of the
    sentiment concept, which represent a richer set of human emotions. Our approach
    goes beyond previous work in that our model contains a continuous manifold
    rather than a finite set of human emotions. We investigate the resulting model,
    compare it to psychological observations, and explore its predictive
    capabilities.

  11. Positive words carry less information than negative words.

    Authors: Frank Schweitzer, David Garcia, Antonios Garas
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    We show that the frequency of word use is not only determined by the word
    length [1] and the average information content [2], but also by its emotional
    content.We have analysed three established lexica of affective word usage in
    English, German, and Spanish, to verify that these lexica have a neutral,
    unbiased, emotional content. Taking into account the frequency of word usage,
    we find that words with a positive emotional content are more frequently used.
    This lends support to Pollyanna hypothesis [3] that there should be a positive
    bias in human expression.

  12. Rule based Part of speech Tagger for Homoeopathy Clinical realm.

    Authors: Sanjay K. Dwivedi, Pramod P. Sukhadeve
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    A tagger is a mandatory segment of most text scrutiny systems, as it
    consigned a s yntax class (e.g., noun, verb, adjective, and adverb) to every
    word in a sentence. In this paper, we present a simple part of speech tagger
    for homoeopathy clinical language. This paper reports about the anticipated
    part of speech tagger for homoeopathy clinical language. It exploit standard
    pattern for evaluating sentences, untagged clinical corpus of 20085 words is
    used, from which we had selected 125 sentences (2322 tokens).

  13. Genetic Algorithm (GA) in Feature Selection for CRF Based Manipuri Multiword Expression (MWE) Identification.

    Authors: Kishorjit Nongmeikapam, Sivaji Bandyopadhyay
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    This paper deals with the identification of Multiword Expressions (MWEs) in
    Manipuri, a highly agglutinative Indian Language. Manipuri is listed in the
    Eight Schedule of Indian Constitution. MWE plays an important role in the
    applications of Natural Language Processing(NLP) like Machine Translation, Part
    of Speech tagging, Information Retrieval, Question Answering etc. Feature
    selection is an important factor in the recognition of Manipuri MWEs using
    Conditional Random Field (CRF).

  14. OMG U got flu? Analysis of shared health messages for bio-surveillance.

    Authors: Nigel Collier, Nguyen Truong Son, Ngoc Mai Nguyen
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    Background: Micro-blogging services such as Twitter offer the potential to
    crowdsource epidemics in real-time. However, Twitter posts ('tweets') are often
    ambiguous and reactive to media trends. In order to ground user messages in
    epidemic response we focused on tracking reports of self-protective behaviour
    such as avoiding public gatherings or increased sanitation as the basis for
    further risk analysis. Results: We created guidelines for tagging self
    protective behaviour based on Jones and Salath\'e (2009)'s behaviour response
    survey.

  15. Towards cross-lingual alerting for bursty epidemic events.

    Authors: Nigel Collier
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    Background: Online news reports are increasingly becoming a source for event
    based early warning systems that detect natural disasters. Harnessing the
    massive volume of information available from multilingual newswire presents as
    many challenges as opportunities due to the patterns of reporting complex
    spatiotemporal events. Results: In this article we study the problem of
    utilising correlated event reports across languages.

  16. What's unusual in online disease outbreak news?.

    Authors: Nigel Collier
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    Background: Accurate and timely detection of public health events of
    international concern is necessary to help support risk assessment and response
    and save lives. Novel event-based methods that use the World Wide Web as a
    signal source offer potential to extend health surveillance into areas where
    traditional indicator networks are lacking. In this paper we address the issue
    of systematically evaluating online health news to support automatic alerting
    using daily disease-country counts text mined from real world data using
    BioCaster.

  17. Syndromic classification of Twitter messages.

    Authors: Nigel Collier, Son Doan
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    Recent studies have shown strong correlation between social networking data
    and national influenza rates. We expanded upon this success to develop an
    automated text mining system that classifies Twitter messages in real time into
    six syndromic categories based on key terms from a public health ontology.
    10-fold cross validation tests were used to compare Naive Bayes (NB) and
    Support Vector Machine (SVM) models on a corpus of 7431 Twitter messages. SVM
    performed better than NB on 4 out of 6 syndromes.

  18. Data formats for phonological corpora.

    Authors: Laurent Romary, Andreas Witt
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    The goal of the present chapter is to explore the possibility of providing
    the research (but also the industrial) community that commonly uses spoken
    corpora with a stable portfolio of well-documented standardised formats that
    allow a high re-use rate of annotated spoken resources and, as a consequence,
    better interoperability across tools used to produce or exploit such resources.

  19. A Comparison of Different Machine Transliteration Models.

    Authors: K. Choi, H. Isahara, J. Oh
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    Machine transliteration is a method for automatically converting words in one
    language into phonetically equivalent ones in another language. Machine
    transliteration plays an important role in natural language applications such
    as information retrieval and machine translation, especially for handling
    proper nouns and technical terms. Four machine transliteration models --
    grapheme-based transliteration model, phoneme-based transliteration model,
    hybrid transliteration model, and correspondence-based transliteration model --
    have been proposed by several researchers.

  20. User-level sentiment analysis incorporating social networks.

    Authors: Ping Li, Lillian Lee, Chenhao Tan, Jie Tang, Long Jiang, Ming Zhou
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    We show that information about social relationships can be used to improve
    user-level sentiment analysis. The main motivation behind our approach is that
    users that are somehow "connected" may be more likely to hold similar opinions;
    therefore, relationship information can complement what we can extract about a
    user's viewpoints from their utterances.

  21. Learning Content Selection Rules for Generating Object Descriptions in Dialogue.

    Authors: P. W. Jordan, M. A. Walker
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    A fundamental requirement of any task-oriented dialogue system is the ability
    to generate object descriptions that refer to objects in the task domain. The
    subproblem of content selection for object descriptions in task-oriented
    dialogue has been the focus of much previous work and a large number of models
    have been proposed.

  22. Combining Knowledge- and Corpus-based Word-Sense-Disambiguation Methods.

    Authors: A. Montoyo, M. Palomar, G. Rigau, A. Suarez
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    In this paper we concentrate on the resolution of the lexical ambiguity that
    arises when a given word has several different meanings. This specific task is
    commonly referred to as word sense disambiguation (WSD). The task of WSD
    consists of assigning the correct sense to words using an electronic dictionary
    as the source of word definitions. We present two WSD methods based on two main
    methodological approaches in this research area: a knowledge-based method and a
    corpus-based method.

  23. LexRank: Graph-based Lexical Centrality as Salience in Text Summarization.

    Authors: G. Erkan, D. R. Radev
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    We introduce a stochastic graph-based method for computing relative
    importance of textual units for Natural Language Processing. We test the
    technique on the problem of Text Summarization (TS). Extractive TS relies on
    the concept of sentence salience to identify the most important sentences in a
    document or set of documents. Salience is typically defined in terms of the
    presence of particular important words or in terms of similarity to a centroid
    pseudo-sentence.

  24. Solving puzzles described in English by automated translation to answer set programming and learning how to do that translation.

    Authors: Chitta Baral, Juraj Dzifcak
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    We present a system capable of automatically solving combinatorial logic
    puzzles given in (simplified) English. It involves translating the English
    descriptions of the puzzles into answer set programming(ASP) and using ASP
    solvers to provide solutions of the puzzles. To translate the descriptions, we
    use a lambda-calculus based approach using Probabilistic Combinatorial
    Categorial Grammars (PCCG) where the meanings of words are associated with
    parameters to be able to distinguish between multiple meanings of the same
    word. Meaning of many words and the parameters are learned.

  25. Language understanding as a step towards human level intelligence - automatizing the construction of the initial dictionary from example sentences.

    Authors: Chitta Baral, Juraj Dzifcak
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    For a system to understand natural language, it needs to be able to take
    natural language text and answer questions given in natural language with
    respect to that text; it also needs to be able to follow instructions given in
    natural language. To achieve this, a system must be able to process natural
    language and be able to capture the knowledge within that text. Thus it needs
    to be able to translate natural language text into a formal language. We
    discuss our approach to do this, where the translation is achieved by composing
    the meaning of words in a sentence.

  26. Using Inverse lambda and Generalization to Translate English to Formal Languages.

    Authors: Chitta Baral, Juraj Dzifcak, Marcos Alvarez Gonzalez, Jiayu Zhou
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    We present a system to translate natural language sentences to formulas in a
    formal or a knowledge representation language. Our system uses two inverse
    lambda-calculus operators and using them can take as input the semantic
    representation of some words, phrases and sentences and from that derive the
    semantic representation of other words and phrases. Our inverse lambda operator
    works on many formal languages including first order logic, database query
    languages and answer set programming.

  27. Acquiring Word-Meaning Mappings for Natural Language Interfaces.

    Authors: C. Thompson
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    This paper focuses on a system, WOLFIE (WOrd Learning From Interpreted
    Examples), that acquires a semantic lexicon from a corpus of sentences paired
    with semantic representations. The lexicon learned consists of phrases paired
    with meaning representations. WOLFIE is part of an integrated system that
    learns to transform sentences into representations such as logical database
    queries. Experimental results are presented demonstrating WOLFIE's ability to
    learn useful lexicons for a database interface in four different natural
    languages.

  28. Experimental Support for a Categorical Compositional Distributional Model of Meaning.

    Authors: Mehrnoosh Sadrzadeh, Edward Grefenstette
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    Modelling compositional meaning for sentences using empirical distributional
    methods has been a challenge for computational linguists. We implement the
    abstract categorical model of Coecke et al. (arXiv:1003.4394v1 [cs.CL]) using
    data from the BNC and evaluate it. The implementation is based on unsupervised
    learning of matrices for relational words and applying them to the vectors of
    their arguments.

  29. Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs.

    Authors: Cristian Danescu-Niculescu-Mizil, Lillian Lee
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    Conversational participants tend to immediately and unconsciously adapt to
    each other's language styles: a speaker will even adjust the number of articles
    and other function words in their next utterance in response to the number in
    their partner's immediately preceding utterance. This striking level of
    coordination is thought to have arisen as a way to achieve social goals, such
    as gaining approval or emphasizing difference in status. But has the adaptation
    mechanism become so deeply embedded in the language-generation process as to
    become a reflex?

  30. Mark My Words! Linguistic Style Accommodation in Social Media.

    Authors: Cristian Danescu-Niculescu-Mizil, Michael Gamon, Susan Dumais
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    The psycholinguistic theory of communication accommodation accounts for the
    general observation that participants in conversations tend to converge to one
    another's communicative behavior: they coordinate in a variety of dimensions
    including choice of words, syntax, utterance length, pitch and gestures. In its
    almost forty years of existence, this theory has been empirically supported
    exclusively through small-scale or controlled laboratory studies. Here we
    address this phenomenon in the context of Twitter conversations.

  31. A Universal Part-of-Speech Tagset.

    Authors: Slav Petrov, Dipanjan Das, Ryan McDonald
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    To facilitate future research in unsupervised induction of syntactic
    structure and to standardize best-practices, we propose a tagset that consists
    of twelve universal part-of-speech categories. In addition to the tagset, we
    develop a mapping from 25 different treebank tagsets to this universal set. As
    a result, when combined with the original treebank data, this universal tagset
    and mapping produce a dataset consisting of common parts-of-speech for 22
    different languages.

  32. Recognizing Uncertainty in Speech.

    Authors: Heather Pon-Barry, Stuart M. Shieber
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    We address the problem of inferring a speaker's level of certainty based on
    prosodic information in the speech signal, which has application in
    speech-based dialogue systems. We show that using phrase-level prosodic
    features centered around the phrases causing uncertainty, in addition to
    utterance-level prosodic features, improves our model's level of certainty
    classification. In addition, our models can be used to predict which phrase a
    person is uncertain about.

  33. Polarized Montagovian Semantics for the Lambek-Grishin calculus.

    Authors: Arno Bastenhof
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    Grishin proposed enriching the Lambek calculus with multiplicative
    disjunction (par) and coresiduals. Applications to linguistics were discussed
    by Moortgat, who spoke of the Lambek-Grishin calculus (LG). In this paper, we
    adapt Girard's polarity-sensitive double negation embedding for classical logic
    to extract a compositional Montagovian semantics from a display calculus for
    focused proof search in LG.

  34. Opinion Polarity Identification through Adjectives.

    Authors: Samaneh Moghaddam, Fred Popowich
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    "What other people think" has always been an important piece of information
    during various decision-making processes. Today people frequently make their
    opinions available via the Internet, and as a result, the Web has become an
    excellent source for gathering consumer opinions. There are now numerous Web
    resources containing such opinions, e.g., product reviews forums, discussion
    groups, and Blogs.

  35. A PDTB-Styled End-to-End Discourse Parser.

    Authors: Ziheng Lin, Hwee Tou Ng, Min-Yen Kan
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    We have developed a full discourse parser in the Penn Discourse Treebank
    (PDTB) style. Our trained parser first identifies all discourse and
    non-discourse relations, locates and labels their arguments, and then
    classifies their relation types. When appropriate, the attribution spans to
    these relations are also determined. We present a comprehensive evaluation from
    both component-wise and error-cascading perspectives.

  36. Stabilizing knowledge through standards - A perspective for the humanities.

    Authors: Laurent Romary
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    It is usual to consider that standards generate mixed feelings among
    scientists. They are often seen as not really reflecting the state of the art
    in a given domain and a hindrance to scientific creativity. Still, scientists
    should theoretically be at the best place to bring their expertise into
    standard developments, being even more neutral on issues that may typically be
    related to competing industrial interests.

  37. Learning Taxonomy for Text Segmentation by Formal Concept Analysis.

    Authors: Mihaiela Lupea, Doina Tatar, Zsuzsana Marian
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    In this paper the problems of deriving a taxonomy from a text and
    concept-oriented text segmentation are approached. Formal Concept Analysis
    (FCA) method is applied to solve both of these linguistic problems. The
    proposed segmentation method offers a conceptual view for text segmentation,
    using a context-driven clustering of sentences. The Concept-oriented Clustering
    Segmentation algorithm (COCS) is based on k-means linear clustering of the
    sentences. Experimental results obtained using COCS algorithm are presented.

  38. A probabilistic top-down parser for minimalist grammars.

    Authors: Thomas Mainguy
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    This paper describes a probabilistic top-down parser for minimalist grammars.
    Top-down parsers have the great advantage of having a certain predictive power
    during the parsing, which takes place in a left-to-right reading of the
    sentence.

  39. Niche as a determinant of word fate in online groups.

    Authors: Eduardo G. Altmann, Janet B. Pierrehumbert, Adilson E. Motter
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    Patterns of word use both reflect and influence a myriad of human activities
    and interactions. Like other entities that reproduce and evolve, words rise or
    decline depending upon a complex interplay between fitness and environment.
    Using Internet discussion communities as model systems, we show that the word
    niche, defined as the extent of the word's association with specific people and
    topics, is a strong determinant of changes in word frequency. Previous, a
    posteriori, studies have indicated that word frequency is a correlate of word
    success at historical time scales.

  40. Tableaux for the Lambek-Grishin calculus.

    Authors: Arno Bastenhof
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    Categorial type logics, pioneered by Lambek, seek a proof-theoretic
    understanding of natural language syntax by identifying categories with
    formulas and derivations with proofs. We typically observe an intuitionistic
    bias: a structural configuration of hypotheses (a constituent) derives a single
    conclusion (the category assigned to it). Acting upon suggestions of Grishin to
    dualize the logical vocabulary, Moortgat proposed the Lambek-Grishin calculus
    (LG) with the aim of restoring symmetry between hypotheses and conclusions.

  41. Emotional State Categorization from Speech: Machine vs. Human.

    Authors: Arslan Shaukat, Ke Chen
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    This paper presents our investigations on emotional state categorization from
    speech signals with a psychologically inspired computational model against
    human performance under the same experimental setup. Based on psychological
    studies, we propose a multistage categorization strategy which allows
    establishing an automatic categorization model flexibly for a given emotional
    speech categorization task. We apply the strategy to the Serbian Emotional
    Speech Corpus (GEES) and the Danish Emotional Speech Corpus (DES), where human
    performance was reported in previous psychological studies.

  42. Don't 'have a clue'? Unsupervised co-learning of downward-entailing operators.

    Authors: Cristian Danescu-Niculescu-Mizil, Lillian Lee
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    Researchers in textual entailment have begun to consider inferences involving
    'downward-entailing operators', an interesting and important class of lexical
    items that change the way inferences are made. Recent work proposed a method
    for learning English downward-entailing operators that requires access to a
    high-quality collection of 'negative polarity items' (NPIs). However, English
    is one of the very few languages for which such a list exists. We propose the
    first approach that can be applied to the many languages for which there is no
    pre-existing high-precision database of NPIs.

  43. For the sake of simplicity: Unsupervised extraction of lexical simplifications from Wikipedia.

    Authors: Mark Yatskar, Bo Pang, Cristian Danescu-Niculescu-Mizil, Lillian Lee
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    We report on work in progress on extracting lexical simplifications (e.g.,
    "collaborate" -> "work together"), focusing on utilizing edit histories in
    Simple English Wikipedia for this task. We consider two main approaches: (1)
    deriving simplification probabilities via an edit model that accounts for a
    mixture of different operations, and (2) using metadata to focus on edits that
    are more likely to be simplification operations.

  44. Space and the Synchronic A-Ram.

    Authors: Alex V Berka
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    Space is a circuit oriented, spatial programming language designed to exploit
    the massive parallelism available in a novel formal model of computation called
    the Synchronic A-Ram, and physically related FPGA and reconfigurable
    architectures. Space expresses variable grained MIMD parallelism, is modular,
    strictly typed, and deterministic. Barring operations associated with memory
    allocation and compilation, modules cannot access global variables, and are
    referentially transparent.

  45. Inflection system of a language as a complex network.

    Authors: Henryk Fukś
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    We investigate inflection structure of a synthetic language using Latin as an
    example. We construct a bipartite graph in which one group of vertices
    correspond to dictionary headwords and the other group to inflected forms
    encountered in a given text. Each inflected form is connected to its
    corresponding headword, which in some cases in non-unique. The resulting sparse
    graph decomposes into a large number of connected components, to be called word
    groups. We then show how the concept of the word group can be used to construct
    coverage curves of selected Latin texts.

  46. Linguistic complexity: English vs. Polish, text vs. corpus.

    Authors: Stanislaw Drozdz, Jaroslaw Kwapien, Adam Orczyk
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    We analyze the rank-frequency distributions of words in selected English and
    Polish texts. We show that for the lemmatized (basic) word forms the
    scale-invariant regime breaks after about two decades, while it might be
    consistent for the whole range of ranks for the inflected word forms. We also
    find that for a corpus consisting of texts written by different authors the
    basic scale-invariant regime is broken more strongly than in the case of
    comparable corpus consisting of texts written by the same author.

  47. Testing SDRT's Right Frontier.

    Authors: Stergos Afantenos, Nicholas Asher
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    The Right Frontier Constraint (RFC), as a constraint on the attachment of new
    constituents to an existing discourse structure, has important implications for
    the interpretation of anaphoric elements in discourse and for Machine Learning
    (ML) approaches to learning discourse structures. In this paper we provide
    strong empirical support for SDRT's version of RFC. The analysis of about 100
    doubly annotated documents by five different naive annotators shows that SDRT's
    RFC is respected about 95% of the time.

  48. Complete Complimentary Results Report of the MARF's NLP Approach to the DEFT 2010 Competition.

    Authors: Serguei A. Mokhov
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    This paper complements the main DEFT'10 article describing the MARF approach
    to the DEFT'10 NLP competition. This paper is aimed to present the complete
    result sets of all the conducted experiments and their settings in the
    resulting tables highlighting the approach and the best results, but also
    showing the worse and the worst and their analysis. This is the first iteration
    of the initial release of the results.

  49. The probabilistic analysis of language acquisition: Theoretical, computational, and experimental analysis.

    Authors: Paul M.B. Vitanyi, Anne S. Hsu, Nick Chater
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    There is much debate over the degree to which language learning is governed
    by innate language-specific biases, or acquired through cognition-general
    principles. Here we examine the probabilistic language acquisition hypothesis
    on three levels: We outline a novel theoretical result showing that it is
    possible to learn the exact generative model underlying a wide class of
    languages, purely from observing samples of the language.

  50. Segmentation and Nodal Points in Narrative: Study of Multiple Variations of a Ballad.

    Authors: Fionn Murtagh, Adam Ganz
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    The Lady Maisry ballads afford us a framework within which to segment a
    storyline into its major components. Segments and as a consequence nodal points
    are discussed for nine different variants of the Lady Maisry story of a (young)
    woman being burnt to death by her family, on account of her becoming pregnant
    by a foreign personage. We motivate the importance of nodal points in textual
    and literary analysis. We show too how the openings of the nine variants can be
    analyzed comparatively, and also the conclusions of the ballads.

  51. Ivan Franko's novel Dlja domashnjoho ohnyshcha (For the Hearth) in the light of the frequency dictionary.

    Authors: Solomiya Buk
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    In the article, the methodology and the principles of the compilation of the
    Frequency dictionary for Ivan Franko's novel Dlja domashnjoho ohnyshcha (For
    the Hearth) are described. The following statistical parameters of the novel
    vocabulary are obtained: variety, exclusiveness, concentration indexes,
    correlation between word rank and text coverage, etc. The main quantitative
    characteristics of Franko's novels Perekhresni stezhky (The Cross-Paths) and
    Dlja domashnjoho ohnyshcha are compared on the basis of their frequency
    dictionaries.

  52. A generic tool to generate a lexicon for NLP from Lexicon-Grammar tables.

    Authors: Elsa Tolone, Matthieu Constant
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    Lexicon-Grammar tables constitute a large-coverage syntactic lexicon but they
    cannot be directly used in Natural Language Processing (NLP) applications
    because they sometimes rely on implicit information. In this paper, we
    introduce LGExtract, a generic tool for generating a syntactic lexicon for NLP
    from the Lexicon-Grammar tables. It is based on a global table that contains
    undefined information and on a unique extraction script including all
    operations to be performed for all tables.

  53. Quantitative parametrization of texts written by Ivan Franko: An attempt of the project.

    Authors: Solomiya Buk
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    In the article, the project of quantitative parametrization of all texts by
    Ivan Franko is manifested. It can be made only by using modern computer
    techniques after the frequency dictionaries for all Franko's works are
    compiled. The paper describes the application spheres, methodology, stages,
    principles and peculiarities in the compilation of the frequency dictionary of
    the second half of the 19th century - the beginning of the 20th century. The
    relation between the Ivan Franko frequency dictionary, explanatory dictionary
    of writer's language and text corpus is discussed.

  54. Network analysis of a corpus of undeciphered Indus civilization inscriptions indicates syntactic organization.

    Authors: Sitabhra Sinha, Md Izhar Ashraf, Raj Kumar Pan, Bryan Kenneth Wells
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    Archaeological excavations in the sites of the Indus Valley civilization
    (2500-1900 BCE) in Pakistan and northwestern India have unearthed a large
    number of artifacts with inscriptions made up of hundreds of distinct signs. To
    date there is no generally accepted decipherment of these sign sequences and
    there have been suggestions that the signs could be non-linguistic. Here we
    apply complex network analysis techniques to a database of available Indus
    inscriptions, with the aim of detecting patterns indicative of syntactic
    organization.

  55. Punctuation effects in English and Esperanto texts.

    Authors: M. Ausloos
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    A statistical physics study of punctuation effects on sentence lengths is
    presented for written texts: {\it Alice in wonderland} and {\it Through a
    looking glass}. The translation of the first text into esperanto is also
    considered as a test for the role of punctuation in defining a style, and for
    contrasting natural and artificial, but written, languages. Several log-log
    plots of the sentence length-rank relationship are presented for the major
    punctuation marks. Different power laws are observed with characteristic
    exponents.

  56. Learning Recursive Segments for Discourse Parsing.

    Authors: Stergos Afantenos, Pascal Denis, Philippe Muller, Laurence Danlos
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    Automatically detecting discourse segments is an important preliminary step
    towards full discourse parsing. Previous research on discourse segmentation
    have relied on the assumption that elementary discourse units (EDUs) in a
    document always form a linear sequence (i.e., they can never be nested).
    Unfortunately, this assumption turns out to be too strong, for some theories of
    discourse like SDRT allows for nested discourse units. In this paper, we
    present a simple approach to discourse segmentation that is able to produce
    nested EDUs.

  57. La repr\'esentation formelle des concepts spatiaux dans la langue.

    Authors: Michel Aurnague, Laure Vieu, Andrée Borillo
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    In this chapter, we assume that systematically studying spatial markers
    semantics in language provides a means to reveal fundamental properties and
    concepts characterizing conceptual representations of space. We propose a
    formal system accounting for the properties highlighted by the linguistic
    analysis, and we use these tools for representing the semantic content of
    several spatial relations of French. The first part presents a semantic
    analysis of the expression of space in French aiming at describing the
    constraints that formal representations have to take into account.

  58. Les entit\'es spatiales dans la langue : \'etude descriptive, formelle et exp\'erimentale de la cat\'egorisation.

    Authors: Michel Aurnague, Maya Hickmann, Laure Vieu
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    While previous linguistic and psycholinguistic research on space has mainly
    analyzed spatial relations, the studies reported in this paper focus on how
    language distinguishes among spatial entities. Descriptive and experimental
    studies first propose a classification of entities, which accounts for both
    static and dynamic space, has some cross-linguistic validity, and underlies
    adults' cognitive processing.

  59. Mathematical Foundations for a Compositional Distributional Model of Meaning.

    Authors: Stephen Clark, Bob Coecke, Mehrnoosh Sadrzadeh
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    We propose a mathematical framework for a unification of the distributional
    theory of meaning in terms of vector space models, and a compositional theory
    for grammatical types, for which we rely on the algebra of Pregroups,
    introduced by Lambek. This mathematical framework enables us to compute the
    meaning of a well-typed sentence from the meanings of its constituents.
    Concretely, the type reductions of Pregroups are `lifted' to morphisms in a
    category, a procedure that transforms meanings of constituents into a meaning
    of the (well-typed) whole.

  60. Les Entit\'es Nomm\'ees : usage et degr\'es de pr\'ecision et de d\'esambigu\"isation.

    Authors: Claude Martineau, Elsa Tolone, Stavroula Voyatzi
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    The recognition and classification of Named Entities (NER) are regarded as an
    important component for many Natural Language Processing (NLP) applications.
    The classification is usually made by taking into account the immediate context
    in which the NE appears. In some cases, this immediate context does not allow
    getting the right classification. We show in this paper that the use of an
    extended syntactic context and large-scale resources could be very useful in
    the NER task.

  61. Linguistic Geometries for Unsupervised Dimensionality Reduction.

    Authors: Krishnakumar Balasubramanian, Guy Lebanon, Yi Mao
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    Text documents are complex high dimensional objects. To effectively visualize
    such data it is important to reduce its dimensionality and visualize the low
    dimensional embedding as a 2-D or 3-D scatter plot. In this paper we explore
    dimensionality reduction methods that draw upon domain knowledge in order to
    achieve a better low dimensional embedding and visualization of documents. We
    consider the use of geometries specified manually by an expert, geometries
    derived automatically from corpus statistics, and geometries computed from
    linguistic resources.

  62. Change of word types to word tokens ratio in the course of translation (based on Russian translations of K. Vonnegut novels).

    Authors: Andrey Kutuzov
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    The article provides lexical statistical analysis of K. Vonnegut's two novels
    and their Russian translations. It is found out that there happen some changes
    between the speed of word types and word tokens ratio change in the source and
    target texts. The author hypothesizes that these changes are typical for
    English-Russian translations, and moreover, they represent an example of
    Baker's translation feature of levelling out.

  63. Why has (reasonably accurate) Automatic Speech Recognition been so hard to achieve?.

    Authors: Steven Wegmann, Larry Gillick
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    Hidden Markov models (HMMs) have been successfully applied to automatic
    speech recognition for more than 35 years in spite of the fact that a key HMM
    assumption -- the statistical independence of frames -- is obviously violated
    by speech data. In fact, this data/model mismatch has inspired many attempts to
    modify or replace HMMs with alternative models that are better able to take
    into account the statistical dependence of frames.

  64. SLAM : Solutions lexicales automatique pour m\'etaphores.

    Authors: Yann Desalle, Bruno Gaume, Karine Duvignau
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    This article presents SLAM, an Automatic Solver for Lexical Metaphors like
    ?d\'eshabiller* une pomme? (to undress* an apple). SLAM calculates a
    conventional solution for these productions. To carry on it, SLAM has to
    intersect the paradigmatic axis of the metaphorical verb ?d\'eshabiller*?,
    where ?peler? (?to peel?) comes closer, with a syntagmatic axis that comes from
    a corpus where ?peler une pomme? (to peel an apple) is semantically and
    syntactically regular.

  65. Syntactic Topic Models.

    Authors: David M. Blei, Jordan Boyd-Graber
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    The syntactic topic model (STM) is a Bayesian nonparametric model of language
    that discovers latent distributions of words (topics) that are both
    semantically and syntactically coherent. The STM models dependency parsed
    corpora where sentences are grouped into documents. It assumes that each word
    is drawn from a latent topic chosen by combining document-level features and
    the local syntactic context. Each document has a distribution over latent
    topics, as in topic models, which provides the semantic consistency.

  66. Co-channel Interference Cancellation for Space-Time Coded OFDM Systems Using Adaptive Beamforming and Null Deepening.

    Authors: Raungrong Suleesathira
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    Combined with space-time coding, the orthogonal frequency division
    multiplexing (OFDM) system explores space diversity. It is a potential scheme
    to offer spectral efficiency and robust high data rate transmissions over
    frequency-selective fading channel. However, space-time coding impairs the
    system ability to suppress interferences as the signals transmitted from two
    transmit antennas are superposed and interfered at the receiver antennas.

  67. Thai Rhetorical Structure Analysis.

    Authors: Somnuk Sinthupoun, Ohm Sornil
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    A rhetorical structure tree (RS tree) is a representation of discourse
    relations among elementary discourse units (EDUs). A RS tree is very useful to
    many text processing tasks employing relationships among EDUs such as text
    understanding, summarization, and question answering. Thai language with its
    unique linguistic characteristics requires a unique RS tree construction
    technique. This paper proposes an approach for Thai RS tree construction which
    consists of three major steps: EDU segmentation, Thai RS tree construction, and
    discourse relation (DR) identification.

  68. Towards a Heuristic Categorization of Prepositional Phrases in English with WordNet.

    Authors: Serguei A. Mokhov, Frank Rudzicz
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    This document discusses an approach and its rudimentary realization towards
    automatic classification of PPs; the topic, that has not received as much
    attention in NLP as NPs and VPs. The approach is a rule-based heuristics
    outlined in several levels of our research. There are 7 semantic categories of
    PPs considered in this document that we are able to classify from an annotated
    corpus.

  69. Dimensionality Reduction: An Empirical Study on the Usability of IFE-CF (Independent Feature Elimination- by C-Correlation and F-Correlation) Measures.

    Authors: M. Babu Reddy, L. S. S. Reddy
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    The recent increase in dimensionality of data has thrown a great challenge to
    the existing dimensionality reduction methods in terms of their effectiveness.
    Dimensionality reduction has emerged as one of the significant preprocessing
    steps in machine learning applications and has been effective in removing
    inappropriate data, increasing learning accuracy, and improving
    comprehensibility. Feature redundancy exercises great influence on the
    performance of classification process.

  70. On Event Structure in the Torn Dress.

    Authors: Serguei A. Mokhov
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    Using Pustejovsky's "The Syntax of Event Structure" and Fong's "On Mending a
    Torn Dress" we give a glimpse of a Pustejovsky-like analysis to some example
    sentences in Fong. We attempt to give a framework for semantics to the noun
    phrases and adverbs as appropriate as well as the lexical entries for all words
    in the examples and critique both papers in light of our findings and
    difficulties.

  71. Approximations to the MMI criterion and their effect on lattice-based MMI.

    Authors: Steven Wegmann
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    Maximum mutual information (MMI) is a model selection criterion used for
    hidden Markov model (HMM) parameter estimation that was developed more than
    twenty years ago as a discriminative alternative to the maximum likelihood
    criterion for HMM-based speech recognition.

  72. "Mind your p's and q's": or the peregrinations of an apostrophe in 17th Century English.

    Authors: Odile Piton, Hélène Pignot
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    If the use of the apostrophe in contemporary English often marks the Saxon
    genitive, it may also indicate the omission of one or more let-ters. Some
    writers (wrongly?) use it to mark the plural in symbols or abbreviations,
    visual-ised thanks to the isolation of the morpheme "s". This punctuation mark
    was imported from the Continent in the 16th century. During the 19th century
    its use was standardised. However the rules of its usage still seem problematic
    to many, including literate speakers of English.

  73. \'Etude et traitement automatique de l'anglais du XVIIe si\`ecle : outils morphosyntaxiques et dictionnaires.

    Authors: Odile Piton, Hélène Pignot
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    In this article, we record the main linguistic differences or singularities
    of 17th century English, analyse them morphologically and syntactically and
    propose equivalent forms in contemporary English. We show how 17th century
    texts may be transcribed into modern English, combining the use of electronic
    dictionaries with rules of transcription implemented as transducers.

  74. Recognition and translation Arabic-French of Named Entities: case of the Sport places.

    Authors: Odile Piton, Abdelmajid Ben Hamadou, Héla Fehri
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    The recognition of Arabic Named Entities (NE) is a problem in different
    domains of Natural Language Processing (NLP) like automatic translation.
    Indeed, NE translation allows the access to multilingual information. This
    translation doesn't always lead to ex-pected result especially when NE contains
    a person name. For this reason and in order to ameliorate translation, we can
    transliterate some part of NE. In this context, we propose a method that
    integrates translation and transliteration together using the linguistic NooJ
    platform that is based on local grammars and transducers.

  75. Morphological study of Albanian words, and processing with NooJ.

    Authors: Odile Piton, Klara Lagji
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    We are developing electronic dictionaries and transducers for the automatic
    processing of the Albanian Language. We will analyze the words inside a linear
    segment of text. We will also study the relationship between units of sense and
    units of form. The composition of words takes different forms in Albanian. We
    have found that morphemes are frequently concatenated or simply juxtaposed or
    contracted. The inflected grammar of NooJ allows constructing the dictionaries
    of flexed forms (declensions or conjugations).

  76. Towards Effective Sentence Simplification for Automatic Processing of Biomedical Text.

    Authors: Siddhartha Jonnalagadda, Graciela Gonzalez, Luis Tari, Jorg Hakenberg, Chitta Baral
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    The complexity of sentences characteristic to biomedical articles poses a
    challenge to natural language parsers, which are typically trained on
    large-scale corpora of non-technical text. We propose a text simplification
    process, bioSimplify, that seeks to reduce the complexity of sentences in
    biomedical abstracts in order to improve the performance of syntactic parsers
    on the processed sentences. Syntactic parsing is typically one of the first
    steps in a text mining pipeline. Thus, any improvement in performance would
    have a ripple effect over all processing steps.

  77. Sentence Simplification Aids Protein-Protein Interaction Extraction.

    Authors: Siddhartha Jonnalagadda, Graciela Gonzalez
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    Accurate systems for extracting Protein-Protein Interactions (PPIs)
    automatically from biomedical articles can help accelerate biomedical research.
    Biomedical Informatics researchers are collaborating to provide metaservices
    and advance the state-of-art in PPI extraction. One problem often neglected by
    current Natural Language Processing systems is the characteristic complexity of
    the sentences in biomedical literature.

  78. Syllable Analysis to Build a Dictation System in Telugu language.

    Authors: N. Kalyani, Dr K. V. N. Sunitha
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    In recent decades, Speech interactive systems gained increasing importance.
    To develop Dictation System like Dragon for Indian languages it is most
    important to adapt the system to a speaker with minimum training. In this paper
    we focus on the importance of creating speech database at syllable units and
    identifying minimum text to be considered while training any speech recognition
    system. There are systems developed for continuous speech recognition in
    English and in few Indian languages like Hindi and Tamil.

  79. Speech Recognition by Machine, A Review.

    Authors: M. A. Anusuya, S. K. Katti
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    This paper presents a brief survey on Automatic Speech Recognition and
    discusses the major themes and advances made in the past 60 years of research,
    so as to provide a technological perspective and an appreciation of the
    fundamental progress that has been accomplished in this important area of
    speech communication.

  80. Speech Recognition Oriented Vowel Classification Using Temporal Radial Basis Functions.

    Authors: Mustapha Guezouri, Larbi Mesbahi, Abdelkader Benyettou
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    The recent resurgence of interest in spatio-temporal neural network as speech
    recognition tool motivates the present investigation. In this paper an approach
    was developed based on temporal radial basis function "TRBF" looking to many
    advantages: few parameters, speed convergence and time invariance. This
    application aims to identify vowels taken from natural speech samples from the
    Timit corpus of American speech. We report a recognition accuracy of 98.06
    percent in training and 90.13 in test on a subset of 6 vowel phonemes, with the
    possibility to expend the vowel sets in future.

  81. A Survey of Paraphrasing and Textual Entailment Methods.

    Authors: Ion Androutsopoulos, Prodromos Malakasiotis
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    Paraphrasing methods recognize, generate, or extract phrases, sentences, or
    longer natural language expressions that convey almost the same information.
    Textual entailment methods, on the other hand, recognize, generate, or extract
    pairs of natural language expressions, such that a human who reads (and trusts)
    the first element of a pair would most likely infer that the other element is
    also true. Paraphrasing can be seen as bidirectional textual entailment and
    methods from the two areas are often very similar.

  82. Representing human and machine dictionaries in Markup languages.

    Authors: Laurent Romary, Lothar Lemnitzer, Andreas Witt
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    In this chapter we present the main issues in representing machine readable
    dictionaries in XML, and in particular according to the Text Encoding
    Dictionary (TEI) guidelines.

  83. Lexical evolution rates by automated stability measure.

    Authors: Maurizio Serva, Filippo Petroni
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    Phylogenetic trees can be reconstructed from the matrix which contains the
    distances between all pairs of languages in a family. Recently, we proposed a
    new method which uses normalized Levenshtein distances among words with same
    meaning and averages on all the items of a given list. Decisions about the
    number of items in the input lists for language comparison have been debated
    since the beginning of glottochronology. The point is that words associated to
    some of the meanings have a rapid lexical evolution.

  84. Acquisition d'informations lexicales \`a partir de corpus C\'edric Messiant et Thierry Poibeau.

    Authors: Cédric Messiant, Thierry Poibeau
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    This paper is about automatic acquisition of lexical information from
    corpora, especially subcategorization acquisition.

  85. Hierarchies in Dictionary Definition Space.

    Authors: Olivier Picard, Alexandre Blondin-Masse, Stevan Harnad, Odile Marcotte, Guillaume Chicoisne, Yassine Gargouri
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    A dictionary defines words in terms of other words. Definitions can tell you
    the meanings of words you don't know, but only if you know the meanings of the
    defining words. How many words do you need to know (and which ones) in order to
    be able to learn all the rest from definitions? We reduced dictionaries to
    their "grounding kernels" (GKs), about 10% of the dictionary, from which all
    the other words could be defined. The GK words turned out to have
    psycholinguistic correlates: they were learned at an earlier age and more
    concrete than the rest of the dictionary.

  86. Valence extraction using EM selection and co-occurrence matrices.

    Authors: Łukasz Dębowski
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    This paper discusses two new procedures for extracting verb valences from raw
    texts, with an application to the Polish language. The first novel technique,
    the EM selection algorithm, performs unsupervised disambiguation of valence
    frame forests, obtained by applying a non-probabilistic deep grammar parser and
    some post-processing to the text. The second new idea concerns filtering of
    incorrect frames detected in the parsed text and is motivated by an observation
    that verbs which take similar arguments tend to have similar frames.

  87. Standardization of the formal representation of lexical information for NLP.

    Authors: Laurent Romary
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    A survey of dictionary models and formats is presented as well as a
    presentation of corresponding recent standardisation activities.

  88. Measuring the Meaning of Words in Contexts: An automated analysis of controversies about Monarch butterflies, Frankenfoods, and stem cells.

    Authors: Loet Leydesdorff, Iina Hellsten
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    Co-words have been considered as carriers of meaning across different domains
    in studies of science, technology, and society. Words and co-words, however,
    obtain meaning in sentences, and sentences obtain meaning in their contexts of
    use. At the science/society interface, words can be expected to have different
    meanings: the codes of communication that provide meaning to words differ on
    the varying sides of the interface. Furthermore, meanings and interfaces may
    change over time.

  89. Automated words stability and languages phylogeny.

    Authors: Maurizio Serva, Filippo Petroni
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    The idea of measuring distance between languages seems to have its roots in
    the work of the French explorer Dumont D'Urville (D'Urville 1832). He collected
    comparative words lists of various languages during his voyages aboard the
    Astrolabe from 1826 to1829 and, in his work about the geographical division of
    the Pacific, he proposed a method to measure the degree of relation among
    languages.

  90. Automated languages phylogeny from Levenshtein distance.

    Authors: Maurizio Serva
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    In order to verify hypotheses concerning relationship between two languages
    it is necessary to define evaluate their distance from lexical differences.
    This concept seems to have its roots in the work of the French explorer Dumont
    D'Urville. He collected comparative words lists of various languages during his
    voyages aboard the Astrolabe from 1826 to 1829 and, in his work about the
    geographical division of the Pacific, he proposed a method to measure the
    degree of relation among languages.

  91. A New Look at the Classical Entropy of Written English.

    Authors: Fabio G. Guerrero
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    A simple method for finding the entropy and redundancy of a reasonable long
    sample of English text is presented. In fact, this method can be extended to
    other Latin languages. Some implications for practical applications such as
    plagiarism-detection software, and the minimum number of words that should be
    used in social Internet network messaging, are discussed. Results on the
    entropy of the English language have been obtained by direct computer
    processing and from first principles according to Shannon theory.

  92. Active Learning for Mention Detection: A Comparison of Sentence Selection Strategies.

    Authors: Nitin Madnani, Hongyan Jing, Nanda Kambhatla, Salim Roukos
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    We propose and compare various sentence selection strategies for active
    learning for the task of detecting mentions of entities. The best strategy
    employs the sum of confidences of two statistical classifiers trained on
    different views of the data. Our experimental results show that, compared to
    the random selection strategy, this strategy reduces the amount of required
    labeled training data by over 50% while achieving the same performance.

  93. Standards for Language Resources.

    Authors: Laurent Romary, Nancy Ide
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    The goal of this paper is two-fold: to present an abstract data model for
    linguistic annotations and its implementation using XML, RDF and related
    standards; and to outline the work of a newly formed committee of the
    International Standards Organization (ISO), ISO/TC 37/SC 4 Language Resource
    Management, which will use this work as its starting point.

  94. Resolution of Unidentified Words in Machine Translation.

    Authors: Sana Ullah, Kyung Sup Kwak, M.Asdaque Hussain
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    This paper presents a mechanism of resolving unidentified lexical units in
    text-based machine translation (TBMT). In machine translation system it is
    unlikely to have a complete MT lexicon and hence there is a need of a mechanism
    to handle the problem of unidentified words. These unknown words could be
    abbreviations, names, acronyms and newly introduced terms. We have proposed an
    algorithm for the resolution of the unidentified words. This algorithm takes
    discourse unit (primitive discourse) as a unit of analysis and provides real
    time updates to the lexicon.

  95. A discourse based approach in text-based machine translation.

    Authors: Sana Ullah, Kyung Sup Kwak, M.A. Khan
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    This paper presents a theoretical research based approach to ellipsis
    resolution in machine translation. Moreover, the formula of discourse is
    applied in order to resolve ellipses. The validity of the discourse formula is
    analyzed by applying it to the real world text i.e. newspaper fragments. The
    source text is converted into mono-sentential discourses where complex
    discourses require further dissection either directly into primitive discourses
    or first into compound discourses and later into primitive ones. The procedure
    of dissection needs further improvement i.e.

  96. A New Computational Schema for Euphonic Conjunctions in Sanskrit Processing.

    Authors: N. Rama, Meenakshi Lakshmanan
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    Automated language processing is central to the drive to enable facilitated
    referencing of increasingly available Sanskrit E texts. The first step towards
    processing Sanskrit text involves the handling of Sanskrit compound words that
    are an integral part of Sanskrit texts. This firstly necessitates the
    processing of euphonic conjunctions or sandhis, which are points in words or
    between words, at which adjacent letters coalesce and transform. The ancient
    Sanskrit grammarian Panini's codification of the Sanskrit grammar is the
    accepted authority in the subject.

  97. ANN-based Innovative Segmentation Method for Handwritten text in Assamese.

    Authors: Kaustubh Bhattacharyya, Kandarpa Kumar Sarma
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    Artificial Neural Network (ANN) s has widely been used for recognition of
    optically scanned character, which partially emulates human thinking in the
    domain of the Artificial Intelligence. But prior to recognition, it is
    necessary to segment the character from the text to sentences, words etc.
    Segmentation of words into individual letters has been one of the major
    problems in handwriting recognition. Despite several successful works all over
    the work, development of such tools in specific languages is still an ongoing
    process especially in the Indian context.

  98. Word Sense Disambiguation Using English-Spanish Aligned Phrases over Comparable Corpora.

    Authors: David Fernandez-Amoros
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    In this paper we describe a WSD experiment based on bilingual English-Spanish
    comparable corpora in which individual noun phrases have been identified and
    aligned with their respective counterparts in the other language. The
    evaluation of the experiment has been carried out against SemCor.

    We show that, with the alignment algorithm employed, potential precision is
    high (74.3%), however the coverage of the method is low (2.7%), due to
    alignments being far less frequent than we expected.

  99. Word Sense Disambiguation Based on Mutual Information and Syntactic Patterns.

    Authors: David Fernandez-Amoros
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    This paper describes a hybrid system for WSD, presented to the English
    all-words and lexical-sample tasks, that relies on two different unsupervised
    approaches. The first one selects the senses according to mutual information
    proximity between a context word a variant of the sense. The second heuristic
    analyzes the examples of use in the glosses of the senses so that simple
    syntactic patterns are inferred. This patterns are matched against the
    disambiguation contexts.

  100. The Uned systems at Senseval-2.

    Authors: David Fernandez-Amoros, Julio Gonzalo, Felisa Verdejo
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    We have participated in the SENSEVAL-2 English tasks (all words and lexical
    sample) with an unsupervised system based on mutual information measured over a
    large corpus (277 million words) and some additional heuristics. A supervised
    extension of the system was also presented to the lexical sample task.

  101. A Note On Higher Order Grammar.

    Authors: Victor Gluzberg
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    Both syntax-phonology and syntax-semantics interfaces in Higher Order Grammar
    (HOG) are expressed as axiomatic theories in higher-order logic (HOL), i.e. a
    language is defined entirely in terms of provability in the single logical
    system. An important implication of this elegant architecture is that the
    meaning of a valid expression turns out to be represented not by a single, nor
    even by a few "discrete" terms (in case of ambiguity), but by a "continuous"
    set of logically equivalent terms. The note is devoted to precise formulation
    and proof of this observation.

  102. Towards Multimodal Content Representation.

    Authors: Laurent Romary, Harry Bunt
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    Multimodal interfaces, combining the use of speech, graphics, gestures, and
    facial expressions in input and output, promise to provide new possibilities to
    deal with information in more effective and efficient ways, supporting for
    instance: - the understanding of possibly imprecise, partial or ambiguous
    multimodal input; - the generation of coordinated, cohesive, and coherent
    multimodal presentations; - the management of multimodal interaction (e.g.,
    task completion, adapting the interface, error prevention) by representing and
    exploiting models of the user, the domain, the task, the intera

  103. Mathematics, Recursion, and Universals in Human Languages.

    Authors: P. Gilkey, S. Lopez Ornat, A. Karousou
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    There are many scientific problems generated by the multiple and conflicting
    alternative definitions of linguistic recursion and human recursive processing
    that exist in the literature. The purpose of this article is to make available
    to the linguistic community the standard mathematical definition of recursion
    and to apply it to discuss linguistic recursion. As a byproduct, we obtain an
    insight into certain "soft universals" of human languages, which are related to
    cognitive constructs necessary to implement mathematical reasoning, i.e.
    mathematical model theory.

  104. Grouping Synonyms by Definitions.

    Authors: Ingrid Falk, Claire Gardent, Evelyne Jacquey, Fabienne Venant
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    We present a method for grouping the synonyms of a lemma according to its
    dictionary senses. The senses are defined by a large machine readable
    dictionary for French, the TLFi (Tr\'esor de la langue fran\c{c}aise
    informatis\'e) and the synonyms are given by 5 synonym dictionaries (also for
    French). To evaluate the proposed method, we manually constructed a gold
    standard where for each (word, definition) pair and given the set of synonyms
    defined for that word by the 5 synonym dictionaries, 4 lexicographers specified
    the set of synonyms they judge adequate.

  105. Analyse en d\'ependances \`a l'aide des grammaires d'interaction.

    Authors: Jonathan Marchand, Bruno Guillaume, Guy Perrier
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    This article proposes a method to extract dependency structures from
    phrase-structure level parsing with Interaction Grammars. Interaction Grammars
    are a formalism which expresses interactions among words using a polarity
    system. Syntactical composition is led by the saturation of polarities.
    Interactions take place between constituents, but as grammars are lexicalized,
    these interactions can be translated at the level of words. Dependency
    relations are extracted from the parsing process: every dependency is the
    consequence of a polarity saturation.

  106. Marking-up multiple views of a Text: Discourse and Reference.

    Authors: Laurent Romary, Dan Cristea, Nancy Ide
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    We describe an encoding scheme for discourse structure and reference, based
    on the TEI Guidelines and the recommendations of the Corpus Encoding
    Specification (CES). A central feature of the scheme is a CES-based data
    architecture enabling the encoding of and access to multiple views of a
    marked-up document. We describe a tool architecture that supports the encoding
    scheme, and then show how we have used the encoding scheme and the tools to
    perform a discourse analytic task in support of a model of global discourse
    cohesion called Veins Theory (Cristea & Ide, 1998).

  107. A Common XML-based Framework for Syntactic Annotations.

    Authors: Laurent Romary, Nancy Ide, Tomaz Erjavec
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    It is widely recognized that the proliferation of annotation schemes runs
    counter to the need to re-use language resources, and that standards for
    linguistic annotation are becoming increasingly mandatory. To answer this need,
    we have developed a framework comprised of an abstract model for a variety of
    different annotation types (e.g., morpho-syntactic tagging, syntactic
    annotation, co-reference annotation, etc.), which can be instantiated in
    different ways depending on the annotator's approach and goals.

  108. Standards for Language Resources.

    Authors: Laurent Romary, Nancy Ide
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    This paper presents an abstract data model for linguistic annotations and its
    implementation using XML, RDF and related standards; and to outline the work of
    a newly formed committee of the International Standards Organization (ISO),
    ISO/TC 37/SC 4 Language Resource Management, which will use this work as its
    starting point. The primary motive for presenting the latter is to solicit the
    participation of members of the research community to contribute to the work of
    the committee.

  109. Implementation of Rule Based Algorithm for Sandhi-Vicheda Of Compound Hindi Words.

    Authors: Priyanka Gupta, Vishal Goyal
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    Sandhi means to join two or more words to coin new word. Sandhi literally
    means `putting together' or combining (of sounds), It denotes all combinatory
    sound-changes effected (spontaneously) for ease of pronunciation.
    Sandhi-vicheda describes [5] the process by which one letter (whether single or
    cojoined) is broken to form two words. Part of the broken letter remains as the
    last letter of the first word and part of the letter forms the first letter of
    the next letter.

  110. Reference Resolution within the Framework of Cognitive Grammar.

    Authors: Laurent Romary, Susanne Salmon-Alt
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    Following the principles of Cognitive Grammar, we concentrate on a model for
    reference resolution that attempts to overcome the difficulties previous
    approaches, based on the fundamental assumption that all reference (independent
    on the type of the referring expression) is accomplished via access to and
    restructuring of domains of reference rather than by direct linkage to the
    entities themselves.

  111. Empowering OLAC Extension using Anusaaraka and Effective text processing using Double Byte coding.

    Authors: B Prabhulla Chandran Pillai
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    The paper reviews the hurdles while trying to implement the OLAC extension
    for Dravidian / Indian languages. The paper further explores the possibilities
    which could minimise or solve these problems. In this context, the Chinese
    system of text processing and the anusaaraka system are scrutinised.

  112. Multiple Retrieval Models and Regression Models for Prior Art Search.

    Authors: Patrice Lopez, Laurent Romary
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    This paper presents the system called PATATRAS (PATent and Article Tracking,
    Retrieval and AnalysiS) realized for the IP track of CLEF 2009. Our approach
    presents three main characteristics: 1. The usage of multiple retrieval models
    (KL, Okapi) and term index definitions (lemma, phrase, concept) for the three
    languages considered in the present track (English, French, German) producing
    ten different sets of ranked results. 2. The merging of the different results
    based on multiple regression models using an additional validation set created
    from the patent collection. 3.

  113. An OLAC Extension for Dravidian Languages.

    Authors: B Prabhulla Chandran Pillai
    Subjects: Computation and Language (Computational Linguistics and Natural Language and Speech Processing)
    Abstract

    OLAC was founded in 2000 for creating online databases of language resources.
    This paper intends to review the bottom-up distributed character of the project
    and proposes an extension of the architecture for Dravidian languages. An
    ontological structure is considered for effective natural language processing
    (NLP) and its advantages over statistical methods are reviewed

Syndicate content