Information Retrieval

  1. Semantic Visualization and Navigation in Textual Corpus.

    Authors: Férihane Kboubi, Anja Habacha Chaibi, Mohamed BenAhmed
    Subjects: Information Retrieval
    Abstract

    This paper gives a survey of related work on the information visualization
    domain and study the real integration of the cartography paradigms in actual
    information search systems. Based on this study, we propose a semantic
    visualization and navigation approach which offer to users three search modes:
    precise search, connotative search and thematic search.

  2. A Proposed Architecture for Continuous Web Monitoring Through Online Crawling of Blogs.

    Authors: Mohsen Sharifi, Mehdi Naghavi
    Subjects: Information Retrieval
    Abstract

    Getting informed of what is registered in the Web space on time, can greatly
    help the psychologists, marketers and political analysts to familiarize,
    analyse, make decision and act correctly based on the society`s different
    needs. The great volume of information in the Web space hinders us to
    continuously online investigate the whole space of the Web. Focusing on the
    considered blogs limits our working domain and makes the online crawling in the
    Web space possible.

  3. Data Mining as a Torch Bearer in Education Sector.

    Authors: Umesh Kumar Pandey, Brijesh Kumar Bhardwaj, Saurabh pal
    Subjects: Information Retrieval
    Abstract

    Every data has a lot of hidden information. The processing method of data
    decides what type of information data produce. In India education sector has a
    lot of data that can produce valuable information. This information can be used
    to increase the quality of education. But educational institution does not use
    any knowledge discovery process approach on these data. Information and
    communication technology puts its leg into the education sector to capture and
    compile low cost information.

  4. Collaborative Personalized Web Recommender System using Entropy based Similarity Measure.

    Authors: Harita Mehta, Shveta Kundra Bhatia, Punam Bedi, V. S. Dixit
    Subjects: Information Retrieval
    Abstract

    On the internet, web surfers, in the search of information, always strive for
    recommendations. The solutions for generating recommendations become more
    difficult because of exponential increase in information domain day by day. In
    this paper, we have calculated entropy based similarity between users to
    achieve solution for scalability problem. Using this concept, we have
    implemented an online user based collaborative web recommender system. In this
    model based collaborative system, the user session is divided into two levels.
    Entropy is calculated at both the levels.

  5. Bengali text summarization by sentence extraction.

    Authors: Kamal Sarkar
    Subjects: Information Retrieval
    Abstract

    Text summarization is a process to produce an abstract or a summary by
    selecting significant portion of the information from one or more texts. In an
    automatic text summarization process, a text is given to the computer and the
    computer returns a shorter less redundant extract or abstract of the original
    text(s). Many techniques have been developed for summarizing English text(s).
    But, a very few attempts have been made for Bengali text summarization.

  6. MultiDendrograms: Variable-Group Agglomerative Hierarchical Clustering.

    Authors: David Torres, Sergio Gomez, Alberto Fernandez, Justo Montiel
    Subjects: Information Retrieval
    Abstract

    MultiDendrograms is a Java-written application that computes agglomerative
    hierarchical clusterings of data. Starting from a distances (or weights)
    matrix, MultiDendrograms is able to calculate its dendrograms using the most
    common agglomerative hierarchical clustering methods. The application
    implements a variable-group algorithm that solves the non-uniqueness problem
    found in the standard pair-group algorithm.

  7. Spam filtering by quantitative profiles.

    Authors: M. Grendár, J. Škutová, V. Špitalský
    Subjects: Information Retrieval
    Abstract

    Instead of the 'bag-of-words' representation, in the quantitative profile
    approach to spam filtering and email categorization, an email is represented by
    an m-dimensional vector of numbers, with m fixed in advance. Inspired by Sroufe
    et al. [Sroufe, P., Phithakkitnukoon, S., Dantu, R., and Cangussu, J. (2010).
    Email shape analysis. In \emph{LNCS}, 5935, pp. 18-29] two instances of
    quantitative profiles are considered: line profile and character profile.
    Performance of these profiles is studied on the TREC 2007, CEAS 2008 and a
    private corpuses.

  8. Recommendation systems: a joint analysis of technical aspects with marketing implications.

    Authors: Vafopoulos Michalis, Oikonomou Michael
    Subjects: Information Retrieval
    Abstract

    In 2010, Web users ordered, only in Amazon, 73 items per second and massively
    contribute reviews about their consuming experience. As the Web matures and
    becomes social and participatory, collaborative filters are the basic
    complement in searching online information about people, events and products.
    In Web 2.0, what connected consumers create is not simply content (e.g.
    reviews) but context. This new contextual framework of consumption emerges
    through the aggregation and collaborative filtering of personal preferences
    about goods in the Web in massive scale.

  9. Learning Context for Text Categorization.

    Authors: Y.V. Haribhakta, Dr. Parag Kulkarni
    Subjects: Information Retrieval
    Abstract

    This paper describes our work which is based on discovering context for text
    document categorization. The document categorization approach is derived from a
    combination of a learning paradigm known as relation extraction and an
    technique known as context discovery. We demonstrate the effectiveness of our
    categorization approach using reuters 21578 dataset and synthetic real world
    data from sports domain. Our experimental results indicate that the learned
    context greatly improves the categorization performance as compared to
    traditional categorization approaches.

  10. Document Classification Using Expectation Maximization with Semi Supervised Learning.

    Authors: Bhawna Nigam, Poorvi Ahirwal, Sonal Salve, Swati Vamney
    Subjects: Information Retrieval
    Abstract

    As the amount of online document increases, the demand for document
    classification to aid the analysis and management of document is increasing.
    Text is cheap, but information, in the form of knowing what classes a document
    belongs to, is expensive. The main purpose of this paper is to explain the
    expectation maximization technique of data mining to classify the document and
    to learn how to improve the accuracy while using semi-supervised approach.
    Expectation maximization algorithm is applied with both supervised and
    semi-supervised approach.

  11. A Framework for Picture Extraction on Search Engine Improved and Meaningful Result.

    Authors: Anamika Sharma
    Subjects: Information Retrieval
    Abstract

    Searching is an important tool of information gathering, if information is in
    the form of picture than it play a major role to take quick action and easy to
    memorize. This is a human tendency to retain more picture than text. The
    complexity and the occurrence of variety of query can give variation in result
    and provide the humans to learn something new or get confused.

  12. Towards "Intelligent Compression" in Streams: A Biased Reservoir Sampling based Bloom Filter Approach.

    Authors: Sourav Dutta, Souvik Bhattacherjee, Ankur Narang
    Subjects: Information Retrieval
    Abstract

    With the explosion of information stored world-wide,data intensive computing
    has become a central area of research.Efficient management and processing of
    this massively exponential amount of data from diverse sources,such as
    telecommunication call data records,online transaction records,etc.,has become
    a necessity.Removing redundancy from such huge(multi-billion records) datasets
    resulting in resource and compute efficiency for downstream processing
    constitutes an important area of study.

  13. Visualizing Domain Ontology using Enhanced Anaphora Resolution Algorithm.

    Authors: L.Jegatha Deborah, R.Baskaran, A.Kannan
    Subjects: Information Retrieval
    Abstract

    Enormous explosion in the number of the World Wide Web pages occur every day
    and since the efficiency of most of the information processing systems is found
    to be less, the potential of the Internet applications is often underutilized.
    Efficient utilization of the web can be exploited when similar web pages are
    rigorously, exhaustively organized and clustered based on some domain knowledge
    (semantic-based) .Ontology which is a formal representation of domain knowledge
    aids in such efficient utilization.

  14. Finding missing edges and communities in incomplete networks.

    Authors: Bowen Yan, Steve Gregory
    Subjects: Information Retrieval
    Abstract

    Many algorithms have been proposed for predicting missing edges in networks,
    but they do not usually take account of which edges are missing. We focus on
    networks which have missing edges of the form that is likely to occur in real
    networks, and compare algorithms that find these missing edges. We also
    investigate the effect of this kind of missing data on community detection
    algorithms.

  15. Probability Ranking in Vector Spaces.

    Authors: Massimo Melucci
    Subjects: Information Retrieval
    Abstract

    The Probability Ranking Principle states that the document set with the
    highest values of probability of relevance optimizes information retrieval
    effectiveness given the probabilities are estimated as accurately as possible.
    The key point of the principle is the separation of the document set into two
    subsets with a given level of fallout and with the highest recall.

  16. Characterization and exploitation of community structure in cover song networks.

    Authors: Joan Serrà, Massimiliano Zanin, Perfecto Herrera, Xavier Serra
    Subjects: Information Retrieval
    Abstract

    The use of community detection algorithms is explored within the framework of
    cover song identification, i.e. the automatic detection of different audio
    renditions of the same underlying musical piece. Until now, this task has been
    posed as a typical query-by-example task, where one submits a query song and
    the system retrieves a list of possible matches ranked by their similarity to
    the query. In this work, we propose a new approach which uses song communities
    to provide more relevant answers to a given query.

  17. Structured Learning of Two-Level Dynamic Rankings.

    Authors: Karthik Raman, Thorsten Joachims, Pannaga Shivaswamy
    Subjects: Information Retrieval
    Abstract

    For ambiguous queries, conventional retrieval systems are bound by two
    conflicting goals. On the one hand, they should diversify and strive to present
    results for as many query intents as possible. On the other hand, they should
    provide depth for each intent by displaying more than a single result. Since
    both diversity and depth cannot be achieved simultaneously in the conventional
    static retrieval model, we propose a new dynamic ranking approach.

  18. A Personalized System for Conversational Recommendations.

    Authors: M. H. Goker, P. Langley, C. A. Thompson
    Subjects: Information Retrieval
    Abstract

    Searching for and making decisions about information is becoming increasingly
    difficult as the amount of information and number of choices increases.
    Recommendation systems help users find items of interest of a particular type,
    such as movies or restaurants, but are still somewhat awkward to use. Our
    solution is to take advantage of the complementary strengths of personalized
    recommendation systems and dialogue systems, creating personalized aides.

  19. PRESY: A Context Based Query Reformulation Tool for Information Retrieval on the Web.

    Authors: Abdelkrim Bouramoul, Mohamed-Khireddine Kholladi, Bich-Lien Doan
    Subjects: Information Retrieval
    Abstract

    Problem Statement: The huge number of information on the web as well as the
    growth of new inexperienced users creates new challenges for information
    retrieval. It has become increasingly difficult for these users to find
    relevant documents that satisfy their individual needs. Certainly the current
    search engines (such as Google, Bing and Yahoo) offer an efficient way to
    browse the web content. However, the result quality is highly based on uses
    queries which need to be more precise to find relevant documents.

  20. Exploiting Conceptual Knowledge for Querying Information Systems.

    Authors: Joachim Selke, Wolf-Tilo Balke
    Subjects: Information Retrieval
    Abstract

    Whereas today's information systems are well-equipped for efficient query
    handling, their strict mathematical foundations hamper their use for everyday
    tasks. In daily life, people expect information to be offered in a personalized
    and focused way. But currently, personalization in digital systems still only
    takes explicit knowledge into account and does not yet process conceptual
    information often naturally implied by users. We discuss how to bridge the gap
    between users and today's systems, building on results from cognitive
    psychology.

  21. Efficient Diversification of Web Search Results.

    Authors: Gabriele Capannini, Franco Maria Nardini, Raffaele Perego, Fabrizio Silvestri
    Subjects: Information Retrieval
    Abstract

    In this paper we analyze the efficiency of various search results
    diversification methods. While efficacy of diversification approaches has been
    deeply investigated in the past, response time and scalability issues have been
    rarely addressed. A unified framework for studying performance and feasibility
    of result diversification solutions is thus proposed. First we define a new
    methodology for detecting when, and how, query results need to be diversified.
    To this purpose, we rely on the concept of "query refinement" to estimate the
    probability of a query to be ambiguous.

  22. Compressed k2-Triples for Full-In-Memory RDF Engines.

    Authors: Sandra Álvarez-García, Nieves R. Brisaboa, Javier D. Fernández, Miguel A. Martínez-Prieto
    Subjects: Information Retrieval
    Abstract

    Current "data deluge" has flooded the Web of Data with very large RDF
    datasets. They are hosted and queried through SPARQL endpoints which act as
    nodes of a semantic net built on the principles of the Linked Data project.
    Although this is a realistic philosophy for global data publishing, its query
    performance is diminished when the RDF engines (behind the endpoints) manage
    these huge datasets. Their indexes cannot be fully loaded in main memory, hence
    these systems need to perform slow disk accesses to solve SPARQL queries.

  23. Methods of Hierarchical Clustering.

    Authors: Fionn Murtagh, Pedro Contreras
    Subjects: Information Retrieval
    Abstract

    We survey agglomerative hierarchical clustering algorithms and discuss
    efficient implementations that are available in R and other software
    environments. We look at hierarchical self-organizing maps, and mixture models.
    We review grid-based clustering, focusing on hierarchical density-based
    approaches. Finally we describe a recently developed very efficient (linear
    time) hierarchical clustering algorithm, which can also be viewed as a
    hierarchical grid-based algorithm.

  24. Multi-representation d'une ontologie : OWL, bases de donnees, syst\`emes de types et d'objets.

    Authors: Mireille Arnoux, Thierry Despeyroux
    Subjects: Information Retrieval
    Abstract

    Due to the emergence of the semantic Web and the increasing need to formalize
    human knowledge, ontologie engineering is now an important activity. But is
    this activity very different from other ones like software engineering, for
    example ? In this paper, we investigate analogies between ontologies on one
    hand, types, objects and data bases on the other one, taking into account the
    notion of evolution of an ontology. We represent a unique ontology using
    different paradigms, and observe that the distance between these different
    concepts is small.

  25. From Dirac Notation to Probability Bracket Notation: Term Vector Space, Concept Fock Space and Probabilistic IR Models.

    Authors: Xing M. Wang
    Subjects: Information Retrieval
    Abstract

    After a brief introduction to Probability Bracket Notation (PBN) for discrete
    random variables in time-independent probability spaces, we apply both PBN and
    Dirac notation to investigate probabilistic modeling for information retrieval
    (IR). We derive the ranking formulas for various probabilistic models, induced
    by Term Vector Space (TVS) and by Concept Fock Space (CFS). The ranking
    formulas are naturally expressed in term frequencies; and, because our formulas
    for inference network models (INM) are symmetric, they can also be used to rank
    closeness of documents.

  26. A New Semantic Web Approach for Constructing, Searching and Modifying Ontology Dynamically.

    Authors: Debajyoti Mukhopadhyay, Chandrima Chakrabarti, Sounak Chakravorty
    Subjects: Information Retrieval
    Abstract

    Semantic web is the next generation web, which concerns the meaning of web
    documents It has the immense power to pull out the most relevant information
    from the web pages, which is also meaningful to any user, using software
    agents. In today's world, agent communication is not possible if concerned
    ontology is changed a little. We have pointed out this very problem and
    developed an Ontology Purification System to help agent communication. In our
    system you can send queries and view the search results. If it can't meet the
    criteria then it finds out the mismatched elements.

  27. Collaborative Filtering without Explicit Feedbacks for Digital Recorders.

    Authors: Giancarlo Ruffo, Alessandro Basso, Marco Milanesio, André Panisson
    Subjects: Information Retrieval
    Abstract

    Recommendation is usually reduced to a prediction problem over the function
    $r(u_a, e_i)$ that returns the expected rating of element $e_i$ for user $u_a$.
    In the IPTV domain, we deal with an environment where the definitions of all
    the parameters involved in this function (i.e., user profiles, feedback ratings
    and elements) are controversial.

  28. A Science Model Driven Retrieval Prototype.

    Authors: Philipp Schaer, Philipp Mayr, Peter Mutschke
    Subjects: Information Retrieval
    Abstract

    This paper is about a better understanding on the structure and dynamics of
    science and the usage of these insights for compensating the typical problems
    that arises in metadata-driven Digital Libraries. Three science model driven
    retrieval services are presented: co-word analysis based query expansion,
    re-ranking via Bradfordizing and author centrality.

  29. Element Retrieval using Namespace Based on keyword search over XML Documents.

    Authors: Yang Wang, Zhikui Chen, Xiaodi Huang
    Subjects: Information Retrieval
    Abstract

    Querying over XML elements using keyword search is steadily gaining
    popularity. The traditional similarity measure is widely employed in order to
    effectively retrieve various XML documents. A number of authors have already
    proposed different similarity-measure methods that take advantage of the
    structure and content of XML documents. They do not, however, consider the
    similarity between latent semantic information of element texts and that of
    keywords in a query.

  30. A robust ranking algorithm to spamming.

    Authors: Tao Zhou, Yanbo Zhou, Ting Lei
    Subjects: Information Retrieval
    Abstract

    Ranking problem of web-based rating system has attracted many attentions. A
    good ranking algorithm should be robust against spammer attack. Here we
    proposed a correlation based reputation algorithm to solve the ranking problem
    of such rating systems where user votes some objects with ratings. In this
    algorithm, reputation of user is iteratively determined by the correlation
    coefficient between his/her rating vector and the corresponding objects'
    weighted average rating vector.

  31. Rules of Thumb for Information Acquisition from Large and Redundant Data.

    Authors: Wolfgang Gatterbauer
    Subjects: Information Retrieval
    Abstract

    We develop an abstract model of information acquisition from redundant data.
    We assume a random sampling process from data which provide information with
    bias and are interested in the fraction of information we expect to learn as
    function of (i) the sampled fraction (recall) and (ii) varying bias of
    information (redundancy distributions). We develop two rules of thumb with
    varying robustness.

  32. SPARQL Assist Language-Neutral Query Composer.

    Authors: Luke McCarthy, Ben Vandervalk, Mark Wilkinson
    Subjects: Information Retrieval
    Abstract

    SPARQL query composition is difficult for the lay-person or even the
    experienced bioinformatician in cases where the data model is unfamiliar.
    Established best-practices and internationalization concerns dictate that
    semantic web ontologies should use terms with opaque identifiers, further
    complicating the task. We present SPARQL Assist: a web application that
    addresses these issues by providing context-sensitive type-ahead completion to
    existing web forms.

  33. A Concept Annotation System for Clinical Records.

    Authors: Ning Kang, Rogier Barendse, Zubair Afzal, Bharat Singh, Martijn J. Schuemie, Erik M. van Mulligen, Jan A. Kors
    Subjects: Information Retrieval
    Abstract

    Unstructured information comprises a valuable source of data in clinical
    records. For text mining in clinical records, concept extraction is the first
    step in finding assertions and relationships. This study presents a system
    developed for the annotation of medical concepts, including medical problems,
    tests, and treatments, mentioned in clinical records. The system combines six
    publicly available named entity recognition system into one framework, and uses
    a simple voting scheme that allows to tune precision and recall of the system
    to specific needs.

  34. User Centered and Ontology Based Information Retrieval System for Life Sciences.

    Authors: Sylvie Ranwez, Vincent Ranwez, Mohameth-François Sy, Jacky Montmain, Michel Crampes
    Subjects: Information Retrieval
    Abstract

    Because of the increasing number of electronic data, designing efficient
    tools to retrieve and exploit documents is a major challenge. Current search
    engines suffer from two main drawbacks: there is limited interaction with the
    list of retrieved documents and no explanation for their adequacy to the query.
    Users may thus be confused by the selection and have no idea how to adapt their
    query so that the results match their expectations. This paper describes a
    request method and an environment based on aggregating models to assess the
    relevance of documents annotated by concepts of ontology.

  35. Building conceptual spaces for exploring and linking biomedical resources.

    Authors: R. Berlanga, E. Jimenez-Ruiz, V. Nebot
    Subjects: Information Retrieval
    Abstract

    The establishment of links between data (e.g., patient records) and Web
    resources (e.g., literature) and the proper visualization of such discovered
    knowledge is still a challenge in most Life Science domains (e.g.,
    biomedicine). In this paper we present our contribution to the community in the
    form of an infrastructure to annotate information resources, to discover
    relationships among them, and to represent and visualize the new discovered
    knowledge. Furthermore, we have also implemented a Web-based prototype tool
    which integrates the proposed infrastructure.

  36. Semantic Content Filtering with Wikipedia and Ontologies.

    Authors: Pekka Malo, Pyry Siitari, Oskar Ahlgren, Jyrki Wallenius, Pekka Korhonen
    Subjects: Information Retrieval
    Abstract

    The use of domain knowledge is generally found to improve query efficiency in
    content filtering applications. In particular, tangible benefits have been
    achieved when using knowledge-based approaches within more specialized fields,
    such as medical free texts or legal documents. However, the problem is that
    sources of domain knowledge are time-consuming to build and equally costly to
    maintain.

  37. A New Email Retrieval Ranking Approach.

    Authors: Samir AbdelRahman, Basma Hassan, Reem Bahgat
    Subjects: Information Retrieval
    Abstract

    Email Retrieval task has recently taken much attention to help the user
    retrieve the email(s) related to the submitted query. Up to our knowledge,
    existing email retrieval ranking approaches sort the retrieved emails based on
    some heuristic rules, which are either search clues or some predefined user
    criteria rooted in email fields. Unfortunately, the user usually does not know
    the effective rule that acquires best ranking related to his query. This paper
    presents a new email retrieval ranking approach to tackle this problem.

  38. A Distributed Metadata Management, Data Discovery and Access System.

    Authors: Giriprakash Palanisamy, Ranjeet Devarakonda, Jim Green, Bruce Wilson
    Subjects: Information Retrieval
    Abstract

    Mercury is a federated metadata harvesting, search and retrieval tool based
    on both open source and software developed at Oak Ridge National Laboratory. It
    was originally developed for NASA, and the Mercury development consortium now
    includes funding from NASA, USGS, and DOE. A major new version of Mercury was
    developed during 2007. This new version provides orders of magnitude
    improvements in search speed, support for additional metadata formats,
    integration with Google Maps for spatial queries, support for RSS delivery of
    search results, among other features.

  39. Implications of Inter-Rater Agreement on a Student Information Retrieval Evaluation.

    Authors: Philipp Schaer, Philipp Mayr, Peter Mutschke
    Subjects: Information Retrieval
    Abstract

    This paper is about an information retrieval evaluation on three different
    retrieval-supporting services. All three services were designed to compensate
    typical problems that arise in metadata-driven Digital Libraries, which are not
    adequately handled by a simple tf-idf based retrieval. The services are: (1) a
    co-word analysis based query expansion mechanism and re-ranking via (2)
    Bradfordizing and (3) author centrality. The services are evaluated with
    relevance assessments conducted by 73 information science students.

  40. Demonstrating a Service-Enhanced Retrieval System.

    Authors: Philipp Schaer, Philipp Mayr, Peter Mutschke
    Subjects: Information Retrieval
    Abstract

    This paper is a short description of an information retrieval system enhanced
    by three model driven retrieval services: (1) co-word analysis based query
    expansion, re-ranking via (2) Bradfordizing and (3) author centrality. The
    different services each favor quite other - but still relevant - documents than
    pure term-frequency based rankings. Each service can be interactively combined
    with each other to allow an iterative retrieval refinement.

  41. Text Categorization using Association Rule and Naive Bayes Classifier.

    Authors: Chowdhury Mofizur Rahman, S M Kamruzzaman
    Subjects: Information Retrieval
    Abstract

    As the amount of online text increases, the demand for text categorization to
    aid the analysis and management of text is increasing. Text is cheap, but
    information, in the form of knowing what classes a text belongs to, is
    expensive. Automatic categorization of text can provide this information at low
    cost, but the classifiers themselves must be built with expensive human effort,
    or trained from texts which have themselves been manually classified. Text
    categorization using Association Rule and Na\"ive Bayes Classifier is proposed
    here.

  42. Text Classification using Data Mining.

    Authors: S. M. Kamruzzaman, Ahmed Ryadh Hasan, Farhana Haider
    Subjects: Information Retrieval
    Abstract

    Text classification is the process of classifying documents into predefined
    categories based on their content. It is the automated assignment of natural
    language texts to predefined categories. Text classification is the primary
    requirement of text retrieval systems, which retrieve texts in response to a
    user query, and text understanding systems, which transform text in some way
    such as producing summaries, answering questions or extracting data. Existing
    supervised learning algorithms to automatically classify text need sufficient
    documents to learn accurately.

  43. Text Classification using Association Rule with a Hybrid Concept of Naive Bayes Classifier and Genetic Algorithm.

    Authors: S. M. Kamruzzaman, Ahmed Ryadh Hasan, Farhana Haider
    Subjects: Information Retrieval
    Abstract

    Text classification is the automated assignment of natural language texts to
    predefined categories based on their content. Text classification is the
    primary requirement of text retrieval systems, which retrieve texts in response
    to a user query, and text understanding systems, which transform text in some
    way such as producing summaries, answering questions or extracting data. Now a
    day the demand of text classification is increasing tremendously. Keeping this
    demand into consideration, new and updated techniques are being developed for
    the purpose of automated text classification.

  44. Text Classification using Artificial Intelligence.

    Authors: S. M. Kamruzzaman
    Subjects: Information Retrieval
    Abstract

    Text classification is the process of classifying documents into predefined
    categories based on their content. It is the automated assignment of natural
    language texts to predefined categories. Text classification is the primary
    requirement of text retrieval systems, which retrieve texts in response to a
    user query, and text understanding systems, which transform text in some way
    such as producing summaries, answering questions or extracting data. Existing
    supervised learning algorithms for classifying text need sufficient documents
    to learn accurately.

  45. Probabilistic Models over Ordered Partitions with Application in Learning to Rank.

    Authors: Tran The Truyen, Dinh Q. Phung, Svetha Venkatesh
    Subjects: Information Retrieval
    Abstract

    This paper addresses the general problem of modelling and learning rank data
    with ties. We propose a probabilistic generative model, that models the process
    as permutations over partitions. This results in super-exponential
    combinatorial state space with unknown numbers of partitions and unknown
    ordering among them. We approach the problem from the discrete choice theory,
    where subsets are chosen in a stagewise manner, reducing the state space per
    each stage significantly. Further, we show that with suitable parameterisation,
    we can still learn the models in linear time.

  46. Recommender Systems by means of Information Retrieval.

    Authors: Alberto Costa, Fabio Roda
    Subjects: Information Retrieval
    Abstract

    In this paper we present a method for reformulating the Recommender Systems
    problem in an Information Retrieval one. In our tests we have a dataset of
    users who give ratings for some movies; we hide some values from the dataset,
    and we try to predict them again using its remaining portion (the so-called
    "leave-n-out approach").

  47. An Architecture of Active Learning SVMs with Relevance Feedback for Classifying E-mail.

    Authors: Md. Saiful Islam, Md. Iftekharul Amin
    Subjects: Information Retrieval
    Abstract

    In this paper, we have proposed an architecture of active learning SVMs with
    relevance feedback (RF)for classifying e-mail. This architecture combines both
    active learning strategies where instead of using a randomly selected training
    set, the learner has access to a pool of unlabeled instances and can request
    the labels of some number of them and relevance feedback where if any mail
    misclassified then the next set of support vectors will be different from the
    present set otherwise the next set will not change.

  48. A high speed unsupervised speaker retrieval using vector quantization and second-order statistics.

    Authors: Konstantin Biatov
    Subjects: Information Retrieval
    Abstract

    This paper describes an effective unsupervised method for query-by-example
    speaker retrieval. We suppose that only one speaker is in each audio file or in
    audio segment. The audio data are modeled using a common universal codebook.
    The codebook is based on bag-of-frames (BOF). The features corresponding to the
    audio frames are extracted from all audio files. These features are grouped
    into clusters using the K-means algorithm. The individual audio files are
    modeled by the normalized distribution of the numbers of cluster bins
    corresponding to this file.

  49. Machine Science in Biomedicine: Practicalities, Pitfalls and Potential.

    Authors: T W Kelsey, W H B Wallace
    Subjects: Information Retrieval
    Abstract

    Machine Science, or Data-driven Research, is a new and interesting scientific
    methodology that uses advanced computational techniques to identify, retrieve,
    classify and analyse data in order to generate hypotheses and develop models.
    In this paper we describe three recent biomedical Machine Science studies, and
    use these to assess the current state of the art with specific emphasis on data
    mining, data assessment, costs, limitations, skills and tool support.

  50. Improved Fast Similarity Search in Dictionaries.

    Authors: Peter Sanders, Daniel Karch, Dennis Luxen
    Subjects: Information Retrieval
    Abstract

    We engineer an algorithm to solve the approximate dictionary matching
    problem. Given a list of words $\mathcal{W}$, maximum distance $d$ fixed at
    preprocessing time and a query word $q$, we would like to retrieve all words
    from $\mathcal{W}$ that can be transformed into $q$ with $d$ or less edit
    operations. We present data structures that support fault tolerant queries by
    generating an index. On top of that, we present a generalization of the method
    that eases memory consumption and preprocessing time significantly. At the same
    time, running times of queries are virtually unaffected.

  51. Designing a Dynamic Components and Agent based Approach for Semantic Information Retrieval.

    Authors: Zeeshan Ahmed, Detlef Gerhard
    Subjects: Information Retrieval
    Abstract

    In this paper based on agent and semantic web technologies we propose an
    approach .i.e., Semantic Oriented Agent Based Search (SOAS), to cope with
    currently existing challenges of Meta data extraction, modeling and information
    retrieval over the web. SOAS is designed by keeping four major requirements
    .i.e., Automatic user request handling, Dynamic unstructured full text reading,
    Analysing and modeling, Semantic query generation and optimized result
    classifier.

  52. Cross-Lingual Adaptation using Structural Correspondence Learning.

    Authors: Peter Prettenhofer, Benno Stein
    Subjects: Information Retrieval
    Abstract

    Cross-lingual adaptation, a special case of domain adaptation, refers to the
    transfer of classification knowledge between two languages. In this article we
    describe an extension of Structural Correspondence Learning (SCL), a recently
    proposed algorithm for domain adaptation, for cross-lingual adaptation. The
    proposed method uses unlabeled documents from both languages, along with a word
    translation oracle, to induce cross-lingual feature correspondences.

  53. Comparison Of Modified Dual Ternary Indexing And Multi-Key Hashing Algorithms For Music Information Retrieval.

    Authors: Rajeswari Sridhar, A. Amudha, S. Karthiga, Geetha T V
    Subjects: Information Retrieval
    Abstract

    In this work we have compared two indexing algorithms that have been used to
    index and retrieve Carnatic music songs. We have compared a modified algorithm
    of the Dual ternary indexing algorithm for music indexing and retrieval with
    the multi-key hashing indexing algorithm proposed by us. The modification in
    the dual ternary algorithm was essential to handle variable length query phrase
    and to accommodate features specific to Carnatic music. The dual ternary
    indexing algorithm is adapted for Carnatic music by segmenting using the
    segmentation technique for Carnatic music.

  54. Clustering Unstructured Data (Flat Files) - An Implementation in Text Mining Tool.

    Authors: Yasir Safeer, Atika Mustafa, Anis Noor Ali
    Subjects: Information Retrieval
    Abstract

    With the advancement of technology and reduced storage costs, individuals and
    organizations are tending towards the usage of electronic media for storing
    textual information and documents. It is time consuming for readers to retrieve
    relevant information from unstructured document collection. It is easier and
    less time consuming to find documents from a large collection when the
    collection is ordered or classified by group or category. The problem of
    finding best such grouping is still there.

  55. Intelligent data analysis based on the complex network theory methods: a case study.

    Authors: O. Mryglod, Yu. Holovatch
    Subjects: Information Retrieval
    Abstract

    The development of modern information technologies permits to collect and to
    analyze huge amounts of statistical data in different spheres of life. The main
    problem is not to only to collect but to process all relevant information. The
    purpose of our work is to show the example of intelligent data analysis in such
    complex and non-formalized field as science. Using the statistical data about
    scientific periodical it is possible to perform its comprehensive analysis and
    to solve different practical problems.

  56. A Survey Paper on Recommender Systems.

    Authors: William Nzoukou, Dhoha Almazro, Ghadeer Shahatah, Lamia Albdulkarim, Mona Kherees, Romy Martinez
    Subjects: Information Retrieval
    Abstract

    Recommender systems apply data mining techniques and prediction algorithms to
    predict users' interest on information, products and services among the
    tremendous amount of available items. The vast growth of information on the
    Internet as well as number of visitors to websites add some key challenges to
    recommender systems. These are: producing accurate recommendation, handling
    many recommendations efficiently and coping with the vast growth of number of
    participants in the system.

  57. Capacity Planning for Vertical Search Engines.

    Authors: Claudine Badue, Jussara Almeida, Virgilio Almeida, Ricardo Baeza-Yates, Berthier Ribeiro-Neto, Artur Ziviani, Nivio Ziviani
    Subjects: Information Retrieval
    Abstract

    Vertical search engines focus on specific slices of content, such as the Web
    of a single country or the document collection of a large corporation. Despite
    this, like general open web search engines, they are expensive to maintain,
    expensive to operate, and hard to design. Because of this, predicting the
    response time of a vertical search engine is usually done empirically through
    experimentation, requiring a costly setup. An alternative is to develop a model
    of the search engine for predicting performance. However, this alternative is
    of interest only if its predictions are accurate.

  58. The comparison of Wiktionary thesauri transformed into the machine-readable format.

    Authors: A. A. Krizhanovsky
    Subjects: Information Retrieval
    Abstract

    Wiktionary is a unique, peculiar, valuable and original resource for natural
    language processing (NLP). The paper describes an open-source Wiktionary
    parser: its architecture and requirements followed by a description of
    Wiktionary features to be taken into account, some open problems of Wiktionary
    and the parser. The current implementation of the parser extracts the
    definitions, semantic relations, and translations from English and Russian
    Wiktionaries.

  59. Large scale link based latent Dirichlet allocation for web document classification.

    Authors: Jácint Szabó, István Bíró
    Subjects: Information Retrieval
    Abstract

    In this paper we demonstrate the applicability of latent Dirichlet allocation
    (LDA) for classifying large Web document collections. One of our main results
    is a novel influence model that gives a fully generative model of the document
    content taking linkage into account. In our setup, topics propagate along links
    in such a way that linked documents directly influence the words in the linking
    document. As another main contribution we develop LDA specific boosting of
    Gibbs samplers resulting in a significant speedup in our experiments.

  60. Two-dimensional ranking of Wikipedia articles.

    Authors: A.O.Zhirov, O.V.Zhirov, D.L.Shepelyansky
    Subjects: Information Retrieval
    Abstract

    The Library of Babel, described by Jorge Luis Borges, stores an enormous
    amount of information. The Library exists {\it ab aeterno}. Wikipedia, a free
    online encyclopaedia, becomes a modern analogue of such a Library. Information
    retrieval and ranking of Wikipedia articles become the challenge of modern
    society. We analyze the properties of two-dimensional ranking of all Wikipedia
    English articles and show that it gives their reliable classification with rich
    and nontrivial features.

  61. TAGME: on-the-fly annotation of short text fragments (by Wikipedia entities).

    Authors: Paolo Ferragina, Ugo Scaiella
    Subjects: Information Retrieval
    Abstract

    In this paper we address the problem of accurately and efficiently
    cross-referencing text fragments with Wikipedia pages, in a way that structured
    knowledge is provided about the (unstructured) input text by resolving synonymy
    and polysemy. We take inspiration from the invited talk of Chakrabarti at WSDM
    2010, and extend his proposed scenario from the annotation of entire documents
    to the annotation of short texts, such as snippets of search-engine results,
    tweets, news, etc..

  62. Power law in website ratings.

    Authors: D.V. Lande, A.A. Snarskii
    Subjects: Information Retrieval
    Abstract

    In the practical work of websites popularization, analysis of their
    efficiency and downloading it is of key importance to take into account
    web-ratings data. The main indicators of website traffic include the number of
    unique hosts from which the analyzed website was addressed and the number of
    granted web pages (hits) per unit time (for example, day, month or year). Of
    certain interest is the ratio between the number of hits (S) and hosts (H). In
    practice there is even used such a concept as "average number of viewed pages"
    (S/H), which on default supposes a linear dependence of S on H.

  63. An Algorithm to Self-Extract Secondary Keywords and Their Combinations Based on Abstracts Collected using Primary Keywords from Online Digital Libraries.

    Authors: Hari Cohly, Natarajan Meghanathan, Nataliya Kostyuk, Raphael Isokpehi
    Subjects: Information Retrieval
    Abstract

    The high-level contribution of this paper is the development and
    implementation of an algorithm to selfextract secondary keywords and their
    combinations (combo words) based on abstracts collected using standard primary
    keywords for research areas from reputed online digital libraries like IEEE
    Explore, PubMed Central and etc. Given a collection of N abstracts, we
    arbitrarily select M abstracts (M<< N; M/N as low as 0.15) and parse each of
    the M abstracts, word by word.

  64. Chi-square-based scoring function for categorization of MEDLINE citations.

    Authors: Andrej Kastrin, Borut Peterlin, Dimitar Hristovski
    Subjects: Information Retrieval
    Abstract

    Objectives: Text categorization has been used in biomedical informatics for
    identifying documents containing relevant topics of interest. We developed a
    simple method that uses a chi-square-based scoring function to determine the
    likelihood of MEDLINE citations containing genetic relevant topic. Methods: Our
    procedure requires construction of a genetic and a nongenetic domain document
    corpus. We used MeSH descriptors assigned to MEDLINE citations for this
    categorization task. We compared frequencies of MeSH descriptors between two
    corpora applying chi-square test.

  65. On the Fly Query Entity Decomposition Using Snippets.

    Authors: Daniel Gayo-Avello, David J. Brenes, Rodrigo Garcia
    Subjects: Information Retrieval
    Abstract

    One of the most important issues in Information Retrieval is inferring the
    intents underlying users' queries. Thus, any tool to enrich or to better
    contextualized queries can proof extremely valuable. Entity extraction,
    provided it is done fast, can be one of such tools. Such techniques usually
    rely on a prior training phase involving large datasets. That training is
    costly, specially in environments which are increasingly moving towards real
    time scenarios where latency to retrieve fresh informacion should be minimal.
    In this paper an `on-the-fly' query decomposition method is proposed.

  66. Constructing Positive Elastic Kernels with Application to Time Series Classification.

    Authors: Pierre-Fran&#xe7;ois Marteau, Sylvie Gibet
    Subjects: Information Retrieval
    Abstract

    This paper proposes some extensions to the work on kernels dedicated to
    string alignment (biological sequence alignment) based on the summing up of
    scores obtained by local alignments with gaps. The extensions we propose allow
    to construct, from classical time warp distances, what we called summative time
    warp kernels that are positive definite if some simple sufficient conditions
    are satisfied.

  67. A database approach to information retrieval: The remarkable relationship between language models and region models.

    Authors: Djoerd Hiemstra, Vojkan Mihajlovic
    Subjects: Information Retrieval
    Abstract

    In this report, we unify two quite distinct approaches to information
    retrieval: region models and language models. Region models were developed for
    structured document retrieval. They provide a well-defined behaviour as well as
    a simple query language that allows application developers to rapidly develop
    applications. Language models are particularly useful to reason about the
    ranking of search results, and for developing new ranking approaches. The
    unified model allows application developers to define complex language modeling
    approaches as logical queries on a textual database.

  68. Clustering Time Series Data Stream - A Literature Survey.

    Authors: M. Punithavalli, V.Kavitha
    Subjects: Information Retrieval
    Abstract

    Mining Time Series data has a tremendous growth of interest in today's world.
    To provide an indication various implementations are studied and summarized to
    identify the different problems in existing applications. Clustering time
    series is a trouble that has applications in an extensive assortment of fields
    and has recently attracted a large amount of research. Time series data are
    frequently large and may contain outliers. In addition, time series are a
    special type of data set where elements have a temporal ordering.

  69. Performance Oriented Query Processing In GEO Based Location Search Engines.

    Authors: M. Umamaheswari, S. Sivasubramanian
    Subjects: Information Retrieval
    Abstract

    Geographic location search engines allow users to constrain and order search
    results in an intuitive manner by focusing a query on a particular geographic
    region. Geographic search technology, also called location search, has recently
    received significant interest from major search engine companies. Academic
    research in this area has focused primarily on techniques for extracting
    geographic knowledge from the web. In this paper, we study the problem of
    efficient query processing in scalable geographic search engines.

  70. Node-Context Network Clustering using PARAFAC Tensor Decomposition.

    Authors: Andri Mirzal, Masashi Furukawa
    Subjects: Information Retrieval
    Abstract

    We describe a clustering method for labeled link network (semantic graph)
    that can be used to group important nodes (highly connected nodes) with their
    relevant link's labels by using PARAFAC tensor decomposition. In this kind of
    network, the adjacency matrix can not be used to fully describe all information
    about the network structure. We have to expand the matrix into 3-way adjacency
    tensor, so that not only the information about to which nodes a node connects
    to but by which link's labels is also included.

  71. Self-Taught Hashing for Fast Similarity Search.

    Authors: Dell Zhang, Jun Wang, Deng Cai, Jinsong Lu
    Subjects: Information Retrieval
    Abstract

    The ability of fast similarity search at large scale is of great importance
    to many Information Retrieval (IR) applications. A promising way to accelerate
    similarity search is semantic hashing which designs compact binary codes for a
    large number of documents so that semantically similar documents are mapped to
    similar codes (within a short Hamming distance). Although some recently
    proposed techniques are able to generate high-quality codes for documents known
    in advance, obtaining the codes for previously unseen documents remains to be a
    very challenging problem.

  72. MIREX: MapReduce Information Retrieval Experiments.

    Authors: Djoerd Hiemstra, Claudia Hauff
    Subjects: Information Retrieval
    Abstract

    We propose to use MapReduce to quickly test new retrieval approaches on a
    cluster of machines by sequentially scanning all documents. We present a small
    case study in which we use a cluster of 15 low cost ma- chines to search a web
    crawl of 0.5 billion pages showing that sequential scanning is a viable
    approach to running large-scale information retrieval experiments with little
    effort. The code is available to other researchers at:
    this http URL

  73. Audio enabled information extraction system for cricket and hockey domains.

    Authors: S. Saraswathi, Narasimha Sravan. V, Sai Vamsi Krishna. B.V, Suresh Reddy. S
    Subjects: Information Retrieval
    Abstract

    The proposed system aims at the retrieval of the summarized information from
    the documents collected from web based search engine as per the user query
    related to cricket and hockey domain. The system is designed in a manner that
    it takes the voice commands as keywords for search. The parts of speech in the
    query are extracted using the natural language extractor for English. Based on
    the keywords the search is categorized into 2 types: - 1.Concept wise -
    information retrieved to the query is retrieved based on the keywords and the
    concept words related to it.

  74. BiLingual Information Retrieval System for English and Tamil.

    Authors: S.Saraswathi, Asma Siddhiqaa.M, Kalaimagal.K, Kalaiyarasi.M
    Subjects: Information Retrieval
    Abstract

    This paper addresses the design and implementation of BiLingual Information
    Retrieval system on the domain, Festivals. A generic platform is built for
    BiLingual Information retrieval which can be extended to any foreign or Indian
    language working with the same efficiency. Search for the solution of the query
    is not done in a specific predefined set of standard languages but is chosen
    dynamically on processing the user's query. This paper deals with Indian
    language Tamil apart from English.

  75. Handling Overload Conditions In High Performance Trustworthy Information Retrieval Systems.

    Authors: Sumalatha Ramachandran, Sharon Joseph, Sujaya Paulraj, Vetriselvi Ramaraj
    Subjects: Information Retrieval
    Abstract

    Web search engines retrieve a vast amount of information for a given search
    query. But the user needs only trustworthy and high-quality information from
    this vast retrieved data. The response time of the search engine must be a
    minimum value in order to satisfy the user. An optimum level of response time
    should be maintained even when the system is overloaded. This paper proposes an
    optimal Load Shedding algorithm which is used to handle overload conditions in
    real-time data stream applications and is adapted to the Information Retrieval
    System of a web search engine.

  76. Solving the Cold-Start Problem in Recommender Systems with Social Tags.

    Authors: Yi-Cheng Zhang, Tao Zhou, Zi-Ke Zhang Chuang Liu
    Subjects: Information Retrieval
    Abstract

    In this paper, based on the user-tag-object tripartite graphs, we propose a
    recommendation algorithm, which considers social tags as an important role for
    information retrieval. Besides its low cost of computational time, the
    experiment results of two real-world data sets, \emph{Del.icio.us} and
    \emph{MovieLens}, show it can enhance the algorithmic accuracy and diversity.
    Especially, it can obtain more personalized recommendation results when users
    have diverse topics of tags.

  77. Improving Update Summarization by Revisiting the MMR Criterion.

    Authors: Juan-Manuel Torres-Moreno, Florian Boudin, Marc El-B&#xe8;ze
    Subjects: Information Retrieval
    Abstract

    This paper describes a method for multi-document update summarization that
    relies on a double maximization criterion. A Maximal Marginal Relevance like
    criterion, modified and so called Smmr, is used to select sentences that are
    close to the topic and at the same time, distant from sentences used in already
    read documents. Summaries are then generated by assembling the high ranked
    material and applying some ruled-based linguistic post-processing in order to
    obtain length reduction and maintain coherency.

  78. Learning Better Context Characterizations: An Intelligent Information Retrieval Approach.

    Authors: Carlos M. Lorenzetti, Ana G. Maguitman
    Subjects: Information Retrieval
    Abstract

    This paper proposes an incremental method that can be used by an intelligent
    system to learn better descriptions of a thematic context. The method starts
    with a small number of terms selected from a simple description of the topic
    under analysis and uses this description as the initial search context. Using
    these terms, a set of queries are built and submitted to a search engine. New
    documents and terms are used to refine the learned vocabulary.

  79. Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure.

    Authors: Martin Klein, Michael L Nelson
    Subjects: Information Retrieval
    Abstract

    Missing web pages (pages that return the 404 "Page Not Found" error) are part
    of the browsing experience. The manual use of search engines to rediscover
    missing pages can be frustrating and unsuccessful. We compare four automated
    methods for rediscovering web pages. We extract the page's title, generate the
    page's lexical signature (LS), obtain the page's tags from the bookmarking
    website delicious.com and generate a LS from the page's link neighborhood. We
    use the output of all methods to query Internet search engines and analyze
    their retrieval performance.

  80. Is This a Good Title?.

    Authors: Michael L. Nelson, Martin Klein, Jeffery Shipman
    Subjects: Information Retrieval
    Abstract

    Missing web pages, URIs that return the 404 "Page Not Found" error or the
    HTTP response code 200 but dereference unexpected content, are ubiquitous in
    today's browsing experience. We use Internet search engines to relocate such
    missing pages and provide means that help automate the rediscovery process. We
    propose querying web pages' titles against search engines. We investigate the
    retrieval performance of titles and compare them to lexical signatures which
    are derived from the pages' content. Since titles naturally represent the
    content of a document they intuitively change over time.

  81. Document Clustering using Sequential Information Bottleneck Method.

    Authors: P.J.Gayathri, S.C. Punitha, M. Punithavalli
    Subjects: Information Retrieval
    Abstract

    This paper illustrates the Principal Direction Divisive Partitioning (PDDP)
    algorithm and describes its drawbacks and introduces a combinatorial framework
    of the Principal Direction Divisive Partitioning (PDDP) algorithm, then
    describes the simplified version of the EM algorithm called the spherical
    Gaussian EM (sGEM) algorithm and Information Bottleneck method (IB) is a
    technique for finding accuracy, complexity and time space.

  82. A Survey on Preprocessing Methods for Web Usage Data.

    Authors: V.Chitraa, Dr. Antony Selvdoss Davamani
    Subjects: Information Retrieval
    Abstract

    World Wide Web is a huge repository of web pages and links. It provides
    abundance of information for the Internet users. The growth of web is
    tremendous as approximately one million pages are added daily. Users' accesses
    are recorded in web logs. Because of the tremendous usage of web, the web log
    files are growing at a faster rate and the size is becoming huge. Web data
    mining is the application of data mining techniques in web data.

  83. Nepotistic Relationships in Twitter and their Impact on Rank Prestige Algorithms.

    Authors: Daniel Gayo-Avello
    Subjects: Information Retrieval
    Abstract

    Micro-blogging services such as Twitter allow anyone to publish anything,
    anytime. Nonetheless to say, many of the available contents can be diminished
    as babble or spam. However, given the number and diversity of users, some
    valuable pieces of information should arise from the stream of tweets. Thus,
    such services can develop into valuable sources of up-to-date information (the
    so-called real-time web) provided a way to find the most
    relevant/trustworthy/authoritative users is available.

  84. Maximal Intersection Queries in Randomized Input Models.

    Authors: Mikhail Lifshits, Benjamin Hoffmann, Yury Lifshits, Dirk Nowotka
    Subjects: Information Retrieval
    Abstract

    Consider a family of sets and a single set, called the query set. How can one
    quickly find a member of the family which has a maximal intersection with the
    query set? Time constraints on the query and on a possible preprocessing of the
    set family make this problem challenging. Such maximal intersection queries
    arise in a wide range of applications, including web search, recommendation
    systems, and distributing on-line advertisements. In general, maximal
    intersection queries are computationally expensive.

  85. Local Popularity based Page Link Analysis.

    Authors: C Ravindranath Chowdary
    Subjects: Information Retrieval
    Abstract

    In this paper we introduce the concept of dynamic link pages. A web site/page
    contains a number of links to other pages. All the links are not equally
    important. Few links are more frequently visited and few rarely visited. In
    this scenario, identifying the frequently used links and placing them in the
    top left corner of the page will increase the user's satisfaction. This process
    will reduce the time spent by a visitor on the page, as most of the times, the
    popular links are presented in the visible part of the screen itself.

  86. A Mathematical Approach to the Study of the United States Code.

    Authors: Michael J. Bommarito II, Daniel Martin Katz
    Subjects: Information Retrieval
    Abstract

    The United States Code (Code) is a document containing over 22 million words
    that represents a large and important source of Federal statutory law. Scholars
    and policy advocates often discuss the direction and magnitude of changes in
    various aspects of the Code. However, few have mathematically formalized the
    notions behind these discussions or directly measured the resulting
    representations. This paper addresses the current state of the literature in
    two ways.

  87. Computation of Reducts Using Topology and Measure of Significance of Attributes.

    Authors: R. Bhaskaran, P. G. JansiRani
    Subjects: Information Retrieval
    Abstract

    Data generated in the fields of science, technology, business and in many
    other fields of research are increasing in an exponential rate. The way to
    extract knowledge from a huge set of data is a challenging task.

  88. Classified Ads Harvesting Agent and Notification System.

    Authors: Razvi Doomun, Lollmahamod N., Auleear Nadeem, Mozafar Aukin
    Subjects: Information Retrieval
    Abstract

    The shift from an information society to a knowledge society require rapid
    information harvesting, reliable search and instantaneous on demand delivery.
    Information extraction agents are used to explore and collect data available
    from Web, in order to effectively exploit such data for business purposes, such
    as automatic news filtering, advertisement or product searching and price
    comparing. In this paper, we develop a real-time automatic harvesting agent for
    adverts posted on Servihoo web portal and an SMS-based notification system.

  89. Revisiting the Examination Hypothesis with Query Specific Position Bias.

    Authors: Sreenivas Gollapudi, Rina Panigrahy
    Subjects: Information Retrieval
    Abstract

    Click through rates (CTR) offer useful user feedback that can be used to
    infer the relevance of search results for queries. However it is not very
    meaningful to look at the raw click through rate of a search result because the
    likelihood of a result being clicked depends not only on its relevance but also
    the position in which it is displayed. One model of the browsing behavior, the
    {\em Examination Hypothesis} \cite{RDR07,Craswell08,DP08}, states that each
    position has a certain probability of being examined and is then clicked based
    on the relevance of the search snippets.

  90. Formal Concept Analysis for Information Retrieval.

    Authors: Abderrahim El Qadi, Driss Aboutajedine, Yassine Ennouary
    Subjects: Information Retrieval
    Abstract

    In this paper we describe a mechanism to improve Information Retrieval (IR)
    on the web. The method is based on Formal Concepts Analysis (FCA) that it is
    makes semantical relations during the queries, and allows a reorganizing, in
    the shape of a lattice of concepts, the answers provided by a search engine. We
    proposed for the IR an incremental algorithm based on Galois lattice. This
    algorithm allows a formal clustering of the data sources, and the results which
    it turns over are classified by order of relevance.

  91. Tag Clusters as Information Retrieval Interfaces.

    Authors: Kathrin Knautz, Simone Soubusta, Wolfgang G. Stock
    Subjects: Information Retrieval
    Abstract

    The paper presents our design of a next generation information retrieval
    system based on tag co-occurrences and subsequent clustering. We help users
    getting access to digital data through information visualization in the form of
    tag clusters. Current problems like the absence of interactivity and semantics
    between tags or the difficulty of adding additional search arguments are
    solved.

  92. A Hough Transform based Technique for Text Segmentation.

    Authors: Subhadip Basu, Mita Nasipuri, Satadal Saha, Dipak Kr. Basu
    Subjects: Information Retrieval
    Abstract

    Text segmentation is an inherent part of an OCR system irrespective of the
    domain of application of it. The OCR system contains a segmentation module
    where the text lines, words and ultimately the characters must be segmented
    properly for its successful recognition. The present work implements a Hough
    transform based technique for line and word segmentation from digitized images.
    The proposed technique is applied not only on the document image dataset but
    also on dataset for business card reader system and license plate recognition
    system.

  93. Improving Term Extraction Using Particle Swarm Optimization Techniques.

    Authors: Naomie Salim, Mohammad Syafrullah
    Subjects: Information Retrieval
    Abstract

    Term extraction is one of the layers in the ontology development process
    which has the task to extract all the terms contained in the input document
    automatically. The purpose of this process is to generate list of terms that
    are relevant to the domain of the input document. In the literature there are
    many approaches, techniques and algorithms used for term extraction. In this
    paper we propose a new approach using particle swarm optimization techniques in
    order to improve the accuracy of term extraction results. We choose five
    features to represent the term score.

  94. Exploring a Multidimensional Representation of Documents and Queries (extended version).

    Authors: Benjamin Piwowarski, Ingo Frommholz, Mounia Lalmas, Keith van Rijsbergen
    Subjects: Information Retrieval
    Abstract

    In Information Retrieval (IR), whether implicitly or explicitly, queries and
    documents are often represented as vectors. However, it may be more beneficial
    to consider documents and/or queries as multidimensional objects. Our belief is
    this would allow building "truly" interactive IR systems, i.e., where
    interaction is fully incorporated in the IR framework.

  95. Recherche de relations spatio-temporelles : une m\'ethode bas\'ee sur l'analyse de corpus textuels.

    Authors: Christian Sallaberry, Mauro Gaio, Tien Nguyen Van
    Subjects: Information Retrieval
    Abstract

    This paper presents a work package realized for the G\'eOnto project. A new
    method is proposed for an enrichment of a first geographical ontology developed
    beforehand. This method relies on text analysis by lexico-syntactic patterns.

    From the retrieve of n-ary relations the method automatically detect those
    involved in a spatial and/or temporal relation in a context of a description of
    journeys.

  96. Extraction de termes, reconnaissance et labellisation de relations dans un th\'esaurus.

    Authors: Eric Kergosien, Marie-No&#xeb;lle Bessagnet, Mauro Gaio
    Subjects: Information Retrieval
    Abstract

    Within the documentary system domain, the integration of thesauri for
    indexing and retrieval information steps is usual. In libraries, documents own
    rich descriptive information made by librarians, under descriptive notice based
    on Rameau thesaurus. We exploit two kinds of information in order to create a
    first semantic structure. A step of conceptualization allows us to define the
    various modules used to automatically build the semantic structure of the
    indexation work. Our current work focuses on an approach that aims to define an
    ontology based on a thesaurus.

  97. Construction et enrichissement automatique d'ontologie \`a partir de ressources externes.

    Authors: Eric Kergosien, Mouna Kamel, Christian Sallaberry, Marie-No&#xeb;lle Bessagnet, Nathalie Aussenac- Gilles, Mauro Gaio
    Subjects: Information Retrieval
    Abstract

    Automatic construction of ontologies from text is generally based on
    retrieving text content. For a much more rich ontology we extend these
    approaches by taking into account the document structure and some external
    resources (like thesaurus of indexing terms of near domain). In this paper we
    describe how these external resources are at first analyzed and then exploited.
    This method has been applied on a geographical domain and the benefit has been
    evaluated.

  98. Learning to Blend by Relevance.

    Authors: Jiang Chen, Wei Chu, Zhenzhen Kou, Zhaohui Zheng
    Subjects: Information Retrieval
    Abstract

    Emergence of various vertical search engines highlights the fact that a
    single ranking technology cannot deal with the complexity and scale of search
    problems. For example, technology behind video and image search is very
    different from general web search. Their ranking functions share few features.
    Question answering websites (e.g., Yahoo! Answer) can make use of text matching
    and click features developed for general web, but they have unique page
    structures and rich user feedback, e.g., thumbs up and thumbs down ratings in
    Yahoo! answer, which greatly benefit their own ranking.

  99. Implicit media frames: Automated analysis of public debate on artificial sweeteners.

    Authors: Loet Leydesdorff, Iina Hellsten, James Dawson
    Subjects: Information Retrieval
    Abstract

    The framing of issues in the mass media plays a crucial role in the public
    understanding of science and technology. This article contributes to research
    concerned with diachronic analysis of media frames by making an analytical
    distinction between implicit and explicit media frames, and by introducing an
    automated method for analysing diachronic changes of implicit frames. In
    particular, we apply a semantic maps method to a case study on the newspaper
    debate about artificial sweeteners, published in The New York Times (NYT)
    between 1980 and 2006.

  100. The effect of discrete vs. continuous-valued ratings on reputation and ranking systems.

    Authors: Matus Medo, Joseph Rushton Wakeling
    Subjects: Information Retrieval
    Abstract

    When users rate objects, a sophisticated algorithm that takes into account
    ability or reputation may produce a fairer or more accurate aggregation of
    ratings than the straightforward arithmetic average. Recently a number of
    authors have proposed different co-determination algorithms where estimates of
    user and object reputation are refined iteratively together, permitting
    accurate measures of both to be derived directly from the rating data.

  101. On Utilization and Importance of Perl Status Reporter (SRr) in Text Mining.

    Authors: Sugam Sharma, Tzusheng Pei, Hari Cohly
    Subjects: Information Retrieval
    Abstract

    In Bioinformatics, text mining and text data mining sometimes interchangeably
    used is a process to derive high-quality information from text. Perl Status
    Reporter (SRr) is a data fetching tool from a flat text file and in this
    research paper we illustrate the use of SRr in text or data mining. SRr needs a
    flat text input file where the mining process to be performed. SRr reads input
    file and derives the high quality information from it. Typically text mining
    tasks are text categorization, text clustering, concept and entity extraction,
    and document summarization.

  102. Building reputation systems for better ranking.

    Authors: Matus Medo, Yi-Cheng Zhang, Tao Zhou, Luo-Luo Jiang, Joseph R. Wakeling
    Subjects: Information Retrieval
    Abstract

    How to rank web pages, scientists and online resources has recently attracted
    increasing attention from both physicists and computer scientists. In this
    paper, we study the ranking problem of rating systems where users vote objects
    by discrete ratings. We propose an algorithm that can simultaneously evaluate
    the user reputation and object quality in an iterative refinement way.
    According to both the artificially generated data and the real data from
    MovieLens and Amazon, our algorithm can considerably enhance the ranking
    accuracy.

  103. Distributed scientific communication in the European information society: Some cases of "Mode 2" fields of research.

    Authors: Loet Leydesdorff, Peter van den Besselaar, Gaston Heimeriks
    Subjects: Information Retrieval
    Abstract

    Can self-organization of scientific communication be specified by using
    literature-based indicators? In this study, we explore this question by
    applying entropy measures to typical "Mode-2" fields of knowledge production.
    We hypothesized these scientific systems to be developing from a
    self-organization of the interaction between cognitive and institutional
    levels: European subsidized research programs aim at creating an institutional
    network, while a cognitive reorganization is continuously ongoing at the
    scientific field level.

  104. Redundancy in Systems which Entertain a Model of Themselves: Interaction Information and the Self-organization of Anticipation.

    Authors: Loet Leydesdorff
    Subjects: Information Retrieval
    Abstract

    Mutual information among three or more dimensions (mu-star = - Q) has been
    considered as interaction information. However, Krippendorff (2009a, 2009b) has
    shown that this measure cannot be interpreted as a unique property of the
    interactions and has proposed an alternative measure of interaction information
    based on iterative approximation of maximum entropies. Q can then be considered
    as a measure of the difference between interaction information and redundancy
    generated in a model entertained by an observer.

  105. Document Clustering with K-tree.

    Authors: Christopher M. De Vries, Shlomo Geva
    Subjects: Information Retrieval
    Abstract

    This paper describes the approach taken to the XML Mining track at INEX 2008
    by a group at the Queensland University of Technology. We introduce the K-tree
    clustering algorithm in an Information Retrieval context by adapting it for
    document clustering. Many large scale problems exist in document clustering.
    K-tree scales well with large inputs due to its low complexity. It offers
    promising results both in terms of efficiency and quality. Document
    classification was completed using Support Vector Machines.

  106. K-tree: Large Scale Document Clustering.

    Authors: Christopher M. De Vries, Shlomo Geva
    Subjects: Information Retrieval
    Abstract

    We introduce K-tree in an information retrieval context. It is an efficient
    approximation of the k-means clustering algorithm. Unlike k-means it forms a
    hierarchy of clusters. It has been extended to address issues with sparse
    representations. We compare performance and quality to CLUTO using document
    collections. The K-tree has a low time complexity that is suitable for large
    document collections. This tree structure allows for efficient disk based
    implementations where space requirements exceed that of main memory.

  107. Random Indexing K-tree.

    Authors: Christopher M. De Vries, Lance De Vine, Shlomo Geva
    Subjects: Information Retrieval
    Abstract

    Random Indexing (RI) K-tree is the combination of two algorithms for
    clustering. Many large scale problems exist in document clustering. RI K-tree
    scales well with large inputs due to its low complexity. It also exhibits
    features that are useful for managing a changing collection. Furthermore, it
    solves previous issues with sparse document vectors when using K-tree. The
    algorithms and data structures are defined, explained and motivated. Specific
    modifications to K-tree are made for use with RI. Experiments have been
    executed to measure quality.

  108. Tutoring System for Dance Learning.

    Authors: Rajkumar Kannan, Frederic Andres, Balakrishnan Ramadoss
    Subjects: Information Retrieval
    Abstract

    Recent advances in hardware sophistication related to graphics display, audio
    and video devices made available a large number of multimedia and hypermedia
    applications. These multimedia applications need to store and retrieve the
    different forms of media like text, hypertext, graphics, still images,
    animations, audio and video. Dance is one of the important cultural forms of a
    nation and dance video is one such multimedia types. Archiving and retrieving
    the required semantics from these dance media collections is a crucial and
    demanding multimedia application.

  109. Realization of Semantic Atom Blog.

    Authors: Dhiren R. Patel, Sidheshwar A. Khuba
    Subjects: Information Retrieval
    Abstract

    Web blog is used as a collaborative platform to publish and share
    information. The information accumulated in the blog intrinsically contains the
    knowledge. The knowledge shared by the community of people has intangible value
    proposition. The blog is viewed as a multimedia information resource available
    on the Internet. In a blog, information in the form of text, image, audio and
    video builds up exponentially.

  110. Rank Based Clustering For Document Retrieval From Biomedical Databases.

    Authors: Jayanthi Manicassamy, P. Dhavachelvan
    Subjects: Information Retrieval
    Abstract

    Now a day's, search engines are been most widely used for extracting
    information's from various resources throughout the world. Where, majority of
    searches lies in the field of biomedical for retrieving related documents from
    various biomedical databases. Currently search engines lacks in document
    clustering and representing relativeness level of documents extracted from the
    databases. In order to overcome these pitfalls a text based search engine have
    been developed for retrieving documents from Medline and PubMed biomedical
    databases.

  111. VirusPKT: A Search Tool For Assimilating Assorted Acquaintance For Viruses.

    Authors: Jayanthi Manicassamy, P. Dhavachelvan
    Subjects: Information Retrieval
    Abstract

    Viruses utilize various means to circumvent the immune detection in the
    biological systems. Several mathematical models have been investigated for the
    description of viral dynamics in the biological system of human and various
    other species. One common strategy for evasion and recognition of viruses is,
    through acquaintance in the systems by means of search engines. In this
    perspective a search tool have been developed to provide a wider comprehension
    about the structure and other details on viruses which have been narrated in
    this paper.

  112. Context and Keyword Extraction in Plain Text Using a Graph Representation.

    Authors: Nathalie Chaignaud, Jean-Philippe Kotowicz, Carlo Abi Chahine, Jean-Pierre P&#xe9;cuchet
    Subjects: Information Retrieval
    Abstract

    Document indexation is an essential task achieved by archivists or automatic
    indexing tools. To retrieve relevant documents to a query, keywords describing
    this document have to be carefully chosen. Archivists have to find out the
    right topic of a document before starting to extract the keywords. For an
    archivist indexing specialized documents, experience plays an important role.
    But indexing documents on different topics is much harder. This article
    proposes an innovative method for an indexing support system.

  113. Spectral Ranking.

    Authors: Sebastiano Vigna
    Subjects: Information Retrieval
    Abstract

    This note tries to attempt a sketch of the history of spectral ranking, a
    general umbrella name for techniques that apply the theory of linear maps (in
    particular, eigenvalues and eigenvectors) to matrices that do not represent
    geometric transformations, but rather some kind of relationship between
    entities. Albeit recently made famous by the ample press coverage of Google's
    PageRank algorithm, spectral ranking was devised more than fifty years ago,
    almost exactly in the same terms, and has been studied in psychology and social
    sciences.

  114. De la recherche sociale d'information \`a la recherche collaborative d'information.

    Authors: Victor Odumuyiwa
    Subjects: Information Retrieval
    Abstract

    In this paper, we explain social information retrieval (SIR) and
    collaborative information retrieval (CIR). We see SIR as a way of knowing who
    to collaborate with in resolving an information problem while CIR entails the
    process of mutual understanding and solving of an information problem among
    collaborators. We are interested in the transition from SIR to CIR hence we
    developed a communication model to facilitate knowledge sharing during CIR.

  115. Integrating the Probabilistic Models BM25/BM25F into Lucene.

    Authors: Joaqu&#xed;n P&#xe9;rez-Iglesias, Jos&#xe9; R. P&#xe9;rez-Ag&#xfc;era, V&#xed;ctor Fresno, Yuval Z. Feinstein
    Subjects: Information Retrieval
    Abstract

    This document describes the BM25 and BM25F implementation using the Lucene
    Java Framework. Both models have stood out at TREC by their performance and are
    considered as state-of-the-art in the IR community. BM25 is applied to `ad-hoc'
    retrieval, that is for documents that do not contain fields, on the other hand
    BM25F is applied to documents with structure.

  116. Adaptive information filtering for dynamic recommender systems.

    Authors: Jian-Guo Liu, Yi-Cheng Zhang, Tao Zhou, Ci-Hang Jin
    Subjects: Information Retrieval
    Abstract

    The dynamic environment in the real world calls for the adaptive techniques
    for information filtering, namely to provide real-time responses to the changes
    of system data. Where many incremental algorithms are designed for this
    purpose, they are usually challenged by the worse and worse performance
    resulted from the cumulative errors over time. In this Letter, we propose two
    incremental diffusion-based algorithms for the personalized recommendations,
    which integrate some pieces of local and fast updatings to achieve the
    approximate results.

  117. Similarity Measures, Author Cocitation Analysis, and Information Theory.

    Authors: Loet Leydesdorff
    Subjects: Information Retrieval
    Abstract

    The use of Pearson's correlation coefficient in Author Cocitation Analysis
    was compared with Salton's cosine measure in a number of recent contributions.
    Unlike the Pearson correlation, the cosine is insensitive to the number of
    zeros. However, one has the option of applying a logarithmic transformation in
    correlation analysis. Information calculus is based on both the logarithmic
    transformation and provides a non-parametric statistics. Using this methodology
    one can cluster a document set in a precise way and express the differences in
    terms of bits of information.

  118. Making the road by searching - A search engine based on Swarm Information Foraging.

    Authors: Daniel Gayo-Avello, David J. Brenes
    Subjects: Information Retrieval
    Abstract

    Search engines are nowadays one of the most important entry points for
    Internet users and a central tool to solve most of their information needs.
    Still, there exist a substantial amount of users' searches which obtain
    unsatisfactory results. Needless to say, several lines of research aim to
    increase the relevancy of the results users retrieve. In this paper the authors
    frame this problem within the much broader (and older) one of information
    overload.

  119. Google matrix and Ulam networks of intermittency maps.

    Authors: Leonardo Ermann, Dima D.L. Shepelyansky
    Subjects: Information Retrieval
    Abstract

    We study the properties of the Google matrix of an Ulam network generated by
    intermittency maps. This network is created by the Ulam method which gives a
    matrix approximant for the Perron-Frobenius operator of dynamical map. The
    spectral properties of eigenvalues and eigenvectors of this matrix are
    analyzed. We show that the PageRank of the system is characterized by a power
    law decay with the exponent $\beta$ dependent on map parameters and the Google
    damping factor $\alpha$.

  120. Co-occurrence Matrices and their Applications in Information Science: Extending ACA to the Web Environment.

    Authors: Loet Leydesdorff, Liwen Vaughan
    Subjects: Information Retrieval
    Abstract

    Co-occurrence matrices, such as co-citation, co-word, and co-link matrices,
    have been used widely in the information sciences. However, confusion and
    controversy have hindered the proper statistical analysis of this data. The
    underlying problem, in our opinion, involved understanding the nature of
    various types of matrices. This paper discusses the difference between a
    symmetrical co-citation matrix and an asymmetrical citation matrix as well as
    the appropriate statistical techniques that can be applied to each of these
    matrices, respectively.

  121. Multiple Presents: How Search Engines Re-write the Past.

    Authors: Loet Leydesdorff, Iina Hellsten, Paul Wouters
    Subjects: Information Retrieval
    Abstract

    Internet search engines function in a present which changes continuously. The
    search engines update their indices regularly, overwriting Web pages with newer
    ones, adding new pages to the index, and losing older ones. Some search engines
    can be used to search for information at the internet for specific periods of
    time. However, these 'date stamps' are not determined by the first occurrence
    of the pages in the Web, but by the last date at which a page was updated or a
    new page was added, and the search engine's crawler updated this change in the
    database.

  122. Re-Pair Compression of Inverted Lists.

    Authors: Francisco Claude, Antonio Farina, Gonzalo Navarro
    Subjects: Information Retrieval
    Abstract

    Compression of inverted lists with methods that support fast intersection
    operations is an active research topic. Most compression schemes rely on
    encoding differences between consecutive positions with techniques that favor
    small numbers. In this paper we explore a completely different alternative: We
    use Re-Pair compression of those differences. While Re-Pair by itself offers
    fast decompression at arbitrary positions in main and secondary memory, we
    introduce variants that in addition speed up the operations required for
    inverted list intersection.

  123. The relation between Pearson's correlation coefficient r and Salton's cosine measure.

    Authors: Leo Egghe, Loet Leydesdorff
    Subjects: Information Retrieval
    Abstract

    The relation between Pearson's correlation coefficient and Salton's cosine
    measure is revealed based on the different possible values of the division of
    the L1-norm and the L2-norm of a vector. These different values yield a sheaf
    of increasingly straight lines which form together a cloud of points, being the
    investigated relation. The theoretical results are tested against the author
    co-citation relations among 24 informetricians for whom two matrices can be
    constructed, based on co-citations: the asymmetric occurrence matrix and the
    symmetric co-citation matrix.

  124. Enhanced Trustworthy and High-Quality Information Retrieval System for Web Search Engines.

    Authors: S. Ramachandran, S. Paulraj, S. Joseph, V. Ramaraj
    Subjects: Information Retrieval
    Abstract

    The WWW is the most important source of information. But, there is no
    guarantee for information correctness and lots of conflicting information is
    retrieved by the search engines and the quality of provided information also
    varies from low quality to high quality. We provide enhanced trustworthiness in
    both specific (entity) and broad (content) queries in web searching. The
    filtering of trustworthiness is based on 5 factors: Provenance, Authority, Age,
    Popularity, and Related Links.

  125. Management Of Volatile Information In Incremental Web Crawler.

    Authors: Ravita Chahar, Komal Hooda, Annu Dhankhar
    Subjects: Information Retrieval
    Abstract

    Paper has been withdrawn.

  126. Collaborative filtering with diffusion-based similarity on tripartite graphs.

    Authors: Yi-Cheng Zhang, Tao Zhou, Ming-Sheng Shang, Zi-Ke Zhang
    Subjects: Information Retrieval
    Abstract

    Collaborative tags are playing more and more important role for the
    organization of information systems. In this paper, we study a personalized
    recommendation model making use of the ternary relations among users, objects
    and tags. We propose a measure of user similarity based on his preference and
    tagging information.

  127. Adaptive model for recommendation of news.

    Authors: Matus Medo, Yi-Cheng Zhang, Tao Zhou
    Subjects: Information Retrieval
    Abstract

    Most news recommender systems try to identify users' interests and news'
    attributes and use them to obtain recommendations. Here we propose an adaptive
    model which combines similarities in users' rating patterns with epidemic-like
    spreading of news on an evolving network. We study the model by computer
    agent-based simulations, measure its performance and discuss its robustness
    against bias and malicious behavior. Subject to the approval fraction of news
    recommended, the proposed model outperforms the widely adopted recommendation
    of news according to their absolute or relative popularity.

  128. Generating Concise and Readable Summaries of XML Documents.

    Authors: Maya Ramanath, Kondreddi Sarath Kumar, Georgiana Ifrim
    Subjects: Information Retrieval
    Abstract

    XML has become the de-facto standard for data representation and exchange,
    resulting in large scale repositories and warehouses of XML data. In order for
    users to understand and explore these large collections, a summarized, bird's
    eye view of the available data is a necessity. In this paper, we are interested
    in semantic XML document summaries which present the "important" information
    available in an XML document to the user. In the best case, such a summary is a
    concise replacement for the original document itself.

  129. An Algorithm for Mining Multidimensional Fuzzy Association Rules.

    Authors: Neelu Khare, Neeru Adlakha, K. R. Pardasani
    Subjects: Information Retrieval
    Abstract

    Multidimensional association rule mining searches for interesting
    relationship among the values from different dimensions or attributes in a
    relational database. In this method the correlation is among set of dimensions
    i.e., the items forming a rule come from different dimensions. Therefore each
    dimension should be partitioned at the fuzzy set level. This paper proposes a
    new algorithm for generating multidimensional association rules by utilizing
    fuzzy sets. A database consisting of fuzzy transactions, the Apriory property
    is employed to prune the useless candidates, itemsets.

  130. A baseline for content-based blog classification.

    Authors: Olof Gornerup, Magnus Boman
    Subjects: Information Retrieval
    Abstract

    A content-based network representation of web logs (blogs) using a basic
    word-overlap similarity measure is presented. Due to a strong signal in blog
    data the approach is sufficient for accurately classifying blogs. Using Swedish
    blog data we demonstrate that blogs that treat similar subjects are organized
    in clusters that, in turn, are hierarchically organized in higher-order
    clusters. The simplicity of the representation renders it both computationally
    tractable and transparent.

  131. The Universal Recommender.

    Authors: J&#xe9;r&#xf4;me Kunegis, Alan Said, Winfried Umbrath
    Subjects: Information Retrieval
    Abstract

    We describe the Universal Recommender, a recommender system for semantic
    datasets that generalizes domain-specific recommenders such a content-based,
    collaborative, social, bibliographic, lexicographic, hybrid and other
    recommenders. In contrast to existing recommender systems, the Universal
    Recommender applies to any dataset that allows a semantic representation. We
    describe the scalable three-stage architecture of the Universal Recommender and
    its application to Internet Protocol Television (IPTV).

  132. Assessing scientific research performance and impact with single indices.

    Authors: John Panaretos, Chrisovaladis Malesios
    Subjects: Information Retrieval
    Abstract

    We provide a comprehensive and critical review of the h-index and its most
    important modifications proposed in the literature, as well as of other similar
    indicators measuring research output and impact. Extensions of some of these
    indices are presented and illustrated.

  133. Weblog Clustering in Multilinear Algebra Perspective.

    Authors: Andri Mirzal
    Subjects: Information Retrieval
    Abstract

    This paper describes a clustering method to group the most similar and
    important weblogs with their descriptive shared words by using a technique from
    multilinear algebra known as PARAFAC tensor decomposition. The proposed method
    first creates labeled-link network representation of the weblog datasets, where
    the nodes are the blogs and the labels are the shared words.

  134. PrisCrawler: A Relevance Based Crawler for Automated Data Classification from Bulletin Board.

    Authors: Pu Yang, Jun Guo, Weiran Xu
    Subjects: Information Retrieval
    Abstract

    Nowadays people realize that it is difficult to find information simply and
    quickly on the bulletin boards. In order to solve this problem, people propose
    the concept of bulletin board search engine. This paper describes the
    priscrawler system, a subsystem of the bulletin board search engine, which can
    automatically crawl and add the relevance to the classified attachments of the
    bulletin board.

  135. Pavideoge: A New Video Processing Method in Video Search Engine.

    Authors: Pu Yang, Jun Guo, Guang Chen
    Subjects: Information Retrieval
    Abstract

    In this paper, we study the problems of video processing in video search
    engine. Video has now become a very important kind of data in Internet; while
    searching for video is still a challenging task due to the inner properties of
    video: requiring enormous storage space, being independent, expressing
    information hiddenly. To handle the properties of video more effectively, in
    this paper, we propose a new video processing method in video search engine.

  136. A Method for Accelerating the HITS Algorithm.

    Authors: Andri Mirzal, Masashi Furukawa
    Subjects: Information Retrieval
    Abstract

    We present a new method to accelerate the HITS algorithm by exploiting
    hyperlink structure of the web graph. The proposed algorithm extends the idea
    of authority and hub scores from HITS by introducing two diagonal matrices
    which contain constants that act as weights to make authority pages more
    authoritative and hub pages more hubby. This method works because in the web
    graph good authorities are pointed to by good hubs and good hubs point to good
    authorities. Consequently, these pages will collect their scores faster under
    the proposed algorithm than under the standard HITS.

  137. Retrieval of Remote Sensing Images Using Colour and Texture Attribute.

    Authors: Priti Maheswary, Namita Srivastava
    Subjects: Information Retrieval
    Abstract

    Grouping images into semantically meaningful categories using low-level
    visual feature is a challenging and important problem in content-based image
    retrieval. The groupings can be used to build effective indices for an image
    database. Digital image analysis techniques are being used widely in remote
    sensing assuming that each terrain surface category is characterized with
    spectral signature observed by remote sensors. Even with the remote sensing
    images of IRS data, integration of spatial information is expected to assist
    and to improve the image analysis of remote sensing data.

RSS-материал