This paper gives a survey of related work on the information visualization
domain and study the real integration of the cartography paradigms in actual
information search systems. Based on this study, we propose a semantic
visualization and navigation approach which offer to users three search modes:
precise search, connotative search and thematic search.
Getting informed of what is registered in the Web space on time, can greatly
help the psychologists, marketers and political analysts to familiarize,
analyse, make decision and act correctly based on the society`s different
needs. The great volume of information in the Web space hinders us to
continuously online investigate the whole space of the Web. Focusing on the
considered blogs limits our working domain and makes the online crawling in the
Web space possible.
Every data has a lot of hidden information. The processing method of data
decides what type of information data produce. In India education sector has a
lot of data that can produce valuable information. This information can be used
to increase the quality of education. But educational institution does not use
any knowledge discovery process approach on these data. Information and
communication technology puts its leg into the education sector to capture and
compile low cost information.
On the internet, web surfers, in the search of information, always strive for
recommendations. The solutions for generating recommendations become more
difficult because of exponential increase in information domain day by day. In
this paper, we have calculated entropy based similarity between users to
achieve solution for scalability problem. Using this concept, we have
implemented an online user based collaborative web recommender system. In this
model based collaborative system, the user session is divided into two levels.
Entropy is calculated at both the levels.
Text summarization is a process to produce an abstract or a summary by
selecting significant portion of the information from one or more texts. In an
automatic text summarization process, a text is given to the computer and the
computer returns a shorter less redundant extract or abstract of the original
text(s). Many techniques have been developed for summarizing English text(s).
But, a very few attempts have been made for Bengali text summarization.
MultiDendrograms is a Java-written application that computes agglomerative
hierarchical clusterings of data. Starting from a distances (or weights)
matrix, MultiDendrograms is able to calculate its dendrograms using the most
common agglomerative hierarchical clustering methods. The application
implements a variable-group algorithm that solves the non-uniqueness problem
found in the standard pair-group algorithm.
Instead of the 'bag-of-words' representation, in the quantitative profile
approach to spam filtering and email categorization, an email is represented by
an m-dimensional vector of numbers, with m fixed in advance. Inspired by Sroufe
et al. [Sroufe, P., Phithakkitnukoon, S., Dantu, R., and Cangussu, J. (2010).
Email shape analysis. In \emph{LNCS}, 5935, pp. 18-29] two instances of
quantitative profiles are considered: line profile and character profile.
Performance of these profiles is studied on the TREC 2007, CEAS 2008 and a
private corpuses.
In 2010, Web users ordered, only in Amazon, 73 items per second and massively
contribute reviews about their consuming experience. As the Web matures and
becomes social and participatory, collaborative filters are the basic
complement in searching online information about people, events and products.
In Web 2.0, what connected consumers create is not simply content (e.g.
reviews) but context. This new contextual framework of consumption emerges
through the aggregation and collaborative filtering of personal preferences
about goods in the Web in massive scale.
This paper describes our work which is based on discovering context for text
document categorization. The document categorization approach is derived from a
combination of a learning paradigm known as relation extraction and an
technique known as context discovery. We demonstrate the effectiveness of our
categorization approach using reuters 21578 dataset and synthetic real world
data from sports domain. Our experimental results indicate that the learned
context greatly improves the categorization performance as compared to
traditional categorization approaches.
As the amount of online document increases, the demand for document
classification to aid the analysis and management of document is increasing.
Text is cheap, but information, in the form of knowing what classes a document
belongs to, is expensive. The main purpose of this paper is to explain the
expectation maximization technique of data mining to classify the document and
to learn how to improve the accuracy while using semi-supervised approach.
Expectation maximization algorithm is applied with both supervised and
semi-supervised approach.
Searching is an important tool of information gathering, if information is in
the form of picture than it play a major role to take quick action and easy to
memorize. This is a human tendency to retain more picture than text. The
complexity and the occurrence of variety of query can give variation in result
and provide the humans to learn something new or get confused.
With the explosion of information stored world-wide,data intensive computing
has become a central area of research.Efficient management and processing of
this massively exponential amount of data from diverse sources,such as
telecommunication call data records,online transaction records,etc.,has become
a necessity.Removing redundancy from such huge(multi-billion records) datasets
resulting in resource and compute efficiency for downstream processing
constitutes an important area of study.
Enormous explosion in the number of the World Wide Web pages occur every day
and since the efficiency of most of the information processing systems is found
to be less, the potential of the Internet applications is often underutilized.
Efficient utilization of the web can be exploited when similar web pages are
rigorously, exhaustively organized and clustered based on some domain knowledge
(semantic-based) .Ontology which is a formal representation of domain knowledge
aids in such efficient utilization.
Many algorithms have been proposed for predicting missing edges in networks,
but they do not usually take account of which edges are missing. We focus on
networks which have missing edges of the form that is likely to occur in real
networks, and compare algorithms that find these missing edges. We also
investigate the effect of this kind of missing data on community detection
algorithms.
The Probability Ranking Principle states that the document set with the
highest values of probability of relevance optimizes information retrieval
effectiveness given the probabilities are estimated as accurately as possible.
The key point of the principle is the separation of the document set into two
subsets with a given level of fallout and with the highest recall.
The use of community detection algorithms is explored within the framework of
cover song identification, i.e. the automatic detection of different audio
renditions of the same underlying musical piece. Until now, this task has been
posed as a typical query-by-example task, where one submits a query song and
the system retrieves a list of possible matches ranked by their similarity to
the query. In this work, we propose a new approach which uses song communities
to provide more relevant answers to a given query.
For ambiguous queries, conventional retrieval systems are bound by two
conflicting goals. On the one hand, they should diversify and strive to present
results for as many query intents as possible. On the other hand, they should
provide depth for each intent by displaying more than a single result. Since
both diversity and depth cannot be achieved simultaneously in the conventional
static retrieval model, we propose a new dynamic ranking approach.
Searching for and making decisions about information is becoming increasingly
difficult as the amount of information and number of choices increases.
Recommendation systems help users find items of interest of a particular type,
such as movies or restaurants, but are still somewhat awkward to use. Our
solution is to take advantage of the complementary strengths of personalized
recommendation systems and dialogue systems, creating personalized aides.
Problem Statement: The huge number of information on the web as well as the
growth of new inexperienced users creates new challenges for information
retrieval. It has become increasingly difficult for these users to find
relevant documents that satisfy their individual needs. Certainly the current
search engines (such as Google, Bing and Yahoo) offer an efficient way to
browse the web content. However, the result quality is highly based on uses
queries which need to be more precise to find relevant documents.
Whereas today's information systems are well-equipped for efficient query
handling, their strict mathematical foundations hamper their use for everyday
tasks. In daily life, people expect information to be offered in a personalized
and focused way. But currently, personalization in digital systems still only
takes explicit knowledge into account and does not yet process conceptual
information often naturally implied by users. We discuss how to bridge the gap
between users and today's systems, building on results from cognitive
psychology.
In this paper we analyze the efficiency of various search results
diversification methods. While efficacy of diversification approaches has been
deeply investigated in the past, response time and scalability issues have been
rarely addressed. A unified framework for studying performance and feasibility
of result diversification solutions is thus proposed. First we define a new
methodology for detecting when, and how, query results need to be diversified.
To this purpose, we rely on the concept of "query refinement" to estimate the
probability of a query to be ambiguous.
Current "data deluge" has flooded the Web of Data with very large RDF
datasets. They are hosted and queried through SPARQL endpoints which act as
nodes of a semantic net built on the principles of the Linked Data project.
Although this is a realistic philosophy for global data publishing, its query
performance is diminished when the RDF engines (behind the endpoints) manage
these huge datasets. Their indexes cannot be fully loaded in main memory, hence
these systems need to perform slow disk accesses to solve SPARQL queries.
We survey agglomerative hierarchical clustering algorithms and discuss
efficient implementations that are available in R and other software
environments. We look at hierarchical self-organizing maps, and mixture models.
We review grid-based clustering, focusing on hierarchical density-based
approaches. Finally we describe a recently developed very efficient (linear
time) hierarchical clustering algorithm, which can also be viewed as a
hierarchical grid-based algorithm.
Due to the emergence of the semantic Web and the increasing need to formalize
human knowledge, ontologie engineering is now an important activity. But is
this activity very different from other ones like software engineering, for
example ? In this paper, we investigate analogies between ontologies on one
hand, types, objects and data bases on the other one, taking into account the
notion of evolution of an ontology. We represent a unique ontology using
different paradigms, and observe that the distance between these different
concepts is small.
After a brief introduction to Probability Bracket Notation (PBN) for discrete
random variables in time-independent probability spaces, we apply both PBN and
Dirac notation to investigate probabilistic modeling for information retrieval
(IR). We derive the ranking formulas for various probabilistic models, induced
by Term Vector Space (TVS) and by Concept Fock Space (CFS). The ranking
formulas are naturally expressed in term frequencies; and, because our formulas
for inference network models (INM) are symmetric, they can also be used to rank
closeness of documents.
Semantic web is the next generation web, which concerns the meaning of web
documents It has the immense power to pull out the most relevant information
from the web pages, which is also meaningful to any user, using software
agents. In today's world, agent communication is not possible if concerned
ontology is changed a little. We have pointed out this very problem and
developed an Ontology Purification System to help agent communication. In our
system you can send queries and view the search results. If it can't meet the
criteria then it finds out the mismatched elements.
Recommendation is usually reduced to a prediction problem over the function
$r(u_a, e_i)$ that returns the expected rating of element $e_i$ for user $u_a$.
In the IPTV domain, we deal with an environment where the definitions of all
the parameters involved in this function (i.e., user profiles, feedback ratings
and elements) are controversial.
This paper is about a better understanding on the structure and dynamics of
science and the usage of these insights for compensating the typical problems
that arises in metadata-driven Digital Libraries. Three science model driven
retrieval services are presented: co-word analysis based query expansion,
re-ranking via Bradfordizing and author centrality.
Querying over XML elements using keyword search is steadily gaining
popularity. The traditional similarity measure is widely employed in order to
effectively retrieve various XML documents. A number of authors have already
proposed different similarity-measure methods that take advantage of the
structure and content of XML documents. They do not, however, consider the
similarity between latent semantic information of element texts and that of
keywords in a query.
Ranking problem of web-based rating system has attracted many attentions. A
good ranking algorithm should be robust against spammer attack. Here we
proposed a correlation based reputation algorithm to solve the ranking problem
of such rating systems where user votes some objects with ratings. In this
algorithm, reputation of user is iteratively determined by the correlation
coefficient between his/her rating vector and the corresponding objects'
weighted average rating vector.
We develop an abstract model of information acquisition from redundant data.
We assume a random sampling process from data which provide information with
bias and are interested in the fraction of information we expect to learn as
function of (i) the sampled fraction (recall) and (ii) varying bias of
information (redundancy distributions). We develop two rules of thumb with
varying robustness.
SPARQL query composition is difficult for the lay-person or even the
experienced bioinformatician in cases where the data model is unfamiliar.
Established best-practices and internationalization concerns dictate that
semantic web ontologies should use terms with opaque identifiers, further
complicating the task. We present SPARQL Assist: a web application that
addresses these issues by providing context-sensitive type-ahead completion to
existing web forms.
Unstructured information comprises a valuable source of data in clinical
records. For text mining in clinical records, concept extraction is the first
step in finding assertions and relationships. This study presents a system
developed for the annotation of medical concepts, including medical problems,
tests, and treatments, mentioned in clinical records. The system combines six
publicly available named entity recognition system into one framework, and uses
a simple voting scheme that allows to tune precision and recall of the system
to specific needs.
Because of the increasing number of electronic data, designing efficient
tools to retrieve and exploit documents is a major challenge. Current search
engines suffer from two main drawbacks: there is limited interaction with the
list of retrieved documents and no explanation for their adequacy to the query.
Users may thus be confused by the selection and have no idea how to adapt their
query so that the results match their expectations. This paper describes a
request method and an environment based on aggregating models to assess the
relevance of documents annotated by concepts of ontology.
The establishment of links between data (e.g., patient records) and Web
resources (e.g., literature) and the proper visualization of such discovered
knowledge is still a challenge in most Life Science domains (e.g.,
biomedicine). In this paper we present our contribution to the community in the
form of an infrastructure to annotate information resources, to discover
relationships among them, and to represent and visualize the new discovered
knowledge. Furthermore, we have also implemented a Web-based prototype tool
which integrates the proposed infrastructure.
The use of domain knowledge is generally found to improve query efficiency in
content filtering applications. In particular, tangible benefits have been
achieved when using knowledge-based approaches within more specialized fields,
such as medical free texts or legal documents. However, the problem is that
sources of domain knowledge are time-consuming to build and equally costly to
maintain.
Email Retrieval task has recently taken much attention to help the user
retrieve the email(s) related to the submitted query. Up to our knowledge,
existing email retrieval ranking approaches sort the retrieved emails based on
some heuristic rules, which are either search clues or some predefined user
criteria rooted in email fields. Unfortunately, the user usually does not know
the effective rule that acquires best ranking related to his query. This paper
presents a new email retrieval ranking approach to tackle this problem.
Mercury is a federated metadata harvesting, search and retrieval tool based
on both open source and software developed at Oak Ridge National Laboratory. It
was originally developed for NASA, and the Mercury development consortium now
includes funding from NASA, USGS, and DOE. A major new version of Mercury was
developed during 2007. This new version provides orders of magnitude
improvements in search speed, support for additional metadata formats,
integration with Google Maps for spatial queries, support for RSS delivery of
search results, among other features.
This paper is about an information retrieval evaluation on three different
retrieval-supporting services. All three services were designed to compensate
typical problems that arise in metadata-driven Digital Libraries, which are not
adequately handled by a simple tf-idf based retrieval. The services are: (1) a
co-word analysis based query expansion mechanism and re-ranking via (2)
Bradfordizing and (3) author centrality. The services are evaluated with
relevance assessments conducted by 73 information science students.
This paper is a short description of an information retrieval system enhanced
by three model driven retrieval services: (1) co-word analysis based query
expansion, re-ranking via (2) Bradfordizing and (3) author centrality. The
different services each favor quite other - but still relevant - documents than
pure term-frequency based rankings. Each service can be interactively combined
with each other to allow an iterative retrieval refinement.
As the amount of online text increases, the demand for text categorization to
aid the analysis and management of text is increasing. Text is cheap, but
information, in the form of knowing what classes a text belongs to, is
expensive. Automatic categorization of text can provide this information at low
cost, but the classifiers themselves must be built with expensive human effort,
or trained from texts which have themselves been manually classified. Text
categorization using Association Rule and Na\"ive Bayes Classifier is proposed
here.
Text classification is the process of classifying documents into predefined
categories based on their content. It is the automated assignment of natural
language texts to predefined categories. Text classification is the primary
requirement of text retrieval systems, which retrieve texts in response to a
user query, and text understanding systems, which transform text in some way
such as producing summaries, answering questions or extracting data. Existing
supervised learning algorithms to automatically classify text need sufficient
documents to learn accurately.
Text classification is the automated assignment of natural language texts to
predefined categories based on their content. Text classification is the
primary requirement of text retrieval systems, which retrieve texts in response
to a user query, and text understanding systems, which transform text in some
way such as producing summaries, answering questions or extracting data. Now a
day the demand of text classification is increasing tremendously. Keeping this
demand into consideration, new and updated techniques are being developed for
the purpose of automated text classification.
Text classification is the process of classifying documents into predefined
categories based on their content. It is the automated assignment of natural
language texts to predefined categories. Text classification is the primary
requirement of text retrieval systems, which retrieve texts in response to a
user query, and text understanding systems, which transform text in some way
such as producing summaries, answering questions or extracting data. Existing
supervised learning algorithms for classifying text need sufficient documents
to learn accurately.
This paper addresses the general problem of modelling and learning rank data
with ties. We propose a probabilistic generative model, that models the process
as permutations over partitions. This results in super-exponential
combinatorial state space with unknown numbers of partitions and unknown
ordering among them. We approach the problem from the discrete choice theory,
where subsets are chosen in a stagewise manner, reducing the state space per
each stage significantly. Further, we show that with suitable parameterisation,
we can still learn the models in linear time.
In this paper we present a method for reformulating the Recommender Systems
problem in an Information Retrieval one. In our tests we have a dataset of
users who give ratings for some movies; we hide some values from the dataset,
and we try to predict them again using its remaining portion (the so-called
"leave-n-out approach").
In this paper, we have proposed an architecture of active learning SVMs with
relevance feedback (RF)for classifying e-mail. This architecture combines both
active learning strategies where instead of using a randomly selected training
set, the learner has access to a pool of unlabeled instances and can request
the labels of some number of them and relevance feedback where if any mail
misclassified then the next set of support vectors will be different from the
present set otherwise the next set will not change.
This paper describes an effective unsupervised method for query-by-example
speaker retrieval. We suppose that only one speaker is in each audio file or in
audio segment. The audio data are modeled using a common universal codebook.
The codebook is based on bag-of-frames (BOF). The features corresponding to the
audio frames are extracted from all audio files. These features are grouped
into clusters using the K-means algorithm. The individual audio files are
modeled by the normalized distribution of the numbers of cluster bins
corresponding to this file.
Machine Science, or Data-driven Research, is a new and interesting scientific
methodology that uses advanced computational techniques to identify, retrieve,
classify and analyse data in order to generate hypotheses and develop models.
In this paper we describe three recent biomedical Machine Science studies, and
use these to assess the current state of the art with specific emphasis on data
mining, data assessment, costs, limitations, skills and tool support.
We engineer an algorithm to solve the approximate dictionary matching
problem. Given a list of words $\mathcal{W}$, maximum distance $d$ fixed at
preprocessing time and a query word $q$, we would like to retrieve all words
from $\mathcal{W}$ that can be transformed into $q$ with $d$ or less edit
operations. We present data structures that support fault tolerant queries by
generating an index. On top of that, we present a generalization of the method
that eases memory consumption and preprocessing time significantly. At the same
time, running times of queries are virtually unaffected.
In this paper based on agent and semantic web technologies we propose an
approach .i.e., Semantic Oriented Agent Based Search (SOAS), to cope with
currently existing challenges of Meta data extraction, modeling and information
retrieval over the web. SOAS is designed by keeping four major requirements
.i.e., Automatic user request handling, Dynamic unstructured full text reading,
Analysing and modeling, Semantic query generation and optimized result
classifier.
Cross-lingual adaptation, a special case of domain adaptation, refers to the
transfer of classification knowledge between two languages. In this article we
describe an extension of Structural Correspondence Learning (SCL), a recently
proposed algorithm for domain adaptation, for cross-lingual adaptation. The
proposed method uses unlabeled documents from both languages, along with a word
translation oracle, to induce cross-lingual feature correspondences.
In this work we have compared two indexing algorithms that have been used to
index and retrieve Carnatic music songs. We have compared a modified algorithm
of the Dual ternary indexing algorithm for music indexing and retrieval with
the multi-key hashing indexing algorithm proposed by us. The modification in
the dual ternary algorithm was essential to handle variable length query phrase
and to accommodate features specific to Carnatic music. The dual ternary
indexing algorithm is adapted for Carnatic music by segmenting using the
segmentation technique for Carnatic music.
With the advancement of technology and reduced storage costs, individuals and
organizations are tending towards the usage of electronic media for storing
textual information and documents. It is time consuming for readers to retrieve
relevant information from unstructured document collection. It is easier and
less time consuming to find documents from a large collection when the
collection is ordered or classified by group or category. The problem of
finding best such grouping is still there.
The development of modern information technologies permits to collect and to
analyze huge amounts of statistical data in different spheres of life. The main
problem is not to only to collect but to process all relevant information. The
purpose of our work is to show the example of intelligent data analysis in such
complex and non-formalized field as science. Using the statistical data about
scientific periodical it is possible to perform its comprehensive analysis and
to solve different practical problems.
Recommender systems apply data mining techniques and prediction algorithms to
predict users' interest on information, products and services among the
tremendous amount of available items. The vast growth of information on the
Internet as well as number of visitors to websites add some key challenges to
recommender systems. These are: producing accurate recommendation, handling
many recommendations efficiently and coping with the vast growth of number of
participants in the system.
Vertical search engines focus on specific slices of content, such as the Web
of a single country or the document collection of a large corporation. Despite
this, like general open web search engines, they are expensive to maintain,
expensive to operate, and hard to design. Because of this, predicting the
response time of a vertical search engine is usually done empirically through
experimentation, requiring a costly setup. An alternative is to develop a model
of the search engine for predicting performance. However, this alternative is
of interest only if its predictions are accurate.
Wiktionary is a unique, peculiar, valuable and original resource for natural
language processing (NLP). The paper describes an open-source Wiktionary
parser: its architecture and requirements followed by a description of
Wiktionary features to be taken into account, some open problems of Wiktionary
and the parser. The current implementation of the parser extracts the
definitions, semantic relations, and translations from English and Russian
Wiktionaries.
In this paper we demonstrate the applicability of latent Dirichlet allocation
(LDA) for classifying large Web document collections. One of our main results
is a novel influence model that gives a fully generative model of the document
content taking linkage into account. In our setup, topics propagate along links
in such a way that linked documents directly influence the words in the linking
document. As another main contribution we develop LDA specific boosting of
Gibbs samplers resulting in a significant speedup in our experiments.
The Library of Babel, described by Jorge Luis Borges, stores an enormous
amount of information. The Library exists {\it ab aeterno}. Wikipedia, a free
online encyclopaedia, becomes a modern analogue of such a Library. Information
retrieval and ranking of Wikipedia articles become the challenge of modern
society. We analyze the properties of two-dimensional ranking of all Wikipedia
English articles and show that it gives their reliable classification with rich
and nontrivial features.
In this paper we address the problem of accurately and efficiently
cross-referencing text fragments with Wikipedia pages, in a way that structured
knowledge is provided about the (unstructured) input text by resolving synonymy
and polysemy. We take inspiration from the invited talk of Chakrabarti at WSDM
2010, and extend his proposed scenario from the annotation of entire documents
to the annotation of short texts, such as snippets of search-engine results,
tweets, news, etc..
In the practical work of websites popularization, analysis of their
efficiency and downloading it is of key importance to take into account
web-ratings data. The main indicators of website traffic include the number of
unique hosts from which the analyzed website was addressed and the number of
granted web pages (hits) per unit time (for example, day, month or year). Of
certain interest is the ratio between the number of hits (S) and hosts (H). In
practice there is even used such a concept as "average number of viewed pages"
(S/H), which on default supposes a linear dependence of S on H.
The high-level contribution of this paper is the development and
implementation of an algorithm to selfextract secondary keywords and their
combinations (combo words) based on abstracts collected using standard primary
keywords for research areas from reputed online digital libraries like IEEE
Explore, PubMed Central and etc. Given a collection of N abstracts, we
arbitrarily select M abstracts (M<< N; M/N as low as 0.15) and parse each of
the M abstracts, word by word.
Objectives: Text categorization has been used in biomedical informatics for
identifying documents containing relevant topics of interest. We developed a
simple method that uses a chi-square-based scoring function to determine the
likelihood of MEDLINE citations containing genetic relevant topic. Methods: Our
procedure requires construction of a genetic and a nongenetic domain document
corpus. We used MeSH descriptors assigned to MEDLINE citations for this
categorization task. We compared frequencies of MeSH descriptors between two
corpora applying chi-square test.
One of the most important issues in Information Retrieval is inferring the
intents underlying users' queries. Thus, any tool to enrich or to better
contextualized queries can proof extremely valuable. Entity extraction,
provided it is done fast, can be one of such tools. Such techniques usually
rely on a prior training phase involving large datasets. That training is
costly, specially in environments which are increasingly moving towards real
time scenarios where latency to retrieve fresh informacion should be minimal.
In this paper an `on-the-fly' query decomposition method is proposed.
This paper proposes some extensions to the work on kernels dedicated to
string alignment (biological sequence alignment) based on the summing up of
scores obtained by local alignments with gaps. The extensions we propose allow
to construct, from classical time warp distances, what we called summative time
warp kernels that are positive definite if some simple sufficient conditions
are satisfied.
In this report, we unify two quite distinct approaches to information
retrieval: region models and language models. Region models were developed for
structured document retrieval. They provide a well-defined behaviour as well as
a simple query language that allows application developers to rapidly develop
applications. Language models are particularly useful to reason about the
ranking of search results, and for developing new ranking approaches. The
unified model allows application developers to define complex language modeling
approaches as logical queries on a textual database.
Mining Time Series data has a tremendous growth of interest in today's world.
To provide an indication various implementations are studied and summarized to
identify the different problems in existing applications. Clustering time
series is a trouble that has applications in an extensive assortment of fields
and has recently attracted a large amount of research. Time series data are
frequently large and may contain outliers. In addition, time series are a
special type of data set where elements have a temporal ordering.
Geographic location search engines allow users to constrain and order search
results in an intuitive manner by focusing a query on a particular geographic
region. Geographic search technology, also called location search, has recently
received significant interest from major search engine companies. Academic
research in this area has focused primarily on techniques for extracting
geographic knowledge from the web. In this paper, we study the problem of
efficient query processing in scalable geographic search engines.
We describe a clustering method for labeled link network (semantic graph)
that can be used to group important nodes (highly connected nodes) with their
relevant link's labels by using PARAFAC tensor decomposition. In this kind of
network, the adjacency matrix can not be used to fully describe all information
about the network structure. We have to expand the matrix into 3-way adjacency
tensor, so that not only the information about to which nodes a node connects
to but by which link's labels is also included.
The ability of fast similarity search at large scale is of great importance
to many Information Retrieval (IR) applications. A promising way to accelerate
similarity search is semantic hashing which designs compact binary codes for a
large number of documents so that semantically similar documents are mapped to
similar codes (within a short Hamming distance). Although some recently
proposed techniques are able to generate high-quality codes for documents known
in advance, obtaining the codes for previously unseen documents remains to be a
very challenging problem.
We propose to use MapReduce to quickly test new retrieval approaches on a
cluster of machines by sequentially scanning all documents. We present a small
case study in which we use a cluster of 15 low cost ma- chines to search a web
crawl of 0.5 billion pages showing that sequential scanning is a viable
approach to running large-scale information retrieval experiments with little
effort. The code is available to other researchers at:
this http URL
The proposed system aims at the retrieval of the summarized information from
the documents collected from web based search engine as per the user query
related to cricket and hockey domain. The system is designed in a manner that
it takes the voice commands as keywords for search. The parts of speech in the
query are extracted using the natural language extractor for English. Based on
the keywords the search is categorized into 2 types: - 1.Concept wise -
information retrieved to the query is retrieved based on the keywords and the
concept words related to it.
This paper addresses the design and implementation of BiLingual Information
Retrieval system on the domain, Festivals. A generic platform is built for
BiLingual Information retrieval which can be extended to any foreign or Indian
language working with the same efficiency. Search for the solution of the query
is not done in a specific predefined set of standard languages but is chosen
dynamically on processing the user's query. This paper deals with Indian
language Tamil apart from English.
Web search engines retrieve a vast amount of information for a given search
query. But the user needs only trustworthy and high-quality information from
this vast retrieved data. The response time of the search engine must be a
minimum value in order to satisfy the user. An optimum level of response time
should be maintained even when the system is overloaded. This paper proposes an
optimal Load Shedding algorithm which is used to handle overload conditions in
real-time data stream applications and is adapted to the Information Retrieval
System of a web search engine.
In this paper, based on the user-tag-object tripartite graphs, we propose a
recommendation algorithm, which considers social tags as an important role for
information retrieval. Besides its low cost of computational time, the
experiment results of two real-world data sets, \emph{Del.icio.us} and
\emph{MovieLens}, show it can enhance the algorithmic accuracy and diversity.
Especially, it can obtain more personalized recommendation results when users
have diverse topics of tags.
This paper describes a method for multi-document update summarization that
relies on a double maximization criterion. A Maximal Marginal Relevance like
criterion, modified and so called Smmr, is used to select sentences that are
close to the topic and at the same time, distant from sentences used in already
read documents. Summaries are then generated by assembling the high ranked
material and applying some ruled-based linguistic post-processing in order to
obtain length reduction and maintain coherency.
This paper proposes an incremental method that can be used by an intelligent
system to learn better descriptions of a thematic context. The method starts
with a small number of terms selected from a simple description of the topic
under analysis and uses this description as the initial search context. Using
these terms, a set of queries are built and submitted to a search engine. New
documents and terms are used to refine the learned vocabulary.
Missing web pages (pages that return the 404 "Page Not Found" error) are part
of the browsing experience. The manual use of search engines to rediscover
missing pages can be frustrating and unsuccessful. We compare four automated
methods for rediscovering web pages. We extract the page's title, generate the
page's lexical signature (LS), obtain the page's tags from the bookmarking
website delicious.com and generate a LS from the page's link neighborhood. We
use the output of all methods to query Internet search engines and analyze
their retrieval performance.
Missing web pages, URIs that return the 404 "Page Not Found" error or the
HTTP response code 200 but dereference unexpected content, are ubiquitous in
today's browsing experience. We use Internet search engines to relocate such
missing pages and provide means that help automate the rediscovery process. We
propose querying web pages' titles against search engines. We investigate the
retrieval performance of titles and compare them to lexical signatures which
are derived from the pages' content. Since titles naturally represent the
content of a document they intuitively change over time.
This paper illustrates the Principal Direction Divisive Partitioning (PDDP)
algorithm and describes its drawbacks and introduces a combinatorial framework
of the Principal Direction Divisive Partitioning (PDDP) algorithm, then
describes the simplified version of the EM algorithm called the spherical
Gaussian EM (sGEM) algorithm and Information Bottleneck method (IB) is a
technique for finding accuracy, complexity and time space.
World Wide Web is a huge repository of web pages and links. It provides
abundance of information for the Internet users. The growth of web is
tremendous as approximately one million pages are added daily. Users' accesses
are recorded in web logs. Because of the tremendous usage of web, the web log
files are growing at a faster rate and the size is becoming huge. Web data
mining is the application of data mining techniques in web data.
Micro-blogging services such as Twitter allow anyone to publish anything,
anytime. Nonetheless to say, many of the available contents can be diminished
as babble or spam. However, given the number and diversity of users, some
valuable pieces of information should arise from the stream of tweets. Thus,
such services can develop into valuable sources of up-to-date information (the
so-called real-time web) provided a way to find the most
relevant/trustworthy/authoritative users is available.
Consider a family of sets and a single set, called the query set. How can one
quickly find a member of the family which has a maximal intersection with the
query set? Time constraints on the query and on a possible preprocessing of the
set family make this problem challenging. Such maximal intersection queries
arise in a wide range of applications, including web search, recommendation
systems, and distributing on-line advertisements. In general, maximal
intersection queries are computationally expensive.
In this paper we introduce the concept of dynamic link pages. A web site/page
contains a number of links to other pages. All the links are not equally
important. Few links are more frequently visited and few rarely visited. In
this scenario, identifying the frequently used links and placing them in the
top left corner of the page will increase the user's satisfaction. This process
will reduce the time spent by a visitor on the page, as most of the times, the
popular links are presented in the visible part of the screen itself.
The United States Code (Code) is a document containing over 22 million words
that represents a large and important source of Federal statutory law. Scholars
and policy advocates often discuss the direction and magnitude of changes in
various aspects of the Code. However, few have mathematically formalized the
notions behind these discussions or directly measured the resulting
representations. This paper addresses the current state of the literature in
two ways.
Data generated in the fields of science, technology, business and in many
other fields of research are increasing in an exponential rate. The way to
extract knowledge from a huge set of data is a challenging task.
The shift from an information society to a knowledge society require rapid
information harvesting, reliable search and instantaneous on demand delivery.
Information extraction agents are used to explore and collect data available
from Web, in order to effectively exploit such data for business purposes, such
as automatic news filtering, advertisement or product searching and price
comparing. In this paper, we develop a real-time automatic harvesting agent for
adverts posted on Servihoo web portal and an SMS-based notification system.
Click through rates (CTR) offer useful user feedback that can be used to
infer the relevance of search results for queries. However it is not very
meaningful to look at the raw click through rate of a search result because the
likelihood of a result being clicked depends not only on its relevance but also
the position in which it is displayed. One model of the browsing behavior, the
{\em Examination Hypothesis} \cite{RDR07,Craswell08,DP08}, states that each
position has a certain probability of being examined and is then clicked based
on the relevance of the search snippets.
In this paper we describe a mechanism to improve Information Retrieval (IR)
on the web. The method is based on Formal Concepts Analysis (FCA) that it is
makes semantical relations during the queries, and allows a reorganizing, in
the shape of a lattice of concepts, the answers provided by a search engine. We
proposed for the IR an incremental algorithm based on Galois lattice. This
algorithm allows a formal clustering of the data sources, and the results which
it turns over are classified by order of relevance.
The paper presents our design of a next generation information retrieval
system based on tag co-occurrences and subsequent clustering. We help users
getting access to digital data through information visualization in the form of
tag clusters. Current problems like the absence of interactivity and semantics
between tags or the difficulty of adding additional search arguments are
solved.
Text segmentation is an inherent part of an OCR system irrespective of the
domain of application of it. The OCR system contains a segmentation module
where the text lines, words and ultimately the characters must be segmented
properly for its successful recognition. The present work implements a Hough
transform based technique for line and word segmentation from digitized images.
The proposed technique is applied not only on the document image dataset but
also on dataset for business card reader system and license plate recognition
system.
Term extraction is one of the layers in the ontology development process
which has the task to extract all the terms contained in the input document
automatically. The purpose of this process is to generate list of terms that
are relevant to the domain of the input document. In the literature there are
many approaches, techniques and algorithms used for term extraction. In this
paper we propose a new approach using particle swarm optimization techniques in
order to improve the accuracy of term extraction results. We choose five
features to represent the term score.
In Information Retrieval (IR), whether implicitly or explicitly, queries and
documents are often represented as vectors. However, it may be more beneficial
to consider documents and/or queries as multidimensional objects. Our belief is
this would allow building "truly" interactive IR systems, i.e., where
interaction is fully incorporated in the IR framework.
This paper presents a work package realized for the G\'eOnto project. A new
method is proposed for an enrichment of a first geographical ontology developed
beforehand. This method relies on text analysis by lexico-syntactic patterns.
From the retrieve of n-ary relations the method automatically detect those
involved in a spatial and/or temporal relation in a context of a description of
journeys.
Within the documentary system domain, the integration of thesauri for
indexing and retrieval information steps is usual. In libraries, documents own
rich descriptive information made by librarians, under descriptive notice based
on Rameau thesaurus. We exploit two kinds of information in order to create a
first semantic structure. A step of conceptualization allows us to define the
various modules used to automatically build the semantic structure of the
indexation work. Our current work focuses on an approach that aims to define an
ontology based on a thesaurus.
Automatic construction of ontologies from text is generally based on
retrieving text content. For a much more rich ontology we extend these
approaches by taking into account the document structure and some external
resources (like thesaurus of indexing terms of near domain). In this paper we
describe how these external resources are at first analyzed and then exploited.
This method has been applied on a geographical domain and the benefit has been
evaluated.
Emergence of various vertical search engines highlights the fact that a
single ranking technology cannot deal with the complexity and scale of search
problems. For example, technology behind video and image search is very
different from general web search. Their ranking functions share few features.
Question answering websites (e.g., Yahoo! Answer) can make use of text matching
and click features developed for general web, but they have unique page
structures and rich user feedback, e.g., thumbs up and thumbs down ratings in
Yahoo! answer, which greatly benefit their own ranking.
The framing of issues in the mass media plays a crucial role in the public
understanding of science and technology. This article contributes to research
concerned with diachronic analysis of media frames by making an analytical
distinction between implicit and explicit media frames, and by introducing an
automated method for analysing diachronic changes of implicit frames. In
particular, we apply a semantic maps method to a case study on the newspaper
debate about artificial sweeteners, published in The New York Times (NYT)
between 1980 and 2006.
When users rate objects, a sophisticated algorithm that takes into account
ability or reputation may produce a fairer or more accurate aggregation of
ratings than the straightforward arithmetic average. Recently a number of
authors have proposed different co-determination algorithms where estimates of
user and object reputation are refined iteratively together, permitting
accurate measures of both to be derived directly from the rating data.
In Bioinformatics, text mining and text data mining sometimes interchangeably
used is a process to derive high-quality information from text. Perl Status
Reporter (SRr) is a data fetching tool from a flat text file and in this
research paper we illustrate the use of SRr in text or data mining. SRr needs a
flat text input file where the mining process to be performed. SRr reads input
file and derives the high quality information from it. Typically text mining
tasks are text categorization, text clustering, concept and entity extraction,
and document summarization.
How to rank web pages, scientists and online resources has recently attracted
increasing attention from both physicists and computer scientists. In this
paper, we study the ranking problem of rating systems where users vote objects
by discrete ratings. We propose an algorithm that can simultaneously evaluate
the user reputation and object quality in an iterative refinement way.
According to both the artificially generated data and the real data from
MovieLens and Amazon, our algorithm can considerably enhance the ranking
accuracy.
Can self-organization of scientific communication be specified by using
literature-based indicators? In this study, we explore this question by
applying entropy measures to typical "Mode-2" fields of knowledge production.
We hypothesized these scientific systems to be developing from a
self-organization of the interaction between cognitive and institutional
levels: European subsidized research programs aim at creating an institutional
network, while a cognitive reorganization is continuously ongoing at the
scientific field level.
Mutual information among three or more dimensions (mu-star = - Q) has been
considered as interaction information. However, Krippendorff (2009a, 2009b) has
shown that this measure cannot be interpreted as a unique property of the
interactions and has proposed an alternative measure of interaction information
based on iterative approximation of maximum entropies. Q can then be considered
as a measure of the difference between interaction information and redundancy
generated in a model entertained by an observer.
This paper describes the approach taken to the XML Mining track at INEX 2008
by a group at the Queensland University of Technology. We introduce the K-tree
clustering algorithm in an Information Retrieval context by adapting it for
document clustering. Many large scale problems exist in document clustering.
K-tree scales well with large inputs due to its low complexity. It offers
promising results both in terms of efficiency and quality. Document
classification was completed using Support Vector Machines.
We introduce K-tree in an information retrieval context. It is an efficient
approximation of the k-means clustering algorithm. Unlike k-means it forms a
hierarchy of clusters. It has been extended to address issues with sparse
representations. We compare performance and quality to CLUTO using document
collections. The K-tree has a low time complexity that is suitable for large
document collections. This tree structure allows for efficient disk based
implementations where space requirements exceed that of main memory.
Random Indexing (RI) K-tree is the combination of two algorithms for
clustering. Many large scale problems exist in document clustering. RI K-tree
scales well with large inputs due to its low complexity. It also exhibits
features that are useful for managing a changing collection. Furthermore, it
solves previous issues with sparse document vectors when using K-tree. The
algorithms and data structures are defined, explained and motivated. Specific
modifications to K-tree are made for use with RI. Experiments have been
executed to measure quality.
Recent advances in hardware sophistication related to graphics display, audio
and video devices made available a large number of multimedia and hypermedia
applications. These multimedia applications need to store and retrieve the
different forms of media like text, hypertext, graphics, still images,
animations, audio and video. Dance is one of the important cultural forms of a
nation and dance video is one such multimedia types. Archiving and retrieving
the required semantics from these dance media collections is a crucial and
demanding multimedia application.
Web blog is used as a collaborative platform to publish and share
information. The information accumulated in the blog intrinsically contains the
knowledge. The knowledge shared by the community of people has intangible value
proposition. The blog is viewed as a multimedia information resource available
on the Internet. In a blog, information in the form of text, image, audio and
video builds up exponentially.
Now a day's, search engines are been most widely used for extracting
information's from various resources throughout the world. Where, majority of
searches lies in the field of biomedical for retrieving related documents from
various biomedical databases. Currently search engines lacks in document
clustering and representing relativeness level of documents extracted from the
databases. In order to overcome these pitfalls a text based search engine have
been developed for retrieving documents from Medline and PubMed biomedical
databases.
Viruses utilize various means to circumvent the immune detection in the
biological systems. Several mathematical models have been investigated for the
description of viral dynamics in the biological system of human and various
other species. One common strategy for evasion and recognition of viruses is,
through acquaintance in the systems by means of search engines. In this
perspective a search tool have been developed to provide a wider comprehension
about the structure and other details on viruses which have been narrated in
this paper.
Document indexation is an essential task achieved by archivists or automatic
indexing tools. To retrieve relevant documents to a query, keywords describing
this document have to be carefully chosen. Archivists have to find out the
right topic of a document before starting to extract the keywords. For an
archivist indexing specialized documents, experience plays an important role.
But indexing documents on different topics is much harder. This article
proposes an innovative method for an indexing support system.
This note tries to attempt a sketch of the history of spectral ranking, a
general umbrella name for techniques that apply the theory of linear maps (in
particular, eigenvalues and eigenvectors) to matrices that do not represent
geometric transformations, but rather some kind of relationship between
entities. Albeit recently made famous by the ample press coverage of Google's
PageRank algorithm, spectral ranking was devised more than fifty years ago,
almost exactly in the same terms, and has been studied in psychology and social
sciences.
In this paper, we explain social information retrieval (SIR) and
collaborative information retrieval (CIR). We see SIR as a way of knowing who
to collaborate with in resolving an information problem while CIR entails the
process of mutual understanding and solving of an information problem among
collaborators. We are interested in the transition from SIR to CIR hence we
developed a communication model to facilitate knowledge sharing during CIR.
This document describes the BM25 and BM25F implementation using the Lucene
Java Framework. Both models have stood out at TREC by their performance and are
considered as state-of-the-art in the IR community. BM25 is applied to `ad-hoc'
retrieval, that is for documents that do not contain fields, on the other hand
BM25F is applied to documents with structure.
The dynamic environment in the real world calls for the adaptive techniques
for information filtering, namely to provide real-time responses to the changes
of system data. Where many incremental algorithms are designed for this
purpose, they are usually challenged by the worse and worse performance
resulted from the cumulative errors over time. In this Letter, we propose two
incremental diffusion-based algorithms for the personalized recommendations,
which integrate some pieces of local and fast updatings to achieve the
approximate results.
The use of Pearson's correlation coefficient in Author Cocitation Analysis
was compared with Salton's cosine measure in a number of recent contributions.
Unlike the Pearson correlation, the cosine is insensitive to the number of
zeros. However, one has the option of applying a logarithmic transformation in
correlation analysis. Information calculus is based on both the logarithmic
transformation and provides a non-parametric statistics. Using this methodology
one can cluster a document set in a precise way and express the differences in
terms of bits of information.
Search engines are nowadays one of the most important entry points for
Internet users and a central tool to solve most of their information needs.
Still, there exist a substantial amount of users' searches which obtain
unsatisfactory results. Needless to say, several lines of research aim to
increase the relevancy of the results users retrieve. In this paper the authors
frame this problem within the much broader (and older) one of information
overload.
We study the properties of the Google matrix of an Ulam network generated by
intermittency maps. This network is created by the Ulam method which gives a
matrix approximant for the Perron-Frobenius operator of dynamical map. The
spectral properties of eigenvalues and eigenvectors of this matrix are
analyzed. We show that the PageRank of the system is characterized by a power
law decay with the exponent $\beta$ dependent on map parameters and the Google
damping factor $\alpha$.