We present a new space-efficient approach, (SparseDTW), to compute the
Dynamic Time Warping (DTW) distance between two time series that always yields
the optimal result. This is in contrast to other known approaches which
typically sacrifice optimality to attain space efficiency. The main idea behind
our approach is to dynamically exploit the existence of similarity and/or
correlation between the time series. The more the similarity between the time
series the less space required to compute the DTW between them.
We formulate query-subquery nets and use them to create the first framework
for developing algorithms for evaluating queries to Horn knowledge bases with
the properties that: the approach is goal-directed; each subquery is processed
only once and each supplement tuple, if desired, is transferred only once;
operations are done set-at-a-time; and any control strategy can be used. Our
intention is to increase efficiency of query processing by eliminating
redundant computation, increasing flexibility and reducing the number of
accesses to the secondary storage.
This paper deals with memory management issues of robotics. In our proposal
we break one of the major issues in creating humanoid. . Database issue is the
complicated thing in robotics schema design here in our proposal we suggest new
concept called NOSQL database for the effective data retrieval, so that the
humanoid robots will get the massive thinking ability in searching each items
using chained instructions.
With the increasing prevalence of location-aware devices, trajectory data has
been generated and collected in various application domains. Trajectory data
carries rich information that is useful for many data analysis tasks. Yet,
improper publishing and use of trajectory data could jeopardize individual
privacy. However, it has been shown that existing privacy-preserving trajectory
data publishing methods derived from partition-based privacy models, for
example k-anonymity, are unable to provide sufficient privacy protection.
Knowledge bases of entities and relations (either constructed manually or
automatically) are behind many real world search engines, including those at
Yahoo!, Microsoft, and Google. Those knowledge bases can be viewed as graphs
with nodes representing entities and edges representing (primary)
relationships, and various studies have been conducted on how to leverage them
to answer entity seeking queries. Meanwhile, in a complementary direction,
analyses over the query logs have enabled researchers to identify entity pairs
that are statistically correlated.
As an essential operation in data cleaning, the similarity join has attracted
considerable attention from the database community. In this paper, we study
string similarity joins with edit-distance constraints, which find similar
string pairs from two large sets of strings whose edit distance is within a
given threshold. Existing algorithms are efficient either for short strings or
for long strings, and there is no algorithm that can efficiently and adaptively
support both short strings and long strings. To address this problem, we
propose a partition-based method called Pass-Join.
A previously proposed keyword search paradigm produces, as a query result, a
ranked list of Object Summaries (OSs). An OS is a tree structure of related
tuples that summarizes all data held in a relational database about a
particular Data Subject (DS). However, some of these OSs are very large in size
and therefore unfriendly to users that initially prefer synoptic information
before proceeding to more comprehensive information about a particular DS. In
this paper, we investigate the effective and efficient retrieval of concise and
informative OSs.
Querying uncertain data sets (represented as probability distributions)
presents many challenges due to the large amount of data involved and the
difficulties comparing uncertainty between distributions. The Earth Mover's
Distance (EMD) has increasingly been employed to compare uncertain data due to
its ability to effectively capture the differences between two distributions.
Computing the EMD entails finding a solution to the transportation problem,
which is computationally intensive.
Many dynamic applications are built upon large network infrastructures, such
as social networks, communication networks, biological networks and the Web.
Such applications create data that can be naturally modeled as graph streams,
in which edges of the underlying graph are received and updated sequentially in
a form of a stream. It is often necessary and important to summarize the
behavior of graph streams in order to enable effective query processing.
However, the sheer size and dynamic nature of graph streams present an enormous
challenge to existing graph management techniques.
One of the main challenges that the Semantic Web faces is the integration of
a growing number of independently designed ontologies. In this work, we present
PARIS, an approach for the automatic alignment of ontologies. PARIS aligns not
only instances, but also relations and classes. Alignments at the instance
level cross-fertilize with alignments at the schema level. Thereby, our system
provides a truly holistic solution to the problem of ontology alignment. The
heart of the approach is probabilistic, i.e., we measure degrees of matchings
based on probability estimates.
In this paper, we formulate a top-k query that compares objects in a database
to a user-provided query object on a novel scoring function. The proposed
scoring function combines the idea of attractive and repulsive dimensions into
a general framework to overcome the weakness of traditional distance or
similarity measures. We study the properties of the proposed class of scoring
functions and develop efficient and scalable index structures that index the
isolines of the function. We demonstrate various scenarios where the query
finds application.
Newly-released web applications often succumb to a "Success Disaster," where
overloaded database machines and resulting high response times destroy a
previously good user experience. Unfortunately, the data independence provided
by a traditional relational database system, while useful for agile
development, only exacerbates the problem by hiding potentially expensive
queries under simple declarative expressions.
In this paper, we proposed a new technique for backing up and restoring
different Database Management Systems (DBMS). The technique is enabling to
backup and restore a part of or the whole database using a unified interface
using ASP.NET and XML technologies. It presents a Web Solution allowing the
administrators to do their jobs from everywhere, locally or remotely. To show
the importance of our solution, we have taken two case studies, oracle 11g and
SQL Server 2008.
With the rapid growth of internet technologies, Web has become a huge
repository of information and keeps growing exponentially under no editorial
control. However the human capability to read, access and understand Web
content remains constant. This motivated researchers to provide Web
personalized online services such as Web recommendations to alleviate the
information overload problem and provide tailored Web experiences to the Web
users. Recent studies show that Web usage mining has emerged as a popular
approach in providing Web personalization.
This paper outlines certain scenarios from the fields of astrophysics and
fluid dynamics simulations which require high performance data warehouses that
support array data type. A common feature of all these use cases is that
subsetting and preprocessing the data on the server side (as far as possible
inside the database server process) is necessary to avoid the client-server
overhead and to minimize IO utilization.
In many modern applications, data are received as infinite, rapid,
unpredictable and time- variant data elements that are known as data streams.
Systems which are able to process data streams with such properties are called
Data Stream Management Systems (DSMS). Due to the unpredictable and time-
variant properties of data streams as well as system, adaptivity of the DSMS is
a major requirement for each DSMS.
ID is derived from the word identity, derived from the first two characters
in the word. ID is used to distinguish between an entity to another entity.
Student ID (SID) is the key differentiator between a student with other
students. On the concept of database, the differentiator is unique. SID can be
numbers, letters, or a combination of both (alphanumeric). Viewed from the
daily context, it is not important to determine which a SID belongs to the type
of data. However, when reviewed on database design, determining the type of
data, including SID in this case, is important.
This paper presents a new solution, which adds a new database server on an
existing Oracle Multimaster Data replication system with Online Instantiation
method. During this time the system is down, because we cannot execute DML
statements on replication objects but we can only make queries. The time for
adding the new database server depends on the number of objects, on the
replication group and on the network conditions. We propose to add a new layer
between replication objects and the database sessions, which contain DML
statements.
Random sampling is an essential tool in the processing and transmission of
data. It is used to summarize data too large to store or manipulate and meet
resource constraints on bandwidth or battery power. Estimators that are applied
to the sample facilitate fast approximate processing of queries posed over the
original data and the value of the sample hinges on the quality of these
estimators.
The like regular expression predicate has been part of the SQL standard since
at least 1989. However, despite its popularity and wide usage, database vendors
provide only limited indexing support for regular expression queries which
almost always require a full table scan.
Often corporations need tools to improve their decision making in a
competitive market. In general, these tools are based on data warehouse
platforms to mange and analyze large amounts of data. However, several of these
corporations do not have enough resources to buy such platforms because of the
high cost. This work is dedicated to a feasibility study of a low cost platform
to data warehouse. We consider as a low cost platform the use of open source
software like the PostgreSQL database system and the GNU/Linux operational
system.
We study the problem of index deployment \textit{ordering}. Many database
applications deploy hundreds or thousands of indexes on large tables to speed
up query execution. An effective index deployment ordering can produce (1) a
prompt query runtime improvement and (2) a reduced total deployment time. Both
of these are essential qualities of design tools for quickly evolving
databases, but optimizing the problem is challenging because of complex index
interactions and a factorial number of possible solutions.
The interlinking of datasets published in the Linked Data Cloud is a
challenging problem and a key factor for the success of the Semantic Web.
Manual rule-based methods are the most effective solution for the problem, but
they require skilled human data publishers going through a laborious, error
prone and time-consuming process for manually describing rules mapping
instances between two datasets. Thus, an automatic approach for solving this
problem is more than welcome. In this paper, we propose a novel interlinking
method, SERIMI, for solving this problem automatically.
The relational model is the most commonly used data model for storing large
datasets, perhaps due to the simplicity of the tabular format which had
revolutionized database management systems. However, many real world objects
are recursive and associative in nature which makes storage in the relational
model difficult. The hypergraph model is a generalization of a graph model,
where each hypernode can be made up of other nodes or graphs and each hyperedge
can be made up of one or more edges. It may address the recursive and
associative limitations of relational model.
With the recent surge of social networks like Facebook, new forms of
recommendations have become possible - personalized recommendations of ads,
content, and even new friend and product connections based on one's social
interactions. Since recommendations may use sensitive social information, it is
speculated that these recommendations are associated with privacy risks. The
main contribution of this work is in formalizing these expected trade-offs
between the accuracy and privacy of personalized social recommendations.
New hardware platforms, e.g. cloud, multi-core, etc., have led to a
reconsideration of database system architecture. Our Deuteronomy project
separates transactional functionality from data management functionality,
enabling a flexible response to exploiting new platforms. This separation
requires, however, that recovery is described logically. In this paper, we
extend current recovery methods to work in this logical setting. While this is
straightforward in principle, performance is an issue.
Users of MapReduce often run into performance problems when they scale up
their workloads. Many of the problems they encounter can be overcome by
applying techniques learned from over three decades of research on parallel
DBMSs. However, translating these techniques to a MapReduce implementation such
as Hadoop presents unique challenges that can lead to new design choices. This
paper describes how column-oriented storage techniques can be incorporated in
Hadoop in a way that preserves its popular programming APIs.
A high-quality, comprehensive product catalog is essential to the success of
Product Search engines and shopping sites such as Yahoo! Shopping, Google
Product Search or Bing Shopping. But keeping catalogs up-to-date becomes a
challenging task, calling for the need of automated techniques. In this paper,
we introduce the problem of product synthesis, a key component of catalog
creation and maintenance. Given a set of offers advertised by merchants, the
goal is to identify new products and add them to the catalog together with
their (structured) attributes.
XML data projection (or pruning) is a natural optimization for main memory
query engines: given a query Q over a document D, the subtrees of D that are
not necessary to evaluate Q are pruned, thus producing a smaller document D';
the query Q is then executed on D', hence avoiding to allocate and process
nodes that will never be reached by Q. In this article, we propose a new
approach, based on types, that greatly improves current solutions.
Privacy Preserving Data Mining (PPDM) addresses the problem of developing
accurate models about aggregated data without access to precise information in
individual data record. A widely studied \emph{perturbation-based PPDM}
approach introduces random perturbation to individual values to preserve
privacy before data is published. Previous solutions of this approach are
limited in their tacit assumption of single-level trust on data miners.
In this work we are analyzing scalability of the heuristic algorithm we used
in the past to discover knowledge from multi-valued symbolic attributes in
fuzzy databases. The non-atomic descriptors, characterizing a single attribute
of a database record, are commonly used in fuzzy databases to reflect
uncertainty about the recorded observation.
A new compression method called difference-Huffman coding (DHC) is introduced
in this paper. It is verified empirically that DHC results in a smaller
multidimensional physical representation than those for other previously
published techniques (single count header compression, logical position
compression, base-offset compression and difference sequence compression). The
article examines how caching influences the expected retrieval time of the
multidimensional and table representations of relations. A model is proposed
for this, which is then verified with empirical data.
One utilisation of multidimensional databases is the field of On-line
Analytical Processing (OLAP). The applications in this area are designed to
make the analysis of shared multidimensional information fast [9]. On one hand,
speed can be achieved by specially devised data structures and algorithms. On
the other hand, the analytical process is cyclic. In other words, the user of
the OLAP application runs his or her queries one after the other.
There have been several recent advancements in Machine Learning community on
the Entity Matching (EM) problem. However, their lack of scalability has
prevented them from being applied in practical settings on large real-life
datasets. Towards this end, we propose a principled framework to scale any
generic EM algorithm. Our technique consists of running multiple instances of
the EM algorithm on small neighborhoods of the data and passing messages across
neighborhoods to construct a global solution.
Set intersection is a fundamental operation in information retrieval and
database systems. This paper introduces linear space data structures to
represent sets such that their intersection can be computed in a worst-case
efficient way. In general, given k (preprocessed) sets, with totally n
elements, we will show how to compute their intersection in expected time
O(n/sqrt(w)+kr), where r is the intersection size and w is the number of bits
in a machine-word.
Spinnaker is an experimental datastore that is designed to run on a large
cluster of commodity servers in a single datacenter. It features key-based
range partitioning, 3-way replication, and a transactional get-put API with the
option to choose either strong or timeline consistency on reads. This paper
describes Spinnaker's Paxos-based replication protocol. The use of Paxos
ensures that a data partition in Spinnaker will be available for reads and
writes as long a majority of its replicas are alive.
We present a generic framework to make wrapper induction algorithms tolerant
to noise in the training data. This enables us to learn wrappers in a
completely unsupervised manner from automatically and cheaply obtained noisy
training data, e.g., using dictionaries and regular expressions. By removing
the site-level supervision that wrapper-based techniques require, we are able
to perform information extraction at web-scale, with accuracy unattained with
existing unsupervised extraction techniques. Our system is used in production
at Yahoo! and powers live applications.
Scaling up the sparse matrix-vector multiplication kernel on modern Graphics
Processing Units (GPU) has been at the heart of numerous studies in both
academia and industry. In this article we present a novel non-parametric,
self-tunable, approach to data representation for computing this kernel,
particularly targeting sparse matrices representing power-law graphs.
Many works have focused, for over twenty five years, on the integration of
the time dimension in databases (DB). However, the standard SQL3 does not yet
allow easy definition, manipulation and querying of temporal DBs. In this
paper, we study how we can simplify querying and manipulating temporal facts in
SQL3, using a model that integrates time in a native manner. To do this, we
propose new keywords and syntax to define different temporal versions for many
relational operators and functions used in SQL.
In this paper a tool called RDBNorma is proposed, that uses a novel approach
to represent a relational database schema and its functional dependencies in
computer memory using only one linked list and used for semi-automating the
process of relational database schema normalization up to third normal form.
This paper addresses all the issues of representing a relational schema along
with its functional dependencies using one linked list along with the
algorithms to convert a relation into second and third normal form by using
above representation.
As the structural databases continue to expand, efficient methods are
required to search similar structures of the query structure from the database.
There are many previous works about comparing protein 3D structures and
scanning the database with a query structure. However, they generally have
limitations on practical use because of large computational and storage
requirements.
We develop a novel method, based on the statistical concept of VC-dimension,
for selecting a small representative sample from a large database. The
execution of a query on the sample provides an accurate estimate for the
selectivity (or cardinality of the output) of each operation in the execution
of the query on the original large database. The size of the sample does not
depend on the size (number of tuples) of the database, but is a function of the
complexity of the queries we plan to run, measured by their VC-dimension.
Log files contain information about User Name, IP Address, Time Stamp, Access
Request, number of Bytes Transferred, Result Status, URL that Referred and User
Agent. The log files are maintained by the web servers. By analysing these log
files gives a neat idea about the user. This paper gives a detailed discussion
about these log files, their formats, their creation, access procedures, their
uses, various algorithms used and the additional parameters that can be used in
the log files which in turn gives way to an effective mining.
There are many clustering methods, such as hierarchical clustering method.
Most of the approaches to the clustering of variables encountered in the
literature are of hierarchical type. The great majority of hierarchical
approaches to the clustering of variables are of agglomerative nature. The
agglomerative hierarchical approach to clustering starts with each observation
as its own cluster and then continually groups the observations into
increasingly larger groups. Higher Learning Institution (HLI) provides training
to introduce final-year students to the real working environment.
We study in this paper provenance information for queries with aggregation.
Provenance information was studied in the context of various query languages
that do not allow for aggregation, and recent work has suggested to capture
provenance by annotating the different database tuples with elements of a
commutative semiring and propagating the annotations through query evaluation.
We show that aggregate queries pose novel challenges rendering this approach
inapplicable.
The Zebrafish Model Organism Database (ZFIN) provides a Web resource of
zebrafish genomic, genetic, developmental, and phenotypic data. Four different
ontologies are currently used to annotate data to the most specific term
available facilitating a better comparison between inter-species data. In
addition, ontologies are used to help users find and cluster data more quickly
without the need of knowing the exact technical name for a term.
The primary mission of UniProt is to support biological research by
maintaining a stable, comprehensive, fully classified, richly and accurately
annotated protein sequence knowledgebase, with extensive cross-references to
external resources, that is freely available to the scientific community. To
enable users of the knowledgebase to accurately assess the reliability of the
information contained in this resource, the evidence for and provenance of the
information must be recorded.
Our Chemical e-Science Information Cloud (ChemCloud) - a Semantic Web based
eScience infrastructure - integrates and automates a multitude of databases,
tools and services in the domain of chemistry, pharmacy and bio-chemistry
available at the Fachinformationszentrum Chemie (FIZ Chemie), at the Freie
Universitaet Berlin (FUB), and on the public Web.
We have compared the performance of five non-commercial triple stores,
Virtuoso-open source, Jena SDB, Jena TDB, SWIFT-OWLIM and 4Store. We examined
three performance aspects: the query execution time, scalability and run-to-run
reproducibility. The queries we chose addressed different ontological or
biological topics, and we obtained evidence that individual store performance
was quite query specific.
A key goal of bioinformatics is to create database systems and software
platforms capable of storing and analysing large sets of biological data.
Hundreds of biological databases are now available and provide access to huge
amount of biological data. SGD, Yeastract, CYGD-MIPS, BioGrid and PhosphoGrid
are five of the most visited databases by the yeast community. These sources
provide complementary data on biological entities. Biologists are brought
systematically to query these data sources in order to analyse the results of
their experiments.
Web query log data contain information useful to research; however, release
of such data can re-identify the search engine users issuing the queries. These
privacy concerns go far beyond removing explicitly identifying information such
as name and address, since non-identifying personal data can be combined with
publicly available information to pinpoint to an individual. In this work we
model web query logs as unstructured transaction data and present a novel
transaction anonymization technique based on clustering and generalization
techniques to achieve the k-anonymity privacy.
A boolean expression is in read-once form if each of its variables appears
exactly once. When the variables denote independent events in a probability
space, the probability of the event denoted by the whole expression in
read-once form can be computed in polynomial time (whereas the general problem
for arbitrary expressions is #P-complete). Known approaches to checking
read-once property seem to require putting these expressions in disjunctive
normal form.
Over the last decade there have been great strides made in developing
techniques to compute functions privately. In particular, Differential Privacy
gives strong promises about conclusions that can be drawn about an individual.
In contrast, various syntactic methods for providing privacy (criteria such as
kanonymity and l-diversity) have been criticized for still allowing private
information of an individual to be inferred. In this report, we consider the
ability of an attacker to use data meeting privacy definitions to build an
accurate classifier.
Ontologies such as taxonomies, product catalogs or web directories are
heavily used and hence evolve frequently to meet new requirements or to better
reflect the current instance data of a domain. To effectively manage the
evolution of ontologies it is essential to identify the difference (Diff)
between two ontology versions. We propose a novel approach to determine an
expressive and invertible diff evolution mapping between given versions of an
ontology.
A Blink Tree latch method and protocol supports synchronous node deletion in
a high concurrency environment. Full source code is available.
An answer to a query has a well-defined lineage expression (alternatively
called how-provenance) that explains how the answer was derived. Recent work
has also shown how to compute the lineage of a non-answer to a query. However,
the cause of an answer or non-answer is a more subtle notion and consists, in
general, of only a fragment of the lineage. In this paper, we adapt Halpern,
Pearl, and Chockler's recent definitions of causality and responsibility to
define the causes of answers and non-answers to queries, and their degree of
responsibility.
In this paper we present a simple database definition language: that of
categories and functors. A database schema is a category and a state is a
set-valued functor. We show that morphisms of schemas induce three "data
migration functors" that translate states from one schema to the other in
canonical ways. Database states form a boolean topos of which the classical
"relational algebra" is a fragment. These ideas thus create a new denotational
semantics for database theory.
In this paper a novel fragile watermarking scheme is proposed to detect,
localize and recover malicious modifications in relational databases. In the
proposed scheme, all tuples in the database are first securely divided into
groups. Then watermarks are embedded and verified group-by-group independently.
By using the embedded watermark, we are able to detect and localize the
modification made to the database and even we recover the true data from the
database modified locations. Our experimental results show that this scheme is
so qualified; i.e.
To analyze complex phenomena which involve moving objects, Trajectory Data
Warehouse (TDW) seems to be an answer for many recent decision problems related
to various professions (physicians, commercial representatives, transporters,
ecologists ...) concerned with mobility. This work aims to make trajectories as
a first class concept in the trajectory data conceptual model and to design a
TDW, in which data resulting from mobile information collectors' trajectory are
gathered.
Problem statement: Clustering has a number of techniques that have been
developed in statistics, pattern recognition, data mining, and other fields.
Subspace clustering enumerates clusters of objects in all subspaces of a
dataset. It tends to produce many over lapping clusters. Approach: Subspace
clustering and projected clustering are research areas for clustering in high
dimensional spaces. In this research we experiment three clustering oriented
algorithms, PROCLUS, P3C and STATPC.
Most of the organizations put information on the web because they want it to
be seen by the world. Their goal is to have visitors come to the site, feel
comfortable and stay a while and try to know completely about the running
organization. As educational system increasingly requires data mining, the
opportunity arises to mine the resulting large amounts of student information
for hidden useful information (patterns like rule, clustering, and
classification, etc).
Matching dependencies (MDs) were introduced to specify the identification or
matching of certain attribute values in pairs of database tuples when some
similarity conditions are satisfied. Their enforcement can be seen as a natural
generalization of entity resolution. In what we call the "pure case" of MDs,
any value from the underlying data domain can be used for the value in common
that does the matching. We investigate the semantics and properties of data
cleaning through the enforcement of matching dependencies for the pure case.
Over the last couple of years, "Cloud Computing" or "Elastic Computing" has
emerged as a compelling and successful paradigm for internet scale computing.
One of the major contributing factors to this success is the elasticity of
resources. In spite of the elasticity provided by the infrastructure and the
scalable design of the applications, the elephant (or the underlying database),
which drives most of these web-based applications, is not very elastic and
scalable, and hence limits scalability.
Matching dependencies were recently introduced as declarative rules for data
cleaning and entity resolution. Enforcing a matching dependency on a database
instance identifies the values of some attributes for two tuples, provided that
the values of some other attributes are sufficiently similar. Assuming the
existence of matching functions for making two attributes values equal, we
formally introduce the process of cleaning an instance using matching
dependencies, as a chase-like procedure.
Iterated hash functions process strings recursively, one character at a time.
At each iteration, they compute a new hash value from the preceding hash value
and the next character. We prove that iterated hashing can be pairwise
independent, but never 3-wise independent. We show that it can be almost
universal over strings much longer than the number of hash values; we bound the
maximal string length given the collision probability.
This research is about an online forum designed and developed to improve the
communication process between alumni, new, old and upcoming students. In this
research paper we present targeted problems, designed architecture, used
technologies in development and final end product in detail.
This research paper briefly describes the industrial contributions of Product
Data Management in any organization's technical and managerial data management.
Then focusing on some current major PDM based problems i.e. Static and
Unintelligent Search, Platform Independent System and Successful PDM System
Implementation, briefly presents a semantic based solution i.e. I-SOAS. Majorly
this research paper is about to present and discuss the contributions of I-SOAS
in any organization's technical and system data management.
Top-$k$ queries allow end-users to focus on the most important (top-$k$)
answers amongst those which satisfy the query. In traditional databases, a user
defined score function assigns a score value to each tuple and a top-$k$ query
returns $k$ tuples with the highest score. In uncertain database, top-$k$
answer depends not only on the scores but also on the membership probabilities
of tuples. Several top-$k$ definitions covering different aspects of
score-probability interplay have been proposed in recent
past~\cite{R10,R4,R2,R8}.
In this work we present in-network techniques to improve the efficiency of
spatial aggregate queries. Such queries are very common in a sensornet setting,
demanding more targeted techniques for their handling. Our approach constructs
and maintains multi-resolution cube hierarchies inside the network, which can
be constructed in a distributed fashion. In case of failures, recovery can also
be performed with in-network decisions.
Date and Darwen have proposed a theory of types, the latter forms the basis
of a detailed presentation of a panoply of simple and complex types. However,
this proposal has not been structured in a formal system. Specifically, Date
and Darwen haven't indicated the formalism of the type system that corresponds
to the type theory established.
We show that it is possible to significantly improve the accuracy of a
general class of histogram queries while satisfying differential privacy. Our
approach carefully chooses a set of queries to evaluate, and then exploits
consistency constraints that should hold over the noisy output. In a
post-processing phase, we compute the consistent input most likely to have
produced the noisy output. The final output is differentially-private and
consistent, but in addition, it is often much more accurate.
This project addressed the conceptual fundamentals of data storage,
investigating techniques for provision of highly generic storage facilities
that can be tailored to produce various individually customised storage
infrastructures, compliant to the needs of particular applications. This
requires the separation of mechanism and policy wherever possible.
We present a generic API suitable for provision of highly generic storage
facilities that can be tailored to produce various individually customised
storage infrastructures. The paper identifies a candidate set of minimal
storage system building blocks, which are sufficiently simple to avoid
encapsulating policy where it cannot be customised by applications, and
composable to build highly flexible storage architectures.
Similarity search methods are widely used as kernels in various machine
learning applications. Nearest neighbor search (NNS) algorithms are often used
to retrieve similar entries, given a query. While there exist efficient
techniques for exact query lookup using hashing, similarity search using exact
nearest neighbors is known to be a hard problem and in high dimensions, best
known solutions offer little improvement over a linear scan.
This paper presents the design of an autonomic, resource-aware distributed
database which enables data to be backed up and shared without complex manual
administration. The database, H2O, is designed to make use of unused resources
on workstation machines. Creating and maintaining highly-available, replicated
database systems can be difficult for untrained users, and costly for IT
departments.
Searching learning or rules in relational database for data mining purposes
with characteristic or classification/discriminant rule in attribute oriented
induction technique can be quicker, easy, and simple with simple SQL statement.
With just only one simple SQL statement, characteristic and classification rule
can be created simultaneously.
Finding interesting rule in the sixth strategy step about threshold control
on generalized relations in attribute oriented induction, there is possibility
to select candidate attribute for further generalization and merging of
identical tuples until the number of tuples is no greater than the threshold
value, as implemented in basic attribute oriented induction algorithm. At this
strategy step there is possibility the number of tuples in final generalization
result still greater than threshold value.
Multidimensional in data warehouse is a compulsion and become the most
important for information delivery, without multidimensional Multidimensional
in data warehouse is a compulsion and become the most important for information
delivery, without multidimensional datawarehouse is incomplete.
Multidimensional give ability to analyze business measurement in many different
ways. Multidimensional is also synonymous with online analytical processing
(OLAP). By using some concepts in datawarehouse like slice-dice,drill down and
roll up will increase the ability of multidimensional datawarehouse.
This paper describes our experience with using Grid files as the main storage
organization for a relational database management system. We primarily focus on
the following two aspects. (i) Strategies for implementing grid files
efficiently. (ii) Methods for efficiency evaluating queries posed to a database
organized using grid files.
The amounts of data available to decision makers are increasingly important,
given the network availability, low cost storage and diversity of applications.
To maximize the potential of these data within the National Social Security
Fund (NSSF) in Tunisia, we have built a data warehouse as a multidimensional
database, cleaned, homogenized, historicized and consolidated. We used Oracle
Warehouse Builder to extract, transform and load the source data into the Data
Warehouse, by applying the KDD process. We have implemented the Data Warehouse
as an Oracle OLAP.
Sharing musical files via the Internet was the essential motivation of early
P2P systems. Despite of the great success of the P2P file sharing systems,
these systems support only "simple" queries. The focus in such systems is how
to carry out an efficient query routing in order to find the nodes storing a
desired file. Recently, several research works have been made to extend P2P
systems to be able to share data having a fine granularity (i.e. atomic
attribute) and to process queries written with a highly expressive language
(i.e. SQL).
Regular tree grammars and regular path expressions constitute core constructs
widely used in programming languages and type systems. Nevertheless, there has
been little research so far on reasoning frameworks for path expressions where
node cardinality constraints occur along a path in a tree. We present a logic
capable of expressing deep counting along paths which may include arbitrary
recursive forward and backward navigation.
Different types of data skew can result in load imbalance in the context of
parallel joins under the shared nothing architecture. We study one important
type of skew, join product skew (JPS). A static approach based on frequency
classes is proposed which takes for granted the data distribution of join
attribute values.
We study the problem of providing workflow data provenance without revealing
the functionality of any module. We develop a model that formalizes the notion
of privacy of modules embedded in a workflow structure as a natural extension
of privacy of standalone modules. Our model shows that by hiding a small amount
of carefully chosen data, one can ensure privacy of all modules over an
unbounded number of executions. The problem of identifying the smallest
possible amount of such data is NP-hard, and in the full generality of our
model it is in fact even hard to get a good approximation.
In this paper we deal with the notion of semantic loss in Peer Data
Management Systems (PDMS) queries. We define such a notion and we give a
mechanism that discovers semantic loss in a PDMS network. Next, we propose an
algorithm that addresses the problem of restoring such a loss. Further
evaluation of our proposed algorithm is an ongoing work
Functional dependencies - traditional, approximate and conditional are of
critical importance in relational databases, as they inform us about the
relationships between attributes. They are useful in schema normalization, data
rectification and source selection. However, probabilistic databases neither
have these dependencies defined for them, nor are fast algorithms available to
evaluate their confidences. In this paper, we define the logical extensions of
various forms of functional dependencies for probabilistic databases. We
explore the connections between these dependencies.
Performance tuning of Database Management Systems(DBMS) is both complex and
challenging as it involves identifying and altering several key performance
tuning parameters. The quality of tuning and the extent of performance
enhancement achieved greatly depends on the skill and experience of the
Database Administrator (DBA). As neural networks have the ability to adapt to
dynamically changing inputs and also their ability to learn makes them ideal
candidates for employing them for tuning purpose.
This paper deals with personalization of annotated OLAP systems. Data
constellation is extended to support annotations and user preferences.
Annotations reflect the decision-maker experience whereas user preferences
enable users to focus on the most interesting data. User preferences allow
annotated contextual recommendations helping the decision-maker during his/her
multidimensional navigations.
In this paper, we emphasize the need for data cleansing when clustering
large-scale transaction databases and propose a new data cleansing method that
improves clustering quality and performance. We evaluate our data cleansing
method through a series of experiments. As a result, the clustering quality and
performance were significantly improved by up to 165% and 330%, respectively.
Inferring an appropriate DTD or XML Schema Definition (XSD) for a given
collection of XML documents essentially reduces to learning deterministic
regular expressions from sets of positive example words. Unfortunately, there
is no algorithm capable of learning the complete class of deterministic regular
expressions from positive examples only, as we will show. The regular
expressions occurring in practical DTDs and XSDs, however, are such that every
alphabet symbol occurs only a small number of times.
Distributed heterogeneous data sources need to be queried uniformly using
global schema. Query on global schema is reformulated so that it can be
executed on local data sources. Constraints in global schema and mappings are
used for source selection, query optimization,and querying partitioned and
replicated data sources.
Association rule mining is an active data mining research area and most ARM
algorithms cater to a centralized environment. Centralized data mining to
discover useful patterns in distributed databases isn't always feasible because
merging data sets from different sites incurs huge network communication costs.
In this paper, an Improved algorithm based on good performance level for data
mining is being proposed.
The rapidly expanding technology of mobile communication will give mobile
users capability of accessing information from anywhere and any time. The
wireless technology has made it possible to achieve continuous connectivity in
mobile environment. When the query is specified as continuous, the requesting
mobile user can obtain continuously changing result. In order to provide
accurate and timely outcome to requesting mobile user, the locations of moving
object has to be closely monitored.
To obtain good system performance, a DBA must choose a set of indices that is
appropriate for the workload. The system can aid in this challenging task by
providing recommendations for the index configuration. We propose a new index
recommendation technique, termed semi-automatic tuning, that keeps the DBA "in
the loop" by generating recommendations that use feedback about the DBA's
preferences. The technique also works online, which avoids the limitations of
commercial tools that require the workload to be known in advance.
The increasing popularity of social networks has initiated a fertile research
area in information extraction and data mining. Anonymization of these social
graphs is important to facilitate publishing these data sets for analysis by
external entities. Prior work has concentrated mostly on node identity
anonymization and structural anonymization. But with the growing interest in
analyzing social networks as a weighted network, edge weight anonymization is
also gaining importance.
Alloy is a lightweight modeling formalism based on relational algebra. In
prior work with Fisler, Giannakopoulos, Krishnamurthi, and Yoo, we have
presented a tool, Alchemy, that compiles Alloy specifications into
implementations that execute against persistent databases. The foundation of
Alchemy is an algorithm for rewriting relational algebra formulas into code for
database transactions. In this paper we report on recent progress in improving
the robustness and efficiency of this transformation.
In various approaches, data cubes are pre-computed in order to answer
efficiently OLAP queries. The notion of data cube has been declined in various
ways: iceberg cubes, range cubes or differential cubes. In this paper, we
introduce the concept of convex cube which captures all the tuples of a
datacube satisfying a constraint combination. It can be represented in a very
compact way in order to optimize both computation time and required storage
space.
Numerous generalization techniques have been proposed for privacy preserving
data publishing. Most existing techniques, however, implicitly assume that the
adversary knows little about the anonymization algorithm adopted by the data
publisher. Consequently, they cannot guard against privacy attacks that exploit
various characteristics of the anonymization mechanism. This paper provides a
practical solution to the above problem. First, we propose an analytical model
for evaluating disclosure risks, when an adversary knows everything in the
anonymization process, except the sensitive values.
We present some formal properties of (symmetrical) commutativity, the major
criterion used in transactional systems, which allow us to fully understand its
advantages and disadvantages. The main result is that commutativity is subject
to the same limitation as compatibility for arbitrary objects. However,
commutativity has also a number of attracting properties, one of which is
related to recovery and, to our knowledge, has not been exploited in the
literature. Advantages and disadvantages are illustrated on abstract data types
of interest.
In this paper, we try to focus the reader's interest on the problems that
transactional systems have to resolve for taking advantage of commutativity in
a serializable and recoverable way. Our framework is, (as others), based on the
use of conditional commutativity on abstract date types. We present new
features that have not been found in the literature hitherto, that both
increase concurrency and simplify recovery.
Commutativity has the same inherent limitations as compatibility. Then, it is
worth conceiving simple concurrency control techniques. We propose a restricted
form of commutativity which increases parallelism without incurring a higher
overhead than compatibility. Advantages of our proposition are: (1)
commutativity of operations is determined at compile-time, (2) run-time
checking is as efficient as for compatibility, (3) neither commutativity
relations, (4) nor inverse operations, need to be specified, and (5) log space
utilization is reduced.
Several propositions were done to provide adapted concurrency control to
object-oriented databases. However, most of these proposals miss the fact that
considering solely read and write access modes on instances may lead to less
parallelism than in relational databases!
Dynamic web applications such as mashups need efficient access to web data
that is only accessible via entity search engines (e.g. product or publication
search engines). However, most current mashup systems and applications only
support simple keyword searches for retrieving data from search engines. We
propose the use of more powerful search strategies building on so-called query
generators. For a given set of entities query generators are able to
automatically determine a set of search queries to retrieve these entities from
an entity search engine.
Previous work reports about SXSI, a fast XPath engine which executes tree
automata over compressed XML indexes. Here, reasons are investigated why SXSI
is so fast. It is shown that tree automata can be used as a general framework
for fine grained XML query optimization. We define the "relevant nodes" of a
query as those nodes that a minimal automaton must touch in order to answer the
query. This notion allows to skip many subtrees during execution, and, with the
help of particular tree indexes, even allows to skip internal nodes of the
tree.
Data mining has been widely recognized as a powerful tool to explore added
value from large-scale databases. Finding frequent item sets in databases is a
crucial in data mining process of extracting association rules. Many algorithms
were developed to find the frequent item sets. This paper presents a summary
and a comparative study of the available FP-growth algorithm variations
produced for mining frequent item sets showing their capabilities and
efficiency in terms of time and memory consumption on association rule mining
by taking application of specific information into account.
Finding multilevel association rules in transaction databases is most
commonly seen in is widely used in data mining. In this paper, we present a
model of mining multilevel association rules which satisfies the different
minimum support at each level, we have employed fuzzy set concepts, multi-level
taxonomy and different minimum supports to find fuzzy multilevel association
rules in a given transaction data set. Apriori property is used in model to
prune the item sets. The proposed model adopts a topdown progressively
deepening approach to derive large itemsets.
The HL7 standard is widely used to exchange medical information
electronically. As a part of the standard, HL7 defines scalar communication
data types like physical quantity, point in time and concept descriptor but
also complex types such as interval types, collection types and probabilistic
types.
Since Chen's Entity-Relationship (ER) model, conceptual modeling has been
playing a fundamental role in relational data design. In this paper we consider
an extended ER (EER) model enriched with cardinality constraints, disjointness
assertions, and is-a relations among both entities and relationships. In this
setting, we consider the case of incomplete data, which is likely to occur, for
instance, when data from different sources are integrated. In such a context,
we address the problem of providing correct answers to conjunctive queries by
reasoning on the schema.
In \cite{Spi}, we developed a category of databases in which the schema of a
database is represented as a simplicial set. Each simplex corresponds to a
table in the database. There, our main concern was to find a categorical
formulation of databases; the simplicial nature of the schemas was to some
degree unexpected and unexploited.
This paper proposes a data tree-rewriting framework for modeling evolving
documents. The framework is close to Guarded Active XML, a platform used for
handling XML repositories evolving through web services. We focus on automatic
verification of properties of evolving documents that can contain data from an
infinite domain. We establish the boundaries of decidability, and show that
verification of a {\em positive} fragment that can handle recursive service
calls is decidable.
Given the vast reservoirs of data stored worldwide, efficient mining of data
from a large information store has emerged as a great challenge. Many databases
like that of intrusion detection systems, web-click records, player statistics,
texts, proteins etc., store strings or sequences. Searching for an unusual
pattern within such long strings of data has emerged as a requirement for
diverse applications. Given a string, the problem then is to identify the
substrings that differs the most from the expected or normal behavior, i.e.,
the substrings that are statistically significant.
Time is one of the most difficult aspects to handle in real world
applications such as database systems. Relational database management systems
proposed by Codd offer very little built-in query language support for temporal
data management. The model itself incorporates neither the concept of time nor
any theory of temporal semantics. Many temporal extensions of the relational
model have been proposed and some of them are also implemented. This paper
offers a brief introduction to temporal database research.
As advances in technology allow for the collection, storage, and analysis of
vast amounts of data, the task of screening and assessing the significance of
discovered patterns is becoming a major challenge in data mining applications.
In this work, we address significance in the context of frequent itemset
mining.
Data mining is the task of discovering interesting patterns from large
amounts of data. There are many data mining tasks, such as classification,
clustering, association rule mining, and sequential pattern mining. Sequential
pattern mining finds sets of data items that occur together frequently in some
sequences.
The importance of finding the characteristics leading to either a success or
a failure is one of the driving forces of data mining. The various application
areas of finding success/failure factors cover vast variety of areas such as
credit risk evaluation and granting loans, micro array analysis, health factors
and health risk factors, and parameter combination leading to a product
success. This paper presents a new approach for making inferences about
dichotomous data. The objective is to determine rules that lead to a certain
result.
There is a considerable body of work on sequence mining of Web Log Data. We
are using One Pass frequent Episode discovery (or FED) algorithm, takes a
different approach than the traditional apriori class of pattern detection
algorithms. In this approach significant intervals for each Website are
computed first (independently) and these interval used for detecting frequent
patterns/Episode and then the Analysis is performed on Significant Intervals
and frequent patterns That can be used to forecast the user's behavior using
previous trends and this can be also used for advertising purpose.
In this paper we present the state of advancement of the French ANR WebStand
project. The objective of this project is to construct a customizable XML based
warehouse platform to acquire, transform, analyze, store, query and export data
from the web, in particular mailing lists, with the final intension of using
this data to perform sociological studies focused on social groups of World
Wide Web, with a specific emphasis on the temporal aspects of this data.
As mobile devices with positioning capabilities continue to proliferate, data
management for so-called trajectory databases that capture the historical
movements of populations of moving objects becomes important. This paper
considers the querying of such databases for convoys, a convoy being a group of
objects that have traveled together for some time. More specifically, this
paper formalizes the concept of a convoy query using density-based notions, in
order to capture groups of arbitrary extents and shapes.
This paper studies the problem of identification and extraction of flat and
nested data records from a given web page. With the explosive growth of
information sources available on the World Wide Web, it has become increasingly
difficult to identify the relevant pieces of information, since web pages are
often cluttered with irrelevant content like advertisements, navigation-panels,
copyright notices etc., surrounding the main content of the web page.