Databases

  1. SparseDTW: A Novel Approach to Speed up Dynamic Time Warping.

    Authors: Sanjay Chawla, Javid Taheri, Ghazi Al-Naymat
    Subjects: Databases
    Abstract

    We present a new space-efficient approach, (SparseDTW), to compute the
    Dynamic Time Warping (DTW) distance between two time series that always yields
    the optimal result. This is in contrast to other known approaches which
    typically sacrifice optimality to attain space efficiency. The main idea behind
    our approach is to dynamically exploit the existence of similarity and/or
    correlation between the time series. The more the similarity between the time
    series the less space required to compute the DTW between them.

  2. Query-Subquery Nets.

    Authors: Linh Anh Nguyen, Son Thanh Cao
    Subjects: Databases
    Abstract

    We formulate query-subquery nets and use them to create the first framework
    for developing algorithms for evaluating queries to Horn knowledge bases with
    the properties that: the approach is goal-directed; each subquery is processed
    only once and each supplement tuple, if desired, is transferred only once;
    operations are done set-at-a-time; and any control strategy can be used. Our
    intention is to increase efficiency of query processing by eliminating
    redundant computation, increasing flexibility and reducing the number of
    accesses to the secondary storage.

  3. Future Robotics Database Management System along with Cloud TPS.

    Authors: Vijaykumar S, Saravanakumar S G
    Subjects: Databases
    Abstract

    This paper deals with memory management issues of robotics. In our proposal
    we break one of the major issues in creating humanoid. . Database issue is the
    complicated thing in robotics schema design here in our proposal we suggest new
    concept called NOSQL database for the effective data retrieval, so that the
    humanoid robots will get the massive thinking ability in searching each items
    using chained instructions.

  4. Differentially Private Trajectory Data Publication.

    Authors: Rui Chen, Benjamin C. M. Fung, Bipin C. Desai
    Subjects: Databases
    Abstract

    With the increasing prevalence of location-aware devices, trajectory data has
    been generated and collected in various application domains. Trajectory data
    carries rich information that is useful for many data analysis tasks. Yet,
    improper publishing and use of trajectory data could jeopardize individual
    privacy. However, it has been shown that existing privacy-preserving trajectory
    data publishing methods derived from partition-based privacy models, for
    example k-anonymity, are unable to provide sufficient privacy protection.

  5. REX: Explaining Relationships between Entity Pairs.

    Authors: Anish Das Sarma, Cong Yu, Lujun Fang, Philip Bohannon
    Subjects: Databases
    Abstract

    Knowledge bases of entities and relations (either constructed manually or
    automatically) are behind many real world search engines, including those at
    Yahoo!, Microsoft, and Google. Those knowledge bases can be viewed as graphs
    with nodes representing entities and edges representing (primary)
    relationships, and various studies have been conducted on how to leverage them
    to answer entity seeking queries. Meanwhile, in a complementary direction,
    analyses over the query logs have enabled researchers to identify entity pairs
    that are statistically correlated.

  6. PASS-JOIN: A Partition-based Method for Similarity Joins.

    Authors: Guoliang Li, Dong Deng, Jiannan Wang, Jianhua Feng
    Subjects: Databases
    Abstract

    As an essential operation in data cleaning, the similarity join has attracted
    considerable attention from the database community. In this paper, we study
    string similarity joins with edit-distance constraints, which find similar
    string pairs from two large sets of strings whose edit distance is within a
    given threshold. Existing algorithms are efficient either for short strings or
    for long strings, and there is no algorithm that can efficiently and adaptively
    support both short strings and long strings. To address this problem, we
    propose a partition-based method called Pass-Join.

  7. Size-l Object Summaries for Relational Keyword Search.

    Authors: Georgios J. Fakas, Zhi Cai, Nikos Mamoulis
    Subjects: Databases
    Abstract

    A previously proposed keyword search paradigm produces, as a query result, a
    ranked list of Object Summaries (OSs). An OS is a tree structure of related
    tuples that summarizes all data held in a relational database about a
    particular Data Subject (DS). However, some of these OSs are very large in size
    and therefore unfriendly to users that initially prefer synoptic information
    before proceeding to more comprehensive information about a particular DS. In
    this paper, we investigate the effective and efficient retrieval of concise and
    informative OSs.

  8. Indexing the Earth Mover's Distance Using Normal Distributions.

    Authors: Ambuj K. Singh, Brian E. Ruttenberg
    Subjects: Databases
    Abstract

    Querying uncertain data sets (represented as probability distributions)
    presents many challenges due to the large amount of data involved and the
    difficulties comparing uncertainty between distributions. The Earth Mover's
    Distance (EMD) has increasingly been employed to compare uncertain data due to
    its ability to effectively capture the differences between two distributions.
    Computing the EMD entails finding a solution to the transportation problem,
    which is computationally intensive.

  9. gSketch: On Query Estimation in Graph Streams.

    Authors: Peixiang Zhao, Charu C. Aggarwal, Min Wang
    Subjects: Databases
    Abstract

    Many dynamic applications are built upon large network infrastructures, such
    as social networks, communication networks, biological networks and the Web.
    Such applications create data that can be naturally modeled as graph streams,
    in which edges of the underlying graph are received and updated sequentially in
    a form of a stream. It is often necessary and important to summarize the
    behavior of graph streams in order to enable effective query processing.
    However, the sheer size and dynamic nature of graph streams present an enormous
    challenge to existing graph management techniques.

  10. PARIS: Probabilistic Alignment of Relations, Instances, and Schema.

    Authors: Pierre Senellart, Fabian M. Suchanek, Serge Abiteboul
    Subjects: Databases
    Abstract

    One of the main challenges that the Semantic Web faces is the integration of
    a growing number of independently designed ontologies. In this work, we present
    PARIS, an approach for the automatic alignment of ontologies. PARIS aligns not
    only instances, but also relations and classes. Alignments at the instance
    level cross-fertilize with alignments at the schema level. Thereby, our system
    provides a truly holistic solution to the problem of ontology alignment. The
    heart of the approach is probabilistic, i.e., we measure degrees of matchings
    based on probability estimates.

  11. Answering Top-k Queries Over a Mixture of Attractive and Repulsive Dimensions.

    Authors: Ambuj K. Singh, Sayan Ranu
    Subjects: Databases
    Abstract

    In this paper, we formulate a top-k query that compares objects in a database
    to a user-provided query object on a novel scoring function. The proposed
    scoring function combines the idea of attractive and repulsive dimensions into
    a general framework to overcome the weakness of traditional distance or
    similarity measures. We study the properties of the proposed class of scoring
    functions and develop efficient and scalable index structures that index the
    isolines of the function. We demonstrate various scenarios where the query
    finds application.

  12. PIQL: Success-Tolerant Query Processing in the Cloud.

    Authors: Michael Armbrust, Armando Fox, Kristal Curtis, Tim Kraska, Michael J. Franklin, David A. Patterson
    Subjects: Databases
    Abstract

    Newly-released web applications often succumb to a "Success Disaster," where
    overloaded database machines and resulting high response times destroy a
    previously good user experience. Unfortunately, the data independence provided
    by a traditional relational database system, while useful for agile
    development, only exacerbates the problem by hiding potentially expensive
    queries under simple declarative expressions.

  13. A New Technique to Backup and Restore DBMS using XML and .NET Technologies.

    Authors: Seifedine Kadry, Mohamad Smaili, Hussam Kassem, Hassan Hayek
    Subjects: Databases
    Abstract

    In this paper, we proposed a new technique for backing up and restoring
    different Database Management Systems (DBMS). The technique is enabling to
    backup and restore a part of or the whole database using a unified interface
    using ASP.NET and XML technologies. It presents a Web Solution allowing the
    administrators to do their jobs from everywhere, locally or remotely. To show
    the importance of our solution, we have taken two case studies, oracle 11g and
    SQL Server 2008.

  14. A semantically enriched web usage based recommendation model.

    Authors: C. Ramesh, K. V. Chalapati Rao, A. Govardhan
    Subjects: Databases
    Abstract

    With the rapid growth of internet technologies, Web has become a huge
    repository of information and keeps growing exponentially under no editorial
    control. However the human capability to read, access and understand Web
    content remains constant. This motivated researchers to provide Web
    personalized online services such as Web recommendations to alleviate the
    information overload problem and provide tailored Web experiences to the Web
    users. Recent studies show that Web usage mining has emerged as a popular
    approach in providing Web personalization.

  15. Array Requirements for Scientific Applications and an Implementation for Microsoft SQL Server.

    Authors: István Csabai, László Dobos, Alexander Szalay, José Blakeley, Tamás Budavári, Dragan Tomic, Milos Milovanovic, Marko Tintor, Andrija Jovanovic
    Subjects: Databases
    Abstract

    This paper outlines certain scenarios from the fields of astrophysics and
    fluid dynamics simulations which require high performance data warehouses that
    support array data type. A common feature of all these use cases is that
    subsetting and preprocessing the data on the server side (as far as possible
    inside the database server process) is necessary to avoid the client-server
    overhead and to minimize IO utilization.

  16. Adaptive Data Stream Management System Using Learning Automata.

    Authors: Shirin Mohammadi, Ali A. Safaei, Fatemeh Abdi, Mostafa S. Haghjoo
    Subjects: Databases
    Abstract

    In many modern applications, data are received as infinite, rapid,
    unpredictable and time- variant data elements that are known as data streams.
    Systems which are able to process data streams with such properties are called
    Data Stream Management Systems (DSMS). Due to the unpredictable and time-
    variant properties of data streams as well as system, adaptivity of the DSMS is
    a major requirement for each DSMS.

  17. Representation for alphanumeric data type based on space and speed case study: Student ID of X university.

    Authors: Agus Pratondo
    Subjects: Databases
    Abstract

    ID is derived from the word identity, derived from the first two characters
    in the word. ID is used to distinguish between an entity to another entity.
    Student ID (SID) is the key differentiator between a student with other
    students. On the concept of database, the differentiator is unique. SID can be
    numbers, letters, or a combination of both (alphanumeric). Viewed from the
    daily context, it is not important to determine which a SID belongs to the type
    of data. However, when reviewed on database design, determining the type of
    data, including SID in this case, is important.

  18. Adding a new site in an existing Oracle Multimaster replication without quiescing the replication.

    Authors: Hakik Paci, Elinda Kajo, Igli Tafa, Aleksander Xhuvani
    Subjects: Databases
    Abstract

    This paper presents a new solution, which adds a new database server on an
    existing Oracle Multimaster Data replication system with Online Instantiation
    method. During this time the system is down, because we cannot execute DML
    statements on replication objects but we can only make queries. The time for
    adding the new database server depends on the number of objects, on the
    replication group and on the network conditions. We propose to add a new layer
    between replication objects and the database sessions, which contain DML
    statements.

  19. Get the Most out of Your Sample: Optimal Unbiased Estimators using Partial Information.

    Authors: Edith Cohen, Haim Kaplan
    Subjects: Databases
    Abstract

    Random sampling is an essential tool in the processing and transmission of
    data. It is used to summarize data too large to store or manipulate and meet
    resource constraints on bandwidth or battery power. Estimators that are applied
    to the sample facilitate fast approximate processing of queries posed over the
    original data and the value of the sample hinges on the quality of these
    estimators.

  20. An index for regular expression quering: Design and implementation.

    Authors: Sanjay Chawla, Dominic Tsang
    Subjects: Databases
    Abstract

    The like regular expression predicate has been part of the SQL standard since
    at least 1989. However, despite its popularity and wide usage, database vendors
    provide only limited indexing support for regular expression queries which
    almost always require a full table scan.

  21. Estudo de Viabilidade de uma Plataforma de Baixo Custo para Data Warehouse.

    Authors: Eduardo Cunha de Almeida
    Subjects: Databases
    Abstract

    Often corporations need tools to improve their decision making in a
    competitive market. In general, these tools are based on data warehouse
    platforms to mange and analyze large amounts of data. However, several of these
    corporations do not have enough resources to buy such platforms because of the
    high cost. This work is dedicated to a feasibility study of a low cost platform
    to data warehouse. We consider as a low cost platform the use of open source
    software like the PostgreSQL database system and the GNU/Linux operational
    system.

  22. Optimizing Index Deployment Order for Evolving OLAP (Extended Version).

    Authors: Hideaki Kimura, Carleton Coffrin, Alexander Rasin, Stanley B. Zdonik
    Subjects: Databases
    Abstract

    We study the problem of index deployment \textit{ordering}. Many database
    applications deploy hundreds or thousands of indexes on large tables to speed
    up query execution. An effective index deployment ordering can produce (1) a
    prompt query runtime improvement and (2) a reduced total deployment time. Both
    of these are essential qualities of design tools for quickly evolving
    databases, but optimizing the problem is challenging because of complex index
    interactions and a factorial number of possible solutions.

  23. SERIMI - Resource Description Similarity, RDF Instance Matching and Interlinking.

    Authors: Samur Araujo, Jan Hidders, Daniel Schwabe, Arjen P. de Vries
    Subjects: Databases
    Abstract

    The interlinking of datasets published in the Linked Data Cloud is a
    challenging problem and a key factor for the success of the Semantic Web.
    Manual rule-based methods are the most effective solution for the problem, but
    they require skilled human data publishers going through a laborious, error
    prone and time-consuming process for manually describing rules mapping
    instances between two datasets. Thus, an automatic approach for solving this
    problem is more than welcome. In this paper, we propose a novel interlinking
    method, SERIMI, for solving this problem automatically.

  24. Mapping Relational Operations onto Hypergraph Model.

    Authors: Amani Tahat, Maurice HT Ling
    Subjects: Databases
    Abstract

    The relational model is the most commonly used data model for storing large
    datasets, perhaps due to the simplicity of the tabular format which had
    revolutionized database management systems. However, many real world objects
    are recursive and associative in nature which makes storage in the relational
    model difficult. The hypergraph model is a generalization of a graph model,
    where each hypernode can be made up of other nodes or graphs and each hyperedge
    can be made up of one or more edges. It may address the recursive and
    associative limitations of relational model.

  25. Personalized Social Recommendations - Accurate or Private?.

    Authors: Atish Das Sarma, Ashwin Machanavajjhala, Aleksandra Korolova
    Subjects: Databases
    Abstract

    With the recent surge of social networks like Facebook, new forms of
    recommendations have become possible - personalized recommendations of ads,
    content, and even new friend and product connections based on one's social
    interactions. Since recommendations may use sensitive social information, it is
    speculated that these recommendations are associated with privacy risks. The
    main contribution of this work is in formalizing these expected trade-offs
    between the accuracy and privacy of personalized social recommendations.

  26. Implementing Performance Competitive Logical Recovery.

    Authors: David Lomet, Kostas Tzoumas, Michael Zwilling
    Subjects: Databases
    Abstract

    New hardware platforms, e.g. cloud, multi-core, etc., have led to a
    reconsideration of database system architecture. Our Deuteronomy project
    separates transactional functionality from data management functionality,
    enabling a flexible response to exploiting new platforms. This separation
    requires, however, that recovery is described logically. In this paper, we
    extend current recovery methods to work in this logical setting. While this is
    straightforward in principle, performance is an issue.

  27. Column-Oriented Storage Techniques for MapReduce.

    Authors: Jignesh Patel, Sandeep Tata, Avrilia Floratou, Eugene Shekita
    Subjects: Databases
    Abstract

    Users of MapReduce often run into performance problems when they scale up
    their workloads. Many of the problems they encounter can be overcome by
    applying techniques learned from over three decades of research on parallel
    DBMSs. However, translating these techniques to a MapReduce implementation such
    as Hadoop presents unique challenges that can lead to new design choices. This
    paper describes how column-oriented storage techniques can be incorporated in
    Hadoop in a way that preserves its popular programming APIs.

  28. Synthesizing Products for Online Catalogs.

    Authors: Hoa Nguyen, Ariel Fuxman, Stelios Paparizos, Juliana Freire, Rakesh Agrawal
    Subjects: Databases
    Abstract

    A high-quality, comprehensive product catalog is essential to the success of
    Product Search engines and shopping sites such as Yahoo! Shopping, Google
    Product Search or Bing Shopping. But keeping catalogs up-to-date becomes a
    challenging task, calling for the need of automated techniques. In this paper,
    we introduce the problem of product synthesis, a key component of catalog
    creation and maintenance. Given a set of offers advertised by merchants, the
    goal is to identify new products and add them to the catalog together with
    their (structured) attributes.

  29. Optimizing XML querying using type-based document projection.

    Authors: Dario Colazzo, Kim Nguyen, Véronique Benzaken, Giuseppe Castagna
    Subjects: Databases
    Abstract

    XML data projection (or pruning) is a natural optimization for main memory
    query engines: given a query Q over a document D, the subtrees of D that are
    not necessary to evaluate Q are pruned, thus producing a smaller document D';
    the query Q is then executed on D', hence avoiding to allocate and process
    nodes that will never be reached by Q. In this article, we propose a new
    approach, based on types, that greatly improves current solutions.

  30. Enabling Multi-level Trust in Privacy Preserving Data Mining.

    Authors: Minghua Chen, Wei Zhang, Yaping Li, Qiwei Li
    Subjects: Databases
    Abstract

    Privacy Preserving Data Mining (PPDM) addresses the problem of developing
    accurate models about aggregated data without access to precise information in
    individual data record. A widely studied \emph{perturbation-based PPDM}
    approach introduces random perturbation to individual values to preserve
    privacy before data is published. Previous solutions of this approach are
    limited in their tacit assumption of single-level trust on data miners.

  31. Heuristic Algorithm for Interpretation of Non-Atomic Categorical Attributes in Similarity-based Fuzzy Databases - Scalability Evaluation.

    Authors: M. Shahriar Hossain, Rafal A. Angryk
    Subjects: Databases
    Abstract

    In this work we are analyzing scalability of the heuristic algorithm we used
    in the past to discover knowledge from multi-valued symbolic attributes in
    fuzzy databases. The non-atomic descriptors, characterizing a single attribute
    of a database record, are commonly used in fuzzy databases to reflect
    uncertainty about the recorded observation.

  32. Difference-Huffman Coding of Multidimensional Databases.

    Authors: István Szépkúti
    Subjects: Databases
    Abstract

    A new compression method called difference-Huffman coding (DHC) is introduced
    in this paper. It is verified empirically that DHC results in a smaller
    multidimensional physical representation than those for other previously
    published techniques (single count header compression, logical position
    compression, base-offset compression and difference sequence compression). The
    article examines how caching influences the expected retrieval time of the
    multidimensional and table representations of relations. A model is proposed
    for this, which is then verified with empirical data.

  33. Caching in Multidimensional Databases.

    Authors: István Szépkúti
    Subjects: Databases
    Abstract

    One utilisation of multidimensional databases is the field of On-line
    Analytical Processing (OLAP). The applications in this area are designed to
    make the analysis of shared multidimensional information fast [9]. On one hand,
    speed can be achieved by specially devised data structures and algorithms. On
    the other hand, the analytical process is cyclic. In other words, the user of
    the OLAP application runs his or her queries one after the other.

  34. Large-Scale Collective Entity Matching.

    Authors: Vibhor Rastogi, Nilesh Dalvi, Minos Garofalakis
    Subjects: Databases
    Abstract

    There have been several recent advancements in Machine Learning community on
    the Entity Matching (EM) problem. However, their lack of scalability has
    prevented them from being applied in practical settings on large real-life
    datasets. Towards this end, we propose a principled framework to scale any
    generic EM algorithm. Our technique consists of running multiple instances of
    the EM algorithm on small neighborhoods of the data and passing messages across
    neighborhoods to construct a global solution.

  35. Fast Set Intersection in Memory.

    Authors: Bolin Ding, Arnd Christian König
    Subjects: Databases
    Abstract

    Set intersection is a fundamental operation in information retrieval and
    database systems. This paper introduces linear space data structures to
    represent sets such that their intersection can be computed in a worst-case
    efficient way. In general, given k (preprocessed) sets, with totally n
    elements, we will show how to compute their intersection in expected time
    O(n/sqrt(w)+kr), where r is the intersection size and w is the number of bits
    in a machine-word.

  36. Using Paxos to Build a Scalable, Consistent, and Highly Available Datastore.

    Authors: Jun Rao, Eugene J. Shekita, Sandeep Tata
    Subjects: Databases
    Abstract

    Spinnaker is an experimental datastore that is designed to run on a large
    cluster of commodity servers in a single datacenter. It features key-based
    range partitioning, 3-way replication, and a transactional get-put API with the
    option to choose either strong or timeline consistency on reads. This paper
    describes Spinnaker's Paxos-based replication protocol. The use of Paxos
    ensures that a data partition in Spinnaker will be available for reads and
    writes as long a majority of its replicas are alive.

  37. Automatic Wrappers for Large Scale Web Extraction.

    Authors: Nilesh Dalvi, Ravi Kumar, Mohamed Soliman
    Subjects: Databases
    Abstract

    We present a generic framework to make wrapper induction algorithms tolerant
    to noise in the training data. This enables us to learn wrappers in a
    completely unsupervised manner from automatically and cheaply obtained noisy
    training data, e.g., using dictionaries and regular expressions. By removing
    the site-level supervision that wrapper-based techniques require, we are able
    to perform information extraction at web-scale, with accuracy unattained with
    existing unsupervised extraction techniques. Our system is used in production
    at Yahoo! and powers live applications.

  38. Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining.

    Authors: Xintian Yang, Srinivasan Parthasarathy, Ponnuswamy Sadayappan
    Subjects: Databases
    Abstract

    Scaling up the sparse matrix-vector multiplication kernel on modern Graphics
    Processing Units (GPU) has been at the heart of numerous studies in both
    academia and industry. In this article we present a novel non-parametric,
    self-tunable, approach to data representation for computing this kernel,
    particularly targeting sparse matrices representing power-law graphs.

  39. Querying and Manipulating Temporal Databases.

    Authors: Rafik Bouaziz, Mohamed Mkaouar, Mohamed Moalla
    Subjects: Databases
    Abstract

    Many works have focused, for over twenty five years, on the integration of
    the time dimension in databases (DB). However, the standard SQL3 does not yet
    allow easy definition, manipulation and querying of temporal DBs. In this
    paper, we study how we can simplify querying and manipulating temporal facts in
    SQL3, using a model that integrates time in a native manner. To do this, we
    propose new keywords and syntax to define different temporal versions for many
    relational operators and functions used in SQL.

  40. RDBNorma: - A semi-automated tool for relational database schema normalization up to third normal form.

    Authors: Y. V. Dongare, P. S. Dhabe, S. V. Deshmukh
    Subjects: Databases
    Abstract

    In this paper a tool called RDBNorma is proposed, that uses a novel approach
    to represent a relational database schema and its functional dependencies in
    computer memory using only one linked list and used for semi-automating the
    process of relational database schema normalization up to third normal form.
    This paper addresses all the issues of representing a relational schema along
    with its functional dependencies using one linked list along with the
    algorithms to convert a relation into second and third normal form by using
    above representation.

  41. Efficient and scalable geometric hashing method for searching protein 3D structures.

    Authors: Gook-Pil Roh, Seung-won Hwang, Byoung-Kee Yi
    Subjects: Databases
    Abstract

    As the structural databases continue to expand, efficient methods are
    required to search similar structures of the query structure from the database.
    There are many previous works about comparing protein 3D structures and
    scanning the database with a query structure. However, they generally have
    limitations on practical use because of large computational and storage
    requirements.

  42. The VC-Dimension of Queries and Selectivity Estimation Through Sampling.

    Authors: Eli Upfal, Matteo Riondato, Mert Akdere, Ugur Cetintemel, Stan Zdonik
    Subjects: Databases
    Abstract

    We develop a novel method, based on the statistical concept of VC-dimension,
    for selecting a small representative sample from a large database. The
    execution of a query on the sample provides an accurate estimate for the
    selectivity (or cardinality of the output) of each operation in the execution
    of the query on the original large database. The size of the sample does not
    depend on the size (number of tuples) of the database, but is a function of the
    complexity of the queries we plan to run, measured by their VC-dimension.

  43. Analysis of Web Logs and Web User in Web Mining.

    Authors: Dhinaharan Nagamalai, L.K. Joshila Grace, V.Maheswari
    Subjects: Databases
    Abstract

    Log files contain information about User Name, IP Address, Time Stamp, Access
    Request, number of Bytes Transferred, Result Status, URL that Referred and User
    Agent. The log files are maintained by the web servers. By analysing these log
    files gives a neat idea about the user. This paper gives a detailed discussion
    about these log files, their formats, their creation, access procedures, their
    uses, various algorithms used and the additional parameters that can be used in
    the log files which in turn gives way to an effective mining.

  44. A Comparative Agglomerative Hierarchical Clustering Method to Cluster Implemented Course.

    Authors: Rahmat Widia Sembiring, Jasni Mohamad Zain, Abdullah Embong
    Subjects: Databases
    Abstract

    There are many clustering methods, such as hierarchical clustering method.
    Most of the approaches to the clustering of variables encountered in the
    literature are of hierarchical type. The great majority of hierarchical
    approaches to the clustering of variables are of agglomerative nature. The
    agglomerative hierarchical approach to clustering starts with each observation
    as its own cluster and then continually groups the observations into
    increasingly larger groups. Higher Learning Institution (HLI) provides training
    to introduce final-year students to the real working environment.

  45. Provenance for Aggregate Queries.

    Authors: Val Tannen, Yael Amsterdamer, Daniel Deutch
    Subjects: Databases
    Abstract

    We study in this paper provenance information for queries with aggregation.
    Provenance information was studied in the context of various query languages
    that do not allow for aggregation, and recent work has suggested to capture
    provenance by annotating the different database tuples with elements of a
    commutative semiring and propagating the annotations through query evaluation.
    We show that aggregate queries pose novel challenges rendering this approach
    inapplicable.

  46. Ontology Usage at ZFIN.

    Authors: Doug Howe, Christian Pich
    Subjects: Databases
    Abstract

    The Zebrafish Model Organism Database (ZFIN) provides a Web resource of
    zebrafish genomic, genetic, developmental, and phenotypic data. Four different
    ontologies are currently used to annotate data to the most specific term
    available facilitating a better comparison between inter-species data. In
    addition, ontologies are used to help users find and cluster data more quickly
    without the need of knowing the exact technical name for a term.

  47. Provenance and evidence in UniProtKB.

    Authors: Jerven Bolleman, Alain Gateau, Sebastien Gehant, Nicole Redaschi
    Subjects: Databases
    Abstract

    The primary mission of UniProt is to support biological research by
    maintaining a stable, comprehensive, fully classified, richly and accurately
    annotated protein sequence knowledgebase, with extensive cross-references to
    external resources, that is freely available to the scientific community. To
    enable users of the knowledgebase to accurately assess the reliability of the
    information contained in this resource, the evidence for and provenance of the
    information must be recorded.

  48. ChemCloud: Chemical e-Science Information Cloud.

    Authors: Adrian Paschke, Alexandru Todor, Stephan Heineke
    Subjects: Databases
    Abstract

    Our Chemical e-Science Information Cloud (ChemCloud) - a Semantic Web based
    eScience infrastructure - integrates and automates a multitude of databases,
    tools and services in the domain of chemistry, pharmacy and bio-chemistry
    available at the Fachinformationszentrum Chemie (FIZ Chemie), at the Freie
    Universitaet Berlin (FUB), and on the public Web.

  49. Benchmarking triple stores with biological data.

    Authors: Vladimir Mironov, Nirmala Seethappan, Ward Blonde, Erick Antezana, Bjorn Lindi, Martin Kuiper
    Subjects: Databases
    Abstract

    We have compared the performance of five non-commercial triple stores,
    Virtuoso-open source, Jena SDB, Jena TDB, SWIFT-OWLIM and 4Store. We examined
    three performance aspects: the query execution time, scalability and run-to-run
    reproducibility. The queries we chose addressed different ontological or
    biological topics, and we obtained evidence that individual store performance
    was quite query specific.

  50. YeastMed: an XML-Based System for Biological Data Integration of Yeast.

    Authors: Abdelaali Briache, Kamar Marrakchi, Amine Kerzazi, Ismael Navas-Delgado, Jose F Aldana Montes, Badr D. Rossi Hassani, Khalid Lairini
    Subjects: Databases
    Abstract

    A key goal of bioinformatics is to create database systems and software
    platforms capable of storing and analysing large sets of biological data.
    Hundreds of biological databases are now available and provide access to huge
    amount of biological data. SGD, Yeastract, CYGD-MIPS, BioGrid and PhosphoGrid
    are five of the most visited databases by the yeast community. These sources
    provide complementary data on biological entities. Biologists are brought
    systematically to query these data sources in order to analyse the results of
    their experiments.

  51. An Effective Clustering Approach to Web Query Log Anonymization.

    Authors: Ke Wang, Amin Milani Fard
    Subjects: Databases
    Abstract

    Web query log data contain information useful to research; however, release
    of such data can re-identify the search engine users issuing the queries. These
    privacy concerns go far beyond removing explicitly identifying information such
    as name and address, since non-identifying personal data can be combined with
    publicly available information to pinpoint to an individual. In this work we
    model web query logs as unstructured transaction data and present a novel
    transaction anonymization technique based on clustering and generalization
    techniques to achieve the k-anonymity privacy.

  52. Faster Query Answering in Probabilistic Databases using Read-Once Functions.

    Authors: Sudeepa Roy, Vittorio Perduca, Val Tannen
    Subjects: Databases
    Abstract

    A boolean expression is in read-once form if each of its variables appears
    exactly once. When the variables denote independent events in a probability
    space, the probability of the event denoted by the whole expression in
    read-once form can be computed in polynomial time (whereas the general problem
    for arbitrary expressions is #P-complete). Known approaches to checking
    read-once property seem to require putting these expressions in disjunctive
    normal form.

  53. Individual Privacy vs Population Privacy: Learning to Attack Anonymization.

    Authors: Graham Cormode
    Subjects: Databases
    Abstract

    Over the last decade there have been great strides made in developing
    techniques to compute functions privately. In particular, Differential Privacy
    gives strong promises about conclusions that can be drawn about an individual.
    In contrast, various syntactic methods for providing privacy (criteria such as
    kanonymity and l-diversity) have been criticized for still allowing private
    information of an individual to be inferred. In this report, we consider the
    ability of an attacker to use data meeting privacy definitions to build an
    accurate classifier.

  54. Rule-based Generation of Diff Evolution Mappings between Ontology Versions.

    Authors: Erhard Rahm, Michael Hartung, Anika Groß
    Subjects: Databases
    Abstract

    Ontologies such as taxonomies, product catalogs or web directories are
    heavily used and hence evolve frequently to meet new requirements or to better
    reflect the current instance data of a domain. To effectively manage the
    evolution of ontologies it is essential to identify the difference (Diff)
    between two ontology versions. We propose a novel approach to determine an
    expressive and invertible diff evolution mapping between given versions of an
    ontology.

  55. A Blink Tree latch method and protocol to support synchronous node deletion.

    Authors: Karl Malbrain
    Subjects: Databases
    Abstract

    A Blink Tree latch method and protocol supports synchronous node deletion in
    a high concurrency environment. Full source code is available.

  56. The Complexity of Causality and Responsibility for Query Answers and non-Answers.

    Authors: Wolfgang Gatterbauer, Dan Suciu, Alexandra Meliou, Katherine M. Moore
    Subjects: Databases
    Abstract

    An answer to a query has a well-defined lineage expression (alternatively
    called how-provenance) that explains how the answer was derived. Recent work
    has also shown how to compute the lineage of a non-answer to a query. However,
    the cause of an answer or non-answer is a more subtle notion and consists, in
    general, of only a fragment of the lineage. In this paper, we adapt Halpern,
    Pearl, and Chockler's recent definitions of causality and responsibility to
    define the causes of answers and non-answers to queries, and their degree of
    responsibility.

  57. Functorial Data Migration.

    Authors: David I. Spivak
    Subjects: Databases
    Abstract

    In this paper we present a simple database definition language: that of
    categories and functors. A database schema is a category and a state is a
    set-valued functor. We show that morphisms of schemas induce three "data
    migration functors" that translate states from one schema to the other in
    canonical ways. Database states form a boolean topos of which the classical
    "relational algebra" is a fragment. These ideas thus create a new denotational
    semantics for database theory.

  58. A Novel Watermarking Scheme for Detecting and Recovering Distortions in Database Tables.

    Authors: Hamed khataeimaragheh, Hassan Rashidi
    Subjects: Databases
    Abstract

    In this paper a novel fragile watermarking scheme is proposed to detect,
    localize and recover malicious modifications in relational databases. In the
    proposed scheme, all tuples in the database are first securely divided into
    groups. Then watermarks are embedded and verified group-by-group independently.
    By using the embedded watermark, we are able to detect and localize the
    modification made to the database and even we recover the true data from the
    database modified locations. Our experimental results show that this scheme is
    so qualified; i.e.

  59. Mobile Information Collectors' Trajectory Data Warehouse Design.

    Authors: wided oueslati, jalel akaichi
    Subjects: Databases
    Abstract

    To analyze complex phenomena which involve moving objects, Trajectory Data
    Warehouse (TDW) seems to be an answer for many recent decision problems related
    to various professions (physicians, commercial representatives, transporters,
    ecologists ...) concerned with mobility. This work aims to make trajectories as
    a first class concept in the trajectory data conceptual model and to design a
    TDW, in which data resulting from mobile information collectors' trajectory are
    gathered.

  60. Clustering high dimensional data using subspace and projected clustering algorithms.

    Authors: Rahmat Widia Sembiring, Jasni Mohamad Zain, Abdullah Embong
    Subjects: Databases
    Abstract

    Problem statement: Clustering has a number of techniques that have been
    developed in statistics, pattern recognition, data mining, and other fields.
    Subspace clustering enumerates clusters of objects in all subspaces of a
    dataset. It tends to produce many over lapping clusters. Approach: Subspace
    clustering and projected clustering are research areas for clustering in high
    dimensional spaces. In this research we experiment three clustering oriented
    algorithms, PROCLUS, P3C and STATPC.

  61. Discovering potential user browsing behaviors using custom-built apriori algorithm.

    Authors: Sandeep Singh Rawat, Lakshmi Rajamani
    Subjects: Databases
    Abstract

    Most of the organizations put information on the web because they want it to
    be seen by the world. Their goal is to have visitors come to the site, feel
    comfortable and stay a while and try to know completely about the running
    organization. As educational system increasingly requires data mining, the
    opportunity arises to mine the resulting large amounts of student information
    for hidden useful information (patterns like rule, clustering, and
    classification, etc).

  62. Matching Dependencies with Arbitrary Attribute Values: Semantics, Query Answering and Integrity Constraints.

    Authors: Leopoldo Bertossi, Jaffer Gardezi, Iluju Kiringa
    Subjects: Databases
    Abstract

    Matching dependencies (MDs) were introduced to specify the identification or
    matching of certain attribute values in pairs of database tuples when some
    similarity conditions are satisfied. Their enforcement can be seen as a natural
    generalization of entity resolution. In what we call the "pure case" of MDs,
    any value from the underlying data domain can be used for the value in common
    that does the matching. We investigate the semantics and properties of data
    cleaning through the enforcement of matching dependencies for the pure case.

  63. ElasTraS: An Elastic Transactional Data Store in the Cloud.

    Authors: Sudipto Das, Amr El Abbadi, Divyakant Agrawal
    Subjects: Databases
    Abstract

    Over the last couple of years, "Cloud Computing" or "Elastic Computing" has
    emerged as a compelling and successful paradigm for internet scale computing.
    One of the major contributing factors to this success is the elasticity of
    resources. In spite of the elasticity provided by the infrastructure and the
    scalable design of the applications, the elephant (or the underlying database),
    which drives most of these web-based applications, is not very elastic and
    scalable, and hence limits scalability.

  64. Data Cleaning and Query Answering with Matching Dependencies and Matching Functions.

    Authors: Laks V. S. Lakshmanan, Leopoldo Bertossi, Solmaz Kolahi
    Subjects: Databases
    Abstract

    Matching dependencies were recently introduced as declarative rules for data
    cleaning and entity resolution. Enforcing a matching dependency on a database
    instance identifies the values of some attributes for two tuples, provided that
    the values of some other attributes are sufficiently similar. Assuming the
    existence of matching functions for making two attributes values equal, we
    formally introduce the process of cleaning an instance using matching
    dependencies, as a chase-like procedure.

  65. The universality of iterated hashing over variable-length strings.

    Authors: Daniel Lemire
    Subjects: Databases
    Abstract

    Iterated hash functions process strings recursively, one character at a time.
    At each iteration, they compute a new hash value from the preceding hash value
    and the next character. We prove that iterated hashing can be pairwise
    independent, but never 3-wise independent. We show that it can be almost
    universal over strings much longer than the number of hash values; we bound the
    maximal string length given the collision probability.

  66. Removal of Communication Gap.

    Authors: Zeeshan Ahmed, Sudhir Ganti
    Subjects: Databases
    Abstract

    This research is about an online forum designed and developed to improve the
    communication process between alumni, new, old and upcoming students. In this
    research paper we present targeted problems, designed architecture, used
    technologies in development and final end product in detail.

  67. PDM based I-SOAS Data Warehouse Design.

    Authors: Zeeshan Ahmed
    Subjects: Databases
    Abstract

    This research paper briefly describes the industrial contributions of Product
    Data Management in any organization's technical and managerial data management.
    Then focusing on some current major PDM based problems i.e. Static and
    Unintelligent Search, Platform Independent System and Successful PDM System
    Implementation, briefly presents a semantic based solution i.e. I-SOAS. Majorly
    this research paper is about to present and discuss the contributions of I-SOAS
    in any organization's technical and system data management.

  68. Fully Dynamic Data Structure for Top-k Queries on Uncertain Data.

    Authors: Manish Patil, Rahul Shah, Sharma V. Thankachan
    Subjects: Databases
    Abstract

    Top-$k$ queries allow end-users to focus on the most important (top-$k$)
    answers amongst those which satisfy the query. In traditional databases, a user
    defined score function assigns a score value to each tuple and a top-$k$ query
    returns $k$ tuples with the highest score. In uncertain database, top-$k$
    answer depends not only on the scores but also on the membership probabilities
    of tuples. Several top-$k$ definitions covering different aspects of
    score-probability interplay have been proposed in recent
    past~\cite{R10,R4,R2,R8}.

  69. Multiresolution Cube Estimators for Sensor Network Aggregate Queries.

    Authors: Carlos Guestrin, Joseph M. Hellerstein, Alexandra Meliou
    Subjects: Databases
    Abstract

    In this work we present in-network techniques to improve the efficiency of
    spatial aggregate queries. Such queries are very common in a sensornet setting,
    demanding more targeted techniques for their handling. Our approach constructs
    and maintains multi-resolution cube hierarchies inside the network, which can
    be constructed in a distributed fashion. In case of failures, recovery can also
    be performed with in-network decisions.

  70. An Algorithmic Structuration of a Type System for an Orthogonal Object/Relational Model.

    Authors: Amel Benabbou, Safia Nait Bahloul, Youssef Amghar
    Subjects: Databases
    Abstract

    Date and Darwen have proposed a theory of types, the latter forms the basis
    of a detailed presentation of a panoply of simple and complex types. However,
    this proposal has not been structured in a formal system. Specifically, Date
    and Darwen haven't indicated the formalism of the type system that corresponds
    to the type theory established.

  71. Boosting the Accuracy of Differentially-Private Histograms Through Consistency.

    Authors: Dan Suciu, Michael Hay, Vibhor Rastogi, Gerome Miklau
    Subjects: Databases
    Abstract

    We show that it is possible to significantly improve the accuracy of a
    general class of histogram queries while satisfying differential privacy. Our
    approach carefully chooses a set of queries to evaluate, and then exploits
    consistency constraints that should hold over the noisy output. In a
    post-processing phase, we compute the consistent input most likely to have
    produced the noisy output. The final output is differentially-private and
    consistent, but in addition, it is often much more accurate.

  72. Report on the XBase Project.

    Authors: Graham Kirby, Alan Dearle, Ron Morrison, Evangelos Zirintsis
    Subjects: Databases
    Abstract

    This project addressed the conceptual fundamentals of data storage,
    investigating techniques for provision of highly generic storage facilities
    that can be tailored to produce various individually customised storage
    infrastructures, compliant to the needs of particular applications. This
    requires the separation of mechanism and policy wherever possible.

  73. A Generic Storage API.

    Authors: Graham Kirby, Alan Dearle, Ron Morrison, Evangelos Zirintsis
    Subjects: Databases
    Abstract

    We present a generic API suitable for provision of highly generic storage
    facilities that can be tailored to produce various individually customised
    storage infrastructures. The paper identifies a candidate set of minimal
    storage system building blocks, which are sufficiently simple to avoid
    encapsulating policy where it cannot be customised by applications, and
    composable to build highly flexible storage architectures.

  74. Similarity Search and Locality Sensitive Hashing using TCAMs.

    Authors: Ashish Goel, Rajendra Shinde, Pankaj Gupta, Debojyoti Dutta
    Subjects: Databases
    Abstract

    Similarity search methods are widely used as kernels in various machine
    learning applications. Nearest neighbor search (NNS) algorithms are often used
    to retrieve similar entries, given a query. While there exist efficient
    techniques for exact query lookup using hashing, similarity search using exact
    nearest neighbors is known to be a hard problem and in high dimensions, best
    known solutions offer little improvement over a linear scan.

  75. H2O: An Autonomic, Resource-Aware Distributed Database System.

    Authors: Graham Kirby, Alan Dearle, Angus Macdonald
    Subjects: Databases
    Abstract

    This paper presents the design of an autonomic, resource-aware distributed
    database which enables data to be backed up and shared without complex manual
    administration. The database, H2O, is designed to make use of unused resources
    on workstation machines. Creating and maintaining highly-available, replicated
    database systems can be difficult for untrained users, and costly for IT
    departments.

  76. Attribute Oriented Induction with simple select SQL statement.

    Authors: Spits Warnars
    Subjects: Databases
    Abstract

    Searching learning or rules in relational database for data mining purposes
    with characteristic or classification/discriminant rule in attribute oriented
    induction technique can be quicker, easy, and simple with simple SQL statement.
    With just only one simple SQL statement, characteristic and classification rule
    can be created simultaneously.

  77. Measuring interesting rules in Characteristic rule.

    Authors: Spits Warnars
    Subjects: Databases
    Abstract

    Finding interesting rule in the sixth strategy step about threshold control
    on generalized relations in attribute oriented induction, there is possibility
    to select candidate attribute for further generalization and merging of
    identical tuples until the number of tuples is no greater than the threshold
    value, as implemented in basic attribute oriented induction algorithm. At this
    strategy step there is possibility the number of tuples in final generalization
    result still greater than threshold value.

  78. Multidimensional Datawarehouse with Combination Formula.

    Authors: Spits Warnars
    Subjects: Databases
    Abstract

    Multidimensional in data warehouse is a compulsion and become the most
    important for information delivery, without multidimensional Multidimensional
    in data warehouse is a compulsion and become the most important for information
    delivery, without multidimensional datawarehouse is incomplete.
    Multidimensional give ability to analyze business measurement in many different
    ways. Multidimensional is also synonymous with online analytical processing
    (OLAP). By using some concepts in datawarehouse like slice-dice,drill down and
    roll up will increase the ability of multidimensional datawarehouse.

  79. Using Grid Files for a Relational Database Management System.

    Authors: S. Sanyal, S.M. Joshi, S. Banerjee, S. Srikumar
    Subjects: Databases
    Abstract

    This paper describes our experience with using Grid files as the main storage
    organization for a relational database management system. We primarily focus on
    the following two aspects. (i) Strategies for implementing grid files
    efficiently. (ii) Methods for efficiency evaluating queries posed to a database
    organized using grid files.

  80. Building a Data Warehouse for National Social Security Fund of the Republic of Tunisia.

    Authors: Mohamed Salah Gouider, Amine Farhat
    Subjects: Databases
    Abstract

    The amounts of data available to decision makers are increasingly important,
    given the network availability, low cost storage and diversity of applications.
    To maximize the potential of these data within the National Social Security
    Fund (NSSF) in Tunisia, we have built a data warehouse as a multidimensional
    database, cleaned, homogenized, historicized and consolidated. We used Oracle
    Warehouse Builder to extract, transform and load the source data into the Data
    Warehouse, by applying the KDD process. We have implemented the Data Warehouse
    as an Oracle OLAP.

  81. Query Routing and Processing in Peer-To-Peer Data Sharing Systems.

    Authors: Raddad Al King, Abdelkader Hameurlain, Franck Morvan
    Subjects: Databases
    Abstract

    Sharing musical files via the Internet was the essential motivation of early
    P2P systems. Despite of the great success of the P2P file sharing systems,
    these systems support only "simple" queries. The focus in such systems is how
    to carry out an efficient query routing in order to find the nodes storing a
    desired file. Recently, several research works have been made to extend P2P
    systems to be able to share data having a fine granularity (i.e. atomic
    attribute) and to process queries written with a highly expressive language
    (i.e. SQL).

  82. A Tree Logic with Graded Paths and Nominals.

    Authors: Everardo Barcenas, Pierre Geneves, Nabil Layaida, Alan Schmitt
    Subjects: Databases
    Abstract

    Regular tree grammars and regular path expressions constitute core constructs
    widely used in programming languages and type systems. Nevertheless, there has
    been little research so far on reasoning frameworks for path expressions where
    node cardinality constraints occur along a path in a tree. We present a logic
    capable of expressing deep counting along paths which may include arbitrary
    recursive forward and backward navigation.

  83. A New Framework for Join Product Skew.

    Authors: Paraskevas V. Lekeas, Foto Afrati, Victor Kyritsis, Dora Souliou
    Subjects: Databases
    Abstract

    Different types of data skew can result in load imbalance in the context of
    parallel joins under the shared nothing architecture. We study one important
    type of skew, join product skew (JPS). A static approach based on frequency
    classes is proposed which takes for granted the data distribution of join
    attribute values.

  84. Preserving Module Privacy in Workflow Provenance.

    Authors: Sanjeev Khanna, Debmalya Panigrahi, Susan B. Davidson, Sudeepa Roy
    Subjects: Databases
    Abstract

    We study the problem of providing workflow data provenance without revealing
    the functionality of any module. We develop a model that formalizes the notion
    of privacy of modules embedded in a workflow structure as a natural extension
    of privacy of standalone modules. Our model shows that by hiding a small amount
    of carefully chosen data, one can ensure privacy of all modules over an
    unbounded number of executions. The problem of identifying the smallest
    possible amount of such data is NP-hard, and in the full generality of our
    model it is in fact even hard to get a good approximation.

  85. Managing Semantic Loss during Query Reformulation in Peer Data Management Systems.

    Authors: Yannis Delveroudis, Paraskevas V. Lekeas
    Subjects: Databases
    Abstract

    In this paper we deal with the notion of semantic loss in Peer Data
    Management Systems (PDMS) queries. We define such a notion and we give a
    mechanism that discovers semantic loss in a PDMS network. Next, we propose an
    algorithm that addresses the problem of restoring such a loss. Further
    evaluation of our proposed algorithm is an ongoing work

  86. Defining and Mining Functional Dependencies in Probabilistic Databases.

    Authors: Sushovan De, Subbarao Kambhampati
    Subjects: Databases
    Abstract

    Functional dependencies - traditional, approximate and conditional are of
    critical importance in relational databases, as they inform us about the
    relationships between attributes. They are useful in schema normalization, data
    rectification and source selection. However, probabilistic databases neither
    have these dependencies defined for them, nor are fast algorithms available to
    evaluate their confidences. In this paper, we define the logical extensions of
    various forms of functional dependencies for probabilistic databases. We
    explore the connections between these dependencies.

  87. Adaptive Tuning Algorithm for Performance tuning of Database Management System.

    Authors: S. F. Rodd, U. P. Kulkarni
    Subjects: Databases
    Abstract

    Performance tuning of Database Management Systems(DBMS) is both complex and
    challenging as it involves identifying and altering several key performance
    tuning parameters. The quality of tuning and the extent of performance
    enhancement achieved greatly depends on the skill and experience of the
    Database Administrator (DBA). As neural networks have the ability to adapt to
    dynamically changing inputs and also their ability to learn makes them ideal
    candidates for employing them for tuning purpose.

  88. Personnalisation de Syst\`emes OLAP Annot\'es.

    Authors: Houssem Jerbi, Geneviève Pujolle, Franck Ravat, Olivier Teste
    Subjects: Databases
    Abstract

    This paper deals with personalization of annotated OLAP systems. Data
    constellation is extended to support annotations and user preferences.
    Annotations reflect the decision-maker experience whereas user preferences
    enable users to focus on the most interesting data. User preferences allow
    annotated contextual recommendations helping the decision-maker during his/her
    multidimensional navigations.

  89. A Data Cleansing Method for Clustering Large-scale Transaction Databases.

    Authors: Woong-Kee Loh, Yang-Sae Moon, Jun-Gyu Kang
    Subjects: Databases
    Abstract

    In this paper, we emphasize the need for data cleansing when clustering
    large-scale transaction databases and propose a new data cleansing method that
    improves clustering quality and performance. We evaluate our data cleansing
    method through a series of experiments. As a result, the clustering quality and
    performance were significantly improved by up to 165% and 330%, respectively.

  90. Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data.

    Authors: Geert Jan Bex, Wouter Gelade, Frank Neven, Stijn Vansummeren
    Subjects: Databases
    Abstract

    Inferring an appropriate DTD or XML Schema Definition (XSD) for a given
    collection of XML documents essentially reduces to learning deterministic
    regular expressions from sets of positive example words. Unfortunately, there
    is no algorithm capable of learning the complete class of deterministic regular
    expressions from positive examples only, as we will show. The regular
    expressions occurring in practical DTDs and XSDs, however, are such that every
    alphabet symbol occurs only a small number of times.

  91. Constraint-based Query Distribution Framework for an Integrated Global Schema.

    Authors: Ahmad Kamran Malik, Muhammad Abdul Qadir, Nadeem Iftikhar, Muhammad Usman
    Subjects: Databases
    Abstract

    Distributed heterogeneous data sources need to be queried uniformly using
    global schema. Query on global schema is reformulated so that it can be
    executed on local data sources. Constraints in global schema and mappings are
    used for source selection, query optimization,and querying partitioned and
    replicated data sources.

  92. Mining The Data From Distributed Database Using An Improved Mining Algorithm.

    Authors: K. L. Shunmuganathan, J. Arokia Renjit
    Subjects: Databases
    Abstract

    Association rule mining is an active data mining research area and most ARM
    algorithms cater to a centralized environment. Centralized data mining to
    discover useful patterns in distributed databases isn't always feasible because
    merging data sets from different sites incurs huge network communication costs.
    In this paper, an Improved algorithm based on good performance level for data
    mining is being proposed.

  93. Mobile Database System: Role of Mobility on the Query Processing.

    Authors: Dr. R. S. Kasana, Samidha Dwivedi Sharma
    Subjects: Databases
    Abstract

    The rapidly expanding technology of mobile communication will give mobile
    users capability of accessing information from anywhere and any time. The
    wireless technology has made it possible to achieve continuous connectivity in
    mobile environment. When the query is specified as continuous, the requesting
    mobile user can obtain continuously changing result. In order to provide
    accurate and timely outcome to requesting mobile user, the locations of moving
    object has to be closely monitored.

  94. Semi-Automatic Index Tuning: Keeping DBAs in the Loop.

    Authors: Karl Schnaitter, Neoklis Polyzotis
    Subjects: Databases
    Abstract

    To obtain good system performance, a DBA must choose a set of indices that is
    appropriate for the workload. The system can aid in this challenging task by
    providing recommendations for the index configuration. We propose a new index
    recommendation technique, termed semi-automatic tuning, that keeps the DBA "in
    the loop" by generating recommendations that use feedback about the DBA's
    preferences. The technique also works online, which avoids the limitations of
    commercial tools that require the workload to be known in advance.

  95. Anonimos: An LP based Approach for Anonymizing Weighted Social Network Graphs.

    Authors: Sudipto Das, Omer Egecioglu, Amr El Abbadi
    Subjects: Databases
    Abstract

    The increasing popularity of social networks has initiated a fertile research
    area in information extraction and data mining. Anonymization of these social
    graphs is important to facilitate publishing these data sets for analysis by
    external entities. Prior work has concentrated mostly on node identity
    anonymization and structural anonymization. But with the growing interest in
    analyzing social networks as a weighted network, edge weight anonymization is
    also gaining importance.

  96. An Improved Algorithm for Generating Database Transactions from Relational Algebra Specifications.

    Authors: Daniel J. Dougherty
    Subjects: Databases
    Abstract

    Alloy is a lightweight modeling formalism based on relational algebra. In
    prior work with Fisler, Giannakopoulos, Krishnamurthi, and Yoo, we have
    presented a tool, Alchemy, that compiles Alloy specifications into
    implementations that execute against persistent databases. The foundation of
    Alchemy is an algorithm for rewriting relational algebra formulas into code for
    database transactions. In this paper we report on recent progress in improving
    the robustness and efficiency of this transformation.

  97. Cubes convexes.

    Authors: Sebastien Nedjar, Alain Casali, Rosine Cicchetti, Lotfi Lakhal
    Subjects: Databases
    Abstract

    In various approaches, data cubes are pre-computed in order to answer
    efficiently OLAP queries. The notion of data cube has been declined in various
    ways: iceberg cubes, range cubes or differential cubes. In this paper, we
    introduce the concept of convex cube which captures all the tuples of a
    datacube satisfying a constraint combination. It can be represented in a very
    compact way in order to optimize both computation time and required storage
    space.

  98. Transparent Anonymization: Thwarting Adversaries Who Know the Algorithm.

    Authors: Xiaokui Xiao, Yufei Tao, Nick Koudas
    Subjects: Databases
    Abstract

    Numerous generalization techniques have been proposed for privacy preserving
    data publishing. Most existing techniques, however, implicitly assume that the
    adversary knows little about the anonymization algorithm adopted by the data
    publisher. Consequently, they cannot guard against privacy attacks that exploit
    various characteristics of the anonymization mechanism. This paper provides a
    practical solution to the above problem. First, we propose an analytical model
    for evaluating disclosure risks, when an adversary knows everything in the
    anonymization process, except the sensitive values.

  99. Limits of Commutativity on Abstract Data Types.

    Authors: Carmelo Malta, José Martinez
    Subjects: Databases
    Abstract

    We present some formal properties of (symmetrical) commutativity, the major
    criterion used in transactional systems, which allow us to fully understand its
    advantages and disadvantages. The main result is that commutativity is subject
    to the same limitation as compatibility for arbitrary objects. However,
    commutativity has also a number of attracting properties, one of which is
    related to recovery and, to our knowledge, has not been exploited in the
    literature. Advantages and disadvantages are illustrated on abstract data types
    of interest.

  100. A framework for designing concurrent and recoverable abstract data types based on commutativity.

    Authors: Carmelo Malta, José Martinez
    Subjects: Databases
    Abstract

    In this paper, we try to focus the reader's interest on the problems that
    transactional systems have to resolve for taking advantage of commutativity in
    a serializable and recoverable way. Our framework is, (as others), based on the
    use of conditional commutativity on abstract date types. We present new
    features that have not been found in the literature hitherto, that both
    increase concurrency and simplify recovery.

  101. Tuple-based abstract data types: full parallelism.

    Authors: Carmelo Malta, José Martinez
    Subjects: Databases
    Abstract

    Commutativity has the same inherent limitations as compatibility. Then, it is
    worth conceiving simple concurrency control techniques. We propose a restricted
    form of commutativity which increases parallelism without incurring a higher
    overhead than compatibility. Advantages of our proposition are: (1)
    commutativity of operations is determined at compile-time, (2) run-time
    checking is as efficient as for compatibility, (3) neither commutativity
    relations, (4) nor inverse operations, need to be specified, and (5) log space
    utilization is reduced.

  102. Automating Fine Concurrency Control in Object-Oriented Databases.

    Authors: Carmelo Malta, José Martinez
    Subjects: Databases
    Abstract

    Several propositions were done to provide adapted concurrency control to
    object-oriented databases. However, most of these proposals miss the fact that
    considering solely read and write access modes on instances may lead to less
    parallelism than in relational databases!

  103. Evaluation of Query Generators for Entity Search Engines.

    Authors: Stefan Endrullis, Andreas Thor, Erhard Rahm
    Subjects: Databases
    Abstract

    Dynamic web applications such as mashups need efficient access to web data
    that is only accessible via entity search engines (e.g. product or publication
    search engines). However, most current mashup systems and applications only
    support simple keyword searches for retrieving data from search engines. We
    propose the use of more powerful search strategies building on so-called query
    generators. For a given set of entities query generators are able to
    automatically determine a set of search queries to retrieve these entities from
    an entity search engine.

  104. XPath Whole Query Optimization.

    Authors: Sebastian Maneth, Kim Nguyen
    Subjects: Databases
    Abstract

    Previous work reports about SXSI, a fast XPath engine which executes tree
    automata over compressed XML indexes. Here, reasons are investigated why SXSI
    is so fast. It is shown that tree automata can be used as a general framework
    for fine grained XML query optimization. We define the "relevant nodes" of a
    query as those nodes that a minimal automaton must touch in order to answer the
    query. This notion allows to skip many subtrees during execution, and, with the
    help of particular tree indexes, even allows to skip internal nodes of the
    tree.

  105. Similarity Data Item Set Approach: An Encoded Temporal Data Base Technique.

    Authors: K. Duraiswamy, M. S. Danessh, C. Balasubramanian
    Subjects: Databases
    Abstract

    Data mining has been widely recognized as a powerful tool to explore added
    value from large-scale databases. Finding frequent item sets in databases is a
    crucial in data mining process of extracting association rules. Many algorithms
    were developed to find the frequent item sets. This paper presents a summary
    and a comparative study of the available FP-growth algorithm variations
    produced for mining frequent item sets showing their capabilities and
    efficiency in terms of time and memory consumption on association rule mining
    by taking application of specific information into account.

  106. A Novel Approach For Discovery Multi Level Fuzzy Association Rule Mining.

    Authors: K. R. Pardasani, Pratima Gautam
    Subjects: Databases
    Abstract

    Finding multilevel association rules in transaction databases is most
    commonly seen in is widely used in data mining. In this paper, we present a
    model of mining multilevel association rules which satisfies the different
    minimum support at each level, we have employed fuzzy set concepts, multi-level
    taxonomy and different minimum supports to find fuzzy multilevel association
    rules in a given transaction data set. Apriori property is used in model to
    prune the item sets. The proposed model adopts a topdown progressively
    deepening approach to derive large itemsets.

  107. Adding HL7 version 3 data types to PostgreSQL.

    Authors: Yeb Havinga, Willem Dijkstra, Ander de Keijzer
    Subjects: Databases
    Abstract

    The HL7 standard is widely used to exchange medical information
    electronically. As a part of the standard, HL7 defines scalar communication
    data types like physical quantity, point in time and concept descriptor but
    also complex types such as interval types, collection types and probabilistic
    types.

  108. Querying Incomplete Data over Extended ER Schemata.

    Authors: Andrea Cali, Davide Martinenghi
    Subjects: Databases
    Abstract

    Since Chen's Entity-Relationship (ER) model, conceptual modeling has been
    playing a fundamental role in relational data design. In this paper we consider
    an extended ER (EER) model enriched with cardinality constraints, disjointness
    assertions, and is-a relations among both entities and relationships. In this
    setting, we consider the case of incomplete data, which is likely to occur, for
    instance, when data from different sources are integrated. In such a context,
    we address the problem of providing correct answers to conjunctive queries by
    reasoning on the schema.

  109. Table manipulation in simplicial databases.

    Authors: David I. Spivak
    Subjects: Databases
    Abstract

    In \cite{Spi}, we developed a category of databases in which the schema of a
    database is represented as a simplicial set. Each simplex corresponds to a
    table in the database. There, our main concern was to find a categorical
    formulation of databases; the simplicial nature of the schemas was to some
    degree unexpected and unexploited.

  110. Verifying Recursive Active Documents with Positive Data Tree Rewriting.

    Authors: Blaise Genest, Anca Muscholl, Zhilin Wu
    Subjects: Databases
    Abstract

    This paper proposes a data tree-rewriting framework for modeling evolving
    documents. The framework is close to Guarded Active XML, a platform used for
    handling XML repositories evolving through web services. We focus on automatic
    verification of properties of evolving documents that can contain data from an
    infinite domain. We establish the boundaries of decidability, and show that
    verification of a {\em positive} fragment that can handle recursive service
    calls is decidable.

  111. Mining Statistically Significant Substrings Based on the Chi-Square Measure.

    Authors: Sourav Dutta Arnab Bhattacharya
    Subjects: Databases
    Abstract

    Given the vast reservoirs of data stored worldwide, efficient mining of data
    from a large information store has emerged as a great challenge. Many databases
    like that of intrusion detection systems, web-click records, player statistics,
    texts, proteins etc., store strings or sequences. Searching for an unusual
    pattern within such long strings of data has emerged as a requirement for
    diverse applications. Given a string, the problem then is to identify the
    substrings that differs the most from the expected or normal behavior, i.e.,
    the substrings that are statistically significant.

  112. A Logical Temporal Relational Data Model.

    Authors: Kamran Ahsan, Nadeem Mahmood, Aqil Burney
    Subjects: Databases
    Abstract

    Time is one of the most difficult aspects to handle in real world
    applications such as database systems. Relational database management systems
    proposed by Codd offer very little built-in query language support for temporal
    data management. The model itself incorporates neither the concept of time nor
    any theory of temporal semantics. Many temporal extensions of the relational
    model have been proposed and some of them are also implemented. This paper
    offers a brief introduction to temporal database research.

  113. An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets.

    Authors: Michael Mitzenmacher, Adam Kirsch, Andrea Pietracaprina, Geppino Pucci, Eli Upfal, Fabio Vandin
    Subjects: Databases
    Abstract

    As advances in technology allow for the collection, storage, and analysis of
    vast amounts of data, the task of screening and assessing the significance of
    discovered patterns is becoming a major challenge in data mining applications.
    In this work, we address significance in the context of frequent itemset
    mining.

  114. Finding Sequential Patterns from Large Sequence Data.

    Authors: Mahdi Esmaeili, Fazekas Gabor
    Subjects: Databases
    Abstract

    Data mining is the task of discovering interesting patterns from large
    amounts of data. There are many data mining tasks, such as classification,
    clustering, association rule mining, and sequential pattern mining. Sequential
    pattern mining finds sets of data items that occur together frequently in some
    sequences.

  115. Mining The Successful Binary Combinations: Methodology and A Simple Case Study.

    Authors: Yuval Cohen
    Subjects: Databases
    Abstract

    The importance of finding the characteristics leading to either a success or
    a failure is one of the driving forces of data mining. The various application
    areas of finding success/failure factors cover vast variety of areas such as
    credit risk evaluation and granting loans, micro array analysis, health factors
    and health risk factors, and parameter combination leading to a product
    success. This paper presents a new approach for making inferences about
    dichotomous data. The objective is to determine rules that lead to a certain
    result.

  116. Significant Interval and Frequent Pattern Discovery in Web Log Data.

    Authors: Kanak Saxena, Rahul Shukla
    Subjects: Databases
    Abstract

    There is a considerable body of work on sequence mining of Web Log Data. We
    are using One Pass frequent Episode discovery (or FED) algorithm, takes a
    different approach than the traditional apriori class of pattern detection
    algorithms. In this approach significant intervals for each Website are
    computed first (independently) and these interval used for detecting frequent
    patterns/Episode and then the Analysis is performed on Significant Intervals
    and frequent patterns That can be used to forecast the user's behavior using
    previous trends and this can be also used for advertising purpose.

  117. The WebStand Project.

    Authors: Benjamin Nguyen, Antoine Vion, François-Xavier Dudouet, Dario Colazzo, Ioana Manolescu, Pierre Senellart
    Subjects: Databases
    Abstract

    In this paper we present the state of advancement of the French ANR WebStand
    project. The objective of this project is to construct a customizable XML based
    warehouse platform to acquire, transform, analyze, store, query and export data
    from the web, in particular mailing lists, with the final intension of using
    this data to perform sociological studies focused on social groups of World
    Wide Web, with a specific emphasis on the temporal aspects of this data.

  118. Discovery of Convoys in Trajectory Databases.

    Authors: Hoyoung Jeung, Man Lung Yiu, Xiaofang Zhou, Christian S. Jensen, Heng Tao Shen
    Subjects: Databases
    Abstract

    As mobile devices with positioning capabilities continue to proliferate, data
    management for so-called trajectory databases that capture the historical
    movements of populations of moving objects becomes important. This paper
    considers the querying of such databases for convoys, a convoy being a group of
    objects that have traveled together for some time. More specifically, this
    paper formalizes the concept of a convoy query using density-based notions, in
    order to capture groups of arbitrary extents and shapes.

  119. Extraction of Flat and Nested Data Records from Web Pages.

    Authors: P.S Hiremath, Siddu P. Algur
    Subjects: Databases
    Abstract

    This paper studies the problem of identification and extraction of flat and
    nested data records from a given web page. With the explosive growth of
    information sources available on the World Wide Web, it has become increasingly
    difficult to identify the relevant pieces of information, since web pages are
    often cluttered with irrelevant content like advertisements, navigation-panels,
    copyright notices etc., surrounding the main content of the web page.

  120. Page-Differential Logging: An Efficient and DBMS-independent Approach for Storing Data into Flash Memory.

    Authors: Yi-Reun Kim, Kyu-Young Whang, Il-Yeol Song
    Subjects: Databases
    Abstract

    Flash memory is widely used as the secondary storage in lightweight computing
    devices due to its outstanding advantages over magnetic disks. Flash memory has
    many access characteristics different from those of magnetic disks, and how to
    take advantage of them is becoming an important research issue. There are two
    existing approaches to storing data into flash memory: page-based and
    log-based. The former has good performance for read operations, but poor
    performance for write operations.

  121. Interestingness Measure for Mining Spatial Gene Expression Data using Association Rule.

    Authors: M.Anandhavalli, M.K.Ghose, K.Gauthaman
    Subjects: Databases
    Abstract

    The search for interesting association rules is an important topic in
    knowledge discovery in spatial gene expression databases. The set of admissible
    rules for the selected support and confidence thresholds can easily be
    extracted by algorithms based on support and confidence, such as Apriori.
    However, they may produce a large number of rules, many of them are
    uninteresting. The challenge in association rule mining (ARM) essentially
    becomes one of determining which rules are the most interesting.

  122. A Model for Mining Multilevel Fuzzy Association Rule in Database.

    Authors: Neelu Khare, K. R. Pardasani, Pratima Gautam
    Subjects: Databases
    Abstract

    The problem of developing models and algorithms for multilevel association
    mining pose for new challenges for mathematics and computer science. These
    problems become more challenging, when some form of uncertainty like fuzziness
    is present in data or relationships in data. This paper proposes a multilevel
    fuzzy association rule mining models for extracting knowledge implicit in
    transactions database with different support at each level. The proposed
    algorithm adopts a top-down progressively deepening approach to derive large
    itemsets.

  123. Proposing a New Method for Query Processing Adaption in DataBase.

    Authors: Mohammad-Reza Feizi-Derakhshi, Hasan Asil, Amir Asil
    Subjects: Databases
    Abstract

    This paper proposes a multi agent system by compiling two technologies, query
    processing optimization and agents which contains features of personalized
    queries and adaption with changing of requirements. This system uses a new
    algorithm based on modeling of users' long-term requirements and also GA to
    gather users' query data. Experimented Result shows more adaption capability
    for presented algorithm in comparison with classic algorithms.

  124. Finding top-k similar pairs of objects annotated with terms from an ontology.

    Authors: Arnab Bhattacharya, Abhishek Bhowmick, Ambuj K. Singh
    Subjects: Databases
    Abstract

    With the growing focus on semantic searches and interpretations, an
    increasing number of standardized vocabularies and ontologies are being
    designed and used to describe data. We investigate the querying of objects
    described by a tree-structured ontology. Specifically, we consider the case of
    finding the top-k best pairs of objects that have been annotated with terms
    from such an ontology when the object descriptions are available only at
    runtime. We consider three distance measures.

  125. An Improved Approach to High Level Privacy Preserving Itemset Mining.

    Authors: Rajesh Kumar Boora, Ruchi Shukla, A. K. Misra
    Subjects: Databases
    Abstract

    Privacy preserving association rule mining has triggered the development of
    many privacy preserving data mining techniques. A large fraction of them use
    randomized data distortion techniques to mask the data for preserving. This
    paper proposes a new transaction randomization method which is a combination of
    the fake transaction randomization method and a new per transaction
    randomization method. This method distorts the items within each transaction
    and ensures a higher level of data privacy in comparison to the previous
    approaches.

  126. Efficient Candidacy Reduction For Frequent Pattern Mining.

    Authors: Mohammad Nadimi Shahraki, Norwati Mustapha, Md Nasir B Sulaiman, Ali B Mamat
    Subjects: Databases
    Abstract

    Certainly, nowadays knowledge discovery or extracting knowledge from large
    amount of data is a desirable task in competitive businesses. Data mining is a
    main step in knowledge discovery process. Meanwhile frequent patterns play
    central role in data mining tasks such as clustering, classification, and
    association analysis. Identifying all frequent patterns is the most time
    consuming process due to a massive number of candidate patterns. For the past
    decade there have been an increasing number of efficient algorithms to mine the
    frequent patterns.

  127. Mining Spatial Gene Expression Data Using Negative Association Rules.

    Authors: M. Anandhavalli, M. K. Ghose, K. Gauthaman
    Subjects: Databases
    Abstract

    Over the years, data mining has attracted most of the attention from the
    research community. The researchers attempt to develop faster, more scalable
    algorithms to navigate over the ever increasing volumes of spatial gene
    expression data in search of meaningful patterns. Association rules are a data
    mining technique that tries to identify intrinsic patterns in spatial gene
    expression data. It has been widely used in different applications, a lot of
    algorithms introduced to discover these rules. However Priori like algorithms
    has been used to find positive association rules.

  128. A framework to model real-time databases.

    Authors: Nizar Idoudi, Nada louati, Claude Duvallet, Bruno Sadeg, Rafik Bouaziz, Faiez Gargouri
    Subjects: Databases
    Abstract

    Real-time databases deal with time-constrained data and time-constrained
    transactions. The design of this kind of databases requires the introduction of
    new concepts to support both data structures and the dynamic behaviour of the
    database. In this paper, we give an overview about different aspects of
    real-time databases and we clarify requirements of their modelling. Then, we
    present a framework for real-time database design and describe its fundamental
    operations.

  129. A Study on Feature Selection Techniques in Educational Data Mining.

    Authors: M. Ramaswami, R. Bhaskaran
    Subjects: Databases
    Abstract

    Educational data mining (EDM) is a new growing research area and the essence
    of data mining concepts are used in the educational field for the purpose of
    extracting useful information on the behaviors of students in the learning
    process. In this EDM, feature selection is to be made for the generation of
    subset of candidate variables. As the feature selection influences the
    predictive accuracy of any performance model, it is essential to study
    elaborately the effectiveness of student performance model in connection with
    feature selection techniques.

  130. Privacy in Search Logs.

    Authors: Johannes Gehrke, Xiaokui Xiao, Guozhang Wang, Michaela Goetz, Ashwin Machanavajjhala
    Subjects: Databases
    Abstract

    Search engine companies collect the "database of intentions", the histories
    of their users' search queries. These search logs are a gold mine for
    researchers. Search engine companies, however, are wary of publishing search
    logs in order not to disclose sensitive information.

  131. Data management in Systems biology II - Outlook towards the semantic web.

    Authors: Gerhard Mayer
    Subjects: Databases
    Abstract

    The benefit of using ontologies, defined by the respective data standards, is
    shown. It is presented how ontologies can be used for the semantic enrichment
    of data and how this can contribute to the vision of the semantic web to become
    true.

  132. COAT: COnstraint-based Anonymization of Transactions.

    Authors: Grigorios Loukides, Aris Gkoulalas-Divanis, Bradley Malin
    Subjects: Databases
    Abstract

    Publishing person-specific transactions in an anonymous form is increasingly
    required by organizations. Recent approaches ensure that potentially
    identifying information (e.g., a set of diagnosis codes) cannot be used to link
    published transactions to persons' identities, but all are limited in
    application because they incorporate coarse privacy requirements (e.g.,
    protecting a certain set of m diagnosis codes requires protecting all m-sized
    sets), do not integrate utility requirements, and tend to explore a small
    portion of the solution space.

  133. Enterprise Multi-Branch Database Synchronization with MSMQ.

    Authors: Emil Vassev
    Subjects: Databases
    Abstract

    When we talk about databases there have always been problems concerning data
    synchronization. The latter is a technique for maintaining consistency among
    different copies of data (often called replicas). In general, there is no
    universal solution to this problem and often a particular situation requires a
    particular approach driven by specific conditions. This paper presents an
    approach tackling the issue of data synchronization in a distributed
    multi-branch enterprise database. The proposed solution is based on MSMQ
    (Microsoft Message Queue), a mechanism for asynchronous messaging.

  134. Design of Intelligent layer for flexible querying in databases.

    Authors: Mrs. Neelu Nihalani, Dr. Sanjay Silakari, Dr. Mahesh Motwani
    Subjects: Databases
    Abstract

    Computer-based information technologies have been extensively used to help
    many organizations, private companies, and academic and education institutions
    manage their processes and information systems hereby become their nervous
    centre. The explosion of massive data sets created by businesses, science and
    governments necessitates intelligent and more powerful computing paradigms so
    that users can benefit from this data. Therefore most new-generation database
    applications demand intelligent information management to enhance efficient
    interactions between database and the users.

  135. Refactoring of a Database.

    Authors: Ayeesha Dsousa, Shalini Bhatia
    Subjects: Databases
    Abstract

    The technique of database refactoring is all about applying disciplined and
    controlled techniques to change an existing database schema. The problem is to
    successfully create a Database Refactoring Framework for databases. This paper
    concentrates on the feasibility of adapting this concept to work as a generic
    template. To retain the constraints regardless of the modifications to the
    metadata, the paper proposes a MetaData Manipulation Tool to facilitate change.
    The tool adopts a Template Design Pattern to make it database independent.

  136. XML Multidimensional Modelling and Querying.

    Authors: Serge Boucher, Boris Verhaegen, Esteban Zimányi
    Subjects: Databases
    Abstract

    As XML becomes ubiquitous and XML storage and processing becomes more
    efficient, the range of use cases for these technologies widens daily. One
    promising area is the integration of XML and data warehouses, where an
    XML-native database stores multidimensional data and processes OLAP queries
    written in the XQuery interrogation language. This paper explores issues
    arising in the implementation of such a data warehouse. We first compare
    approaches for multidimensional data modelling in XML, then describe how
    typical OLAP queries on these models can be expressed in XQuery.

  137. Applying an XML Warehouse to Social Network Analysis, Lessons from the WebStand Project.

    Authors: Benjamin Nguyen, Antoine Vion, Francois-Xavier Dudouet, Loic Saint-Ghislain
    Subjects: Databases
    Abstract

    In this paper we present the state of advancement of the French ANR WebStand
    project. The objective of this project is to construct a customizable XML based
    warehouse platform to acquire, transform, analyze, store, query and export data
    from the web, in particular mailing lists, with the final intension of using
    this data to perform sociological studies focused on social groups of World
    Wide Web, with a specific emphasis on the temporal aspects of this data.

  138. "Almost automatic" and semantic integration of XML Schemas at various "severity" levels.

    Authors: P. De Meo, G. Quattrone, G. Terracina, D. Ursino
    Subjects: Databases
    Abstract

    This paper presents a novel approach for the integration of a set of XML
    Schemas. The proposed approach is specialized for XML, is almost automatic,
    semantic and "light". As a further, original, peculiarity, it is parametric
    w.r.t. a "severity" level against which the integration task is performed. The
    paper describes the approach in all details, illustrates various theoretical
    results, presents the experiments we have performed for testing it and,
    finally, compares it with various related approaches already proposed in the
    literature.

  139. On the Privacy of Euclidean Distance Preserving Data Perturbation.

    Authors: Chris Giannella, Hillol Kargupta, Kun Liu
    Subjects: Databases
    Abstract

    We examine Euclidean distance preserving data perturbation as a tool for
    privacy-preserving data mining. Such perturbations allow many important data
    mining algorithms, with only minor modification, to be applied to the perturbed
    data and produce exactly the same results as if applied to the original data,
    e.g. hierarchical clustering and k-means clustering. However, the issue of how
    well the original data is hidden needs careful study. We take a step in this
    direction by assuming the role of an attacker armed with two types of prior
    information regarding the original data.

  140. Composition and Inversion of Schema Mappings.

    Authors: Marcelo Arenas, Jorge Perez, Juan Reutter, Cristian Riveros
    Subjects: Databases
    Abstract

    In the last years, a lot of attention has been paid to the development of
    solid foundations for the composition and inversion of schema mappings. In this
    paper, we review the proposals for the semantics of these crucial operators.
    For each of these proposals, we concentrate on the three following problems:
    the definition of the semantics of the operator, the language needed to express
    the operator, and the algorithmic issues associated to the problem of computing
    the operator.

  141. On Metric Skyline Processing by PM-tree.

    Authors: Tomas Skopal, Jakub Lokoc
    Subjects: Databases
    Abstract

    The task of similarity search in multimedia databases is usually accomplished
    by range or k nearest neighbor queries. However, the expressing power of these
    "single-example" queries fails when the user's delicate query intent is not
    available as a single example. Recently, the well-known skyline operator was
    reused in metric similarity search as a "multi-example" query type. When
    applied on a multi-dimensional database (i.e., on a multi-attribute table), the
    traditional skyline operator selects all database objects that are not
    dominated by other objects.

  142. Differential Privacy via Wavelet Transforms.

    Authors: Johannes Gehrke, Xiaokui Xiao, Guozhang Wang
    Subjects: Databases
    Abstract

    Privacy preserving data publishing has attracted considerable research
    interest in recent years. Among the existing solutions, {\em
    $\epsilon$-differential privacy} provides one of the strongest privacy
    guarantees. Existing data publishing methods that achieve
    $\epsilon$-differential privacy, however, offer little data utility. In
    particular, if the output dataset is used to answer count queries, the noise in
    the query answers can be proportional to the number of tuples in the data,
    which renders the results useless.

  143. Personal Information Databases.

    Authors: Sabah S. Al-Fedaghi, Bernhard Thalheim
    Subjects: Databases
    Abstract

    One of the most important aspects of security organization is to establish a
    framework to identify security significant points where policies and procedures
    are declared. The (information) security infrastructure comprises entities,
    processes, and technology. All are participants in handling information, which
    is the item that needs to be protected. Privacy and security information
    technology is a critical and unmet need in the management of personal
    information. This paper proposes concepts and technologies for management of
    personal information.

  144. On Chase Termination Beyond Stratification.

    Authors: Michael Meier, Michael Schmidt, Georg Lausen
    Subjects: Databases
    Abstract

    We study the termination problem of the chase algorithm, a central tool in
    various database problems such as the constraint implication problem,
    Conjunctive Query optimization, rewriting queries using views, data exchange,
    and data integration. The basic idea of the chase is, given a database instance
    and a set of constraints as input, to fix constraint violations in the database
    instance. It is well-known that, for an arbitrary set of constraints, the chase
    does not necessarily terminate (in general, it is even undecidable if it does
    or not).

  145. Reducing Network Traffic in Unstructured P2P Systems Using Top-k Queries.

    Authors: Reza Akbarinia, Esther Pacitti, Patrick Valduriez
    Subjects: Databases
    Abstract

    A major problem of unstructured P2P systems is their heavy network traffic.
    This is caused mainly by high numbers of query answers, many of which are
    irrelevant for users. One solution to this problem is to use Top-k queries
    whereby the user can specify a limited number (k) of the most relevant answers.
    In this paper, we present FD, a (Fully Distributed) framework for executing
    Top-k queries in unstructured P2P systems, with the objective of reducing
    network traffic. FD consists of a family of algorithms that are simple but
    effec-tive.

  146. Inter-Operator Feedback in Data Stream Management Systems via Punctuation.

    Authors: Rafael Fernández-Moctezuma, Kristin Tufte, Jin Li
    Subjects: Databases
    Abstract

    High-volume, high-speed data streams may overwhelm the capabilities of stream
    processing systems; techniques such as data prioritization, avoidance of
    unnecessary processing and on-demand result production may be necessary to
    reduce processing requirements. However, the dynamic nature of data streams, in
    terms of both rate and content, makes the application of such techniques
    challenging. Such techniques have been addressed in the context of static and
    centralized query optimization; however, they have not been fully addressed for
    data stream management systems.

  147. Size Bounds for Conjunctive Queries with General Functional Dependencies.

    Authors: Gregory Valiant, Paul Valiant
    Subjects: Databases
    Abstract

    This paper resolves the main open question left by Gottlob, Lee, and Valiant
    (PODS 2009)[GLV09], establishing tight worst-case bounds for the size of the
    result Q(D) of a conjunctive query Q to a database D given an arbitrary set of
    functional dependencies. We show that the lower bound presented in [GLV09] in
    which the variables of the query are "colored" so as to yield a coloring number
    C(Q) for each query Q is, in fact, also an upper bound.

  148. SocialScope: Enabling Information Discovery on Social Content Sites.

    Authors: Sihem Amer-Yahia, Laks Lakshmanan, Cong Yu
    Subjects: Databases
    Abstract

    Recently, many content sites have started encouraging their users to engage
    in social activities such as adding buddies on Yahoo! Travel and sharing
    articles with their friends on New York Times.

  149. Reordering Columns for Smaller Indexes.

    Authors: Daniel Lemire, Owen Kaser
    Subjects: Databases
    Abstract

    Column-oriented indexes-such as projection or bitmap indexes-are compressed
    by run-length encoding to reduce storage and increase speed. Sorting the tables
    improves compression. On realistic data sets, permuting the columns in the
    right order before sorting can reduce the number of runs by a factor of two or
    more. For many cases, we prove that the number of runs in table columns is
    minimized if we sort columns by increasing cardinality. Yet-maybe
    surprisingly-we must sometimes maximize the number of runs to minimize the
    index size.

  150. Teaching an Old Elephant New Tricks.

    Authors: Nicolas Bruno
    Subjects: Databases
    Abstract

    In recent years, column stores (or C-stores for short) have emerged as a
    novel approach to deal with read-mostly data warehousing applications.
    Experimental evidence suggests that, for certain types of queries, the new
    features of C-stores result in orders of magnitude improvement over traditional
    relational engines. At the same time, some C-store proponents argue that
    C-stores are fundamentally different from traditional engines, and therefore
    their benefits cannot be incorporated into a relational engine short of a
    complete rewrite.

  151. LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases.

    Authors: Xiaodan Wang, Randal Burns, Tanu Malik
    Subjects: Databases
    Abstract

    Workloads that comb through vast amounts of data are gaining importance in
    the sciences. These workloads consist of "needle in a haystack" queries that
    are long running and data intensive so that query throughput limits
    performance. To maximize throughput for data-intensive queries, we put forth
    LifeRaft: a query processing system that batches queries with overlapping data
    requirements. Rather than scheduling queries in arrival order, LifeRaft
    executes queries concurrently against an ordering of the data that maximizes
    data sharing among queries.

  152. Capturing Data Uncertainty in High-Volume Stream Processing.

    Authors: Yanlei Diao, Boduo Li, Anna Liu, Liping Peng, Charles Sutton, Thanh Tran, Michael Zink
    Subjects: Databases
    Abstract

    We present the design and development of a data stream system that captures
    data uncertainty from data collection to query processing to final result
    generation. Our system focuses on data that is naturally modeled as continuous
    random variables. For such data, our system employs an approach grounded in
    probability and statistical theory to capture data uncertainty and integrates
    this approach into high-volume stream processing. The first component of our
    system captures uncertainty of raw data streams from sensing devices.

  153. A Case for A Collaborative Query Management System.

    Authors: Nodira Khoussainova, Magda Balazinska, Wolfgang Gatterbauer, YongChul Kwon, Dan Suciu
    Subjects: Databases
    Abstract

    Over the past 40 years, database management systems (DBMSs) have evolved to
    provide a sophisticated variety of data management capabilities. At the same
    time, tools for managing queries over the data have remained relatively
    primitive. One reason for this is that queries are typically issued through
    applications. They are thus debugged once and re-used repeatedly. This mode of
    interaction, however, is changing. As scientists (and others) store and share
    increasingly large volumes of data in data centers, they need the ability to
    analyze the data by issuing exploratory queries.

  154. The Case for RodentStore, an Adaptive, Declarative Storage System.

    Authors: Philippe Cudre-Mauroux, Eugene Wu, Sam Madden
    Subjects: Databases
    Abstract

    Recent excitement in the database community surrounding new
    applications?analytic, scientific, graph, geospatial, etc.?has led to an
    explosion in research on database storage systems. New storage systems are
    vital to the database community, as they are at the heart of making database
    systems perform well in new application domains. Unfortunately, each such
    system also represents a substantial engineering effort including a great deal
    of duplication of mechanisms for features such as transactions and caching.

  155. Principles for Inconsistency.

    Authors: Shel Finkelstein, Dean Jacobs, Rainer Brendle
    Subjects: Databases
    Abstract

    Data consistency is very desirable because strong semantic properties make it
    easier to write correct programs that perform as users expect. However, there
    are good reasons why consistency may have to be weakened to achieve other
    business goals. In this CIDR 2009 Perspectives paper, we present real-world
    reasons inconsistency may be necessary, offer principles for managing
    inconsistency coherently, and describe implementation approaches we are
    investigating for sustainably scalable systems that offer comprehensible user
    experiences despite inconsistency.

  156. RIOT: I/O-Efficient Numerical Computing without SQL.

    Authors: Yi Zhang, Herodotos Herodotou, Jun Yang
    Subjects: Databases
    Abstract

    R is a numerical computing environment that is widely popular for statistical
    data analysis. Like many such environments, R performs poorly for large
    datasets whose sizes exceed that of physical memory. We present our vision of
    RIOT (R with I/O Transparency), a system that makes R programs I/O-efficient in
    a way transparent to the users. We describe our experience with RIOT-DB, an
    initial prototype that uses a relational database system as a backend.

  157. Towards Eco-friendly Database Management Systems.

    Authors: Willis Lang, Jignesh Patel
    Subjects: Databases
    Abstract

    Database management systems (DBMSs) have largely ignored the task of managing
    the energy consumed during query processing. Both economical and environmental
    factors now require that DBMSs pay close attention to energy consumption. In
    this paper we approach this issue by considering energy consumption as a
    first-class performance goal for query processing in a DBMS. We present two
    concrete techniques that can be used by a DBMS to directly manage the energy
    consumption. Both techniques trade energy consumption for performance.

  158. SCADS: Scale-Independent Storage for Social Computing Applications.

    Authors: Michael Armbrust, Armando Fox, David Patterson, Nick Lanham, Beth Trushkowsky, Jesse Trutna, Haruki Oh
    Subjects: Databases
    Abstract

    Collaborative web applications such as Facebook, Flickr and Yelp present new
    challenges for storing and querying large amounts of data. As users and
    developers are focused more on performance than single copy consistency or the
    ability to perform ad-hoc queries, there exists an opportunity for a
    highly-scalable system tailored specifically for relaxed consistency and
    pre-computed queries.

  159. Sailing the Information Ocean with Awareness of Currents: Discovery and Application of Source Dependence.

    Authors: Laure Berti-Equille, Anish Das Sarma, Dong, Amelie Marian, Divesh Srivastava
    Subjects: Databases
    Abstract

    The Web has enabled the availability of a huge amount of useful information,
    but has also eased the ability to spread false information and rumors across
    multiple sources, making it hard to distinguish between what is true and what
    is not. Recent examples include the premature Steve Jobs obituary, the second
    bankruptcy of United airlines, the creation of Black Holes by the operation of
    the Large Hadron Collider, etc.

  160. Energy Efficiency: The New Holy Grail of Data Management Systems Research.

    Authors: Stavros Harizopoulos, Mehul Shah, Justin Meza, Parthasarathy Ranganathan
    Subjects: Databases
    Abstract

    Energy costs are quickly rising in large-scale data centers and are soon
    projected to overtake the cost of hardware. As a result, data center operators
    have recently started turning into using more energy-friendly hardware. Despite
    the growing body of research in power management techniques, there has been
    little work to date on energy efficiency from a data management software
    perspective.

  161. Harnessing the Deep Web: Present and Future.

    Authors: Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy
    Subjects: Databases
    Abstract

    Over the past few years, we have built a system that has exposed large
    volumes of Deep-Web content to Google.com users. The content that our system
    exposes contributes to more than 1000 search queries per-second and spans over
    50 languages and hundreds of domains. The Deep Web has long been acknowledged
    to be a major source of structured data on the web, and hence accessing
    Deep-Web content has long been a problem of interest in the data management
    community. In this paper, we report on where we believe the Deep Web provides
    value and where it does not.

  162. DBMSs Should Talk Back Too.

    Authors: Alkis Simitsis, Yannis Ioannidis
    Subjects: Databases
    Abstract

    Natural language user interfaces to database systems have been studied for
    several decades now. They have mainly focused on parsing and interpreting
    natural language queries to generate them in a formal database language.

  163. The Case for a Structured Approach to Managing Unstructured Data.

    Authors: AnHai Doan, Jeff Naughton, Akanksha Baid, Xiaoyong Chai, Fei Chen, Ting Chen, Eric Chu, Pedro DeRose, Byron Gao, Chaitanya Gokhale, Jiansheng Huang, Warren Shen, Ba-Quy Vuong
    Subjects: Databases
    Abstract

    The challenge of managing unstructured data represents perhaps the largest
    data management opportunity for our community since managing relational data.
    And yet we are risking letting this opportunity go by, ceding the playing field
    to other players, ranging from communities such as AI, KDD, IR, Web, and
    Semantic Web, to industrial players such as Google, Yahoo, and Microsoft. In
    this essay we explore what we can do to improve upon this situation. Drawing on
    the lessons learned while managing relational data, we outline a structured
    approach to managing unstructured data.

  164. Unbundling Transaction Services in the Cloud.

    Authors: David Lomet, Alan Fekete, Gerhard Weikum, Mike Zwilling
    Subjects: Databases
    Abstract

    The traditional architecture for a DBMS engine has the recovery, concurrency
    control and access method code tightly bound together in a storage engine for
    records. We propose a different approach, where the storage engine is factored
    into two layers (each of which might have multiple heterogeneous instances). A
    Transactional Component (TC) works at a logical level only: it knows about
    transactions and their "logical" concurrency control and undo/redo recovery,
    but it does not know about page layout, B-trees etc. A Data Component (DC)
    knows about the physical storage structure.

  165. From Declarative Languages to Declarative Processing in Computer Games.

    Authors: Johannes Gehrke, Benjamin Sowell, Alan Demers, Nitin Gupta, Haoyuan Li, Walker White
    Subjects: Databases
    Abstract

    Recent work has shown that we can dramatically improve the performance of
    computer games and simulations through declarative processing: Character AI can
    be written in an imperative scripting language which is then compiled to
    relational algebra and executed by a special games engine with features similar
    to a main memory database system. In this paper we lay out a challenging
    research agenda built on these ideas.

  166. The Role of Schema Matching in Large Enterprises.

    Authors: Ken Smith, Michael Morse, Peter Mork, Maya Li, Arnon Rosenthal, David Allen, Len Seligman, Chris Wolf
    Subjects: Databases
    Abstract

    To date, the principal use case for schema matching research has been as a
    precursor for code generation, i.e., constructing mappings between schema
    elements with the end goal of data transfer. In this paper, we argue that
    schema matching plays valuable roles independent of mapping construction,
    especially as schemata grow to industrial scales.

  167. Visualizing the robustness of query execution.

    Authors: Goetz Graefe, Harumi Kuno, Janet Wiener
    Subjects: Databases
    Abstract

    In database query processing, actual run-time conditions (e.g., actual
    selectivities and actual available memory) very often differ from compile-time
    expectations of run-time conditions (e.g., estimated predicate selectivities
    and anticipated memory availability). Robustness of query processing can be
    defined as the ability to handle unexpected conditions. Robustness of query
    execution, specifically, can be defined as the ability to process a specific
    plan efficiently in an unexpected condition.

  168. Search Driven Analysis of Heterogenous XML Data.

    Authors: Andrey Balmin, Latha Colby, Emiran Curtmola, Quanzhong Li, Fatma Ozcan
    Subjects: Databases
    Abstract

    Analytical processing on XML repositories is usually enabled by designing
    complex data transformations that shred the documents into a common data
    warehousing schema. This can be very time-consuming and costly, especially if
    the underlying XML data has a lot of variety in structure, and only a subset of
    attributes constitutes meaningful dimensions and facts. Today, there is no tool
    to explore an XML data set, discover interesting attributes, dimensions and
    facts, and rapidly prototype an OLAP solution.

  169. Social Systems: Can we Do More Than Just Poke Friends?.

    Authors: Georgia Koutrika, Benjamin Bercovitz, Robert Ikeda, Filip Kaliszan, Henry Liou, Zahra Mohammadi Zadeh, Hector Garcia-Molina
    Subjects: Databases
    Abstract

    Social sites have become extremely popular among users but have they
    attracted equal attention from the research community? Are they good only for
    simple tasks, such as tagging and poking friends? Do they present any new or
    interesting research challenges? In this paper, we describe the insights we
    have obtained implementing CourseRank, a course evaluation and planning social
    system. We argue that more attention should be given to social sites like ours
    and that there are many challenges (though not the traditional DBMS ones) that
    should be addressed by our community.

  170. Data Management for High-Throughput Genomics.

    Authors: Uwe Roehm, Jose Blakeley
    Subjects: Databases
    Abstract

    Today's sequencing technology allows sequencing an individual genome within a
    few weeks for a fraction of the costs of the original Human Genome project.
    Genomics labs are faced with dozens of TB of data per week that have to be
    automatically processed and made available to scientists for further analysis.
    This paper explores the potential and the limitations of using relational
    database systems as the data processing platform for high-throughput genomics.
    In particular, we are interested in the storage management for high-throughput
    sequence data and in leveraging SQL and user-defined func

  171. Qunits: queried units in database search.

    Authors: Arnab Nandi, H V Jagadish
    Subjects: Databases
    Abstract

    Keyword search against structured databases has become a popular topic of
    investigation, since many users find structured queries too hard to express,
    and enjoy the freedom of a ``Google-like'' query box into which search terms
    can be entered. Attempts to address this problem face a fundamental dilemma.
    Database querying is based on the logic of predicate evaluation, with a
    precisely defined answer set for a given query. On the other hand, in an
    information retrieval approach, ranked query results have long been accepted as
    far superior to results based on boolean query evaluation.

  172. Remembrance: The Unbearable Sentience of Being Digital.

    Authors: Ragib Hasan, Radu Sion, Marianne Winslett
    Subjects: Databases
    Abstract

    We introduce a world vision in which data is endowed with memory. In this
    data-centric systems paradigm, data items can be enabled to retain all or some
    of their previous values. We call this ability "remembrance" and posit that it
    empowers significant leaps in the security, availability, and general
    operational dimensions of systems. With the explosion in cheap, fast memories
    and storage, large-scale remembrance will soon become practical. Here, we
    introduce and explore the advantages of such a paradigm and the challenges in
    making it a reality.

  173. Interactive Data Integration through Smart Copy & Paste.

    Authors: Zachary Ives, Craig Knoblock, Steve Minton, Marie Jacob, Partha Talukdar, Rattapoom Tuchinda, Jose Luis Ambite, Maria Muslea, Cenk Gazen
    Subjects: Databases
    Abstract

    In many scenarios, such as emergency response or ad hoc collaboration, it is
    critical to reduce the overhead in integrating data. Ideally, one could perform
    the entire process interactively under one unified interface: defining
    extractors and wrappers for sources, creating a mediated schema, and adding
    schema mappings ? while seeing how these impact the integrated view of the
    data, and refining the design accordingly.

  174. Anonymization with Worst-Case Distribution-Based Background Knowledge.

    Authors: Raymond Chi-Wing Wong, Ada Wai-Chee Fu, Ke Wang, Yabo Xu, Jian Pei, Philip S. Yu
    Subjects: Databases
    Abstract

    Background knowledge is an important factor in privacy preserving data
    publishing. Distribution-based background knowledge is one of the well studied
    background knowledge. However, to the best of our knowledge, there is no
    existing work considering the distribution-based background knowledge in the
    worst case scenario, by which we mean that the adversary has accurate knowledge
    about the distribution of sensitive values according to some tuple attributes.
    Considering this worst case scenario is essential because we cannot overlook
    any breaching possibility.

  175. In-Network Outlier Detection in Wireless Sensor Networks.

    Authors: Joel W. Branch, Chris Giannella, Boleslaw Szymanski, Ran Wolff, Hillol Kargupta
    Subjects: Databases
    Abstract

    To address the problem of unsupervised outlier detection in wireless sensor
    networks, we develop an approach that (1) is flexible with respect to the
    outlier definition, (2) computes the result in-network to reduce both bandwidth
    and energy usage,(3) only uses single hop communication thus permitting very
    simple node failure detection and message reliability assurance mechanisms
    (e.g., carrier-sense), and (4) seamlessly accommodates dynamic updates to data.
    We examine performance using simulation with real sensor data streams.

  176. Data management in systems biology I - Overview and bibliography.

    Authors: Gerhard Mayer
    Subjects: Databases
    Abstract

    Large systems biology projects can encompass several workgroups often located
    in different countries.

  177. Plagiarism Detection in arXiv

Syndicate content