Extended XML Tree
Pattern Matching: Theories and Algorithms
ABSTRACT
As business and enterprises generate and exchange XML data
more often, there is an increasing need for efficient processing of queries on
XML data. Searching for the occurrences of a tree pattern query in an XML
database is a core operation in XML query processing. Prior works demonstrate
that holistic twig pattern matching algorithm is an efficient technique to
answer an XML tree pattern with parent-child (P-C) and ancestor-descendant
(A-D) relationships, as it can effectively control the size of intermediate
results during query processing. However, XML query languages (e.g. XPath,
XQuery) define more axes and functions such as negation function, order-based
axis and wildcards.Here we research a large set of XML tree pattern, called extended
XML tree pattern, which may include P-C, A-D relationships, negation
functions, wildcards and order restriction. We establish a theoretical
framework about “matching cross” which demonstrates the intrinsic reason
in the proof of optimality on holistic algorithms. Based on our theorems, we
propose a set of novel algorithms to efficiently process three categories of
extended XML tree patterns. A set of experimental results on both real-life and
synthetic data sets demonstrate the effectiveness and efficiency of our
proposed theories and algorithms.
Existing
System
Previous
algorithms focus on XML tree pattern queries with only P-C and A-D
relationships. Little work has been done on XML tree queries which may contain
wildcards, negation function and order restriction, all of which are frequently
used in XML query languages such as XPath and XQuery. In this article, we call
an XML tree pattern with negation function, wildcards and/or order restriction
as extended XML tree pattern. Previous XML tree pattern matching
algorithms do not fully exploit the “optimality” of holistic algorithms.
Proposed
System
We build a theoretical
framework on optimal processing of XML tree pattern queries. We show that “matching cross” is the key reason to result in the sub-optimality of holistic
algorithms. Intuitively, matching cross describes a dilemma where holistic algorithms have to decide whether to output useless intermediate result or to miss useful results. The fact that TwigStack is optimal for queries with only A-D relationships can be
explained that no matching cross can be found for any XML document with respect
to queries with A-D edges. We classify matching cross to bound and unbounded matching cross (BMC and UMC). We develop theorems to show that only part of UMC
(i.e. UMC with mediator) can force holistic algorithms to potentially output
useless intermediate results. Based on the theoretical analysis, we develop a series of holistic
algorithms TreeMatch to achieve a large optimal query class for
Q/,//,*. Our main technique is to use a concise encoding to present matching
results, which leads to the reduction of useless intermediate results. We conducted an extensive
set of experiment on synthetic and real data set for performance comparison. We
compared TreeMatch with previous four holistic XML tree pattern
matching algorithms. The experimental results show that our algorithm can
correctly process extended XML tree patterns, achieving performance speedup for
tested queries and data sets, even in their restricted focus. The improvement
mainly owes to the reduction of the size of intermediate results.
Modules:
1. Optimality of holistic algorithm:
Previous XML tree pattern
matching algorithms do not fully exploit the “optimality” of holistic
algorithms. TwigStack guarantees that there is no useless intermediate result
for queries with only Ancestor-Descendant (A-D) relationships.
Therefore, TwigStack is optimal for queries with only A-D edges. Another
algorithm TwigStackList enlarges the
optimal query class of TwigStack by including Parent-Child(P-C) relationships
in non-branching edges. A natural question is whether the optimal query class
of TwigStackList can be further improved. Hence, the current open problems
include (1) how to identify a larger query class which can be processed
optimally and (2) how to efficiently answer a query which cannot be guaranteed
to process optimally. This explores the
challenges and shows the promise of a novel
theoretical framework called “matching cross” to identify a large
optimal query class for posing extended XML tree queries.
No comments:
Post a Comment