02 . Discovering Conditional
Functional Dependencies
Abstract:
This paper investigates the discovery of conditional functional
dependencies (CFDs). CFDs are a recent extension of functional dependencies
(FDs) by supporting patterns of semantically related constants, and can be used
as rules for cleaning relational data. However, finding CFDs is an expensive
process that involves intensive manual effort. To effectively identify data
cleaning rules, we develop techniques for discovering CFDs from sample relations.
We provide three methods for CFD discovery. The first, referred to as CFDMiner,
is based on techniques for mining closed itemsets, and is used to discover
constant CFDs, namely, CFDs with constant patterns only. The other two
algorithms are developed for discovering general CFDs. The first algorithm,
referred to as CTANE, is a levelwise algorithm that extends TANE, a well-known
algorithm for mining FDs. The other, referred to as FastCFD, is based on the
depthfirst approach used in FastFD, a method for discovering FDs. It leverages
closed-itemset mining to reduce search space. Our experimental results
demonstrate the following. (a) CFDMiner can be multiple orders of magnitude
faster than CTANE and FastCFD for constant CFD discovery. (b) CTANE works well
when a given sample relation is large, but it does not scale well with the
arity of the relation. (c) FastCFD is far more efficient than CTANE when the
arity of the relation is large.
Algorithm
Used:
Levelwise Algorithm
Existing System:
As remarked earlier, constant CFDs are
particularly important for object identification, and thus deserve a separate
treatment. One wants efficient methods to discover constant CFDs alone, without
paying the price of discovering all CFDs. Indeed, as will be seen later,
constant CFD discovery is often several orders of magnitude faster than general
CFD discovery. Levelwise algorithms may not perform well on sample relations of
large arity, given their inherent exponential complexity. More effective
methods have to be in place to deal with datasets with a large arity. A host of
techniques have been developed for (non-redundant) association rule mining, and
it is only natural to capitalize on these for CFD discovery. As we shall see,
these techniques can not only be readily used in constant CFD discovery, but
also significantly speed up general CFD discovery. To our knowledge, no
previous work has considered these issues for CFD discovery.
Proposed
System:
In light of these considerations we provide
three algorithms for CFD discovery: one for discovering constant CFDs, and the
other two for general CFDs.
(Module: 1) We propose a notion of minimal CFDs based on both the minimality
of attributes and the minimality of patterns. Intuitively, minimal CFDs contain
neither redundant attributes nor redundant patterns. Furthermore, we consider frequent CFDs that hold on a
sample dataset r, namely, CFDs in
which the pattern tuples have a support in r above a certain threshold. Frequent CFDs allow us to accommodate
unreliable data with errors and noise. Our algorithms find minimal and frequent
CFDs to help users identify quality cleaning rules from a possibly large set of
CFDs that hold on the samples.
(Module: 2) Our first algorithm, referred to as CFDMiner, is for constant CFD discovery. We explore
the connection between minimal constant CFDs and closed and free patterns.
Based on this, CFDMiner finds
constant CFDs by leveraging a latest mining technique proposed in [24], which
mines closed itemsets and
free itemsets in parallel following a depth-first search scheme.
(Module: 3) Our second algorithm, referred to as CTANE, extends TANE to discover general CFDs. It is based on an attribute-set/pattern tuple lattice, and mines CFDs at level k + 1 of the
lattice (i.e., when each set at the level consists of k+1 attributes) with pruning based on those at level k. CTANE discovers minimal CFDs only.
(Module: 4) Our third algorithm, referred to as FastCFD, discovers general CFDs by employing a
depth-first search strategy instead of the levelwise approach. It is a
nontrivial extension of FastFD mentioned
above, by mining pattern tuples. A novel pruning technique is introduced by FastCFD, by leveraging constant CFDs found by CFDMiner. As opposed to CTANE, FastCFD does not take exponential time in the arity of sample data when a
canonical cover of CFDs is not exponentially large.
No comments:
Post a Comment