01
. Decision Trees for Uncertain Data
ABSTRACT
:
Traditional
decision tree classifiers work with data whose values are known and precise. We
extend such classifiers to handle data with uncertain information. Value
uncertainty arises in many applications during the data collection process.
Example sources of uncertainty include measurement/quantization errors, data
staleness, and multiple repeated measurements. With uncertainty, the value of a
data item is often represented not by one single value, but by multiple values
forming a probability distribution. Rather than abstracting uncertain data by
statistical derivatives (such as mean and median), we discover that the
accuracy of a decision tree classifier can be much improved if the “complete information”
of a data item (taking into account the probability density function (pdf)) is
utilized.
We extend classical decision tree building algorithms to handle data
tuples with uncertain values. Extensive experiments have been conducted that
show that the resulting classifiers are more accurate than those using value
averages. Since processing pdf’s is computationally more costly than processing
single values (e.g., averages), decision tree construction on uncertain data is
more CPU demanding than that for certain data. To tackle this problem, we
propose a series of pruning techniques that can greatly improve construction
efficiency.
EXISTING SYSTEM :
In traditional
decision-tree classification, a feature (an attribute) of a tuple is either
categorical or numerical. For the latter, a precise and definite point value is
usually assumed. In many applications, however, data uncertainty is common. The
value of a feature/attribute is thus best captured not by a single point value,
but by a range of values giving rise to a probability distribution. Although
the previous techniques can improve the efficiency of means, they do not
consider the spatial relationship among cluster representatives, nor make use
of the proximity between groups of uncertain objects to perform pruning in
batch. A simple way to handle data uncertainty is to abstract probability
distributions by summary statistics such as means and variances. We call this
approach Averaging. Another approach is to consider the complete information
carried by the probability distributions to build a decision tree. We call this
approach Distribution-based.
PROPOSED SYSTEM :
We study the problem
of constructing decision tree classifiers on data with uncertain numerical
attributes. Our goals are (1) to devise an algorithm for building decision
trees from uncertain data using the Distribution-based approach; (2) to
investigate whether the Distribution-based approach could lead to a higher
classification accuracy compared with the Averaging approach; and (3) to
establish a theoretical foundation on which pruning techniques are derived that
can significantly improve the computational efficiency of the
Distribution-based algorithms.
No comments:
Post a Comment