Data Generalization In Data Mining - Summarization Based Characterization

data generalization and summarization based characterization


From Data Analysis point of view, data mining can be classified into two categories: Descriptive mining and  predictive mining

Descriptive mining: It describes the data set in a concise and summative manner and presents interesting general properties of data.

Predictive mining: It analyzes the data to construct one or a set of models, and attempts to predict the behavior of new data sets.

Databases usually store a large amount of data in great detail. However, users often like to view sets of summarized data in concise, descriptive terms. 

Such data descriptions may provide an overall picture of a class of data or distinguish it from a set of comparative classes. 

(Check Out The Data Science Course On Udemy)

Such descriptive data mining is called concept descriptions and forms an important component of data mining. 

What Is Concept Description

The simplest kind of descriptive data mining is called concept description. A concept usually refers to a collection of data such as frequent_buyers, graduate_students and so on.

As data mining task concept description is not a simple enumeration of the data. Instead, concept description generates descriptions for characterization and comparison of the data.

It is sometimes called class description when the concept to be described refers to a class of objects
  • Characterization: It provides a concise and succinct summarization of the given collection of data.
  • Comparison: It provides descriptions comparing two or more collections of data.

Data Generalization & Summarization 

Data and objects in databases contain detailed information at the primitive concept level.
For example, the item relation in a sales database may contain attributes describing low-level item information such as item_ID, name, brand, category, supplier, place_made and price.

It is useful to be able to summarize a large set of data and present it at a high conceptual level.

For example, summarizing a large set of items relating to Christmas season sales provides a general description of such data, which can be very helpful for sales and marketing managers.

This requires an important functionality called data generalization.

Data Generalization 

A process that abstracts a large set of task-relevant data in a database from a low conceptual level to higher ones.

Data Generalization is a summarization of general features of objects in a target class and produces what is called characteristic rules. 

The data relevant to a user-specified class are normally retrieved by a database query and run through a summarization module to extract the essence of the data at different levels of abstractions.

For example, one may want to characterize the "OurVideoStore" customers who regularly rent more than 30 movies a year. With concept hierarchies on the attributes describing the target class, the attribute-oriented induction method can be used, for example, to carry out data summarization. 

Note that with a data cube containing a summarization of data, simple OLAP operations fit the purpose of data characterization.

Approaches:
  • Data cube approach(OLAP approach).
  • Attribute-oriented induction approach.

Presentation Of Generalized Results

Generalized Relation:
  • Relations where some or all attributes are generalized, with counts or other aggregation values accumulated.

Cross-Tabulation:
  • Mapping results into cross-tabulation form (similar to contingency tables). 

Visualization Techniques:
  • Pie charts, bar charts, curves, cubes, and other visual forms.
Quantitative characteristic rules:
  • Mapping generalized results in characteristic rules with quantitative information associated with it. 
 

Data Cube Approach

It is nothing but performing computations and storing results in data cubes.

Strength
  • An efficient implementation of data generalization.
  • Computation of various kinds of measures, e.g., count( ), sum( ), average( ), max( ).
  • Generalization and specialization can be performed on a data cube by roll-up and drill-down.
Limitations
  • It handles only dimensions of simple non-numeric data and measures of simple aggregated numeric values.
  • Lack of intelligent analysis, can’t tell which dimensions should be used and what levels should the generalization reach.

Summary 

Data generalization is the process that abstracts a large set of task-relevant data in a database from a low conceptual level to higher ones.

It is a summarization of general features of objects in a target class and produces what is called characteristic rules.

Subscribe us for more content on Data. 

Post a Comment

0 Comments