Analytical Characterization In Data Mining - Attribute Relevance Analysis


Analytical Characterization

Let's consider a situation where

"What if we are not sure which attribute to include for class characterization and class comparison? We may end up specifying too many attributes, which could slow down the system considerably”.

To overcome this situation we need to perform analytical characterization.

It is the measure of attribute relevance analysis that can be used to help identify irrelevant or weakly relevant attributes that can be excluded from the concept description process.
 

The incorporation of this processing step into class characterization or comparison is referred to as analytical characterization or analytical comparison.

Why Analytical Characterization

It is used because,

The first limitation of the OLAP tool is the handling of complex objects.

The second limitation is the lack of an automated generalization process, the user must explicitly tell the system which dimensions should be included in the class characterization and how high a level each dimension should be generalized.

Actually, each step of generalization or specialization on any dimension must be specified by the user.

Usually, it is not difficult for a user to instruct a data mining system regarding how high a level each dimension should be generalized. 

For example, users can set attribute generalization thresholds for this, or specify which level a given dimension should reach, such as with the command “generalize dimension location to the country level”.

Even without explicit user instructions, a default value such as 2 to 8 can be set by the data mining system, which would allow each dimension to be generalized to a level that contains only 2 to 8 unique values.


(Check Out The Best Selling Data Science Course On Udemy)

On the other hand, normally a user may include too few attributes in the analysis, causing incomplete mining results or a user may introduce too many attributes for analysis e.g “in relevance to *”.

Methods should be introduced to perform attribute relevance analysis to filter out statistically irrelevant or weakly relevant attributes.

The class characterization that includes the analysis of attribute/dimension relevance is called analytical characterization.

The class comparison that includes such analysis is called analytical comparison.

Attribute Relevance Analysis

1. Data Collection: 

  • It is collecting the data for both the target class and the contrasting class by query processing.

2. Preliminary relevance analysis using conservative AOI: 

  • This step identifies a set of dimensions and attributes on which the selected relevance measure is to be applied.
  • The relation obtained by such an application of Attribute Oriented Induction is called the candidate relation of the mining task.

3. Remove irrelevant and weakly relevant attributes using the selected relevance analysis: 

  • We evaluate each attribute in the candidate relation using the selected relevance analysis measure.
  • This step results in an initial target class working relation and initial contrasting class working relation.

4. Generate the concept description using AOI:

  • We need to perform the Attribute Oriented Induction process using a less conservative set of attribute generalization thresholds.
If descriptive mining is
  • Class characterization, only ITCWR is included.
  • Class Comparison both ITCWR and ICCWR are included.

Relevance Measures  

Quantitative relevance measure determines the classifying power of an attribute within a set of data.

Some of the methods of quantitative relevance measure are:
  • Information Gain (ID3)
  • Gain Ratio (C4.5)
  • Gini Index
  • Chi^2 contingency table statistics
  • Uncertainty Coefficient

Entropy & Information Gain

S contains si tuples of class Ci for i = {1, …, m}.

Information measures info required to classify any arbitrary tuple.

Analytical characterization emtropy

Entropy of attribute A with values {a1,a2,…,av} can be used to partition S into the susets {S1,S2,..Sv} where Sj contain sij samples of class Ci.

entropy

Gain(A) = I(S1,S2,...,Sm)- E(A)


The information gained by branching on attribute A
 

Example: Analytical Characterization

Task

  • To mine general characteristics describing graduate students using analytical characterization.

Given

  • Attributes name, gender, major, birth_place, birth_date, phone#, and GPA.
  • Gen(ai) = concept hierarchies on ai.
  • Ui = attribute analytical thresholds for ai.
  • Ti = attribute generalization thresholds for ai.
  • R = attribute relevance threshold.

1. Data collection

  • Target Class: graduate student
  • Contrasting Class: undergraduate student


2. Analytical generalization using Ui

  • Attribute Removal -> to remove the name and phone#
  • Attribute Generalization -> to generalize major, birth_place, birth_date and GPA, accumulate counts
  • Candidate Relation(large attribute generalization threshold): gender, major, birth_country, age_range, and GPA
Candidate relation for Target class: Graduate students (summation = 120):
Target class in analytical characterization

Candidate relation for Contrasting class: Undergraduate students (Summation=130):
Contrasting class in Analytical Characterization


3. Relevance analysis

  • We need to calculate the expected info required to classify an arbitrary tuple.

  • Similarly, we need to calculate the entropy of each attribute: e.g. major
  • And also calculate information gain for each attribute. 

Summary

The class characterization that includes the analysis of attribute/dimension relevance is called analytical characterization.

The class comparison that includes such analysis is called analytical comparison.


Subscribe us for more content on Data. 

Post a Comment

0 Comments