Mining Class Comparisons In Data Mining

Class Comparison Methods & Implementations

Data Collection: 

  • The set of associated data from the databases and data warehouses is collected by query processing and is partitioned into the target class and contrasting class.

Dimension Relevance Analysis: 

  • When many dimensions are to be processed and is required that analytical comparison should be performed, then dimension relevance analysis should be performed on these classes, and only the highly relevant dimensions are included in the further analysis.

Synchronous Generalization: 

  • The process of generalization is performed upon the target class to the level controlled by the user or expert specified dimension threshold, which results in a prime target class relation/cuboid. 

  • The concepts in the contrasting class or classes are generalized to the same level as those in the prime target class relation/cuboid, forming the prime contrasting class relation/cuboid.

Presentation of the derived comparison: 

  • The resulting class comparison description can be visualized in the form of tables, charts, and rules.

  • This presentation usually includes a “ contrasting” measure (such as count%) that reflects the comparison between the target and contrasting classes.
(No.1 Best Selling Data Science Course On Udemy)


Task - Compare graduate and undergraduate students using the discriminant rule. 

for this, the DMQL query would be.

use University_Database
mine comparison as “graduate_students vs_undergraduate_students”
in relevance to name, gender, program, birth_place, birth_date, residence, phone_no, GPA
for “graduate_students”
where status in “graduate”
versus “undergraduate_students”
where status in “undergraduate”
analyze count%
from student

Now from this, we can formulate that
  • attributes = name, gender, program, birth_place, birth_date, residence, phone_no, and GPA.
  • Gen(ai) = concept hierarchies on attributes ai.
  • Ui = attribute analytical thresholds for attributes ai.
  • Ti = attribute generalization thresholds for attributes ai.
  • R = attribute relevance threshold.
1. Data collection -Understanding Target and Contrasting classes.

2. Attribute relevance analysis - It is used to remove attributes name, gender, program, phone_no.

3. Synchronous generalization - It is controlled by user-specified dimension thresholds, a prime target, and contrasting class(es) relations/cuboids.

mining class comparisons

4. Drill down, roll up and other OLAP operations on target and contrasting classes to adjust levels of abstractions of resulting description.

prime generalization

5. The presentation- Data is presented as generalized relations, crosstabs, bar charts, pie charts, or rules,
contrasting measures to reflect a comparison between target and contrasting classes.
e.g. count%


generalized relation


corss tab presentation

Quantitative Discriminant Rules

To find out the discriminative features of target and contrasting classes can be described as a discriminative rule. 

It associates an interestingness measure d-weight with each tuple.

  • Cj - target class
  • Qa - a generalized tuple covers some tuples of class, but can also cover some tuples of contrasting class
  • d-weight - range: [0, 1]
d-weight = count(Qa)/summation(count(Qa))


In the above example, suppose that the count distribution for major =‘science’ and age_range = ’20..25” and GPA =‘good’ is shown in the tables.

The d_weight would be 90/(90+210) = 30% w.r.t to target class and the d_weight would be 210/(90+210) = 70% w.r.t to contrasting class. i.e.

The student majoring in science is 21 to 25 years old and has a good GPA then based on the data, there is a probability that she is a graduate student versus a 70% probability that she is an undergraduate student. Similarly, the d-weights for other tuples also can be derived.


Post a Comment