Classification In Data Mining - Various Methods In Classification


Classification In Data Mining


Classification In Data Mining

We know that real-world application databases are rich with hidden information that can be used for making intelligent business decisions.

Classification is the data analysis method that can be used to extract models describing important data classes or to predict future data trends and patterns. 


Classification is a data mining technique that predicts categorical class labels while prediction models continuous-valued functions.


For example, a classification model may be built to categorize credit card transactions as either real or fake, while the prediction model may be built to predict the expenditures of potential customers on furniture equipment given their income and occupation.

Many classification and prediction methods have been proposed by researchers in machine learning, expert systems, statistics, and neurobiology. 


In this article, we will only be discussing classification in brief.


Classification

In the first step, a model is built describing a predetermined step of data labels(classes)or concepts.

The model is constructed by analyzing database records described by attributes(columns).

Each tuple or record is assumed to belong to a predefined class as determined by one of the attributes, called the class label attribute. 


In the context of classification, data tuples or records are also referred to as samples, examples of objects. 


The data records or tuples analyzed to build the model collectively form the training data set.


The individual tuples or records making up the training set are referred to as training samples and are randomly selected from the sample population.


Since the class label(categorical attribute) of each training sample is provided, this step is also known as supervised learning (i.e., the learning of the model is “supervised” in that it is told to which each training sample belongs).

It contrasts with unsupervised learning (or clustering), in which the class label of each training sample is unknown, and the number or set of classes to be learned may be known in advance.


Typically, the learned model is represented in the form of classification rules, decision trees, or statistical or mathematical formulae.


For example, given a database of customer credit information, classification rules can be learned to identify customers having either excellent or fair credit ratings.


The rules can be used to categorize future data samples, as well as provide a better understanding of the database contents.


Model Construction: 



classification in data mining

In the second step, the model is used for classification. 

First, the predictive accuracy of the model (or classifier) is estimated. 


The "Holdout Method" is a simple method that uses a test set of class labeled samples.


These samples are randomly selected and are independent of testing samples. 


The Accuracy of the model on a given test dataset is the percentage of test set samples that are correctly classified by the model.


For each test sample, the known class label is compared with the learned model’s class prediction for that sample. 


If the accuracy of the model is considered acceptable, the model can be used to classify future data tuples or objects for which the class label is known.
(Such data are also referred to in the machine learning literature as “unknown” or “previously unseen” data).


Model For Prediction:

classification in data mining




A Two-Step Process

Model construction: describing a set of predetermined classes
  • Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute.
  • The set of tuples used for model construction: training(testing) set.
  • The model is represented as classification rules, decision trees, or statistical or mathematical formulae.

Model usage: for classifying future or unknown objects
  • Estimating the accuracy of the model.
  • The known label of the test sample is compared with the classified result from the model.
  • The accuracy rate is the percentage of test set samples that are correctly classified by the model.

The test set is independent of the training set, otherwise, over-fitting would occur.

Classifier Accuracy Measures

Here are some of the methods of estimating the accuracy of the classifier 
  • Holdout Method
  • Random Subsampling 
  • K-fold Cross-Validation 
  • Bootstrap Methods

Types Of Classification Methods

  • Decision Tree Induction
  • Bayesian Classification
  • Classification by Back Propagation

(Note: We shall be discussing those separately.)

(Read also - > Data Reduction In Data Mining)


Comparing Classification Methods

Classification and prediction methods can be compared and evaluated according to the following criteria.

  • Predictive Accuracy: Predictive Accuracy is the ability of the model to correctly predict the class label of new or previously unseen data
  • Speed: Speed of the model the computation costs involved in generating and using the model.
  • Robustness: This is the ability of the model to make correct predictions given noisy data or data with missing values or inconsistent data.
  • Scalability: This refers to the ability to construct the model efficiently given huge amounts of data.
  • Interpretability: This refers to the level of understanding and analyzing the insights that are provided by the model.

Issues Regarding Classification

Data cleaning
  • This refers to the preprocessing of data to remove or reduce noise (by applying smoothing techniques) and the treatment of missing values (e.g. by replacing the missing values with the most commonly occurring value for the most probable value based on statistics).
  • Although most of the classification algorithms have some mechanisms for handling noisy or missing data, this step can help reduce confusion during learning. 

Relevance Analysis
  • Many of the attributes in the data may be irrelevant to the classification or prediction task. For example, data recording the day of the week on which a bank loan application was filed is unlikely to be relevant to the success of the application. 
  • Furthermore, other attributes may be redundant. Hence, relevance analysis may be performed on the data to remove any irrelevant or redundant attributes from the learning process.
  • In machine learning, this step is known as feature selection. Including such attributes may otherwise slow down, and possibly mislead the learning step.

Data Transformation
  • The data can be generalized to higher-level concepts. Concept hierarchies may be used for this purpose. This is particularly useful for continuous-valued attributes. For example, numeric values for the attribute income may be generalized to discrete ranges such as low, medium and high. Similarly, nominal-valued attributes, like street, can be generalized to higher-level concepts, like city. 
  • Since generalization compresses the original training data, fewer input/output operations may be involved during learning
  • The data may be normalized, particularly when neural networks or methods involving distance measurements, are used in the learning step.
  • Normalization involves scaling all values for a given attribute so that they fall within a small specified range, such as -1.0 to 1.0 or 0.0 to 1.0. 
  • In methods that use distance measurements, for example, this would prevent attributes with initially large ranges (like, say income) from outweighing attributes with initially smaller ranges Such as binary attributes. 

Summary

Classification is the data analysis method that can be used to extract models describing important data classes or to predict future data trends and patterns. 

(Read also -> Data Mining Primitive Tasks)

Classification is a data mining technique that predicts categorical class labels while prediction models continuous-valued functions.


Subscribe us for more content on Data.   

 

Post a Comment

0 Comments