Clustering In Data Mining - Applications & Requirements

Clustering In Data Mining Process

In the Data Mining and Machine Learning processes, the clustering is the process of grouping a set of physical or abstract objects into classes of similar objects.

A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters.

A cluster of data objects can be treated collectively as a single group in many applications.

Clustering has its applications in many areas, including data mining, statistics, biology, and machine learning.

Cluster analysis is an important human activity. Earlier in childhood, one learns how to distinguish between cats and dogs, or between animals and plants, by continuously improving subconscious clustering schemes.

Cluster analysis is extensively used in numerous applications, including pattern recognition, data analysis, image processing, and market research.

By clustering, one can identify dense and sparse regions, and therefore, discover overall distribution patterns and interesting correlations among data attributes. 

Some of the software packages which implement clustering are S-Plus, SPSS, and SAS.

In machine learning, clustering is an example of unsupervised learning.

For this reason, clustering is a form of learning by observation, rather than learning by examples.

In conceptual clustering, a group of objects forms a class only if it is describable by the concept. This differs from conventional clustering, which measures similarity based on geometric distance.

Conceptual clustering consists of two components
  • It discovers the appropriate classes. 
  • It forms descriptions for each class, as in classification.

Types of Clustering Methods

Partitional Clustering - K-Means & K-Medoids

Hierarchical Clustering - Agglomerative, Divisive & Dendogram

Density-Based Clustering - DBSCAN, OPTICS & DENCLUE

Grid-Based Clustering - STING, WaveCluster & CLIQUE

  

Applications Of Clustering

In business, clustering can help marketers discover distinct groups in their customer bases and characterize customer groups based on purchasing patterns.

In biology, it can be used to derive plant and animal taxonomies, categorize genes with similar functionality.

Clustering may also help in the identification of areas of land use in an earth observation database, and in the identification of groups of automobile insurance policyholders with a high average claim cost.

As a data mining function, cluster analysis can be used as a stand-alone tool to gain insight into the distribution of data, to observe the characteristics of each cluster, and to focus on a particular set of clusters for further analysis.

Alternatively, it may serve as a preprocessing step for other algorithms, such as characterization and classification. 

It is also used in the field of Pattern Recognition.
  • To create thematic maps in GIS by clustering feature spaces.
  • To detect spatial clusters and explain them in spatial data mining.
Clustering is also widely used in Image Processing methods.

Economic Science (especially market research)
WWW
  • Document classification.
  • Cluster Weblog data to discover groups of similar access patterns.

Examples Of Clustering Applications

Marketing: It helps marketers in discovering distinct groups in their customer bases, and then use this knowledge to develop targeted marketing strategies.

Land use: It is used in the identification of areas of similar land use in an earth observation database.

Insurance: It is used in identifying the groups of motor insurance policyholders with a high average claim cost.

City-planning: It helps in identifying the groups of houses according to their house type, value, and geographical location.

Earth-quake studies: Observed earthquake epicenters should be clustered.


Requirements Of Clustering

Scalability: 
  • Many clustering algorithms work well on small data sets containing fewer than 200 data objects, however, a large database may contain millions of objects. Clustering on a sample of a given large data samples may lead to biased results. High scalable clustering algorithms are needed.

Ability to deal with different types of attributes:
  • Many algorithms are designed to cluster interval-based (numerical) data. However, applications may require clustering other types of data, such as binary, categorical (nominal), and ordinal data, or mixtures of these data types.

Discovery of clusters with arbitrary shape: 
  • Many clustering algorithms determine clusters based on Euclidean or Manhattan distance measures. Algorithms based on such distance measures tend to find spherical clusters with similar size and density. However, a cluster could be of any shape. It is important to develop algorithms that can detect clusters of arbitrary shapes. 

Minimal requirements for domain knowledge to determine input parameters: 
  • Many clustering algorithms require users to input certain parameters in cluster analysis (such as the number of desired clusters). The clustering results can be quite sensitive to input parameters.

Able to deal with noise and outliers: 
  • Most real-world databases contain outliers or missing or unknown or erroneous data. Some clustering algorithms are sensitive to such data and may lead to clusters of poor quality.

    Insensitive to the order of input records: 
    • Some clustering algorithms are sensitive to the order of input data; for example, the same set of data, when presented with different orderings to such an algorithm, may generate dramatically different clusters. It is important to develop algorithms that are insensitive to the order of input.

    High dimensionality: 
    • A database or a data warehouse can contain several dimensions or attributes. Many clustering algorithms are good at handling low dimensional data, involving only two or three dimensions. Human eyes are good at judging the quality of clustering for up to three dimensions. It is challenging to cluster data objects in high dimensional space, especially considering that such data can be very sparse and highly skewed (misleading data).

    Constraint-based clustering: 
    • Real-world applications may need to perform clustering under various constraints. Suppose that your job is to choose the locations for a given number of new automatic cash dispensing machines (i.e. ATMS) in a city. To decide upon this, you may cluster households while considering constraints such as the city’s rivers, highway networks, and customer requirements per region. A challenging task to find groups of data with good clustering behavior that satisfy specified constraints.

    Interpretability and usability: 
    • Users expect clustering results to be interpretable, comprehensive, and usable. That is clustering may need to tie up with specific semantic interpretations and applications. It is important to study how an application goal may influence the selection of clustering methods. 
      

    Summary

    The process of grouping a set of physical or abstract objects into classes of similar objects is called Clustering or Cluster Analysis.

    A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters.


    Subscribe us for more content on Data. 

     

    Post a Comment

    0 Comments