What Is Data Mining - An Introduction Guide To Beginners

data mining and kdd knowledge discovery

Now Before starting with the introduction of Data Mining, let's see the evolution of databases so that we could have an idea of how the data got stored through ages.
(Read also -> Data Mining Tasks)  

Database Technology Evolution

  • Data collection, database creation, IMS(Hierarchical) and network DBMS
  • Relational data model, relational DBMS implementation
  • RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
  • Application-oriented DBMS (spatial, scientific, engineering, etc.)
  • Data mining, data warehousing, multimedia databases, and Web databases
  • Stream data management and mining
  • Data mining and its applications
  • Web technology (XML, data integration) and global information systems 

In the field of Information technology, we have a huge amount of data available that needs to be turned into useful knowledge.

This information or knowledge further can be used for various applications such as market analysis, fraud detection, customer retention, production control, science exploration, etc. 

Data mining

Data mining (knowledge discovery from data) is
Extracting or “mining” knowledge from a large amount of data.

Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data

A few of its alternative names are Knowledge Discovery (mining) in Databases (KDD), Knowledge Extraction, Data/Pattern Analysis, Data Archeology, Data Dredging, Information Harvesting, Business Intelligence, etc.

Knowledge Discovery Process

Knowledge Discovery Process diagram, data mining

Data Cleaning: to remove noisy and inconsistent data

Data integration: where multiple data sources may be combined and integrated.

Data selection: where data relevant to the analysis tasks are retrieved from the databases.

Data transformation: where data is transformed and consolidated into forms appropriate for mining by performing summary or aggregation operations.

Data mining: an essential process where intelligent methods are applied to extract data patterns

Pattern evaluation: to identify the truly interesting patterns representing knowledge based on interestingness measures.

Knowledge presentation: where visualization and knowledge representation techniques are used to present mined knowledge to users

Data Cleaning 

Data in the real-world is not available in proper(Many tuples might not have recorded attribute values). 

Missing values may be due to
  • Equipment failure.
  • Inconsistent with other recorded data and thus deleted or misplaced.
  • Data not entered correctly due to misunderstanding 
  • Certain data may not be considered important at the time of entry. 
  • Not register history or changes in the data.

Data Integration 

Data integration: Combines data from multiple sources into a coherent store
  • Schema integration: e.g., A.cust-id = B.cust-#
  • Identify real-world entities from multiple data sources, e.g., Bill Clinton = William Clinton
  • Detecting and resolving data value conflicts
  • For the same real-world entity, attribute values from different sources are different
Possible reasons: different representations, different scales, e.g., metric vs. British units.
Price (dollars, euro), (kg, gram), total sales, etc.,

Handling Redundancy 

Redundant data occur often when the integration of multiple databases is done.

  • Object identification: The same attribute or object may have different names in different databases
  • Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue
Redundant attributes can be detected by correlation analysis

Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality

Data Transformation 

Smoothing: remove noise from data

Aggregation: summarization, data cube construction

Generalization: concept hierarchy climbing

Normalization: scaled to fall within a small, specified range
  • min-max normalization
  • z-score normalization
  • normalization by decimal scaling
Attribute/feature construction
  • New attributes constructed from the given ones

Data Reduction Strategies

Dimensionality reduction: e.g., remove unimportant attributes

Data compression

Numerosity reduction: e.g., fit data into models
Discretization and concept hierarchy generation

Data Mining

  • An essential process where intelligent methods are applied to extract data patterns

Pattern Evaluation

  • To identify the truly interesting patterns representing knowledge based on interestingness measures.

Knowledge Presentation

  • Where visualization and knowledge representation techniques are used to present mined knowledge to users


Applications of Data Mining

Customer Profiling: Data Mining helps to determine what kind of people buy what kind of products.

Identifying Customer Requirements: Data Mining helps in identifying the best products for different customers. It uses prediction to find the factors that may attract new customers.

Cross Market Analysis: Data Mining performs Association/correlations between product sales.

Target Marketing
: Data Mining helps to find clusters of model customers who share the same characteristics such as interest, spending habits, income, etc

Determining Customer purchasing pattern: Data mining helps in determining customer purchasing pattern

Providing Summary Information: Data Mining provides us various multidimensional summary reports

Finance Planning and Asset Evaluation: It involves cash flow analysis and prediction, contingent claim analysis to evaluate assets.

Resource Planning:  It involves summarizing and comparing the resources and spending.

Competition: It involves monitoring competitors and market directions.

Major Issues of Data Mining 

Mining Methodology
  • Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web
  • Handling noise and incomplete data: data cleaning and data analysis methods that can handle noise are required. outlier mining methods for discovery and analysis of exceptional cases.
  • Incorporation of background knowledge: domain knowledge is required to guide the discovery process and express patterns in concise terms and at different levels of abstraction.

User Interaction

  • Data mining query languages and ad-hoc mining
  • Data mining Query language needs to be developed to allow users to describe ad- hoc data mining tasks.
  • Expression and visualization of data mining results
  • Interactive mining of knowledge at multiple levels of abstraction
Applications and Social Impacts

  • Domain-specific data mining & invisible data mining Eg. Companies like Amazon keeps track of customer profiles
  • Protection of data security, integrity, and privacy
  • We need to observe data sensitivity and preserve people's privacy while performing successful data mining.


Hence the trend of data mining and its application has been an exponential rise along with Machine Learning and Artificial Intelligence. 

Because the basic thing for any machine learning or artificial intelligence Engineer to deploy a model is to understand and analyze the patterns of data.

Thank you for going through this blog post, I hope it's worth your time and don't forget to comment if any doubts or suggestions. 

(next -> Data Reduction In Data Mining)

We will be discussing more about data mining in our upcoming articles.

Post a Comment