Descriptive Statistics - Methods Of Mining In Large Databases


Descriptive Statistical Measures


A Descriptive statistic is a statistical summary that quantitatively describes or summarizes features of a collection of information on, while descriptive statistics is the process of using and analyzing those statistics. Descriptive statistics are distinguished from inferential statistics (or inductive statistics) by its aim to summarize a sample

There are several descriptive statistics to mine in large databases in data mining i.e used for knowledge discovery in large databases.

These measures are listed down below.

  • Measuring Central Tendency.
  • Measuring the Dispersion of Data.
  • Boxplot Analysis.
  • Visualization of Boxplot Dispersion.
  • Histogram Analysis.
  • Quantile Plot.
  • Quantile-Quantile Plot.
  • Scatter Plot.
  • Loess Curve. 

Measuring The Central Tendency

Mean:

  • It is the Arithmetic average of the given data.
  • For Weighted mean, we use this formula, x=(sum(wi*xi)/sum(wi).

Median:

  • It is a holistic measure of data.
  • Given in order, It is nothing but the middlemost value of the dispersed data.
  • If there are odd no values then the middle value will be the median.
  • If there are even no values then the median is average of two middle values.
  • It can also be estimated by using,

Mode:

  • It is nothing but the value that occurs most frequently in the data.  
  • If there is only one mode in the data then it is a unimodal data.
  • If there are two modes in the data then it is bimodal data.
  • If there are three modes in the data then it is trimodal data.
  • The empirical formula of mode is, median-mode=3*(mean-median).
 

Measuring The Dispersion Of Data

Quartiles, Outliers, and Boxplots

  • Quartiles: Those are nothing but the 1/4th of the data, Q1 (25th percentile), Q3 (75th percentile).
  • Inter-quartile range: It is the differences between the 75th and 25th quartile (IQR = Q3 – Q1).
  • Five number summary: It describes five values -> min, Q1, M, Q3, max.
  • Boxplot: Ends of the box are the quartiles, the median is marked, whiskers(two lines outside the box extend to Minimum and Maximum) and plot outlier individually.
  • Outlier: It is usually, a value higher/lower than 1.5 x IQR

Variance and standard deviation:

  • Variance s2: (algebraic, scalable computation).
     Descriptive Satistical Measure
  • Standard deviation: It is the square root of variance s2

Boxplot Analysis

  • In this type of analysis, visualization of data is represented with a box.
  • The ends of the box are at the first and third quartiles, i.e., the height of the box is IRQ.
  • The median is marked by a line within the box
  • Whiskers: two lines outside the box extend to Minimum and Maximum
 Boxplot Analysis


Histogram Analysis

  • It is a graph that displays basic statistical class descriptions.
Frequency histograms
  • It is a univariate graphical method.
  • It consists of a set of rectangles that reflect the counts or frequencies of the classes present in the given data.
Histogram Analysis 



Quantile Plot

  • It displays all of the data (allowing the user to assess both the overall behavior and unusual occurrences).
  • It plots quantile information
  • For a data xi data sorted in increasing order, fi indicates that approximately 100 fi% of the data are below or equal to the value xi.
Quantile plot 

Quantile-Quantile Plot

  • It is graphs of the quantiles of one univariate which shows distribution against the corresponding quantiles of another.
  • It allows the user to view whether there is a shift in going from one distribution to another.
Quantiel Quantile plot 


Scatter Plot

  • It provides a first look at bivariate data to see clusters of points, outliers, etc.
  • Each pair of values is treated as a pair of coordinates and plotted as points in the plane.
Scatter plot 

Conclusion

Here's a brief explanation of various descriptive statistical measures form mining data from large databases in data mining.
 

Post a Comment

0 Comments