Apriori Algorithm In Data Mining With Examples


Apriori Algorithm In Data Mining


Here is the comprehensive guide on Apriori Algorithm.  

Introduction To Apriori Algorithm

The Apriori Algorithm is an influential algorithm for mining frequent itemsets for boolean association rules.

Key Concepts 

  • Frequent Itemsets: The sets of item which has minimum support (denoted by Li for ith-Itemset).
  • Apriori Property: Any subset of a frequent itemset must be frequent.
  • Join Operation: To find Lk, a set of candidate k-itemsets is generated by joining Lk-1 with itself.


Apriori Algorithm

Find the frequent itemsets: the sets of items that have minimum support.

A subset of a frequent itemset must also be a frequent itemset.
i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset.

Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset).

Use the frequent itemsets to generate association rules.


Apriori Algorithm Pseudo Code

Join Step: Ck is generated by joining Lk-1with itself

Prune Step:  Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset

Pseudo-code:
Ck: Candidate itemset of size k


Lk: frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=0; k++) do begin
     Ck+1 = candidates generated from Lk;
    for each transaction t in database do
       increment the count of all candidates in Ck+1 that are contained in t
    Lk+1 = candidates in Ck+1 with min_support
    end
return = Lk;



Apriori Algorithm Example

TIDList of Items
T100I1, I2, I5
T200I2, I4
T300I2, I3
T400I1, I2, I4
T500I1, I3
T600I2, I3
T700I1, I3
T800I1, I2, I3, I5
T900I1, I2, I3

Consider a database, D, consisting of 9 transactions.

Suppose min. support count required is 2 (i.e. min_sup = 2/9 = 22 % ).

Let the minimum confidence required is 70%.

We have to first find out the frequent itemset using Apriori algorithm.

Then, Association rules will be generated using min. support & min. confidence.

Step 1: Generating 1-Itemset Frequent Pattern 

Itemset   Count
{I1}6
{I2}7
{I3}6
{I4}2
{I5}2

The above table is L1.
In the first iteration of the algorithm, each item is a member of the set of candidate.

The set of frequent 1-itemsets, L1, consists of the candidate 1-itemsets satisfying minimum support.

Step 2: Generating 2-Itemset Frequent Pattern

To discover the set of frequent 2-itemsets, L2, the algorithm uses L1 Join L1 to generate a candidate set of 2-itemsets, C2.

Next, the transactions in D are scanned and the support count for each candidate itemset in C2 is accumulated (as shown in the middle table).

The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate 2-itemsets in C2 having minimum support.

Note: We haven’t used Apriori Property yet.

L1 = {I1,I2,I3,I4,I5}.

Since L2 = L1 join L1 then {I1,I2,I3,I4,I5} join {I1,I2,I3,I4,I5}.
 
It becomes -> C2= [ {I1,I2} {I1,I3}, {I1,I4}, {I1,I5}, {I2,I3}, {I2,I4}, {I2,I5}, {I3,I4} {I3,I5}, {I4,I4} ].

Now we need to check the frequent itemsets with min support count.

Then we get -> (C2*C2) L2= [ {I1,I2} {I1,I3}, {I1,I5}, {I2,I3}, {I2,I4}, {I2,I5} ].

Similarly, We do it for L3.

Step 3: Generating 3-Itemset Frequent Pattern

L2= [ {I1,I2} {I1,I3}, {I1,I5}, {I2,I3}, {I2,I4}, {I2,I5} ].
    
L3 = L2 JOIN L2 i.e.

C3 =  [{I1,I2,I3},{I1,I2,I5}].

Now, the Join step is complete and the Prune step will be used to reduce the size of C3. Prune step helps to avoid heavy computation due to large Ck.
 
Procedure Step 1: Find Items starting with I2 in B
It gives {I1, I2, I3}, {I1, I2, I4}, {I1, I2, I5}.

Step 2: Find Items starting with I3 in B
It gives NIL, Similarly I4, I5.

Step 3: Find out infrequent items sets using min support count and remove them.
Based on the Apriori property that all subsets of a frequent itemset must also be frequent, we can determine that four latter candidates cannot possibly be frequent. How?

For example, lets take {I1, I2, I3}. The 2-item subsets of it are {I1, I2}, {I1, I3} & {I2, I3}. Since all 2-item subsets of {I1, I2, I3} are members of L2, We will keep {I1, I2, I3} in C3.

Lets take another example of {I2, I3, I5} which shows how the pruning is performed. The 2-item subsets are {I2, I3}, {I2, I5} & {I3,I5}. 

BUT, {I3, I5} is not a member of L2 and hence it is not frequent violating Apriori Property. Thus We will have to remove {I2, I3, I5} from C3.

Therefore, C3 = {{I1, I2, I3}, {I1, I2, I5}} after checking for all members of the result of Join operation for Pruning.

Now, the transactions in D are scanned in order to determine L3, consisting of those candidates 3-itemsets in C3 having minimum support.
 

Step 4: Generating 4-Itemset Frequent Pattern

The algorithm uses L3 Join L3 to generate a candidate set of 4-itemsets, C4. Although the join results in {{I1, I2, I3, I5}}, this itemset is pruned since its subset {{I2, I3, I5}} is not frequent. 

Thus, C4 = φ(Null), and algorithm terminates, having found all of the frequent items. This completes our Apriori Algorithm.

What’s Next? 

These frequent itemsets will be used to generate strong association rules ( where strong association rules satisfy both minimum support & minimum confidence).
 

Step 5: Generating Association Rules From Frequent Itemsets

Procedure:
  • For each frequent itemset “l”, generate all nonempty subsets of l.
  • For every non-empty subset s of l, output the rule “s -> (l-s)” if support_count(l) / support_count(s) >= min_conf where min_conf is minimum confidence threshold.

From the above example

Let the minimum confidence threshold is, say 70%.

The resulting association rules are shown below, each listed with its confidence.

R1: I1 ^ I2 -> I5
  • Confidence = sc{I1,I2,I5}/sc{I1,I2} = 2/4 = 50%.
  •  R1 is Rejected.
R2: I1 ^ I5 -> I2
  •  Confidence = sc{I1,I2,I5}/sc{I1,I5} = 2/2 = 100%.
  •  R2 is Selected.
R3: I2 ^ I5 -> I1
  •  Confidence = sc{I1,I2,I5}/sc{I2,I5} = 2/2 = 100%.
  •  R3 is Selected.
R4: I1 -> I2 ^ I5
  • Confidence = sc{I1,I2,I5}/sc{I1} = 2/6 = 33%.
  • R4 is Rejected.
R5: I2 -> I1 ^ I5
  • Confidence = sc{I1,I2,I5}/{I2} = 2/7 = 29%
  • R5 is Rejected.
R6: I5 -> I1 ^ I2
  • Confidence = sc{I1,I2,I5}/ {I5} = 2/2 = 100%.
  • R6 is Selected.

In this way, We have found three strong association rules(R1, R3, R6).

Code for implementing Apriori Algorithm.

Methods To Improve Apriori's Efficiency

Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent.

Transaction reduction: A transaction that does not contain any frequent k-itemset is useless in subsequent scans.

Partitioning: An itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB.

Sampling
: Mining on a subset of given data, lower support threshold and a method to determine completeness.

Dynamic itemset counting: Add new candidate itemsets only when all of their subsets are estimated to be frequent.

   

Summary

The Apriori Algorithm is an influential algorithm for mining frequent itemsets for boolean association rules.

Subscribe us for more content on Data. 

  


Post a Comment

0 Comments