# Association Algorithm Principles - Data Mining

The association algorithm is nothing more than a correlation counting engine. The Microsoft Association Algorithm belongs to the a priori association family, which is a very popular and efficient algorithm for finding frequent itemsets (common attribute value sets). There are two steps in the association algorithm, as illustrated in Figure. The first step of the algorithm, a calculationintensive phase, is to find frequent itemsets. The second step is to generate association rules based on frequent itemsets. This step requires much less time than the first step does.

The two-step process of the association algorithm Understanding Basic Association Algorithm Concepts
Before going to the algorithm principles, this section introduces a few basic association algorithm concepts. The following sections define the terms and concepts you will need to understand before implementing the algorithm principles:

Itemset
An itemset is a set of items. Each item is an attribute value. In the market basket example, an itemset contains a set of products such, as cake, Pepsi, and milk. In the customer demographic exploration example, an itemset contains a set of attribute values such as {Gender = ‘Male’, Education = ‘Bachelor’}. Each itemset has a size, which is the number of items contained in the itemset. The size of itemset {Cake, Pepsi, Milk} is 3.

Frequent itemsets are those itemsets that are relatively popular in the dataset. The popularity threshold for an itemset is defined using support, which is covered in the next section.

Support
Support is used to measure the popularity of an itemset. Support of an itemset {A, B} is made up of the total number of transactions that contain both A and B. Support ({A, B}) = NumberofTransactions(A, B)

Minimum_Support is a threshold parameter you need to specify before processing an association model. It means that you are interested only in those itemsets and rules that represent at least minimum support of the dataset. The parameter Minimum_Support is used to restrict the itemset, but not rules.

Probability (Confidence)
Probability is a property of an association rule. The probability of a rule A=>B is calculated using the support of itemset {A,B} divided by the support of {A}. This probability is also called confidence in the data mining research community. It is defined as follows:

Probability (A => B) = Probability (B|A) = Support (A, B)/ Support (A)

Minimum_Probability is a threshold parameter you need to specify before running the algorithm. It means that the user is interested in only those rules that have a high probability rather than a minimum probability. Minimum_Probablity has no impact on itemsets, but it does impact rules.

Importance

Importance is also called the interesting score or the lift in some literature. Importance can be used to measure itemsets and rules. The importance of an itemset is defined using the following formula:

Importance ({A,B}) = Probability (A, B)/(Probability (A)* Probability (B))

If importance = 1, A and B are independent items. It means that the purchase of product A and purchase of product B are two independent events. If importance < 1, A and B are negatively correlated. This means if a customer buys A, it is unlikely he will also buy B. If importance > 1, A and B are positively correlated. This means if a customer buys A, it is very likely he also buys B.For rules, the importance is calculated using the following formula:

Importance (A => B) = log (p(B|A)/p(B|not A))

An importance of 0 means that there is no association between A and B. A positive importance score means that the probability of B goes up when A is true. A negative importance score means that the probability of B goes down when A is true.

Table gives the correlation counts of donut and muffin derived from a purchase database. Each cell value represents the number of transactions. For example, 15 out of 100 transactions include a customer purchasing both donuts and muffins.

Correlation Count for Donut and Muffin In the following, we will use the previous definitions to calculate the support, probability, and importance of related itemsets and rules for donut and muffin: From the importance of the itemset {Donut, Muffin}, we can see that Donut and Muffin are negatively correlated; it is rather unlikely for someone who buys a donut to also buy a muffin.

Finding Frequent Itemsets
Finding frequent itemsets is the core part of the using the association algorithm. First, you need to specify the frequency threshold using the Minimum_Support parameter, for example, Minimum_Support = 2%. This means you are interested in analyzing only those items that appear in at least 2% of all shopping baskets.

The algorithm finds all frequent itemsets with size = 1 in the first iteration (those popular products with support greater than Minimum_Support). The algorithm does this by scanning the dataset and counting the support of each individual item. The second iteration finds those frequent itemsets of size = 2. Before starting the second iteration, the algorithm generates a set of candidate itemsets of size 2 based on the result of first iteration (frequent itemsets of size 1). Again, the algorithm scans the dataset and counts the supports for each generated candidate itemset. At the end of the iteration, it selects those candidates with support less than Minimum_Support to get the list of frequent itemsets with size equal to 2.

The algorithm repeats the same procedure to find frequent itemsets with size 3, 4, 5 . . . until no more itemsets meet the Minimum_Support criteria. Figure illustrates the process of identifying frequent itemsets. The Minimum_Support is set to 250/1000. At the first iteration, cheese and cake are filtered out. At the second iteration, the candidate {diaper, milk} is disqualified. At the third iteration, the candidate {beer, diaper, bread} has enough support; whereas the candidate {beer, milk, bread} is filtered out.

The following pseudocode is the main procedure for generating frequent itemsets:

Once you have your frequent itemsets, generateCandidates is a function that returns all the candidate itemsets with size = k + 1. One important property of a frequent itemset is that every subset of a frequent itemset must be a frequent itemset. For example, if {beer, diaper, bread} is a frequent itemset, {beer}, {diaper}, {bread}, {beer, diaper}, {beer, bread}, and {diaper, bread} must also be frequent itemsets.

Finding frequent itemsets The following SQL join statement can be used to generate candidate itemsets

This SQL statement generates those candidate itemsets with prefixes of itemsets size k. However, it doesn’t guarantee that all the subsets of candidate itemsets are frequent itemsets. So, we need to prune those candidates containing infrequent subsets by using the following procedure.

Candidate itemsets generation and counting their correlation are timeconsuming. In some cases, it can generate a huge number of candidate sets. For example, suppose that there are 10,000 products (a medium-sized supermarket). If the minimum support is low enough, the algorithm will generate over 107 candidate 2 itemsets. Many optimization techniques are available in this phase; for example, Microsoft Association Rules stores the itemsets in a tree data structure to save space.

Some association algorithms generate frequent itemsets without candidate generation. For large datasets with lots of distinct items, we recommend you avoid setting this parameter too small.

The number of items is also critical to the performance of the processing. When there are too many unique items, consider grouping them into categories. For example, your store may have a dozen different JellyBeans, you may group these JellyBeans to a single JellyBeans category. This can greatly reduce the total number of items and thus reduce the model processing time.

Generating Association Rules
The next step in the association algorithm process is to generate association rules. We’re looking for rules of the form: cake ≥ milk, and we’re interested in rules that have a high correlation. To generate this rules we need the count for the { cake, milk } itemset as well as the counts for cake and milk (the 1- itemsets). In general you need the itemsets to the left of the arrow, the left hand side along with the itemset including all items in the rule.

As rules are generated from the itemset, each item in the rule automatically satisfies the minimum support condition. The following procedure generates all the qualified association rules:

For each frequent itemset f, generate all the subset x and its complimentary set

association rule with probability = Support(f)/Support(x)

The following property can be used to accelerate the rule-generation process:

If a, b, c => d has probability lower than the minimum probability, rulea, b => c, d doesn’t have enough probability neither.

Prediction
In an association model, if a column is used for input, its values can be used only in frequent itemsets and on the left side of association rules. If a column is used to make predictions, the column’s states can be used in frequent itemsets and on the left and right sides of the association rules.

If a column is predict_only, its states can appear in frequent itemsets and on the right side of rules. Many association algorithms in commercial data mining packages stop at finding itemsets and rules; the Microsoft Association Algorithm can perform predictions using these rules. The results of the predictions are usually a set of items to recommend.

You can build an association model not only based on shopping baskets but also based on customer demographics. For example, you can include gender, marital status, and home ownership as case-level attributes in the mining structure and include the shopping basket as a nested table in the same structure. In this case, you analyze the shopping patterns not only based on the relationship of itemsets but also based on the demographics. For example, you may find a rule that predicts that 65% of male customers who purchase beer also purchase diapers in the same transaction, and 20% of female customers who purchase diapers also purchase wine.

These rules can be applied for prediction. For a male customer, you may recommend a list of wines. If a male customer has already bought beer in the shopping cart, you may recommend both wine and diapers. However, not every itemset is associated with a rule. For example, there is no rule that has the itemset {beer, diaper, bread, milk} on the left side. What would the recommendation list be for a customer who bought beer, diapers, bread, and milk? Here is the method the Microsoft Association algorithm uses to execute associative prediction:

1. Given a list of items, find all rules with the left side matching the given items or any subsets of the given items. Apply those rules to get the list of recommendations.
2. If there is no appropriate rule or there are too few recommended items, apply marginal statistics to predict and return the n most popular items.
3. Sort the items from steps 1 and 2 based on probability.

Data Mining Topics