Using Clustering Models - Data Mining

Clustering models are great models to throw your data at just to see what comes out the other side. However, as with all data mining techniques, you get the best answers when you ask your questions the right way. Do you want to group your salespeople by total sales, or by the ratio of sales in each category? Is it important that your clustering model understand income as a continuous value, or do you need to break it down into categories first? The clustering algorithm is very flexible in that it supports all data and content types specified in OLE DB for Data Mining, but you still need to take care to provide the data to the algorithm in the form that is most likely to solve your problem.

For example, if a large company issues an annual employee satisfaction survey to determine an Organizational Health Index (OHI). The results are a percentage for each category, for example 65% for the set of compensation questions. You could simply cluster over all of these results for all of your employees, but could you understand the results? Is 65% in compensation a good response, a bad response, or neutral? In this case, you could transform your data into buckets of High, Medium, and Low by considering High to be higher than half of a standard deviation above the average response and Low to be lower than the same amount below the same amount. Now, when you examine resultant clusters, the results will be easier to read; it’s easy to pick out the cluster of people who love their jobs but want to get paid more. You can increase the buckets to the granularity you wish to explore. This kind of transformation can be done in the Data Source view, or using Data Transformation Services.

The Clustering algorithm has some special behavior associated with the column usage flag on the mining model. When you set the column usage to Input or Predict, the algorithm acts as described; the difference, of course, being that the predictable columns are selectable from the model during predictions and the input columns are not. When you set the column usage to Predict Only, the column is treated specially. Predict Only columns in the clustering algorithm are not used during the clustering phase of the model training. When the model is completely trained, the algorithm takes another run over the training data and assigns the values of these attributes based on how the training cases fall into the clusters. When using the previous OHI example, to ensure that the company complies with federal guidelines regarding workforce diversity, it may be important to determine the distribution of gender and race among the clusters. When creating the clusters, you only want to consider the answers to the survey, but at the end of the day you need to know that there is the same ratio of men to women in that group of happy, underpaid employees as there is in the rest of the company.

Clustering as an Analytical Step
Clustering is also used as a step in a larger analytical project. By segmenting your data into groups of self-similarity, you can create better supplemental models to answer deeper questions. Surely, the decline in revenue among Family Buyers has occurred for a different reason than the decline among Single Moviegoers. By analyzing the data along the lines of these clusters, you are able to more tightly focus on the exact reasons for the revenue loss. Perhaps you want to create a tree model to predict whether a customer will give you repeat business within a month. Creating a model on all of your customers will provide valuable information, but it will be a model that is generalized to the entire data set. By training the tree models on only the areas that clustering has demonstrated are where the revenue has declined, you can focus on where the problem lies.

Clustering supports all content and data types and predictions, so all of the DMX language that is not specific to other algorithms can be used to query cluster models. The following sections discuss the functionality particular to the clustering algorithm.

Clustering as a preprocessing step for decision trees

Clustering as a preprocessing step for decision trees

The Cluster function returns the cluster that is most likely to contain a particular case. For example, to retrieve all cases from an input set that belong to the Family Buyers cluster, you would use the following syntax:

The Cluster function can also be used as a column reference for functions that take such references as parameters. $Cluster can be used as a synonym for Cluster().

ClusterProbability returns the probability that a case belongs to a particular cluster. Calling ClusterProbability without specifying a parameter returns the probability of the primary cluster. The following query returns the cluster ID for each case and the probability that the case is in that cluster:

PredictHistogram is not specific to the clustering algorithm, but when specified using the Cluster function as the column reference, it can be used to return a histogram of the likelihood of the input case existing in each of the model’s clusters. This query returns a table with a nested table containing the top three likely clusters for each case, along with associated statistics:

CaseLikelihood returns a measure from 0 to 1 that indicates how likely an input case is to exist considering the model learned by the algorithm. This measure is very good for use in anomaly detection because it quickly and easily tells you if new data is similar to any data seen before. This function operates in two modes: normalized and nonnormalized.

In the nonnormalized mode, the value of the measure is the raw probability of the case, that is, the product of the probabilities of each of the attributes in the case. For instance, if the probability of Home Ownership = Yes is 40% and the probability of Occupation = Craftsmen is 10%, then the probability of the case is 40% × 10% = 4%.

Nonnormalized likelihoods can be useful, but due to the nature of the probabilities, as you increase the number of attributes in a case, the probability of the case becomes correspondingly smaller. Additionally, as a user, you cannot understand whether a 4% probability for a certain combination of attributes is a good thing or a bad thing. The normalized likelihood divides the probability of the case as provided by the model by the probability computed without the model, using raw statistics. This provides a lift number that is normalized between 0 and 1 using the formula (lift)/(lift + 1). This is interpreted to mean that cases with likelihood values greater than 0.5 have positive lift and are more likely than random to occur, and that values less than 0.5 have negative lift and are less likely than random to occur.

For continuous attributes, the probability distribution is used for this computation. The following query returns the normalized case likelihood for each case in the input set:

Model Content
The content of a clustering model is easy to understand. The content contains:

  • One row for the model, indicated by node type DM_NODE_TYPE_ MODEL (1).
  • One row for each cluster, indicated by node type DM_NODE_TYPE_ CLUSTER (5).

Each row contains general information about the node plus the distributions of all attributes contained in the node that row represents. The model node contains the global distributions, and the cluster nodes contain the distributions particular to those individual clusters.

The first row of the model contains a score for the model in the MSOLAP_NODE_SCORE column of the model content. This score is the average case likelihood for each of the training cases. This score represents how well the model describes the training data.
This query returns the model score:


Each cluster row contains the cluster caption and an algorithm-generated textual description of the clusters. This query retrieves the caption and description for each cluster.

The cluster captions can also be set through DMX. The default caption for clusters simply assigns a numeric value. This query changes the caption for cluster 2 of the model:

Understanding Your Cluster Models
Clustering is very good at taking data with scores of attributes and distilling them down into a handful of groupings. Comprehending what the resultant groups mean can be quite a challenge. This is particularly difficult because each cluster cannot be considered in isolation; rather a cluster can only be understood in relation to all other clusters.

The naming convention used by the clustering algorithm is simply the word Cluster followed by a cluster index. When presenting your model to others, or even referring to it for your own personal use, you will need to choose appropriate labels for each cluster. With models built on dozens or even hundreds of attributes, a short label seems like a tall promise. The most effective labels come from your personal understanding of the business problem you are trying to solve combined with the patterns uncovered by the clustering engine. SQL Server Analysis Server provides a viewer that contains four tabbed cluster views that help you to build this understanding. Alone, each view does not provide enough insight to accomplish this task. When used together, you can apply the following strategy, which is effective in understanding and labeling your clusters. The viewer is accessed through either the BI Development Studio or SQL Management Studio by right-clicking on the model and selecting Browse.

  1. Get a high-level overview of your clusters.
  2. Pick a cluster, and determine how it is different than the general population.
  3. Determine how that cluster is different from nearby clusters.
  4. Verify that your assertions about the cluster are true.
  5. Label the cluster.
  6. Repeat for all remaining clusters.

The following sections will review these steps in detail.

Get a High-Level Overview
The first views provide a high-level overview of clusters: the Cluster Profiles view and the Cluster Diagram view. At first glance, Cluster Profiles provide too much information, and Cluster Diagram provides too little, but together they provide the topology of your cluster model.

The Cluster Profiles view, accessed through the second tab of the Cluster Viewer, contains a column for each cluster in your model and a row for each attribute. This set up makes it easy to see interesting differences across the cluster space. Using this view, you can choose an attribute of interest and visibly scan horizontally to see its distribution across all clusters. When an item catches your interest, you can look at neighboring cells or other cells of the same cluster to learn more about what that cluster means.

Figure shows a portion of the Cluster Profiles view on the Customer Clustering model. Cluster 21 clearly contains people who are older than the average customer, who like the Arts & Entertainment and American Movie Classics channels. In contrast, Cluster 25 contains younger customers, who like Cinemax. Cluster 10 looks similar to Cluster 21, but is a tad younger and is not quite as much into the classics.The Cluster Profiles view displays everything in your model in a manner that is easy to see. Binary and continuous attributes are particularly easy to discern, as are discrete attributes with a small number of states. Exploring your clusters through the cluster profiles is a good way to find a starting point for further exploration.

Continuous and binary attributes in the Cluster Profile viewer

Continuous and binary attributes in the Cluster Profile viewer

After exploring the minute details of the model, you can move to the other extreme. The Cluster Diagram view, the first and default tab of the Cluster viewer, represents each cluster as a single node. These nodes are scattered across a field and allowed to group based on similarities.

The resultant view is a diagram indicating which clusters are similar or dissimilar and the relative strength of these similarities. Using this view, it is easy to expand on the exploration you began in the Cluster Profile viewer. Using the shading feature of the view, it is easy to ask some more targeted questions about your model. You can ask, “In which clusters will I find customers who like the Arts & Entertainment channel?” or “Where are my customers between 20 and 25 years of age?” Furthermore, you can determine which clusters are very similar by moving the link slider down to remove weak links, leaving only the strongest links (see Figure 7.8). Continue to explore the Cluster Profiles and Diagram views until you feel comfortable with the overall layout of your model.

Pick a Cluster and Determine How It Is Different
Choose a cluster for further analysis. At this point, it really isn’t important which cluster you choose. One method for picking a cluster is to determine which clusters have the strongest link and choose one of them, or to pick a cluster that seems far removed from the rest. Or you may simply have found an interesting cluster during your initial exploration.

The first thing to do is to look at the third tab of the Cluster viewer, the Cluster Characteristics view. This view describes the characteristics of the cluster cases by displaying attributes in decreasing probability. Figure the top characteristics for cluster 21. The members of this cluster are married homeowners with more than one car. They own DVD and VHS players and watch a variety of cable movie channels. It may seem like this is enough information to choose an adequate label for the cluster. However, how do you know that this information is the most important? It may be true that customers in this group own DVD players, but all of your customers may own DVD players, which makes that fact uninteresting.

Cluster diagram showing strong (dark) links between clusters

Cluster diagram showing strong (dark) links between clusters

Cluster characteristics

Cluster characteristics

You determine what is important about the cluster by comparing it to everything outside the cluster. Figure compares cluster 21 and its complement. Here you see that DVD or VHS ownership aren’t quite so important for describing the cluster, but the preference for classic movie channels are. You also can infer that members of this cluster have children and are either married or have been in the past. Your picture of who the customers of cluster 21 are is becoming clear.

Determine How a Cluster Is Different from Nearby Clusters
You may now have enough information to accurately label this cluster. However, this cluster may be very similar to other clusters and any labeling you do at this point may also apply to those. Therefore, you must use due diligence and compare your chosen cluster to its nearest neighbors. To accomplish this, go to the Cluster Diagram view and determine which clusters are close to the cluster of interest. If no links to the cluster are very strong, it is probably safe to stop. For any clusters that are close, you need to switch back to the Cluster Characteristics view to compare those clusters one by one. Through this process you will refine your view of your chosen cluster. For example, comparing cluster 11 to cluster 17 indicates that the most important differences involve marital status and housing.

Cluster discrimination versus complement

Cluster discrimination versus complement

Verify That Your Assertions Are True
At this point, you probably have a pretty good idea about the members of your chosen cluster. Switching back to the Cluster Characteristics view allows you to ensure that none of the other viewers mislead you about the cluster. This can happen particularly when refining your cluster understanding by comparing it to neighboring clusters. The difference between the clusters that appeared important may be caused by an attribute that is uncommon in both, but simply less uncommon in one.

Label the Cluster
Labeling the cluster is a simple technique of switching to the cluster diagram, right-clicking the cluster node, and selecting Rename Cluster. The model label has an important impact on the understanding and future use of your model. What are the important attributes to express? This depends on your business knowledge of the data you are clustering and on which attributes are interesting to the intended audience of the model. Cluster 21 could be Classic Movie Watchers or Married, No Premium Channels, or any number of other monikers,depending on how you want to display your model and how you want others to perceive it.

All rights reserved © 2020 Wisdom IT Services India Pvt. Ltd Protection Status

Data Mining Topics