Introducing the Principles of Clustering - Data Mining

Clustering relies on guessing and lying. You guess about the organization of data and create a set of clusters, arbitrarily deciding the set of attributes and values that belong in each cluster. Then you lie to yourself and assert that your guess is correct. Now that you’ve created this model of the world, you can take each case from your training data and assign it to clusters as fits that model. You can adjust your model of the world by looking at how well it fit the data of the world; you move the clusters. Again, you lie and say that you believe this new model correctly describes the world, and again you take the world’s data and throw it at the model. This time, not all of the same cases fall into the same clusters. You repeat this process of updating your guess and assuming it’s true until it either becomes true — cases no longer switch clusters — or you don’t believe that you are going to get a better solution.

Figure demonstrates this procedure in one dimension. The top chart shows the point distribution along the x-axis along with an initial guess. In this case, we chose clusters evenly spaced across the range of data with similar distributions. We consider the cluster borders to be the midpoints between the ters, when the model has converged, or until you decide that the model will simply not improve with further iterations.

In practice, clusters are initialized randomly along all of the dimensions of the data. The clustering methodology is very sensitive to the starting points and can converge at local solutions that may not be an optimal global solution. For this reason, you initialize several candidate models and train them simultaneously. When the models have converged or when you have otherwise finished, you pick the best model from the candidates.

Hard versus Soft Clustering
One of the most important differentiators in a clustering algorithm is how the algorithm decides how to assign cases to clusters. The Microsoft Clustering algorithm allows two distinct methods of cluster assignment: K-means and expectation maximization (EM).

The K-means algorithm assigns cluster membership by means of distance. As shown in Figure 7.3, an object belongs to the cluster whose center is closest to it, measured using simple Euclidean distance. After all, the objects have been assigned to clusters, the center of the cluster is moved to the mean of all assigned objects, thus the name K-means K being the typical denomination for the number of clusters to look for. This technique is considered hard clustering because each object is assigned one and exactly one cluster. The clusters are disjoint and do not overlap.

Basic clustering procedure

Basic clustering procedure

The EM algorithm uses a probabilistic measure, rather than a strict distance measure, to determine which objects belong to which clusters. Instead of choosing a point for each dimension and computing a distance, the EM algorithm considers a bell curve for each dimension, with a mean and standard deviation. As a point falls within the bell curve, it is assigned to a cluster with a certain probability. Because the curves for various clusters can and do overlap, any point can belong to multiple clusters, with an assigned probability for each. This technique is considered soft clustering because it allows for clusters to overlap with indistinct edges. This method permits the clustering algorithm to find nondisjoint clusters, such as dense regions.

The dot size in Figure refers to the probability that each dot is in its respective cluster. Note that the dot sizes are uniform in the K-means diagram, whereas they are reduced in size near the cluster borders in the EM diagram.

Discrete Clustering
So far, we’ve described how clustering works in terms of numerical values. These values are easy to compare and relate, computing distances and whatnot, but what happens when the objects you’re trying to cluster do not have attributes that can be easily compared? A marble’s size could potentially be represented by its diameter, but what value would you assign to a marble’s material or color?

Luckily, the clustering techniques here can also handle discrete variables. Just as you assign random points along each dimension for continuous attributes, you assign random distributions for each discrete attribute. For instance, if you had an equal number of red, blue, green, and yellow marbles, your global distribution for each color would be 25%. As you initialize each cluster, you assume a random distribution that could look like the distributions in Table.

Clustering of dense region using (left) K-means and (right) EM

Clustering of dense region using (left) K-means and (right) EM

Discrete Cluster Initializations

Discrete Cluster Initializations

For EM clustering, you can pick a marble — say green — and can say that it is in cluster 2 with 25% likelihood. As you determine all the probabilities across all the discrete attributes of a case, you can compute the probability that it exists in each cluster and assign its values accordingly to each cluster of which it could be a member.

K-means clustering, being distance-based, does not fit as naturally in this model of probabilistic measures and traditionally isn’t used for clustering discrete attributes. K-means can still be applied if you can infer some sort of distance from the cluster. The Microsoft Clustering implementation measures the distance from a value to a cluster as one minus the probability of that value in the cluster. For example, a green marble in cluster 2 would be (1 – .25) = .75 “color” units away from cluster 2. This distance factors into the distance calculation as would any continuous value.

Discrete clustering is used not only for multivalued attributes such as the color of an object but also to cluster attributes that appear in nested tables. In our movie store, we could cluster our customers not only by their demographics and movie watching behavior but also by the actual movies they watched. For such attributes, the clustering algorithm treats each movie as having two possible values — existing and missing — and considers those values in a manner similar to other discrete attributes.

Scalable Clustering
One of the problems in clustering data is that to determine the appropriate segmentation requires multiple iterations over the training dataset. In small datasets this is not a problem, because iterating over data in memory is very fast. After the data grows to the point where it can no longer fit into memory, the performance of clustering degrades to the point where it is no longer feasible to continue computing. In this case, you have a scalable framework for clustering that allows you to efficiently cluster datasets regardless of the size of the data.

The principle of the scalable framework is that particular data points that are unlikely to change clusters can be compressed out of the data you are iterating over, providing room to load more data. This way the entire data stream is loaded once, one chunk at a time. Additionally, it is possible for the model to converge at each chunk of data, completing the clustering operation without even seeing all of the data.
The basic outline of the scalable framework implemented in Microsoft Clustering is:

  1. Initialize a set of candidate models to random initialization points.
  2. Collect a sample of the source data to fill the memory buffer.
  3. For each model perform the following scalable steps:
  4. a. Perform a clustering iteration as described in “Introducing the Principles of Clustering.”
    b. Add information gathered from previous scalable steps.
    c. Reinitialize any clusters that disappear or merge.
    Repeat until convergence occurs or you have completed sufficient iterations.
  5. If models have converged since the last scalable step or you have run out of data, you are finished — choose the best model from the candidates.
  6. Select and remove data from buffer, adding sufficient statistics to each model.
  7. Repeat from step 2.

Clustering Prediction
Clustering algorithms can also be used to predict values as well as provide natural groupings. While this seems like a natural and obvious application, traditionally, clustering hasn’t been used for such purposes. The Microsoft Clustering algorithm employs two tricks to accomplish this. First, it considers missing values to be uninformative. For example, if I have a new marble and I don’t know the color, I won’t use the fact that the color value is missing to determine which cluster the marble belongs to. Rather, the algorithm will only use the information for which it knows the values.

Once the cluster membership has been determined, the second trick is to simply read off the values from the cluster. For example, if this hypothetical marble had been found to be in cluster 2 from Table, I would say that it is 70% likely to be a red marble. Of course, with soft (EM) clustering, you would generally find that this marble didn’t belong to a single cluster but rather to a set of clusters with a particular probability for each. In this case, you create a composite result based on the contribution of each member cluster and present that as the result.

All rights reserved © 2020 Wisdom IT Services India Pvt. Ltd Protection Status

Data Mining Topics