# Microsoft Sequence Clustering Algorithm Principles - Data Mining

Before learning about the inner workings of the algorithm, you need to know a new concept: the Markov chain. As stated previously, sequence clustering works by merging two technologies, clustering and sequence analysis. The sequence analysis is something new — a Markov Chain model.

What Is a Markov Chain?
Andrei Markov was a famous Russia mathematician born in 1856. He was a professor at Saint Petersburg University and is remembered in particular for his study of Markov chains, which are sequences of random variables in which the future variable is determined by the present variable but is independent of the way in which the present state arose from its predecessors.

A weather sequence Figure illustrates an example of a Markov chain of the DNA sequence. A Markov chain contains a set of states. Most states emit events; other states like Begin and End are silent.

A Markov chain also contains a matrix of transition probabilities. The transitions emanating from a given state define a distribution over the possible next states. P (xi = G|xi1 = A) = 0.15 means given the current state A, the probability of next state being G is 0.15.

The Microsoft Sequence Clustering algorithm models sequence events based on the Markov Chain model.

Order of a Markov Chain
One of the important properties of a Markov chain is the order. In a Markov chain, the order n specifies the probability of a state depending on the previous n states. The most common Markov chain is 1st order, which means the probability of each state xi depends only on the state of xi-1. We can build high-order Markov chains using more “memory” to remember the previous n states.

Markov Chain model An nth-order Markov chain over k states is equivalent to a first order Markov chain over kn states. For example, a 2nd- order Markov model for DNA can be treated as a 1st-order Markov model over the following states: AA, AC, AG, AT, CA, CC, CG,CT, GA, GC, GG, GT, TA, TC, TG, and TT. The total number of states is 42. The higher the order of a Markov chain, the more memory and time required for the processing.

Based on the Markov chain, for any given length L sequence x {x1, x2, x3,...,xL}, we can calculate the probability of a sequence as follows:

P(x) = P(xL . xL-1,...,x1) = P(xL| xL-1,...,x1)P (xL-1|xL-2,...,x1)...P(x1)

In the case of a 1st-order Markov chain, the probability of each xi depends only on xi-1, the preceding formula is equivalent to the following:

P(x) = P(xL . xL-1,...,x1) = P(xL|xL-1)P(xL-1|xL-2)...P(x2|x1)P(x1)

State Transition Matrix
AMarkov chain remembers the transition probabilities among different states. Figure 8.3 graphically displays a state transition matrix of a 1st- order Markov chain. Each cell in the table corresponds to the probability of the transition. The probability in the grid is encoded by the gray scale; higher probabilities are brighter.

The transition matrix for 1-order Markov model is M*M, where M is the number of states in the sequence. When M is large, the size of the transition matrix can be significant. When there are too many states, many cells in the matrix are dark with a low transition probability. One of the ways to optimize the matrix storage is to store only those probabilities above a certain threshold.

State transition matrix Clustering with a Markov Chain
The Microsoft Sequence Clustering algorithm learns a mixture of Markov chains, where each mixture component corresponds to a particular cluster. To understand what a mixture model is, it is useful to understand how a mixture model generates data. A single case is generated from a mixture model as follows. First, a particular component (cluster) is randomly selected using a probability distribution over the clusters. Second, depending on which cluster is selected, a sequence is generated from the Markov chain corresponding to that cluster (each cluster or component corresponds to a different Markov chain).

Given data, the Microsoft Sequence Clustering algorithm learns the parameters of the mixture model — the mixture weights (the probability distribution over the clusters) and the parameters of each Markov chain. Note that the algorithm never sees the cluster identities of any case.

As we explained earlier, expectation and maximization (EM) is an iterative algorithm that finds parameters corresponding to a local optima for a model — parameters that locally maximize the likelihood of the data. The overall process of the clustering algorithm is:

1. Initialize the model parameters somehow (e.g., at random)
2. Given a current model parameters, each case is assigned to each of the K clusters with some probability. This is the E step.
3. Revaluate the model parameters based on the weighed assignment of each case. This is the M step.
4. Check whether the model has converged. If not, go to step 2 for a new iteration.

We explained the method to calculate the probability and likelihood of scalar attribute in each cluster. For the sequence attribute, the model parameters have the sequence state transition matrix for each cluster. For a given sequence x, we know its probability in a given cluster C is calculated using the following formula:

P(x|C) = P(xL|xL-1)P(xL-1|xL-2)...P(x2|x1)P(x1)

where P(xj|xi) is the transition probability of state i to j in cluster C.
We can then use Bayes rule to calculate the cluster membership probabilities of x in cluster C:

P = xi = G xi l = A = 0.15 ` - j

Where P(C) is the marginal probability of cluster C, for example the weight of cluster C over the whole population.

The Microsoft Sequence Clustering lgorithm supports both sequence and nonsequence attributes. Nonsequence attributes can be scalar attributes or nested attributes, as explained in the previous chapter. Sequence data is stored in a nested table. The sequence nested table must contain a column modeled as the key sequence. This column is the key of the sequence. The sequence key can be of any data type that can be sorted, such as date, integer, and string. In some cases, your data may have multiple sequence attributes. For example, you may want to group your customers by a sequence of Web pages that he or she visited, and a sequence of products that he or she bought. However, multiple sequences in a single model are not supported in SQL Server 2005.

It is possible to build a sequence model without any sequence data. In this case, the model becomes a simple clustering model. Nonsequence and sequence attributes are assumed to be mutually independent given cluster identity. What’s the cost of processing a clustering a Markov chain? Supposing that the number of clusters in the model is K, the number of cases is N, the average length of each sequence is L, and the number of states in the sequence is M, the cost of each iteration is O(KNL + LM2). The first part O(KNL) is the cost to assign each sequence to a cluster with a membership probability, and the second part O(LM2) is the cost to calculate the transition matrix after each iteration. In many applications such as a DNA sequence, Mis relatively small. This complexity can be reduced to O(KNL), which means the total runtime of the algorithm scales linearly in both N and K.

Cluster Decomposition
The number of natural groups in a sequence clustering model is different from that in a normal clustering model. In normal clustering, people tend to build the clustering model with k <10. When the number of clusters is too large, it is difficult to interpret the final results. If a really large number of distinct groups exist, people usually build clustering models in multiple steps, and in each step, they break the population into a handful groups.

In the sequence clustering model, when the number of states in the sequence is large, there could be many distinct clusters. For example, in a Web navigation scenario, there may be over 60 URL categories in a portal site. The first group of Web customers mainly navigates among news, the second group of customers focuses on music and movies, and the third group of customers is interested in front pages and weather. While clustering these customers, we usually get a larger number of clusters, compared to the nonsequence cluster model. It is relatively easy to interpret these models based on their sequences of states.

One step during the sequence clustering algorithm processing is cluster decomposition. If a user specifies a small number of clusters, and there are different types of sequences in a cluster, the algorithm will decompose the cluster into multiple clusters. For example, if a cluster contains two sets of sequences — Movie ➪ Music ➪ Download and New ➪ News ➪Weather — the algorithm breaks it into two clusters at the final stage of the model processing.

Data Mining Topics