Using the Sequence Clustering Algorithm - Data Mining

The Sequence Clustering algorithm can be applied in many areas such as click stream analysis, customer purchase analysis, bioinformatics, and so on. In this section, you learn about creating DMX queries for the Sequence Clustering algorithm and how to interpret the model using the Sequence Clustering viewer.

DMX Queries

Below diagram displays two tables: Customer and ClickPath. The Customer table contains customer profiles about Web usage on a portal site. ClickPath is a transaction table. It contains three columns: CustomerGuid, URLCategory, and SequenceID. CustomerGuid is the foreign key to the Customer table. SequenceID is a numeric column that stores the Web click sequence number 1, 2, 3 . . . n. URLCategory is the state of the sequence. The sequence is the series of Web clicks on the URLCategory in this model, such as News ➪ News ➪ Sports ➪ News ➪Weather. A sequence may have various lengths because some customers stay longer and visit various URL categories.

The following statement creates a mining model using the Microsoft Sequence Clustering algorithm. Sequence data must be stored in a nested table. The Microsoft Sequence Clustering algorithm doesn’t support multiple sequence tables in a model; neither supports more than one nonkey attribute in the sequence table.

Using Microsoft_SequenceClustering_Algorithm

The nested table — ClickPath — or the nonkey attribute in the nested table — URLCategory — may be specified as predictable. In this case, when the model is processed, you see customer segments based on their Web clicks and geolocations. You can also use the model to predict the next n sequence states for a given customer.

Customer and ClickPath tables

Customer and ClickPath tables

The following Insert into statement trains the sequence model:

Similarly to the Microsoft Clustering algorithm, the Microsoft Sequence Clustering algorithm supports prediction. For cluster membership prediction, we can use the Cluster() function, which returns the cluster ID for each case.
The following query returns the cluster ID for each input case:


Because the nested table ClickPath is predictable, it is possible to use the Sequence Clustering algorithm to predict the subsequent states of a given sequence. There is a new prediction function called PredictSequence, which has the following syntax:

PredictSequence(ClickPath) (Returns the next state predicted sequence state for a given sequence. The result is in a table form.) PredictSequence(ClickPath, 3) (Returns the next three predicted sequence states for a given sequence. The result is in a table form.)

When the prediction returns a number of consequence steps, the probability of Pn is always less than Pn-1, where n is the step number. The formula to calculate of Pn is the following:

Pn = Pn - 1*P(Sn|Sn - 1)

where P(Sn|Sn – 1) is the probability from state Sn – 1 to Sn in the closest cluster for the case.The following query predicts the next three steps for each customer.

Select CustomerId, PredictSequence(ClickPath, 2) as Sequences From WebSequence Prediction Join ...

It returns the results shown in Table. The predicted sequence states are stored in a nested table. There are three columns in the nested table. $Sequence is the generated column. It is an integer indicating the future steps, with ordinal numbers 1, 2, 3 . . . . “1” means the next step. The Sequence ID has the same data type as the sequence column. If the sequence key is date type, it returns the consequent dates. The Microsoft Sequence Clustering algorithm doesn’t fill this column. The last column URLCategory is the predicted state of the sequence.

Prediction Query Result with Sequences

Prediction Query Result with Sequences

You can also use a subselect statement on the nested table produced by PredictSequence. For example:


To get the probability of each predicted sequence state, you can use the PredictProbability function:


Sometimes, you want to have a histogram of the probability for each sequence state at each step. You can use the PredictHistogram function on the sequence state column. For example:


This result of this query contains two levels of nesting: one level is generated by PredictSequence, and another level is generated by Predict- Histogram. The result format is displayed in Table:

Query Result with PredictHistogram Function

Query Result with PredictHistogram Function

In a Web click scenario, you know your Web visitor’s navigation sequence within a session, and you may want to predict his or her next few possible clicks in real time so that you can provide a personalized guide for the visitor. The click path is not yet recorded in database. In this case, you can use singleton query to make your prediction:

Model Content
The content of a sequence clustering model is laid out in four levels, as illustrated in Figure. The root node represents the model. The second level is the cluster level; each node except the last one represents a cluster discovered by the algorithm. The last node in the second level is a transition matrix, which represents the state transition probabilities of the overall population. The transition matrix has a set of children; each represents a row in the transition matrix. Due to content size, the matrix stores only those items with a probability greater than 0. Each cluster node also has a transition matrix as its child, which represents the transition probability of the given cluster. Therefore, there are four levels in the content of a sequence clustering model.

Interpreting the Model
Once the sequence clustering model is defined and processed, you can browse the content of the model using the Sequence Clustering viewer. The Sequence Clustering viewer contains five tabs: Cluster Diagram, Cluster Profile, Cluster Characteristics, Cluster Discrimination, and Cluster Transition. The overall design of this viewer is very similar to that of the Clustering viewer, except for the Sequence Transition tab, which graphically displays the transition matrix for each cluster.

Content of Sequence Clustering model

Content of Sequence Clustering model

Content of Sequence Clustering model displays the Cluster Diagram pane. This tab is the same as in the Clustering viewer. Clusters are layouts based on relationships. Similar clusters are closer to each other. The default node background represents the size of the cluster. For example, Cluster 5 is a large cluster and Cluster 9 is much smaller. You can also use the node color-coding to represent other attribute values, including a sequence state,for example, Weather. The clusters representing those with high probabilities of clicking on the Weather page are highlighted with a darker color.

Each column represents a cluster. Each row represents an attribute. The URLCategory row represents the sequence attribute. Each cell in this row contains a histogram of sequences.

Each line in the histogram represents a sample case in this cluster, and a line iscomposed of a series of sequence states. Each sequence cell displays about 20 cases. These are the sample sequences from the training cases.

Cluster diagram

Cluster diagram

Cluster profile

Cluster profile

Each row represents the frequency (probability) of an attribute/value pair in the selected cluster. Each sequence state (including the Start and End events) is considered a distinct value for the sequence attribute. The list of attribute values is sorted based on the frequency. For example, the most likely attribute value in 1 is Start ➪ Music, which means that most of the Web visitors in cluster 1 start with the Music page. Movie is another popular URL that cluster 1 individuals like to visit.

This pane is designed to compare any two clusters, or to compare a cluster with the whole population or its complement. From the figure, you can see that the biggest difference between cluster 1 and cluster 8 is that cluster 1 customers end their navigation at a Music site while cluster 8 customers end their navigation at the Flight site. Cluster 5 customers like to go to Music and Movie sites, while cluster 8 customers like to visit Flight and Hotel URLs.

Cluster characteristics

Cluster characteristics

Cluster Discrimination

Cluster Discrimination

It is designed to display the sequence navigation patterns of each cluster. Each node is a sequence state, and each edge is the transition between these two states. Each edge has a direction and weight. The weight is the transition probability. From the figure, you can see that the main activities of customers in cluster 1 are Music, Shopping Music, and Movie, because those nodes are colored with the highest density. There is a strong link from Music toward Shopping Music. Among those customers who are in the Shopping Music URL category, 64% will click on a Movie site next. About 45% of the customers in the cluster start with a Music page in the portal site.

Cluster transitions

Cluster transitions

All rights reserved © 2020 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Data Mining Topics