DMX - Data Mining

The most important contribution of the OLE DB for Data Mining specification is the definition of a data mining query language. This language is called DMX, which stands for Data Mining eXtensions. Without a good query language, it is very challenging to integrate data mining functions such as prediction with user applications. The data mining query language defined in OLE DB for DM adopts mainly relational concepts, and its syntax is based on SQL. As a database developer, you will find that it is fairly easy to learn this query language.

Three Steps of Data Mining
Before showing you the language, we will review the three basic steps of data mining. The first step is to create a mining model. This is similar to creating a table in relational database. A mining model definition includes a number of input columns, predictable columns, and an associated algorithm. A mining model is a container similar to a relational table. It is used to store patterns discoveredby data mining algorithms.

The three steps of data mining

The second step of the data mining is model training (also called processing). In this step, we feed historical data to the data mining engine, for example, existing customers’ demographics and their credit risk information. In the training step, the data mining algorithm starts to analyze the input data. Depending on the efficiency of an algorithm, it may scan the dataset in one or more iterations to find correlations among attribute values. Training steps are usually time-consuming. But in most cases, mining models are trained in batches weekly or monthly. After training, patterns are stored in the mining model. You can browse the model content using content viewers such as the Decision Tree viewer and Cluster viewer.

The third step of data mining process is prediction. In order to predict, we need a trained mining model and a new dataset. During the prediction, the data mining engine applies rules it found in the training step to the new dataset and assigns the prediction result for each input case. The prediction process is usually quite simple. For most algorithms, prediction is very fast and can be executed in real time. There are two types of prediction queries: batch and singleton. A batch query has multiple input cases, which are stored in a table. A singleton query has only one input case, which is constructed on the fly (the case is not persisted in database). In this section, we will go over the core part of this language for model creation, training, and prediction. The examples are based on the following schema of a mini-data-mart of a supermarket (see Figure 2.6). There are two tables in the schema: a customer table and a purchase table. The customer table contains customer demographic information and customers’ store membership card. The purchase table is the fact table about customer shopping transactions.

Step 1: Model Creation
Because the mining model is considered to be a container similar to a relational table, the statement for model creation is similar to the table creation statement and uses the Create command. The following is example of mining model creation:

The model here uses Gender, Age, Profession, HasChildren and HouseOwner to predict the type of membership card that a customer will have. The algorithm applied for the model is Microsoft Decision Trees. The statement is very similar to the table creation statement in SQL. You may have noticed a few differences to SQL. For example, for each column, the statement specifies not only the data type, but also Continuous or Discrete. These are called content types. Content types tell the algorithm the right way to model the column.

Database schema of mini-data-mart

Often, algorithms work better if we can give the distribution information of the column values. For example, if we know that the age of the customers follows a normal distribution, we can specify the Age column as follows:

Agelong normal continuous

Usually data mining algorithms have parameters for tuning the model. For example, the Microsoft Decision Trees algorithm has a parameter called Minimum_Support that specifies the minimum cases any leaf node must contain. The user can specify algorithm parameters at the end of the creation statement.

. . . Using Microsoft_Decision_Trees(Minimum_Support=50)

Some algorithms, such as Microsoft Decision Trees, support multiple predictable columns. For example, a mining model can predict both MemberCard and HouseOwner. The following model builds two decision trees. One uses Gender, Age, Profession, HasChildren, and HouseOwner to predict MemberCard, the other uses Gender, Age, Profession, HasChildren, MemberCard to predict HouseOwner.

Using Microsoft_Decision_Trees

In the previous model, if you want MemberCard to be used only as predictable, not as input to predict HouseOwner, you can use the keyword Predict_Only.

MemberCard textdiscrete predict_only

Sometimes there are dependencies among attributes. For example, in the customer table, there are two attributes: City and State. The State attribute depends on the attribute City (if City is Seattle, State must be Washington).
This information is very useful. For example, when there are too many cities, some algorithms may decide to group some cities together based their states. In DMX, there is a Related To keyword that is used to specify the dependencies among attributes.

Citytextdiscrete, State text discrete related to City

For numeric columns such as Age, sometimes you might want to cut it into n buckets. In DMX, there is a function discretized (discretization method, buckets number) with two parameters. The first one is the method used for discretization. The second parameter is the number of buckets to be grouped by. For example, the following column definition grouped Age into five buckets using the Equal_Areas method:

Income long discretizedEqual_Areas, 5) You can also specify: Income long discretized

It uses the default discretization method to bin Income into five buckets. As you have learned, nested case is a very important concept in DMX. It adds much express power for building mining models. Predictable attributes can be the case-level attributes as well as nested-level attributes. For example, you can build models that use customer demographic information to predict the list of products in which the customer may be interested. You can also build models that predict a customer’s demographic information based on the list of products that he or she buys. Other models can be built to analyze product associations for cross-selling using nested cases. The following are some sample models built with nested cases:

Using Microsoft_Decision_Trees

The preceding model uses Gender, Income, and the list of the purchased products and their associated quantities to predict the MemberCard.

Using Microsoft_Decision_Trees

When a nested table is marked as predict or predict_only, the nested key and its associated nested attributes are all predictable. The previous model uses Gender, Income, and MemberCard to predict the list of products a customer may be interested in and the quantity for each product. Now, we want to build another model, which not only uses the demographic information to predict a list of potential products a customer may buy, but also uses the list of products a customer has already bought. For example, suppose that there is a strong correlation between cheese and milk for female customers. A female customer that has cheese in her shopping cart is very likely to buy milk. To build such a model, you just need to slightly change the previous mining model by replacing the Predict_Only keyword with Predict for the Purchase table column.

Step 2: Model Training
Model training is also called model processing. During the training stage, data mining algorithms consume input cases and analyze correlations among attribute values. After the training, mining models are populated with patterns. Thus the syntax for a mining model training statement is the same as the Insert statement for relational table:

Insert into <mining model name> [ <mapped model columns>]<source data query>

In most cases, <Source data query> is a Select query from a relational database. Because most DM providers are embedded within the relational database management system (RDBMS) containing the source data, the <source data query> needs to read data from other data sources. In OLE DB, the OpenRowset statement supports querying data from a data source through an OLE DB provider. The syntax for the OpenRowset statement is the following:

OpenRowset(‘Provider_Name’, ‘Provider_String’, ‘DB Query’)

Provider_Name is an OLE DB provider name, for example, MSSQL for Microsoft SQL Server, Jet for Access. OLE DB drivers are available for almost any data sources (text file, Oracle, DB2, Sybase, and so on)
Provider_String is the connection string for the provider. DB Query is the SQL query supported by the provider. The query returns a rowset. The following is a training statement for the MemberCard_Prediction
model. The data source is stored in an SQL Server database on myserver machine.

ERSTANDING NESTED KEYS

The previous model training statement has exactly the same syntax required to insert data into a relational table. The only difference is that the object name is a mining model.

If a mining model contains nested tables, the model training syntax is more complicated in <source data query> part. DMX uses the Data Shaping Service defined in OLE DB to create hierarchical rowsets. Input cases must be shaped to the hierarchical format in order to feed the data mining provider. The Shape command is similar to the join operator, which joins two related rowsets based on the related columns. Shape results in hierarchical rowsets. Each parent row contains some columns that represent child rows. For example, after shaping customer and purchases rowsets using the CustomerId key, the result rowsets has the structure displayed in Figure 2.4. The following is a training statement with a nested table using Shape provider.

Step 3: Model Prediction
Prediction discovers information about unknown cases using patterns found from historical data. Prediction is an important step in the data mining process. It is often the final goal of a data mining project. It is also considered to be the method for closing the analysis loop. The training step is time-consuming because it requires multiple passes through the historical dataset to find patterns. Prediction is a simple and efficient task for most data mining algorithms. For example, you can think of prediction for the decision tree algorithm as dropping a case from the tree root node. The case falls to a leaf node, following the splitting condition at each intermediate node. This is a very simple process that requires only a few predicate evaluations and can be done in real time. While mining models are trained once a month, predictions are executed frequently. For example, a Web site personalizes the advertisement banner according to the user’s profile. When each user hits the Web site, a prediction query is executed to get the list of targeted advertisements. Another example is an insurance quote, each quote is a prediction. A call center executes tens of thousands of prediction queries each day.

Understanding Prediction Joins
To predict, we need to have two inputs: a trained mining model and a set of new cases. In DMX, a mining model is considered the same as a relational table. Conceptually, a trained mining model can be considered a truth table. Atruth table is a term in the electronics field. It contains all the combinations of possible inputattribute values and the predictable column value associated with a probability. In the MemberCard_Prediction model, a truth table contains all the value combinations of Gender, Income, HouseOwner, Maritual Status, and Member Card, with a probability assigned for each row. For example, there is a row in the truth table: Gender (Male), Income (20000), HouseOwner(Yes), Marital Status(Single), MemberCard(Gold) with Probability(10%). If there is a new customer with the exact the same profile, we can predict that the customer has 10% probability of having a gold membership. When there is a continuous attribute, such as Income, the truth table has an infinite number of rows. Of course, a mining model doesn’t store the patterns in the format of truth table, though conceptually they are similar.

In Figure the table on the right is a truth table. The left table is a new customer table with demographic information. We want to know the most likely type of membership card for each of these new customers. The typical relational method to get the membership card of each new customer is to join these two tables together. The join condition is the input column mappings. We adopt the relational join operator for making data mining predictions. However, it is a special kind of join with a mining model and a table; we call it Prediction Join.

Query Syntax for Prediction
The query syntax of a prediction join is the same as the syntax of a join query in SQL. There are three parts: the Select expression, the On condition, and the Where clause. The Select expression is a set of comma-separated expressions, each of which can be a column from an input table, a predictable column from the mining model, or a prediction function. The select expression can have sub-select-clause, which we will explain this later in this section.

Prediction join between a mining model and a table

The on condition specifies the column mapping between the mining model and input table. For example:

Select ... From ... On MemberCard_Prediction.age= NewCustomer.age And MemberCard_Prediction.gender= NewCustomer.gender ..

Sometimes new input cases don’t have all the attributes that a mining model is trained from. For example, a new customer table contains most demographic information except customer age. In this case, the On clause doesn’t specify the mapping of Age column. This is still a valid prediction query. The data mining algorithm provider decides how to execute the prediction without complete input information. Some algorithms may treat missing information as informative, that is, a valid input state; other algorithms may treat missing as noninformative, that is, to be ignored. Of course, the prediction result is less accurate without all the input attributes. The Where clause is the filter that limits the number of cases returned by the prediction query. Filter conditions can be set on input columns as well as on prediction functions. For example:

Where NewCustomer.age > 30 And PredictProbability(MemberCard, ‘Gold’)>0.75

This Where clause limits the query results. It returns only those customers older than 30 and with prediction certainty of gold membership card greater than 0.75. PredictProbability is a prediction function. We will explain these functions later in the section.

A Prediction Query Example
The following is an example of prediction query used to predict the type of customer membership card.

If the column names of the input table are the same as the column names in the mining model, we can use a Natural Prediction Join statement without specifying the On clause.

Prediction Functions
DMX defines a set of prediction functions. Some of these functions return simple scalar values, while others return a table of details, including information such as statistics.

Prediction Functions on Scalar Columns
Scalar predict functions return scalar values. The most important prediction function is Predict (Predict can also return table values, which we will discuss later). When a scalar column is given to the Predict function, the result is a scalar value. For example the following function returns the data shown in Table.

Select T.CustomerID, Predict(MemberCard), PredictProbability(MemberCard) as Proba ...

Directly selecting a predictable column from a mining model is a shortcut for using Predict with the predictable column as parameter, that is, Select MemberCard is same as Select Predict(MemberCard).
Sometimes we may want to know if the prediction result is in a “Missing” state (Null state). For example, in the mining model there is a rule: for male customers with a low income and who are younger than 25, the membership card information is missing with a probability of 0.60. We may want the prediction result to include the Missing state. To do this, we need to add the INCLUDE_NULL option in the prediction function (The default behavior is EXCLUDE_NULL, which means the missing state is not represented in the result).

Select T.CustomerID, Predict(MemberCard, INCLUDE_NULL),

PredictProbability returns the probability of the predicted state. Sometimes we want to know the probability of a given state, such as “Gold” member card, instead of the most likely one. We can add the state value as a parameter of the PredictProbability function. For example the following function returns the data shown in below Table Select T.CustomerID, PredictProbability(MemberCard, ‘Gold’) as Proba_Gold

Query Result

CUSTOMER ID MEMBERCARD PROBA

Query Result

CUSTOMER ID PROBA_GOLD
Besides Predict, there are many other prediction functions such as PredictProbability, PredictSupport, Cluster, and so on. Table provides a short description of scalar prediction functions:

Prediction Functions on Scalar Columns

PredictHistogram
Sometimes we want the prediction result to contain not only the most likely state but also a histogram with all states and their associated probabilities. For example, in the MemberCard_Prediction model, we would like to know the probabilities of each type of membership cards for a given customer. In DMX, here is a prediction function PredictHistogram, which returns the histogram of each possible value of the predictable column with a probability. PredictHistogram returns a table column, that is, a column that embeds a table. The columns in the embedded table are $Support,$Variance, $Stdev (standard deviation),$Probability, $AdjustedProbability,$Probability Variance, and $ProbabilityStdev. The following is the example of query result using PredictHistogram. Its results are shown in below Table. Select CustomerId, PredictHistogram(MemberCard) as Histogram Query result of PredictHistogram Prediction Functions on Table Columns You just learned the syntax for prediction functions using scalar columns as parameters. A few other prediction functions, such as PredictAssociation, PredictSequence, and TopCount, accept table columns. Some functions support both scalar columns and table columns, such as PredictTimeSeries and Predict. RETURNING HIERARCHICAL ROWSETS In the previous MarketBasketModel example, we can get the list of products a customer may be interested by using the following query: PredictAssociation(Purchase) returns a table column. The columns of the table are the same as the columns in the nested part of the model. The number of rows in the table equals the number of nested keys, in this case, the number of different products. Table shows the resulting rowsets: The Predict function is a special prediction function. It supports polymorphism. Based on the underlying algorithm, the Predict function delegates the request to Predict, PredictTimeSeries, PredictSequence, Predict ssociation, and so on. For the previous query, we can use Predict (Purchases) instead of PredictAssociation(Purchases). Sometimes a customer may have already purchased some products. We want to have prediction results for one of the following three cases: 1. The prediction contains the complete list of products that the store offers, with associated predicted quantities. 2. The prediction shows what other products a customer is likely to buy based on the products the customer has already bought. The reported list should not include the product from the input case. 3. The prediction contains just the predicted Quantity value associated with the products from the input case, or perhaps just the likelihood of each product in the input case. No other products should appear in the output table. To express these three different cases, we can specify, respectively, one of the following options in the Predict function: • INCLUSIVE, which represents the behavior in case 1. • EXCLUSIVE (default option), which causes behavior number 2. • INPUT_ONLY, which ensures that the predicted table contains only the ows supplied by the input (behavior number 3). Query result of PredictAssociation CUSOMTERID GENDER PURCHASES Each row in the Purchase table may contain statistics, for example, the support and the probability. This information is stored in the derived columns$Support and $Probability, respectively. The support is the number of cases similar to the given customer in the training dataset (for example, those who bought a particular product and has the same demographic information). The probability indicates how likely it is for a given customer to buy a particular product and is not concerned with the quantity of the product the customer may buy. Be aware that probabilities for different products don’t sum to 1, because the customer may buy a number of products. To include these statistics in the result rowset, we add another parameter for the Predict function: The result of the previous query is shown below. Query result using Include_Statistics CUSOMTER_ID GENDER PURCHASES Sometimes we need more information than just the purchase probability for each product. We want to know, for example, how likely it is that the customer will buy 1, 2 or 3 units of the product. Conceptually, all this information is included in the Predict function. In order to get the probability for each quantity, we can use the PredictHistogram function on the Quantity column and use the nested Select statement to select from the results of Predict function. The PredictHistogram function returns the histogram of all quantities and their associated probabilities. These probabilities sum to 1. The preceding query returns the data shown in below. Result for Nested Query The result rowset from Predict table column could be very large, especially when the query uses options, such as INCLUSIVE, INCLUDE_STATITICS, or uses PredictHistogram on nested columns. To solve the problem, DMX introduces the TopX and BottomX family of functions, which operate on nested tables (including those resulting from PredictHistogram, a nested Select or any other expression returning a table). These functions order the records of the nested table by a specified column’s value and then truncate the sorted list to a specified length. For example, the following query uses the TopCount function to retrieve the top two most likely membership cards that may interest the customer: Select CustomerId, TopCount(PredictHistogram(MemberCard),$Probability, 2). . .

The following query retrieves the top 10 products the customers is predicted to buy in the largest quantities:

Select CustomerId, TopCount(Predict(Purchase, EXCLUSIVE), quantity, 10)...

The following query retrieves the top five products the customers is most likely to buy:

Select CustomerId, TopCount(Predict(Purchase, EXCLUSIVE), $Probability, 5)... DMX 2.0 proposes a shortcut for the previous query using TopCount on the Predict function with$Probability as the select criteria. The following is the equivalent query using the shortcut.

Select CustomerId, Predict(Purchase, 5)...

If you want to return the top n recommendations based on an adjusted probability instead of probability, you can use the following query:

As the result of TopCount is a table column, we can apply a subelect clause to the result of TopCount function (or any prediction function returns table column). The following is an example:

The result rowset of the previous query is shown in below

Query Result

CUSTOMERID GENDER PROBABILITYHISTOGRAM

We can also add a Where clause to pull out certain records from a nested table. For example, if instead of always getting the “best” prediction for a membership card, suppose that a query only wants to get the probability of having a silver membership card for each customer. We can express this query using the following Select clause:

Similar to the query in the MarketBasketAnlysis model, the following query retrieves only the probability for each customer of purchasing Pepsi.

Table contains a list of prediction functions that operates on table columns. A brief description is also given in the table.

Prediction Function on Table Columns

Singleton Queries
Sometimes, the input case for prediction doesn’t exist in a database table. For example, suppose that an insurance company wants to give a quote to a customer over the phone or that an online retail site wants to recommend products based on the list of products a customer has already chosen in the shopping cart. In these cases, the customer data may not yet be recorded in the database. To make a prediction in these cases, DMX provides the syntax for a singleton prediction query. The singleton query allows sets of constant values to be used in place of the <source data query> for the prediction join query. The following is an example of singleton prediction query using the Member Card_Prediction model:

It is also possible to construct a singleton query with a nested table. The following is an example of prediction against MarketBasketAnalysis model based on the items in a customer’s shopping cart:

Making Predictions Using Content Only
One of the initial steps in training a mining model is to gather marginal statistics, for example, counting the number of customers with gold, silver, bronze member cards. DMX allows us to make predictions based on the marginal statistics of model content. The following queries return the most popular member card: Bronze.

You can also query nonpredictable columns. The following query returns the distinct values of Gender in the training dataset.

Select Distinct Gender From MemberCard The following query returns the most popular three products based on customer purchases: Select Predict(Purchase, 3) From MarketBasketAnalysi

Drilling through the Model’s Content
The drill-through concept in DMX means getting the source data from the model’s content. For example, from a tree model, you can get all the source cases classified in a given node of the tree. This type of query is based on Model.Cases. The following query returns all the training dataset:

Select * from MyModel.Cases

The following query returns those cases classified in Node 001 (including all the descendants of Node 001):

Select * from MyModel.Cases Where IsInNode(‘0001’)

Model.Sample_Cases is similar to Model.Cases; the only difference is that Model.Sample_Cases returns the sample cases instead of full training cases. The sample cases are not necessarily real input cases. These cases can be generated based on the model’s pattern.

Content Query
AContent query is a type of query that retrieves the content of a trained mining model. The model content represents the patterns a data mining algorithm discovered from the training dataset. These patterns are stored in tabular format. You can use the following query to select the tabular data from a mining model:

Select * from MyModel.Content

We will discuss the details of returned columns and give more examples of content query in the “Understanding Schema Rowsets” section. Based on the information retrieved from the content query, you should be able to display the patterns graphically.