Creating a predictive model - Microstrategy

After you have created a dataset report, the next step is to create a predictive model, which will then be imported into MicroStrategy. You can create a predictive model in one of the following two ways:

• Using a third-party data mining application
• Using a built-in MicroStrategy function

Using third-party applications

There are many companies that offer data mining applications and workbenches. These tools are specialized for the data mining workflow and are usually rich in data mining features and options. MicroStrategy Data Mining Services integrates seamlessly with these applications in the following ways:

1. MicroStrategy is used as the source of data mining dataset.
2. After the third-party application develops the PMML-based predictive model based on the dataset, MicroStrategy imports the model to generate reports.

You can use one of the following two options to provide the MicroStrategy dataset to third-party applications:

• Save the dataset as a data mart using MicroStrategy’s data mart feature.
• Export to a particular file format using MicroStrategy’s export capabilities.

Using MicroStrategy

While there are many sophisticated data mining algorithms, some data mining techniques are fairly simple and do not require specialized tools. The MicroStrategy TrainRegression function can be used to create predictive models using the regression data mining technique. This technique should be familiar to you if you have ever tried to extrapolate or interpolate data, or tried to find the line that best fits a series of data points, or used Microsoft Excel’s LINEST or LOGEST functions.

Regression analyzes the relationship between several predictive inputs, or independent variables, and a dependent variable that is to be predicted. It does this by finding the line that best fits the data, with a minimum of error.

For example, you have a dataset with just two variables, X and Y, which are plotted as in the following chart: Using the Regression technique, it is relatively simple to find the straight line that best fits this data (see the following chart). The line is represented by a linear equation in the classic y = mx + b format, were m is the slope and b is the y-intercept. Alternatively, you can also fit an exponential line through this data (see the following chart). This line has an equation in the y = b mx format. So, how can you tell which line has the better fit? There are many statistics used in the Regression technique. One basic statistic is an indicator of the “goodness-of-fit,” meaning how well the line fits the relationship among the variables. It is also called the Coefficient of Determination, whose symbol is R2. The higher the R, the better the fit. We can see that the linear predictor has R2 = 0.7177 and the exponential predictor has R2 = 0.7459; therefore, the exponential predictor statistically is a better fit.

With just one independent variable, this example is considered a “univariable regression” model. In reality, the Regression technique can work with any number of independent variables, however, always only one dependent variable. While the “multivariable regression” models are not as easy to visualize as the univariable model, the technique will generate statistics so you can determine the goodness-of-fit.

The MicroStrategy TrainRegression function allows both types of Regression techniques. Both the linear and exponential Regression predictor can be created easily, with just one change to set the type of model.

Example: Predicting quarterly online sales

First, let us create a metric using the TrainRegression function. As you can see from the following, the Train Regression metric has a set of metrics as its inputs, including the On-line Sales metric we are trying to predict (the dependent variable) and several metrics that describe each quarter (the independent variables). The order of these inputs in the metric’s expression does not matter. Next, set the parameters for the TrainRegression metric for it to work properly. Highlight the TrainRegression portion of the expression, right-click it and view the TrainRegression parameters, shown as follows.

The TrainRegression Parameters have the following meanings:

• MinimumR2: This value is used to determine which independent variables will be included in the final regression equation. The R2 value is a number between 0 and 1. A value of 0 indicates no predictive value, whereas a value of 1 indicates high predictive value. The R2 value is calculated for each independent variable, in relation to the dependent variable. If the R2 value is less than the specified MinimumR2, or 0, it will not be included in the generated regression equation. If all the predictive inputs fail to calculate a high enough R2 to be included, no model will be generated and you’ll see a “-1” for each row and the training metric’s column in the training report.
• ModelFileName: This is the physical or relative file name to be created by TrainRegression. If set to, for example, c: empmodel.xml, the output PMML will be written to that file. It is best to always use the .xml suffix. Also, on systems where the MicroStrategy Intelligent Server is remote, a shared drive reference will ensure that the PMML file will be accessible to MicroStrategy Desktop, which is necessary for importing the model.
• ModelName: This name will be incorporated into the output PMML.
• ModelPropName: This parameter can be useful if you want to perform this process programmatically. If you were to write a VB program that set up and executed a training report, this parameter allows you to assign a name to the output PMML property, which is part of the report. After the report has been executed, the program can read the output PMML from this property.
• RegressionType: TrainRegression supports two types of regression:

Multiple Linear Regression (MLR): If this type of regression is specified, the function will attempt to calculate the coefficients of a straight line that best fits the input data. The calculated formula follows the following format: y = b0 + b1x1 + b2x2 + … + bnxn, where y is the target value, and x1 … xn are the independent variables. The MLR technique finds the bn values that best fit the data. To use MLR, set RegressionType = 0.

Multiple Exponential Regression (MER): If this type of regression is specified, the function will attempt to calculate the coefficients of an exponential curve that best fits the input data. This is accomplished by calculating the natural log (ln) of the input target variables and then performing the same calculations used for MLR. Once the straight-line coefficients are calculated, MER takes the natural exponential of the values, which results in the coefficients for the formula of an exponential curve. The calculated formula will follow the following format: y = b0 * (b1^x1) * (b2^x2) * … * (bn^xn), where y is the target value, and x1 … xn are the independent variables. The MER technique finds the bn values that best fit the data. To use MER, set RegressionType = 1.

• Target: This is the name of the dependent variable. This name must correspond to one of the arguments of the TrainRegression function. The case-sensitive name will be incorporated into the output model.
• Variables: This is a list composed of the names of all variables (independents and dependent) in the same order as they are passed into the TrainRegression function. The names should be separated by commas, with no leading or trailing blanks. These case-sensitive names will be incorporated into the output PMML.

Once you have created a metric with one regression type, it is easy to make another metric of the other type by simply copying the metric and changing the RegressionType parameter. It would be good to also change the ModelFileName and ModelName parameters so there is not any conflict with the original training metric.

To use the training metric, add it to your dataset report and run the report. When the report execution finishes, the training metric’s column will contain the results of the Regression model, which is saved to the location specified by the ModelFileName parameter.

To create a predictive model using MicroStrategy

1. Log on to a MicroStrategy project.
2. Open the Metric Editor
3. Using the object browser, locate the TrainRegression function, and add it to the metric formula.
4. Add each independent variable to the TrainRegression argument list.
5. Add the dependent variable to the TrainRegression argument list (order-independent).
6. Highlight and right-click the TrainRegression function name.
7. Select TrainRegession parameters from the pop-up menu.
8. Fill in the appropriate values for each of the function parameters.
9. Click Save and Close on the toolbar. Select a location for the new metric and enter a metric name.
10. Close the Metric Editor.
11. Add the training metric to a training report.
12. Execute the report.

It is recommended that reports containing training metrics have report caching disabled. The reason for this is to insure that the PMML model is always generated. If the report is in the report cache, there is no need for MicroStrategy to execute the training metric, since the results have already been cached. Therefore, the PMML model file is not generated. To disable report caching for a training report, open the report with the Report Editor and select Report Caching Options from the Data menu. This opens the Report Caching Options dialog box, which allows you to disable report caching for the report.

When the report has completed execution, a file containing the generated PMML model can be found at the location specified by the training metric. This model can now be imported into a MicroStrategy project, which is the subject of the next section.

Microstrategy Topics