Exploring the Basic Concepts in OLE DB for Data Mining - Data Mining

OLE DB for Data Mining defines the common concepts for a number of data mining objects. We will explain these concepts in this section.

Case
Data mining is about analyzing cases — a case is the basic entity of information. It contains a set of attributes, such as Gender and Age. Attributes are also called variables in statistical terms. An attribute can have a set of possible values, called its state. For example, the attribute Gender has two states: Male and Female.

A case can be simple. For example, when you use customer demographic information to analyze customer loan risk, each case is a customer. It also corresponds to a row in the customer table.

Or a case can be more complicated. For example, when you analyze customer purchasing behavior based on the customers’ demographic data as well as purchase history, each case is a customer together with the list of products the customer has purchased. This type of case is a nested case. A nested case contains at least one table column. Figure shows a few nested cases of customer with their purchase histories.

In OLE DB for Data Mining, a data mining algorithm provider consumes cases. If a case is a nested case, data mining algorithms require the case to be input with hierarchical rowset format.

DDING ALGORITHMS TO SQL SERVER ANALYSIS SERVICE

In SQL Server 2000 Service Pack 1, a new component called Provider Aggregator was added to Analysis Services. This component aggregates other providers to provide a unified set of schema rowsets for metadata. Consumer applications can connect to the MSOLAP provider and query the set of unified schema rowsets to discover the combined set of algorithms that Microsoft and other companies provided. MSOLAP dispatches DMX queries to different providers based on the model algorithm. In SQL Server 2005, Microsoft provides a lower-level interface than OLE DB for Data Mining Provider Aggregator. This lower-level interface is called the Algorithm Plug-In Framework. With this API, algorithm developers no longer need to implement the DMX parsers and schema rowsets. The Microsoft data mining provider parses the query, provides tokenized data to the third-party algorithms, and calls the third-party algorithms’ training or prediction functions. This largely simplifies the development effort for adding algorithms to SQL Server Analysis Services.

Three nested customer cases

Three nested customer cases

The Case Key
The case key is the attribute that uniquely identifies each case. A case key is often the primary key of a relational table. Sometimes a case may have composite case key. For example, FirstName and LastName can both be indicated as the case key.

The Nested Key
Although the case key can be considered as the primary key in the relational terminology, the nested key is very different from the foreign key. The case key is just an identifier and doesn’t contain any patterns (often ignored by data mining algorithm), whereas the nested key is the most important attribute of the nested part of the case. Other attributes in the nested part are used to describe the nested key. For example, if a model is designed to learn patterns about customer purchasing behavior, the nested key is the product. Quantity describes the product purchase. The nested key is not an identifier; it contains useful information about patterns. For example, we can use the nested key Product as input to predict the case-level attribute Gender.

Case Tables and Nested Tables
Acase table is the table containing the case information that’s related to the flat part of the case. Anested table is the table that contains information related to the nested part of the case. Anested table is often a transaction table, for example, purchase history, and Web navigation logs. A nested table can be joined with the case table using the case key. To join the case table and nested table to produce hierarchical rowset, OLE DB defines an operator Shape. We will describe the syntax of the Shape operator in the following sections.

Scalar Columns and Table Columns
Acolumn in a mining model is similar to a column in a relational table; it’s also called a variable or attribute in statistical terminology. Depending on the usage, a data mining model can have four types of columns: key, input, predictable, or a column that’s both input and predictable. The predictable column is the target of the mining model. Most data mining models use the set of input columns to predict an output column. Some algorithms, such as clustering, don’t require predictable columns. In this case, the mining model may containonly input columns. There are two kinds of column structures: scalar and table. Most columns arescalar columns. Each scalar column of an individual record has a single value. For example, Gender and Age are scalar columns. A table column is a special column. A table column embeds a table inside the column. For example, Purchases is a table column. It stores the purchase information for each customer. It contains a table of two columns: Product and Quantity. OLE DB has the hierarchical rowset concept. The flat parts of the rowsets are the scalar columns, while the hierarchical parts are table columns.

The Data Mining Model
Adata mining model, or mining model, can be thought of as a relational table. It contains key columns, input columns, and predictable columns. Each model is associated with a data mining algorithm on which the model is trained. Training a mining model means finding patterns in the training dataset by using specified data mining algorithms with proper algorithm parameters. After training, the data mining model stores patterns that the data mining algorithm discovered about the dataset.While a relational table is a container of records, a data mining model is a container of patterns.

Model Creation
The concept of model creation simply deals with creating an empty data mining model, similar to the way we create a new table.

Model Training
Model training is also called model processing. It is used to invoke the data mining algorithm to uncover knowledge about the training dataset. After training, patterns are stored in the mining model.

Model Prediction
The model prediction is used for applying trained mining model patterns to the new dataset and predicting the potential value of predictable columns for each new case.


All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Data Mining Topics