Data Mining in SSIS Environment - Data Mining

SSIS provides a flow environment for data extraction, loading, and transformation with a set of built-in tasks and transforms.

As you have already learned, the most resource-consuming work in a data mining project is data cleaning and transformation. Naturally, SSIS can be a good complement to a data mining project. You can use this powerful tool to load data from various sources, join them together, normalize column values, remove dirty records, replace missing values, split data into training and testing datasets, and so on.

A data flow example

A data flow example

More than just an ETL tool for data mining, SSIS actually provides a few built-in data mining components in the control flow and data flow environment. In particular, there are a few data-mining-specific tasks and transforms listed in Table.

Tasks and Transforms for Data Mining

Tasks and Transforms for Data MiningTasks and Transforms for Data Mining

Data Mining Tasks
In this section, we will take a closer look at a few data mining related tasks.

The Data Mining Query Task
Data Mining Query task is used for executing data mining queries, mainly prediction queries in SSIS control flow.

Figure displays the Mining Model tab of the dialog box used to edit the properties of the Data Mining Query task. In this tab, you first specify the connection to a live Analysis Services database. Then you need to specify the mining structure and mining model on which your query is based.

Data Mining Query task — select mining model

Data Mining Query task — select mining model

Data Mining Query task — select mining model shows the Query tab of Data Mining Query task. In this tab, you need to input the data mining query statement. You can write the query directly in the text box, or you can click the Build New Query button to invoke the Prediction Query Builder. The Prediction Query Builder provides a graphic environment, which helps you to build the DMX query.

Your DMX query may contain parameters. These parameters can be mapped to SSIS variables. For example, you might want to write a query that returns those customers whose probability of having a gold membership card is greater than a given value. The value is set by an SSIS variable: Probability. The ResultSet tab provides you an option to map the result to an SSIS variable such as collection.

Data Mining Query task — select mining model displays the Output tab of Data Mining Query task. In this tab, you specify the connection destination database where the query result will be stored. It could be the same database as the source data or a different database. You also need to give name of the result table.

Data Mining Query task — build query

Data Mining Query task — build query

Data Mining Query task — specify output

Data Mining Query task — specify output

Analysis Services Processing Task
The Analysis Services Processing task is a task for processing major objects in the Analysis Services database. The objects can be dimensions, cubes, mining structures, and mining models.

Figure displays the editor for this task. You can select objects using the Add button. You can also specify the process options and settings in the Process Options column.

Analysis Services Processing task

Analysis Services Processing task

Analysis Services Execute DDL Task
DDL stands for Data Definition Language. It is XML. In Analysis Services 2005, all the objects (mining models, mining columns, dimensions, cubes, roles, and so on) are defined with DDL.

Besides object definition, DDL also contains a set of commands. These commands facilitate creating, updating, processing, and deleting objects. When you use graphic tools such as mining model editors to create mining models, these tools produce DDL scripts that are sent to the server. At times, you may want to create/update these objects without using data mining editors. You can copy or write these DDL scripts and execute them with the Execute DDL task. Figure (Analysis Services Execute DDL task) displays the editor for the Execute DDL task.

First, you need to specify a connection to the Analysis Services. Then you click on the Source- Direct property, which invokes a text box for DDL commands. The DDL in the figure is to create a mining structure DM Customers in the Adventure Works DW database.

Analysis Services Execute DDL task

Analysis Services Execute DDL task

An Example of a Control Flow Using Data Mining
The control flow first downloads a dataset of new customers through the FTP task, then it loads the data to a SQL Server database using the Bulk Insert task. If there is an error during the data loading, it sends an email to the DBA. If the data is loaded successfully, it executes the Data Mining Prediction task to find those customers who might be interested in a special promotion. The result of the prediction is saved in a table. After the prediction, it sends emails to the identified customers.

Data Mining Transforms
In this section, we will have a closer look at transforms related to data mining.

Data Mining Model Training Transform
The Data Mining Model Training transform (DM Training transform for short) is used for processing a mining structure and its associated mining models in the data flow environment.

A control flow using a data mining query

A control flow using a data mining query

The Analysis Services Process task can also process mining models, but you need to specify the input table in the task. During the execution, the AS server pulls data from the source table where the training data is stored. However, in the data flow environment, the DM training transform pushes data from the pipeline to Analysis Services during the execution. The AS server gets the training data in push mode. The pipeline data is wrapped by the DM training transform in an XML rowsets format before it is sent to the AS Server.

Figure displays the Connection tab of the Properties dialog box for the DM training transform. In this tab, you specify the connection to an AS Server database and select the mining structure you want to process using the pipeline data.

You can also create a new mining structure using the New button. It launches the Data Mining Wizard. The wizard works you through the model creation process in the same way as it does in the Data Mining Editor environment. However, there are a few differences. First, the columns are from a pipeline, not from a relational table. Certain features, such as autodetect content type and correlation suggestions, are not available. This is because these features require scanning and sample the input data, which is not feasible in the pipeline environment.

Figure shows the Columns tab of the DM training transform, where you specify the mappings between pipeline columns and mining structure columns. By default, the tool automatically does mapping based on column names.

DM Training transform — Connection tab

DM Training transform — Connection tab

DM Training Transform — Mapping tab

DM Training Transform — Mapping tab

Data Mining Query Transform
Data Mining Query transform (DM Query transform), as the name suggests, is a transform for executing data mining queries. You learned about the various types of DMX queries including prediction, content, model creation, and model training. Among them, the prediction query is the most common.

To execute a prediction query, you need a trained mining model and an input dataset. In the pipeline environment, the input dataset is pushed from the pipeline. The DM Query transform wraps the input data from the pipeline into XML rowsets and sends them to the AS server through the XML/A protocol. The AS server executes the query and returns XML results to the transform. The transform then unwraps the results and pushes them into the pipeline for the next transform to consume. When the input data is large, this process is done in chunks each time a set of rows is sent to the AS server for prediction.

Figure displays the Mining Model tab of the transform, where you specify the connection to a database of a live AS server. You also need to select a mining structure and one of its models.

Figure displays the Query tab of the transform, where you can input the DMX query. You can either manually write the query or click the Build New Query button to launch the Prediction Query Builder. The input table for the prediction is the pipeline input, which is denoted as @InputRowset. Your DMX query result may contain nested results. For example, you use the PredictionHistogram function in the Select statement. In this case, the nested rowsets will automatically be flattened.

DM Prediction transform — Mining Model tab

DM Prediction transform — Mining Model tab

DM Prediction transform — Query tab

DM Prediction transform — Query tab

Example Data Flows
It starts with an OLE DB Data Source transform, which selects the FoodMart Customer table. The next transform is the Derived Column transform, which creates a new column, Age, derived from the Birthdate column. Once the new column is added to the pipeline, the data is sent to the DM Training transform, which processes the mining structure and its associated mining models in parallel.

Figure displays a data flow using the DM Query transform. The pipeline starts with an OLE DB Source transform, which loads the NewCustomer table. The next transform is Derived Column, which adds the Age column derived from Birthdate. The third transform is DM Query, which uses a trained decision tree model to predict the most likely membership card for each new customer. The result of the prediction is sent to the Conditional Split transform, which splits the data into three pipeline branches based on the membership card type. The left branch contains only gold members, the middle branch contains only silver members, and the right branch contains bronze and normal members. Each branch leads to an OLE DB Destination transform, which stores the VIP customers, important customers, and normal customers, respectively.

A data flow example for model training

A data flow example for model training

A data flow example with prediction

A data flow example with prediction
This example also demonstrates that data mining techniques can be applied as advanced ETL processes. It enables data splitting not only based on the existing attributes but also based on predictable attributes.

Term Extraction Transform
The Term Extraction transform is used to build a glossary of key terms for a specific domain. This is usually the first step of a text mining project. One of the columns in the input pipeline buffer contains text data with types such as ntext or nvarchar. The purpose of this transform is to analyze this column and build a dictionary of key terms based on its content. The output of the transform is a table with a single column. Extracting key terms is not as simple as you may think; there are sophisticated techniques such as word stemming and grammar parsing involved. The transform extracts nouns and noun phrases such as data mining.

The UI of the Term Extraction transform is quite simple. In the first tab, you need to specify the textual column. You can also name the output columns for key terms and their associated scores, as shown in Figure. The score is based on TFIDF, which stands for Term Frequency and Inverse Document Frequency. It is a statistical technique used to evaluate how important a word is to a document. The importance increases proportionally to the number of times a word appears in the document but is offset by how common the word is in all of the documents in the collection.

Term Extraction transform — Term Extraction tab

Term Extraction transform — Term Extraction tab

The second tab gives you the option to specify inclusion and exclusion terms. You may already have a list of predefined terms that must be included and a list of terms that you don’t want to be extracted. You can specify these two term lists in this tab. The Term Extraction transform also provides options for the terms; for example, you can specify that terms must be single words or noun phrases. In case of noun phrase, you can also mention the maximum length.

Term Lookup Transform
The Term Lookup transform is used to search for key terms from the input textual column, based on a dictionary. The dictionary is usually generated by the Term Extraction transform. Since the dictionary is just a table, you can write SQL queries to modify the list by adding or removing terms when necessary.

The editor for the Term Lookup transform is quite simple. The first tab is used for specifying the reference table (dictionary), as shown in Figure

Term Extraction transform — Inclusion/Exclusion tab

Term Extraction transform — Inclusion/Exclusion tab

Term Lookup transform — Reference Table tab

Term Lookup transform — Reference Table tab

The second tab is used to specify the column mapping, for example mapping the input textual column to the dictionary. You can also pass through some input columns such as the document ID.

The Term Lookup transform produces two new columns as output: Term and Frequency. You can think of the output of this transform as a fact table witha large number of rows, containing the document ID, key terms, and associated frequency.

Term Lookup transform — Term Lookup tab

Term Lookup transform — Term Lookup tab

Example of Text Mining Project
There is much more unstructured data than structured data around us. Think about the Internet, which contains billions of Hypertext Markup Language (HTML) documents that are not structured. Analyzing text documents is becoming more important. Two popular data mining tasks for text mining are classification and segmentation. One of the popular text mining domains is customer support. In the classification example, you have lots of customer feedback from the Web and email, and you want to assign priorities to each feedback instance. In the segmentation example, you want to group similar feedback together and forward them to the right department.

With SQL Server 2005, you can implement a text mining project using SSIS and data mining. A typical flow starts with using Term Extraction to build a dictionary for your business domain. The second step is to use Term Lookup to search for key terms from the textual column, thus converting unstructured data to structured output. The third step is to apply data mining techniques such as Naïve Bayes, Neural network, clustering, and decision trees to build models on the output of Term Lookup. Usually the output of Term Lookup is served as a nested table in the mining model.

A Term Association model based on a software bug database

A Term Association model based on a software bug database

It analyzes the bug description database of a software development project and identifies the relationship among the key terms.

All rights reserved © 2020 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Data Mining Topics