Transformation Mechanisms - Data Warehousing

objset product_t_table := product_t_table();
i NUMBER := 0;

BEGIN
LOOP
-- Fetch from cursor variable

FETCH cur INTO prod_id, prod_name, prod_desc, prod_subcategory, prod_subcategory_desc, prod_category, prod_category_desc, prod_weight_class, prod_unit_of_measure, prod_pack_size, supplier_id, prod_status, prod_list_price, prod_min_price;

END; You have the following choices for transforming data inside the database:

  • Transformation Using SQL
  • Transformation Using PL/SQL
  • Transformation Using Table Functions

Transformation Using SQL
Once data is loaded into the database, data transformations can be executed using SQL operations. There are four basic techniques for implementing SQL data transformations:

  • CREATE TABLE ... AS SELECT And INSERT /*+APPEND*/ AS SELECT
  • Transformation Using UPDATE
  • Transformation Using MERGE
  • Transformation Using Multitable INSERT

CREATE TABLE ... AS SELECT And INSERT /*+APPEND*/ AS SELECT
The CREATE TABLE ... AS SELECT statement (CTAS) is a powerful tool for manipulating large sets of data. As shown in the following example, many data transformations can be expressed in standard SQL, and CTAS provides a mechanism for efficiently executing a SQL query and storing the results of that query in a new database table. The INSERT /*+APPEND*/ ... AS SELECT statement offers the same capabilities with existing database tables.

In a data warehouse environment, CTAS is typically run in parallel using NOLOGGING mode for best performance.

A simple and common type of data transformation is data substitution.In a data substitution transformation, some or all of the values of a single column are modified. For example, our sales table has a channel_id column.This column indicates whether a given sales transaction was made by a company's own sales force (a direct sale) or by a distributor (an indirect sale).

You may receive data from multiple source systems for your data warehouse. Suppose that one of those source systems processes only direct sales, and thus the source system does not know indirect sales channels. When the data warehouse initially receives sales data from this system, all sales records have a NULL value for the sales.channel_id field. These NULL values must be set to the proper key value. For example, you can do this efficiently using a SQL function as part of the insertion into the target sales table statement:

The structure of source table sales_activity_direct is as follows:

DESC sales_activity_direct

Table

Transformation Using UPDATE
Another technique for implementing a data substitution is to use an UPDATE statement to modify the sales.channel_id column. An UPDATE will provide the correct result. However, if the data substitution transformations require that a very large percentage of the rows (or all of the rows) be modified, then, it may be more efficient to use a CTAS statement than an UPDATE.

Transformation Using MERGE
Oracle Database's merge functionality extends SQL, by introducing the SQL keyword MERGE, in order to provide the ability to update or insert a row conditionally into a table or out of line single table views. Conditions are specified in the ON clause. This is, besides pure bulk loading, one of the most common operations in data warehouse synchronization.

Merge Examples The following discusses various implementations of a merge. The examples assume that new data for the dimension table products is propagated to the data warehouse and has to be either inserted or updated. The table products_delta has the same structure as products.

Example 14 Merge Operation Using SQL

Transformation Using Multitable INSERT
Many times, external data sources have to be segregated based on logical attributes for insertion into different target objects. It's also frequent in data warehouse environments to fan out the same source data into several target objects. Multitable inserts provide a new SQL statement for these kinds of transformations, where data can either end up in several or exactly one target, depending on the business transformation rules.This insertion can be done conditionally based on business rules or unconditionally.

It offers the benefits of the INSERT ... SELECT statement when multiple tables are involved as targets. In doing so, it avoids the drawbacks of the two obvious alternatives. You either had to deal with independent INSERT … SELECT statements, thus processing the same source data n times and increasing the transformation workload times. Alternatively, you had to choose a procedural approach with a per-row determination how to handle the insertion. This solution lacked direct access to high-speed access paths available in SQL.

As with the existing INSERT ... SELECT statement, the new statement can be parallelized and used with the direct-load mechanism for faster performance.

Example 14–2 Unconditional Insert

The following statement aggregates the transactional sales information, stored in sales_activity_direct, on a per daily base and inserts into both the sales and the costs fact table for the current day.

Example 14–3 Conditional ALL Insert

The following statement inserts a row into the sales and costs tables for all sales transactions with a valid promotion and stores the information about multiple identical orders of a customer in a separate table cum_sales_activity. It is possible two rows will be inserted for some sales transactions, and none for others.

Example 14–4 Conditional FIRST Insert

The following statement inserts into an appropriate shipping manifest according to the total quantity and the weight of a product order. An exception is made for high value orders, which are also sent by express, unless their weight classification is not too high. It assumes the existence of appropriate tables large_freight_shipping, express_shipping, and default_shipping.

Example 14–5 Mixed Conditional and Unconditional Insert

The following example inserts new customers into the customers table and stores all new customers with cust_credit_limit higher then 4500 in an additional, separate table for further promotions.

Transformation Using PL/SQL
In a data warehouse environment, you can use procedural languages such as PL/SQL to implement complex transformations in the Oracle Database. Whereas CTAS operates on entire tables and emphasizes parallelism, PL/SQL provides a row-based approached and can accommodate very sophisticated transformation rules. For example, a PL/SQL procedure could open multiple cursors and read data from multiple source tables, combine this data using complex business rules, and finally insert the transformed data into one or more target table. It would be difficult or impossible to express the same sequence of operations using standard SQL statements.

Using a procedural language, a specific transformation (or number of transformation steps) within a complex ETL processing can be encapsulated, reading data from an intermediate staging area and generating a new table object as output. A previously generated transformation input table and a subsequent transformation will consume the table generated by this specific transformation. Alternatively, these encapsulated transformation steps within the complete ETL process can be integrated seamlessly, thus streaming sets of rows between each other without the necessity of intermediate staging. You can use table functions to implement such behavior.

Transformation Using Table Functions
Table functions provide the support for pipelined and parallel execution of transformations implemented in PL/SQL, C, or Java. Scenarios as mentioned earlier can be done without requiring the use of intermediate staging tables, which interrupt the data flow through various transformations steps.

What is a Table Function?
A table function is defined as a function that can produce a set of rows as output. Additionally, table functions can take a set of rows as input. Prior to Oracle9i PL/SQL functions:

  • Could not take cursors as input.
  • Could not be parallelized or pipelined.

Now, functions are not limited in these ways. Table functions extend database functionality by allowing:

  • Multiple rows to be returned from a function.
  • Results of SQL subqueries (that select multiple rows) to be passed directly to functions.
  • Functions take cursors as input.
  • Functions can be parallelized.
  • Returning result sets incrementally for further processing as soon as they are created. This is called incremental pipelining

Table functions can be defined in PL/SQL using a native PL/SQL interface, or in Java or C using the Oracle Data Cartridge Interface (ODCI).

The following figure illustrates a typical aggregation where you input a set of rows and output a set of rows, in that case, after performing a SUM operation.

Table Function Example

Table Function Example
The pseudocode for this operation would be similar to:

The table function takes the result of the SELECT on In as input and delivers a set of records in a different format as output for a direct insertion into Out.

Additionally, a table function can fan out data within the scope of an atomic transaction. This can be used for many occasions like an efficient logging mechanism or a fan out for other independent transformations. In such a scenario, a single staging table will be needed.

Pipelined Parallel Transformation with Fanout

Pipelined Parallel Transformation with Fanout
The pseudocode for this would be similar to:

This will insert into target and, as part of tf1, into Stage Table 1 within the scope of an atomic transaction.

INSERT INTO target SELECT * FROM tf3(SELT * FROM stage_table1);

Example Table Functions Fundamentals

The following examples demonstrate the fundamentals of table functions, without the usage of complex business rules implemented inside those functions. They are chosen for demonstration purposes only, and are all implemented in PL/SQL.

Table functions return sets of records and can take cursors as input. Besides the sh sample schema, you have to set up the following database objects before using the examples:

/ CREATE TYPE product_t_table AS TABLE OF product_t;
/
COMMIT;
/ REM artificial help table, used later
CREATE TABLE obsolete_products_errors (prod_id NUMBER, msg VARCHAR2(2000));

The following example demonstrates a simple filtering; it shows all obsolete products except the prod_category Electronics. The table function returns the result set as a set of records and uses a weakly typed ref cursor as input.

You can use the table function in a SQL statement to show the results. Here we use additional SQL functionality forthe output:

The following example implements the same filtering than the first one.The main differences between those two are:

  • This example uses a strong typed REF cursor as input and can be parallelized based on the objects of the strong typed cursor, as shown in one of the following examples.
  • _ The table function returns the result set incrementally as soon as records are created.

CREATE OR REPLACE FUNCTION

You can use the table function as follows:

We now change the degree of parallelism for the input table products and issue the
same statement again:

ALTER TABLE products PARALLEL 4;

The session statistics show that the statement has been parallelized:

SELECT * FROM V$PQ_SESSTAT WHERE statistic='Queries Parallelized';session statistics

1 row selected.

Table functions are also capable to fanout results into persistent table structures. This is demonstrated in the next example.The function filters returns all obsolete products except a those of a specific prod_category (default Electronics), which was set to status obsolete by error. The result set of the table function consists of all other obsolete product categories. The detected wrong prod_id's are stored in aseparate table structure obsolete_products_error. Note that if a table function is part of an autonomous transaction, it must COMMIT or ROLLBACK before each PIPE ROW statement to avoid an error in the calling subprogram. Its result set consists of all other obsolete product categories. It furthermore demonstrates how normal variables can be used in conjunction with table functions:

CREATE OR REPLACE FUNCTION

The following query shows all obsolete product groups except the prod_category Electronics, which was wrongly set to status obsolete:

As you can see, there are some products of the prod_category Electronics that were obsoleted by accident:

Taking advantage of the second input variable, you can specify a different product group than Electronics to be considered:

Because table functions can be used like a normal table, they can be nested, as shown in the following:

The biggest advantage of Oracle Database's ETL is its toolkit functionality, where you can combine any of the latter discussed functionality to improve and speed up your ETL processing. For example, you can take an external table as input, join it with an existing table and use it as input for a parallelized table function to process complex business logic. This table function can be used as input source for a MERGE operation, thus streaming the new information for the data warehouse, provided in a flat file within one single statement through the complete ETL process.


All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Data Warehousing Topics