Introducing OLAP - Data Mining

OLAP is used for decision support systems to analyze aggregated information for sales, finance, budget, and many other types of applications; while Online Transaction Processing (OLTP) is mainly used to record transactions of daily operations such as updating an account balance for a bank transaction.

OLTP database schema are not organized in a manner that can easily provide the summarized information required by managers at different levels of an organization. Managers need aggregated data from which they can view reports and analyze the trends. They need to know the key indications that affect their business success in order to make critical decisions. They need to find how their enterprise’s workload is affected by seasonal and yearly trends so that they can plan and optimize resources.

The OLTP system is not designed for this kind of decision-support query for two reasons. The first is performance. Getting the summarized information requires querying a large amount of transaction data with joins among many different dimension tables. These queries may take hours for a relational database system to compute; meanwhile, the OLTP system has heavy operational duties. The second reason is the schema. The database schema in the OLTP system is not designed to answer decision-support questions. Extensive numbers of joins among OLTP tables are required to generate the reports the managers need.

An OLAP cube is built for decision-support queries. A cube is a multidimensional database. A typical cube contains a set of well-defined dimensions such as Customer, Product, Store, and Time. Each dimension contains many members. For example, each product and each product category is a member of the Product dimension. Dimension members are organized in hierarchies. For example, the hierarchy in the product dimension is All Products ➪ Category ➪ SubCategory ➪ Product Name. You can query aggregated values on different levels of a hierarchy.

A cube has a set of measures such as Store Sales and Unit Sales. Measurescome from a transaction table (fact table), where details of each transaction are stored. These measures are preaggregated (or partially preaggregated) based on the dimension hierarchies. For example, the store sales of beverage products in 2005 in all the stores in Washington state is calculated during the cubeprocessing stage. When a user queries this information, the result can be retrieved very fast thanks to the precalculation and the multidimensional index structure.

In summary, OLAP is about aggregating measures based on dimension hierarchies and storing these precalculated aggregations in a special data structure. With the help of preaggregations and special indexes, you can query aggregated data and get decision-support query results back in real time, which had to be done in batches and offline mode traditionally. Microsoft OLAP Services was initially introduced in SQL Server 7.0 and largely enhanced in SQL Server 2000 and SQL Server 2005. In this section, we go over the key OLAP concepts, using the FoodMart Sales cube as an example. The Sales cube contains sales data of an imaginary supermarket chain FoodMart that specializes in food products. The data is shipped as an Analysis Services sample database since SQL Server 7.0.

Understanding Star and Snowflake Schema
Astar schema is a database schema with a star shape. The heart of the schema is the fact table, which records the details of each transaction. The fact table contains a large amount of records, and it is the largest table in the database. Surrounding the fact table are a set of dimension tables, which describes the properties of each dimension. Figure shows the star schema of the Food- Mart retail store database. Sales_fact is the fact table that contains every sales transaction from all the FoodMart stores. Product, Customer, Time_by_day, Store, and Promotion are all dimension tables. The fact table is fully normalized. It links to the dimension tables through foreign keys. In addition to foreign keys, a fact table contains a set of numeric columns, which are the measures. A star schema is a typical schema in a database warehouse.

Sometimes a dimension table is partially normalized. For example, in the customer table there is an Education attribute, which contains the educational level of a customer, such as Bachelors, Masters, and so on. To normalize this attribute, you can create a separate Education table that contains two columns: Education_id and Description. The Education attribute in the customer table is changed to Education_id, which is a foreign key from Education table. Education becomes a lookup table for the Education_id attribute of Customer table. Similarly, you can create more lookup tables for other attributes in the dimension table. We call the snowflake shape formed by these relationships a schema. Star schemas can be considered special kinds of snowflake schemas, where there is no lookup table. Both star schema and snowflake schema are popular schemas in data warehouse design.

Understanding Dimension and Hierarchy
Each cube contains a number of dimensions. The Sales cube in FoodMart contains five dimensions: Product, Customer, Time, Store, and Promotion. A dimension has dimension members. Each dimension member is a uniquely identifiable unit within a dimension. For example, each customer is identified by Customer_id in the Customer dimension. Customer_id is the dimension key. A dimension may contain a large number of members. For example, the FoodMart supermarket may have millions of customers.

A dimension member has a set of attributes. In the Customer dimension, a member has attributes such as gender, education, state, city, country, and so on. Each column in the dimension table could be a member attribute.

Star Schema of the FoodMart database

Star Schema of the FoodMart database

Attributes may contain relationships among them. For example, a country has a number of states; a state has a number of cities. These relationships form hierarchies, as displayed in Figure below.

Each hierarchy has a name. The hierarchy in Figure(Dimension hierarchy) is named Geo in the Customer dimension. Country, State, City, and Name are the levels of the hierarchy. USA is a member in the Country level. CA, ID, and WAare members in the State level. The lowest level in the Geo hierarchy is Name. Each customer is a member in the Name level. Hierarchy is a very important concept in OLAP. The hierarchy level is the basic unit of aggregation. Users can query aggregated data at different levels of the hierarchy, for example, to obtain the total beverage sales in Washington state.

The Geo hierarchy is a natural hierarchy; the relationships aong levels exist naturally. You can also define an attribute hierarchy without having a natural relationship. For example, you can build a hierarchy Gender ➪ Education ➪ Name. Based on this hierarchy, the customers are separated first by their gender, then by their educational level. You can query aggregated data about store sales of male customer with bachelor’s degree based on this hierarchy. Analysis Services allows each dimension to have multiple hierarchies.

Understanding Measures and Measure Groups
Measures are the numeric values to be aggregated by the cube. They are based on the numeric columns of the fact table. There are three measure columns in Sales_Fact: Store_Sales, Unit_Sales, and Store_Cost. Measures are the numeric data of primary interest to end users browsing a cube. Each measure specifies an aggregate function that determines how values in the measure’s source column are aggregated. This function also determines how measured values for sibling members are aggregated to produce a value for their parent. The aggregation functions include Sum, Min, Max, Average, Distinct Count, and so on.

Dimension hierarchy

Dimension hierarchy

Acube contains a special type of dimension that contains a member for each measure. This dimension is called the Measure dimension. The Measure dimension doesn’t contain hierarchies. It is always flat with one level. While browsing the cube, you can slice by a member in the Measures dimension to display values for the selected measure; you can also place the Measures dimension on an axis to view all the measures of the cube.

The fact table stores the value of each measure of each individual transaction. Before SQL Server 2005, each cube could have only one fact table. In SQL Server 2005, a cube can actually have multiple fact tables. For example, a cube can store both the information about sales and about inventory, with one sales fact table and one inventory fact table. Due to this enhancement, the fact table is renamed to the measure group. A cube may have multiple measure groups.

Understanding Cube Processing and Storage
Acube contains a set of dimensions and measures. There are two steps to processing a cube: dimension processing (if a dimension hasn’t been processed previously) and cube processing. Dimension processing reads dimension data from underlying dimension tables, builds the dimension structure, creates hierarchies, and assigns members to proper levels of the hierarchy. After all the dimensions are processed, cube processing can be started.

The main task of cube processing is to precalculated aggregations based on the dimension hierarchies. When there are many dimensions and each dimension contains several levels and many members, the total number of aggregations could be exponential. One of the challenges of cube processing is to choose the optimal number of aggregations to precalculate. Other aggregated values can be derived from those precalculated measures efficiently. For example, if the monthly Store_Sales values are preaggregated,uarterly and yearly Store_Sales values can be derived easily.

Both dimension processing and cube processing have the options of full processing and incremental processing. During the processing stage, Analysis Services also builds bitmap indexes that allow efficient access to the aggregated cells.

Analysis Services allows cube to be separated into different partitions. For example, for the FoodMart Sales cube, each month’s fact data can be separated into a partition. Partitions can be processed individually. By the end of each month, only the newest partition needs to be processed. As a partition can contain huge amount of aggregations, Analysis Services provides the following three different storage modes for these partitions:

  1. Multidimensional OLAP (MOLAP): The data of a partition is stored in the special format of Analysis Services, which allows efficient retrieval of multidimensional data. MOLAP is by far the most common storage format.
  2. Relational OLAP (ROLAP): All data is stored in RDBMS. Additional tables are created to store precalculated aggregations. ROLAP has the advantage of scalability as compared to the MOLAP structure. However, the performance of ROLAP is less than optimal.
  3. Hybrid OLAP (HOLAP): The HOLAP partition stores the fact table in RDBMS while all the aggregations are stored in the special format of Analysis Services, just like MOLAP. When your query requires aggregated data, Analysis Services can return these aggregations efficiently because they are stored in its special format. When your query requires atomic-level facts, Analysis Services generates SQL to query the fact table in RDBMS.

Using Proactive Caching
The OLAP server is an efficient counting engine that calculates all the aggregations based on the dimension definitions. A cube can be considered simply as a cache that stores precalculated aggregations. MOLAP is the most popular storage mode as it has a special structure that is optimized for search. MOLAP has good performance, but it requires the cube to be processed. This is in conflict with the real-time concept. A ROLAP cube, on the other hand, directly queries the relational data source. It lets users immediately browse the most recent changes in a data source, but it can have significantly poorer performance than MOLAP storage has.

You may have applications in which your users need to see recent data, but you want the performance advantages of MOLAP storage. The proactive caching feature introduced in Analysis Services 2005 can provide a balance between the enhanced performance of MOLAP storage and the immediacy of ROLAP storage.

With proactive caching, queries against an OLAP object are made against either ROLAP storage or MOLAP storage, depending on whether recent changes have occurred to the data. The query engine directs queries against source data in MOLAP storage until changes occur in the data source. After changes occur in a data source, cached MOLAP objects are dropped and querying switches to ROLAP storage while the MOLAP objects are rebuilt in cache. After the MOLAP objects are rebuilt and processed, queries are switched again to the MOLAP storage. The cache refresh can occur extremely quickly for a small partition.

Caching may also be used without dropping the current MOLAP objects. Queries then continue against the MOLAP objects while data is read into and processed in a new cache. This method provides better performance but may result in queries returning old data while the new cache is being built. Proactive caching features simplify the process of managing data obsolescence. If a transaction occurs on the source database, such as the addition of a new dimension member or new fact transaction, the existing cache becomes obsolete. The proactive caching settings determine how frequently the multidimensional cache is rebuilt, whether the outdated MOLAP storage is queried while the cache is rebuilt, and whether the cache is rebuilt on a schedule or based on changes in the database.

Querying a Cube
Once a cube is processed, you can query the cube to retrieve aggregated information. OLE DB for OLAP has defined a query language for querying OLAP cubes. The language is called MDX, which stands for MultiDimensional Expressions. MDX is similar to SQL in the sense that it follows the Select . . . From . . . Where framework. But MDX is not an extension of SQL and its syntax is much more complicated than standard SQL’s syntax is. SQL only deals with two-dimensional data, while MDX allows for querying data with almost any number of dimensions. In SQL, the Select clause is used to define the column layout for a query, and the Where clause is used to define the row layout. However, in MDX the Select clause can be used to define several axis dimensions, while the Where clause is used to restrict multidimensional data to a specific dimension or member. MDX has also defined hundreds of functions, which helps users specify dimension navigations and calculations. The detailed syntax of MDX can be found in SQL Server Books online. The following is a simple example of MDX query to retrieve the total Unit Sales and Store Sales of beverage products:

The result of this MDX query is shown in Table.

Results of MDX Query to Retrieve Beverage Product Sales

Results of MDX Query to Retrieve Beverage Product Sales

All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd Protection Status

Data Mining Topics