Dimensional data modeling - Qlik View

We will first take a moment to review a little bit of theory, and even some history. If you are already familiar with dimensional modeling, feel free to skip to The associative data model section. Otherwise, read on to see how the data models used in transaction processing systems came to be, why these data models are hard to query, and how an alternative modeling technique solves these problems.

Back in the day

When computers first appeared on the scene, the methods for storing, retrieving, and modifying data were still in their infancy. For example, when storing a customer order, it was likely that all of the data from the paper order form was directly copied into a single record or file.

While it was convenient to have the data digitally available, people quickly realized that storing and manipulating the data was not. As each record was stored on its own, it was hard to keep the data consistent. Imagine customer addresses and product information being repeated on every order, and you will agree that updating and keeping the data consistent is a painful exercise.

To counter these issues, and save expensive storage space, developers started to apply their own optimizations, often splitting data out into separate tables. While this approach was a step in the right direction, it also came with a downside. The algorithms for linking and working with these tables needed to be embedded within the applications, adding significant complexity.

Relational databases and ER modeling

Fortunately, Relational Database Management System (RDBMS) came to the rescue by the early 1980's and solved part of the problem. These were systems dedicated entirely to storing, retrieving, and modifying data.

At the same time, the Entity-Relationship Modeling (ER modeling) technique became fashionable. ER modeling aims to remove all redundancy from the data model. This technique greatly improved and simplified transaction processing. For example, instead of needing to update the same customer address information in each separate record, only a single update to a customer address master table is made. This customer address is then referenced in other tables using a customer address key, a field which uniquely identifies each customer address.

While all of these advancements greatly improved the efficiency of inserting and updating information in a transactional database, it also made it increasingly harder to get information out of it. For example, consider the following table containing attributes related to Aircraft Types:

% Aircraft Type ID
Aircraft Name
Aircraft Manufacturer
Aircraft Manufacturer Country
Aircraft Name Full
Aircraft Name Abbreviated
Aircraft Begin Date
Aircraft End Date
Aircraft Age Classification
Engine Type
Number Of Engines

If we were to model the information in this table using ER modeling, the data could be normalized into the model, as you can see in the following screenshot:

Relational databases and ER modeling

This is only some of the information relating to aircrafts. Imagine if we'd model all of the information we're interested in (airports, carriers, countries, flights, personnel, and so on), we might end up with dozens, or even hundreds, of tables!

When dealing with database sources other than a data warehouse or data mart, this is the most likely scenario we will encounter, and it might get even more complex. For example, the SAP ERP system has thousands of tables for all of the different entities it handles.

Dimensional modeling

When ER models get too complicated to query, dimensional modeling can offer a practical solution. A dimensional data model is composed of a single fact table. This fact table contains a compound primary key, with separate keys linking the fact table to the dimension tables.

These dimension tables contain descriptions and attributes that provide context to the metrics stored in the fact table. Dimensions often contain data on multiple hierarchical levels that are "flattened" (or de normalized) into a single table. For example, in our Aircraft Types dimension table, we have both Aircraft and Aircraft Manufacturer in the same table.

In addition to keys to the dimension tables, fact tables also contain measures. For example, metrics such as transported passengers or available seats are often additive, allowing them to be summed over various dimensions. An example would be: transported freight per aircraft type per month, in which transported freight is a fact, while aircraft type and month are dimensions that add context to the measure.

The fact and dimension tables are usually combined into a star schema. This name is used because, with some imagination, the data model resembles a star. An example of a star schema is shown in the following screenshot:

Dimensional modeling

The Flights table is the fact table, containing all of the measures as well as links to the surrounding dimension tables. A big advantage of the star schema is that it can be easily understood, and business users can easily recognize the names of the tables and how they relate to each other.

The snowflake schema

In the earlier example, each dimension table is completely de normalized. There is a second type of dimensional model in which dimensions are not necessarily fully flattened. This is called a snowflake schema as (again with some imagination) the diagram resembles a snowflake. An example is depicted in the following screenshot. It is basically the same schema as the previous example, but the Aircraft dimension is not completely de normalized.

The snowflake schema

In an RDBMS, the snowflake schema is sometimes chosen when trying to save disk space, as it removes duplicate values from the dimension tables. Since QlikView automatically removes duplicates from the data model, using a snowflake schema is generally not a preferred approach.

Creating the dimensional model

So, how do we go from an ER diagram to a dimensional model? The first thing to understand is that an ER diagram does not directly translate into a single star schema. A transactional system often contains data used across many different business processes. For example, think of how many different business processes and functions a typical ERP system supports: accounting, human resources, manufacturing, supply chain management, and customer relationship management, just to name a few. The data for all of these processes is usually stored in a single ER schema.

The first step in converting from an ER schema to a dimensional schema is dividing the ER schema into separate business processes. Each of these business processes will be modeled into a separate star schema.

The next step is to declare the granularity of the business process (for example, a single flight or one salary payment). We then group the measures that are used in the business process into a single fact table.

After that, the remaining tables are flattened into dimension tables and directly linked to the fact table using a single key. It is possible for the same dimension table to be used in multiple star schemas. This is called conformed dimension. For example, the employee dimension can be used in the context of Airline Operations as well as in the payroll context.

In QlikView, we can use QVD files to store conformed dimensions. For example, we can store the Aircraft Type dimension into a QVD file and use that file in any application that requires information about the Aircraft. This way, data consistency is ensured across applications.

Dealing with multiple fact tables

As described previously, each business process is modeled into a separate star schema. When dealing with multiple fact tables in a single QlikView application, loops and synthetic keys can occur. To solve this problem, fact tables can be concatenated, or a link table can be added. The techniques for how to achieve this are described in Chapter ( Data Modeling Best Practices).

Dimensional models in QlikView

In QlikView, the main benefit of using a dimensional model is increased response time. QlikView just works faster when there are fewer links between tables. A second benefit is that a dimensional model is very easy to understand, and can be extended gracefully. It is very easy to add new facts (as long as they share the same granularity), new dimensions, or extend existing dimensions.

As QlikView works better with fewer links between tables, you may wonder if it might work even better when we just use a single flat table. If fewer links is better, no links must be best, right? As with many things, the answer to this question is "It depends." The next diagram shows a generalized list of the pros and cons of various modeling approaches in QlikView:

Dimensional models in QlikView

Overall, in our experience, the star schema is the preferred model as it offers a good balance between the various trade-offs.


All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Qlik View Topics