All dimensions should be physically built to have the minimal set of components shown in Figure below. The primary key is a single field containing a meaningless, unique integer.We call a meaningless integer key a surrogate. The data warehouse ETL process should always create and insert the surrogate keys. In other words, the data warehouse owns these keys and never lets another entity assign them.
The primary key of a dimension is used to join to fact tables. Since all fact tables must preserve referential integrity, the primary dimension key is joined to a corresponding foreign key in the fact table. This is shown in our insurance example in Figure previous section. We get the best possible performance in most relational databases when all joins between dimension tables and fact tables are based on these single field integer joins. And finally, our fact tables are much more compact when the foreign key fields are simple integers.
All dimension tables should possess one or more other fields that compose the natural key of the dimension. We show this in Figure below as an ID and designate the natural key field(s) withNK. The natural key is not a meaningless surrogate quantity but rather is based on one or more meaningful fields extracted from the source system. For instance, a simple static (nonchanging) employee dimension would probably have the familiar EMP ID field, which is probably the employee number assigned by the human resources production system. EMP_ID would be the natural key of this employee dimension. We still insist on assigning a data warehouse surrogate key in this case, because we must insulate ourselves from weird administrative steps that an HR system might take. For instance, in the future we might have to merge in bizarrely formatted EMP_IDs from another HR system in the event of an acquisition.
The basic structure of a dimension
When a dimension is static and is not being updated for historical changes to individual rows, there is a 1-to-1 relationship between the primary surrogate key and the natural key. But we will see a little later in this section that when we allow a dimension to change slowly, we generate many primary surrogate keys for each natural key as we track the history of changes to the dimension. In other words, in a slowly changing dimension, the relationship between the primary surrogate key and the natural key is many-to-1. In our employee dimension example, each of the changing employee profile snapshots would have different and unique primary surrogate keys, but the profiles for a given employee would all have the same natural key (EMP_ID). This logic is explained in detail in the section on slowly changing dimensions in this section.
The final component of all dimensions, besides the primary key and the natural key, is the set of descriptive attributes. Descriptive attributes are predominately textual, but numeric descriptive attributes are legitimate. The data warehouse architect probably will specify a very large number of descriptive attributes for dimensions like employee, customer, and product. Do not be alarmed if the design calls for 100 descriptive attributes in a dimension! Just hope that you have clean sources for all these attributes. More on this later.
The data warehouse architect should not call for numeric fields in a dimension that turn out to be periodically measured quantities. Such measured quantities are almost certainly facts, not descriptive attributes. All descriptive attributes should be truly static or should only change slowly and episodically. The distinction between a measured fact and a numeric descriptive attribute is not as difficult as it sounds. In 98 percent of the cases, the choice is immediately obvious. In the remaining two percent, pretty strong arguments can be made on both sides for modeling the quantity either as a fact or as a dimensional attribute. For instance, the standard (catalog) price of a product is a numeric quantity that takes on both roles. In the final analysis, it doesn’t matter which choice is made. The requesting applications will look different depending on where this numeric quantity is located, but the information content will be the same. The difference between these two choices will start to become important if it turns out that the standard price is actually slowly changing. As the pace of the change accelerates, modeling the numeric quantity as a measured fact becomes more attractive.
Generating Surrogate Keys for Dimensions
Creating surrogate keys via the DBMS is probably the most common technique used today. However, we see this trend changing. In the past, it was common practice to have surrogate keys created and inserted by database triggers. Subsequently, it has been determined that triggers cause severe bottlenecks in the ETL process and should be eliminated from any new processes being created. Even though it is still acceptable for the integers for a surrogate key to be maintained by the DBMS, these integers should be called by the ETL process directly. Having the ETL process call the database sequence will produce a significant improvement in ETL performance over the use of database triggers.
Also, using the database to generate surrogate keys almost guarantees that the keys will be out of sync across the different environments of the data warehouse— development, test, and production. As each environment gets loaded at different intervals, their respective database could generates different surrogate key values for the same incoming dimension records. This lack of synchronization will cause confusion during testing for developers and users alike.
For ultimate efficiency, consider having an ETL tool or third-party application generate and maintain your surrogate keys. Make sure that efficient generation and maintenance of surrogate keys are in your ETL proof-of-concept success criteria.
A tempting solution seen repeatedly during design reviews is concatenating the natural key of the source system and a date stamp that reflects when the recordwas either created in the source system or inserted into the data warehouse.
Giving the surrogate key intelligence—the exact time of its creation—may be useful in some situations, but it is not an acceptable alternative to a true integer-based surrogate key. Intelligent or smart keys fail as an acceptable surrogate key for the following reasons:
The attractive characteristic of using this forbidden smart-key strategy is its simplicity at ETL development time when building the first data mart, when it is quite simple to implement a smart key by appending the SYSDATE to the natural key upon insertion. Avoid the temptation of this prohibited shortcut. This approach doesn’t scale to your second data mart.
|
|
Data Warehouse ETL Toolkit Related Tutorials |
|
---|---|
Informatica Tutorial | Data Warehousing Tutorial |
Networking Tutorial | Hadoop Tutorial |
Apache Flume Tutorial |
Data Warehouse ETL Toolkit Related Interview Questions |
|
---|---|
Informatica Interview Questions | Data Warehousing Interview Questions |
Networking Interview Questions | System Administration Interview Questions |
Hadoop Interview Questions | MYSQL DBA Interview Questions |
Data modeling Interview Questions | Hadoop Administration Interview Questions |
Apache Flume Interview Questions | Informatica Admin Interview Questions |
Data Warehouse ETL Toolkit Related Practice Tests |
|
---|---|
Informatica Practice Tests | Data Warehousing Practice Tests |
Networking Practice Tests | System Administration Practice Tests |
Hadoop Practice Tests | MYSQL DBA Practice Tests |
Hadoop Administration Practice Tests |
All rights reserved © 2020 Wisdom IT Services India Pvt. Ltd
Wisdomjobs.com is one of the best job search sites in India.