In dimensional modeling, referential integrity means that every fact table is filled with legitimate foreign keys. Or, to put it another way, no fact table record contains corrupt or unknown foreign key references.
There are only two ways to violate referential integrity in a dimensional schema:
If you don’t pay attention to referential integrity, it is amazingly easy to violate it. The authors have studied fact tables where referential integrity was not explicitly enforced; in every case, serious violations were found. A fact record that violates referential integrity (because it has one or more bad foreign keys) is not just an annoyance; it is dangerous. Presumably, the record has some legitimacy, as it probably represents a true measurement event, but it is stored in the database incorrectly. Any query that references the bad dimension in the fact record will fail to include the fact record; by definition, the join between the dimension and this fact record cannot take place. But any query that omits mention of the bad dimension may well include the record in a dynamic aggregation!
In Figure below, we show the three main places in the ETL pipeline where referential integrity can be enforced. They are:
Choices for enforcing referential integrity
Practically speaking, the first option usually makes the most sense. One of the last steps just before the fact table load is looking up the natural keys in the fact table record and replacing them with the correct contemporary values of the dimension surrogate keys. This process is explained in detail in the next section on the surrogate key pipeline. The heart of this procedure is a special lookup table for each dimension that contains the correct value of the surrogate key to be used for every incoming natural key. If this table is correctly maintained, the fact table records will obey referential integrity. Similarly, when dimension table records are to be deleted, a query must be done attempting to join the dimension record to the fact table. Only if the query returns null should the dimension record be deleted.
The second option of having the database enforce referential integrity continuously is elegant but often too slow for major bulk loads of thousands or millions of records. But this is only a matter of software technology. The Red Brick database system (now sold by IBM) was purpose-built to maintain referential integrity at all times, and it is capable of loading 100 million records an hour into a fact table where it is checking referential integrity on all the dimensions simultaneously!
The third option of checking for referential integrity after database changes have been made is theoretically capable of finding all violations but may be prohibitively slow. The queries checking referential integrity must be of the form:select f.product_key from fact_table f where f.product_key not in (select p.product_key from product_dimension p)
In an environment with a million-row product dimension and a billion row fact table, this is a ridiculous query. But perhaps the query can be restricted only to the data that has been loaded today. That assumes the time dimension foreign key is correct! But this is a sensible approach that probably should be used as a sanity check even if the first approach is the main processing technique.
Data Warehouse ETL Toolkit Related Interview Questions
|Informatica Interview Questions||Data Warehousing Interview Questions|
|Networking Interview Questions||System Administration Interview Questions|
|Hadoop Interview Questions||MYSQL DBA Interview Questions|
|Data modeling Interview Questions||Hadoop Administration Interview Questions|
|Apache Flume Interview Questions||Informatica Admin Interview Questions|
Data Warehouse Etl Toolkit Tutorial
Surrounding The Requirements
Etl Data Structures
Cleaning And Conforming
Delivering Dimension Tables
Delivering Fact Tables
Real-time Etl Systems
All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd
Wisdomjobs.com is one of the best job search sites in India.