Join Index Teradata

Compatibility: Teradata Extension

A Join Index is an index table that pre-joins the joined rows of two or more tables and, optionally, aggregates selected columns. They are used to speed up queries that frequently join certain tables. Teradata join indexes can be defined as hash-ordered or value-ordered. Join indexes are defined in a way that allows join queries to be resolved without accessing or joining their underlying base tables.
A Join Index takes two or more tables and physically joins the tables together into anotherphysical index table. It also updates the Join Index table when the base rows of the joining base tables are updated. Join indexes are a great way to aggregate columns from tables with a large range of values. A Join Index can play the role of a summary table without denormalizing the logical design of the database and without causing update anomalies presented by denormalized tables. This actually gives you the ability to keep your detail data in 3rd normal form and your summary tables in a star schema format. Brilliant Teradata!

The syntax for a JOIN INDEX is:

Here is an example of creating a Join Index between the Employee table and the Department table:

In the example above a JOIN INDEX called emp_dept_idx has been created on the tables Employee and Department. The rows from the two tables have been physically joined together and will be maintained when rows in the Employee or Department tables change because of INSERTS, UDATES, or DELETES. When users run queries they don't specify or mention the JOIN INDEX table. Instead, when they run queries that can be satisfied by the JOIN INDEX table faster the Teradata database will choose to pull the data from the JOIN INDEX table instead of the base tables. When SQL is run and the columns asked for are any combination of the above columns defined in the JOIN INDEX then Teradata may choose to use the JOIN INDEX instead of the actual base tables. This is called a covering query.

You can also explicitly define a Primary Index for a JOIN INDEX. Teradata spreads the rows of the join index across the AMPs. The AMPs read and write the rows in parallel. So, utilizing a good Primary Index can be important.

You can also drop a JOIN INDEX. Here is an example:

DROP JOIN INDEX TOMC.JOINIDX_2;

Collecting Statistics

The Teradata PE or optimizer follows the saying, "If you fail to PLAN you PLAN to fail". The PE is responsible for taking users SQL and after optimizing the SQL comes up with a PLAN for the AMPs to follow. The PE is the Boss and the AMPs are the workers. Ask yourself two questions:

  1. Could you have a Teradata system without AMPs?<
    • Of course not! AMPs read and write the data.
  2. Could you have a Teradata system without PE's?
    • Of course not! Could you get along without your boss?
      (kidding, kidding)

The Teradata Parsing Engine (PE) is the best optimizer in the data warehouse world, but it needs you to COLLECT STATISTICS so it can optimize its work. The statistics allow the optimizer to use its vast experience to PLAN the best way to fulfill the query request. It is particularly important for the optimizer to have accurate table demographics when data is skewed.
The purpose of the COLLECT STATISTICS command is to gather and store demographic data for one or more columns or indices of a table or join index. This process computes a statistical profile of the collected data, and stores the synopsis in the Data Dictionary (DD) for use during the PE's optimizing phase of SQL statement parsing. The optimizer uses this synopsis data to generate efficient table access and join plans.
Lets review: The Parsing Engine Processor (PEP) which is also referred to as the optimizer takes SQL requests from a user and comes up with a Plan for the Access Module Processors (AMPs) to execute. The PEP uses statistics to come up with the most cost efficient plan. You must COLLECT STATISTICS on any columns or indices of a table you want the optimizer to use with high confidence.

If statistics are not collected, the PE randomly chooses an AMP in which it will ask a series of questions. The PEP will then estimate based on the total number of AMPs to estimate the number of rows in the entire table. This "guess-timate" value can be inaccurate, especially if the data is skewed. You should COLLECT STATISTICS on all tables. You also have the ability to COLLECT STATISTICS on a Global temporary tables, but not Volatile tables.

We recommend you refresh the statistics whenever the number of rows in a table is changed by 10%. For example, a MultiLoad job may INSERT a million records in a 9 million-row table. Since the table has an additional 10% of new rows it is definitely time to refresh the COLLECT STATISTICS. In reality, we refresh statistics by using the COLLECT STATISTICS command again any time the table changes by more than 10%.

The first time you collect statistics you collect them at the index or column level. After that you just collect statistics at the table level and all previous columns collected previously are collected again. It is a mistake to collect statistics only once and then never do it again. In reality, it is better to have no statistics than to have ridiculously incorrect statistics. This is because the optimizer is gullible and believes the statistics, no matter how inaccurate.

Collecting Statistics is rough on system resources so it is best to do it at night in a batch job or during other off peak times. You can see what statistics have been collected on a table and the date and time the STATISTICS were last collected with the following:

HELP STATISTICS <table-name> command ;

Here are some excellent guidelines on what you should collect statistics on:

  • All Non-Unique indices
  • Non-index join columns
  • The Primary Index of small tables
  • Primary Index of a Join Index
  • Secondary Indices defined on any join index
  • Join index columns that frequently appear on any additional join index columns that frequently appear in WHERE search conditions
  • Columns that frequently appear in WHERE search conditions or in the WHERE clause of joins.

The two key words to collect and drop statistics are:

  • COLLECT STATISTICS
  • DROP STATISTICS

Here is the syntax to collect statistics:

Here is an example of collecting statistics on the column dept and the multicolumn index of lname, fname columns in the employee table.

We have COLLECTED STATISTICS for the employee table in two separate statements. You will always COLLECT STATISTICS on a column or index one at a time initially. You must use the COLLECT STATISTICS command for each column or index you want to collect in a table. In the above examples, we collected statistics on the column dept and the index(lname, fname). You can collectstatistics at either the column or index level. It is best to COLLECT STATISTICS at the column level unless you are dealing with a multi-columnindex. COLLECT at the index level only for indices that are multicolumn indices. Otherwise collect columns and single column indices at the column level. Single column indices actually perform the same COLLECT STATISTICS functions as if they were collected at the column level. Plus, if you drop an index, you lose the statistics.

The table Employee now has COLLECTED STATISTICS defined within the table. Although you must collect statistics the first time at the column or index level you only collect statistics at the TABLE LEVEL for all refreshing of STATISTICS. Here is an example of COLLECTING STATISTICS at the table level.

COLLECT STATISTICS on TomC.Employee;

The system will refresh the COLLECT STATISTICS on the columns and indices it had previously collected on the table.


All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Teradata Topics