Parallel Join Processing Teradata

There are four basic types of joins that Teradata can perform depending on the characteristics of the table definition. When the join domain is the primary index (PI) column, with a unique secondary index (USI) the join is referred to as a nested join and involves, at most, three AMPs. The second type of join is a merge join, with three different forms of a merge join, based on the request. The newest type of join in Teradata is the Row Hash join using the pre-sorted Row Hash value instead of a sorted data value match. This is beneficial since the data row is stored based on the row hash value and not the data value. The last type is the product join.

In Teradata, each AMP performs all join processing in parallel locally. This means that matching values in the join columns must be on the same AMP to be matched. When the rows are not distributed and stored on the same AMP, they must be temporarily moved to the same AMP, in spool. Remember, rows are distributed on the value in the PI column(s). If joins are performed on the PI of both tables, no row movement is necessary. This is because the rows with the same PI value are on the same AMP – easy, but not always practical. Most joins use a primary key, which might be the UPI and a foreign key, which is probably not the PI.

Regardless of the join type, in a parallel environment, the movement of at least one row is normally required. This movement puts all matching rows together on the same AMP. The movement is usually required due to the user's choice of a PI. Remember, it is the PI data value that is used for hashing and row distribution to an AMP. Therefore, since the joined columns are mostly columns other than the PI, rows need to be redistributed to another AMP. The redistributed rows will be temporarily stored in spool space and used from there for the join processing.

The optimizer will attempt to determine the most efficient path for data row movement. Its choice will be based on the amount of data involved. The three join strategies available are: 1-duplicate all rows of one table onto every AMP, 2- redistribute the rows of one table by hashing the non-PI join column and sending them to the AMP containing the matching PI row, and 3- redistribute both tables by hashed join column value.

The duplication of all rows is a popular approach when the non-PI column is on a small table. Therefore, copying all rows is faster than hashing and distributing all rows. This technique is also used when doing a product join and worse, a Cartesian product join.

When both tables are large, the redistribution of the non-PI column row to the AMP with the PI column will be used to save space on each AMP. All participating rows are redistributed so that they are on the same AMP with the same data value used by the PI for the other table.

The last choice is the redistribution of all participating row from both tables by hashing on the join column. This is required when the join is on a column that is not the PI in either table. Using this last type of join strategy will require the most spool space. Still, this technique allows Teradata to quickly join tables together in a parallel environment. By combining the speed of the BYNET, the experience of the PE optimizer, and the hashing capabilities of Teradata the data can be temporarily moved to meet the demands of the SQL query. Do not underestimate the importance or brilliance of this capability. As queries change and place new demands on the data, Teradata is flexible and powerful enough to move the data temporarily and quickly to the proper location.
Redistribution requires overhead processing. It has nothing to do with the join processing, but everything to do with preparing for the join. This is the primary reason that many tables will use a column that is not the primary key column as a NUPI. This way, the join columns used in the WHERE or the ON are used for distribution and the rows are stored on the same AMP. Therefore, the join is performed without need to redistribute data. However, normally some re-distribution is needed. So, make sure to COLLECT STATISTICS on the join columns. The strategy that the optimize chooses can be seen in output from an EXPLAIN.

All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd Protection Status

Teradata Topics