Hashing Functions Teradata

Teradata uses parallel processing with its architecture of AMPs and PEPs. The Primary Index must be chosen whenever you create a table because it is the sole determinant of which AMPs owns which rows. This concept pertains to data storage and data retrieval. Picking the proper column(s) for the Primary Index is extremely important for distribution and therefore, for performance. The hashing functions introduced in this section provide information pertaining to the selection of the AMP where each individual row is stored.

As mentioned previously, Teradata uses a hashing algorithm to determine which AMP is responsible for a data row's storage and retrieval. This mathematical meat grinder is configured to generate a 32-bit binary number called the Row Hash for any data value passed to it. This makes the placement of rows on AMPs a random process. Yet, it will generate the same 32-bit value whenever the same data value is passed into it. Then, the first 16-bits called the Destination Selection Word (DSW) is used to select the appropriate AMP. Using this technique over a large number of rows with unique values, the system normally generates a good distribution of data rows the vast majority of the time.

There are now hashing functions, which can be incorporated into SQL, to produce and use the same hash value result for testing current or future distribution levels. In other words, these functions can be used to evaluate the distribution of the rows within any or all tables or determine the acceptability of other columns as a potential primary index.

HASHROW

The HASHROW function is used to produce the 32-bit binary (BYTE(4) data type) Row Hash that is stored as part of the data row. It can return a maximum of 4,294,967,295 unique values. The values produced range from 0 to FFFFFFFF.

One might think that 16 different values can be passed to the HASHROW function since 16 columns can be used in an index. However, we have used up to 50 different values and it continues to produce unique output values.

The basic syntax for using the HASHROW function follows:

SELECT HASHROW( [ <data-column-value> [, <data-column-value2> ... ] ] ) ;

Examples using HASHROW:


1 Row ReturnedHASHROW

Now that the functionality has been demonstrated, a more realistic use might be the following to examine the data distribution and determine the average number of rows per value:

SELECT COUNT(*) / COUNT(DISTINCT(HASHROW(Student_id))) AS AVG_ROW_CT FROM Student_table;
1 Row Returnedavg-row

As good as this is, the HASHROW function does not provide a lot more help in the evaluation process. However, when combined with the other Hashing Functions, it yields some very helpful data demographics.

HASHBUCKET

The HASHBUCKET function is used to produce the 16-bit binary Hash Bucket (the DSW) that is used with the Hash Map to determine the AMP that should store and retrieve the data row. It can return a maximum of 65,536 unique values. The values range from 0 to 65535, not counting the NULL as a potential result. The input to the HASHBUCKET is the 32-bit Row Hash value.

The basic syntax for using the HASHBUCKET function follows:

SELECT HASHBUCKET( [ <row-hash-value> ] ) ;

Example using HASHBUCKET function:

SELECT HASHBUCKET(NULL) AS NULL_BUCKET ,HASHBUCKET() AS NO_BUCKET;
1 Row ReturnedHASHBUCKET

A more realistic use is to determine the number of rows in each Hash Bucket using the following SELECT:


10 Rows ReturnedHash Bucket

The Hash Bucket is also known as the Destination Selection Word (DSW). This is due to its use of the Hash Bucket to determine the destination AMP. Like Teradata, the SELECT can use the HASHBUCKET as input into the HASHAMP function.

HASHAMP

The HASHAMP function returns the identification number of the primary AMP for any Hash Bucket number.

The input to the HASHAMP function is an integer value in the range of 0 to 65535. When no value is passed to the HASHAMP function, it returns a number that is one less than the number of AMPs in the current system configuration. If any other data type is passed to it, a run-time error occurs.

The basic syntax for using the HASHAMP function follows:

SELECT HASHAMP( <hash-bucket> ) ;

Examples using HASHAMP function:

SELECT HASHAMP(NULL) AS NULL_BUCKET ,HASHAMP () AS NO_Bucket;
1 Row ReturnedExamples using HASHAMP function

The following SELECT displays the AMP where each row lives:

10 Rows Returned:following SELECT displays the AMP where each row lives

This example outputs one row for every row in the table. This is a small table so the previous example is feasible. However, most tables have up to millions of rows and this SELECT is not meaningful.

The following makes Teradata do the evaluation:


3 Rows Returned:following makes Teradata do the evaluation

The one thing that becomes obvious is that on this system, the Student Table does not have rows on all AMPs. In a production environment, this might be a problem. The Primary Index might need to be re-evaluated.

HASHBAKAMP

The HASHBAKAMP function returns the identification number of the Fallback AMP for any Hash Bucket number.

The input to the HASHBAKAMP function is an integer value in the range of 0 to 65535. When no value is passed to the HASHAMP function, it returns a number that is one less than the number of AMPs in the current system configuration. If any other data type is passed to it, a run-time error occurs.

The basic syntax for using the HASHBAKAMP function follows:

SELECT HASHBAKAMP ( <hash-bucket> ) ;

Examples using HASHBAKAMP function:

SELECT HASHBAKAMP (NULL) AS NULL_BUCKET ,HASHBAKAMP () AS N0_Bucket;
1 Row Returned:Examples using HASHAMP function

The following shows the Student_ID, the Bucket_No and the AMP that contains the fallback rows for each Student_ID.


10 Rows Returnedhashamphashamp

The output once again reflects one row for each row of the table. This is a small table and therefore it is reasonable to output a small number of rows. However, most tables have up to millions of rows and this SELECT would not be meaningful.

The following makes Teradata do the evaluation:


4 Rows Returned:following makes Teradata do the evaluation

The FALLBACK rows for the Student table are spread better than the Primary Index and do have rows on all AMPs. So, the real question might be, "How well do all the rows (Primary + FALLBACK) distribute across all of the AMPS?"

The following SELECT can help determine that situation by finding all the Primary rows with their AMP and all the FALLBACK rows with their AMPs and than adding them together for the total (notice it uses a derived table to consolidate the rows counts):


4 Rows Returned:FALLBACK rows with their AMPs

As seen here, the spread of both the primary data and the fallback data is very consistent. These are great functions to evaluate actual data distribution on the column(s) that are the primary index or any column can be used to test other distribution values.


All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Teradata Topics