# Using the DISTINCT Function with Aggregates Teradata

At times throughout this book, examples are shown using a function within a function and the power it provides. The COUNT aggregate provides another opportunity to demonstrate a capability that might prove itself useful. It combines the DISTINCT and aggregate functions.

The following may be used to determine how many courses are being taken instead of the total number of students (10) with a valid class code:

1 Row Returned

Unique_Courses

4

Versus using all of the values:

1 Row Returned

Unique_Courses

9

It is allowable to use the DISTINCT in multiple aggregates within a SELECT. The only restriction is that all aggregates must use the same column for each DISTINCT function.

Aggregates and Very Large Data Bases (VLDB)

As great as huge databases might be, there are considerations to take into account when processing large numbers of rows. This section enumerates a few of the situations that might be encountered. Read them and think about the requirement or benefit of incorporating them into your SQL.

Potential of Execution Error

Aggregates use the data type of the column they are aggregating. On most databases, this works fine. However, when working on a VLDB, this may cause the SELECT to fail on a numeric overflow condition. An overflow occurs when the value being calculated exceeds the maximum size or value for the data type being used.

For example, one billion (1,000,000,000) is a valid value for an integer column because it is less than 2,147,483,647. However, if three rows each have one billion as their value and a SUM operation is performed, it fails on the third row.

Try the following series of commands to demonstrate an overflow and its fix:

-- Create a table called Overflow with 2 columns
CT Overflow_tbl (Ovr_byte BYTEINT, Ovr_int INT);
-- Insert 3 rows with very large values of 1 billion where max value is
2,147,438,647
INS overflow_tbl values (1, 10**9);
INS overflow_tbl values (2, 10**9);
INS overflow_tbl values (3, 10**9);
-- A SUM aggregate on these values will result in 3 billion
SEL SUM(ovr_int) AS sum_col FROM overflow_tbl;
***** 2616 numeric overflow

Attempting this SUM, as written, results in a 2616 numeric overflow error. That is because 3 billion is too large to be stored in the default data type of integer. This is the default because of the data type of the column being used within the aggregate. To fix it, use either of the following techniques to convert the data column to a different type before performing the aggregation.

1 Row Returned

sum_col

3,000,000,000

Whenever you find yourself in a situation where the SQL is failing due to a numeric overflow, it is most likely due to the inherited data type of the column. When this happens, be sure to convert the type before doing the math.

GROUP BY versus DISTINCT

As seen, DISTINCT is used to eliminate duplicate values. In this, the GROUP BY is used to consolidate multiple rows with the same value into the same group. It does the consolidation by eliminating duplicates. On the surface, they provide the same functionality.

The next SELECT uses GROUP BY without aggregation to eliminate duplicates:

5 Rows Returned

The GROUP BY without aggregation returns the same rows as the DISTINCT. So the obvious question becomes, which is more efficient? The answer is not a simple one. Instead, something must be known about the characteristics of the data. Generally, with more duplicate data values – GROUP BY is more efficient. However, if only a few duplicates exist – DISTINCT is more efficient. To understand the reason, it is important to know how each of them eliminates the duplicate values.

Technique used to eliminate duplicates (can be seen in EXPLAIN):

Back to the original question: which is more efficient?

Since DISTINCT redistributes the rows immediately, more data may move between the AMPs, compared to GROUP BY that only sends unique values between the AMPs. So, GROUP BY sounds more efficient. However, when you consider that if the data is nearly unique, GROUP BY spends time attempting to eliminate duplicates that do not exist. Therefore, it is wasting the time to check for duplicates the first time. Then, it must redistribute the same amount of data anyway.

Therefore, for efficiency, when there are:

• Many duplicates – use GROUP BY
• Few to no duplicates – use DISTINCT
• SPOOL space is exceeded – try GROUP BY