OLAP Functions, Definitions - DB2 Using SQL

Ranking Functions

The RANK and DENSE_RANK functions enable one to rank the rows returned by a query. The result is of type BIGINT.

Ranking Functions

Ranking Functions syntax

RANK vs. DENSE_RANK

The two functions differ in how they handle multiple rows with the same value:

  • The RANK function returns the number of proceeding rows, plus one. If multiple rows have equal values, they all get the same rank, while subsequent rows get a ranking that counts all of the prior rows. Thus, there may be gaps in the ranking sequence.
  • The DENSE_RANK function returns the number of proceeding distinct values, plus one. If multiple rows have equal values, they all get the same rank. Each change in data value causes the ranking number to be incremented by one. The following query illustrates the use of the two functions:

Ranking functions example

ORDER BY Usage

The ORDER BY phrase, which is mandatory, gives a sequence to the ranking, and also tells DB2 when to start a new rank value.

The following query illustrates both uses:

ORDER BY usage


ANSWER
================================================================
JOB YEARS ID NAME ASC1 ASC2 ASC3 DSC1 DSC2 DSC3 MIX1 MIX2
----- ----- --- ------- ---- ---- ---- ---- ---- ---- ---- ----
Mgr 6 140 Fraye 1 1 1 4 6 6 3 4
Mgr 7 10 Sanders 1 2 2 4 4 5 1 6
Mgr 7 100 Plotz 1 2 3 4 4 4 2 5
Sales 6 40 O'Brien 4 4 4 1 2 3 5 2
Sales 6 90 Koonitz 4 4 5 1 2 2 6 1
Sales 7 70 Rothman 4 6 6 1 1 1 4 3

Observe above that adding more fields to the ORDER BY phrase resulted in more ranking values being generated.

Ordering Nulls

When writing the ORDER BY, one can optionally specify whether or not null values should be counted as high or low. The default, for an ascending field is that they are counted as high (i.e. come last), and for a descending field, that they are counted as low:

Overriding the default null ordering sequence

In general, in a relational database one null value does not equal another null value. But, as is illustrated above, for purposes of assigning rank, all null values are considered equal. NOTE: The ORDER BY used in the ranking functions(above) has nothing to do with the ORDER BY at the end of the query. The latter defines the row output order, while the former tells each ranking function how to sequence the values. Likewise, one cannot define the null sort sequence when ordering the rows.

Counting Nulls

The DENSE RANK and RANK functions include null values when calculating rankings. By contrast the COUNT DISTINCT statement excludes null values when counting values. Thus, as is illustrated below, the two methods will differ(by one) when they are used get a count of distinct values - if there are nulls in the target data:

Counting distinct values – comparison

PARTITION Usage

The PARTITION phrase lets one rank the data by subsets of the rows returned. In the following example, the rows are ranked by salary within year:

Values ranked by subset of rows

Multiple Rankings

One can do multiple independent rankings in the same query:

Multiple rankings in same query

Dumb Rankings

If one wants to, one can do some really dumb rankings. All of the examples below are fairly stupid, but arguably the dumbest of the lot is the last. In this case, the "ORDER BY 1" phrase ranks the rows returned by the constant "one", so every row gets the same rank. By contrast the "ORDER BY 1" phrase at the bottom of the query sequences the rows, and so has valid business meaning:

Dumb ranking


Dumb rankings, SQL Answer ==================================================

ID YEARS NAME SALARY DUMB1 DUMB2 DUMB3 DUMB4
- ----- -------- -------- ----- ----- ----- -----
10 7 Sanders 18357.50 1 3 1 1
20 8 Pernal 18171.25 3 2 3 1
30 5 Marenghi 17506.75 2 1 2 1

Subsequent Processing

The ranking function gets the rank of the value as of when the function was applied. Subsequent processing may mean that the rank no longer makes sense. To illustrate this point, the following query ranks the same field twice. Between the two ranking calls, some rows were removed from the answer set, which has caused the ranking results to differ:

Subsequent processing of ranked data

Ordering Rows by Rank

One can order the rows based on the output of a ranking function. This can let one sequence the data in ways that might be quite difficult to do using ordinary SQL. For example, in the following query the matching rows are ordered so that all those staff with the highest salary in their respective department come first, followed by those with the second highest salary, and so on. Within each ranking value, the person with the highest overall salary is listed first:

Ordering rows by rank, using RANK function

Here is the same query, written without the ranking function:

Ordering rows by rank, using sub-query

The above query has all of the failings that were discussed at the beginning of this chapter:

  • The nested table expression has to repeat all of the predicates in the main query, and have predicates that define the ordering sequence. Thus it is hard to read.
  • The nested table expression will (inefficiently) join every matching row to all prior rows.

Selecting the Highest Value

The ranking functions can also be used to retrieve the row with the highest value in a set of rows. To do this, one must first generate the ranking in a nested table expression, and then query the derived field later in the query. The following statement illustrates this concept by getting the person, or persons, in each department with the highest salary:

Get highest salary in each department, use RANK function

Here is the same query, written using a correlated sub-query:

Get highest salary in each department, use correlated sub-query


Here is the same query, written using an uncorrelated sub-query:

Get highest salary in each department, use uncorrelated sub-query

Arguably, the first query above (i.e. the one using the RANK function) is the most elegant of the series because it is the only statement where the basic predicates that define what rows match are written once. With the two sub-query examples, these predicates have to be repeated, which can often lead to errors. If it seems at times that this was written with a poison pen, it is because just about now I had a "Microsoft moment" and my machine crashed. Needless to say, I had backups and, needless to say, they got trashed. It took me four days to get back to where I was. Thanks Bill - may you rot in hell. / Graeme

Row Numbering Function

The ROW_NUMBER function lets one number the rows being returned. The result is of type BIGINT. A syntax diagram follows. Observe that unlike with the ranking functions, the ORDER BY is not required:

Row Numbering Function

Numbering Function syntax

ORDER BY Usage

You don't have to provide an ORDER BY when using the ROW_NUMBER function, but not doing so can be considered to be either brave or foolish, depending on one's outlook on life. To illustrate this issue, consider the following query:

ORDER BY example, 1 of 3

In the above example, both ROW_NUMBER functions return the same set of values, which happen to correspond to the sequence in which the rows are returned. In the next query, the second ROW_NUMBER function purposely uses another sequence:

ORDER BY example, 2 of 3

Observe that changing the second function has had an impact on the first. Now lets see what bhappens when we add another ROW_NUMBER function:

ORDER BY example, 3 of 3

Observe that now the first function has reverted back to the original sequence. The lesson to be learnt here is that the ROW_NUMBER function, when not given an explicit ORDER BY, may create a value in any odd sequence. Usually, the sequence will reflect the order in which the rows are returned - but not always.

PARTITION Usage

The PARTITION phrase lets one number the matching rows by subsets of the rows returned. In the following example, the rows are both ranked and numbered within each JOB:

Use of PARTITION phrase

One problem with the above query is that the final ORDER BY that sequences the rows does not identify a unique field (e.g. ID). Consequently, the rows can be returned in any sequence within a given JOB and YEAR. Because the ORDER BY in the ROW_NUMBER function also fails to identify a unique row, this means that there is no guarantee that a particular row will always give the same row number. For consistent results, ensure that both the ORDER BY phrase in the function call, and at the end of the query, identify a unique row. And to always get the rows returned in the desired row-number sequence, these phrases must be equal.

Selecting "n" Rows

To query the output of the ROW_NUMBER function, one has to make a nested temporary table that contains the function expression. In the following example, this technique is used to limit the query to the first three matching rows:

Select first 3 rows, using ROW_NUMBER function

In the next query, the FETCH FIRST "n" ROWS notation is used to achieve the same result:

Select first 3 rows, using FETCH FIRST notation

So far, the ROW_NUMBER and the FETCH FIRST notations seem to be about the same. But the former technique is much more flexible. To illustrate, in the next query we retrieve the 3rd through 6th matching rows:

Select 3rd through 6th rows

In the next query we get every 5th matching row - starting with the first:

Select every 5th matching row

In the next query we get the last two matching rows:

Select last two rows

Selecting "n" or more Rows

Imagine that one wants to fetch the first "n" rows in a query. This is easy to do, and has been illustrated above. But imagine that one also wants to keep on fetching if the following rows have the same value as the "nth".
In the next example, we will get the first three matching rows in the STAFF table, ordered by years of service. However, if the 4th row, or any of the following rows, has the same YEAR as the 3rd row, then we also want to fetch them. The query logic goes as follows:

  • Select every matching row in the STAFF table, and give them all both a row-number and a ranking value. Both values are assigned according to the order of the final output. Do all of this work in a nested table expression.
  • Select from the nested table expression where the rank is three or less. The query relies on the fact that the RANK function (see page 94) assigns the lowest common row number to each row with the same ranking:

Select first "n" rows, or more if needed

The type of query illustrated above can be extremely useful in certain business situations. To illustrate, imagine that one wants to give a reward to the three employees that have worked for the company the longest. Stopping the query that lists the lucky winners after three rows are fetched can get one into a lot of trouble if it happens that there are more than three employees that have worked for the company for the same number of years.

Selecting "n" Rows - Efficiently

Sometimes, one only wants to fetch the first "n" rows, where "n" is small, but the number of matching rows is extremely large. In this section, we will discus how to obtain these "n" rows efficiently, which means that we will try to fetch just them without having to process any of the many other matching rows. Below is a sample invoice table. Observe that we have defined the INV# field as the primary key, which means that DB2 will build a unique index on this column:

Performance test table – definition

The next SQL statement will insert 500,000 rows into the above table. After the rows are inserted a REORG and RUNSTATS are run, so the optimizer can choose the best access path.

Performance test table - insert 500,000 rows

Imagine we want to retrieve the first five rows (only) from the above table. Below are several queries that get this result. For each query, the elapsed time, as measured by the DB2 Event Monitor is provided. Below we use the "FETCH FIRST n ROWS" notation to stop the query at the 5th row. This query first did a tablespace scan, then sorted all 500,000 matching rows, and then fetched the first five. It was not cheap:

Fetch first 5 rows - 0.313 elapsed seconds

The next query is essentially the same as the prior, but this time we told DB2 to optimize the query for fetching five rows. Nothing changed:

Fetch first 5 rows - 0.281 elapsed seconds

The next query is the same as the first, except that it uses the ROW_NUMBER function to sequence the output. This query is even more expensive than the first because of the cost of assigning the row numbers:

Fetch first 5 rows+ number rows - 0.672 elapsed seconds

All of the above queries have processed all 500,000 matching rows, sorted them, and then fetched the first five. We can do much better if we somehow only process the five rows that we want to fetch, which is what the next query does:

Process and number 5 rows only - 0.000 elapsed seconds

In the above query the "OVER()" phrase told DB2 to assign row numbers in the output order. In the next query we explicitly provide the row-number sequence, which happens to be the same at the ORDER BY sequence, but DB2 can't figure that out, so this query costs:

Process and number 5 rows only - 0.281 elapsed seconds

One can also use recursion to get the first "n" rows. One begins by getting the first matching row, and then uses that row to get the next, and then the next, and so on (in a recursive join), until the required number of rows have been obtained. In the following example, we start by getting the row with the MIN invoice-number. This row is then joined to the row with the next to lowest invoice-number, which is then joined to the next, and so on. After five such joins, the cycle is stopped and the result is selected:

Fetch first 5 rows - 0.000 elapsed seconds

The above technique is nice to know, but it has several major disadvantages:

  • It is not exactly easy to understand.
  • It requires all primary predicates (e.g. get only those rows where the sale-value is greater than $10,000, and the sale-date greater than last month) to be repeated four times. In the above example there are none, which is unusual in the real world.
  • It quickly becomes both very complicated and quite inefficient when the sequencing value is made up of multiple fields. In the above example, we sequenced by the INV# column, but imagine if we had used the sale-date, sale-value, and customer-number.
  • It is extremely vulnerable to inefficient access paths. For example, if instead of joining from one (indexed) invoice-number to the next, we joined from one (non-indexed) customer- number to the next, the query would run forever. In this section we have illustrated how minor changes to the SQL syntax can cause major changes in query performance. But to illustrate this phenomenon, we used a set of queries with 500,000 matching rows. In situations where there are far fewer matching rows, one can reasonably assume that this problem is not an issue.

Aggregation Function

The various aggregation functions let one do cute things like get cumulative totals or running averages. In some ways, they can be considered to be extensions of the existing DB2 column functions. The output type is dependent upon the input type.

Aggregation Function syntax

Aggregation Function syntax

Syntax Notes

Guess what - this is a complicated function. Be aware of the following:

  • Any DB2 column function (e.g. AVG, SUM, COUNT) can use the aggregation function.
  • The OVER() usage aggregates all of the matching rows. This is equivalent to getting the current row, and also applying a column function (e.g. MAX, SUM) against all of the matching rows (see page 107).
  • The PARTITION phrase limits any aggregation to a subset of the matching rows.
  • The ORDER BY phrase has two purposes; It defines a set of values to do aggregations on. Each distinct value gets a new result. It also defines a direction for the aggregation function processing - either ascending or descending (see page 108).
  • An ORDER BY phrase is required if the aggregation is confined to a set of rows or range of values. In addition, if a RANGE is used, then the ORDER BY expression must be a single value that allows subtraction.
  • If an ORDER BY phrase is provided, but neither a RANGE nor ROWS is specified, then the aggregation is done from the first row to the current row.
  • The ROWS phrase limits the aggregation result to a set of rows - defined relative to the current row being processed. The applicable rows can either be already processed (i.e. preceding) or not yet processed (i.e. following), or both (see page 109).
  • The RANGE phrase limits the aggregation result to a range of values - defined relative to the value of the current row being processed. The range is calculated by taking the value in the current row (defined by the ORDER BY phrase) and adding to and/or subtracting from it, then seeing what other rows are in the range. For this reason, when RANGE is used, only one expression can be specified in the aggregation function ORDER BY, and the expression must be numeric (see page 112).
  • Preceding rows have already been fetched. Thus, the phrase "ROWS 3 PRECEDING" refers to the 3 preceding rows - plus the current row. The phrase "UNBOUNDED PRECEDING" refers to all those rows (in the partition) that have already been fetched, plus the current one.
  • Following rows have yet to be fetched. The phrase "UNBOUNDED FOLLOWING" refers to all those rows (in the partition) that have yet to be fetched, plus the current one.
  • The phrase CURRENT ROW refers to the current row. It is equivalent to getting zero preceding and following rows.
  • If either a ROWS or a RANGE phrase is used, but no BETWEEN is provided, then one must provide a starting point for the aggregation (e.g. ROWS 1 PRECEDING). The starting point must either precede or equal the current row - it cannot follow it. The implied end point is the current row.
  • When using the BETWEEN phrase, put the "low" value in the first check and the "high" value in the second check. Thus one can go from the 1 PRECEDING to the CURRENT ROW, or from the CURRENT ROW to 1 FOLLOWING, but not the other way round.
  • The set of rows that match the BETWEEN phrase differ depending upon whether the aggregation function ORDER BY is ascending or descending.

Basic Usage

In its simplest form, with just an "OVER()" phrase, an aggregation function works on all of the matching rows, running the column function specified. Thus, one gets both the detailed data, plus the SUM, or AVG, or whatever, of all the matching rows. In the following example, five rows are selected from the STAFF table. Along with various detailed fields, the query also gets sum summary data about the matching rows:

Aggregation function, basic usage


ANSWER ============================================================
ID NAME SALARY SUM_SAL AVG_SAL MIN_SAL MAX_SAL #ROWS
-- -------- -------- -------- -------- -------- -------- -----
10 Sanders 18357.50 92701.30 18540.26 17506.75 20659.80 5
20 Pernal 18171.25 92701.30 18540.26 17506.75 20659.80 5
30 Marenghi 17506.75 92701.30 18540.26 17506.75 20659.80 5
40 O'Brien 18006.00 92701.30 18540.26 17506.75 20659.80 5
50 Hanes 20659.80 92701.30 18540.26 17506.75 20659.80 5

It is possible to do exactly the same thing using old-fashioned SQL, but it is not so pretty:

Select detailed data, plus summary data

An aggregation function with just an "OVER()" phrase is logically equivalent to one that has an ORDER BY on a field that has the same value for all matching rows. To illustrate, in the following query, the four aggregation functions are all logically equivalent:

Logically equivalent aggregation functions

ANSWER
=========================================================
ID NAME SALARY SUM1 SUM2 SUM3 SUM4
-- -------- -------- -------- -------- -------- --------
10 Sanders 18357.50 92701.30 92701.30 92701.30 92701.30
20 Pernal 18171.25 92701.30 92701.30 92701.30 92701.30
30 Marenghi 17506.75 92701.30 92701.30 92701.30 92701.30
40 O'Brien 18006.00 92701.30 92701.30 92701.30 92701.30
50 Hanes 20659.80 92701.30 92701.30 92701.30 92701.30

ORDER BY Usage
The ORDER BY phrase has two main purposes:

  • It provides a set of values to do aggregations on. Each distinct value gets a new result.
  • It gives a direction to the aggregation function processing (i.e. ASC or DESC).

In the next query, various aggregations are done on the DEPT field, which is not unique, and on the DEPT and NAME fields combined, which are unique (for these rows). Both ascending and descending aggregations are illustrated:

Aggregation function, order by usage, SQL

The answer is below. Observe that the ascending fields sum or count up, while the descending nfields sum down. Also observe that each aggregation field gets a separate result for each new set of rows, as defined in the ORDER BY phrase:

ANSWER ===================================================================== DEPT NAME SALARY SUM1 SUM2 SUM3 SUM4 ROW1 ROW2
---- -------- -------- -------- -------- -------- -------- ---- ----
15 Hanes 20659.80 20659.80 92701.30 20659.80 92701.30 1 1
20 Pernal 18171.25 57188.55 72041.50 38831.05 72041.50 3 2
20 Sanders 18357.50 57188.55 72041.50 57188.55 53870.25 3 3
38 Marenghi 17506.75 92701.30 35512.75 74695.30 35512.75 5 4
38 O'Brien 18006.00 92701.30 35512.75 92701.30 18006.00 5 5

ROWS Usage

The ROWS phrase can be used to limit the aggregation function to a subset of the matching rows or distinct values. If no ROWS or RANGE phrase is provided, the aggregation is done for all preceding rows, up to the current row. Likewise, if no BETWEEN phrase is provided, the aggregation is done from the start-location given, up to the current row. In the following query, all of the examples using the ROWS phrase are of this type:

Starting ROWS usage. Implied end is current row

Below is the answer. Observe that an aggregation starting at the current row, or including zero proceeding rows, doesn't aggregate anything other than the current row:

ANAWER ================================================= DEPT NAME YEARS D DN DNU DN3 DN1 DN0 DNC DNX
---- -------- ----- -- -- --- --- --- --- --- ---
15 Hanes 10 17 10 10 10 10 10 10 17
15 Rothman 7 17 17 17 17 17 7 7 15
20 Pernal 32 25 25 25 15 8 8 15
20 Sanders 7 32 32 32 32 15 7 7 2
38 Marenghi 5 43 37 37 27 12 5 5 11
38 O'Brien 6 43 43 43 26 11 6 6 12
42 Koonitz 6 49 49 49 24 12 6 6 6

Starting ROWS usage. Implied end is current row, Answer

BETWEEN Usage

In the next query, the BETWEEN phrase is used to explicitly define the start and end rows that are used in the aggregation:

ROWS usage, with BETWEEN phrase


ANSWER ========================================================= DEPT NAME YEARS UC1 UC2 UC3 CU1 PF1 PF2 PF3 CU1 UU1
---- -------- ----- --- --- --- --- --- --- --- --- ---
15 Hanes 10 10 10 10 10 17 25 32 49 49
15 Rothman 7 7 17 17 7 25 32 37 39 49
20 Pernal 8 25 25 25 8 22 37 43 32 49
20 Sanders 7 32 32 32 7 20 33 49 24 49
38 Marenghi 5 37 37 37 5 18 32 39 17 49
38 O'Brien 6 43 43 43 6 17 24 32 12 49
42 Koonitz 6 49 49 49 6 12 17 24 6 49

The BETWEEN predicate in an ordinary SQL statement is used to get those rows that have a value between the specified low-value (given first) and the high value (given last). Thus the predicate "BETWEEN 5 AND 10" may find rows, but the predicate "BETWEEN 10 AND 5" will never find any. The BETWEEN phrase in an aggregation function has a similar usage in that it defines the set of rows to be aggregated. But it differs in that the answer depends upon the function ORDER BY sequence, and a non-match returns a null value, not no-rows. Below is some sample SQL. Observe that the first two aggregations are ascending, while the last two are descending:

The following table illustrates the processing sequence in the above query. Each BETWEEN is applied from left to right, while the rows are read either from left to right (ORDER BY ID ASC) or right to left (ORDER BY ID DESC):

Explanation of query

ASC id (10,20,30,40)
READ ROWS, LEFT to RIGHT 1ST-ROW 2ND-ROW 3RD-ROW 4TH-ROW
========================== ======== ======== ======== ========
1 PRECEDING to CURRENT ROW 10=10 10+20=30 20+30=40 30+40=70
CURRENT ROW to 1 FOLLOWING 10+20=30 20+30=50 30+40=70 40 =40
DESC id (40,30,20,10)
READ ROWS, RIGHT to LEFT 1ST-ROW 2ND-ROW 3RD-ROW 4TH-ROW
========================== ======== ======== ======== ========
1 PRECEDING to CURRENT ROW 20+10=30 30+20=50 40+30=70 40 =40
CURRENT ROW to 1 FOLLOWING 10 =10 20+10=30 30+20=50 40+30=70
NOTE: Preceding row is always on LEFT of current row.
Following row is always on RIGHT of current row.

IMPORTANT: The BETWEEN predicate, when used in an ordinary SQL statement, is not affected by the sequence of the input rows. But the BETWEEN phrase, when used in an aggregation function, is affected by the input sequence.

RANGE Usage

The RANGE phrase limits the aggregation result to a range of numeric values - defined relative to the value of the current row being processed. The range is obtained by taking the value in the current row (defined by the ORDER BY expression) and adding to and/or subtracting from it, then seeing what other rows are in the range. Note that only one expression can be specified in the ORDER BY, and that expression must be numeric. In the following example, the RANGE function adds to and/or subtracts from the DEPT field. For example, in the function that is used to populate the RG10 field, the current DEPT value is checked against the preceding DEPT values. If their value is within 10 digits of the current value, the related YEARS field is added to the SUM:

RANGE usage


ANSWER ======================================================== DEPT NAME YEARS ROW1 ROW2 RG01 RG10 RG20 RG11 RG99
------ ------- ----- ---- ---- ---- ---- ---- ---- ----
15 Hanes 10 10 10 17 17 17 32 32
15 Rothman 7 17 17 17 17 17 32 32
20 Pernal 8 15 25 15 32 32 43 26
20 Sanders 7 15 22 15 32 32 43 26
38 Marengh 5 12 20 11 11 26 17 17
38 O'Brien 6 11 18 11 11 26 17 17
42 Koonitz 6 12 17 6 17 17 17 6

Note the difference between the ROWS as RANGE expressions:

  • The ROWS expression refers to the "n" rows before and/or after (within the partition), as defined by the ORDER BY.
  • The RANGE expression refers to those before and/or after rows (within the partition) that are within an arithmetic range of the current row.

PARTITION Usage

One can take all of the lovely stuff described above, and make it whole lot more complicated by using the PARTITION expression. This phrase limits the current processing of the aggregation to a subset of the matching rows. In the following query, some of the aggregation functions are broken up by partition range and some are not. When there is a partition, then the ROWS check only works within the range of the partition (i.e. for a given DEPT):

PARTITION usage

ANSWER
=============================================================
DEPT NAME YEARS X XO3 XO11 P PO PO1 PO3 PO11
----- ------- ----- ---- ---- ---- ---- ---- ---- ---- ----
15 Hanes 10 22 10 15 22 22 10 10 15
15 Ngan 5 22 15 22 22 22 15 15 22
15 Rothman 7 22 22 18 22 22 12 22 12
38 O'Brien 6 28 28 19 6 6 6 6 6
42 Koonitz 6 41 24 19 13 13 6 6 13
42 Plotz 7 41 26 13 3 13 13 13 13

PARTITION vs. GROUP BY

The PARTITION clause, when used by itself, returns a very similar result to a GROUP BY, except that it does not remove the duplicate rows. To illustrate, below is a simple query that does a GROUP BY:

Sample query using GROUP BY

All rights reserved © 2020 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

DB2 Using SQL Topics