Statistical Aggregates Teradata

In Teradata Release 4 (V2R4) there are several new aggregates that perform statistical operations. Many of them are used in other internal functions and now they are available for use within SQL.

Not only are these statistical functions the newest, but there are two types of statistical functions. They are unary (single input value) functions, and binary (dual input value) functions.

The unary functions look at individual column values for each row included and compare all of the values for trends, similarities and groupings. All the original aggregate functions are unary in that they accept a single value to perform their processing.

The statistical unary functions are:

  • Kurtosis
  • Skew
  • Standard Deviation of a sample
  • Standard Deviation of a population
  • Variance of a sample
  • Variance of a population

The binary functions examine the relationship between the two different values. Normally these two values represent two separate points on an X axis and Y-axis.

The binary functions are:

  • Correlation
  • Covariance
  • Regression Line Intercept
  • Regression Line Slope

The results from the statistical functions are not as obvious to demonstrate and figure out as the original functions, like SUM or AVG. The Stats table in Figure is used to demonstrate the statistical functions. Its column values have certain patterns in them. For instance COL1 increases sequentially from 1 to 30 while COL4 decreases sequentially from 30 to 1. The remaining columns tend to have the same value repeated and some values repeat more than others. These values are used in both the unary and binary functions to illustrate the types of answers generated using these statistical functions.

Stats table

The following table demonstrates the operation and output from the new statistical aggregate functions in V2R4.

Statistical Aggregates

The KURTOSIS function is used to return a number that represents the sharpness of a peak on a plotted curve of a probability function for a distribution compared with the normal distribution.

A high value result is referred to as leptokurtic. While a medium result is referred to as mesokurtic and a low result is referred to as platykurtic.

A positive value indicates a sharp or peaked distribution and a negative number represents a flat distribution. A peaked distribution means that one value exists more often than the other values. A flat distribution means there is the same quantity values exist for each number.

If you compare this to the row distribution associated within Teradata, most of the time a flat distribution is best, with the same number of rows stored on each AMP. Having skewed data represents more of a lumpy distribution.

Syntax for using KURTOSIS:

KURTOSIS(<column-name>)

The next SELECT uses KURTOSIS to compare the distribution of the Stats table:


1 Row Returneddistribution of the Stats table

The SKEW Function

The Skew indicates that a distribution does not have equal probabilities above and below the mean (average). In a skew distribution, the median and the mean are not coincident, or equal.

Where:

  • a median value < mean value = a positive skew
  • a median value > mean value = a negative skew
  • a median value = mean value = no skew

Syntax for using SKEW:

SKEW(<column-name>)

The following SELECT uses SKEW to compare the distribution of the Stats table:


1 Row ReturnedSKEW to compare the distribution of the Stats table

The STDDEV_POP Function

The standard deviation function is a statistical measure of spread or dispersion of values. It is the root's square of the difference of the mean (average). This measure is to compare the amount by which a set of values differs from the arithmetical mean.

The STDDEV_POP function is one of two that calculates the standard deviation. The population is of all the rows included based on the comparison in the WHERE clause.

Syntax for using STDDEV_POP:

STDDEV_POP(<column-name>)

The next SELECT uses STDDEV_POP to determine the standard deviation on all columns of all rows within the Stats table:


1 Row ReturnedThe STDDEV_POP Function

The STDDEV_SAMP Function

The standard deviation function is a statistical measure of spread or dispersion of values. It is the root's square of the difference of the mean (average). This measure is to compare the amount by which a set of values differs from the arithmetical mean.

The STDDEV_SAMP function is one of two that calculates the standard deviation. The sample is a random selection of all rows returned based on the comparisons in the WHERE clause. The population is for all of the rows based on the WHERE clause.

Syntax for using STDDEV_SAMP:

STDDEV_SAMP(<column-name>)

The following SELECT uses STDDEV_SAMP to determine the standard deviation on all columns of a sample of the rows within the Stats table:


1 Row Returnedstandard deviation on all columns of a sample of the rows within the Stats table

The VAR_POP Function

The Variance function is a measure of dispersion (spread of the distribution) as the square of the standard deviation. There are two forms of Variance in Teradata, VAR_POP is for the entire population of data rows allowed by the WHERE clause.

Although standard deviation and variance are regularly used in statistical calculations, the meaning of variance is not easy to elaborate. Most often variance is used in theoretical work where a variance of the sample is needed.

There are two methods for using variance. These are the Kruskal-Wallis one-way Analysis of Variance and Friedman two-way Analysis of Variance by rank.

Syntax for using VAR_POP:

VAR_POP(<column-name>)

The following SELECT uses VAR_POP to compare the variance of the distribution on all rows from the Stats table:


1 Row Returnedvariance of the distribution on all rows from the Stats table

The VAR_SAMP Function

The Variance function is a measure of dispersion (spread of the distribution) as the square of the standard deviation. There are two forms of Variance in Teradata, VAR_SAMP is used for a random sampling of the data rows allowed through by the WHERE clause.

Although standard deviation and variance are regularly used in statistical calculations, the meaning of variance is not easy to elaborate. Most often variance is used in theoretical work where a variance of the sample is needed to look for consistency.

There are two methods for using variance. These are the Kruskal-Wallis one-way Analysis of Variance and Friedman two-way Analysis of Variance by rank.

Syntax for using VAR_SAMP:

VAR_SAMP(<column-name>)

The next SELECT uses VAR_SAMP to compare the variance of the distribution on a row sample from the Stats table:


1 Row Returnedvariance of the distribution on a row sample from the Stats table

The CORR Function

The CORR function is a binary function, meaning that two variables are used as input to it. It measures the association between 2 random variables. If the variables are such that when one changes the other does so in a related manner, they are correlated. Independent variables are not correlated because the change in one does not necessarily cause the other to change.

The correlation coefficient is a number between -1 and 1. It is calculated from a number of pairs of observations or linear points (X,Y).

Where:

  • 1 = perfect positive correlation
  • 0 = no correlation
  • −1 = perfect negative correlation

Syntax for using CORR:

CORR(<column-name1>, <column-name2>)

The following SELECT uses CORR to compare the association of values stored in various columns from the Stats table:


1 Row Returnedassociation of values stored in various columns from the Stats table

Since there are two column values passed to this function and the first example has data values that sequentially ascend, the next example uses col4 as the first value because it sequentially descends. It demonstrates the impact of this sequence change on the result:


1 Row Returneddemonstrates the impact of this sequence change on the result

The COVAR Function

The covariance is a statistical measure of the tendency of two variables to change in conjunction with each other. It is equal to the product of their standard deviations and correlation coefficients.

The covariance is a statistic used for bivariate samples or bivariate distribution. It is used for working out the equations for regression lines and the product-moment correlation coefficient.

Syntax:

COVAR(<column-name1>, <column-name2>)

The next SELECT uses COVAR to compare the covariance association of values stored in various columns from the Stats table:


1 Row Returnedvalues stored in various columns from the Stats table

Since there are two column values passed to this function and the first example has data values that sequentially ascend, the next example uses col4 as the first value because it sequentially descends. It demonstrates the impact of this sequence change on the result:


1 Row Returnedimpact of this sequence change on the result

The REGR_INTERCEPT Function

A regression line is a line of best fit, drawn through a set of points on a graph for X and Y coordinates. It uses the Y coordinate as the Dependent Variable and the X value as the Independent Variable.

Two regression lines always meet or intercept at the mean of the data points(x,y), where x=AVG(x) and y=AVG(y) and is not usually one of the original data points.

Syntax for using REGR_INTERCEPT:

REGR_INTERCEPT(dependent-expression, independent-expression)

The following SELECT uses REGR_INTERCEPT to find the intercept point between the values stored in various columns from the Stats table:


1 Row ReturnedThe REGR_INTERCEPT Function

Since there are two column values passed to this function and the first example has data values that sequentially ascend, the next example uses col4 as the first value because it sequentially descends. It demonstrates the impact of this sequence change on the result:

1 Row Returned impact of this sequence change on the result

The REGR_SLOPE Function

A regression line is a line of best fit, drawn through a set of points on a graph of X and Y coordinates. It uses the Y coordinate as the Dependent Variable and the X value as the Independent Variable.

The slope of the line is the angle at which it moves on the X and Y coordinates. The vertical slope is Y on X and the horizontal slope is X on Y.

Syntax for using REGR_SLOPE:

REGR_SLOPE(dependent-expression, independent-expression)

The next SELECT uses REGR_SLOPE to find the slope for the values stored in various columns from the Stats table:


1 Row ReturnedThe REGR_SLOPE Function

Since there are two column values passed to this function and the first example has data values that sequentially ascend, the next example uses col4 as the first value because it sequentially descends. It demonstrates the impact of this sequence change on the result:


1 Row Returnedimpact of this sequence change on the result

Using GROUP BY

Like the original aggregates, the new statistical aggregates may also take advantage of using non-aggregates with the aggregates. The GROUP BY is used to identify and form groups for each unique value in the selected non-aggregate column.

Likewise, the new statistical aggregates are compatible with the original aggregates as seen in the following SELECT:


7 Rows ReturnedUsing GROUP BY

Use of HAVING

Also like the original aggregates, the HAVING may be used to eliminate specific output lines based on one or more of the final aggregate values.

The next SELECT uses the HAVING to perform a compound comparison on both the count and the covariance:

2 Rows ReturnedUse of HAVING

Face Book Twitter Google Plus Instagram Youtube Linkedin Myspace Pinterest Soundcloud Wikipedia

All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Teradata Topics