Sampling Rows using the SAMPLE Function Teradata

Compatibility: Teradata Extension

The Sampling function (SAMPLE) permits a SELECT to randomly return rows from a Teradata database table. It allows the request to specify either an absolute number of rows or a percentage of rows to return. Additionally, it provides an ability to return rows from multiple samples.

The syntax for the SAMPLE function:

The next SELECT uses the SAMPLE to get a random sample of the sales table:

SELECT * FROM student_course_tableSAMPLE 5; 5 Rows Returned

SAMPLE to get a random sample of the sales table

This next SELECT uses the SAMPLE function to request multiple samples to create a derived table (cover later). Then, the unique rows will be counted to show the random quality of the SAMPLE function:

1 Row Returned

count(distinct(course_id)

4

In the above execution, all five rows contained a different data value in the course ID. A second run of the same SELECT might very well yield these results:

1 Row Returned

count(distinct(course_id)
5

Sometimes, a single sampling of the data is not sufficient. The SAMPLE function can be used to request more than one sample by listing either the number of rows or the percentage of the rows to be returned.

The next SELECT uses the SAMPLE function to request multiple samples:

8 Rows Returned

SAMPLE function to request multiple samples

Although multiple samples were taken, the rows came back as a single answer set consisting of 50% (.25 + .25) of the data. When it is necessary to determine which rows came from which sample, the SAMPLEID column name can be used to distinguish between each sample.

This SELECT uses the SAMPLE function with the SAMPLEID to request multiple samples and denote which sample each row came from:


14 Rows Returned

SAMPLE function with the SAMPLEID to request multiple samples

Since the previous request asks for more rows than are currently in the table, a warning message 7473 is received. Regardless, it is only a warning and the SELECT works and all rows are returned. If there is any doubt in the number of rows, instead of using a fixed number and receiving the warning message, the use of percentage is a better choice.

The next SELECT uses the SAMPLE function with the SAMPLEID to request multiple samples as a percentage and denotes which sample each row came from:

14 Rows Returned

SAMPLE function with the SAMPLEID to request multiple samples
SAMPLE function with the SAMPLEID to request multiple samples

Since SAMPLEID is a column, it can be used as the sort key.

The OLAP functions provide some very interesting and powerful functionality for examining and evaluating data. They provide an insight into the data that was not easily obtained prior to these functions.

Although they look like Aggregates, they are not normally compatible with them in the same SELECT list. As demonstrated here, aggregation can be performed, however, they must be calculated in a temporary or derived table.



Face Book Twitter Google Plus Instagram Youtube Linkedin Myspace Pinterest Soundcloud Wikipedia

All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Teradata Topics