Creating Sample Data - DB2 Using SQL

If every application worked exactly as intended from the first, we would never have any need for test databases. Unfortunately, one often needs to builds test systems in order to both tune the application SQL, and to do capacity planning. In this section we shall illustrate how very large volumes of extremely complex test data can be created using relatively simple SQL statements.
Good Sample Data is

  • Reproducible.
  • Easy to make.
  • Similar to Production:
  • Same data volumes (if needed).
  • Same data distribution characteristics.

Data Generation

Create the set of integers between zero and one hundred. In this statement we shall use recursive coding to expand a single value into many more.
Use recursion to get list of 100 numbers


Instead of coding a recursion join every time, we use the table function to create the required rows. Assuming that the function exists, one would write the following:
Use user-defined-function to get list of 100 numbers

Make Reproducible Random Data

So far, all we have done is create sets of fixed values. These are usually not suitable for testing purposes because they are too consistent. To mess things up a bit we need to use the RAND function, which generates random numbers in the range of zero to one inclusive. In the next example we will get a (reproducible) list of five random numeric values:
Use RAND to create pseudo-random numbers

The initial invocation of the RAND function above is seeded with the value 1. Subsequent invocations of the same function (in the recursive part of the statement) use the initial value to generate a reproducible set of pseudo-random numbers.

Using the GENERATE_UNIQUE function

With a bit of data manipulation, the GENERATE_UNIQUE function can be used (instead of the RAND function) to make suitably random test data. The are advantages and disadvantages to using both functions:

  • The GENERATE_UNIQUE function makes data that is always unique. The RAND function only outputs one of 32,000 distinct values.
  • The RAND function can make reproducible random data, while the GENERATE_ UNIQUE function can not.

See the description of the GENERATE_UNIQUE function (see page 131) for an example of how to use it to make random data. Make Random Data - Different Ranges There are several ways to mess around with the output from the RAND function: We can use simple arithmetic to alter the range of numbers generated (e.g. convert from 0 to 10 to 0 to 10,000). We can alter the format (e.g. from FLOAT to DECIMAL). Lastly, we can make fewer, or more, distinct random values (e.g. from 32K distinct values down to just 10). All of this is done below:
Make differing ranges of random numbers

Make Random Data - Varying Distribution

In the real world, there is a tendency for certain data values to show up much more frequently than others. Likewise, separate fields in a table usually have independent semi-random data distribution patterns. In the next statement we create three independently random fields. The first has the usual 32K distinct values evenly distributed in the range of zero to one. ond and third have random numbers that are skewed towards the low end of the range, and have many more distinct values:
Create RAND data with different distributions

WITH -- ------ ------ ------
temp1 (s1) AS 0 1251 365370 114753
(VALUES (0) 1 350291 280730 88106
UNION ALL 2 710501 149549 550422
SELECT s1 + 1 3 147312 33311 2339
FROM temp1 4 8911 556 73091
WHERE s1 + 1 < 5
,INTEGER((RAND(1)) * 1E6) AS ran1
,INTEGER((RAND() * RAND()) * 1E6) AS ran2
,INTEGER((RAND() * RAND()* RAND()) * 1E6) AS ran3
FROM temp1;

Make Random Data - Different Flavours

The RAND function generates random numbers. To get random character data one has to convert the RAND output into a character. There are several ways to do this. The first method shown below uses the CHR function to convert a number in the range: 65 to 90 into the ASCII equivalent: "A" to "Z". The second method uses the CHAR function to translate a number into the character equivalent.
Converting RAND output from number to character

Make Test Table & Data

So far, all we have done in this chapter is use SQL to select sets of rows. Now we shall create a Production-like table for performance testing purposes. We will then insert 10,000 rows of suitably lifelike test data into the table. The DDL, with constraints and index definitions, follows.

The important things to note are:

  • The EMP# and the SOCSEC# must both be unique.
  • The JOB_FTN, FST_NAME, and LST_NAME fields must all be non-blank.
  • The SOCSEC# must have a special format.
  • The DATE_BN must be greater than 1900.

Several other fields must be within certain numeric ranges.
Production-like test table DDL

Now we shall populate the table. The SQL shall be described in detail latter. For the moment, note the four RAND fields. These contain, independently generated, random numbers which are used to populate the other data fields.
Production-like test table INSERT

ome sample data follows:

------ ----------- ---- ---- --------- ---------- --------- ---------
100000 484-10-9999 WKR 47 13.63 1979-01-01 Ammaef Mimytmbi
100001 449-38-9998 SEC 53 35758.87 1962-04-10 Ilojff Liiiemea
100002 979-90-9997 WKR 1 8155.23 1975-01-03 Xzacaa Zytaebma
100003 580-50-9993 WKR 31 16643.50 1971-02-05 Lpiedd Pimmeeat
100004 264-87-9994 WKR 21 962.87 1979-01-01 Wgfacc Geimteei
100005 661-84-9995 WKR 19 4648.38 1977-01-02 Wrebbc Rbiybeet
100006 554-53-9990 WKR 8 375.42 1979-01-01 Mobaaa Oiiaiaia
100007 482-23-9991 SEC 36 23170.09 1968-03-07 Emjgdd Mimtmamb
100008 536-41-9992 WKR 6 10514.11 1974-02-03 Jnbcaa Nieebayt

In order to illustrate some of the tricks that one can use when creating such data, each field above was calculated using a different schema:

  • The EMP# is a simple ascending number.
  • The SOCSEC# field presented three problems: It had to be unique, it had to be random with respect to the current employee number, and it is a character field with special layout constraints.
  • To make it random, the first five digits were defined using two of the temporary random number fields. To try and ensure that it was unique, the last four digits contain part of the employee number with some digit-flipping done to hide things. Also, the first random number used is the one with lots of unique values. The special formatting that this field required is addressed by making everything in pieces and then concatenating.
  • The JOB FUNCTION is determined using the fourth (highly skewed) random number. This ensures that we get many more workers than managers.
  • The DEPT is derived from another, somewhat skewed, random number with a range of values from one to ninety nine.
  • The SALARY is derived using the same, highly skewed, random number that was used for the job function calculation. This ensures that theses two fields have related values.
  • The BIRTH DATE is a random date value somewhere between 1930 and 1981.
  • The FIRST NAME is derived using seven independent invocation of the CHR function, each of which is going to give a somewhat different result.
  • The LAST NAME is (mostly) made by using the TRANSLATE function to convert a large random number into a corresponding character value. The output is skewed towards some of the vowels and the lower-range characters during the translation.

All rights reserved © 2020 Wisdom IT Services India Pvt. Ltd Protection Status

DB2 Using SQL Topics