Newer ANSI Join Syntax Teradata

The ANSI committee has created a new form of the join syntax. Like most ANSI compliant code, it is a bit longer to write. However, I personally believe that it is worth the time and the effort due to better functionality and safeguards. Plus, it is more difficult to get an accidental product join using this form of syntax. This chapter describes and demonstrates the use of the INNER JOIN, the OUTER JOIN, the CROSS JOIN and the Self-join.

INNER JOIN

Although the original syntax still works, there is a newer version of the join using the INNER JOIN syntax. It works exactly the same as the original join, but is written slightly different.

The following syntax is for a two-table INNER JOIN:

There are two primary differences between the new INNER JOIN and the original join syntax. The first difference is that a comma (,) no longer separates the table names. Instead of a comma, the words INNER JOIN are used. As shown in the above syntax format, the word INNER is optional. If only the JOIN appears, it defaults to an INNER JOIN.

The other difference is that the WHERE clause for the join condition is changed to an ON to declare an equal comparison on the common domain columns. If the ON is omitted, a syntax error is reported and the SELECT does not execute. So, the result is not a Cartesian product join as seen in the original syntax. Therefore, it is safer to use.

Although the INNER JOIN is a slightly longer SQL statement to code, it does have advantages. The first advantage, mentioned above, is fewer accidental Cartesian product joins because the ON is required. In the original syntax, when the WHERE is omitted the syntax is still correct. However, without a comparison, all rows of both tables are joined with each other and results in a Cartesian product.

The last and most compelling advantage of the newer syntax is that the INNER JOIN and OUTER JOIN statements can easily be combined into a single SQL statement. The OUTER JOIN syntax, explanation and significance are covered in this.

The following is the same join that was performed earlier using the original join syntax. Here, it has been converted to use an INNER JOIN:


5 Rows Returnedsame join that was performed earlier using the original join syntax

Like the original syntax, more than two tables can be joined in a single INNER JOIN. Each consecutive table name follows an INNER JOIN and associated ON clause to tell which columns to match. Therefore, a ten-table join has nine JOIN and nine ON clauses to identify each table and the columns being compared. There is always one less JOIN / ON combination than the number of tables referenced in the FROM.

The following syntax is for an INNER JOIN with more than two tables:

The <table-nameN> reference above is intended to represent a variable number of tables. It might be a 3-table, a 10-table or up to a 64-table join. The same approach is used regardless of the number of tables being joined together in a single SELECT.

The other difference between these two join formats is that regardless of the number of tables in the original syntax, there was only a single WHERE clause. Here, each additional INNER JOIN has its own ON condition.

If one ON is omitted from the INNER JOIN, an error code of 3706 will be returned. This error keeps the join from executing, unlike the original syntax, where a forgotten join condition in the WHERE is allowed, but creates an accidental Cartesian product join.

The next INNER JOIN is converted from the 3-table join seen earlier:


3 Rows ReturnedINNER JOIN is converted from the 3-table join seen earlier
INNER JOIN is converted from the 3-table join seen earlier

The INNER JOIN syntax can use a WHERE clause instead of a compound ON comparison. It can be used to add one or more residual conditions. A residual condition is a comparison that is in addition to the join condition. When it is used, the intent is to potentially eliminate rows from one or more of the tables.

In other words, as rows are read the WHERE clause compares each row with a condition to decide whether or not it should be included or eliminated from the join processing. The WHERE clause is applied as rows are read, before the ON clause. Eliminated rows do not participate in the join against rows from another table. For more details, read the section on WHERE clauses at the end of this chapter.

The following is the same SELECT using a WHERE to compare the Course name as a residual condition instead of a compound (AND) comparison in the ON:

As far as the INNER JOIN processing is concerned, the PE will normally optimize both of these last two joins exactly the same. The EXPLAIN is the best way to determine how the optimizer uses specific Teradata tables in a join operation.

OUTER JOIN

As seen previously, the join processing matches rows from multiple tables on a column containing values from a common domain. Most of the time, each row in a table has a matching row in the other table. However, we do not live in a perfect world and sometimes our data is not perfect. Imperfect data is never returned when a normal join is used and the imperfection may go unnoticed.

The sole purpose of an OUTER JOIN is to find and return rows that do not match at least one row from another table. It is for "exception" reporting, but at the same time, it does the INNER JOIN processing too. Therefore, the intersecting (matching) common domain rows are returned along with all rows without a matching value from another table. This nonmatching condition might be due to the existence of a NULL or invalid data value in the join column(s).

For instance, if the employee and department tables are joined using an INNER JOIN, it displays all the employees who work in a valid department. Mechanically, this means it returns all of the employee rows that contain a value in the department number column, as a foreign key, that matches a department number value in the department table, as a primary key.

What it does not display are employees without a department number (NULL) and employees with invalid department numbers (breaks referential integrity rules). These additional rows can be returned with the intersecting rows using one of the three formats for an OUTER JOIN listed below. The three formats of an OUTER JOIN are:

Left_table LEFT OUTER JOIN Right_table -left table is outer table

Left_table RIGHT OUTER JOIN Right_table -right table is outer table

Left_table FULL OUTER JOIN Right_table -both are outer tables

The OUTER JOIN has an outer table. The outer table is used to direct which exception rows are output. Simply put, it is the controlling table of the OUTER JOIN. As a result of this feature, all the rows from the outer table will be returned, those containing matching domain values and those with non-matching values. The INNER JOIN has only inner tables. To codean OUTER JOIN it is wise to start with an INNER JOIN. Once the join is working, the next step is to convert the word INNER to OUTER. The SELECT list for matching rows can display data from any of the tables in the FROM. This is because a row with a matching row exists in the tables. However, all non-matching rows with NULL or invalid data in the outer table do not have a matching row in the inner table. Therefore, the entire inner table row is missing and no column is available for the SELECT list. This is the equivalent of a NULL. Since the exception row is missing, there is no data available for display. All referenced columns from the missing inner table rows will be represented as a NULL in the display.

The basic syntax for a two-table OUTER JOIN follows:

Unlike the INNER JOIN, there is no original join syntax operation for an OUTER JOIN. The OUTER JOIN is a unique answer set. The closest functionality to an OUTER JOIN comes from the UNION set operator, which is covered later in this book. The other fantastic quality of the newer INNER and OUTER join syntax is that they both can be used in the same SELECT with three or more tables.

The next several sections explain and demonstrate all three formats of the OUTER JOIN. The primary issue when using an OUTER JOIN is that only one format can be used in a SELECT between any two tables. The FROM list determines the outer table for processing. It is important to understand the functionality in order to chose the correct outer join.

LEFT OUTER JOIN

The outer table is determined by its location in the FROM clause of the SELECT as shown here:

<Outer-table> LEFT OUTER JOIN <Inner-table>
Or
<Outer-table> LEFT JOIN <Inner-table>

In this format, the Customer table is the one on the left of the word JOIN. Since this is a LEFT OUTER JOIN, the Customer is the outer table. This syntax can return all customer rows that match a valid order number (INNER JOIN) and Customers with NULL or invalid order numbers (OUTER JOIN).

The next SELECT shows customers with matching orders and those that need to be called because they have not placed an order:


6 Rows ReturnedSELECT shows customers with matching orders

The above output consists of all the rows from the Customer table because it is the outer table and there are no residual conditions. Unlike the earlier INNER JOIN, Acme Products is now easily seen as the only customer without an order. Since Acme Products has no order at this time, the order number and the order total are both extended with the "?" to represent a NULL, or missing value from a non-matching row of the inner table. This is a very important concept.

The result of the SELECT provides the matching rows like the INNER JOIN and the nonmatching rows, or exceptions that are missed by the INNER JOIN. It is possible to add the order number to an ORDER BY to put all exceptions either at the front (ASC) or back (DESC) of the output report.

When using an OUTER JOIN, the results of this join are stored in the spool area and contain all of the rows from the outer table. This includes the rows that match and all the rows that do not match from the join step. The only difference is that the non-matching rows are carrying the NULL values for all columns for missing rows from the inner table.

The concept of a LEFT OUTER JOIN is pretty straight forward with two tables. However, additional thought is required when using more then two tables to preserve rows from the first outer table.

Remember that the result of the first join is saved in spool. This same spool is then used to perform all subsequent joins against any additional tables, or other spool areas. So if you join 3 tables using an outer join the first two tables are joined together with the spooled results representing the new outer table and then joined with the third table which becomes the RIGHT table.

Using the Student, Course and Student_Course tables, the following SELECT preserves the exception rows from the Student table as the outer table, throughout the entire join. Since both joins are written using the LEFT OUTER JOIN and the Student table is the table name that is the furthest to the left it remains as the outer table:

14 Rows ReturnedLEFT OUTER JOIN and the Student table is the table name
LEFT OUTER JOIN and the Student table is the table name

The above output contains all the rows from the Student table as the outer table in the threetable LEFT OUTER JOIN. The OUTER JOIN returns a row for a student named Michael Larkins even though he is not taking a course. Since, his course row is missing, no course name is available for display. As a result, the output is extended with a NULL in course name, but becomes part of the answer set.

Now, it is known that a student isn't taking a course. It might be important to know if there are any courses without students. The previous join can be converted to determine this fact by rearranging the table names in the FROM to make the Course table the outer table, or by using the RIGHT OUTER JOIN.

RIGHT OUTER JOIN

As indicated earlier, the outer table is determined by its position in the FROM clause of the SELECT. Consider the following:

<Inner-table> RIGHT OUTER JOIN <Outer-table>
Or
<Inner-table> RIGHT JOIN <Outer-table>

In the next example, the Customer table is still written before the Order table. Since it is now a RIGHT OUTER JOIN and the Order table is on the right of the word JOIN, it is now the outer table. Remember, all rows can be returned from the outer table!

To include the orders without customers, the previously seen LEFT OUTER JOIN has been converted to a RIGHT OUTER JOIN. It can be used to return all of the rows in the Order table, those that match customer rows and those that do not match customers.

The following is converted to a RIGHT OUTER JOIN to find all orders:


6 Rows ReturnedRIGHT OUTER JOIN to find all orders

The above output from the SELECT consists of all the rows from the Order table, which is the outer table. In a 2-table OUTER JOIN without a WHERE clause, the number of rows returned is usually equal to the number of rows in the outer table. In this case, the outer table is the Order table. It contains 6 rows and all 6 rows are returned.

This join returns all orders with a valid customer ID (like the INNER JOIN) and orders with a missing or an invalid customer ID (OUTER JOIN). Either of these last two conditions constitute a critical business problem that needs immediate attention. It is important to determine that orders were placed, but that the buyer of them is not known. Since the output was sorted by the customer name, the exception row is returned first. This technique makes the exception easy to find, especially in a large report. Not only is the customer missing for this order, it obviously has additional problems. The total is negative and the order number is all nines. We can now correct a situation we knew nothing about or correct the procedure or policy that allowed for the error to occur.

Using the same Student and Course tables from the previous 3-table join, it can be converted from the two LEFT OUTER JOIN operations to two RIGHT OUTER JOIN operations in order to find the students taking courses and also find any courses without students enrolled:


8 Rows Returnedtwo LEFT OUTER JOIN operations to two RIGHT OUTER JOIN operations

Now, using the output from the OUTER JOIN on the Course table, it is apparent that no one is enrolled in the Logical Database Design course. The enrollment needs to be increased or the room needs to be freed up for another course. Where inner joins are great at finding matches, outer joins are great at finding both matches and problems.

FULL OUTER JOIN

The last form of the OUTER JOIN is a FULL OUTER JOIN. If both Customer and Order exceptions are to be included in the output report, then the syntax should appear as:

<Outer-table> FULL OUTER JOIN <Outer-table>
Or
<Outer-table> FULL JOIN <Outer-table>

A FULL OUTER JOIN uses both of the tables as outer tables. The exceptions are returned from both tables and the missing column values from either table are extended with NULL. This puts the LEFT and RIGHT OUTER JOIN output into a single report.

To return the customers with orders, and include the orders without customers and customers without orders, the following FULL OUTER JOIN can be used:


7 Rows ReturnedFULL OUTER JOIN

The output from the SELECT consists of all the rows from the Order and Customer tables because they are now both outer tables in a FULL OUTER JOIN.

The total number of rows returned is more difficult to predict with a FULL OUTER JOIN. The answer set contains: one row for each of the matching rows from the tables, plus one row for each of the missing rows in the left table, plus one for each of the missing rows in the right table.

Since both tables are outer tables, not as much thought is required for choosing the outer table. However, as mentioned earlier the INNER and OUTER join processing can be combined in a single SELECT. The INNER JOIN still eliminates all non-matching rows. This is when the most consideration needs to be given to the appropriate outer tables.

Like all joins, more than two tables can be joined using a FULL OUTER JOIN, up to 64 tables. The next FULL OUTER JOIN syntax uses Student and Course tables for the outer tables through the entire join process:


15 Rows ReturnedFULL OUTER JOIN
FULL OUTER JOIN

The above SELECT uses the Student, Course and "Student Course" (associative) tables in a FULL OUTER JOIN. All three tables are outer tables. The above includes one nonmatching row from the Student table with a null in the course name and one non-matching row from the course table with nulls in all three columns from the Student table. Since the Student Course table is also an outer table, if there were any non-matching rows in it, they can also be returned containing a null in its columns. However, since it is an associative table used only for a many-to-many relationship between the Student and Course tables, missing rows in it would indicate a serious business problem.

As a reminder, the result of the first join step is stored in spool, which is temporary work space that the system uses to complete each step of the SELECT. Then, the spool area is used for each consecutive JOIN step. This continues until all of the tables have been joined together, two at a time. However, the spool areas are not held until the end of the SELECT. Instead, when the spool is no longer needed, it is released immediately. This makes more spool available for another step, or by another user. The release can be seen in the EXPLAIN output as (Last Use) for a spool area.

Also, when using Teradata, do not spend a lot of time worrying about which tables to join first. The optimizer makes this choice at execution time. The optimizer always looks for the fastest method to obtain the requested rows. It uses data distribution and index demographics to make its final decision on a methodology. So, the tables joined first in the syntax, might be the last tables joined in the execution plan.

All databases join tables two at a time, but most databases just pick which tables to join based on their position in the FROM. Sometimes when the SQL runs slow, the user just changes the order of the tables in the join. Otherwise, join schemas must be built to tell the RDBMS how to join specific tables.

Teradata is smart enough, using explicit or implicit STATISTICS, to evaluate which tables to join together first. Whenever possible, four tables might be joined at the same time, but it is still done as two, two-table joins in parallel. Joins involving millions of rows are considered difficult for most databases, but Teradata joins them with ease.

It is a good idea to use the Teradata EXPLAIN, to see what steps the optimizer plans to use to accomplish the request. Primarily in the beginning you are looking for an estimate of the number of rows that will be returned and the time cost to accomplish it. I recommend using the EXPLAIN before each join as you are learning to make sure that the result is reasonable.

If these numbers appear to be too high for the tables involved, it is probably a Cartesian product; which is not good. The EXPLAIN discovers the product join within seconds instead of hours. If it were actually running, it would be wasting resources by doing all the extra work to accomplish nothing. Use the EXPLAIN to learn this fact the easy way and fix it.

CROSS JOIN

A CROSS JOIN is the ANSI way to write a product join. This means that it joins one or more rows participating from one table with all the participating rows from the other table. As mentioned earlier in this chapter, there is not a large application for a product join and even fewer for a Cartesian join.

Although there are not many applications for a CROSS JOIN, consider this: an airline might use one to determine the location and number of routes needed to fly from one hub to all of the other cities they serve. A potential route "joins" every city to the hub. Therefore, the result needs a product join. Probably what should still be avoided is to fly from every city to every other city (Cartesian join).

A CROSS JOIN is controlled using a WHERE clause. Unlike the other join syntax, a CROSS JOIN results in a syntax error if an ON clause is used.

The following is the syntax for the CROSS JOIN:

The next SELECT performs a CROSS JOIN (product join) using the Student and Course tables:

SELECT Last_name, Course_name FROM Student_table CROSS JOIN Course_tableWHERE Course_ID = 100 ;
10 Rows ReturnedCROSS JOIN (product join) using the Student and Course tables

Since every student is not taking every course, this output has very little meaning from a student and course perspective. However, this same data can be valuable in determining a potential for a situation or the resources that are needed to determine maximum room capacities. For example, it helps if the Dean wants to know the maximum number of seats needed in a classroom if every student were to enroll for every SQL class. However, the rows are probably counted (COUNT(*)) and not displayed.

This SELECT uses a CROSS JOIN to populate a derived table (discussed later), which is then used to obtain the final count:


1 Row Returned

Total SQL Seats Needed

30

The previous SELECT can also be written to use the WHERE clause to the main SELECT to compare the rows of the derived table called DT instead of only building those rows. Compare the previous SELECT with the next one and determine which is more efficient.

Which do you find to be more efficient?

At first glance, it would appear that the first is more efficient because the CROSS JOIN inside the parentheses for a derived table is not a Cartesian product. Instead, the CROSS JOIN that populates the derived table is constrained in the WHERE to only SQL courses rather than all courses. However, the PE optimizes them the same. I told you that Teradata was smart!

Self Join

A Self Join is simply a join that uses the same table more than once in a single join operation. The first requirement for this type of join is that the table must contain two different columns of the same domain. This may involve de-normalized tables.

For instance, if the Employee table contained a column for the manager's employee number and the manager is an employee, these two columns have the same domain. By joining on these two columns in the Employee table, the managers can be joined to the employees.

The next SELECT joins the Employee table to itself as an employee table and also as a manager table to find managers. Then, the managers are joined to the Department table to return the first ten characters of the manager's name and their entire department name:

The self join can be the original syntax (table, table), an INNER, OUTER, or CROSS join. Another requirement is that at least one of the table references must be assigned an alias.Since the alias name becomes the table name, the table is now treated as two completely different tables.

Normally, a self join requires some degree of de-normalization to allow for two columns in the same table to be part of the same domain. Since our Employee table does not contain the manager's employee number, the output cannot be shown. However, the concept is shown here.

Alternative JOIN / ON Coding

There is another format that may be used for coding both the INNER and OUTER JOIN processing. Previously, all of the examples and syntax for joins of more than two tables used an ON immediately following the JOIN table list.

The following demonstrates the other coding syntax technique:

When using this technique, care should be taken to sequence the JOIN and ON portions correctly. There are two primary differences with this style compared to the early syntax. First, the JOIN statements and table names are all together. In one sense, this is more like the syntax of: tablename1, tablename2 as seen in the original join.

Second, the ON statement sequence is reversed. In the above syntax diagram, the ON reference for tablename2 and tablenameN is before the ON reference for tablename1 and tablename2. However, the JOIN for <table-name1> and <table-name2> are still before theJOIN of <table-name2> and <table-nameN>. In other words, the first ON goes with the last JOIN when they are nested using this technique.

The following three-table INNER JOIN seen earlier is converted here to use this reversed form of the ON comparisons:

Personally, we prefer the first technique in which every JOIN is followed immediately by its ON condition. Here are our reasons:

  • It is harder to accidentally forget to code an ON for a JOIN, they are together.
  • Less debugging time needed, and when it is needed, it is easier.
  • Because the join allows 64 tables in a single SELECT, the SQL involving several tables may be longer than a single page can display. Therefore, many of the JOIN clauses will be on a different page than its corresponding ON condition. It might require paging back and forth multiple times to locate all of the ON conditions for every JOIN clause. This involves too much effort. Using the JOIN / ON, they are physically next to each other.
  • Adding another table into the join requires careful thought and placement for both the JOIN and the ON. When using the JOIN / ON, they can be placed almost anywhere in the FROM clause.

All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Teradata Topics