The UNION Operator - Firebird

The set operator UNION can be used to combine the output of two or more SELECT statements and produce a single, read -only set comprising rows derived from different tables or from different sets queried from the same table. Multiple queries are “stacked,” with each subset specification being linked to the next by the UNION keyword. Data must be chosen and the query constructed in such a way as to ensure the contributing input sets are all union compatible.

Union-Compatible Sets

For each SELECT operation contributing a stream as input to a UNION set, the specification must be a column list that exactly matches those of all the others by degree (number and order of columns) and respective data type. For example, suppose we have these two table specifications:

The tables are union compatible because we can query both to obtain sets with matching “geometry”:

A UNION having SELECT * FROM either table would not work because the table structures are different—the second table has no ISBN column.

Using Runtime Columns in Unions

The column identifiers of the output set are those specified in the first SELECT specification. If you would like to use alternative column names, column aliasing can be used in the output list of the first SELECT specification. Additionally, if needed, runtime fields derived from constants or variables can be included in the SELECT clause of each contributing stream. The next query provides a much more satisfactory listing of publications from these two tables:

Search and Ordering Conditions

Notice in the preceding example that search conditions are possible in each contributing SELECT specification. They are normal search expressions, which must be confined to the contributing set controlled by the current SELECT expression. There is no way to correlate search conditions across the boundaries of the subsets.

Only one ordering clause is allowed, and it must follow all subsets. The ORDER BY degree syntax (i.e., ordering by column position number) is required for ordering union sets.

Re-Entrant UNION Queries

It is possible to apply a re-entrant technique to produce a union query that draws multiple subsets from a single table. Table aliases are required in the FROM clauses, but column references need not be fully qualified. Returning to our CURRENT_TITLES table, suppose we want a list that splits out the book titles according to price range. The re -entrant query could be something like this:

UNION ALL

If duplicate rows are formed during the creation of the union set, the default behavior is to exclude the duplicate rows from the set. To include the duplicates, use UNION ALL instead of UNION on its own.

OPTIMIZATION TOPIC: Query Plans and the Optimizer

This section looks at the Firebird optimizer subsystem and the strategies it uses to devise the query plans that the engine will use for SELECTs and subqueries at execution time. We take a look at query plan syntax and some possibilities for presenting your own custom plan to the engine.

Plans and the Firebird Query Optimizer

To process a SELECT statement or a search condition, the Firebird engine subjects the statement to a set of internal algorithms known as the query optimizer. Each time the statement is prepared for execution, the optimizer computes a retrieval plan.

The Plan

The query plan provides a kind of road map that tells the engine the least costly route through the required process of searching, sorting, and matching to arrive at the requested output. The more efficient a plan the optimizer can construct, the faster the statement will begin returning results.

A plan is built according to the indexes available, the way indexes or streams are joined or merged, and a method for searching (access method).

When calculating a plan, the optimizer will consider every index available, choosing or rejecting indexes according to their cost. It takes other factors into account besides the existence of an index, including the size of the table and, to a degree, the distribution of distinct values throughout the index. If the optimizer can determine that an index would incur more overhead than stepping row by row through a stream, it may choose to ignore the index in favor of forming an intermediate sorted stream itself or processing the stream in natural order.

Plan Expressions

Firebird provides the same plan expression syntax elements for setting plans in SQL as it provides to the engine. Understanding the plans generated by the optimizer can be very useful, both for anticipating how the optimizer will resolve the task to be done and for using as a basis for custom plans.

Plan Expression Syntax

The syntax pattern for the PLAN clause is

<query-specification> PLAN <plan_expression>

This syntax allows specification of a single relation or a join of two or more relations in one pass. Nested parentheses can be used to specify any combination of joins. The operations pass their results from left to right through the expression.

Symbols

In the notation used here, the round brackets and commas are elements of the syntax. Curly braces, square brackets, and pipe symbols are not part of the syntax—as with earlier syntax layouts, they indicate, respectively, mandatory and optional phrases and mutually exclusive variations.

The Elements

join-type can be JOIN or MERGE:

  • The default join type is JOIN (i.e., to join two streams using an index on the right-side stream to search for matching keys in the left-side stream).
  • MERGE is chosen if there is no useable index. It causes the two streams to be sorted into matching orders and then merged. In custom plans, it will improve retrieval speed to specify this join type for joins where no indexes are available.

table-identifier identifies a stream. It must be the name of a table in the database or an alias. If the same table will be used more than once, it must be aliased for each usage and followed by the name of the table being aliased. To enable the base tables of a view to be specified, the syntax allows multiple table identifiers. Plans for views are discussed in Chapter Views.

access-type must be one of the following:

  • NATURAL order, which means rows are accessed sequentially in no particular order. It is the default access type and can be omitted, although, again, it is wise to include it in custom plans to aid documentation.
  • INDEX allows one or more indexes to be specified for testing predicates and satisfying join conditions in the query.
  • ORDER specifies that the result of the query is to be sorted (ordered) on the leftmost stream, using the index.

A plan_item comprises an access type and the table identifier or alias.

Understanding the Optimizer

If you are unfamiliar with query plans, you are probably quite confused about how all of this syntax translates to a plan. A little later, the syntax will make more sense when we take a look at some plans generated by the optimizer. However, at this point, it will be helpful to examine how the optimizer evaluates the “materials” for its operations: the join and search conditions requested in the statement, the streams underlying the query specification, and the indexes that are available.

Factors in Cost Evaluation

The objective of the optimizer is to establish a plan that reflects the strategy that, according to several factors, is likely to start returning the output stream fastest. The evaluation may be quite imprecise, using several variables that are no more than rough estimates. Factors considered include

  • Availability of an index and the selectivity of that index. The selectivity factor used in estimates is that read from the system tables when the database was opened. Even at startup, it may not be accurate and it may have been altered due to distribution changes in operations since the last selectivity was computed.
  • The cardinality (number of rows) in the table streams.
  • Whether there are selection criteria and, if so, whether an index is available or is suitable.
  • The need to perform sorts, both intermediate (for merges) and final (for ordering and grouping).

Streams

The term “stream” that crops up in discussions about the optimizer is merely a generic way to refer to a set of rows. The set might be all of the rows and columns in a table, or it might be a set from a table that is restricted in some way by column specifications and search conditions. During the evolution of a query, the engine may create new streams from contributing stream, such as an internal, sorted set or a set of subqueried values for an IN( ) comparison. See also the upcoming “Rivers” section.

Use of Indexes

In queries, the Firebird engine can use an index for three kinds of tasks:

  • Equality comparisons: These perform an equality, greater than, less than, or STARTING WITH comparison between the value in a column and a constant. If the column is indexed, the index key is used to eliminate ineligible rows, making it unnecessary to fetch and scan unwanted rows of table data just to eliminate them.
  • Matching streams for joins: If an index is available for the columns named as the join condition for the stream on one side of the join, the stream on the other side is retrieved first and its join columns are matched with the index.
  • Sorting and grouping: If an ORDER BY or GROUP BY condition specifies a column that has an index, the index is used to retrieve the rows in order and a sort operation is not required.

Bitmapping of Streams

When join criteria can be matched to an indexed column, the Firebird engine generates a bitmap of each stream selected. Subsequently, using AND and OR logic in accordance with the search criteria, the individual bitmapped row streams are combined into a single bitmap that describes all of the eligible rows. When the query needs to scan the same stream in multiple passes, scanning the bitmap is much faster than revisiting the indexes.

Binary Language Representation

Underlying Firebird’s SQL engine is its native binary language representation (BLR) language and a bare-bones interpreter consisting of a lexer, a parser, a symbol table, and code generator. The DSQL engine and embedded applications pass BLR to the optimizer, and it is BLR—not strings—that the optimizer analyzes. It can happen that two different SQL statements are interpreted into identical BLR.

Joins Without Indexes

If no index is available for a join term, the optimizer specifies a sort merge between the two streams. A sort merge involves sorting both streams and then scanning one time through both to merge the matching streams into a river. Sorting both streams avoids the need to scan the right stream repeatedly to find matches for the left stream’s keys.

Rivers

A river is a stream that is formed when two streams are merged by a join. When joins are nested, a river can become a stream in a subsequent operation. Various strategies for merging streams into rivers are compared for cost. The resulting plan captures the best strategy that the optimizer can determine for joining pairs of streams in order from left to right.

In calculating the cost of a joining strategy, the optimizer gives a weighting to the order in which streams are joined. Alternative ordering evaluations are more likely with inner joins than with outer. Full joins require special processing and extra weighting.

The length of a particular river comes into account after the longest path is determined in the course of join order evaluation. The optimizer initially favors the longest join path—the maximum number of stream pairs that can be joined directly. The favored access method is to scan through the longest stream (ideally, the leftmost stream) and loop through shorter streams.

However, more important than the length of the join path is how rows from the right stream are read. The access type is evaluated according to the indexes available and their attributes (selectivity) compared with natural order and any sorting specifications that are in the picture.

If a better index is available for a shorter stream, choosing the longest stream as the controlling stream may be less important than the cost saving from choosing a shorter stream with a highly selective (ideally, unique) index. In this case, the eligible rows in the searched stream are retrieved in a single visit and ineligible rows are ignored.

Examples of Plans

Many graphical database admin and SQL monitor tools provide the capability to inspect the optimizer’s plan when a statement is prepared. Firebird’s own isql utility provides two interactive commands for viewing plans.

Inspecting Plans with isql

SET LAN

In an interactive isql session, you can use SET PLAN, with the optional keywords ON or OFF, to toggle the option to display the plan at the head of the console output:

SET STATS

With another toggle command, you can have isql display some statistics about queries at the end of the data output, which can be very useful for testing alternative query structures and plans:

SET PLANONLY

An alternative toggle command, SET PLANONLY [ON | OFF], prepares the statement and displays the plan, without executing the query:

If you intend to use the optimizer’s plan as a starting point for constructing a custom PLAN clause, the expression syntax that is output is identical to that required in the statement’s plan expression. Inconveniently, there is no provision in isql to capture the optimizer’s plan into a text file.

The following examples use queries that you can test yourself using the employee.fdb test database that is installed in your Firebird samples directory.

Simplest Query

This query merely retrieves every row from a lookup table, in no specific order. The optimizer determines that, although an index is available (the primary key index), it has no use for it:

Ordered Query

This is still a simple query, without joins:

However, the optimizer chooses the primary key index because it provides the requested ordering without the need to form an intermediate stream for sorting. Now, let’s see what happens when we decide to order the same plain query on a non-indexed column:

The optimizer chooses to perform the sort by walking through COUNTRY in natural order.

Cross Join

The cross join, which produces a generally useless set, does not use join criteria at all:

It simply merges each stream on the left with each stream on the right. The optimizer has no use for an index. However, this example is useful for introducing the connection between table aliases in the join specification and those in the plan.

The aliases specified in the query are used in the plan printed by the optimizer. When specifying a custom plan, it is a good idea to use the same mode of table identification as is used in the query, for consistency. However, whereas the SQL parser disallows mixing table identifiers and aliases in queries, the PLAN clause accepts any mixture. For example, the following is accepted. Note that the optimizer keeps the table identifiers consistent with those used in the query specification.

Join with Indexed Equality Keys

This join denormalizes a one-to-many relationship—each employee has one or more salary history records:

The optimizer chooses to loop over the (potentially) longer detail stream in order to search for relevant rows by the unique primary key index of the EMPLOYEE table. In this instance, either the number of rows in each table is roughly equal, or the number of rows in SALARY_HISTORY does not exceed the number of rows in EMPLOYEE sufficiently to outweigh the benefit of using the unique index as the lookup key. This is an inner join and the optimizer reasonably guesses that the right stream will determine the length of the river.

Let’s look at the optimizer’s treatment of the same streams, when the join is left outer:

This time, one row will be returned for each row in the right stream, regardless of whether there is a matching key in the controlling stream. The length of the river has no relevance here, since outer joins are unequivocal about which table must be on the left side, to be looped over. It is the algorithm for the outer join that determines the access method, not the sizes of the streams. With EMPLOYEE on the left side, it would be impossible to form the outer join by looping over the SALARY_HISTORY table to look up in EMPLOYEE.

Because optimizer does not have any choice about the order in which the streams could be joined, it simply picks the most usable index on SALARY_HISTORY.

When Size Matters

In the next example, table size comes into account and is not overlooked by using the unique primary key index. DEPARTMENT has 21 rows, PROJECT has 6 rows, and the optimizer chooses the smaller table’s foreign key index to optimize the lookup on the larger table:

Join with an Indexed ORDER BY Clause

The involvement of an indexed ordering specification can influence how the optimizer chooses to navigate the streams. Take the following example:

The unique index on EMPLOYEE is chosen because of the filter condition implicit in the join criteria. The query will eliminate employees who are not team leaders, and the unique index permits that to happen without needing to search the EMPLOYEE table. The choice of the filter index may also be influenced by the need to use the navigational index on PROJ_NAME for the sort.

The optimizer chooses the index on the right side because the right stream will be as long as the left or (potentially) longer. Again, the optimizer cannot tell that the relationship is, in fact, one to one. The Proj_Name column specified to order the set has a unique index, created for a UNIQUE constraint, to use for the sort and the optimizer chooses it. The sort index appears first, instructing the engine to sort the left stream before it begins matching the join key from the right stream.

Equality Join with no Available Indexes

The tables in the following query are non-indexed copies of the Project and Employee tables:

The streams from both sides will be sorted and then merged by a single pass through both sorted streams and, finally, the river is sorted again because neither of the contributing streams has the requested sort order.

Three-Way Join with Indexed Equalities

Consider the triple equijoin in this example:

Because plenty of usable indexes are available, the optimizer chooses the JOIN access method. The index linking the PDB stream with Department is used to select

The Department stream. When forming between the resulting river and the Project stream, the equijoin between the primary key of Project and the (potentially) longer river is able to be formed by using Project’s primary key index to select from the river.

Three-Way Join with Only One Indexed Equality

For this example, we use the non-indexed copies of Project and Department to demonstrate how the optimizer will use an available index where it can and will do its best to short-circuit the non-indexed equijoin conditions:

In the innermost loop, the foreign key index for the equijoin between the PDB stream is chosen to select the eligible rows from the Department stream. Notice that its selection of this index has nothing to do with the foreign key that the index was constructed to support, since the referenced table is not even present in the statement.

Next, the resulting river and the Project stream are both sorted. Finally (in the outermost loop), the two sorted streams are merged by a single read through the two streams.

Queries with Multiple Plans

When subqueries and unions are specified in a query, multiple SELECT statements are involved. The optimizer constructs an independent plan for each SELECT statement. Take the following example:

The first plan selects the primary key index of EMPLOYEE to look up the TEAM_LEADER code in the primary table for the subquery. The second uses the index PRODTYPEX on the PROJECT table for the PRODUCT filter, because the first key of the index is the PRODUCT column.

Interestingly, if the same query is modified to include an ordering clause, the optimizer overrules its choice to use an index for the filter and chooses instead to use a unique index on PROJ_NAME to navigate the ordering column:

Specifying Your Own Plan

The expression syntax that the optimizer provides in the plan it passes to the Firebird engine is available in SQL, via the PLAN clause. It enables you to define your own plan and restrict the optimizer to your choices.

A PLAN clause can be specified for almost any SELECT statement, including those used in creating views, in stored procedures, and in subqueries. Firebird versions 1.5 and above also accept PLAN clauses in triggers. Multiple plans can be specified independently for the query and any subqueries. However, there is no “all or none” requirement —any plan clause is optional.

A PLAN clause is generated for a SELECT from a selectable stored procedure. Since the output of a selectable stored procedure is a virtual set, any conditions will be based on NATURAL access. However, any SELECT statements within the stored procedure itself will be optimized and can be subjected to a custom plan.

You must specify the names and usages of all indexes that are to be used.

The optimizer always creates a plan, even if a custom plan is specified. While the optimizer does not interfere with a user-supplied plan, it does check that the indexes are valid for the context. Alternative paths are discarded, but otherwise it is business as usual. A custom plan does not cause the optimizer to short-circuit the aspects of plan evaluation and generation that are not surfaced in the PLAN clause syntax.

An invalid index will cause the query request to fail. If any predicates or join conditions are left after all of the indexes named in the plan expression have been used, the optimizer simply evaluates the streams according to natural order and default collation.

By providing your own plan expression, you may gain a little performance speed through bypassing the optimization sequences. However, devising a plan of your own based on the structural rules governing your data may not have the satisfactory outcome you expect, especially if your plan is inherited from another DBMS that uses structural rules to optimize queries.

The Firebird optimizer is primarily cost-based and generally will produce the best plan if your database is well kept through good housekeeping practices. Since the geometry of both indexes and data can change during execution of a statement —particularly if large numbers of rows are updated or deleted by the statement—no optimizer-generated plan can be counted on to remain static from one prepare to another. If you provide a static PLAN expression, the degradation in efficiency may result in performance losses that exceed any advantage gained through avoiding the optimizer.

The message here is that setting your own plan can be a two-edged sword. At design time, you may perceive a gain from forcing a plan that you believe is better than what the optimizer can calculate. In so doing, you switch off the benefit of a dynamic optimizer that can respond to and compensate for successive changes in data or index distributions.

Improving the Query Plan

Badly performing queries in Firebird are most often caused by poor indexes and suboptimal query specifications. In the “Optimization Topic” section in Chapter Indexes, we explored the effect of indexes with poor selectivity. In this section, we take a further look at indexes and some of the misadventures that can befall the optimizer and the designer when indexing interferes with the efficiency of retrieval.

Careful Indexing

It is not a “given” that using an index on a join or a search will make the query faster. There are actually aspects of metadata structure and indexing that cause selection of some indexes to kill performance in comparison with a natural scan.

Duplicated or overlapping indexes may interfere with the optimizer’s ability to use other indexes. Drop any indexes that duplicate the index created automatically for a primary or foreign key or for a unique constraint. Composite primary and foreign key pairs, especially those that are long or carry semantic content, can make queries vulnerable to poor performance, wrong index choices and, not least, human error.When designing tables, consider using surrogate keys to separate the “meaningfulness” of columns for searching and ordering from the functional purpose of the formal key for joins.

Compound Indexes

Compound (composite, complex) indexes are those that comprise more than one key column. A well-considered compound index may help speed queries and searches on complex ANDed conditions, especially when the table is large or a searched column on its own has low selectivity.

With compound indexes, degree (the number of elements and their relative left-to-right positions) is important. The optimizer can make use of an individual column of a compound index or a subgroup of columns in search, join, or ordering operations, provided other columns not involved in the operation do not “obscure” it to the left or interrupt the left-to-right sequence. Figure illustrates the possibilities available to the optimizer when evaluating a compound index on (COL1, COL2, COL3).

Availability of partial index keys

If a compound index can be used to advantage in place of one or more single column indexes, it makes sense to create and test it. There is no benefit in creating compound indexes for groups of OR conditions—they will not be used. Do not be tempted to create a compound index with the expectation that “one index will fit all cases.” Create indexes for identified needs and be prepared to drop any that do not work well.

In practice, it is essential to consider the usefulness of any compound indexes in terms of the most likely output requirements, on a case-by-case basis. The more you allow redundant compound indexes to proliferate, the higher the likelihood that you will defeat the optimizer and get sub-optimal plans. The key elements of compound indexes often overlap with those of other indexes, forcing the optimizer to choose one from a range of two or more competing indexes. The optimizer cannot be guaranteed to choose the best one in every case, and it may even decide to use no index in cases where it cannot determine a winner.

Confirm any assumptions you make about the effect of adding a compound index by testing not just the query in question but all regular or large query tasks that involve one or more of the same columns.

TIP

Single-Column Indexes

Single-column indexes are much more flexible than compound indexes, because they can be used in combination. They are preferred for any conditions that have no special need for a compound index, and they are needed for each searched column that is part of an ORed condition.

Natural Ordering

Don’t be upset to see NATURAL in a query plan and don’t be afraid to use it when testing custom plans. Natural order specifies a top -to -bottom scan through the table, page by page. It is frequently faster for some data, especially matches and searches on columns for which an index would be very unselective, and on tables where the number of rows is small and fairly static.

Deciding on a Join Order

When two or more streams have to be joined, the order in which the streams are retrieved is often more important than all of the other factors combined. Streams are retrieved in left-to-right order, with the leftmost stream of a particular join pair being the “prime controller” determining the search logic for the remaining joins and merges that are connected to it.

Ideally, this controlling stream is searched once. Streams that it joins to are searched iteratively until all matching rows are found. It follows, therefore, that the stream that is costlier to retrieve should be placed to the left of the “cheaper” stream, in the controlling position. The last stream in the join list will be fetched the most times, so make sure it will be the cheapest to retrieve.

Fooling the Optimizer

In situations where the optimizer chooses an index that you want it to ignore, you can trick it into ignoring the index by adding a dummy OR condition. For a simple example, suppose you have a WHERE clause specifying a column that has a poorly selective index that you really want to keep, perhaps to enforce a referential integrity constraint:

If this database is deployed in Australia (country code 'AU'), the selectivity of PARENT_COUNTRY in this database may be so low as to kill performance, but you are stuck with a mandatory index. To have the optimizer ignore the index, change the search predicate to

Getting the Instinct

Getting the “shape” of a join specification right isn’t rocket science and, with experience, you will develop an instinct for it. However, it is a technique that some people struggle with. With all query specifications, but especially with multi-table queries, the secret is to design them conscientiously. If you are inexperienced with SQL, do not rely on a CASE tool or query builder utility to teach you. It is a classic catch-22 situation: Until you acquire the experience, you cannot judge how mindless these tools are. If you always rely on the dumb tool, you will never develop the instinct.

Get acquainted with the performance characteristics inherent in both the structures and content of your data. Understand the normalizations in your database and recognize the shortest routes between your tables and your output specifications. Use pencil and paper to map and mold join lists and search conditions to fit output specifications with precision and without waste.


All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Firebird Topics