Understanding Subqueries and Joins - T-SQL

There's a right way and a wrong way to do everything, but unfortunately, there isn't always just one right or wrong method. The subject of correct join syntax is debatable, and there are recommended methods for different products. In the material that follows, I will show you a few different techniques, all of which are currently supported in SQL Server 2005 and 2008. Read on as I describe the methods recommended and guaranteed to provide future support and optimal performance.

As the SQL language has evolved and as it has been implemented in different products, a few different techniques have been devised for joining tables to match up related records. Although there are a few variations of these techniques, there are essentially three different options to join records from two different tables. The first two are different forms of join operations within a single SELECT statement.You can either join tables in the WHERE clause or in the FROM clause. The third technique involves more than one SELECT statement, where one query encompasses a second SELECT statement. This is often called a subquery — a query within a query. The tables in a subquery can be independent of each other or can be related through some kind of matching expression. One technique, used to match rows of the subquery to a row or rows of the main query, is often referred to as a correlated subquery. Subqueries are covered in Chapter (Advanced Queries and Scripting). For now, I will concentrate on the bread and butter of SQL Server queries: the join operation.

Joining Tables in the WHERE Clause

If you work with long-time database professionals who cut their teeth on other products such as Oracle, Informix, or DB2, you are likely to encounter an older style of join syntax that is discouraged in SQL Server. In fact (since 1992 when the most accepted ANSI SQL standard was penned), the traditional form has been mildly discouraged by most of the database product vendors, but old-timers get downright religious about continuing to use the old-school method. Because it's an older technique, it is often referred to as a legacy join. Although it is still partially supported by SQL Server, this is not the recommended approach for SQL Server 2008. You may encounter this syntax in existing code. In this example, both the Customer and SalesOrderHeader are referenced in the FROM clause, and the join operation is performed in the WHERE clause:

SELECT Customer.AccountNumber, SalesOrderHeader.OrderDate FROM Sales.Customer, Sales.SalesOrderHeader WHERE Customer.CustomerID = SalesOrderHeader.CustomerID

The query returns 31,465 rows, comprised of a combination of records from the Customer and SalesOrderDetail tables. Figure (Show IN Below) shows the first few of these rows.

This query implements an inner join, discussed shortly. The equal sign between each of the column references means that this query returns only rows where there are matching records in each of the tables. This is known as an equijoin, meaning that the values in two tables compared in the join operation must be equal to one another. Microsoft is making good on its threat to deprecate this feature of T-SQL. As of SQL Server 2008, the outer join form is no longer supported. I'll discuss this a little later in this chapter.

Joining Tables in the WHERE Clause

Joining Tables in the FROM Clause

The same operation can also be performed using the ANSI standard method. In the FROM clause, the two tables are referenced with a JOIN statement followed by the ON keyword and the same column references used in the preceding example:

SELECT Customer.AccountNumber, SalesOrderHeader.OrderDate FROM Sales.Customer INNER JOIN Sales.SalesOrderHeader ON Customer.CustomerID = SalesOrderHeader.CustomerID

The result is the same. If you view the execution plan for both of these queries, you'll note that they both cause SQL Server to perform exactly the same operations. There is no difference in time, cost, or efficiency. You can view the execution plan for a query by enabling this option in the Query menu. Run the query, and the Execution Plan tab appears below the results window, as shown in Figure (Show In Figure Below). Note that we've removed some of the irrelevant elements from the diagram to keep things simple.

Reading from right to left, each icon represents an operation. The records are retrieved from both tables and held in memory. Because each of the columns referenced in the JOIN statement are indexed, the query-processing engine chooses to scan and retrieve records using these indexes. Rows in the Customer table are retrieved using a separate, non-clustered index. Records in the SalesOrderDetail table are physically ordered by the CustomerID column based on a clustered index. The width of the arrows indicates the relative volume of data returned from the respective operation. The rows are combined using a hash join method to produce the final result. If you execute either of these two queries, you will see that the execution plans are the same. Float the mouse pointer over the left-most icon to see statistics for the finished product. The Subtree cost shows the total time in seconds for this and all operations that lead to it.

Understanding-Subqueries-and-Joins-Figure

So here's the bottom line on this topic: In most cases, legacy joins, performed in the WHERE clause, and ANSI standard joins, performed in the FROM clause, will result in the same execution plan and achieve the same relative performance. However, this doesn't guarantee that you will see the same effect when joins are combined in large, complex queries. Microsoft recommends that you use ANSI standard joins going forward. As of SQL Server 2008, some forms of non-ANSI standard joins are no longer supported. To keep things simple, just use the recommended syntax.

Types of Joins

Two major types of joins exist: inner joins, which return only corresponding records in two tables, and outer joins, which return all the rows in one table and then corresponding rows in a second table.

To demonstrate the behavior of different join types, keep the following facts in mind: The AdventureWorks2008 database contains 19,820 customer records. All but 701 customers have corresponding sales orders. These are corresponding rows in the SalesOrderHeader table, where the CustomerID column value is equal to an existing value in the Customer table's CustomerID. There are 31,465 rows in the SalesOrderHeader table.

The record counts in the SQL Server 2005 AdventureWorks database are different from these, but you should see the same general patterns and behavior when using the same queries.

Inner Joins

This is the most common type of join operation. The purpose of the inner join is to match up rows in one table with corresponding rows in another table where the associated columns contain the same value. If one of these tables' columns has a different value, or no value at all, these rows will not be returned by the query.Before showing you an example, I'll make this request using common language: I'd like to see all of the customers who have orders — and all of the orders that have corresponding customers. For each customer, show me the customer's account number, and for each order, the order date.

Again, the SQL statement that makes this same request is as follows:

SELECT Customer.AccountNumber, SalesOrderHeader.OrderDate FROM Sales.Customer INNER JOIN Sales.SalesOrderHeader ON Customer.CustomerID = SalesOrderHeader.CustomerID

You just saw the results from this query in Figure Show In Below, so I won't show them to you again. Just remember that it returns 31,465 rows that consist of customers with orders as well as orders that have customers.

The INNER JOIN statement can also be abbreviated by simply using JOIN. Although this is not as explicit, it works just as effectively as the earlier statement:

SELECT Customer.AccountNumber, SalesOrderHeader.OrderDate FROM Sales.Customer JOIN Sales.SalesOrderHeader ON Customer.CustomerID = SalesOrderHeader.CustomerID

Outer Joins

The job of an outer join is to return all of the rows from one table and then to match those rows in a corresponding table where the joining column has the same value. The difference between this and an inner join is that the unmatched rows in the first table are still returned by the query.

In common language, an outer join request might look like this: I'd like to see all of the customers, and for the customers who have orders, I'd also like to see related order information. Show me the account number for every customer, and if the customer has orders, show me a row for each combination of customers and orders.

Here's the SQL statement for this request:

SELECT Customer.AccountNumber, SalesOrderHeader.OrderDate FROM Sales.Customer LEFT OUTER JOIN Sales.SalesOrderHeader ON Customer.CustomerID = SalesOrderHeader.CustomerID

Outer joins always favor one table — the table from which you choose all rows. In this case, the table on the left side of the JOIN statement From Customer LEFT OUTER JOIN SalesOrderHeader is the Customer table. This means that all customer rows will be returned from the query and then from the corresponding SalesOrderHeader rows. If you think about it, this makes sense because the SalesOrderDetail table has a foreign key constraint that requires a matching CustomerID value. Given the relationship between these tables, it wouldn't make sense to join them the other way around.

When you execute this query, the results will initially look much the same as before. However, notice the row count: 32,166 rows — 635 more than before. What's going on here? Go back and look at the numbers I gave you just before I introduced inner joins. The Customer table contains 31,465 rows, including 701 without any orders. The outer join returned the customers who don't have orders. When an outer join doesn't have matching rows in the outer table (in a left outer join, the table on the right is the outer table), NULL values are returned. To find customers without orders, look for a NULL in the OrderDate column. But can't you use a query to do this? Rather than making you scroll through over 31,000 rows looking for those missing an order date, just alter the query, adding a WHERE clause:

SELECT Customer.AccountNumber, SalesOrderHeader.OrderDate FROM Sales.Customer LEFT OUTER JOIN Sales.SalesOrderHeader ON Customer.CustomerID = SalesOrderHeader.CustomerID WHERE SalesOrderHeader.OrderDate IS NULL

This query returns 701 rows, customers who have no orders, as shown in Figure Below.

query returns 701 rows, customers who have no orders,

So, what is the purpose of a right outer join? Using a right in place of left in this expression would have the same effect as reversing the order of the tables and columns. In most cases, you could choose to use one or the other. However, if you have an outer join on both sides of a table, you may be constrained to use either a left or a right outer join. Fortunately, visual query design tools can be used to create complex queries. This is easy to do using the Transact SQL Query Builder.

Try It Out

Applying the following steps, you will use the graphic query editor feature in SQL Server Management Studio to build a simple inner join query. Before you begin, make sure you have Management Studio open with a connection to your database server instance.

  1. Open the Query Builder in SQL Server Management Studio.
  2. In the Object Browser, drill down into Databases. Right-click the AdventureWorks2008 (or Adventure Works for SQL Server 2005) database and choose New Query.
  3. In the New Query window, right-click over the white space and choose Design Query in Editor. This opens the graphical query designer in a separate window with the Add Tables dialog on top.
  4. From the Add Table dialog window, select the "Customer (Sales)” table and click Add.
  5. Select the "SalesOrderHeader (Sales)” table (see Figure Show In Below) and click Add. Click "Close” to close the Add Table dialog window.

    Understanding-Subqueries-and-Joins-Figure

    The designer always assumes you will want an inner join. The diamond on the join line between these tables represents the join, and the logic is quite simple.
  6. Right-click the diamond icon to view a pop-up menu of join operations. From the menu, choose Select All Rows from Customer, as shown in Figure Below.

    Select All Rows from Customer

    The diagram adds a rectangular "cap” to the left side of the join to indicate that all rows will be returned from the table on the corresponding side of the join.
  7. Now scroll down or resize the Customer table window and check the box next to the AccountNumber column. Also, for the SalesOrderHeader table, check the OrderDate column (see Figure Show In Below).

    SalesOrderHeader table

Take a look at the SQL statement in the third pane of the designer. It should look like this:

SELECT Sales.Customer.AccountNumber, Sales.SalesOrderHeader.OrderDate FROM Sales.Customer LEFT OUTER JOIN Sales.SalesOrderHeader ON Sales.Customer.CustomerID = Sales.SalesOrderHeader.CustomerID

This is a great tool for learning join syntax. When in doubt, build your queries this way and then examine the SQL. Rather than using the designer as a crutch so you don't have to learn to do it the hard way, use it as a learning tool and then challenge yourself by rewriting the same queries in the Query Editor. As we mentioned earlier, there are those purists who refuse to use these design tools to create queries. In our opinion, it all comes down to time, money, and effort. If you can get the job done more effectively using a utility of some kind without sacrificing functionality, then by all means, do it.

There is also shorthand syntax for outer joins. You can abbreviate the join statement by using LEFT JOIN or RIGHT JOIN rather than LEFT OUTER JOIN and RIGHT OUTER JOIN, respectively.

Legacy Outer Joins

A little earlier in this chapter, I told you that support for legacy joins is going away. In the previous edition of this book, written for SQL Server 2000 and 2005, we showed you how to use a technique — common in other database products — that we refer to as legacy join syntax. We also recommended that this technique not be used because Microsoft had threatened to stop supporting it. Well, guess what? They stopped supporting legacy outer joins in SQL Server 2008. You can still get away with implementing an inner join in the WHERE clause, but you can no longer perform outer joins in the WHERE clause unless you set the compatibility of your database to emulate an older version of SQL Server.

In case you encounter this in existing, older queries, I'll show you an example so that you can recognize queries that must be rewritten to work. A legacy outer join is performed in the WHERE clause by placing an asterisk next to the equal sign between the column references. Either precede the equal sign (*=) or proceed the equal sign (=*) to denote a left or right outer join, respectively. The asterisk points to the table that will return all rows regardless of matching rows in the other table. The most significant issue with this is that ambiguous results can be returned by a legacy outer join when an expression is placed on the side of the join where all records are to be returned (the "*” side). Under certain conditions, the query parser just can't figure out how to build the right execution plan and doesn't apply the correct execution logic.

Here is the same query as the previous example using the legacy join syntax in the WHERE clause:

SELECT Customer.AccountNumber, SalesOrderHeader.OrderDate FROM Sales.Customer, Sales.SalesOrderHeader WHERE Customer.CustomerID *= SalesOrderHeader.CustomerID

Unless your database is in 8.0 or previous compatibility mode, this query will not run and will return an error. Because legacy inner joins are supported but outer joins are not, this leaves the door open for poor and inefficient query design. If any of these working old-school inner joins need to be converted to outer joins, I guess you're stuck. Now it's your job to convince the 25 year DB2 veteran, turned SQL Server DBA, to stop using them.

Multicolumn Joins

Some databases are designed with multi column keys and may require that you define multi column joins in your queries. There is no stated limit to the number of columns that can be used in a join. Typically, you would only need to use more than one joining column to support specific business rules (because joins are usually performed on primary and foreign key columns). The logic of a join expression is very similar to that of a WHERE clause. Multiple comparisons can be combined using the AND and OR operators.

Let's use this technique to find product sales records (in the SalesOrderDetail table) for products sold at dealer cost. This involves matching the ProductID between the two tables and matching the ListPrice from the Product table to the UnitPrice from the SalesOrderDetail table:

57,949 transactions were recorded where the product was sold for the list price. The first few rows of this result are shown in Figure ( Below)). Of course, if you were to change any of the Standard Cost or Unit Price values (as we do in Chapter (Transactions )), your results will be a little different.

Multicolumn Joins

Non-Equijoins

So far, the join operations you've seen (in their various forms) have all used comparisons of equality. In other words, the values compared between two tables must be equal for the query to return matching records. Although far less common, joins can also be performed using any other valid method of comparison. This can include any of those listed in the following table.

Non-Equijoins

I can modify the previous example to find sales orders for products that were sold at a discounted price:

This returns 63,368 rows of sales orders that weren't sold at full list price, some of which are shown in Figure Below.

returns 63,368 rows of sales orders

Of course, I can easily turn this query around to show inflated sales by changing the comparison expression to read Product.ListPrice<SalesOrderDetail.UnitPrice, then find that we haven't jacked up the price above list for even one transaction. Wow! We're running an honest business after all.

Special-Purpose Join Operations

I think it's safe to say that you have seen 99 percent of the join operations you will use day-to-day. Two more types of joins are quite rare: full joins and cross joins. Unless you need to do some very unusual things, you will likely not use them. I can think of just three or four times I've used a full join or cross join in the past few years to solve unique problems.

Full Joins

A fulljoin (or full outerjoin) is an outer join that doesn't favor one of the two tables. The result set will return unmatched values on both sides of the join.

Consider the following hypothetical example. A Parent table contains parent names and a Child table contains child names and parent names. One Parent record doesn't have a matching Child record, and one Child record doesn't have a matching Parent.

Full Joins

Full Joins

If you were to join these two tables on the ParentName columns from both tables using a full outer join, as follows, all records would be returned, including the mismatched parent and the child.

SELECT Child.ChildName, Parent.ParentName FROM Child FULL OUTER JOIN Parent ON Child.ParentName = Parent.ParentName

The results would look like the following example. Note that Mr. Slate and Dino are both returned with NULL values in the joining columns.

Understanding-Subqueries-and-Joins-Table

This hypothetical example demonstrates this point in its simplest form. The AdventureWorks database has a highly normalized design, and there aren't many simple examples of related tables with unmatched records. The following real example requires some extra joins to present the necessary data. Wholesale customers (or stores) exist as records in both the Customer and Store tables. Store records have a corresponding SalesPersonID, which relates to an Employee. A left outer join between the Store and Customer tables is necessary to include mismatched SalesPersonID values (such as mismatched ParentName values in the previous example).

Here's a quick example using the AdventureWorks database in SQL Server 2005. Note that because the SQL Server 2008 sample data was "cleaned up,” this query no longer works with the newer sample database. Customer records exist that do not have an assigned sales person employee, and employee records exist that are not assigned to a customer as a sales person. This query will return customers and employees — and the combination of the two when they are related:

In Figure Below, you will notice NULL values in the CustomerID column returned from the Customer table. If you scroll-down, you will also see NULL values in the EmployeeID column returned from the Employee table.

NULL values in the EmployeeID column returned from the Employee table

This query returned 19,462 results. This is essentially every possible combination of distinct values for the two columns participating in the join, including NULL values.

The last time I used this type of join was for a medical patient scheduling application. Business requirements called for the user interface to display the available appointment blocks for all doctors in the clinic. Doctors with appointments scheduled were to be displayed in a different color. For example, each of three doctors in the clinic could see patients scheduled for appointments that could begin every 15 minutes. An appointment slot table contained scheduling blocks beginning at 9:00 a.m. and ending at 4:45 p.m. Each row in this table represented a 15-minute block (9:00, 9:15, 9:30, and so on). In the scheduling application, my client wanted to see every possible appointment for each doctor, including those that had no appointments scheduled. A full join between the appointment slot table and the appointment table did the trick.

Cross Joins

This is the granddaddy of all joins. When using a cross join, you don't designate columns for the join to match values. The query will simply return every possible combination of rows for two tables without regard for matching column values. This produces what is known as a Cartesian product. Needless to say, this can produce a large volume of rows and could be an effective way to populate a test database table with sample data. Frankly, I have yet to find a practical use for this technique in a production database.

If you were to create a cross join between the Customer and SalesOrderHeader tables, this is what you should expect to see: The Customer table contains 19,185 records, and the SalesOrderHeader table contains 31,465 records. This means that for every Customer row, 31,465 rows will be added to the result set. If you do the math, the result will contain 603,656,025 possible combinations. Please do not run this query on a production server in the middle of the day. If you do, please don't tell your system administrator what book you were reading when you learned to do this. Joking aside, running this query would take several minutes and would consume a fair amount of server resources.

A somewhat more conservative example would be to cross join customers with employees. This query returns about five and a half million records:

SELECT CustomerID, LoginID FROM Sales.Customer CROSS JOIN HumanResources.Employee ORDER BY CustomerID, LoginID

You can also simply omit the CROSS JOIN statement and separate the table references with commas:

SELECT CustomerID, LoginID FROM Sales.Customer, HumanResources.Employee ORDER BY CustomerID, LoginID

In my experience cross joins are more often created by accident than by intent. This is especially true when using the Query Builder. If two tables are added to the diagram that do not have a relationship defined, the Query Builder will generate a cross join query automatically.

Filtering Records in the Join Clause Using a Predicate

This is a lesser-known technique that can be used to filter rows before a join operation is executed. When a traditional multi-table query is executed, the join is processed first and then filtering takes place afterward; at least this is what you're telling SQL Server to do. Fortunately, SQL Server is smart enough to perform filtering ahead of time when it makes sense to do this (as long as indexes and column statistics are up-to-date, which they should be under normal conditions). Regardless, this technique guarantees that records not matching the filtering criteria will not be considered in the join operation.

And here is the same query for the AdventureWorks database in SQL Server 2005:

However unconventional, this method would have the same effect as if the filter criteria (Customer .CustomerID< 10) were specified in the WHERE clause. For readability, I believe that using the WHERE clause is the preferred method.

Joining on an Expression

Chapter (Advanced Queries and Scripting) discusses the idea of using an AS statement as a derived table. For the purpose of the current discussion, it's good to know that joins can be used not only for tables, but also for any SELECT expression. An example follows so that you can see how this works.

The director of marketing wants to reduce the cost of selling small-ticket items but doesn't want to discontinue low-priced items that are selling well. The director would like to see the cumulative sales for the ten least-expensive products. This will require two separate SELECT expressions. The first will return the product rows for the ten least-expensive products, and the second will return the aggregate sales filtered by this product selection.

I'm working from the inside out, so the second expression will actually become the first part of the final query. I'll start with the first step and then add the second expression to the beginning of the query filtering criteria.
The first query simply returns the ten least-expensive products:

The ProductID column is essential, but the other two columns shown in Figure Below are included just for reference in the results.

Joining on an Expression

This expression must be given an alias so that I can refer to it in another SELECT expression. In the following example, I refer to the results of the first query by the alias Cheap Products. The second step is to create another query for the aggregated sales orders. This becomes the outer query that refers to the first inner query by its alias, as if it were a physical table.

The alias I created for the derived table has a descriptive name. The traditional approach is to use a single letter for table aliases. Personally, I find this to be a bit cryptic. You will likely refer to the alias several times, so it's a good idea to keep the name short — but also make it meaningful. To include columns from the inner products query in the outer result set, reference them by using the alias, as if this were just another table. Finally, because I'm aggregating the Line Total column for records sharing the same product, all of the columns except for the aggregate must be included in the GROUP BY column list. This works because each of these column values is unique to a specific product. The results are displayed in Figure (Show In Below).

Understanding-Subqueries-and-Joins-Figure

I'll take some more time to cover derived tables and other subquery techniques in the next chapter. This gives you an idea about using joins in creative ways to address unique business rules. What I've learned (and continue to learn) about using SQL to address unique challenges is that there is almost always a method to solve the problem — and there are often several options. The ideal solution is usually not all that complicated. However, the ideal and most elegant solution is often not the one we use on the first attempt.

Multi-Table Joins

Let's put it all together. I need a list of all stores and, for those store/customers that have purchased products, details of the order and the product information. In this database, a store is a type of a customer, which requires a join between the Store and Customer tables. Because I want all stores regardless of matching orders, this will require an outer join to the SalesOrderHeader table. All other tables are included through inner joins. Figure Show In Below provides an illustration.

Multi-Table Joins

This query might be particularly tricky if you were to write it without the help of the designer. Because the SalesOrderHeader table participates in two joins, an inner join with SalesOrderDetail and an outer join with the Customer table, this breaks up the SQL syntax. Note the mispairing of JOIN and related ON statements in the SQL statement for this query.

This query runs against the AdventureWorks2008 database in SQL Server 2008:

This is the same query designed to work with the AdventureWorks database in SQL Server 2005:

When you need to write a complex query, it may be a good idea to at least start with the graphical query designer. Figure (Show In Figure Below) shows the result set for this query from the results pane of the SQL Query Designer. Note the NULL values in the first row indicating that the store/customer has no related order records.

Understanding-Subqueries-and-Joins-Figure

Union Queries

Joins expand the result set horizontally. That is, columns are added to the results from multiple tables, essentially widening the result. A UNION query expands the results vertically as records are piled on top of one another. A simple example follows. The Employee table and the Individual table both contain records of people. To shorten the list, I'm just going to select the top five rows from each table:

SELECT TOP 5 FirstName FROM Person.Person SELECT TOP 5 Name FROM Purchasing.Vendor

These two queries, even if executed at the same time, return two different result sets, as shown in Figure Below.

two different result sets

Now, I'll put them together. A union combines multiple results with the same number of columns. Columns must have compatible data types. To keep things simple, I suggest that columns have the same names and the same data types. Actually, the column names don't have to match. If the names are different for a corresponding column, the name from the last set will be used in the results. Because these two queries contain the same column names and types, combining them is simple.

Try It Out

Enter and execute the following query: In SQL Server 2008:

SELECT TOP 5 FirstName FROM Person.Person UNION SELECT TOP 5 Name FROM Purchasing.Vendor

In SQL Server 2005:

SELECT TOP 5 FirstName FROM Person.Contact UNION SELECT TOP 5 Name FROM Purchasing.Vendor

Note that each SELECT statement could be executed as an independent query. Although not absolutely necessary, all of the columns have the same name. Figure (Figure Show In Below)shows the results.

Understanding-Subqueries-and-Joins-Figure

A potential problem with this result set is that you have no way of knowing what table each of the rows came from. Depending on what you intend to do with this data, this may or may not be important. In case it is, you need to tag each row with a value indicating its source. This is easily accomplished by adding an alias in the individual SELECT statements. I've also cleaned this query up a bit by naming the columns uniformly. You'll recall that the column name returned by a UNION query is determined by the last query in the union. As a matter of standard convention, I would recommend that all columns have the same names and data types just to avoid surprises. This is the modified query for SQL Server 2008:

By creating a Source alias, you provide a literal value. In the results shown in Figure (Show In Figure Below), this indicates whether the person is an individual customer or an employee, as defined by each of the two queries.

literal value. In the results

The UNION operator when used by itself is actually a UNION DISTINCT. This means that the two queries are merged and sorted and have any duplicates removed. Notice the FirstName column in Figure (Show In Fingure Above). The values are in alphabetical order. This is the result of the implied DISTINCT/SORT operation that accompanies the UNION. On the surface this may seem like a good idea, and it may be, but the additional overhead of sorting and removing duplicates can be quite costly when used against large tables. For best performance, use the UNION ALL operator instead. The UNION ALL operator simply concatenates the two results. Just keep in mind that if the same first name-last name combination existed in both the Employee and Individual table, it would appear twice.

In the early days of client/server computing, there were two different approaches to managing data. A database was either managed centrally, in a mainframe- or midrange-hosted environment, or data was stored in small-scale, file-based data stores. There really wasn't a middle ground between these two options. To compensate for the lack of desktop database scalability, programmers found creative ways to replicate and synchronize data, but none of these work-arounds provided true concurrency. In a large- scale, multi-regional business environment, it often makes sense to store the data that will be used by local users on a server relatively close to them. This approach may provide less network overhead and faster query performance. This provides for the needs of users in the region who need access to the region-specific data, but it also means that users who need access to data from another region must retrieve it from a remote server.

Here's a scenario that nearly all of us can relate to: When you make a purchase with a credit card, the store uses a merchant service provider to authorize and transact the purchase. This could be a service offered by their bank, credit card company, or a third-party clearinghouse. Your card number is sent to a data center, a query is executed against a database to retrieve your account information, and then records are added to complete the transaction. Most likely, the bank doesn't have one central data center. A data center is a facility housing a group of computer servers. PC-based servers are often clustered together to form a single, virtual server. Essentially, it's a bunch of PCs lumped together, pretending to be a more powerful computer — almost like a mainframe. The bank probably has a few regional data centers, and merchants connect to one closest to them. For simplicity, assume that my bank has merchant services data centers in Seattle (west coast accounts) and Atlanta (east coast accounts).

Partitioning and Federating Data

I live, and my bank accounts are based, in Washington State. If I were to make a purchase at a local store — or a store anywhere in the western United States — the transaction would be processed in the Seattle data center. If someone whose bank account is managed in another data center were to make a purchase in the Seattle area, a connection from the local data center to the customer's home data center would be used to retrieve their account information. Here's how this might work.

If these accounts were managed in SQL Server, a remote server connection may be used to provide on- demand connectivity to a database in another regional data center. In this example, I use the server names EastCoastServer and WestCoastServer to represent the remote servers and Local Host to represent the respective local server. At each data center, the local database contains a view (a stored SELECT query) called vw_AllAccounts. In the Seattle regional center, the SQL script for the view looks like this:

In Atlanta, a view with the same name looks like this:

Of course, these are hypothetical examples that won't run with our sample databases.

In each of the regional databases, the local accounts table (either WestCoastAccounts or EastCoastAccounts) is used to manage accounts in that region. The account lookup is performed using a stored procedure that accepts the account number as an input parameter and then looks up a record by matching this value to the Account Number column. Because there will only ever be one matching record, the stored procedure's SELECT statement uses a TOP 1 modifier that causes the database to stop searching for records after it locates one. The stored procedure script looks like this:

As you can see, this is just a simple SELECT statement with a parameterized value (@AccountNumber) that gets passed in when it is executed. Chapter ( Creating and Managing Database Objects) uses the views and stored procedure to discuss what happens when the credit card transaction is processed.

Performance and Scaling Considerations

Now, let's look at a smaller-scale implementation to improve performance and efficiently manage a large volume of data. In this scenario, sales records are stored in a single table. Perhaps several thousand records are inserted each month. To report on this data over a period of years, it would be necessary to keep perhaps millions of rows in the sales table. Managing such a large volume of data in an active database will inevitably slow performance and raise administrative costs. At first glance this may seem like a perplexing challenge.

Let's view the facts and consider the options to resolve this quandary. To effectively deal with performance issues, you must first identify the bottlenecks. What's typically the slowest part of a computer system? Nearly all system components are solid-state, route electronic signals moving at nearly the speed of light and rely on transistors switching pulses at millions of times per second. Computers process data at incredible speeds but continue to store it on spinning disks, using mechanical armatures to write and read values. The hard disk is almost always the slowest part of a system. When retrieving data, other components wait while the disk controller finds and moves the read/write head to the right track and then waits for the disk to rotate to the sectors containing the data bits, assembling interlaced values as the disk rotates multiple times. The head is relocated again until all data is retrieved. You can't speed up the disk, but you can spread data across multiple disks to speed up the overall process.

In a partitioned storage solution, data is split up onto multiple disks. This affords SQL Server the opportunity to retrieve data in parallel and make more efficient use of multiple disk drives. There are actually a few different approaches to multiple-disk storage. In this scenario, I'll present an unsophisticated but very effective technique using multiple tables and a Union query.

At the end of every year, a routine removes year-old sales records from the current sales table and inserts them into a specific archive table containing only that year's sales. Each of these tables is placed on a separate physical disk drive. Over a few years, there may be several tables. Suppose that this system has been in place for five years, and the disks and tables are set up as shown in Figure Below.

Performance and Scaling Considerations

For reporting, a view is created that contains a series of UNION statements:

When this query is executed, all of these disk drives may be accessed at the same time. Assuming there are no other performance barriers, this could theoretically improve performance by a factor of five. Not only is this query going to run faster, but it will also ease the burden on competing operations rather than blocking processes and other users' requests. You'll see some more specific examples of this type of query in the discussion on creating database objects in Chapter (Creating and Managing Database Objects).

All rights reserved © 2020 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

T-SQL Topics