CLR User-Defined Aggregates - SQL Server 2008

When working with T-SQL, it’s often desirable to answer various questions at different levels of granularity. Although it’s interesting to know the price of each line item in an order, that information might be more valuable in the form of a total for the entire order. And at the end of the quarter, the sales team might want to know the total for all orders placed during the previous 3 months; or the average total order price; or the total number of visitors to the web site who made a purchase, divided by the total number of visitors, to calculate the percentage of visitors who bought something.

Each of these questions can be easily answered using T-SQL aggregate functions such as SUM, AVG, and COUNT. But there are many questions that are much more difficult to answer with the builtin aggregates. For example, what is the median of the total order prices over the last 90 days? What is the average order price, disregarding the least and most expensive orders? These are examples of questions that are difficult to answer with T-SQL aggregates. For instance, the standard algorithm for finding a median involves sorting and counting the set of values, and then returning the value in the middle of the sorted set. Translated to SQL Server, this would most likely mean using a cursor, walking over the result set to find the count, and then backtracking to get the correct value. And while that is a workable solution for a single group, it is not easy to adapt to multiple groups in the same rowset. Imagine writing that cursor to find the median sales amount for every salesperson, split up by month, for the last year—not a pretty picture.

User-defined aggregates eliminate this problem by giving developers tools to create custom aggregation functions in the .NET language of their choice. These aggregate functions are built to be robust and extensible, with built-in consideration for parallelism and flags that control behavior such that the query optimizer can better integrate the aggregations into query plans. User-defined aggregates can provide powerful support for operations that were previously extremely difficult in SQL Server.

In this section, we’ll examine a CLR user-defined aggregate that calculates a “trimmed” mean— an average of a set of numbers minus the smallest and largest input values.

Adding a User-Defined Aggregate to a SQL Server Project

To add a user-defined aggregate to a preexisting SQL Server project in Visual Studio 2008, right-click the project name in Solution Explorer and select Add ➤Aggregate, as shown in Figure.

Adding a user-defined aggregate to a SQL Server project

Adding a user-defined aggregate to a SQL Server project

Once the aggregate has been added to the project, Visual Studio will add template code. The following code is the result of adding an aggregate called TrimmedMean:

using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
[Serializable]
[Microsoft.SqlServer.Server.SqlUserDefinedAggregate(Format.Native)]
public struct TrimmedMean
{
public void Init()
{
// Put your code here
}
public void Accumulate(SqlString Value)
{
// Put your code here
}
public void Merge(TrimmedMean Group)
{
// Put your code here
}
public SqlString Terminate()
{
// Put your code here
return new SqlString("");
}
// This is a placeholder member field
private int var1;
}

Parts of a User-Defined Aggregate

Programming a user-defined aggregate is in many ways similar to programming user-defined types. Both aggregates and types are represented by classes or structures that are serializable. It is important to understand when dealing with aggregates that the intermediate result will be serialized and deserialized once per row of aggregated data. Therefore, it is imperative for performance that serialization and deserialization be as efficient as possible.

SqlUserDefinedAggregate Attribute

The SqlUserDefinedAggregate attribute, much like the SqlUserDefinedType attribute, functions primarily as a way for developers to control serialization behavior. However, the attribute also exposes parameters that can allow the query optimizer to choose better query plans depending on the data requirements of the aggregate. The parameters exposed by the attribute are as follows:

  • Format: The Format of a user-defined aggregate indicates the method of serialization that will be used. The Native option means that the CLR will control serialization automatically, whereas UserDefined indicates that the developer will control serialization by implementing the IBinarySerialize interface. Native serialization is faster than user-defined serialization, but much more limited: it can serialize aggregates only if all member variables are value types, such as integers and bytes. Reference types such as arrays and strings require user-defined serialization. Given the performance implications of serialization and deserialization on a per-row basis, developers should try to avoid using reference types in aggregates whenever possible.
  • IsInvariantToDuplicates: The IsInvariantToDuplicates parameter indicates that the aggregate is able to handle duplicate input values. Setting this parameter to true can help the optimizer formulate better query plans when the aggregate is used. An example of an aggregate that is invariant to duplicates is MIN; no matter how many duplicate values are passed in, only one is the minimum. The default value for this parameter is false.
  • IsInvariantToNulls: This parameter indicates to the query optimizer whether the aggregate ignores null inputs. Certain query plans might result in extra nulls being passed into the aggregate; if it ignores them, this will not modify the aggregation. An example of an aggregate with this behavior is SQL Server’s SUM aggregate, which ignores nulls if at least one non-null value is processed. The default for this parameter is false.
  • IsInvariantToOrder: This property is not currently used by the query processor, so its existence is for informational purposes only.
  • IsNullIfEmpty: This parameter indicates whether the aggregate will return null for cases in which no values have been accumulated. This can allow the query engine to take a shortcut in certain cases. The default value for this parameter is true.
  • MaxByteSize: This parameter, similar to the same parameter on the SqlUserDefinedType attribute, controls how large, in bytes, the aggregate’s intermediate data can grow. In SQL Server 2005, the maximum size and default value was 8,000. In SQL Server 2008, you can specify a value of –1, which means the maximum byte size is unlimited (actually the maximum size is 2GB).
  • Name: This parameter is optionally used by the Visual Studio deployment task to name the aggregate within the target database differently than the name of the class or structure that defines the aggregate.

Init

The life of an instance of an aggregate begins with a call to Init. Within this method, any private members should be initialized to the correct placeholder values for having processed no rows. There is no guarantee that any data will ever be passed into the aggregate just because Init was called. Care should be taken to ensure that this assumption is never coded into an aggregate. An instance of an aggregate can be reused multiple times for different groups within the result set, so Init should be coded to reset the entire state of the aggregate.

Accumulate

The Accumulate method takes a scalar value as input and appends that value, in the correct way, to the running aggregation. That scalar value is an instance of whatever type is being aggregated. Since these values are coming from the SQL Server engine, they are nullable, and since the method itself has no control over which values are passed in, it must always be coded to properly deal with nulls. Remember that even if the column for the input to the aggregation is defined as NOT NULL, a NULL can result from an OUTER JOIN or a change in project requirements.

Merge

In some cases, query plans can go parallel. This means that two or more operations can occur simultaneously—including aggregation. There is a chance that some aggregation for a given aggregate of a given column will take place in one thread, while the rest will take place in other threads. The Merge method takes an instance of the aggregate as input and must append any intermediate data it contains into its own instance’s member data.

Terminate

The final call in an aggregate’s life is Terminate. This method returns the end result of the aggregation.

Programming the TrimmedMean Aggregate

T-SQL has long included the AVG aggregate for calculating the mean value of a set of inputs. This is generally quite useful, but for statistical purposes, it’s often desirable to eliminate the greatest and least values from a mean calculation. Unfortunately, doing this in pure T-SQL is quite difficult, especially if the query also includes other aggregates that should not exclude the rows with the greatest and least amounts. This is a classic problem, and it’s the kind that CLR user-defined aggregates excel at solving.

To calculate the mean value excluding the maximum and minimum values, the aggregate must keep track of four values:

  • A count of the number of values processed so far
  • A running sum of all input values
  • The minimum value seen so far
  • The maximum value seen so far

The final output value can be calculated by subtracting the minimum and maximum values from the running sum, and then dividing that number by the count, minus 2 (to account for the subtracted values). The following private member variables will be used to keep track of these values:

private int numValues;
private SqlMoney totalValue;
private SqlMoney minValue;
private SqlMoney maxValue;

The Init method will prepopulate each of these variables with the appropriate values. numValues and totalValue will both be initialized to 0, starting the count. minValue will be initialized to SqlMoney.MaxValue, and maxValue to SqlMoney.MinValue. This will ease development of comparison logic for the initial values entered into the aggregate. Note that the SqlMoney type is used for this example to facilitate taking averages of order data in the AdventureWorks database. Other applications of such an aggregate may require different types. The following code is the implementation of Init for this aggregate:

public void Init()
{
this.numValues = 0;
this.totalValue = 0;
this.minValue = SqlMoney.MaxValue;
this.maxValue = SqlMoney.MinValue;
}

So that the aggregate behaves similarly to intrinsic SQL Server aggregates like AVG, it’s important that it ignore nulls. Therefore, the Accumulate method should increment the numValues variable only if the input is non-null. The following code implements Accumulate for this aggregate:

public void Accumulate(SqlMoney Value)
{
if (!Value.IsNull)
{
this.numValues++;
this.totalValue += Value;
if (Value < this.minValue)
this.minValue = Value;
if (Value > this.maxValue)
this.maxValue = Value;
}
}

Implementing Merge is very similar to implementing Accumulate, except that the value comes from another instance of the aggregate instead of being passed in from the query engine.

public void Merge(TrimmedMean Group)
{
if (Group.numValues > 0)
{
this.numValues += Group.numValues;
this.totalValue += Group.totalValue;
if (Group.minValue < this.minValue)
this.minValue = Group.minValue;
if (Group.maxValue > this.maxValue)
this.maxValue = Group.maxValue;
}
}

The final step in coding the aggregate is to define the Terminate method. Since the lowest and highest input values will be ignored, the output will be null if numValues is less than 3; it is impossible to ignore values that don’t exist! Aside from that, the algorithm employed is as described previously: divide the total value by the number of values after subtracting the minimum and maximum.

public SqlDecimal Terminate()
{
if (this.numValues < 3)
return (SqlMoney.Null);
else
{
this.numValues -= 2;
this.totalValue -= this.minValue;
this.totalValue -= this.maxValue;
return (this.totalValue / this.numValues);
}
}

Since the aggregate uses only value types as member variables, native serialization will suffice, and the default SqlUserDefinedAggregate attribute will not need to be modified. The complete code for the aggregate follows:

using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
[Serializable]
[Microsoft.SqlServer.Server.SqlUserDefinedAggregate(Format.Native)]
public struct TrimmedMean
{
public void Init()
{
this.numValues = 0;
this.totalValue = 0;
this.minValue = SqlMoney.MaxValue;
this.maxValue = SqlMoney.MinValue;
}
public void Accumulate(SqlMoney Value)
{
if (!Value.IsNull)
{
this.numValues++;
this.totalValue += Value;
if (Value < this.minValue)
this.minValue = Value;
if (Value > this.maxValue)
this.maxValue = Value;
}
}
public void Merge(TrimmedMean Group)
{
if (Group.numValues > 0)
{
this.numValues += Group.numValues;
this.totalValue += Group.totalValue;
if (Group.minValue < this.minValue)
this.minValue = Group.minValue;
if (Group.maxValue > this.maxValue)
this.maxValue = Group.maxValue;
}
}
public SqlMoney Terminate()
{
if (this.numValues < 3)
return (SqlMoney.Null);
else
{
this.numValues -= 2;
this.totalValue -= this.minValue;
this.totalValue -= this.maxValue;
return (this.totalValue / this.numValues);
}
}
private int numValues;
private SqlMoney totalValue;
private SqlMoney minValue;
private SqlMoney maxValue;
}

Using the TrimmedMean Aggregate

Once deployed to the database, user-defined aggregate functions can be used just like built-in aggregates.For instance, to compare the results returned by the T-SQL AVG function to those returned by TrimmedMean for the total order amounts in the AdventureWorks database, you can use the following query:

SELECT
AVG(TotalDue) AS AverageTotal,
dbo.TrimmedMean(TotalDue) AS TrimmedAverageTotal
FROM Sales.SalesOrderHeader

The results of this query show a slightly lower average for the trimmed figure: $4,464.88 instead of $4,471.28 for the normal average.

Managing User-Defined Aggregates

If an assembly has been loaded into the database using CREATE ASSEMBLY, aggregates can be created or dropped without using the Visual Studio deployment task.

To create an aggregate that is exposed in an assembly, use CREATE AGGREGATE and specify the name of the assembly and the name of the structure or class that defines the aggregate. The following code creates the TrimmedMean aggregate from an assembly called Aggregates:

CREATE AGGREGATE TrimmedMean
EXTERNAL NAME Aggregates.TrimmedMean

To drop an aggregate, use DROP AGGREGATE. The following code drops the TrimmedMean aggregate:

DROP AGGREGATE TrimmedMean

There is no catalog view dedicated to aggregates, but some data is exposed in the sys.objects view. To get information about user-defined aggregates, filter the type column for the value AF.

SELECT *
FROM sys.objects
WHERE type = 'AF'

Multiple-Input User-Defined Aggregates

Up until SQL Server 2008, user-defined aggregates could allow only one input parameter, such as AVG(TotalDue) and dbo.TrimmedMean(TotalDue) in the previous examples. Now it is possible to passmore than one parameter through your user-defined aggregate.

To demonstrate, we will work through an example that uses a table called MyValue, defined formally as follows:

CREATE TABLE MyValue
(v1 money,
v2 money)
--Let's insert some test values as well
INSERT INTO MyValue VALUES (1,1)
INSERT INTO MyValue VALUES (2,2)
INSERT INTO MyValue VALUES (10,5)

We want to create an aggregate that will take the values of both columns and multiply them for each row. It will then perform a sum of the result of these products (products with regard to the mathematics definition—not something people buy). Our aggregate would ideally work as follows:

  • For the first row, the Accumulate function will return 1 (which is 1 × 1).
  • For the second row, the Accumulate function will return 4 (which is 2 × 2).
  • For the third row, the Accumulate function will return 50 (which is 10 × 5).
  • The Merge function then sums these values and will return a value of 55.

The code for this simple aggregate is as follows:

using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
[Serializable]
[Microsoft.SqlServer.Server.SqlUserDefinedAggregate(Format.Native)]
public struct GetTotal
{
public void Init()
{
this.totalValue = 0;
}
public void Accumulate(SqlMoney firstValue,SqlMoney secondValue)
{
this.totalValue += firstValue * secondValue;
}
public void Merge(GetTotal Group)
{
this.totalValue += Group.totalValue;
}
public SqlMoney Terminate()
{
return totalValue;
}
private SqlMoney totalValue;
}

Although this example is simple and the same effect could be achieved using other programmingconcepts, the important point to note here is the ease of passing multiple inputs into your aggregate function. Now let’s create our assembly and aggregate in SQL, as follows:

CREATE ASSEMBLY MyAggregates
FROM 'C: est_codeMyFunction.dll'
CREATE AGGREGATE SUM_ALL(@v1 money,@v2 money)
RETURNS money
EXTERNAL NAME MyAggregates.GetSubTotal

Now we can issue SELECT dbo.SUM_ALL(v1,v2) FROM MyValue and obtain our expected value.

55.0000

All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

SQL Server 2008 Topics