# Advanced Querying and Information Retrieval - Database system concepts

Businesses have begun to exploit the burgeoning data online tomake better decisions about their activities, such as what items to stock and how best to target customers to increase sales. Many of their queries are rather complicated, however, and certain types of information cannot be extracted even by using SQL.

Several techniques and tools are available to help with decision support. Several tools for data analysis allow analysts to view data in different ways. Other analysis tools precompute summaries of very large amounts of data, in order to give fast responses to queries. The SQL:1999 standard now contains additional constructs to support data analysis. Another approach to getting knowledge from data is to use data mining, which aims at detecting various types of patterns in large volumes of data. Data mining supplements various types of statistical techniques with similar goals.

Textual data, too, has grown explosively. Textual data is unstructured, unlike the rigidly structured data in relational databases. Querying of unstructured textual data is referred to as information retrieval. Information retrieval systems have much in common with database systems in particular, the storage and retrieval of data on secondary storage. However, the emphasis in the field of information systems is different from that in database systems, concentrating on issues such as querying based on keywords; the relevance of documents to the query; and the analysis, classification, and indexing of documents.

This chapter covers decision support, including online analytical processing and data mining and information retrieval.

Decision-Support Systems

Database applications can be broadly classified into transaction processing and decision support. Transaction-processing systems are widely used today, and companies have accumulated a vast amount of information generated by these systems.

For example, company databases often contain enormous quantities of information about customers and transactions. The size of the information storage required may range up to hundreds of gigabytes, or even terabytes, for large retail chains. Transaction information for a retailer may include the name or identifier (such as credit-card number) of the customer, the items purchased, the price paid, and the dates on which the purchases were made. Information about the items purchased may include the name of the item, the manufacturer, the model number, the color, and the size. Customer information may include credit history, annual income, residence, age, and even educational background.

Such large databases can be treasure troves of information for making business decisions, such as what items to stock and what discounts to offer. For instance, a retail company may notice a sudden spurt in purchases of flannel shirts in the Pacific Northwest, may realize that there is a trend, and may start stocking a larger number of such shirts in shops in that area. As another example, a car company may find, on querying its database, that most of its small sports cars are bought by young women whose annual incomes are above $50,000. The company may then target its marketing to attract more such women to buy its small sports cars, and may avoid wasting money trying to attract other categories of people to buy those cars. In both cases, the company has identified patterns in customer behavior, and has used the patterns to make business decisions. The storage and retrieval of data for decision support raises several issues: • Although many decision support queries can be written in SQL, others either cannot be expressed in SQL or cannot be expressed easily in SQL. Several SQL extensions have therefore been proposed to make data analysis easier. The area of online analytical processing (OLAP) deals with tools and techniques for data analysis that can give nearly instantaneous answers to queries requesting summarized data, even though the database may be extremely large. • Database query languages are not suited to the performance of detailed statistical analyses of data. There are several packages, such as SAS and S++, that help in statistical analysis. Such packages have been interfaced with databases, to allow large volumes of data to be stored in the database and retrieved efficiently for analysis. The field of statistical analysis is a large discipline on its own; • Knowledge-discovery techniques attempt to discover automatically statistical rules and patterns from data. The field of data mining combines knowledge discovery techniques invented by artificial intelligence researchers and statistical analysts, with efficient implementation techniques that enable them to be used on extremely large databases. • Large companies have diverse sources of data that they need to use for making business decisions. The sources may store the data under different schemas. For performance reasons (as well as for reasons of organization control), the data sources usually will not permit other parts of the company to retrieve data on demand. To execute queries efficiently on such diverse data, companies have built data warehouses. Data warehouses gather data from multiple sources under a unified schema, at a single site. Thus, they provide the user a single uniform interface to data. The area of decision support can be broadly viewed as covering all the above areas, although some people use the term in a narrower sense that excludes statistical analysis and data mining. Data Analysis and OLAP Although complex statistical analysis is best left to statistics packages, databases should support simple, commonly used, forms of data analysis. Since the data stored in databases are usually large in volume, they need to be summarized in some fashion if we are to derive information that humans can use. OLAP tools support interactive analysis of summary information. Several SQL extensions have been developed to support OLAP tools. There are many commonly used tasks that cannot be done using the basic SQL aggregation and grouping facilities. Examples include finding percentiles, or cumulative distributions, or aggregates over sliding windows on sequentially ordered data. A number of extensions of SQL have been recently proposed to support such tasks, and implemented in products such as Oracle and IBM DB2. Online Analytical Processing Statistical analysis often requires grouping on multiple attributes. Consider an application where a shop wants to find out what kinds of clothes are popular. Let us suppose that clothes are characterized by their item-name, color, and size, and that we have a relation sales with the schema sales(item-name, color, size, number). Suppose that item-name can take on the values (skirt, dress, shirt, pant), color can take on the values (dark, pastel, white), and size can take on values (small, medium, large). Given a relation used for data analysis, we can identify some of its attributes as measure attributes, since they measure some value, and can be aggregated upon. For instance, the attribute number of the sales relation is a measure attribute, since it measures the number of units sold. Some (or all) of the other attributes of the relation are identified as dimension attributes, since they define the dimensions on which measure attributes, and summaries of measure attributes, are viewed. In the sales relation, item-name, color, and size are dimension attributes. (A more realistic version of the sales relation would have additional dimensions, such as time and sales location, and additional measures such as monetary value of the sale.) Data that can be modeled as dimension attributes and measure attributes are called multidimensional data. Cross tabulation of sales by item-name and color. To analyze the multidimensional data, a manager may want to see data laid out as shown in the table in Figure . The table shows total numbers for different combinations of item-name and color. The value of size is specified to be all, indicating that the displayed values are a summary across all values of size. The table in Figure is an example of a cross-tabulation (or cross-tab, for short), also referred to as a pivot-table. In general, a cross-tab is a table where values for one attribute (say A) form the row headers, values for another attribute (say B) form the column headers, and the values in an individual cell are derived as follows. Each cell can be identified by (ai, bj), where ai is a value for A and bj a value for B. If there is at most one tuple with any (ai, bj) value, the value in the cell is derived from that single tuple (if any); for instance, it could be the value of one or more other attributes of the tuple. If there can be multiple tuples with an (ai, bj) value, the value in the cell must be derived by aggregation on the tuples with that value. In our example, the aggregation used is the sum of the values for attribute number. In our example, the cross-tab also has an extra column and an extra row storing the totals of the cells in the row/column. Most cross-tabs have such summary rows and columns. A cross-tab is different from relational tables usually stored in databases, since the number of columns in the cross-tab depends on the actual data. A change in the data values may result in adding more columns, which is not desirable for data storage. However, a cross-tab view is desirable for display to users. It is straightforward to represent a cross-tab without summary values in a relational form with a fixed number of columns. A cross-tab with summary rows/columns can be represented by introducing a special value all to represent subtotals, as in Figure 22.2. The SQL:1999 standard actually uses the null value in place of all, but to avoid confusion with regular null values, we shall continue to use all. Consider the tuples (skirt, all, 53) and (dress, all, 35). We have obtained these tuples by eliminating individual tuples with different values for color, and by replacing the value of number by an aggregate namely, sum. The value all can be thought of as representing the set of all values for an attribute. Tuples with the value all only for the color dimension can be obtained by an SQL query performing a group by on the column item-name. Similarly, a group by on color can be used to get the tuples with the value all for item-name, and a group by with no attributes (which can simply be omitted in SQL) can be used to get the tuple with value all for item-name and color. Relational representation of the data . The generalization of a cross-tab, which is 2-dimensional, to n dimensions can be visualized as an n-dimensional cube, called the data cube. Figure shows a data cube on the sales relation. The data cube has three dimensions, namely item-name, color, and size, and the measure attribute is number. Each cell is identified by values for these three dimensions. Each cell in the data cube contains a value, just as in a cross-tab. In Figure , the value contained in a cell is shown on one of the faces of the cell; other faces of the cell are shown blank if they are visible. The value for a dimension may be all, in which case the cell contains a summary over all values of that dimension, as in the case of cross-tabs. The number of different ways in which the tuples can be grouped for aggregation can be large. In fact, for a table with n dimensions, aggregation can be performed with grouping on each of the 2n subsets of the n dimensions. An online analytical processing or OLAP system is an interactive system that permits an analyst to view different summaries of multidimensional data. The word online indicates that the an analyst must be able to request new summaries and get responses online, within a few seconds, and should not be forced to wait for a long time to see the result of a query. With an OLAP system, a data analyst can look at different cross-tabs on the same data by interactively selecting the attributes in the cross-tab. Each cross-tab is a two-dimensional view on a multidimensional data cube. For instance the analyst may select a cross-tab on item-name and size, or a cross-tab on color and size. The operation of changing the dimensions used in a cross-tab is called pivoting. Three-dimensional data cube. An OLAP system provides other functionality as well. For instance, the analyst may wish to see a cross-tab on item-name and color for a fixed value of size, for example, large, instead of the sum across all sizes. Such an operation is referred to as slicing, since it can be thought of as viewing a slice of the data cube. The operation is sometimes called dicing, particularly when values for multiple dimensions are fixed. When a cross-tab is used to view a multidimensional cube, the values of dimension attributes that are not part of the cross-tab are shown above the cross-tab. The value of such an attribute can be all, as shown in Figure , indicating that data in the crosstab are a summary over all values for the attribute. Slicing/dicing simply consists of selecting specific values for these attributes, which are then displayed on top of thecross-tab. OLAP systems permit users to view data at any desired level of granularity. The operation of moving from finer-granularity data to a coarser granularity (by means of aggregation) is called a rollup. In our example, starting from the data cube on the sales table, we got our example cross-tab by rolling up on the attribute size. The opposite operation that of moving from coarser-granularity data to finer-granularity data is called a drill down. Clearly, finer-granularity data cannot be generated from coarse-granularity data; they must be generated either from the original data, or from even finer-granularity summary data. Analysts may wish to view a dimension at different levels of detail. For instance, an attribute of type datetime contains a date and a time of day. Using time precise to a second (or less) may not be meaningful: An analyst who is interested in rough time of day may look at only the hour value. An analyst who is interested in sales by day of the week may map the date to a day-of-the-week and look only at that. Anotheranalyst may be interested in aggregates over a month, or a quarter, or for an entire year. Hierarchies on dimensions. The different levels of detail for an attribute can be organized into a hierarchy. Figure (a) shows a hierarchy on the datetime attribute. As another example, Figure (b) shows a hierarchy on location, with the city being at the bottom of the hierarchy, state above it, country at the next level, and region being the top level. In our earlier example, clothes can be grouped by category (for instance, menswear or womenswear); category would then lie above item-name in our hierarchy on clothes. At the level of actual values, skirts and dresses would fall under the womens wear category and pants and shirts under the menswear category. An analyst may be interested in viewing sales of clothes divided as menswear and womenswear, and not interested in individual values. After viewing the aggregates at the level of womenswear and menswear, an analyst may drill down the hierarchy to look at individual values. An analyst looking at the detailed level may drill up the hierarchy, and look at coarser-level aggregates. Both levels can be displayed on the same cross-tab, as in Figure . OLAP Implementation The earliest OLAP systems used multidimensional arrays in memory to store data cubes, and are referred to as multidimensional OLAP (MOLAP) systems. Later, OLAP facilities were integrated into relational systems, with data stored in a relational database. Such systems are referred to as relational OLAP (ROLAP) systems. Hybrid systems, which store some summaries in memory and store the base data and othersummaries in a relational database, are called hybrid OLAP (HOLAP) systems. Cross tabulation of sales with hierarchy on item-name. Many OLAP systems are implemented as client–server systems. The server contains the relational database as well as any MOLAP data cubes. Client systems obtain views of the data by communicating with the server. A na¨ıve way of computing the entire data cube (all groupings) on a relation is to use any standard algorithm for computing aggregate operations, one grouping at a time. The na¨ıve algorithm would require a large number of scans of the relation. A simple optimization is to compute an aggregation on, say, (item-name, color) from an aggregation (item-name, color, size), instead of from the original relation. For the standardSQL aggregate functions, we can compute an aggregate with grouping on a set of attributes A from an aggregate with grouping on a set of attributes B if A ⊆ B; you can do so as an exercise (see Exercise 22.1), but note that to compute avg, we additionally need the count value. (For some non-standard aggregate functions, such as median, aggregates cannot be computed as above; the optimization described here do not apply to such “non-decomposable” aggregate functions.) The amount of data read drops significantly by computing an aggregate from another aggregate, instead of from the original relation. Further improvements are possible; for instance, multiple groupings can be computed on a single scan of the data. See the bibliographical notes for references to algorithms for efficiently computing data cubes. Early OLAP implementations precomputed and stored entire data cubes, that is, groupings on all subsets of the dimension attributes. Precomputation allows OLAP queries to be answered within a few seconds, even on datasets that may contain millions of tuples adding up to gigabytes of data. However, there are 2n groupings with n dimension attributes; hierarchies on attributes increase the number further.As a result, the entire data cube is often larger than the original relation that formed the data cube and in many cases it is not feasible to store the entire data cube. Instead of precomputing and storing all possible groupings, it makes sense to precompute and store some of the groupings, and to compute others on demand. Instead of computing queries from the original relation, which may take a very long time, we can compute them from other precomputed queries. For instance, suppose a query requires summaries by (item-name, color), which has not been precomputed.The query result can be computed from summaries by (item-name, color, size), if that has been precomputed. See the bibliographical notes for references on how to select a good set of groupings for precomputation, given limits on the storage available for precomputed results. The data in a data cube cannot be generated by a single SQL query using the basic group by constructs, since aggregates are computed for several different groupings of the dimension attributes. Extended Aggregation The SQL-92 aggregation functionality is limited, so several extensions were implemented by different databases. The SQL:1999 standard, however, defines a rich set of aggregate functions, whichwe outline in this section and in the next two sections. The Oracle and IBM DB2 databases support most of these features, and other databases will no doubt support these features in the near future. The new aggregate functions on single attributes are standard deviation and variance (stddev and variance). Standard deviation is the square root of variance. Some database systems support other aggregate functions such as median and mode. Some database systems even allow users to add new aggregate functions. SQL:1999 also supports a new class of binary aggregate functions, which can compute statistical results on pairs of attributes; they include correlations, covariances, and regression curves, which give a line approximating the relation between the values of the pair of attributes. Definitions of these functions may be found in any standard textbook on statistics, such as those referenced in the bibliographical notes. SQL:1999 also supports generalizations of the group by construct, using the cube and rollup constructs. A representative use of the cube construct is: select item-name, color, size, sum(number) from sales group by cube(item-name, color, size) This query computes the union of eight different groupings of the sales relation: { (item-name, color, size), (item-name, color), (item-name, size), (color, size), (item-name), (color), (size), () } where () denotes an empty group by list. For each grouping, the result contains the null value for attributes not present in the grouping. For instance, the table in Figure , with occurrences of all replaced by null, can be computed by the query select item-name, color, sum(number) from sales group by cube(item-name, color) A representative rollup construct is select item-name, color, size, sum(number) from sales group by rollup(item-name, color, size) Here, only four groupings are generated: { (item-name, color, size), (item-name, color), (item-name), () } Rollup can be used to generate aggregates at multiple levels of a hierarchy on a column. For instance, suppose we have a table itemcategory(item-name, category) giving the category of each item. Then the query select category, item-name, sum(number) from sales, category where sales.item-name = itemcategory.item-name group by rollup(category, item-name) would give a hierarchical summary by item-name and by category. Multiple rollups and cubes can be used in a single group by clause. For instance, the following query select item-name, color, size, sum(number) from sales group by rollup(item-name), rollup(color, size) generates the groupings { (item-name, color, size), (item-name, color), (item-name), (color, size), (color), () } To understand why, note that rollup(item-name) generates two groupings, {(itemname), ()}, and rollup(color, size) generates three groupings, {(color, size), (color), () }. The cross product of the two gives us the six groupings shown.:1999 uses the value null to indicate the usual sense of null as well as all. This dual use of null can cause ambiguity if the attributes used in a rollup or cube clause contain null values. The function grouping can be applied on an attribute; it returns 1 if the value is a null value representing all, and returns 0 in all other cases. Consider the following query: select item-name, color, size, sum(number), grouping(item-name) as item-name-flag, grouping(color) as color-flag, grouping(size) as size-flag from sales group by cube(item-name, color, size) The output is the same as in the version of the query without grouping, but with three extra columns called item-name-flag, color-flag, and size-flag. In each tuple, the value of a flag field is 1 if the corresponding field is a null representing all. Instead of using tags to indicate nulls that represent all, we can replace the null value by a value of our choice: decode(grouping(item-name), 1, ’all’, item-name) This expression returns the value “all” if the value of item-name is a null corresponding to all, and returns the actual value of item-name otherwise. This expression can be used in place of item-name in the select clause to get “all” in the output of the query, in place of nulls representing all. Neither the rollup nor the cube clause gives complete control on the groupings that are generated. For instance, we cannot use them to specify that we want only groupings {(color, size), (size, item-name)}. Such restricted groupings can be generated by using the grouping construct in the having clause; Ranking Finding the position of a value in a larger set is a common operation. For instance, we may wish to assign students a rank in class based on their total marks, with the rank 1 going to the student with the highest marks, the rank 2 to the student with the next highest marks, and so on. While such queries can be expressed in SQL-92, they are difficult to express and inefficient to evaluate. Programmers often resort to writing the query partly in SQL and partly in a programming language. A related type of query is to find the percentile in which a value in a (multi)set belongs, for example, the bottom third, middle third, or top third.We study SQL:1999 support for these types of queries here. Ranking is done in conjunction with an order by specification. Suppose we are given a relation student-marks(student-id, marks) which stores the marks obtained by each student. The following query gives the rank of each student. select student-id, rank() over (order by (marks) desc) as s-rank from student-marks Note that the order of tuples in the output is not defined, so they may not be sorted by rank. An extra order by clause is needed to get them in sorted order, as shown below. select student-id, rank () over (order by (marks) desc) as s-rank from student-marks order by s-rank A basic issue with ranking is how to deal with the case of multiple tuples that are the same on the ordering attribute(s). In our example, this means deciding what to do if there are two students with the same marks. The rank function gives the same rank to all tuples that are equal on the order by attributes. For instance, if the highest mark is shared by two students, both would get rank 1. The next rank given would be 3, not 2, so if three students get the next highest mark, they would all get rank 3, and the next student(s) would get rank 5, and so on. There is also a dense rank function that does not create gaps in the ordering. In the above example, the tuples with the second highest value all get rank 2, and tuples with the third highest value get rank 3, and so on. Ranking can be done within partitions of the data. For instance, suppose we have an additional relation student-section(student-id, section) that stores for each student the section in which the student studies. The following query then gives the rank of students within each section. select student-id, section, rank () over (partition by section order by marks desc) as sec-rank from student-marks, student-section where student-marks.student-id = student-section.student-id order by section, sec-rank The outer order by clause orders the result tuples by section, and within each section by the rank. Multiple rank expressions can be used within a single select statement; thus we can obtain the overall rank and the rank within the section by using two rank expressions in the same select clause. An interesting question is what happens when ranking (possibly with partitioning) occurs along with a group by clause. In this case, thegroup by clause is applied first, and partitioning and ranking are done on the results of the group by. Thus aggregate values can then be used for ranking. For example, suppose we had marks for each student for each of several subjects. To rank students by the sum of their marks in different subjects, we can use a group by clause to compute the aggregate marks for each student, and then rank students by the aggregate sum.We leave details as an exercise for you. The ranking functions can be used to find the top n tuples by embedding a ranking query within an outer-level query; we leave details as an exercise. Note that bottom n is simply the same as top n with a reverse sorting order. Several database systems provide nonstandard SQL extensions to specify directly that only the top n results are required; such extensions do not require the rank function, and simplify the job of theoptimizer, but are (currently) not as general since they do not support partitioning. SQL:1999 also specifies several other functions that can be used in place of rank. For instance, percent rank of a tuple gives the rank of the tuple as a fraction. If there are n tuples in the partition3 and the rank of the tuple is r, then its percent rank is defined as (r − 1)/(n − 1) (and as null if there is only one tuple in the partition). The function cume dist, short for cumulative distribution, for a tuple is defined as p/n where p is the number of tuples in the partition with ordering values preceding or equal to the ordering value of the tuple, and n is the number of tuples in the parti tion. The function row number sorts the rows and gives each row a unique number corresponding to its position in the sort order; different rows with the same ordering value would get different row numbers, in a nondeterministic fashion. Finally, for a given constant n, the ranking function ntile(n) takes the tuples in each partition in the specified order, and divides them into n buckets with equal numbers of tuples. For each tuple, ntile(n) then gives the number of the bucket in which it is placed, with bucket numbers starting with 1. This function is particularly useful for constructing histograms based on percentiles. For instance, we can sort employees by salary, and use ntile(3) to find which range (bottom third, middle third, or top third) each employee is in, and compute the total salary earned by employees in each range: select threetile, sum(salary) from ( select salary, ntile(3) over (order by (salary)) as threetile from employee) as s group by threetile. The presence of null values can complicate the definition of rank, since it is not clear where they should occur first in the sort order. SQL:1999 permits the user to specify where they should occur by using nulls first or nulls last, for instance select student-id, rank () over (order by marks desc nulls last) as s-rank from student-marks Windowing An example of a window query is query that, given sales values for each date, calculates for each date the average of the sales on that day, the previous day, and the next day; such moving average queries are used to smooth out random variations. Another example of a window query is one that finds the cumulative balance in an account, given a relation specifying the deposits and withdrawals on an account. Such queries are either hard or impossible (depending on the exact query) to express in basic SQL. SQL:1999 provides a windowing feature to support such queries. In contrast to group by, the same tuple can exist in multiple windows. Suppose we are given a relation transaction(account-number, date-time, value), where value is positive for a deposit and negative for a withdrawal. We assume there is at most one transaction per date-time value. Consider the query select account-number, date-time, sum(value) over (partition by account-number order by date-time rows unbounded preceding) as balance from transaction order by account-number, date-time The query gives the cumulative balances on each account just before each transaction on the account; the cumulative balance of the account is the sum of values of all earlier transactions on the account. The partition by clause partitions tuples by account number, so for each row only the tuples in its partition are considered. A window is created for each tuple; the keywords rows unbounded preceding specify that the window for each tuple consists of all tuples in the partition that precede it in the specified order (here, increasing order of date-time). The aggregate function sum(value) is applied on all the tuples in the window. Observe that the query does not use a group by clause, since there is an output tuple for each tuple in the transaction relation. While the query could be written without these extended constructs, it would be rather difficult to formulate. Note also that different windows can overlap, that is, a tuple may be present in more than one window.Other types of windows can be specified. For instance, to get a window containing the previous 10 rows for each row, we can specify rows 10 preceding. To get a window containing the current, previous, and following row, we can use between rows 1 preceding and 1 following. To get the previous rows and the current row, we can say between rows unbounded preceding and current. Note that if the ordering is ona nonkey attribute, the result is not deterministic, since the order of tuples is not fully defined. We can even specify windows by ranges of values, instead of numbers of rows. For instance, suppose the ordering value of a tuple is v; thenrange between 10 preceding and current row would give tuples whose ordering value is between v − 10 and v (both values inclusive). When dealing with dates, we can use range interval 10 day preceding to get a window containing tuples within the previous 10 days, but notincluding the date of the tuple.Clearly, the windowing functionality of SQL:1999 is very rich and can be used to write rather complex queries with a small amount of effort. Data Mining The term data mining refers loosely to the process of semiautomatically analyzing large databases to find useful patterns. Like knowledge discovery in artificial intelligence (also called machine learning), or statistical analysis, data mining attempts to discover rules and patterns from data. However, data mining differs from machine learning and statistics in that it deals with large volumes of data, stored primarily on disk. That is, data mining deals with “knowledge discovery in databases.” Some types of knowledge discovered from a database can be represented by a set of rules. The following is an example of a rule, stated informally: “Young women with annual incomes greater than$50,000 are the most likely people to buy small sports cars.” Of course such rules are not universally true, and have degrees of “support” and “confidence,” as we shall see. Other types of knowledge are represented by equations relating different variables to each other, or by other mechanisms for predicting outcomes when the values of some variables are known.

There are a variety of possible types of patterns that may be useful, and different techniques are used to find different types of patterns.We shall study a few examples of patterns and see how they may be automatically derived from a database. Usually there is a manual component to data mining, consisting of preprocessing data to a form acceptable to the algorithms, and postprocessing of discovered patterns to find novel ones that could be useful. There may also be more than one type of pattern that can be discovered from a given database, and manual interaction may be needed to pick useful types of patterns. For this reason, data mining is really a semiautomatic process in real life. However, in our description we concentrate on the automatic aspect of mining.

Applications of Data Mining

The discovered knowledge has numerous applications. The most widely used applications are those that require some sort of prediction. For instance, when a person applies for a credit card, the credit-card company wants to predict if the person is a good credit risk. The prediction is to be based on known attributes of the person, such as age, income, debts, and past debt repayment history. Rules for making the predictionare derived from the same attributes of past and current credit card holders, along with their observed behavior, such as whether they defaulted on their creditcard dues. Other types of prediction include predicting which customers may switch over to a competitor (these customers may be offered special discounts to tempt them not to switch), predicting which people are likely to respond to promotional mail (“junk mail”), or predicting what types of phone calling card usage are likely to be fraudulent.

Another class of applications looks for associations, for instance, books that tend to be bought together. If a customer buys a book, an online bookstore may suggest other associated books. If a person buys a camera, the system may suggest accessories that tend to be bought along with cameras. A good salesperson is aware of such patterns and exploits them to make additional sales. The challenge is to automate theprocess. Other types of associations may lead to discovery of causation. For instance, discovery of unexpected associations between a newly introduced medicine and cardiac problems led to the finding that the medicine may cause cardiac problems in some people. The medicine was then withdrawn from the market.

Associations are an example of descriptive patterns. Clusters are another example of such patterns. For example, over a century ago a cluster of typhoid cases was found around awell, which led to the discovery that the water in the well was contaminated and was spreading typhoid. Detection of clusters of disease remains important even today.

Classification

We outline what is classification, study techniques for building one type of classifiers, called decision tree classifiers, and then study other prediction techniques. Abstractly, the classification problem is this: Given that items belong to one of several classes, and given past instances (called training instances) of items along with the classes to which they belong, the problem is to predict the class to which a new item belongs. The class of the new instance is not known, so other attributes of the instance must be used to predict the class.

Classification can be done by finding rules that partition the given data into disjoint groups. For instance, suppose that a credit-card company wants to decide whether or not to give a credit card to an applicant. The company has a variety of information about the person, such as her age, educational background, annual income, and current debts, that it can use for making a decision.

Some of this information could be relevant to the credit worthiness of the applicant, whereas some may not be. To make the decision, the company assigns a credit worthiness level of excellent, good, average, or bad to each of a sample set of current customers according to each customer’s payment history. Then, the company attempts to find rules that classify its current customers into excellent, good, average, or bad, on the basis of the information about the person, other than the actual payment history (which is unavailable for new customers). Let us consider just two attributes: education level (highest degree earned) and income. The rules may be of the following form:

∀person P, P.degree = masters and P.income > 75, 000
⇒ P.credit = excellent
∀ person P, P.degree = bachelors or
(P.income ≥ 25, 000 and P.income ≤ 75, 000) ⇒ P.credit = good

Similar rules would also be present for the other credit worthiness levels (average and bad). The process of building a classifier starts from a sample of data, called a training set. For each tuple in the training set, the class to which the tuple belongs is already known. For instance, the training set for a credit-card application may be the existing customers, with their credit worthiness determined from their payment history. Theactual data, or population, may consist of all people, including those who are not existing customers. There are several ways of building a classifier. As the name suggests, decision tree classifiers use a tree; each leaf node has an associated class, and each internal node has a predicate (or more generally, a function) associated with it. Figure shows an example of a decision tree.

To classify a new instance, we start at the root, and traverse the tree to reach a leaf; at an internal node we evaluate the predicate (or function) on the data instance, to find which child to go to. The process continues till we reach a leaf node. For example, if the degree level of a person is masters, and the persons income is 40K, starting from the root we follow the edge labeled “masters,” and from there the edge labeled “25K to 75K,” to reach a leaf. The class at the leaf is “good,” so we predict that the credit risk of that person is good

Classification tree.

Building Decision Tree Classifiers

The question then is how to build a decision tree classifier, given a set of training instances. The most common way of doing so is to use a greedy algorithm, which works recursively, starting at the root and building the tree downward. Initially there is only one node, the root, and all training instances are associated with that node. At each node, if all, or “almost all” training instances associated with the node belong to the same class, then the node becomes a leaf node associated with that class.

Otherwise, a partitioning attribute and partitioning conditions must be selected to create child nodes. The data associated with each child node is the set of training instances that satisfy the partitioning condition for that child node. In our example, the attribute degree is chosen, and four children, one for each value of degree, are created.

The conditions for the four children nodes are degree = none, degree = bachelors, degree = masters, and degree = doctorate, respectively. The data associated with each child consist of training instances satisfying the condition associated with that child. At the node corresponding to masters, the attribute income is chosen, with the range of values partitioned into intervals 0 to 25,000, 25,000 to 50,000, 50,000 to 75,000, andover 75,000. The data associated with each node consist of training instances with the degree attribute being masters, and the income attribute being in each of these ranges, respectively. As an optimization, since the class for the range 25,000 to 50,000 and the range 50,000 to 75,000 is the same under the node degree = masters, the two ranges have been merged into a single range 25,000 to 75,000.

Best Splits

Intuitively, by choosing a sequence of partitioning attributes, we start with the set of all training instances, which is “impure” in the sense that it contains instances from many classes, and end up with leaves which are “pure” in the sense that at each leaf all training instances belong to only one class. We shall see shortly how to measure purity quantitatively. To judge the benefit of picking a particular attribute and condition for partitioning of the data at a node, we measure the purity of the data at the children resulting from partitioning by that attribute. The attribute and condition that result in the maximum purity are chosen.

The purity of a set S of training instances can be measured quantitatively in several ways. Suppose there are k classes, and of the instances in S the fraction of instances in class i is pi. One measure of purity, the Gini measure is defined as

When all instances are in a single class, the Gini value is 0, while it reaches its maximum (of 1 − 1/k) if each class has the same number of instances. Another measure of purity is the entropy measure, which is defined as

The entropy value is 0 if all instances are in a single class, and reaches its maximum when each class has the same number of instances. The entropy measure derives from information theory.When a set S is split into multiple sets Si, i = 1, 2, . . . , r,we can measure the purity of the resultant set of sets as:

That is, the purity is the weighted average of the purity of the sets Si. The above formula can be used with both the Gini measure and the entropy measure of purity. The information gain due to a particular split of S into Si, i = 1, 2, . . . , r is then

Information-gain(S, {S1, S2, . . . , Sr}) = purity(S) − purity(S1, S2, . . . , Sr)

Splits into fewer sets are preferable to splits into many sets, since they lead to simpler and more meaningful decision trees. The number of elements in each of the sets Si may also be taken into account; otherwise, whether a set Si has 0 elements or 1 element would make a big difference in the number of sets, although the split is the same for almost all the elements. The information content of a particular split can be defined in terms of entropy as

All of this leads to a definition: The best split for an attribute is the one that gives the maximum information gain ratio, defined as

Information-gain(S, {S1, S2, . . . , Sr})/Information-content(S, {S1, S2, . . . , Sr})

Finding Best Splits

How do we find the best split for an attribute? How to split an attribute depends on the type of the attribute. Attributes can be either continuous valued, that is, the values can be ordered in a fashion meaningful to classification, such as age or income, or can be categorical, that is, they have no meaningful order, such as department names or country names. We do not expect the sort order of department names or country names to have any significance to classification.

Usually attributes that are numbers (integers/reals) are treated as continuous valued while character string attributes are treated as categorical, but this may be controlled by the user of the system. In our example, we have treated the attribute degree as categorical, and the attribute income as continuous valued.

We first consider how to find best splits for continuous-valued attributes. For simplicity, we shall only consider binary splits of continuous-valued attributes, that is, splits that result in two children. The case of multiway splits is more complicated; see the bibliographical notes for references on the subject. To find the best binary split of a continuous-valued attribute, we first sort the attribute values in the training instances. We then compute the information gain obtained by splitting at each value. For example, if the training instances have values 1, 10, 15, and 25 for an attribute, the split points considered are 1, 10, and 15; in each case values less than or equal to the split point form one partition and the rest of the values form the other partition. The best binary split for the attribute is the split that gives the maximum information gain.

For a categorical attribute, we can have a multiway split, with a child for each value of the attribute. This works fine for categorical attributes with only a few distinct values, such as degree or gender. However, if the attribute has many distinct values, such as department names in a large company, creating a child for each value is not a good idea. In such cases, we would try to combine multiple values into each child, to create a smaller number of children. See the bibliographical notes for references on how to do so.

Decision-Tree Construction Algorithm

The main idea of decision tree construction is to evaluate different attributes and different partitioning conditions, and pick the attribute and partitioning condition that results in the maximum information gain ratio. The same procedure works recursively on each of the sets resulting from the split, thereby recursively constructing a decision tree. If the data can be perfectly classified, the recursion stops when the purity of a set is 0.

However, often data are noisy, or a set may be so small that partitioning it further may not be justified statistically. In this case, the recursion stops when the purity of a set is “sufficiently high,” and the class of resulting leaf is defined as the class of the majority of the elements of the set. In general, different branches of the tree could grow to different levels.

procedure GrowTree(S)
Partition(S);
procedure Partition (S)
if (purity(S) > δp or |S| < δs ) then
return;
for each attribute A
evaluate splits on attribute A;
Use best split found (across all attributes) to partition
S into S1, S2, . . . , Sr;
for i = 1, 2, . . . , r
Partition(Si);

Figure: Recursive construction of a decision tree.

Figure shows pseudocode for a recursive tree construction procedure, which takes a set of training instances S as parameter. The recursion stops when the set is sufficiently pure or the set S is too small for further partitioning to be statistically significant. The parameters δp and δs define cutoffs for purity and size; the system may give them default values, that may be overridden by users.

There are a wide variety of decision tree construction algorithms, and we outline the distinguishing features of a few of them. See the bibliographical notes for details. With very large data sets, partitioning may be expensive, since it involves repeated copying. Several algorithms have therefore been developed to minimize the I/O and computation cost when the training data are larger than available memory.

Several of the algorithms also prune subtrees of the generated decision tree to reduce overfitting: Asubtree is overfitted if it has been so highly tuned to the specifics of the training data that it makes many classification errors on other data. A subtree is pruned by replacing it with a leaf node. There are different pruning heuristics; one heuristic uses part of the training data to build the tree and another part of the training data to test it. The heuristic prunes a subtree if it finds that misclassification on the test instances would be reduced if the subtree were replaced by a leaf node.

We can generate classification rules from a decision tree, if we so desire. For each leaf we generate a rule as follows: The left-hand side is the conjunction of all the split conditions on the path to the leaf, and the class is the class of the majority of the training instances at the leaf. An example of such a classification rule is

degree = masters and income > 75, 000 ⇒ excellent

Other Types of Classifiers

There are several types of classifiers other than decision tree classifiers. Two types that have been quite useful are neural net classifiers and Bayesian classifiers. Neural net classifiers use the training data to train artificial neural nets. There is a large body of literature on neural nets, and we do not consider them further here. Bayesian classifiers find the distribution of attribute values for each class in the training data; when given a new instance d, they use the distribution information to estimate, for each class cj , the probability that instance d belongs to class cj , denoted by p(cj |d), in a manner outlined here. The class with maximum probability becomes the predicted class for instance d.

To find the probability p(cj |d) of instance d being in class cj , Bayesian classifiers use Bayes’ theorem, which says

where p(d|cj) is the probability of generating instance d given class cj , p(cj) is the probability of occurrence of class cj, and p(d) is the probability of instance d occurring. Of these, p(d) can be ignored since it is the same for all classes. p(cj) is simply the fraction of training instances that belong to class cj . Finding p(d|cj) exactly is difficult, since it requires a complete distribution of instances of cj . To simplify the task, naive Bayesian classifiers assume attributes have independent distributions, and thereby estimate

p(d|cj) = p(d1|cj) * p(d2|cj) * . . . * p(dn|cj)

That is, the probability of the instance d occurring is the product of the probability of occurrence of each of the attribute values di of d, given the class is cj . The probabilities p(di|cj) derive from the distribution of values for each attribute i, for each class class cj . This distribution is computed from the training instances that belong to each class cj ; the distribution is usually approximated by a histogram. For instance, we may divide the range of values of attribute i into equal intervals, and store the fraction of instances of class cj that fall in each interval. Given a value di for attribute i, the value of p(di|cj) is simply the fraction of instances belonging to class cj that fall in the interval to which di belongs.

A significant benefit of Bayesian classifiers is that they can classify instances with unknown and null attribute values unknown or null attributes are just omitted from the probability computation. In contrast, decision tree classifiers cannot meaningfully handle situations where an instance to be classified has a null value for a partitioning attribute used to traverse further down the decision tree.

Regression

Regression deals with the prediction of a value, rather than a class. Given values for a set of variables, X1,X2, . . .,Xn, we wish to predict the value of a variable Y . For instance, we could treat the level of education as a number and income as another number, and, on the basis of these two variables, we wish to predict the likelihood of default, which could be a percentage chance of defaulting, or the amount involved in the default.

One way is to infer coefficients a0, a1, a1, . . . , an such that

Y = a0 + a1 * X1 + a2 *X2 + · · · + an * Xn

Finding such a linear polynomial is called linear regression. In general, we wish to find a curve (defined by a polynomial or other formula) that fits the data; the process is also called curve fitting. The fit may only be approximate, because of noise in the data or because the relationship is not exactly a polynomial, so regression aims to find coefficients that give the best possible fit. There are standard techniques in statistics for finding regression coefficients. We do not discuss these techniques here, but the bibliographical notes provide references.

Association Rules

Retail shops are often interested in associations between different items that people buy. Examples of such associations are:

• A person who bought the book Database System Concepts is quite likely also to buy the book Operating System Concepts.

Association information can be used in several ways. When a customer buys a particular book, an online shop may suggest associated books. A grocery shop may decide to place bread close to milk, since they are often bought together, to help shoppers finish their task faster. Or the shop may place them at opposite ends of a row, and place other associated items in between to tempt people to buy those items as well, as theshoppers walk from one end of the row to the other. A shop that offers discounts on one associated item may not offer a discount on the other, since the customer will probably buy the other anyway.

Association Rules

An example of an association rule is

In the context of grocery-store purchases, the rule says that customers who buy bread also tend to buy milk with a high probability. An association rule must have an associated population: the population consists of a set of instances. In the grocery-store example, the population may consist of all grocery store purchases; each purchase is an instance. In the case of a bookstore, the population may consist of all people whomade purchases, regardless of when they made a purchase. Each customer is an instance.Here, the analyst has decided that when a purchase is made is not significant, whereas for the grocery-store example, the analyst may have decided to concentrate on single purchases, ignoring multiple visits by the same customer.

Rules have an associated support, as well as an associated confidence. These are defined in the context of the population:

• Support is a measure of what fraction of the population satisfies both the antecedent and the consequent of the rule.

For instance, suppose only 0.001 percent of all purchases include milk and screwdrivers. The support for the rule

milk ⇒ screwdrivers

is low. The rule may not even be statistically significant perhaps there was only a single purchase that included both milk and screwdrivers. Businesses are usually not interested in rules that have low support, since they involve few customers, and are not worth bothering about.

On the other hand, if 50 percent of all purchases involve milk and bread, then support for rules involving bread and milk (and no other item) is relatively high, and such rules may be worth attention. Exactly what minimum degree of support is considered desirable depends on the application.

• Confidence is a measure of how often the consequent is true when the antecedent is true. For instance, the rule