Wednesday, December 1, 2010

Aggregates

PostgreSQL supports a number of different aggregate functions (a tutorial appears here). These are functions that perform some mathematical or statistical calculations on a set of values. Common examples are sum, min, max and count. The values are often numeric but they can be boolean, string, or XML too and count accepts anything.

Let's start with a simple example. Suppose we have the following table of sales per month for different sales people

aggs=# select * from salesfigures;
 name | month | sales
------+-------+-------
 joe  | jan   |    34
 joe  | feb   |    23
 bob  | jan   |    18
 bob  | feb   |    30
(4 rows)

We can get the total sales for the sales team for the year to date as

aggs=# select sum(sales) from salesfigures;
 sum
-----
 105
(1 row)

Notice that sum is a function that we can specify in the select list (the part before the FROM) and it takes one argument. We are supplying it the values in the sales column. It adds 34 + 23 + 18 + 30 to get 105. Easy.

Filtering by WHERE
As the aggregate gets computed after any WHERE clause is applied we can easily apply filters on the data that is supplied to the aggregate. For instance, to get the sales for January

aggs=# select sum(sales) from salesfigures where month='jan';
 sum
-----
  52
(1 row)

(34 + 18 = 52)

Filtering by GROUP
Similarly, aggregates are computed after GROUP BY clauses. This is a really powerful feature. So, if we want the total sales for each month we can run

aggs=# select month, sum(sales) from salesfigures group by month;
 month | sum
-------+-----
 feb   |  53
 jan   |  52
(2 rows)

Clearly, using strings for month is not the best table design here as they cannot sensibly be ordered or compared. I used them to keep the example simple.

We call the aggregates multiple times in the select list. Here we get the min and max sales values for each person

aggs=# select name, min(sales), max(sales) from salesfigures group by name;
 name | min | max
------+-----+-----
 bob  |  18 |  30
 joe  |  23 |  34
(2 rows)

GOTCHA
If we wanted to know who had the most sales in any month we might be tempted to query

select name from salesfigures where sales=max(sales);         WRONG!

We could imagine this as computing the maximum value in the sales column and the selecting the one or more names that matched this value. This will not work. If one runs this query, PostgreSQL complains

aggs=# select name from salesfigures where sales=max(sales);
ERROR:  aggregates not allowed in WHERE clause
LINE 1: select name from salesfigures where sales=max(sales);
                                                  ^
aggs=#

This is a typical beginner mistake. I've done it. You'll probably do it too. As the error message says, aggregates are not allowed in the WHERE clause. This is because the WHERE clause is used to generate the set of rows to be passed to the select list and any aggregates. What we have here is a circular relationship: max depends on which rows it is given and that is determined by WHERE, which depends on max... What you have to do is obtain the maximum value as one query and then pass that into another query.

aggs=# select name from salesfigures where sales=(select max(sales) from salesfigures);
 name
------
 joe
(1 row)

(select max(sales) from salesfigures) is called a subquery, something we will cover in more detail in a later post. For now, just know that it computes this subquery first to obtain a value, which will be 34. We then effectively have a simple query

select name from salesfigures where sales=34;

which is OK as the circularity has been eliminated.

Filtering by HAVING
OK, so we cannot have aggregates in WHERE clause. However, we can have them in a HAVING clause. If you recall our earlier post, WHERE can be accompanied by an optional HAVING clause that can be used to filter rows after the WHERE clause is performed.

Who is underperforming? Which salesperson(s) have had less than 20 sales in any one month?

aggs=# select name from salesfigures group by name having min(sales)<20;
 name
------
 bob
(1 row)

This query groups first to get a set of virtual tables, then applies the min aggregate to get the minumum per person. It then filters in those with a minumum less than 20. Finally, we asked for SELECT name so it returns the one name left, that for Bob.

COUNT
COUNT is commonly used aggregate. SELECT count(*) from tablename returns the number of rows. The * just means "everything", every column in this context. If we ran SELECT count(*) from salesfigures it would return 4, as there are four rows, as would SELECT count(sales) from salesfigures. Count counts the rows not the values. Don't confuse this with sum!

Unfortunately, COUNT is fairly slow in PostgreSQL. It can take seconds or sometime tens of seconds in really large tables. The reason is that when a row is deleted from a table, it is not really deleted, just flagged as not being needed any more. (This is the same in some operating system when one deletes a file.) Also, the rows for a table may be scattered through memory. Count has to scan through each row of the table sequentially, work out which rows are "dead" in order to return the total row count. Other databases can do this more efficiently and is a frequent complaint about PostgreSQL performance. However, it does excel in many other respects over other databases.

This lists the various aggregates that PostgreSQL provides. As well as count, min, max and sum mentioned above other commonly used aggregates are avg (average) and stddev (standard deviation, a statistical measure of variability). Unfortunately, median is not provided.

Let's end with a non numeric example. The EVERY aggregate checks whether each and every row is true

aggs=# select * from clients;
     name     | paidbill
--------------+----------
 acme corp.   | t
 lehman bros. | f
 GM           | t
(3 rows)

aggs=# select every(paidbill) from clients;
 every
-------
 f
(1 row)

No comments:

Post a Comment