Tuesday, December 7, 2010

CREATE FUNCTION: return types

In the last post we covered the basic anatomy of a SQL function or stored procedure with CREATE FUNCTION:

CREATE FUNCTION function_name (argument_name argument_type, argument_name argument_type, ...) RETURNS return_type AS $$
     function_body_query;
$$ language language_name;

In this post we'll add some additional options and features related to the function's output.

VOID FUNCTIONS
Functions can return nothing. To do that we declare them as "RETURNS void". This is appropriate for non-SELECT commands such as INSERT:

CREATE FUNCTION addsalesdata(name varchar, month varchar, value decimal) returns void as $$
    insert into salesfigures values($1,$2,$3);
$$ language 'sql';


For these functions one can call SELECT function_name(...) or SELECT * FROM function_name(...).

FUNCTIONS THAT RETURN A SINGLE VALUE
Functions can return one value (one int, one decimal, one text etc.). For these functions one can call SELECT function_name(...) or SELECT * FROM function_name(...).

CREATE FUNCTION totalsalesformonth(month varchar) returns decimal as $$
    select sum(sales) from salesfigures where month=$1;
$$ language 'sql';


Another option to define a function's output is using OUT parameters. Here we do not use RETURNS but we list the order, names and types of output alongside the input arguments preceding each field with "OUT".

CREATE FUNCTION salesforpersonmonth4(month varchar, OUT name varchar, OUT value decimal) as $$
    select name,sales from salesfigures where month=$1;
$$ language 'sql';

aggs=# select * from salesforpersonmonth4('jan');
 name | value
------+-------
 joe  |    34
(1 row)


Notice that this is returning just one record when in fact the query would produce multiple rows of results.

FUNCTIONS THAT RETURN A SETOF VALUES
Functions can also return a set of values of the same type. That is, imagine it returning a column of decimals or a column of text. To specify this we use the SETOF modifier. Here is a function that returns the sales figures for each salesperson for January.

CREATE FUNCTION salesformonth(month varchar) returns SETOF decimal as $$
    select sales from salesfigures where month=$1;
$$ language 'sql';


aggs=# select * from salesformonth('jan');
 salesformonth
---------------
            34
            18
(2 rows)

FUNCTIONS THAT RETURN A SETOF ROWS
There are 4 options to return a set of rows.
1) RETURNS table
PostgreSQL functions (version 8.4 and later) can also return a table. That is, we define the fields of the SETOF rows that the function will return. Here is that last function but with the person's name as well as their sales:

CREATE FUNCTION salesforpersonformonth(month varchar) returns table(name text, sales decimal) as $$
    select name, sales from salesfigures where month=$1;
$$ language 'sql';

aggs=# select * from salesforpersonformonth('jan');
 name | sales
------+-------
 joe  |    34
 bob  |    18
(2 rows)

The types that functions return can be standard PostgreSQL types but they can also be columns or tables that exist or they can be a type that the user defines.

2) Existing table
Here we have a table with a name and sales value:

aggs=# create table personsales(name varchar, sales decimal);
CREATE TABLE

A function can then return that type:

CREATE FUNCTION salesforpersonmonth(month varchar) returns SETOF personsales as $$
    select name,sales from salesfigures where month=$1;
$$ language 'sql';

aggs=# select * from salesforpersonmonth('jan');     
 name | sales
------+-------
 joe  |    34
 bob  |    18
(2 rows)

3) Column of existing table
Here is a function that returns a SETOF values of a type defined by the column of an existing table:

CREATE FUNCTION salesforpersonmonth2(month varchar) returns SETOF personsales.sales%TYPE as $$
    select sales from salesfigures where month=$1;
$$ language 'sql';

aggs=# select * from salesforpersonmonth2('jan');
salesforpersonmonth2
----------------------
                   34
                   18
(2 rows)

4) Defined type
We can create a new type by using CREATE TYPE. CREATE TYPE has a lot of options but the basic use is

CREATE TYPE typename AS (fieldname1 type1, fieldname2 type2,...);

For instance,

CREATE TYPE personsales3 AS (name varchar, sales decimal);

Then, one can define

CREATE FUNCTION salesforpersonmonth3(month varchar) returns SETOF personsales3 as $$
    select name,sales from salesfigures where month=$1;
$$ language 'sql';

CREATE FUNCTION - basic anatomy

Functions or stored procedures are a very powerful feature of PostgreSQL. Functions allow one to modularize code, create "units" that do useful tasks and hide the complexity of the underlying code, here a query. They also reduce the risk of making a mistake by typing the same query over and over again, such as running a daily or monthly report.

PostgreSQL supports a particularly large variety of programming languages. By using these procedural language extensions one can write functions in Java (PL/Java), Python (PL/Python), R (PL/R), Perl (PL/Perl) and many others to interact with a PostgreSQL database. In this and the next post, however, we are going to cover simple functions in SQL only. In later posts we will cover PL/pgSQL a more powerful and flexible procedural language that provides features such as control flow (such as IF...THEN...ELSE), local variables and multiple queries.

We'll start with a simple SQL function. In the next post, we will expand on this and detail some additional options and features that CREATE FUNCTION supports.

Here are the sales data from the last post:

aggs=# select * from salesfigures;
 name | month | sales
------+-------+-------
 joe  | jan   |    34
 joe  | feb   |    23
 bob  | jan   |    18
 bob  | feb   |    30
(4 rows)

If we want to get the total sales for a given month we can do

aggs=# select sum(sales) from salesfigures where month='jan';
 sum
-----
  52
(1 row)

but if we know that are going to have to do this each month (and you are likely to forget the query next month) it might make sense to create a function called totalsalesformonth() that one can call and pass in the month of interest. Here is how we do that. Don't worry if this looks complicated for now. We will step through each part.

CREATE FUNCTION totalsalesformonth(month varchar) returns decimal as $$
    select sum(sales) from salesfigures where month=$1;
$$ language 'sql';


and we can run it as

aggs=# select totalsalesformonth('jan');
 totalsalesformonth
--------------------
                 52
(1 row)

and get the same result as the plain query. Let us break this function definition down.

COMMAND
CREATE FUNCTION is the command that runs to the final semicolon. It is a query so we can paste this into the psql command line or into pgAdmin the same as any other query.

It is defining a function called totalsalesformonth. PostgreSQL ignores the capitilization and turns everything into lower case so this call is also recognized

aggs=# select totalsalesForMonth('jan');
 totalsalesformonth
--------------------
                 52
(1 row)

NAME
The function name must be a single string. It cannot be "total sales for month" but it can be total_sales_for_month. Choose a sensible descriptive name for the function. You and any other users will thank you later.

PARAMETER LIST
We are specifying that it takes one argument or parameter (technically this is an IN parameter) called "month" and is of type varchar.
We can have more than one argument. These are specified as a comma-delimited list.

RETURN TYPE
The function is declared to return a single value of type decimal. We will discuss return types further in the next post.

FUNCTION BODY
The "$$" says to PostgreSQL that we are starting the code that constitutes the body of the function.
The function body should be a single statement, i.e. only one semicolon. PostgreSQL will allow multiple statements but it is only going to return the last. That is,

CREATE FUNCTION totalsalesformonth(month varchar) returns decimal as $$
    select sum(sales) from salesfigures where month=$1;
    select 3.;
$$ language 'sql';

is valid and will return 3 when called.
The names of the parameters in the parameter list are completely ignored. Instead we use $1 in place of the first parameter, $2 for the second parameter etc. So, when we call

select totalsalesForMonth('jan');

PostgreSQL takes the value 'jan' and replaces $1 with this to produce

select sum(sales) from salesfigures where month='jan';

which is then executed.

Importantly, one cannot use a passed in parameter for the name of a field or the name of a table. For instance, this function will not compile:

CREATE FUNCTION totalsales(tablename varchar) returns decimal as $$
    select sum(sales) from $1;                                       //WRONG!
$$ language 'sql';

PostgreSQL complains of a syntax error. Why is this?
PostgreSQL is strongly typed. That means it checks that the output of the query matches the types that the function should return. In this case, the parser cannot know if the tablename passed in has a sales field and if it does whether it is numeric. There are too many unknowns for PostgreSQL to be happy that this function will work as intended.If you do need to create dynamic queries such as this, one can achieve it with PL/pgSQL.

Although the names are ignored it is important to use good descriptive names for documentation purposes. I don't know about you but I can forget what I coded last week and appreciate the help of good documentation and besides, you might not be the only user of these functions. I would strongly discourage use of generic names such as "arg0", "arg1", "x", "y" unless there is good reason. (For instance, x and y might be the best choice for a function that processes a point in the Cartesian plane.)

LANGUAGE
Finally, we tell PostgreSQL which language we are using.

In summary, our function is

CREATE FUNCTION function_name (argument_name argument_type, argument_name argument_type, ...) RETURNS return_type AS $$
     function_body_query;
$$ language language_name;

This covers the basic anatomy of a SQL function. CREATE FUNCTION is a sophisticated command and supports a richer set of options. We'll start on those in the next post.

Thursday, December 2, 2010

schema

In this post we'll examine PostgreSQL's schema support.

We encountered the default schema called "public" in an earlier post when exploring pgAdmin. Colloquially, when someone talks about a database's schema they typically mean the structure of the tables (that this table has these fields with these types). Here, a schema is a higher level of organization, a container or namespace within a database. A schema contains a set of tables, views, stored procedures and triggers and so on, all the usual components of a database. However, one can have multiple schema within a PostgreSQL database. To clarify this, let's start with a concrete example and consider the pros and cons of different ways of organizing the data.

Suppose we have a supermarket company, SuperFoods Inc., who have two stores, say Sausalito, CA and Petaluma, CA. (Clearly they have recovered from the bad economy and layoffs of this earlier post.) Each of the two stores has a set of information: employees and contact details, details about the physical store itself, replenishment orders and order history etc.

Imagine the data are stored in two separate databases. The CEO of SuperFoods asks one of his minions to produce a report on the sales history of the two stores. The problem here is that databases are independent silos. One cannot write a SQL query that will collate the data from the two stores. psql can only connect to one database at a time. We can use external tools, such as a JDBC driver in Java that will make a connection to each of the two databases and collect the data, but we cannot query the two databases simultaneously from the psql command line.

Imagine now that we take the two databases and put them inside a single SuperFoods database. We create a container, a schema, for each of the two original databases. Each schema has a unique name, "sausalito" and "petaluma", and this causes each of the tables in the schema to have a qualified name that includes the schema name.

That is, we have

sausalito.employees
sausalito.sales

petaluma.employees
petaluma.sales

(Technically, it would include the database name too and be "SuperFoods".sausalito.employees etc. under the hood but the user doesn't see this or need to see it.)

Importantly, when we connect to the SuperFoods database we can now access the sales report tables from both stores. We cannot do

select * from sales;

because PostgreSQL is going to complain. There is an abmiguity here. There are multiple sales tables. Which do you mean? Thus, one has to qualify it with the schema name:

select * from sausalito.sales;


to specify the Sausalito sales and

select * from petaluma.sales;


to access the Petaluma sales.

We can now combine those queries with operators such as UNION and UNION ALL (covered in this post) to get something closer to what the CEO wants:

select 'sausalito', sum(sales) from sausalito.sales where year=2010
UNION ALL
select 'petaluma', sum(sales) from petaluma.sales where year=2010;

Great. Goal achieved.

OK, now consider this structure from the perspective of a human resources person at the Sausalito store. She needs to update the employees table for her store and she isn't interested in the Petaluma tables. Writing 'sausalito.' before each table in the queries is a bit of a nuisance. PostgreSQL helps us out with the notion of the search path. The search path says where to look, similar to classpaths in most programming languages.

Let's connect to a database, set the search path and create a new table.

psql -d SuperFoods

testdb=#set search_path TO sausalito;
SET

testdb=# \d
              List of relations
  Schema   |   Name    | Type  |    Owner    
-----------+-----------+-------+--------------
 sausalito | employees | table | postgresuser
 sausalito | sales     | table | postgresuser
(2 rows)

Notice we see sales and employees not sausalito.sales. When we set the search path it means, work on the database from the perspective of the sausalito schema. Now when one does

select * from sales;

it does not complain. It knows that you mean sausalito.sales. There is no ambiguity now because you cleared that up earlier when you set the search path.

You can even select the data from the other schema using the qualified name

select * from petaluma.sales;

from within the sausalito schema.

Creating tables once one has set the search_path has the same implicit name qualification:

testdb=# create table promotions(name varchar, value int);
CREATE TABLE
testdb=# \d
               List of relations
  Schema   |    Name    | Type  |    Owner    
-----------+------------+-------+--------------
 sausalito | employees  | table | postgresuser
 sausalito | promotions | table | postgresuser
 sausalito | sales      | table | postgresuser
(3 rows)

How do we create a schema? That is a synch:

testdb=# create schema sausalito;
CREATE SCHEMA
testdb=# create schema petaluma;
CREATE SCHEMA

Further, you can dump data from a single schema from the operating system command line

localhost:~ postgresuser$ pg_dump --schema sausalito testdb > sausalito.sql

We can see that using schema gives us all the tools that exist for single databases but we can collect related tables together, in SuperFoods by store, and yet compare data mong those different collections. Now we understand more about schema we can refine our notion about them. They are a namespace or rather a namespace within a database. PostgreSQL only sees one database here: SuperFoods.

Schema can be very powerful and useful way of partitioning data. However, they are typically used for more complex sets of tables, more complex than those typically needed for beginner exercises. For simpler cases, an additional field that defines the subset of data might suffice. Imagine that we include an additional 'store' field in the sales table. A single table could then be queried for Sausalito sales, Petaluma sales or both.

Also, imagine that SuperFoods assigns a regional manager to the two stores. That manager is effectively an employee at both stores. We want to avoid replication of data in databases and so would not want to include his personal details in both schema. In this case a single, more unified database structure might be more appropriate. Given those, schema are a valuable addition to PostgreSQL and worth considering when designing one's database.

Wednesday, December 1, 2010

Aggregates

PostgreSQL supports a number of different aggregate functions (a tutorial appears here). These are functions that perform some mathematical or statistical calculations on a set of values. Common examples are sum, min, max and count. The values are often numeric but they can be boolean, string, or XML too and count accepts anything.

Let's start with a simple example. Suppose we have the following table of sales per month for different sales people

aggs=# select * from salesfigures;
 name | month | sales
------+-------+-------
 joe  | jan   |    34
 joe  | feb   |    23
 bob  | jan   |    18
 bob  | feb   |    30
(4 rows)

We can get the total sales for the sales team for the year to date as

aggs=# select sum(sales) from salesfigures;
 sum
-----
 105
(1 row)

Notice that sum is a function that we can specify in the select list (the part before the FROM) and it takes one argument. We are supplying it the values in the sales column. It adds 34 + 23 + 18 + 30 to get 105. Easy.

Filtering by WHERE
As the aggregate gets computed after any WHERE clause is applied we can easily apply filters on the data that is supplied to the aggregate. For instance, to get the sales for January

aggs=# select sum(sales) from salesfigures where month='jan';
 sum
-----
  52
(1 row)

(34 + 18 = 52)

Filtering by GROUP
Similarly, aggregates are computed after GROUP BY clauses. This is a really powerful feature. So, if we want the total sales for each month we can run

aggs=# select month, sum(sales) from salesfigures group by month;
 month | sum
-------+-----
 feb   |  53
 jan   |  52
(2 rows)

Clearly, using strings for month is not the best table design here as they cannot sensibly be ordered or compared. I used them to keep the example simple.

We call the aggregates multiple times in the select list. Here we get the min and max sales values for each person

aggs=# select name, min(sales), max(sales) from salesfigures group by name;
 name | min | max
------+-----+-----
 bob  |  18 |  30
 joe  |  23 |  34
(2 rows)

GOTCHA
If we wanted to know who had the most sales in any month we might be tempted to query

select name from salesfigures where sales=max(sales);         WRONG!

We could imagine this as computing the maximum value in the sales column and the selecting the one or more names that matched this value. This will not work. If one runs this query, PostgreSQL complains

aggs=# select name from salesfigures where sales=max(sales);
ERROR:  aggregates not allowed in WHERE clause
LINE 1: select name from salesfigures where sales=max(sales);
                                                  ^
aggs=#

This is a typical beginner mistake. I've done it. You'll probably do it too. As the error message says, aggregates are not allowed in the WHERE clause. This is because the WHERE clause is used to generate the set of rows to be passed to the select list and any aggregates. What we have here is a circular relationship: max depends on which rows it is given and that is determined by WHERE, which depends on max... What you have to do is obtain the maximum value as one query and then pass that into another query.

aggs=# select name from salesfigures where sales=(select max(sales) from salesfigures);
 name
------
 joe
(1 row)

(select max(sales) from salesfigures) is called a subquery, something we will cover in more detail in a later post. For now, just know that it computes this subquery first to obtain a value, which will be 34. We then effectively have a simple query

select name from salesfigures where sales=34;

which is OK as the circularity has been eliminated.

Filtering by HAVING
OK, so we cannot have aggregates in WHERE clause. However, we can have them in a HAVING clause. If you recall our earlier post, WHERE can be accompanied by an optional HAVING clause that can be used to filter rows after the WHERE clause is performed.

Who is underperforming? Which salesperson(s) have had less than 20 sales in any one month?

aggs=# select name from salesfigures group by name having min(sales)<20;
 name
------
 bob
(1 row)

This query groups first to get a set of virtual tables, then applies the min aggregate to get the minumum per person. It then filters in those with a minumum less than 20. Finally, we asked for SELECT name so it returns the one name left, that for Bob.

COUNT
COUNT is commonly used aggregate. SELECT count(*) from tablename returns the number of rows. The * just means "everything", every column in this context. If we ran SELECT count(*) from salesfigures it would return 4, as there are four rows, as would SELECT count(sales) from salesfigures. Count counts the rows not the values. Don't confuse this with sum!

Unfortunately, COUNT is fairly slow in PostgreSQL. It can take seconds or sometime tens of seconds in really large tables. The reason is that when a row is deleted from a table, it is not really deleted, just flagged as not being needed any more. (This is the same in some operating system when one deletes a file.) Also, the rows for a table may be scattered through memory. Count has to scan through each row of the table sequentially, work out which rows are "dead" in order to return the total row count. Other databases can do this more efficiently and is a frequent complaint about PostgreSQL performance. However, it does excel in many other respects over other databases.

This lists the various aggregates that PostgreSQL provides. As well as count, min, max and sum mentioned above other commonly used aggregates are avg (average) and stddev (standard deviation, a statistical measure of variability). Unfortunately, median is not provided.

Let's end with a non numeric example. The EVERY aggregate checks whether each and every row is true

aggs=# select * from clients;
     name     | paidbill
--------------+----------
 acme corp.   | t
 lehman bros. | f
 GM           | t
(3 rows)

aggs=# select every(paidbill) from clients;
 every
-------
 f
(1 row)