Sunday, November 21, 2010

SQL: UNION, EXCEPT and INTERSECT

In this post we shall examine the very useful PostgreSQL operators UNION, EXCEPT and INTERSECT, each of which can be followed by ALL.

A hot startup is holding a special event and wants to send out invites to some of their best clients and also to some VIPs. Some of the VIPs are actually very supportive of the site and are clients too. What query will provide the complete set of people to invite avoiding duplicate records? For this, we want UNION. UNION is an OR operator for multiple result sets. Here is our data

create database hotstartup;
\c hotstartup;
create table clients(name varchar);
insert into clients values('John Smith');
insert into clients values('Ashton Kutcher');
insert into clients values('Joe Doe');
insert into clients values('Stephen Fry');

create table vips(name varchar);
insert into vips values('Ashton Kutcher');
insert into vips values('Stephen Fry');
insert into vips values('Demi Moore');
insert into vips values('Hugh Laurie');

and the query will be

hotstartup=# select * from clients union select * from vips;
      name     
----------------
 Ashton Kutcher
 Hugh Laurie
 Joe Doe
 Demi Moore
 John Smith
 Stephen Fry
(6 rows)

The query is saying get the rows from clients and append the rows of vips but remove any duplicates. Here we can see that we can have a total of 6 unique names from the initial set of 8 rows.

There are times when we do not want to remove duplicates. For that, we can append ALL after UNION:

hotstartup=# select * from clients union all select * from vips;
      name     
----------------
 John Smith
 Ashton Kutcher
 Joe Doe
 Stephen Fry
 Ashton Kutcher
 Stephen Fry
 Demi Moore
 Hugh Laurie
(8 rows)

The head party planner sees the list of 6 invites and says "No, no, no. We need to send out two sets of invites, one set to the VIPs that will get them into the VIP tent and then ordinary invites for everyone else". Clearly, select * from vips will return the names for the VIP invite list. However, what will get us the list for the remaining invites? We want everyone on the clients list EXCEPT those on the VIP list. Unsurprisingly, we use EXCEPT

hotstartup=# select * from clients except select * from vips;
    name   
------------
 Joe Doe
 John Smith
(2 rows)

Finally, if we want to get the list of people who are both clients and VIP we can use INTERSECT.

hotstartup=# select * from clients intersect select * from vips;
      name     
----------------
 Ashton Kutcher
 Stephen Fry
(2 rows)

Both EXCEPT and INTERSECT can be followed by an optional ALL. To examine its effect we need to add some duplicate data. Let's insert a duplicate name into clients:

hotstartup=# select * from clients;
      name     
----------------
 John Smith
 Ashton Kutcher
 Joe Doe
 Stephen Fry
 Stephen Fry
(5 rows)

hotstartup=# select * from vips;
      name     
----------------
 Ashton Kutcher
 Demi Moore
 Hugh Laurie
 Stephen Fry
(4 rows)

Compare the results of these two

hotstartup=# select * from clients except select * from vips;
    name   
------------
 Joe Doe
 John Smith
(2 rows)

versus

hotstartup=# select * from clients except all select * from vips;
    name    
-------------
 Joe Doe
 John Smith
 Stephen Fry
(3 rows)

The latter query is essentially saying take each row of clients and find a matching row in vips. If there is match delete the pair of rows from the results set.

For INTERSECT, let's insert Stephen Fry into vips to have him duplicated in both lists

hotstartup=# select * from clients;
      name     
----------------
 John Smith
 Ashton Kutcher
 Joe Doe
 Stephen Fry
 Stephen Fry
(5 rows)

hotstartup=# select * from vips;
      name     
----------------
 Ashton Kutcher
 Stephen Fry
 Demi Moore
 Hugh Laurie
 Stephen Fry
(5 rows)

Then we can see that INTERSECT ALL no longer removes duplicates, i.e.

hotstartup=# select * from clients intersect select * from vips;
      name     
----------------
 Ashton Kutcher
 Stephen Fry
(2 rows)

versus

hotstartup=# select * from clients intersect all select * from vips;
      name     
----------------
 Ashton Kutcher
 Stephen Fry
 Stephen Fry
(3 rows)

The latter query is saying find matching pairs of rows. Stephen Fry appears in both tables twice so we find two matching pairs for him and hence two rows appears in the results.

In all these three operators the columns of the two select statements must match. We cannot have the first select statement returning 3 fields and the second returning two. We can demonstrate that with our ceos table:

apple=# select * from ceos;
       name       | year
------------------+------
 Steve Jobs       | 1976
 Mike Markkula    | 1977
 Mike Scott       | 1978
 John Sculley     | 1983
 Michael Spindler | 1993
 Gil Amelio       | 1996
 Steve Jobs       | 1997
(7 rows)

apple=# select * from ceos where year between 1970 and 1979 union select * from ceos where year=1977;
     name      | year
---------------+------
 Mike Scott    | 1978
 Steve Jobs    | 1976
 Mike Markkula | 1977
(3 rows)

This is OK because both statements are returning * which as they refer to the same table means the same set of fields.

This, however, is not acceptable

apple=# select * from ceos where year between 1970 and 1979 union select name from ceos where year=1977;
ERROR:  each UNION query must have the same number of columns
LINE 1: ...os where year  between 1970 and 1979 union select name from ...

because the first statement returns all fields wheareas the second returns one field only.

3 comments:

  1. awesome again!!! magnifico!

    ReplyDelete
  2. Great article. Very easy to understand.

    ReplyDelete