Friday, November 26, 2010

SQL: inner, cross and self joins

Understanding how table joins work is one of the key skills of a database beginner. In an earlier post, we normalized our ceos table into two tables, one table to list the set of unique people and a second table to associate a person with a term as CEO. What we didn't cover then was how to combine those tables now those data had been broken into separate tables. This is the work of join queries. When we join tables we take two or more tables and combine them using zero, one or more join criteria to create a set of resulting rows.

Suppose that we have two tables:

joindb=# select * from clients;
 id | name 
----+-------
 0  | joe
 1  | bob
 2  | alice
(3 rows)

joindb=# select * from orders;
 id | value
----+-------
  1 | val1
  2 | val2
  3 | val3
(3 rows)


CROSS JOIN
The simplest type of join is a cross join. This says create a results set of all possible combinations of the the rows from the two tables.

joindb=# select * from clients, orders;
 id | name  | id | value
----+-------+----+-------
 0  | joe   |  1 | val1
 1  | bob   |  1 | val1
 2  | alice |  1 | val1
 0  | joe   |  2 | val2
 1  | bob   |  2 | val2
 2  | alice |  2 | val2
 0  | joe   |  3 | val3
 1  | bob   |  3 | val3
 2  | alice |  3 | val3
(9 rows)

The first two columns are all the fields of the first of our tables listed in the query, clients (shown in red), followed by all the fields of the second table, orders. We can see that it takes the clients table and combines all its rows with the first row of orders, then it takes the clients table again and combines with the second row of orders and finally does the same for the third row of orders to produce 3 * 3 rows.

We could have produced the same set of results using the more explicit CROSS JOIN terminology:

joindb=# select * from clients cross join orders;
 id | name  | id | value
----+-------+----+-------
 0  | joe   |  1 | val1
 1  | bob   |  1 | val1
 2  | alice |  1 | val1
 0  | joe   |  2 | val2
 1  | bob   |  2 | val2
 2  | alice |  2 | val2
 0  | joe   |  3 | val3
 1  | bob   |  3 | val3
 2  | alice |  3 | val3
(9 rows)

INNER JOIN
Cross joins can be useful but more typically we restrict the set of combinations based on some join criteria. Suppose we want to match up the client and order Ids:

joindb=# select * from clients join orders on clients.id=orders.id;
 id | name  | id | value
----+-------+----+-------
  1 | bob   |  1 | val1
  2 | alice |  2 | val2
(2 rows)

This says output only the rows where there is a matching ID for both clients and orders so we only obtain 2 rows. (0,'joe') of clients doesn't have a match so that is not output. Similarly for (3,'val3') for orders. These criteria are boolean criteria. They must return true or false for each row combination. This is called an INNER JOIN and we can supply that keyword explicitly. This query produces the same results as the last query: select * from clients inner join orders on clients.id=orders.id;.

It is not unusual to have a large number of join criteria. It can be tedious to keep writing out the table names for all these criteria so we can shorten the criteria text by the use of an alias. This is a short, temporary "nickname" we can give to each table. Let's give client an alias of t1 and orders t2.
We assign the name by putting the alias after each of our tables in the from section of our query. We can then use them in the later part of the query:

joindb=# select * from clients t1 join orders t2 on t1.id=t2.id;
 id | name  | id | value
----+-------+----+-------
  1 | bob   |  1 | val1
  2 | alice |  2 | val2
(2 rows)

If you use aliases you must use them in the join criteria otherwise PostgreSQL will complain:

joindb=# select * from clients t1 join orders t2 on clients.id=orders.id;
ERROR:  invalid reference to FROM-clause entry for table "clients"
LINE 1: select * from clients t1 join orders t2 on clients.id=orders...
                                                   ^
HINT:  Perhaps you meant to reference the table alias "t1".

Rather than using ON we can specify a field list after USING

joindb=# select * from clients join orders using (id);
 id | name  | value
----+-------+-------
  1 | bob   | val1
  2 | alice | val2
(2 rows)

We can actually drop the JOIN keyword altogether here if we use WHERE instead of ON:

joindb=# select * from clients t1, orders t2 where t1.id=t2.id;
 id | name  | id | value
----+-------+----+-------
  1 | bob   |  1 | val1
  2 | alice |  2 | val2
(2 rows)

NATURAL JOIN
Often we give fields the same fieldname for the same entity types across different tables. That is, if we have a person field in different tables we might use the same "person_id" name in those different tables. Such conventions is an aspect of good database design and can avoid some confusion when designing queries. We can ask PostgreSQL to make use of this convention in a NATURAL join which is a special type of INNER join. This is a join where we ask PostgreSQL to match up on all columns that have the same name. In the following query

joindb=# select * from clients natural join orders;
 id | name  | value
----+-------+-------
  1 | bob   | val1
  2 | alice | val2
(2 rows)

PostgreSQL works out that we have an id column in each table and implicitly does an INNER JOIN on that column.

SELF JOIN
The tables we are joining don't have to be different tables. We can join a table with itself.  This is called a self join. In this case we have to use aliases for the table otherwise PostgreSQL will not know which id column of which table instance we mean

joindb=# select * from clients t1, clients t2 where t1.id=t2.id;
 id | name  | id | name 
----+-------+----+-------
  0 | joe   |  0 | joe
  1 | bob   |  1 | bob
  2 | alice |  2 | alice
(3 rows)

To summarize, we can join two tables t1 and t2 using a cross join:

T1 CROSS JOIN T2
T1, T2

a self join

T1 CROSS JOIN T1
T1, T1

or a qualified inner join:

T1 [INNER] JOIN T2 ON boolean_expression
T1 [INNER] JOIN T2 USING ( join column list )
T1 NATURAL [INNER] JOIN T2
T1, T2 WHERE ....

and we can use table aliases

tablename t1, tablename t2...

In the next post we will cover OUTER JOINs.

3 comments: