Saturday, November 27, 2010

SQL: outer joins

Last time we covered INNER JOINs including a couple of special cases CROSS JOIN and self joins. In INNER JOINS the result set returns only the set of rows that match the join criteria (if any). IN OUTER JOINS, the results might contain both matched and unmatched rows. It is for this reason that beginners might find such JOINS a little more confusing. However, the logic is really quite straightforward.

There are three types of outer joins: LEFT, RIGHT, and FULL. Let's start with a LEFT OUTER JOIN.

In a left join when joining table t1 with table t2 PostgreSQL first does a "normal" inner join. It then looks to see if there are any rows from t1 that are not in the result set. If so, it adds in those rows placing NULLs for all the fields of t2. Thus, this guarantees that each row of t1 will appear at least once in the results.

Here are our data:

joindb=# select * from clients;
 id | name 
----+-------
  0 | joe
  1 | bob
  2 | alice
(3 rows)

joindb=# select * from orders;
 id | value
----+-------
  1 | val1
  2 | val2
  3 | val3
(3 rows)

An inner join on ids matches on id=1 and id=2

joindb=# select * from clients join orders on clients.id=orders.id;
 id | name  | id | value
----+-------+----+-------
  1 | bob   |  1 | val1
  2 | alice |  2 | val2
(2 rows)

but a left outer join has that inner join result set plus the row for id=0 from clients (shown in red):

joindb=# select * from clients left join orders on clients.id=orders.id;
 id | name  | id | value
----+-------+----+-------
  0 | joe   |    |
  1 | bob   |  1 | val1
  2 | alice |  2 | val2
(3 rows)

Left joins are implicitly OUTER JOINs but we can add that keyword in explicitly

select * from clients left outer join orders on clients.id=orders.id;

We can obtain the same result set if we use the USING syntax: select * from clients left join orders using(id);

That's it. Simple isn't it?

Why might we want something like this? Suppose we have a list of clients and we have purchase data from the last month. We might want a report of the sales for all our clients, even those that didn't buy anything in that month. This sort of report might be useful for the sales staff to target those clients.

OK, if a LEFT JOIN does an inner join and adds in any additonal "missing" rows from t1, what do you think a RIGHT JOIN does? Exactly. Do an INNER JOIN and add in any missing rows from t2.

joindb=# select * from clients right join orders on clients.id=orders.id;
 id | name  | id | value
----+-------+----+-------
  1 | bob   |  1 | val1
  2 | alice |  2 | val2
    |       |  3 | val3
(3 rows)

Here we can see the row for id=3 from orders shown in blue.

The last type is a FULL OUTER JOIN. That is an left and a right outer join. Do an inner join and add in any missing rows from both t1 and t2.

joindb=# select * from clients full join orders on clients.id=orders.id;
 id | name  | id | value
----+-------+----+-------
  0 | joe   |    |
  1 | bob   |  1 | val1
  2 | alice |  2 | val2
    |       |  3 | val3
(4 rows)

Again, we can use the ON join_criteria or the USING (join_column_list) syntax and we can include or omit the OUTER keyword.

We can also do a NATURAL join for left, right and full outer joins. Recall that with natural joins we ask PostgreSQL to match up the columns that share the same name.

joindb=# select * from clients natural full join orders;
 id | name  | value
----+-------+-------
  0 | joe   |
  1 | bob   | val1
  2 | alice | val2
  3 |       | val3
(4 rows)

Finally, in this and the last post we only joined two tables. We can in fact join as many tables as we wish in a single query. For outer joins you will need to use parentheses and nest the joins. For inner joins, however, one can do

select * from tablename1 t1, tablename2 t2, tablename3 t3,... where t1.id=t2.id and t1.id=t3.id and...;

Friday, November 26, 2010

SQL: inner, cross and self joins

Understanding how table joins work is one of the key skills of a database beginner. In an earlier post, we normalized our ceos table into two tables, one table to list the set of unique people and a second table to associate a person with a term as CEO. What we didn't cover then was how to combine those tables now those data had been broken into separate tables. This is the work of join queries. When we join tables we take two or more tables and combine them using zero, one or more join criteria to create a set of resulting rows.

Suppose that we have two tables:

joindb=# select * from clients;
 id | name 
----+-------
 0  | joe
 1  | bob
 2  | alice
(3 rows)

joindb=# select * from orders;
 id | value
----+-------
  1 | val1
  2 | val2
  3 | val3
(3 rows)


CROSS JOIN
The simplest type of join is a cross join. This says create a results set of all possible combinations of the the rows from the two tables.

joindb=# select * from clients, orders;
 id | name  | id | value
----+-------+----+-------
 0  | joe   |  1 | val1
 1  | bob   |  1 | val1
 2  | alice |  1 | val1
 0  | joe   |  2 | val2
 1  | bob   |  2 | val2
 2  | alice |  2 | val2
 0  | joe   |  3 | val3
 1  | bob   |  3 | val3
 2  | alice |  3 | val3
(9 rows)

The first two columns are all the fields of the first of our tables listed in the query, clients (shown in red), followed by all the fields of the second table, orders. We can see that it takes the clients table and combines all its rows with the first row of orders, then it takes the clients table again and combines with the second row of orders and finally does the same for the third row of orders to produce 3 * 3 rows.

We could have produced the same set of results using the more explicit CROSS JOIN terminology:

joindb=# select * from clients cross join orders;
 id | name  | id | value
----+-------+----+-------
 0  | joe   |  1 | val1
 1  | bob   |  1 | val1
 2  | alice |  1 | val1
 0  | joe   |  2 | val2
 1  | bob   |  2 | val2
 2  | alice |  2 | val2
 0  | joe   |  3 | val3
 1  | bob   |  3 | val3
 2  | alice |  3 | val3
(9 rows)

INNER JOIN
Cross joins can be useful but more typically we restrict the set of combinations based on some join criteria. Suppose we want to match up the client and order Ids:

joindb=# select * from clients join orders on clients.id=orders.id;
 id | name  | id | value
----+-------+----+-------
  1 | bob   |  1 | val1
  2 | alice |  2 | val2
(2 rows)

This says output only the rows where there is a matching ID for both clients and orders so we only obtain 2 rows. (0,'joe') of clients doesn't have a match so that is not output. Similarly for (3,'val3') for orders. These criteria are boolean criteria. They must return true or false for each row combination. This is called an INNER JOIN and we can supply that keyword explicitly. This query produces the same results as the last query: select * from clients inner join orders on clients.id=orders.id;.

It is not unusual to have a large number of join criteria. It can be tedious to keep writing out the table names for all these criteria so we can shorten the criteria text by the use of an alias. This is a short, temporary "nickname" we can give to each table. Let's give client an alias of t1 and orders t2.
We assign the name by putting the alias after each of our tables in the from section of our query. We can then use them in the later part of the query:

joindb=# select * from clients t1 join orders t2 on t1.id=t2.id;
 id | name  | id | value
----+-------+----+-------
  1 | bob   |  1 | val1
  2 | alice |  2 | val2
(2 rows)

If you use aliases you must use them in the join criteria otherwise PostgreSQL will complain:

joindb=# select * from clients t1 join orders t2 on clients.id=orders.id;
ERROR:  invalid reference to FROM-clause entry for table "clients"
LINE 1: select * from clients t1 join orders t2 on clients.id=orders...
                                                   ^
HINT:  Perhaps you meant to reference the table alias "t1".

Rather than using ON we can specify a field list after USING

joindb=# select * from clients join orders using (id);
 id | name  | value
----+-------+-------
  1 | bob   | val1
  2 | alice | val2
(2 rows)

We can actually drop the JOIN keyword altogether here if we use WHERE instead of ON:

joindb=# select * from clients t1, orders t2 where t1.id=t2.id;
 id | name  | id | value
----+-------+----+-------
  1 | bob   |  1 | val1
  2 | alice |  2 | val2
(2 rows)

NATURAL JOIN
Often we give fields the same fieldname for the same entity types across different tables. That is, if we have a person field in different tables we might use the same "person_id" name in those different tables. Such conventions is an aspect of good database design and can avoid some confusion when designing queries. We can ask PostgreSQL to make use of this convention in a NATURAL join which is a special type of INNER join. This is a join where we ask PostgreSQL to match up on all columns that have the same name. In the following query

joindb=# select * from clients natural join orders;
 id | name  | value
----+-------+-------
  1 | bob   | val1
  2 | alice | val2
(2 rows)

PostgreSQL works out that we have an id column in each table and implicitly does an INNER JOIN on that column.

SELF JOIN
The tables we are joining don't have to be different tables. We can join a table with itself.  This is called a self join. In this case we have to use aliases for the table otherwise PostgreSQL will not know which id column of which table instance we mean

joindb=# select * from clients t1, clients t2 where t1.id=t2.id;
 id | name  | id | name 
----+-------+----+-------
  0 | joe   |  0 | joe
  1 | bob   |  1 | bob
  2 | alice |  2 | alice
(3 rows)

To summarize, we can join two tables t1 and t2 using a cross join:

T1 CROSS JOIN T2
T1, T2

a self join

T1 CROSS JOIN T1
T1, T1

or a qualified inner join:

T1 [INNER] JOIN T2 ON boolean_expression
T1 [INNER] JOIN T2 USING ( join column list )
T1 NATURAL [INNER] JOIN T2
T1, T2 WHERE ....

and we can use table aliases

tablename t1, tablename t2...

In the next post we will cover OUTER JOINs.

Sunday, November 21, 2010

SQL: UNION, EXCEPT and INTERSECT

In this post we shall examine the very useful PostgreSQL operators UNION, EXCEPT and INTERSECT, each of which can be followed by ALL.

A hot startup is holding a special event and wants to send out invites to some of their best clients and also to some VIPs. Some of the VIPs are actually very supportive of the site and are clients too. What query will provide the complete set of people to invite avoiding duplicate records? For this, we want UNION. UNION is an OR operator for multiple result sets. Here is our data

create database hotstartup;
\c hotstartup;
create table clients(name varchar);
insert into clients values('John Smith');
insert into clients values('Ashton Kutcher');
insert into clients values('Joe Doe');
insert into clients values('Stephen Fry');

create table vips(name varchar);
insert into vips values('Ashton Kutcher');
insert into vips values('Stephen Fry');
insert into vips values('Demi Moore');
insert into vips values('Hugh Laurie');

and the query will be

hotstartup=# select * from clients union select * from vips;
      name     
----------------
 Ashton Kutcher
 Hugh Laurie
 Joe Doe
 Demi Moore
 John Smith
 Stephen Fry
(6 rows)

The query is saying get the rows from clients and append the rows of vips but remove any duplicates. Here we can see that we can have a total of 6 unique names from the initial set of 8 rows.

There are times when we do not want to remove duplicates. For that, we can append ALL after UNION:

hotstartup=# select * from clients union all select * from vips;
      name     
----------------
 John Smith
 Ashton Kutcher
 Joe Doe
 Stephen Fry
 Ashton Kutcher
 Stephen Fry
 Demi Moore
 Hugh Laurie
(8 rows)

The head party planner sees the list of 6 invites and says "No, no, no. We need to send out two sets of invites, one set to the VIPs that will get them into the VIP tent and then ordinary invites for everyone else". Clearly, select * from vips will return the names for the VIP invite list. However, what will get us the list for the remaining invites? We want everyone on the clients list EXCEPT those on the VIP list. Unsurprisingly, we use EXCEPT

hotstartup=# select * from clients except select * from vips;
    name   
------------
 Joe Doe
 John Smith
(2 rows)

Finally, if we want to get the list of people who are both clients and VIP we can use INTERSECT.

hotstartup=# select * from clients intersect select * from vips;
      name     
----------------
 Ashton Kutcher
 Stephen Fry
(2 rows)

Both EXCEPT and INTERSECT can be followed by an optional ALL. To examine its effect we need to add some duplicate data. Let's insert a duplicate name into clients:

hotstartup=# select * from clients;
      name     
----------------
 John Smith
 Ashton Kutcher
 Joe Doe
 Stephen Fry
 Stephen Fry
(5 rows)

hotstartup=# select * from vips;
      name     
----------------
 Ashton Kutcher
 Demi Moore
 Hugh Laurie
 Stephen Fry
(4 rows)

Compare the results of these two

hotstartup=# select * from clients except select * from vips;
    name   
------------
 Joe Doe
 John Smith
(2 rows)

versus

hotstartup=# select * from clients except all select * from vips;
    name    
-------------
 Joe Doe
 John Smith
 Stephen Fry
(3 rows)

The latter query is essentially saying take each row of clients and find a matching row in vips. If there is match delete the pair of rows from the results set.

For INTERSECT, let's insert Stephen Fry into vips to have him duplicated in both lists

hotstartup=# select * from clients;
      name     
----------------
 John Smith
 Ashton Kutcher
 Joe Doe
 Stephen Fry
 Stephen Fry
(5 rows)

hotstartup=# select * from vips;
      name     
----------------
 Ashton Kutcher
 Stephen Fry
 Demi Moore
 Hugh Laurie
 Stephen Fry
(5 rows)

Then we can see that INTERSECT ALL no longer removes duplicates, i.e.

hotstartup=# select * from clients intersect select * from vips;
      name     
----------------
 Ashton Kutcher
 Stephen Fry
(2 rows)

versus

hotstartup=# select * from clients intersect all select * from vips;
      name     
----------------
 Ashton Kutcher
 Stephen Fry
 Stephen Fry
(3 rows)

The latter query is saying find matching pairs of rows. Stephen Fry appears in both tables twice so we find two matching pairs for him and hence two rows appears in the results.

In all these three operators the columns of the two select statements must match. We cannot have the first select statement returning 3 fields and the second returning two. We can demonstrate that with our ceos table:

apple=# select * from ceos;
       name       | year
------------------+------
 Steve Jobs       | 1976
 Mike Markkula    | 1977
 Mike Scott       | 1978
 John Sculley     | 1983
 Michael Spindler | 1993
 Gil Amelio       | 1996
 Steve Jobs       | 1997
(7 rows)

apple=# select * from ceos where year between 1970 and 1979 union select * from ceos where year=1977;
     name      | year
---------------+------
 Mike Scott    | 1978
 Steve Jobs    | 1976
 Mike Markkula | 1977
(3 rows)

This is OK because both statements are returning * which as they refer to the same table means the same set of fields.

This, however, is not acceptable

apple=# select * from ceos where year between 1970 and 1979 union select name from ceos where year=1977;
ERROR:  each UNION query must have the same number of columns
LINE 1: ...os where year  between 1970 and 1979 union select name from ...

because the first statement returns all fields wheareas the second returns one field only.

SQL: DISTINCT, DISTINCT ON and ALL

It is not uncommon to have duplicate data in the results of a query. We can use the DISTINCT clause of SELECT to remove those and return only the unique results.

Returning to our trusty ceos table, a simple select is showing 7 rows:

apple=# select * from ceos;
       name       | year
------------------+------
 Steve Jobs       | 1976
 Mike Markkula    | 1977
 Mike Scott       | 1978
 John Sculley     | 1983
 Michael Spindler | 1993
 Gil Amelio       | 1996
 Steve Jobs       | 1997
(7 rows)

If we only want the distinct set of people who are or have been Apple's CEOs we can use distinct to give the six names:

apple=# select distinct(name) from ceos;
       name      
------------------
 Michael Spindler
 John Sculley
 Gil Amelio
 Mike Markkula
 Mike Scott
 Steve Jobs
(6 rows)

Similarly select distinct(year) from ceos; will provide a list of years of a CEO changeover.

DISTINCT ON
(expression) will return the "first" row of each set of rows where the expression is equal. The following says get the name and year from ceos for the set of rows where name is distinct.

apple=# select distinct on (name) name,year from ceos;
       name       | year
------------------+------
 Gil Amelio       | 1996
 John Sculley     | 1983
 Michael Spindler | 1993
 Mike Markkula    | 1977
 Mike Scott       | 1978
 Steve Jobs       | 1976
(6 rows)

You can see that it returned Jobs' first term as CEO.

I quoted first because it is not always obvious what that means. The results can be unpredictable. To be safe it is good practice to combine DISTINCT ON with ORDER BY to make that explicit.

apple=# select distinct on (name) name, year from ceos order by name,year;
       name       | year
------------------+------
 Gil Amelio       | 1996
 John Sculley     | 1983
 Michael Spindler | 1993
 Mike Markkula    | 1977
 Mike Scott       | 1978
 Steve Jobs       | 1976
(6 rows)

DISTINCT ON can handle multiple fields. This says get all the rows where the combination of name and year is distinct:

apple=# select distinct on (name,year) name,year from ceos;
       name       | year
------------------+------
 Gil Amelio       | 1996
 John Sculley     | 1983
 Michael Spindler | 1993
 Mike Markkula    | 1977
 Mike Scott       | 1978
 Steve Jobs       | 1976
 Steve Jobs       | 1997
(7 rows)

In this trivial example it returns all rows but one can imagine how can this can be powerful with additional fields. DISTINCT ON is not part of the SQL standard.

Incidentally, this could have been achieved with GROUP BY and ordering results

apple=# select name,year from ceos group by name,year order by name;
       name       | year
------------------+------
 Gil Amelio       | 1996
 John Sculley     | 1983
 Michael Spindler | 1993
 Mike Markkula    | 1977
 Mike Scott       | 1978
 Steve Jobs       | 1976
 Steve Jobs       | 1997
(7 rows)

Finally, ALL is the opposite to DISTINCT. As its name suggests it explicitly specifies that all rows must be returned. ALL is the default mode so that
select name from ceos is equivalent to select all(name) from ceos. As such it doesn't sound useful. However, whereas DISTINCT is designed to override the default behavior and strip out duplicate rows, ALL can be used with constructs such as UNION, EXCEPT and INTERSECT whose default behavior is to remove duplicates and override them to return ALL rows, including duplicates. We will cover those last three keywords in the next post.

Thursday, November 18, 2010

Create table and constraints

Today we shall cover creating tables. So far we have only created the simplest tables, one or two fields, no constraints, no foreign keys, no primary keys etc. CREATE TABLE is one of the core SQL command and is one of the more complex commands with a host of options. This post is quick introduction to highlight the basic ideas, common patterns and where to find mode detailed explanation in the PostgreSQL documentation. Importantly, as all of this material is essentially standard SQL so the reader can easily find a whole host of additional material covering all levels of detail on the web.

When creating a table we want to define the fields (columns), their data type (int, text, date etc.) and any constraints such as whether that field can be empty (NULL), unique, or whether it must contain a value found in another table (foreign key) or that the value must be in a certain range (say, check that price > 0.01).

In its simplest form the CREATE TABLE command is a comma-delimited list of fieldnames and their types:

CREATE TABLE tablename (
  fieldname1 datatype1,
  fieldname2 datatype2,
  ...
  fieldname3 datatype3
);

such as the Apple CEOs table of an earlier post:

create table ceos (name varchar, year int);

PostgreSQL supports a large number of datatypes. Here are some of the more common ones:
varchar(n) string of at most n characters. If less than n are provided, it only stores those chars.
char(n)string of n characters. If less than n are provided it will be space padded.
text variable length string.
int integer, 4 bytes.
bigint 8 bytes. This is returned by count as in "select count(*) from..."
decimal or numeric a flexible real value of specifiable precision. For instance 12.3456 is a numeric(6,4).
date date with year, month and day.
timestamp, timestamptz date and time without and with a timezone.
booleantrue or false.

This is, however, just a sample. There are types for money, shapes, arrays and many more.

Decide upon the types for each of your fields but do so with a consideration to storage size. Don't use more than you need. For instance, if you know that a column is two character data, say for a state abbreviation('AZ', 'CA' etc), a varchar(2) is a better choice than text. On the other hand, make sure that you will have enough for what you might need in the future. For instance, numeric(4,2) might suffice for most daily temperature readings in San Francisco but very very rarely it can exceed 100 F. If you try to store the record 103 F in a numeric(4,2) PostgreSQL will throw an exception ("ERROR:  numeric field overflow. DETAIL:  A field with precision 4, scale 2 must round to an absolute value less than 10^2"). With numeric(5,2) or decimal you will be safe.

Having decided upon field types for a table we then need to consider the PRIMARY KEY. This is a special field or combination of fields that must be unique across all rows. This key becomes a unique identifier for each row. Typically these are integer IDs (product ID, order ID etc) which can be supplied explicitly or can be provided by PostgreSQL automatically by declaring a field SERIAL. Let's examine this further.

To remind you, here is our table of Apple's CEOs and the year that they started that role:

apple=# \d ceos
          Table "public.ceos"
 Column |       Type        | Modifiers
--------+-------------------+-----------
 name   | character varying |
 year   | integer           |

apple=# select * from ceos;
       name       | year
------------------+------
 Steve Jobs       | 1976
 Mike Markkula    | 1977
 Mike Scott       | 1978
 John Sculley     | 1983
 Michael Spindler | 1993
 Gil Amelio       | 1996
 Steve Jobs       | 1997
(7 rows)


A table doesn't have to have a primary key. We didn't declare a primary key in the ceos table but then I knew that I was going to do simple selects only and that I would not be adding any new data to the table. Often, however, we don't know what additional data might be added to a table and so we have to design the table to handle the edge cases of that future data. Suppose Steve Jobs steps down and his post is taken over by Mike Scott. This is a relatively common name so what if it is not the same Mike Scott that ran the company in the 1970s. Now we have a problem. If we use the current table schema we would not be able to distinguish among the two Mikes. We need to assign each of the two Mikes a unique code. We could normalize the table and create one table to define the different people (people table) and a different table (ceos table) to define the association between the person and the year that they became CEO.

create table people(
   id int PRIMARY KEY,
   name varchar NOT NULL
);
insert into people(0,'Steve Jobs');
insert into people(1,'Mike Markkula');
insert into people(2,'Mike Scott');
insert into people(3,'John Sculley');
insert into people(4,'Michael Spindler');
insert into people(5,'Gil Amelio');
insert into people(6,'Mike Scott');

(This is just a simple illustrative example. In a real database a person would likely have separate last name and first name fields as well as other attribute fields.)
With this we have solved two problems. First, we have an id for each of Mikes: 2 and 6. We also have removed some redundancy in the database by only defining Steve Jobs once. Recall that he appears twice in the ceos table. It is better design to have an entity defined once. Imagine that Apple had a female CEO. She got married and changed her name. We would have to update all the rows where her name appears. In a normalized schema as above we only have to update a single row. This doesn't make a great difference in our trivial ceos example but in a large database where there may be lots of tables associated with a person. Imagine a payroll database where a person is associated with address, job, payrates and bonuses and so on, all of which might be in different tables this can be a huge issue.

In the people schema, we defined a id column of type integer. We declared it to be the PRIMARY KEY. This means that it must be unique and it must not be NULL. Next we declared a name column and said that it cannot be NULL. If we are defining people based on a single name field it doesn't make sense to allow a NULL name, otherwise what is the point of the row?

In the table above we supplied the ids. If we don't care what id is used for a person so long as each person gets a unique name we can ask PostgreSQL to make up IDs for us.

create table people(
  id SERIAL PRIMARY KEY,
  name varchar NOT NULL
);


By adding the keyword SERIAL it says create an ID starting from 1 and incrementing by 1 when we insert data. We can then insert data as

insert into people (name) values('Steve Jobs');
insert into people (name) values('Mike Markkula');
insert into people (name) values('Mike Scott');
insert into people (name) values('John Sculley');
insert into people (name) values('Michael Spindler');
insert into people (name) values('Gil Amelio');
insert into people (name) values('Mike Scott');

apple=# select * from people;
 id |       name      
----+------------------
  1 | Steve Jobs
  2 | Mike Markkula
  3 | Mike Scott
  4 | John Sculley
  5 | Michael Spindler
  6 | Gil Amelio
  7 | Mike Scott
(7 rows)

With our revised people table complete let us rework the ceo table.

create table ceo(
   id int,
   year int NOT NULL,
   PRIMARY KEY(id,year)
);
insert into ceos(1,1976);
insert into ceos(2,1977);
insert into ceos(3,1978);
insert into ceos(4,1983);
insert into ceos(5,1993);
insert into ceos(6,1996);
insert into ceos(1,1997);
ALTER TABLE ceos ADD CONSTRAINT ceos_foreign_id FOREIGN KEY (id) REFERENCES people (id);

We have an id column of type int. We intend that to be the same id as in the people table. To achieve that we add a foreign key constraint with the ALTER TABLE statement. This says add a constraint to the ceo table such that an id value in the ceo's id column must be a value that is present in the id field of the people table. (We could have written "id int references people," which would add a foreign key constraint to the primary key of the people table. It is a matter of tast but the alter table command is more explicit.)

We want the year to be NOT NULL. We might also want to verify make sure that any year supplied is sensible. After all, -5 is a NOT NULL int but doesn't make sense for a year. Moreover, this is not just any year, it must be a year since Apple was founded so it must be 1976 or later. Thus, we could add a check to that field

year int NOT NULL check(year >= 1976),

If we try to enter a year less than 1976 PostgreSQL will complain with

ERROR:  new row for relation "ceo" violates check constraint "ceo_year_check"

Finally, we have a primary key of both the id and the year. This will allow multiple people in the same year but not the same person multiple times in a year. When we execute the create table query PostgreSQL outputs

NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "ceo_pkey" for table "ceo"

Database design is a tricky exercise and certainly a little confusing for beginners. There are many tradeoffs to be made in terms of storage size, speed and efficiency of loading and querying, maintainability, current versus known or unknown future data and so on. However, hopefully this post has conveyed the sense of some of things to consider and the rich levels of types and constraints that PostreSQL provides. PostgreSQL CREATE TABLE documentation details many more options and details but the basic detailes covered here will actually get you fairly far andconver a lot of needs.

Monday, November 15, 2010

Productivity tip: psql tab completion and history

Like many other shells, psql has tab completion. The ability to tab complete commands and relations such as tablenames can be a huge productivity boost.

Type 'S' and hit tab and the commands starting with 'S' appear:

SuperFoods=# S
SELECT     SET        SHOW       START      SAVEPOINT 

Type the next letter say 'e' and only the subset of commands appear:

SuperFoods=# SE
SET     SELECT 


(psql automatically changes the case to upper case on these primary commands and this can take an extra tab which is a little annoying and unnecessary.)

This continues until there is no ambiguity and the command completes.

Type 'set sea' and tab and it completes to

SuperFoods-# set search_path

Type space and then tab again and it completes to

SuperFoods-# set search_path TO

and so on.

The same behavior is true with tablenames. If you have a table called 'employees' and no other table starts with 'e' then tab will complete the whole tablename. If there are two tables starting with 'e' then psql will list those tables in the same manner as the commands above.

It is not always obvious what will and will not complete . If one types 'SEL' and tab it completes to 'SELECT'. Then typing ' * FR' and then tab one might expect it to complete the FROM but it does not. The psql documentation does however state that "the completion logic makes no claim to be an SQL parser".

If you have a blank line and type tab twice it brings up the primary commands:

SuperFoods-#
ABORT        ALTER        ANALYZE      BEGIN        CHECKPOINT   CLOSE        CLUSTER      COMMENT      COMMIT       COPY        
CREATE       DEALLOCATE   DECLARE      DELETE FROM  DISCARD      DO           DROP         END          EXECUTE      EXPLAIN     
FETCH        GRANT        INSERT       LISTEN       LOAD         LOCK         MOVE         NOTIFY       PREPARE      REASSIGN    
REINDEX      RELEASE      RESET        REVOKE       ROLLBACK     SAVEPOINT    SELECT       SET          SHOW         START       
TABLE        TRUNCATE     UPDATE       VACUUM       VALUES       WITH         UNLISTEN    

Finally, like other shells one can scroll through the command history using page up/down or cursor up/down keys. History retains up to the last 500 commands.

Modifying data: UPDATE and DELETE

In the last post we covered getting data into tables using INSERT and COPY. In this post we focus on modifying data in table using UPDATE and DELETE.

UPDATE as one might imagine updates or modifies values in a table, including NULLs. The basic idea is that one is specifying or SETting the new values of one or more fields FROM a certain table WHERE the rows meet some condition.

SuperFoods supermarket has a table for their employees:

create database "SuperFoods";
\c "SuperFoods";
create table employees (lastname varchar, firstname varchar, title varchar);
insert into employees values('Doe','Jane', 'cashier');
insert into employees values('Frost','Mary', 'manager');
insert into employees values('Smith','John', 'shift leader');

SuperFoods=# select * from employees;
 lastname | firstname |    title    
----------+-----------+--------------
 Doe      | Jane      | cashier
 Frost    | Mary      | manager
 Smith    | John      | shift leader
(3 rows)

John Smith has worked hard and has been promoted from shift leader to assistant manager. HR needs to update the employees table and set the title to be 'assistant manager' where the row corresponds to John Smith. We do that with:

SuperFoods=# update employees SET title='assistant manager' WHERE lastname='Smith' AND firstname='John';
UPDATE 1
SuperFoods=# select * from employees;
 lastname | firstname |       title      
----------+-----------+-------------------
 Doe      | Jane      | cashier
 Frost    | Mary      | manager
 Smith    | John      | assistant manager
(3 rows)


We don't have to update one row at a time. We can update many rows simultaneously. Suppose SuperFoods is having a promotion and are going to reduce the price of all their wine from a certain winery by 10%.

create table wines (winery varchar, wine varchar, price decimal);
insert into wines values('Vinny''s Vino','Smooth Syrah',9.99);
insert into wines values('Wacky Wines','Zany Zin',8.99);
insert into wines values('Wacky Wines','Magic Merlot',19.99);
insert into wines values('Oregon Oeno','Perky Pinot',14.50);

SuperFoods=# select * from wines;
    winery    |     wine     | price
--------------+--------------+-------
 Vinny's Vino | Smooth Syrah |  9.99
 Wacky Wines  | Zany Zin     |  8.99
 Wacky Wines  | Magic Merlot | 19.99
 Oregon Oeno  | Perky Pinot  | 14.50
(4 rows)

Notice for "Vinny's Vino" we needed to double up on the single quote inside the string. Without that PostgreSQL will get confused, think that string is 'Vinny' and then complain that the rest of statement starting with 's Vino' is not valid.

OK, let's reduce the price on all Wacky Wines:

update wines set price=0.9*price where winery='Wacky Wines';

SuperFoods=# update wines set price=0.9*price where winery='Wacky Wines';
UPDATE 2
SuperFoods=# select * from wines;
    winery    |     wine     | price 
--------------+--------------+--------
 Vinny's Vino | Smooth Syrah |   9.99
 Oregon Oeno  | Perky Pinot  |  14.50
 Wacky Wines  | Zany Zin     |  8.091
 Wacky Wines  | Magic Merlot | 17.991
(4 rows)

(Note: decimal is not the best type for a price but is fine for illustrative purposes here. Notice too that the updated rows are now at the end of the table.)

We are setting price field for any rows where winery='Wacky Wines' to be 0.9 times the original price and this leads to 2 rows being updated.

If we wanted to update the price of all wines, it would have been very simple indeed: update wines set price=0.9*price; as there would have been no need for a WHERE clause.

Usefully, we can update multiple columns in a single query by putting the SET components in a comma separated list:

update tablename set col1='newvalue1', col2='newvalue2' WHERE some_condition;

With UPDATE we are updating individual fields of some or all of the rows of a tables. With DELETE, however, we are deleting complete rows; there is no need to specify any columns. (Deleting complete columns is not achieved with delete. For that we use ALTER TABLE...DROP COLUMN covered later.) Thus, we just need to specify the table of interest and the focal set of rows:

delete from table_name where some_condition;

Despite SuperFoods' wine promotion, they are not bringing in enough customers. Times are tough and they have to let one of their cashiers go. To delete Jane Doe from the employees table we can do:

SuperFoods=# delete from employees where lastname='Doe' and firstname='Jane';
DELETE 1

If we don't specify a WHERE condition, say delete from wines, what will happen? This will mean that all rows are of interest and the table will be emptied. The table will remain, it will not be deleted from the database completely (for that we would need the DROP TABLE command), but it will not contain any rows until you do an INSERT or COPY. If you do want to empty a table it is actually better to use a special command, TRUNCATE TABLE which is more efficient.

You can also specify that you want to see the set of rows that were deleted. To do this, tack on a "returning *" to the end of the query:

SuperFoods=# delete from wines where price=9.99 returning *;
    winery    |     wine     | price
--------------+--------------+-------
 Vinny's Vino | Smooth Syrah |  9.99
(1 row)

DELETE 1

Few databases are completely static. Data change and they need to be cleaned, modified and purged. UPDATE and DELETE are both simple yet powerful. There isn't however a simple undo command if you make a mistake so use with caution. If in doubt, make a copy of the table (create table new_table as select * from old_table), test out your queries and then run them against the intended table.
 

Friday, November 12, 2010

Populating a database

A database isn't very useful without data. Let's examine how one can populate a database.

So far we have used simple INSERT statements such as

insert into ceos values('Steve Jobs',1976);

We don't have to fill all the fields in an insert. We can specify some subset in parentheses before VALUES. That is, suppose we have a table:

apple=# create table mytable (col1 int, col2 int, col3 int);
CREATE TABLE


We can insert into the last two columns only with:

apple=# insert into mytable (col2, col3) values(2,3);
INSERT 0 1
apple=# select * from mytable;
 col1 | col2 | col3
------+------+------
      |    2 |    3
(1 row)

We can also use the NULL keyword to fill in empty data:

apple=# insert into mytable values(1,2,NULL);
INSERT 0 1
apple=# select * from mytable;
 col1 | col2 | col3
------+------+------
      |    2 |    3
    1 |    2 |    
(2 rows)

and PostgreSQL also allows us to omit fields and it will fill all the remaining fields with NULLs:

apple=# insert into mytable values(11,12);
INSERT 0 1
apple=# select * from mytable;
 col1 | col2 | col3
------+------+------
      |    2 |    3
    1 |    2 |    
   11 |   12 |    
(3 rows)

This behavior actually depends on whether we specified any default values in the table schema. Here, we define that col1 has a default value of 0. PostgreSQL will fill in missing data with that value rather than NULL.

apple=# create table mytable2 (col1 int default 0, col2 int);
CREATE TABLE

apple=# insert into mytable2 (col2) values(1);
INSERT 0 1
apple=# select * from mytable2;
 col1 | col2
------+------
    0 |    1
(1 row)

The PostgreSQL documentation contains more information about how we can control that behavior.

Usefully, we can insert into a table using the result of a SELECT:

insert INTO table table_name SELECT some_query_expression;

e.g.

apple=# insert into mytable (col1) SELECT year from ceos where year > 1996;
INSERT 0 1
apple=# select * from mytable;
 col1 | col2 | col3
------+------+------
      |    2 |    3
    1 |    2 |    
   11 |   12 |    
 1997 |      |    
(4 rows)

Lastly, we can insert multiple rows of data using a single INSERT statement. This is more efficient and quicker than multiple single row insert statements.

apple=# insert into mytable2 values(1,2),(3,4),(5,6);
INSERT 0 3
apple=# select * from mytable2;
 col1 | col2
------+------
    0 |    1
    1 |    2
    3 |    4
    5 |    6
(4 rows)

If we had a file of multiple INSERT statements, which is how many databases including PostgreSQL can export a database to file, we could run those statements using psql's \i meta-command that we covered earlier.

If we have a CSV file that includes a header and we want to import it into PostgreSQL we can use the COPY command. The query for this is

COPY table_name FROM filepath WITH CSV HEADER;

(PostgreSQL has many more options for COPY detailed here.)

That is, suppose we have a file /Users/postgresuser/ceos.csv:

name,year
Steve Jobs, 1976
Mike Markkula,1977
Mike Scott,1978
John Sculley,1983
Michael Spindler,1993
Gil Amelio,1996
Steve Jobs,1997

then we can import into a table ceos as

apple=# create table ceos (name varchar, year int);
CREATE TABLE
apple=# copy ceos from '/Users/postgresuser/ceos.csv' with csv header;
COPY 7
apple=# select * from ceos;
       name       | year
------------------+------
 Steve Jobs       | 1976
 Mike Markkula    | 1977
 Mike Scott       | 1978
 John Sculley     | 1983
 Michael Spindler | 1993
 Gil Amelio       | 1996
 Steve Jobs       | 1997
(7 rows)

Note: the table must be created before the COPY otherwise PostgreSQL complains:

apple=# copy newceos from '/Users/postgresuser/ceos.csv' with csv header;
ERROR:  relation "newceos" does not exist

TIP: if we have an existing table with the same table structure that we want for our new table we can create that new empty table as

create table your_new_table_name as select * from old_table_name limit 0;

Importantly, COPY is much faster than INSERT. If you have a large amount of data to load, say millions of rows, and performance is important you should COPY in conjunction with some of the other strategies suggested by the PostreSQL documentation.

Thursday, November 11, 2010

Constraining rows of output: LIMIT and OFFSET

Database users often want to take a quick look at just a few rows of tables to get a sense of the data within them. We can do that in PostgreSQL with the LIMIT clause. Let's examine this table:

create table data(name varchar, value int);
insert into data values('joe',1);
insert into data values('bob',2);
insert into data values('jane',3);
insert into data values('anne',4);
insert into data values('kate',5);

apple=# select * from data;
 name | value
------+-------
 joe  |     1
 bob  |     2
 jane |     3
 anne |     4
 kate |     5
(5 rows)

Now, let's select just the first two rows:

apple=# select * from data limit 2;
 name | value
------+-------
 joe  |     1
 bob  |     2
(2 rows)


LIMIT is especially important when dealing with large tables. If you do select * from [very large table] in psql, it will generate the result set then show the first screen of rows. You can then press space to see the next screen of results or 'q' to quit and get back to the psql command line. Importantly, it may take a while to process the select because there are so many rows even if you only want to see the first screen. LIMITing results explicitly however generates results much more quickly.

What if wanted a set of rows that occurred later in the sequence? For this we can use PostgreSQL's OFFSET in combination with LIMIT. Both of these are PostgreSQL specific. (They have other names in other databases: MySQL uses LIMIT, ORACLE use ROWNUM <= number and other use TOP.) So, if we wanted the 4th and 5th row we offset the first 3 so that the 4th row becomes the start of our set and we specify a limit to say that we only want 2 rows from that point:

apple=# select * from data limit 2 offset 3;
 name | value
------+-------
 anne |     4
 kate |     5
(2 rows)

The order of LIMIT and OFFSET does not matter. This gives the same result:

apple=# select * from data offset 3 limit 2;
 name | value
------+-------
 anne |     4
 kate |     5
(2 rows)

We can use these two in combination with ORDER BY. Let's order by name:

apple=# select * from data order by name;
 name | value
------+-------
 anne |     4
 bob  |     2
 jane |     3
 joe  |     1
 kate |     5
(5 rows)

Now if we offset 3 and limit 2 we get the following instead:

apple=# select * from data order by name limit 2 offset 3;
 name | value
------+-------
 joe  |     1
 kate |     5
(2 rows)

Thus, clearly ORDER BY occurs first and then the LIMIT...OFFSET is applied.

Wednesday, November 10, 2010

Basic SQL

So far we have only covered the simplest SELECT SQL queries: select * from [tablename];. That is not going to get us too far. Let's broaden our repertoire. The material in this post is standard SQL. That is, it is not specific to PostgreSQL. Thus, I would encourage you to read around the wealth of other online resources. Ultimately, however, the best way to learn SQL is to use it and practice.

Copy and paste the following SQL into psql which is a list of Apple Computer CEOs and the year that they started that position:

create database apple;
\c apple
create table ceos (name varchar, year int);
insert into ceos values('Steve Jobs',1976);
insert into ceos values('Mike Markkula',1977);
insert into ceos values('Mike Scott',1978);
insert into ceos values('John Sculley',1983);
insert into ceos values('Michael Spindler',1993);
insert into ceos values('Gil Amelio',1996);
insert into ceos values('Steve Jobs',1997);
select * from ceos;

We can now start adding some WHERE clauses to our queries to filter down results. Who was an Apple CEO before 1984? That is a simple WHERE clause:

apple=# select name from ceos where year < 1984;
     name     
---------------
 Steve Jobs
 Mike Markkula
 Mike Scott
 John Sculley
(4 rows)


Not surprisingly, < is not the only operator available. We have all the usual >, >=, <, <=, =, != as well as many more, including those for string operations and regular expressions.

In SQL, the "WHERE x BETWEEN y AND z" clause means look for rows where some field x has a value >= y and value <= z. We can use that to query who was a CEO in the 1970s:

apple=# select name from ceos where year between 1970 and 1979;
     name     
---------------
 Steve Jobs
 Mike Markkula
 Mike Scott
(3 rows)

A SELECT statement can have any number of filtering clauses. The first is preceded with a "WHERE" and later ones are preceded with an AND or an OR — for example, select name from ceos where year <= 1970 OR year >= 1980; — and can involve parentheses, subqueries and other constructs which we will cover later.

A powerful additional clause type is a GROUP BY. Let's get the name and number of times each person has been a CEO:

apple=# select name, count(*) from ceos group by name;
       name       | count
------------------+-------
 Michael Spindler |     1
 John Sculley     |     1
 Gil Amelio       |     1
 Mike Markkula    |     1
 Mike Scott       |     1
 Steve Jobs       |     2
(6 rows)

When we group by, I like to imagine PostgreSQL creating a bunch of virtual minitables as an interim step, one per grouping, here by name.  Thus, one could think of PostgreSQL thinking "ah, Steve Jobs has two rows, let's create a minitable for him with those two rows, and everyone else gets a single row table:

       name       | year
------------------+------
 Steve Jobs       | 1976
 Steve Jobs       | 1997

       name       | year
------------------+------
 Mike Markkula    | 1977


       name       | year
------------------+------
 Mike Scott       | 1978


       name       | year
------------------+------
 John Sculley     | 1983

       name       | year
------------------+------
 Michael Spindler | 1993


       name       | year
------------------+------
 Gil Amelio       | 1996

Then it will look at the part of the query before the FROM, which is called the select list. What does the user want from these grouped data? They want the name. We grouped by name so the name will be the same for each row within a minitable. OK, output the name field from each minitable, the first row will do. What else do they want? A count(*) meaning count all the rows. That is 2 for Jobs, 1 for everyone else.

When we GROUP BY we can also add a posterior HAVING clause. (GROUP BYs don't have to have a HAVING but HAVING must have a GROUP BY.) Let's look at an example. Who was a CEO more than once?

apple=# select name from ceos group by name having count(*) > 1;
    name   
------------
 Steve Jobs
(1 row)

This means group first, then check the having clause which must return true of false for each group. Then after look that, inspect the required output in the select list.

PostgreSQL will group by as in the set of tables above, then it will count the number of rows for each and only those who have more than one row will qualify. Finally, we asked for the name only so it returns the name from that one minitable returning "Steve Jobs".

This has been a taste of some of the richness that SQL offers. We will cover additional query constructs later.

Tuesday, November 9, 2010

Gotcha: case sensitivity

In this short post we'll cover an aspect that catches out many new PostgreSQL users, that of case sensitivty.

Let's create a database called SuperFoods:

postgres=# create database SuperFoods;
CREATE DATABASE

From the last post we saw that we can use psql's \l meta-command to list the databases:

postgres=# \l
                                 List of databases
    Name    |    Owner     | Encoding  | Collation | Ctype |   Access privileges  
------------+--------------+-----------+-----------+-------+-----------------------
 postgres   | postgres     | SQL_ASCII | C         | C     |
 superfoods | postgresuser | SQL_ASCII | C         | C     |
 template0  | postgres     | SQL_ASCII | C         | C     | =c/postgres          +
            |              |           |           |       | postgres=CTc/postgres
 template1  | postgres     | SQL_ASCII | C         | C     | =c/postgres          +
            |              |           |           |       | postgres=CTc/postgres
(4 rows)

Oh, it is listed as superfoods. What about if we try to connect to it:

postgres=# \c SuperFoods;
FATAL:  database "SuperFoods" does not exist
Previous connection kept

This is not what we wanted. PostgreSQL always switches everything to lower case unless we are very explicit in what we mean. We can do that with double quotes:

postgres=# create database "SuperFoods";
CREATE DATABASE
postgres=# \l
                                 List of databases
    Name    |    Owner     | Encoding  | Collation | Ctype |   Access privileges  
------------+--------------+-----------+-----------+-------+-----------------------
 SuperFoods | postgresuser | SQL_ASCII | C         | C     |
 postgres   | postgres     | SQL_ASCII | C         | C     |
 template0  | postgres     | SQL_ASCII | C         | C     | =c/postgres          +
            |              |           |           |       | postgres=CTc/postgres
 template1  | postgres     | SQL_ASCII | C         | C     | =c/postgres          +
            |              |           |           |       | postgres=CTc/postgres
(4 rows)

postgres=# \c "SuperFoods";
You are now connected to database "SuperFoods".
SuperFoods=#

The same is true of tables and other relations:

SuperFoods=# create table MyCamelCaseTable(name varchar, value int);
CREATE TABLE
SuperFoods=# \d
                List of relations
 Schema |       Name       | Type  |    Owner    
--------+------------------+-------+--------------
 public | mycamelcasetable | table | postgresuser
(1 row)

Many PostgreSQL users use lower case only in database and table names to avoid any confusion. However, there are times where a mixed case is justifiable and the most sensible approach for ones's particular situation. Just be aware of this common gotcha.

Monday, November 8, 2010

Interacting with PostgreSQL: psql

psql is an interactive terminal that comes with PostgreSQL. With it you can enter and run SQL queries and get any results in the console. You can also create stored procedures, run complete SQL scripts, do database maintenance and much much more. For most PostgreSQL users, this is the only interface they need and use with PostgreSQL.

psql allows you to connect to one database only hosted either locally or remotely. Let's suppose that that PostgreSQL is running locally (perhaps you ssh'd to a remote server to get to PostgreSQL but no matter). To run psql we need to supply a database name using the -d option.

psql -d testdb

psql is going to complain here. Who are you? It needs to authenticate you. For this we use the -U [for "U"ser] option. This will then prompt for a password, at least the first time. You can also force it prompt for a password using the -W or --password option:

localhost:~ postgresuser$ psql -d postgres -U postgresuser --password
Password for user postgresuser:
psql (9.0.1)
Type "help" for help.

postgres=#

(The "postgres=#" here is the command prompt. Yours might look different depending on how PostgreSQL was set up.)

You have to supply a database name to connect to. However, what if we don't know which databases exist? In an earlier post we mentioned the special database "postgres". (There are others but these are more for administrators). We can always connect to that postgres database, list the databases, decide which we want and then switch to connecting to one of those.

Now we have started the terminal, let's list the databases. We do that using one of psql "meta-commands". These are very useful, terse commands for doing common tasks. To list databases we simply type \l [that is ell for "l"ist]

postgres=# \l
                              List of databases
   Name    |  Owner   | Encoding  | Collation | Ctype |   Access privileges  
-----------+----------+-----------+-----------+-------+-----------------------
 postgres  | postgres | SQL_ASCII | C         | C     |
 template0 | postgres | SQL_ASCII | C         | C     | =c/postgres          +
           |          |           |           |       | postgres=CTc/postgres
 template1 | postgres | SQL_ASCII | C         | C     | =c/postgres          +
           |          |           |           |       | postgres=CTc/postgres
 testdb    | postgres | SQL_ASCII | C         | C     |
(4 rows)

Here we can see four databases: postgres, template0, template1 and testdb.

Let's create a new database: testdb2. To do that, we run a SQL query. Type the following and hit enter:

postgres=# create database testdb2;
CREATE DATABASE

Important: make sure that you add the semicolon to the end of the statement. If you don't PostgreSQL doesn't know that you are finished and sits there waiting.

Let's double check by relisting the databases:

postgres=# \l
                                List of databases
   Name    |    Owner     | Encoding  | Collation | Ctype |   Access privileges  
-----------+--------------+-----------+-----------+-------+-----------------------
 postgres  | postgres     | SQL_ASCII | C         | C     |
 template0 | postgres     | SQL_ASCII | C         | C     | =c/postgres          +
           |              |           |           |       | postgres=CTc/postgres
 template1 | postgres     | SQL_ASCII | C         | C     | =c/postgres          +
           |              |           |           |       | postgres=CTc/postgres
 testdb    | postgres     | SQL_ASCII | C         | C     |
 testdb2   | postgresuser | SQL_ASCII | C         | C     |
(5 rows)

OK, let's connect to that database. We do that using the \c [for "c"onnect] meta-command.

postgres=# \c testdb2
You are now connected to database "testdb2".

For meta-commands you do not need to semicolon.

We should have a clean, empty database. We can check that by listing the tables with the \d [for, who knows? It means relations] meta-command.

testdb2=# \d
No relations found.

Let's create a table. Type the following:

testdb2=# create table mytable (name varchar, value int);
CREATE TABLE


and let's list the tables again.

testdb2=# \d
List of relations
 Schema | Name | Type | Owner
--------+---------+-------+--------------
 public | mytable | table | postgresuser
(1 row)

How about entering some data:

testdb2=# insert into mytable values('Joe',0);
INSERT 0 1
testdb2=# insert into mytable values('Jane',1);
INSERT 0 1

and let's select the data:

testdb2=# select * from mytable;
 name | value
------+-------
  Joe | 0
 Jane | 1
(2 rows)

Now we are cruising. We have created a database and a table, inserted some data and selected it.

Typing each command by hand is all very well but what if we have a set of commands that we want to run often? We can run a complete script in psql in two ways: \i  [for "i"nteractive?] or psql -f. First, let's create a simple script. Quit psql and get back to the operating system. Do that with the \q [for "q"uit] or by CTRL+D.
Use your favorite text editor and type in the following and save it as myscript.sql:

create table presidents (firstname varchar, lastname varchar);
insert into presidents values('George','Washington');
insert into presidents values('John','Adams');
insert into presidents values('Thomas','Jefferson');

Now hop back into psql. We can then run this script with \i:

testdb2=# \i ./myscript.sql
CREATE TABLE
INSERT 0 1
INSERT 0 1
INSERT 0 1
testdb2=# \d
             List of relations
 Schema |    Name    | Type  |    Owner    
--------+------------+-------+--------------
 public | mytable    | table | postgresuser
 public | presidents | table | postgresuser
(2 rows)

testdb2=# select * from presidents;
 firstname |  lastname 
-----------+------------
 George    | Washington
 John      | Adams
 Thomas    | Jefferson
(3 rows)

OK, let's drop the table then quit:

testdb2=# drop table presidents;
DROP TABLE
testdb2=# \q

Now let's run this from the operating system command line:

localhost:~ postgresuser$ psql -d testdb2 -f myscript.sql
CREATE TABLE
INSERT 0 1
INSERT 0 1
INSERT 0 1

(The operating system stored my username and password so I didn't need to use -U.)
This did the same thing as \i from within the terminal. Again, if you hop back into psql, do a select * from presidents; you will see the data.

Let's finish with a final psql trick. \d should show you that you have mytable and presidents. \d and most meta-commands support a wildcard operator. That is, if we do \d pre* it is saying give me all the tables starting with "pre". In this case it only finds one and shows the complete schema.

testdb2=# \d pres*
         Table "public.presidents"
  Column   |       Type        | Modifiers
-----------+-------------------+-----------
 firstname | character varying |
 lastname  | character varying |

Try it with \d p*. How many tables would you expect? It turns out the answer is lots. If you run this you will see a lot of tables that start with "pg_". These are special "system catalogs" that store information about the database, such as the names of tables and fields, lists of users, current activity on the databases and so on. They are normally hidden which is why you don't see them with a plain \d.

I mentioned that one can connect remotely to PostgreSQL. This means that you have psql installed on your local machine but are interacting with PostgreSQL on some other machine. To do this simply use the -h [hostname] option when starting psql.

To finish off let's summarize what we have learned. psql is an interactive terminal. When starting it, it can take various arguments: -d for the database (required), -f for a script to run, -h for the host, -U for the username and -w or --password for the password. Once in the terminal we can use various meta-commands: \l to list the databases, \d to list the tables or relations, \i to run a script, \c to connect to a different database, \q to quit. In later posts we will cover additional meta-commands.

Sunday, November 7, 2010

Interacting with PostgreSQL: pgAdmin

You have a PostgreSQL account but where do you start? There are several ways of interacting with PostgreSQL:
  • GUIs: you can use a GUI-based tool, such as pgAdmin, to connect to the database, browse the tables and run queries.
  • command line: PostgreSQL comes with a very powerful command line utility or interactive terminal called psql. This will be covered in a later post.
  • programmatically: one can use a database driver through code such as the Java JDBC driver. This is a more advanced topic that will not be covered.
Let's start with pgAdmin. When first using PostgreSQL this is probably the easiest way to connect to a database and browse the database structure, tables, stored procedures etc. Later posts, however, will focus on the psql terminal.

If you haven't done so already install pgAdmin on your local machine. The install is very quick and easy. Done? OK, let's connect.

When your account was set up hopefully you were told its location. This will be the name of a server or a complete URL and they probably told you the name of the port to connect to. Typically this is 5432 for PostgreSQL.

Fire up pgAdmin and click on the connection button on the top left. It looks like a 2-pronged plug with a short section of cable and if you mouseover the button it says "add a connection to the server". This brings up a form. Fill in the details. The name is whatever you want to call this connection. The host is the name of the server. The port is going to be 5432 unless you were told otherwise. The maintenance DB is usually postgres; we will explain this later. Fill in your username and password. If this is your computer and you are not sharing this with anyone else then you probably want to check the "store password" checkbox. This means that you won't need to enter all this information next time. If this is a shared computer it is probably best not to check this.

Once everything is entered, click OK. If everything went well you will be connected and you will see not only the name of your server in a list on the left but you can click on the little triangle next to it to expand the view and you will see
  • Databases (#)
  • Tablespaces (#)
  • Group Roles (#)
  • Login Roles (#)
where the hashes represent some numbers. At this stage we are only interested in the Databases section.

Expand this view. There will be at least one database. The reason is that there is a special database typically called "postgres". This is the maintenance DB. Something needs to store information about the other databases. Why not use a database to do that.

If you can, create a new database called "testdb". Right click on "Databases (#)" and click on "New Database..." Type "testdb" in the name field and click OK. Being a novice there is a chance that you might not have permission to create one. Perhaps you can ask the administrator to create a testdb for you to play around with.

Assuming that worked, expand the "schemas (#)" section. There will be a schema called "public". This is the default schema. We will cover schema in a later post. For now, just think of this as where the tables are listed. You are going to see a lot of confusing stuff: FTS configurations, FTS parsers and so on. The only thing we need now is the tables section.

Let's create a table. Right click on "tables (#)", click on "New Table...", input the name "mytable" for the table in the name field. Now click on the "columns" tab. Let's create a table with 2 columns. For each column click add, input a name, select a data type and click OK. Create a table with columns "name" of type "character varying" and "value" of type integer. Finish off by adding a primary key. Click the constraints tab. Select primary key, click "add", click the columns tab, select the name column and click OK twice. You now have a table with two columns and a primary key. When we use the psql command line this will actually be easier and more explicit as you do this in a single SQL query but you now know how to do it in the GUI.

You can view the data in a table by clicking on a table name and clicking on the icon that looks like a spreadsheet. Make sure that it is the basic one that says "view the data in the selected object" on mouseover and not the one with the purple funnel that says "apply a filter and view the data in the selected object.

The table is empty now so let's insert some. Click on that icon. You should now be able to click on a cell and enter data. Clicking enter on your keyboard will create a new row.

Let's finish by running a SQL query to select data from a table. Click the SQL icon which looks like a magnifying glass. In the large box in the top left input "select * from mytable". Click the green triangle (says "Execute query" on mouseover"). You should now see the data from the table.

OK, you have now explored a GUI that connected to PostgreSQL, you created a database and a table, inserted some data and ran a query. Good job! In subsequent posts we are going to use the psql command line instead allowing us to focus on the SQL commands to do what we need.

Installation (maybe)

Typically, one starts out with these things stepping through the installation process and account creation. Installing PostgreSQL can be very easy with one click installers or can be far more involved depending on the level of complexity, customization and control that one requires. (In the examples here I installed PostgreSQL 9.0.1 from source code on OSX in about 10 minutes, much of which was waiting for the code to compile.)

I'm going to assume that you are a novice and that the IT guy, the database administrator, or a friend has installed PostgreSQL successfully and created you an account. (If enough people complain I can write a post on installation but for now let proceed straight to using it.)

You have a PostgreSQL account with a username and a password. Let's get to it.

Welcome

Welcome to PostgreSQL for beginners.

As you may know, PostgreSQL is a world-class open source database, arguably the best. While it has really great documentation, it does assume that you know what you are doing, that you are already familiar with databases, and that you just need a technical reference. However, not everyone is a database administrator.

PostgreSQL lacks books for beginners or "dummies" and it can be hard to find examples that explain the core features concepts slowly and clearly with examples. (The books that it does have are fairly advanced.) For those of you starting out, perhaps because you need to learn it for school or for a job or are just curious, then this is for you.

In this series of posts we will cover using PostgreSQL from the perspective not from an administrator but that of a novice user.