Thursday, November 18, 2010

Create table and constraints

Today we shall cover creating tables. So far we have only created the simplest tables, one or two fields, no constraints, no foreign keys, no primary keys etc. CREATE TABLE is one of the core SQL command and is one of the more complex commands with a host of options. This post is quick introduction to highlight the basic ideas, common patterns and where to find mode detailed explanation in the PostgreSQL documentation. Importantly, as all of this material is essentially standard SQL so the reader can easily find a whole host of additional material covering all levels of detail on the web.

When creating a table we want to define the fields (columns), their data type (int, text, date etc.) and any constraints such as whether that field can be empty (NULL), unique, or whether it must contain a value found in another table (foreign key) or that the value must be in a certain range (say, check that price > 0.01).

In its simplest form the CREATE TABLE command is a comma-delimited list of fieldnames and their types:

CREATE TABLE tablename (
  fieldname1 datatype1,
  fieldname2 datatype2,
  ...
  fieldname3 datatype3
);

such as the Apple CEOs table of an earlier post:

create table ceos (name varchar, year int);

PostgreSQL supports a large number of datatypes. Here are some of the more common ones:
varchar(n) string of at most n characters. If less than n are provided, it only stores those chars.
char(n)string of n characters. If less than n are provided it will be space padded.
text variable length string.
int integer, 4 bytes.
bigint 8 bytes. This is returned by count as in "select count(*) from..."
decimal or numeric a flexible real value of specifiable precision. For instance 12.3456 is a numeric(6,4).
date date with year, month and day.
timestamp, timestamptz date and time without and with a timezone.
booleantrue or false.

This is, however, just a sample. There are types for money, shapes, arrays and many more.

Decide upon the types for each of your fields but do so with a consideration to storage size. Don't use more than you need. For instance, if you know that a column is two character data, say for a state abbreviation('AZ', 'CA' etc), a varchar(2) is a better choice than text. On the other hand, make sure that you will have enough for what you might need in the future. For instance, numeric(4,2) might suffice for most daily temperature readings in San Francisco but very very rarely it can exceed 100 F. If you try to store the record 103 F in a numeric(4,2) PostgreSQL will throw an exception ("ERROR:  numeric field overflow. DETAIL:  A field with precision 4, scale 2 must round to an absolute value less than 10^2"). With numeric(5,2) or decimal you will be safe.

Having decided upon field types for a table we then need to consider the PRIMARY KEY. This is a special field or combination of fields that must be unique across all rows. This key becomes a unique identifier for each row. Typically these are integer IDs (product ID, order ID etc) which can be supplied explicitly or can be provided by PostgreSQL automatically by declaring a field SERIAL. Let's examine this further.

To remind you, here is our table of Apple's CEOs and the year that they started that role:

apple=# \d ceos
          Table "public.ceos"
 Column |       Type        | Modifiers
--------+-------------------+-----------
 name   | character varying |
 year   | integer           |

apple=# select * from ceos;
       name       | year
------------------+------
 Steve Jobs       | 1976
 Mike Markkula    | 1977
 Mike Scott       | 1978
 John Sculley     | 1983
 Michael Spindler | 1993
 Gil Amelio       | 1996
 Steve Jobs       | 1997
(7 rows)


A table doesn't have to have a primary key. We didn't declare a primary key in the ceos table but then I knew that I was going to do simple selects only and that I would not be adding any new data to the table. Often, however, we don't know what additional data might be added to a table and so we have to design the table to handle the edge cases of that future data. Suppose Steve Jobs steps down and his post is taken over by Mike Scott. This is a relatively common name so what if it is not the same Mike Scott that ran the company in the 1970s. Now we have a problem. If we use the current table schema we would not be able to distinguish among the two Mikes. We need to assign each of the two Mikes a unique code. We could normalize the table and create one table to define the different people (people table) and a different table (ceos table) to define the association between the person and the year that they became CEO.

create table people(
   id int PRIMARY KEY,
   name varchar NOT NULL
);
insert into people(0,'Steve Jobs');
insert into people(1,'Mike Markkula');
insert into people(2,'Mike Scott');
insert into people(3,'John Sculley');
insert into people(4,'Michael Spindler');
insert into people(5,'Gil Amelio');
insert into people(6,'Mike Scott');

(This is just a simple illustrative example. In a real database a person would likely have separate last name and first name fields as well as other attribute fields.)
With this we have solved two problems. First, we have an id for each of Mikes: 2 and 6. We also have removed some redundancy in the database by only defining Steve Jobs once. Recall that he appears twice in the ceos table. It is better design to have an entity defined once. Imagine that Apple had a female CEO. She got married and changed her name. We would have to update all the rows where her name appears. In a normalized schema as above we only have to update a single row. This doesn't make a great difference in our trivial ceos example but in a large database where there may be lots of tables associated with a person. Imagine a payroll database where a person is associated with address, job, payrates and bonuses and so on, all of which might be in different tables this can be a huge issue.

In the people schema, we defined a id column of type integer. We declared it to be the PRIMARY KEY. This means that it must be unique and it must not be NULL. Next we declared a name column and said that it cannot be NULL. If we are defining people based on a single name field it doesn't make sense to allow a NULL name, otherwise what is the point of the row?

In the table above we supplied the ids. If we don't care what id is used for a person so long as each person gets a unique name we can ask PostgreSQL to make up IDs for us.

create table people(
  id SERIAL PRIMARY KEY,
  name varchar NOT NULL
);


By adding the keyword SERIAL it says create an ID starting from 1 and incrementing by 1 when we insert data. We can then insert data as

insert into people (name) values('Steve Jobs');
insert into people (name) values('Mike Markkula');
insert into people (name) values('Mike Scott');
insert into people (name) values('John Sculley');
insert into people (name) values('Michael Spindler');
insert into people (name) values('Gil Amelio');
insert into people (name) values('Mike Scott');

apple=# select * from people;
 id |       name      
----+------------------
  1 | Steve Jobs
  2 | Mike Markkula
  3 | Mike Scott
  4 | John Sculley
  5 | Michael Spindler
  6 | Gil Amelio
  7 | Mike Scott
(7 rows)

With our revised people table complete let us rework the ceo table.

create table ceo(
   id int,
   year int NOT NULL,
   PRIMARY KEY(id,year)
);
insert into ceos(1,1976);
insert into ceos(2,1977);
insert into ceos(3,1978);
insert into ceos(4,1983);
insert into ceos(5,1993);
insert into ceos(6,1996);
insert into ceos(1,1997);
ALTER TABLE ceos ADD CONSTRAINT ceos_foreign_id FOREIGN KEY (id) REFERENCES people (id);

We have an id column of type int. We intend that to be the same id as in the people table. To achieve that we add a foreign key constraint with the ALTER TABLE statement. This says add a constraint to the ceo table such that an id value in the ceo's id column must be a value that is present in the id field of the people table. (We could have written "id int references people," which would add a foreign key constraint to the primary key of the people table. It is a matter of tast but the alter table command is more explicit.)

We want the year to be NOT NULL. We might also want to verify make sure that any year supplied is sensible. After all, -5 is a NOT NULL int but doesn't make sense for a year. Moreover, this is not just any year, it must be a year since Apple was founded so it must be 1976 or later. Thus, we could add a check to that field

year int NOT NULL check(year >= 1976),

If we try to enter a year less than 1976 PostgreSQL will complain with

ERROR:  new row for relation "ceo" violates check constraint "ceo_year_check"

Finally, we have a primary key of both the id and the year. This will allow multiple people in the same year but not the same person multiple times in a year. When we execute the create table query PostgreSQL outputs

NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "ceo_pkey" for table "ceo"

Database design is a tricky exercise and certainly a little confusing for beginners. There are many tradeoffs to be made in terms of storage size, speed and efficiency of loading and querying, maintainability, current versus known or unknown future data and so on. However, hopefully this post has conveyed the sense of some of things to consider and the rich levels of types and constraints that PostreSQL provides. PostgreSQL CREATE TABLE documentation details many more options and details but the basic detailes covered here will actually get you fairly far andconver a lot of needs.

3 comments:

  1. Small typo ?

    insert into people(0,'Steve Jobs');

    =>

    insert into people values(0,'Steve Jobs');

    PS: Great set of tutorials.

    ReplyDelete
  2. Also, the CREATE TABLE creates the table "ceo" but the insert commands insert into "ceos".

    And yes, the tutorials are terrific. Thanks so much!

    ReplyDelete
  3. This is amazing!!!! thank you!!!!!

    for anyone who wanted to know, at the part where it says

    "year int NOT NULL check(year >= 1976)"

    the way to get that constraint into your table is to say:

    ALTER TABLE ceos ADD CONSTRAINT year check(year>= 1976);

    I tried:

    ALTER TABLE ceos ADD CONSTRAINT year int NOT NULL check(year>= 1976);

    and that returned an error, I think because at this point, year was already constrained to be an integer and to be NOT NULL.

    ReplyDelete