Friday, November 12, 2010

Populating a database

A database isn't very useful without data. Let's examine how one can populate a database.

So far we have used simple INSERT statements such as

insert into ceos values('Steve Jobs',1976);

We don't have to fill all the fields in an insert. We can specify some subset in parentheses before VALUES. That is, suppose we have a table:

apple=# create table mytable (col1 int, col2 int, col3 int);
CREATE TABLE


We can insert into the last two columns only with:

apple=# insert into mytable (col2, col3) values(2,3);
INSERT 0 1
apple=# select * from mytable;
 col1 | col2 | col3
------+------+------
      |    2 |    3
(1 row)

We can also use the NULL keyword to fill in empty data:

apple=# insert into mytable values(1,2,NULL);
INSERT 0 1
apple=# select * from mytable;
 col1 | col2 | col3
------+------+------
      |    2 |    3
    1 |    2 |    
(2 rows)

and PostgreSQL also allows us to omit fields and it will fill all the remaining fields with NULLs:

apple=# insert into mytable values(11,12);
INSERT 0 1
apple=# select * from mytable;
 col1 | col2 | col3
------+------+------
      |    2 |    3
    1 |    2 |    
   11 |   12 |    
(3 rows)

This behavior actually depends on whether we specified any default values in the table schema. Here, we define that col1 has a default value of 0. PostgreSQL will fill in missing data with that value rather than NULL.

apple=# create table mytable2 (col1 int default 0, col2 int);
CREATE TABLE

apple=# insert into mytable2 (col2) values(1);
INSERT 0 1
apple=# select * from mytable2;
 col1 | col2
------+------
    0 |    1
(1 row)

The PostgreSQL documentation contains more information about how we can control that behavior.

Usefully, we can insert into a table using the result of a SELECT:

insert INTO table table_name SELECT some_query_expression;

e.g.

apple=# insert into mytable (col1) SELECT year from ceos where year > 1996;
INSERT 0 1
apple=# select * from mytable;
 col1 | col2 | col3
------+------+------
      |    2 |    3
    1 |    2 |    
   11 |   12 |    
 1997 |      |    
(4 rows)

Lastly, we can insert multiple rows of data using a single INSERT statement. This is more efficient and quicker than multiple single row insert statements.

apple=# insert into mytable2 values(1,2),(3,4),(5,6);
INSERT 0 3
apple=# select * from mytable2;
 col1 | col2
------+------
    0 |    1
    1 |    2
    3 |    4
    5 |    6
(4 rows)

If we had a file of multiple INSERT statements, which is how many databases including PostgreSQL can export a database to file, we could run those statements using psql's \i meta-command that we covered earlier.

If we have a CSV file that includes a header and we want to import it into PostgreSQL we can use the COPY command. The query for this is

COPY table_name FROM filepath WITH CSV HEADER;

(PostgreSQL has many more options for COPY detailed here.)

That is, suppose we have a file /Users/postgresuser/ceos.csv:

name,year
Steve Jobs, 1976
Mike Markkula,1977
Mike Scott,1978
John Sculley,1983
Michael Spindler,1993
Gil Amelio,1996
Steve Jobs,1997

then we can import into a table ceos as

apple=# create table ceos (name varchar, year int);
CREATE TABLE
apple=# copy ceos from '/Users/postgresuser/ceos.csv' with csv header;
COPY 7
apple=# select * from ceos;
       name       | year
------------------+------
 Steve Jobs       | 1976
 Mike Markkula    | 1977
 Mike Scott       | 1978
 John Sculley     | 1983
 Michael Spindler | 1993
 Gil Amelio       | 1996
 Steve Jobs       | 1997
(7 rows)

Note: the table must be created before the COPY otherwise PostgreSQL complains:

apple=# copy newceos from '/Users/postgresuser/ceos.csv' with csv header;
ERROR:  relation "newceos" does not exist

TIP: if we have an existing table with the same table structure that we want for our new table we can create that new empty table as

create table your_new_table_name as select * from old_table_name limit 0;

Importantly, COPY is much faster than INSERT. If you have a large amount of data to load, say millions of rows, and performance is important you should COPY in conjunction with some of the other strategies suggested by the PostreSQL documentation.

6 comments: