Friday, November 12, 2010

Populating a database

A database isn't very useful without data. Let's examine how one can populate a database.

So far we have used simple INSERT statements such as

insert into ceos values('Steve Jobs',1976);

We don't have to fill all the fields in an insert. We can specify some subset in parentheses before VALUES. That is, suppose we have a table:

apple=# create table mytable (col1 int, col2 int, col3 int);
CREATE TABLE


We can insert into the last two columns only with:

apple=# insert into mytable (col2, col3) values(2,3);
INSERT 0 1
apple=# select * from mytable;
 col1 | col2 | col3
------+------+------
      |    2 |    3
(1 row)

We can also use the NULL keyword to fill in empty data:

apple=# insert into mytable values(1,2,NULL);
INSERT 0 1
apple=# select * from mytable;
 col1 | col2 | col3
------+------+------
      |    2 |    3
    1 |    2 |    
(2 rows)

and PostgreSQL also allows us to omit fields and it will fill all the remaining fields with NULLs:

apple=# insert into mytable values(11,12);
INSERT 0 1
apple=# select * from mytable;
 col1 | col2 | col3
------+------+------
      |    2 |    3
    1 |    2 |    
   11 |   12 |    
(3 rows)

This behavior actually depends on whether we specified any default values in the table schema. Here, we define that col1 has a default value of 0. PostgreSQL will fill in missing data with that value rather than NULL.

apple=# create table mytable2 (col1 int default 0, col2 int);
CREATE TABLE

apple=# insert into mytable2 (col2) values(1);
INSERT 0 1
apple=# select * from mytable2;
 col1 | col2
------+------
    0 |    1
(1 row)

The PostgreSQL documentation contains more information about how we can control that behavior.

Usefully, we can insert into a table using the result of a SELECT:

insert INTO table table_name SELECT some_query_expression;

e.g.

apple=# insert into mytable (col1) SELECT year from ceos where year > 1996;
INSERT 0 1
apple=# select * from mytable;
 col1 | col2 | col3
------+------+------
      |    2 |    3
    1 |    2 |    
   11 |   12 |    
 1997 |      |    
(4 rows)

Lastly, we can insert multiple rows of data using a single INSERT statement. This is more efficient and quicker than multiple single row insert statements.

apple=# insert into mytable2 values(1,2),(3,4),(5,6);
INSERT 0 3
apple=# select * from mytable2;
 col1 | col2
------+------
    0 |    1
    1 |    2
    3 |    4
    5 |    6
(4 rows)

If we had a file of multiple INSERT statements, which is how many databases including PostgreSQL can export a database to file, we could run those statements using psql's \i meta-command that we covered earlier.

If we have a CSV file that includes a header and we want to import it into PostgreSQL we can use the COPY command. The query for this is

COPY table_name FROM filepath WITH CSV HEADER;

(PostgreSQL has many more options for COPY detailed here.)

That is, suppose we have a file /Users/postgresuser/ceos.csv:

name,year
Steve Jobs, 1976
Mike Markkula,1977
Mike Scott,1978
John Sculley,1983
Michael Spindler,1993
Gil Amelio,1996
Steve Jobs,1997

then we can import into a table ceos as

apple=# create table ceos (name varchar, year int);
CREATE TABLE
apple=# copy ceos from '/Users/postgresuser/ceos.csv' with csv header;
COPY 7
apple=# select * from ceos;
       name       | year
------------------+------
 Steve Jobs       | 1976
 Mike Markkula    | 1977
 Mike Scott       | 1978
 John Sculley     | 1983
 Michael Spindler | 1993
 Gil Amelio       | 1996
 Steve Jobs       | 1997
(7 rows)

Note: the table must be created before the COPY otherwise PostgreSQL complains:

apple=# copy newceos from '/Users/postgresuser/ceos.csv' with csv header;
ERROR:  relation "newceos" does not exist

TIP: if we have an existing table with the same table structure that we want for our new table we can create that new empty table as

create table your_new_table_name as select * from old_table_name limit 0;

Importantly, COPY is much faster than INSERT. If you have a large amount of data to load, say millions of rows, and performance is important you should COPY in conjunction with some of the other strategies suggested by the PostreSQL documentation.

No comments:

Post a Comment