Thursday, December 2, 2010

schema

In this post we'll examine PostgreSQL's schema support.

We encountered the default schema called "public" in an earlier post when exploring pgAdmin. Colloquially, when someone talks about a database's schema they typically mean the structure of the tables (that this table has these fields with these types). Here, a schema is a higher level of organization, a container or namespace within a database. A schema contains a set of tables, views, stored procedures and triggers and so on, all the usual components of a database. However, one can have multiple schema within a PostgreSQL database. To clarify this, let's start with a concrete example and consider the pros and cons of different ways of organizing the data.

Suppose we have a supermarket company, SuperFoods Inc., who have two stores, say Sausalito, CA and Petaluma, CA. (Clearly they have recovered from the bad economy and layoffs of this earlier post.) Each of the two stores has a set of information: employees and contact details, details about the physical store itself, replenishment orders and order history etc.

Imagine the data are stored in two separate databases. The CEO of SuperFoods asks one of his minions to produce a report on the sales history of the two stores. The problem here is that databases are independent silos. One cannot write a SQL query that will collate the data from the two stores. psql can only connect to one database at a time. We can use external tools, such as a JDBC driver in Java that will make a connection to each of the two databases and collect the data, but we cannot query the two databases simultaneously from the psql command line.

Imagine now that we take the two databases and put them inside a single SuperFoods database. We create a container, a schema, for each of the two original databases. Each schema has a unique name, "sausalito" and "petaluma", and this causes each of the tables in the schema to have a qualified name that includes the schema name.

That is, we have

sausalito.employees
sausalito.sales

petaluma.employees
petaluma.sales

(Technically, it would include the database name too and be "SuperFoods".sausalito.employees etc. under the hood but the user doesn't see this or need to see it.)

Importantly, when we connect to the SuperFoods database we can now access the sales report tables from both stores. We cannot do

select * from sales;

because PostgreSQL is going to complain. There is an abmiguity here. There are multiple sales tables. Which do you mean? Thus, one has to qualify it with the schema name:

select * from sausalito.sales;


to specify the Sausalito sales and

select * from petaluma.sales;


to access the Petaluma sales.

We can now combine those queries with operators such as UNION and UNION ALL (covered in this post) to get something closer to what the CEO wants:

select 'sausalito', sum(sales) from sausalito.sales where year=2010
UNION ALL
select 'petaluma', sum(sales) from petaluma.sales where year=2010;

Great. Goal achieved.

OK, now consider this structure from the perspective of a human resources person at the Sausalito store. She needs to update the employees table for her store and she isn't interested in the Petaluma tables. Writing 'sausalito.' before each table in the queries is a bit of a nuisance. PostgreSQL helps us out with the notion of the search path. The search path says where to look, similar to classpaths in most programming languages.

Let's connect to a database, set the search path and create a new table.

psql -d SuperFoods

testdb=#set search_path TO sausalito;
SET

testdb=# \d
              List of relations
  Schema   |   Name    | Type  |    Owner    
-----------+-----------+-------+--------------
 sausalito | employees | table | postgresuser
 sausalito | sales     | table | postgresuser
(2 rows)

Notice we see sales and employees not sausalito.sales. When we set the search path it means, work on the database from the perspective of the sausalito schema. Now when one does

select * from sales;

it does not complain. It knows that you mean sausalito.sales. There is no ambiguity now because you cleared that up earlier when you set the search path.

You can even select the data from the other schema using the qualified name

select * from petaluma.sales;

from within the sausalito schema.

Creating tables once one has set the search_path has the same implicit name qualification:

testdb=# create table promotions(name varchar, value int);
CREATE TABLE
testdb=# \d
               List of relations
  Schema   |    Name    | Type  |    Owner    
-----------+------------+-------+--------------
 sausalito | employees  | table | postgresuser
 sausalito | promotions | table | postgresuser
 sausalito | sales      | table | postgresuser
(3 rows)

How do we create a schema? That is a synch:

testdb=# create schema sausalito;
CREATE SCHEMA
testdb=# create schema petaluma;
CREATE SCHEMA

Further, you can dump data from a single schema from the operating system command line

localhost:~ postgresuser$ pg_dump --schema sausalito testdb > sausalito.sql

We can see that using schema gives us all the tools that exist for single databases but we can collect related tables together, in SuperFoods by store, and yet compare data mong those different collections. Now we understand more about schema we can refine our notion about them. They are a namespace or rather a namespace within a database. PostgreSQL only sees one database here: SuperFoods.

Schema can be very powerful and useful way of partitioning data. However, they are typically used for more complex sets of tables, more complex than those typically needed for beginner exercises. For simpler cases, an additional field that defines the subset of data might suffice. Imagine that we include an additional 'store' field in the sales table. A single table could then be queried for Sausalito sales, Petaluma sales or both.

Also, imagine that SuperFoods assigns a regional manager to the two stores. That manager is effectively an employee at both stores. We want to avoid replication of data in databases and so would not want to include his personal details in both schema. In this case a single, more unified database structure might be more appropriate. Given those, schema are a valuable addition to PostgreSQL and worth considering when designing one's database.

16 comments:

  1. I found it useful. Thanks

    ReplyDelete
  2. Seriously 0% useful

    ReplyDelete
  3. Thanks, the one key bit of info for me was that I could combine schema data by using UNION. I've had trouble finding this anywhere else. Job done, I can go to bed happy - cheers!

    ReplyDelete
  4. Useful for beginners like me :)

    ReplyDelete
  5. Very cool -- that was a very clear and useful explanation! Thank you!

    ReplyDelete
  6. Useful for a beginner!

    ReplyDelete
  7. Useful for a beginner okay.

    I came across this page while trying to find the answer to a related problem - you mention using Java & JDBC to merge the data from two databases but if you use two schema instead it seems the avenue is no longer open to you because there is no way to specify the schema in the jdbc connection url!

    Which is a real surprise to me and I've been using java & postgresql for many years. I just never bothered using separate schema until recently.

    ReplyDelete
  8. With big tables the separation by schemas can be helpful for partitioning the data. Putting all data into one table can make it hard for the database to maintain the huge index structures. Insert operations might slow down due to index or primary key reorganization - same is true for updates. The more the database grows the more the DBMS might busy to maintain the structures. With a schema you will get many small tables that are easier to maintain.

    ReplyDelete
  9. for begineers this info is usefull

    ReplyDelete
  10. appreciate this alot, thanks

    ReplyDelete
  11. Very useful... thank you!!!

    ReplyDelete