Sunday, November 21, 2010

SQL: DISTINCT, DISTINCT ON and ALL

It is not uncommon to have duplicate data in the results of a query. We can use the DISTINCT clause of SELECT to remove those and return only the unique results.

Returning to our trusty ceos table, a simple select is showing 7 rows:

apple=# select * from ceos;
       name       | year
------------------+------
 Steve Jobs       | 1976
 Mike Markkula    | 1977
 Mike Scott       | 1978
 John Sculley     | 1983
 Michael Spindler | 1993
 Gil Amelio       | 1996
 Steve Jobs       | 1997
(7 rows)

If we only want the distinct set of people who are or have been Apple's CEOs we can use distinct to give the six names:

apple=# select distinct(name) from ceos;
       name      
------------------
 Michael Spindler
 John Sculley
 Gil Amelio
 Mike Markkula
 Mike Scott
 Steve Jobs
(6 rows)

Similarly select distinct(year) from ceos; will provide a list of years of a CEO changeover.

DISTINCT ON
(expression) will return the "first" row of each set of rows where the expression is equal. The following says get the name and year from ceos for the set of rows where name is distinct.

apple=# select distinct on (name) name,year from ceos;
       name       | year
------------------+------
 Gil Amelio       | 1996
 John Sculley     | 1983
 Michael Spindler | 1993
 Mike Markkula    | 1977
 Mike Scott       | 1978
 Steve Jobs       | 1976
(6 rows)

You can see that it returned Jobs' first term as CEO.

I quoted first because it is not always obvious what that means. The results can be unpredictable. To be safe it is good practice to combine DISTINCT ON with ORDER BY to make that explicit.

apple=# select distinct on (name) name, year from ceos order by name,year;
       name       | year
------------------+------
 Gil Amelio       | 1996
 John Sculley     | 1983
 Michael Spindler | 1993
 Mike Markkula    | 1977
 Mike Scott       | 1978
 Steve Jobs       | 1976
(6 rows)

DISTINCT ON can handle multiple fields. This says get all the rows where the combination of name and year is distinct:

apple=# select distinct on (name,year) name,year from ceos;
       name       | year
------------------+------
 Gil Amelio       | 1996
 John Sculley     | 1983
 Michael Spindler | 1993
 Mike Markkula    | 1977
 Mike Scott       | 1978
 Steve Jobs       | 1976
 Steve Jobs       | 1997
(7 rows)

In this trivial example it returns all rows but one can imagine how can this can be powerful with additional fields. DISTINCT ON is not part of the SQL standard.

Incidentally, this could have been achieved with GROUP BY and ordering results

apple=# select name,year from ceos group by name,year order by name;
       name       | year
------------------+------
 Gil Amelio       | 1996
 John Sculley     | 1983
 Michael Spindler | 1993
 Mike Markkula    | 1977
 Mike Scott       | 1978
 Steve Jobs       | 1976
 Steve Jobs       | 1997
(7 rows)

Finally, ALL is the opposite to DISTINCT. As its name suggests it explicitly specifies that all rows must be returned. ALL is the default mode so that
select name from ceos is equivalent to select all(name) from ceos. As such it doesn't sound useful. However, whereas DISTINCT is designed to override the default behavior and strip out duplicate rows, ALL can be used with constructs such as UNION, EXCEPT and INTERSECT whose default behavior is to remove duplicates and override them to return ALL rows, including duplicates. We will cover those last three keywords in the next post.

8 comments:

  1. awesome again! thank you so much!!!!

    ReplyDelete
  2. "This says get all the rows where the combination of name and year is distinct"

    Combination means order doesn't matter, but with DISTINCT ON it does. I would like the order not to matter. Is there a way to do that?

    ReplyDelete
  3. Nice Article !
    This is my pleasure to read your article.
    Really this will help to people of PostgreSQL Community.

    I have also prepared one article about, Get first record per each group in PostgreSQL - Using DISTINCT ON / LATERAL
    You can also visit my article, your comments and reviews are most welcome.

    http://www.dbrnd.com/2016/08/postgresql-optimized-way-to-get-first-record-per-each-group-using-distinct-on-lateral-sub-queries/

    ReplyDelete
  4. Great article with excellent idea! I appreciate your post for Jobs like Php Jobs In Hyderabad. I like ur blog nd thank u...

    ReplyDelete