statistics in postgresql (1)

January 27, 2013 — Leave a comment

as it is with oracle postgresql relies on various statistics to produce the explain plan for a query. the catalog table which stores the statistical information is pg_statistic.

for the purpose of this post let’s play with a simple table and see what gets stored in the pg_statistic table.

drop table if exists t1;
create table t1 ( a int );

I will always use the same query for producing the output from the pg_statistics table:

select pa.attname
     , ps.stawidth
     , ps.stadistinct
     , ps.stanullfrac
     , ps.stavalues1
     , ps.stavalues2
     , ps.stavalues3
     , ps.stavalues4
     , ps.stavalues5
  from pg_class pc
     , pg_statistic ps
     , pg_attribute pa
 where pc.relname = 't1'
   and pc.relowner = ( select nspowner 
                         from pg_namespace
                        where nspname = 'public' )
   and pc.oid = ps.starelid
   and pa.attnum = ps.staattnum
   and pa.attrelid = pc.oid;

If you run the query right now it will not return any rows as pg_statistics gets populated once you run analyze and at least one row is inserted into the table ( but not before ).

--exec statistics-query from above:
 attname | stawidth | stadistinct | stanullfrac | stavalues1 | stavalues2 | stavalues3 | stavalues4 | stavalues5 
---------+----------+-------------+-------------+------------+------------+------------+------------+------------
(0 rows)

So let’s create a row and see what’s happening:

insert into t1 (a) values (1);
analyze t1;
--exec statistics-query from above:
 attname | stawidth | stadistinct | stanullfrac | stavalues1 | stavalues2 | stavalues3 | stavalues4 | stavalues5 
---------+----------+-------------+-------------+------------+------------+------------+------------+------------
 a       |        4 |          -1 |           0 |      |      |      |      | 
(1 row)

what does this tell ?

  • the column “a” is reported to have a width of 4 which corresponds to the definition of the integer data-type.
  • distinct values is reported as “-1” which seems a little but surprising for the moment as it should be 1, shouldn’t it ? in fact this is equal to one as it tells that there is statistically one occurrence for the value
  • the number of null values is zero, of course

let’s create another row and have a look at the results:

insert into t1 (a) values (2);
analyze t1;
--exec statistics-query from above:
 attname | stawidth | stadistinct | stanullfrac | stavalues1 | stavalues2 | stavalues3 | stavalues4 | stavalues5 
---------+----------+-------------+-------------+------------+------------+------------+------------+------------
 a       |        4 |          -1 |           0 | {1,2}      |      |      |      | 

the only column that changed is the stavalues1 column. as soon as there is more than one row in the table this column gets populated. distinct values is still reported as “-1” as each row in the table has a different value. this will not change as long as you insert unique values and not too much of them ( more on his later ):

insert into t1 (a) values (generate_series(3,10));
analyze t1;
--exec statistics-query from above:
 attname | stawidth | stadistinct | stanullfrac |       stavalues1       | stavalues2 | stavalues3 | stavalues4 | stavalues5 
---------+----------+-------------+-------------+------------------------+------------+------------+------------+------------
 a       |        4 |          -1 |           0 | {1,2,3,4,5,6,7,8,9,10} |      |      |      | 
(1 row)

once you insert the same values again the stadistinct column gets updated to reflect the change in the data distribution:

insert into t1 (a) values (generate_series(1,10));
analyze t1;
--exec statistics-query from above:
 attname | stawidth | stadistinct | stanullfrac |       stavalues1       | stavalues2 | stavalues3 | stavalues4 | stavalues5 
---------+----------+-------------+-------------+------------------------+------------+------------+------------+------------
 a       |        4 |        -0.5 |           0 | {1,2,3,4,5,6,7,8,9,10} |      |      |      | 

now there is a value of “-0.5” which essentially tells that there are about 2 entries for each distinct value. we can do the same again and the stadistinct value decreases again to match data ( around three occurrences for each distinct value ):

insert into t1 (a) values (generate_series(1,10));
analyze t1;
--exec statistics-query from above:
 attname | stawidth | stadistinct | stanullfrac |       stavalues1       | stavalues2 | stavalues3 | stavalues4 | stavalues5 
---------+----------+-------------+-------------+------------------------+------------+------------+------------+------------
 a       |        4 |   -0.333333 |           0 | {1,2,3,4,5,6,7,8,9,10} |      |      |      | 
(1 row)

as it looks right now postgresql stores all the values of a column another time in the pg_statistic catalog table. as this would be bad practice let’s if this changes once we insert more data:

insert into t1 (a) values (generate_series(11,10000));
analyze t1;
--exec statistics-query from above:
 attname | stawidth | stadistinct | stanullfrac |       stavalues1       |        stavalues2    | stavalues3 | stavalues4 | stavalues5 
---------+----------+-------------+-------------+------------------------+----------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------------------------------------+---
---------+------------+------------
 a       |        4 |   -0.998004 |           0 | {1,2,3,4,5,6,7,8,9,10} | {11,110,210,310,410,510,610,710,810,910,1009,1109,1209,1309,1409,1509
,1609,1709,1809,1908,2008,2108,2208,2308,2408,2508,2608,2708,2807,2907,3007,3107,3207,3307,3407,3507,3607,3706,3806,3906,4006,4106,4206,4306,440
6,4506,4605,4705,4805,4905,5005,5105,5205,5305,5405,5504,5604,5704,5804,5904,6004,6104,6204,6304,6403,6503,6603,6703,6803,6903,7003,7103,7203,73
02,7402,7502,7602,7702,7802,7902,8002,8102,8201,8301,8401,8501,8601,8701,8801,8901,9001,9100,9200,9300,9400,9500,9600,9700,9800,9900,10000} |      |      | 
(1 row)

now we get the next stavalues column populated in steps of a hundred. the first stavalues columns remains unchanged in this example.

in oracle words the stavaluesN columns are similar to a histogram. it’s a statistical representation of the data distribution which is then used to generate execution plans. more to come …

No Comments

Be the first to start the conversation!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.