as it is with oracle postgresql relies on various statistics to produce the explain plan for a query. the catalog table which stores the statistical information is pg_statistic.
for the purpose of this post let’s play with a simple table and see what gets stored in the pg_statistic table.
drop table if exists t1; create table t1 ( a int );
I will always use the same query for producing the output from the pg_statistics table:
select pa.attname , ps.stawidth , ps.stadistinct , ps.stanullfrac , ps.stavalues1 , ps.stavalues2 , ps.stavalues3 , ps.stavalues4 , ps.stavalues5 from pg_class pc , pg_statistic ps , pg_attribute pa where pc.relname = 't1' and pc.relowner = ( select nspowner from pg_namespace where nspname = 'public' ) and pc.oid = ps.starelid and pa.attnum = ps.staattnum and pa.attrelid = pc.oid;
If you run the query right now it will not return any rows as pg_statistics gets populated once you run analyze and at least one row is inserted into the table ( but not before ).
--exec statistics-query from above: attname | stawidth | stadistinct | stanullfrac | stavalues1 | stavalues2 | stavalues3 | stavalues4 | stavalues5 ---------+----------+-------------+-------------+------------+------------+------------+------------+------------ (0 rows)
So let’s create a row and see what’s happening:
insert into t1 (a) values (1); analyze t1; --exec statistics-query from above: attname | stawidth | stadistinct | stanullfrac | stavalues1 | stavalues2 | stavalues3 | stavalues4 | stavalues5 ---------+----------+-------------+-------------+------------+------------+------------+------------+------------ a | 4 | -1 | 0 | | | | | (1 row)
what does this tell ?
- the column “a” is reported to have a width of 4 which corresponds to the definition of the integer data-type.
- distinct values is reported as “-1” which seems a little but surprising for the moment as it should be 1, shouldn’t it ? in fact this is equal to one as it tells that there is statistically one occurrence for the value
- the number of null values is zero, of course
let’s create another row and have a look at the results:
insert into t1 (a) values (2); analyze t1; --exec statistics-query from above: attname | stawidth | stadistinct | stanullfrac | stavalues1 | stavalues2 | stavalues3 | stavalues4 | stavalues5 ---------+----------+-------------+-------------+------------+------------+------------+------------+------------ a | 4 | -1 | 0 | {1,2} | | | |
the only column that changed is the stavalues1 column. as soon as there is more than one row in the table this column gets populated. distinct values is still reported as “-1” as each row in the table has a different value. this will not change as long as you insert unique values and not too much of them ( more on his later ):
insert into t1 (a) values (generate_series(3,10)); analyze t1; --exec statistics-query from above: attname | stawidth | stadistinct | stanullfrac | stavalues1 | stavalues2 | stavalues3 | stavalues4 | stavalues5 ---------+----------+-------------+-------------+------------------------+------------+------------+------------+------------ a | 4 | -1 | 0 | {1,2,3,4,5,6,7,8,9,10} | | | | (1 row)
once you insert the same values again the stadistinct column gets updated to reflect the change in the data distribution:
insert into t1 (a) values (generate_series(1,10)); analyze t1; --exec statistics-query from above: attname | stawidth | stadistinct | stanullfrac | stavalues1 | stavalues2 | stavalues3 | stavalues4 | stavalues5 ---------+----------+-------------+-------------+------------------------+------------+------------+------------+------------ a | 4 | -0.5 | 0 | {1,2,3,4,5,6,7,8,9,10} | | | |
now there is a value of “-0.5” which essentially tells that there are about 2 entries for each distinct value. we can do the same again and the stadistinct value decreases again to match data ( around three occurrences for each distinct value ):
insert into t1 (a) values (generate_series(1,10)); analyze t1; --exec statistics-query from above: attname | stawidth | stadistinct | stanullfrac | stavalues1 | stavalues2 | stavalues3 | stavalues4 | stavalues5 ---------+----------+-------------+-------------+------------------------+------------+------------+------------+------------ a | 4 | -0.333333 | 0 | {1,2,3,4,5,6,7,8,9,10} | | | | (1 row)
as it looks right now postgresql stores all the values of a column another time in the pg_statistic catalog table. as this would be bad practice let’s if this changes once we insert more data:
insert into t1 (a) values (generate_series(11,10000)); analyze t1; --exec statistics-query from above: attname | stawidth | stadistinct | stanullfrac | stavalues1 | stavalues2 | stavalues3 | stavalues4 | stavalues5 ---------+----------+-------------+-------------+------------------------+---------------------------------------------------------------------- ------------------------------------------------------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------------------------------------------------------ --------------------------------------------------------------------------------------------------------------------------------------------+--- ---------+------------+------------ a | 4 | -0.998004 | 0 | {1,2,3,4,5,6,7,8,9,10} | {11,110,210,310,410,510,610,710,810,910,1009,1109,1209,1309,1409,1509 ,1609,1709,1809,1908,2008,2108,2208,2308,2408,2508,2608,2708,2807,2907,3007,3107,3207,3307,3407,3507,3607,3706,3806,3906,4006,4106,4206,4306,440 6,4506,4605,4705,4805,4905,5005,5105,5205,5305,5405,5504,5604,5704,5804,5904,6004,6104,6204,6304,6403,6503,6603,6703,6803,6903,7003,7103,7203,73 02,7402,7502,7602,7702,7802,7902,8002,8102,8201,8301,8401,8501,8601,8701,8801,8901,9001,9100,9200,9300,9400,9500,9600,9700,9800,9900,10000} | | | (1 row)
now we get the next stavalues column populated in steps of a hundred. the first stavalues columns remains unchanged in this example.
in oracle words the stavaluesN columns are similar to a histogram. it’s a statistical representation of the data distribution which is then used to generate execution plans. more to come …