Archives For January 2013

as it is with oracle postgresql relies on various statistics to produce the explain plan for a query. the catalog table which stores the statistical information is pg_statistic.

for the purpose of this post let’s play with a simple table and see what gets stored in the pg_statistic table.

drop table if exists t1;
create table t1 ( a int );

I will always use the same query for producing the output from the pg_statistics table:

select pa.attname
     , ps.stawidth
     , ps.stadistinct
     , ps.stanullfrac
     , ps.stavalues1
     , ps.stavalues2
     , ps.stavalues3
     , ps.stavalues4
     , ps.stavalues5
  from pg_class pc
     , pg_statistic ps
     , pg_attribute pa
 where pc.relname = 't1'
   and pc.relowner = ( select nspowner 
                         from pg_namespace
                        where nspname = 'public' )
   and pc.oid = ps.starelid
   and pa.attnum = ps.staattnum
   and pa.attrelid = pc.oid;

If you run the query right now it will not return any rows as pg_statistics gets populated once you run analyze and at least one row is inserted into the table ( but not before ).

--exec statistics-query from above:
 attname | stawidth | stadistinct | stanullfrac | stavalues1 | stavalues2 | stavalues3 | stavalues4 | stavalues5 
---------+----------+-------------+-------------+------------+------------+------------+------------+------------
(0 rows)

So let’s create a row and see what’s happening:

insert into t1 (a) values (1);
analyze t1;
--exec statistics-query from above:
 attname | stawidth | stadistinct | stanullfrac | stavalues1 | stavalues2 | stavalues3 | stavalues4 | stavalues5 
---------+----------+-------------+-------------+------------+------------+------------+------------+------------
 a       |        4 |          -1 |           0 |      |      |      |      | 
(1 row)

what does this tell ?

  • the column “a” is reported to have a width of 4 which corresponds to the definition of the integer data-type.
  • distinct values is reported as “-1” which seems a little but surprising for the moment as it should be 1, shouldn’t it ? in fact this is equal to one as it tells that there is statistically one occurrence for the value
  • the number of null values is zero, of course

let’s create another row and have a look at the results:

insert into t1 (a) values (2);
analyze t1;
--exec statistics-query from above:
 attname | stawidth | stadistinct | stanullfrac | stavalues1 | stavalues2 | stavalues3 | stavalues4 | stavalues5 
---------+----------+-------------+-------------+------------+------------+------------+------------+------------
 a       |        4 |          -1 |           0 | {1,2}      |      |      |      | 

the only column that changed is the stavalues1 column. as soon as there is more than one row in the table this column gets populated. distinct values is still reported as “-1” as each row in the table has a different value. this will not change as long as you insert unique values and not too much of them ( more on his later ):

insert into t1 (a) values (generate_series(3,10));
analyze t1;
--exec statistics-query from above:
 attname | stawidth | stadistinct | stanullfrac |       stavalues1       | stavalues2 | stavalues3 | stavalues4 | stavalues5 
---------+----------+-------------+-------------+------------------------+------------+------------+------------+------------
 a       |        4 |          -1 |           0 | {1,2,3,4,5,6,7,8,9,10} |      |      |      | 
(1 row)

once you insert the same values again the stadistinct column gets updated to reflect the change in the data distribution:

insert into t1 (a) values (generate_series(1,10));
analyze t1;
--exec statistics-query from above:
 attname | stawidth | stadistinct | stanullfrac |       stavalues1       | stavalues2 | stavalues3 | stavalues4 | stavalues5 
---------+----------+-------------+-------------+------------------------+------------+------------+------------+------------
 a       |        4 |        -0.5 |           0 | {1,2,3,4,5,6,7,8,9,10} |      |      |      | 

now there is a value of “-0.5” which essentially tells that there are about 2 entries for each distinct value. we can do the same again and the stadistinct value decreases again to match data ( around three occurrences for each distinct value ):

insert into t1 (a) values (generate_series(1,10));
analyze t1;
--exec statistics-query from above:
 attname | stawidth | stadistinct | stanullfrac |       stavalues1       | stavalues2 | stavalues3 | stavalues4 | stavalues5 
---------+----------+-------------+-------------+------------------------+------------+------------+------------+------------
 a       |        4 |   -0.333333 |           0 | {1,2,3,4,5,6,7,8,9,10} |      |      |      | 
(1 row)

as it looks right now postgresql stores all the values of a column another time in the pg_statistic catalog table. as this would be bad practice let’s if this changes once we insert more data:

insert into t1 (a) values (generate_series(11,10000));
analyze t1;
--exec statistics-query from above:
 attname | stawidth | stadistinct | stanullfrac |       stavalues1       |        stavalues2    | stavalues3 | stavalues4 | stavalues5 
---------+----------+-------------+-------------+------------------------+----------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------------------------------------+---
---------+------------+------------
 a       |        4 |   -0.998004 |           0 | {1,2,3,4,5,6,7,8,9,10} | {11,110,210,310,410,510,610,710,810,910,1009,1109,1209,1309,1409,1509
,1609,1709,1809,1908,2008,2108,2208,2308,2408,2508,2608,2708,2807,2907,3007,3107,3207,3307,3407,3507,3607,3706,3806,3906,4006,4106,4206,4306,440
6,4506,4605,4705,4805,4905,5005,5105,5205,5305,5405,5504,5604,5704,5804,5904,6004,6104,6204,6304,6403,6503,6603,6703,6803,6903,7003,7103,7203,73
02,7402,7502,7602,7702,7802,7902,8002,8102,8201,8301,8401,8501,8601,8701,8801,8901,9001,9100,9200,9300,9400,9500,9600,9700,9800,9900,10000} |      |      | 
(1 row)

now we get the next stavalues column populated in steps of a hundred. the first stavalues columns remains unchanged in this example.

in oracle words the stavaluesN columns are similar to a histogram. it’s a statistical representation of the data distribution which is then used to generate execution plans. more to come …

just noticed another difference between postgres and oracle:

oracle:

SQL> create table t1 ( a number(10,2 ) );
Table created.
SQL> create view v1 as select a from t1;
View created.
SQL> alter table t1 modify ( a number(11,2));
Table altered.

postgresql:

[postgres] > create table t1 ( a numeric(10,2) );
CREATE TABLE
[postgres] > create view v1 as select a from t1;
CREATE VIEW
[postgres] > alter table t1 alter column a type numeric(11,2);
ERROR:  cannot alter type of a column used by a view or rule
DETAIL:  rule _RETURN on view v1 depends on column "a"

the same is true if you want to drop a table which has a view defined on it:
oracle:

SQL> drop table t1;
Table dropped.

postgres:

[postgres] > drop table t1;
ERROR:  cannot drop table t1 because other objects depend on it
DETAIL:  view v1 depends on table t1
HINT:  Use DROP ... CASCADE to drop the dependent objects too.

just noticed that you can disable table locks and thus prevent ddl on a table in oracle.

create table t1 ( a number );
alter table t1 disable table lock;

according to the documentation this should prevent all ddls on this table. let’s see if it works:

alter table t1 modify ( a number(1,0 ));
alter table t1 modify ( a number(1,0 ))
*
ERROR at line 1:
ORA-00069: cannot acquire lock -- table locks disabled for T1

works if I try to modify a column definition. can you drop the table ?

drop table t1;
drop table t1
           *
ERROR at line 1:
ORA-00069: cannot acquire lock -- table locks disabled for T1

ok, for dropping, too. truncating ?

truncate table t1;
truncate table t1
               *
ERROR at line 1:
ORA-00069: cannot acquire lock -- table locks disabled for T1

what about adding a column ?

alter table t1 add ( b number );
Table altered.

hm…not sure if this is a bug, but think so ( this is 11.2.0.3 ). and an alter statement for sure is ddl.

on a 10.2.0.4 this does not work:

select version from v$instance;
VERSION
-----------------
10.2.0.4.0
alter table t1 add ( b number );
alter table t1 add ( b number )
*
ERROR at line 1:
ORA-00069: cannot acquire lock -- table locks disabled for T1

when working with oracle there is exactly one numeric data type one may use for storing numbers ( and that’s number ). in postgresql there are ten. what happened when we migrated an oracle database to postgresql was, that a source table in oracle took much less space than the same table in postgresql. thanks to some people on the pgsql-performance mailing list one of the reasons for this was, that the wrong numeric data type was chosen for storing the integers in postgresql.

a simple test-case:

drop table datatype1;
drop table datatype2;
create table datatype1
( a numeric ( 8,0 )
, b numeric ( 8,0 )
);
insert into datatype1 ( a, b )
       values ( generate_series ( 1, 10000000 )
              , generate_series ( 1, 10000000 )
              );
create index idatatype1_1 on datatype1 ( a );
create index idatatype1_2 on datatype1 ( b );
create table datatype2
( a int
, b int
);
insert into datatype2 ( a, b )
       values ( generate_series ( 1, 10000000 )
              , generate_series ( 1, 10000000 )
             );
create index idatatype2_1 on datatype2 ( a );
create index idatatype2_2 on datatype2 ( b );
analyze verbose datatype1;
analyze verbose datatype2;
select pg_size_pretty ( pg_relation_size ( 'datatype1' ) );
 pg_size_pretty 
----------------
 422 MB
(1 row)
select pg_size_pretty ( pg_relation_size ( 'datatype2' ) );
 pg_size_pretty 
----------------
 346 MB
(1 row)

for these two little tables the difference is about 76mb. depending on the statements this is 76mb more that needs to be scanned and this can have great impacts on performance. surprisingly, at least for me, this is not true for the indexes:

select pg_size_pretty ( pg_relation_size ( 'idatatype1_1' ) );
 pg_size_pretty 
----------------
 214 MB
(1 row)
select pg_size_pretty ( pg_relation_size ( 'idatatype2_1' ) );
 pg_size_pretty 
----------------
 214 MB
(1 row)
select pg_size_pretty ( pg_relation_size ( 'idatatype2_1' ) );
 pg_size_pretty 
----------------
 214 MB
(1 row)
select pg_size_pretty ( pg_relation_size ( 'idatatype2_2' ) );
 pg_size_pretty 
----------------
 214 MB
(1 row)

so, it’s worth to keep an eye on which data types to use for numbers …