why not take a look at postgresql ? (3): maintenance tasks ( analyze and vacuum )

July 2, 2012 — Leave a comment

as with oracle there are some maintenance tasks once should perform with the postgres database. while the first and second posts introduced the essentials for setting up the server and creating an initial database this post will look at the maintenance of the postgres database.

one of the most important maintenance task in postgres is: vacuuming. as described in the second post the vacuum launcher processes is responsible for starting the vacuum worker processes for the databases. another important task is collecting statistics.

these two tasks are done by the vacuum and analyze commands, where:

  • analyze: is used to collect statistics about a database ( collects statistics of the tables in the database and stores them in the pg_statistics catalog ).
  • vacuum: is used to reclaim storage used by dead tuples ( for example produced by deletes )

let’s take a look at two commands.

analyze

in oracle you would use the dbms_stats package to gather statistics. in postgres it is the analyze command. a description of all the options, as usual, can be found in the documentation.

without any arguments the analyze command will collect the statistics for every table in the database:

sysdba@[local]:5432/postgres*# analyze;
ANALYZE
sysdba@[local]:5432/postgres*#

a nice argument, especially if you are new to postgres, is the verbose argument. when executed with this argument analyze reports the tables which are analyzed. this gives you some hints on the internal tables ( catalog ):

sysdba@[local]:5432/postgres*# analyze verbose;
INFO: analyzing "pg_catalog.pg_type"
INFO: "pg_type": scanned 7 of 7 pages, containing 309 live rows and 2 dead rows; 309 rows in sample, 309 estimated total rows
INFO: analyzing "pg_catalog.pg_attribute"
INFO: "pg_attribute": scanned 40 of 40 pages, containing 2155 live rows and 30 dead rows; 2155 rows in sample, 2155 estimated total rows
INFO: analyzing "pg_catalog.pg_authid"
INFO: "pg_authid": scanned 1 of 1 pages, containing 1 live rows and 0 dead rows; 1 rows in sample, 1 estimated total rows
INFO: analyzing "pg_catalog.pg_proc"

additionally analyze reports various statistics collected for each table. this is a feature I’d like to see in oracle, too. especially when called automatically this information might be useful if you quickly need to compare statistics. it is not a big deal to save this information to a logfile and to keep some of them for every run of analyze.

there are two more parameters analyze accepts. of course you can specify a single table to collect statistics for:

sysdba@[local]:5432/postgres# analyze verbose pg_catalog.pg_am;
INFO: analyzing "pg_catalog.pg_am"
INFO: "pg_am": scanned 1 of 1 pages, containing 4 live rows and 0 dead rows; 4 rows in sample, 4 estimated total rows
ANALYZE

… and you can tell analyze to collect statistics for a single column:

analyze verbose pg_catalog.pg_am(amname);
INFO: analyzing "pg_catalog.pg_am"
INFO: "pg_am": scanned 1 of 1 pages, containing 4 live rows and 0 dead rows; 4 rows in sample, 4 estimated total rows
ANALYZE

as collecting statistics on large tables requires a lot of resources and some amount of time postgres uses samples of these tables as oracle does. the size of the samples is controlled by the default_statistics_target parameter which defaults to 100. in contrast to oracle this is not a percentage, but: “The target value sets the maximum number of entries in the most-common-value list and the maximum number of bins in the histogram”.

as with oracle you can override the system wide setting of the default_statistics_target by setting statistics on the columns level…

alter table pg_catalog.pg_am alter column amname set statistics 99;
ALTER TABLE
sysdba@[local]:5432/postgres*#

… and directly set statistical values of columns ( only n_distinct and n_distinct_inherited at the moment ):

alter table pg_catalog.pg_am alter column amname set ( n_distinct = 5 );
ERROR: permission denied: "pg_am" is a system catalog

seems postgres does not allow to directly update catalog tables ( at least in the default configuration ). might be good idea :)

vacuum

as mentioned above vacuum reclaims wasted/unused space. in oracle one would use rebuilding or shrinking. without any parameter vacuum will process any table in the database the user who executes the command has access to:

sysdba@[local]:5432/postgres# vacuum pg_catalog.pg_am;
VACUUM
sysdba@[local]:5432/postgres#

as with the analyze command there is the verbose option:

vacuum (verbose) pg_catalog.pg_am;
INFO: vacuuming "pg_catalog.pg_am"
INFO: index "pg_am_name_index" now contains 4 row versions in 2 pages
DETAIL: 0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: index "pg_am_oid_index" now contains 4 row versions in 2 pages
DETAIL: 0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: "pg_am": found 0 removable, 4 nonremovable row versions in 1 out of 1 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
VACUUM
sysdba@[local]:5432/postgres#

… which gives you some interesting information of what is really done.

another argument to vacuum is analyze, so you may combine vacuuimg and analyzing ( here on a single table rather than the whole database )…

vacuum (verbose,analyze) pg_catalog.pg_am;
INFO: vacuuming "pg_catalog.pg_am"
INFO: index "pg_am_name_index" now contains 4 row versions in 2 pages
DETAIL: 0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: index "pg_am_oid_index" now contains 4 row versions in 2 pages
DETAIL: 0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: "pg_am": found 0 removable, 4 nonremovable row versions in 1 out of 1 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: analyzing "pg_catalog.pg_am"
INFO: "pg_am": scanned 1 of 1 pages, containing 4 live rows and 0 dead rows; 4 rows in sample, 4 estimated total rows
VACUUM

… which is handy for automation.

without the “full” parameter vacuum does not lock the table and simple reclaims space for reuse. if you want to give back space to the operating system ( by actually using less space for the database files ) you will need to call vacuum with the full parameter:

vacuum (verbose,analyze,full);
INFO: vacuuming "pg_catalog.pg_statistic"
INFO: "pg_statistic": found 29 removable, 370 nonremovable row versions in 15 pages
DETAIL: 0 dead row versions cannot be removed yet.
...

… but be careful: this will put an exclusive lock on the table currently processed and temporarily will use more space as the table gets copied. you should not do this while the database is accessed in production.

the last parameter, called “freeze”, leads to a specific design decision with postgres: the internal transaction IDs are limited by 32bits. because of this transaction id wraparounds occur which could lead to data loss ( transaction ids higher as the current are considered future and should not be visible to users, so if the wraparound occurs this could be disastrous ). in postgres there is a special transaction id called: FrozenXID. these transactions ids are always considered older than the every other transaction id. the assignment of the FrozenXID is done by the vacuum command. check the documentation for a detailed description of this behaviour.

to control this behaviour there is a parameter called: vacuum_freeze_min_age ( specified in number of transactions ), which controls how old a transaction (xid) may be until it is replaced with the FrozenXID. thus, a larger value will preserve transactional information longer, while a smaller value will increase the number of transactions before a table must be vacuumed. the other parameter involved is called: vacuum_freeze_table_age

calling the vacuum command with the freeze parameter is equivalent to setting the vacuum_freeze_min_age to zero.

as the vacuum command can be very I/O intensive there might be side effects to current active users of the database. to minimize this effect the administrator can specify the vacuum_cost_delay parameter which will put the vacuum process to sleep if it reaches a defined cost limit ( vacuum_cost_limit ).

because these tasks are such important the autovacuum feature should be enabled which is controlled by the autovacuum and track_counts parameters.
as this is only an introduction I will not mention the other parameters which are used to control the vacuum behaviour.

so, as it is with oracle: it is very important to carefully define your statistic collecting strategy for the best performance. when it comes to postgres you need to think about the vacuum process, too. as postgres keeps old versions of data in the tables data files ( no undo like in oracle ) and the transaction ID will wrap around it is essential to be familiar with the parameters which control this process.

the next post will give an introduction to backup and restore…

 

Advertisements

No Comments

Be the first to start the conversation!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s