Archives For June 2012

if you followed the first postgresql post the initial postgresql instance should be up and running and now it is time to create the first database. remember that in postgresql terms a database is defined as: A database is a named collection of SQL objects (“database objects”). Generally, every database object (tables, functions, etc.) belongs to one and only one database. (However there are a few system catalogs, for example pg_database, that belong to a whole cluster and are accessible from each database within the cluster.) More accurately, a database is a collection of schemas and the schemas contain the tables, functions, etc. So the full hierarchy is: server, database, schema, table (or some other kind of object, such as a function).

first of all let’s create a .psqlrc file which will control some settings for the psql sessions created:

echo "\set AUTOCOMMIT off
\set PROMPT1 '%n@%m:%>/%/%x%# '
\pset null '<NULL>'
\pset format aligned" > ~/.psqlrc

this will turn off autocommit, set a nice psql prompt for us, display NULL values as ‘<NULL>’ and sets the output to aligned mode.

addtionally I will create a new alias for my database connection as I am too lazy to type the whole command every time:

echo "alias pg='psql -d postgres -U sysdba'" >> ~/.bashrc
. ~/.bashrc

lest quickly check which databases are already present in the current postgresql instance:

Null display is "<>".
Output format is aligned.
psql (9.1.4)
Type "help" for help.
sysdba@[local]:5432/postgres# select datname from pg_database;
(3 rows)

the template1 database will essentially be cloned for every database created. this means if you create objects or users in this database they will be present ( cloned ) in the created database, too. If i remember correctly microsoft sql server uses a similar concept.
the template0 database should not be used to create users or objects so that you always have a blank template at the original state available if you need to create databases without custom objects.

to create a database issue a command similar to this:

mkdir /opt/postgres/mydb/tbs
CREATE TABLESPACE db_tbs1 LOCATION '/opt/postgres/mydb/tbs/';
OWNER default
TEMPLATE template1

the character sets supported by postgres are listed in the documentation.

to quickly check the encodings of your databases may either query pg_database:

 sysdba@[local]:5432/postgres*# select datname,datcollate,datctype from pg_database; 
datname    | datcollate  | datctype 
template1 | en_US.UTF-8 | en_US.UTF-8 
template0 | en_US.UTF-8 | en_US.UTF-8 
postgres | en_US.UTF-8 | en_US.UTF-8 
db1 | en_US.UTF-8 | en_US.UTF-8 

or use the psql shortcut:

psql -U sysdba -l 
Null display is "<>". 
Output format is aligned. 
List of databases 
Name | Owner | Encoding | Collate | Ctype | Access privileges 
db1       | sysdba | UTF8 | en_US.UTF-8 | en_US.UTF-8 | postgres | sysdba | UTF8 | en_US.UTF-8 | en_US.UTF-8 | 
template0 | sysdba | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/sysdba + | | | | | sysdba=CTc/sysdba 
template1 | sysdba | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/sysdba + | | | | | sysdba=CTc/sysdba 

destroying the database is almost exactly the same as in oracle:


what about temporary tablespaces ? let’s see what tablespaces currently exist in the instance:

SELECT * FROM pg_tablespace; 
spcname    | spcowner | spclocation | spcacl | spcoptions 
pg_default | 10       | | | pg_global | 10 | | | 
db_tbs1    | 10       | /opt/postgres/mydb/tbs | | 

if you read the documentation postgresql seems to use a similar concept than the table space groups within oracle. let’s create two new tablespaces. for this prepare the directories ( is not possible to create more than one tablespace in the same directory ):

mkdir /opt/postgres/mydb/tbs_temp1/
mkdir /opt/postgres/mydb/tbs_temp2/

… and create the tablespaces:

CREATE TABLESPACE temp1 LOCATION '/opt/postgres/mydb/tbs_temp1/'; 
CREATE TABLESPACE temp2 LOCATION '/opt/postgres/mydb/tbs_temp2/'; 
SELECT * FROM pg_tablespace; 
spcname | spcowner | spclocation | spcacl | spcoptions 
pg_default | 10 | | <> | <> pg_global | 10 | | <> | <> 
db_tbs1 | 10 | /opt/postgres/mydb/tbs | <> | <> 
temp1 | 10 | /opt/postgres/mydb/tbs_temp1 | <> | <> 
temp2 | 10 | /opt/postgres/mydb/tbs_temp2 | <> | <> 

adjusting the configuration file:

echo "temp_tablespaces='temp1,temp2'" >> /opt/postgres/mydb/postgresql.conf

… reload the configuration


… and check the actual value:

show temp_tablespaces;
temp1, temp2
(1 row)

postgresql will now select one of the tablespaces ( round robin ) for the operations which will require temporary space.

next time I will take a look at the maintenance tasks which can/should be performed with postgres.

from time to time it is interesting to see how other databases are implemented and as postgresql seems to be one of the most widely used open source databases I decided to have a look at it. the result will be some posts which describe the very basics one needs to know to start working with this database system. please keep in mind that I am not an experienced postgres admin nor developer so some statements may not be totally correct or even misleading. just drop me an email or post a comment if you find any wrong statements or think something is missing to complete the picture.

if you are running a rpm based linux distribution the easiest way to install postgres is to include the postgres yum repository ( if you are on a debian based distribution there are apt repositories, too, just check the download section on the postgresql website )

to avoid any conflicts with current repositories it may be a good idea to exclude the postgres packages provided by your distribution. for this you need to add “exclude=postgresql*” to the sections of your repositories. here is an example:

cat CentOS-Base.repo
# CentOS-Base.repo
# The mirror system uses the connecting IP address of the client and the
# update status of each mirror to pick mirrors that are updated to and
# geographically close to the client. You should use this for CentOS updates
# unless you are manually picking other mirrors.
# If the mirrorlist= does not work for you, as a fall back you can try the
# remarked out baseurl= line instead.
name=CentOS-$releasever - Base
#released updates
name=CentOS-$releasever - Updates
#additional packages that may be useful
name=CentOS-$releasever - Extras
#additional packages that extend functionality of existing packages
name=CentOS-$releasever - Plus
#contrib - packages by Centos Users
name=CentOS-$releasever - Contrib

now you can safely add the postgresql repository to your yum:

rpm -ihv pgdg-centos91-9.1-4.noarch.rpm

this will download the repository rpm and, once installed, create the yum repo file:

ls -la /etc/yum.repos.d/
total 24
drwxr-xr-x. 2 root root 4096 Jun 12 07:28 .
drwxr-xr-x. 61 root root 4096 Jun 12 07:28 ..
-rw-r--r--. 1 root root 2026 Jun 12 07:22 CentOS-Base.repo
-rw-r--r--. 1 root root 657 Jun 12 07:22 CentOS-Debuginfo.repo
-rw-r--r--. 1 root root 646 Jun 12 07:22 CentOS-Media.repo
-rw-r--r--. 1 root root 436 Aug 22 2011 pgdg-91-centos.repo

before installing the software you need to decide which packages you want to install. check the Installing PostgreSQL on Red Hat Enterprise Linux / Fedora Core document for a list of the packages provided.

because I want to install the server, this is the procedure to use:

yum clean all
yum install postgresql91-libs, postgresql91 and postgresql91-server

as stated in the pdf linked above, these are the locations of the files installed:

what where
Executables /usr/bin
Libraries /usr/lib
Documentation /usr/share/doc/postgresql­x.y.z, /usr/share/doc/postgresql­x.y.z/contrib
Contrib /usr/share/pgsql/contrib
Data /var/lib/pgsql/data
Backup area /var/lib/pgsql/backup
Templates /usr/share/pgsql
Procedural Languages /usr/lib/pgsql
Development Headers /usr/include/pgsql
Other shared data /usr/share/pgsql
Regression tests /usr/lib/pgsql/test/regress

the very first thing one needs to do is to initialize the database cluster, which in postgresql terms is the database storage area on disk which contains the actual data. according to the documentation this is also called: data directory or data area. the definition the wiki referenced below uses is: “A database cluster is a collection of databases that is stored at a common file system location (the “data area”). It is possible to have multiple database clusters, so long as they use different data areas and different communication ports.”
once initialized the cluster will contain three databases:

  • postgres: the default database for utilities and users
  • template0: same as template1 but should not be used for creating any objects or users
  • template1: which serves as a template for other databases created ( everything installed in the template DB will be copied to newly created databases )

as with every database there needs to be a location on disk where all the files will be stored:

su -
mkdir -p /opt/postgres/mydb
mkdir -p /opt/postgres/log
chown postgres:postgres /opt/postgres
chown postgres:postgres /opt/postgres/mydb
chown postgres:postgres /opt/postgres/log

the installed package provides a system V init script. because I created custom locations for the data and the logfiles I adjusted the init script:

vi /etc/init.d/postgresql-9.1

to initialize the cluster on the newly created location the initdb command is used:

/usr/pgsql-9.1/bin/initdb -D /opt/postgres/mydb -U sysdba -W
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.
The database cluster will be initialized with locale en_US.UTF-8.
The default database encoding has accordingly been set to UTF8.
The default text search configuration will be set to "english".
fixing permissions on existing directory /opt/postgres/mydb ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 32MB
creating configuration files ... ok
creating template1 database in /opt/postgres/mydb/base/1 ... ok
initializing pg_authid ... ok
Enter new superuser password:
Enter it again:
setting password ... ok
initializing dependencies ... ok
creating system views ... ok
loading system objects' descriptions ... ok
creating collations ... ok
creating conversions ... ok
creating dictionaries ... ok
setting privileges on built-in objects ... ok
creating information schema ... ok
loading PL/pgSQL server-side language ... ok
vacuuming database template1 ... ok
copying template1 to template0 ... ok
copying template1 to postgres ... ok
WARNING: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the -A option the
next time you run initdb.
Success. You can now start the database server using:
/usr/pgsql-9.1/bin/postgres -D /opt/postgres/mydb
/usr/pgsql-9.1/bin/pg_ctl -D /opt/postgres/mydb -l logfile start

in my opinion the “-U” and “-W” are very important. by using inidb’s default behaviour every local os-user may become the database superuser. so specifying a username and password for the superuser is a good idea.

for the scope of this and the following posts I will start with a minimal set of parameters, so overwriting the default configuration file ( this is the equivalent of oracle’s pfile ):

echo "max_connections = 100
shared_buffers = 32MB
log_destination = 'stderr'
logging_collector = on
log_directory = '/opt/postgres/log/'
log_filename = 'postgresql-%a.log'
log_file_mode = 0600
log_truncate_on_rotation = on
log_rotation_age = 1d
log_rotation_size = 0
datestyle = 'iso, mdy'
lc_messages = 'en_US.UTF-8'
lc_monetary = 'en_US.UTF-8'
lc_numeric = 'en_US.UTF-8'
lc_time = 'en_US.UTF-8'
default_text_search_config = 'pg_catalog.english'" > /opt/postgres/mydb/postgresql.conf

startup the service:

# check the current status of the service
service postgresql-9.1 status
# start the service
service postgresql-9.1 start

quickly use chkconfig to see if postgres will come up again once the servers is rebooted:

chkconfig --list postgresql-9.1
postgresql-9.1 0:off 1:off 2:off 3:off 4:off 5:off 6:off

no. so let’s enable this:

chkconfig postgresql-9.1 on
chkconfig --list postgresql-9.1
postgresql-9.1 0:off 1:off 2:on 3:on 4:on 5:on 6:off

installation done. do a reboot to confirm everything works as expected..

service postgresql-9.1 status
(pid 937) is running...

if you prefer to manually startup the service you can either do:

postgres -D /opt/postgres/mydb >logfile 2>&1 &


pg_ctl start -D /opt/postgres/mydb -l serverlog

and always remember to _not_ start the service as the root user.

for shutting down the server manually use:

/usr/pgsql-9.1/bin/pg_ctl stop -D /opt/postgres/mydb/ -m smart

as with the oracle database there are some shutdown options ( the “-m” switch ):

  • smart: quit after all clients have disconnected
  • fast: quit directly, with proper shutdown
  • immediate: quit without complete shutdown; will lead to recovery on restart

you can directly send the signals to the server process, too, if you prefer:

  • SIGTERM = smart
  • SIGINT = fast
  • SIGQUIT = immediate

for exmaple:

kill -INT `head -1 /opt/postgres/mydb/`

as admins tend to be lazy let’s simplify life and define the environment and some aliases:

echo "PGDATA=/opt/postgres/mydb
alias pgstart='pg_ctl start -D $PGDATA -l $PGLOG'
alias pgstop='pg_ctl stop -D $PGDATA -m fast'
alias pgreload='pg_ctl reload -D $PGDATA'
alias pgrestart='pg_ctl restart -D $PGDATA -m fast'
alias pgstatus='pg_ctl status'
alias viparam='vi $PGDATA/postgresql.conf'
alias tlog='tail -1000f $PGLOGDIR/postgresql-Tue.log'" >> ~/.bashrc
. ~/.bashrc

if you miss the kernel parameters you may check the documentation for the recommendations ( the default parameters should be sufficient for a test installation ).

might be a good idea to prevent the OOM symptom mentioned in the documentation by setting the parameter in question :) :

sysctl -w vm.overcommit_memory=2

oracle uses various kinds of processes for different tasks. how does postgresql handle this ? there is a wiki on the postgres website especially for oracle dbas which describes the processes involved:

  • Master process – launches the other processes, background and session processes.
  • Writer process – background process that coordinates database writes, log writes and checkpoints.
  • Stats collector process – background process collecting information about server activity.
  • User session processes.

additionally you will see processes like this:

  • Wal writer process – background process for writing write ahead logs ( redo )
  • autovacuum launcher process – launcher for processes that scan tables to release memory/space and prevents loss of data. vacuum processes are also used to clean-up any unused undo data which is present in the table’s files. there are no separate undo/rollback segments as in oracle

if you take a look at the data area ( /opt/postgres/mydb in my case ) you will see various files and directories. the wiki mentioned above provides a clean description what all the files and directories are for:

  • postgresql.conf – Parameter or main server configuration file.
  • pg_hba.conf – Client authentication configuration file.
  • pg_ident.conf – Map from OS account to PostgreSQL account file.

The cluster subdirectories:

  • base – Subdirectory containing per-database subdirectories
  • global – Subdirectory containing cluster-wide tables
    • pg_auth – Authorization file containing user and role definitions.
    • pg_control – Control file.
    • pg_database – Information of databases within the cluster.
  • pg_clog – Subdirectory containing transaction commit status data
  • pg_multixact – Subdirectory containing multitransaction status data (used for shared row locks)
  • pg_subtrans – Subdirectory containing subtransaction status data
  • pg_tblspc – Subdirectory containing symbolic links to tablespaces
  • pg_twophase – Subdirectory containing state files for prepared transactions
  • pg_xlog – Subdirectory containing WAL (Write Ahead Log) files

to finish the this post, here is a quick comparison between oracle and postgres for the things I noticed right now:

oracle postgres
default transaction isolation level: read committed default transaction isolation level: read committed
pfile/spfile postgresql.conf
binary parameter file: yes binary parameter file: no, but reload function of the server
multiplexed control files one pg_control directory containing one control-file
create database … initdb
location of the archived logs: dynamic pg_xlog
size of the redologs: dynamic must be specified once when building postgres and can not be changed afterwards
system/sysaux, data dictionary base and global directories
listener/sqlnet configuration postgresql.conf, pg_hba.conf
datafiles symlinks in pg_tblspc
sqlplus psql
one to one relation of instance and database ( except for rac ) one cluster can serve multiple databases
select * from v$instrance; SELECT version();
show user; select current_user;
select 1 +3 from dual; select 1 + 3;
shutdown modes: normal, immediate, transactional, abort shutdown modes: smart, immediate, fast
alter session… SET…TO…

yesterday we hit a bug in

if you grant various dbms_* packages to another user “with grant” option and try to pass through this grant with that user you may hit bug 13036331

simple test-case to confirm that you are affected:

create user blubb identified by "blubb";
create user blabb identified by "blabb";
grant connect,resource to blubb;
grant execute on dbms_lob to blubb with grant option;
connect blubb/blubb
grant execute on dbms_lob to blabb;
ERROR at line 1:
ORA-01031: insufficient privileges 

it seems that only is affected. we already had a psu applied ( ), so this does not help…

by now we know that we can tell the optimizer to write its decisions to a trace file. this file can tell us what access plans for the sql in question were considered and what statistics were used to come up with the final costs.

as introduced in the first post one parameter which influences the correctness of the statistics is the estimate_percent parameter of the the gather_*_stats procedures. wouldn’t it be great if we could compare statistics which were gathered with different values for the parameters to help us to decide which will be the best strategy for our database and application ?

recall the table containing the house- and phonenumbers which was created in the second post. we already noticed that the num_rows statistic reported different values once the estimate_percent changed. so let’s see if we can report the changes for different sets of parameters for the gather_*_stats procedures.

the first thing we’ll need is a table where we can export some sets of statistics to:

dbms_stats.create_stat_table ( ownname => USER
, stattab => 'MY_STATS'
, tblspace => 'USERS'

as oracle recommends using the default value for the estimate_percent ( which is DBMS_STATS.AUTO_SAMPLE_SIZE ) this shall be our first set of statistics:

dbms_stats.gather_table_stats ( ownname => USER
, tabname => 'T1' );

we already noticed in the first post that the values will not be that correct:

SELECT num_rows
FROM user_tab_statistics
WHERE table_name = 'T1';

the actual number of rows is still 10’000:

SELECT count(*)
FROM t1;

before playing around with the estimate_parameter save the current statistics to the statistic table created above:

dbms_stats.export_table_stats ( ownname => USER
, tabname => 'T1'
, stattab => 'MY_STATS'

now there is stored set of statistics we may use to compare with other sets. as we know a value of 100 for the estimate_percent parameter should produce better statistics let’s re-generate with the increased value:

dbms_stats.gather_table_stats ( ownname => USER
, tabname => 'T1'
, estimate_percent => 100 );

… and quickly check the result for the num_rows statistic:

SELECT num_rows
FROM user_tab_statistics
WHERE table_name = 'T1';

looks correct. again, let’s save the statistics:

dbms_stats.export_table_stats ( ownname => USER
, tabname => 'T1'
, statid => 'ESTIMATE_PERCENT_100'
, stattab => 'MY_STATS'

to see what values changed for the different sets we can now compare the sets of statistics:

SET LONG 20000
FROM TABLE ( dbms_stats.diff_table_stats_in_stattab ( USER
, 'T1'

this will produce a report similar to this one:

-------------------------------------------------------------------------------- ----------
SOURCE A : User statistics table MY_STATS
: Owner : OW
SOURCE B : User statistics table MY_STATS
: Owner : OW
T1 T A 9966 20 8 4764
B 10000 20 8 10000
HOUSENUMBER A 9941 .000100593 NO 0 4 C104 C302 4752
B 10000 .0001 NO 0 4 C102 C302 10000
PHONENUMBER A 5503 .000181719 NO 0 4 C104 C2646 4852
B 6376 .000156838 NO 0 4 C102 C2646 10000
I_HOUSENUMBER I A 10000 21 10000 1 1 18 1 10000
B 10000 21 10000 1 1 18 1 10000
I_PHONENUMBER I A 10000 21 6376 1 1 9411 1 10000
B 10000 21 6376 1 1 9411 1 10000

take a look at the report and you can easily see how the statistics changed when increasing the estimate_percent parameter ( of course you can change any other parameter, too ).

this is a great feature if you want to play around with different settings for the gather_*_procedures and want to check which settings are the best for your case. and always remember that you can set different parameters down to the table level. for a quick description check this post.

this post will continue the post optimizer basics (2) and further look at how the optimizer decides if an index will be used or not. the clustering_factor was already introduced as one important statistic the optimizer uses for making its decisions.

recall the two statements from the last post:

-- first statement
SELECT phonenumber
WHERE housenumber < 100;
-- second statement
SELECT housenumber
WHERE phonenumber < 100;

while for the first statement the appropriate index ( I_HOUSENUMBER ) was used the index on the phone number column ( I_PHONENUMBER ) was ignored for the second statement. until now the assistant of the phone company realized that it makes a difference on how well the table is ordered in relation to the index and that this is expressed in a statistic called the clustering_factor. what is missing to complete the picture is how the optimizer chooses its plan.

as the oracle database is highly instrumented we are lucky and can tell the optimizer to write its decisions to a trace file. the event one needs to set for this is: 10053.

let’s create two trace files, one for each of the statements from above:

alter session set tracefile_identifier='I_HOUSENUMBERS';
alter session set events '10053 trace name context forever, level 12';
SELECT phonenumber
WHERE housenumber < 100;
alter session set events '10053 trace name context forever, level 0';

alter session set tracefile_identifier='I_PHONENUMBERS';
alter session set events '10053 trace name context forever, level 12';
SELECT housenumber
WHERE phonenumber < 100;
alter session set events '10053 trace name context forever, level 0';

remember that you can can ask the database if you’re not sure where to find the trace files:
show parameter background_dump_dest

if you open the trace files and scroll down to the section called “QUERY BLOCK TEXT” you will see the statement. right after that the interesting stuff begins:
CPUSPEEDNW: 1998 millions instructions/sec (default is 100)
IOTFRSPEED: 4096 bytes per millisecond (default is 4096)
IOSEEKTIM: 10 milliseconds (default is 10)
MBRC: NO VALUE blocks (default is 8)

these are the statistical values which are valid for the system:

  • CPUSPEEDNW: the speed of the cpu
  • IOTFRSPEED: transfer speed for singel I/O read requests
  • IOSEEKTIM: I/O seek time
  • MBRC: multiblock read count

for a detailed description of these statistics check the oracle documentation.
how this statistics are gathered and how they are modified will be a topic for another post. for the scope of this post just realize the ones present in the trace file.

the next section in the trace file reports the statistics for the table and its indexes:
Table Stats::
Table: T1 Alias: T1
#Rows: 10000 #Blks: 20 AvgRowLen: 8.00 ChainCnt: 0.00
Index Stats::
Index: I_HOUSENUMBER Col#: 1
LVLS: 1 #LB: 21 #DK: 10000 LB/K: 1.00 DB/K: 1.00 CLUF: 18.00
Index: I_PHONENUMBER Col#: 2
LVLS: 1 #LB: 21 #DK: 6376 LB/K: 1.00 DB/K: 1.00 CLUF: 9411.00

these are the same values that you can query directly from the database:
, BLOCKS "#Blks"
, AVG_ROW_LEN "AvgRowLen"
, CHAIN_CNT "ChainCnt"
FROM user_tab_statistics
WHERE table_name = 'T1';
#rows #Blks AvgRowLen ChainCnt
---------- ---------- ---------- ----------
10000 20 8 0
FROM user_ind_statistics
---------- ---------- ---------- ---------- ---------- ----------
1 21 10000 1 1 18
1 21 6376 1 1 9411

let’s look at the first statement. the trace file reports that the following access paths were considered:
Single Table Cardinality Estimation for T1[T1]
Column (#1): HOUSENUMBER(
AvgLen: 4 NDV: 10000 Nulls: 0 Density: 0.000100 Min: 1 Max: 10000
Table: T1 Alias: T1
Card: Original: 10000.000000 Rounded: 99 Computed: 99.01 Non Adjusted: 99.01
Access Path: TableScan
Cost: 7.09 Resp: 7.09 Degree: 0
Cost_io: 7.00 Cost_cpu: 2144409
Resp_io: 7.00 Resp_cpu: 2144409
Access Path: index (RangeScan)
resc_io: 3.00 resc_cpu: 58364
ix_sel: 0.009901 ix_sel_with_filters: 0.009901
Cost: 3.00 Resp: 3.00 Degree: 1
******** Begin index join costing ********
****** trying bitmap/domain indexes ******
Access Path: index (IndexOnly)
resc_io: 2.00 resc_cpu: 34243
ix_sel: 0.009901 ix_sel_with_filters: 0.009901
Cost: 2.00 Resp: 2.00 Degree: 0
Bitmap nodes:
Cost = 2.001985, sel = 0.009901
****** finished trying bitmap/domain indexes ******
******** End index join costing ********
Best:: AccessPath: IndexRange
Cost: 3.00 Degree: 1 Resp: 3.00 Card: 99.01 Bytes: 0

clearly the IndexRange scan reports the lowest cost. so this is the plan to choose for the first statement. for the second statement:
Single Table Cardinality Estimation for T1[T1]
Column (#2): PHONENUMBER(
AvgLen: 4 NDV: 6376 Nulls: 0 Density: 0.000157 Min: 1 Max: 9998
Table: T1 Alias: T1
Card: Original: 10000.000000 Rounded: 99 Computed: 99.03 Non Adjusted: 99.03
Access Path: TableScan
Cost: 7.10 Resp: 7.10 Degree: 0
Cost_io: 7.00 Cost_cpu: 2342429
Resp_io: 7.00 Resp_cpu: 2342429
Access Path: index (RangeScan)
resc_io: 96.00 resc_cpu: 720658
ix_sel: 0.009903 ix_sel_with_filters: 0.009903
Cost: 96.03 Resp: 96.03 Degree: 1
******** Begin index join costing ********
****** trying bitmap/domain indexes ******
Access Path: index (IndexOnly)
resc_io: 2.00 resc_cpu: 34243
ix_sel: 0.009903 ix_sel_with_filters: 0.009903
Cost: 2.00 Resp: 2.00 Degree: 0
Bitmap nodes:
Cost = 2.001986, sel = 0.009903
****** finished trying bitmap/domain indexes ******
******** End index join costing ********
Best:: AccessPath: TableScan
Cost: 7.10 Degree: 1 Resp: 7.10 Card: 99.03 Bytes: 0

… the cost for using the index ( 96.03 ) is far to high in comparison to the cost of the full tablescan ( 7.10 ). that’s why the full table scan will be the plan of choice.

if you are interesed in how these numbers gets calculated: there are lots of smart people out there who spent a lot of work in describing this, for example:
Richard Foote
Randolf Geist
John Brady

… and of course Jonathan Lewis’ book

in short every I/O is a cost + some costing for CPU.

conclusion: if you know how your data is organized you should be able to predict what the optimizer will do for the statements in question. you always should think about the indexes before you decide to create them. not every index will be used by oracle and every additional index will increase the work oracle needs to do in case of insert/updates/deletes to the table. sequential reads are very fast today so don’t be surprised if a full tablescan is the plan of choice …

linux ( as well as most of the unixes ) provides the ability to integrate many different file systems at the same time. to name a few of them:

  • ext2, ext3, ext4
  • ocfs, ocfs2
  • reiserfs
  • vxfs
  • brtfs
  • dos, ntfs

although each of them provides different features and was developed with different purposes in mind the tools to work with them stay the same:

  • cp
  • mv
  • cd

the layer which makes this possible is called the virtual filesystem ( vfs ). this layer provides a common interface for the filesystems which are plugged into the operating system. I already introduced one special kind of filesystem, the the proc filesystem. the proc filesystem does not handle any files on disk or on the network, but neitherless it is a filesystem. in addition to the above mentioned filesystems, which all are disk based, filesystem may also handle files on the network, such as nfs or cifs.

no matter what kind of filesystem you are working with: when interacting with the filesystem by using the commands of choice you are routed through the virtual filesystem:

the virtual file system

to make this possible there needs to be a standard all file system implementations must comply with, and this standard is called the common file model. the key components this model consist of are:

  • the superblock which stores information about a mounted filesystem ( … that is stored in memory as a doube linked list )
  • inodes which store information about a specific file ( … that are stored in memory as a doube linked list)
  • the file object which stores information of the underlying files
  • dentries, which represent the links to build the directory structure ( … that are stored in memory as a doube linked list)

to speed up operations on the file systems some of the information which is normally stored on disk are cached. if you recall the post about slabs, you can find an entry like the following in the /proc/slabinfo file if you have a mounted ext4 filesystem on your system:

cat /proc/slabinfo | grep ext4 | grep cache
ext4_inode_cache   34397  34408    920   17    4 : tunables    0    0    0 : slabdata   2024   2024      0

so what needs the kernel to do if, for example, a request for listing the contents of a directoy comes in and the directory resides on an ext4 filesystem? because the filesystem is mounted the kernel knows that the filesystem for the specific request is of type ext4. the ls command will then be translated ( pointed ) to the specific ls implementation of the ext4 filesystem. this operation is the same for all commands interacting with filesystems. there is a pointer for each operation that links to the specific implementation of the command in question:

directory listing

as the superblock is stored in memory and therefore may become dirty, that is not synchronized with the superblock on disk, there is the same issue that oracle must handle with its buffer pools: periodically check the dirty flag and write down the changes to disk. the same is true for inodes ( while in memory ), which contain all the information that make up a file. closing a loop to oracle again: to speed up searching the ionodes linux maintains a hash table for fast access ( remember how oracle uses hashes to identify sql statements in the shared_pool ).

when there are files, there are processes which want to work with files. once a file is opened a new file object will be created. as these are frequent operations file objects are allocated through a slab cache.

the file objects itself are visible to the user through the /proc filesystem per process:

ls -la /proc/*/fd/
total 0
dr-x------ 2 root root  0 2012-05-18 14:03 .
dr-xr-xr-x 8 root root  0 2012-05-18 06:40 ..
lrwx------ 1 root root 64 2012-05-18 14:03 0 -> /dev/null
lrwx------ 1 root root 64 2012-05-18 14:03 1 -> /dev/null
lr-x------ 1 root root 64 2012-05-18 14:03 10 -> anon_inode:inotify
lrwx------ 1 root root 64 2012-05-18 14:03 2 -> /dev/null
lrwx------ 1 root root 64 2012-05-18 14:03 3 -> anon_inode:[eventfd]
lrwx------ 1 root root 64 2012-05-18 14:03 4 -> /dev/null
lrwx------ 1 root root 64 2012-05-18 14:03 5 -> anon_inode:[signalfd]
lrwx------ 1 root root 64 2012-05-18 14:03 6 -> socket:[7507]
lrwx------ 1 root root 64 2012-05-18 14:03 7 -> anon_inode:[eventfd]
lrwx------ 1 root root 64 2012-05-18 14:03 8 -> anon_inode:[eventfd]
lrwx------ 1 root root 64 2012-05-18 14:03 9 -> socket:[11878]

usually numbers 0 – 3 refer to the standard input, standard output and standard error of the corresponding process.

last but not least there are the dentries. as with the file objects, dentries are allocated from a slab cache, the dentry cache in this case:

cat /proc/slabinfo | grep dentry
dentry             60121  61299    192   21    1 : tunables    0    0    0 : slabdata   2919   2919      0

directories are files, too, but special in that kind that dictories may contain other files or directories. once a directory is read into memory it is transformed into a dentry object. as this operation is expensive there is the dentry cache mentioned above. thus the operations for building the dentry objects can be minimized.
another link to oracle wording: the unused dentry double linked list uses a least recently used ( lru ) algorithm to track the usage of the entries. when the kernel needs to shrink the cache the objects at the tail of the list will be removed. as with the ionodes there is hash table for the dentries and a lock protecting the lists ( dcache_spin_lock in this case ).

this should give you enough hints to go further if you are interesed …