Daniel Westermann's Blog

an educational example to demonstrate the oracle database filesystem ( dbfs )

May 8, 2012 — Leave a comment

a probably less known feature of the oracle database is the database filesystem (dbfs). in short dbfs provides a filesystem interface on top of database tables accessible through standard operating system commands. storing files inside the database becomes as easy as creating files on the operating system level. there is no need to use a programming language to store blobs/clobs inside the database.

for demonstrating dbfs this post will show how to create a subversion repository which actually resides in the oracle database. this is not intended for any production use and shall only show that is basically works.

because I am on linux and want to mount the database filesystem on the operating system I will need to install the kernel-devel and fuse packages and give execute privileges to the fusermount command:

su-
yum install kernel-devel fuse fuse-libs
chmod +x /bin/fusermount

next I’ll create the configuration file for fuse:

touch /etc/fuse.conf
echo "user_allow_other" > /etc/fuse.conf

the mountpoint for holding my database filesystem and subversion repository will be /oradata/DB112/svn owned by the database owner:

su - oracle
mkdir /oradata/DB112/svn

as all the files will reside inside the oracle database I will create a separate user:

su - oracle
sqlplus / as sysdba
CREATE USER svn
IDENTIFIED by "svn"
DEFAULT TABLESPACE USERS
TEMPORARY TABLESPACE TEMP;
GRANT CREATE SESSION TO svn;
GRANT dbfs_role TO svn;
GRANT RESOURCE TO svn;

the script that creates the dbfs “MYSVNFS” which will be stored in the “USERS” tablespace is:

connect svn/svn
@$ORACLE_HOME/rdbms/admin/dbfs_create_filesystem.sql USERS MYSVNFS

the output should be similar to this:

No errors.
--------
CREATE STORE:
begin dbms_dbfs_sfs.createFilesystem(store_name => 'FS_MYSVNFS', tbl_name =>
'T_MYSVNFS', tbl_tbs => 'USERS', lob_tbs => 'USERS', do_partition => false,
partition_key => 1, do_compress => false, compression => '', do_dedup => false,
do_encrypt => false); end;
--------
REGISTER STORE:
begin dbms_dbfs_content.registerStore(store_name=> 'FS_MYSVNFS', provider_name
=> 'sample1', provider_package => 'dbms_dbfs_sfs'); end;
--------
MOUNT STORE:
begin dbms_dbfs_content.mountStore(store_name=>'FS_MYSVNFS',
store_mount=>'MYSVNFS'); end;
--------
CHMOD STORE:
declare m integer; begin m := dbms_fuse.fs_chmod('/MYSVNFS', 16895); end;
No errors.

the client program used to mount the filesystem is called dbfs_client and is located under $ORACLE_HOME/bin. the proper way in regards to security would be to use an oracle wallet for storing the password but as this is only a test I will do it the easy way ( if someone is interested on how to setup the wallet way, just let me know ):

echo "svn" > /home/oracle/passwordfile
nohup dbfs_client svn@DB112 /oradata/DB112/svn < /home/oracle/passwordfile &

if everything worked the new mount is now visible with the os commands:

df -h
Filesystem Size Used Avail Use% Mounted on
/dev/hdc1 28G 11G 15G 43% /
tmpfs 741M 456M 286M 62% /dev/shm
dbfs-svn@DB112:/ 1.1G 160K 1.1G 1% /oradata/DB112/svn
##
ls -la /oradata/DB112/svn/MYSVNFS/
total 0
drwxrwxrwx 3 root root 0 May 4 01:10 .
drwxr-xr-x 3 root root 0 May 4 01:15 ..
drwxr-xr-x 7 root root 0 May 4 01:10 .sfs

now I am ready to create the subversion repository:

su - oracle
cd /oradata/DB112/svn/MYSVNFS/
svnadmin create MYREPO

for being able to access the repository later on from another user I’ll add the user to the subversion password file:

cat MYREPO/conf/passwd
[users]
svn = svn

… and uncomment the following line in the svnserve.conf file:

password-db = passwd

… and start the subversion deamon:

svnserve -d -r /oradata/DB112/svn/MYSVNFS

to simulate real-life subversion usage I will create a new os user and do some work on the subversion repository ( I could have done this from another workstation, too ):

su -
useradd svn
passwd svn
su - svn

the first thing I need to do is to checkout the repository:

svn co svn://localhost/MYREPO/
cd MYREPO
ls -la

lets copy some files ( text and binary ) to populate the repository ( don’t care if not all files are copied because of permissions ):

cp /etc/*.conf .
cp /bin/cpio .
ls -la

as I now have some files I’ll add them to the repository and commit ( commit means pushing to the subversion server here ):

svn add *
svn commit -m "blabla" *.conf --username=svn
svn commit -m "blabla" cpio

done. all files committed to the subversion server and stored inside the oracle database. don’t believe it ?

su - oracle
sqlplus / as sysdba
set lines 164
col object_name for a20
col object_type for a15
SELECT object_name,object_type FROM all_objects WHERE owner = 'SVN';
##
OBJECT_NAME          OBJECT_TYPE
-------------------- ---------------
T_MYSVNFS            TABLE
IG_SFS$_FST_19       INDEX
IP_SFS$_FST_19       INDEX
IPG_SFS$_FST_19      INDEX
SFS$_FSTP_19         TABLE
SYS_C005052          INDEX
##
col item for a30
col pathname for a40
select item ,PATHNAME from svn.T_MYSVNFS;
##
ITEM PATHNAME
------------------------------ ----------------------------------------
MYREPO /MYREPO
locks /MYREPO/locks
db.lock /MYREPO/locks/db.lock
db-logs.lock /MYREPO/locks/db-logs.lock
hooks /MYREPO/hooks
start-commit.tmpl /MYREPO/hooks/start-commit.tmpl
pre-commit.tmpl /MYREPO/hooks/pre-commit.tmpl
post-revprop-change.tmpl /MYREPO/hooks/post-revprop-change.tmpl
conf /MYREPO/conf
authz /MYREPO/conf/authz
db /MYREPO/db
fs-type /MYREPO/db/fs-type
revs /MYREPO/db/revs
0 /MYREPO/db/revs/0
revprops /MYREPO/db/revprops
0 /MYREPO/db/revprops/0
transactions /MYREPO/db/transactions
txn-protorevs /MYREPO/db/txn-protorevs
write-lock /MYREPO/db/write-lock
uuid /MYREPO/db/uuid
0 /MYREPO/db/revs/0/0
0 /MYREPO/db/revprops/0/0
fsfs.conf /MYREPO/db/fsfs.conf
rep-cache.db /MYREPO/db/rep-cache.db
min-unpacked-rev /MYREPO/db/min-unpacked-rev
txn-current-lock /MYREPO/db/txn-current-lock
format /MYREPO/db/format
format /MYREPO/format
passwd /MYREPO/conf/passwd
svnserve.conf /MYREPO/conf/svnserve.conf
pre-revprop-change.tmpl /MYREPO/hooks/pre-revprop-change.tmpl
pre-lock.tmpl /MYREPO/hooks/pre-lock.tmpl
pre-unlock.tmpl /MYREPO/hooks/pre-unlock.tmpl
ROOT /
.sfs /.sfs
attributes /.sfs/attributes
tools /.sfs/tools
snapshots /.sfs/snapshots
RECYCLE /.sfs/RECYCLE
content /.sfs/content
post-commit.tmpl /MYREPO/hooks/post-commit.tmpl
post-lock.tmpl /MYREPO/hooks/post-lock.tmpl
post-unlock.tmpl /MYREPO/hooks/post-unlock.tmpl
README.txt /MYREPO/README.txt
1 /MYREPO/db/revs/0/1
txn-current /MYREPO/db/txn-current
2 /MYREPO/db/revs/0/2
1 /MYREPO/db/revprops/0/1
current /MYREPO/db/current
2 /MYREPO/db/revprops/0/2

that’s the subversion repository stored in the oracle database.

remember this is only for educational purposes, but imagine what you could do with it:

back up the repository with rman
use rman to duplicate the database for testing purposes
setup rac one node, integrate the subversion daemon with the oracle grid infrastructure and make the repository high available
create multiple dbfs and multiple svn repositories in one oracle database
enable deduplication for the repository files ( see * )
enable compression for the repository files ( see ** )

oracle uses secure files to store the files located in the dbfs. this means you may use all the features secure files provide. check the examples for an overview.

alter table svn.T_MYSVNFS modify lob(FILEDATA) (deduplicate);

alter table svn.T_MYSVNFS modify lob(FILEDATA) (compress high);

as said above, this is only educational but i’d be happy to start a discussion about the pros and cons ( beside the licensing costs, of course ) …. what would be a good use case to use dbfs for ?

In demonstration oracle, resources

the oracle local registry ( olr ) – what it is and how one can manage it

May 7, 2012 — 1 Comment

when it comes to the grid infrastructure ( single node or cluster configuration ) oracle needs to store and manage some configuration data about the state of the node(s) and its resources. each node in the configuration has its own container to store and manage data which is called the “oracle local registry (olr)”.

by using the ocrcheck utility one can check the current the status of the local registry:

ocrcheck -local
Status of Oracle Local Registry is as follows :
Version : 3
Total space (kbytes) : 262120
Used space (kbytes) : 2472
Available space (kbytes) : 259648
ID : 1322090758
Device/File Name : /opt/oracle/product/crs/11.2.0.3/cdata/localhost/oracleplayground.olr
Device/File integrity check succeeded
Local registry integrity check succeeded
Logical corruption check bypassed due to non-privileged user

beside some other information this tells that the local registry is located here:

/opt/oracle/product/crs/11.2.0.3/cdata/localhost/oracleplayground.olr

if you wonder about the “Logical corruption check bypassed due to non-privileged user” message, this is because the ocrcheck command was executed as the grid infrastructure software owner. re-executing the same command as root user will show “Logical corruption check succeeded”. why root? because some parts of the grid infrastructure run with root privileges ( remember the roothas.pl or rootcrs.pl setup script ).

as the local registry is essential for the cluster stack to start up there must be a way to do proper backups and restores of it in case it gets lost or corrupted. oracle creates an initial backup of the olr after the configuration of the grid infrastructure. from that point onwards it is up to you to create and manage the backups. you will find the initial backup of your olr under ORACLE_HOME/cdata/[HOSTNAME]/, in my case:

ls -al /opt/oracle/product/crs/11.2.0.3/cdata/oracleplayground/
total 5644
drwxr-x--- 2 grid oinstall 4096 Mar 23 09:18 .
drwxrwx--- 4 grid oinstall 4096 Mar 23 09:17 ..
-rw------- 1 grid oinstall 5746688 Mar 23 09:18 backup_20120323_091815.olr

first of all one should decide where to store the manual backups of the olr. the default location is fine if you do regular backups of your $ORACLE_HOME ( and it is strongly recommended to do so ). to list the current configuration use:

./ocrconfig -local -showbackup
oracleplayground 2012/03/23 09:18:15 /opt/oracle/product/crs/11.2.0.3/cdata/oracleplayground/backup_20120323_091815.olr

if you need to change the location for any reason, you can do so by using the ocrconfig command:

./ocrconfig -local -backuploc [BACKUP_DESTINATION]

doing backups of your olr is as easy as:

/ocrconfig -local -manualbackup
oracleplayground 2012/05/03 15:51:38 /opt/oracle/product/crs/11.2.0.3/cdata/oracleplayground/backup_20120503_155138.olr
oracleplayground 2012/03/23 09:18:15 /opt/oracle/product/crs/11.2.0.3/cdata/oracleplayground/backup_20120323_091815.olr

…which will also list the current backups of the olr. of course you need to ensure that these files are being backuped up to a safe location ( to a tape, for example ).

another way for doing backups and restores is to use the “import” and “export” parameters of the ocrconfig command:

./ocrconfig -local -export /tmp/olr.dmp
./ocrconfig -local -import /tmp/olr.dmp

as oracle recommends not to use the “export” and “import” procedures for backups and restores i wonder what these parameters are for.
perhaps you should do both kinds of backups to be sure that one of it really works :)

in case you need to restore a backup of the local registry make sure you stop the cluster stack before:

crsctl stop has
ocrconfig -local -restore [OLR_BACKUP_FILE]
ocrcheck -local
crsctl start has

as it should be clear now how to backup and restore the local registry lets take look at the contents. oracle provides another command for dumping the contents of the local ( or cluster ) registry:

./ocrdump -local -stdout

this will dump the contents of the local registry to the screen. take a look at the output and you will see why the registry is essential for the stack. if you registered a database and listener with the grid infrastructure you will see something like this:

[SYSTEM.OHASD.RESOURCES.ora!db112!db]
[SYSTEM.OHASD.RESOURCES.ora!db112!db.CONFIG]
ORATEXT : ACL=owner:grid:rwx,pgrp:asmdba:r-x,other::r--,group:oinstall:r-x,user:oracle:rwx~ACTION_FAILURE_TEMPLATE=~ACTION_SCRIPT=~ACTIVE_PLACEMENT=1~AGENT_FILENAME=%CRS_HOME%/bin/oraagent%CRS_EXE_SUFFIX%~AUTO_START=restore~BASE_TYPE=ora.cluster_resource.type~CARDINALITY=1~CHECK_INTERVAL=1~CHECK_TIMEOUT=30~CLUSTER_DATABASE=false~DATABASE_TYPE=SINGLE~DB_UNIQUE_NAME=DB112~DEFAULT_TEMPLATE=PROPERTY(RESOURCE_CLASS=database) PROPERTY(DB_UNIQUE_NAME= CONCAT(PARSE(%NAME
%, ., 2), %USR_ORA_DOMAIN%, .)) ELEMENT(INSTANCE_NAME= %GEN_USR_ORA_INST_NAME%) ELEMENT(DATABASE_TYPE= %DATABASE_TYPE%)~DEGREE=1~DESCRIPTION=Oracle Database resource~ENABLED=1~FAILOVER_DELAY=0~FAILURE_INTERVAL=60~FAILURE_THRESHOLD=1~GEN_AUDIT_FILE_DEST=/oradata/DB112/admin/adump~GEN_START_OPTIONS=open~GEN_USR_ORA_INST_NAME=DB112~HOSTING_MEMBERS=~INSTANCE_FAILOVER=1~LOAD=1~LOGGING_LEVEL=1~MANAGEMENT_POLICY=AUTOMATIC~NAME=ora.db112.db~NLS_LANG=~NOT_RESTARTING_TEMPLATE=~OFFLINE_CHECK_INTERVAL=0~ONLINE_RELOCATI
ON_TIMEOUT=0~ORACLE_HOME=/opt/oracle/product/base/11.2.0.3~ORACLE_HOME_OLD=~PLACEMENT=balanced~PROFILE_CHANGE_TEMPLATE=~RESTART_ATTEMPTS=2~ROLE=PRIMARY~SCRIPT_TIMEOUT=60~SERVER_POOLS=~SPFILE=/opt/oracle/product/base/11.2.0.3/dbs/spfileDB112.ora~START_DEPENDENCIES=weak(type:ora.listener.type,uniform:ora.ons) hard(ora.DB112_DATA_DG.dg,ora.DB112_ARCH_DG.dg) pullup(ora.DB112_DATA_DG.dg,ora.DB112_ARCH_DG.dg)~START_TIMEOUT=600~STATE_CHANGE_TEMPLATE=~STOP_DEPENDENCIES=hard(intermediate:ora.asm,shutdown:ora.DB112_D
ATA_DG.dg,shutdown:ora.DB112_ARCH_DG.dg)~STOP_TIMEOUT=600~TYPE=ora.database.type~TYPE_ACL=owner:oracle:rwx,pgrp:oinstall:rwx,other::r--~TYPE_NAME=ora.database.type~TYPE_VERSION=3.2~UPTIME_THRESHOLD=1h~USR_ORA_DB_NAME=~USR_ORA_DOMAIN=~USR_ORA_ENV=~USR_ORA_FLAGS=~USR_ORA_INST_NAME=DB112~USR_ORA_OPEN_MODE=open~USR_ORA_OPI=false~USR_ORA_STOP_MODE=immediate~VERSION=11.2.0.3.0~

everything the cluster stack needs to know about the registered resource is recorded here. this includes all the dependencies of the resource as well as the different kinds of parameters, e.g. :

SPFILE=/opt/oracle/product/base/11.2.0.3/dbs/spfileDB112.ora

… which tells where to find the spfile of the registered database, or:

hard(ora.DB112_DATA_DG.dg,ora.DB112_ARCH_DG.dg)

… which tells that the diskgroups DB112_DATA_DG and DB112_ARCH_DG must be available before the database can start up. this is a hard dependency, so the startup of the database will stop if the diskgroups are not available.

you can get a more readable format by dumping the contents to a xml file:

./ocrdump -local /tmp/olr.xml -xml

as the backups of the olr are the same format as the current olr, ocrdump works the same way if you want to dump the backups. this can help when searching for changes in the olr:

./ocrdump -local /tmp/olr_backup.xml -backupfile /opt/oracle/product/crs/11.2.0.3/cdata/oracleplayground/backup_20120503_155138.olr -xml

you may have observed, that all the commands ( ocrconfig, orcdump, ocrcheck ) are used to manage the local registry as well as the cluster registry in case you have a clustered setup ( RAC or RAC one node, for example ). so, no matter if you are on a single node or cluster configuration: make sure your backup and restore strategy includes the oracle local and cluster registry.

In configuration, main oracle

what one can learn from a simple one row update

May 1, 2012 — Leave a comment

what happens when you update a row of a table ? not much, you could think. but in fact there are lots and lots of things happening in the background from which you can learn how the database works. this post will follow a simple one row update and shows what’s happening behind the scenes.

i will start by creating a tiny little table and adding one row to it:

CREATE TABLE T1 ( A NUMBER )
       TABLESPACE USERS;
INSERT INTO T1 (A)
       VALUES (1);
COMMIT;

this is the starting point. because we will need this information later lets record four things:
1) the block number of the row we just created
2) the object id of the table in the data dictionary
3) the sid of the session
4) the session’s address

-- rowid of the row
SELECT rowid
  FROM t1
 WHERE A = 1;
ROWID
------------------
AAADb4AAEAAAAF9AAA
-- block number of the row
SELECT dbms_rowid.rowid_block_number('AAADb4AAEAAAAF9AAA') block_no
  FROM dual;
BLOCK_NO
----------
381
-- object id of the table
SELECT object_id
  FROM all_objects
 WHERE owner = USER
   AND object_type = 'TABLE'
   AND object_name = 'T1';
OBJECT_ID
----------
14072
-- the session id of the session which sends the update
SELECT SYS_CONTEXT('userenv','SID') sid
  FROM dual;
SID
---
32
-- the same sessions address
SELECT saddr
  FROM v$session
 WHERE sid = 32;
SADDR
----------------
000000006F680A28

so by now we know the object id ( which is 14072 ), the block number of the block containing the row ( which is 381 ), the session id ( which is 32 in my case ) and the session’s address ( which is 000000006F680A28 ).

ready to update ?

UPDATE T1
SET A = 2
WHERE A = 1;
1 row updated.

at this point it is important to leave the transaction open ( to not commit ). i will start a second sqlplus session and leave the session which issues the update as is.

the obvious things first. each insert/update/delete will start a transaction and as we did not commit our update we should be able to see the transaction:

SELECT addr
, ubafil
, ubablk
, ubarec
, status
, xid
, start_scn
FROM v$transaction
WHERE ses_addr = '000000006F680A28';
ADDR UBAFIL UBABLK UBAREC STATUS XID START_SCN
---------------- ---------- ---------- ---------- ------ ---------------- ----------
000000006DF332D8 3 483 11 ACTIVE 0100100022010000 958658

as expected there is an active transaction for our update statement. because each transaction will change some data ( if there is not rollback ) some undo must be generated. from the output above we can learn:
1. the datafile containing the undo block has file number 3
2. the block containing the undo is block number 483
3. the undo record is 11

before taking a look at the undo block lets see what information oracle has about our data block in v$bh ( which lists the buffer headers ):

col dirty for a5
col temp for a5
col ping for a5
col stale for a5
col stale for a5
col direct for a5
col status for a6
col objd for 99999
col block# for 999999
SELECT file#
, status
, dirty
, temp
, ping
, stale
, direct
, objd
, block#
FROM v$bh
WHERE objd = 14072
AND block# = 381;
FILE# STATUS DIRTY TEMP PING STALE DIREC OBJD BLOCK#
---------- ------ ----- ----- ----- ----- ----- ------ -------
4 cr N N N N N 14072 381
4 xcur N N N N N 14072 381

so v$bh reports two copies of the block header, one in “consitent read” the other in “exclusive” mode non of them dirty, temporary, pinged, stale or direct.

the exclusive one is the original block while the cr one is the copy we got when we issued the update. if you’d do a rollback of the update and resend the same update statement you’d receive another “cr” copy of the block. can you imagine why ? because readers do not block writers and writers do not block readers. for others sessions to be able to view the block as it really is ( the update is not yet commited ) oracle creates copies of blocks to construct a read consistant view of the data. you should keep this in mind when sizing your pools.

happily oracle provides a way to look at the contents of a block. lets do a dump of our undo block:

ALTER SESSION SET TRACEFILE_IDENTIFIER='my_dumps';
ALTER SYSTEM DUMP DATAFILE 3 BLOCK 483;

as the select from v$transaction tells us that our undo record is 11, we will take a look at this record only. if you open the trace file scroll down to your undo record you should see a similar output than this:

*-----------------------------
* Rec #0xb slt: 0x10 objn: 14072(0x000036f8) objd: 14072 tblspc: 4(0x00000004)
* Layer: 11 (Row) opc: 1 rci 0x00
Undo type: Regular undo Begin trans Last buffer split: No
Temp Object: No
Tablespace Undo: No
rdba: 0x00000000Ext idx: 0
flg2: 0
*-----------------------------
uba: 0x00c001e3.0036.09 ctl max scn: 0x0000.000e4630 prv tx scn: 0x0000.000e4631
txn start scn: scn: 0x0000.000ea07f logon user: 42
prev brb: 12583394 prev bcl: 0
KDO undo record:
KTB Redo
op: 0x03 ver: 0x01
compat bit: 4 (post-11) padding: 1
op: Z
Array Update of 1 rows:
tabn: 0 slot: 0(0x0) flag: 0x2c lock: 0 ckix: 0
ncol: 1 nnew: 1 size: 0
KDO Op code: 21 row dependencies Disabled
xtype: XAxtype KDO_KDOM2 flags: 0x00000080 bdba: 0x0100017d hdba: 0x0100017a
itli: 2 ispac: 0 maxfr: 4858
vect = 0
col 0: [ 2] c1 02

if you wonder why this is record 11, “#0xb” translates to 11:

select to_number('b','X') dec
from dual;
DEC
----------
11

what can we learn from here:
1) we can see the tablespace number and the object which is affected:

* Rec #0xb slt: 0x10 objn: 14072(0x000036f8) objd: 14072 tblspc: 4(0x00000004)

2) the start scn ( system change number ) is recorded as long with the user who started the transaction

txn start scn: scn: 0x0000.000ea07f logon user: 42

000e2f90 translates to 929670 and 42 is the user_id reported in all_users for the user who started the transaction.
3) one row will be updated:

Array Update of 1 rows:
tabn: 0 slot: 0(0x0) flag: 0x2c lock: 0 ckix: 0
ncol: 1 nnew: 1 size: 0
KDO Op code: 21 row dependencies Disabled
xtype: XAxtype KDO_KDOM2 flags: 0x00000080 bdba: 0x0100017d hdba: 0x0100017a
itli: 2 ispac: 0 maxfr: 4858
vect = 0
col 0: [ 2] c1 02

other people described the meanings of all the keywords and flags, so i will not just re-type them here ( you may, for example, want to check julian dyke’s website ).
for now, it is just important to notice that the first column of the table ( col 0 ) will be changed by 2 bytes ( [2] ) and the values is 1 ( 02 – 1 ). oracle prefixes numeric values ( c1 ), so we may ignore the first 2 bytes.

what else is happening in the background? as stated in earlier posts, things need protections. lets see if we can find some locks for our session:

SELECT sid
, id1
, id2
, lmode
, request
FROM v$lock
WHERE sid = 32;
SID ID1 ID2 LMODE REQUEST
---------- ---------- ---------- ---------- ----------
32 100 0 4 0
32 14072 0 3 0
32 65552 290 6 0

the lmodes translate to: 4 = Edition enqueue (AE), 3 = DML (TM), 6 = Transaction (TX).

i am not sure, but i think the edition enqueue lock is always present ( happy if someone can tell more about this ).
our update triggered two locks, one for the dml and one for the transaction.
column id1 of the dml references our table, which is object 14072. we locked one row in shared exclusive mode.
column id2 of the transaction lock references the sequence of the transaction ( v$transaction.xidsqn ). so, the other way around, you can check the locks for your session and from there get to the object and transaction.

in my case, the redo containing the update must be the current one ( as nothing else is happening on my database ). so i’ll find one copy of it ( as the logs are mirrored ):

select GROUP#
, THREAD#
, SEQUENCE#
, STATUS
from v$log;
GROUP# THREAD# SEQUENCE# STATUS
---------- ---------- ---------- ---------------
1 1 85 INACTIVE
2 1 86 INACTIVE
3 1 87 CURRENT
select member
from v$logfile
where group# = 3;
MEMBER
------------------------------------------------------------------------------------------------------------------------
+DB112_DATA_DG/db112/onlinelog/group_3.259.779116045
+DB112_ARCH_DG/db112/onlinelog/group_3.259.779116047

… dump the first one:

alter system dump logfile '+DB112_DATA_DG/db112/onlinelog/group_3.259.779116045';

… and you will find the redo record:

REDO RECORD - Thread:1 RBA: 0x000057.000000a2.0010 LEN: 0x01fc VLD: 0x0d
SCN: 0x0000.000ea149 SUBSCN: 1 04/24/2012 20:10:42
(LWN RBA: 0x000057.000000a2.0010 LEN: 0002 NST: 0001 SCN: 0x0000.000ea149)
CHANGE #1 TYP:2 CLS:1 AFN:4 DBA:0x0100017d OBJ:14072 SCN:0x0000.000ea0c2 SEQ:1 OP:11.19 ENC:0 RBL:0
KTB Redo
op: 0x11 ver: 0x01
compat bit: 4 (post-11) padding: 1
op: F xid: 0x0001.010.00000122 uba: 0x00c001e3.0036.0b
Block cleanout record, scn: 0x0000.000ea148 ver: 0x01 opt: 0x02, entries follow...
itli: 1 flg: 2 scn: 0x0000.000ea0c2
Array Update of 1 rows:
tabn: 0 slot: 0(0x0) flag: 0x2c lock: 2 ckix: 0
ncol: 1 nnew: 1 size: 0
KDO Op code: 21 row dependencies Disabled
xtype: XAxtype KDO_KDOM2 flags: 0x00000080 bdba: 0x0100017d hdba: 0x0100017a
itli: 2 ispac: 0 maxfr: 4858
vect = 0
col 0: [ 2] c1 03
CHANGE #2 TYP:0 CLS:17 AFN:3 DBA:0x00c00080 OBJ:4294967295 SCN:0x0000.000ea097 SEQ:1 OP:5.2 ENC:0 RBL:0
ktudh redo: slt: 0x0010 sqn: 0x00000122 flg: 0x0012 siz: 164 fbi: 0
uba: 0x00c001e3.0036.0b pxid: 0x0000.000.00000000
CHANGE #3 TYP:0 CLS:18 AFN:3 DBA:0x00c001e3 OBJ:4294967295 SCN:0x0000.000ea096 SEQ:2 OP:5.1 ENC:0 RBL:0
ktudb redo: siz: 164 spc: 6734 flg: 0x0012 seq: 0x0036 rec: 0x0b
xid: 0x0001.010.00000122
ktubl redo: slt: 16 rci: 0 opc: 11.1 [objn: 14072 objd: 14072 tsn: 4]
Undo type: Regular undo Begin trans Last buffer split: No
Temp Object: No
Tablespace Undo: No
0x00000000 prev ctl uba: 0x00c001e3.0036.09
prev ctl max cmt scn: 0x0000.000e4630 prev tx cmt scn: 0x0000.000e4631
txn start scn: 0x0000.000ea07f logon user: 42 prev brb: 12583394 prev bcl: 0 BuExt idx: 0 flg2: 0
KDO undo record:
KTB Redo
op: 0x03 ver: 0x01
compat bit: 4 (post-11) padding: 1
op: Z
Array Update of 1 rows:
tabn: 0 slot: 0(0x0) flag: 0x2c lock: 0 ckix: 0
ncol: 1 nnew: 1 size: 0
KDO Op code: 21 row dependencies Disabled
xtype: XAxtype KDO_KDOM2 flags: 0x00000080 bdba: 0x0100017d hdba: 0x0100017a
itli: 2 ispac: 0 maxfr: 4858
vect = 0
col 0: [ 2] c1 02

what can we see here:
1) there is our update “col 0: [ 2] c1 03”: change the first column of our table to 2 ( 03 – 1 ).
2) there is the same undo entry as in the undo block a little later in the trace: “col 0: [ 2] c1 02”.

why is there the same entry in the redo as we found it in the undo ? remember that we did not yet finish our transaction ( the commit is still pending ). so what will happen if the instance crashes now? when the instance comes up again after the crash oracle will scan the redo logs and apply all the changes that happened up to before the crash. and as we did not commit the transaction oracle must do a rollback, and that’s what the redo entry is about ( redo the undo ).

lets commit the update in the initial session and check what changed.

commit;

what happened to our transaction:

SELECT addr
, ubafil
, ubablk
, ubarec
, status
, xid
, start_scn
FROM v$transaction
WHERE ses_addr = '000000006F680A28';
no rows selected

as expected, the transaction finished.

our locks?

SELECT sid
, id1
, id2
, lmode
, request
FROM v$lock
WHERE sid = 32;
SID TYPE ID1 ID2 LMODE REQUEST
SID ID1 ID2 LMODE REQUEST
---------- ---------- ---------- ---------- ----------
32 100 0 4 0

gone, too.

the buffer headers ?

SELECT file#
, status
, dirty
, temp
, ping
, stale
, direct
, objd
, block#
FROM v$bh
WHERE objd = 14072
AND block# = 381;
FILE# STATUS DIRTY TEMP PING STALE DIREC OBJD BLOCK#
---------- --------------- ----- ----- ----- ----- ----- ------ -------
4 cr N N N N N 14072 381
4 xcur Y N N N N 14072 381

the dirty flag for our buffer changed to ‘Y’, so there was some modification. obvious, as we changed the row.

just for completion: if you do a rollback instead of the commit you’d get another redo entry describing the rollback.

again this is far from being complete but should give you some hints on what is happening behind the scenes. and, even more important: every change to the database is recorded in the log files ( and this includes undo generation ).

In main oracle, resources

Ouch! tns poison

April 29, 2012 — Leave a comment

there is a very old critical security hole regarding the oracle database. be aware of it, as there is no fix for it currently:

german

english

update: official workaround available now

In patching oracle, security

memory management

April 28, 2012 — Leave a comment

the performance of most applications depends on the efficiency that memory is managed by the operating system. because memory is not only used by applications but also by the kernel itself for storing its data and structures some portions of the memory are reserved by the kernel. the rest of the memory is called the dynamic memory.

memory gets addressed by the kernel in pages which are typically 4kb in size. you can check the pagesize in linux with the following command:

getconf PAGESIZE
4096

to track the status of all the pages there is a so called page descriptor. as pages may be used by several processes, the kernel itself or not used at all this descriptor contains the information the kernel needs when dealing with all the pages. one of the fields of the pages descriptor is called “lru” which stands for? same as in the oracle database: least recently used. the lru contains pointers to the least recently used double linked list of pages. remember: the same concepts, over and over again.

another fields describes the status of a page, which, for example, can be:

locked
dirty
active
…

note that oracle uses the word “dirty” in exactly the same manner: the page has been modified ( in oracle syntax it is a buffer )

when there is a request for allocating a page two things can happen:

there is enough free space and the request is successful immediately
before the allocation request may succeed some cleanup work must be done ( which usually blocks the request until finished ). you can compare this to the “buffer wait” events in oracle

because some critical requests can not be blocked there are exceptions to the second case ( for example when handling interrupts ). in these cases the request will be atomic and fail if no free pages are available. to minimize the chance for failing requests some pages are reserved by the kernel for these atomic requests. the amount of pages which will be reserved is calculated at system initialization and can be changed later by modifying a file in the proc filesystem:

cat /proc/sys/vm/min_free_kbytes
67584

oracle uses a similar concept for the shared_pool_reserved_size. this parameter defines some portion of the shared_pool to be reserved for large contiguous memory allocations. the goal is the same as with the linux kernel: try to block others as less as possible and make sure the allocation request succeeds.

as time goes by and requests for memory allocations come and go, memory will be allocated and released. this leads to a common issue with memory management: fragmentation. frequent allocations and releases may lead to situations that although there is enough free memory for a request the request will fail. this is because the remaining free memory is scattered through already allocated pages ( internal fragmentation ) or there is no free contiguous free memory that can satisfy the request ( external fragmentation ). and that is whats happening when you face the “ORA-04031: unable to allocate x bytes of shared memory” in the alertlog.

memory fragmentation

while internal fragmentation is waste of memory external fragmentation may lead to failing allocation requests. to avoid external fragmentation as much as possible the linux kernel groups the free pages into lists of 1,2,4,8,16,32,64,128,256,512 and 1024 contiguous chunks. if, for example, a request for 128 of contiguous memory page frames arrives the kernel will first check the 128 list for a free block. if a free block exists the memory gets allocated. if no free block exists in that list the next bigger list ( the 256 ) will be check for free blocks. if a free block exists there the kernel allocates 128 from the 256 page frames and inserts the remaining 128 page frames to the 128 list. this, if no free block is found in the next bigger list, will continue until the last group is reached ( the 1024 list ) and if this list is empty an error will be signaled.
the other way around the kernel tries to merge free blocks into bigger blocks when memory is released ( if the blocks have the same size and are located next to each other ).

there is much more to say about memory management ( e.g. slabs ), but this will be a topic for another post ….

In main linux, resources

nice ( but not essential ) to know (1)

April 26, 2012 — Leave a comment

if you ever wondered where oracle stores the statistic preferences one can set by using the dbms_stats.set_*_prefs procedures here you go:

global statistic preferences

table which holds the values: sys.optstat_hist_control$

example script to query the settings:

COLUMN SNAME FOR a30;
COLUMN SVAL1 FOR 9999;
COLUMN SVAL2 FOR a20;
COLUMN SPARE1 FOR 999999;
COLUMN SPARE2 FOR 999;
COLUMN SPARE3 FOR 999999;
COLUMN SPARE4 FOR a30;
COLUMN SPARE5 FOR a10;
COLUMN SPARE6 FOR a10;
SELECT sname
, sval1
, to_char(sval2,'DD.MM.YYYY HH24:MI:SS') sval2
, spare1
, spare2
, spare3
, spare4
, spare5
, spare6
FROM sys.optstat_hist_control$
ORDER BY 1
;

procedure which affects the table: dbms_stats.set_global_prefs, e.g.:

exec dbms_stats.set_global_prefs ( 'ESTIMATE_PERCENT', '11');

non-global ( user ) statistic preferences

table which holds the values: sys.optstat_user_prefs$

example script to query the settings:

COLUMN owner FOR a20;
COLUMN object_name FOR a20;
COLUMN object_type FOR a20;
COLUMN pname FOR a30;
COLUMN valnum FOR 99999;
COLUMN valchar FOR a30;
COLUMN chgtime FOR a20;
COLUMN spare1 FOR 999;
SELECT do.owner
, do.object_name
, do.object_type
, o.pname
, o.valnum
, o.valchar
, to_char(o.chgtime,'DD.MM.YYYY HH24:MI:SS') chgtime
, o.spare1
FROM sys.optstat_user_prefs$ o
, dba_objects do
WHERE o.obj# = do.object_id
;

procedures which affects the table: dbms_stats.set_global_prefs, dbms_stats.set_schema_prefs, dbms_stats.set_database_prefs,e.g.:

exec dbms_stats.set_table_prefs ( USER,'T1','GRANULARITY','ALL');
exec dbms_stats.set_schema_prefs ( USER,'ESTIMATE_PERCENT','11' );
exec dbms_stats.set_database_prefs ( 'ESTIMATE_PERCENT', '12', FALSE );

In quick&dirty helpers, oracle

dear kernel, please let me do my work …

April 22, 2012 — Leave a comment

until now we had an introduction to processes, how they are managed, what signals are and what they are used for, how the linux kernel ( and oracle ) uses double linked list to quickly look up memory structures and how critical regions like shared memory can be protected. this post gives an introduction to timing and process scheduling.

as the cpu can execute only one process at a time but because maybe hundreds or thousands of processes want to do their work the kernel must provide a mechanism to decide which process to run next ( process switching ). this is the task of the scheduler. for being able to do what it does, the scheduler must be able to make decisions, and the decisions are based on time and priorities.

lots and lots of work behind the scenes is driven by time measurements. consider cronjobs, for example. without being able to measure time they would not work. in short the kernel must be able to keep the current time and to provide a mechanism to notify programs when a specific interval has elapsed.

on the one hand there is the real time clock ( accessible through the /dev/rtc interface ) which is a special chip that continues to tick even if the computer is powered off ( there is a small battery for this chip ). the real time clock is used by linux to derive the date and time.

on the other hand there are several other mechanisms which can be used for timing:

one of the time related activities the kernel must perform is to determine how long a process has been running. each process is given a time slot in which it may run, which is called a quanta. if the quantum expires and the process did not terminate a process switch may occur ( another process is selected for execution ). these processes are called expired. active processes are those which did not yet consume their quantum.
additionally each process has a priority assigned, which is used by the scheduler to decide how appropriate it is to let the process do its work on the cpu.

in general processes can be divided in three classes:

interactive: typical interactive processes are those which respond to keyboard and mouse inputs of an end user. as an user wants to see quick responses, for example when editing text, these processes must be woken up quickly
batch: batch processes do not interact with the user and often run in the background.
real-time: real-time processes have very strong scheduling requirements and should not be blocked by processes with lower priorities.

in general the scheduler will give more attention to interactive processes than to batch processes, although this must not always be true.

one way we can change the base priority of processes from the command line is by using the “nice” command:

nice -19 vi

if you check the process without the nice call:

ps -aux | grep vi
oracle 4185 0.5 0.0 5400 1504 pts/0 S+ 10:51 0:00 vi

… and compare it to when you call vi with a nice value:

ps -aux | grep vi
oracle 4194 1.6 0.0 5400 1496 pts/0 SN+ 10:52 0:00 vi

.. you will see that “S+” changes to “SN+” ( the “N” stands for “low-priority (nice to other users)”

processes in linux are preemptable, which means that higher priority processes may suspend lower priority processes when they enter the running state. another reason a process can be preempted is when its time quantum expires.

consider this example: a user is writing an email while copying music from a cd to her computer. the email client is considered an interactive program while the copy job is considered a batch program. each time the user presses a key on her keyboard an interrupt occurs and the scheduler selects the email program for execution. but because users tend to think when writing emails there is plenty of time ( regarding the cpu ) between the key presses to wake up the copy job and let it do its work.

the time a process is allowed to be on a cpu, the quantum, is derived from a so called “static priority” which can be in the range of 100 to 139 ( with 100 being the highest priority and 139 being the lowest ). the higher the priority the more time the process is granted ( which ranges from 800ms for the highest priority to 5ms for the lowest priority ). in addition to the static priority there is a “dynamic priority” for each process ( again ranging from 100 to 139 ). without going too much into detail again: the dynamic priority is the one the scheduler uses for its decisions. as the name suggest, this priority may change over time ( depending on the average sleep time of a process ). processes with longer sleep times usually get a bonus ( the priority will be increased ) while processes with lower sleep times will get a penalty ( the priority will be decreased ). the average sleep time is also used by the scheduler to decide if processes are interactive or batch.

recall the post about double linked lists. the most important data structure used by the scheduler is the runqueue, which in fact is another linked list. this list links together all the process descriptors of the processes which want to run ( there is one runqueue per cpu ). one process can be in one runqueue only, but processes may migrate to others runqueues if the load between the cpus becomes unbalanced.

what to keep in mind: as only one process can run on one cpu at a time the scheduler decides which process to run next and which processes to suspend in case higher priority processes enter the running state. in general interactive processes are favored over batch processes and real-time processes should not be blocked by lower priority processes.

In main linux, processes, resources

kill is not really about killing

April 17, 2012 — Leave a comment

the previous post about SIGSEGV and the ORA-07445 introduced signals and how they are used by the kernel to notify processes about events.

many times I see people using the “kill -9” to terminate processes. no questions, this works most of the time, but a lot of people are not aware what they are actually doing when firing this command. in my opinion, kill is a really bad name for this command, because what kill is doing is not necessarily kill processes, but send signals to processes ( of which 9, or SIGKILL is probably the most well known ). a much better name, for example, would be “sig” or “signal”.

the list of signals one can use in regards to kill can be printed with:

kill -l
1) SIGHUP 2) SIGINT 3) SIGQUIT 4) SIGILL
5) SIGTRAP 6) SIGABRT 7) SIGBUS 8) SIGFPE
9) SIGKILL 10) SIGUSR1 11) SIGSEGV 12) SIGUSR2
13) SIGPIPE 14) SIGALRM 15) SIGTERM 16) SIGSTKFLT
17) SIGCHLD 18) SIGCONT 19) SIGSTOP 20) SIGTSTP
21) SIGTTIN 22) SIGTTOU 23) SIGURG 24) SIGXCPU
25) SIGXFSZ 26) SIGVTALRM 27) SIGPROF 28) SIGWINCH
29) SIGIO 30) SIGPWR 31) SIGSYS 34) SIGRTMIN
35) SIGRTMIN+1 36) SIGRTMIN+2 37) SIGRTMIN+3 38) SIGRTMIN+4
39) SIGRTMIN+5 40) SIGRTMIN+6 41) SIGRTMIN+7 42) SIGRTMIN+8
43) SIGRTMIN+9 44) SIGRTMIN+10 45) SIGRTMIN+11 46) SIGRTMIN+12
47) SIGRTMIN+13 48) SIGRTMIN+14 49) SIGRTMIN+15 50) SIGRTMAX-14
51) SIGRTMAX-13 52) SIGRTMAX-12 53) SIGRTMAX-11 54) SIGRTMAX-10
55) SIGRTMAX-9 56) SIGRTMAX-8 57) SIGRTMAX-7 58) SIGRTMAX-6
59) SIGRTMAX-5 60) SIGRTMAX-4 61) SIGRTMAX-3 62) SIGRTMAX-2
63) SIGRTMAX-1 64) SIGRTMAX

here you can see, what 9 really means, it is SIGKILL. the ( perhaps ) dangerous about SIGKILL is, that the process will not be allowed to do any cleanup ( for example releasing resources ). as this signal can not be ignored or caught, the process will terminate immediately ( you can compare it to the “shutdown abort” command of the oracle database ).

the way one should terminate processes is to use the SIGTERM (15) signal, the default parameter for kill. this allows the process to do its cleanup work and to safely terminate ( although this is dependent on how the process or the applications handles the signal ).

once interesting thing you can do by sending a signal to a process with kill is to force, for example, the ssh daemon to re-read its configuration without closing the active ssh sessions. you can try this with:

kill -HUP [PID_OF_SSHD]

be aware that this must not be true for other daemons, as this depends on how the program was implemented. the default behavior for SIGHUP ( hang up ) is abnormal termination. originally the SIGHUP comes from serial connections ( e.g. modems ) which indeed did a hang up when the user closed the connection.

closing the loop to the previous post about SIGSEGV and the ORA-07445 you can force an ORA-07445 by sending the SIGSEGV signal, for example, to the dbwriter process ( I hope there is no need to say: you should not try this on a production system ):

ps -ef | grep dbw
oracle 4560 1 0 13:18 ? 00:00:00 ora_dbw0_DB112
kill -11 4560

…which will result in the following errors reported in the alertlog:

Exception [type: SIGSEGV, unknown code] [ADDR:0xC41] [PC:0x32E7CD46BA, semtimedop()+10] [exception issued by pid: 3137, uid: 0] [flags: 0x0, count: 1]
Errors in file /oradata/DB112/admin/diag/rdbms/db112/DB112/trace/DB112_dbw0_4560.trc (incident=26003):
ORA-07445: exception encountered: core dump [semtimedop()+10] [SIGSEGV] [ADDR:0xC41] [PC:0x32E7CD46BA] [unknown code] []
Incident details in: /oradata/DB112/admin/diag/rdbms/db112/DB112/incident/incdir_26003/DB112_dbw0_4560_i26003.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Mon Apr 16 13:23:36 2012

for simulating a power failure try to send the SIGPWR to a sqlplus session. this will result in:

SQL>
SQL>
SQL> Power failure

conclusion: signals are one more important concept when it comes to understanding the operating system. by sending signals processes are notified about events and are given the chance to take the necessary actions ( if an appropriate handler is present ). when using the kill command you do not necessarily kill a process. what you are doing is sending a signal.

as usual, this is just an introduction and far from being complete ….

In main linux, processes

SIGSEGV and the ORA-07445

April 15, 2012 — Leave a comment

when working with the oracle database sooner or later you will face the ORA-07445 error reported in a trace file and the alertlog.

there is plenty of documentation about this error on oracle support and the web. in short: this is an unhandled exception in the oracle code ( in contrast the ORA-00600 errors are handled exceptions ). so why do I want to write about it ? because this is another example where you can map database behavior to the operating system. one common type of the ORA-07445 is this one: “type: SIGSEGV”.

what is this about and what does it stand for ?

software contains bugs. this is true for the linux kernel, this is true for the oracle database and this is true for probably all other software. to deal with unexpected behavior and to protect the system there must be some some sort of exception handler which processes/catches the exceptions once they occur and does the necessary steps to recover from the exceptions. the lowest level exceptions are raised by the CPU and must be handled by the operating system’s kernel. these exception are predefined and the kernel provides an exception handler for each of them.

some of them are:

division by zero
segment not present
stack segment fault
invalid opcode
…

you can, for example, check the intel documentation for a complete list of defined exceptions.

when an exception is raised the corresponding exception handler sends a signal to the process which caused the exception. and this is exactly what the SIGSEGV is: it is a signal. signals, for example ( the following list is not complete ), can be:

SIGSEGV: page faults, overflows
SIGFPE: divide error
SIGBUS: stack segments fault
SIGILL: invalid opcode

most of these exceptions can only occur when the kernel is in user mode, that is, when executing tasks from user programs ( oracle in this case ). there are two ways in which the processor can halt ( or interrupt ) process execution:

interrupts, which are ansynchron and typically triggered by I/O devices
exceptions, which are synchron and triggered by the processor when it detects predefined conditions while executing

when the processor halts process execution it switches to the handler routine ( each routine is defined in the interrupt description table, IDT ). once the handler routine has executed its tasks control is given back to the interrupted process.

unfortunately there is not much you can do about it. you can try to find a workaround with oracle support ( e.g. by setting some database parameters or applying a patch ) or check the generated dumps to get some hints on what exactly caused the exception.

a recent search on oracle support returned about 2500 results for the term SIGSEGV. you see, this is not an unusual signal …

In main linux, oracle, processes

shared memory – interprocess communication (2)

April 12, 2012 — 2 Comments

as mentioned in the previous post about semaphores there are more things to consider when it comes to interprocess communication. as semaphores are used to protect critical regions, there must be some critical regions to protect and this is the shared memory oracle uses for its communication.

to give an example on how the shared memory addressing works we will take a look at what happens when the database starts up.
for this you’ll need two sessions to a test infrastructure ( one as the database owner, the other as root ).

session one ( oracle ):
connect to sqlplus as sysdba make sure you shutdown the database ( do not exit sqlplus once the database is down ):

sqlplus / as sysdba
shutdown immediate

session two ( root ): discover the PID for then sqlplus session above …

ps -ef | grep sqlp
oracle    3062  3036  0 09:49 pts/1    00:00:00 sqlplus

… check the shared memory segments and trace the sqlplus PID from above:

ipcs -m
------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status      
0x7401003e 1310720    root      600        4          0                       
0x74010014 1998849    root      600        4          0                       
0x00000000 2359298    root      644        80         2                       
0x74010013 1966083    root      600        4          0                       
0x00000000 2392068    root      644        16384      2                       
0x00000000 2424837    root      644        280        2                       
0x00000000 2490374    grid      640        4096       0                       
0x00000000 2523143    grid      640        4096       0                       
0x8e11371c 2555912    grid      640        4096       0
# start the trace
strace -o db_startup.log -fp 3062

it is important to specify the “-f” flag for the strace call. this will tell strace to follow the child processes spawned.

in session one startup the database…

startup

… and stop the tracing in the root session once the database is up and re-check the shared memory segments.

ipcs -m
------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status      
0x7401003e 1310720    root      600        4          0                       
0x74010014 1998849    root      600        4          0                       
0x00000000 2359298    root      644        80         2                       
0x74010013 1966083    root      600        4          0                       
0x00000000 2392068    root      644        16384      2                       
0x00000000 2424837    root      644        280        2                       
0x00000000 2490374    grid      640        4096       0                       
0x00000000 2523143    grid      640        4096       0                       
0x8e11371c 2555912    grid      640        4096       0                       
0x00000000 3538953    oracle    640        4096       0                       
0x00000000 3571722    oracle    640        4096       0                       
0x3393b3a4 3604491    oracle    640        4096       0

as you can see, three more segments appeared after the database started up.

you’ll probably noticed some trace output on the screen similar to this:

Process 3468 detached
Process 3470 attached (waiting for parent)
Process 3470 resumed (parent 3409 ready)
Process 3471 attached (waiting for parent)
Process 3471 resumed (parent 3470 ready)
Process 3469 detached
Process 3470 detached

this is because of the “-f” flag given to strace.
the complete trace output is now available in the db_startup.log trace file and we are ready to take a look at it.

the first thing that catches the eye are the various references to the “/proc” filesystem. in may trace file there are 1213 calls to it. you can check this with:

grep "/proc/" db_startup.log | wc -l

take a look at the previous post which introduces the “/proc” filesystem for more information. for the scope of this post just notice how much depends on it.

the actual startup of the database is triggered by the following line:

execve("/opt/oracle/product/base/11.2.0.3/bin/oracle", ["oracleDB112", "(DESCRIPTION=(LOCAL=YES)(ADDRESS"], [/* 22 vars */]) = 0

this is the call to the oracle binary ( execve executes the binary ) with 22 arguments omitted. from now on the oracle instance starts up.

the calls important to the shared memory stuff are the following:

brk: changes a data segment’s size
mmap, munmap: maps/unmaps files or devices into memory
mprotect: sets protection on a region of memory
shmget: allocates a shared memory segment
shmat, shmdt: performs attach/detach operations on shared memory
get_mempolicy: return NUMA memory policies for a process
semget: get a semaphore identifier
semctl: perform control operations on a semaphore
semop, semtimedop: perform sempahore operations

for each of the above commands you can check the man-pages for more information.
as the trace file is rather large and a lot of things are happening i will focus on the minimum ( this is not about re-engineering oracle :) ):

let’s check the keys returned by the ipcs command above:

egrep "3538953|3571722|3604491" db_startup.log
...
5365  shmget(IPC_PRIVATE, 4096, IPC_CREAT|IPC_EXCL|0640) = 3538953
5365  shmget(IPC_PRIVATE, 4096, IPC_CREAT|IPC_EXCL|0640) = 3571722
5365  shmget(0x3393b3a4, 4096, IPC_CREAT|IPC_EXCL|0640) = 3604491
...

as you can see the identifiers returned by the shmget call ( 3604491,3571722,3538953 ) correspond to the ones reported by ipcs. you wonder about the size of 4096 bytes ? this is because memory_target/memory_max_target is in use by the instance. if the database is configured using sga_target/sga_max_target you would see the actual size. let’s check this:

su - oracle
sqlplus / as sysdba
alter system reset memory_max_target scope=spfile;
alter system reset memory_target scope=spfile;
alter system set sga_max_size=256m scope=spfile;
alter system set sga_target=256m scope=spfile;
alter system set pga_aggregate_target=24m scope=spfile;
startup force;
exit;
# re-check the shared memory segments
ipcs -m
------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status      
0x00000000 2359298    root      644        80         2                       
0x00000000 2392068    root      644        16384      2                       
0x00000000 2424837    root      644        280        2                       
0x00000000 2490374    grid      640        4096       0                       
0x00000000 2523143    grid      640        4096       0                       
0x8e11371c 2555912    grid      640        4096       0                       
0x00000000 3801097    oracle    640        8388608    25                      
0x00000000 3833866    oracle    640        260046848  25                      
0x3393b3a4 3866635    oracle    640        2097152    25

the “260046848” corresponds to the sga size of 256m and the nattch column shows that 25 processes are attached to it. you can double check the 25
attached processes if you want:

ps -ef | grep DB112 | grep -v LISTENER | grep -v grep | wc -l

let’s return to the memory_target/memory_max_target configuration. as oracle puts together all the memory junks ( pga and sga ) the management of memory changes to the virtual shared memory filesystem ( tmpfs ). unfortunately this is not visible with the ipcs command.
but you can map your memory_* sizes to the shm filesystem:

ls -la /dev/shm/ | grep -v "+ASM"
total 466100
drwxrwxrwt  2 root   root        2640 Apr 10 13:09 .
drwxr-xr-x 10 root   root        3400 Apr 10 09:44 ..
-rw-r-----  1 oracle asmadmin 4194304 Apr 10 13:27 ora_DB112_3932169_0
-rw-r-----  1 oracle asmadmin 4194304 Apr 10 13:09 ora_DB112_3932169_1
-rw-r-----  1 oracle asmadmin 4194304 Apr 10 13:09 ora_DB112_3964938_0
-rw-r-----  1 oracle asmadmin 4194304 Apr 10 13:20 ora_DB112_3964938_1
-rw-r-----  1 oracle asmadmin 4194304 Apr 10 13:09 ora_DB112_3964938_10
-rw-r-----  1 oracle asmadmin 4194304 Apr 10 13:20 ora_DB112_3964938_11
-rw-r-----  1 oracle asmadmin 4194304 Apr 10 13:10 ora_DB112_3964938_12

note that i have excluded the ASM stuff here. in my case each segment ( or granule ) is 4mb of size ( this depends on the avaible memory of the system ) and the sum of all the segments should get you near to your memory_* configuration.

as ipcs can not tell you much here there are other commands to use. if you want to know which process has a memory granule open:

fuser -v /dev/shm/ora_DB112_4358154_49
                     USER        PID ACCESS COMMAND
/dev/shm/ora_DB112_4358154_49:
                     oracle     6626 ....m oracle
                     oracle     6628 ....m oracle
                     oracle     6630 ....m oracle
                     oracle     6634 ....m oracle
                     oracle     6636 ....m oracle
                     oracle     6638 ....m oracle
                     oracle     6640 ....m oracle
                     oracle     6642 ....m oracle
                     oracle     6644 ....m oracle
                     oracle     6646 ....m oracle
                     oracle     6648 ....m oracle
                     oracle     6650 ....m oracle
                     oracle     6652 ....m oracle
                     oracle     6654 ....m oracle
                     oracle     6656 ....m oracle
                     oracle     6658 ....m oracle
                     oracle     6662 ....m oracle
                     oracle     6669 ....m oracle
                     oracle     6744 ....m oracle
                     oracle     6767 ....m oracle
                     oracle     6769 ....m oracle
                     oracle     6791 ....m oracle
                     oracle     7034 ....m oracle

or the other way around, if you want to know which files are opened by a specific process:

ps -ef | grep pmon | grep -v "ASM"
oracle    6626     1  0 13:40 ?        00:00:05 ora_pmon_DB112
root      7075  5338  0 14:33 pts/0    00:00:00 grep pmon
# use the pmap command on the PID
pmap 6626
6626:   ora_pmon_DB112
0000000000400000 183436K r-x--  /opt/oracle/product/base/11.2.0.3/bin/oracle
000000000b922000   1884K rwx--  /opt/oracle/product/base/11.2.0.3/bin/oracle
000000000baf9000    304K rwx--    [ anon ]
0000000010c81000    660K rwx--    [ anon ]
0000000060000000      4K r-xs-  /dev/shm/ora_DB112_4325385_0
0000000060001000   4092K rwxs-  /dev/shm/ora_DB112_4325385_0
0000000060400000   4096K rwxs-  /dev/shm/ora_DB112_4325385_1
0000000060800000   4096K rwxs-  /dev/shm/ora_DB112_4358154_0
0000000060c00000   4096K rwxs-  /dev/shm/ora_DB112_4358154_1
0000000061000000   4096K rwxs-  /dev/shm/ora_DB112_4358154_2
0000000061400000   4096K rwxs-  /dev/shm/ora_DB112_4358154_3
0000000061800000   4096K rwxs-  /dev/shm/ora_DB112_4358154_4
0000000061c00000   4096K rwxs-  /dev/shm/ora_DB112_4358154_5
0000000062000000   4096K rwxs-  /dev/shm/ora_DB112_4358154_6
...

if you have troubles starting up your instance with this configuration ( ORA-00845 ) check the size of the virtual filesystem:

df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/hdc1              28G   14G   12G  54% /
tmpfs                 741M  456M  286M  62% /dev/shm

depending on your configuration ( memory_* or sga_* parameters ) the way that memory is managed changes ( from System V to POSIX, to be exact ).

lots and lots of information. not all of it is important to keep in mind. but what you should remember:
there are several processes and memory segments that make up the oracle instance. as several processes are attached to the same memory regions there must be a way to protect them from concurrent access ( think of semaphores ) … and oracle heavily depends on shared memory. if you scroll through the trace file you’ll notice that there are thousands of operations going on when an oracle instance starts up. imagine what is going on if the instance is under heavy workload and lots and lots of things need protection.

ps: for those interested:

there is plenty of more interesting stuff which you can find in the db_startup.log trace, for example:

writing the audit files:

grep -i adump db_startup.log  | grep -v ASM
3404  open("/oradata/DB112/admin/adump/DB112_ora_3404_2.aud", O_RDWR|O_CREAT|O_EXCL, 0660) = 10
3404  write(10, "/oradata/DB112/admin/adump/DB112"..., 47) = 47
3444  open("/oradata/DB112/admin/adump/DB112_ora_3444_1.aud", O_RDWR|O_CREAT|O_EXCL, 0660) = -1 EEXIST (File exists)
3444  open("/oradata/DB112/admin/adump/DB112_ora_3444_2.aud", O_RDWR|O_CREAT|O_EXCL, 0660) = 8
3444  write(8, "/oradata/DB112/admin/adump/DB112"..., 47) = 47
3481  open("/oradata/DB112/admin/adump/DB112_ora_3481_1.aud", O_RDWR|O_CREAT|O_EXCL, 0660 
3481  write(8, "/oradata/DB112/admin/adump/DB112"..., 47) = 47

writing the alert.log:

grep -i "alert_DB112.log" db_startup.log
3404  lstat("/oradata/DB112/admin/diag/rdbms/db112/DB112/trace/alert_DB112.log", {st_mode=S_IFREG|0640, st_size=110201, ...}) = 0
3404  open("/oradata/DB112/admin/diag/rdbms/db112/DB112/trace/alert_DB112.log", O_WRONLY|O_CREAT|O_APPEND, 0660) = 5
3404  lstat("/oradata/DB112/admin/diag/rdbms/db112/DB112/trace/alert_DB112.log", {st_mode=S_IFREG|0640, st_size=110260, ...}) = 0
3404  open("/oradata/DB112/admin/diag/rdbms/db112/DB112/trace/alert_DB112.log", O_WRONLY|O_CREAT|O_APPEND, 0660) = 11

reading the oracle message files:

grep msb db_startup.log
db_startup.log:5438  open("/opt/oracle/product/base/11.2.0.3/oracore/mesg/lrmus.msb", O_RDONLY) = 18
db_startup.log:5438  open("/opt/oracle/product/base/11.2.0.3/oracore/mesg/lrmus.msb", O_RDONLY) = 18
db_startup.log:5430  open("/opt/oracle/product/base/11.2.0.3/rdbms/mesg/oraus.msb", O_RDONLY 
db_startup.log:5494  open("/opt/oracle/product/base/11.2.0.3/rdbms/mesg/oraus.msb", O_RDONLY

getting sempahores:

grep semget db_startup.log 
5365  semget(IPC_PRIVATE, 1, IPC_CREAT|IPC_EXCL|0600) = 1081346
5365  semget(IPC_PRIVATE, 124, IPC_CREAT|IPC_EXCL|0666) = 1114114
5365  semget(IPC_PRIVATE, 124, IPC_CREAT|0660) = 1146882
5365  semget(0x710dfe10, 0, 0)          = -1 ENOENT (No such file or directory)
5365  semget(0x46db3f80, 0, 0)          = -1 ENOENT (No such file or directory)
5365  semget(0x9ae46084, 0, 0)          = -1 ENOENT (No such file or directory)
5365  semget(0xf6dcc368, 0, 0)          = -1 ENOENT (No such file or directory)
5365  semget(0x710dfe10, 124, IPC_CREAT|IPC_EXCL|0640) = 1179650

some exadata stuff:

3404  open("/etc/oracle/cell/network-config/cellinit.ora", O_RDONLY) = -1 ENOENT (No such file or directory)

and … and …

In main linux, processes, resources

Daniel Westermann's Blog

an educational example to demonstrate the oracle database filesystem ( dbfs )

the oracle local registry ( olr ) – what it is and how one can manage it

what one can learn from a simple one row update

Ouch! tns poison

memory management

nice ( but not essential ) to know (1)

dear kernel, please let me do my work …

kill is not really about killing

SIGSEGV and the ORA-07445

shared memory – interprocess communication (2)

Archives

Recent Posts

RSS Links

Top Posts & Pages

Search

Follow Blog via Email

tags