Archives For resources

consider reading this before jumping on this train:

Oracle’s In-Memory Database: The True Cost Of Licensing

if you are on a distribution based on redhat 6 there are some interesting tools which can help in fine tuning the system for different workloads. e.g. if you’d like to put a database on
your server there a various settings you might want to adjust ( kernel, disks, network … ). if you use the system as a workstation other settings might make more sense ( power saving settings, for example ).

as this blog mainly is about databases i’ll focus there, obviously. first of all you’ll need the software:

yum install tuned

as tuned is a service you’ll need to enable and start it.

service tuned start
chkconfig tuned on
chkconfig --list | grep tuned

let’s see what happend. the main configuration file for tuned is located in /etc:

/etc/tuned.conf

if you take a look at the file you will find a main section and various plugins sections (e.g. DiskTuning or CPUTuning).

next several default tuning profiles have been created:

ls -l /etc/tune-profiles/
active-profile
default
desktop-powersave
enterprise-storage
functions
laptop-ac-powersave
laptop-battery-powersave
latency-performance
server-powersave
spindown-disk
throughput-performance
virtual-guest
virtual-host

each of these directories contains the same configuration files ( ktune.sh, ktune.sysconfig, sysctl.ktune, tuned.conf ) which specify the various settings which will be set once the profile becomes active.

you can list the available profiles with the tune-adm command, too:

tuned-adm list
Available profiles:
- server-powersave
- laptop-ac-powersave
- latency-performance
- default
- desktop-powersave
- enterprise-storage
- virtual-guest
- virtual-host
- spindown-disk
- laptop-battery-powersave
- throughput-performance
Current active profile: default

… which additionally tells us that the default profile is the one which is active at the moment.

another way to check the active profile is:

tuned-adm active

if you want to create a new profile just copy an existing one and adjust the settings you want to:

cp -pr /etc/tune-profiles/enterprise-storage/ /etc/tune-profiles/my_profile
tuned-adm list | grep my_profile
- my_profile

for databases you’ll probably need maximum throughput, so let’s activate the throughput-performance profile:

tuned-adm profile throughput-performance
Stopping tuned:                                            [  OK  ]
Switching to profile 'throughput-performance'
Applying ktune sysctl settings:
/etc/ktune.d/tunedadm.conf:                                [  OK  ]
Calling '/etc/ktune.d/tunedadm.sh start':                  [  OK  ]
Applying sysctl settings from /etc/sysctl.conf
Applying deadline elevator: sda                            [  OK  ]
Starting tuned:                                            [  OK  ]

according to the documentation and the output from above this should change the io scheduler to deadline ( which is recommended for databases ). is this true ?

cat /sys/block/sda/queue/scheduler 
noop anticipatory [deadline] cfq

seems to work. does this survive a reboot?

reboot
tuned-adm active
cat /sys/block/sda/queue/scheduler
noop anticipatory [deadline] cfq

very good. no need to adjust this in the bootloader anymore.

if you want to check which kernel settings have been adjusted by activating this profile just have a look at the configuration files:

cat /etc/tune-profiles/throughput-performance/sysctl.ktune

include any kernel setting you need in there and you’re fine.

as profiles may be switched on the fly several profiles activated at different times of the day might make sense, too.

if you are on linux/solaris and sar is configured and running on your system there is a nice utility called kSar which can be used to create graphs of the various statistics sar gathered. this can be very handy if you are looking for peaks and want to have a quick overview what happened on your system.

installing kSar is just a matter of unzipping the provided package and either executing the run.sh script or use java directly to execute the jar file:

java -jar kSar.jar

this will start ksar and you may load the sar files for having a look at the statistics:
standard

another option is to generate a pdf:

java -jar kSar.jar -input '/var/log/sa/sarXX' -outputPDF today.pdf

pdf

and even faster: create a bash function and an alias in your .bashrc:

ksarfunc() {
java -jar PATH_TO/kSar.jar -input "$1" -outputPDF today.pdf
}
alias ksar='ksarfunc'

… and you will be able to quickly generate a pdf for a specific sar file:

ksar /path/to/sar/file

a much more comprehensive tutorial for sar and ksar can be found here.

just a little hint that there is another option than top, which is htop. pre-compiled packages are available for the most distributions.

htop.sourceforge.net/htop-1.0.2-io.png

check htop’s sourceforge page for a tiny comparison between htop and top.

another ouch with GI 11.2.0.3 on solaris 10 sparc 64bit: if one of your cluster nodes restarts and you can not find any evident reason for it despite some of these entries in the logs:

[cssd(1084)]CRS-1612:Network communication with node1 node (1) missing for 50% of timeout interval. Removal of this node from cluster in 14
.258 seconds
[cssd(1084)]CRS-1625:Node node1, number 1, was manually shut down
[cssd(1084)]CRS-1601:CSSD Reconfiguration complete. Active nodes are node2 .
[ctssd(1117)]CRS-2407:The new Cluster Time Synchronization Service reference node is host node2.
[crsd(1522)]CRS-5504:Node down event reported for node 'node1'.
[cssd(1084)]CRS-1601:CSSD Reconfiguration complete. Active nodes are node1 node2 .

… and:

[ CSSD][20](:CSSNM00018:)clssnmvDiskCheck: Aborting, 0 of 1 configured voting disks available, need 1
[ CSSD][20]###################################
[ CSSD][20]clssscExit: CSSD aborting from thread clssnmvDiskPingMonitorThread
[ CSSD][20]###################################
[ CSSD][20](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally
[ CSSD][20]

you probably hit bug 13869978. this seems only to happen if you are on external redundancy for the cluster diskgroup and therefore only one voting disk was created.

two solutions are available:

  • migrate the votings disk to an asm mirrored diskgroup ( normal or high redundancy )
  • or apply PSU4 on top of 11.2.0.3

there seems to be the same issue on linux.

if you want to manage vips in the grid infrastructure which are not on the default network and you get this: “CRS-2534: Resource type ‘ora.cluster_vip_net2.type’ is not registered” don’t panic, it is easy to fix. basically you need create the “ora.cluster_vip_net2.type”-type before adding the vip with appvipcfg:

./srvctl add network -k 2 -S x.x.x.0/255.255.255.0/igb0
./crsctl start resource ora.net2.network
./crsctl add type ora.cluster_vip_net2.type -basetype ora.cluster_vip.type
./appvipcfg create -network=2 -ip=x.x.x.x -vipname=myvip1 -user=root
./crsctl start resource vip1 -n server1
./appvipcfg create -network=2 -ip=x.x.x.x -vipname=myvip2 -user=root
./crsctl start resource vip2 -n server2
./crsctl stat res –t
./crsctl modify res 'myvip1' -attr "HOSTING_MEMBERS=server1 server2"
./crsctl modify res 'myvip2' -attr "HOSTING_MEMBERS=server1 server2"

not sure, but I think this is a bug as appvipcfg should manage this.

the above is valid for 11.2.0.3 on Solaris SPARC 64bit

linux ( as well as most of the unixes ) provides the ability to integrate many different file systems at the same time. to name a few of them:

  • ext2, ext3, ext4
  • ocfs, ocfs2
  • reiserfs
  • vxfs
  • brtfs
  • dos, ntfs

although each of them provides different features and was developed with different purposes in mind the tools to work with them stay the same:

  • cp
  • mv
  • cd

the layer which makes this possible is called the virtual filesystem ( vfs ). this layer provides a common interface for the filesystems which are plugged into the operating system. I already introduced one special kind of filesystem, the the proc filesystem. the proc filesystem does not handle any files on disk or on the network, but neitherless it is a filesystem. in addition to the above mentioned filesystems, which all are disk based, filesystem may also handle files on the network, such as nfs or cifs.

no matter what kind of filesystem you are working with: when interacting with the filesystem by using the commands of choice you are routed through the virtual filesystem:

the virtual file system

to make this possible there needs to be a standard all file system implementations must comply with, and this standard is called the common file model. the key components this model consist of are:

  • the superblock which stores information about a mounted filesystem ( … that is stored in memory as a doube linked list )
  • inodes which store information about a specific file ( … that are stored in memory as a doube linked list)
  • the file object which stores information of the underlying files
  • dentries, which represent the links to build the directory structure ( … that are stored in memory as a doube linked list)

to speed up operations on the file systems some of the information which is normally stored on disk are cached. if you recall the post about slabs, you can find an entry like the following in the /proc/slabinfo file if you have a mounted ext4 filesystem on your system:

cat /proc/slabinfo | grep ext4 | grep cache
ext4_inode_cache   34397  34408    920   17    4 : tunables    0    0    0 : slabdata   2024   2024      0

so what needs the kernel to do if, for example, a request for listing the contents of a directoy comes in and the directory resides on an ext4 filesystem? because the filesystem is mounted the kernel knows that the filesystem for the specific request is of type ext4. the ls command will then be translated ( pointed ) to the specific ls implementation of the ext4 filesystem. this operation is the same for all commands interacting with filesystems. there is a pointer for each operation that links to the specific implementation of the command in question:

directory listing

as the superblock is stored in memory and therefore may become dirty, that is not synchronized with the superblock on disk, there is the same issue that oracle must handle with its buffer pools: periodically check the dirty flag and write down the changes to disk. the same is true for inodes ( while in memory ), which contain all the information that make up a file. closing a loop to oracle again: to speed up searching the ionodes linux maintains a hash table for fast access ( remember how oracle uses hashes to identify sql statements in the shared_pool ).

when there are files, there are processes which want to work with files. once a file is opened a new file object will be created. as these are frequent operations file objects are allocated through a slab cache.

the file objects itself are visible to the user through the /proc filesystem per process:

ls -la /proc/*/fd/
/proc/832/fd/:
total 0
dr-x------ 2 root root  0 2012-05-18 14:03 .
dr-xr-xr-x 8 root root  0 2012-05-18 06:40 ..
lrwx------ 1 root root 64 2012-05-18 14:03 0 -> /dev/null
lrwx------ 1 root root 64 2012-05-18 14:03 1 -> /dev/null
lr-x------ 1 root root 64 2012-05-18 14:03 10 -> anon_inode:inotify
lrwx------ 1 root root 64 2012-05-18 14:03 2 -> /dev/null
lrwx------ 1 root root 64 2012-05-18 14:03 3 -> anon_inode:[eventfd]
lrwx------ 1 root root 64 2012-05-18 14:03 4 -> /dev/null
lrwx------ 1 root root 64 2012-05-18 14:03 5 -> anon_inode:[signalfd]
lrwx------ 1 root root 64 2012-05-18 14:03 6 -> socket:[7507]
lrwx------ 1 root root 64 2012-05-18 14:03 7 -> anon_inode:[eventfd]
lrwx------ 1 root root 64 2012-05-18 14:03 8 -> anon_inode:[eventfd]
lrwx------ 1 root root 64 2012-05-18 14:03 9 -> socket:[11878]
...

usually numbers 0 – 3 refer to the standard input, standard output and standard error of the corresponding process.

last but not least there are the dentries. as with the file objects, dentries are allocated from a slab cache, the dentry cache in this case:

cat /proc/slabinfo | grep dentry
dentry             60121  61299    192   21    1 : tunables    0    0    0 : slabdata   2919   2919      0

directories are files, too, but special in that kind that dictories may contain other files or directories. once a directory is read into memory it is transformed into a dentry object. as this operation is expensive there is the dentry cache mentioned above. thus the operations for building the dentry objects can be minimized.
another link to oracle wording: the unused dentry double linked list uses a least recently used ( lru ) algorithm to track the usage of the entries. when the kernel needs to shrink the cache the objects at the tail of the list will be removed. as with the ionodes there is hash table for the dentries and a lock protecting the lists ( dcache_spin_lock in this case ).

this should give you enough hints to go further if you are interesed …

to be honest, when I first heard about slabs i thought of kitchens, or a bathroom, but not about memory management. but after reading and started learning about it, this term seemed not to be bad for what it stands for. to stay in the context of a kitchen there are different kinds of slabs:

  • some maybe appropriate for the floor
  • others for the wall
  • there are all kinds of colors …
  • … all kinds of sizes
  • slabs may consist of different materials

and in a way that’s exactly what slabs, in regards to memory management, are about. they provide regions of memory which are fixed in size and for different kinds of tasks.

the basic idea behind slabs is to avoid fragmentation by reserving fixed size memory regions and minimizing the work for allocation and deallocation. the so called constructor is used to initialize areas if memory while the destructor de-initializes them.

while the system is running the kernel requests memory regions of the same size and type over and over again. the process descriptors discussed earlier are one example. if a new process descriptor gets requested the slab allocator is used to get a memory area from cache which fits for holding the structures. once the process terminates and the descriptor can be discarded the memory region will not be deallocated but can be reused when another process needs to get created. in this way the overhead of frequent allocation and deallocation of pages frames can be avoided.

caches & slabs

to check the current caches and slabs you can either directly go to the /proc filesystem:

cat /proc/slabinfo
slabinfo - version: 2.1
# name : tunables : slabdata
UDPLITEv6 0 0 1024 16 4 : tunables 0 0 0 : slabdata 0 0 0
UDPv6 32 32 1024 16 4 : tunables 0 0 0 : 

… or check the current behavior with slabtop:

 Active / Total Objects (% used) : 305636 / 315218 (97.0%)
Active / Total Slabs (% used) : 11077 / 11077 (100.0%)
Active / Total Caches (% used) : 68 / 99 (68.7%)
Active / Total Size (% used) : 74608.55K / 76732.62K (97.2%)
Minimum / Average / Maximum Object : 0.01K / 0.24K / 8.00K
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
89193 89192 99% 0.10K 2287 39 9148K buffer_head
55020 54352 98% 0.19K 2620 21 10480K dentry
30260 30247 99% 0.90K 1780 17 28480K ext4_inode_cache
19992 19974 99% 0.08K 392 51 1568K sysfs_dir_cache
17158 16006 93% 0.17K 746 23 2984K vm_area_struct
15360 14165 92% 0.01K 30 512 120K kmalloc-8
10710 10077 94% 0.05K 126 85 504K shared_policy_node
9664 8544 88% 0.06K 151 64 604K kmalloc-64
9051 8118 89% 0.19K 431 21 1724K kmalloc-192
7917 7904 99% 0.59K 609 13 4872K inode_cache

a cache initially does not contain any slabs. slabs will be created once requested and there is no free object to satisfy the request. in this case the cache will grow by n slabs. as with all concurrent access to memory regions caches and slabs are protected by locks, cache spin locks in this case. and when there are locks the must be different states for the objects, that are:

  • emtpy
  • partial
  • full

so what to remember: slabs and caches provide a way to avoid fragmentation and save some work in regards to frequent allocation and deallocation of memory regions. structures that are created and destroyed frequently such as the process or file descriptors benefit from this technique as the requests can be served at much faster time.

due to some network changes we were forced to adjust the interconnect configuration of one our managed grid infrastructure installations today. surprisingly this was a really smooth task, and all that needs to to be done is:

before doing any changes to the network configuration and while the rac databases and grid infrastructure are still running:

backup the current gpng ( grid plug and play ) profile in case you need to restore the old configuration:

cd $GRID_HOME/gpnp/[hostname]/profiles/peer
cp profile.xml profile.xml_bak

shutdown the rac database and listener:

srvctl stop database -d db_name
srvctl stop listener -l listener_name

make the changes to the interconnect configuration:

oifcfg setif -global eth1/10.167.77.0:cluster_interconnect
oifcfg getif

… where eth1 is the interface name for the interconnect and 10.167.77.0 is the new network to use

as root shutdown and disable the cluster stack:

crsctl stop crs
crsctl disable crs

shutdown the servers and perform the physical network changes, if any, start them up again and adjust the hosts file and the interface configurations.

as root user enable and start crs:

crsctl enable crs
crsctl start crs

wait a few minutes and check the cluster status:

oifcfg getif
crsctl status resource -t

that’s it.

a probably less known feature of the oracle database is the database filesystem (dbfs). in short dbfs provides a filesystem interface on top of database tables accessible through standard operating system commands. storing files inside the database becomes as easy as creating files on the operating system level. there is no need to use a programming language to store blobs/clobs inside the database.

for demonstrating dbfs this post will show how to create a subversion repository which actually resides in the oracle database. this is not intended for any production use and shall only show that is basically works.

because I am on linux and want to mount the database filesystem on the operating system I will need to install the kernel-devel and fuse packages and give execute privileges to the fusermount command:

su-
yum install kernel-devel fuse fuse-libs
chmod +x /bin/fusermount

next I’ll create the configuration file for fuse:

touch /etc/fuse.conf
echo "user_allow_other" > /etc/fuse.conf

the mountpoint for holding my database filesystem and subversion repository will be /oradata/DB112/svn owned by the database owner:

su - oracle
mkdir /oradata/DB112/svn

as all the files will reside inside the oracle database I will create a separate user:

su - oracle
sqlplus / as sysdba
CREATE USER svn
IDENTIFIED by "svn"
DEFAULT TABLESPACE USERS
TEMPORARY TABLESPACE TEMP;
GRANT CREATE SESSION TO svn;
GRANT dbfs_role TO svn;
GRANT RESOURCE TO svn;

the script that creates the dbfs “MYSVNFS” which will be stored in the “USERS” tablespace is:

connect svn/svn
@$ORACLE_HOME/rdbms/admin/dbfs_create_filesystem.sql USERS MYSVNFS

the output should be similar to this:

No errors.
--------
CREATE STORE:
begin dbms_dbfs_sfs.createFilesystem(store_name => 'FS_MYSVNFS', tbl_name =>
'T_MYSVNFS', tbl_tbs => 'USERS', lob_tbs => 'USERS', do_partition => false,
partition_key => 1, do_compress => false, compression => '', do_dedup => false,
do_encrypt => false); end;
--------
REGISTER STORE:
begin dbms_dbfs_content.registerStore(store_name=> 'FS_MYSVNFS', provider_name
=> 'sample1', provider_package => 'dbms_dbfs_sfs'); end;
--------
MOUNT STORE:
begin dbms_dbfs_content.mountStore(store_name=>'FS_MYSVNFS',
store_mount=>'MYSVNFS'); end;
--------
CHMOD STORE:
declare m integer; begin m := dbms_fuse.fs_chmod('/MYSVNFS', 16895); end;
No errors.

the client program used to mount the filesystem is called dbfs_client and is located under $ORACLE_HOME/bin. the proper way in regards to security would be to use an oracle wallet for storing the password but as this is only a test I will do it the easy way ( if someone is interested on how to setup the wallet way, just let me know ):

echo "svn" > /home/oracle/passwordfile
nohup dbfs_client svn@DB112 /oradata/DB112/svn < /home/oracle/passwordfile &

if everything worked the new mount is now visible with the os commands:

df -h
Filesystem Size Used Avail Use% Mounted on
/dev/hdc1 28G 11G 15G 43% /
tmpfs 741M 456M 286M 62% /dev/shm
dbfs-svn@DB112:/ 1.1G 160K 1.1G 1% /oradata/DB112/svn
##
ls -la /oradata/DB112/svn/MYSVNFS/
total 0
drwxrwxrwx 3 root root 0 May 4 01:10 .
drwxr-xr-x 3 root root 0 May 4 01:15 ..
drwxr-xr-x 7 root root 0 May 4 01:10 .sfs

now I am ready to create the subversion repository:

su - oracle
cd /oradata/DB112/svn/MYSVNFS/
svnadmin create MYREPO

for being able to access the repository later on from another user I’ll add the user to the subversion password file:

cat MYREPO/conf/passwd
[users]
svn = svn

… and uncomment the following line in the svnserve.conf file:

password-db = passwd

… and start the subversion deamon:

svnserve -d -r /oradata/DB112/svn/MYSVNFS

to simulate real-life subversion usage I will create a new os user and do some work on the subversion repository ( I could have done this from another workstation, too ):

su -
useradd svn
passwd svn
su - svn

the first thing I need to do is to checkout the repository:

svn co svn://localhost/MYREPO/
cd MYREPO
ls -la

lets copy some files ( text and binary ) to populate the repository ( don’t care if not all files are copied because of permissions ):

cp /etc/*.conf .
cp /bin/cpio .
ls -la

as I now have some files I’ll add them to the repository and commit ( commit means pushing to the subversion server here ):

svn add *
svn commit -m "blabla" *.conf --username=svn
svn commit -m "blabla" cpio

done. all files committed to the subversion server and stored inside the oracle database. don’t believe it ?

su - oracle
sqlplus / as sysdba
set lines 164
col object_name for a20
col object_type for a15
SELECT object_name,object_type FROM all_objects WHERE owner = 'SVN';
##
OBJECT_NAME          OBJECT_TYPE
-------------------- ---------------
T_MYSVNFS            TABLE
IG_SFS$_FST_19       INDEX
IP_SFS$_FST_19       INDEX
IPG_SFS$_FST_19      INDEX
SFS$_FSTP_19         TABLE
SYS_C005052          INDEX
##
col item for a30
col pathname for a40
select item ,PATHNAME from svn.T_MYSVNFS;
##
ITEM PATHNAME
------------------------------ ----------------------------------------
MYREPO /MYREPO
locks /MYREPO/locks
db.lock /MYREPO/locks/db.lock
db-logs.lock /MYREPO/locks/db-logs.lock
hooks /MYREPO/hooks
start-commit.tmpl /MYREPO/hooks/start-commit.tmpl
pre-commit.tmpl /MYREPO/hooks/pre-commit.tmpl
post-revprop-change.tmpl /MYREPO/hooks/post-revprop-change.tmpl
conf /MYREPO/conf
authz /MYREPO/conf/authz
db /MYREPO/db
fs-type /MYREPO/db/fs-type
revs /MYREPO/db/revs
0 /MYREPO/db/revs/0
revprops /MYREPO/db/revprops
0 /MYREPO/db/revprops/0
transactions /MYREPO/db/transactions
txn-protorevs /MYREPO/db/txn-protorevs
write-lock /MYREPO/db/write-lock
uuid /MYREPO/db/uuid
0 /MYREPO/db/revs/0/0
0 /MYREPO/db/revprops/0/0
fsfs.conf /MYREPO/db/fsfs.conf
rep-cache.db /MYREPO/db/rep-cache.db
min-unpacked-rev /MYREPO/db/min-unpacked-rev
txn-current-lock /MYREPO/db/txn-current-lock
format /MYREPO/db/format
format /MYREPO/format
passwd /MYREPO/conf/passwd
svnserve.conf /MYREPO/conf/svnserve.conf
pre-revprop-change.tmpl /MYREPO/hooks/pre-revprop-change.tmpl
pre-lock.tmpl /MYREPO/hooks/pre-lock.tmpl
pre-unlock.tmpl /MYREPO/hooks/pre-unlock.tmpl
ROOT /
.sfs /.sfs
attributes /.sfs/attributes
tools /.sfs/tools
snapshots /.sfs/snapshots
RECYCLE /.sfs/RECYCLE
content /.sfs/content
post-commit.tmpl /MYREPO/hooks/post-commit.tmpl
post-lock.tmpl /MYREPO/hooks/post-lock.tmpl
post-unlock.tmpl /MYREPO/hooks/post-unlock.tmpl
README.txt /MYREPO/README.txt
1 /MYREPO/db/revs/0/1
txn-current /MYREPO/db/txn-current
2 /MYREPO/db/revs/0/2
1 /MYREPO/db/revprops/0/1
current /MYREPO/db/current
2 /MYREPO/db/revprops/0/2

that’s the subversion repository stored in the oracle database.

remember this is only for educational purposes, but imagine what you could do with it:

  • back up the repository with rman
  • use rman to duplicate the database for testing purposes
  • setup rac one node, integrate the subversion daemon with the oracle grid infrastructure and make the repository high available
  • create multiple dbfs and multiple svn repositories in one oracle database
  • enable deduplication for the repository files ( see * )
  • enable compression for the repository files ( see ** )

oracle uses secure files to store the files located in the dbfs. this means you may use all the features secure files provide. check the examples for an overview.

*

alter table svn.T_MYSVNFS modify lob(FILEDATA) (deduplicate);

**

alter table svn.T_MYSVNFS modify lob(FILEDATA) (compress high);

as said above, this is only educational but i’d be happy to start a discussion about the pros and cons ( beside the licensing costs, of course ) …. what would be a good use case to use dbfs for ?