(no subject)

Swee-Chuan Khoo (sckhoo@tm.net.my)
Thu, 09 Oct 1997 11:33:41 +0800

--Boundary_(ID_uFTkQ6oJcDz/Yl19mmir5w)
Content-type: text/plain; charset=us-ascii

original question

>
> i am just wondering is there any FAQ or papaper somewhere
> that show us the syntx and option that i can add into the /etc/system.
>

1) Mariel Feder <unix.support@central.meralco.com.ph>

attachment

2) Andre Saile <saile@transtec.de>

You can check the file autoconf.h there are a little descritption
of the scsi_options.

3) Vitaly Beliaev <vit@mmk.ru>

Type:
man -s4 system

Look: Answerbook: Tunning kernel parameters

Finally there are many FAQs. I suggest you to read Solaris Kernel FAQ.

4) Jim Harmon <jharmon@telecnnct.com>

Read the file "/etc/system" for syntax

Read the "man" page --> man -4 system
for options.

5) Mike Nguyen <miken@mwh.com>

attachment

6) reineman1@llnl.gov

I attached some stuff I found in SunSolve. It's sort of old. There's
several other things that can be set, I found several docs. I suggest
doing a SunSolve search of your own for /etc/system.

http://sunsolve1.sun.com/sunsolve/index.html

--Boundary_(ID_uFTkQ6oJcDz/Yl19mmir5w)
Content-type: text/plain; NAME=kernel.txt; charset=us-ascii
Content-disposition: attachment; filename=kernel.txt

>From mikej@mtvmail.Corp.Sun.COM Tue Jun 17 14:59 PDT 1997
Date: Tue, 17 Jun 1997 14:59:48 -0700
From: mikej@mtvmail.Corp.Sun.COM (Michael Jaffe)
To: miken@mwh.com
Subject: my paper

Optimizing and Measuring the Solaris Kernel For Large Oracle Servers.
by Mike Jaffee, Sun Microsystems

The first part of the paper will discuss the basics of Solaris Internals that
are relevant to the Oracle DBA along with tips to common technical questions
and relevant header files. The second part is quoted tuning information taken
from Sun Experts. The final part is a discussion of kernel memory allocation,
how to measure it, and some things that can be done to prevent starvation.

Solaris Internals
Sparc has two rings of execution. The inner ring is for kernel functions and
the outer ring is for user process functions. The process address space is
virtual, and normally only part of a process is in physical memory. The kernel
stores the contents of the process address space in physical memory, on-disk
files, and specially reserved swap areas. Over time the kernel shuffles pages
of the processes between physical memory and disk. Each process has registers
that are stored in the kernel and are place in the hardware registers at run
time. A process must block if it is waiting for a resource and allow another
process to run. The kernel allows each process a brief period of time, usually
10 milliseconds, to run before performing a context switch. (Vahalia p.20-25)
On startup once the kernel is loaded, user processes can request system
services from the kernel through the system call interface. If the process
misbehaves by dividing by zero or overflow its stack, a hardware exception
occurs, and the kernel intervenes, usually aborting the process. Interrupts
come from peripheral devices usually indicating a status change or I/O
completion. Two important processes that manage memory are the swapper and
pagedaemon. (Vahalia p.22-25)

Each process has a virtual memory address space (VMA) that is translated to
physical memory addresses by page tables. This mapping is done by the chip's
MMU. (Tip - System panics can be either hardware or software related. The MMU
registers give helpful hints on what actually caused the panic.) In addition to
kernel and user mode, there is kernel and user space. This refers to regions
in virtual memory address space of the process. There is only one kernel and
many processes and hence every process must map in a single kernel address
space. The kernel portion of the VMA maintains global data structures and some
per process objects. These can only be accessed by the kernel when the chip is
running in kernel mode (ring 0). Since the kernel is shared by all processes,
kernel space must be protected by user-mode access. This is done by requiring
the processes to use the system call interface. This requires the chip to go
into kernel mode, transfer program control to the kernel, have the kernel
execute system code instructions, then switch back to user mode and user
control of the process. (Vahalia p.22-23)

System Services
Oracle uses many Solaris system services such as file and record locking,
inter process communications, virtual memory, and process scheduling. Common
system calls are open, read, write, fcntl, kill, priocntl, plock, memcntl,
sync. Common Signals are SIGSEGV - usually means user stack overflow, SIGBUS
- out of the process address space, SIGTERM - user has "hung up" without
exiting gracefully, SIGUSR1 - defined signal for asynchronous events, SIGKILL
- kill process immediately no exceptions. Oracle uses file and record locking
by setting read write locks on portions of a file. Any process can read a
file that is locked but only the owner of the lock can update the file. A
write lock is sometimes called an exclusive lock and a read lock is sometimes
called a shared lock. Process scheduling is usually managed very well by the
kernel, however a slow job can be speeded up by the priocntl system call.
(System Services Guide p.1-25) Jim Skeen of Sunsoft - "Oracle gets locked-
down memory as a consequence of using intimate shared memory (ISM), not
through plock. It controls sharing inside shared memory through latches, not
memcntl or plock." He also cautions against changing the priority of the
Oracle processes "This is something we in DBE actually strongly discourage.
Only the most daring and knowledgable DBA's should attempt this. The problem
is that system threads can get starved if Oracle processes are not "well
behaved" when running in real time class. Oracle processes may easily hog a
cpu for extended periods of time (time being measured in Unix quantums). We
in DBE have experimented with changing the dispatch table in useful/clever
ways, to minimize the number of involuntary context switches. But Oracle
processes still run in TS class." (private letter Skeen)

Oracle Internals and Solaris System Services
Mark Johnson of Oracle and Jim Skeen provide the following expert insight and
information. The system global area is defined as "One or more shared
segments visible to all Oracle processes that are used to store precompiled
SQL and PL/SQL (library cache), database buffers (buffer cache), and for
interprocess communication" (Johnson). As far as process control - "Oracle
does use semaphores, but latches are the usual synchronizing mechanism, as
mutexes implemented as spin locks" (Johnson). On the subject of locks "Oracle
maintains database transaction integrity through use of database locks of
various sorts--shared read, exclusive read, exclusive write, etc. These are
implemented through database locks, not using Unix file locks. Thus, the
scope of a database lock can be limited to a single row in the database. Or,
the database may choose to lock a database page (which may be quite a bit
smaller than a Unix page). Or, the database may choose to lock an entire
database table (which may be composed of multiple database files, which in
turn may or may not map into Unix files)." (private letter Skeen).

Oracle uses heavyweight processes that are in the shared memory portion of the
process address space. The DBWR (data buffer writer) process uses aio threads
known as light weight processes (LWP). An LWP is a kernel-supported user
thread that is based on kernel threads. They are independently scheduled and
share the address space of the process. Vahalia's book has a nice discussion
on LWPs. (Jaffee) Kernel Asynchronous I/O and Intimate Shared Memory are two
key technologies used by Oracle on the Solaris platform.

Asynchronous I/O is needed because a single blocking thread in a multi-
threaded application causes all threads to wait until the thread wakes up.
What needs to happen is for the thread to issue an asynchronous I/O request
and then pass control to another thread in the process. Also heavy I/O is not
efficient when done synchronously because of the large number of context
switches that must occur every time a thread is blocked. (Hyuck Yoo)

Asynchronous I/O under Solaris is implemented two ways - under Solaris 2.3 it
is using the library and under Solaris 2.4 and beyond it is in the file
system layer of the kernel. The library approach uses kernel-level threads
where each I/O request is handled by a newly created kernel-level thread that
acts synchronously (i.e. issuing read and write calls). The library lives
outside of the kernel and the kernel threads that perform the I/O are
separate from the calling process. The kernel approach is much more
sophisticated and efficient. The basic concept is to not maintain the queue
in user space but to put the request directly into the device driver queue.
The biowait function is bypassed (which is the device driver equivalent to a
blocking function) and the thread transfers control rather than sleep in the
kernel. The kernel has buffers with slots called AIO that maintain a listing
of all I/O requests. (Hyuck Yoo)

Solaris has provided the ISM feature since 2.2. The main feature of ISM is
in addition to sharing the "memory" pages (like the normal shared memory), it
also shares the page table entries for those pages (therefore, it's
"intimate"). Another side feature, which is more important for this
discussion, is that ISM also locks down the shared memory segment in real
physical RAM. Since the main purpose of ISM is for the DBMS products' buffer
cache usage, this makes sense. (Jaffee)

Sharing page table entries solves the problem of page table stealing which is
expensive because all the pages mapped in the stolen page table have to be
flushed before being given to another process. This avoids the condition
where the whole system may thrash as processes steal page tables from each
other. (H. Yoo)

The design team created a new segment in the process address space called
segshm so that they could create one set of page tables for a shared memory
segment and share the page tables among the processes that attach that same
shared memory. In addition to saving page table allocation, sharing page
tables have other advantages such as having a higher cache hit rate on memory
map lookups because the tables are in a buffer cache rather than in memory.
It also avoids the amount of overhead done by the hardware address
translation layer since it no longer needs go through page tables for every
process to monitor whether a page has been modified. These are both huge
savings and speed up the virtual memory paging algorithm within Solaris. (H.
Yoo)

IPC
The Oracle RDBMS is a complex program that uses multiple cooperating processes
that must communicate with each other and share resources. The kernel provides
a mechanism in user space called inter process communication or IPC. The
processes operate in a shared memory segment such that if one process modifies
data it will be immediately visible to the other processes. Data transfer and
event notifications occur between the various Oracle processes in the Oracle
SGA. Semaphores are used for Oracle's own locking and synchronization
scheme. Asynchronous events such as errors are reported to the processes
using signals. The default action for most signals from the kernel is to
terminate the process, however the process may specify an alternate response
by providing a signal handler function. (Tip - Before installing the kernel
jumbo patch read the readme file to see if there are any known signal
problems with Oracle). (Vahalia - p150) The relevant IPC system calls
Oracle makes are shmget, semget, shmat, shmdt, shmctl, and semctl. The ipc
information is stored in the kernel with the ipc_perm structure. shmget(key,
size,flag) creates a portion of shared memory (which will be the size of the
Oracle SGA) and shmat(shmid, shmaddr, shmflag) attaches the region to a
virtual memory address of the process. (shmsys is how Oracle sets up the
intimate shared memory segment). The structure of a shared memory segment
includes access permission, segment size, the PID of the process performing
last operation, and the memory map segment descriptor pointer as well as
other fields. (tip - sgabeg in the ksms.s file is a virtual address not
physical address (0-0xffffffff = 2 GB). Choose small beginning addresses for
large SGAs. Also watch out for 28 bit Sparc chips. They have a smaller
virtual addresses. Hal Stern notes "They're really not 28 bit chips, but
instead the system architecture only passes 28 bits of virtual address space
on to the memory bus. [private letter]) Once attached the region may be
accessed like any other memory location without requiring system calls to
read or write data to it. Hence shared memory is the fastest mechanism for
processes to share data. (Tip - don't be confused by the SZ field in ps -elf.
It is in 4 KB pages and represents shared memory in the case of Oracle. For
example Oracle may have 60 server processes in a shared memory segment all
approximately 25000 4 KB pages. A common misconception is to think that
Oracle needs 60 X 4KB X 25000 = 6 GB of virtual memory. Those 60 processes
are mainly using the shared memory region in the process address space).
(Tip - shared memory pages are backed by swap space, not by a file. The
absolute minimum swap must be at least the size of the SGA.) A process
detaches the shared memory with shmdt(addr) and destroys the shared memory
region completely with the IPC_RMID command of the shmctl system call. (Tip
- the important commands are ipcs -b; look at field SEGSZ for shared memory
size in use ; sysdef -i and sysdef -i -n /dev/ksyms for IPC and resource
table definitions; kill -9 <process id> to terminate (no core file) a hung
process or kill -6 <process id> to abort (core file) a hung Oracle process.
modload -p sys/shmsys at the command line or forceload: sys/shmsys in the
system file maybe needed if ipcs -b doesn't work) correctly. This is because
the kernel is dynamic meaning that file systems, drivers, and modules are
loaded into memory when they are used, and the memory is returned if the
module is no longer needed. (Vahalia - p155-158, p162-164) Semaphores are
counters that are used by Oracle to monitor and control the availability of
shared memory segments. Typically the process initializes the semaphore with
semget, assigns ownership of the semaphore with semctl , and then updates the
semaphore with semop. A process has to block until the semaphore operation
has reached zero. A semaphore structure contains the following information -
semaphore value, the PID of the process that last performed successfully, the
number of processes waiting for the semaphore to increase, and the number of
processes waiting for the semaphore to reach zero. (tip-ipc_perm and sem in
ipc.h, sem.h) (System Services Guide - p68-77). Shared Memory and Semaphore
Tunables in Solaris 2 relevant to Oracle. (Tip - semmnu = semmns = semmsl X
semmni). There is no harm in setting the numbers too high since the Oracle
instance will only allocate semaphores and shared memory as needed. The
values are definitions not declarations.

Name Default Min Max Reference Suggested
____ _______ ___ ___ _________ ________
shmmax 1048576 1048576 Available Maximum shm segment 50% of RAM
RAM size in bytes
shmmin 1 1 - Minimum shm segment 1
size in bytes
shmni 100 100 - Number of shm id 100
to pre-allocate
shmseg 6 6 - Maximum number shm 32
seg per process
semmni 10 10 65535 Number of semaphore 64
identifiers
semmns 60 - - Number of semaphores 1600
in system
semmnu 30 - - Number of undo 1250
structures in sys
semmsl 25 - - Maximum number of 25 (fixed)
semaphores per ID

Solaris Tuning According to the Experts
Every month in SunWorld Online, the performance experts at Sun write articles
on tuning. In addition to the well known book, "Sun Performance and Tuning",
Adrian Cockcroft with the help of Rich Pettit have put together a series of
scripts called se2.5 (www.sun.com/960301/columns/adrian /se2.5.html. Hal
Stern, another well known Sun tuning guru, has written an O'Reilly press book
on "Managing NFS & NIS" and he too writes articles that can be downloaded off
of the web. Fellow SunService Engineers Chris Drake and Kimberley Woods wrote
"Panic - System Core dump Analysis" which contains detailed information on the
Solaris kernel and common techniques used in to analysis core files. Brian
Wong the hardware expert has written a book called "Configuration and Capacity
Planning of Large Sun Servers". Most of the tuning information for large Sun
Servers running Oracle can be found in these sources. Since many customers
often call SunService for further explanations, it is appropriate to highlight
some common questions and answer them as the experts would.

Question 1 - Where is all my Memory?
Probably the most common performance question of all is "Why does vmstat report
only xxxx about of free memory available?" To use an example, type the
vmstat 5 and suppose the system shows freemem of 80708 and available swap is
330000. Now start the application and observe that the freemem goes down to
8824 and swap goes to 300000. Now stop the application and observe that all
of the available swap returns to 330000 but the freemem returns only to
21260. Where then is all of the ram? Does we have a memory leak? The answer
is probably no because as Cockcroft notes "(the app) starts up more quickly
than it did the first time, and with less disk activity. The application code
and its data files are still in memory, even though they are not active. The
memory they occupy is not "free." If you restart the same application it
finds the pages that are already in memory. The pages are attached to the
inode cache entries for the files. If you start a different application, and
there is insufficient free memory, the kernel will scan for pages that have
not been touched for a long time, and "free" them. Once you quit the first
application, the memory it occupies is not being touched, so it will be freed
quickly for use by other applications. "(Cockcroft 1) Leaving parts of the
app in memory even after termination is efficient because "Attaching to a
page in memory is around 1,000 times faster than reading it in from disk."
(Cockcroft 1) So how can one know if he has a memory leak in his application?
The answer is there will be a shortage of swap space after the program runs
a while and the SZ field in ps -elf for that app will grow over time.

Question 2 - My Oracle Server is slow. Can you help me tune the kernel?
The answer depends on the version of the operating system and the level of the
patches. Early versions of the os had performance bugs and incompatible
hardware that were the cause of slow performance. The latest version of the os
is self-tuning for high performance and will work quite successfully on systems
ranging from a huge SparcCenter 2000 to small desktops. As Cockcroft says "In
normal use there is no need to tune the Solaris 2 kernel, since it dynamically
adapts itself to the given hardware configuration and application workload. "
(Cockcroft 2) However for really large Oracle servers some tuning may be
needed if using early versions of Solaris 2.3 2.4 and 2.5 without a kernel
patch that automatically adjusts the the paging algorithm. Solaris 2.5.1 is
self tuning for large memory systems. Paul Faramelli of the kernel TSE group
has put together the following list of tunables for Solaris. Recommendations
for large Oracle servers (Ram > 1 GB) are listed. (Tip - Use crash to display
kernel tunables. As root type crash. At the greater than prompt, type "od -d
maxuser" or "od -d lotsfree". The od stands for octal dump, and the -d stands
for decimal. By the way every Solaris tunable [even undocumented ones] can be
displayed by typing nm /kernel/unix). Note these recommendations are only
necessary for early versions of Solaris. The some recommendations are
provided by Steve O'Neil of SunService. (Caution - there is no right answer)

Parameter Description Recommended
--------- ----------- -----------
dump_cnt Size of the dump
autoup Used in struct var for dynamic configuration of the age 300
that a delayed-write buffer must be, in seconds, before
bdflush will write it out (default = 60)
bufhwm Used in struct var for v_bufhwm; it's the high water mark 8000
for buffer cache memory usage, in Kbytes (2% of memory).
maxusers Maximum number of users (In 2.3 and 2.4 the default is
number of Megabytes in memory)
max_nprocs Maximum number of processes (10 + 16 * maxuser)
maxuprc The maximum number of user processes. (max_nprocs - 5)
rstchown POSIX_CHOWN_RESTRICTED is enabled (default = 1 )
ngroups_max Maximum number of supplementary groups per user (def 32).
rlim_fd_cur Maximum number of open file descriptors per process sysem
wide (default = 64, max = 1024)
ncallout Number of callout buffers (default = 16 + max_nprocs).
(No longer exists in Solaris 2.2 and later releases)
nautopush Number of entries in the autopush free list 1024
sadcnt Number allowed of concurrent opens of both /dev/sad/user 2048
and /dev/sad/admin (default 16).
npty Number of 4.X psuedo-ttys configured (default 48) 1024
pt_cnt Number of 5.X psuedo-ttys configured (default 48) 1024
physmem Sets the number of pages usable in physical memory. Only
use this for testing, it reduces the size of memory.
minfree Memory threshold which determines when to start swapping 100
processes, when free memory falls to this level swapping
begins (default: 2.4 - 4d = 50 pages, all others 25
pages, 2.3 - physmem / 64 ).
desfree This is the "desperation" level, this determines when 200
paging is abandoned for swapping. When free memory stays
below this level for 30 seconds, swapping kicks in ( 2.4
4d = 100 pages, all others 50 pages, 2.3 physmem / 32 ).
lotsfree Memory threshold which determines when to start paging. 512
When free memory falls below this level paging begins (2.4
4d = 256 pages all others 128 pages, 2.3 physmem /16)
fastscan The number of pages scanned per second when free memory
is zero, the scan rate increases as free memory falls
from lotsfree to zero, reaching fastscan ( default: 2.4
physmem / 4 with 64Mb being max, 2.3 physmem / 2 ).
slowscan The number of pages scanned per second when free memory
is equal to lotsfree, also see fastscan ( defaults: 2.4
is fixed at 100, 2.3 fastscan /10 ).
handspr- Is the distance between the front hand and backhand in
eadpages the clock algorithm. The larger the number the longer an
idle page can stay in memory (default: 2.4 physmem / 4
2.3 physmem / 2 ).
maxpgio The maximum number of page-out I/O operations per second. 120
This acts as a throttle for the page deamon to prevent
page thrashing ((DISKRPM * 2) /3 = 40). This parameter
must be set higher if using two swap partitions.
t_gpgslo 2.1 through 2.3, Used to set the threshold on when to
swap out processes (default 25 pages ).
ufs_ninode Maximum number of inodes. (max_nprocs+16+maxusers+64) 34906
ndquot Number of disk quota structures. (default = (maxusers *
NMOUNT / 4) + max_nprocs)
ncsize Number of dnlc entries. (default = max_procs + 16 + 34906
maxusers + 64); dnlc is the directory-name lookup cache

Cockcroft on maxusers
"I never set maxusers. It sizes itself based on the amount of RAM in the
system. In some cases on configurations with gigabytes of RAM it needs to be
reduced to avoid problems with lack of kernel address space. The kernel uses up
a lot of space keeping track of all the RAM in a system. Several other kernel
table sizes and limits are derived from maxusers." (Cockcroft 2)

Cockcroft on ncsize
"The directory name lookup cache (DNLC) is sized to a default value based on
maxusers. A large cache size (ncsize) significantly helps NFS servers that
have a lot of clients. On other systems the default is adequate."(Cockcroft 2)

Question 3: How much swap is needed for a large Oracle database?
Many people are under the impression that very little swap is needed for Oracle
because the architecture uses temporary tablespaces for sorting and the SGA is
fixed in memory. Well the truth is large databases require a lot of swap. The
shared memory segment is backed by swap so the allocated swap MUST be at least
as large as the shared memory segments. In addition when the database uses
intimate shared memory this is also backed by swap. All of the Oracle
processes must be partially backed by swap. Steve Schuettinger, the Oracle
applications specialist at Sun, recommends at least 2 GB of swap for benchmark
testing on large servers. Obviously since RAM plus swap equals virtual memory,
once swap is gone, the program will halt and no new apps can be started until
other programs have stopped. As Adrian Cockcroft says "The important thing to
realize about swap space is that it is the combined total size of every program
running and dormant on the system that matters. When a system runs out of swap
space it can be very difficult to recover. Sometimes you find that there is
insufficient swap space left to login as root or run the commands needed to
kill the errant process that is consuming all the swap space." (Cockcroft 3) In
Theory Solaris 2 changes the rules by adding the RAM and the disk space so if
the system has enough RAM for the workload, "it can run with no swap disk. In
practice common database applications that are sized to run in a few gigabytes
of RAM will actually need many gigabytes of disk allocated as swap space."
(Cockcroft 3) In the same article Cockcroft says "The consequences of running
out of swap space affect a larger number of users on a big server, so it wise
to allocate a lot more than you normally need to cope with any usage peaks. To
start with, add twice as much disk as you have RAM." (Cockcroft 3) (Tip - It is
not worth making a striped metadevice to swap on - that would just add overhead
and slow it down. There is also a limit of 2 gigabytes on the size of each swap
partition, so striping disks together tends to make them too big.

/usr/ucb/ps alx, fields SZ or SIZE, /usr/proc/bin/pmap

% /usr/ucb/ps alx
F UID PID PPID CP PRI NI SZ RSS WCHAN S TT TIME COMMAND
8 2595 1133 1130 0 48 20 988 360 modlinka S pts/4 0:00 -bin/csh

There is confusion between what ps reports. The "/bin/ps prints a field
labelled SZ, but this is the resident set size in RAM -- printed as RSS by the
/usr/ucb/ps. You need to use the SZ or SIZE field reported by /usr/ucb/ps alx
in units of kilobytes to determine the amount of swap space used by the
process." (Cockcroft 3)

Oracle's Mark Johnson adds the following "I had thought the standard Oracle
rule of thumb was 2 to 4 times physical memory (can be a bit less on very
large memory systems). Smaller memory systems may want to use higher ratios
of SGA size to physical memory size and higher swap space ratios. (I ended
up using ratios of 1:1 and 1:4 for a very small Solaris for Intel system with
surprisingly good results.)"

Hal Stern says "So why do you need swap space if your SGA << phys mem? The
short answer is that the "phys mem" in that calculation is the non-locked-
down physical memory, and when you allocate an oracle SGA, you allocate
intimate shared memory (ISM) that is taken out of the physical memory pool
(ie, it gets locked down). so on a 1 Gbyte machine, you may think you're ok
with a 256M SGA, leaving 700M+ for processes. BUT: the 256M SGA gets taken
out of the available memory pool, so your maximum VM is only 700M+, and you
could probably use the swap space....as the SGA/memory ratio goes up, this is
even more true." (private letter from Stern)

Question 4 - Will a faster cpu help performance?
The answer is not easy to answer. As Hal Stern noted " Noticing that you're
using 20 percent of the CPU doesn't mean anything until you know the kind of
work that's using the cycles. If you're CPU-bound, then you have headroom to
increase the workload by a factor of four or five. An I/O-bound job, however,
that uses 20 percent of the CPU might be improved by adding disk spindles. As
you increase the disk count and I/O load, to ease the bottleneck, you'll use
more CPU to deal with the I/O setup, system calls, and interrupts from the
additional work. You run the risk of morphing a disk problem into a CPU
shortage. How do you know when relaxing one constraint pops another one into
the foreground? Define the right relationships -- CPU time used per disk I/O
tells you how much system time you eat up as you add disk load -- and measure
with your tailored yardstick." (Stern 1)

Preventing Kernel Memory Starvation
When Oracle is working very hard and the operating system is Solaris 2.3 or
early Solaris 2.4, it is possible to have kernel memory allocation faults
that can eventually lead to kernel memory starvation. A new memory allocator
algorithm has been developed and integrated into Solaris 2.5.1 (the old
allocator had paging thresholds that were too low which causing kernel memory
allocation failures on very large systems). The allocator has been back
ported to rev 40 of the Solaris 2.4 jumbo patch and to a future rev of the
2.5 jumbo patch. No fix has yet been developed for Solaris 2.3. (Tip - large
database users should upgrade to Solaris 2.4 or better). In the past Oracle
customers could manually adjust paging thresholds. The actual value that
needed to be set was proportional and depended upon the amount of memory and
the number of cpus on the system. Also in some cases decreasing maxusers and
bufhwm would mitigate the problem. The total allowable size for the kernel on
the ultrasparc servers running 2.5 is now so large that kernel memory
allocation problems on very large systems is virtually impossible. See
examples below. The crash output displaying kernel memory starvation is taken
from a SparcServer 1000 running Solaris 2.3 with 1 GB of ram and 8 cpus.

Solaris 2.4: Solaris 2.5: Kernel memory limits
sun4c 33MB sun4c 33MB
sun4m 61MB sun4m 100MB
sun4d 139MB sun4d 251MB
sun4u 2525MB
$> kas crash 15
>map kernelmap FREE: 2042 WANT: 1 SIZE: 2042 SIZE ADDRESS TOTAL
NUMBER OF SEGMENTS 0 TOTAL SIZE 0
> kmastat
total bytes total bytes
size # pools in pools allocated # failures
-----------------------------------------------------------------
small 6807 26138880 25677584 1989915
big 2652 75276288 73046528 0
outsize - - 18571264 45351

Crash is a very powerful tool that helps analyze kernel memory allocation
failures. We see from the output "TOTAL SIZE 0" indicates that no more free
kernel memory exists. The FREE field (2042) indicates that there is still
plenty of memory in the user portion of the virtual address space. Carl of
Sunsoft provides an explanation of kernel map scarcity under Solaris 2.3 and
Solaris 2.4. "In the overwhelming majority of cases on large database
servers, we have found that 64MB is overly generous for bufhwm in that it can
be cut back by one-half (to 32MB) without too much of an impact on the cache
hit ratio. What is usually in short supply on these machines is not the
buffer cache but the amount of kernel heap (mapped by kernelmap) that remains
for non-buffer cache usage. Limiting buffer cache growth to 32MB frees up an
addition 32MB to the heap and has proven successful in avoiding kernelmap
scarcity at a number of sites running large database applications. Kernelmap
scarcity (or equivalently kernel heap scarcity as the size of the kernel heap
is limited by the size of the address space the kernelmap can map) results in
an extreme slowdown of processing in the systems. All of a sudden kernelmap
becomes a scarce resource that every thread contends for and to exacerbate
the situation the rate of release is slowed by the very same contention to
the point that kernelmap turnover grinds down almost to the point of
deadlock. Why 64MB's worth of kernelmap is inadequate for the largest
database servers is unknown. The sites on which this has been a problem have
been checked for kernelmap leakage and none has been found. There has also
been a problem in the past with some kernel data structures being pre
allocated from the heap and the size of this pre allocation being
inappropriately scaled to physical memory. As it is fairly common now for
machines to be equipped with 3GB of physical memory, this was not the right
thing to do and did account for some kernelmap depletion headaches. But this
particular bug has been fixed. With these two things discounted, the only
conclusion is that modern database workloads are driving up peak transient
demands for kernelmap to the 100MB level." (Tip -For large databases running
Solaris 2.4 or less set bufhwm to 8000 on 4c, 4m, and 4d or upgrade to
Solaris 2.5 which has a large kernel map address space.)

Acknowledgements
I want to thank Sun performance gurus Adrian Cockcroft and Hal Stern for
their contributions to this paper. UNIX architect Mark Johnson of Oracle and
database expert Jim Skeen of Sunsoft provided comments on Oracle internals.
Kernel architect Jeff Bonwick has added explanations and suggestions
regarding kernel memory allocation and kernel memory starvation. SunService
kernel engineer Paul Faramelli documented the Solaris tuning parameters and
SunService Technical Expert Steve O'Neil provided recommendations for tuning
large Oracle databases on versions of Solaris that are not self tuning.
Finally I want to thank Uresh Vahalia who gave me permission to quote at
length from his wonderful book "UNIX Internals - The New Frontiers".

Disclaimer
The author alone is responsible for the contents of this paper. No one at Sun
Microsystems, Sunsoft, SunService, or the Oracle corporation has reviewed or
approved the paper for completeness or accuracy in it's published format and
nothing in the paper can be construed as the official policy of Sun
Microsystems or the Oracle Corporation.

References
UNIX Internals - The New Frontiers by Uresh Vahalia, Prentice Hall 1996
"How the Solaris Kernel is Optimized for Oracle" by Mike Jaffee 1996
"Shared Page Table: Virtual Memory Enhancement for Data Sharing in UNIX" H.Yoo
"Comparative analysis of Asynchronous I/O in Multithreaded UNIX" Hyuck Yoo
"Help! I've lost my memory!" by Adrian Cockcroft, SunWorldOnline 1995 (1)
"What are the tunable kernel parameters for Solaris 2?" by Adrian Cockcroft (2)
"How does swap space work?" by Adrian Cockcroft, SunWorldOnline 1995 (3)
"We suggest creative ways to better your system" performance by Hal Stern
System Service Guide - Solaris 2.4 Manual, SunSoft, 1994
"The Slab Allocator: An Object-Caching Kernel Memory Allocator" Jeff Bonwick

--Boundary_(ID_uFTkQ6oJcDz/Yl19mmir5w)--