Hunting red herrings

Sometimes you “know” the problem from the first moment. But sometimes your feeling in the gut results in something that is perceived as a large change, so you have to find the smoking gun, the undeniable proof for your hypothesis. This is the story of such a search. It started with a telephone call of a colleague. He got my name from another colleague. An Oracle database running on a Solaris system, the datafiles and logs are located on a Veritas File System. The customer saw massive delays (in the range of hundreds of seconds) when excuting certain commands. One of the commands was “truncate table”.

A hypothesis - but the proof?

And in the beginning it started with a red herring.

semtimedop(28, 0xFFFFFFFF7FFFD644, 1, 0xFFFFFFFF7FFFD630) Err#11 EAGAIN

In this case the thread is trying to execute something on a semaphore, but it wasn’t able to do so. However the semtimedop is timebombed. When the timeout is reached without being able to execute on the semaphore , it terminates with error 11. All the timeouts were consistent with the waiting time seen from the SQL commands perspective. Obviously the customer and other involved parties were tempted to see this as the problem, but already thought that this may be just the harbinger of bad news. And after a short look into the truss files, I was pretty sure that they were right with their doubts in regard of passing the . It was just the harbinger of bad news. After a short amount of research I suspected, that we were talking about a locking problem here. There was just a problem: vxfs. At first I worked seldomly with it, thus it’s not really my center of expertise. One point that diverted the attention of the customer from the locking stuff is a small but important difference: The customer knew that Oracle likes Direct I/O. With UFS the “Direct I/O” is doing a little bit more than just making the I/O direct by disabling buffering. It also removes the inode r/w lock mandated by POSIX rules.The customer knew about UFS Direct I/O that and thus activated Direct I/O on vxfs. And thus I found lines like /oracle/importantdatabase/oradata1 on /dev/vx/dsk/importantdatabase/oradata1 read/write/setuid/devices/mincache=direct/convosync=direct/delaylog/largefiles/ioerror=mwdisable/mntlock=VCS/dev=51836b0 on Thu Mar 17 20:14:11 2011 However i stil suspected a lock contention problem, and had a reason for it: Direct I/O isn’t the same with vxfs than it’s in UFS. In vxfs Direct I/O is really just the direct part. It doesn’t enable concurrent I/O (explain that moniker later) to a file. The removal of the inode r/w-lock isn’t part of the feature. You have to use either Quick I/O (QIO) or the ODM module for vxfs. As both features weren’t activated, that was the moment where i told the customer “Hey, choose the ODM module for vxfs or QIO, activate it and the problem should go away”. Both remove that lock contention and thus are of big help in order to get better Oracle performance when using vxfs. Just to remove a misunderstanding: ODM (Oracle Disk Management) is an API in Oracle, not of Veritas. Oracles DNFS (direct NFS) is implemented via an ODM module as well. The problem: You used to pay for both vxfs, neither of them is really cheap and before doing the change, the customer wanted to know that i was right with my diagnosis (according to the release notes, ODM and QIO are now part of the SF except in basic). I wrote of two problems, but just wrote of one so far. Normally, finding out this inode rwlock contention problems are quite easy to find . But not in this case. vxfs is different than UFS in a multitude of ways. It doesn’t use the locking primitives of Solaris but has its own instead. And thus all values reported by prefered diagnosis tools were pretty useless. Damned … how should you find problems, when your instruments can’t show the problems. Without instrumentation troubleshooting is just guesswork and experience. At this point a question on a mail alias (it’s great to have people on internal aliases, that have forgotten more about Solaris than I know) and some research via google yielded the same result in a few minutes of time: vxfs`vx_rwsleep_rec_lock is the function waiting on/implementing the posix inode rw lock. Now I was back in the game and I was able to use all the nice things of the operating system i prefer.

Digging in the dirt

I asked the customer to put a dtrace script into a script that is executed in the moment of the wait:

bash-3.00# cat lock.d

tick-1s
/ i++ >= 60 /
{
exit(0);
}

fbt:vxfs:vx_rwsleep_rec_lock:entry
{
        self->in = timestamp
}

fbt:vxfs:vx_rwsleep_rec_lock:return
/self->in/
{
        @locked[stack()] = quantize(timestamp - self->in);
        self->in = 0;
}

The result was interesting, as it clearly showed a peak of 307 events in the range 34359738368 nanoseconds (34.36 seconds) to 68719476735 nanoseconds (68.72 seconds).

              vxfs`vx_write_common+0x2a8
              vxfs`vx_write+0x28
              genunix`fop_write+0x20
              genunix`pwrite+0x22c
              unix`syscall_trap+0xac

           value  ------------- Distribution ------------- count
             512 |                                         0
            1024 |                                         1
            2048 |@@@@@                                    160
            4096 |@@@@@@@@@@@@@@@@                         528
            8192 |@@@@                                     144
           16384 |@@                                       57
           32768 |                                         0
           65536 |                                         8
          131072 |                                         0
          262144 |                                         2
          524288 |                                         1
         1048576 |                                         4
         2097152 |                                         7
         4194304 |                                         8
         8388608 |                                         13
        16777216 |@                                        23
        33554432 |@                                        25
        67108864 |@                                        23
       134217728 |                                         1
       268435456 |                                         0
       536870912 |                                         0
      1073741824 |                                         0
      2147483648 |                                         0
      4294967296 |                                         0
      8589934592 |                                         0
     17179869184 |                                         0
     34359738368 |@@@@@@@@@                                307
     68719476736 |                                         0

This was especially interesting as the same dtrace script didn’t showed such a peak during times where the system ran flawlessly.

              vxfs`vx_write_common+0x2a8
              vxfs`vx_write+0x28
              genunix`fop_write+0x20
              genunix`pwrite+0x22c
              unix`syscall_trap+0xac

           value  ------------- Distribution ------------- count
             512 |                                         0
            1024 |                                         3
            2048 |@@@@@@@@                                 77
            4096 |@@@@@@@@@@@@@                            122
            8192 |@@@@@@@@@                                89
           16384 |@@@@@@@@                                 77
           32768 |                                         0
           65536 |                                         0
          131072 |                                         0
          262144 |                                         0
          524288 |                                         0
         1048576 |@                                        9
         2097152 |                                         1
         4194304 |                                         1
         8388608 |                                         0

Okay … well … next step … what parts of the system were executing this vxfs`vx_rwsleep_rec_lock function. I could have used dtrace for this task as well, but i wanted some additional insight in one step. Thus i used a nice little command of the modular debugger in Solaris: # echo "::threadlist -v" | mdb -k
The output is quite long on a loaded solaris system. It prints something like this for each thread:

ffffff02d153ce40 ffffff02cefdb540 ffffff02cefff1a0   1  59 ffffff02d153d01e
  PC: _resume_from_idle+0xf1    CMD: /lib/svc/bin/svc.startd
  stack pointer for thread ffffff02d153ce40: ffffff000f91ccc0
  [ ffffff000f91ccc0 _resume_from_idle+0xf1() ]
    swtch+0x145()
    cv_wait_sig_swap_core+0x174()
    cv_wait_sig_swap+0x18()
    cv_waituntil_sig+0x13c()
    lwp_park+0x157()
    syslwp_park+0x31()
    sys_syscall32+0xff()

I hate multiple line outputs when searching for patterns. There is nothing better than two monitors, an terminal streched on both and the two glibberish grep-implementations on the front side of your skull. But this works best, if one event is just in one line. So i did some grepsed-fu on it.

echo "::threadlist -v" | mdb –k | sed 's:^$:§:' | tr -d '\n' | tr '§' '\n' > mdbout

.
Each thread is now in a single line. Yeah … perhaps there is a more elegant way to do this, but that was the first that came into my mind ;)

00000300646655e0      600a6796cb8      600d62e14d8   1  60      600e64f4520  PC: cv_wait+0x38    CMD: ora_dbwriter7_importantdatabase  stack pointer for thread 300646655e0: 2a1075f4da1  [ 000002a1075f4da1 cv_wait+0x38() ]    vx_rwsleep_rec_lock+0x70()    vx_write_common+0x2a8()    vx_write+0x28()    fop_write+0x20()    pwrite+0x22c()    syscall_trap+0xac()

Just a quick check.

# cat mdbout | grep "vx_rwsleep" | wc -l
1008</code></pre>
</blockquote>
At the moment of the hang, 1008 processes were in <code>vxfs`vx_rwsleep_rec_lock</code>. That was interesting. Even more interesting were the list of commands that had threads in the mentioned function. It's column 10 in the threadlist in it's concatenated form.

<!-- Migration Rule 3 --> 

<figure class="highlight"><pre><code class="language-plaintext" data-lang="plaintext"># grep "vx_rwsleep_rec_lock" mdbout  | tr  -s ' ' | cut -d " " -f 10 | sort | uniq
ora_dbwriter1_importantdatabase
ora_dbwriter3_importantdatabase
ora_dbwriter5_importantdatabase
ora_dbwriter7_importantdatabase</code></pre></figure>


When you further dig down into the large heap of data:
<blockquote><code>
<pre># grep "ora_dbwriter3_importantdatabase" mdbout  | wc -l
258
# grep "ora_dbwriter3_importantdatabase" mdbout | grep -v "vx_rwsleep_rec_lock" | wc -l
7

From all this threads belonging to the ora_dbwriter3_importantdatabase just seven weren’t in the vx_rwsleep_rec_lock function. At that moment i thought: That isn’t a smoking gun, that’s a smoking howitzer.

An attempt to explain

Most threads excuting this function are part of the database writers. When you think about it, that’s not so astonishing, especially when you think about the nature of an rwlock. At first: There is a rwlock for each inode in a filesystem. Their function: Multiple readers can get the lock and so they can read concurrently from the file, but just one writer is able to hold it and thus to write into the file. Equally important: You can’t write to the file as long one or more readers is in the codepath protected by the rwlock for this file, and no one can read from the file as long there is a writer in protected codepath. In really basic rwlock implementations this can lead to writer starvation, as it’s hard for the writer to get the lock, because all readers have to relinquish the rwlock and no new readers should start before the writer can get the lock. Out of this reason, the Solaris threads implementation tends to favour writers before readers. However when you have many writers, it may take a long time before the backlog of writes. Blindly prefering writers is not a solution as well, because then readers would starve which is even more problematic, because reads are always synchronous by nature. As i wrote at other locations. While a system can chose the time of an physical write to some extent, it can’t chose the time of a read. A function won’t execute as long the data isn’t available. But that’s out of scope of this article. For the capability to write and read in parallel to a file the name cocurrent I/O was coined. I just wrote that it can take a moment before the backlog of writes has been executed. In this case it was even worse: The inode r/w lock adds insult to injury. Because basically the inode r/w lock limits you to just a single write I/O operation in parallel to a file, no matter how many HBA, how many disks you have in your system. And now you’ve made a while out of a moment. Even when the changes in the file are totally unrelated, e.g. changing a block belonging to the user table stored in it and another block in the article database or you want to read a block into the sga containing the customer database and writing the new salary for the promited assistant. You can’t do this in parallel due to the inode rw lock. And with many updates in your workload it’s not that astonishing that database writer threads start to twiddeling fingers in an increasing number in order to wait for their turn to write to the file. You may ask yourself, why the heck there is such a mechanism. The r/w lock is something mandatory in order to be Posix compliant. You need it to ensure write ordering and consistent reads, when updates occur in parallel to read. Obviously you really want such a protection when working with files. However especially with databases a file is just a container for a large heap of things. Independent things. And things are now different. Out of this reason there were some developments in the database realm to get rid of the inode rwlock and put this mechanism elsewere. Oracle allows you to use a raw disk, and so it has to do the consistent read and write ordering stuff anyways and as it’s aware of the inner structure of the heaps of data, it can do it with a much greater granularity than just per inode and thus per file. The inode r/w lock is just a bottleneck without any use in this case. Out of this reason Direct I/O of UFS for example offers a mode that removes the lock. It’s not the way, that those write ordering things or consistency protections are away. They are just in a layer that knows more about the structure inside the file and thus can do a better job at doing this job. vxfs knows similar mechanisms. QIO or ODM don’t have such an inodewise locking. They are working differently compared with UFS direct I/O but as an earlier chancelor of the Federal Republic of Germany said: Outcome matters. One question was still open. Why was this problem reproducible by a “TRUNCATE TABLE” command? That’s pretty easy however you have to dig deep into the internals of Oracle. When Oracle executes a TRUNCATE TABLE command, it checkpoints the database. In such a situation it writes all dirty blocks from the SGA into the database datafiles. This must be done for recovery purposes. Such checkpointing may trigger a storm of writes via the database writer, especially when you have a SGA with a lot of dirty blocks. The checkpoint has to complete, before the TRUNCATE TABLE executes. And then we are at another red herring at the end: It’s not the TRUNCATE TABLE command that was slow … it’s the checkpoint occuring before. You can check this pretty easy, when a “TRUNCATE TABLE” takes too long for your taste, trigger a checkpoint manually and do the TRUNCATE TABLE directly afterwards. TRUNCATE TABLE does still a checkpoint, but as you’ve already cleaned up the SGA from dirty buffers, it doesn’t have to do much writing. It should run much faster now.

Conclusion

At the end i had to tell the customer, that in essence everything works as designed. It would be a bug, when the system would act just a little bit different differently. However that’s seldom the answer a customer wants. So: The solution for the issue? It’s as old as it’s easy. Getting rid of the inode rwlock. Get concurrent I/O: Either by using raw disks, by using ASM, by using UFS or by using ODM or QIO for vxfs. I just can reiterate something i’ve already said: When you put your Oracle database file into a filesystem, you want to use direct I/O and concurrent I/O!