Troubleshooting can be fun: December 2010

The latest release of my product (November 2010) would always be remembered for the PSR (Performance Scalability & Reliability) issues. Though we had faced and resolved PSR issues in every release, what made this release stood out was the fact that the issues were surfacing in the legacy (more than 10 years old) infrastructure code base.

We had seen issues like memory leaks, memory fragmentation, RPC failures etc. The infrastructure had survived for past 10 years with such issues mainly because it was not stretched to its limits earlier. While most of the issues were resolved and fixed, one of them could not make to the main release.

In one of the PSR run QA observed that the jobs were not running and there were errors in the event logs indicating that something was wrong with the system. I logged into the QA environment and checked the status of the system. I launched the utility to enable product logging. Even this seemed to be hung. I went to the registry to enable the logging manually. Strangely the logging was enabled. I looked at the log file size. It was around 4 GB. As a try I disabled the logging and the product started working i.e. came out of halted state. I enabled logging and again it went into hung state. I verified this mostly by launching the external utility which we use for diagnostic purpose. As mentioned earlier even this would freeze as it tried to communicate with product’s windows service. I published the observations. Since there were other priority issues to take care of we kept this issue at lower priority as it was coming only because of logging enabled (and with assumption that 4 GB log file size would be the cause).

After a couple of days, same issue was reported in another scenario. Again it was because of logging. But this time the log file size was 512 KB. So the large log file size assumption went for a toss. Since I was working on other critical issues I was not involved into this one. Later I went on Diwali leaves and took a good break after a hectic release. It was decided that some of the issues which had workaround and could not be fixed would be worked upon in parallel and a hotfix would b released. The logger issue was one of them.

I came back from the break only to find that this issue was waiting for me. Same issue was reported by a customer. We had to enable the logging for troubleshooting on customer site and now the logger itself was causing the product to freeze. I went through the emails describing the analysis done so far on that issue by other people. Following were the main points from the analysis:

From the dumps and also by attaching the debugger at runtime it was confirmed that the issue was with mutex in logger.
We were using a logger component across various processes and which would log to single file. For this mutex was being used.
All the component which would appear to freeze were in fact waiting on this mutex and timing out.

Following steps were tried to solve the issue:

Used smart class to acquire\release the mutex. But this did not work.
Other solution tried was to create separate log file per process. With this the issue was not observed but this was not an acceptable solution.

Also, nobody concentrated on why wait for mutex was timing out. This meant someone was holding the mutex and not releasing it. Nobody tried to find, by going a code walk through, if there were any cases when a given thread would not release a mutex. Instead people concentrated on fact like number of processes logging to single file, mutex contention etc.

Finding the root cause

Now, it was the time to find the root cause of the issue and not list the possible causes. I had two things in mind:

In order to reproduce the issue I had to create a sample application which used the logger. Surprisingly this was not done and people were reproducing the issue using the product which would take not less than 10 hours of cycle.
Secondly, I had to look at the code and find if there was any situation when an acquired mutex would not be released by the thread\process.

I created a sample application which would just call the logger in a loop.

The breakthrough

Well, I just started launching multiple instances of the sample console application. Every instance was logging to the file and also to the console (the logger could be configured to log to file, console or log as debug string). I was lunching instances and closing some. While doing this suddenly all the instances froze except for one. Even a new instance would not log anything and would appear to be hung. Then I closed the instance which was logging. As soon as I did this another instance started logging but not all. Only a single instance would log at a time. This continued i.e. the moment I closed the instance which was logging only one of the other instances would log.

In the logger I had done changes to log the return value of the WaitForSingleObject() API. What I noticed was that when another instance would start logging WaitForSingleObject() call would return with WAIT_ABANDONED instead of success. I had read about this previously just to confirm that if a thread\process dies without releasing the mutex then it should not be an issue and also in the code I could see handling for this return value. I had another look at the code and there the mystery unraveled itself. The issue was that in the code the WAIT_ABANDONED return value was not handled properly.

Following code snippet resembles that in the product:

Following is the documentation for this value:

WAIT_ABANDONED: 0x00000080L The specified object is a mutex object that was not released by the thread that owned the mutex object before the owning thread terminated. Ownership of the mutex object is granted to the calling thread and the mutex state is set to nonsignaled.

If the mutex was protecting persistent state information, you should check it for consistency.

In the code if the return value was WAIT_FAILED or WAIT_ABANDONED then we were just returning from the logger function. But as per the documentation if the return value is WAIT_ABANDONED then the ownership is granted to the thread and hence ReleaseMutex() should be called to release the mutex which was not happening.

This explained why, when I was launching and closing the sample application suddenly all instances except one could log. In that case when I was closing the instances one of them might have acquired the mutex and terminated before releasing it. I simulated this behavior in the sample application by putting a condition to terminate the process without release the mutex and the issue got reproduced easily. Also, then I confirmed that if we call RelaseMutex() in case of WAIT_ABANDONED condition then the ownership is released properly.

How the product was impacted
In the product we create separate processes to perform the work. These are created and terminated after reused for certain time. Hence, there was possibility that such processes would terminate while holding the mutex. But then the mutex should be in abandoned state and everything should work fine. Assuming that after the mutex is abandoned due to process termination, the very first thread that got the WAIT_ABANDONED return value is, say, a monitoring thread (i.e. the thread does not terminate till the process exists) then this thread will have the ownership of the mutex where as all other threads (in all other processes) will timeout after waiting for specified time (i.e. 15 seconds).

Verifying the fix
In the logger the mutex was used for making sure that only one process was logging to the file at a time. Hence, we could continue even if the wait function returned WAIT_ABANDONED i.e. the system won’t be in an unstable state. The fix was to make sure that ReleaseMutex() was called even if the wait function returned WAIT_ABANDONED.

With all the analysis and testing (with sample application) it was clear that the issue was because of abandoned mutex. But still we wanted to make sure that this was the issue. So we added debug strings in the logger which would print the return value of ::WaitForSingleObject() in case it failed. In the QA testing we used DebugView from Sysinternals. The debug logs showed that the ::WaitForSingleObject() did return WAIT_ABANDONED and because of the fix now the product was not failing.

Troubleshooting can be fun

Friday, 24 December 2010

When the logger itself halted the product