Friday, 24 December 2010

When the logger itself halted the product

The latest release of my product (November 2010) would always be remembered for the PSR (Performance Scalability & Reliability) issues. Though we had faced and resolved PSR issues in every release, what made this release stood out was the fact that the issues were surfacing in the legacy (more than 10 years old) infrastructure code base.

We had seen issues like memory leaks, memory fragmentation, RPC failures etc. The infrastructure had survived for past 10 years with such issues mainly because it was not stretched to its limits earlier. While most of the issues were resolved and fixed, one of them could not make to the main release.

In one of the PSR run QA observed that the jobs were not running and there were errors in the event logs indicating that something was wrong with the system. I logged into the QA environment and checked the status of the system. I launched the utility to enable product logging. Even this seemed to be hung. I went to the registry to enable the logging manually. Strangely the logging was enabled. I looked at the log file size. It was around 4 GB. As a try I disabled the logging and the product started working i.e. came out of halted state. I enabled logging and again it went into hung state. I verified this mostly by launching the external utility which we use for diagnostic purpose. As mentioned earlier even this would freeze as it tried to communicate with product’s windows service. I published the observations. Since there were other priority issues to take care of we kept this issue at lower priority as it was coming only because of logging enabled (and with assumption that 4 GB log file size would be the cause).

After a couple of days, same issue was reported in another scenario. Again it was because of logging. But this time the log file size was 512 KB. So the large log file size assumption went for a toss. Since I was working on other critical issues I was not involved into this one. Later I went on Diwali leaves and took a good break after a hectic release. It was decided that some of the issues which had workaround and could not be fixed would be worked upon in parallel and a hotfix would b released. The logger issue was one of them.

I came back from the break only to find that this issue was waiting for me. Same issue was reported by a customer. We had to enable the logging for troubleshooting on customer site and now the logger itself was causing the product to freeze. I went through the emails describing the analysis done so far on that issue by other people. Following were the main points from the analysis:
  • From the dumps and also by attaching the debugger at runtime it was confirmed that the issue was with mutex in logger.
  • We were using a logger component across various processes and which would log to single file. For this mutex was being used.
  • All the component which would appear to freeze were in fact waiting on this mutex and timing out.
 Following steps were tried to solve the issue:
  • Used smart class to acquire\release the mutex. But this did not work.
  • Other solution tried was to create separate log file per process. With this the issue was not observed but this was not an acceptable solution.
Also, nobody concentrated on why wait for mutex was timing out. This meant someone was holding the mutex and not releasing it. Nobody tried to find, by going a code walk through, if there were any cases when a given thread would not release a mutex. Instead people concentrated on fact like number of processes logging to single file, mutex contention etc.

Finding the root cause
Now, it was the time to find the root cause of the issue and not list the possible causes. I had two things in mind:
  1. In order to reproduce the issue I had to create a sample application which used the logger. Surprisingly this was not done and people were reproducing the issue using the product which would take not less than 10 hours of cycle.
  2. Secondly, I had to look at the code and find if there was any situation when an acquired mutex would not be released by the thread\process.
 I created a sample application which would just call the logger in a loop.

The breakthrough
Well, I just started launching multiple instances of the sample console application. Every instance was logging to the file and also to the console (the logger could be configured to log to file, console or log as debug string). I was lunching instances and closing some. While doing this suddenly all the instances froze except for one. Even a new instance would not log anything and would appear to be hung. Then I closed the instance which was logging. As soon as I did this another instance started logging but not all. Only a single instance would log at a time. This continued i.e. the moment I closed the instance which was logging only one of the other instances would log.

In the logger I had done changes to log the return value of the WaitForSingleObject() API. What I noticed was that when another instance would start logging WaitForSingleObject() call would return with WAIT_ABANDONED instead of success. I had read about this previously just to confirm that if a thread\process dies without releasing the mutex then it should not be an issue and also in the code I could see handling for this return value. I had another look at the code and there the mystery unraveled itself. The issue was that in the code the WAIT_ABANDONED return value was not handled properly.
Following code snippet resembles that in the product:

Following is the documentation for this value:

WAIT_ABANDONED: 0x00000080L The specified object is a mutex object that was not released by the thread that owned the mutex object before the owning thread terminated. Ownership of the mutex object is granted to the calling thread and the mutex state is set to nonsignaled.

If the mutex was protecting persistent state information, you should check it for consistency.

In the code if the return value was WAIT_FAILED or WAIT_ABANDONED then we were just returning from the logger function. But as per the documentation if the return value is WAIT_ABANDONED then the ownership is granted to the thread and hence ReleaseMutex() should be called to release the mutex which was not happening.

This explained why, when I was launching and closing the sample application suddenly all instances except one could log. In that case when I was closing the instances one of them might have acquired the mutex and terminated before releasing it. I simulated this behavior in the sample application by putting a condition to terminate the process without release the mutex and the issue got reproduced easily. Also, then I confirmed that if we call RelaseMutex() in case of WAIT_ABANDONED condition then the ownership is released properly.

How the product was impacted
In the product we create separate processes to perform the work. These are created and terminated after reused for certain time. Hence, there was possibility that such processes would terminate while holding the mutex. But then the mutex should be in abandoned state and everything should work fine. Assuming that after the mutex is abandoned due to process termination, the very first thread that got the WAIT_ABANDONED return value is, say, a monitoring thread (i.e. the thread does not terminate till the process exists) then this thread will have the ownership of the mutex where as all other threads (in all other processes) will timeout after waiting for specified time (i.e. 15 seconds).

Verifying the fix
In the logger the mutex was used for making sure that only one process was logging to the file at a time. Hence, we could continue even if the wait function returned WAIT_ABANDONED i.e. the system won’t be in an unstable state. The fix was to make sure that ReleaseMutex() was called even if the wait function returned WAIT_ABANDONED.

With all the analysis and testing (with sample application) it was clear that the issue was because of abandoned mutex. But still we wanted to make sure that this was the issue. So we added debug strings in the logger which would print the return value of ::WaitForSingleObject() in case it failed. In the QA testing we used DebugView from Sysinternals. The debug logs showed that the ::WaitForSingleObject() did return WAIT_ABANDONED and because of the fix now the product was not failing.

Tuesday, 23 March 2010

Active Directory error - The server does not support the requested critical extension

This one was very interesting. The customer was running a query on Windows Active Directory Groups using our product. The issue was easy to reproduce in customer environment. Ask any developer working on customer issues how good it is to know that the issue is consistently reproduced !!! This is as if half the battle is won for the troubleshooter :)

Now, customer was running a basic query including basic fields like domain name, group name etc. Till then it was 'Aal is well'. The customer modified the query to include field displaying the group's members. That's when the query would fail with 'The server does not support the requested critical extension' error. The issue would come only when the group member field was included in the query. This error was returned by Active Directory. Also, in the past release we had faced this error but there was no solution to it other than putting a retry. But in customer scenario the error would always come for each run of the query so retry option was not feasible. Also, on net there were mostly indications about sorting on DN etc. which could cause this error. There were other reasons also mentioned (but not many). None of them were even close to indicate as to why the error would come only when an additional field was included. Also, the query would fail randomly on any group.

I had a look at the implementation. We were just firing an AD query and enumerating the results. We were enumerating the groups one by one and also fetching all the additional information about the group. The error was thrown when we were calling GetNext() on the result set to move to next Group. Following should give an idea about the implementation:

IDirectorySearch directorySearch = SearchADGroups();
foreach(ADGroup group in directorySearch)
{
    // Get additional group information by
    // making additional AD connections
}
 
So the search handle would remain open till the end. I thought that the issue could be because of AD connections being leaked. So I inspected the code but in vain. The customer has large number of groups with each group having more than 5 members. Assuming that it could be because of large number of groups I created almost double the number customer had. Also, added members to each group. The query in my test environment would run overnight but would not fail.

The customer was migrating from old release to new one and this issue was stopping him from migration. We needed a solution quickly. Analyzing the customer scenario was not useful as there was no clear-cut indication of when the error would be thrown by Active Directory. So instead of beating around the bush I decided to look for a quick solution.

I moved back to my initial observation i.e. we were firing an Active Directory query and while iterating the groups we would fetch additional information about the group. When the group's members were to be fetched were doing addition bindings to Active Directory. So while AD search handle was open were making additional AD bindings. I thought lets try enumerating all groups first, store them in a linked list, close the search handle and then iterate through the groups in the linked list. This way while querying the group details including the membership information we would not have the AD search handle open. I did the changes and gave the binary to customer for verification.

The private binary did not work. Did that mean my fix was not working? Fortunately I had put additional logs and those indicated that there were two code paths. So, I had to put the fix in other code path also. Then came the good news. The query executed successfully in customer environment. I knew there would be some performance issue as now we were first storing all groups in a linked list and then again iterate through the list. Besides this even more memory would be consumed. So there was another battle to fight, that of performance (memory was not an issue).

I was not expecting any issue with performance as how much time would an iteration contribute especially when the query time was in minutes? Worst case 0.5 seconds. But I was proved wrong. We could not reproduce the customer issue in our scenario and hence our QA could test the performance impact of the fix. The performance figure indicated degradation by almost 100%. How come a loop cause such a degradation? I kept only the loop and removed the code for fetching the additional group information. Well, the time taken for completely iterating the list was in milliseconds. So what was the issue?

The only thing I did was instead of keeping the AD connection open (i.e. the search handle) I first stored the groups in a linked list. So what was happening earlier was that an AD connection was open and we were making additional AD connections to fetch additional information. This was causing issue in customer scenario. But now that was fixed and another one surfaced. I was just thinking about the changes I did and why it should be causing the performance issue. Suddenly I thought I was about to nail down this on also. Out of nowhere AD connection caching came to my mind. Remember the issue was that the search handle was being kept open while we were fetching additional information by making additional AD binds. Open search handle means an open AD connection. That was the difference.

What is AD connection caching? In simple terms it means reusing the AD connection. When you make an AD connection and that connection is open then a request for another connection will be served using the same connection internally i.e. to you it appears as if a new connection is created but that is not the case and internally a previously opened connection is used. Following are the two scenarios:

Scenario1:
  • Create AD connection A
  • Perform some operation using A
  • Create AD connection B
  • Since A is already opened it will be used i.e. calls on B will be made using A (this is just to explain things here. Follow the link to understand AD caching.)
  • Performance improves since we are reusing the connection

Scenario2:

  • Create AD connection A
  • Perform some operation using A
  • Close connection A
  • Create AD connection B
  • Since A is closed as new connection will be created
  • Performance impacted since we are creating new connection
Previously, when we were enumerating through the AD results set and also making additional AD connections that meant that AD connection caching was happening. When I changed the implementation to first store all groups in a linked list that meant that the first AD connection was closed. Now, while iterating through the linked list we were making additional AD connections. But these were created and closed in the loop itself and hence no connection caching or reuse was happening. This was the reason for performance degradation. All I had to do was:
  • Make AD search and read all groups in a linked list
  • Close the search handle
  • Create another AD connection, say A and keep it open
  • Iterate through the linked list and fetch the required information for the groups by making additional AD connections
  • Once all groups are processed close AD connection A
Now, in the loop when we were creating additional AD connections we were leveraging the AD connection caching. How? We created an AD connection A and kept it open till the end. The performance came back to original one (only difference of few milliseconds). But this was not the end. Picture abhi baki thi. I had changed the code to have a AD conneciton open and it was similar to the original code. What if the issue in customer scenario was because of an open connection. Well I thought this would not be the case and the issue was because of the search handle being open i.e. we fired the AD query and not only taking too much time to iterate through the results but also making additional binds. So, if the performance fix would work in customer scenario then it would prove my point. And I was proved right. The fix worked in customer scenario also.

Epilogue
Well as I mentioned earlier, while thinking about the changes I remembered AD connection caching. I should be thanking my friend SJ for this :) In previous release he had faced such issue. In that particular case the connection handles were increasing. He came across AD connection caching and tried many things and I was curiously observing all this.

Let me conclude this post by pointing out that one of the best ways to learn things is to look at the issues others are facing. That's what I mostly do and one of the reasons I leave late from office :)

Saturday, 27 February 2010

Why 'Component Services' (dcomcnfg.exe) would hung?

This was really a hard one to crack. The issue was as follows:
  • After installing the product launch the 'Component Services' snap-in i.e. dcomcnfg.exe.
  • Expand the 'Computers' node.
  • The snap-in would freeze.
  • After few minutes the node would expand but the 'My Computer' node had red icon next to it indicating that MSDTC service did not start properly.
  • When the snap-in would freeze that time if the service installed by the product was stopped then the snap-in would unfreeze quickly. But still red icon against the 'My Computer' node.
To reproduce the issue, the machine had to be restarted every time. This was one issue where-in I must have set a record in restarting the machine. I had once accidentally selected windows shutdown instead of restart and hence I created a batch file with restart command.

The issue would come after the product installation. In order to make sure that the issue was because of the components installed by the product we disabled all the product services. With this the 'Component Services' snap-in would not freeze.

Since it appeared to be a hang I got common inputs from people like there is resource conflict, could be a deadlock etc. I thought of using hang analysis option of DebugDiag. But launching the DebugDiag would throw an error stating that the msdtc service was not started. So I could not use DebugDiag. Next I used windbg and took dump of the hung 'Component Services' snap-in as well as MSDTC service. Compared thread stacks with the one when no freeze was observed but in vain. 

There was another way to reproduce the issue. Disable all the service installed by the product as well as the MSDTC service. After restart start any one of the product service and then the MSDTC service. The issue would resurface. But the 'Component Services' snap-in would not freeze for too long. However, it was still not usable as before. Since the issue was coming when our services were started before MSDTC service we added dependency on MSDTC service but that did not help either.

Now, the next step was to find out which part of the code was causing the issue. So I disabled all the product services. Since the issue could be reproduced even with anyone of the product services, I choose the one which had least initialization steps. I started commenting the code and finally arrived at the culprit. We had used .Net remoting with custom channel i.e. internally we used RPC. In this when the RPC server was started the issue would resurface. I thought the issue was because of starting of an RPC server before MSDTC service doing some initialization. So in the service start I added code to start another executable. Then instead of starting the RPC server in the product I started external RPC sample server. But issue was not observed in this case. I tried using different port numbers assuming that there could be port conflict but that didn't help. So this meant that the issue was with the way we were creating RPC server in our product. We were creating the RPC server from managed .Net assembly i.e. calling unmanaged APIs using PInvoke. I created a separate exe which would start the RPC server similar to the product service. In this case the issue was observed. So instead of depending on service I could now use this exe. Then, I started commenting the RPC initialization code. Finally, the API that was causing the issue was caught and it was RpcEpRegister().

As I mentioned earlier, since we were calling unmanaged APIs from managed code I suspected something wrong with the managed to unmanaged calls. But that was not the case. I even tried to register the endpoint completely from unmanaged code but that did not help. (I had tried few additional things here also).

The RpcEpRegister() function is used to register the RPC server with its port to the endpoint mapper database. It is required while using dynamic ports i.e. every time the server starts it will use the available port number. But how will the client know the port which is allocated at runtime? That's where RpcEpRegister() is required. The client will just say I need to connect to this RPC server (identified using a GUID) on machine xyz. The endpoint mapper on machine xyz will look up the port number and then the connection proceeds. In our product we were not using dynamic ports. Every service had predefined port number assigned. Hence, in worst case commenting out the RpcEpRegister() call would have solved the problem for us.

I had even tried using procmon.exe from Sysinternals. Took the dumps with and without hang and compared them. But this also did not help. Then I remembered rpcdump.exe. This utility dumps all the RPC servers along with the protocol, port and RPC interface id (let's call it RPC server id). I used that tool and took the dump. While going through the dump I observed that there were two entries for our RPC server. One with tcp protocol and other with local RPC protocol (LRPC). This was because while registering the protocols we would register both tcp and LRPC. So I commented out LRPC protocol registration. The issue did not come. Then I registered only LRPC and commented tcp. The issue came. Looked like the issue was with LRPC protocol only. So, I tried registering only name pipe protocol instead of tcp or LRPC. The issue did not come. So it was confirmed that the issue was with LRPC protocol.

It was getting late (as usual) and I didn't know what I was doing. Just trying something or the other. While going through the rpc dumps I was searching for my entries using the port number I was using. God knows how I looked at the interface id and searched on it in the dump file. Surprisingly, I found an additional entry for my RPC interface with LRPC protocol. I commented the call to RpcEpRegister() so that my RPC interface should not come in the RPC dump. I searched using the interface id and as expected there was no entry for my RPC server using the assigned port. But still there was a single entry against my RPC server for LRPC protocol. This was very surprising. I sent a mail with this observation to all. Since it was getting late I turned off my laptop and headed for home.

I was standing in front of the lift still thinking about the issue. Suddenly I realized what I missed. I went back to my cubicle and started my laptop. Copied the interface id from the dump file and googled it. There there !!! I found that the id was used by one of the MSDTC component. The mystery seemed to be resolved. The GUID used by my RPC server was conflicting with the RPC server of one of the MSTDC component. So now was the time to confirm this. As mentioned earlier I had tried using sample RPC server but the issue did not come in that case. I used the same RPC server but changed the RPC interface id to that used in my product and as expected the issue came with this sample RPC server also.

Is the GUID really unique then?
The interface id was a GUID which is supposed to be unique. But in my case it was not. There is a possibility that the one who wrote the RPC server copied it from somewhere. But looking at the people involved in writing the RPC server especially if they are writing a managed to unmanaged calls then it becomes hard to believe that it was copied. Even if the id was copied then it should have been from some of the example and I should have found it in my google search. But that  was not the case. So the mystery with the conflicting id is still open :) 

Solution:
The solution for this was to change the id by generating a new one. But it would have broken the existing thick consoles deployed on customer machines. So, we went ahead with commenting the call to RpcEpRegister() instead.

Why the hang then?
Using the term hang is wrong. As I mentioned at the start the snap-in would freeze for about 5-10 minutes and after that it would unfreeze but was not usable. Also, the issue would come only when RPC server used LRPC protocol. In addition to this there was another observation. But before that something about RPC.

For an RPC server you can register a callback function to perform validation whenever a call is made. In this function you can check if the client is allowed to make the call e.g. if client identity is not local admin you can return access denied error code.

Now, when machine was restarted and the issue was reproduced, I observed that the product logs indicated that this RPC callback function was called. This was strange. I thought that someone is calling my RPC server. But there was no easy way of finding the other end of an RPC call. On net there was an article for it but that was too much cumbersome.

So based on this my understanding was as follows:
  • One of the MSTDC service was enumerating all the RPC endpoints with the interface id that was conflicting with my product's RPC server. It was enumerating only those which used LRPC protocol.
  • It would then make a RPC call to this interface.
  • When the call to my RPC server was made then it would either hang or fail with some error. In case of failure the client must have been retrying.
  • Hence, the snap-in would appear as frozen for sometime.

Additional observations and validations:
In the 'Component Service' snap-in I looked at the 'Running Processes' node. In normal case it would show processes like Catsrv.CatalogServer.1 etcI wrote a VB script to create object of this COM class. In normal case it would get created. But when the issue was reproduced then even the call to create object of Catsrv.CatalogServer.1 would freeze similar to the snap-in.

Also, I created a sample RPC server and client, say, RpcServer01 and RpcClient01. I registered a callback function on RpcServer01. RpcClient01 could successfully call RpcServer01. Then I created another RPC server, say RpcServer02, with same ID as that of RpcServer01 and again registered a callback function. Now, if I ran RpcServer02 and RpcClient01 then the call would fail with error code 1783 which is 'The stub received bad data.'. However, on RpcServer02 the callback function was invoked. This confirmed that even if a client calls wrong RPC server (i.e. server with same ID but not that one expected by client) then the registered callback is invoked but the actual call will fail.

Wow too much to read !!!!
I know I have written too much and probably should stop here. But few points before I conclude this post.
While I was looking at this issue I had to stop working one this as other critical ones came up. Then we took inputs from the team which had worked on the RPC channel. Even they could not deduce why this was happening. We were going to raise a ticket with Microsoft for this. I knew I had not given my 100% on this. I told that I would again have a look at this issue for a day or two and then we could raise a ticket with Microsoft. It was agreed and the rest is history :)

Monday, 8 February 2010

Finding issue with third party COM component

The day started with reading mails that came overnight. There was a mail from tech support escalating a customer issue. A webex with the customer was scheduled after an hour. We requested for logs from the tech support. On looking at the logs the first guess was that the error was thrown by a third party component which we used in our product. The strange thing was that the customer had two environments: one a test network and the other a production network. The product was installed on both. However, it was working on test network and not on production network.

From the logs it was almost clear that the third party COM component was returning E_UNEXPECTED error code. But before the third party COM component call, there were few lines from our product which could have returned same error. So to make sure it was third party component issue we prepared a debug binary with more logging. We postponed the webex to 4 in the evening so that we have the debug binary ready.

In the evening we did webex with the customer and got the required logs. These logs made sure that the error was returned by the third party COM component. Now, we had only one option, that of contacting the third party vendor. But since the component was working properly in one scenario we knew that the issue had to be related with that particular installation on that particular machine. The third party component also had the option of enabling the logging. We had tried that also but no logs were generated. The call that was failing was an Initialize() call on that component. So, it was clear that the component was failing even before basic initialization like enabling logging. Contacting the third party vendor was last option as that would have been a lengthy process.

What next? We could not look into the COM component. Had it been a .Net component then reflection would have come to the rescue. Wait a minute. We wanted to look into the component. It had to be something related to the machine configuration. Some permissions or something. We took help of Procmon (Process Monitor) tool from Sysinternals. This tool captures the registry and file accesses. So after looking at the procmon logs you can tell which registry or file operations the component is performing and whether it succeeded or failed.

Running procmon directly on the product would have generated lots of logs. So we prepared a simple native exe which used the third party COM component. This served two purposes.
  • One if this exe ran successfully then it meant there was problem with our product configuration and not with machine.
  • Second if it failed then the problem was with the COM component due to machine configuration and also less procmon logs to analyze.
We did another webex with the customer and took the procmon logs. To be on safer side we took the procmon logs by running the test exe we created as well as the product. Half the work was done. Now, we had to analyze the logs.

We looked at the logs from webex. But it did not show any error like could not access registry due to permissions etc. Now, other option was to compare it with the logs when the COM component ran successfully. We did that and realized that there was some issue with MSXML. The logs on customer machine showed that MSXML4 was being used. But there were some read logs which were missing i.e. the component was not able to parse the XML using MSXML4. Following was the procmon log:

"ReadFile","C:\WINDOWS\system32\msxml4.dll","SUCCESS","Offset: 99,328, Length: 32,768, I/O Flags: Non-cached, Paging I/O, Synchronous Paging I/O"

We suspected issue with MSXML. Even the Initialize() method on the COM component took XML string as input and option for enabling logging was also part of this XML string only. So the component must have been failing to parse the input XML string and hence no logs generated.

MSXML6 being the latest we asked the customer to verify if MSXML6 was installed on the machine and if not then it should be installed. Bingo, we hit the bull’s-eye. Customer found that MSXML6 was not installed and after installing it the product worked without any issues.

Note: There are many tools available from sysinternals which are very helpful in troubleshooting issues.