Saturday 27 February 2010

Why 'Component Services' (dcomcnfg.exe) would hung?

This was really a hard one to crack. The issue was as follows:
  • After installing the product launch the 'Component Services' snap-in i.e. dcomcnfg.exe.
  • Expand the 'Computers' node.
  • The snap-in would freeze.
  • After few minutes the node would expand but the 'My Computer' node had red icon next to it indicating that MSDTC service did not start properly.
  • When the snap-in would freeze that time if the service installed by the product was stopped then the snap-in would unfreeze quickly. But still red icon against the 'My Computer' node.
To reproduce the issue, the machine had to be restarted every time. This was one issue where-in I must have set a record in restarting the machine. I had once accidentally selected windows shutdown instead of restart and hence I created a batch file with restart command.

The issue would come after the product installation. In order to make sure that the issue was because of the components installed by the product we disabled all the product services. With this the 'Component Services' snap-in would not freeze.

Since it appeared to be a hang I got common inputs from people like there is resource conflict, could be a deadlock etc. I thought of using hang analysis option of DebugDiag. But launching the DebugDiag would throw an error stating that the msdtc service was not started. So I could not use DebugDiag. Next I used windbg and took dump of the hung 'Component Services' snap-in as well as MSDTC service. Compared thread stacks with the one when no freeze was observed but in vain. 

There was another way to reproduce the issue. Disable all the service installed by the product as well as the MSDTC service. After restart start any one of the product service and then the MSDTC service. The issue would resurface. But the 'Component Services' snap-in would not freeze for too long. However, it was still not usable as before. Since the issue was coming when our services were started before MSDTC service we added dependency on MSDTC service but that did not help either.

Now, the next step was to find out which part of the code was causing the issue. So I disabled all the product services. Since the issue could be reproduced even with anyone of the product services, I choose the one which had least initialization steps. I started commenting the code and finally arrived at the culprit. We had used .Net remoting with custom channel i.e. internally we used RPC. In this when the RPC server was started the issue would resurface. I thought the issue was because of starting of an RPC server before MSDTC service doing some initialization. So in the service start I added code to start another executable. Then instead of starting the RPC server in the product I started external RPC sample server. But issue was not observed in this case. I tried using different port numbers assuming that there could be port conflict but that didn't help. So this meant that the issue was with the way we were creating RPC server in our product. We were creating the RPC server from managed .Net assembly i.e. calling unmanaged APIs using PInvoke. I created a separate exe which would start the RPC server similar to the product service. In this case the issue was observed. So instead of depending on service I could now use this exe. Then, I started commenting the RPC initialization code. Finally, the API that was causing the issue was caught and it was RpcEpRegister().

As I mentioned earlier, since we were calling unmanaged APIs from managed code I suspected something wrong with the managed to unmanaged calls. But that was not the case. I even tried to register the endpoint completely from unmanaged code but that did not help. (I had tried few additional things here also).

The RpcEpRegister() function is used to register the RPC server with its port to the endpoint mapper database. It is required while using dynamic ports i.e. every time the server starts it will use the available port number. But how will the client know the port which is allocated at runtime? That's where RpcEpRegister() is required. The client will just say I need to connect to this RPC server (identified using a GUID) on machine xyz. The endpoint mapper on machine xyz will look up the port number and then the connection proceeds. In our product we were not using dynamic ports. Every service had predefined port number assigned. Hence, in worst case commenting out the RpcEpRegister() call would have solved the problem for us.

I had even tried using procmon.exe from Sysinternals. Took the dumps with and without hang and compared them. But this also did not help. Then I remembered rpcdump.exe. This utility dumps all the RPC servers along with the protocol, port and RPC interface id (let's call it RPC server id). I used that tool and took the dump. While going through the dump I observed that there were two entries for our RPC server. One with tcp protocol and other with local RPC protocol (LRPC). This was because while registering the protocols we would register both tcp and LRPC. So I commented out LRPC protocol registration. The issue did not come. Then I registered only LRPC and commented tcp. The issue came. Looked like the issue was with LRPC protocol only. So, I tried registering only name pipe protocol instead of tcp or LRPC. The issue did not come. So it was confirmed that the issue was with LRPC protocol.

It was getting late (as usual) and I didn't know what I was doing. Just trying something or the other. While going through the rpc dumps I was searching for my entries using the port number I was using. God knows how I looked at the interface id and searched on it in the dump file. Surprisingly, I found an additional entry for my RPC interface with LRPC protocol. I commented the call to RpcEpRegister() so that my RPC interface should not come in the RPC dump. I searched using the interface id and as expected there was no entry for my RPC server using the assigned port. But still there was a single entry against my RPC server for LRPC protocol. This was very surprising. I sent a mail with this observation to all. Since it was getting late I turned off my laptop and headed for home.

I was standing in front of the lift still thinking about the issue. Suddenly I realized what I missed. I went back to my cubicle and started my laptop. Copied the interface id from the dump file and googled it. There there !!! I found that the id was used by one of the MSDTC component. The mystery seemed to be resolved. The GUID used by my RPC server was conflicting with the RPC server of one of the MSTDC component. So now was the time to confirm this. As mentioned earlier I had tried using sample RPC server but the issue did not come in that case. I used the same RPC server but changed the RPC interface id to that used in my product and as expected the issue came with this sample RPC server also.

Is the GUID really unique then?
The interface id was a GUID which is supposed to be unique. But in my case it was not. There is a possibility that the one who wrote the RPC server copied it from somewhere. But looking at the people involved in writing the RPC server especially if they are writing a managed to unmanaged calls then it becomes hard to believe that it was copied. Even if the id was copied then it should have been from some of the example and I should have found it in my google search. But that  was not the case. So the mystery with the conflicting id is still open :) 

Solution:
The solution for this was to change the id by generating a new one. But it would have broken the existing thick consoles deployed on customer machines. So, we went ahead with commenting the call to RpcEpRegister() instead.

Why the hang then?
Using the term hang is wrong. As I mentioned at the start the snap-in would freeze for about 5-10 minutes and after that it would unfreeze but was not usable. Also, the issue would come only when RPC server used LRPC protocol. In addition to this there was another observation. But before that something about RPC.

For an RPC server you can register a callback function to perform validation whenever a call is made. In this function you can check if the client is allowed to make the call e.g. if client identity is not local admin you can return access denied error code.

Now, when machine was restarted and the issue was reproduced, I observed that the product logs indicated that this RPC callback function was called. This was strange. I thought that someone is calling my RPC server. But there was no easy way of finding the other end of an RPC call. On net there was an article for it but that was too much cumbersome.

So based on this my understanding was as follows:
  • One of the MSTDC service was enumerating all the RPC endpoints with the interface id that was conflicting with my product's RPC server. It was enumerating only those which used LRPC protocol.
  • It would then make a RPC call to this interface.
  • When the call to my RPC server was made then it would either hang or fail with some error. In case of failure the client must have been retrying.
  • Hence, the snap-in would appear as frozen for sometime.

Additional observations and validations:
In the 'Component Service' snap-in I looked at the 'Running Processes' node. In normal case it would show processes like Catsrv.CatalogServer.1 etcI wrote a VB script to create object of this COM class. In normal case it would get created. But when the issue was reproduced then even the call to create object of Catsrv.CatalogServer.1 would freeze similar to the snap-in.

Also, I created a sample RPC server and client, say, RpcServer01 and RpcClient01. I registered a callback function on RpcServer01. RpcClient01 could successfully call RpcServer01. Then I created another RPC server, say RpcServer02, with same ID as that of RpcServer01 and again registered a callback function. Now, if I ran RpcServer02 and RpcClient01 then the call would fail with error code 1783 which is 'The stub received bad data.'. However, on RpcServer02 the callback function was invoked. This confirmed that even if a client calls wrong RPC server (i.e. server with same ID but not that one expected by client) then the registered callback is invoked but the actual call will fail.

Wow too much to read !!!!
I know I have written too much and probably should stop here. But few points before I conclude this post.
While I was looking at this issue I had to stop working one this as other critical ones came up. Then we took inputs from the team which had worked on the RPC channel. Even they could not deduce why this was happening. We were going to raise a ticket with Microsoft for this. I knew I had not given my 100% on this. I told that I would again have a look at this issue for a day or two and then we could raise a ticket with Microsoft. It was agreed and the rest is history :)

2 comments:

  1. awesome dude... m sure cracking the issue was not as simple & straight forward as you make it sound in blog... i mean duplicate GUID... who would think of that !!!

    ReplyDelete
  2. we don't call him dude for no reason..... ;)
    http://en.wikipedia.org/wiki/Dude

    ReplyDelete