Tuesday 23 March 2010

Active Directory error - The server does not support the requested critical extension

This one was very interesting. The customer was running a query on Windows Active Directory Groups using our product. The issue was easy to reproduce in customer environment. Ask any developer working on customer issues how good it is to know that the issue is consistently reproduced !!! This is as if half the battle is won for the troubleshooter :)

Now, customer was running a basic query including basic fields like domain name, group name etc. Till then it was 'Aal is well'. The customer modified the query to include field displaying the group's members. That's when the query would fail with 'The server does not support the requested critical extension' error. The issue would come only when the group member field was included in the query. This error was returned by Active Directory. Also, in the past release we had faced this error but there was no solution to it other than putting a retry. But in customer scenario the error would always come for each run of the query so retry option was not feasible. Also, on net there were mostly indications about sorting on DN etc. which could cause this error. There were other reasons also mentioned (but not many). None of them were even close to indicate as to why the error would come only when an additional field was included. Also, the query would fail randomly on any group.

I had a look at the implementation. We were just firing an AD query and enumerating the results. We were enumerating the groups one by one and also fetching all the additional information about the group. The error was thrown when we were calling GetNext() on the result set to move to next Group. Following should give an idea about the implementation:

IDirectorySearch directorySearch = SearchADGroups();
foreach(ADGroup group in directorySearch)
{
    // Get additional group information by
    // making additional AD connections
}
 
So the search handle would remain open till the end. I thought that the issue could be because of AD connections being leaked. So I inspected the code but in vain. The customer has large number of groups with each group having more than 5 members. Assuming that it could be because of large number of groups I created almost double the number customer had. Also, added members to each group. The query in my test environment would run overnight but would not fail.

The customer was migrating from old release to new one and this issue was stopping him from migration. We needed a solution quickly. Analyzing the customer scenario was not useful as there was no clear-cut indication of when the error would be thrown by Active Directory. So instead of beating around the bush I decided to look for a quick solution.

I moved back to my initial observation i.e. we were firing an Active Directory query and while iterating the groups we would fetch additional information about the group. When the group's members were to be fetched were doing addition bindings to Active Directory. So while AD search handle was open were making additional AD bindings. I thought lets try enumerating all groups first, store them in a linked list, close the search handle and then iterate through the groups in the linked list. This way while querying the group details including the membership information we would not have the AD search handle open. I did the changes and gave the binary to customer for verification.

The private binary did not work. Did that mean my fix was not working? Fortunately I had put additional logs and those indicated that there were two code paths. So, I had to put the fix in other code path also. Then came the good news. The query executed successfully in customer environment. I knew there would be some performance issue as now we were first storing all groups in a linked list and then again iterate through the list. Besides this even more memory would be consumed. So there was another battle to fight, that of performance (memory was not an issue).

I was not expecting any issue with performance as how much time would an iteration contribute especially when the query time was in minutes? Worst case 0.5 seconds. But I was proved wrong. We could not reproduce the customer issue in our scenario and hence our QA could test the performance impact of the fix. The performance figure indicated degradation by almost 100%. How come a loop cause such a degradation? I kept only the loop and removed the code for fetching the additional group information. Well, the time taken for completely iterating the list was in milliseconds. So what was the issue?

The only thing I did was instead of keeping the AD connection open (i.e. the search handle) I first stored the groups in a linked list. So what was happening earlier was that an AD connection was open and we were making additional AD connections to fetch additional information. This was causing issue in customer scenario. But now that was fixed and another one surfaced. I was just thinking about the changes I did and why it should be causing the performance issue. Suddenly I thought I was about to nail down this on also. Out of nowhere AD connection caching came to my mind. Remember the issue was that the search handle was being kept open while we were fetching additional information by making additional AD binds. Open search handle means an open AD connection. That was the difference.

What is AD connection caching? In simple terms it means reusing the AD connection. When you make an AD connection and that connection is open then a request for another connection will be served using the same connection internally i.e. to you it appears as if a new connection is created but that is not the case and internally a previously opened connection is used. Following are the two scenarios:

Scenario1:
  • Create AD connection A
  • Perform some operation using A
  • Create AD connection B
  • Since A is already opened it will be used i.e. calls on B will be made using A (this is just to explain things here. Follow the link to understand AD caching.)
  • Performance improves since we are reusing the connection

Scenario2:

  • Create AD connection A
  • Perform some operation using A
  • Close connection A
  • Create AD connection B
  • Since A is closed as new connection will be created
  • Performance impacted since we are creating new connection
Previously, when we were enumerating through the AD results set and also making additional AD connections that meant that AD connection caching was happening. When I changed the implementation to first store all groups in a linked list that meant that the first AD connection was closed. Now, while iterating through the linked list we were making additional AD connections. But these were created and closed in the loop itself and hence no connection caching or reuse was happening. This was the reason for performance degradation. All I had to do was:
  • Make AD search and read all groups in a linked list
  • Close the search handle
  • Create another AD connection, say A and keep it open
  • Iterate through the linked list and fetch the required information for the groups by making additional AD connections
  • Once all groups are processed close AD connection A
Now, in the loop when we were creating additional AD connections we were leveraging the AD connection caching. How? We created an AD connection A and kept it open till the end. The performance came back to original one (only difference of few milliseconds). But this was not the end. Picture abhi baki thi. I had changed the code to have a AD conneciton open and it was similar to the original code. What if the issue in customer scenario was because of an open connection. Well I thought this would not be the case and the issue was because of the search handle being open i.e. we fired the AD query and not only taking too much time to iterate through the results but also making additional binds. So, if the performance fix would work in customer scenario then it would prove my point. And I was proved right. The fix worked in customer scenario also.

Epilogue
Well as I mentioned earlier, while thinking about the changes I remembered AD connection caching. I should be thanking my friend SJ for this :) In previous release he had faced such issue. In that particular case the connection handles were increasing. He came across AD connection caching and tried many things and I was curiously observing all this.

Let me conclude this post by pointing out that one of the best ways to learn things is to look at the issues others are facing. That's what I mostly do and one of the reasons I leave late from office :)

No comments:

Post a Comment