Now, customer was running a basic query including basic fields like domain name, group name etc. Till then it was 'Aal is well'. The customer modified the query to include field displaying the group's members. That's when the query would fail with 'The server does not support the requested critical extension' error. The issue would come only when the group member field was included in the query. This error was returned by Active Directory. Also, in the past release we had faced this error but there was no solution to it other than putting a retry. But in customer scenario the error would always come for each run of the query so retry option was not feasible. Also, on net there were mostly indications about sorting on DN etc. which could cause this error. There were other reasons also mentioned (but not many). None of them were even close to indicate as to why the error would come only when an additional field was included. Also, the query would fail randomly on any group.
I had a look at the implementation. We were just firing an AD query and enumerating the results. We were enumerating the groups one by one and also fetching all the additional information about the group. The error was thrown when we were calling GetNext() on the result set to move to next Group. Following should give an idea about the implementation:
IDirectorySearch directorySearch = SearchADGroups();
foreach(ADGroup group in directorySearch)
{
// Get additional group information by
// making additional AD connections
}
So the search handle would remain open till the end. I thought that the issue could be because of AD connections being leaked. So I inspected the code but in vain. The customer has large number of groups with each group having more than 5 members. Assuming that it could be because of large number of groups I created almost double the number customer had. Also, added members to each group. The query in my test environment would run overnight but would not fail.
The customer was migrating from old release to new one and this issue was stopping him from migration. We needed a solution quickly. Analyzing the customer scenario was not useful as there was no clear-cut indication of when the error would be thrown by Active Directory. So instead of beating around the bush I decided to look for a quick solution.
I moved back to my initial observation i.e. we were firing an Active Directory query and while iterating the groups we would fetch additional information about the group. When the group's members were to be fetched were doing addition bindings to Active Directory. So while AD search handle was open were making additional AD bindings. I thought lets try enumerating all groups first, store them in a linked list, close the search handle and then iterate through the groups in the linked list. This way while querying the group details including the membership information we would not have the AD search handle open. I did the changes and gave the binary to customer for verification.
The private binary did not work. Did that mean my fix was not working? Fortunately I had put additional logs and those indicated that there were two code paths. So, I had to put the fix in other code path also. Then came the good news. The query executed successfully in customer environment. I knew there would be some performance issue as now we were first storing all groups in a linked list and then again iterate through the list. Besides this even more memory would be consumed. So there was another battle to fight, that of performance (memory was not an issue).
I was not expecting any issue with performance as how much time would an iteration contribute especially when the query time was in minutes? Worst case 0.5 seconds. But I was proved wrong. We could not reproduce the customer issue in our scenario and hence our QA could test the performance impact of the fix. The performance figure indicated degradation by almost 100%. How come a loop cause such a degradation? I kept only the loop and removed the code for fetching the additional group information. Well, the time taken for completely iterating the list was in milliseconds. So what was the issue?
The only thing I did was instead of keeping the AD connection open (i.e. the search handle) I first stored the groups in a linked list. So what was happening earlier was that an AD connection was open and we were making additional AD connections to fetch additional information. This was causing issue in customer scenario. But now that was fixed and another one surfaced. I was just thinking about the changes I did and why it should be causing the performance issue. Suddenly I thought I was about to nail down this on also. Out of nowhere AD connection caching came to my mind. Remember the issue was that the search handle was being kept open while we were fetching additional information by making additional AD binds. Open search handle means an open AD connection. That was the difference.
What is AD connection caching? In simple terms it means reusing the AD connection. When you make an AD connection and that connection is open then a request for another connection will be served using the same connection internally i.e. to you it appears as if a new connection is created but that is not the case and internally a previously opened connection is used. Following are the two scenarios:
Scenario1:
- Create AD connection A
- Perform some operation using A
- Create AD connection B
- Since A is already opened it will be used i.e. calls on B will be made using A (this is just to explain things here. Follow the link to understand AD caching.)
- Performance improves since we are reusing the connection
Scenario2:
- Create AD connection A
- Perform some operation using A
- Close connection A
- Create AD connection B
- Since A is closed as new connection will be created
- Performance impacted since we are creating new connection
- Make AD search and read all groups in a linked list
- Close the search handle
- Create another AD connection, say A and keep it open
- Iterate through the linked list and fetch the required information for the groups by making additional AD connections
- Once all groups are processed close AD connection A
Epilogue
Well as I mentioned earlier, while thinking about the changes I remembered AD connection caching. I should be thanking my friend SJ for this :) In previous release he had faced such issue. In that particular case the connection handles were increasing. He came across AD connection caching and tried many things and I was curiously observing all this.
Let me conclude this post by pointing out that one of the best ways to learn things is to look at the issues others are facing. That's what I mostly do and one of the reasons I leave late from office :)