Monday 2 April 2012

COM callback stopped working with new builds

This had happened few years back (probably 2nd March 2007) when we used to work mainly on C++. The issue became critical as we were nearing the release date and there was no break through.

Issue:
There was a 3rd party COM component we had used to scan the machines for Windows patches. In that particular release, the 3rd party COM component had undergone design changes. One of the changes was to use COM connection points for event handling. So we had implemented the corresponding COM component to receive the events.The required changes were done and tested. However, with the latest build from the 3rd party vendor the events were not received. Getting callback was critical, as based on the callback we used to update the heartbeat indicating that the processing was going on and that there was no hang. Since we were not getting the callback it was assumed (after default 30 min.) that there was a hang in the 3rd party component and the processing was aborted.

Troubleshooting:
Initially, I was  not involved in this issue from the start as I was busy with other priority items. My friend and colleague, had implemented the corresponding COM event handler and was driving this end to end. Everything was working well till the new build from the 3rd party vendor came. Same code would work with old build and not with the new one. He left no stone unturned. Looked at his code, the sample code which the vendor provided etc. but all in vain. Even the 3rd party vendor had tested the functionality at their end. It was still not clear if the issue was on our side or in the 3rd party component. In such a situation my mentor always used to say that it would be good if the issue was on our side. Because if the issue is on our side we can quickly fix it but if it is in 3rd party component then getting a fix from the vendor on time is not possible.

It was Friday evening and the testing was to be completed by Monday. Since I was done with my tasks I decided to poke my nose into this issue :)

I understood the code flow for debugging the issue. Since it was the COM component I decided to start with the basics. The very foundation of COM is the QueryInterface method on IUnknown interface. We had implemented the QueryInterface method, so I put a break point and started debugging.

As the processing started the query interface method was getting called. Few calls were as expected i.e. call for IUnknown, the event handler interface etc. However, there came the call for IDispatch. This case was not handled and we were returning the standard E_NOINTERFACE error. To confirm if this was the issue I just returned the required interface pointer for IDispatch. With this we started getting the callback.

So the issue was because we did not handle the call for IDispatch in the overridden QueryInterface method. Seems like the earlier builds from the 3rd party vendor used to make the callback using the required event interface e.g. say IEventCallback whereas in the latest builds the callback was made using the IDispatch interface.

The required COM class already implemented the IDisplatch interface but in the overridden QueryInterface method we returned E_NOINTERFACE instead of the IDispatch pointer. 

Fortunately the fix was simple and without any impact to any other functionality. The QAs had to work on weekends (and so did the developers including me :( ) to make sure that the testing would complete by Monday.