Monday, 2 April 2012

COM callback stopped working with new builds

This had happened few years back (probably 2nd March 2007) when we used to work mainly on C++. The issue became critical as we were nearing the release date and there was no break through.

Issue:
There was a 3rd party COM component we had used to scan the machines for Windows patches. In that particular release, the 3rd party COM component had undergone design changes. One of the changes was to use COM connection points for event handling. So we had implemented the corresponding COM component to receive the events.The required changes were done and tested. However, with the latest build from the 3rd party vendor the events were not received. Getting callback was critical, as based on the callback we used to update the heartbeat indicating that the processing was going on and that there was no hang. Since we were not getting the callback it was assumed (after default 30 min.) that there was a hang in the 3rd party component and the processing was aborted.

Troubleshooting:
Initially, I was  not involved in this issue from the start as I was busy with other priority items. My friend and colleague, had implemented the corresponding COM event handler and was driving this end to end. Everything was working well till the new build from the 3rd party vendor came. Same code would work with old build and not with the new one. He left no stone unturned. Looked at his code, the sample code which the vendor provided etc. but all in vain. Even the 3rd party vendor had tested the functionality at their end. It was still not clear if the issue was on our side or in the 3rd party component. In such a situation my mentor always used to say that it would be good if the issue was on our side. Because if the issue is on our side we can quickly fix it but if it is in 3rd party component then getting a fix from the vendor on time is not possible.

It was Friday evening and the testing was to be completed by Monday. Since I was done with my tasks I decided to poke my nose into this issue :)

I understood the code flow for debugging the issue. Since it was the COM component I decided to start with the basics. The very foundation of COM is the QueryInterface method on IUnknown interface. We had implemented the QueryInterface method, so I put a break point and started debugging.

As the processing started the query interface method was getting called. Few calls were as expected i.e. call for IUnknown, the event handler interface etc. However, there came the call for IDispatch. This case was not handled and we were returning the standard E_NOINTERFACE error. To confirm if this was the issue I just returned the required interface pointer for IDispatch. With this we started getting the callback.

So the issue was because we did not handle the call for IDispatch in the overridden QueryInterface method. Seems like the earlier builds from the 3rd party vendor used to make the callback using the required event interface e.g. say IEventCallback whereas in the latest builds the callback was made using the IDispatch interface.

The required COM class already implemented the IDisplatch interface but in the overridden QueryInterface method we returned E_NOINTERFACE instead of the IDispatch pointer. 

Fortunately the fix was simple and without any impact to any other functionality. The QAs had to work on weekends (and so did the developers including me :( ) to make sure that the testing would complete by Monday.

Tuesday, 13 March 2012

The Leap Day Bug !!!

No no...the title is not wrong it is indeed 'Leap Day Bug' and not 'Leap Year Bug'.

There was nothing challenging about this issue but the date on which this surfaced gave it the honor of getting listed on the blog :)

29th February 2012

That the day was 29th did not go unnoticed as one of my cousin's birthday falls on this day. Other than that there was nothing special about the day but thanks to this issue I now have something to add to the leap year conversations.

I had to go to Baner office for attending a technical presentation. Few minutes into the presentation and there was an email from one of the developer from other team regarding the product installation failure. The developer had trouble installing even the old product builds which he had installed previously. After some time another developer reported the same issue. By lunch time there were flurry of emails from other teams as well as QAs regarding issue with product installation.

On returning to the Godrej office, the first I did was to have a look at one of the QA scenarios where the product installation failed. Of course, one thing was continuously on mind as to why suddenly the product was not in a mood to get installed :)

Following was the exception in the product log:


System.ArgumentOutOfRangeException: Year, Month, and Day parameters describe an un-representable DateTime.
   at System.DateTime.DateToTicks(Int32 year, Int32 month, Int32 day)



Did the above exception ring a bell? It indeed.

I had seen this error two days back and had also created a sample application to reproduce the issue but to no avail. However, running the same sample application greeted me with a pleasant surprise and the surprise is not hard to guess. Yes, the sample application also threw same exception. So, half the battle was won by reproducing the issue with sample application. Next step was to find out the root cause.

The code that was failing was written to get the expiration date time by adding the provided N years to current date time. Following snippet should give some idea about the code:









After going through the code again and keeping in mind the ArgumentOutOfRangeException, it was no more than a child's play to put all the pieces of the puzzle together.

The above code does following:
  1. Take today's date time
  2. Create a new DateTime object using the constructor which takes
    • Year
    • Month
    • Day
  3.  Take difference and return number of days
The culprit was step #2 i.e. the first parameter to constructor. If you noticed, the input N years was being added to current date's year i.e. on 29th Feb. 2012 this would make the value as 2012 + N.

Now, in the product installation the default value for N was 25, which means 2012 + 25 = 2037. So, on 29th Feb. 2012 we were trying to create a DateTime object with year as 2037, month as 02 and day as 29 i.e. the new date being 29th Feb. 2037. This will always throw ArgumentOutOfRangeException because there is no 29th Feb. in the year 2037 as it won't be a leap year :)

Fix
The fix was just to use the provided AddYears() function on the DateTime object. Yes, a single line fix :)

Workaround
While you are working on the fix the first thing to do is to see if others can be unblocked by providing a workaround. In this case the workaround was simple. The default value of 25 years was to be changed to 24 or any value which when added to 2012 would result in a leap year. With this workaround, the product installation was successful.

Epilogue
If you remember I had mentioned that the sample application for this issue was created two days back. This was because a QA had reported same issue with same error on 27th. Feb. That time I had created the sample application but it did not throw the exception on my machine as the date time was 27th Feb. But when I checked the system time on QA scenario it showed 29th Feb. Since I had other issues to look at I did not bother to see if the sample application would fail by changing the date to 29th. Instead I asked the QA to change the system time properly to 27th which solved the issue on QA scenario. Little did I know that the issue would resurface on a large magnitude on the auspicious day of 29th Feb. 2012 :)