Wednesday, February 2, 2011

The deadly Deadlock - Fixing the issue with java.util.Calendar and java.util.TimeZone

All computer engineers learn about a condition called Deadlock in college.

What is deadlock?

This is what Wikipedia has to say about Deadlock:

"A deadlock is a situation where in two or more competing actions are each waiting for the other to finish, and thus neither ever does.  In computer science, Coffman deadlock refers to a specific condition when two or more processes are each waiting for each other to release a resource, or more than two processes are waiting for resources in a circular chain"

The article gives quiet some information about deadlock.  No! don't worry I am not going to explain all that.  There are lots of resources that will explain what is deadlock and why it occurs in a much better way.

We have all learned about the deadlock and know why does it occurs.  But have we ever faced a deadlock situation in our code?

The technology, enterprise stack, databases, we use have matured so much, that we hardly ever face a deadlock situation in production code.  But, I was fortunate enough to face a deadlock situation in one of my recent projects.

Of course the deadlock I faced happened when the app went live i.e. in production environment.  

I don't know about others, but its not a good situation to be in.  Since the deadlock was only detected in production environment, situation becomes very critical.  Numerous mails are exchanged (some of them ugly), everyone wants to know what is causing the deadlock and Me (poor soul) had no idea why it occurred :)

Moving on, how did we know that there was a deadlock?

Symptoms of deadlock
  • Performance of the app degrades quite a bit
  • CPU usages shoots up to 100% and stays there!
  • Users complain that the site is not responding and they are not able to do a certain action.
  • To get things back to normal servers have to be restarted!
When we noticed these symptoms in production for the first time, we thought it was an one off case, we restarted the server and decided we would try and find a reason for this performance problem.

I tried to find all possible causes of this performance issue but without any luck.  All possible causes were ruled out, because we were using all possible best practices, we were using a standard J2EE stack, spring, hibernate, lazy loading, optimized queries etc.  

I was not read to believe myself but, I suspected that the app might be a victim of deadlock!  

Our app was deployed in a Weblogic cluster.  We had configured Weblogic to throw an exception when it detects that a thread is running for more than 10 min (We had not long running background processes which would take more than 600 sec to finish).  And bang!  After a few days, Weblogic did throw an exception stating that a certain thread was running for 666 sec which is more than the threshold of 600 sec!

Following is the part of stack trace

Looking at the stack trace, I was a bit surprised, I was expecting a deadlock situation, when app tries to interact with the database. Thats what my impression about deadlock was! It occurs mostly when we interact with the database.

But, I was proved wrong by my own code!  In our app the deadlock occurred in memory!

We nailed it down to a function called DateRange.countNumberOfFullDays.  As the name suggests this functions calculates the number of full days between two dates!

The code was not complex at all.  And this was the least expected place where a deadlock could occur.

As you can see the actual code of DateRange.countNumberOfFullDays does not do anything special!  Then what was causing the deadlock?

Googling a little, I found that there are various issues with the Java Calendar and TimeZone classes in Java 1.5.  Visit http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6199320 and http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6232022 for more information.

These bugs were fixed in Java 1.6 but, unfortunately our production servers use Java 1.5.  Damn!  Why why why, production server always lag behind the pace at which technology evolves.

The solution:

There is an excellent open source library which provides a parallel and better implementation of Calender and TimeZone api's.  Its called Joda Time

I used this library, changed the code of countNumberOfFullDays to look like

As we can see in this simple example, Joda time provides a very simple and intuitive Date and Calendar api.

Since I implemented this change (on 04/01/2011) we have not faced a deadlock situation yet! Fingers crossed!

OK, enough about the deadlock situation that I faced.  I want to know from you guys, have you ever faced a deadlock situation in production code?  If yes then, what was the situation and how did you fix it?
Have some Fun!