Portal Meltdowns

We had another portal meltdown this morning. I figure I’ll keep a “meltdown log” of sorts to record notes, etc. Hopefully it’ll help me get to the bottom of this.

Background: Every so often, the portal becomes unresponsive. There seems to be no correlation to system load (high demand etc). For example, spring final exams ended yesterday, and today is one of the quietest days of the year around here. Yet we had a meltdown this morning.

Observations:

  1. In portal.log, there are always lots of these DBCP-related errors.
  2. The portal fills up its DB connection pool very quickly and I often see log messages that the pool is “exhausted”.
  3. Lots of ALM-related infinite loops, in getFirstSiblingNode and others. I’ve put loop-breaking code in various spots; otherwise, all the busy-looping threads would kill the JVM pretty quickly. I have it logging the portal userid when this happens, and it doesn’t seem to be limited to specific users. In fact, I picked one from an error log, checked his portal account later, and it worked fine.
  4. When the problem is occurring, restarting the portal (web server, JVM, the whole 9 yards) does not fix the problem.
  5. Both portal server boxes (uportal1 and uportal2) are always affected at the same time, ostensibly ruling out an issue involving the OS, Apache, or Tomcat.

I did a JVM thread dump. One thread was hung doing a database commit (?!?) inside portal.RDBMUserLayoutStore.getNextStructId, a synchronized routine. All other threads (98 of ’em) were hung waiting on this one thread.

[More:]

Sometimes, this problem magically resolves itself. Other times, it’s resolved after the DBA restarts the Oracle instance. We’ve hypothesized that this is being caused by an errant app running at the same time, which is dragging down our database instance. I’m not sure if I buy that or not. But right now it looks like, somehow, a thread is hanging inside a critical section, thereby locking up all the other threads; and the locked threads are all grabbing DB connections from the DBCP pool and never releasing them, thereby hosing up the pool. Next time it happens, I’ll do another thread dump and see if the lockup is in the same spot.

I also did a dump of active portal sessions to see who was logged on around the time of the meltdown. When it happens again, I’ll do another dump and cross-check the two, to see if a particular user or users might be causing the problem.

6/1: Happened again. Lockup occurred in exactly the same spot, with all threads locked waiting on another thread trying to do a database commit. WTF is going on here..