Portal Meltdowns

We had another portal meltdown this morning. I figure I’ll keep a “meltdown log” of sorts to record notes, etc. Hopefully it’ll help me get to the bottom of this.

Background: Every so often, the portal becomes unresponsive. There seems to be no correlation to system load (high demand etc). For example, spring final exams ended yesterday, and today is one of the quietest days of the year around here. Yet we had a meltdown this morning.

Observations:

  1. In portal.log, there are always lots of these DBCP-related errors.
  2. The portal fills up its DB connection pool very quickly and I often see log messages that the pool is “exhausted”.
  3. Lots of ALM-related infinite loops, in getFirstSiblingNode and others. I’ve put loop-breaking code in various spots; otherwise, all the busy-looping threads would kill the JVM pretty quickly. I have it logging the portal userid when this happens, and it doesn’t seem to be limited to specific users. In fact, I picked one from an error log, checked his portal account later, and it worked fine.
  4. When the problem is occurring, restarting the portal (web server, JVM, the whole 9 yards) does not fix the problem.
  5. Both portal server boxes (uportal1 and uportal2) are always affected at the same time, ostensibly ruling out an issue involving the OS, Apache, or Tomcat.

I did a JVM thread dump. One thread was hung doing a database commit (?!?) inside portal.RDBMUserLayoutStore.getNextStructId, a synchronized routine. All other threads (98 of ’em) were hung waiting on this one thread.

[More:]

Sometimes, this problem magically resolves itself. Other times, it’s resolved after the DBA restarts the Oracle instance. We’ve hypothesized that this is being caused by an errant app running at the same time, which is dragging down our database instance. I’m not sure if I buy that or not. But right now it looks like, somehow, a thread is hanging inside a critical section, thereby locking up all the other threads; and the locked threads are all grabbing DB connections from the DBCP pool and never releasing them, thereby hosing up the pool. Next time it happens, I’ll do another thread dump and see if the lockup is in the same spot.

I also did a dump of active portal sessions to see who was logged on around the time of the meltdown. When it happens again, I’ll do another dump and cross-check the two, to see if a particular user or users might be causing the problem.

6/1: Happened again. Lockup occurred in exactly the same spot, with all threads locked waiting on another thread trying to do a database commit. WTF is going on here..

Today’s PAGS tweak

Today I fixed the latest crop of users with portal-access issues: Previous students who took a semester off, or never completed a degree, who want to use myUMBC to register or retrieve a transcript. These students do not have a student affiliation in LDAP, because they’re not “current” students. They also lack an alumni attribute as they never completed degrees. As such, they don’t see the “Academics” tab because we limit display of this tab to users with student or alumni affiliations (well, and a few others as well, but that’s beside the point right now).

Our LDAP directory does not have an affiliation for “past-student-who-never-graduated”. However, we have an attribute for the last term a user was registered for classes. In theory, if a user possesses this attribute, that indicates that they were a student at some time in the past. We can give them the “Academics” tab by expanding the UMBC Student PAGS group to include people with this attribute.

[More:]

Laundry list for doing this:

  1. Make sure the LDAP attribute in question is available in uPortal as a Person Attribute. If not, add the attribute to PersonDirs.xml.
  2. Edit PAGSGroupStoreConfig.xml and add any new groups (or edit existing ones) to incorporate the new attribute.
  3. Redeploy, restart and hope it doesn’t bomb.

In our case, I added an additional clause to check for presence of the umbclasttermreg attribute using the PAGS “Value Exists Tester”, org.jasig.portal.groups.pags.testers.ValueExistsTester.

The PAGS entry on the JA-SIG Wiki has improved quite a bit since I last visited it. Caveat emptor: It references ValueExistsTester and ValueMissingTester. Initially I blindly followed the docs and tried to use ValueExistsTester, and things didn’t work. It turns out these two testers are not present in uPortal 2.4.3.. I had to add ValueExistsTester manually. It’s only around 10 lines of Java, but the Wiki ought to mention that it’s not in 2.4.3.