Stalking the infinite loop

Today began with an effort to track down the infinite loop which is causing the java process to munch CPU cycles on the portal server. The problem seems to be spreading.. when I checked this morning around 6:30am, there were 4 looping threads on uportal1 and 5 on uportal2. I’ve tracked down the offending loop, and I started out by putting a counter into it and logging the number of times the loop ran after each successful completion. This gave me a pretty good idea of how many times we should go through the loop under normal circumstances (looks like only 2 or 3). Then I picked an arbitrary large number, 1000, and changed the code so it throws an exception if the counter exceeds this value. I’m hoping this will have two effects: one, stop the looping; and two, provide some logging so we can further investigate what is causing the problem. The new code is up and running, so we’ll see how it goes.

Welp, it worked. Interesting… very similar to yesterday, everything was quiet all morning and then both instances hit the infinite loop almost exactly at 1pm. Now, instead of an endlessly looping thread, I get a nice error log and stack dump. Next thing to do is try to log some additional info, to see if I can narrow this down to a particular user, activity, or whatever. If this is only affecting certain user(s), maybe someone will call the help desk and help me solve the mystery.

Other than that, the biggest issues so far have been related to permissions and affinities. Lots of people complaining that they don’t see content they used to get on the old portal. I expected this, because the uPortal groups/permissions model is quite different from what we were using with the old myUMBC. For now, I’m noting the users who are having problems, and in a week or so I’ll call a meeting to discuss how to reconcile them. In the meantime, these users can continue to use the old portal.

I also discovered today that the new portal does not work for anyone who has a “mandatory PIN change” flag set in SIS. Now first off… PINs are going away. There’s nowhere in myUMBC where a user is required to enter a PIN any more (well, there’s orientation.. but don’t go there). Given that, I made the executive decision that mandatory PIN changes are going to go away in uPortal, and tweaked the legacy code accordingly. However, it looks like the HP is checking the forced_pin_change field and disallowing registration if it is set. So it looks like I need to take this one step further, and actually submit a PIN change for these users behind the scenes. Looking into that now. Regardless, we’re definitely going to need to test how the portal behaves with “virgin” users, before the fall semester starts.

My God, I’m going through my own code that does the HP PIN stuff, and it is so bloody convoluted I want to shoot myself. From following the code, I can’t see any way for it to ever get to the HP PIN verify when accessed normally. I think what we need is a rewrite that takes all of the PIN stuff out of the normal login process, and doesn’t do anything with PINs until the user does something that accesses the HP. Then, if the user doesn’t have a PIN it can have the HP create one, then send a PIN change request to the HP so it clears out the forced_pin_change flag. Will look at that tomorrow..

Relaunch day

First “real” test (i.e. University open for business, people hitting portal) for our re-launch of uPortal today. It went up Sunday.

Issue #1: I see that we still have the problem of the JVM going up to 100% CPU utilization occasionally. It was like that this morning when I signed on around 7:30am. Portal was still responsive. The problem went away when I bounced the Tomcat instance. I guess somewhere, a thread is going haywire or something. I learned how to get a thread dump under Tomcat: Send SIGQUIT to the JVM process, and Tomcat puts the thread dump in catalina.out. Next time it happens, I’ll see if this produces anything useful.

Well, the issue cropped up again, and I did a thread dump. It appears we’re getting hit with UP-1175. Somewhere there’s a corrupted layout with a circular reference, which is causing an infinite loop. They’ve fixed it for uPortal 2.5.x but there appears to be no fix for 2.4.x. Need to look into this a little further.. Other than that, things have gone pretty well so far today. Fingers crossed.

CGenericXSLT channels, parameters, and Local Connection Contexts

I’m a bit strapped for time today, but I did take a quick look at this, to see if it looks doable. In a nutshell.. I’d like to use a local connection context to do legacy authentication and obtain an encr string to pass to various legacy backed services. This would allow me to create RSS-type channels that link to authenticated services, so I don’t need to use web proxy channels for everything. Initially I’d use it to connect to external services like MAP/DN, but eventually I could actually have the legacy perl code handle the rendering for stuff like registration, and just re-skin it to look like uPortal.

I started out by seeing if I could pass an “encr” into the RSS and have it display conditionally somehow (we don’t necessarily want it appended to every link in the RSS feed). I came up with the somewhat hackish idea of using the RSS <category> element. If I give the item a category of “myumbcauth”, I can tweak the XSLT to look for that and append extra data to the link. Then, I can pass the actual encrypted string into the XSLT using a stylesheet parameter. This all works fine. The next challenge is getting the portal to set the appropriate parameter in the stylesheet. It looks like all of the channel runtime parameters are also passed in as stylesheet parameters (and in fact I was able to read one of them, baseActionURL), the question is, can I somehow add my own arbitrary param in there? Obviously this would have to be done somehow in the local connection context code. Anyhow, I got as far as that and now I have to run off and fight other fires, so I’ll have to come back to this later.

Today’s database tweak..

Well, one thing our ongoing uPortal launch has illustrated, is that contrary to popular belief, our Oracle database server does not have unlimited resources. To that end, a lot of my recent efforts have been geared towards making our installation more “database friendly”. The centerpiece of this is the connection pooling we set up on Monday. Of course, once you’ve got a nice, manageable connection pooling setup, you want to use it whenever possible. And until today, there was one big piece of the portal that still wasn’t using the pool: the “glue” that interfaces the uPortal web proxy channels to the legacy portal’s authentication scheme. uPortal calls this a local connection context, and ours goes by org.jasig.portal.security.UmbcLegacyLocalConnectionContext. The legacy portal’s session information is all database driven, so this code needs to connect to the database and create a valid legacy portal session for the user, so the web proxy channels will work and the kiddies can see their schedules and drop all their classes. This code was doing an explicit connect to the ‘myumbc’ user in the UMBC instance. Each channel needs to do it, and some of our portal tabs contain several of this type of channel. I’m not sure exactly how many times this code was getting invoked, or how many connections it was generating, etc. because I didn’t do any profiling. But it definitely had an impact.

Anyhow, I’ve modified the code so that it pulls a connection from the pool (using RDBMServices.getConnection) and uses that instead. I needed to modify the LegacyPortalSession code a bit to support this. Also, since our connection pool uses the ‘uportal’ user (not ‘myumbc’), I needed to get our DBA to do a couple of grants so that ‘uportal’ would have access to the tables it needs.

For better or for worse, it’s in production now, so we’ll see how it goes.

The plan for tomorrow: Fix all of the missing or broken links that people have reported. Create a new channel exclusively for DN/MAP. And, look into local connection context usage with CGenericXSLT type channels. I recently discovered that this type of channel can use a local connection context. Depending on how it works, I may be able to use it to eliminate a couple more web proxy channels and replace them with RSS type channels. We’ll see.

Legacy myUMBC ACLs as PAGS Groups

I think I’ve found a way (two ways, actually) to import program ACLs (from the BRCTL.PROG_USER_XREF SIS table) into uPortal as PAGS groups, so that we can publish uPortal channels with the exact same access lists as the respective areas in the legacy myUMBC. This would be a big win, particularly for an app like Degree Navigation/MAP. In the old portal, we control access to DN/MAP using a big, looong list of individual usernames. If the user isn’t on the list, they don’t even see a link to DN/MAP. However, with uPortal, we currently don’t have access to this list, so we have to present the DN/MAP link to a much larger set of users (basically anyone who is faculty or staff), or we’re faced with totally replicating the access list in uPortal, and maintaining two lists. Not what we want.

Fortunately, we designed the old portal with a bit of forward thinking, and made its ACL mechanism totally database driven. That is, all ACL info is stored in the Oracle database, so some future portal could theoretically extract that data and use it down the road. The challenge, then, is to figure out how to get uPortal to do that.

uPortal provides a very nice groups manager called PAGS, which allows us to create arbitrary groups based on what uPortal calls Person Attributes. It can extract Person Attributes directly from LDAP, as well as extracting them from the results of an arbitrary RDBM query. It then presents this group of attributes as a seamless collection, regardless of the actual backend datasource for each individual attribute. It’s really very nice.

My first thought, then, was to just have uPortal query the legacy myUMBC ACL table to get a list of each app a particular user can access, and map the results to “Person Attributes”. I tested this and it works just fine, but there’s one problem: The legacy ACL table is indexed by UMBC username, but the way we have uPortal configured, it’s currently using the LDAP GUID to do its queries. So, to do this the right way (that is, without hacking the uPortal code), we’d need a table that maps the GUID to the username, so that we could do a join against it to get our results. Currently, we don’t have LDAP GUID data anywhere in our Oracle database. Now, I don’t think getting it there would be a huge issue (we’re already doing nightly loads of usernames from LDAP to Oracle), but it still needs to happen before we could use this method.

The second method would be to import the user’s legacy ACL data into the LDAP database as an additional attribute. Then I could just pull the data directly out of LDAP, without having to worry about an RDBM query at all. This seems like a simpler solution, if it’s possible. More later..

Note: Configuration of Person Attributes is done in the file /properties/PersonDirs.xml. When specifying an RDBM attributes query, the SQL statement must include a bind variable reference, or the code will crap out. I learned this when I tried to remove the bind variable and hardcode my own username.. no dice. To test this stuff out, subscribe to the “Person Attributes” channel, which is under the “Development” group. Then look for the attributes you defined in the config file. If they’re there, it worked. If not, not.

Connection pooling crash course

Just spent the whole day tweaking our new uPortal installation and trying to get it to stay up reliably under load. It’s coming along, but not quite there yet. First lesson: Under any kind of load, you must, absolutely must, enable database connection pooling. That’s because if you don’t, it will open enough database connections to, let’s just say, really screw things up. Now, setting up connection pooling is not supposed to be that hard. But in our case, it was a huge pain. The default uPortal 2.4.3 configuration, includes a file uPortal.xml which is used to specify the connection pooling info to Tomcat. Great, I set it up with our connection parameters, and tried it out. Hmm, doesn’t seem to work. Look a little further.. Apparently in portal.properties, I need to set the flag org.jasig.portal.RDBMServices.getDatasourceFromJndi to “true”, or it bypasses the whole connection pooling thing and just opens direct connections. I set it, and tried again. Major bombage. More poking around and I found this page describing the mechanics of Tomcat connection pooling. Apparently, the config file format (as well as the factory class name) changed from Tomcat 5.0.x to Tomcat 5.5.x. We’re running 5.5.x, and the uPortal distro’s config file in the 5.0.x format. So, I updated the config file. Plus as a good measure, I dropped a copy of the Oracle JDBC jar file into tomcat-root/common/lib. Not sure if it really needs to be there or not. But, once I jumped through all those hoops, the connection pooling finally seems to work.

Now, we’re dealing with memory issues causing slowness, as well as a couple lingering database issues with logins to the ‘myumbc’ user…

I hope I don’t have too many more days like this…

Update 1/12/2006: Well, it appears that the connection pooling breaks any ant targets that use the database: This includes pubchan as well as pubfragments, etc. This is kinda bogus, but rather than tweaking portal.properties every time I want to publish a channel or fragment, it looks like I can just run these from the test tree (which uses the same set of database tables).