Whew… what a day

Thursday was quite the day of triumphs and tribulations.

It all started out with a successful swap-out of my home Linux server. It had been running on an old, trusty-but-tired Dell P2-300mhz, and I upgraded it to a slightly-less-old Dell P3-450mhz (Linux is far from perfect, but it truly shines at getting new life out of old, scavenged hardware). The upgrade was as easy as: build a new kernel, shut the old box down, pull out all the boards and peripherals, put the stuff in the new box, and boot the new box up. The result was a home server with a 50% faster processor and an extra 256mb RAM (640mb total vs 384mb). Not earth shattering, but a noticeable improvement, and it was easy to do. The trick to doing is this to transplant as much of the “guts” from the old box to the new box as possible, so the hardware configuration stays mostly the same.

Next up was the launch of our new myUMBC portal, which so far has been very smooth, other than the usual little niggling problems that always pop up with these things. We had already been running uPortal in production for six months, and that experience definitely helped ease the transition. The centerpiece of this launch was a complete redesign of the UI, but behind the scenes we also upgraded the framework and layout manager. This gave us an opportunity to start out “clean” with a fresh set of database tables, and to redo a few things which were done sloppily with the initial launch (such as our PAGS group and channel hierarchies). It positions us very well for future releases and gives us a clean platform on which to build future improvements.

Of course, by Murphy’s Law, the air conditioning in ECS picked yesterday to go on the fritz. So the launch was a little sweaty, but it happened anyhow. When I left the office, the A/C was finally getting back to normal, and then I get home and our power goes out. It ended up being out for around 4 hours, from 7:30pm till 11:30pm. Not all that long, but it always seems like forever when it’s actually out, and it made for a pretty sweaty (and surly) day. And of course, BGE’s automated system never gave us an estimate of when the power would be restored, so Cathy packed up the kids to go sleep at her sister’s, and pulled out 2 minutes before the power came back on. Fortunately I was eventually able to get a decent night’s sleep. I must say I’m more than ready for this summer to end. This is the first truly miserable summer we’ve had since 2002, and I had forgotten just how bad 2002 was. Oh well… t-minus 7 weeks till the Outer Banks.

CWebProxy channels and self-signed certs

OK. Just so I don’t forget this when I inevitably have to do it again.

We are starting to add some CWebProxy channels that access the portal web server via its external URL rather than one of the loopback interfaces (long story why, but there are a few issues with proxying to a localhost URL, particuraly WRT inline images). These channels go through SSL, as opposed to the loopback ones which use standard HTTP. Our test portal server uses a self-signed SSL cert. That causes some problems, because the portal doesn’t have access to the server’s cert to properly negotiate the SSL connection.

Solution: Create a local keystore containing the cert info, and point the JVM at this file via a command-line argument.

How to do it in 5 easy steps:

  1. Find the SSL cert for the web server. On the portal servers, this is located under server-root/conf/server-name.crt. Make a temporary copy of this file. Edit the copy and remove all lines except the actual cert data, including the -----BEGIN CERTIFICATE----- and -----END CERTIFICATE----- lines.
  2. Use the cert file to create a Java keystore file. Assuming the keystore will live at /etc/umbc/uportal-test.umbc.edu.keystore and the cert file copy is cert.txt:

    keytool -import -trustcacerts -keystore /etc/umbc/uportal-test.umbc.edu.keystore -file cert.txt -alias uportal-test

    (Note: keytool is in JAVA_HOME/bin on recent versions of the Sun JVM.)

  3. Set permissions on the keystore file so that the portal web server can read it.
  4. Point the portal web server’s JVM at the custom keystore file. With Tomcat, this is done by setting the JAVA_OPTS environment variable prior to starting Tomcat. For UMBC web servers, the place to set this is server-root/bin/config-perl.
  5. Restart Tomcat.

myUMBC redesign coming along

What’s this… a rare work-related entry on a weekend??

Well, it is the weekend, but I’m hacking on the portal anyhow, to get ready for our relaunch this August. It’s coming along marvelously, thanks mainly to Collier, but I’ve been busy with it too. The relaunch will include:

  • A complete redesign of the front end (users will think we did something 😉
  • Framework upgrade from uPortal 2.4.3 to uPortal 2.5.2
  • Switch layout managers from Aggregated Layouts (ALM) to Distributed Layout Manager (DLM)
  • A flexible development environment (built around CVS) that accommodates multiple developers and makes it easy to deploy from scratch

It’s an ambitious undertaking given the time frame, but it’s really coming together amazingly well. The new portal instance is up and running on our test server. Today, I got our “Local Connection Context,” (which provides connectivity to the legacy myUMBC perl codebase via web proxy channels) working. There were a couple “gotchas” with doing this. First off, I made the executive decision to point the test-instance web proxy channels at the production SIS data. Yeah, we do have a development SIS instance, but it’s a little too rough-around-the-edges to use for this purpose. For example, there’s no (reliable) FastCGI version running, and using the standard CGI stuff would absolutely kill the performance. And, we’re not really concerned with testing the SIS stuff on the test portal instance — we’re really more concerned with ensuring that the SIS stuff renders properly. To that end, it makes more sense to point to the stable, production SIS code, to get the closest approximation to what would be running in prod anyhow.

Doing this raises a problem, though — the portal is connected to the DEVL Oracle instance, but to render production content, I also need a connection to the PROD instance. In production, I only need one database connection; in devl/test, I’m going to need two. I handled this by adding an additional JNDI data source for the prod database, and adding code to the connection context to read the data source info out of a properties file. I called the data source ProdPortalDb (it’s defined in /properties/uPortal55.xml), and the properties file is /properties/myumbc.properties.

Stay tuned for more.

Out with GUIDs.. in with Campus IDs

OK, policy decision for our up-and-coming launch of uPortal 2.5.. I’m going to stop using the LDAP GUID as the uPortal ‘unique username’, and use the new UMBC Campus ID instead.

Background: Every uPortal user gets an entry in uPortal’s user table, UP_USER. This table contains (among other things) a numeric ID for primary key, and a text ‘username’. Username is what uPortal uses as its security principal. For our current production installation of uPortal, we’re using the user’s LDAP GUID as the security principal. This works fine, but it causes problems with things like the Groups and Permissions manager, which allows you to search for specific usernames. The GUID is a long, unwieldy string that isn’t really human-friendly, so it’s not really something we want people using as a search key.

Now, why don’t we just use the user’s UMBC username as the security principal? Certainly seems logical. But, down the road, we want to provide access to the portal for alumni, prospective students, parents, etc. These people will not necessarily have a UMBC username. If we use that as a security principal, we effectively lock these people out of the portal. We need a unique identifier that everyone in the directory possesses. Until recently, the GUID was the only thing that fit the bill. Now, we have the nifty UMBC Campus ID, which is a simple string of two alphanumeric characters followed by five digits. Unlike GUID, the Campus ID is intended for human consumption (we actually display it on our ID cards). So, while it may not be quite as search-friendly as a username, it’s much more so than a GUID, and everyone in the directory has one.

Yet another happy byproduct of SSN remediation. I’m sure I’ll be cursing it next week after the SIS cutover, but I’ll say it again: In the long run, it’ll be well worth the hassle.

Followup… looks like if I want to do this, I’ll need to do an extra LDAP call to pull the Campus ID out of the PersonDB record. Not the end of the world, but not a 10-minute change either as I had hoped. I’ll have to come back to this later.

6/22: Looks like Rob is now including Campus ID in the Webauth ticket hash map. So, no extra LDAP query necessary. Cool…

Portal Meltdowns

We had another portal meltdown this morning. I figure I’ll keep a “meltdown log” of sorts to record notes, etc. Hopefully it’ll help me get to the bottom of this.

Background: Every so often, the portal becomes unresponsive. There seems to be no correlation to system load (high demand etc). For example, spring final exams ended yesterday, and today is one of the quietest days of the year around here. Yet we had a meltdown this morning.

Observations:

  1. In portal.log, there are always lots of these DBCP-related errors.
  2. The portal fills up its DB connection pool very quickly and I often see log messages that the pool is “exhausted”.
  3. Lots of ALM-related infinite loops, in getFirstSiblingNode and others. I’ve put loop-breaking code in various spots; otherwise, all the busy-looping threads would kill the JVM pretty quickly. I have it logging the portal userid when this happens, and it doesn’t seem to be limited to specific users. In fact, I picked one from an error log, checked his portal account later, and it worked fine.
  4. When the problem is occurring, restarting the portal (web server, JVM, the whole 9 yards) does not fix the problem.
  5. Both portal server boxes (uportal1 and uportal2) are always affected at the same time, ostensibly ruling out an issue involving the OS, Apache, or Tomcat.

I did a JVM thread dump. One thread was hung doing a database commit (?!?) inside portal.RDBMUserLayoutStore.getNextStructId, a synchronized routine. All other threads (98 of ’em) were hung waiting on this one thread.

[More:]

Sometimes, this problem magically resolves itself. Other times, it’s resolved after the DBA restarts the Oracle instance. We’ve hypothesized that this is being caused by an errant app running at the same time, which is dragging down our database instance. I’m not sure if I buy that or not. But right now it looks like, somehow, a thread is hanging inside a critical section, thereby locking up all the other threads; and the locked threads are all grabbing DB connections from the DBCP pool and never releasing them, thereby hosing up the pool. Next time it happens, I’ll do another thread dump and see if the lockup is in the same spot.

I also did a dump of active portal sessions to see who was logged on around the time of the meltdown. When it happens again, I’ll do another dump and cross-check the two, to see if a particular user or users might be causing the problem.

6/1: Happened again. Lockup occurred in exactly the same spot, with all threads locked waiting on another thread trying to do a database commit. WTF is going on here..

Today’s PAGS tweak

Today I fixed the latest crop of users with portal-access issues: Previous students who took a semester off, or never completed a degree, who want to use myUMBC to register or retrieve a transcript. These students do not have a student affiliation in LDAP, because they’re not “current” students. They also lack an alumni attribute as they never completed degrees. As such, they don’t see the “Academics” tab because we limit display of this tab to users with student or alumni affiliations (well, and a few others as well, but that’s beside the point right now).

Our LDAP directory does not have an affiliation for “past-student-who-never-graduated”. However, we have an attribute for the last term a user was registered for classes. In theory, if a user possesses this attribute, that indicates that they were a student at some time in the past. We can give them the “Academics” tab by expanding the UMBC Student PAGS group to include people with this attribute.

[More:]

Laundry list for doing this:

  1. Make sure the LDAP attribute in question is available in uPortal as a Person Attribute. If not, add the attribute to PersonDirs.xml.
  2. Edit PAGSGroupStoreConfig.xml and add any new groups (or edit existing ones) to incorporate the new attribute.
  3. Redeploy, restart and hope it doesn’t bomb.

In our case, I added an additional clause to check for presence of the umbclasttermreg attribute using the PAGS “Value Exists Tester”, org.jasig.portal.groups.pags.testers.ValueExistsTester.

The PAGS entry on the JA-SIG Wiki has improved quite a bit since I last visited it. Caveat emptor: It references ValueExistsTester and ValueMissingTester. Initially I blindly followed the docs and tried to use ValueExistsTester, and things didn’t work. It turns out these two testers are not present in uPortal 2.4.3.. I had to add ValueExistsTester manually. It’s only around 10 lines of Java, but the Wiki ought to mention that it’s not in 2.4.3.

Web Proxy channels and cw_person parameter

CWebProxy provides a channel parameter called cw_person, which is a comma-separated list of person attributes. If this parameter is set, CWebProxy is supposed to fetch the listed attributes and pass them to the back-end web application as CGI parameters. This is a potentially handy feature, because then we can develop “smart” unauthenticated apps which present content tailored to individual users. For example, in our case we’d like to develop a “campus links” channel, which presents different sets of links to different users based on their LDAP affiliations. In theory, with cw_person, that should be easy to do.

Well, I tried this out, and as usual, what sounds great in theory isn’t always great in practice. There are two issues with this feature:

  1. If a user has multiple values for a given attribute, CWebProxy only passes one of them (presumably the first one returned by the LDAP query). Example: I set up a test channel with cw_person set to pass the LDAP affiliation attribute. My LDAP affiliations are “staff”, “employee”, and “alumni”, but CWebProxy only passes “staff”.
  2. When the channel is refreshed (e.g. by switching to a different tab and then going back), it seems to stop passing the attribute. Not sure if this is user error, or if that’s just how it works. But if this is going to be any use to us, it needs to pass the attributes every time the channel is rendered.

These two problems will probably prevent us from using this feature to do what we want. Yeah, we can probably hack CWebProxy to make this work, but I think a better solution would be to write up a custom local connection context. That will give me complete control over what gets passed to the back-end app (and when), with the added benefit that we can use it with a CGenericXSLT type channel and aren’t limited to using a web proxy.

Incidentally, for anyone trying to set up a channel with cw_person, there’s one big “gotcha”. There is an additional channel parameter called cw_personAllow. This parameter is a list of attributes that the channel is allowed to pass to the back-end app. The default is to disallow every attribute. So if you’re like most people, you’ll set up cw_person, ignore cw_personAllow, and then wonder why it doesn’t work. To get it to pass your attributes, you can either set cw_personAllow to ‘*’ (meaning “pass any attribute”), or specify an explicit list. This can be done in the channel definition, but there’s also a global default in portal.properties called org.jasig.portal.channels.webproxy.CWebProxy.person_allow. Yes, this is all in the CWebProxy documentation, but you have to dig for it. A tutorial would probably help.

Student IDs, and making the portal work for users who don’t have them

First: The facts of life.

  1. UMBC has the concept of a “student ID number.” This number is used in SIS as the primary key for almost every table. All students have these IDs, as do faculty members who teach courses, etc.
  2. uPortal itself does not use or require this number to do its work.
  3. The legacy myUMBC portal uses this number to do its internal session management. Also, we need to know this number to look up any SIS data on behalf of the user.
  4. uPortal proxies to the legacy portal to do most of its “real work” involving SIS functionality.
  5. When a user logs into either portal, they provide only a username and password. We need to be able to take that username and map it to a student ID. To do this, we query our campus LDAP directory.
  6. Not all users have student ID attributes in LDAP, for various reasons. In particular, if the help desk manually enters someone into the directory, they sometimes leave out the ID.
  7. These users have problems accessing the portal.

Now, in the old portal, we handled these cases by prompting the user to manually enter the student ID and a 4-digit PIN (don’t go there). In uPortal, we don’t have this logic. So, it just breaks. In fact, it breaks in such a way that “bad” HTML is generated, so the channels in question don’t render at all.

Well, the rendering issue is fixed now, but it was tricker than I thought it would be. It wasn’t just a matter of tidying the HTML in the legacy Perl code. The problem was actually happening in my local connection context code. Basically in these cases, the connection context can’t generate a valid legacy portal session at all. And when that happened, it was crapping out before it added a couple of necessary URI parameters to the web proxy URL. Long story short, it was trying to bring up a legacy portal login screen, complete with navigation, decoration, and lotsa sloppy HTML to boot, and uPortal of course was refusing to render it.

Fortunately it appears that the solution to this is going to be human-engineered, meaning I will just display some appropriate wording that tells the user how to remedy the problem. So, I shouldn’t need to do anything fancier than what I’ve already done.

As an aside, I’m still working on the best way to fix these HTML-tidy issues with the web proxy channels. First trick I try, is masquerading as the user with the problem, finding the bad channel, and examining it outside the portal by passing the same arguments to the legacy myUMBC web app. Then I can view the document source, paste it into emacs, and look for problems. Once I think I have it fixed, I save it as static HTML and point another web proxy channel at it, to see if it’ll render. That usually works. Sometimes I need a Java stack trace, which can be problematic because “ordinary” users don’t have the ability to display stack traces within channels (there might be something in portal.log, but it’s often hard to sift through this and correlate log entries with individual problems). In these cases, I hack the code to force the error condition to occur with my own account. Then I can see the stack trace, find out the nature of the evil, and fix it. Yeah, it’s klunky, but it gets the job done.

The hassle^H^H^H^H^H^Hlegacy of PINs at UMBC

So, we have this new campus portal at UMBC. And for the most part, the launch has gone pretty well. As part of this whole thing, we’re rethinking some of our old, outdated business processes and changing the way they work under the new portal. One of these is the PIN (Personal ID Number). Waaaaay back in 1996 when UMBC first launched web-based course registration, we required all students to enter a 4-digit PIN to log in. There was also a service where students could register by phone, that used the same PIN. Flash forward to 2006. We’re now using a campus-wide single signon system, the telephone registration system is gone, and we’ve done away with PINs as part of the login process. So, since students aren’t using them any more, we can just get rid of PINs altogether, right? Wrong. Problem is, we’re still using our old, crusty HP3000 mainframe as system of record for registration, so when we do online course registration, we have to play by the HP3000’s rules. The HP3000 is still running circa-1996 registration code (written in Cobol), and PINs are so deeply embedded into that code, that there’s no way we’re ever getting rid of them as long as the HP is around. We can rework things so that all the PIN stuff is handled behind-the-scenes, and users never see them or even know they exist, but on the back end, they’re still going to be there.

Now.. the HP3000 stores everybody’s PIN in a database table. But initially, that table is not populated until a user accesses the system for the first time. Then, the HP figures out an initial PIN for the user, and uses that to populate the PIN table. It then sets a flag that the user’s PIN needs to be changed. The HP will then refuse to do anything on behalf of that user, until they change their PIN. The mandatory PIN change happens when the user logs into myUMBC. With the old version of myUMBC, they would see the mandatory PIN change screen immediately after logging in, at which point they’d need to change the PIN before doing anything else in myUMBC. But again, this behavior is a relic of the days when most activities in myUMBC centered around the HP. It also prevents certain users (people who lack the appropriate data that the HP needs to generate the initial PIN) from using myUMBC at all. If we’re going to move forward, we need to get rid of this behavior.

The first step towards this goal, was to eliminate the mandatory PIN change check on initial login. Now keep in mind that we have to submit an initial PIN change request for every student, before they can do anything that involves the HP. So, I’m now doing the mandatory PIN change only when the user requests a function that uses the HP. So rather than seeing it when they first log in, they see it when they try to register for the first time. A small but significant step.

This works great, except it broke the online student parking registration app. Student Parking Registration has the distinction of being the only external (not part of the monolithic legacy myUMBC code) app that talks directly to the HP. If a student goes to this app before changing their PIN, the HP will refuse the parking registration request and the app will fail silently. Yep, this was a fun one to debug. It’s still broken, until I figure out the best way to fix it.

Affinities affinities affinities…

As expected, the whole uPortal affinity thing is really drawing folks out of the woodwork. For those of you who came in late, uPortal uses a totally different scheme from the old portal to determine who sees what content. Let’s take the “Faculty Options” channel as an example:

Portal You see faculty options if…
old myUMBC You are in AUTH_CLASS or TEACHER SIS tables, or have faculty LDAP affiliation
uPortal You have faculty, staff or instructor LDAP affiliation

As I said… totally different. Now.. most folks who need faculty options will satisfy both portals’ conditions, and will see the content in both portals. However, there are always special cases, and as expected, we’ve run into a few. First, not everyone in AUTH_CLASS has one of the three magic LDAP attributes. Case in point, the education department has graduate assistants that do course authorizations for them. These people need to see the faculty options, but don’t get them in uPortal.

What we really need to do here, is rethink how we’re doing some of these affinities and who’s seeing what content. Some of this will just involve creating new PAGS groups to correspond with various affiliations. But, we may also need to create additional affiliations in LDAP to cover certain cases. Membership in AUTH_CLASS would be one of those cases. uPortal allows me to specify Person Attributes based on DB queries, so for some of these I might be able to go that route. But I’d prefer to keep this totally LDAP based.

In the meantime, I’m making things work for these people by granting them temporary staff affiliations in LDAP. Can’t have them unable to do their jobs while we’re figuring this out.