« November 2005 | Main | January 2006 »

December 2005 Archives

December 5, 2005

Service capacity upgrades

Today Santa came early for Syscore and we received 9 new Sun V20z 2x1.8Ghz Opteron (single core) servers. Over the course of winter break, they will be phased into service in the following roles:

* One will serve as cold spare. Now that we have ourselves a critical mass of V20zs in our farm, it's prudent to keep a virgin spare around. This server will have a OS load on it, ready to go.

* Six will replace the current cluster of 16 email servers that handle inbound and outbound umbc.edu email traffic. Three will be located in the ECS data center and the other three in the PP data center. An additional 4 servers will also be freed up as we will stop running dedicated Milter (spam and virus filtering) servers and run these services directly on the mail servers themselves. So 20 servers will be turned into 6. Yay for the data center power and cooling budget.

* Two will be used to vastly increase the capacity and speed of the webmail system. This will cure the recent user complaints of slowness with that service. It'll also have redundancy as there is currently no backup for the single server now acting as the webmail server. As with the new email servers, one will be located in ECS and the other in PP.

Some of the hardware these new V20zs replace will be recycled as upgrades for other systems and be retained as emergency spares other backline uses.

December 6, 2005

imapd update

The imap server software has been upgraded (slightly!) to log logins that were made via TLS or SSL as such. UW-IMAP would log SSL-initiated connections, but wouldn't log connections that were TLS-negotiated.

Messages look like this, for an encrypted login attempt (either TLS or SSL)


Dec 6 09:39:37 mr4 imapd[16487]: Login (SSL) user=travel host=accounting-3.finsrv.umbc.edu [130.85.165.232]

or


Dec 6 09:48:00 mr4 imapd[16983]: Login (CTX) user=banz host=kyle.ucs.umbc.edu [130.85.70.249]

for a clear-text authentication.

syslog server update

curly.umbc.edu has been retired, and replaced with grinder.umbc.edu, a Sun X2100.

December 14, 2005

Nightly myUMBC outages (hopefully) halted..

Last weekend, I learned that myUMBC (among other things) was going down every night from roughly 9:05pm till 9:20pm. I had no idea this had been happening, so the night before last I signed on around 9:00pm to see what was going on. Sure enough, at 9:05 sharp, it hung. I got stack traces from all of the myUMBC jobs, and found they were all hanging on calls to the Oracle DB. The next morning I talked to one of our DBAs. Turns out there was a backup job running during that time. The DBAs shuffled the backup jobs around, and last night there were no problems. Dunno how they fixed it, but if it works now, that's all I really care about :-) I plan on keeping an eye on things for awhile, to see if the problem turns up again, but for now I'm calling it fixed.

imap/pop mail reader upgrades

We made some memory upgrades/configuration changes on our mail readers to increase their capacity and snappiness on Tuesday & Wednesday (12/13 & 12/14).

Continue reading "imap/pop mail reader upgrades" »

December 18, 2005

Kerberos.umbc.edu outage -- Sunday, Dec 18

Kerberos.umbc.edu was "Hung" for a time (still responding to pings, just not doing much) from 4pm on Dec 18 until approximatly 9:30pm. It is unknown why it was hung, just that it was -- no diagnostics on the console or logs. Services that required the backwards-compatible rxKa authentication for older AFS client installs, and password changes, did not work during this time. In addition, it was noticed that the principal database on the secondary KDC hadn't been updated in some time. Logs showed that it hadn't been updated since November 8; seems that 'cron' on kerberos2 had packed it in as well. Well, it has been up for 450+ days.

December 21, 2005

Core fabric extended to PP building

On monday the Core Storage Fabric was extended from the ECS building to our second data center in the Public Policy building.

Two Qlogic SANbox 5200 switches are there and they were connected to our two switch clusters in ECS with two single mode fiber optic runs, one ran from the sw1/sw4 cluster in ECS B8 straight to sw5 in PP11 and the other was ran from sw2/sw3 in ECS H5 to sw6 in PP11 via the Physics building... so at least part of the way the fiber paths diverge.

This provides our fiber channel network an aggregate of 8Gb/s betwen ECS and PP (that's 4Gb/s each way) and an additional failover route.

Also, an additional switch was installed and actvated in ECS E2 for putting the Blackboard database and app servers onto the SAN.

With all these switch additions (those 3 switch additions consitituted a 75% growth) zoning is starting to get tricky. Documentation alone isn't going to cut it when it comes to management anymore. I'm looking at ESM apps to help with the management, as Qlogic's management app is not very whole-fabric centric.

December 23, 2005

ff1-raid1 to be moved to PP

Next week on 12/28 we will move ff1-raid1 from ECS to PP. This move will mean that all AFS home storage will be redundant between buildings, a first for the campus! This means not all data is lost if the ECS building were to go up in flames.

No user-viewable disruption should be seen, as the missing array will be automatically handled by the mirroring on our Solaris AFS servers. ff1-raid2 will remain in the ECS data center. Once ff1-raid1 is back on the fibre channel fabric in PP, the mirrors will automatically resync and the AFS home server network will return to a nominal state.

Inter-switch link redundancy testing success

This morning, Rob and I tested the redundant links between our fibre channel switches in ECS and PP. Everything went off without a hitch, and that's good considering that we were dealing with live traffic!

We have two dual switch clusters in ECS. There are two fibre channel links between the clusters, one from each switch. There is an additional single switch in the blackboard server rack which has one link to each cluster... so basically it straddles the two clusters.

We also have one dual switch cluster in PP, and both switches in that cluster has a connection to one of the two clusters in ECS.

So we pulled the two direct links between the ECS clusters, and traffic failed over to the blackboard switch and went through that.

We then pulled the connection to the blackboard switch, and then traffic predicably failed over to going to, and then back out from the switch cluster in PP.

This means it'll take a lot of things going wrong to split the fabric.... we can survive two switch failures without splitting the fabric. This testing also confirmed that we have a correct fibre channel zone configuration (fibre channel zones are analogous to VLANs on a ethernet network)

About December 2005

This page contains all entries posted to OIT SysCore in December 2005. They are listed from oldest to newest.

November 2005 is the previous archive.

January 2006 is the next archive.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.34