We made some memory upgrades/configuration changes on our mail readers to increase their capacity and snappiness on Tuesday & Wednesday (12/13 & 12/14).
Basically, while analysing one of the servers during a particularly high level of utilization, we noticed a couple things that were really dragging the system down.
* High disk IO latencies (due to really busy disk access)
* Some swapping -- memory was getting tight.
On these servers there are only two things that use the disk. The most obvious is the AFS disk cache -- which was quite busy. The slightly less obvious was syslog, which calls an fsync() with every write to the log file. The first thing done was to use an option in linux's syslogd which disables this feature; this improved IO utilization on the system disk sligtly, however, the major source of IO load on the disk was the AFS cache -- now, how do we get rid of that?
I had done some research awhile ago on how to configure the AFS cache in memory for these machines, but due to some limitations in the AFS code, "large" AFS memory caches are inefficient and -- well -- in some cases cause the machine to crash. I had also attempted using a linux tmpfs (ram) filesystem for the cache, however, this didn't work as AFS seems to do some mojo with it's cache access that's incompatible with tmpfs. However, it came to my attention that I could use tmpfs, create a loopback device on it that contained a "real" ext2 filesystem, and mount that as the AFS disk cache. All of our problems (memory, and I/O) could be solved with more memory. A ton of it.
So, I went ahead and purchased a ton of it -- (ok, only 4G more for each machine) however, I had calculated out that even reserving 2G of that for the AFS cache, we'd still have more than enough left over to handle the memory requirements of the IMAP readers. On Tuesday, I added the extra memory to three of the mr machines (5,6,7), and after re-compiling the linux kernel to support over 4G of ram -- stupid linux -- I implemented the AFS disk-cache-in-memory thing that was described above. The result was impressive. No disk IO, no latency, these machines were screaming.
As of now, two of the machines (mr4 and mr8) don't have their memory upgrades. For yucks, I wanted to see just how much load the upgraded machines could take -- so I turned off imap & pop service on the unupgraded things on Wednesday morning.
Currently, the remaing three machines are handling our full mail reading load with *OVERHEAD* to spare. That's almost 800 mail reading processes each.