Donor Question: Why do certain servers go down more easily?

I'll start posting replies that I make in the forum ( when I think it would be of broad interest.  Here's one regarding server downtime.

Is there a reason, that this particular server seems to have an exceptional amount of down time?

Physically, the server is fine. However, when it gets overloaded, it just fails to continue and the binary must be killed.  This is a software issue, likely in the low-level HTTP code/library.

And what can be done, on a long term basis …

We have been looking into solutions for the past two years. Two years ago, we refactored the server code to clean up elements and try to isolate the problem (creating the v4 branch of the server code from v3). That did help, but not enough. One feature of the v4 code is that it can detect when it has this problem and restart itself, but we don't want it falsely restarting itself (that can lead to other problems), so the timeout is somewhat long.

About a year ago, we took a radically different approach and worked with an outside programming company (Joseph Coffland from Cauldron) to rewrite the server code from scratch. That new server code (v5) is now under beta testing. Project 3798 are run on the new server platform. We are rolling it out gradually to avoid any sort of disaster, but so far so good; beta testing has found bugs in the v5 code which Joe has fixed and it looks like the code is ready for a bigger rollout (which is underway). I expect that it will see its first duty in production projects (i.e. not server test projects) in early 2009 (likely January 2009) and a more complete roll out throughout the year. 

The SMP and GPU servers will get it last, since they need additional code to bring the v5 code path up to spec with what the GPU and SMP needs. However, we expect this won't be too onerous to get done.