We've had a rough night with GPU servers. One has been down hard over the day yesterday (it crashed hard and now can't find its / partition — the admins are attempting a rescue disk fsck this morning). Two more went down last night (PST) due to the heavy load, but those were easy to get back up (they are up now).
We are stretched a bit thin as we are implementing the new server infrastructure in parallel with the old one. The upshot is that once the new one has been deployed, we will have much more functional collection servers (CS's) and also get work servers (WS's) that should not need to be restarted nearly as frequently when under heavy load.
We are beginning the roll out of the new WS (v5) code this week onto GPU servers, although these issues have slowed us down a bit.