We’ve been looking over the server code to see what’s going on and I
think we have some news. What’s happening is that the server is being a
bit too generous in terms of when it times out connections, filling up
all of the available server threads. For now, we’ve bumped up the # of
threads (not unlike adding more cashiers at the grocery isle, so that a
few slow people don’t slow down everyone) from 200 to 500. That should
help a lot in the short term. 
On Monday, we’ll make some code
modifications to make a more complete fix to this.  It may take a little while to get this tested and implemented, but I expect that this should be in by Monday afternoon.  That should greatly ease what we’re seeing.  (PS Note Monday = Monday pacific time)
In some ways, this has crept up on us, as nothing has changed (all our servers are up and running) — it’s just that the extended down time built up a large # of WU’s to return and the server is still trying to catch up. We’ve turned off new assigns from this machine to help it catch up (no new WU’s will mean it spends all it’s time receiving WU’s).
With this change, the threads tweak, and the code update coming later today we should have some more news (hopefully good news) later on Monday.
