Reply from Betfair re their latest outage below .................
We are working as hard as possible to ensure Betfair offers as reliable a site as possible. In a normal week we make at least 15 changes to the Betfair website but we have resolved not to release any new products or features for the next seven days. This should give maximum stability throughout a busy week that includes the Cheltenham Festival, cricket World Cup and Champions League football.
Below is an explanation of what went wrong and what we have done to fix the issue.
When the website failed, our first step was to disable Betfair for all our customers on the web, API and mobile services. Once we identified the actual problem, we determined that we needed our website "available" but with betting disallowed. We recovered the site internally around 18:00GMT and re-enabled betting as of 20:00GMT once we were certain it was stable.
Here is what actually happened:
After performing certain types of website changes, an issue developed that caused our servers to temporarily slow down, processing just one thing at a time (single threading) instead of thousands of user requests in parallel. This "single threading" behaviour was introduced some time ago to protect against occasional broken pages caused by serving content while it is changing. In tech speak, our servers weren't thread-safe on certain types of content changes.
This has been an operational concern for several weeks as our traffic has reached record volumes week after week. While we had several operational protections in place to limit these types of changes during peak load, we missed an important one. Every 15 minutes, an automated process was publishing exactly the type of content that triggers the issue described above. Yesterday we hit a tipping point as the web servers reached a point where it was taking longer than 15 minutes to complete their update - essentially rendering the servers unusable.
Then in an attempt to quickly shed load, we triggered a process to disable some of the computationally intensive features on the site. Unfortunately, the way this was done triggered a complete recompile of every page on our site, for every user, in every locale. Under our normal weekend usage, recovery took several hours.
After spotting the pattern, we've recognised this has been going on with varying impact since February 8, 2011. During periods of increased user traffic, our customers would experience this issue in the form of slow navigation or a "sticky" user experience. Yesterday was simply a tipping point, made worse by our recovery attempt.
We've fixed this problem now. We've disabled the original automated job and rebuilt it to update content safely. We've tripled the capacity of our web server farm to spread our load even more thinly. We've fixed our process for disabling features so that we won't make things worse. We've updated our operational processes and introduced a whole new raft of monitoring to spot this type of issue. We've also isolated the underlying web server issue so that we can change our content at will without triggering the switch to single-threading.
We believe these changes will bring the stability we all desire and thank you for your continued custom.