Author Topic: Betfairs outage on Saturday and thier response this morning.  (Read 3263 times)

Tags:
  • Élite
  • Posts: 486
  • Karma: +7/-0
  • Gender: Male
  • head always buried in stats
*

AFTER Saturdays debarcle with several hours of lost trading and finger nails bitten to the bone whilst waiting for the outcome of my 86 lay bets with exposure of a little over £6500, and my triggers unable to hedge a single one of them! I decided to write a polite but firm letter to Betfair, suggesting that their service was some way short of most peoples expectation with a company floated recently and valued 1.5 billion this was their response.

We are working as hard as possible to ensure Betfair offers as reliable a site as possible. In a normal week we make at least 15 changes to the Betfair website but we have resolved not to release any new products or features for the next seven days. This should give maximum stability throughout a busy week that includes the Cheltenham Festival, cricket World Cup and Champions League football.

Below is an explanation of what went wrong and what we have done to fix the issue.

When the website failed, our first step was to disable Betfair for all our customers on the web, API and mobile services. Once we identified the actual problem, we determined that we needed our website "available" but with betting disallowed. We recovered the site internally around 18:00GMT and re-enabled betting as of 20:00GMT once we were certain it was stable.

Here is what actually happened:

After performing certain types of website changes, an issue developed that caused our servers to temporarily slow down, processing just one thing at a time (single threading) instead of thousands of user requests in parallel. This "single threading" behaviour was introduced some time ago to protect against occasional broken pages caused by serving content while it is changing. In tech speak, our servers weren't thread-safe on certain types of content changes.

This has been an operational concern for several weeks as our traffic has reached record volumes week after week. While we had several operational protections in place to limit these types of changes during peak load, we missed an important one. Every 15 minutes, an automated process was publishing exactly the type of content that triggers the issue described above. Yesterday we hit a tipping point as the web servers reached a point where it was taking longer than 15 minutes to complete their update - essentially rendering the servers unusable.

Then in an attempt to quickly shed load, we triggered a process to disable some of the computationally intensive features on the site. Unfortunately, the way this was done triggered a complete recompile of every page on our site, for every user, in every locale. Under our normal weekend usage, recovery took several hours.

After spotting the pattern, we've recognised this has been going on with varying impact since February 8, 2011. During periods of increased user traffic, our customers would experience this issue in the form of slow navigation or a "sticky" user experience. Yesterday was simply a tipping point, made worse by our recovery attempt.

We've fixed this problem now. We've disabled the original automated job and rebuilt it to update content safely. We've tripled the capacity of our web server farm to spread our load even more thinly. We've fixed our process for disabling features so that we won't make things worse. We've updated our operational processes and introduced a whole new raft of monitoring to spot this type of issue. We've also isolated the underlying web server issue so that we can change our content at will without triggering the switch to single-threading.

We believe these changes will bring the stability we all desire and thank you for your continued custom.
This time next year, we will all be paying Betfair premium charge commission rates!

  • All members
  • Posts: 64
  • Karma: +0/-0
Re: Betfairs outage on Saturday and thier response this morning.
« Reply #1 on: Mon, 14 March, 2011, 13:57 »
Since February, haha! How many years of instability have we witnessed?

  • Nerd
  • Élite
  • Posts: 589
  • Karma: +27/-1
  • Gender: Male
  • I think I could be on to something here!
*
Re: Betfairs outage on Saturday and thier response this morning.
« Reply #2 on: Tue, 15 March, 2011, 07:46 »
at least they gave an answer
Fortune favors the brave!