Amazon Web Services (AWS) outage: an update from our SRE team

Amazon Web Services (AWS) outage: an update from our SRE team

As you may have seen in the news last week, AWS experienced some issues with the infrastructure it uses for some of its IaaS (Infrastructure as a Service) platform. Whilst IDBS was only minimally affected by this outage, we thought it would be good to provide you with an update on what we have learnt about this outage and the steps we are taking to improve our communication to our SaaS customers.

At around 18:50 GMT on Tuesday 28th Feb 2017, IDBS’ Site Reliability Engineering (SRE) team received email and text message alerts from New Relic (our application performance monitoring system) via VictorOps (our on-call management, incident notification and live infrastructure timeline system) about increased error rates with E-WorkBook Connect. The alert indicated more than 5% of requests were failing. The alert was acknowledged and SRE started to investigate. It was discovered AWS was experiencing an S3 outage resulting in some requests failing.

E-WorkBook Connect was the only product in our SaaS offering that was affected. The rest of The E-WorkBook Cloud suite reported no errors, with monitoring systems reporting all systems were performing as expected.

The SRE team escalated internally to our helpdesk, management and Global Professional Services (GPS) team. The GPS team provided an additional check of all systems in US-EAST-1 (the affected AWS region) and confirmed performance was as expected.

For the rest of the evening, SRE monitored our systems as the issue escalated within AWS and additional services started to fail. There were no further application alerts during this period and E-WorkBook performance remained static. E-WorkBook Connect also reported recovery and the alert was marked as resolved.

Our SRE team closed the incident at 06:30 GMT on Wednesday 1st March as AWS reported all systems as healthy and recovered. Since this time, IDBS have run an internal session to discuss what we have learnt from the AWS outage in US-EAST-1 and have refined and improved a number of our internal processes and communication protocols that we will be rolling out over the coming months. This will enable us to improve our service and the communication our customers receive should anything like this occur in the future.

The IDBS E-WorkBook Cloud was able to ensure our SLA was maintained despite a few challenges with one of our infrastructure partners.

If you’d like to read more about the AWS outage, then please visit the link below.

https://aws.amazon.com/message/41926/