Outage Timeline
2:02 pm: Outage start according to AWS metrics
2:05 pm: First report from customer/sales
2:06 pm: Engineering team begins investigation
2:10 pm: Root cause determined and resolution steps implemented
2:21 pm: System recovered
2:23 pm: Official All Clear announced
Customer Impact
- Customers could no longer log in to the system surveyor application on web
- Customers who were already logged in were unable to save their surveys
- Mobile applications were not affected
- There was no data loss or degraded performance in any of the other systems
- A few customers (less than 10) called or chatted in regarding the outage
Root Cause
- Web tasks running in ECS were terminated due to spot instances being reaped by AWS
- New web tasks were stuck in ‘provisioning’ status waiting for new AWS spot requests to be filled
Immediate Actions
- The number of desired web tasks was increased from 4 to 8 to increase the number of open spot requests and the probability of new instances being allocated
- Modification of the spot request policy was underway as well as efforts to bring up an on-demand ec2 instance to satisfy the requests
- The on-demand instance was not required since all the spot requests were fulfilled
Resolution Next Steps
- Expand the spot request criteria to include more instance types (c5.xlarge, c5.2xlarge, c6a.xlarge) - DEVOPS-44
- Modify infrastructure configuration to include 2 on-demand ec2 instances, which are long lived and never automatically recovered. This will incur an additional cost that can be mitigated by using reserved instances - DEVOPS-45
Investigation Notes
- We request EC2 "spot" instances which are instances at a specific price target to fulfill our needs. These requests are made using criteria that includes instance types and sizes in addition to price target.
- These instances have a set life span, at the end of which, they are terminated (reaped) by AWS
- In this case, we had 4 open requests that were not fulfilled for over 15 minutes. Then we added 4 more requests, out of which 3 were fulfilled in about 4 minutes.
- The rest of the requests were fulfilled an another 10 minutes