Partial outage due to infrastructure loss

Incident Report for System Surveyor

Postmortem

2:02 pm: Outage start according to AWS metrics

2:05 pm: First report from customer/sales

2:06 pm: Engineering team begins investigation

2:10 pm: Root cause determined and resolution steps implemented

2:21 pm: System recovered

2:23 pm: Official All Clear announced

Web tasks running in ECS were terminated due to spot instances being reaped by AWS
New web tasks were stuck in ‘provisioning’ status waiting for new AWS spot requests to be filled

The number of desired web tasks was increased from 4 to 8 to increase the number of open spot requests and the probability of new instances being allocated
Modification of the spot request policy was underway as well as efforts to bring up an on-demand ec2 instance to satisfy the requests
The on-demand instance was not required since all the spot requests were fulfilled

Expand the spot request criteria to include more instance types (c5.xlarge, c5.2xlarge, c6a.xlarge) - DEVOPS-44
Modify infrastructure configuration to include 2 on-demand ec2 instances, which are long lived and never automatically recovered. This will incur an additional cost that can be mitigated by using reserved instances - DEVOPS-45

We request EC2 "spot" instances which are instances at a specific price target to fulfill our needs. These requests are made using criteria that includes instance types and sizes in addition to price target.
These instances have a set life span, at the end of which, they are terminated (reaped) by AWS
In this case, we had 4 open requests that were not fulfilled for over 15 minutes. Then we added 4 more requests, out of which 3 were fulfilled in about 4 minutes.
The rest of the requests were fulfilled an another 10 minutes

Posted Nov 09, 2022 - 15:57 CST

Resolved

This was an outage of the website at app.systemsurveyor.com caused due AWS infrastructure being unavailable. All api requests were still processed.

Posted Nov 09, 2022 - 15:16 CST

This incident affected: Application Website.