Partial outage due to infrastructure loss
Incident Report for System Surveyor
Postmortem

Outage Timeline

2:02 pm: Outage start according to AWS metrics

2:05 pm: First report from customer/sales

2:06 pm: Engineering team begins investigation

2:10 pm: Root cause determined and resolution steps implemented

2:21 pm: System recovered

2:23 pm: Official All Clear announced

Customer Impact

  • Customers could no longer log in to the system surveyor application on web
  • Customers who were already logged in were unable to save their surveys
  • Mobile applications were not affected
  • There was no data loss or degraded performance in any of the other systems
  • A few customers (less than 10) called or chatted in regarding the outage

Root Cause

  • Web tasks running in ECS were terminated due to spot instances being reaped by AWS
  • New web tasks were stuck in ‘provisioning’ status waiting for new AWS spot requests to be filled

Immediate Actions

  • The number of desired web tasks was increased from 4 to 8 to increase the number of open spot requests and the probability of new instances being allocated
  • Modification of the spot request policy was underway as well as efforts to bring up an on-demand ec2 instance to satisfy the requests
  • The on-demand instance was not required since all the spot requests were fulfilled

Resolution Next Steps

  • Expand the spot request criteria to include more instance types (c5.xlarge, c5.2xlarge, c6a.xlarge) - DEVOPS-44
  • Modify infrastructure configuration to include 2 on-demand ec2 instances, which are long lived and never automatically recovered. This will incur an additional cost that can be mitigated by using reserved instances - DEVOPS-45

Investigation Notes

  • We request EC2 "spot" instances which are instances at a specific price target to fulfill our needs. These requests are made using criteria that includes instance types and sizes in addition to price target.
  • These instances have a set life span, at the end of which, they are terminated (reaped) by AWS
  • In this case, we had 4 open requests that were not fulfilled for over 15 minutes. Then we added 4 more requests, out of which 3 were fulfilled in about 4 minutes.
  • The rest of the requests were fulfilled an another 10 minutes
Posted Nov 09, 2022 - 15:57 CST

Resolved
This was an outage of the website at app.systemsurveyor.com caused due AWS infrastructure being unavailable. All api requests were still processed.
Posted Nov 09, 2022 - 15:16 CST
This incident affected: Application Website.