Outage postmortem

Summary

Yesterday, from around 1430 UTC the JEMHC application started to experience performance issues causing general non availability of the UI and delays in processing inbound mail and generating outbound mail. Normal service was resumed at around 23:30 UTC.

What happened

Yesterday we had an extended outage resulting the delay of mail being processed in and sent out, the root cause of this was our application not limiting the concurrent handling of incoming webhooks from Jira instances for retry (see Retry policy). The consequence was exhaustion of database connections resulting in general lockup feedback back from the db into other areas of the app, triggering health check fail and node reboot cycles. The UI was impacted because webhooks are currently processed in the same nodes.

Mitigation

Contributing factors were also the the recent onboarding of a few very large volume customers, meaning there was less free processing headroom than before. To improve resilience under load we have doubled the processing capacity of the JEMHC database to handle the doubling of DB load over the last year.

Remediation

In the next few days we will update our inbound webhook handler to prevent similar overload scenarios, in the longer term we plan to decouple webhook processing from the UI entirely.

 

Sorry for any inconvenience caused.