Reports of System Slowness
Incident Report for Flight Schedule Pro
Postmortem

At approximately 4:15 PM CST until 5:30 PM CST on Sunday 5/12/2019 we received reports of system slowness. We apologize for any inconveniences this may have caused your operation. We're committed to providing exceptional service and take these issues seriously.

Root Cause

Our engineers were able to track the issue back to a communication library the API uses to communicate with a Redis caching server. One of the API servers started to receive timeouts when trying to read/write data to the Redis server. Connections that were routed to this specific API server started to become slow and then fail causing the system slowness. Upon investigation the engineers discovered that a known issue with the communication library was responsible for the connection/timeout problems. The owner of the communication library had issued a fix to the software that was included in a newer version than what was installed the API servers.

Resolution

Our engineers updated the communication library to the current version that included the fix, tested and deployed the new version to the API servers.

Going Forward

To ensure this doesn’t happen again and to lower the amount of impact this issue caused we’ve implemented the following changes:

1. Implemented defensive coding techniques to allow code execution around this specific failure type should it happen again.

2. Implemented monitoring that allows us to see the health of the API to Redis connections in real time from each server.

3. We're also looking into better ways to monitor when updates are made to certain libraries that are part of the Flight Schedule Pro system.

Again, we apologize for any inconveniences this may have caused your operation. We're committed to providing exceptional service and take these issues seriously. We believe the above action items will prevent this issue from occuring in the future. Please let us know if you have any questions.

Posted May 17, 2019 - 13:06 CDT

Resolved
At approximately 4:15 PM CST until 5:30 PM CST on Sunday 5/12/2019 we received reports of system slowness. Our staff monitored the situation until full performance was restored. A full update with more detail will be available in 72 hours.
Posted May 12, 2019 - 18:59 CDT
This incident affected: Scheduling Hub, Training Hub and Billing Hub (Merchant Services (Credit Card Processing), Billing & POS).