On Sunday 2024-01-14, starting from 02:30 EET (GMT+2) and continuing until 13:02 a disruption occurred in our systems which led to many of our terminals being unable connect to our payment gateway. This prevented the affected terminals from processing payments for the duration of the incident.
Upon investigation of the problem, a trigger for the incident was found in the weekly restart of gateways (which occurs at 02:30; the start time of the incident) producing an increase in new connection establishments. For an unknown reason the new connections failed to be established in the allotted time, causing the terminals to retry connection established. This resulted in another flood of connection attempts, which would eventually lead to memory exhaustion on the servers, leading to them crashing and being replaced, and the cycle would repeat.
We sincerely apologize for any inconvenience this disruption has caused and assure you that steps have been taken and are being taken to prevent the recurrence of such issues in the future.
At 03:45 we received an alarm which is a symptom of unusually high log traffic slowing down the monitoring of our systems. After noting that no other alarms were active and that there were no apparent problems in the low amounts of payment processing which are typical of a Sunday night, the original alarm was dismissed as a false positive incident.
Later in the following morning, at 11:45, we became aware of effectively all our Castles terminals being unable to process payment authorizations and started our incident response.
We quickly found out that starting from their weekly restart at 02:30, the payment gateway which our terminals use to communicate with our backend was under heavy load from repeated terminal communication attempts, and constantly crashing due to the increased memory load. This crash would cause all previously connected terminals to try yet again to connect to our systems, which would cause another burst of strain, resulting in another crash, and so on.
All the Castles terminals in our fleet were unable to connect to our payment between 02:30 and 13:02, which led to them being unable to process transactions.
The incident response started at 11:45, after we became aware of the problem.
After isolating the problem to be related to excessive strain in our payment gateway, we started implementing an emergency change in our systems which increased the capacity of the servers to handle the increased strain, and at 13:02 we finished restarting the servers with the new configuration. After this the servers started accepting connections as normal, and transaction volumes were rapidly restored towards expected levels.
A necessary security update was deployed to all Castles terminals.
In the analysis of the incident, a definitive root cause for the change in behaviour around weekly restarts has yet to be identified, and we are following up on these in the coming weeks:
This incident highlighted a blind spot in our monitoring: We are not monitoring transaction volumes closely enough to be alerted to unusually low volumes happening at times when they should not be so. This is understandable at night (when we were originally alerted of unusual activity) when transaction volumes are naturally low, and even a large disruption is indistinguishable from the nocturnal volumes of transaction activity. However, this problem should become clear in the morning at the latest when transaction volumes are expected to pick up as people are becoming more active.
Furthermore, we have learned that we need to more closely monitor for secondary signs of incidents in our critical systems, such as if they are suddenly restarting too often. If we had spotted the repeated server restarts happening already at night, we would have been able to take preventative action then with only a few hours’ downtime at night being the impact of this incident.
For improving our processes as a result our analysis of how this downtime progressed, we have already implemented changes to prevent this from happening in the future, and will continue to implement the remaining items:
We are confident that these actions will improve our availability and significantly reduce the likelyhood of similar large scale disruptions.