Payment terminals are having connectivity issues to the payment backend

Incident Report for Npay

Postmortem

Summary

On Sunday 2024-01-14, starting from 02:30 EET (GMT+2) and continuing until 13:02 a disruption occurred in our systems which led to many of our terminals being unable connect to our payment gateway. This prevented the affected terminals from processing payments for the duration of the incident.

Upon investigation of the problem, a trigger for the incident was found in the weekly restart of gateways (which occurs at 02:30; the start time of the incident) producing an increase in new connection establishments. For an unknown reason the new connections failed to be established in the allotted time, causing the terminals to retry connection established. This resulted in another flood of connection attempts, which would eventually lead to memory exhaustion on the servers, leading to them crashing and being replaced, and the cycle would repeat.

We sincerely apologize for any inconvenience this disruption has caused and assure you that steps have been taken and are being taken to prevent the recurrence of such issues in the future.

Leadup

At 03:45 we received an alarm which is a symptom of unusually high log traffic slowing down the monitoring of our systems. After noting that no other alarms were active and that there were no apparent problems in the low amounts of payment processing which are typical of a Sunday night, the original alarm was dismissed as a false positive incident.

Later in the following morning, at 11:45, we became aware of effectively all our Castles terminals being unable to process payment authorizations and started our incident response.

Fault

We quickly found out that starting from their weekly restart at 02:30, the payment gateway which our terminals use to communicate with our backend was under heavy load from repeated terminal communication attempts, and constantly crashing due to the increased memory load. This crash would cause all previously connected terminals to try yet again to connect to our systems, which would cause another burst of strain, resulting in another crash, and so on.

Impact

All the Castles terminals in our fleet were unable to connect to our payment between 02:30 and 13:02, which led to them being unable to process transactions.

Response

The incident response started at 11:45, after we became aware of the problem.

Recovery

After isolating the problem to be related to excessive strain in our payment gateway, we started implementing an emergency change in our systems which increased the capacity of the servers to handle the increased strain, and at 13:02 we finished restarting the servers with the new configuration. After this the servers started accepting connections as normal, and transaction volumes were rapidly restored towards expected levels.

Timeline

Earlier in the week:

A necessary security update was deployed to all Castles terminals.

2024-01-14:

02:30: Weekly payment gateway restart. Connection failure problems starting.
03:45: Oncall administrator woken up by alarms about unusual log volumes. Problem dismissed as a false positive due to no other problems being apparent.
11:41: Oncall administrator informed of a widespread problem in payment authorization processing.
11:45: Incident response started by oncall administrator.
12:35: Problems with Castles terminals identified.
12:41: Payment gateway crashing problem identified.
12:49: Incident opened on Statuspage.
12:50: Payment gateway memory increase emergency change started.
13:02: Payment gateway restarts fully complete, authorization volumes starting to restore.
13:29: Situation normal again. Immediate incident response ended.

Root cause

In the analysis of the incident, a definitive root cause for the change in behaviour around weekly restarts has yet to be identified, and we are following up on these in the coming weeks:

The payment gateway had suddenly come under unexpectedly heavy strain (inconsistent with the normal increase in terminals connecting to our systems) from all the terminals reconnecting after the nightly restart. As opposed to a normal reconnection attempt, the terminals were attempting to re-establish their connections several times simultaneously, which caused a massive spike in servers’ resource usage leading to a crash and restart. The same problem repeated after the restart, which would cause the crashing cycle to carry on before we noticed and fixed the problem.
We implemented a necessary security update on several of the terminals connecting to the system in the preceding week, and it is possible that repeated connection attempts (and thus servers’ crashing and restarts) are a side effect of the update.

Lessons learned

This incident highlighted a blind spot in our monitoring: We are not monitoring transaction volumes closely enough to be alerted to unusually low volumes happening at times when they should not be so. This is understandable at night (when we were originally alerted of unusual activity) when transaction volumes are naturally low, and even a large disruption is indistinguishable from the nocturnal volumes of transaction activity. However, this problem should become clear in the morning at the latest when transaction volumes are expected to pick up as people are becoming more active.

Furthermore, we have learned that we need to more closely monitor for secondary signs of incidents in our critical systems, such as if they are suddenly restarting too often. If we had spotted the repeated server restarts happening already at night, we would have been able to take preventative action then with only a few hours’ downtime at night being the impact of this incident.

Corrective actions

For improving our processes as a result our analysis of how this downtime progressed, we have already implemented changes to prevent this from happening in the future, and will continue to implement the remaining items:

We are rewriting the connection gateways for more predictable behaviour under load. The new gateways are already in use for a portion of the terminals and handle all the terminal traffic in the coming weeks. We have added extensive load tests on these new gateways specifically for this situation.
We are improving our internal alerts to make us more aware of potential disruptions in transaction authorization volumes, where incidents such as these often cause a dip in.
We are improving the monitoring of secondary incident characteristics (such as frequent restarts) of the payment gateway and other systems which may fail similarly.

We are confident that these actions will improve our availability and significantly reduce the likelyhood of similar large scale disruptions.

Posted Mar 15, 2024 - 14:15 EET

Resolved

We are seeing no more problems in the field. There was a connection issue that affected V3M2 and V3P3 terminals that started at 02:30 EET. However, this appeared as a logging issue rather than connectivity issue, so full investigation was not started. At 11:30 EET it became clear there is a wider connectivity issue and an investigation was launched. A fix was deployed at 13:00 EET, after which the terminal connections returned to normal. We sincerely apologize for this event and will publish a more detailed post-mortem after full investigation has been completed.

Posted Jan 14, 2024 - 15:12 EET

Monitoring

Fix seems to have worked, but we'll attempt to see if there are still other problems. More info to follow shortly.

Posted Jan 14, 2024 - 13:29 EET

Identified

Fix has been attempted and looks initially positive, but continuing investigation.

Posted Jan 14, 2024 - 13:16 EET

Investigating

We are currently investigating an issue that payment terminals cannot connect to payment backend.

Posted Jan 14, 2024 - 12:50 EET

This incident affected: Payment Backend and CloudPOS.