Castles V3M2 and V3P3 unavailable for making purchases
Incident Report for Npay
Postmortem

Summary

On Saturday, between 13:05 and 14:15 EEST (GMT+3), a disruption occurred that affected the connectivity of all Castles V3M2 and V3P3 terminals to the payment gateway. This disruption prevented users from making purchases within this time frame.

Typically, payment terminals maintain a continuous connection with the payment gateway and automatically establish a new connection following a network reset. Upon investigation, the root cause of the issue was traced to the terminals' inability to establish new connections with the payment gateway. Notably, some terminals had already been encountering connection problems up to 7 hours before the aforementioned time window, as they attempted to reconnect.

We sincerely apologize for any inconvenience this disruption has caused and assure you that steps are being taken to prevent the recurrence of such issues in the future.

Leadup

We received some spurious reports of individual terminals not being able to connect around 9:00 EEST on Saturday. Nothing out of the ordinary was detected, so these were attributed to possible problems with the 3G network. Around 11:20 EEST there were enough reports of terminals with connectivity problems that an incident investigation was initiated.

Fault

There were a number of terminals across the estate that were unable to connect to the payment gateway. These terminals appeared to have just suddenly lost connectivity and have gone totally quiet. Going over the problem reports clarified that this affected both wired and wireless terminals and was not related to an ongoing firmware update for the terminals. At the same time, there was a previously unseen problem in accessing one of the monitoring systems, which required manual recovery to become accessible. Since no explanation could be found as to why some of the terminals could not connect, and the manual recovery worked on the monitoring system, it was decided that the same manual recovery would be tried also on the terminal connections. Running this manual recovery exacerbated the incident, taking it from affecting just a small number of terminals to affecting all Vega3000 (V3M2 and V3P3) terminals. At this time when a large number of terminals were affected at once, it became clear that this must be a systemic issue. Some time after that the root cause of the incident was found. Recovery was started immediately and was completed as rapidly as possible. After a fix was in place, all the terminals with connectivity problems came back online, and normal operation was restored.

Impact

The incident impacted only Castles V3M2 and V3P3 terminals. This was due to the fact that these terminals used a new payment gateway connection address, whereas all other terminals used an old one. Some terminals were unable to connect from Saturday morning, but the connectivity issue affected all Castles V3M2 and V3P3 terminals from 13:07 to 14:12 EEST. During this period, no payments could be successfully completed on the affected terminals. The terminals would display an error message about not being able to connect to the payment gateway.

Response

The incident response started on Saturday at 11:20 and continued until the issue was fully resolved. A second on-call person was brought in to assist in recovery at 13:35.

Recovery

After the root cause was discovered, recovery actions were started immediately. Recovery took around 30 minutes due to the requirement of getting approval from a second on-call administrator for a security-critical change and pushing the change through deployment.

Timeline

  • ~09:00: Initial reports of single terminals having network connectivity problems.
  • 11:20: Multiple customers had reported connectivity issues. On-call response was started and full investigation was underway.
  • 13:03: Manual recovery was initiated on the payment gateway in an attempt to fix connectivity issues for the small number of affected terminals.
  • 13:07: Manual recovery for the payment gateway concluded, which required the reconnection of all the payment terminals. This meant that all the terminals that could no longer connect to the gateway were unable to re-establish a connection, making all V3M2 and V3P3 be offline.
  • 13:33: Root cause was found and recovery was immediately started.
  • 13:35: Second on-call person was alerted and joined the investigation to get approval for a security-critical change.
  • 14:02: Deployment of the new configuration was started.
  • 14:13: Deployment of the new configuration was finished. All terminals went back online.

Root cause

Certificate necessary for the payment terminals to communicate with the payment gateway had expired. This happened because the specific certificate in question was not in scope for the regular expiry tracking processes. Certificates are currently being migrated from an old management system to a new system with automatic renewal. Expiry tracking was in place for the old management system and the new system, but this specific certificate was managed by an intermediate solution that was taken into use because our hosting provider could not support the required security parameters for the new system for this specific certificate. Our hosting provider has since added support for these security parameters and migration of this certificate was on the backlog, but unfortunately the certificate expired before the migration was done.

Lessons learned

Certificate expiry is unfortunately one of the most common ways even big providers have downtime. We recognize this and have placed special emphasis on ensuring that expiry is tracked dutifully in our system design. However, in reality there are situations where intermediate solutions have to be made. We know now that we need to also ensure we always have expiry tracking also for these intermediate solutions, even if the plan is to migrate away from them before certificate expiry. It is not enough to handle things well in the primary solution, but we need to ensure every solution will have the same emphasis on tracking. Also, automatic renewal should always have a high priority to be implemented, where it is possible to implement it.

Corrective actions

Several changes are being made in response to this incident:

  • Guidelines for software development were updated to require expiry tracking for any certificates, even for intermediate solutions.
  • Migration of the certificates to the new system was given a high priority.
  • Automatic renewal is prioritized for all certificates.
  • Monitoring and reporting solutions are being updated to allow easier discovery of certificate errors.

We believe that these changes will ensure that we will no longer run into similar problems in the future.

Posted Aug 16, 2023 - 12:20 EEST

Resolved
There was a problem with V3P3 and V3M2 terminals that prevented some of these terminals from being able to connect to the gateway. This problem escalated to every V3P3 and V3M2 not being able to connect to the gateway, at which time the root cause was also discovered. Once the root cause was resolved, all terminals returned to full connectivity. There will be a full post-mortem of the issue posted early next week.
Posted Aug 12, 2023 - 14:30 EEST
Monitoring
Fix has been deployed for the problem, monitoring to see if it resolves the issue completely.
Posted Aug 12, 2023 - 14:15 EEST
Update
Issue should be resolved in a few minutes, still working on it.
Posted Aug 12, 2023 - 14:06 EEST
Update
There is an issue affecting all Vega3000 terminals, so V3P3 and V3M2. Attempting to resolve as soon as possible.
Posted Aug 12, 2023 - 13:35 EEST
Update
We are continuing to investigate the issue. There are certain individual terminals that no longer seem to have access to the backend. The behaviour looks like a network problem, but we are continuing investigation on our side to find if there could be something specific affecting only those terminals.
Posted Aug 12, 2023 - 12:55 EEST
Update
No changes seen in transaction volumes, investigation on going to locate if there is an ongoing problem still.
Posted Aug 12, 2023 - 12:25 EEST
Investigating
There are multiple connectivity issues reported with 3G terminals. Investigation is underway.
Posted Aug 12, 2023 - 11:37 EEST
This incident affected: Payment Backend and Cellular payment terminal connections (Telenor).