On Saturday, between 13:05 and 14:15 EEST (GMT+3), a disruption occurred that affected the connectivity of all Castles V3M2 and V3P3 terminals to the payment gateway. This disruption prevented users from making purchases within this time frame.
Typically, payment terminals maintain a continuous connection with the payment gateway and automatically establish a new connection following a network reset. Upon investigation, the root cause of the issue was traced to the terminals' inability to establish new connections with the payment gateway. Notably, some terminals had already been encountering connection problems up to 7 hours before the aforementioned time window, as they attempted to reconnect.
We sincerely apologize for any inconvenience this disruption has caused and assure you that steps are being taken to prevent the recurrence of such issues in the future.
We received some spurious reports of individual terminals not being able to connect around 9:00 EEST on Saturday. Nothing out of the ordinary was detected, so these were attributed to possible problems with the 3G network. Around 11:20 EEST there were enough reports of terminals with connectivity problems that an incident investigation was initiated.
There were a number of terminals across the estate that were unable to connect to the payment gateway. These terminals appeared to have just suddenly lost connectivity and have gone totally quiet. Going over the problem reports clarified that this affected both wired and wireless terminals and was not related to an ongoing firmware update for the terminals. At the same time, there was a previously unseen problem in accessing one of the monitoring systems, which required manual recovery to become accessible. Since no explanation could be found as to why some of the terminals could not connect, and the manual recovery worked on the monitoring system, it was decided that the same manual recovery would be tried also on the terminal connections. Running this manual recovery exacerbated the incident, taking it from affecting just a small number of terminals to affecting all Vega3000 (V3M2 and V3P3) terminals. At this time when a large number of terminals were affected at once, it became clear that this must be a systemic issue. Some time after that the root cause of the incident was found. Recovery was started immediately and was completed as rapidly as possible. After a fix was in place, all the terminals with connectivity problems came back online, and normal operation was restored.
The incident impacted only Castles V3M2 and V3P3 terminals. This was due to the fact that these terminals used a new payment gateway connection address, whereas all other terminals used an old one. Some terminals were unable to connect from Saturday morning, but the connectivity issue affected all Castles V3M2 and V3P3 terminals from 13:07 to 14:12 EEST. During this period, no payments could be successfully completed on the affected terminals. The terminals would display an error message about not being able to connect to the payment gateway.
The incident response started on Saturday at 11:20 and continued until the issue was fully resolved. A second on-call person was brought in to assist in recovery at 13:35.
After the root cause was discovered, recovery actions were started immediately. Recovery took around 30 minutes due to the requirement of getting approval from a second on-call administrator for a security-critical change and pushing the change through deployment.
Certificate necessary for the payment terminals to communicate with the payment gateway had expired. This happened because the specific certificate in question was not in scope for the regular expiry tracking processes. Certificates are currently being migrated from an old management system to a new system with automatic renewal. Expiry tracking was in place for the old management system and the new system, but this specific certificate was managed by an intermediate solution that was taken into use because our hosting provider could not support the required security parameters for the new system for this specific certificate. Our hosting provider has since added support for these security parameters and migration of this certificate was on the backlog, but unfortunately the certificate expired before the migration was done.
Certificate expiry is unfortunately one of the most common ways even big providers have downtime. We recognize this and have placed special emphasis on ensuring that expiry is tracked dutifully in our system design. However, in reality there are situations where intermediate solutions have to be made. We know now that we need to also ensure we always have expiry tracking also for these intermediate solutions, even if the plan is to migrate away from them before certificate expiry. It is not enough to handle things well in the primary solution, but we need to ensure every solution will have the same emphasis on tracking. Also, automatic renewal should always have a high priority to be implemented, where it is possible to implement it.
Several changes are being made in response to this incident:
We believe that these changes will ensure that we will no longer run into similar problems in the future.