Late October 2024, we started seeing sporadic connection problems that affected a tiny number of transactions intermittently. The problems were caused by IPsec tunnel failures. We are running two tunnels from two different data centres. The problems began happening more regularly, finally escalating into a total downtime of both tunnels on a Saturday, 14th of December. We deeply apologise for the problems this has caused and do not consider this acceptable.
Despite tuning the tunnel parameters over the past weeks, we were unable to find a solution that resolved all issues. We discovered that the tunnels sometimes failed to detect data flow interruptions promptly. We adjusted the dead peer detection (DPD) parameters to improve availability, which unexpectedly led to a complete downtime of both tunnels two days after the change.
13:45: First on-call alarms about both IPsec tunnels down; expected quick resolution as before
After the weekend downtime: Continued investigation revealed that UDP Encapsulation (NAT-T) might resolve all issues
Monday, December 16th, around noon: Deployed NAT-T configuration for the first tunnel
Wednesday, December 18th, around noon: Deployed NAT-T configuration for the second tunnel after observing zero downtime with the first tunnel
Since then: No errors observed
The root cause of the issues remains unclear due to the complexity of IPsec debugging. There were no error messages pinpointing the cause of the tunnel failures. We have identical configuration with other acquirers without issues, and no errors in the staging environment.
Through trial and error, we found that using UDP Encapsulation (NAT-T) resolved the issues for this particular production connection, though the exact reasons remain unknown.
We recommend using mTLS instead of IPsec but that is not always available.
Purchase transactions occasionally failed, but retries were mostly successful. Unfortunately, on December the 14th, the problems escalated, resulting in a complete downtime of purchase transactions for all merchants using the affected acquirer from 14:27 until 16:00.
A total of 894 purchase transaction attempts failed due this problem, across all affected merchants.
We are pleased that the tunnels are now functioning as expected. We apologise for the problems and want to wish you a pleasant holiday season and a thriving business!