Payments in Sweden and Norway

Incident Report for Npay

Postmortem

Summary

Late October 2024, we started seeing sporadic connection problems that affected a tiny number of transactions intermittently. The problems were caused by IPsec tunnel failures. We are running two tunnels from two different data centres. The problems began happening more regularly, finally escalating into a total downtime of both tunnels on a Saturday, 14th of December. We deeply apologise for the problems this has caused and do not consider this acceptable.

Despite tuning the tunnel parameters over the past weeks, we were unable to find a solution that resolved all issues. We discovered that the tunnels sometimes failed to detect data flow interruptions promptly. We adjusted the dead peer detection (DPD) parameters to improve availability, which unexpectedly led to a complete downtime of both tunnels two days after the change.

Timeline UTC+2

  • End of October 2024: First occurrences of a few failing transactions
  • October - December: Increasing frequency of transaction failures, though still relatively low
  • Thursday, December 12th: DPD parameters were tuned for faster detection of broken tunnels
  • Saturday, December 14th:
  • 13:45: First on-call alarms about both IPsec tunnels down; expected quick resolution as before

    • 13:55: Alarms closed
    • 14:21: Alarms re-opened and closed again at 14:22
    • 14:27: Alarms re-opened but this time the tunnels did not self-recover
    • 14:38: We requested acquirer to restart IPsec tunnels on their side
    • 15:14: Statuspage incident created
    • 15:25: Acquirer informed us that restarts did not help
    • 15:48: Reverted DPD parameter changes and started deployment
    • 15:56: First IPsec tunnel up
    • 16:03: Second IPsec tunnel up, alarms closed shortly after
    • 16:07: Statuspage updated with Monitoring status as transactions were succeeding
    • 19:51: Statuspage updated with Resolved status as no further problems were observed
  • After the weekend downtime: Continued investigation revealed that UDP Encapsulation (NAT-T) might resolve all issues

  • Monday, December 16th, around noon: Deployed NAT-T configuration for the first tunnel

  • Wednesday, December 18th, around noon: Deployed NAT-T configuration for the second tunnel after observing zero downtime with the first tunnel

  • Since then: No errors observed

Root cause

The root cause of the issues remains unclear due to the complexity of IPsec debugging. There were no error messages pinpointing the cause of the tunnel failures. We have identical configuration with other acquirers without issues, and no errors in the staging environment.

Through trial and error, we found that using UDP Encapsulation (NAT-T) resolved the issues for this particular production connection, though the exact reasons remain unknown.

We recommend using mTLS instead of IPsec but that is not always available.

Impact

Purchase transactions occasionally failed, but retries were mostly successful. Unfortunately, on December the 14th, the problems escalated, resulting in a complete downtime of purchase transactions for all merchants using the affected acquirer from 14:27 until 16:00.

A total of 894 purchase transaction attempts failed due this problem, across all affected merchants.

Lessons learned

  • IPsec Complexity: When problems arise, one must keep on trying to find a suitable configuration until problems are solved. Previously stable connection can fail without any configuration changes, highlighting the importance of detailed monitoring.
  • Alarm Fatigue: We must avoid alarm fatigue, where recurring issues that previously resolved without intervention are dismissed as routine. Faster reaction times could have been achieved if the issue was not considered business as usual. Additionally, we must never consider acquirer connection issues as business as usual.
  • Alternative Solutions: We advocate for alternative connection solutions to avoid IPsec, but that requires changes beyond our control. We support our partners with knowledge and experience to encourage migration to more modern solutions.

We are pleased that the tunnels are now functioning as expected. We apologise for the problems and want to wish you a pleasant holiday season and a thriving business!

Posted Dec 23, 2024 - 15:11 EET

Resolved

This incident has been resolved.
Posted Dec 14, 2024 - 19:51 EET

Monitoring

A fix was implemented and the authorizations are succeeding again from 15:00 CET onwards.
Posted Dec 14, 2024 - 16:07 EET

Investigating

Since around 13:25 CET we are experiencing issues with connection to acquiring, which affects payments in Sweden and Norway.
Posted Dec 14, 2024 - 15:14 EET
This incident affected: Acquirer Connections (Nets Denmark).