Past incident from December 16th: ECR connectivity issues

Incident Report for Npay

Postmortem

Summary

On 16th of December there was an incident related to our DNS (Domain Name Service) records which made some merchants’ ECRs fail to connect to our systems. This problem surfaced at an unfortunate time during the height of the end-of-year shopping season, and we sincerely apologise for the problems caused by this. As a result of the now concluded analysis of the incident we are improving our monitoring and deployment processes.

The root cause arises from DNS resolution breaking after an attempted move of our services to a new location. The change was rolled back, but the rollback had an unexpected side effect; the changed records were propagated to downstream DNS resolvers with a long TTL (time to live) even though the TTL for the changed records was set to 1 minute.

Timeline

At 06:37 EET (GMT+2) we deployed a DNS change to production for the ECR API endpoint, which was rolled back automatically approximately one minute later due to a deployment failure of an unrelated (to this incident) resource.

At 12:02 the first reports of ECR issues with some merchants came in, and at 12:07 our on-call personnel started an investigation on the incident.

The cause of the problem was difficult to pin down due to the problem not being visible on any existing monitoring graphs and some red herrings (such as suspecting problems in the terminals themselves), but at 12:46 the potential cause for problems was identified as the production change deployed earlier in the morning.

At 13:03 we deployed a fix for any affected ECRs. This fixed the problem for the affected devices within up to 18 minutes, but mostly in a few minutes due to the nature of DNS negative TTL cache expiration.

Root cause

A configuration error with the DNS NS (name server) record TTL caused the rollback to not fix the connectivity issues in a timely manner. We did set the TTL at 60 seconds at the parent domain’s NS record used for DNS delegation, but there was another setting for the same TTL in the subdomain’s internal NS record, which was unfortunately set at 48 hours. It is the subdomain’s TTL that takes precedence, and therefore without intervention the record would have only corrected itself after up to 2 days. Some resolvers ignore high TTL values and return e.g. with Google’s 8.8.8.8 public resolver service, a maximum TTL of 6 hours for NS records.

Impact

Any ECR device that managed to query for the NS record for the subdomain api.poplatek.com cached the new delegated name servers for 48 hours. This cache happens on the DNS resolver level, which makes all other devices using the same resolver to fail in a similar manner. This NS record was live between 06:37 and 06:38. Due to the nature of propagating modified DNS records to our name servers, it’s not possible to give more exact timestamps. Also clients cached the original A record for 60 seconds, so only a very small amount of clients managed to cache this new NS record that replaced the original A record for api.poplatek.com. Unfortunately the impact for the clients that did resolve the new NS record was catastrophic. The new name servers returned a wrong A record value for api.poplatek.com.

Lessons learned

Making changes to NS delegations simultaneously with other changes is high-risk. NS TTLs are hard to understand and to get correct, and testing environment does not always reflect production environment with regards to DNS configuration.

NS TTLs are specified in two different places for the same subdomain. First, there is a TTL value with the NS records in the parent domain, but there is also a TTL for the very same NS records in the subdomain as well. We learned the hard way that it is this subdomain’s TTL that takes precedence. We thought we had set the TTL at 60 seconds but we set it in the wrong place, at the parent domain level where the DNS delegation is originally configured.

We learned to always communicate changes that affect production DNS by scheduling a maintenance event at Statuspage, well before the deployment. This will help discover any issues in a more timely manner.

Our monitoring is unable to detect ECR problems that affect only a small fraction of our customer base. We are going to set up new alarms which further subdivide ECR connection monitoring for anomalous drops by merchant channel.

Finally, we are going to improve customer communication about the timing of the release of our public post-mortems. We understand that these incidents cause our customers uncertainty, and we want to do our best to alleviate that.

We believe that these actions will help us catch these types of problems faster in the future, and improve our availability. Once again we apologize for any problems this has caused you, and wish you a successful and happy holiday season!

Posted Dec 20, 2024 - 18:15 EET

Resolved

On December 16th, a small fraction of our customers' ECRs were unable to access our service due to DNS problems.
Posted Dec 20, 2024 - 18:12 EET
This incident affected: CloudPOS.