7.6rc1 cached DNS CNAME responses break with glibc/Linux
Posted: Thu Oct 06, 2022 2:54 pm
Hi all,
I don't have permission to reply to the announcement thread, so I'm posting here. The DNS server in 7.6rc1 serves cached CNAME responses in a way that causes getaddrinfo(3) to error out in at least recent versions of glibc. This causes most Linux and many IoT applications to break when trying to resolve CNAMEs, which most heavily affects CDN type stuff. It looks like at least one other person has seen the issue.
In addition to the normal response for A records, after the query has been cached AAAA queries return additional records for the IPv4 addresses. An example response:
And here's how sees it:
I have confirmed with that the UDP messages are making it to userspace and RouterOS 7.6rc1 is the only DNS server I have tested causing this behaviour, so the structure of the response looks to be the cause. I don't think that having responses in the section of an request violates the spec necessarily, but it is certainly unexpected, and does not appear very helpful, since most software (including glibc) sends both and requests simultaneously. The glibc DNS stack could definitely use some robustness improvements, but this response behaviour will break quite a few clients, including many that won't be meaningfully updated (printers, TVs, etc.).
I have sent full PCAPs and debug logs to support@. I tested this on my CCR-2116, and rolling back to 7.6beta8 resolves the issue. Deployments with RouterOS DNS caching and Linux clients should probably avoid 7.6rc1.
I don't have permission to reply to the announcement thread, so I'm posting here. The DNS server in 7.6rc1 serves cached CNAME responses in a way that causes getaddrinfo(3) to error out in at least recent versions of glibc. This causes most Linux and many IoT applications to break when trying to resolve CNAMEs, which most heavily affects CDN type stuff. It looks like at least one other person has seen the issue.
In addition to the normal response for A records, after the query has been cached AAAA queries return additional records for the IPv4 addresses. An example response:
Code: Select all
Domain Name System (response)
Transaction ID: 0x12fc
Flags: 0x8180 Standard query response, No error
Questions: 1
Answer RRs: 1
Authority RRs: 0
Additional RRs: 4
Queries
api.twitter.com: type AAAA, class IN
Name: api.twitter.com
[Name Length: 15]
[Label Count: 3]
Type: AAAA (IPv6 Address) (28)
Class: IN (0x0001)
Answers
api.twitter.com: type CNAME, class IN, cname tpop-api.twitter.com
Name: api.twitter.com
Type: CNAME (Canonical NAME for an alias) (5)
Class: IN (0x0001)
Time to live: 1366 (22 minutes, 46 seconds)
Data length: 19
CNAME: tpop-api.twitter.com
Additional records
tpop-api.twitter.com: type A, class IN, addr 104.244.42.66
Name: tpop-api.twitter.com
Type: A (Host Address) (1)
Class: IN (0x0001)
Time to live: 261 (4 minutes, 21 seconds)
Data length: 4
Address: 104.244.42.66
tpop-api.twitter.com: type A, class IN, addr 104.244.42.2
Name: tpop-api.twitter.com
Type: A (Host Address) (1)
Class: IN (0x0001)
Time to live: 261 (4 minutes, 21 seconds)
Data length: 4
Address: 104.244.42.2
tpop-api.twitter.com: type A, class IN, addr 104.244.42.130
Name: tpop-api.twitter.com
Type: A (Host Address) (1)
Class: IN (0x0001)
Time to live: 261 (4 minutes, 21 seconds)
Data length: 4
Address: 104.244.42.130
tpop-api.twitter.com: type A, class IN, addr 104.244.42.194
Name: tpop-api.twitter.com
Type: A (Host Address) (1)
Class: IN (0x0001)
Time to live: 261 (4 minutes, 21 seconds)
Data length: 4
Address: 104.244.42.194
[Request In: 2]
[Time: 0.015108684 seconds]
Code: Select all
socat
Code: Select all
D getaddrinfo("api.twitter.com", NULL, {1,0,1,6,0,0x0,0x0,0x0}, 0x7ffc82af8d60)
D getaddrinfo(,,,{0x0}) -> -2
E getaddrinfo("api.twitter.com", "NULL", {1,0,1,6}, {}): Name or service not known
Code: Select all
strace
Code: Select all
A
Code: Select all
Additional records
Code: Select all
AAAA
Code: Select all
A
Code: Select all
AAAA
I have sent full PCAPs and debug logs to support@. I tested this on my CCR-2116, and rolling back to 7.6beta8 resolves the issue. Deployments with RouterOS DNS caching and Linux clients should probably avoid 7.6rc1.