问题与排查

目前业务使用K8S中部署java的微服务,但在使用过程中,发现某个kubernetes的pod无法访问api.weixin.qq.com,尝试通过nslookup, nc来测试pod是否可以连通服务

kubectl exec -it dnsutils -- bash

在pod中运行:

nc -z api.weixin.qq.com 443 && echo ok

返回结果

nc: bad address 'api.weixin.qq.com'

但是尝试其他域名,发现可以解析

nc -z www.baidu.com 443 && echo ok

结果却是ok

开启CoreDNS的日志记录功能

dns选择性的出现解析异常,怀疑是k8s的CoreDNS出现问题,导致无法解析域名,然后尝试打开CoreDNS的log日志功能

kubectl -n kube-system edit configmap coredns

增加log

apiVersion: v1
data:
  Corefile: |
    .:53 {
        log
        errors
        health {
           lameduck 5s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           fallthrough in-addr.arpa ip6.arpa
           ttl 30
        }
        prometheus :9153
        forward . /etc/resolv.conf {
           max_concurrent 1000
        }
        cache 30
        loop
        reload
        loadbalance
    }    
kind: ConfigMap
metadata:
  creationTimestamp: "2023-12-30T06:52:54Z"
  name: coredns
  namespace: kube-system
  resourceVersion: "5340888"
  uid: xxxxx

保存退出,等待1~2分钟后,等待CoreDNS apply新的配置,查看CoreDNS的日志

kubectl logs --tail 10 -f -l k8s-app=kube-dns -n kube-system

在pod中执行

nc -z api.weixin.qq.com 443 && echo ok

日志如下:

[INFO] 10.244.6.133:33772 - 41051 "A IN api.weixin.qq.com.ishare.svc.cluster.local. udp 60 false 512" NXDOMAIN qr,aa,rd 153 0.00048101s
[INFO] 10.244.6.133:33772 - 42091 "AAAA IN api.weixin.qq.com.ishare.svc.cluster.local. udp 60 false 512" NXDOMAIN qr,aa,rd 153 0.000642553s
[INFO] 10.244.6.133:60047 - 41449 "A IN api.weixin.qq.com.svc.cluster.local. udp 53 false 512" NXDOMAIN qr,aa,rd 146 0.00027336s
[INFO] 10.244.6.133:60047 - 42206 "AAAA IN api.weixin.qq.com.svc.cluster.local. udp 53 false 512" NXDOMAIN qr,aa,rd 146 0.000377254s
[INFO] 10.244.6.133:36078 - 29981 "AAAA IN api.weixin.qq.com.cluster.local. udp 49 false 512" NXDOMAIN qr,aa,rd 142 0.000526157s
[INFO] 10.244.6.133:36078 - 29064 "A IN api.weixin.qq.com.cluster.local. udp 49 false 512" NXDOMAIN qr,aa,rd 142 0.000724731s
[INFO] 10.244.6.133:35114 - 13003 "A IN api.weixin.qq.com. udp 35 false 512" NOERROR qr,rd 167 0.002365088s
[INFO] 10.244.6.133:35114 - 14063 "AAAA IN api.weixin.qq.com. udp 35 false 512" SERVFAIL qr,rd 35 2.013423971s
[INFO] 10.244.6.133:35114 - 14063 "AAAA IN api.weixin.qq.com. udp 35 false 512" SERVFAIL qr,aa,rd 35 0.000137506s
[INFO] 10.244.6.133:35114 - 14063 "AAAA IN api.weixin.qq.com. udp 35 false 512" SERVFAIL qr,aa,rd 35 0.000169814s
[INFO] 10.244.6.133:35114 - 14063 "AAAA IN api.weixin.qq.com. udp 35 false 512" SERVFAIL qr,aa,rd 35 0.000196277s
[INFO] 10.244.6.133:35114 - 14063 "AAAA IN api.weixin.qq.com. udp 35 false 512" SERVFAIL qr,aa,rd 35 0.000164776s
[INFO] 10.244.6.133:35114 - 14063 "AAAA IN api.weixin.qq.com. udp 35 false 512" SERVFAIL qr,aa,rd 35 0.000214178s
[INFO] 10.244.6.133:35114 - 14063 "AAAA IN api.weixin.qq.com. udp 35 false 512" SERVFAIL qr,aa,rd 35 0.000153283s
[INFO] 10.244.6.133:35114 - 14063 "AAAA IN api.weixin.qq.com. udp 35 false 512" SERVFAIL qr,aa,rd 35 0.000212498s
[INFO] 10.244.6.133:35114 - 14063 "AAAA IN api.weixin.qq.com. udp 35 false 512" SERVFAIL qr,aa,rd 35 0.000164354s
[INFO] 10.244.6.133:35114 - 14063 "AAAA IN api.weixin.qq.com. udp 35 false 512" SERVFAIL qr,aa,rd 35 0.000188767s

发现api.weixin.qq.com的解析结果都是SERVFAIL,进一步怀疑是否是worker node的host无法访问api.weixin.qq.com呢。

远程进入pod所在的host,执行

nc -z api.weixin.qq.com 443 && echo ok

结果是ok,使用nslookup查询,却发现了SERVFAIL

root@cs-red:~# nslookup api.weixin.qq.com
Server:		127.0.0.53
Address:	127.0.0.53#53

Non-authoritative answer:
Name:	api.weixin.qq.com
Address: 121.14.23.85
Name:	api.weixin.qq.com
Address: 183.2.143.222
Name:	api.weixin.qq.com
Address: 119.147.6.203
Name:	api.weixin.qq.com
Address: 119.147.6.237
** server can't find api.weixin.qq.com: SERVFAIL

使用google dns

综合以上指令的运行结果,怀疑试试dns解析的问题呢,使用8.8.8.8再尝试解析api.weixin.qq.com

root@cs-red:~# nslookup api.weixin.qq.com 8.8.8.8
Server:		8.8.8.8
Address:	8.8.8.8#53

Non-authoritative answer:
Name:	api.weixin.qq.com
Address: 119.147.6.203
Name:	api.weixin.qq.com
Address: 121.14.23.85
Name:	api.weixin.qq.com
Address: 119.147.6.237
Name:	api.weixin.qq.com
Address: 183.2.143.222

结果没有出现SERVFAIL

CoreDNS配置forward指令

问题定位在dns的解析上,尝试修改CoreDNS的解析功能,增加额外的dns服务器8.8.8.8

apiVersion: v1
data:
  Corefile: |
    .:53 {
        log
        errors
        health {
           lameduck 5s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           fallthrough in-addr.arpa ip6.arpa
           ttl 30
        }
        prometheus :9153
        forward . /etc/resolv.conf 8.8.8.8 {
           max_concurrent 1000
        }
        cache 30
        loop
        reload
        loadbalance
    }    
kind: ConfigMap
metadata:
  creationTimestamp: "2023-12-30T06:52:54Z"
  name: coredns
  namespace: kube-system
  resourceVersion: "5340888"
  uid: xxxxx

保存退出,并等待1~2分钟后,重新进入pod,执行

ishare$ nc -z api.weixin.qq.com 443 && echo ok
ok

问题终于得到解决。

总结

K8S中CoreDNS负责Pod中service的域名解析服务,如果发现pod解析域名出错,不管是外部域名还是k8s内部服务的域名,都需要通过CoreDNS来排查问题,首先最直接的就是打开CoreDNS的日志功能log,在排查过程中,仔细查看日志内容,并发现到底是哪个环节出现异常,进而想办法解决问题。

References