Debugging failure in generating K10 reports due to the DNS issues

This article helps in troubleshooting K10 reports not being generated, due DNS server not able to resolve Prometheus hostname.

Description:
Reports are not generated after enabling K10 reports. The following error message is observed when running on-demand report policy, which is also observed in executor logs: 
"message":"Post "http://prometheus-server-exp:80/k10/prometheus/api/v1/query\": dial tcp: lookup prometheus-server-exp on 192.0.0.10:53: no such host" 

cause":{"message":"Failure in subordinate phase","function":"kasten.io/k10/kio/exec/phases/phase.(*queueAndWaitChildrenPhase).processGroup","linenumber":196,"file":"kasten.io/k10/kio/exec/phases/phase/queue_and_wait_children.go:196","fields":[{"name":"FailedSubPhases","value":[{"Phase":"","Err":{"cause":{"cause":{"cause":{"cause":{"cause":{"message":"Post "http://prometheus-server-exp:80/k10/prometheus/api/v1/query\": dial tcp: lookup prometheus-server-exp on 192.0.0.10:53: no such host"},"fields":[{"name":"query","value":"sum(round(increase(action_backup_skipped_overall{cluster=""}[24h])))"}] 

Explanation: 

The issue is seen when there is a problem resolving Prometheus server hostname, preventing K10 pods, i.e. Executor pods to get the Prometheus metrics. This prevents K10 reports to be generated as they are dependent on the K10 Prometheus instance metrics.

It can be due to a scenario where the DNS server is not running.

lookup prometheus-server-exp on 192.0.0.10:53: no such host 


Troubleshooting/Solution:

Valdiate the cluster DNS by following this DNS troubleshooting guide 

Below commands would help to validate DNS in the cluster

#DNS debugging example pod YAML from kubernetes documentation
kubectl apply -f https://raw.githubusercontent.com/kubernetes/website/main/content/en/examples/admin/dns/dnsutils.yaml

#exec into the pod to run the nslookup command with the name <svcName.NamespaceName>
kubectl -n default exec -i -t dnsutils -- nslookup prometheus-server-exp.kasten-io

There might be an issue with the cluster DNS if the command fails with the error below.

kubectl -n default exec -i -t dnsutils -- nslookup prometheus-server-exp.kasten-io
Server:        10.245.0.10
Address:    10.245.0.10#53

** server can't find prometheus-server-exp.kasten-io: NXDOMAIN

command terminated with exit code 1

The next steps would be to check if the DNS service is up, endpoints are exposed, and to restart the DNS deployment.  

There are many DNS servers for Kubernetes cluster. The commands below are to be used with coredns that is a built-in DNS service that comes with kubernetes cluster. The troubleshooting procedure would be similar to other DNS services. 

Check if DNS pods are running filtering by label kube-dns

kubectl get pods --namespace=kube-system -l k8s-app=kube-dns 

Check if DNS endpoints are discovered 

kubectl get endpoint kube-dns --namespace=kube-system 

Try restarting DNS services (rollout)

kubectl -n kube-system rollout restart deployment coredns