Troubleshooting K10 installation issues with NFS provisioned PVCs

This article provides assistance for resolving problems that may arise when using NFS to install K10 and provision PVCs, resulting in unexpected statuses such as CrashLoopBackOff or Initializing for the pods.

Troubleshooting

Error description: 

Below there are a few examples of some errors that can appear while the pods are initializing that could be related with NFS share setup or permissions: 

Permission denied while trying to read/write: 

{"File":"kasten.io/k10/kio/assert/assert.go","Function":"kasten.io/k10/kio/assert.NoError","Line":16,"cluster_name":"395c233f-0e8a-47ff-b2ff-b133ef0fa85c","error":{"message":"Cannot open model store datastore","function":"kasten.io/k10/kio/modelstore.(*ModelStore).OpenStore","linenumber":239,"file":"kasten.io/k10/kio/modelstore/store.go:239","cause":{"message":"failed to create directory for store","function":"kasten.io/k10/kio/modelstore.(*ModelStore).openDataStore","linenumber":718,"file":"kasten.io/k10/kio/modelstore/store.go:718","cause":{"message":"mkdir /mnt/k10state/kasten-io/jobs: permission denied"}}}, 

 

Grafana-svc fails to chown while initializing the pod: 

Events: 

  Type     Reason     Age               From               Message 

  ----     ------     ----              ----               ------- 

  Normal   Scheduled  24s               default-scheduler  Successfully assigned kasten-io/k10-grafana-595f465647-4d8bb to pool-bd0bfjg35-ya3qs 

  Normal   Pulled     7s (x3 over 24s)  kubelet            Container image "gcr.io/kasten-images/init:6.0.7" already present on machine 

  Normal   Created    7s (x3 over 24s)  kubelet            Created container init-chown-data 

  Normal   Started    7s (x3 over 24s)  kubelet            Started container init-chown-data 

  Warning  BackOff    7s (x3 over 22s)  kubelet            Back-off restarting failed container init-chown-data in pod k10-grafana-595f465647-4d8bb_kasten-io(3f53cd9d-e74a-4c99-b8f2-18669840925d) 

 

Catalog-svc pod stuck into initializing and other K10 pods that has PVCs also does not properly start I.e.: 

catalog-svc-845bcf6bbd-xms2f             0/2     Init:1/2   1          35m  

jobs-svc-56975c9f56-wcwx2                0/1     Running    5          35m  

k10-grafana-57fbc9fc65-25fs8             0/1     Running    8          35m 

 

Explanation: 

Permissions (owner / rw) 

These issues appear when NFS server has some restricted controls of "chown" on the directory structure or it is missing proper permissions (owner or rw). 

K10 creates the PVCs for the pods with the owner specified in the NFS root share: 

root@nfs-test:/nfs-k10# ls -ltr 
total 32 
drwxrwxrwx 3 nobody nogroup 4096 Sep 21 18:47 archived-kasten-io-jobs-pv-claim-pvc-45c7edd7-b3b4-4aa6-b604-f805d8637d7a 
drwxrwxrwx 3 nobody nogroup 4096 Sep 21 18:47 kasten-io-catalog-pv-claim-pvc-54bf06d1-1f06-4a3b-8e3b-d105c45436f9 
drwxrwxrwx 4 nobody nogroup 4096 Sep 21 18:47 kasten-io-prometheus-server-pvc-496c5911-7975-4a7d-81b7-3ee25a65d73f 
drwxrwxrwx 4 nobody nogroup 4096 Sep 21 18:48 kasten-io-metering-pv-claim-pvc-1a6251d2-2b0f-401b-ab58-4cfa5f4e4304 
drwxrwxrwx 3 nobody nogroup 4096 Sep 21 18:48 kasten-io-logging-pv-claim-pvc-01df8bd0-7857-4df6-8714-d3493dd0bb8b 
drwxrwxrwx 7 nobody nogroup 4096 Sep 21 19:23 archived-kasten-io-k10-grafana-pvc-8c4765e8-25fc-4301-bf63-fe65b2cdc874 
drwxrwxrwx 3 nobody nogroup 4096 Sep 21 19:45 kasten-io-jobs-pv-claim-pvc-8bee181e-5e7e-480f-9688-9dbfcf843b29 
drwxrwxrwx 7 nobody nogroup 4096 Sep 21 19:45 kasten-io-k10-grafana-pvc-925a371f-537d-4f21-ab6f-16e0c5b1dd21 

 

In this case the share was created using nobody:nogroup as owner since the ID that will create the sub-directories will not be present in the host. 

 

The problem is that some K10 service need some specific IDs to have proper permissions to create (rw) I.e : 

Grafana-svc 472:472  

Job-svc 1000:1000 

 

If there are some restrictions in the NFS share regarding chown or missing rw permissions, K10 service pods will not be able to create the directory/files in the sub folder and will fail to initialize. 

 

Sync – Performance issues 

The "sync" option forces NFS to write changes to disk before replying. This results in a more consistent environment since the reply reflects the actual state of the remote volume. However, it also reduces the speed of file operations (rw). 

root@nfs-test:~# cat /etc/exports   

/nfs-k10    server1(rw,sync,no_subtree_check) 

Insufficient resources on the NFS server or running it on a standalone machine can impact overall performance, including I/O and slowness. This situation can be problematic due to how sync operates on NFS, potentially causing issues when K10 is writing to the PVCs. As a result, the pods may become stuck in the initializing state without providing a clear error message about the underlying problem.

Upon reviewing the events, numerous readiness/liveness errors might be found as the pods dud not complete their initialization process, while some pods might have remained stuck in their initial operations without providing clear error messages, which could potentially indicate sluggishness in the NFS system.

 
Solution 

When utilizing an NFS share for PVC provisioning, it is advisable to take into account the resources and performance of the NFS server. This is particularly important because the "sync" option can potentially lead to performance issues.

If the NFS server has a sub-optimal performance, it would be necessary to remove "sync" from the NFS share to speed up file operations(rw). 

Additionally, check if the permissions were applied to the root folder of the share, also check if the owner is correct and not set to root. 

In some instances, such as with Grafana-svc and Job-svc, specific users are used, and it may be necessary to modify the owner of the PVCs folder for these services recursively in order for them to function properly with KastenK10. The need for this change depends on how the NFS share was set up.

If there is a "chown" restriction on the NFS share that is causing issues with initializing Grafana-svc, K10 provides a parameter setting to disable this check during the initialization of the Grafana pod. This setting has to be manually applied to the NFS subfolders.

--set grafana.initChownData.enabled=false