Debugging k10-grafana pod in CrashLoopBackOff status

This article helps troubleshooting K10 Grafana pod in CrashLoopBackOff status, due I/O disk error.

Description:

There are many reasons that can cause Grafana DB to get in "locked" state, however, storage issues are the primary reason. Since any issues on the storage layer can cause Grafana's PVC to become unavailable intermittently, which in turn causes the data to get corrupted.

In this example, disk I/O error is seen after the Grafana pod gets restarted.

level=error msg="alert migration failure: could not get migration log" error="failed to check table existence: disk I/O error: input/output error"

Explanation:

Data inside Grafana's PVC becomes inaccessible, preventing Grafana to add or read any data from the PVC. This leads to the database to get in a locked state, which eventually results in data corruption. Checking the logs the error message "database is locked" message can be found:

logger=grafanaStorageLogger t=2022-10-18T11:54:54.837495327Z level=info msg="storage starting" 

logger=http.server t=2022-10-18T11:54:54.843617807Z level=info msg="HTTP Server Listen" address=[::]:3000 protocol=http subUrl=/k10/grafana socket= logger=sqlstore t=2022-10-18T12:09:54.661747229Z level=info msg="Database locked, sleeping then retrying" error="database is locked" retry=0 logger=infra.usagestats.collector t=2022-11-01T13:58:17.95942258Z level=error msg="Failed to get system stats" error="database is locked" 5 6

After the Grafana pod is restarted, it gets in CrashLoopBackOff status, showing "disk I/O error"

logger=sqlstore t=2022-11-03T09:51:39.975849774Z level=info msg="Connecting to DB" dbtype=sqlite3 

logger=sqlstore t=2022-11-03T09:51:39.97587918Z level=warn msg="SQLite database file has broader permissions than it should" path=/var/lib/grafana/grafana.db mode=-rw-rw---- expected=-rw-r----- logger=migrator t=2022-11-03T09:51:39.992578712Z level=error msg="alert migration failure: could not get migration log" error="failed to check table existence: disk I/O error: input/output error"

Since the issue is with data PVC, a rollout of K10 Grafana's deployment will not fix the problem, because the PVC will not be recreated.

Solution

The recommended way to recreate Grafana deployment (PVC, deployment) is to help upgrade K10, disabling Grafana instance and enabling it back. This way the PVC and data will be recreated.

Note. If there were any Grafana alerts in dashboard or any setup created on K10 Grafana, they will be lost and need to be recreated.

Execute helm commands below to disable/enable K10 Grafana instance and remove/recreate the PVC:

#run helm upgrade to disable grafana, it will remove grafana deployment and pvc. 
helm upgrade k10 kasten/k10 --namespace=kasten-io --reuse-values --set grafana.enabled=false --version=<CURRENT-VERSION> 
  
#wait until all K10 pods get in running state, after that run again the helm upgrade to enable grafana 
helm upgrade k10 kasten/k10 --namespace=kasten-io --reuse-values --set grafana.enabled=true --version=<CURRENT-VERSION>