Export fails with a timeout for large PVCs

Exportaction fails with a timeout of 45 minutes for larger PVCs

K10 provides a reliable way to export a snapshot to an external repository during policy execution. K10 has a few default timeout settings to ensure that a job doesn’t run forever and affects the subsequent policy execution. K10 users need to tune these timeouts based on their environment, workloads, and the bandwidth available for the data transfers during the export.


When a large amount of data needs to be exported, these default timeouts for export actions on K10 might not be suitable. 

This document describes how to increase the default timeout for export actions.

Description

K10 has a timeout of 45 minutes by default for the export action. The export will fail if K10 takes more than this timeout to complete the job. It could happen due to a large amount of data, a slow network connection to the external repository, or any other factor that might affect the transfer of the files.


Checking the export action details, the message below is the clue where K10 hits the timeout for the operation (waitWithBackoffwithRetries) with the duration field showing 45 mins.

errors:
    - cause: '{"cause":{"cause":{"cause":{"cause":{"cause":{"message":"context
Deadline exceeded"},"function":"kasten.io/k10/kio/poll.waitWithBackoffWithRetries",
"linenumber":78,"message":"Context done while polling"},​​​​​
"fields":[{​​​​​"name":"duration","value":"44m59.989920898s"}​​​​​

Resolution

The issue can be resolved by increasing the value of KanisterBackupTimeout parameter.

The command below will gather the current helm values used for installation into a file, and run a helm upgrade to the K10 installation increasing the timeout to a specific number, for example 150 minutes. We recommend using a higher value to start with and fine tune it based on the environment.

helm get values k10 --output yaml --namespace=kasten-io > k10_val.yaml && \
helm upgrade k10 kasten/k10 --namespace=kasten-io -f k10_val.yaml \
--set kanister.backupTimeout=150 --version=<current-k10-version>

After the helm upgrade, validate the timeout value on the `k10-config` configmap.