When a harness delegate, running as a pod in k8s, is rescheduled by the k8s control plane as part of the usual course of cycling nodes, relocating for better resource utilization, etc -- all normal and expected activities -- end users are met with a "harness delegate disconnect" error and the pipeline simply fails.
We configured failure strategies on the stages and believed this should be sufficient but it has proven otherwise. In communicating with our Harness sales & engineering reps, this failure mode simply results in "lost tasks".
This needs to be addressed: we expect the failure strategies and general harness machinery should be able to handle this and appropriately reschedule the task accordingly, failing only when the configured failure strategy is exhausted.
Created by Ankit Kumar
August 7, 2024