FR - Support to appropriately reschedule the task in case a delegate gets disconnected while running a task.
in progress
T
Teal Alpaca
When a harness delegate, running as a pod in k8s, is rescheduled by the k8s control plane as part of the usual course of cycling nodes, relocating for better resource utilization, etc -- all normal and expected activities -- end users are met with a "harness delegate disconnect" error and the pipeline simply fails.
We configured failure strategies on the stages and believed this should be sufficient but it has proven otherwise. In communicating with our Harness sales & engineering reps, this failure mode simply results in "lost tasks".
This needs to be addressed: we expect the failure strategies and general harness machinery should be able to handle this and appropriately reschedule the task accordingly, failing only when the configured failure strategy is exhausted.
Log In
Canny AI
Merged in a post:
Delegate should retain some stateful information about ongoing tasks or have a way for tasks to replay
C
Controlled Tapir
Running software in the cloud it is expected that there are partitioned events that can occur that can cause disruption be that connectivity, the container restarts, etc.
Today if the delegate container is restarted assuming a software error or some other problem if there are running tasks. Once the delegate is restarted those tasks are orphaned.
Possible solution 1:
Task information state is stateful this would require that state stored somewhere or if there are multiple delegates the state distributed. Upon restarting a delegate the last state is known and task can be resumed.
Possible solution 2:
The task itself can be replayed after a period of time if there is no new activity the task can attempt to reconnect and replay the last step. Idempotency would need to be considered here.
Possible solution 3:
The task times outs and the pipeline run fails instead of being stuck in a running state. This forces the user to take action quickly when the delegate is not recoverable.
Canny AI
Merged in a post:
Enhance Delegate Task Assignment to Handle Disconnected Delegates Efficiently
P
Printed Armadillo
We have encountered issues where our pipelines become stuck or fail when delegates disconnect unexpectedly during task assignment or task rebroadcasting. The system continues to attempt to assign tasks to these disconnected delegates, causing delays and hindering our deployment processes.
Problem Statement:
Pipelines get stuck when tasks are repeatedly broadcast to delegates that are no longer connected.
There is no immediate detection or handling mechanism for delegates that disconnect during task assignments.
This leads to increased pipeline execution times and requires manual intervention to abort and restart pipelines.
Proposed Solution:
Implement a mechanism to promptly detect when delegates have disconnected during task assignment.
Introduce logic to reassign tasks to available and connected delegates without causing the pipeline to hang.
Optimize the task rebroadcasting process to avoid repeatedly targeting disconnected delegates.
Benefits:
Improved reliability and stability of pipeline executions.
Reduced manual intervention to manage stuck or failed pipelines.
Enhanced efficiency in environments with fluctuating delegate availability.
Use Case:
As a user, when I run a pipeline, I want the system to handle delegate disconnections gracefully so that my pipeline does not get stuck or fail, ensuring smooth and efficient deployments.
This post was marked as
in progress
Rohan Gupta
long-term
We can improve this, lost tasks shouldn't be happening, we will review this.