FR - Support to appropriately reschedule the task in case a delegate gets disconnected while running a task. | Feature Requests | harness

FR - Support to appropriately reschedule the task in case a delegate gets disconnected while running a task.

in progress

Teal Alpaca

When a harness delegate, running as a pod in k8s, is rescheduled by the k8s control plane as part of the usual course of cycling nodes, relocating for better resource utilization, etc -- all normal and expected activities -- end users are met with a "harness delegate disconnect" error and the pipeline simply fails.
We configured failure strategies on the stages and believed this should be sufficient but it has proven otherwise. In communicating with our Harness sales & engineering reps, this failure mode simply results in "lost tasks".
This needs to be addressed: we expect the failure strategies and general harness machinery should be able to handle this and appropriately reschedule the task accordingly, failing only when the configured failure strategy is exhausted.

Created by Ankit Kumar

August 7, 2024

Autopilot

Merged in a post:

Delegate should retain some stateful information about ongoing tasks or have a way for tasks to replay

Controlled Tapir

Running software in the cloud it is expected that there are partitioned events that can occur that can cause disruption be that connectivity, the container restarts, etc. 
Today if the delegate container is restarted assuming a software error or some other problem if there are running tasks.  Once the delegate is restarted those tasks are orphaned.
Possible solution 1:
Task information state is stateful this would require that state stored somewhere or if there are multiple delegates the state distributed. Upon restarting a delegate the last state is known and task can be resumed.
Possible solution 2: 
The task itself can be replayed after a period of time if there is no new activity the task can attempt to reconnect and replay the last step. Idempotency would need to be considered here. 
Possible solution 3:
The task times outs and the pipeline run fails instead of being stuck in a running state.  This forces the user to take action quickly when the delegate is not recoverable.

December 25, 2024

Autopilot

Merged in a post:

Enhance Delegate Task Assignment to Handle Disconnected Delegates Efficiently

Printed Armadillo

We have encountered issues where our pipelines become stuck or fail when delegates disconnect unexpectedly during task assignment or task rebroadcasting. The system continues to attempt to assign tasks to these disconnected delegates, causing delays and hindering our deployment processes.
Problem Statement:
Pipelines get stuck when tasks are repeatedly broadcast to delegates that are no longer connected.
There is no immediate detection or handling mechanism for delegates that disconnect during task assignments.
This leads to increased pipeline execution times and requires manual intervention to abort and restart pipelines.
Proposed Solution:
Implement a mechanism to promptly detect when delegates have disconnected during task assignment.
Introduce logic to reassign tasks to available and connected delegates without causing the pipeline to hang.
Optimize the task rebroadcasting process to avoid repeatedly targeting disconnected delegates.
Benefits:
Improved reliability and stability of pipeline executions.
Reduced manual intervention to manage stuck or failed pipelines.
Enhanced efficiency in environments with fluctuating delegate availability.
Use Case:
As a user, when I run a pipeline, I want the system to handle delegate disconnections gracefully so that my pipeline does not get stuck or fail, ensuring smooth and efficient deployments.

November 25, 2024

This post was marked as

in progress

Rohan Gupta

marked this post as

long-term

We can improve this, lost tasks shouldn't be happening, we will review this.