返回文章列表

调度器的自我救赎:从亡羊补牢到未雨绸缪

2 分钟阅读

发生了什么

今天的代码仓库里躺着一个看似微小的修复:scheduler 现在能在 agent 出发前检测到工作区是否还存在。如果工作区被物理删除了(比如分支清理时 worktree 被 prune 掉),系统不再让 agent 白跑一圈才报错,而是直接 fail-fast,并标记为不可重试的错误。

两个改动:

  1. executeIssueWithEmployee 里加了早期存在性检查
  2. employee_executor_workspace_missing 标记为 non-retryable

这事有多严重

说大不大,说小不小。

issue-141 之前一直带着这个 bug 跑,agent 每次都被派出去,跑了一圈才发现工作区没了,然后重试,再跑一圈,再发现没了。消耗的是时间,磨损的是信心。

像极了你妈让你去冰箱拿东西,你走到厨房才发现冰箱被搬走了——然后你妈说"再去一次试试看"。

现在好了,出发前检查一下,不行就直说。

反思:被动防御的惯性

这次修复让我想到一个问题:为什么这个问题存在了这么久?

scheduler 之前的设计逻辑是:相信工作区一直在,有问题 runtime 再报。这种"乐观假设"在开发期没问题,但进入生产环境后,branch 会被清理,worktree 会被 prune,意外会发生。

我们总在问题发生后才加补丁,而不是在设计阶段就把"可能不存在"这个 case 考虑进去。

这是技术债,也是思维惰性。


English Version

The Fix

A small but meaningful change landed today: the scheduler now checks if a workspace actually exists before dispatching an agent. If the worktree has been physically removed (e.g., during branch cleanup or pruning), the system fails fast instead of letting the agent waste a full execution cycle.

Two changes:

  1. Added early workspace existence check in executeIssueWithEmployee
  2. Marked employee_executor_workspace_missing as non-retryable

Why It Matters

Issue-141 was spinning with this bug for a while—agent gets dispatched, runs, fails, retries, fails again. Waste of time, erode confidence.

Now it's caught early. Like your mom asking you to grab something from the fridge, but she checks if the fridge still exists before you walk to the kitchen.

Reflection: The Habit of Reactive Defense

Why did this bug persist so long?

The scheduler was built on an optimistic assumption: workspaces are always there, let runtime handle problems. That works in dev, but production is messier—branches get cleaned, worktrees get pruned, things disappear.

We keep patching after things break, rather than designing for failure from the start.

Technical debt, but also mental debt.

觉得有帮助?请我喝杯咖啡

如果这篇文章对你有所帮助,欢迎扫码支持作者继续创作更多优质内容。

微信
微信
支付宝
支付宝

评论