Skip to content

[AMORO-4235] Recognize stale optimizer responses on the AMS side#4261

Open
j1wonpark wants to merge 4 commits into
apache:masterfrom
j1wonpark:optimizer-stale-response-on-reset
Open

[AMORO-4235] Recognize stale optimizer responses on the AMS side#4261
j1wonpark wants to merge 4 commits into
apache:masterfrom
j1wonpark:optimizer-stale-response-on-reset

Conversation

@j1wonpark

Copy link
Copy Markdown
Contributor

Why are the changes needed?

Tables intermittently get stuck during self-optimizing, and the AMS log is flooded with:

TaskRuntimeException: Task has been reset or not yet scheduled

The root cause is that the AMS can reset a task an optimizer is still working on, and then
reject that optimizer's valid response for it.

A task is reset (token cleared, status -> PLANNED) in two places:

  • OptimizerKeeper, when a SCHEDULED task's ack does not arrive within
    optimizer.task-ack-timeouteven though the optimizer is still alive (the
    SCHEDULED + ackTimeout branch of buildSuspendingPredication does not check whether the
    owning token is still active).
  • resetStaleTasksForThread, when the same optimizer thread polls again while one of its tasks
    is still ACKED.

After the reset, the optimizer's in-flight ack/complete arrives and TaskRuntime.validThread
sees token == null, so it throws. This:

  • surfaces as an ERROR (PersistentBase "failed to commit transaction" plus the thrift layer), and
  • for a completion that lands on a meanwhile-rescheduled task, breaks the SCHEDULED -> SUCCESS
    transition with IllegalTaskStateException.

In other words, a perfectly valid optimizer response is dropped with a noisy error.

The issue also reports the table becoming permanently stuck and uncancelable. That symptom could
not be reproduced from the excerpt logs and may need the full log to confirm, so this PR uses
"Relates to" rather than "Close".

Note: #4239 lowers the client-side log level for the same exception. This PR addresses the
server-side root cause instead, and additionally covers the completion path that #4239 does not.

Brief change log

TaskRuntime now recognizes a stale response by (token, threadId, status) instead of throwing
unconditionally in validThread:

  • ack: if the task was reset/rescheduled, reject it outside the transaction so the
    optimizer skips the obsolete round, without the misleading "failed to commit transaction" error.
    The exception message is preserved so existing clients still recognize it.
  • complete: if stale, ignore it gracefully — the reported run belongs to a torn-down round and
    the task will be re-executed in its current round. This also removes the
    IllegalTaskStateException variant and the equivalent race on canceled tasks.

A WARN line is logged in both cases (with status and owner) so the situation stays observable.

How was this patch tested?

  • Unit reproductions in TestOptimizingQueue via the real pollTask path: multi-task
    (stale ack is rejected) and single-task (stale completion is ignored).
  • End-to-end reproduction in TestDefaultOptimizingService driving the real OptimizerKeeper:
    a live optimizer (kept alive by heartbeats) leaves a SCHEDULED task unacked past the ack
    timeout, the keeper resets it, and the late ack is rejected as expected. With the fix reverted,
    this test reproduces the issue's exact log sequence and stack trace -- same messages and line
    numbers, including the misleading "optimizer is expired" log for an optimizer that is in fact
    still alive.
  • Full TestOptimizingQueue regression passes.

Documentation

  • Does this pull request introduce a new feature? No.

When the OptimizerKeeper (or resetStaleTasksForThread on re-poll) resets a
task that is still executing, the optimizer's in-flight ack/complete arrives
after the task's token has been cleared. TaskRuntime.validThread then threw
"Task has been reset or not yet scheduled", surfacing as an ERROR; for
completion it also broke the SCHEDULED -> SUCCESS transition
(IllegalTaskStateException).

The AMS now recognizes stale responses by (token, threadId, status):
- ack: rejected outside the transaction so the optimizer skips the obsolete
  round, without the misleading "failed to commit transaction" error.
- complete: ignored gracefully, since the reported run belongs to a
  torn-down round.

This fixes the rejection of valid optimizer responses. The permanent-stuck /
uncancelable symptom in the issue could not be reproduced from the excerpt
logs and may need the full log to confirm -- hence "Relates to" rather than
"Close".

Tests:
- TestOptimizingQueue: unit reproductions via the pollTask path (multi-task
  stale ack, single-task stale completion).
- TestDefaultOptimizingService: end-to-end reproduction driving the real
  OptimizerKeeper ack-timeout reset of a live optimizer's task.

Relates to apache#4235

Signed-off-by: Jiwon Park <jpark92@outlook.kr>
@github-actions github-actions Bot added the module:ams-server Ams server module label Jun 25, 2026
Completing a task before ack now produces no exception (the AMS absorbs it
as a stale response, indistinguishable from a completion for a task reset
and re-scheduled to the same thread), so assert the task stays SCHEDULED
instead of expecting IllegalTaskStateException.

Signed-off-by: Jiwon Park <jpark92@outlook.kr>
Signed-off-by: Jiwon Park <jpark92@outlook.kr>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:ams-server Ams server module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant