[AMORO-4235] Recognize stale optimizer responses on the AMS side#4261
Open
j1wonpark wants to merge 4 commits into
Open
[AMORO-4235] Recognize stale optimizer responses on the AMS side#4261j1wonpark wants to merge 4 commits into
j1wonpark wants to merge 4 commits into
Conversation
When the OptimizerKeeper (or resetStaleTasksForThread on re-poll) resets a task that is still executing, the optimizer's in-flight ack/complete arrives after the task's token has been cleared. TaskRuntime.validThread then threw "Task has been reset or not yet scheduled", surfacing as an ERROR; for completion it also broke the SCHEDULED -> SUCCESS transition (IllegalTaskStateException). The AMS now recognizes stale responses by (token, threadId, status): - ack: rejected outside the transaction so the optimizer skips the obsolete round, without the misleading "failed to commit transaction" error. - complete: ignored gracefully, since the reported run belongs to a torn-down round. This fixes the rejection of valid optimizer responses. The permanent-stuck / uncancelable symptom in the issue could not be reproduced from the excerpt logs and may need the full log to confirm -- hence "Relates to" rather than "Close". Tests: - TestOptimizingQueue: unit reproductions via the pollTask path (multi-task stale ack, single-task stale completion). - TestDefaultOptimizingService: end-to-end reproduction driving the real OptimizerKeeper ack-timeout reset of a live optimizer's task. Relates to apache#4235 Signed-off-by: Jiwon Park <jpark92@outlook.kr>
Completing a task before ack now produces no exception (the AMS absorbs it as a stale response, indistinguishable from a completion for a task reset and re-scheduled to the same thread), so assert the task stays SCHEDULED instead of expecting IllegalTaskStateException. Signed-off-by: Jiwon Park <jpark92@outlook.kr>
Signed-off-by: Jiwon Park <jpark92@outlook.kr>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why are the changes needed?
Tables intermittently get stuck during self-optimizing, and the AMS log is flooded with:
The root cause is that the AMS can reset a task an optimizer is still working on, and then
reject that optimizer's valid response for it.
A task is reset (token cleared, status -> PLANNED) in two places:
OptimizerKeeper, when aSCHEDULEDtask's ack does not arrive withinoptimizer.task-ack-timeout— even though the optimizer is still alive (theSCHEDULED + ackTimeoutbranch ofbuildSuspendingPredicationdoes not check whether theowning token is still active).
resetStaleTasksForThread, when the same optimizer thread polls again while one of its tasksis still
ACKED.After the reset, the optimizer's in-flight ack/complete arrives and
TaskRuntime.validThreadsees
token == null, so it throws. This:PersistentBase"failed to commit transaction" plus the thrift layer), andSCHEDULED -> SUCCESStransition with
IllegalTaskStateException.In other words, a perfectly valid optimizer response is dropped with a noisy error.
The issue also reports the table becoming permanently stuck and uncancelable. That symptom could
not be reproduced from the excerpt logs and may need the full log to confirm, so this PR uses
"Relates to" rather than "Close".
Note: #4239 lowers the client-side log level for the same exception. This PR addresses the
server-side root cause instead, and additionally covers the completion path that #4239 does not.
Brief change log
TaskRuntimenow recognizes a stale response by(token, threadId, status)instead of throwingunconditionally in
validThread:optimizer skips the obsolete round, without the misleading "failed to commit transaction" error.
The exception message is preserved so existing clients still recognize it.
the task will be re-executed in its current round. This also removes the
IllegalTaskStateExceptionvariant and the equivalent race on canceled tasks.A WARN line is logged in both cases (with status and owner) so the situation stays observable.
How was this patch tested?
TestOptimizingQueuevia the realpollTaskpath: multi-task(stale ack is rejected) and single-task (stale completion is ignored).
TestDefaultOptimizingServicedriving the realOptimizerKeeper:a live optimizer (kept alive by heartbeats) leaves a
SCHEDULEDtask unacked past the acktimeout, the keeper resets it, and the late ack is rejected as expected. With the fix reverted,
this test reproduces the issue's exact log sequence and stack trace -- same messages and line
numbers, including the misleading "optimizer is expired" log for an optimizer that is in fact
still alive.
TestOptimizingQueueregression passes.Documentation