Fix MIGRATE REGION falsely reported complete when ConfigNode leader switches during AddRegionPeer#17908
Fix MIGRATE REGION falsely reported complete when ConfigNode leader switches during AddRegionPeer#17908CRZbulabula wants to merge 5 commits into
Conversation
When the ConfigNode leader is gracefully stopped (or loses leadership) while AddRegionPeerProcedure is waiting for the coordinator DataNode's AddRegionPeer task to finish, RegionMaintainHandler.waitTaskFinish() is interrupted and returns PROCESSING. The DO_ADD_REGION_PEER state previously treated PROCESSING as a no-op terminal state (return Flow.NO_MORE_STATE), silently ending the AddRegionPeerProcedure without success or rollback. The parent RegionMigrateProcedure had already persisted at CHECK_ADD_REGION_PEER, so the new leader resumed there directly. Its isDataNodeContainsRegion() check only inspects the partition table's location list, which is written at CREATE_NEW_REGION_PEER (long before the snapshot finishes transferring). It therefore passed, the source replica was removed, and the migration was declared a success while the destination replica was still in Adding state and had not received the snapshot. Queries against the region returned incorrect results during the gap (observed: ~17 min until the destination became active). Fix: in the PROCESSING branch, stay at DO_ADD_REGION_PEER and persist it (HAS_MORE_STATE) instead of ending. After recovery the new leader re-enters DO_ADD_REGION_PEER and re-polls the coordinator task until it truly reaches SUCCESS or FAIL. The re-poll is idempotent: the isStateDeserialized() guard skips re-submitting after a restart, and the coordinator DataNode dedups by taskId (putIfAbsent) even on a same-process re-execute, so the AddRegionPeer task is never submitted twice. If the coordinator crashed and lost its task table, the poll returns TASK_NOT_EXIST and falls through to the existing FAIL/rollback path. Add cnLeaderSwitchDuringDoAddPeerTest for each consensus protocol (IoTConsensus, IoTConsensusV2 batch/stream, Ratis). Existing daily ConfigNode-crash ITs all use stopForcibly() (SIGKILL), which kills the process before it can run the PROCESSING branch; the new test uses a graceful stop() (SIGTERM) of the leader among 3 ConfigNodes so the interrupted PROCESSING path is actually exercised across a real leader switch.
9047875 to
b5fda9d
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #17908 +/- ##
============================================
+ Coverage 40.82% 41.08% +0.25%
Complexity 318 318
============================================
Files 5245 5249 +4
Lines 363203 364234 +1031
Branches 46780 47044 +264
============================================
+ Hits 148287 149637 +1350
+ Misses 214916 214597 -319 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
The temporary @category({DailyIT.class, ClusterIT.class}) tags were only to validate the new tests on the per-PR Cluster IT pipeline. That run is green, so narrow them back to the class-level DailyIT category (remove the method-level @category, the temporary comment, and the now-unused ClusterIT import).
|
我看了这次修复,主要有两个点建议再处理一下:
|
…cise PROCESSING Two issues raised in review of the leader-switch fix: 1. Returning HAS_MORE_STATE for the same DO_ADD_REGION_PEER state makes ProcedureExecutor.executeProcedure re-execute the procedure in place (subprocs[0] == proc -> reExecute = true). On shutdown / leader switch the worker is interrupted, waitTaskFinish() returns PROCESSING, and the procedure parks at DO_ADD_REGION_PEER again. Because stopExecutor() runs executor.stop() and executor.join() before store.stop(), store.isRunning() is still true while join() waits for this worker, so the inner reExecute loop would spin (re-submitting / re-polling) and join() would hang. Fix (framework layer): the inner loop now also checks the executor's own running flag - `if (!isRunning() || !store.isRunning()) return;` - so the worker exits cleanly when the executor is stopping. The persisted state lets the next leader resume. Hardening (procedure layer): only submit the AddRegionPeerTask when getCycles() == 0 (first entry) in addition to the existing isStateDeserialized() guard, so an in-place re-entry never re-submits. 2. cnLeaderSwitchDuringDoAddPeerTest used the DO_ADD_REGION_PEER kill point, which fires after the task is submitted but before waitTaskFinish() starts polling. If AddPeer finished quickly the procedure took the SUCCESS branch and the new PROCESSING branch was never exercised (confirmed: the prior CI run's logs contain no "returns PROCESSING"). Fix: add a RegionMaintainKillPoints.WAIT_TASK_FINISH_POLLING kill point fired from inside waitTaskFinish() once the first poll confirms the task is still PROCESSING - i.e. when the worker is provably blocked in the wait. The graceful stop then deterministically interrupts waitTaskFinish(). The test registers this kill point and additionally asserts a ConfigNode log contains "returns PROCESSING", so it fails loudly if the branch is skipped.
|
Thanks @Caideyipi, both points are valid — fixed in 6dd498e. 1. Re-execution loop on shutdown. You're right that Fix at the framework layer: the inner loop now also checks the executor's own running flag — 2. The test could miss the PROCESSING branch. Correct, and worse than theoretical — I checked the logs of the previous (green) CI run and there is no Fix: I added a I'll re-run the tests on the per-PR Cluster IT pipeline to confirm the PROCESSING branch is now actually hit before finalizing. |
Tag the four cnLeaderSwitchDuringDoAddPeerTest methods with ClusterIT (in addition to DailyIT) so the per-PR Cluster IT pipeline runs them and we can confirm the new WAIT_TASK_FINISH_POLLING kill point deterministically drives the PROCESSING branch (asserted via the new ConfigNode-log check). This will be reverted to DailyIT-only before merge.
The per-PR Cluster IT run showed cnLeaderSwitchDuringDoAddPeerTest failing on
assertSomeConfigNodeLogContains("returns PROCESSING") for all four consensus
protocols, even though the migration itself completed correctly. The
WAIT_TASK_FINISH_POLLING kill point did fire (confirmed on disk), i.e. the
worker reached waitTaskFinish() and was interrupted by the graceful stop - but
the "returns PROCESSING" line is logged during the ConfigNode shutdown
sequence, so logback's async appender is often already torn down and the line
never reaches disk. Asserting on a log line emitted during shutdown is
inherently racy.
The assertion is also redundant: generalTestWithAllOptions already calls
checkKillPointsAllTriggered(), which fails the test unless the
WAIT_TASK_FINISH_POLLING kill point fired - and that kill point is emitted only
from inside waitTaskFinish() after the first poll confirms the task is still
PROCESSING. So the framework already guarantees the worker was blocked in the
wait when the leader was stopped, i.e. the interrupted-PROCESSING branch was
exercised. Combined with the migration-success check, that is a reliable and
stronger guarantee than scanning for a shutdown-time log line.
Remove assertSomeConfigNodeLogContains and its helper/imports.
|
Caideyipi
left a comment
There was a problem hiding this comment.
I re-checked the current head (9cc0f63). The core fix looks sound to me: the PROCESSING branch now stays in DO_ADD_REGION_PEER, re-submission is guarded by isStateDeserialized()/getCycles(), and ProcedureExecutor exits the in-place reExecute loop when the executor is stopping.
One thing still needs cleanup before approval: the four new cnLeaderSwitchDuringDoAddPeerTest methods are still temporarily categorized as ClusterIT, with comments saying they will be narrowed back to DailyIT-only before merge:
- integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/iotv1/IoTDBRegionMigrateConfigNodeCrashIoTV1IT.java
- integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/iotv2/batch/IoTDBRegionMigrateConfigNodeCrashIoTV2BatchIT.java
- integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/iotv2/stream/IoTDBRegionMigrateConfigNodeCrashIoTV2StreamIT.java
- integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/ratis/IoTDBRegionMigrateConfigNodeCrashForRatisIT.java
The PR description also says these tags have already been reverted, but the code still imports ClusterIT and uses @category({DailyIT.class, ClusterIT.class}). Please remove the temporary ClusterIT category/import/comment, then I think this is ready to approve.



Problem
When the ConfigNode leader is gracefully stopped — or otherwise loses leadership — while
AddRegionPeerProcedureis inDO_ADD_REGION_PEERwaiting for the coordinator DataNode's AddRegionPeer task to finish,MIGRATE REGIONcould be reported as successful while the destination replica had not actually received the snapshot, and the source replica was already removed. Queries against the region returned incorrect results during the gap.Reproduction (field case)
MIGRATE REGION 1 FROM 3 TO 4, then stop the current ConfigNode leader right after the source replica started transmitting the snapshot (an 8.85 GB region over IoTConsensus). Observed:[MigrateRegion] successandshow migrationswas already empty;show regionsshowed the destination replica as Running/Leader;During that window the source replica was already deleted while the destination was still
Adding, so query results were wrong.Root cause
RegionMaintainHandler.waitTaskFinish()returnsPROCESSINGonly when its polling loop is interrupted by anInterruptedException(ConfigNode shutdown / leadership loss). A userCANCELor a coordinator disconnection both go through theFAILbranch instead, soPROCESSINGunambiguously means "interrupted by shutdown/leader switch".The old
DO_ADD_REGION_PEERcode treatedPROCESSINGas a terminal no-op (return Flow.NO_MORE_STATE), silently endingAddRegionPeerProcedurewith neither success nor rollback. The parentRegionMigrateProcedurehad already persisted atCHECK_ADD_REGION_PEER, so the new leader resumed there directly. That check usesisDataNodeContainsRegion(), which only inspects the partition table's location list — and the location is written back atCREATE_NEW_REGION_PEER, long before the snapshot finishes transferring. So the check passed, the source replica was removed, and the migration was declared complete prematurely.Fix
In the
DO_ADD_REGION_PEERPROCESSINGbranch, stay atDO_ADD_REGION_PEERand persist it (HAS_MORE_STATE) instead of ending. After recovery, the new leader re-entersDO_ADD_REGION_PEERand re-polls the coordinator task until it truly reachesSUCCESSorFAIL, so the parent only advances toREMOVE_REGION_PEERonce the destination replica is genuinely added.The re-poll is idempotent, so the AddRegionPeer task is never submitted twice:
isStateDeserialized()guard skips re-submitting the task and only re-polls;taskId(taskResultMap.putIfAbsent), so a duplicate submit is a no-op;TASK_NOT_EXIST, which falls through to the existingFAIL/ rollback path (safe degradation).The
waitTaskFinish() returns PROCESSING ...log message (en + zh) is updated to reflect the new behavior.Tests
Adds
cnLeaderSwitchDuringDoAddPeerTestfor each consensus protocol — IoTConsensus, IoTConsensusV2 (batch & stream), and Ratis.The existing daily ConfigNode-crash ITs all use
stopForcibly()(SIGKILL), which kills the process before it can run thePROCESSINGbranch, so this path was never covered. The new test introduces a gracefulstop()(SIGTERM) action and stops the leader among 3 ConfigNodes, so the interruptedPROCESSINGpath is actually exercised across a real leader switch, and asserts the migration still completes correctly (destination Running, source removed).These tests are tagged
@Category({DailyIT.class})(they run in the daily IT pipeline). They were already validated on CI: during this PR they were temporarily also taggedClusterITso the per-PRCluster ITpipeline would run them, and all four passed (eachTests run: 1, Failures: 0, Errors: 0, with the graceful-stop / leader-switch path confirmed firing in the logs). That temporary tag has since been reverted, so they are nowDailyIT-only.