Skip to content

Make previous identity endorsement fetching robust to the ledger gaps#7913

Closed
maxtropets wants to merge 5 commits into
microsoft:mainfrom
maxtropets:f/partial-chain-receipts
Closed

Make previous identity endorsement fetching robust to the ledger gaps#7913
maxtropets wants to merge 5 commits into
microsoft:mainfrom
maxtropets:f/partial-chain-receipts

Conversation

@maxtropets

@maxtropets maxtropets commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

Missing previous-identity endorsement ledger chunks no longer block node startup. The subsystem serves what it has validated, and callers decide when to retry.

                ┌──────────[ctor: fetch_first runs]──────────┐
                │                                             │
                ▼                                             │
            ┌─────────┐    trigger_extension()                │
            │ Partial │◄────────────────────────  caller-driven retry
            │         │       (one cycle in flight, CAS-gated)
            └────┬─┬──┘
                 │ │
   chain walks   │ └── chain integrity violation ──► ┌────────┐
   to self ─────►│                                   │ Failed │  (terminal)
                 ▼                                   └────────┘
             ┌──────┐
             │ Done │  (terminal)
             └──────┘

  Readers in Partial: serve validated prefix; nullopt below earliest
                      validated seqno. Caller can trigger_extension().
  Readers in Failed:  same shape (no throw). Status signals "do not retry".
  Readers in Done:    full chain.
  • New public API: FetchStatus::Partial and NetworkIdentitySubsystemInterface::trigger_extension().
  • State machine: subsystem starts in Partial; cycles end in Partial (recoverable) or Done/Failed (terminal). Retry enum value removed.
  • Bounded retries: 30 × 100ms per chunk fetch, 30 × 1s for the initial node-state read; on exhaustion → Partial.
  • Caller-driven recovery: historical-queries adapter calls trigger_extension() whenever a 202 is due to a partial chain.
  • Incremental validation: each link verified at insertion; the whole-chain check + batched build_trusted_keys are gone.
  • Reader contract: no throws; nullopt/nullptr/empty when the chain doesn't (yet) cover the request. IdentityHistoryNotFetched removed.
  • Dependency injection: subsystem now takes INodeStateAccessor + IHistoricalStateAccessor + TaskScheduler (production wrappers in network_identity_accessors_impl.h); enables a 35-case unit test suite that drives the state machine with mocks.
  • e2e: recovery_chain_heals exercises the full healing flow over 3 recoveries with a moved ledger chunk and asserts the trusted-keys count grows after trigger_extension().

@maxtropets maxtropets self-assigned this Jun 2, 2026
@maxtropets maxtropets force-pushed the f/partial-chain-receipts branch from 822a052 to 19a8272 Compare June 2, 2026 13:23
@maxtropets maxtropets added the run-long-test Run Long Test job label Jun 2, 2026
@maxtropets maxtropets force-pushed the f/partial-chain-receipts branch from 19a8272 to f96a87b Compare June 2, 2026 13:47
@achamayou achamayou added this to the 7.0.4 milestone Jun 3, 2026
@maxtropets maxtropets force-pushed the f/partial-chain-receipts branch 2 times, most recently from 43d3551 to 02234aa Compare June 3, 2026 16:54
@maxtropets maxtropets force-pushed the f/partial-chain-receipts branch from 02234aa to 30fba92 Compare June 3, 2026 17:05
@maxtropets maxtropets marked this pull request as ready for review June 3, 2026 17:06
@maxtropets maxtropets requested a review from a team as a code owner June 3, 2026 17:06
Copilot AI review requested due to automatic review settings June 3, 2026 17:06
@maxtropets maxtropets changed the title [WIP] Make previous identity endorsement fetching robust to the ledger gaps Make previous identity endorsement fetching robust to the ledger gaps Jun 3, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates CCF’s network identity subsystem to tolerate gaps in historical ledger availability when fetching previous-identity endorsements, so node startup can proceed with a validated prefix and callers can explicitly request further fetch attempts as missing chunks reappear.

Changes:

  • Introduces a FetchStatus::Partial state and a new NetworkIdentitySubsystemInterface::trigger_extension() API for caller-driven extension attempts.
  • Refactors fetching/validation into an incremental state machine with bounded retries and injected accessors/scheduler to enable focused unit testing.
  • Adds both unit-test coverage for the state machine and an end-to-end recovery test that simulates missing/restored ledger chunks.

Custom instructions used

  • .github/copilot-instructions.md
  • .github/instructions/reviewing.instructions.md
  • .github/instructions/changelog.instructions.md

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/recovery.py Adds an e2e test that moves/restores a ledger chunk and asserts partial→healed endorsement behavior.
src/node/test/network_identity_subsystem.cpp New unit test suite driving the subsystem state machine with mocks/fake scheduler.
src/node/rpc/network_identity_subsystem.h Implements Partial/Done/Failed state machine, incremental validation, and trigger-driven extension.
src/node/rpc/network_identity_chain_helpers.h Adds pure helper predicates for chain connectivity/front-connection validation.
src/node/rpc/network_identity_accessors.h Defines narrow injection interfaces for live/historical reads and task scheduling.
src/node/rpc/network_identity_accessors_impl.h Provides production adapters over AbstractNodeState, StateCacheImpl, and ccf::tasks.
src/node/historical_queries_utils.cpp Triggers trigger_extension() on nullopt chain reads and returns 202 behavior to callers.
include/ccf/network_identity_interface.h Updates public API: adds Partial, removes Retry/exception-based contract, adds trigger_extension().
CMakeLists.txt Adds the new unit test target.
CHANGELOG.md Documents the user-visible behavior change and new API.

Comment thread src/node/rpc/network_identity_subsystem.h
Comment thread tests/recovery.py
Comment thread src/node/rpc/network_identity_subsystem.h
Comment thread CMakeLists.txt

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Comment thread src/node/rpc/network_identity_subsystem.h
Comment thread src/node/rpc/network_identity_subsystem.h
Comment thread src/node/rpc/network_identity_subsystem.h Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated no new comments.

Retry, ///< Fetching should be retried
Done, ///< Fetching completed successfully
Failed ///< Fetching failed
Done, ///< Fetching trusted identities completed successfully

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is problematic in two ways:

  1. it clearly is a public API break
  2. the values are re-ordered, which is risky at best

Does the FetchStatus need to be in a public header? If it must, can we at least make the order Partial, Done, Failed?

@maxtropets

Copy link
Copy Markdown
Collaborator Author

Closed in favour of #7922

@maxtropets maxtropets closed this Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-long-test Run Long Test job

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants