Skip to content

[SPARK-57442][SQL] Add opt-in nullability check to logical plan integrity validation#56502

Closed
cloud-fan wants to merge 3 commits into
apache:masterfrom
cloud-fan:SPARK-oss-nullability-plan-validation
Closed

[SPARK-57442][SQL] Add opt-in nullability check to logical plan integrity validation#56502
cloud-fan wants to merge 3 commits into
apache:masterfrom
cloud-fan:SPARK-oss-nullability-plan-validation

Conversation

@cloud-fan

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

This PR adds an opt-in nullability check to logical plan integrity validation.

  • A new internal config spark.sql.planChangeValidation.nullability (default false) gates the check. It is a sub-flag of the existing spark.sql.planChangeValidation.
  • A new LogicalPlanIntegrity.validateNullability walks the plan and flags any operator that references one of its child's output attributes with a narrower nullability than the child produces — i.e. declaring the attribute nullable = false while the child produces it as nullable = true. It is wired into validateOptimizedPlan's existing check chain.
  • References declared wider (nullable over a non-nullable child) are safe and not flagged; references whose ExprId is not produced by any child (e.g. outer references) are ignored.

Why are the changes needed?

When a rule synthesizes operators, every reference to a child output must carry the nullability the child actually produces. A classic miss is an outer-join rewrite that references the null-widened side with its original non-nullable attributes — for example SPARK-56395 in RewriteNearestByJoin, where the synthesized Aggregate referenced the right side (widened to nullable by the synthetic LEFT OUTER join) with its original non-nullable metadata.

Such a mismatch is a latent plan-integrity defect: the rows are unchanged, so it is invisible to result-equivalence reasoning, to validateSchemaOutput (which compares types asNullable), and to green CI. Yet a later optimization that trusts the non-nullable declaration can drop a null check and produce wrong results. There is currently no integrity check that catches this class of defect; this adds one.

The check is disabled by default — even though spark.sql.planChangeValidation defaults to on in test runs — because not all existing rules maintain precise nullability today, and enabling it globally would surface pre-existing benign mismatches. Suites that have been audited (or rules that have been fixed) can opt in to guard against regressions.

Does this PR introduce any user-facing change?

No. The config is internal and disabled by default; the check only runs under plan-change validation when explicitly enabled.

How was this patch tested?

New unit test in LogicalPlanIntegritySuite that, with the config enabled, asserts:

  • a reference declared non-nullable over a nullable child is flagged (and the message names the offending attribute);
  • a faithful reference (same nullability as the child) passes;
  • a wider reference (nullable over a non-nullable child) passes;

and that the check is a no-op when the config is disabled.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.8)

This pull request and its description were written by Isaac.

@cloud-fan cloud-fan force-pushed the SPARK-oss-nullability-plan-validation branch from af32b9c to f0617a0 Compare June 14, 2026 23:10

@MaxGekk MaxGekk left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this — the approach is sound: opt-in via an internal sub-flag, narrowing-only, wider-is-safe, and a strictly local parent-vs-child comparison. The implementation matches the design, and I confirmed outer references are correctly ignored (since OuterReference is a LeafExpression, collect never descends into the wrapped attribute). The .version("4.3.0") is correct per the branch-versioning policy.

A few non-blocking items before merge:

  • Doc comments are slightly behind. validateOptimizedPlan's "what we check" bullet list doesn't mention this check (or the existing validateAggregateExpressions).
  • Scope is worth stating explicitly in the scaladoc: only top-level attribute nullability is compared (nested struct/array/map fields are not), subquery plans aren't traversed, and the check is optimizer-only. These are consistent with the existing checks, just undocumented.
  • Test coverage currently exercises only the simplest shape. I'd add a multi-child join case (the actual SPARK-56395 scenario), a reference nested in a compound expression, an offending node below the root, the ignored-ExprId path with the config ON, and the "any-nullable-wins" duplicate-ExprId merge.

Details inline. None of these block the core change; they mostly tighten the docs and harden the test against regressions.

}

/**
* This method validates that no operator references one of its child's output attributes with a

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make the scope limitations explicit in this scaladoc? The check only compares top-level AttributeReference.nullable — nested nullability (struct fields, array elements, map values) is not validated, so a child producing struct<a: long /* nullable */> referenced as struct<a: long /* non-null */> keeps the same top-level nullable and passes. This mirrors validateSchemaOutput (which compares asNullable), so it's consistent, but since that's exactly the "latent nullability defect" class this check targets, it's worth calling out. Same for subquery plans, which aren't reached by plan.collect (they live inside expressions, not as children).

}
plan.collect {
case p if p.childrenResolved =>
val childNullable = p.children.flatMap(_.output).groupBy(_.exprId).map {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: worth a one-line note that this is a strictly local parent-vs-immediate-child check over p.expressions. A narrowing introduced at a pure pass-through boundary (an operator that narrows an attribute it only forwards via output, without re-referencing it in expressions) won't be flagged at that node, and the parent then treats the already-narrowed value as the child's truth. The targeted SPARK-56395 shape is caught because the synthesized references appear in expressions, but making the scoping explicit helps future readers.

.orElse(LogicalPlanIntegrity.validateSchemaOutput(previousPlan, currentPlan))
.orElse(LogicalPlanIntegrity.validateNoDanglingReferences(currentPlan))
.orElse(LogicalPlanIntegrity.validateAggregateExpressions(currentPlan))
.orElse(LogicalPlanIntegrity.validateNullability(currentPlan))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things while we're touching this chain:

  1. The validateOptimizedPlan scaladoc bullet list (just above this method) is now out of date — it lists "is still resolved / special expressions / unique IDs / same schema / no dangling references" but omits both validateAggregateExpressions (pre-existing) and the new validateNullability. Could we add a bullet, e.g. - references each child output attribute with consistent (or wider) nullability, and one for the aggregate check too?
  2. Confirming intended scope: this runs only from validateOptimizedPlan, so nullability defects introduced during analysis (not optimization) aren't caught. Assuming that's deliberate — a one-liner in the config doc would make it clear.

assert(checkIfSameExprIdNotReused(t.select(Alias(a + b, "ab")(exprId = b.exprId))).isDefined)
}

test("Checks if an attribute is referenced as non-nullable over a nullable child") {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start, but this only exercises the simplest shape (one-level Project over a LocalRelation, offending node at the root). Could we extend coverage to the cases that exercise the actual logic and the motivating bug?

  • Multi-child join shape — an Aggregate/operator over a (synthetic LEFT OUTER) Join referencing the null-widened side as non-nullable. This is the SPARK-56395 scenario and the only one that exercises groupBy(_.exprId) across multiple children.
  • Offending reference nested in a compound expression (e.g. narrowedRef + 1L, or wrapped in an Alias/function) to confirm the inner _.collect finds it, not just bare top-level references.
  • Offending node below the root, to confirm the plan.collect traversal rather than root-only inspection.
  • ExprId not produced by any child, config ON (dangling/outer-reference exprId declared non-nullable) → asserts it's ignored. Currently the "ignored" path is only covered implicitly via the config-disabled assertion.
  • Duplicate ExprId across children, "any nullable wins" — the behavior the inline comment specifically claims, currently untested.

…overage

Address review feedback: spell out the scope limitations of
validateNullability (top-level only, strictly local, subqueries not
reached), update the validateOptimizedPlan and config docs to cover the
aggregate and nullability checks and the optimization-only scope, and
add tests for the multi-child outer-join shape, compound expressions,
below-root traversal, ignored non-child references, and the
duplicate-ExprId "any nullable wins" rule.

Co-authored-by: Isaac
@cloud-fan

Copy link
Copy Markdown
Contributor Author

the docker test failure is unrelated, thanks, merging to master/4.x!

@cloud-fan cloud-fan closed this in 5f1f8da Jun 16, 2026
cloud-fan added a commit that referenced this pull request Jun 16, 2026
…rity validation

### What changes were proposed in this pull request?

This PR adds an opt-in nullability check to logical plan integrity validation.

- A new internal config `spark.sql.planChangeValidation.nullability` (default `false`) gates the check. It is a sub-flag of the existing `spark.sql.planChangeValidation`.
- A new `LogicalPlanIntegrity.validateNullability` walks the plan and flags any operator that references one of its child's output attributes with a **narrower** nullability than the child produces — i.e. declaring the attribute `nullable = false` while the child produces it as `nullable = true`. It is wired into `validateOptimizedPlan`'s existing check chain.
- References declared **wider** (nullable over a non-nullable child) are safe and not flagged; references whose `ExprId` is not produced by any child (e.g. outer references) are ignored.

### Why are the changes needed?

When a rule synthesizes operators, every reference to a child output must carry the nullability the child actually produces. A classic miss is an outer-join rewrite that references the null-widened side with its original non-nullable attributes — for example SPARK-56395 in `RewriteNearestByJoin`, where the synthesized `Aggregate` referenced the right side (widened to nullable by the synthetic `LEFT OUTER` join) with its original non-nullable metadata.

Such a mismatch is a latent plan-integrity defect: the rows are unchanged, so it is invisible to result-equivalence reasoning, to `validateSchemaOutput` (which compares types `asNullable`), and to green CI. Yet a later optimization that trusts the non-nullable declaration can drop a null check and produce wrong results. There is currently no integrity check that catches this class of defect; this adds one.

The check is disabled by default — even though `spark.sql.planChangeValidation` defaults to on in test runs — because not all existing rules maintain precise nullability today, and enabling it globally would surface pre-existing benign mismatches. Suites that have been audited (or rules that have been fixed) can opt in to guard against regressions.

### Does this PR introduce _any_ user-facing change?

No. The config is internal and disabled by default; the check only runs under plan-change validation when explicitly enabled.

### How was this patch tested?

New unit test in `LogicalPlanIntegritySuite` that, with the config enabled, asserts:
- a reference declared non-nullable over a nullable child is flagged (and the message names the offending attribute);
- a faithful reference (same nullability as the child) passes;
- a wider reference (nullable over a non-nullable child) passes;

and that the check is a no-op when the config is disabled.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.8)

This pull request and its description were written by Isaac.

Closes #56502 from cloud-fan/SPARK-oss-nullability-plan-validation.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 5f1f8da)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants