[SPARK-57442][SQL] Add opt-in nullability check to logical plan integrity validation#56502
[SPARK-57442][SQL] Add opt-in nullability check to logical plan integrity validation#56502cloud-fan wants to merge 3 commits into
Conversation
…rity validation Co-authored-by: Isaac
af32b9c to
f0617a0
Compare
MaxGekk
left a comment
There was a problem hiding this comment.
Thanks for adding this — the approach is sound: opt-in via an internal sub-flag, narrowing-only, wider-is-safe, and a strictly local parent-vs-child comparison. The implementation matches the design, and I confirmed outer references are correctly ignored (since OuterReference is a LeafExpression, collect never descends into the wrapped attribute). The .version("4.3.0") is correct per the branch-versioning policy.
A few non-blocking items before merge:
- Doc comments are slightly behind.
validateOptimizedPlan's "what we check" bullet list doesn't mention this check (or the existingvalidateAggregateExpressions). - Scope is worth stating explicitly in the scaladoc: only top-level attribute nullability is compared (nested struct/array/map fields are not), subquery plans aren't traversed, and the check is optimizer-only. These are consistent with the existing checks, just undocumented.
- Test coverage currently exercises only the simplest shape. I'd add a multi-child join case (the actual SPARK-56395 scenario), a reference nested in a compound expression, an offending node below the root, the ignored-
ExprIdpath with the config ON, and the "any-nullable-wins" duplicate-ExprIdmerge.
Details inline. None of these block the core change; they mostly tighten the docs and harden the test against regressions.
| } | ||
|
|
||
| /** | ||
| * This method validates that no operator references one of its child's output attributes with a |
There was a problem hiding this comment.
Could we make the scope limitations explicit in this scaladoc? The check only compares top-level AttributeReference.nullable — nested nullability (struct fields, array elements, map values) is not validated, so a child producing struct<a: long /* nullable */> referenced as struct<a: long /* non-null */> keeps the same top-level nullable and passes. This mirrors validateSchemaOutput (which compares asNullable), so it's consistent, but since that's exactly the "latent nullability defect" class this check targets, it's worth calling out. Same for subquery plans, which aren't reached by plan.collect (they live inside expressions, not as children).
| } | ||
| plan.collect { | ||
| case p if p.childrenResolved => | ||
| val childNullable = p.children.flatMap(_.output).groupBy(_.exprId).map { |
There was a problem hiding this comment.
Minor: worth a one-line note that this is a strictly local parent-vs-immediate-child check over p.expressions. A narrowing introduced at a pure pass-through boundary (an operator that narrows an attribute it only forwards via output, without re-referencing it in expressions) won't be flagged at that node, and the parent then treats the already-narrowed value as the child's truth. The targeted SPARK-56395 shape is caught because the synthesized references appear in expressions, but making the scoping explicit helps future readers.
| .orElse(LogicalPlanIntegrity.validateSchemaOutput(previousPlan, currentPlan)) | ||
| .orElse(LogicalPlanIntegrity.validateNoDanglingReferences(currentPlan)) | ||
| .orElse(LogicalPlanIntegrity.validateAggregateExpressions(currentPlan)) | ||
| .orElse(LogicalPlanIntegrity.validateNullability(currentPlan)) |
There was a problem hiding this comment.
Two things while we're touching this chain:
- The
validateOptimizedPlanscaladoc bullet list (just above this method) is now out of date — it lists "is still resolved / special expressions / unique IDs / same schema / no dangling references" but omits bothvalidateAggregateExpressions(pre-existing) and the newvalidateNullability. Could we add a bullet, e.g.- references each child output attribute with consistent (or wider) nullability, and one for the aggregate check too? - Confirming intended scope: this runs only from
validateOptimizedPlan, so nullability defects introduced during analysis (not optimization) aren't caught. Assuming that's deliberate — a one-liner in the config doc would make it clear.
| assert(checkIfSameExprIdNotReused(t.select(Alias(a + b, "ab")(exprId = b.exprId))).isDefined) | ||
| } | ||
|
|
||
| test("Checks if an attribute is referenced as non-nullable over a nullable child") { |
There was a problem hiding this comment.
Good start, but this only exercises the simplest shape (one-level Project over a LocalRelation, offending node at the root). Could we extend coverage to the cases that exercise the actual logic and the motivating bug?
- Multi-child join shape — an
Aggregate/operator over a (syntheticLEFT OUTER)Joinreferencing the null-widened side as non-nullable. This is the SPARK-56395 scenario and the only one that exercisesgroupBy(_.exprId)across multiple children. - Offending reference nested in a compound expression (e.g.
narrowedRef + 1L, or wrapped in anAlias/function) to confirm the inner_.collectfinds it, not just bare top-level references. - Offending node below the root, to confirm the
plan.collecttraversal rather than root-only inspection. ExprIdnot produced by any child, config ON (dangling/outer-reference exprId declared non-nullable) → asserts it's ignored. Currently the "ignored" path is only covered implicitly via the config-disabled assertion.- Duplicate
ExprIdacross children, "any nullable wins" — the behavior the inline comment specifically claims, currently untested.
…overage Address review feedback: spell out the scope limitations of validateNullability (top-level only, strictly local, subqueries not reached), update the validateOptimizedPlan and config docs to cover the aggregate and nullability checks and the optimization-only scope, and add tests for the multi-child outer-join shape, compound expressions, below-root traversal, ignored non-child references, and the duplicate-ExprId "any nullable wins" rule. Co-authored-by: Isaac
Co-authored-by: Isaac
|
the docker test failure is unrelated, thanks, merging to master/4.x! |
…rity validation ### What changes were proposed in this pull request? This PR adds an opt-in nullability check to logical plan integrity validation. - A new internal config `spark.sql.planChangeValidation.nullability` (default `false`) gates the check. It is a sub-flag of the existing `spark.sql.planChangeValidation`. - A new `LogicalPlanIntegrity.validateNullability` walks the plan and flags any operator that references one of its child's output attributes with a **narrower** nullability than the child produces — i.e. declaring the attribute `nullable = false` while the child produces it as `nullable = true`. It is wired into `validateOptimizedPlan`'s existing check chain. - References declared **wider** (nullable over a non-nullable child) are safe and not flagged; references whose `ExprId` is not produced by any child (e.g. outer references) are ignored. ### Why are the changes needed? When a rule synthesizes operators, every reference to a child output must carry the nullability the child actually produces. A classic miss is an outer-join rewrite that references the null-widened side with its original non-nullable attributes — for example SPARK-56395 in `RewriteNearestByJoin`, where the synthesized `Aggregate` referenced the right side (widened to nullable by the synthetic `LEFT OUTER` join) with its original non-nullable metadata. Such a mismatch is a latent plan-integrity defect: the rows are unchanged, so it is invisible to result-equivalence reasoning, to `validateSchemaOutput` (which compares types `asNullable`), and to green CI. Yet a later optimization that trusts the non-nullable declaration can drop a null check and produce wrong results. There is currently no integrity check that catches this class of defect; this adds one. The check is disabled by default — even though `spark.sql.planChangeValidation` defaults to on in test runs — because not all existing rules maintain precise nullability today, and enabling it globally would surface pre-existing benign mismatches. Suites that have been audited (or rules that have been fixed) can opt in to guard against regressions. ### Does this PR introduce _any_ user-facing change? No. The config is internal and disabled by default; the check only runs under plan-change validation when explicitly enabled. ### How was this patch tested? New unit test in `LogicalPlanIntegritySuite` that, with the config enabled, asserts: - a reference declared non-nullable over a nullable child is flagged (and the message names the offending attribute); - a faithful reference (same nullability as the child) passes; - a wider reference (nullable over a non-nullable child) passes; and that the check is a no-op when the config is disabled. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Opus 4.8) This pull request and its description were written by Isaac. Closes #56502 from cloud-fan/SPARK-oss-nullability-plan-validation. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 5f1f8da) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
This PR adds an opt-in nullability check to logical plan integrity validation.
spark.sql.planChangeValidation.nullability(defaultfalse) gates the check. It is a sub-flag of the existingspark.sql.planChangeValidation.LogicalPlanIntegrity.validateNullabilitywalks the plan and flags any operator that references one of its child's output attributes with a narrower nullability than the child produces — i.e. declaring the attributenullable = falsewhile the child produces it asnullable = true. It is wired intovalidateOptimizedPlan's existing check chain.ExprIdis not produced by any child (e.g. outer references) are ignored.Why are the changes needed?
When a rule synthesizes operators, every reference to a child output must carry the nullability the child actually produces. A classic miss is an outer-join rewrite that references the null-widened side with its original non-nullable attributes — for example SPARK-56395 in
RewriteNearestByJoin, where the synthesizedAggregatereferenced the right side (widened to nullable by the syntheticLEFT OUTERjoin) with its original non-nullable metadata.Such a mismatch is a latent plan-integrity defect: the rows are unchanged, so it is invisible to result-equivalence reasoning, to
validateSchemaOutput(which compares typesasNullable), and to green CI. Yet a later optimization that trusts the non-nullable declaration can drop a null check and produce wrong results. There is currently no integrity check that catches this class of defect; this adds one.The check is disabled by default — even though
spark.sql.planChangeValidationdefaults to on in test runs — because not all existing rules maintain precise nullability today, and enabling it globally would surface pre-existing benign mismatches. Suites that have been audited (or rules that have been fixed) can opt in to guard against regressions.Does this PR introduce any user-facing change?
No. The config is internal and disabled by default; the check only runs under plan-change validation when explicitly enabled.
How was this patch tested?
New unit test in
LogicalPlanIntegritySuitethat, with the config enabled, asserts:and that the check is a no-op when the config is disabled.
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Opus 4.8)
This pull request and its description were written by Isaac.