Skip to content

OCPBUGS-89351: Fix flake in best-effort QoS test due to debug pods#31312

Open
mdbooth wants to merge 1 commit into
openshift:mainfrom
mdbooth:investigate-qos-debug-race
Open

OCPBUGS-89351: Fix flake in best-effort QoS test due to debug pods#31312
mdbooth wants to merge 1 commit into
openshift:mainfrom
mdbooth:investigate-qos-debug-race

Conversation

@mdbooth

@mdbooth mdbooth commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

test/extended/node defines 2 helper functions which create debug pods with "oc debug" in the "openshift-machine-config-operator" namespace. These helper functions have numerous callers. By design, the debug pods have best-effort QoS.

The "[sig-arch] Managed cluster should ensure control plane pods do not run in best-effort QoS" test looks for pods in openshift namespaces, including openshift-machine-config-operator. It fails if any have best-effort QoS. This test fails if it coincides with some other test which is using the node helper functions.

Summary by CodeRabbit

  • Tests
    • Improved Quality of Service test checks by properly excluding ephemeral debug pods.

test/extended/node defines 2 helper functions which create debug pods
with "oc debug" in the "openshift-machine-config-operator" namespace.
These helper functions have numerous callers. By design, the debug pods
have best-effort QoS.

The "[sig-arch] Managed cluster should ensure control plane pods do not
run in best-effort QoS" test looks for pods in openshift namespaces,
including openshift-machine-config-operator. It fails if any have
best-effort QoS. This test fails if it coincides with some other test
which is using the node helper functions.
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 17, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@mdbooth: This pull request references Jira Issue OCPBUGS-89351, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

test/extended/node defines 2 helper functions which create debug pods with "oc debug" in the "openshift-machine-config-operator" namespace. These helper functions have numerous callers. By design, the debug pods have best-effort QoS.

The "[sig-arch] Managed cluster should ensure control plane pods do not run in best-effort QoS" test looks for pods in openshift namespaces, including openshift-machine-config-operator. It fails if any have best-effort QoS. This test fails if it coincides with some other test which is using the node helper functions.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jun 17, 2026
@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: acdb0564-5697-4a04-bba6-2a617e2f65a3

📥 Commits

Reviewing files that changed from the base of the PR and between ae4a0d4 and 50ce7fa.

📒 Files selected for processing (2)
  • test/extended/node/node_utils.go
  • test/extended/operators/qos.go

Walkthrough

debugNamespace in node_utils.go is exported as DebugNamespace, and both ExecOnNodeWithChroot and ExecOnNodeWithNsenter are updated to use it. The QoS test in qos.go imports the node package and adds an isEphemeralDebugPod helper that skips ephemeral oc debug pods from best-effort QoS violation detection.

Changes

Debug Namespace Export and QoS Pod Exclusion

Layer / File(s) Summary
Export DebugNamespace constant
test/extended/node/node_utils.go
Renames debugNamespace to DebugNamespace (exported) and updates both ExecOnNodeWithChroot and ExecOnNodeWithNsenter to reference the new identifier.
QoS test: skip ephemeral debug pods
test/extended/operators/qos.go
Adds isEphemeralDebugPod helper detecting pods by debug.openshift.io/managed-by label, debug.openshift.io/source-resource annotation, or -debug- name substring; the QoS loop skips pods matching any criterion.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested labels

ready-for-human-review

Suggested reviewers

  • asahay19
  • sairameshv
🚥 Pre-merge checks | ✅ 14 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (14 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: fixing a flake in the best-effort QoS test caused by debug pods, which aligns with both file changes (exporting DebugNamespace and filtering debug pods).
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed Ginkgo test names are stable and deterministic. The test names ("ensure control plane pods do not run in best-effort QoS" and "[sig-arch] Managed cluster should") contain no dynamic values, generat...
Test Structure And Quality ✅ Passed Test code meets all quality requirements: maintains single responsibility, includes meaningful assertion messages, follows established codebase patterns (including context.Background usage in valid...
Microshift Test Compatibility ✅ Passed The new test in qos.go checks pod QoS classes using standard Kubernetes APIs (Pods().List(), pod.Status.QOSClass). It does not use any MicroShift-unavailable APIs or resources, and contains no Micr...
Single Node Openshift (Sno) Test Compatibility ✅ Passed No new Ginkgo e2e tests are added in this PR. Changes are limited to exporting a constant in a utility module and modifying logic in an existing test, making the SNO compatibility check inapplicable.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies test-only code (test/extended/node/node_utils.go, test/extended/operators/qos.go). No deployment manifests, operator code, or scheduling constraints are introduced. Not applicable to to...
Ote Binary Stdout Contract ✅ Passed PR contains no stdout writes in process-level code. Changes are: exporting DebugNamespace constant in node_utils.go and adding ephemeral debug pod filtering in qos.go test. No fmt.Print, klog, or i...
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No new Ginkgo e2e tests are added. Changes only involve exporting a namespace constant and filtering debug pods from existing test logic.
No-Weak-Crypto ✅ Passed PR contains no cryptographic code. Changes involve exporting a constant and adding test helper functions to detect debug pods using label/annotation/name pattern matching.
Container-Privileges ✅ Passed PR modifies only Go test files with no K8s manifests or container specs. The check for privileged container configurations is not applicable here.
No-Sensitive-Data-In-Logs ✅ Passed The PR only logs Kubernetes metadata (pod names/namespaces) and uses a standard namespace constant. No passwords, tokens, API keys, PII, or other sensitive data are exposed in logs.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@mdbooth

mdbooth commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 17, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@mdbooth: This pull request references Jira Issue OCPBUGS-89351, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot requested review from BhargaviGudi and sdodson June 17, 2026 15:55
@openshift-ci

openshift-ci Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mdbooth
Once this PR has been reviewed and has the lgtm label, please assign bparees, rphillips for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mdbooth

mdbooth commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

/cc @yuqi-zhang as you originally reported https://redhat.atlassian.net/browse/OCPBUGS-42691

@openshift-ci-robot

Copy link
Copy Markdown

@mdbooth: This pull request references Jira Issue OCPBUGS-89351, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

test/extended/node defines 2 helper functions which create debug pods with "oc debug" in the "openshift-machine-config-operator" namespace. These helper functions have numerous callers. By design, the debug pods have best-effort QoS.

The "[sig-arch] Managed cluster should ensure control plane pods do not run in best-effort QoS" test looks for pods in openshift namespaces, including openshift-machine-config-operator. It fails if any have best-effort QoS. This test fails if it coincides with some other test which is using the node helper functions.

Summary by CodeRabbit

  • Tests
  • Improved Quality of Service test checks by properly excluding ephemeral debug pods.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added the ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review label Jun 17, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@openshift-ci

openshift-ci Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

@mdbooth: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-vsphere-ovn-upi 50ce7fa link true /test e2e-vsphere-ovn-upi

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants