Skip to content

99-mellanox: fix "ifaces[id]: unbound variable" — align arrays, fail cleanly when a device has no interface#271

Open
elordahl wants to merge 1 commit into
NVIDIA:mainfrom
elordahl:fix/mellanox-hook-array-skew
Open

99-mellanox: fix "ifaces[id]: unbound variable" — align arrays, fail cleanly when a device has no interface#271
elordahl wants to merge 1 commit into
NVIDIA:mainfrom
elordahl:fix/mellanox-hook-array-skew

Conversation

@elordahl

@elordahl elordahl commented Jun 8, 2026

Copy link
Copy Markdown

Problem

conf/hooks/99-mellanox.sh builds drivers/devices, ifaces, and umads/issms from three independent sysfs globs assuming they are equal-length and index-aligned. When a Mellanox PCI function exposes an infiniband_verbs entry but no infiniband/ class device — a BlueField DPU, an SF/SR-IOV representor, a down port, or an SR-IOV VF whose RDMA device is in another network namespace (e.g. a Kubernetes pod, via rdma-cni) — ifaces[] ends up shorter than devices[]. The mount loop only range-checks id against ${#devices[@]}, so it dereferences an unset ifaces[id] and, under set -euo pipefail, aborts with an unhandled error:

/etc/enroot/hooks.d/99-mellanox.sh: line 88: ifaces[id]: unbound variable
[ERROR] /etc/enroot/hooks.d/99-mellanox.sh exited with return code 1

Before the abort, the skew also silently mis-pairs devices[] with the wrong ifaces[] entry for indices past the first gap.

Fix

  • Enumerate per PCI function, anchored on infiniband_verbs, resolving the interface and management nodes from the same <bdf> directory, so the arrays stay index-aligned regardless of which sysfs sub-entries exist.
  • When a requested device has no InfiniBand interface, fail with a clear common::err (… refusing to start container …) instead of the unhandled unbound-variable crash. This preserves the prior behavior (the container does not start when a requested RDMA device is unavailable) — just as a handled, actionable error.
  • umad/issm entries are guarded with [ -n ] since their absence is non-critical.

@elordahl

Copy link
Copy Markdown
Author

@flx42 to review

@elordahl elordahl force-pushed the fix/mellanox-hook-array-skew branch from 916ae2d to 86f78d2 Compare June 12, 2026 15:20
@elordahl elordahl changed the title 99-mellanox: fix array skew and abort on degraded NIC 99-mellanox: fix verbs/iface array skew, skip interfaceless devices Jun 12, 2026
@elordahl elordahl changed the title 99-mellanox: fix verbs/iface array skew, skip interfaceless devices 99-mellanox: fix "ifaces[id]: unbound variable" — align arrays, skip interface-less devices Jun 13, 2026
…aceless device

Three independent sysfs globs (infiniband_verbs, infiniband, infiniband_mad)
built the parallel arrays assuming equal counts and aligned ordering. When a PCI
function exposed a verbs device but no infiniband/ class entry (BlueField DPU,
SF/SR-IOV representor, down port, or an SR-IOV VF whose RDMA device is in another
network namespace), ifaces[] ended up shorter than devices[]. The mount loop only
range-checked against ${#devices[@]}, so it dereferenced an unset ifaces[id] and,
under set -euo pipefail, aborted with an unhandled error:

  /etc/enroot/hooks.d/99-mellanox.sh: line 88: ifaces[id]: unbound variable

Fix: enumerate per PCI function anchored on infiniband_verbs and resolve the
interface and management nodes from the same <bdf> directory, so the arrays stay
index-aligned regardless of which sysfs sub-entries are present. A requested
device with no interface now fails with a clear common::err ("refusing to start
container ...") instead of the unhandled unbound-variable crash -- preserving the
prior behavior (the container does not start) but as a handled, actionable error.
umad/issm entries are guarded with [ -n ] since their absence is non-critical.

Signed-off-by: Eric Lordahl <elordahl@nvidia.com>
@elordahl elordahl force-pushed the fix/mellanox-hook-array-skew branch from 86f78d2 to ee30f16 Compare June 26, 2026 02:28
@elordahl elordahl changed the title 99-mellanox: fix "ifaces[id]: unbound variable" — align arrays, skip interface-less devices 99-mellanox: fix "ifaces[id]: unbound variable" — align arrays, fail cleanly when a device has no interface Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant