99-mellanox: fix "ifaces[id]: unbound variable" — align arrays, fail cleanly when a device has no interface#271
Open
elordahl wants to merge 1 commit into
Open
Conversation
Author
|
@flx42 to review |
916ae2d to
86f78d2
Compare
…aceless device
Three independent sysfs globs (infiniband_verbs, infiniband, infiniband_mad)
built the parallel arrays assuming equal counts and aligned ordering. When a PCI
function exposed a verbs device but no infiniband/ class entry (BlueField DPU,
SF/SR-IOV representor, down port, or an SR-IOV VF whose RDMA device is in another
network namespace), ifaces[] ended up shorter than devices[]. The mount loop only
range-checked against ${#devices[@]}, so it dereferenced an unset ifaces[id] and,
under set -euo pipefail, aborted with an unhandled error:
/etc/enroot/hooks.d/99-mellanox.sh: line 88: ifaces[id]: unbound variable
Fix: enumerate per PCI function anchored on infiniband_verbs and resolve the
interface and management nodes from the same <bdf> directory, so the arrays stay
index-aligned regardless of which sysfs sub-entries are present. A requested
device with no interface now fails with a clear common::err ("refusing to start
container ...") instead of the unhandled unbound-variable crash -- preserving the
prior behavior (the container does not start) but as a handled, actionable error.
umad/issm entries are guarded with [ -n ] since their absence is non-critical.
Signed-off-by: Eric Lordahl <elordahl@nvidia.com>
86f78d2 to
ee30f16
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
conf/hooks/99-mellanox.shbuildsdrivers/devices,ifaces, andumads/issmsfrom three independent sysfs globs assuming they are equal-length and index-aligned. When a Mellanox PCI function exposes aninfiniband_verbsentry but noinfiniband/class device — a BlueField DPU, an SF/SR-IOV representor, a down port, or an SR-IOV VF whose RDMA device is in another network namespace (e.g. a Kubernetes pod, via rdma-cni) —ifaces[]ends up shorter thandevices[]. The mount loop only range-checksidagainst${#devices[@]}, so it dereferences an unsetifaces[id]and, underset -euo pipefail, aborts with an unhandled error:Before the abort, the skew also silently mis-pairs
devices[]with the wrongifaces[]entry for indices past the first gap.Fix
infiniband_verbs, resolving the interface and management nodes from the same<bdf>directory, so the arrays stay index-aligned regardless of which sysfs sub-entries exist.common::err(… refusing to start container …) instead of the unhandled unbound-variable crash. This preserves the prior behavior (the container does not start when a requested RDMA device is unavailable) — just as a handled, actionable error.umad/issmentries are guarded with[ -n ]since their absence is non-critical.