sec(parser): BoundedSegment + audit trail for untrusted asSlice#27
sec(parser): BoundedSegment + audit trail for untrusted asSlice#27dfa1 wants to merge 6 commits into
Conversation
…ents helper SECURITY.md promises that malformed input throws VortexException, never an IndexOutOfBoundsException or other unchecked JDK exception. Several call sites in the file-open and scan paths violated this when fed adversarial offsets/lengths read from the on-disk schema: - VortexReader.parse: file shorter than the 8-byte trailer made `bodyBytes = size - TRAILER_SIZE` negative, then `seg.asSlice(negative, 8)` threw IOOBE. - VortexReader.parse: `trailer.postscriptLen()` greater than the body made `postscriptOffset` negative, same outcome. - VortexReader.slice (public, exposed via VortexHandle) forwarded any caller-supplied offset/length straight to asSlice with no bounds check. - VortexReader.readFlatStats / ScanIterator.readFlatStats: segment offsets taken from footer.segmentSpecs() were sliced unchecked. - VortexHttpReader: trailer + postscript offsets in the prefetched HTTP tail were sliced unchecked. - FlatSegmentDecoder: per-buffer dataOffset accumulated from attacker-controlled `Buffer.padding()` and `Buffer.length()` was sliced unchecked. - PostscriptParser.slice: local helper called asSlice without converting IOOBE to VortexException (checkBlobBounds already runs upstream, but the helper should defend the same contract on its own). Introduces `io.github.dfa1.vortex.core.MemorySegments.slice(MemorySegment, long, long, String)`. The static helper throws VortexException — with the caller-supplied context label baked into the message — for negative offsets, negative lengths, and overflow-prone offset+length combinations. Replaces eight asSlice call sites across reader/scan/encoding. Each surviving raw asSlice call now operates only on already-validated internal offsets (already-decoded array buffer projection in VarBinArray, ArraySegments slicing of post-decode buffers in ScanIterator's narrowToRows, ProtoReader's own bounds-checked cursor, the constant-offset magic check in Trailer, and the decoded-output buffer in ZstdEncoding). A Checkstyle ban on raw asSlice in io/scan/encoding packages is a planned follow-up so future regressions are caught at build time. MemorySegmentsTest covers in-range slicing, the zero-length-at-end edge, negative offsets, negative lengths, offset+length past segment size, length alone larger than the segment, and an overflow-prone (Long.MAX_VALUE - 1) + 100 input that would defeat a naive `off + len > segSize` check. ./mvnw verify — all unit + integration tests pass (incl. Rust round-trips). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…hase 1/4]
Adds io.github.dfa1.vortex.core.BoundedSegment, a record wrapping a
MemorySegment with a context label. The type does not expose
MemorySegment.asSlice; the only way to derive a sub-region is
BoundedSegment.slice(off, len, childContext), which routes through
MemorySegments.checkRange and throws VortexException — not
IndexOutOfBoundsException — on malformed input.
Raw access is exposed only through unwrapForSubParser(String reason),
which both documents the trust transfer and surfaces the reason string in
any propagated error, so trust transfers are greppable for audit.
Phase 1 covers the file-open boundary:
- VortexReader.parse: wraps the mmap'd segment as
`new BoundedSegment(seg, "vortex file")` immediately after `channel.map`.
Trailer + postscript slices go through `file.slice(off, len, ctx)`.
- Trailer.parse: signature changed from (MemorySegment, long) to
(BoundedSegment, long). The single internal `asSlice` on the magic
bytes uses `unwrapForSubParser("trailer parser")` since the offset is
a compile-time constant (4) on a segment whose length has already been
validated to 8 bytes.
- PostscriptParser.parse: signature changed from
(ByteBuffer, MemorySegment, long) to (ByteBuffer, BoundedSegment). The
local `slice` helper goes away; callers use `file.slice(...).asByteBufferLE()`.
Explicit checkBlobBounds calls are retained so the per-blob error
message ("postscript footer blob out of bounds") is more specific than
BoundedSegment's generic "vortex file" parent context.
- VortexHttpReader: tail array now wrapped as
`BoundedSegment(MemorySegment.ofArray(tail), "http tail")` so the
changed Trailer.parse signature is satisfied. Full HTTP-aware
refactor lands in Phase 3.
MemorySegments.slice is refactored to call a new
MemorySegments.checkRange helper, which BoundedSegment's primitive
readers (getByte, getIntLE, getLongLE) reuse to share the same
bounds-check path without producing a slice.
VortexHandle.slice (the public interface method) is NOT touched in
this phase. Its signature change to return BoundedSegment lands in
Phase 2 alongside the DecodeContext.buffer migration.
BoundedSegmentTest covers in-range slice, bad slice carrying parent
context label, bounds-checked primitive reads, and unwrapForSubParser
identity. The 7 MemorySegmentsTest cases from PR #27 still pass.
./mvnw verify — all unit + integration tests pass (incl. Rust round-trips).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ase 2/4]
DecodeContext.segmentBuffers becomes BoundedSegment[]; DecodeContext.buffer(i)
returns BoundedSegment instead of MemorySegment. The encoded-buffer arithmetic
in FlatSegmentDecoder now wraps the parent flat segment as BoundedSegment and
slices each per-buffer region with a "encoded buffer i" context label, so a
crafted Buffer.padding()/Buffer.length() pair in the FlatBuffer schema throws
VortexException at slice time rather than IndexOutOfBoundsException at first
read.
Encoder consumers (14 files, ~25 call sites) keep their existing MemorySegment-
based decode logic and add an explicit `.unwrapForSubParser("<encoding> encoding")`
at each `ctx.buffer(i)` site. Trust transfers are greppable: `git grep
unwrapForSubParser` lists every encoder boundary.
A new convenience factory `DecodeContext.ofRawBuffers(...)` wraps raw
MemorySegment arrays as BoundedSegments for synthetic test inputs that produce
their own trusted buffers; the 71 test sites that previously called
`new DecodeContext(..., new MemorySegment[]{buf}, ...)` migrated mechanically
to `DecodeContext.ofRawBuffers(..., new MemorySegment[]{buf}, ...)`. Production
callers (only FlatSegmentDecoder) keep using the canonical constructor with
already-bounded buffers.
DictEncoding.java:381 was the one site `ctx.buffer(i)` did not cover —
it indexed `ctx.segmentBuffers()[...]` directly. That site now calls
`.unwrapForSubParser("dict encoding values")` for symmetry.
VortexHandle.slice (the public deprecated escape hatch) is NOT touched in
this phase; it still returns MemorySegment. Phase 3 covers VortexHttpReader's
HTTP-aware wrapping and ScanIterator.readFlatStats.
./mvnw verify — all 938 unit + 243 integration tests pass (incl. Rust round-trips).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…e 3/4]
VortexHandle.slice (the internal escape hatch used by ScanIterator,
InspectorTree, VortexInspectorTui, and integration-test code) now returns
BoundedSegment instead of MemorySegment. Callers that genuinely need raw
MemorySegment access add an explicit `.unwrapForSubParser("<purpose>")` at
the call site, making every cross-package escape-hatch usage greppable for
audit.
Sites migrated:
- VortexReader.slice: wraps the file-segment slice as BoundedSegment with
context "file slice".
- VortexHttpReader.slice: wraps the freshly-fetched HTTP range as
BoundedSegment with context "http range <off>..<end>" so error messages
attribute bounds-check failures to the specific HTTP request.
- ScanIterator.readFlat: `.unwrapForSubParser("flat segment decoder")`
before handing off to FlatSegmentDecoder.decode (which still takes
MemorySegment; full FlatSegmentDecoder migration would be a separate
follow-up).
- ScanIterator.readFlatStats: keeps a BoundedSegment for the parent stats
region, slices the stats flatbuffer via `statsRegion.slice(off, len,
"stats flatbuffer").asByteBufferLE()` instead of the raw
MemorySegments.slice helper.
- InspectorTree.peekFlat / collectEncodingsAndStats: unwrap with
"inspector flat segment decoder" before peekFlatRoot.
- VortexInspectorTui.previewSegment / hexPeekSegment: unwrap with
"inspector tui flat segment" and "inspector tui hex peek".
- PcoFixtureInspectionIntegrationTest.walkLayoutInner: unwrap with
"integration test inspector".
The slice method is still marked @deprecated(forRemoval = true) on
VortexHandle; the long-term direction is to remove cross-package raw
segment access entirely and route everything through Scan + typed
accessors. That demolition happens in Phase 4.
./mvnw verify — all unit + integration tests pass (incl. Rust round-trips).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…4/4]
VortexReader.slice and VortexReader.readFlatStats — the last two
file-open / scan-adjacent sites still calling MemorySegments.slice
directly — now construct a BoundedSegment from fileSegment and slice
through it:
- VortexReader.slice: returns `new BoundedSegment(fileSegment, "vortex
file").slice(offset, length, "file slice")`. The intermediate label
flows into VortexHandle.slice's BoundedSegment return value, so a
downstream `.slice(off, len, "...")` on it carries both parent context
("file slice") and the child label in error messages.
- VortexReader.readFlatStats: builds a `statsRegion` BoundedSegment for
the per-flat stats slice, then uses `statsRegion.getIntLE(...)` for the
trailing fbLen read and `statsRegion.slice(fbStart, fbLen, "stats
flatbuffer").asByteBufferLE()` for the FlatBuffer payload. The
bounds-check helper calls (MemorySegments.slice / .checkRange) are no
longer reachable from reader-layer code.
MemorySegments class javadoc updated to describe its new role as the
implementation detail behind BoundedSegment — application code should
prefer BoundedSegment, and direct MemorySegments use is reserved for
constructing a BoundedSegment from a raw segment at the mmap boundary.
Phase 4 closeout — the remaining unwrapForSubParser sites (35) are all
in encoders that consume `ctx.buffer(i)` and pass to sub-parsers
(ProtoReader, FlatBuffer runtime) which take raw MemorySegment. Future
work to drop those would mean adapting ProtoReader to take BoundedSegment
directly (it already does its own bounds-checked cursor internally) and
adding a BoundedSegment-aware FlatBuffer helper. Both are mechanically
straightforward but out of scope here; the audit point — every trust
transfer is greppable via unwrapForSubParser — is satisfied.
./mvnw verify — all unit + integration tests pass (incl. Rust round-trips).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Audit:
|
| File | Line | Reason | Downstream |
|---|---|---|---|
BitpackedEncoding.java |
273 | bitpacked encoding |
fastlanes unpack into pre-allocated output segment |
BoolEncoding.java |
71 | bool encoding |
BoolArray stores MemorySegment |
ByteBoolEncoding.java |
47 | bytebool encoding |
byte-level scan + BoolArray construction |
ConstantEncoding.java |
136 | constant encoding |
ProtoReader(MemorySegment, ...) for scalar payload |
DecimalEncoding.java |
112 | decimal encoding |
per-row decimal byte parsing |
DictEncoding.java |
381 | dict encoding values |
dict-values direct buffer indexing |
DictEncoding.java |
438 | dict encoding |
dict-bytes for VarBinArray |
DictEncoding.java |
439 | dict encoding |
dict-offsets for VarBinArray |
DictEncoding.java |
440 | dict encoding |
codes for child decode |
FsstEncoding.java |
209 | fsst encoding |
FSST symbol table (8B per symbol) |
FsstEncoding.java |
210 | fsst encoding |
FSST symbol lengths (1B per symbol) |
FsstEncoding.java |
211 | fsst encoding |
FSST-compressed heap |
PcoEncoding.java |
144 | pco encoding |
chunk metadata ProtoReader |
PcoEncoding.java |
163, 189, 205, 228 | pco encoding |
per-page Pco decode |
PrimitiveEncoding.java |
315 | primitive encoding |
MemorySegment.copy into output |
Registry.java |
80 | registry |
Registry.decodeAsSegment adapter |
SparseEncoding.java |
213 | sparse encoding |
fill-value ProtoReader |
VarBinEncoding.java |
143 | varbin encoding |
VarBinArray constructor |
VarBinViewEncoding.java |
123 | varbinview encoding |
views buffer for VarBinViewArray |
VarBinViewEncoding.java |
126 | varbinview encoding |
per-data-buffer |
ZstdEncoding.java |
317 | zstd encoding |
.toArray() for native Zstd dict |
ZstdEncoding.java |
322 | zstd encoding |
.toArray() for native Zstd frame |
ZstdEncoding.java |
351 | zstd encoding |
frame slice for ZstdJavaDecompressor |
scan / inspector / TUI → FlatSegmentDecoder.decode(MemorySegment, ...) (7)
| File | Line | Reason |
|---|---|---|
ScanIterator.java |
483 | flat segment decoder |
ScanIterator.java |
637 | stats segment fbLen read |
InspectorTree.java |
150 | inspector flat segment decoder |
InspectorTree.java |
205 | inspector flat segment decoder |
VortexInspectorTui.java |
574 | inspector tui flat segment |
VortexInspectorTui.java |
763 | inspector tui hex peek |
Trailer.java |
33 | trailer parser (constant-offset magic read) |
integration test (1)
| File | Line | Reason |
|---|---|---|
PcoFixtureInspectionIntegrationTest.java |
104 | integration test inspector |
Removal cost map (follow-up PRs)
FlatSegmentDecoder.decodetakesBoundedSegment→ kills 6 of the scan/inspector/TUI unwraps. ~30 min.Trailer.parseusesgetBytefamily instead of asSlice → kills 1. ~5 min.ProtoReader(BoundedSegment, long, long)+ regenerated proto records → kills the 7ProtoReader-bound unwraps (Constant, Pco×5, Sparse). Requires CodeGen update + 42 file regen. ~1 hour.core/arrayhierarchy holdsBoundedSegment→ kills the ~12 Array-constructor unwraps. Broad blast radius. ~3-5 hours.BoundedSegment.toArray()proxy → kills 2 Zstd unwraps. ~10 min.
Easy wins (FlatSegmentDecoder + Trailer + Zstd proxy) cumulatively eliminate 9 unwraps in under 1 hour. The remaining 24 require the deeper migrations.
TradeoffsWhat we gain
What we pay
PerfNegligible. When this design is the wrong fit
NetFor an alpha-stage library where the threat model is "untrusted columnar files from S3 / HTTP" and the contract is documented in SECURITY.md, the audit-trail property is worth the verbosity. The remaining 33 unwraps are documented (see previous comment) with concrete removal costs; they can be picked off opportunistically in future PRs rather than holding this one back. |
After Phase 4 nothing outside BoundedSegment called MemorySegments.slice or MemorySegments.checkRange any more, so the static helper was a one-line proxy used only by BoundedSegment itself — two parallel APIs saying the same thing. Moves the bounds-check logic directly into BoundedSegment as a private instance method that closes over `context`, so each call no longer re-passes the label. slice now does the check inline and calls seg.asSlice directly; getByte/getIntLE/getLongLE share the same private checkRange. Deletes MemorySegments.java and MemorySegmentsTest.java. The seven test cases that lived in MemorySegmentsTest (zero-length-at-end, negative offset, negative length, len > segSize, overflow-prone Long.MAX_VALUE - 1 + 100) were absorbed into BoundedSegmentTest so the coverage doesn't regress. Net: one type, one API, ~93 lines fewer code. No behavioural change; ./mvnw verify — all unit + integration tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The 33 `unwrapForSubParser` audit-trail sites in PR #27 are a strong signal that the current module layout is wrong: `core` hosts the read runtime, not just the data model. `reader` is a thin orchestration shell around `core`; cross-module byte hand-offs force public escape-hatch APIs (`VortexHandle.slice`, `Registry.decodeAsSegment`) that no architectural wrapper can really cure. Proposes splitting `Encoding` into separate read and write interfaces and moving the corresponding runtime into `reader` / `writer`, leaving `core` as the model-only module it was originally intended to be. PR #27's escape hatches collapse as a side effect of co-locating the decoder runtime with the byte source. Includes: - Root-cause analysis of the current `core` layout (5 concrete smells) - Target module split with the `Array` hierarchy moving to `reader` (writer only touches `NullableData`, confirmed by audit) — only `NullableData` and `BoundedSegment` stay in `core` proper - Registry split into distinct `ReadRegistry` and `WriteRegistry` types passed alongside the entry points (not folded into options records) — read-only callers never pull the `writer` module onto their classpath - 6-phase migration plan with effort estimates (~9 person-days plus CI fallout) - Consequences (positive and negative) and risks-to-manage - Rejected alternatives (JPMS hide, move FlatSegmentDecoder alone, adopt Arrow codec SPI, status-quo + docs) Status: Proposed. No code changes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The 33 `unwrapForSubParser` audit-trail sites in PR #27 are a strong signal that the current module layout is wrong: `core` hosts the read runtime, not just the data model. `reader` is a thin orchestration shell around `core`; cross-module byte hand-offs force public escape-hatch APIs (`VortexHandle.slice`, `Registry.decodeAsSegment`) that no architectural wrapper can really cure. Proposes splitting `Encoding` into separate read and write interfaces and moving the corresponding runtime into `reader` / `writer`, leaving `core` as the model-only module it was originally intended to be. PR #27's escape hatches collapse as a side effect of co-locating the decoder runtime with the byte source. Includes: - Root-cause analysis of the current `core` layout (5 concrete smells) - Target module split with the `Array` hierarchy moving to `reader` (writer only touches `NullableData`, confirmed by audit) — only `NullableData` and `BoundedSegment` stay in `core` proper - Registry split into distinct `ReadRegistry` and `WriteRegistry` types passed alongside the entry points (not folded into options records) — read-only callers never pull the `writer` module onto their classpath - 6-phase migration plan with effort estimates (~9 person-days plus CI fallout) - Consequences (positive and negative) and risks-to-manage - Rejected alternatives (JPMS hide, move FlatSegmentDecoder alone, adopt Arrow codec SPI, status-quo + docs) Status: Proposed. No code changes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Summary
Replaces every untrusted
MemorySegment.asSlicecall in the file-open and decode path with a typedBoundedSegmentwrapper, so malformed input throwsVortexException(the documented SECURITY.md contract) rather thanIndexOutOfBoundsException.The win is not "no MemorySegment ever exists" — it's every trust transfer is grep-discoverable:
35 hits, one per encoder/inspector boundary. Each entry documents "this code consumes a bounded buffer." Future regressions get caught in code review (or by a Checkstyle ban on raw
.asSlice(outsideMemorySegments.java/BoundedSegment.java— see follow-ups).Design
BoundedSegment(a record holdingMemorySegment + contextlabel) exposes only:slice(off, len, childContext)— bounds-checked, throwsVortexExceptionon bad inputgetByte/getIntLE/getLongLE(off)— bounds-checked primitive readsasByteBufferLE()— for FlatBuffer parsers (which validate offsets independently)unwrapForSubParser(reason)— the audited escape hatch. Every call documents the trust transfer and surfacesreasonin error messages.MemorySegments.slice / .checkRangeremain as the implementation detail thatBoundedSegmentdelegates to. Prior art for this pattern: LuceneIndexInput.slice(LUCENE-5827), Trail of Bitsbounded_buffer, C++gsl::span, Rust&[u8]/bytes::Bytes. The JDK FFM team has not formally recommended it; this is the first articulation of the pattern forMemorySegment.Commits (5)
b6a9c24— originalMemorySegmentsstatic helper (covers the immediateasSliceaudit)b7d583a— Phase 1: introduceBoundedSegmentat the mmap boundary (VortexReader.parse,Trailer.parse,PostscriptParser.parse)413edb8— Phase 2: thread throughDecodeContext.buffer()andFlatSegmentDecoder(14 encoder consumers + 71 test sites migrated)4352be0— Phase 3:VortexHandle.slicesignature →BoundedSegment; ScanIterator/InspectorTree/TUI/integration-test callers updateda9a5189— Phase 4: VortexReader internals migrated;MemorySegmentsbecomes implementation detail ofBoundedSegmentOut of scope (follow-ups)
Eliminating the remaining 35
unwrapForSubParsersites would require:core/array(BoolArray, IntArray, etc.) to holdBoundedSegment— broad blast radiusProtoReader+ every generated proto recorddecode(MemorySegment, ...)factoryBoundedSegmentfor encoder buffer mathZstdEncoding.toArray()+ similarEach is mechanically straightforward but expensive (~7-10 hours total). Deferred — the current audit trail is the actionable property; deeper migration is purity for purity's sake at alpha stage.
Test plan
./mvnw verify— 938 unit + 243 integration tests pass (incl. Rust round-trip suite)MemorySegmentsTest(7 cases) +BoundedSegmentTest(4 cases) cover in-range, out-of-range, negative offset/length, overflow-prone Long.MAX_VALUE-1, primitive reads, slice context propagation, unwrap identityunwrapForSubParser" categories are actually trust-bounded buffers (FlatSegmentDecoder slices them before they reach encoders)🤖 Generated with Claude Code