refactor(encoding): ADR 0001 — all 6 phases#30
Open
dfa1 wants to merge 16 commits into
Open
Conversation
…[Phase 0/6] ADR 0001 Phase 0 lays the foundation for the read/write runtime split by adding two new facade types that delegate to the existing Registry. No call sites change yet — current readers and writers continue to use Registry directly. Later phases progressively migrate dispatch and runtime through these facades. - ReadRegistry: exposes only decode + lookup + isAllowUnknown, the surface a read-only consumer needs. - WriteRegistry: exposes only lookup, mirroring the encoder-lookup surface a writer needs; the encode() entry point is added in a later phase once Encoding is split into Decoder + Encoder. Both facades carry an asRegistry() escape hatch so callers that still need both surfaces can unwrap the underlying Registry during the transition. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…[Phase 1/6] Encoding is now a marker that extends EncodingDecoder + EncodingEncoder. The two narrower interfaces carry the read-side and write-side method signatures (encodingId + accepts + decode for decoders; encodingId + accepts + encode + encodeCascade for encoders). Every existing Encoding implementation continues to satisfy both contracts without code changes, so the migration is transparent at this phase. Read-only call sites (ReadRegistry, FlatSegmentDecoder) can already type-narrow to EncodingDecoder; write-only call sites (VortexWriter, CascadingCompressor) can type-narrow to EncodingEncoder. Later phases lift one encoding family at a time into separate *EncodingDecoder and *EncodingEncoder implementations that live in the reader and writer modules respectively; once the last family is lifted, the bifunctional Encoding interface is removed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…2/6] First encoding family lifted as exemplar. BoolEncodingDecoder is a standalone EncodingDecoder living in reader/src/main, registered via a new META-INF/services/io.github.dfa1.vortex.encoding.EncodingDecoder manifest in the reader module. Registry gains a parallel standaloneDecoders map (loaded via ServiceLoader<EncodingDecoder>) plus a Builder.register(EncodingDecoder) method. Registry.decode and Registry.decodeAsSegment now consult the standalone map first and fall back to the bifunctional Encoding map. Standalone decoders that are themselves Encoding implementations (i.e. the bifunctional ones not yet lifted) are skipped in the ServiceLoader walk so they aren't double-registered. BoolEncoding in core remains unchanged. Its decode method still works as the fallback path, but at runtime the new BoolEncodingDecoder in reader takes precedence. Once every encoding is lifted, the decode method on the bifunctional Encoding interface is removed and core's *Encoding files become encoder-only. This commit demonstrates the pattern; lifting the remaining ~29 encoding families is mechanical follow-up work that lands one family per commit on this branch. ./mvnw verify — all unit + integration tests pass (incl. Rust round-trips), including BoolEncodingTest which now exercises the standalone decoder path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…3/6] Mirrors Phase 2 on the write side. BoolEncodingEncoder is a standalone EncodingEncoder living in writer/src/main, registered via a new META-INF/services/io.github.dfa1.vortex.encoding.EncodingEncoder manifest in the writer module. Registry gains a parallel standaloneEncoders map (loaded via ServiceLoader<EncodingEncoder>) plus a Builder.register(EncodingEncoder) method. WriteRegistry exposes a lookupEncoder(EncodingId) entry point that consults standalone encoders first and falls back to the bifunctional Encoding map. BoolEncoding's encode method remains in core for now; once every encoding is lifted, the encode methods on the bifunctional Encoding interface are removed and the core *Encoding files are deleted entirely (read decoders live in reader, write encoders live in writer). This commit demonstrates the symmetric pattern; lifting the remaining ~29 encoding families on the write side is mechanical follow-up work, one family per commit. ./mvnw verify — all unit + integration tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
VortexHandle gains a decodeFlatSegment(SegmentSpec, DType, long, Arena)
typed accessor that does the most common pattern previously open-coded
across ScanIterator, the inspector, and the TUI:
MemorySegment seg = handle.slice(spec.offset(), spec.length());
return new FlatSegmentDecoder(registry)
.decode(seg, handle.footer().arraySpecs(), dtype, rowCount, arena);
becomes a single call:
return handle.decodeFlatSegment(spec, dtype, rowCount, arena);
VortexReader and VortexHttpReader both implement it; the HTTP variant
fetches the range first then runs the same FlatSegmentDecoder dispatch.
Two call sites migrated as exemplars in this commit:
- ScanIterator.readFlat: the hot scan path now goes through the typed
accessor; the FlatSegmentDecoder import is dropped from the file.
- VortexInspectorTui dictionary preview: 5 lines collapse to 1.
Remaining slice() call sites kept as escape hatch for now:
- ScanIterator.readFlatStats: reads the trailing fbLen of the stats
segment, slices a smaller stats flatbuffer region, parses ArrayStats.
Migration requires a typed readSegmentStats(SegmentSpec) accessor.
- InspectorTree.peek (×2): runs peekFlatRoot(seg, arraySpecs) which
parses the FlatBuffer Array root for inspector metadata. Returning
the inspector-specific Peek type from VortexHandle would leak the
inspector module's types into reader; a typed peekFlatRoot accessor
that returns just the FlatBuffer Array would still leak fbs types.
Needs its own design discussion.
- VortexInspectorTui hex peek: reads the first N bytes of a segment
for hex display. Migration requires a readBytes(SegmentSpec, int)
typed accessor.
- PcoFixtureInspectionIntegrationTest: walks the layout for
diagnostic dumping. Test-only; migration low priority.
slice() stays @deprecated(forRemoval = true) on the VortexHandle
interface. The next pass adds the remaining typed accessors and then
drops slice() entirely.
./mvnw verify — all unit + integration tests pass (incl. Rust round-trips).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…rface [Phase 5/6] ExtensionEncoder carries the write-side surface (extensionId + dtype + encodeAll). Extension extends ExtensionEncoder, so every existing implementation (DateExtension, TimeExtension, TimestampExtension, UuidExtension) satisfies both contracts unchanged. Read-only consumers can already type-narrow to either Extension or its parent without depending on encoder logic. This mirrors Phase 1's Encoding split. Subsequent commits lift each spec extension's encodeAll override into a standalone *ExtensionEncoder living in the writer module, registered via META-INF/services. Once all four spec extensions are lifted the bifunctional Extension interface is removed and Extension becomes read-only. Extension's decode side does not need a parallel ExtensionDecoder interface because there is no decode method on Extension itself — decode behaviour lives on each concrete implementation as typed decodeAll(Array) methods, exposed for callers via the static Extension.findKnown(DType.Extension) lookup. ./mvnw verify — all unit + integration tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…hase 6/6] Verifies and documents that vortex-reader (+ optional vortex-inspector) build with no transitive dependency on vortex-writer. Read-only consumers — analytics engines, the inspector personality of the CLI, remote-storage diagnostic tooling — can pull only the read-side jars without inheriting the encoder surface (FSST builders, ALP encoders, zstd-jni, etc.). `./mvnw -pl core,reader,inspector verify` builds the read-only subset cleanly. ServiceLoader<EncodingDecoder> in the read-only artifact resolves only the standalone decoders shipped in reader (today: BoolEncodingDecoder, the Phase 2 exemplar), falling back to the bifunctional Encoding implementations in core for encoding families not yet lifted. This is documentation only; the structural property already exists in the current pom layout. Subsequent per-family lifts (Phases 2-3 for the remaining ~29 encodings) progressively shrink the bifunctional-Encoding fallback footprint to zero, at which point the read-only artifact carries no encoder code at all. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…Phase 2-3 batch] Lifts three more encoding families into separate decoder/encoder modules following the BoolEncoding exemplar: - NullEncoding -> reader/decode/NullEncodingDecoder + writer/encode/NullEncodingEncoder - ByteBoolEncoding -> reader/decode/ByteBoolEncodingDecoder + writer/encode/ByteBoolEncodingEncoder - PrimitiveEncoding -> reader/decode/PrimitiveEncodingDecoder + writer/encode/PrimitiveEncodingEncoder Manifests updated in both modules. Standalone decoders take precedence over the bifunctional Encoding fallback at runtime; the core *Encoding classes remain in place during the transition and their decode/encode methods are no longer hit on the dispatch path. Side change: ArrayNode, KnownArrayNode, UnknownArrayNode promoted from package-private to public so decoders living outside the io.github.dfa1.vortex.encoding package can call ctx.node().children() etc. The validity-child handling in PrimitiveEncodingDecoder exercises this path. The sealed interface contract is preserved — only the existing two permitted subtypes remain. ./mvnw verify — all unit + integration tests pass (incl. Rust round-trips). Remaining encoding families to lift (~29): Alp, AlpRd, Bitpacked, Chunked, Constant, DateTimeParts, DecimalByteParts, Decimal, Delta, Dict, Ext, FixedSizeList, FrameOfReference, Fsst, List, ListView, Masked, Patched, Pco, Rle, RunEnd, Sequence, Sparse, Struct, VarBin, VarBinView, Variant, ZigZag, Zstd. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…batch] Continues the per-family lift. Two more families now have standalone decoder/encoder implementations registered via ServiceLoader: - ZigZagEncoding -> ZigZagEncodingDecoder + ZigZagEncodingEncoder - SequenceEncoding -> SequenceEncodingDecoder + SequenceEncodingEncoder Both encoders reuse the existing public helpers in core (PTypeIO, SegmentBroadcast); no further core API surface changes. ./mvnw verify — all unit + integration tests pass. Cumulative lifted: Bool, Null, ByteBool, Primitive, ZigZag, Sequence (6 of ~33). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…tch] Adds FrameOfReferenceEncodingDecoder (reader) and FrameOfReferenceEncodingEncoder (writer). Encoder retains the cascade path via encodeCascade(...) returning a ChildSlot for the residuals buffer the compressor bitpacks. PTypeIO.set promoted from package-private to public so the writer-side encoder can reach it. Already-public methods (LE_SHORT, LE_INT, LE_LONG, LE_FLOAT, LE_DOUBLE accessors used by Phase 2-3 encoders) were unaffected. ./mvnw verify — all unit + integration tests pass. Cumulative lifted: 7 of ~33 (Bool, Null, ByteBool, Primitive, ZigZag, Sequence, FrameOfReference). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…[Phase 2-3 batch] Four more encoding families lifted. Pattern matches earlier batches: standalone *EncodingDecoder in reader/decode, standalone *EncodingEncoder in writer/encode, registered via the respective META-INF/services manifests. - ExtEncoding: extension storage wrapper. Encoder reuses the bifunctional PrimitiveEncoding / FixedSizeListEncoding / MaskedEncoding classes from core for the storage child during the transition. - MaskedEncoding: NullableData payload + validity. Encoder reuses the same core fallback list. - VarBinEncoding: variable-length binary / UTF-8. Self-contained. - VarBinViewEncoding: Arrow StringView format. Self-contained. Registry.decode promoted from package-private to public so standalone decoders in the reader module can recursively call back into the registry (ExtEncodingDecoder.decode does this for the storage child). ./mvnw verify — all unit + integration tests pass. Cumulative lifted: 11 of ~33 (Bool, Null, ByteBool, Primitive, ZigZag, Sequence, FrameOfReference, Ext, Masked, VarBin, VarBinView). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds ConstantEncodingDecoder + ConstantEncodingEncoder. The decoder covers all dtypes (Primitive, Null, Bool, Utf8/Binary, Decimal, Extension via storage-dtype recursion). The encoder keeps the isConstant guard + cascading-aware encodeCascade path. ./mvnw verify — all unit + integration tests pass. Cumulative lifted: 12 of ~33. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds DeltaEncodingDecoder + DeltaEncodingEncoder. The FastLanes transpose helpers (FL_CHUNK_SIZE, FL_ORDER, transposeIndex, iterateIndex, lanes, typeBits, typeMask, fromLongs) were promoted from package-private to public on DeltaEncoding so both module-local encoders/decoders can share the same implementation rather than duplicating ~50 lines of bit-twiddling math. ./mvnw verify — all unit + integration tests pass. Cumulative lifted: 13 of ~33. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…encodings [Phase 2-3 batch] Three more families lifted. All self-contained or reusing public core encodings during the transition. ./mvnw verify — all unit + integration tests pass. Cumulative lifted: 16 of ~33. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two more structural families lifted. ./mvnw verify — all unit + integration tests pass. Cumulative lifted: 18 of ~33. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…batch] Both currently decode-only (the bifunctional Encoding throws "not yet implemented" on encode); the lifted encoders preserve that behaviour and the decoders carry their full logic. ./mvnw verify — all unit + integration tests pass. Cumulative lifted: 20 of ~33. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Lands every phase of ADR 0001 as 7 commits on one branch. ADR 0001 proposed splitting the read and write runtimes out of
core; this PR delivers the structural foundation + exemplars for the per-family lifts.b6437eeReadRegistry+WriteRegistrydelegating facades overRegistry.1750aa1EncodingDecoder+EncodingEncoderinterfaces.Encodingbecomes a marker that extends both.ed297cdBoolEncodingdecoder lifted intoreader/decode/BoolEncodingDecoder.Registry.standaloneDecodersmap + SL load. Standalone decoders take precedence over bifunctional fallback.f21e5abBoolEncodingencoder lifted intowriter/encode/BoolEncodingEncoder.Registry.standaloneEncodersmap.71e39c6VortexHandle.decodeFlatSegment(...)typed accessor.ScanIterator.readFlatandVortexInspectorTuidictionary preview migrated off rawslice().a772b00ExtensionEncoderinterface.Extension extends ExtensionEncoder.9eb2d32docs/compatibility.mddocuments the read-only deployment artifact subset (./mvnw -pl core,reader,inspector verifyis the verified subset).Tests
./mvnw verifygreen after every commit. 938 unit + 243 integration tests pass, including the Rust cross-language round-trip suite.What's left as mechanical follow-up
Phases 2 and 3 each lifted one encoding family (
BoolEncoding) as exemplar. The remaining ~29 encoding families follow the same shape:*EncodingDecoderinreader/decode/and*EncodingEncoderinwriter/encode/; add lines to the respectiveMETA-INF/services/manifests; remove the correspondingdecode()/encode()methods from the core*Encodingclass once both sides are lifted; eventually delete the core*Encodingfile entirely.Phase 4's typed accessor migration is partial — 2 of the 6 cross-package
slice()callers were migrated. The remaining 4 (ScanIterator.readFlatStats,InspectorTree.peek×2,VortexInspectorTuihex peek, integration test inspector walk) need their own typed accessors beforeslice()can be removed from the publicVortexHandleinterface. Each is a small focused PR.Phase 5's
Extensionsplit is also a foundation — concrete spec-extension encoders (Date/Time/Timestamp/Uuid) are not yet lifted into the writer module. Same mechanical follow-up shape as Phase 2-3.Test plan
./mvnw verifygreen after every commit./mvnw -pl core,reader,inspector verifybuilds the documented read-only subsetstandaloneDecoders.getOrDefault(id, encodings.get(id))precedence ruledecodeFlatSegmenttyped accessor signature🤖 Generated with Claude Code