feat: support to read chain data split by Weixin-Xu · Pull Request #387 · alibaba/paimon-cpp

Weixin-Xu · 2026-06-29T07:54:40Z

Purpose

Linked issue: close #385

Support deserializing and reading ChainDataSplit.

This change adds ChainDataSplit handling on top of DataSplit deserialization, including:

reading ChainDataSplit tail metadata after DataSplit bytes
supporting per-file bucket path mapping when building data file paths
preserving normal DataSplit / FallbackDataSplit behavior
supporting the observed version 7 DataSplit metadata layout needed by chain split bytes

Tests

Added ChainDataSplitTest
- deserialize ChainDataSplit tail
- deserialize version 7 split with original bucket path and chain tail
- verify per-file bucket path resolution
- verify external path is preserved
- verify malformed ChainDataSplit tail returns contextual error

API and Format

Documentation

No user-facing documentation update.

Generative AI tooling

Generated-by: OpenAI Codex

CLAassistant · 2026-06-29T07:54:48Z

All committers have signed the CLA.

zjw1111

Thanks for adding ChainDataSplit support! The overall design looks good. A few minor suggestions below.

zjw1111 · 2026-06-30T10:51:47Z

    Result<std::unique_ptr<BatchReader>> ApplyPredicateFilterIfNeeded(
        std::unique_ptr<BatchReader>&& reader, const std::shared_ptr<Predicate>& predicate) const;

+    Result<std::shared_ptr<DataFilePathFactory>> CreateDataFilePathFactory(


Minor: CreateDataFilePathFactory is currently in the public section, but it seems to only be called by subclasses internally. Would it be possible to move it into the protected section just below?

Fixed in the latest revision. CreateDataFilePathFactory is now protected since it is only used by AbstractSplitRead subclasses.

zjw1111 · 2026-06-30T10:51:47Z

+
+    const std::unordered_map<std::string, std::string>& FileBucketPathMapping() const {
+        return file_bucket_path_mapping_;
+    }


I noticed that file_branch_mapping_ is deserialized and stored, but not consumed by any read path in this PR (e.g. ChainDataFilePathFactory only uses file_bucket_path_mapping_). I assume this is reserved for future use (e.g. selecting the correct schema per branch, as the Java side does). Could you add a brief note in the PR description mentioning this is intentionally deferred?

lxy-9602 · 2026-06-30T10:52:24Z

+    if (!data_file_path_factory) {
+        PAIMON_ASSIGN_OR_RAISE(data_file_path_factory,
+                               path_factory_->CreateDataFilePathFactory(partition, bucket));
+    }


This design feels a bit too implicit. Why can’t each caller just pass in the correct data_file_path_factory directly?

Also, let’s avoid using default arguments in production code.

lxy-9602 · 2026-06-30T10:58:16Z

+                return Status::Invalid(fmt::format("invalid ChainDataSplit byte stream: {}",
+                                                   chain_split.status().ToString()));
+            }
+            return std::static_pointer_cast<Split>(chain_split.value());


Could you please use PAIMON_ASSIGN_OR_RAISE rather than if (!chain_split.ok()).

lxy-9602 · 2026-06-30T11:01:23Z

+
+Result<std::shared_ptr<ChainDataSplitImpl>> ReadChainDataSplitTail(
+    const std::shared_ptr<DataSplitImpl>& base_split, DataInputStream* in,
+    const std::shared_ptr<MemoryPool>& pool) {


Could you move the output parameter (in) in to the end of the parameter list?

lxy-9602 · 2026-06-30T12:44:39Z

 struct DataFileMeta;
 namespace {
+Result<std::vector<std::shared_ptr<DataFileMeta>>> ReadVersion7DataFileMetaList(
+    DataInputStream* in, const std::shared_ptr<MemoryPool>& pool) {


I found that the chain table split is currently not compatible with the Java implementation. Java uses ChainSplitHeader + logicalPartition + files + bucketMap + branchMap, while C++ uses DataSplit + ChainTail. The split format must be kept fully consistent with Java.

Support chain data split

9db76b3

Weixin-Xu changed the title ~~Support chain data split~~ Support to read chain data split Jun 29, 2026

Merge branch 'main' into support_chain_table_oss

338b67a

zjw1111 reviewed Jun 30, 2026

View reviewed changes

zjw1111 changed the title ~~Support to read chain data split~~ feat: support to read chain data split Jun 30, 2026

lxy-9602 reviewed Jun 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: support to read chain data split#387

feat: support to read chain data split#387
Weixin-Xu wants to merge 2 commits into
alibaba:mainfrom
Weixin-Xu:support_chain_table_oss

Weixin-Xu commented Jun 29, 2026

Uh oh!

CLAassistant commented Jun 29, 2026 •

edited

Loading

Uh oh!

zjw1111 left a comment

Uh oh!

zjw1111 Jun 30, 2026

Uh oh!

Weixin-Xu Jul 1, 2026

Uh oh!

zjw1111 Jun 30, 2026

Uh oh!

lxy-9602 Jun 30, 2026

Uh oh!

lxy-9602 Jun 30, 2026

Uh oh!

lxy-9602 Jun 30, 2026

Uh oh!

lxy-9602 Jun 30, 2026

Uh oh!

lxy-9602 Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

Weixin-Xu commented Jun 29, 2026

Purpose

Tests

API and Format

Documentation

Generative AI tooling

Uh oh!

CLAassistant commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zjw1111 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CLAassistant commented Jun 29, 2026 •

edited

Loading