Skip to content

feat: support to read chain data split#387

Open
Weixin-Xu wants to merge 2 commits into
alibaba:mainfrom
Weixin-Xu:support_chain_table_oss
Open

feat: support to read chain data split#387
Weixin-Xu wants to merge 2 commits into
alibaba:mainfrom
Weixin-Xu:support_chain_table_oss

Conversation

@Weixin-Xu

Copy link
Copy Markdown

Purpose

Linked issue: close #385

Support deserializing and reading ChainDataSplit.

This change adds ChainDataSplit handling on top of DataSplit deserialization, including:

  • reading ChainDataSplit tail metadata after DataSplit bytes
  • supporting per-file bucket path mapping when building data file paths
  • preserving normal DataSplit / FallbackDataSplit behavior
  • supporting the observed version 7 DataSplit metadata layout needed by chain split bytes

Tests

  • Added ChainDataSplitTest
    • deserialize ChainDataSplit tail
    • deserialize version 7 split with original bucket path and chain tail
    • verify per-file bucket path resolution
    • verify external path is preserved
    • verify malformed ChainDataSplit tail returns contextual error

API and Format

Documentation

No user-facing documentation update.

Generative AI tooling

Generated-by: OpenAI Codex

@CLAassistant

CLAassistant commented Jun 29, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@Weixin-Xu Weixin-Xu changed the title Support chain data split Support to read chain data split Jun 29, 2026

@zjw1111 zjw1111 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding ChainDataSplit support! The overall design looks good. A few minor suggestions below.

Result<std::unique_ptr<BatchReader>> ApplyPredicateFilterIfNeeded(
std::unique_ptr<BatchReader>&& reader, const std::shared_ptr<Predicate>& predicate) const;

Result<std::shared_ptr<DataFilePathFactory>> CreateDataFilePathFactory(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: CreateDataFilePathFactory is currently in the public section, but it seems to only be called by subclasses internally. Would it be possible to move it into the protected section just below?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the latest revision. CreateDataFilePathFactory is now protected since it is only used by AbstractSplitRead subclasses.


const std::unordered_map<std::string, std::string>& FileBucketPathMapping() const {
return file_bucket_path_mapping_;
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that file_branch_mapping_ is deserialized and stored, but not consumed by any read path in this PR (e.g. ChainDataFilePathFactory only uses file_bucket_path_mapping_). I assume this is reserved for future use (e.g. selecting the correct schema per branch, as the Java side does). Could you add a brief note in the PR description mentioning this is intentionally deferred?

@zjw1111 zjw1111 changed the title Support to read chain data split feat: support to read chain data split Jun 30, 2026
if (!data_file_path_factory) {
PAIMON_ASSIGN_OR_RAISE(data_file_path_factory,
path_factory_->CreateDataFilePathFactory(partition, bucket));
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This design feels a bit too implicit. Why can’t each caller just pass in the correct data_file_path_factory directly?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, let’s avoid using default arguments in production code.

return Status::Invalid(fmt::format("invalid ChainDataSplit byte stream: {}",
chain_split.status().ToString()));
}
return std::static_pointer_cast<Split>(chain_split.value());

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please use PAIMON_ASSIGN_OR_RAISE rather than if (!chain_split.ok()).


Result<std::shared_ptr<ChainDataSplitImpl>> ReadChainDataSplitTail(
const std::shared_ptr<DataSplitImpl>& base_split, DataInputStream* in,
const std::shared_ptr<MemoryPool>& pool) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move the output parameter (in) in to the end of the parameter list?

struct DataFileMeta;
namespace {
Result<std::vector<std::shared_ptr<DataFileMeta>>> ReadVersion7DataFileMetaList(
DataInputStream* in, const std::shared_ptr<MemoryPool>& pool) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that the chain table split is currently not compatible with the Java implementation. Java uses ChainSplitHeader + logicalPartition + files + bucketMap + branchMap, while C++ uses DataSplit + ChainTail. The split format must be kept fully consistent with Java.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Chain Table read failure

4 participants