feat: support to read chain data split#387
Conversation
zjw1111
left a comment
There was a problem hiding this comment.
Thanks for adding ChainDataSplit support! The overall design looks good. A few minor suggestions below.
| Result<std::unique_ptr<BatchReader>> ApplyPredicateFilterIfNeeded( | ||
| std::unique_ptr<BatchReader>&& reader, const std::shared_ptr<Predicate>& predicate) const; | ||
|
|
||
| Result<std::shared_ptr<DataFilePathFactory>> CreateDataFilePathFactory( |
There was a problem hiding this comment.
Minor: CreateDataFilePathFactory is currently in the public section, but it seems to only be called by subclasses internally. Would it be possible to move it into the protected section just below?
There was a problem hiding this comment.
Fixed in the latest revision. CreateDataFilePathFactory is now protected since it is only used by AbstractSplitRead subclasses.
|
|
||
| const std::unordered_map<std::string, std::string>& FileBucketPathMapping() const { | ||
| return file_bucket_path_mapping_; | ||
| } |
There was a problem hiding this comment.
I noticed that file_branch_mapping_ is deserialized and stored, but not consumed by any read path in this PR (e.g. ChainDataFilePathFactory only uses file_bucket_path_mapping_). I assume this is reserved for future use (e.g. selecting the correct schema per branch, as the Java side does). Could you add a brief note in the PR description mentioning this is intentionally deferred?
| if (!data_file_path_factory) { | ||
| PAIMON_ASSIGN_OR_RAISE(data_file_path_factory, | ||
| path_factory_->CreateDataFilePathFactory(partition, bucket)); | ||
| } |
There was a problem hiding this comment.
This design feels a bit too implicit. Why can’t each caller just pass in the correct data_file_path_factory directly?
There was a problem hiding this comment.
Also, let’s avoid using default arguments in production code.
| return Status::Invalid(fmt::format("invalid ChainDataSplit byte stream: {}", | ||
| chain_split.status().ToString())); | ||
| } | ||
| return std::static_pointer_cast<Split>(chain_split.value()); |
There was a problem hiding this comment.
Could you please use PAIMON_ASSIGN_OR_RAISE rather than if (!chain_split.ok()).
|
|
||
| Result<std::shared_ptr<ChainDataSplitImpl>> ReadChainDataSplitTail( | ||
| const std::shared_ptr<DataSplitImpl>& base_split, DataInputStream* in, | ||
| const std::shared_ptr<MemoryPool>& pool) { |
There was a problem hiding this comment.
Could you move the output parameter (in) in to the end of the parameter list?
| struct DataFileMeta; | ||
| namespace { | ||
| Result<std::vector<std::shared_ptr<DataFileMeta>>> ReadVersion7DataFileMetaList( | ||
| DataInputStream* in, const std::shared_ptr<MemoryPool>& pool) { |
There was a problem hiding this comment.
I found that the chain table split is currently not compatible with the Java implementation. Java uses ChainSplitHeader + logicalPartition + files + bucketMap + branchMap, while C++ uses DataSplit + ChainTail. The split format must be kept fully consistent with Java.
Purpose
Linked issue: close #385
Support deserializing and reading ChainDataSplit.
This change adds ChainDataSplit handling on top of DataSplit deserialization, including:
Tests
ChainDataSplitTestAPI and Format
Documentation
No user-facing documentation update.
Generative AI tooling
Generated-by: OpenAI Codex