[fix](cloud) Use bthread-aware shared mutex for tablet header lock#64574
[fix](cloud) Use bthread-aware shared mutex for tablet header lock#64574liaoxin01 wants to merge 1 commit into
Conversation
The tablet header lock (`_meta_lock`) is a `std::shared_mutex`, which under libstdc++ wraps `pthread_rwlock_t` and is thread-affine: unlocking from an OS thread other than the one that acquired it is undefined behavior. In cloud mode the header write lock can be held across a suspending call. For example, when a query-driven rowset sync adds an overlapping (compacted) rowset, `CloudTablet::add_rowsets` warms up the new remote file while holding the write lock, and on a cold-restarted BE resolving the storage vault / building the S3 client issues a meta-service RPC. The holding bthread suspends and may migrate to another worker pthread, so the matching unlock runs on a different OS thread. This corrupts the glibc rwlock (the write-locked bit is left set with no owner), permanently wedging the lock; all readers/writers on that tablet then pile up and queries time out. Replace `_meta_lock` with `BthreadSharedMutex`, a port of libc++'s `std::shared_mutex` (the two-gate condition-variable algorithm) onto `bthread::Mutex` / `bthread::ConditionVariable`. Ownership is an integer state guarded by a briefly held internal mutex and carries no OS-thread identity, so locking on one worker and unlocking on another after a bthread migration is well defined. Waiting suspends the bthread instead of blocking the worker.
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
There was a problem hiding this comment.
Pull request overview
This PR addresses a cloud-mode deadlock/wedge risk caused by using std::shared_mutex (pthread rwlock) as the tablet header lock across bthread-suspending operations, where unlock may occur on a different OS thread than lock acquisition (undefined behavior). It introduces a bthread-friendly shared mutex and switches the tablet header lock to it, updating call sites accordingly.
Changes:
- Add
doris::BthreadSharedMutex, a shared/exclusive lock implemented withbthread::Mutex+bthread::ConditionVariable. - Replace
BaseTablet/Tabletheader lock (_meta_lock/get_header_lock()) and update storage/cloud call sites to lock it withstd::unique_lock/std::shared_lock/std::lock_guardwithout hard-codingstd::shared_mutex. - Update cloud unit tests and cloud meta manager APIs to use the new lock type.
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| be/test/cloud/cloud_tablet_test.cpp | Update test locking to use CTAD std::unique_lock with the new header lock type. |
| be/test/cloud/cloud_meta_mgr_test.cpp | Update tests to acquire the tablet header lock via CTAD std::unique_lock. |
| be/test/cloud/cloud_empty_rowset_compaction_test.cpp | Update test to lock tablet header with CTAD std::unique_lock. |
| be/src/util/bthread_shared_mutex.h | Introduce BthreadSharedMutex implemented on bthread primitives. |
| be/src/storage/task/index_builder.cpp | Update header-lock guard to work with the new lock type (CTAD). |
| be/src/storage/task/engine_clone_task.cpp | Update header-lock guard to work with the new lock type (CTAD). |
| be/src/storage/tablet/tablet.h | Change Tablet::get_header_lock() to return BthreadSharedMutex&. |
| be/src/storage/tablet/tablet.cpp | Update _meta_lock locking sites to use CTAD (std::lock_guard / std::shared_lock). |
| be/src/storage/tablet/tablet_manager.cpp | Update tablet-drop path to lock header with CTAD std::lock_guard. |
| be/src/storage/tablet/base_tablet.h | Replace _meta_lock type with BthreadSharedMutex and update get_header_lock(). |
| be/src/storage/schema_change/schema_change.cpp | Update new-tablet header lock guards to CTAD (compatible with new lock type). |
| be/src/storage/rowset_builder.cpp | Update header lock acquisition to CTAD std::lock_guard (now targets new lock type). |
| be/src/storage/compaction/full_compaction.cpp | Update header lock guard to CTAD std::lock_guard. |
| be/src/storage/compaction/compaction.cpp | Update header lock guards to CTAD std::lock_guard. |
| be/src/storage/compaction/binlog_compaction.cpp | Update header lock guard to CTAD std::lock_guard. |
| be/src/cloud/cloud_tablet.h | Update APIs taking the header lock to accept std::unique_lock<BthreadSharedMutex>&. |
| be/src/cloud/cloud_tablet.cpp | Update header lock usage to new type (e.g., CTAD std::unique_lock, std::shared_lock). |
| be/src/cloud/cloud_meta_mgr.h | Update fill_version_holes signature to take std::unique_lock<BthreadSharedMutex>&. |
| be/src/cloud/cloud_meta_mgr.cpp | Update header lock usage and fill_version_holes definition for the new lock type. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| std::lock_guard rlock(tablet->get_header_lock()); | ||
| RETURN_IF_ERROR(tablet->get_all_rs_id_unlocked(old_max_version, &old_rowset_ids)); | ||
| old_rowsets = tablet->get_rowset_by_ids(&old_rowset_ids); | ||
| } |
| std::lock_guard lck(tablet()->get_header_lock()); | ||
| _max_version_in_flush_phase = tablet()->max_version_unlocked(); |
| // A reader-writer lock for bthread contexts. It is a port of libc++'s | ||
| // std::shared_mutex (the two-gate condition-variable algorithm) onto | ||
| // bthread::Mutex/bthread::ConditionVariable. Unlike std::shared_mutex | ||
| // (pthread_rwlock_t), ownership carries no OS-thread identity, so it is safe to | ||
| // lock on one bthread worker and unlock on another after a bthread migrates. | ||
| // Satisfies the C++ SharedMutex requirements (usable with std::unique_lock / | ||
| // std::shared_lock). Writer-preferring. | ||
| class BthreadSharedMutex { |
TPC-H: Total hot run time: 29000 ms |
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
ClickBench: Total hot run time: 25.16 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
|
Thanks for the fix. Since Suggested coverage:
The original bug is subtle and platform-dependent, so a focused |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
Problem
The tablet header lock (
_meta_lock) is astd::shared_mutex, which under libstdc++ wrapspthread_rwlock_tand is thread-affine: unlocking it from an OS thread other than the one that acquired it is undefined behavior.In cloud mode the header write lock can be held across a suspending call. Concretely, when a query-driven rowset sync pulls an overlapping (compacted) rowset,
CloudTablet::add_rowsetswarms up the new remote file while holding the write lock; on a cold-restarted BE, resolving the storage vault / building the S3 client issues a meta-service RPC. The holding bthread suspends and may migrate to another worker pthread, so the matching unlock runs on a different OS thread. This corrupts the glibc rwlock (the write-locked bit is left set with no owner), permanently wedging the lock — all readers/writers on that tablet pile up and queries time out (~90s).Observed in graceful-restart tests: a tablet's header lock stuck with
active_writer=[none]yet everytry_lock/try_lock_sharedfailing for >70 min, and the last writer's acquire OS-tid differing from its release OS-tid (proof of the cross-thread unlock).Fix
Replace
_meta_lockwithBthreadSharedMutex, a port of libc++'sstd::shared_mutex(the two-gate condition-variable algorithm) ontobthread::Mutex/bthread::ConditionVariable:std::unique_lock/std::shared_lock.Only the tablet header lock is switched; unrelated
std::shared_mutexmembers (e.g.TabletMeta::_meta_lock) are left untouched. Call sites that named the type explicitly are converted to class template argument deduction or toBthreadSharedMutex.