Skip to content

[fix](cloud) Use bthread-aware shared mutex for tablet header lock#64574

Open
liaoxin01 wants to merge 1 commit into
apache:masterfrom
liaoxin01:fix-meta-lock-bthread-rwlock-master
Open

[fix](cloud) Use bthread-aware shared mutex for tablet header lock#64574
liaoxin01 wants to merge 1 commit into
apache:masterfrom
liaoxin01:fix-meta-lock-bthread-rwlock-master

Conversation

@liaoxin01

Copy link
Copy Markdown
Contributor

Problem

The tablet header lock (_meta_lock) is a std::shared_mutex, which under libstdc++ wraps pthread_rwlock_t and is thread-affine: unlocking it from an OS thread other than the one that acquired it is undefined behavior.

In cloud mode the header write lock can be held across a suspending call. Concretely, when a query-driven rowset sync pulls an overlapping (compacted) rowset, CloudTablet::add_rowsets warms up the new remote file while holding the write lock; on a cold-restarted BE, resolving the storage vault / building the S3 client issues a meta-service RPC. The holding bthread suspends and may migrate to another worker pthread, so the matching unlock runs on a different OS thread. This corrupts the glibc rwlock (the write-locked bit is left set with no owner), permanently wedging the lock — all readers/writers on that tablet pile up and queries time out (~90s).

Observed in graceful-restart tests: a tablet's header lock stuck with active_writer=[none] yet every try_lock/try_lock_shared failing for >70 min, and the last writer's acquire OS-tid differing from its release OS-tid (proof of the cross-thread unlock).

Fix

Replace _meta_lock with BthreadSharedMutex, a port of libc++'s std::shared_mutex (the two-gate condition-variable algorithm) onto bthread::Mutex / bthread::ConditionVariable:

  • Ownership is an integer state guarded by a briefly held internal mutex and carries no OS-thread identity, so locking on one worker and unlocking on another after a bthread migration is well defined — the permanent wedge can no longer happen.
  • Waiting blocks on a bthread condition variable, suspending the bthread instead of blocking the worker.
  • Writer-preferring; satisfies the C++ SharedMutex requirements, so it is a drop-in with std::unique_lock / std::shared_lock.

Only the tablet header lock is switched; unrelated std::shared_mutex members (e.g. TabletMeta::_meta_lock) are left untouched. Call sites that named the type explicitly are converted to class template argument deduction or to BthreadSharedMutex.

The tablet header lock (`_meta_lock`) is a `std::shared_mutex`, which under
libstdc++ wraps `pthread_rwlock_t` and is thread-affine: unlocking from an OS
thread other than the one that acquired it is undefined behavior.

In cloud mode the header write lock can be held across a suspending call. For
example, when a query-driven rowset sync adds an overlapping (compacted) rowset,
`CloudTablet::add_rowsets` warms up the new remote file while holding the write
lock, and on a cold-restarted BE resolving the storage vault / building the S3
client issues a meta-service RPC. The holding bthread suspends and may migrate
to another worker pthread, so the matching unlock runs on a different OS thread.
This corrupts the glibc rwlock (the write-locked bit is left set with no owner),
permanently wedging the lock; all readers/writers on that tablet then pile up
and queries time out.

Replace `_meta_lock` with `BthreadSharedMutex`, a port of libc++'s
`std::shared_mutex` (the two-gate condition-variable algorithm) onto
`bthread::Mutex` / `bthread::ConditionVariable`. Ownership is an integer state
guarded by a briefly held internal mutex and carries no OS-thread identity, so
locking on one worker and unlocking on another after a bthread migration is well
defined. Waiting suspends the bthread instead of blocking the worker.
Copilot AI review requested due to automatic review settings June 16, 2026 15:52
@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@liaoxin01

Copy link
Copy Markdown
Contributor Author

run buildall

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a cloud-mode deadlock/wedge risk caused by using std::shared_mutex (pthread rwlock) as the tablet header lock across bthread-suspending operations, where unlock may occur on a different OS thread than lock acquisition (undefined behavior). It introduces a bthread-friendly shared mutex and switches the tablet header lock to it, updating call sites accordingly.

Changes:

  • Add doris::BthreadSharedMutex, a shared/exclusive lock implemented with bthread::Mutex + bthread::ConditionVariable.
  • Replace BaseTablet/Tablet header lock (_meta_lock / get_header_lock()) and update storage/cloud call sites to lock it with std::unique_lock / std::shared_lock / std::lock_guard without hard-coding std::shared_mutex.
  • Update cloud unit tests and cloud meta manager APIs to use the new lock type.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
be/test/cloud/cloud_tablet_test.cpp Update test locking to use CTAD std::unique_lock with the new header lock type.
be/test/cloud/cloud_meta_mgr_test.cpp Update tests to acquire the tablet header lock via CTAD std::unique_lock.
be/test/cloud/cloud_empty_rowset_compaction_test.cpp Update test to lock tablet header with CTAD std::unique_lock.
be/src/util/bthread_shared_mutex.h Introduce BthreadSharedMutex implemented on bthread primitives.
be/src/storage/task/index_builder.cpp Update header-lock guard to work with the new lock type (CTAD).
be/src/storage/task/engine_clone_task.cpp Update header-lock guard to work with the new lock type (CTAD).
be/src/storage/tablet/tablet.h Change Tablet::get_header_lock() to return BthreadSharedMutex&.
be/src/storage/tablet/tablet.cpp Update _meta_lock locking sites to use CTAD (std::lock_guard / std::shared_lock).
be/src/storage/tablet/tablet_manager.cpp Update tablet-drop path to lock header with CTAD std::lock_guard.
be/src/storage/tablet/base_tablet.h Replace _meta_lock type with BthreadSharedMutex and update get_header_lock().
be/src/storage/schema_change/schema_change.cpp Update new-tablet header lock guards to CTAD (compatible with new lock type).
be/src/storage/rowset_builder.cpp Update header lock acquisition to CTAD std::lock_guard (now targets new lock type).
be/src/storage/compaction/full_compaction.cpp Update header lock guard to CTAD std::lock_guard.
be/src/storage/compaction/compaction.cpp Update header lock guards to CTAD std::lock_guard.
be/src/storage/compaction/binlog_compaction.cpp Update header lock guard to CTAD std::lock_guard.
be/src/cloud/cloud_tablet.h Update APIs taking the header lock to accept std::unique_lock<BthreadSharedMutex>&.
be/src/cloud/cloud_tablet.cpp Update header lock usage to new type (e.g., CTAD std::unique_lock, std::shared_lock).
be/src/cloud/cloud_meta_mgr.h Update fill_version_holes signature to take std::unique_lock<BthreadSharedMutex>&.
be/src/cloud/cloud_meta_mgr.cpp Update header lock usage and fill_version_holes definition for the new lock type.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +621 to 624
std::lock_guard rlock(tablet->get_header_lock());
RETURN_IF_ERROR(tablet->get_all_rs_id_unlocked(old_max_version, &old_rowset_ids));
old_rowsets = tablet->get_rowset_by_ids(&old_rowset_ids);
}
Comment on lines +150 to 151
std::lock_guard lck(tablet()->get_header_lock());
_max_version_in_flush_phase = tablet()->max_version_unlocked();
Comment on lines +28 to +35
// A reader-writer lock for bthread contexts. It is a port of libc++'s
// std::shared_mutex (the two-gate condition-variable algorithm) onto
// bthread::Mutex/bthread::ConditionVariable. Unlike std::shared_mutex
// (pthread_rwlock_t), ownership carries no OS-thread identity, so it is safe to
// lock on one bthread worker and unlock on another after a bthread migrates.
// Satisfies the C++ SharedMutex requirements (usable with std::unique_lock /
// std::shared_lock). Writer-preferring.
class BthreadSharedMutex {
@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 29000 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 47740e028e4d360ed1df6b7e6ddee515126a6fe3, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17690	3976	3963	3963
q2	1996	305	184	184
q3	10329	1438	830	830
q4	4684	469	337	337
q5	7538	856	577	577
q6	205	167	136	136
q7	760	827	621	621
q8	10128	1596	1747	1596
q9	6286	4531	4536	4531
q10	6799	1808	1531	1531
q11	435	272	240	240
q12	651	430	287	287
q13	18241	3382	2777	2777
q14	278	255	239	239
q15	q16	782	785	705	705
q17	1318	905	876	876
q18	7102	5735	5476	5476
q19	1797	1381	1119	1119
q20	492	391	263	263
q21	5936	2654	2409	2409
q22	435	357	303	303
Total cold run time: 103882 ms
Total hot run time: 29000 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4316	4237	4237	4237
q2	322	376	228	228
q3	4660	5050	4400	4400
q4	2077	2163	1379	1379
q5	4479	4292	4328	4292
q6	226	173	122	122
q7	1995	1918	1605	1605
q8	2535	2149	2093	2093
q9	8030	7872	7961	7872
q10	4806	4737	4332	4332
q11	587	451	456	451
q12	760	757	542	542
q13	3326	3588	2977	2977
q14	321	314	287	287
q15	q16	693	719	649	649
q17	1359	1329	1343	1329
q18	7865	7271	6716	6716
q19	1112	1072	1097	1072
q20	2221	2216	1933	1933
q21	5263	4559	4435	4435
q22	510	469	411	411
Total cold run time: 57463 ms
Total hot run time: 51362 ms

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Jun 16, 2026
@github-actions

Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@github-actions

Copy link
Copy Markdown
Contributor

PR approved by anyone and no changes requested.

@hello-stephen

Copy link
Copy Markdown
Contributor
ClickBench: Total hot run time: 25.16 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 47740e028e4d360ed1df6b7e6ddee515126a6fe3, data reload: false

query1	0.01	0.01	0.01
query2	0.10	0.05	0.04
query3	0.26	0.14	0.14
query4	1.60	0.14	0.13
query5	0.23	0.22	0.23
query6	1.20	1.13	1.06
query7	0.04	0.00	0.01
query8	0.05	0.04	0.03
query9	0.37	0.31	0.30
query10	0.54	0.56	0.52
query11	0.18	0.14	0.14
query12	0.19	0.15	0.15
query13	0.46	0.47	0.47
query14	1.01	1.01	1.00
query15	0.60	0.58	0.59
query16	0.32	0.31	0.31
query17	1.08	1.12	1.16
query18	0.22	0.20	0.21
query19	2.02	1.91	2.01
query20	0.02	0.01	0.02
query21	16.31	0.22	0.12
query22	4.84	0.05	0.05
query23	16.13	0.31	0.12
query24	2.89	0.42	0.33
query25	0.10	0.05	0.04
query26	0.73	0.21	0.16
query27	0.03	0.04	0.04
query28	3.55	0.95	0.54
query29	12.46	4.27	3.44
query30	0.27	0.14	0.16
query31	2.78	0.59	0.32
query32	3.22	0.60	0.49
query33	3.18	3.19	3.20
query34	15.66	4.22	3.50
query35	3.50	3.52	3.53
query36	0.54	0.45	0.45
query37	0.09	0.06	0.06
query38	0.05	0.04	0.04
query39	0.04	0.03	0.03
query40	0.18	0.16	0.15
query41	0.09	0.03	0.03
query42	0.03	0.03	0.03
query43	0.04	0.03	0.03
Total cold run time: 97.21 s
Total hot run time: 25.16 s

@hello-stephen

Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 52.44% (43/82) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 54.38% (21316/39199)
Line Coverage 37.99% (203558/535808)
Region Coverage 33.99% (159683/469857)
Branch Coverage 35.00% (69880/199643)

@gavinchou

Copy link
Copy Markdown
Contributor

Thanks for the fix. Since BthreadSharedMutex becomes a new correctness-critical synchronization primitive, I think it is worth adding a small dedicated UT instead of only relying on the existing tablet tests.

Suggested coverage:

  • Basic SharedMutex contract: multiple concurrent std::shared_locks can coexist; std::unique_lock is exclusive; try_lock / try_lock_shared fail while the opposite mode is held.
  • Writer-pending behavior of the two-gate algorithm: once a writer has set the write-entered bit and is waiting for existing readers to drain, new readers should block until the writer has acquired and released the lock.
  • Cross-worker / cross-thread unlock regression case: acquire the lock in a bthread, suspend/yield enough to allow worker migration, and unlock from the resumed context. If possible, also include a deterministic variant using two pthreads/bthreads to demonstrate that ownership is not tied to the original OS thread.
  • Mixed reader/writer stress test with invariant counters, e.g. active_writers <= 1 and active_writers == 0 whenever active_readers > 0, running many bthreads with both shared and exclusive sections.
  • RAII compile/runtime coverage with std::shared_lock<BthreadSharedMutex> and std::unique_lock<BthreadSharedMutex>, because the PR relies on being a drop-in replacement at call sites.

The original bug is subtle and platform-dependent, so a focused be/test/util/bthread_shared_mutex_test.cpp would make this much easier to keep safe during future changes.

@hello-stephen

Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 66.67% (56/84) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 64.06% (24518/38274)
Line Coverage 47.83% (254777/532657)
Region Coverage 44.57% (210384/472001)
Branch Coverage 45.69% (91333/199898)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants