Optimizes the Margo DTL's memory usage by matching the behavior of the UCX DTL#178
Optimizes the Margo DTL's memory usage by matching the behavior of the UCX DTL#178ilumsden wants to merge 6 commits into
Conversation
…tions on Tuo based on excessive memory pinning
JaeseungYeom
left a comment
There was a problem hiding this comment.
I have not finished reviewing margo_dtl.c yet but I will come back later.
| margo_handle->recv_buffer = NULL; | ||
| margo_handle->recv_len = 0; | ||
| margo_handle->recv_ready = 0; | ||
| atomic_store_explicit (&margo_handle->recv_ready, false, memory_order_relaxed); |
There was a problem hiding this comment.
- I am curious about this. It is good to know that this variable needs to be protected. However, was there incident?
- Reusing buffer would improve performance. However, that also means that there will be no concurrent transfer.
stdatomicseems to be optional with c11. We either bump up the minimum C standard required if the protection is necessary. Otherwise, do test as incmake/testsat cmake configure time and propagate the choice viadyad_config.h
|
|
||
| # UCX implementation for DTL | ||
| set(UCX_DTL_SRC ${CMAKE_CURRENT_SOURCE_DIR}/ucx_dtl.c ${CMAKE_CURRENT_SOURCE_DIR}/ucx_ep_cache.cpp) | ||
| set(UCX_DTL_SRC ${CMAKE_CURRENT_BINARY_DIR}/ucx_dtl.c ${CMAKE_CURRENT_SOURCE_DIR}/ucx_ep_cache.cpp) |
There was a problem hiding this comment.
Instead of doing this, MAX_TRANSFER_SIZE as an env variable.
And parse size characters at the end.
We wouldn't need to change CMakeLists.txt and make it needlessly static.
| #include <mercury_macros.h> | ||
|
|
||
| // clang-format off | ||
| #define MARGO_MAX_TRANSFER_SIZE (@DYAD_DTL_MAX_TRANSFER_SIZE@) |
There was a problem hiding this comment.
This sould be converted into the env variable DYAD_DTL_MAX_TX_SIZE_ENV "DYAD_DTL_MAX_TX_SIZE"
| // clang-format on | ||
|
|
||
| margo_instance_id mid = margo_hg_handle_get_instance (h); | ||
| margo_set_log_level (mid, MARGO_LOG_INFO); |
There was a problem hiding this comment.
#if define(DYAD_LOGGER_LEVEL_INFO)
margo_set_log_level (mid, MARGO_LOG_INFO);
#endif
| { | ||
| DYAD_C_FUNCTION_START (); | ||
| dyad_rc_t rc = DYAD_RC_OK; | ||
| if (data_buf == NULL || *data_buf == NULL) { |
There was a problem hiding this comment.
I see that static buffer is already checked but it is still safer to check especially if we are going to reallocate.
In UCX case, you had a different problem, an unsolved mystery. That is something to revisit as well.
|
|
||
| // dyad_rc_t rc = DYAD_RC_OK; | ||
| dyad_dtl_margo_t *margo_handle = NULL; | ||
| dyad_rc_t rc = 0; |
There was a problem hiding this comment.
Initialize it with a return code macro.
|
|
||
| #include <dyad/dtl/dyad_dtl_api.h> | ||
| #include <margo.h> | ||
| #include <stdatomic.h> |
There was a problem hiding this comment.
This is guaranteed to be available in C11.
| { | ||
| dyad_mod_ctx_t *mod_ctx = (dyad_mod_ctx_t *)arg; | ||
| flux_msg_handler_delvec (mod_ctx->handlers); | ||
| if (mod_ctx->handlers != NULL) { |
| endif() | ||
|
|
||
|
|
||
| set(DYAD_DTL_MAX_TRANSFER_SIZE "4294967296" CACHE STRING "Maximum transfer size supported by DYAD in bytes") |
There was a problem hiding this comment.
We want this be rather runtime configurable.
CMake can set the deafult value.
DYAD_DTL_MAX_TX_SIZE_DEFAULT, which should be written to dyad_config.h
This should be commonly used for both UCX and MARGO, client and server.
There was a problem hiding this comment.
This one is from shuffle test.
bool parse_size (const char* str, size_t& out)
{
char* end;
long long val = strtoll (str, &end, 10);
if (val <= 0 || end == str)
return false;
switch (*end) {
case 'K':
case 'k':
out = static_cast<size_t> (val) * 1024ULL;
break;
case 'M':
case 'm':
out = static_cast<size_t> (val) * 1024ULL * 1024ULL;
break;
case 'G':
case 'g':
out = static_cast<size_t> (val) * 1024ULL * 1024ULL * 1024ULL;
break;
case '\0':
out = static_cast<size_t> (val);
break;
default:
return false; // unrecognized suffix
}
return true;
}
|
Minimum to merge is to make the transfer size parameter environment variable controllable. |
| } | ||
| addr_str = malloc (addr_str_size); | ||
| if (addr_str == NULL) { | ||
| rc = DYAD_RC_SYSFAIL; |
There was a problem hiding this comment.
TODO: Need to define a RC macro for this as well as other cases where assert is called.
| margo_addr_free (margo_handle->mid, margo_handle->remote_addr); | ||
| margo_finalize (margo_handle->mid); | ||
| } | ||
|
|
There was a problem hiding this comment.
Move this up before if (margo_handle->mid != MARGO_INSTANCE_NULL) {
| if (margo_handle->mid != MARGO_INSTANCE_NULL) { | ||
| margo_addr_free (margo_handle->mid, margo_handle->local_addr); | ||
| if (margo_handle->remote_addr != NULL) | ||
| margo_addr_free (margo_handle->mid, margo_handle->remote_addr); |
There was a problem hiding this comment.
set margo_handle->remote_addr = HG_ADDR_NULL after margo_addr_free()
| } | ||
|
|
||
| if (margo_handle->mid != MARGO_INSTANCE_NULL) { | ||
| margo_addr_free (margo_handle->mid, margo_handle->local_addr); |
There was a problem hiding this comment.
margo_handle->local_addr = HG_ADDR_NULL
|
|
||
| if (margo_handle->mid != MARGO_INSTANCE_NULL) { | ||
| margo_addr_free (margo_handle->mid, margo_handle->local_addr); | ||
| if (margo_handle->remote_addr != NULL) |
There was a problem hiding this comment.
margo_handle->remote_addr != HG_ADDR_NULL
|
|
||
| // both margo client and server | ||
| margo_addr_self (margo_handle->mid, &margo_handle->local_addr); | ||
| margo_handle->remote_addr = NULL; |
There was a problem hiding this comment.
margo_handle->remote_addr = HG_ADDR_NULL
I was using DYAD on Tuolumne recently, and I found two issues with DYAD.
First, I found that DYAD would crash whenever trying to send/recv data with a Mercury NA protocol string of
ofi+cxi(i.e., what should be used for Slingshot). This was due to an oversight in thedyad_dtl_margo_rpc_packfunction. That function assumes that the network address will fit into a statically sized 128 byte buffer. That assumption is invalid on Tuo because the Slingshot network normally uses network addresses larger than 128 bytes.This PR fixes this first issue by updating
dyad_dtl_margo_rpc_packto callmargo_addr_to_stringtwice. The first call is used to get the actual length of the network address. Then, after amalloc, the second call is used to actually obtain the network address in full.Second, I found that DYAD was performing much worse than expected. When reviewing the code, I found that the main issue was in how the Margo DTL was managing memory.
By allocating a buffer and calling
margo_bulk_createfor every send/recv, the DTL was essentially incurring the cost of memory allocation + memory pinning on the NIC repeatedly throughout a run. To add onto that, when I started using DYAD on Tuo, the Margo bulk objects were never being freed, resulting in massive memory leaks. Those memory leaks were recently fixed by other PRs, but the underlying alloc + pin cost still exists.To fix this issue, this PR mimics the memory reuse scheme used by the UCX DTL. It now creates a single allocation and calls
margo_bulk_createonce during initialization, and it frees and unpins the memory during finalization. On the sender side, the pinned memory buffer is returned to the broker module viadyad_dtl_margo_get_bufferto ensure the module reads directly from local storage into the pinned memory. Similarly, on the receiver side, the pinned memory is passed directly into themargo_bulk_transfercall to avoid excessive allocations. Thedyad_dtl_margo_recvfunction then copies from that pinned memory into a new allocation to return to the DYAD client. This PR also adds C11'sstdatomic.hto help thedyad_dtl_margo_recvanddata_ready_rpcsynchronize on data availability.This PR also makes the max transfer size for the UCX and Margo DTL backends configurable via the new CMake cache variable
DYAD_DTL_MAX_TRANSFER_SIZE. The value of this variable defaults to 4294967296 (i.e., 4 GiB) which matches the existing behavior of the UCX DTL.I tested these changes on Tuo. They result in much better performance. In fact, with these changes, DYAD can achieve results comparable to a data transfer approach using
MPI_IsendandMPI_Irecvin certain workflow situations.