RDMAEndpoint Completion-Ownership Model¶
This document fixes the language for the RDMAEndpoint completion refactor. Before the refactor the codebase mixed three concerns in one struct — the "slot" was simultaneously the execution-storage, the completion owner, and the wire work-request. That made it impossible to describe who owned what, so subtle ABA hazards kept slipping in. The refactor separates those concerns into three layers with explicit ownership rules.
The Four Objects¶
┌──────────────────────────────────────────────────────────────────────┐
│ Layer 1: Future (SendFuture / RecvFuture / ReadWriteFuture / │
│ ImmRecvFuture) │
│ — user handle; wait() and read result │
└──────────────────────────────────────────────────────────────────────┘
│ shared_ptr<EndpointOpState>
▼
┌──────────────────────────────────────────────────────────────────────┐
│ Layer 2: EndpointOpState │
│ — per-op identity + completion signal + result fields │
│ — NOT pooled; created per op, released per op │
└──────────────────────────────────────────────────────────────────────┘
│ referenced by slot.op_state
│ and by transport callbacks (strong, on CQ
│ thread stack while invoking)
▼
┌──────────────────────────────────────────────────────────────────────┐
│ Layer 3: Slot (ReadWriteContext / SendContext / RecvContext) │
│ — POOLED dispatch coordinator │
│ — holds wr_id-backing RDMAAssign storage │
│ — lease/release via free-slot ring; never modulo-reused │
└──────────────────────────────────────────────────────────────────────┘
│ wr_id = &slot.assigns_[qpi]
▼
┌──────────────────────────────────────────────────────────────────────┐
│ Layer 4: RDMAAssign + RDMAChannel / RDMAContext │
│ — wire WR; CQ polling + callback dispatch │
└──────────────────────────────────────────────────────────────────────┘
ImmRecvContext is a separate Layer-3 object with different semantics
(see below) but follows the same ownership rules.
Who Owns What¶
| Owner | Owns | Reference kind |
|---|---|---|
| User code | Future | shared_ptr<RDMAFuture> |
| Future | EndpointOpState | shared_ptr<EndpointOpState> |
| Transport callback | EndpointOpState (during fire) | shared_ptr captured in lambda, moved to CQ stack, released on return |
| Slot | Current op's EndpointOpState | shared_ptr in slot.op_state, cleared on release |
| Slot | Its own RDMAAssign storage |
by value |
| Endpoint | Slot pool | owning array |
| Endpoint | Free-slot ring | owning ring |
| Endpoint | In-flight op registry | weak_ptr<EndpointOpState> |
The registry is weak so it does not pin retired ops; it exists only so
cancelAll() can fail-out every outstanding op without walking slots.
Boundary Invariants¶
These are the rules the refactor enforces. Violating any of them reintroduces the ABA hazard.
I1 — Signal exclusivity¶
An EndpointOpState::signal is created per op and never shared. Two
futures can never observe the same signal, and a signal is never
reset / rebound across ops.
I2 — Slot-reuse safety¶
A slot enters its free ring at init. It leaves the free ring on
acquireXxxSlot() and returns only after slot_qp_mask reaches
slot_qp_expected — i.e. every per-qp callback for its current
tenant has fired. While a slot is out of the free ring, no other op
can claim it; while a WR referencing &slot.assigns_[qpi] is
outstanding, the slot is out of the free ring. Therefore no CQE
whose wr_id points into slot.assigns_[qpi] can ever dispatch
into a callback installed by a different op.
I3 — Callback lifetime¶
Transport callbacks capture shared_ptr<EndpointOpState>. This closes a
cycle (op_state → assigns → callback_ → op_state), but the cycle is
bounded: it breaks the next time the slot is re-leased for a new op
(because RDMAAssign::reset() overwrites callback_ and releases the
old lambda's captures) or when the endpoint is destroyed. In the
meantime the completed op is pinned in memory by its own slot's
closure, which is acceptable because:
- the total retention is bounded by the pool size (at most
DEPTHcompleted op_states in flight); - for any real workload the slot is reclaimed and re-leased within microseconds of completion, so retention is negligible;
- futures observe completion via the signal, which fires before the callback returns — user code is never blocked by retention.
We do NOT try to move the callback out on fire (an earlier attempt
broke the codebase): many RDMAAssigns hold multiple pending WRs
simultaneously (connect-time pre-post plus a repost in WAIT_META
that shares the same wr_id), so the callback must remain callable
across every CQE for that wr_id until the next reset() installs
a replacement.
I4 — Op lifetime¶
An EndpointOpState lives exactly as long as at least one of these
holds a strong reference:
- its future (owned by user code),
- a slot currently leasing it (
slot.op_state), - a callback in flight (moved onto the CQ thread stack).
Once all three are gone, shared_ptr reclaims the op_state. The
endpoint's weak registry has no impact on lifetime.
I5 — RNR avoidance¶
The inbound RECV windows on the meta channel, the message data
channel, and the io-data channel are pre-posted at connect() and
kept primed across op boundaries. The refactor never removes a
posted RECV without first reposting a replacement. Slot release
happens strictly AFTER the callback has run — and the callback, for
message paths, is the place that reposts — so the RQ is never empty
just because an op retired.
What the Slot Is NOT¶
This is what changed. Historically the slot did everything; after the refactor each of these is explicitly elsewhere:
- Not the completion owner. The slot does not store the
user-visible signal, completion status, or
imm_data. Those live onEndpointOpState.SendFuture/RecvFuture/ReadWriteFuture/ImmRecvFuturedo not hold slot pointers. - Not the op identity. The slot is a dispatch vehicle reused
across ops.
op_id-style identity is unnecessary once the free ring invariant (I2) is in place; no generation counter is kept on the slot. - Not the WR ownership root.
RDMAAssignlives inside the slot by value, but the slot cannot be re-leased while any WR is outstanding (I2). So the WRs are effectively owned by "the current lease" for their useful lifetime.
Message-Path Specifics (Send / Recv)¶
The two-sided path uses the same slot lease model but has extra slot-local transport state:
SendContext::meta_arrived_flag_— set by themeta_recv_assign_callback when the peer writes a meta reply; consumed (reset to 0) bysendProcesswhen it advances pastWAIT_META. NOT reset onacquireSendSlot()— the peer may have already posted meta for this slot before the user calledsend(); clearing the flag on acquire would drop a valid pre-arrival and deadlock.SendContext::remote_meta_info_— overwritten by the peer viaWRITE_WITH_IMM. Slot-local buffer.RecvContext::local_meta_info_— filled by user'srecv()call, then written to the peer.
The meta handshake callback installed at connect() captures the
slot (not the op_state) and only flips the slot-local flag. The
user-visible op_state.signal is set only by the data_send_assigns_
/ data_recv_assigns_ callbacks installed when the op actually runs
(sendProcess / recvProcess), which capture the current
op_state by shared_ptr — consistent with I3.
API contract — pair ordering¶
Two-sided send() / recv() is correct under concurrency provided
both endpoints pair their calls in the same relative order. This is
the standard point-to-point contract (the same invariant held by
MPI, UCX, NCCL two-sided, and every other two-sided RDMA stack).
Why it is sufficient:
- IBRC send-side FIFO. WRs posted on an RC QP complete in post order.
- IBRC receive-side FIFO. The peer's k-th
WRITE_WITH_IMMalways consumes our RECV WR #k. - FIFO slot allocation + release-in-completion-order (from the free-slot ring + guarantee 1).
Together these give: sender slot-k is matched with receiver slot-k
for the lifetime of the connection. Meta destined for slot-k lands
in slot-k's remote_meta_info_, and by (2) it is slot-k's RECV that
the HW consumes to signal that arrival — so the flag is set on the
correct slot and sendProcess makes progress on the right pending
send.
Violating the pair-ordering contract (sender issues send(A), send(B)
while receiver issues recv(B), recv(A)) will cross-wire data. This
is a user-level contract, not a DLSlime-specific limitation; the
library does not try to detect the violation.
Imm-Recv Path Specifics¶
immRecv() is a pure matching path — it does not acquire a slot.
The transport-owned ImmRecvContext array stays permanently posted
on the hardware RQ (this is what keeps WRITE_WITH_IMM senders from
hitting RNR). Each completion is copied into an ImmRecvEvent value
object and matched to either a pending user op (if immRecv() was
called first) or queued for the next caller (if the WRITE_WITH_IMM
arrived first). The CQ-side slot refill uses a lock-free MPSC stack;
the endpoint progress loop drains the stack and reposts the slots.
This is entirely orthogonal to the send/recv/read/write free-slot
rings and intentionally so — the semantics are different.
What cancelAll() Does¶
cancelAll() walks the weak in-flight registry and force-completes
every live EndpointOpState (sets status FAILED, wakes signal). It
also drains the imm-recv matching queues and the refill stack. It
deliberately does NOT touch slot pools — the pool invariant is that
slots re-enter the free ring only when their WRs have fired, and
cancelAll does not violate that. Anything still sitting in
pending_{rw,send,recv}_queue_ un-posted will either post normally
if the endpoint resumes or be flushed as FLUSH_ERR when the channel
is destroyed, at which point the normal callback path releases the
slot.
How To Extend This Safely¶
Adding a new endpoint operation?
- Allocate a fresh
EndpointOpState(viamakeOpState) with a unique signal. Register withregisterInFlight(). - If the op needs a dispatch slot, lease one via the appropriate
acquireXxxSlot(). Setslot.op_stateto your new op_state. - When building callbacks: capture
shared_ptr<EndpointOpState>for per-op state updates. Capture the slot pointer (if needed) only to driveslot_qp_maskand release the slot. - In the final callback for the slot,
fetch_orthe qp bit intoslot.slot_qp_mask. If it equalsslot_qp_expected, call the matchingreleaseXxxSlot(). - Return an
RDMAFuturesubclass that holds theshared_ptr<EndpointOpState>. The future must never dereference the slot.