RDMAEndpoint Completion-Ownership Model¶

This document fixes the language for the RDMAEndpoint completion refactor. Before the refactor the codebase mixed three concerns in one struct — the "slot" was simultaneously the execution-storage, the completion owner, and the wire work-request. That made it impossible to describe who owned what, so subtle ABA hazards kept slipping in. The refactor separates those concerns into three layers with explicit ownership rules.

The Four Objects¶

┌──────────────────────────────────────────────────────────────────────┐
│  Layer 1: Future (SendFuture / RecvFuture / ReadWriteFuture /        │
│           ImmRecvFuture)                                             │
│           — user handle; wait() and read result                      │
└──────────────────────────────────────────────────────────────────────┘
                          │ shared_ptr<EndpointOpState>
                          ▼
┌──────────────────────────────────────────────────────────────────────┐
│  Layer 2: EndpointOpState                                            │
│           — per-op identity + completion signal + result fields      │
│           — NOT pooled; created per op, released per op              │
└──────────────────────────────────────────────────────────────────────┘
                          │ referenced by slot.op_state
                          │ and by transport callbacks (strong, on CQ
                          │ thread stack while invoking)
                          ▼
┌──────────────────────────────────────────────────────────────────────┐
│  Layer 3: Slot (ReadWriteContext / SendContext / RecvContext)        │
│           — POOLED dispatch coordinator                              │
│           — holds wr_id-backing RDMAAssign storage                   │
│           — lease/release via free-slot ring; never modulo-reused    │
└──────────────────────────────────────────────────────────────────────┘
                          │ wr_id = &slot.assigns_[qpi]
                          ▼
┌──────────────────────────────────────────────────────────────────────┐
│  Layer 4: RDMAAssign + RDMAChannel / RDMAContext                     │
│           — wire WR; CQ polling + callback dispatch                  │
└──────────────────────────────────────────────────────────────────────┘

ImmRecvContext is a separate Layer-3 object with different semantics (see below) but follows the same ownership rules.

Who Owns What¶

Owner	Owns	Reference kind
User code	Future	`shared_ptr<RDMAFuture>`
Future	EndpointOpState	`shared_ptr<EndpointOpState>`
Transport callback	EndpointOpState (during fire)	`shared_ptr` captured in lambda, moved to CQ stack, released on return
Slot	Current op's EndpointOpState	`shared_ptr` in `slot.op_state`, cleared on release
Slot	Its own `RDMAAssign` storage	by value
Endpoint	Slot pool	owning array
Endpoint	Free-slot ring	owning ring
Endpoint	In-flight op registry	`weak_ptr<EndpointOpState>`

The registry is weak so it does not pin retired ops; it exists only so cancelAll() can fail-out every outstanding op without walking slots.

Boundary Invariants¶

These are the rules the refactor enforces. Violating any of them reintroduces the ABA hazard.

I1 — Signal exclusivity¶

An EndpointOpState::signal is created per op and never shared. Two futures can never observe the same signal, and a signal is never reset / rebound across ops.

I2 — Slot-reuse safety¶

A slot enters its free ring at init. It leaves the free ring on acquireXxxSlot() and returns only after slot_qp_mask reaches slot_qp_expected — i.e. every per-qp callback for its current tenant has fired. While a slot is out of the free ring, no other op can claim it; while a WR referencing &slot.assigns_[qpi] is outstanding, the slot is out of the free ring. Therefore no CQE whose wr_id points into slot.assigns_[qpi] can ever dispatch into a callback installed by a different op.

I3 — Callback lifetime¶

Transport callbacks capture shared_ptr<EndpointOpState>. This closes a cycle (op_state → assigns → callback_ → op_state), but the cycle is bounded: it breaks the next time the slot is re-leased for a new op (because RDMAAssign::reset() overwrites callback_ and releases the old lambda's captures) or when the endpoint is destroyed. In the meantime the completed op is pinned in memory by its own slot's closure, which is acceptable because:

the total retention is bounded by the pool size (at most DEPTH completed op_states in flight);
for any real workload the slot is reclaimed and re-leased within microseconds of completion, so retention is negligible;
futures observe completion via the signal, which fires before the callback returns — user code is never blocked by retention.

We do NOT try to move the callback out on fire (an earlier attempt broke the codebase): many RDMAAssigns hold multiple pending WRs simultaneously (connect-time pre-post plus a repost in WAIT_META that shares the same wr_id), so the callback must remain callable across every CQE for that wr_id until the next reset() installs a replacement.

I4 — Op lifetime¶

An EndpointOpState lives exactly as long as at least one of these holds a strong reference:

its future (owned by user code),
a slot currently leasing it (slot.op_state),
a callback in flight (moved onto the CQ thread stack).

Once all three are gone, shared_ptr reclaims the op_state. The endpoint's weak registry has no impact on lifetime.

I5 — RNR avoidance¶

The inbound RECV windows on the meta channel, the message data channel, and the io-data channel are pre-posted at connect() and kept primed across op boundaries. The refactor never removes a posted RECV without first reposting a replacement. Slot release happens strictly AFTER the callback has run — and the callback, for message paths, is the place that reposts — so the RQ is never empty just because an op retired.

What the Slot Is NOT¶

This is what changed. Historically the slot did everything; after the refactor each of these is explicitly elsewhere:

Not the completion owner. The slot does not store the user-visible signal, completion status, or imm_data. Those live on EndpointOpState. SendFuture/RecvFuture/ ReadWriteFuture/ImmRecvFuture do not hold slot pointers.
Not the op identity. The slot is a dispatch vehicle reused across ops. op_id-style identity is unnecessary once the free ring invariant (I2) is in place; no generation counter is kept on the slot.
Not the WR ownership root. RDMAAssign lives inside the slot by value, but the slot cannot be re-leased while any WR is outstanding (I2). So the WRs are effectively owned by "the current lease" for their useful lifetime.

Message-Path Specifics (Send / Recv)¶

The two-sided path uses the same slot lease model but has extra slot-local transport state:

SendContext::meta_arrived_flag_ — set by the meta_recv_assign_ callback when the peer writes a meta reply; consumed (reset to 0) by sendProcess when it advances past WAIT_META. NOT reset on acquireSendSlot() — the peer may have already posted meta for this slot before the user called send(); clearing the flag on acquire would drop a valid pre-arrival and deadlock.
SendContext::remote_meta_info_ — overwritten by the peer via WRITE_WITH_IMM. Slot-local buffer.
RecvContext::local_meta_info_ — filled by user's recv() call, then written to the peer.

The meta handshake callback installed at connect() captures the slot (not the op_state) and only flips the slot-local flag. The user-visible op_state.signal is set only by the data_send_assigns_ / data_recv_assigns_ callbacks installed when the op actually runs (sendProcess / recvProcess), which capture the current op_state by shared_ptr — consistent with I3.

API contract — pair ordering¶

Two-sided send() / recv() is correct under concurrency provided both endpoints pair their calls in the same relative order. This is the standard point-to-point contract (the same invariant held by MPI, UCX, NCCL two-sided, and every other two-sided RDMA stack).

Why it is sufficient:

IBRC send-side FIFO. WRs posted on an RC QP complete in post order.
IBRC receive-side FIFO. The peer's k-th WRITE_WITH_IMM always consumes our RECV WR #k.
FIFO slot allocation + release-in-completion-order (from the free-slot ring + guarantee 1).

Together these give: sender slot-k is matched with receiver slot-k for the lifetime of the connection. Meta destined for slot-k lands in slot-k's remote_meta_info_, and by (2) it is slot-k's RECV that the HW consumes to signal that arrival — so the flag is set on the correct slot and sendProcess makes progress on the right pending send.

Violating the pair-ordering contract (sender issues send(A), send(B) while receiver issues recv(B), recv(A)) will cross-wire data. This is a user-level contract, not a DLSlime-specific limitation; the library does not try to detect the violation.

Imm-Recv Path Specifics¶

immRecv() is a pure matching path — it does not acquire a slot. The transport-owned ImmRecvContext array stays permanently posted on the hardware RQ (this is what keeps WRITE_WITH_IMM senders from hitting RNR). Each completion is copied into an ImmRecvEvent value object and matched to either a pending user op (if immRecv() was called first) or queued for the next caller (if the WRITE_WITH_IMM arrived first). The CQ-side slot refill uses a lock-free MPSC stack; the endpoint progress loop drains the stack and reposts the slots. This is entirely orthogonal to the send/recv/read/write free-slot rings and intentionally so — the semantics are different.

What cancelAll() Does¶

cancelAll() walks the weak in-flight registry and force-completes every live EndpointOpState (sets status FAILED, wakes signal). It also drains the imm-recv matching queues and the refill stack. It deliberately does NOT touch slot pools — the pool invariant is that slots re-enter the free ring only when their WRs have fired, and cancelAll does not violate that. Anything still sitting in pending_{rw,send,recv}_queue_ un-posted will either post normally if the endpoint resumes or be flushed as FLUSH_ERR when the channel is destroyed, at which point the normal callback path releases the slot.

How To Extend This Safely¶

Adding a new endpoint operation?

Allocate a fresh EndpointOpState (via makeOpState) with a unique signal. Register with registerInFlight().
If the op needs a dispatch slot, lease one via the appropriate acquireXxxSlot(). Set slot.op_state to your new op_state.
When building callbacks: capture shared_ptr<EndpointOpState> for per-op state updates. Capture the slot pointer (if needed) only to drive slot_qp_mask and release the slot.
In the final callback for the slot, fetch_or the qp bit into slot.slot_qp_mask. If it equals slot_qp_expected, call the matching releaseXxxSlot().
Return an RDMAFuture subclass that holds the shared_ptr<EndpointOpState>. The future must never dereference the slot.