跳转至

DLSlimeCache: RDMA cache service assignment directory

Status: V0 landed in this branch.

One-line summary: DLSlimeCache is a small service that owns a preallocated memory region, exposes it through a composed PeerAgent, and records (peer_agent_id, version) -> AssignmentBatch manifests so clients can read cached bytes back through the existing DLSlime RDMA endpoint path.

This document describes the current cache-service design. It intentionally does not model peer-to-peer references as a cache mode: a worker that wants to read directly from another worker can already do that through PeerAgent and RDMAEndpoint. DLSlimeCache starts only when bytes are written into the cache service's own registered memory.

Goals

  • Reuse DLSlime's existing data plane. The cache server is a PeerAgent peer with a registered memory region; clients use normal write and read operations.
  • Keep cache metadata tiny. The C++ cache core stores only assignment manifests keyed by the original Engine/PeerAgent id plus a generated version.
  • Keep the first version service-shaped. Users can run dlslime-cache start/status/stop, and Python examples can perform a real end-to-end RDMA roundtrip.
  • Avoid inventing a new storage abstraction for peer-to-peer transfer. P2P remains P2P; cache means data has entered the cache service MR.

Non-goals

  • No shallow mode. The old store(key, extents, mode="shallow") idea was removed because it duplicates PeerAgent's direct P2P path.
  • No tiered allocator yet. V0 manages one fixed slab size per service instance; adaptive per-store slab classes are deferred.
  • No persistence, replication, master election, SSD tier, or distributed cache protocol. Placement and replication policy belong in NanoDeploy.
  • No server-side RDMA read on query. Clients issue their own RDMA reads using the returned manifest.

Data Model

The core object is an assignment manifest:

struct AssignmentManifest {
    std::string              peer_agent_id;
    dlslime::AssignmentBatch assignments;
    std::vector<uint64_t>    slab_ids;
    uint64_t                 version;
};

peer_agent_id is the original Engine/PeerAgent owner id. version is generated by the cache server. slab_ids records the fixed cache slabs owned by the manifest. assignments is a ready-to-run batch for the consumer's RDMA read path.

The C++ cache core exposes:

AssignmentManifest store_assignments(peer_agent_id, assignments);
Optional<AssignmentManifest> query_assignments(peer_agent_id, version);
bool delete_assignments(peer_agent_id, version);
CacheStats stats();
void clear();  // test helper

There is no Extent, Manifest, CacheMode, store(key, ...), load(key), or delete(key) API in the landed V0 surface.

Service Flow

The cache service composes a real PeerAgent:

  1. dlslime-cache start --memory-size ... preallocates host memory.
  2. The service registers that buffer as a PeerAgent memory region, default name cache.
  3. GET /peer-agent tells clients the cache PeerAgent id, NanoCtrl address, cache MR name, slab size, memory size, and resource info.
  4. A client connects to the cache PeerAgent through NanoCtrl.
  5. The client stores a read manifest with POST /store; the service allocates cache slabs and rewrites cache-side offsets in the returned manifest.
  6. The client writes bytes into the allocated cache MR offsets with normal RDMA write.
  7. A consumer queries the manifest with POST /query.
  8. The consumer feeds the returned assignments to agent.read(...).
  9. The client removes the manifest with POST /delete when done.

The example at examples/python/cache_client_example.py performs this full roundtrip and checks correctness.

HTTP API

GET /healthz

Returns:

{"ok": true}

GET /stats

Returns assignment and slab counters:

{
  "slab_size": 262144,
  "memory_size": 1073741824,
  "num_slabs": 4096,
  "used_slabs": 3,
  "free_slabs": 4093,
  "num_assignment_peers": 1,
  "num_assignment_entries": 1,
  "num_assignments": 3,
  "assignment_bytes": 655360
}

GET /peer-agent

Returns the cache service's PeerAgent and cache MR metadata:

{
  "peer_agent_id": "cache-agent:0",
  "cache_mr_name": "cache",
  "cache_mr_handle": 123,
  "nanoctrl_url": "http://127.0.0.1:3000",
  "scope": null,
  "slab_size": 262144,
  "memory_size": 1073741824,
  "resource": {}
}

The endpoint returns 503 if the service was started without a PeerAgent or without preallocated cache memory.

POST /store

Request:

{
  "peer_agent_id": "engine-a",
  "assignments": [
    {
      "mr_key": 11,
      "remote_mr_key": 22,
      "target_offset": 0,
      "source_offset": 0,
      "length": 655360
    }
  ]
}

Response:

{
  "peer_agent_id": "engine-a",
  "version": 1,
  "total_bytes": 655360,
  "slab_ids": [0, 1, 2],
  "assignments": [
    {
      "mr_key": 11,
      "remote_mr_key": 22,
      "target_offset": 0,
      "source_offset": 0,
      "length": 262144
    }
  ]
}

Large assignments are split into chunks no larger than slab_size. When preallocated memory is enabled, each returned assignment owns one slab id, and source_offset points at slab_id * slab_size inside the cache MR.

POST /query

Request:

{"peer_agent_id": "engine-a", "version": 1}

Response is the stored assignment manifest, or 404 if not found.

POST /delete

Request:

{"peer_agent_id": "engine-a", "version": 1}

Response:

{"deleted": true}

CLI

The public lifecycle commands are:

dlslime-cache start
dlslime-cache status
dlslime-cache stop

Data mode requires preallocated memory:

nanoctrl start
dlslime-cache start --ctrl http://127.0.0.1:3000 \
    --host 127.0.0.1 --port 8765 --memory-size 1G

--metadata-only exists only for parser/control-plane tests. It starts the HTTP metadata wrapper without a usable cache MR, so real clients should not use it.

Useful service knobs:

  • --slab-size: maximum assignment slab bytes, default 256K. Supported startup range is 128K to 1G.
  • --memory-size: preallocated cache MR size. Accepts suffixes such as 512M and 1G.
  • --cache-mr-name: PeerAgent memory-region name, default cache.
  • --ctrl: NanoCtrl address, default http://127.0.0.1:3000.
  • --peer-agent-alias: optional fixed alias for the service PeerAgent.

Python Client

from dlslime.cache import CacheClient

client = CacheClient(url="http://127.0.0.1:8765", peer_agent=agent)
server = client.connect_to_server()

stored = client.store(assignments)
queried = client.query(stored["peer_agent_id"], stored["version"])
deleted = client.delete(stored["peer_agent_id"], stored["version"])

If no peer_agent is passed, connect_to_server() creates one using the NanoCtrl information advertised by /peer-agent.

Correctness Contract

  • Store/query/delete metadata is protected by a std::shared_mutex.
  • query_assignments() and stats() take a shared lock.
  • store_assignments(), delete_assignments(), and clear() take the write lock.
  • Delete removes the manifest and returns its slab ids to the free list. It does not fence or cancel RDMA reads that a client has already issued.
  • Callers should delete after read_future.wait() if they want the same correctness property as the example.

Because slabs are reusable, delete/evict must grow a lease or pin-count mechanism before production use so memory cannot be recycled while a client has an in-flight read.

Slab Semantics

slab_size is currently a startup-time normalization and capacity unit:

  • Store splits every assignment into chunks of at most slab_size.
  • Supported range is 128K <= slab_size <= 1G.
  • memory_size / slab_size gives the number of logical slabs.
  • If memory_size > 0, store rejects manifests that would exceed the configured logical slab count.
  • Store allocates slab ids from a free list and rewrites returned assignment source_offset values to cache MR slab offsets.
  • Delete returns slab ids to the free list.
  • used_slabs and free_slabs come from allocator state.

Per-store adaptive slab sizing is deferred until tiered capacity accounting and leases exist; changing the slab unit per manifest would make lifecycle semantics ambiguous in the current V0 directory.

Implementation Status

Component Status
C++ assignment directory Landed
Pybind cache bindings Landed
Python HTTP service Landed
CacheClient wrapper Landed
dlslime-cache start/status/stop Landed
Real RDMA client example Landed
HTTP delete path Landed
Fixed slab allocator / reuse Landed
Leases / pin counts Not started
NanoDeploy placement integration Not started

Current C++ surface:

from dlslime._slime_c import Assignment, cache

srv = cache.CacheServer(slab_size=256 * 1024, memory_size=1024 * 1024)
m = srv.store_assignments("engine-a", [Assignment(1, 2, 0, 0, 655360)])
got = srv.query_assignments("engine-a", m.version)
srv.delete_assignments("engine-a", m.version)

Example

Start services:

nanoctrl start
dlslime-cache start --ctrl http://127.0.0.1:3000 \
    --host 127.0.0.1 --port 8765 --memory-size 1G

Run the client:

python examples/python/cache_client_example.py --url http://127.0.0.1:8765

Expected success signal:

correctness: ok
deleted: peer_agent_id=<client-agent> version=<n> deleted=True

Stop the service:

dlslime-cache stop

Next Steps

  1. Add tiered slab sizing, e.g. 128K..1G, so each store can choose the smallest fitting slab class once capacity accounting is backed by real slab ownership instead of a single fixed startup unit.
  2. Add slab leases or pin counts so delete/evict cannot race with in-flight reads.
  3. Add metrics for assignment entries, bytes, logical slab pressure, and failed stores.
  4. Integrate NanoDeploy placement policy on top of the cache client.
  5. Add multi-client stress tests around store/query/delete/read ordering.