Skip to content

LLM Inference (runnable)

This guide shows how to run a router + worker LLM service with Pulsing, and expose an OpenAI-compatible HTTP API.

Architecture

  • Router: accepts HTTP requests, selects a worker, forwards GenerateRequest / GenerateStreamRequest
  • Workers: host model replicas

0) Prerequisites

  • pip install pulsing
  • Choose one backend:
  • Transformers: install torch + transformers
  • vLLM: install vllm

1) Start the Router (Terminal A)

The router needs an actor system address so workers can join the same cluster:

pulsing actor pulsing.serving.Router \
  --addr 0.0.0.0:8000 \
  --name my-llm \
  -- \
  --http_port 8080 \
  --model_name gpt2 \
  --worker_name worker

2) Start workers

You can run one or more workers. Each worker should join the router node via --seeds.

Option A: Transformers worker (Terminal B)

pulsing actor pulsing.serving.TransformersWorker \
  --addr 0.0.0.0:8001 \
  --seeds 127.0.0.1:8000 \
  --name worker \
  -- \
  --model_name gpt2

Option B: vLLM worker (Terminal C)

pulsing actor pulsing.serving.vllm.VllmWorker \
  --addr 0.0.0.0:8002 \
  --seeds 127.0.0.1:8000 \
  --name worker \
  -- \
  --model Qwen/Qwen2.5-0.5B

3) Verify cluster + workers

List actors (observer mode)

pulsing inspect actors --endpoint 127.0.0.1:8000

Inspect cluster

pulsing inspect cluster --seeds 127.0.0.1:8000

4) Call the OpenAI-compatible API

Non-streaming

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "gpt2", "messages": [{"role": "user", "content": "Hello"}], "stream": false}'

Streaming (SSE)

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "gpt2", "messages": [{"role": "user", "content": "Tell me a joke"}], "stream": true}'

Troubleshooting

  • If you see No available workers, ensure:
  • router is started with --addr and workers join via --seeds <router_addr>
  • the worker actor name matches: workers started with --name worker (before --), or start the router with --worker_name <name> (after --) to match your worker name
  • check: pulsing inspect actors --seeds 127.0.0.1:8000 — you should see an actor with the name the router is looking for (default worker)

See also: