Run 16+ AI coding agents locally — in parallel, on your hardware.
Run MLX and llama.cpp simultaneously on Apple Silicon, or vLLM on Linux/CUDA — across three coordinator modes (Flat, Pipeline, Router). No cloud, no API keys, no data leaves your machine.
npm i @keepdevops/matrix
Matrix Swarm is a local-first orchestration layer for open-weight LLMs. Spin up a fleet of role-specialized agents — Architect, Programmer, Security, DevOps, and more — then broadcast one prompt to all of them, pipe it through a sequence, or let a Router model dispatch each task to the best fit.
Models run entirely on your hardware: MLX + llama.cpp on Apple Silicon, vLLM on Linux/CUDA. Nothing leaves the box. Built for engineers who want Cursor-class productivity without sending their codebase to the cloud.
npm i @keepdevops/matrix
Node.js 18+ on macOS (Apple Silicon) or Linux/CUDA.
matrix init --preset 16gb
Generates a swarm config with sensible agent + model defaults for your hardware.
matrix run "build a REST API"
Broadcasts to all agents (Flat) or pipe through with --mode pipeline / router.
Cursor-class productivity without sending your codebase to a cloud LLM. Your laptop, your models, your repo.
Mix MLX, llama.cpp, and vLLM in a single run. Compare model behavior across agents on identical prompts.
Financial, healthcare, defense — anywhere proprietary code can't leave the box. Air-gapped friendly.
| Matrix Swarm | Cursor | Aider | Cline | |
|---|---|---|---|---|
| Runs fully local | Yes | No | Optional | Optional |
| Multi-agent orchestration | Yes (16+) | No | No | No |
| Mix backends per agent | MLX + llama.cpp + vLLM | No | No | No |
| Coordinator modes | Flat · Pipeline · Router | — | — | — |
| Open source | Yes | No | Yes | Yes |
No. On Apple Silicon, MLX and llama.cpp use the unified memory and Neural Engine. On Linux, vLLM needs an NVIDIA GPU (CUDA 12+). CPU-only llama.cpp also works — just slower.
You bring your own GGUF (llama.cpp), MLX, or HuggingFace weights. Matrix Swarm doesn't ship models. Recommended starters: Llama 3, Qwen 2.5, DeepSeek-Coder.
Flat broadcasts your prompt to all agents in parallel. Pipeline chains them in a fixed sequence (e.g., Architect → Programmer → Reviewer). Router uses a small dispatcher model to pick the best agent per request.
No. Inference, code extraction, and config all happen locally. There are no telemetry calls. (You can optionally point an agent at a remote OpenAI-compatible endpoint if you want — but it's off by default.)
Drop a JSON entry into your swarm-config.json with a system prompt, model binding, and role. See swarm-config-16gb.json in the repo for examples.
The CLI installs via npm. Docker is used to run vLLM model servers (via Docker Model Runner on ports 8080–8083) when running the Linux/CUDA backend. See docker/Dockerfile.vllm-metal.