← index

tool · open source

llama-launcher

Profile-driven CLI for llama.cpp running as a systemd --user service. Switch inference profiles, autotune context from available VRAM, and hand the GPU back when you launch a game. No sudo needed anywhere.

GitHub mirror ↗

llama-launcher turns a hand-tuned llama.cpp server into something you drive with a single command. Rather than editing a systemd unit and restarting by hand every time you want a different model or a longer context, you keep your setups as named profiles in one config.json and switch between them with lcpp use <profile>.

A profile is just a set of llama-server flags — model, context size, GPU layers, KV-cache type, batch sizes, chat template. Applying one writes those flags to the service's environment file, restarts the server, and waits for /health to come back green before reporting success. If the new profile can't load — a bad flag, not enough VRAM — it rolls back to the last working one on its own.

Everything runs as a per-user systemd service, so the daily loop never touches root: install with pipx, seed the unit once, then switch, restart, and tail logs as yourself. It stays a thin, transparent wrapper around your build — lcpp show <profile> always prints the exact command it would run — not a model registry or a proxy.

service model
A systemd --user unit wrapping llama-server — no root in the loop, and it comes up at boot via lingering.
profiles
Named flag sets in config.json. lcpp use <profile> writes them to the unit's EnvironmentFile and restarts, gating on /health with automatic rollback.
autotune
Reads GGUF metadata + free VRAM to suggest a safe context size and GPU-layer split; --fit stays on as the runtime net, so an over-ambitious context degrades instead of OOM-ing.
watch daemon
Polls for running games — a Steam launch, a Prism/Minecraft instance, any non-llama GPU process — and frees the GPU while they run (shrink to a lighter profile or stop the server outright), then restores the previous profile on exit.
backends
Drives the llama-server already on your PATH (CUDA in practice; ROCm-aware, with an rocm-smi telemetry fallback) — it runs the binary, it doesn't build it. Profile flags are verified against the build you have, and lcpp doctor warns when an option has drifted out of your binary's --help.
lcpp --help output listing the available commands
lcpp --help — 25+ commands across service control, profiles, and diagnostics
PythonTypersystemdllama.cpppipx

one runtime dependency · 63 tests · ruff + mypy clean