Every voice interface I found either needed a GPU, a cloud API, or was locked to one OS. So I built one that needs none of that — and benchmarked it so the numbers are real.
The stack — all ONNX, all CPU:
- Silero VAD — neural voice activity detection, ~0.09 ms/frame. Knows when you stop talking so there's no push-to-talk.
- Parakeet TDT 0.6B v3 — INT8 transcription, 25 languages, OpenAI-compatible on :5093. A 2.4 s clip → 307 ms on an i7 (~8× realtime).
- Supertonic TTS 3 — FP16 synthesis. Short replies in ~1.4 s. On Apple Silicon M5 Neural Engine: 33× realtime for STT, 16× for TTS.
Data flow:
you → Silero VAD → Parakeet STT → your LLM (Ollama / LM Studio / vLLM / any OpenAI-compatible) → Supertonic TTS → speakers Zero cloud. Zero API keys. Nothing routes outside the machine.
Works with Claude Code, OpenCode CLI, OpenClaw, Hermes Agent, and Codex. One install wires voice into your agent and starts the services (systemd/launchd/Task Scheduler).
Install (macOS / Linux):
git clone https://github.com/groxaxo/Local-VoiceMode-LLM cd Local-VoiceMode-LLM && ./setup.sh Windows: .setup.ps1
Ollama one-liner (standalone, no clone):
bash <(curl -fsSL https://raw.githubusercontent.com/groxaxo/Local-VoiceMode-LLM/main/integrations/ollama/install-ollama-voice.sh) Benchmarks are reproducible via python benchmarks/run_benchmark.py in the repo. MIT-licensed, free.
[link] [comments]