Overview
LuaVoice is the class-based primitive you define in code to declare a voice-enabled agent — its speech-to-text engine, text-to-speech engine, LLM, turn detection, and any voice-specific tools.
For testing voice agents live or running automated voice tests, see the Voice Command. For the direct plugin route (when string descriptors aren’t enough), see Plugin and Realtime Engines below.
Persona is configured on the parent
LuaAgent, not on LuaVoice. Use the channel-aware persona shape { base, voice, text } on the agent to give a voice its own prompt — see Channel-Aware Personas.String Descriptors (recommended)
llm, stt, and tts all accept a provider-prefixed string descriptor. This is the canonical form — it routes through Lua’s inference layer so you don’t manage provider credentials yourself.
The model and voice catalogs below are a living list — your descriptor is forwarded straight to Lua’s inference layer, so newer provider models may work before they’re listed here and retired ones may drop off. Treat these tables as a starting point, not an exhaustive allowlist.
LLM options
Provider-prefixed model id. Grouped by tier — pick a tier based on the latency/cost/quality trade-off you need. Fast tier — lowest latency, lowest cost:| Descriptor | Notes |
|---|---|
openai/gpt-5-mini | Fast & cheap OpenAI. |
openai/gpt-5-nano | Cheapest OpenAI tier. |
openai/gpt-4.1-mini | Stable, fast. |
google/gemini-2.5-flash-lite | Fastest Gemini. |
google/gemini-2.5-flash | Fast multimodal. |
xai/grok-4-1-fast-non-reasoning | Fast xAI tier. |
| Descriptor | Notes |
|---|---|
openai/gpt-5 | Balanced quality and speed. |
openai/gpt-5.1-chat-latest | Balanced, chat-tuned. Common default. |
openai/gpt-4.1 | Stable, balanced. |
google/gemini-3-flash | Newest Flash multimodal. |
xai/grok-4-1-fast-reasoning | Reasoning at fast tier. |
deepseek-ai/deepseek-v3.2 | Cost-efficient reasoning. |
moonshotai/kimi-k2-instruct | Long-context instruct. |
| Descriptor | Notes |
|---|---|
openai/gpt-5.4 | Top-tier OpenAI. |
openai/gpt-5.3-chat-latest | Top-tier chat-tuned. |
google/gemini-3-pro | Long context, top tier. |
google/gemini-2.5-pro | Stable Pro tier. |
xai/grok-4.20-0309-reasoning | Top-tier xAI reasoning. |
Anthropic / Claude is intentionally absent — Lua’s inference layer does not carry Anthropic models for voice as of this writing. Use OpenAI, Google, xAI, DeepSeek, or Kimi for voice LLMs.
STT options
Deepgram (recommended)
Deepgram is the recommended STT provider, anddeepgram/nova-3 is the standard choice. stt is required for cascaded LLMs — omit it only when the llm is a realtime speech-to-speech model (which handles audio directly).
| Descriptor | Notes |
|---|---|
deepgram/nova-3 | Latest Nova series. Best accuracy + low latency. Recommended default. |
deepgram/nova-2 | Previous generation. Still solid. |
deepgram/nova-2-phonecall | Tuned for narrowband (8 kHz) phone audio. Use when call quality is poor or when you want extra robustness on PSTN. |
sttLanguage to pin the spoken language:
- BCP-47 code (
'en','es','pt-BR', etc.) — pins recognition to that language. 'multi'— multilingual transcription. Applies to both the Inference route and the direct Deepgram plugin.
ElevenLabs Scribe
ElevenLabs has an STT model called Scribe, available via the Inference route:TTS options
ElevenLabs (recommended)
ElevenLabs is the canonical TTS provider. The descriptor format iselevenlabs/<model>:<voiceId>.
| Model | Latency | Languages | Best for |
|---|---|---|---|
eleven_v3 | ~250ms | 70+ | Most expressive. Use when quality matters more than latency. |
eleven_turbo_v2_5 | Low | Multilingual | Common default — balanced latency + quality. |
eleven_flash_v2_5 | ~75ms | Multilingual | Ultra-low latency. Use for fast, interactive turns. |
eleven_multilingual_v2 | ~200ms | 29 | Lifelike emotion across many languages. |
eleven_flash_v2 | ~75ms | English only | Ultra-low latency, English-only. |
| Voice ID | Name | Accent | Style |
|---|---|---|---|
pwMBn0SsmN1220Aorv15 | Matt | American | Male, Hyper-Conversational |
ZTho75k1M56OV0k9XtSC | Spence | American | Male, Soft-Spoken |
kdmDKE6EkgrWrrykO9Qt | Alexandra | American | Female, Conversational |
h2sm0NbeIZXHBzJOMYcQ | Natasha | American | Female, Calm Narrative |
lUTamkMw7gOzZbFIwmq4 | James | British | Male, Professional |
4BWwbsA70lmV7RMG0Acs | Blondie | British | Female, Relaxed Casual |
lcMyyd2HUfFzxdCaC4Ta | Lucy | British | Female, Fresh Casual |
4CrZuIW9am7gYAxgo2Af | Shelley | British | Female, Clear Confident |
56bWURjYFHyYyVf490Dp | Emma | Australian | Female, Warm Conversational |
aCChyB4P5WEomwRsOKRh | Salma | Arabic | Female, Conversational Expressive |
2zRM7PkgwBPiau2jvVXc | Monika | Indian | Female, Deep and Natural |
ecp3DWciuUyW7BYM7II1 | Anika | Indian | Female, Sweet and Lively |
pzxut4zZz4GImZNlqQ3H | Raju | Indian | Male, Natural Conversationalist |
Deepgram Aura
Deepgram offers TTS via the Aura family. The voice id is encoded inside the model id asaura-2-<name>-<lang>:
| ID | Name | Gender | Style |
|---|---|---|---|
aura-2-thalia-en | Thalia | Female (American) | Conversational |
aura-2-asteria-en | Asteria | Female (American) | Friendly |
aura-2-luna-en | Luna | Female (American) | Warm |
aura-2-stella-en | Stella | Female (American) | Professional |
aura-2-athena-en | Athena | Female (British) | Authoritative |
aura-2-hera-en | Hera | Female (American) | Calm Narrative |
aura-2-orion-en | Orion | Male (American) | Confident |
aura-2-arcas-en | Arcas | Male (American) | Conversational |
aura-2-perseus-en | Perseus | Male (American) | Engaging |
aura-2-angus-en | Angus | Male (Irish) | Storyteller |
aura-2-helios-en | Helios | Male (British) | Professional |
aura-2-zeus-en | Zeus | Male (American) | Deep Authoritative |
aura-2-celeste-es, aura-2-estrella-es.
Other TTS providers (via Inference)
Lua’s inference layer also exposes Cartesia, Inworld, Rime, and xAI TTS. The descriptors follow the sameprovider/model shape:
| Descriptor | Provider | Notes |
|---|---|---|
cartesia/sonic-3 | Cartesia | Newest, expressive. |
cartesia/sonic-turbo | Cartesia | Ultra-low latency. |
inworld/inworld-tts-1.5-max | Inworld | High-quality multilingual. |
rime/arcana | Rime | Multilingual, expressive. |
xai/tts-1 | xAI | 21 languages. |
Plugin and Realtime Engines
For most voice agents the string-descriptor form above is all you need. Reach for the plugin/class forms here in two cases: (1) you need provider-specific options the descriptor route doesn’t expose, or (2) you’re using a realtime (speech-to-speech) model in thellm slot.
lua-cli/voice re-exports the LiveKit plugin namespaces that LuaVoice accepts as class instances — importing through it means you don’t add the underlying plugin packages as direct dependencies:
What’s allowed where
The compiler enforces two separate allowlists:| Form | Allowed in llm / stt / tts |
|---|---|
'<provider>/<model>' string descriptor | Any provider supported by Lua’s inference layer. The descriptor route handles credentials. |
new deepgram.<Class>({...}) | Plugin route. Only deepgram and elevenlabs are allowlisted. |
new elevenlabs.<Class>({...}) | Plugin route. Only deepgram and elevenlabs are allowlisted. |
new inference.<Class>({ model, ... }) | Typed shortcut for the descriptor route — same semantics as a string descriptor, just with autocomplete on the options. |
new <provider>.realtime.RealtimeModel({...}) | Realtime route. openai, google (via google.beta.realtime.*), xai are allowlisted for realtime only (goes in the llm slot, replaces STT+TTS). |
Plugin route: Deepgram + ElevenLabs
The two allowlisted plugin providers. Use these class forms when you need provider-specific options not exposed by the string-descriptor route.Deepgram STT (plugin form)
new deepgram.STT({...})— Deepgram’s v1 WebSocket endpoint. Use this fornova-3,nova-2, etc.new deepgram.STTv2({...})— Deepgram’s v2 endpoint. Required for Flux models that use semantic endpointing (eotThreshold,eagerEotThreshold,eotTimeoutMs).
ElevenLabs TTS (plugin form)
Inference route (typed shortcut)
inference.LLM, inference.STT, inference.TTS are typed wrappers for the string-descriptor route. The compiler normalizes both forms to the same wire shape; the class form just gives you better TypeScript autocomplete on the options.
model option is required — it’s the same provider-prefixed string you’d pass directly. For TTS, pass voice separately.
This is the only way to use class syntax for providers that aren’t on the plugin allowlist (OpenAI, Google, xAI, Cartesia, etc.).
Realtime route (speech-to-speech)
The realtime route puts a speech-to-speech model in thellm slot, replacing the cascaded STT → LLM → TTS pipeline. The class-construction path differs by provider:
- OpenAI:
new openai.realtime.RealtimeModel({...}) - Google (Gemini):
new google.beta.realtime.RealtimeModel({...})— note the.beta.prefix (matches Google’s Node SDK shape)
Available realtime models
| Class form | Model id | Notes |
|---|---|---|
new openai.realtime.RealtimeModel({ model: 'gpt-realtime-1.5' }) | gpt-realtime-1.5 | OpenAI flagship realtime. GA. |
new openai.realtime.RealtimeModel({ model: 'gpt-realtime-mini' }) | gpt-realtime-mini | Cost-efficient OpenAI realtime. GA. |
new google.beta.realtime.RealtimeModel({ model: 'gemini-3.1-flash-live-preview' }) | gemini-3.1-flash-live-preview | Newest Gemini realtime. Preview. |
new google.beta.realtime.RealtimeModel({ model: 'gemini-2.5-flash-live-preview' }) | gemini-2.5-flash-live-preview | Cheaper Gemini alternative. Preview. |
xai is reserved in the realtime allowlist but no xAI realtime models are currently published.Half-cascade mode
You can keep a separatetts with a realtime LLM — the worker injects modalities: ['text'] so the realtime model emits text and tts handles synthesis. Useful when you want realtime’s low-latency reasoning but ElevenLabs’ voice quality:
llm with a custom stt — the compiler rejects it. Realtime models handle audio input directly.
Credentials
Plugin class instances rely on credentials provisioned by the Lua platform — you do not need to setDEEPGRAM_API_KEY, ELEVENLABS_API_KEY, etc. in your project’s .env. Lua manages the provider credentials for you; your code just references the class form and the platform constructs the actual engine at runtime.
When to use which form
| Goal | Recommended form |
|---|---|
| Quick start, sensible defaults | String descriptor — stt: 'deepgram/nova-3' |
| TypeScript autocomplete on options | inference.X — stt: new inference.STT({ model: 'deepgram/nova-3' }) |
| Deepgram or ElevenLabs with provider-specific options | Plugin class — stt: new deepgram.STT({ model: 'nova-3', smartFormat: true }) |
| Speech-to-speech (OpenAI/Google/xAI realtime) | Realtime class — llm: new openai.realtime.RealtimeModel({ model: 'gpt-realtime-1.5' }) |
Configuration Reference
Required fields
Unique name for this voice. Used to address the voice in
lua voice --voice <name> and as the server-side identifier. Allowed characters: a-zA-Z0-9_-, 1–64 chars.The LLM that drives the conversation. String descriptor (e.g.
'openai/gpt-5.1-chat-latest') is the canonical form. See LLM options above for the catalog.Speech-to-text engine. String descriptor (e.g.
'deepgram/nova-3') is canonical. Required for cascaded LLMs; omit only when using a realtime speech-to-speech model in the llm slot.Text-to-speech engine. String descriptor with colon-separated voice id (e.g.
'elevenlabs/eleven_turbo_v2_5:<voiceId>'), or object form { model, voice }. Required for cascaded LLMs.Optional fields
Human-readable description. Surfaced in the compiled manifest and admin listings.
Opening line spoken at session start. Empty string means no greeting. Generated through the LLM at session connect, so it can be dynamic if
onEnter sets up context first.BCP-47 language code (e.g.
'en', 'es', 'pt-BR') or 'multi' for multilingual transcription. Applies to both Inference STT and the Deepgram plugin.How the agent decides when the user has finished speaking.
'vad' is the safest choice for most setups. 'multilingual' and 'english' use LiveKit’s turn-detector model; 'manual' defers to your own logic.Voice activity detection engine.
'silero' is the only currently-supported value.Silero VAD tuning. Useful when the default endpointing clips quiet callers or fires too eagerly mid-thought.
minSpeechDuration(ms, 0–5000) — speech required before a turn starts. Default: 50.minSilenceDuration(ms, 0–5000) — silence required to end a turn. Default: 550.prefixPaddingDuration(ms, 0–2000) — audio captured before detected speech start, forwarded into STT. Default: 500.activationThreshold(0–1) — lower = more sensitive to speech onset.
Krisp BVC background noise cancellation. Recommended for inbound phone calls — it removes background chatter, traffic, and other ambient noise. Billed separately, so opt-in.
Maximum sequential tool calls per turn (1–20). Higher values let the agent chain more tools before responding.
Seconds of silence before the agent considers the user “away” and ends the session. Useful for cleanly handling abandoned calls.
Generate the assistant’s response speculatively as the user is still speaking. Reduces perceived latency for predictable turns but can be wasted on highly interruptive callers.
How the agent handles being interrupted mid-response.
enabled— whether interruption is allowed.mode—'adaptive'(recommended) or'vad'.falseInterruptionTimeout(seconds) — how long to wait before treating a brief noise as a false interruption.resumeFalseInterruption(boolean) — resume the cut-off response after a false interruption.minDelay/maxDelay(seconds) — bounds on the interruption response window.
Word-boundary text replacements applied before TTS synthesis. Keys are matched case-insensitively as whole words. Use for acronyms and proper nouns the TTS mispronounces.Cascaded path only. Setting
pronunciations on a full-realtime voice (realtime llm with no tts) is rejected at compile time — pair with a half-cascade tts, or drop the field.Background audio layered onto the agent’s output. Pass a built-in clip name, a
{ source, volume, probability } config, or an array (probabilistic mix).Built-in clips: 'office-ambience', 'keyboard-typing', 'keyboard-typing-2'.Output speech volume, 0–100. Applied as a per-frame multiplier. Omit to pass the TTS provider’s native level through unchanged.
When
true, the worker writes session.history to Data.set('call:<sessionId>') after the call ends. Read it back from a job or webhook with Data.get('call:<sessionId>') for post-call analytics, follow-ups, or QA.Short line spoken to the caller when a tool call fails (throws, times out, or returns an unsupported result) — fills the 2–3s gap before the LLM’s own recovery response. Spoken once per failed call, then the error is surfaced to the LLM. Keep it short and on-brand (e.g.
'Sorry, let me try that another way.'); omit for no spoken fallback.Voice-specific tools in addition to skills attached to the owning agent. See Defining Voice Tools.
Lifecycle Hooks
Three hooks let you wire up per-session state, RAG injection, and post-call work.Fires after the session connects to the room and before the greeting. Use it to hydrate
session.userdata from User, Data, etc., or to set up any per-call state.Fires after the user finishes a turn, before the LLM is invoked. This is the canonical RAG-injection point —
turnCtx.addMessage(...) adds context messages the LLM sees on this turn.Fires when the session is closing. Use for transcript persistence, outcome reporting, CRM updates, etc.
Defining Voice Tools
Voice tools run during a voice conversation.LuaVoiceTool is a concrete class — instantiate it with a config object:
Config fields
Tool name. Used by the LLM to identify and call the tool.
What the tool does. Action-oriented description the LLM reads when deciding to invoke.
Zod schema for the tool’s input. Validated before
execute is called.Tool body. Receives the validated input and an optional voice-specific context.
Optional gate. When provided, the tool is only exposed to the LLM if
condition() returns true. Use for feature flags or runtime availability checks.Voice-specific tool flags (e.g. controlling barge-in behavior).
ctx — LuaVoiceToolCtx
Identifier for this specific tool invocation.
Speak
text to the caller via the active LiveKit session. Useful for status updates during long-running tool work (“Looking that up — one moment.”).Transfer the live caller to a human at
msisdn. Two mechanisms:mode: 'refer'(default) — SIP REFER on the inbound leg. Cheap (one billed leg) but depends on the inbound carrier accepting REFER end-to-end.mode: 'bridge'— dial the human as a second SIP participant into the same room. Two billed legs but works regardless of carrier REFER support. Use for high-stakes transfers.
announce is spoken before the transfer fires.LuaTool instances between chat skills and voice tools — just pass them in the same tools array. The tools field accepts both LuaTool and LuaVoiceTool instances.
Function-style: defineVoice
Equivalent to new LuaVoice(config) if you prefer a function call:
Wiring Up to an Agent
persona.voice branch is what gives supportLine its voice-specific prompt.
Connect a phone number
Attaching the voice in code makes it available; to make the agent answer phone calls, bind a number to it. Push your voice first, then run the channels flow and choose “☎️ Manage phone numbers”:Binding requires a code-defined LuaVoice that has been pushed. Without one, inbound calls fall through to platform-default STT/LLM/TTS — no greeting, lifecycle hooks, or voice-only tools. Author the voice,
lua push, then bind.Related
- Voice Command — live testing and voice test suites
- Plugin and Realtime Engines — Deepgram/ElevenLabs class forms, realtime speech-to-speech
- Persona Command — voice-specific personas on the parent agent
- LuaAgent API

