Documentation Index
Fetch the complete documentation index at: https://docs.heylua.ai/llms.txt
Use this file to discover all available pages before exploring further.
Overview
LuaVoice is the class-based primitive you define in code to declare a voice-enabled agent — its speech-to-text engine, text-to-speech engine, LLM, turn detection, and any voice-specific tools.
For testing voice agents live or running automated voice tests, see the Voice Command. For the direct plugin route (when string descriptors aren’t enough), see Voice Plugins.
Persona is configured on the parent
LuaAgent, not on LuaVoice. Use the channel-aware persona shape { base, voice, text } on the agent to give a voice its own prompt — see Channel-Aware Personas.String Descriptors (recommended)
llm, stt, and tts all accept a provider-prefixed string descriptor. This is the canonical form — it routes through Lua’s inference layer so you don’t manage provider credentials yourself.
LLM options
Provider-prefixed model id. Grouped by tier — pick a tier based on the latency/cost/quality trade-off you need. Fast tier — lowest latency, lowest cost:| Descriptor | Notes |
|---|---|
openai/gpt-5-mini | Fast & cheap OpenAI. |
openai/gpt-5-nano | Cheapest OpenAI tier. |
openai/gpt-4.1-mini | Stable, fast. |
google/gemini-2.5-flash-lite | Fastest Gemini. |
google/gemini-2.5-flash | Fast multimodal. |
xai/grok-4-1-fast-non-reasoning | Fast xAI tier. |
| Descriptor | Notes |
|---|---|
openai/gpt-5 | Balanced quality and speed. |
openai/gpt-5.1-chat-latest | Balanced, chat-tuned. Common default. |
openai/gpt-4.1 | Stable, balanced. |
google/gemini-3-flash | Newest Flash multimodal. |
xai/grok-4-1-fast-reasoning | Reasoning at fast tier. |
deepseek-ai/deepseek-v3.2 | Cost-efficient reasoning. |
moonshotai/kimi-k2-instruct | Long-context instruct. |
| Descriptor | Notes |
|---|---|
openai/gpt-5.4 | Top-tier OpenAI. |
openai/gpt-5.3-chat-latest | Top-tier chat-tuned. |
google/gemini-3-pro | Long context, top tier. |
google/gemini-2.5-pro | Stable Pro tier. |
xai/grok-4.20-0309-reasoning | Top-tier xAI reasoning. |
Anthropic / Claude is intentionally absent — Lua’s inference layer does not carry Anthropic models for voice as of this writing. Use OpenAI, Google, xAI, DeepSeek, or Kimi for voice LLMs.
STT options
Deepgram (recommended)
Deepgram is the default STT provider — the worker’s STT routing falls back todeepgram/nova-3 when nothing else is configured.
| Descriptor | Notes |
|---|---|
deepgram/nova-3 | Latest Nova series. Best accuracy + low latency. Recommended default. |
deepgram/nova-2 | Previous generation. Still solid. |
deepgram/nova-2-phonecall | Tuned for narrowband (8 kHz) phone audio. Use when call quality is poor or when you want extra robustness on PSTN. |
sttLanguage to pin the spoken language:
- BCP-47 code (
'en','es','pt-BR', etc.) — pins recognition to that language. 'multi'— multilingual transcription. Applies to both the Inference route and the direct Deepgram plugin.
ElevenLabs Scribe
ElevenLabs has an STT model called Scribe, available via the Inference route:TTS options
ElevenLabs (recommended)
ElevenLabs is the canonical TTS provider. The descriptor format iselevenlabs/<model>:<voiceId>.
| Model | Latency | Languages | Best for |
|---|---|---|---|
eleven_v3 | ~250ms | 70+ | Most expressive. Use when quality matters more than latency. |
eleven_turbo_v2_5 | Low | Multilingual | Common default — balanced latency + quality. |
eleven_flash_v2_5 | ~75ms | Multilingual | Ultra-low latency. Use for fast, interactive turns. |
eleven_multilingual_v2 | ~200ms | 29 | Lifelike emotion across many languages. |
eleven_flash_v2 | ~75ms | English only | Ultra-low latency, English-only. |
| Voice ID | Name | Accent | Style |
|---|---|---|---|
pwMBn0SsmN1220Aorv15 | Matt | American | Male, Hyper-Conversational |
ZTho75k1M56OV0k9XtSC | Spence | American | Male, Soft-Spoken |
kdmDKE6EkgrWrrykO9Qt | Alexandra | American | Female, Conversational |
h2sm0NbeIZXHBzJOMYcQ | Natasha | American | Female, Calm Narrative |
lUTamkMw7gOzZbFIwmq4 | James | British | Male, Professional |
4BWwbsA70lmV7RMG0Acs | Blondie | British | Female, Relaxed Casual |
lcMyyd2HUfFzxdCaC4Ta | Lucy | British | Female, Fresh Casual |
4CrZuIW9am7gYAxgo2Af | Shelley | British | Female, Clear Confident |
56bWURjYFHyYyVf490Dp | Emma | Australian | Female, Warm Conversational |
aCChyB4P5WEomwRsOKRh | Salma | Arabic | Female, Conversational Expressive |
2zRM7PkgwBPiau2jvVXc | Monika | Indian | Female, Deep and Natural |
ecp3DWciuUyW7BYM7II1 | Anika | Indian | Female, Sweet and Lively |
pzxut4zZz4GImZNlqQ3H | Raju | Indian | Male, Natural Conversationalist |
Deepgram Aura
Deepgram offers TTS via the Aura family. The voice id is encoded inside the model id asaura-2-<name>-<lang>:
| ID | Name | Gender | Style |
|---|---|---|---|
aura-2-thalia-en | Thalia | Female (American) | Conversational |
aura-2-asteria-en | Asteria | Female (American) | Friendly |
aura-2-luna-en | Luna | Female (American) | Warm |
aura-2-stella-en | Stella | Female (American) | Professional |
aura-2-athena-en | Athena | Female (British) | Authoritative |
aura-2-hera-en | Hera | Female (American) | Calm Narrative |
aura-2-orion-en | Orion | Male (American) | Confident |
aura-2-arcas-en | Arcas | Male (American) | Conversational |
aura-2-perseus-en | Perseus | Male (American) | Engaging |
aura-2-angus-en | Angus | Male (Irish) | Storyteller |
aura-2-helios-en | Helios | Male (British) | Professional |
aura-2-zeus-en | Zeus | Male (American) | Deep Authoritative |
aura-2-celeste-es, aura-2-estrella-es.
Other TTS providers (via Inference)
Lua’s inference layer also exposes Cartesia, Inworld, Rime, and xAI TTS. The descriptors follow the sameprovider/model shape:
| Descriptor | Provider | Notes |
|---|---|---|
cartesia/sonic-3 | Cartesia | Newest, expressive. |
cartesia/sonic-turbo | Cartesia | Ultra-low latency. |
inworld/inworld-tts-1.5-max | Inworld | High-quality multilingual. |
rime/arcana | Rime | Multilingual, expressive. |
xai/tts-1 | xAI | 21 languages. |
Configuration Reference
Required fields
Unique name for this voice. Used to address the voice in
lua voice --voice <name> and as the server-side identifier. Allowed characters: a-zA-Z0-9_-, 1–64 chars.The LLM that drives the conversation. String descriptor (e.g.
'openai/gpt-5.1-chat-latest') is the canonical form. See LLM options above for the catalog.Speech-to-text engine. String descriptor (e.g.
'deepgram/nova-3') is canonical. Required for cascaded LLMs; omit only when using a realtime speech-to-speech model in the llm slot.Text-to-speech engine. String descriptor with colon-separated voice id (e.g.
'elevenlabs/eleven_turbo_v2_5:<voiceId>'), or object form { model, voice }. Required for cascaded LLMs.Optional fields
Human-readable description. Surfaced in the compiled manifest and admin listings.
Opening line spoken at session start. Empty string means no greeting. Generated through the LLM at session connect, so it can be dynamic if
onEnter sets up context first.BCP-47 language code (e.g.
'en', 'es', 'pt-BR') or 'multi' for multilingual transcription. Applies to both Inference STT and the Deepgram plugin.How the agent decides when the user has finished speaking.
'vad' is the safest choice for most setups. 'multilingual' and 'english' use LiveKit’s turn-detector model; 'manual' defers to your own logic.Voice activity detection engine.
'silero' is the only currently-supported value.Silero VAD tuning. Useful when the default endpointing clips quiet callers or fires too eagerly mid-thought.
minSpeechDuration(ms, 0–5000) — speech required before a turn starts. Default: 50.minSilenceDuration(ms, 0–5000) — silence required to end a turn. Default: 550.prefixPaddingDuration(ms, 0–2000) — audio captured before detected speech start, forwarded into STT. Default: 500.activationThreshold(0–1) — lower = more sensitive to speech onset.
Krisp BVC background noise cancellation. Recommended for inbound phone calls — it removes background chatter, traffic, and other ambient noise. Billed separately, so opt-in.
Maximum sequential tool calls per turn (1–20). Higher values let the agent chain more tools before responding.
Seconds of silence before the agent considers the user “away” and ends the session. Useful for cleanly handling abandoned calls.
Generate the assistant’s response speculatively as the user is still speaking. Reduces perceived latency for predictable turns but can be wasted on highly interruptive callers.
How the agent handles being interrupted mid-response.
enabled— whether interruption is allowed.mode—'adaptive'(recommended) or'vad'.falseInterruptionTimeout(seconds) — how long to wait before treating a brief noise as a false interruption.resumeFalseInterruption(boolean) — resume the cut-off response after a false interruption.minDelay/maxDelay(seconds) — bounds on the interruption response window.
Word-boundary text replacements applied before TTS synthesis. Keys are matched case-insensitively as whole words. Use for acronyms and proper nouns the TTS mispronounces.Only effective on the cascaded path — realtime models bypass the TTS step entirely.
Background audio layered onto the agent’s output. Pass a built-in clip name, a
{ source, volume, probability } config, or an array (probabilistic mix).Built-in clips: 'office-ambience', 'keyboard-typing', 'keyboard-typing-2'.Output speech volume, 0–100. Applied as a per-frame multiplier. Omit to pass the TTS provider’s native level through unchanged.
When
true, the worker writes session.history to Data.set('call:<sessionId>') after the call ends. Read it back from a job or webhook with Data.get('call:<sessionId>') for post-call analytics, follow-ups, or QA.Voice-specific tools in addition to skills attached to the owning agent. See Defining Voice Tools.
Lifecycle Hooks
Three hooks let you wire up per-session state, RAG injection, and post-call work.Fires after the session connects to the room and before the greeting. Use it to hydrate
session.userdata from User, Data, etc., or to set up any per-call state.Fires after the user finishes a turn, before the LLM is invoked. This is the canonical RAG-injection point —
turnCtx.addMessage(...) adds context messages the LLM sees on this turn.Fires when the session is closing. Use for transcript persistence, outcome reporting, CRM updates, etc.
Defining Voice Tools
Voice tools run during a voice conversation.LuaVoiceTool is a concrete class — instantiate it with a config object:
Config fields
Tool name. Used by the LLM to identify and call the tool.
What the tool does. Action-oriented description the LLM reads when deciding to invoke.
Zod schema for the tool’s input. Validated before
execute is called.Tool body. Receives the validated input and an optional voice-specific context.
Optional gate. When provided, the tool is only exposed to the LLM if
condition() returns true. Use for feature flags or runtime availability checks.Voice-specific tool flags (e.g. controlling barge-in behavior).
ctx — LuaVoiceToolCtx
Identifier for this specific tool invocation.
Speak
text to the caller via the active LiveKit session. Useful for status updates during long-running tool work (“Looking that up — one moment.”).Transfer the live caller to a human at
msisdn. Two mechanisms:mode: 'refer'(default) — SIP REFER on the inbound leg. Cheap (one billed leg) but depends on the inbound carrier accepting REFER end-to-end.mode: 'bridge'— dial the human as a second SIP participant into the same room. Two billed legs but works regardless of carrier REFER support. Use for high-stakes transfers.
announce is spoken before the transfer fires.LuaTool instances between chat skills and voice tools — just pass them in the same tools array. The tools field accepts both LuaTool and LuaVoiceTool instances.
Function-style: defineVoice
Equivalent to new LuaVoice(config) if you prefer a function call:
Wiring Up to an Agent
persona.voice branch is what gives supportLine its voice-specific prompt.
Related
- Voice Command — live testing and voice test suites
- Voice Plugins — direct plugin route (Deepgram, ElevenLabs class forms)
- Persona Command — voice-specific personas on the parent agent
- LuaAgent API

