Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.heylua.ai/llms.txt

Use this file to discover all available pages before exploring further.

Overview

LuaVoice is the class-based primitive you define in code to declare a voice-enabled agent — its speech-to-text engine, text-to-speech engine, LLM, turn detection, and any voice-specific tools.
import { LuaVoice } from 'lua-cli';

export default new LuaVoice({
  name: 'support-line',
  llm: 'openai/gpt-5.1-chat-latest',
  stt: 'deepgram/nova-3',
  tts: 'elevenlabs/eleven_turbo_v2_5:pwMBn0SsmN1220Aorv15',
  greeting: 'Hi, this is your support line. How can I help?',
});
For testing voice agents live or running automated voice tests, see the Voice Command. For the direct plugin route (when string descriptors aren’t enough), see Voice Plugins.
Persona is configured on the parent LuaAgent, not on LuaVoice. Use the channel-aware persona shape { base, voice, text } on the agent to give a voice its own prompt — see Channel-Aware Personas.

llm, stt, and tts all accept a provider-prefixed string descriptor. This is the canonical form — it routes through Lua’s inference layer so you don’t manage provider credentials yourself.
new LuaVoice({
  name: 'support-line',
  llm: 'openai/gpt-5.1-chat-latest',                  // LLM
  stt: 'deepgram/nova-3',                              // STT
  tts: 'elevenlabs/eleven_turbo_v2_5:<voiceId>',      // TTS — colon-separated voiceId
});

LLM options

Provider-prefixed model id. Grouped by tier — pick a tier based on the latency/cost/quality trade-off you need. Fast tier — lowest latency, lowest cost:
DescriptorNotes
openai/gpt-5-miniFast & cheap OpenAI.
openai/gpt-5-nanoCheapest OpenAI tier.
openai/gpt-4.1-miniStable, fast.
google/gemini-2.5-flash-liteFastest Gemini.
google/gemini-2.5-flashFast multimodal.
xai/grok-4-1-fast-non-reasoningFast xAI tier.
Balanced tier — good default for most voice agents:
DescriptorNotes
openai/gpt-5Balanced quality and speed.
openai/gpt-5.1-chat-latestBalanced, chat-tuned. Common default.
openai/gpt-4.1Stable, balanced.
google/gemini-3-flashNewest Flash multimodal.
xai/grok-4-1-fast-reasoningReasoning at fast tier.
deepseek-ai/deepseek-v3.2Cost-efficient reasoning.
moonshotai/kimi-k2-instructLong-context instruct.
Quality tier — best capability, higher latency/cost:
DescriptorNotes
openai/gpt-5.4Top-tier OpenAI.
openai/gpt-5.3-chat-latestTop-tier chat-tuned.
google/gemini-3-proLong context, top tier.
google/gemini-2.5-proStable Pro tier.
xai/grok-4.20-0309-reasoningTop-tier xAI reasoning.
Anthropic / Claude is intentionally absent — Lua’s inference layer does not carry Anthropic models for voice as of this writing. Use OpenAI, Google, xAI, DeepSeek, or Kimi for voice LLMs.

STT options

Deepgram is the default STT provider — the worker’s STT routing falls back to deepgram/nova-3 when nothing else is configured.
new LuaVoice({
  // ...
  stt: 'deepgram/nova-3',
  sttLanguage: 'en',        // BCP-47 code, or 'multi' for multilingual
});
DescriptorNotes
deepgram/nova-3Latest Nova series. Best accuracy + low latency. Recommended default.
deepgram/nova-2Previous generation. Still solid.
deepgram/nova-2-phonecallTuned for narrowband (8 kHz) phone audio. Use when call quality is poor or when you want extra robustness on PSTN.
Combine with sttLanguage to pin the spoken language:
  • BCP-47 code ('en', 'es', 'pt-BR', etc.) — pins recognition to that language.
  • 'multi' — multilingual transcription. Applies to both the Inference route and the direct Deepgram plugin.
Want non-default Deepgram options (smart formatting, filler-word filtering, custom keywords)? Use the plugin class form: stt: new deepgram.STT({ model: 'nova-3', smartFormat: true }). See Voice Plugins for the full plugin route.

ElevenLabs Scribe

ElevenLabs has an STT model called Scribe, available via the Inference route:
stt: 'elevenlabs/scribe_v2_realtime'
Useful when you want STT and TTS from the same provider, or when Scribe’s behavior on a specific language outperforms Deepgram in your testing.

TTS options

ElevenLabs is the canonical TTS provider. The descriptor format is elevenlabs/<model>:<voiceId>.
new LuaVoice({
  // ...
  tts: 'elevenlabs/eleven_turbo_v2_5:pwMBn0SsmN1220Aorv15',
});
Models:
ModelLatencyLanguagesBest for
eleven_v3~250ms70+Most expressive. Use when quality matters more than latency.
eleven_turbo_v2_5LowMultilingualCommon default — balanced latency + quality.
eleven_flash_v2_5~75msMultilingualUltra-low latency. Use for fast, interactive turns.
eleven_multilingual_v2~200ms29Lifelike emotion across many languages.
eleven_flash_v2~75msEnglish onlyUltra-low latency, English-only.
Curated voice IDs: Lua maintains a curated list with metadata (gender, accent, style) the raw ElevenLabs API doesn’t expose:
Voice IDNameAccentStyle
pwMBn0SsmN1220Aorv15MattAmericanMale, Hyper-Conversational
ZTho75k1M56OV0k9XtSCSpenceAmericanMale, Soft-Spoken
kdmDKE6EkgrWrrykO9QtAlexandraAmericanFemale, Conversational
h2sm0NbeIZXHBzJOMYcQNatashaAmericanFemale, Calm Narrative
lUTamkMw7gOzZbFIwmq4JamesBritishMale, Professional
4BWwbsA70lmV7RMG0AcsBlondieBritishFemale, Relaxed Casual
lcMyyd2HUfFzxdCaC4TaLucyBritishFemale, Fresh Casual
4CrZuIW9am7gYAxgo2AfShelleyBritishFemale, Clear Confident
56bWURjYFHyYyVf490DpEmmaAustralianFemale, Warm Conversational
aCChyB4P5WEomwRsOKRhSalmaArabicFemale, Conversational Expressive
2zRM7PkgwBPiau2jvVXcMonikaIndianFemale, Deep and Natural
ecp3DWciuUyW7BYM7II1AnikaIndianFemale, Sweet and Lively
pzxut4zZz4GImZNlqQ3HRajuIndianMale, Natural Conversationalist
You can also use any ElevenLabs voice ID from your own ElevenLabs account — these are just the curated defaults. Alternative: object form If you’d rather not concatenate model and voice with a colon, the object form works too:
tts: { model: 'elevenlabs/eleven_turbo_v2_5', voice: 'pwMBn0SsmN1220Aorv15' }

Deepgram Aura

Deepgram offers TTS via the Aura family. The voice id is encoded inside the model id as aura-2-<name>-<lang>:
tts: 'deepgram/aura-2-thalia-en'
Common Aura 2 voices (English):
IDNameGenderStyle
aura-2-thalia-enThaliaFemale (American)Conversational
aura-2-asteria-enAsteriaFemale (American)Friendly
aura-2-luna-enLunaFemale (American)Warm
aura-2-stella-enStellaFemale (American)Professional
aura-2-athena-enAthenaFemale (British)Authoritative
aura-2-hera-enHeraFemale (American)Calm Narrative
aura-2-orion-enOrionMale (American)Confident
aura-2-arcas-enArcasMale (American)Conversational
aura-2-perseus-enPerseusMale (American)Engaging
aura-2-angus-enAngusMale (Irish)Storyteller
aura-2-helios-enHeliosMale (British)Professional
aura-2-zeus-enZeusMale (American)Deep Authoritative
Spanish voices are also available: aura-2-celeste-es, aura-2-estrella-es.

Other TTS providers (via Inference)

Lua’s inference layer also exposes Cartesia, Inworld, Rime, and xAI TTS. The descriptors follow the same provider/model shape:
DescriptorProviderNotes
cartesia/sonic-3CartesiaNewest, expressive.
cartesia/sonic-turboCartesiaUltra-low latency.
inworld/inworld-tts-1.5-maxInworldHigh-quality multilingual.
rime/arcanaRimeMultilingual, expressive.
xai/tts-1xAI21 languages.

Configuration Reference

new LuaVoice({
  // Required
  name: 'support-line',
  llm: 'openai/gpt-5.1-chat-latest',
  stt: 'deepgram/nova-3',
  tts: 'elevenlabs/eleven_turbo_v2_5:pwMBn0SsmN1220Aorv15',

  // Recommended
  description: 'Inbound phone voice for the support assistant',
  greeting: "Hi, this is your support line. How can I help?",
  sttLanguage: 'en',
  turnDetection: 'vad',
  krispEnabled: true,

  // Optional tuning
  maxToolSteps: 6,
  userAwayTimeout: 20,
  preemptiveGeneration: true,
  interruption: { mode: 'adaptive', falseInterruptionTimeout: 2.0 },

  // Optional polish
  pronunciations: { 'HVAC': 'H V A C', 'CFM': 'C F M' },
  persistTranscript: true,
  backgroundAudio: { ambient: 'office-ambience', thinking: 'keyboard-typing' },

  // Tools + lifecycle hooks
  tools: [/* ... */],
  onEnter: async (ctx) => {/* ... */},
  onUserTurnCompleted: async (turnCtx, message) => {/* ... */},
  onExit: async (ctx) => {/* ... */},
});

Required fields

name
string
required
Unique name for this voice. Used to address the voice in lua voice --voice <name> and as the server-side identifier. Allowed characters: a-zA-Z0-9_-, 1–64 chars.
llm
string | LLMConfig
required
The LLM that drives the conversation. String descriptor (e.g. 'openai/gpt-5.1-chat-latest') is the canonical form. See LLM options above for the catalog.
stt
string | STTConfig
required
Speech-to-text engine. String descriptor (e.g. 'deepgram/nova-3') is canonical. Required for cascaded LLMs; omit only when using a realtime speech-to-speech model in the llm slot.
tts
string | TTSConfig
required
Text-to-speech engine. String descriptor with colon-separated voice id (e.g. 'elevenlabs/eleven_turbo_v2_5:<voiceId>'), or object form { model, voice }. Required for cascaded LLMs.

Optional fields

description
string
Human-readable description. Surfaced in the compiled manifest and admin listings.
greeting
string
Opening line spoken at session start. Empty string means no greeting. Generated through the LLM at session connect, so it can be dynamic if onEnter sets up context first.
sttLanguage
string
BCP-47 language code (e.g. 'en', 'es', 'pt-BR') or 'multi' for multilingual transcription. Applies to both Inference STT and the Deepgram plugin.
turnDetection
'multilingual' | 'english' | 'vad' | 'stt' | 'manual'
How the agent decides when the user has finished speaking. 'vad' is the safest choice for most setups. 'multilingual' and 'english' use LiveKit’s turn-detector model; 'manual' defers to your own logic.
vad
string
default:"silero"
Voice activity detection engine. 'silero' is the only currently-supported value.
vadOptions
object
Silero VAD tuning. Useful when the default endpointing clips quiet callers or fires too eagerly mid-thought.
  • minSpeechDuration (ms, 0–5000) — speech required before a turn starts. Default: 50.
  • minSilenceDuration (ms, 0–5000) — silence required to end a turn. Default: 550.
  • prefixPaddingDuration (ms, 0–2000) — audio captured before detected speech start, forwarded into STT. Default: 500.
  • activationThreshold (0–1) — lower = more sensitive to speech onset.
krispEnabled
boolean
default:"false"
Krisp BVC background noise cancellation. Recommended for inbound phone calls — it removes background chatter, traffic, and other ambient noise. Billed separately, so opt-in.
maxToolSteps
number
Maximum sequential tool calls per turn (1–20). Higher values let the agent chain more tools before responding.
userAwayTimeout
number
Seconds of silence before the agent considers the user “away” and ends the session. Useful for cleanly handling abandoned calls.
preemptiveGeneration
boolean
Generate the assistant’s response speculatively as the user is still speaking. Reduces perceived latency for predictable turns but can be wasted on highly interruptive callers.
interruption
InterruptionOptions
How the agent handles being interrupted mid-response.
  • enabled — whether interruption is allowed.
  • mode'adaptive' (recommended) or 'vad'.
  • falseInterruptionTimeout (seconds) — how long to wait before treating a brief noise as a false interruption.
  • resumeFalseInterruption (boolean) — resume the cut-off response after a false interruption.
  • minDelay / maxDelay (seconds) — bounds on the interruption response window.
pronunciations
Record<string, string>
Word-boundary text replacements applied before TTS synthesis. Keys are matched case-insensitively as whole words. Use for acronyms and proper nouns the TTS mispronounces.
pronunciations: { 'HVAC': 'H V A C', 'kubectl': 'kube control' }
Only effective on the cascaded path — realtime models bypass the TTS step entirely.
backgroundAudio
{ ambient?, thinking? }
Background audio layered onto the agent’s output. Pass a built-in clip name, a { source, volume, probability } config, or an array (probabilistic mix).Built-in clips: 'office-ambience', 'keyboard-typing', 'keyboard-typing-2'.
backgroundAudio: {
  ambient: 'office-ambience',
  thinking: 'keyboard-typing',
}
volume
number
Output speech volume, 0–100. Applied as a per-frame multiplier. Omit to pass the TTS provider’s native level through unchanged.
persistTranscript
boolean
default:"false"
When true, the worker writes session.history to Data.set('call:<sessionId>') after the call ends. Read it back from a job or webhook with Data.get('call:<sessionId>') for post-call analytics, follow-ups, or QA.
tools
Array<LuaTool | LuaVoiceTool>
Voice-specific tools in addition to skills attached to the owning agent. See Defining Voice Tools.

Lifecycle Hooks

Three hooks let you wire up per-session state, RAG injection, and post-call work.
onEnter
(ctx: LuaVoiceHookContext) => Promise<void>
Fires after the session connects to the room and before the greeting. Use it to hydrate session.userdata from User, Data, etc., or to set up any per-call state.
onEnter: async (ctx) => {
  if (ctx.caller?.phoneNumber) {
    const user = await User.get({ phone: ctx.caller.phoneNumber });
    ctx.session.userdata = { user, returning: !!user };
  }
},
onUserTurnCompleted
(turnCtx, message) => Promise<void>
Fires after the user finishes a turn, before the LLM is invoked. This is the canonical RAG-injection point — turnCtx.addMessage(...) adds context messages the LLM sees on this turn.
onUserTurnCompleted: async (turnCtx, message) => {
  const docs = await Data.search('kb', message.content, 3);
  for (const doc of docs) {
    turnCtx.addMessage({ role: 'system', content: doc.text });
  }
},
onExit
(ctx: LuaVoiceHookContext) => Promise<void>
Fires when the session is closing. Use for transcript persistence, outcome reporting, CRM updates, etc.

Defining Voice Tools

Voice tools run during a voice conversation. LuaVoiceTool is a concrete class — instantiate it with a config object:
import { LuaVoiceTool } from 'lua-cli';
import { z } from 'zod';

export const getOrderStatusTool = new LuaVoiceTool({
  name: 'getOrderStatus',
  description: 'Look up the status of an order by ID',
  inputSchema: z.object({ orderId: z.string() }),
  execute: async (input, ctx) => {
    const order = await Data.get('orders', input.orderId);
    return { status: order.status, eta: order.eta };
  },
});

Config fields

name
string
required
Tool name. Used by the LLM to identify and call the tool.
description
string
required
What the tool does. Action-oriented description the LLM reads when deciding to invoke.
inputSchema
ZodType
required
Zod schema for the tool’s input. Validated before execute is called.
execute
(input, ctx?: LuaVoiceToolCtx) => Promise<any>
required
Tool body. Receives the validated input and an optional voice-specific context.
condition
() => Promise<boolean>
Optional gate. When provided, the tool is only exposed to the LLM if condition() returns true. Use for feature flags or runtime availability checks.
flags
ToolFlag[]
Voice-specific tool flags (e.g. controlling barge-in behavior).

ctx — LuaVoiceToolCtx

ctx.toolCallId
string
Identifier for this specific tool invocation.
ctx.voice.say
(text: string) => Promise<void>
Speak text to the caller via the active LiveKit session. Useful for status updates during long-running tool work (“Looking that up — one moment.”).
ctx.voice.transferToHuman
(msisdn, opts?) => Promise<void>
Transfer the live caller to a human at msisdn. Two mechanisms:
  • mode: 'refer' (default) — SIP REFER on the inbound leg. Cheap (one billed leg) but depends on the inbound carrier accepting REFER end-to-end.
  • mode: 'bridge' — dial the human as a second SIP participant into the same room. Two billed legs but works regardless of carrier REFER support. Use for high-stakes transfers.
announce is spoken before the transfer fires.
await ctx.voice?.transferToHuman('+32477123456', {
  mode: 'bridge',
  announce: 'Transferring you to our sales team — one moment.',
});
You can also share regular LuaTool instances between chat skills and voice tools — just pass them in the same tools array. The tools field accepts both LuaTool and LuaVoiceTool instances.

Function-style: defineVoice

Equivalent to new LuaVoice(config) if you prefer a function call:
import { defineVoice } from 'lua-cli';

export default defineVoice({
  name: 'support-line',
  llm: 'openai/gpt-5.1-chat-latest',
  stt: 'deepgram/nova-3',
  tts: 'elevenlabs/eleven_turbo_v2_5:pwMBn0SsmN1220Aorv15',
});

Wiring Up to an Agent

import { LuaAgent } from 'lua-cli';
import supportLine from './voices/support-line.voice';
import supportSkill from './skills/support.skill';

export const agent = new LuaAgent({
  name: 'support-agent',
  persona: {
    base: 'You are a helpful support agent for Acme Corp.',
    voice: `Speak conversationally in two sentences or fewer. No markdown. Never output digits — spell numbers and prices in full English words ("one hundred twenty-nine dollars", "nine o'clock", "fifty miles").`,
    text: 'Use markdown headers and bullet lists where helpful.',
  },
  voices: [supportLine],
  skills: [supportSkill],
});
The agent’s persona.voice branch is what gives supportLine its voice-specific prompt.
Voice persona tips:
  • Keep replies short (1–2 sentences). Voice users can’t skim.
  • No markdown — TTS reads it literally.
  • Spell out numbers and prices (“nine o’clock”, “twenty dollars”) — TTS reads digits robotically otherwise.