Voice API

Overview

LuaVoice is the class-based primitive you define in code to declare a voice-enabled agent — its speech-to-text engine, text-to-speech engine, LLM, turn detection, and any voice-specific tools.

import { LuaVoice } from 'lua-cli';

export default new LuaVoice({
  name: 'support-line',
  llm: 'openai/gpt-5.1-chat-latest',
  stt: 'deepgram/nova-3',
  tts: 'elevenlabs/eleven_turbo_v2_5:pwMBn0SsmN1220Aorv15',
  greeting: 'Hi, this is your support line. How can I help?',
});

For testing voice agents live or running automated voice tests, see the Voice Command. For the direct plugin route (when string descriptors aren’t enough), see Voice Plugins.

Persona is configured on the parent LuaAgent, not on LuaVoice. Use the channel-aware persona shape { base, voice, text } on the agent to give a voice its own prompt — see Channel-Aware Personas.

String Descriptors (recommended)

llm, stt, and tts all accept a provider-prefixed string descriptor. This is the canonical form — it routes through Lua’s inference layer so you don’t manage provider credentials yourself.

new LuaVoice({
  name: 'support-line',
  llm: 'openai/gpt-5.1-chat-latest',                  // LLM
  stt: 'deepgram/nova-3',                              // STT
  tts: 'elevenlabs/eleven_turbo_v2_5:<voiceId>',      // TTS — colon-separated voiceId
});

LLM options

Provider-prefixed model id. Grouped by tier — pick a tier based on the latency/cost/quality trade-off you need. Fast tier — lowest latency, lowest cost:

Descriptor	Notes
`openai/gpt-5-mini`	Fast & cheap OpenAI.
`openai/gpt-5-nano`	Cheapest OpenAI tier.
`openai/gpt-4.1-mini`	Stable, fast.
`google/gemini-2.5-flash-lite`	Fastest Gemini.
`google/gemini-2.5-flash`	Fast multimodal.
`xai/grok-4-1-fast-non-reasoning`	Fast xAI tier.

Balanced tier — good default for most voice agents:

Descriptor	Notes
`openai/gpt-5`	Balanced quality and speed.
`openai/gpt-5.1-chat-latest`	Balanced, chat-tuned. Common default.
`openai/gpt-4.1`	Stable, balanced.
`google/gemini-3-flash`	Newest Flash multimodal.
`xai/grok-4-1-fast-reasoning`	Reasoning at fast tier.
`deepseek-ai/deepseek-v3.2`	Cost-efficient reasoning.
`moonshotai/kimi-k2-instruct`	Long-context instruct.

Quality tier — best capability, higher latency/cost:

Descriptor	Notes
`openai/gpt-5.4`	Top-tier OpenAI.
`openai/gpt-5.3-chat-latest`	Top-tier chat-tuned.
`google/gemini-3-pro`	Long context, top tier.
`google/gemini-2.5-pro`	Stable Pro tier.
`xai/grok-4.20-0309-reasoning`	Top-tier xAI reasoning.

Anthropic / Claude is intentionally absent — Lua’s inference layer does not carry Anthropic models for voice as of this writing. Use OpenAI, Google, xAI, DeepSeek, or Kimi for voice LLMs.

STT options

Deepgram (recommended)

Deepgram is the default STT provider — the worker’s STT routing falls back to deepgram/nova-3 when nothing else is configured.

new LuaVoice({
  // ...
  stt: 'deepgram/nova-3',
  sttLanguage: 'en',        // BCP-47 code, or 'multi' for multilingual
});

Descriptor	Notes
`deepgram/nova-3`	Latest Nova series. Best accuracy + low latency. Recommended default.
`deepgram/nova-2`	Previous generation. Still solid.
`deepgram/nova-2-phonecall`	Tuned for narrowband (8 kHz) phone audio. Use when call quality is poor or when you want extra robustness on PSTN.

Combine with sttLanguage to pin the spoken language:

BCP-47 code ('en', 'es', 'pt-BR', etc.) — pins recognition to that language.
'multi' — multilingual transcription. Applies to both the Inference route and the direct Deepgram plugin.

Want non-default Deepgram options (smart formatting, filler-word filtering, custom keywords)? Use the plugin class form: stt: new deepgram.STT({ model: 'nova-3', smartFormat: true }). See Voice Plugins for the full plugin route.

ElevenLabs Scribe

ElevenLabs has an STT model called Scribe, available via the Inference route:

stt: 'elevenlabs/scribe_v2_realtime'

Useful when you want STT and TTS from the same provider, or when Scribe’s behavior on a specific language outperforms Deepgram in your testing.

TTS options

ElevenLabs (recommended)

ElevenLabs is the canonical TTS provider. The descriptor format is elevenlabs/<model>:<voiceId>.

new LuaVoice({
  // ...
  tts: 'elevenlabs/eleven_turbo_v2_5:pwMBn0SsmN1220Aorv15',
});

Models:

Model	Latency	Languages	Best for
`eleven_v3`	~250ms	70+	Most expressive. Use when quality matters more than latency.
`eleven_turbo_v2_5`	Low	Multilingual	Common default — balanced latency + quality.
`eleven_flash_v2_5`	~75ms	Multilingual	Ultra-low latency. Use for fast, interactive turns.
`eleven_multilingual_v2`	~200ms	29	Lifelike emotion across many languages.
`eleven_flash_v2`	~75ms	English only	Ultra-low latency, English-only.

Curated voice IDs: Lua maintains a curated list with metadata (gender, accent, style) the raw ElevenLabs API doesn’t expose:

Voice ID	Name	Accent	Style
`pwMBn0SsmN1220Aorv15`	Matt	American	Male, Hyper-Conversational
`ZTho75k1M56OV0k9XtSC`	Spence	American	Male, Soft-Spoken
`kdmDKE6EkgrWrrykO9Qt`	Alexandra	American	Female, Conversational
`h2sm0NbeIZXHBzJOMYcQ`	Natasha	American	Female, Calm Narrative
`lUTamkMw7gOzZbFIwmq4`	James	British	Male, Professional
`4BWwbsA70lmV7RMG0Acs`	Blondie	British	Female, Relaxed Casual
`lcMyyd2HUfFzxdCaC4Ta`	Lucy	British	Female, Fresh Casual
`4CrZuIW9am7gYAxgo2Af`	Shelley	British	Female, Clear Confident
`56bWURjYFHyYyVf490Dp`	Emma	Australian	Female, Warm Conversational
`aCChyB4P5WEomwRsOKRh`	Salma	Arabic	Female, Conversational Expressive
`2zRM7PkgwBPiau2jvVXc`	Monika	Indian	Female, Deep and Natural
`ecp3DWciuUyW7BYM7II1`	Anika	Indian	Female, Sweet and Lively
`pzxut4zZz4GImZNlqQ3H`	Raju	Indian	Male, Natural Conversationalist

You can also use any ElevenLabs voice ID from your own ElevenLabs account — these are just the curated defaults. Alternative: object form If you’d rather not concatenate model and voice with a colon, the object form works too:

tts: { model: 'elevenlabs/eleven_turbo_v2_5', voice: 'pwMBn0SsmN1220Aorv15' }

Deepgram Aura

Deepgram offers TTS via the Aura family. The voice id is encoded inside the model id as aura-2-<name>-<lang>:

tts: 'deepgram/aura-2-thalia-en'

Common Aura 2 voices (English):

ID	Name	Gender	Style
`aura-2-thalia-en`	Thalia	Female (American)	Conversational
`aura-2-asteria-en`	Asteria	Female (American)	Friendly
`aura-2-luna-en`	Luna	Female (American)	Warm
`aura-2-stella-en`	Stella	Female (American)	Professional
`aura-2-athena-en`	Athena	Female (British)	Authoritative
`aura-2-hera-en`	Hera	Female (American)	Calm Narrative
`aura-2-orion-en`	Orion	Male (American)	Confident
`aura-2-arcas-en`	Arcas	Male (American)	Conversational
`aura-2-perseus-en`	Perseus	Male (American)	Engaging
`aura-2-angus-en`	Angus	Male (Irish)	Storyteller
`aura-2-helios-en`	Helios	Male (British)	Professional
`aura-2-zeus-en`	Zeus	Male (American)	Deep Authoritative

Spanish voices are also available: aura-2-celeste-es, aura-2-estrella-es.

Other TTS providers (via Inference)

Lua’s inference layer also exposes Cartesia, Inworld, Rime, and xAI TTS. The descriptors follow the same provider/model shape:

Descriptor	Provider	Notes
`cartesia/sonic-3`	Cartesia	Newest, expressive.
`cartesia/sonic-turbo`	Cartesia	Ultra-low latency.
`inworld/inworld-tts-1.5-max`	Inworld	High-quality multilingual.
`rime/arcana`	Rime	Multilingual, expressive.
`xai/tts-1`	xAI	21 languages.

Configuration Reference

new LuaVoice({
  // Required
  name: 'support-line',
  llm: 'openai/gpt-5.1-chat-latest',
  stt: 'deepgram/nova-3',
  tts: 'elevenlabs/eleven_turbo_v2_5:pwMBn0SsmN1220Aorv15',

  // Recommended
  description: 'Inbound phone voice for the support assistant',
  greeting: "Hi, this is your support line. How can I help?",
  sttLanguage: 'en',
  turnDetection: 'vad',
  krispEnabled: true,

  // Optional tuning
  maxToolSteps: 6,
  userAwayTimeout: 20,
  preemptiveGeneration: true,
  interruption: { mode: 'adaptive', falseInterruptionTimeout: 2.0 },

  // Optional polish
  pronunciations: { 'HVAC': 'H V A C', 'CFM': 'C F M' },
  persistTranscript: true,
  backgroundAudio: { ambient: 'office-ambience', thinking: 'keyboard-typing' },

  // Tools + lifecycle hooks
  tools: [/* ... */],
  onEnter: async (ctx) => {/* ... */},
  onUserTurnCompleted: async (turnCtx, message) => {/* ... */},
  onExit: async (ctx) => {/* ... */},
});

Required fields

name

string

required

Unique name for this voice. Used to address the voice in lua voice --voice <name> and as the server-side identifier. Allowed characters: a-zA-Z0-9_-, 1–64 chars.

llm

string | LLMConfig

required

The LLM that drives the conversation. String descriptor (e.g. 'openai/gpt-5.1-chat-latest') is the canonical form. See LLM options above for the catalog.

stt

string | STTConfig

required

Speech-to-text engine. String descriptor (e.g. 'deepgram/nova-3') is canonical. Required for cascaded LLMs; omit only when using a realtime speech-to-speech model in the llm slot.

tts

string | TTSConfig

required

Text-to-speech engine. String descriptor with colon-separated voice id (e.g. 'elevenlabs/eleven_turbo_v2_5:<voiceId>'), or object form { model, voice }. Required for cascaded LLMs.

Optional fields

description

string

Human-readable description. Surfaced in the compiled manifest and admin listings.

greeting

string

Opening line spoken at session start. Empty string means no greeting. Generated through the LLM at session connect, so it can be dynamic if onEnter sets up context first.

sttLanguage

string

BCP-47 language code (e.g. 'en', 'es', 'pt-BR') or 'multi' for multilingual transcription. Applies to both Inference STT and the Deepgram plugin.

turnDetection

'multilingual' | 'english' | 'vad' | 'stt' | 'manual'

How the agent decides when the user has finished speaking. 'vad' is the safest choice for most setups. 'multilingual' and 'english' use LiveKit’s turn-detector model; 'manual' defers to your own logic.

vad

string

default:"silero"

Voice activity detection engine. 'silero' is the only currently-supported value.

vadOptions

object

Silero VAD tuning. Useful when the default endpointing clips quiet callers or fires too eagerly mid-thought.

minSpeechDuration (ms, 0–5000) — speech required before a turn starts. Default: 50.
minSilenceDuration (ms, 0–5000) — silence required to end a turn. Default: 550.
prefixPaddingDuration (ms, 0–2000) — audio captured before detected speech start, forwarded into STT. Default: 500.
activationThreshold (0–1) — lower = more sensitive to speech onset.

krispEnabled

boolean

default:"false"

Krisp BVC background noise cancellation. Recommended for inbound phone calls — it removes background chatter, traffic, and other ambient noise. Billed separately, so opt-in.

maxToolSteps

number

Maximum sequential tool calls per turn (1–20). Higher values let the agent chain more tools before responding.

userAwayTimeout

number

Seconds of silence before the agent considers the user “away” and ends the session. Useful for cleanly handling abandoned calls.

preemptiveGeneration

boolean

Generate the assistant’s response speculatively as the user is still speaking. Reduces perceived latency for predictable turns but can be wasted on highly interruptive callers.

interruption

InterruptionOptions

How the agent handles being interrupted mid-response.

enabled — whether interruption is allowed.
mode — 'adaptive' (recommended) or 'vad'.
falseInterruptionTimeout (seconds) — how long to wait before treating a brief noise as a false interruption.
resumeFalseInterruption (boolean) — resume the cut-off response after a false interruption.
minDelay / maxDelay (seconds) — bounds on the interruption response window.

pronunciations

Record<string, string>

Word-boundary text replacements applied before TTS synthesis. Keys are matched case-insensitively as whole words. Use for acronyms and proper nouns the TTS mispronounces.

pronunciations: { 'HVAC': 'H V A C', 'kubectl': 'kube control' }

Only effective on the cascaded path — realtime models bypass the TTS step entirely.

backgroundAudio

{ ambient?, thinking? }

Background audio layered onto the agent’s output. Pass a built-in clip name, a { source, volume, probability } config, or an array (probabilistic mix).Built-in clips: 'office-ambience', 'keyboard-typing', 'keyboard-typing-2'.

backgroundAudio: {
  ambient: 'office-ambience',
  thinking: 'keyboard-typing',
}

volume

number

Output speech volume, 0–100. Applied as a per-frame multiplier. Omit to pass the TTS provider’s native level through unchanged.

persistTranscript

boolean

default:"false"

When true, the worker writes session.history to Data.set('call:<sessionId>') after the call ends. Read it back from a job or webhook with Data.get('call:<sessionId>') for post-call analytics, follow-ups, or QA.

tools

Array<LuaTool | LuaVoiceTool>

Voice-specific tools in addition to skills attached to the owning agent. See Defining Voice Tools.

Lifecycle Hooks

Three hooks let you wire up per-session state, RAG injection, and post-call work.

onEnter

(ctx: LuaVoiceHookContext) => Promise<void>

Fires after the session connects to the room and before the greeting. Use it to hydrate session.userdata from User, Data, etc., or to set up any per-call state.

onEnter: async (ctx) => {
  if (ctx.caller?.phoneNumber) {
    const user = await User.get({ phone: ctx.caller.phoneNumber });
    ctx.session.userdata = { user, returning: !!user };
  }
},

onUserTurnCompleted

(turnCtx, message) => Promise<void>

Fires after the user finishes a turn, before the LLM is invoked. This is the canonical RAG-injection point — turnCtx.addMessage(...) adds context messages the LLM sees on this turn.

onUserTurnCompleted: async (turnCtx, message) => {
  const docs = await Data.search('kb', message.content, 3);
  for (const doc of docs) {
    turnCtx.addMessage({ role: 'system', content: doc.text });
  }
},

onExit

(ctx: LuaVoiceHookContext) => Promise<void>

Fires when the session is closing. Use for transcript persistence, outcome reporting, CRM updates, etc.

Defining Voice Tools

Voice tools run during a voice conversation. LuaVoiceTool is a concrete class — instantiate it with a config object:

import { LuaVoiceTool } from 'lua-cli';
import { z } from 'zod';

export const getOrderStatusTool = new LuaVoiceTool({
  name: 'getOrderStatus',
  description: 'Look up the status of an order by ID',
  inputSchema: z.object({ orderId: z.string() }),
  execute: async (input, ctx) => {
    const order = await Data.get('orders', input.orderId);
    return { status: order.status, eta: order.eta };
  },
});

Config fields

name

string

required

Tool name. Used by the LLM to identify and call the tool.

description

string

required

What the tool does. Action-oriented description the LLM reads when deciding to invoke.

inputSchema

ZodType

required

Zod schema for the tool’s input. Validated before execute is called.

execute

(input, ctx?: LuaVoiceToolCtx) => Promise<any>

required

Tool body. Receives the validated input and an optional voice-specific context.

condition

() => Promise<boolean>

Optional gate. When provided, the tool is only exposed to the LLM if condition() returns true. Use for feature flags or runtime availability checks.

flags

ToolFlag[]

Voice-specific tool flags (e.g. controlling barge-in behavior).

ctx — `LuaVoiceToolCtx`

ctx.toolCallId

string

Identifier for this specific tool invocation.

ctx.voice.say

(text: string) => Promise<void>

Speak text to the caller via the active LiveKit session. Useful for status updates during long-running tool work (“Looking that up — one moment.”).

ctx.voice.transferToHuman

(msisdn, opts?) => Promise<void>

Transfer the live caller to a human at msisdn. Two mechanisms:

mode: 'refer' (default) — SIP REFER on the inbound leg. Cheap (one billed leg) but depends on the inbound carrier accepting REFER end-to-end.
mode: 'bridge' — dial the human as a second SIP participant into the same room. Two billed legs but works regardless of carrier REFER support. Use for high-stakes transfers.

announce is spoken before the transfer fires.

await ctx.voice?.transferToHuman('+32477123456', {
  mode: 'bridge',
  announce: 'Transferring you to our sales team — one moment.',
});

You can also share regular LuaTool instances between chat skills and voice tools — just pass them in the same tools array. The tools field accepts both LuaTool and LuaVoiceTool instances.

Function-style: `defineVoice`

Equivalent to new LuaVoice(config) if you prefer a function call:

import { defineVoice } from 'lua-cli';

export default defineVoice({
  name: 'support-line',
  llm: 'openai/gpt-5.1-chat-latest',
  stt: 'deepgram/nova-3',
  tts: 'elevenlabs/eleven_turbo_v2_5:pwMBn0SsmN1220Aorv15',
});

Wiring Up to an Agent

import { LuaAgent } from 'lua-cli';
import supportLine from './voices/support-line.voice';
import supportSkill from './skills/support.skill';

export const agent = new LuaAgent({
  name: 'support-agent',
  persona: {
    base: 'You are a helpful support agent for Acme Corp.',
    voice: `Speak conversationally in two sentences or fewer. No markdown. Never output digits — spell numbers and prices in full English words ("one hundred twenty-nine dollars", "nine o'clock", "fifty miles").`,
    text: 'Use markdown headers and bullet lists where helpful.',
  },
  voices: [supportLine],
  skills: [supportSkill],
});

The agent’s persona.voice branch is what gives supportLine its voice-specific prompt.

Voice persona tips:

Keep replies short (1–2 sentences). Voice users can’t skim.
No markdown — TTS reads it literally.
Spell out numbers and prices (“nine o’clock”, “twenty dollars”) — TTS reads digits robotically otherwise.

Voice Command — live testing and voice test suites
Voice Plugins — direct plugin route (Deepgram, ElevenLabs class forms)
Persona Command — voice-specific personas on the parent agent
LuaAgent API

Getting Started

Core Concepts

CLI Commands

API Reference

Template & Examples

Overview

String Descriptors (recommended)

LLM options

STT options

Deepgram (recommended)

ElevenLabs Scribe

TTS options

ElevenLabs (recommended)

Deepgram Aura

Other TTS providers (via Inference)

Configuration Reference

Required fields

Optional fields

Lifecycle Hooks

Defining Voice Tools

Config fields

ctx — `LuaVoiceToolCtx`

Function-style: `defineVoice`

Wiring Up to an Agent

Getting Started

Core Concepts

CLI Commands

API Reference

Template & Examples

Documentation Index

​Overview

​String Descriptors (recommended)

​LLM options

​STT options

​Deepgram (recommended)

​ElevenLabs Scribe

​TTS options

​ElevenLabs (recommended)

​Deepgram Aura

​Other TTS providers (via Inference)

​Configuration Reference

​Required fields

​Optional fields

​Lifecycle Hooks

​Defining Voice Tools

​Config fields

​ctx — LuaVoiceToolCtx

​Function-style: defineVoice

​Wiring Up to an Agent

​Related

Overview

String Descriptors (recommended)

LLM options

STT options

Deepgram (recommended)

ElevenLabs Scribe

TTS options

ElevenLabs (recommended)

Deepgram Aura

Other TTS providers (via Inference)

Configuration Reference

Required fields

Optional fields

Lifecycle Hooks

Defining Voice Tools

Config fields

ctx — `LuaVoiceToolCtx`

Function-style: `defineVoice`

Wiring Up to an Agent

Related