> ## Documentation Index
> Fetch the complete documentation index at: https://docs.heylua.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Voice API

> Define voice agents in code with LuaVoice — STT, TTS, LLM, lifecycle hooks, and voice tools

## Overview

`LuaVoice` is the class-based primitive you define in code to declare a voice-enabled agent — its speech-to-text engine, text-to-speech engine, LLM, turn detection, and any voice-specific tools.

```typescript theme={null}
import { LuaVoice } from 'lua-cli';

export default new LuaVoice({
  name: 'support-line',
  llm: 'openai/gpt-5.1-chat-latest',
  stt: 'deepgram/nova-3',
  tts: 'elevenlabs/eleven_turbo_v2_5:pwMBn0SsmN1220Aorv15',
  greeting: 'Hi, this is your support line. How can I help?',
});
```

<Note>
  For testing voice agents live or running automated voice tests, see the [Voice Command](/cli/voice-command). For the direct plugin route (when string descriptors aren't enough), see [Voice Plugins](/api/voice-plugins).
</Note>

<Note>
  **Persona is configured on the parent `LuaAgent`, not on `LuaVoice`.** Use the channel-aware persona shape `{ base, voice, text }` on the agent to give a voice its own prompt — see [Channel-Aware Personas](/cli/persona-command#channel-aware-personas).
</Note>

***

## String Descriptors (recommended)

`llm`, `stt`, and `tts` all accept a provider-prefixed string descriptor. This is the canonical form — it routes through Lua's inference layer so you don't manage provider credentials yourself.

```typescript theme={null}
new LuaVoice({
  name: 'support-line',
  llm: 'openai/gpt-5.1-chat-latest',                  // LLM
  stt: 'deepgram/nova-3',                              // STT
  tts: 'elevenlabs/eleven_turbo_v2_5:<voiceId>',      // TTS — colon-separated voiceId
});
```

### LLM options

Provider-prefixed model id. Grouped by tier — pick a tier based on the latency/cost/quality trade-off you need.

**Fast tier** — lowest latency, lowest cost:

| Descriptor                        | Notes                 |
| --------------------------------- | --------------------- |
| `openai/gpt-5-mini`               | Fast & cheap OpenAI.  |
| `openai/gpt-5-nano`               | Cheapest OpenAI tier. |
| `openai/gpt-4.1-mini`             | Stable, fast.         |
| `google/gemini-2.5-flash-lite`    | Fastest Gemini.       |
| `google/gemini-2.5-flash`         | Fast multimodal.      |
| `xai/grok-4-1-fast-non-reasoning` | Fast xAI tier.        |

**Balanced tier** — good default for most voice agents:

| Descriptor                    | Notes                                     |
| ----------------------------- | ----------------------------------------- |
| `openai/gpt-5`                | Balanced quality and speed.               |
| `openai/gpt-5.1-chat-latest`  | Balanced, chat-tuned. **Common default.** |
| `openai/gpt-4.1`              | Stable, balanced.                         |
| `google/gemini-3-flash`       | Newest Flash multimodal.                  |
| `xai/grok-4-1-fast-reasoning` | Reasoning at fast tier.                   |
| `deepseek-ai/deepseek-v3.2`   | Cost-efficient reasoning.                 |
| `moonshotai/kimi-k2-instruct` | Long-context instruct.                    |

**Quality tier** — best capability, higher latency/cost:

| Descriptor                     | Notes                   |
| ------------------------------ | ----------------------- |
| `openai/gpt-5.4`               | Top-tier OpenAI.        |
| `openai/gpt-5.3-chat-latest`   | Top-tier chat-tuned.    |
| `google/gemini-3-pro`          | Long context, top tier. |
| `google/gemini-2.5-pro`        | Stable Pro tier.        |
| `xai/grok-4.20-0309-reasoning` | Top-tier xAI reasoning. |

<Note>
  **Anthropic / Claude is intentionally absent** — Lua's inference layer does not carry Anthropic models for voice as of this writing. Use OpenAI, Google, xAI, DeepSeek, or Kimi for voice LLMs.
</Note>

### STT options

#### Deepgram (recommended)

Deepgram is the default STT provider — the worker's STT routing falls back to `deepgram/nova-3` when nothing else is configured.

```typescript theme={null}
new LuaVoice({
  // ...
  stt: 'deepgram/nova-3',
  sttLanguage: 'en',        // BCP-47 code, or 'multi' for multilingual
});
```

| Descriptor                  | Notes                                                                                                              |
| --------------------------- | ------------------------------------------------------------------------------------------------------------------ |
| `deepgram/nova-3`           | Latest Nova series. Best accuracy + low latency. **Recommended default.**                                          |
| `deepgram/nova-2`           | Previous generation. Still solid.                                                                                  |
| `deepgram/nova-2-phonecall` | Tuned for narrowband (8 kHz) phone audio. Use when call quality is poor or when you want extra robustness on PSTN. |

Combine with `sttLanguage` to pin the spoken language:

* BCP-47 code (`'en'`, `'es'`, `'pt-BR'`, etc.) — pins recognition to that language.
* `'multi'` — multilingual transcription. Applies to both the Inference route and the direct Deepgram plugin.

<Tip>
  Want non-default Deepgram options (smart formatting, filler-word filtering, custom keywords)? Use the plugin class form: `stt: new deepgram.STT({ model: 'nova-3', smartFormat: true })`. See [Voice Plugins](/api/voice-plugins) for the full plugin route.
</Tip>

#### ElevenLabs Scribe

ElevenLabs has an STT model called Scribe, available via the Inference route:

```typescript theme={null}
stt: 'elevenlabs/scribe_v2_realtime'
```

Useful when you want STT and TTS from the same provider, or when Scribe's behavior on a specific language outperforms Deepgram in your testing.

### TTS options

#### ElevenLabs (recommended)

ElevenLabs is the canonical TTS provider. The descriptor format is `elevenlabs/<model>:<voiceId>`.

```typescript theme={null}
new LuaVoice({
  // ...
  tts: 'elevenlabs/eleven_turbo_v2_5:pwMBn0SsmN1220Aorv15',
});
```

**Models:**

| Model                    | Latency | Languages    | Best for                                                     |
| ------------------------ | ------- | ------------ | ------------------------------------------------------------ |
| `eleven_v3`              | \~250ms | 70+          | Most expressive. Use when quality matters more than latency. |
| `eleven_turbo_v2_5`      | Low     | Multilingual | **Common default** — balanced latency + quality.             |
| `eleven_flash_v2_5`      | \~75ms  | Multilingual | Ultra-low latency. Use for fast, interactive turns.          |
| `eleven_multilingual_v2` | \~200ms | 29           | Lifelike emotion across many languages.                      |
| `eleven_flash_v2`        | \~75ms  | English only | Ultra-low latency, English-only.                             |

**Curated voice IDs:**

Lua maintains a curated list with metadata (gender, accent, style) the raw ElevenLabs API doesn't expose:

| Voice ID               | Name      | Accent     | Style                             |
| ---------------------- | --------- | ---------- | --------------------------------- |
| `pwMBn0SsmN1220Aorv15` | Matt      | American   | Male, Hyper-Conversational        |
| `ZTho75k1M56OV0k9XtSC` | Spence    | American   | Male, Soft-Spoken                 |
| `kdmDKE6EkgrWrrykO9Qt` | Alexandra | American   | Female, Conversational            |
| `h2sm0NbeIZXHBzJOMYcQ` | Natasha   | American   | Female, Calm Narrative            |
| `lUTamkMw7gOzZbFIwmq4` | James     | British    | Male, Professional                |
| `4BWwbsA70lmV7RMG0Acs` | Blondie   | British    | Female, Relaxed Casual            |
| `lcMyyd2HUfFzxdCaC4Ta` | Lucy      | British    | Female, Fresh Casual              |
| `4CrZuIW9am7gYAxgo2Af` | Shelley   | British    | Female, Clear Confident           |
| `56bWURjYFHyYyVf490Dp` | Emma      | Australian | Female, Warm Conversational       |
| `aCChyB4P5WEomwRsOKRh` | Salma     | Arabic     | Female, Conversational Expressive |
| `2zRM7PkgwBPiau2jvVXc` | Monika    | Indian     | Female, Deep and Natural          |
| `ecp3DWciuUyW7BYM7II1` | Anika     | Indian     | Female, Sweet and Lively          |
| `pzxut4zZz4GImZNlqQ3H` | Raju      | Indian     | Male, Natural Conversationalist   |

You can also use any ElevenLabs voice ID from your own ElevenLabs account — these are just the curated defaults.

**Alternative: object form**

If you'd rather not concatenate model and voice with a colon, the object form works too:

```typescript theme={null}
tts: { model: 'elevenlabs/eleven_turbo_v2_5', voice: 'pwMBn0SsmN1220Aorv15' }
```

#### Deepgram Aura

Deepgram offers TTS via the Aura family. The voice id is encoded inside the model id as `aura-2-<name>-<lang>`:

```typescript theme={null}
tts: 'deepgram/aura-2-thalia-en'
```

**Common Aura 2 voices (English):**

| ID                  | Name    | Gender            | Style              |
| ------------------- | ------- | ----------------- | ------------------ |
| `aura-2-thalia-en`  | Thalia  | Female (American) | Conversational     |
| `aura-2-asteria-en` | Asteria | Female (American) | Friendly           |
| `aura-2-luna-en`    | Luna    | Female (American) | Warm               |
| `aura-2-stella-en`  | Stella  | Female (American) | Professional       |
| `aura-2-athena-en`  | Athena  | Female (British)  | Authoritative      |
| `aura-2-hera-en`    | Hera    | Female (American) | Calm Narrative     |
| `aura-2-orion-en`   | Orion   | Male (American)   | Confident          |
| `aura-2-arcas-en`   | Arcas   | Male (American)   | Conversational     |
| `aura-2-perseus-en` | Perseus | Male (American)   | Engaging           |
| `aura-2-angus-en`   | Angus   | Male (Irish)      | Storyteller        |
| `aura-2-helios-en`  | Helios  | Male (British)    | Professional       |
| `aura-2-zeus-en`    | Zeus    | Male (American)   | Deep Authoritative |

Spanish voices are also available: `aura-2-celeste-es`, `aura-2-estrella-es`.

#### Other TTS providers (via Inference)

Lua's inference layer also exposes Cartesia, Inworld, Rime, and xAI TTS. The descriptors follow the same `provider/model` shape:

| Descriptor                    | Provider | Notes                      |
| ----------------------------- | -------- | -------------------------- |
| `cartesia/sonic-3`            | Cartesia | Newest, expressive.        |
| `cartesia/sonic-turbo`        | Cartesia | Ultra-low latency.         |
| `inworld/inworld-tts-1.5-max` | Inworld  | High-quality multilingual. |
| `rime/arcana`                 | Rime     | Multilingual, expressive.  |
| `xai/tts-1`                   | xAI      | 21 languages.              |

***

## Configuration Reference

```typescript theme={null}
new LuaVoice({
  // Required
  name: 'support-line',
  llm: 'openai/gpt-5.1-chat-latest',
  stt: 'deepgram/nova-3',
  tts: 'elevenlabs/eleven_turbo_v2_5:pwMBn0SsmN1220Aorv15',

  // Recommended
  description: 'Inbound phone voice for the support assistant',
  greeting: "Hi, this is your support line. How can I help?",
  sttLanguage: 'en',
  turnDetection: 'vad',
  krispEnabled: true,

  // Optional tuning
  maxToolSteps: 6,
  userAwayTimeout: 20,
  preemptiveGeneration: true,
  interruption: { mode: 'adaptive', falseInterruptionTimeout: 2.0 },

  // Optional polish
  pronunciations: { 'HVAC': 'H V A C', 'CFM': 'C F M' },
  persistTranscript: true,
  backgroundAudio: { ambient: 'office-ambience', thinking: 'keyboard-typing' },

  // Tools + lifecycle hooks
  tools: [/* ... */],
  onEnter: async (ctx) => {/* ... */},
  onUserTurnCompleted: async (turnCtx, message) => {/* ... */},
  onExit: async (ctx) => {/* ... */},
});
```

### Required fields

<ParamField path="name" type="string" required>
  Unique name for this voice. Used to address the voice in `lua voice --voice <name>` and as the server-side identifier. Allowed characters: `a-zA-Z0-9_-`, 1–64 chars.
</ParamField>

<ParamField path="llm" type="string | LLMConfig" required>
  The LLM that drives the conversation. String descriptor (e.g. `'openai/gpt-5.1-chat-latest'`) is the canonical form. See [LLM options](#llm-options) above for the catalog.
</ParamField>

<ParamField path="stt" type="string | STTConfig" required>
  Speech-to-text engine. String descriptor (e.g. `'deepgram/nova-3'`) is canonical. Required for cascaded LLMs; omit only when using a realtime speech-to-speech model in the `llm` slot.
</ParamField>

<ParamField path="tts" type="string | TTSConfig" required>
  Text-to-speech engine. String descriptor with colon-separated voice id (e.g. `'elevenlabs/eleven_turbo_v2_5:<voiceId>'`), or object form `{ model, voice }`. Required for cascaded LLMs.
</ParamField>

### Optional fields

<ParamField path="description" type="string">
  Human-readable description. Surfaced in the compiled manifest and admin listings.
</ParamField>

<ParamField path="greeting" type="string">
  Opening line spoken at session start. Empty string means no greeting. Generated through the LLM at session connect, so it can be dynamic if `onEnter` sets up context first.
</ParamField>

<ParamField path="sttLanguage" type="string">
  BCP-47 language code (e.g. `'en'`, `'es'`, `'pt-BR'`) or `'multi'` for multilingual transcription. Applies to both Inference STT and the Deepgram plugin.
</ParamField>

<ParamField path="turnDetection" type="'multilingual' | 'english' | 'vad' | 'stt' | 'manual'">
  How the agent decides when the user has finished speaking. `'vad'` is the safest choice for most setups. `'multilingual'` and `'english'` use LiveKit's turn-detector model; `'manual'` defers to your own logic.
</ParamField>

<ParamField path="vad" type="string" default="silero">
  Voice activity detection engine. `'silero'` is the only currently-supported value.
</ParamField>

<ParamField path="vadOptions" type="object">
  Silero VAD tuning. Useful when the default endpointing clips quiet callers or fires too eagerly mid-thought.

  * `minSpeechDuration` (ms, 0–5000) — speech required before a turn starts. Default: 50.
  * `minSilenceDuration` (ms, 0–5000) — silence required to end a turn. Default: 550.
  * `prefixPaddingDuration` (ms, 0–2000) — audio captured before detected speech start, forwarded into STT. Default: 500.
  * `activationThreshold` (0–1) — lower = more sensitive to speech onset.
</ParamField>

<ParamField path="krispEnabled" type="boolean" default="false">
  Krisp BVC background noise cancellation. Recommended for inbound phone calls — it removes background chatter, traffic, and other ambient noise. Billed separately, so opt-in.
</ParamField>

<ParamField path="maxToolSteps" type="number">
  Maximum sequential tool calls per turn (1–20). Higher values let the agent chain more tools before responding.
</ParamField>

<ParamField path="userAwayTimeout" type="number">
  Seconds of silence before the agent considers the user "away" and ends the session. Useful for cleanly handling abandoned calls.
</ParamField>

<ParamField path="preemptiveGeneration" type="boolean">
  Generate the assistant's response speculatively as the user is still speaking. Reduces perceived latency for predictable turns but can be wasted on highly interruptive callers.
</ParamField>

<ParamField path="interruption" type="InterruptionOptions">
  How the agent handles being interrupted mid-response.

  * `enabled` — whether interruption is allowed.
  * `mode` — `'adaptive'` (recommended) or `'vad'`.
  * `falseInterruptionTimeout` (seconds) — how long to wait before treating a brief noise as a false interruption.
  * `resumeFalseInterruption` (boolean) — resume the cut-off response after a false interruption.
  * `minDelay` / `maxDelay` (seconds) — bounds on the interruption response window.
</ParamField>

<ParamField path="pronunciations" type="Record<string, string>">
  Word-boundary text replacements applied before TTS synthesis. Keys are matched case-insensitively as whole words. Use for acronyms and proper nouns the TTS mispronounces.

  ```typescript theme={null}
  pronunciations: { 'HVAC': 'H V A C', 'kubectl': 'kube control' }
  ```

  Only effective on the cascaded path — realtime models bypass the TTS step entirely.
</ParamField>

<ParamField path="backgroundAudio" type="{ ambient?, thinking? }">
  Background audio layered onto the agent's output. Pass a built-in clip name, a `{ source, volume, probability }` config, or an array (probabilistic mix).

  Built-in clips: `'office-ambience'`, `'keyboard-typing'`, `'keyboard-typing-2'`.

  ```typescript theme={null}
  backgroundAudio: {
    ambient: 'office-ambience',
    thinking: 'keyboard-typing',
  }
  ```
</ParamField>

<ParamField path="volume" type="number">
  Output speech volume, 0–100. Applied as a per-frame multiplier. Omit to pass the TTS provider's native level through unchanged.
</ParamField>

<ParamField path="persistTranscript" type="boolean" default="false">
  When `true`, the worker writes `session.history` to `Data.set('call:<sessionId>')` after the call ends. Read it back from a job or webhook with `Data.get('call:<sessionId>')` for post-call analytics, follow-ups, or QA.
</ParamField>

<ParamField path="tools" type="Array<LuaTool | LuaVoiceTool>">
  Voice-specific tools in addition to skills attached to the owning agent. See [Defining Voice Tools](#defining-voice-tools).
</ParamField>

***

## Lifecycle Hooks

Three hooks let you wire up per-session state, RAG injection, and post-call work.

<ParamField path="onEnter" type="(ctx: LuaVoiceHookContext) => Promise<void>">
  Fires after the session connects to the room and **before** the greeting. Use it to hydrate `session.userdata` from `User`, `Data`, etc., or to set up any per-call state.

  ```typescript theme={null}
  onEnter: async (ctx) => {
    if (ctx.caller?.phoneNumber) {
      const user = await User.get({ phone: ctx.caller.phoneNumber });
      ctx.session.userdata = { user, returning: !!user };
    }
  },
  ```
</ParamField>

<ParamField path="onUserTurnCompleted" type="(turnCtx, message) => Promise<void>">
  Fires after the user finishes a turn, **before** the LLM is invoked. This is the canonical RAG-injection point — `turnCtx.addMessage(...)` adds context messages the LLM sees on this turn.

  ```typescript theme={null}
  onUserTurnCompleted: async (turnCtx, message) => {
    const docs = await Data.search('kb', message.content, 3);
    for (const doc of docs) {
      turnCtx.addMessage({ role: 'system', content: doc.text });
    }
  },
  ```
</ParamField>

<ParamField path="onExit" type="(ctx: LuaVoiceHookContext) => Promise<void>">
  Fires when the session is closing. Use for transcript persistence, outcome reporting, CRM updates, etc.
</ParamField>

***

## Defining Voice Tools

Voice tools run during a voice conversation. `LuaVoiceTool` is a concrete class — **instantiate** it with a config object:

```typescript theme={null}
import { LuaVoiceTool } from 'lua-cli';
import { z } from 'zod';

export const getOrderStatusTool = new LuaVoiceTool({
  name: 'getOrderStatus',
  description: 'Look up the status of an order by ID',
  inputSchema: z.object({ orderId: z.string() }),
  execute: async (input, ctx) => {
    const order = await Data.get('orders', input.orderId);
    return { status: order.status, eta: order.eta };
  },
});
```

### Config fields

<ParamField path="name" type="string" required>
  Tool name. Used by the LLM to identify and call the tool.
</ParamField>

<ParamField path="description" type="string" required>
  What the tool does. Action-oriented description the LLM reads when deciding to invoke.
</ParamField>

<ParamField path="inputSchema" type="ZodType" required>
  Zod schema for the tool's input. Validated before `execute` is called.
</ParamField>

<ParamField path="execute" type="(input, ctx?: LuaVoiceToolCtx) => Promise<any>" required>
  Tool body. Receives the validated input and an optional voice-specific context.
</ParamField>

<ParamField path="condition" type="() => Promise<boolean>">
  Optional gate. When provided, the tool is only exposed to the LLM if `condition()` returns `true`. Use for feature flags or runtime availability checks.
</ParamField>

<ParamField path="flags" type="ToolFlag[]">
  Voice-specific tool flags (e.g. controlling barge-in behavior).
</ParamField>

### ctx — `LuaVoiceToolCtx`

<ParamField path="ctx.toolCallId" type="string">
  Identifier for this specific tool invocation.
</ParamField>

<ParamField path="ctx.voice.say" type="(text: string) => Promise<void>">
  Speak `text` to the caller via the active LiveKit session. Useful for status updates during long-running tool work ("Looking that up — one moment.").
</ParamField>

<ParamField path="ctx.voice.transferToHuman" type="(msisdn, opts?) => Promise<void>">
  Transfer the live caller to a human at `msisdn`. Two mechanisms:

  * **`mode: 'refer'`** (default) — SIP REFER on the inbound leg. Cheap (one billed leg) but depends on the inbound carrier accepting REFER end-to-end.
  * **`mode: 'bridge'`** — dial the human as a second SIP participant into the same room. Two billed legs but works regardless of carrier REFER support. Use for high-stakes transfers.

  `announce` is spoken before the transfer fires.

  ```typescript theme={null}
  await ctx.voice?.transferToHuman('+32477123456', {
    mode: 'bridge',
    announce: 'Transferring you to our sales team — one moment.',
  });
  ```
</ParamField>

You can also share regular `LuaTool` instances between chat skills and voice tools — just pass them in the same `tools` array. The `tools` field accepts both `LuaTool` and `LuaVoiceTool` instances.

***

## Function-style: `defineVoice`

Equivalent to `new LuaVoice(config)` if you prefer a function call:

```typescript theme={null}
import { defineVoice } from 'lua-cli';

export default defineVoice({
  name: 'support-line',
  llm: 'openai/gpt-5.1-chat-latest',
  stt: 'deepgram/nova-3',
  tts: 'elevenlabs/eleven_turbo_v2_5:pwMBn0SsmN1220Aorv15',
});
```

***

## Wiring Up to an Agent

```typescript theme={null}
import { LuaAgent } from 'lua-cli';
import supportLine from './voices/support-line.voice';
import supportSkill from './skills/support.skill';

export const agent = new LuaAgent({
  name: 'support-agent',
  persona: {
    base: 'You are a helpful support agent for Acme Corp.',
    voice: `Speak conversationally in two sentences or fewer. No markdown. Never output digits — spell numbers and prices in full English words ("one hundred twenty-nine dollars", "nine o'clock", "fifty miles").`,
    text: 'Use markdown headers and bullet lists where helpful.',
  },
  voices: [supportLine],
  skills: [supportSkill],
});
```

The agent's `persona.voice` branch is what gives `supportLine` its voice-specific prompt.

<Tip>
  **Voice persona tips:**

  * Keep replies short (1–2 sentences). Voice users can't skim.
  * No markdown — TTS reads it literally.
  * Spell out numbers and prices ("nine o'clock", "twenty dollars") — TTS reads digits robotically otherwise.
</Tip>

***

## Related

* [Voice Command](/cli/voice-command) — live testing and voice test suites
* [Voice Plugins](/api/voice-plugins) — direct plugin route (Deepgram, ElevenLabs class forms)
* [Persona Command](/cli/persona-command#channel-aware-personas) — voice-specific personas on the parent agent
* [LuaAgent API](/api/luaagent)
