On this page

Speech Segmentation Control

2026-05-12

Since LLM (Large Language Model) does not support streaming input, it is necessary to determine whether the user has finished speaking based on real-time ASR (Automatic Speech Recognition) results, and then request LLM to start a new round of Q&A.

To determine whether the user has finished speaking, check these parameters:

  • VAD.TurnDetectConfig.SilenceSegmentation
  • VAD.TurnDetectConfig.PauseInterval

Parameter Description

The parameters that affect the determination of user's speech completion are in the VAD parameters of creating/updating agent instances. Please refer to Create Agent Instance > Body for details.

Parameter NameTypeRequiredDescription
VAD.TurnDetectConfig.SilenceSegmentationIntNoSets the duration (in milliseconds) of silence after which two utterances are no longer considered as one. Range: [200, 2000], Default: 500.
VAD.TurnDetectConfig.PauseIntervalIntNoSets the duration (in milliseconds) within which two utterances are considered as one, enabling ASR multi-sentence concatenation. Range: [200, 2000]. ASR multi-sentence concatenation is only enabled when this value is greater than SilenceSegmentation.

Scenario Examples

ConfigurationQ&A Results
SilenceSegmentation = 500ms,
PauseInterval not set
User is determined to have spoken twice, resulting in 2 turns of Q&A
round 1:
- user: The weather is nice today. I want to go out
- assistant: Response 1 (interrupted by round 2)
Context: Empty
round 2:
- user: What about you?
- assistant: Response 2
Context: First Q&A round
SilenceSegmentation = 500ms,
PauseInterval = 1000ms
User is determined to have spoken once, resulting in 1 round of Q&A
- user: The weather is nice today. I want to go out. What about you?
- assistant: Response 1
Context: Empty

Best Practice Configurations

Note
If you're unsure which configuration works better, we recommend using Scenario 2 configuration.
ScenarioVAD.TurnDetectConfig.SilenceSegmentationVAD.TurnDetectConfig.PauseInterval
Scenario 1: Users speak in short, frequent bursts. E.g., companionship scenarios500msNot set
Scenario 2: Users have mixed-length content and are sensitive to latency. E.g., customer service scenarios500ms1000~1500ms
Scenario 3: Users typically speak for longer durations and are less sensitive to latency1000msNot set

Previous

Interrupt Agent

Next

AI Short-Term Memory Management

On this page

Back to top