On this page

Speech Segmentation Control

2026-03-30

Since LLM (Large Language Model) does not support streaming input, it is necessary to determine whether the user has finished speaking based on real-time ASR (Automatic Speech Recognition) results, and then request LLM to start a new round of Q&A.

To determine whether the user has finished speaking, check these parameters:

  • VADSilenceSegmentation
  • PauseInterval

Parameter Description

The two parameters that affect the determination of user's speech completion are in the ASR parameters of registering/updating agents and creating/updating agent instances. Please refer to Register Agent > Body > ASR Parameters for details.

Parameter NameTypeRequiredDescription
VADSilenceSegmentationNumberNoSets the duration (in milliseconds) of silence after which two utterances are no longer considered as one. Range: [200, 2000], Default: 500.
PauseIntervalNumberNoSets the duration (in milliseconds) within which two utterances are considered as one, enabling ASR multi-sentence concatenation. Range: [200, 2000]. ASR multi-sentence concatenation is only enabled when this value is greater than VADSilenceSegmentation.

Scenario Examples

ConfigurationQ&A Results
VADSilenceSegmentation = 500ms,
PauseInterval not set
User is determined to have spoken twice, resulting in 2 turns of Q&A
round 1:
- user: The weather is nice today. I want to go out
- assistant: Response 1 (interrupted by round 2)
Context: Empty
round 2:
- user: What about you?
- assistant: Response 2
Context: First Q&A round
VADSilenceSegmentation = 500ms,
PauseInterval = 1000ms
User is determined to have spoken once, resulting in 1 round of Q&A
- user: The weather is nice today. I want to go out. What about you?
- assistant: Response 1
Context: Empty

Best Practice Configurations

Note
If you're unsure which configuration works better, we recommend using Scenario 2 configuration.
ScenarioVADSilenceSegmentationPauseInterval
Scenario 1: Users speak in short, frequent bursts. E.g., companionship scenarios500msNot set
Scenario 2: Users have mixed-length content and are sensitive to latency. E.g., customer service scenarios500ms1000~1500ms
Scenario 3: Users typically speak for longer durations and are less sensitive to latency1000msNot set

Previous

Interrupt Agent

Next

AI Short-Term Memory Management

On this page

Back to top