logo
On this page

Speech Segmentation Control

Since LLM (Large Language Model) does not support streaming input, it is necessary to determine whether the user has finished speaking based on real-time ASR (Automatic Speech Recognition) results, and then request LLM to start a new round of Q&A.

To determine whether the user has finished speaking, check these parameters:

  • VADSilenceSegmentation
  • PauseInterval

Parameter Description

The two parameters that affect the determination of user's speech completion are in the ASR parameters of registering/updating agents and creating/updating agent instances. The detailed descriptions are as follows:

Parameter NameTypeRequiredDescription
VADSilenceSegmentationNumberNoSets the duration (in milliseconds) of silence after which two utterances are no longer considered as one. Range: [200, 2000], Default: 500.
PauseIntervalNumberNoSets the duration (in milliseconds) within which two utterances are considered as one, enabling ASR multi-sentence concatenation. Range: [200, 2000]. ASR multi-sentence concatenation is only enabled when this value is greater than VADSilenceSegmentation.

Scenario Examples

asr_vad_example.png
ConfigurationQ&A Results
VADSilenceSegmentation = 500ms,
PauseInterval not set
User is determined to have spoken twice, resulting in 2 turns of Q&A
round 1:
- user: The weather is nice today. I want to go out
- assistant: Response 1 (interrupted by round 2)
Context: Empty
round 2:
- user: What about you?
- assistant: Response 2
Context: First Q&A round
VADSilenceSegmentation = 500ms,
PauseInterval = 1000ms
User is determined to have spoken once, resulting in 1 round of Q&A
- user: The weather is nice today. I want to go out. What about you?
- assistant: Response 1
Context: Empty

Best Practice Configurations

Note
If you're unsure which configuration works better, we recommend using Scenario 2 configuration.
ScenarioVADSilenceSegmentationPauseInterval
Scenario 1: Users speak in short, frequent bursts. E.g., companionship scenarios500msNot set
Scenario 2: Users have mixed-length content and are sensitive to latency. E.g., customer service scenarios500ms1000~1500ms
Scenario 3: Users typically speak for longer durations and are less sensitive to latency1000msNot set

Previous

Interrupt Agent

Next

AI Short-Term Memory Management