Speech Segmentation Control

Since LLM (Large Language Model) does not support streaming input, it is necessary to determine whether the user has finished speaking based on real-time ASR (Automatic Speech Recognition) results, and then request LLM to start a new round of Q&A.

To determine whether the user has finished speaking, check these parameters:

VADSilenceSegmentation
PauseInterval

Parameter Description

The two parameters that affect the determination of user's speech completion are in the ASR parameters of registering/updating agents and creating/updating agent instances. Please refer to Register Agent > Body > ASR Parameters for details.

Parameter Name	Type	Required	Description
VADSilenceSegmentation	Number	No	Sets the duration (in milliseconds) of silence after which two utterances are no longer considered as one. Range: [200, 2000], Default: 500.
PauseInterval	Number	No	Sets the duration (in milliseconds) within which two utterances are considered as one, enabling ASR multi-sentence concatenation. Range: [200, 2000]. ASR multi-sentence concatenation is only enabled when this value is greater than VADSilenceSegmentation.

Scenario Examples

Configuration	Q&A Results
VADSilenceSegmentation = 500ms, PauseInterval not set	User is determined to have spoken twice, resulting in 2 turns of Q&A round 1: - user: The weather is nice today. I want to go out - assistant: Response 1 (interrupted by round 2) Context: Empty round 2: - user: What about you? - assistant: Response 2 Context: First Q&A round
VADSilenceSegmentation = 500ms, PauseInterval = 1000ms	User is determined to have spoken once, resulting in 1 round of Q&A - user: The weather is nice today. I want to go out. What about you? - assistant: Response 1 Context: Empty

Best Practice Configurations

Note

If you're unsure which configuration works better, we recommend using Scenario 2 configuration.

Scenario	VADSilenceSegmentation	PauseInterval
Scenario 1: Users speak in short, frequent bursts. E.g., companionship scenarios	500ms	Not set
Scenario 2: Users have mixed-length content and are sensitive to latency. E.g., customer service scenarios	500ms	1000~1500ms
Scenario 3: Users typically speak for longer durations and are less sensitive to latency	1000ms	Not set