Speech Segmentation Control
Since LLM (Large Language Model) does not support streaming input, it is necessary to determine whether the user has finished speaking based on real-time ASR (Automatic Speech Recognition) results, and then request LLM to start a new round of Q&A.
To determine whether the user has finished speaking, check these parameters:
VADSilenceSegmentationPauseInterval
Parameter Description
The two parameters that affect the determination of user's speech completion are in the ASR parameters of registering/updating agents and creating/updating agent instances. Please refer to Register Agent > Body > ASR Parameters for details.
| Parameter Name | Type | Required | Description |
|---|---|---|---|
| VADSilenceSegmentation | Number | No | Sets the duration (in milliseconds) of silence after which two utterances are no longer considered as one. Range: [200, 2000], Default: 500. |
| PauseInterval | Number | No | Sets the duration (in milliseconds) within which two utterances are considered as one, enabling ASR multi-sentence concatenation. Range: [200, 2000]. ASR multi-sentence concatenation is only enabled when this value is greater than VADSilenceSegmentation. |
Scenario Examples
| Configuration | Q&A Results |
|---|---|
| VADSilenceSegmentation = 500ms, PauseInterval not set | User is determined to have spoken twice, resulting in 2 turns of Q&A round 1: - user: The weather is nice today. I want to go out - assistant: Response 1 (interrupted by round 2) Context: Empty round 2: - user: What about you? - assistant: Response 2 Context: First Q&A round |
| VADSilenceSegmentation = 500ms, PauseInterval = 1000ms | User is determined to have spoken once, resulting in 1 round of Q&A - user: The weather is nice today. I want to go out. What about you? - assistant: Response 1 Context: Empty |
Best Practice Configurations
Note
If you're unsure which configuration works better, we recommend using Scenario 2 configuration.
| Scenario | VADSilenceSegmentation | PauseInterval |
|---|---|---|
| Scenario 1: Users speak in short, frequent bursts. E.g., companionship scenarios | 500ms | Not set |
| Scenario 2: Users have mixed-length content and are sensitive to latency. E.g., customer service scenarios | 500ms | 1000~1500ms |
| Scenario 3: Users typically speak for longer durations and are less sensitive to latency | 1000ms | Not set |

