Speech Segmentation Control
2026-05-12
Since LLM (Large Language Model) does not support streaming input, it is necessary to determine whether the user has finished speaking based on real-time ASR (Automatic Speech Recognition) results, and then request LLM to start a new round of Q&A.
To determine whether the user has finished speaking, check these parameters:
VAD.TurnDetectConfig.SilenceSegmentationVAD.TurnDetectConfig.PauseInterval
Parameter Description
The parameters that affect the determination of user's speech completion are in the VAD parameters of creating/updating agent instances. Please refer to Create Agent Instance > Body for details.
| Parameter Name | Type | Required | Description |
|---|---|---|---|
| VAD.TurnDetectConfig.SilenceSegmentation | Int | No | Sets the duration (in milliseconds) of silence after which two utterances are no longer considered as one. Range: [200, 2000], Default: 500. |
| VAD.TurnDetectConfig.PauseInterval | Int | No | Sets the duration (in milliseconds) within which two utterances are considered as one, enabling ASR multi-sentence concatenation. Range: [200, 2000]. ASR multi-sentence concatenation is only enabled when this value is greater than SilenceSegmentation. |
Scenario Examples
| Configuration | Q&A Results |
|---|---|
| SilenceSegmentation = 500ms, PauseInterval not set | User is determined to have spoken twice, resulting in 2 turns of Q&A round 1: - user: The weather is nice today. I want to go out - assistant: Response 1 (interrupted by round 2) Context: Empty round 2: - user: What about you? - assistant: Response 2 Context: First Q&A round |
| SilenceSegmentation = 500ms, PauseInterval = 1000ms | User is determined to have spoken once, resulting in 1 round of Q&A - user: The weather is nice today. I want to go out. What about you? - assistant: Response 1 Context: Empty |
Best Practice Configurations
Note
If you're unsure which configuration works better, we recommend using Scenario 2 configuration.
| Scenario | VAD.TurnDetectConfig.SilenceSegmentation | VAD.TurnDetectConfig.PauseInterval |
|---|---|---|
| Scenario 1: Users speak in short, frequent bursts. E.g., companionship scenarios | 500ms | Not set |
| Scenario 2: Users have mixed-length content and are sensitive to latency. E.g., customer service scenarios | 500ms | 1000~1500ms |
| Scenario 3: Users typically speak for longer durations and are less sensitive to latency | 1000ms | Not set |
