Configuring ASR
2026-06-02
Function Introduction
To improve the recognition accuracy of speech recognition (or speech-to-text) in different scenarios, the following methods can be used:
- Select the appropriate vendor/recognition model:
- Select the appropriate language: The default Tencent and Aliyun Paraformer models are for Chinese recognition, and Microsoft is for English recognition.
- Set recognition hot words: In certain scenarios, there are usually some specialized words, such as character names, user IDs, function names, etc., which can be set as temporary hot words when creating an agent instance to improve the accuracy of speech recognition.
Currently ZEGO-supported ASR vendors and models:
- Tencent ASR: Standard version, Large Model version (including Chinese-English-Cantonese+9 dialect large model engine [Large Model Version], Putonghua-English large model engine [Large Model Version], etc.). For details, see Tencent Cloud-Real-Time Speech Recognition
- Aliyun Bailian:
- Gummy series models (mainly support Chinese, English, Japanese, etc.), Fun-ASR series models (mainly support Chinese and dialects), Paraformer series (Mandarin, dialects, English, and some minority languages, not recommended). For details, see Real-Time Speech Recognition-Fun-ASR/Gummy/Paraformer;
- Qwen series models: Mainly applicable to Chinese, English and other languages. For details, see Real-Time Speech Recognition-Qwen
- Volcengine Large Model streaming speech recognition model: Applicable to Chinese, English and other languages. For details, see Volcengine Speech Recognition Large Model
- Microsoft ASR: For details, see Microsoft Real-Time Speech Recognition. For more ASR vendors and models, please contact ZEGOCLOUD business.
Prerequisites
Currently, Tencent is the default vendor that is supported and opened. If you need Aliyun, Microsoft, Volcengine, etc., please contact ZEGOCLOUD business to open.
Usage Method
Currently, ASR related parameters can be set through 4 interfaces:
| Interface | Description |
|---|---|
| Register Agent | Set vendor, hot words, language, etc. parameters. |
| Create Agent Instance Create Digital Human Agent Instance | Set vendor, hot words, language, etc. parameters. Note If not set, the ASR parameters carried by the registered Agent ( RegisterAgent ) will be used by default. |
| Update Agent Instance | Note Supports modifying hot words and languages. Other parameters please contact technical support for confirmation. |
ASR Parameters
| Parameters | Type | Required | Description |
|---|---|---|---|
| Vendor | String | No | ASR vendor, default is Tencent:
|
| String | No | This parameter has been deprecated. Please set it through the Params extended parameters, please refer to the hot word setting instructions for each vendor below. | |
| Params | Object | No | Vendor parameters, please refer to the parameter setting instructions for each vendor below. |
| number | No | ⚠️ This parameter has been deprecated. Since v2.12.0, it has been migrated to VAD.TurnDetectConfig.SilenceSegmentation in the VAD structure. Used to set how many milliseconds after the user stops speaking, the two sentences are no longer considered as one. Range [200, 2000], default is 500. For details, see Speech Segmentation. | |
| number | No | ⚠️ This parameter has been deprecated. Since v2.12.0, it has been migrated to VAD.TurnDetectConfig.PauseInterval in the VAD structure. Used to set how many milliseconds within the user stops speaking, the two sentences are considered as one, i.e., ASR multi-sentence concatenation. Range [200, 2000]. Only when this value is greater than SilenceSegmentation, ASR multi-sentence concatenation will be enabled. For details, see Speech Segmentation. |
The Params parameters for each vendor are as follows:
