On this page

Configuring ASR

2026-06-02

Function Introduction

To improve the recognition accuracy of speech recognition (or speech-to-text) in different scenarios, the following methods can be used:

  • Select the appropriate vendor/recognition model:
  • Select the appropriate language: The default Tencent and Aliyun Paraformer models are for Chinese recognition, and Microsoft is for English recognition.
  • Set recognition hot words: In certain scenarios, there are usually some specialized words, such as character names, user IDs, function names, etc., which can be set as temporary hot words when creating an agent instance to improve the accuracy of speech recognition.

Currently ZEGO-supported ASR vendors and models:

Prerequisites

Currently, Tencent is the default vendor that is supported and opened. If you need Aliyun, Microsoft, Volcengine, etc., please contact ZEGOCLOUD business to open.

Usage Method

Currently, ASR related parameters can be set through 4 interfaces:

InterfaceDescription
Register AgentSet vendor, hot words, language, etc. parameters.
Create Agent Instance
Create Digital Human Agent Instance
Set vendor, hot words, language, etc. parameters.
Note
If not set, the ASR parameters carried by the registered Agent ( RegisterAgent ) will be used by default.
Update Agent Instance
Note
Supports modifying hot words and languages. Other parameters please contact technical support for confirmation.

ASR Parameters

ParametersTypeRequiredDescription
VendorStringNoASR vendor, default is Tencent:
  • Tencent: Tencent
  • AliyunParaformer: Aliyun Paraformer
  • AliyunGummy: Aliyun Gummy
  • Microsoft: Microsoft ASR
  • AliyunFunASR: Aliyun FunASR
  • AliyunQwenASR: Aliyun Qwen ASR
  • VolcSeedASR: Volcengine Seed ASR
HotWordStringNoThis parameter has been deprecated.
Please set it through the Params extended parameters, please refer to the hot word setting instructions for each vendor below.
ParamsObjectNoVendor parameters, please refer to the parameter setting instructions for each vendor below.
VADSilenceSegmentationnumberNo⚠️ This parameter has been deprecated. Since v2.12.0, it has been migrated to VAD.TurnDetectConfig.SilenceSegmentation in the VAD structure.
Used to set how many milliseconds after the user stops speaking, the two sentences are no longer considered as one. Range [200, 2000], default is 500. For details, see Speech Segmentation.
PauseIntervalnumberNo⚠️ This parameter has been deprecated. Since v2.12.0, it has been migrated to VAD.TurnDetectConfig.PauseInterval in the VAD structure.
Used to set how many milliseconds within the user stops speaking, the two sentences are considered as one, i.e., ASR multi-sentence concatenation. Range [200, 2000]. Only when this value is greater than SilenceSegmentation, ASR multi-sentence concatenation will be enabled. For details, see Speech Segmentation.

The Params parameters for each vendor are as follows:

ASR Vendor Parameters

Previous

Configuring LLM

Next

Configuring TTS