End-to-end speech recognition toolkit connecting academic research with industrial applications

Introduction: A new bridge for speech recognition technology

It is an open-source speech recognition toolkit developed by the Alibaba DAMO Academy, aiming to provide an efficient solution for the connection between academia and industry. By releasing the training and fine-tuning code for industrial-grade models, the toolkit lowers the threshold for the application of speech recognition technology, supporting the full process from basic research to product implementation. Its core design philosophy is “to make speech recognition more interesting,” through modular architecture and pre-trained model libraries, developers can quickly build speech applications that support multiple languages and scenarios.

Core Function Analysis

Full-stack voice processing capability

Provides seven core functional modules:

  1. Speech Recognition (ASR): Supports real-time/offline Chinese and English recognition, outputting text with timestamps
  2. Voice Activity Detection (VAD): Precisely identifies effective speech segments, supporting real-time processing at the millisecond level
  3. Punctuation Restoration: Automatically adds Chinese/English punctuation marks
  4. Speaker Separation: Differentiates between different speakers in a conversation
  5. Emotion Recognition: Detects the emotional state of speech (angry, happy, etc.)
  6. Voice Wake-up (KWS): Customized Wake Word Recognition
  7. Multi-modal Understanding: Integrated Audio-Text Large Model Qwen-Audio Series

Breakthrough Model Architecture

The Paraformer model in the toolkit adopts a non-autoregressive architecture, which, while ensuring recognition accuracy, improves inference speed by more than 3 times compared to traditional models. Its features include:

  • Completes the entire recognition process with a single forward computation
  • Supports dynamic batching, significantly improving the efficiency of long audio processing
  • Compatible with ONNX format, facilitating cross-platform deployment

Technical evolution and latest trends (key updates in 2024)

Update time Important update content
2024/10/29 Chinese real-time transcription service supports SenseVoiceSmall model’s 2pass-offline mode
2024/10/10 Added Whisper-large-v3-turbo model, supporting multi-language recognition and translation
2024/09/26 Optimize ONNX memory management, fix GPU version memory leak
2024/07/04 Launched the SenseVoice basic speech understanding model, integrating multi-task capabilities such as ASR/LID/SER/AED
2024/05/15 Added emotion2vec+ series sentiment recognition models, accuracy improved by 12%

Environment Configuration and Installation Guide

Basic Environment Requirements

  • Python ≥ 3.8
  • PyTorch ≥ 1.13
  • CUDA 11.6+ (GPU version)

Two installation methods

Method 1: Install quickly using pip

pip3 install -U
# 可选工业模型支持
pip3 install -U modelscope huggingface_hub

 

Method Two: Compile and Install from Source Code

git clone https://github.com/alibaba/.git
cd
pip3 install -e ./

 

Model Repository and Selection Recommendations

Provides pre-trained models covering different scenarios, key model comparisons are as follows:

Model name Applicable scenarios Language Support Delay Level Memory Usage
SenseVoiceSmall Multi-task speech understanding Chinese High real-time 1.2GB
Paraformer-zh Long audio file transcription Chinese Offline 2.3GB
Paraformer-zh-streaming Real-time speech recognition Chinese Low latency 2.1GB
Whisper-large-v3-turbo Multilingual recognition/translation Over 100 languages Offline 3.8GB
emotion2vec+large Emotional State Analysis General Speech Real-time 1.1GB

Selection Suggestions:

  • Chinese Customer Service Quality Inspection: Paraformer-zh + ct-punc + emotion2vec+
  • Transcript of international conference: Whisper-large-v3-turbo + cam++
  • Intelligent Hardware Wake-up: fsmn-kws + fsmn-vad

Practical Application Examples

Scenario 1: Transcription of long audio files

from  import AutoModel
# 加载多功能模型链
model = AutoModel(
    model=“paraformer-zh”,
    vad_model=“fsmn-vad”,
    punc_model=“ct-punc”,
    device=“cuda:0”
)
# 处理2小时会议录音
result = model.generate(
    input=“meeting_recording.wav”,
    batch_size_s=600,  # 动态批处理窗口
    hotword=[“达摩院”“机器学习”]  # 定制热词
)
print(f”转写结果:{result[0][‘text’]})
print(f”时间戳:{result[0][‘timestamp’]})

 

Scenario 2: Real-time speech-to-text transcription

from  import AutoModel
import soundfile
# 初始化流式模型
model = AutoModel(model=“paraformer-zh-streaming”)
# 音频流处理模拟
audio_chunks = split_audio_stream(“input.wav”, chunk_size=600)  # 600ms分片
cache = {}
for idx, chunk in enumerate(audio_chunks):
    is_final = (idx == len(audio_chunks)-1)
    res = model.generate(
        input=chunk,
        cache=cache,
        is_final=is_final,
        chunk_size=[0,10,5]  # 600ms延迟配置
    )
    print(f“实时输出:{res[0][‘text’]})

 

Scenario 3: Emotion Recognition Integration

from  import AutoModel
# 加载情感分析模型
emotion_model = AutoModel(model=“emotion2vec_plus_large”)
# 分析客服录音
result = emotion_model.generate(
    input=“customer_service.wav”,
    granularity=“utterance”,  # 语句级分析
    extract_embedding=False
)
print(f”情感分析结果:{result[0][’emotion’]})  # 输出:neutral(85%), happy(15%)

 

Advanced Features: Model Optimization and Deployment

Export in ONNX format

# 命令行导出
export ++model=paraformer ++quantize=true
# Python API导出
from  import AutoModel
model = AutoModel(model=“paraformer”)
model.export(quantize=True)

 

Service-oriented deployment solution

Current supported service types:

  1. Chinese offline transcription service (CPU/GPU)

    • Supports dynamic batching
    • Single-line RTF as low as 0.0076 (GPU)
    • Maximum support for 8-hour audio files
  2. Real-time Chinese and English transcription service

    • End-to-end latency < 800ms
    • Supports Ngram Language Model
    • Adaptive speech segmentation

Deployment Example (Docker Solution):

# 启动中文转写服务
docker run -p 10095:10095 \
  registry.cn-hangzhou.aliyuncs.com//-runtime-sdk-cpu:4.6

 

Community Ecosystem and Technical Support

A complete technical ecosystem has been formed:

  • Academic Support: Tsinghua University, China Telecom, and other institutions are deeply involved
  • Industrial Applications: RapidAI, AIHealthX, and other companies provide practical cases
  • Developer Community: Over 3000 active developers in DingTalk groups
  • Continuously Updated: Annually releases over 20 new models

Open Source License and Academic Citation

The project adopts the MIT open-source protocol, and commercial applications must comply with additional terms. Key technical paper citations:

@inproceedings{gao2023,
title={: A Fundamental EndtoEnd Speech Recognition Toolkit},
author={Gao, Zhifu and Li, Zerui and Wang, Jiaming and Luo, Haoneng and Shi, Xian and Chen, Mengzhe and Li, Yabin and Zuo, Lingyun and Du, Zhihao and Xiao, Zhangyu and Zhang, Shiliang},
booktitle={INTERSPEECH},
year={2023}
}

 

Evolutionary Direction and Future Prospects

According to the updated roadmap for 2024, the focus will be on the development of:

  1. Multimodal Fusion: Deepening the joint understanding ability of voice-text-visual
  2. Edge Computing Optimization: Release <100MB> of On-Device Inference Model
  3. Semi-supervised Learning: Pre-training General Speech Representation with Ten Million Hours of Data
  4. Medical Field Adaptation: A dedicated version developed in compliance with HIPAA standards

Through continuous technological iteration, it is steadily advancing towards the goal of building a general speech intelligence infrastructure, providing a solid foundation for the democratization of speech technology applications.

– Efficient Code Farmer –