End-to-end speech recognition toolkit connecting academic research with industrial applications
Introduction: A new bridge for speech recognition technology
It is an open-source speech recognition toolkit developed by the Alibaba DAMO Academy, aiming to provide an efficient solution for the connection between academia and industry. By releasing the training and fine-tuning code for industrial-grade models, the toolkit lowers the threshold for the application of speech recognition technology, supporting the full process from basic research to product implementation. Its core design philosophy is “to make speech recognition more interesting,” through modular architecture and pre-trained model libraries, developers can quickly build speech applications that support multiple languages and scenarios.
Core Function Analysis
Full-stack voice processing capability
Provides seven core functional modules:
-
Speech Recognition (ASR): Supports real-time/offline Chinese and English recognition, outputting text with timestamps -
Voice Activity Detection (VAD): Precisely identifies effective speech segments, supporting real-time processing at the millisecond level -
Punctuation Restoration: Automatically adds Chinese/English punctuation marks -
Speaker Separation: Differentiates between different speakers in a conversation -
Emotion Recognition: Detects the emotional state of speech (angry, happy, etc.) -
Voice Wake-up (KWS): Customized Wake Word Recognition -
Multi-modal Understanding: Integrated Audio-Text Large Model Qwen-Audio Series
Breakthrough Model Architecture
The Paraformer model in the toolkit adopts a non-autoregressive architecture, which, while ensuring recognition accuracy, improves inference speed by more than 3 times compared to traditional models. Its features include:
-
Completes the entire recognition process with a single forward computation -
Supports dynamic batching, significantly improving the efficiency of long audio processing -
Compatible with ONNX format, facilitating cross-platform deployment
Technical evolution and latest trends (key updates in 2024)
Update time | Important update content |
---|---|
2024/10/29 | Chinese real-time transcription service supports SenseVoiceSmall model’s 2pass-offline mode |
2024/10/10 | Added Whisper-large-v3-turbo model, supporting multi-language recognition and translation |
2024/09/26 | Optimize ONNX memory management, fix GPU version memory leak |
2024/07/04 | Launched the SenseVoice basic speech understanding model, integrating multi-task capabilities such as ASR/LID/SER/AED |
2024/05/15 | Added emotion2vec+ series sentiment recognition models, accuracy improved by 12% |
Environment Configuration and Installation Guide
Basic Environment Requirements
-
Python ≥ 3.8 -
PyTorch ≥ 1.13 -
CUDA 11.6+ (GPU version)
Two installation methods
Method 1: Install quickly using pip
pip3 install -U | |
# 可选工业模型支持 | |
pip3 install -U modelscope huggingface_hub |
Method Two: Compile and Install from Source Code
git clone https://github.com/alibaba/.git | |
cd | |
pip3 install -e ./ |
Model Repository and Selection Recommendations
Provides pre-trained models covering different scenarios, key model comparisons are as follows:
Model name | Applicable scenarios | Language Support | Delay Level | Memory Usage |
---|---|---|---|---|
SenseVoiceSmall | Multi-task speech understanding | Chinese | High real-time | 1.2GB |
Paraformer-zh | Long audio file transcription | Chinese | Offline | 2.3GB |
Paraformer-zh-streaming | Real-time speech recognition | Chinese | Low latency | 2.1GB |
Whisper-large-v3-turbo | Multilingual recognition/translation | Over 100 languages | Offline | 3.8GB |
emotion2vec+large | Emotional State Analysis | General Speech | Real-time | 1.1GB |
Selection Suggestions:
-
Chinese Customer Service Quality Inspection: Paraformer-zh + ct-punc + emotion2vec+ -
Transcript of international conference: Whisper-large-v3-turbo + cam++ -
Intelligent Hardware Wake-up: fsmn-kws + fsmn-vad
Practical Application Examples
Scenario 1: Transcription of long audio files
from import AutoModel | |
# 加载多功能模型链 | |
model = AutoModel( | |
model=“paraformer-zh”, | |
vad_model=“fsmn-vad”, | |
punc_model=“ct-punc”, | |
device=“cuda:0” | |
) | |
# 处理2小时会议录音 | |
result = model.generate( | |
input=“meeting_recording.wav”, | |
batch_size_s=600, # 动态批处理窗口 | |
hotword=[“达摩院”, “机器学习”] # 定制热词 | |
) | |
print(f”转写结果:{result[0][‘text’]}“) | |
print(f”时间戳:{result[0][‘timestamp’]}“) |
Scenario 2: Real-time speech-to-text transcription
from import AutoModel | |
import soundfile | |
# 初始化流式模型 | |
model = AutoModel(model=“paraformer-zh-streaming”) | |
# 音频流处理模拟 | |
audio_chunks = split_audio_stream(“input.wav”, chunk_size=600) # 600ms分片 | |
cache = {} | |
for idx, chunk in enumerate(audio_chunks): | |
is_final = (idx == len(audio_chunks)-1) | |
res = model.generate( | |
input=chunk, | |
cache=cache, | |
is_final=is_final, | |
chunk_size=[0,10,5] # 600ms延迟配置 | |
) | |
print(f“实时输出:{res[0][‘text’]}“) |
Scenario 3: Emotion Recognition Integration
from import AutoModel | |
# 加载情感分析模型 | |
emotion_model = AutoModel(model=“emotion2vec_plus_large”) | |
# 分析客服录音 | |
result = emotion_model.generate( | |
input=“customer_service.wav”, | |
granularity=“utterance”, # 语句级分析 | |
extract_embedding=False | |
) | |
print(f”情感分析结果:{result[0][’emotion’]}“) # 输出:neutral(85%), happy(15%) |
Advanced Features: Model Optimization and Deployment
Export in ONNX format
# 命令行导出 | |
–export ++model=paraformer ++quantize=true | |
# Python API导出 | |
from import AutoModel | |
model = AutoModel(model=“paraformer”) | |
model.export(quantize=True) |
Service-oriented deployment solution
Current supported service types:
-
Chinese offline transcription service (CPU/GPU)
-
Supports dynamic batching -
Single-line RTF as low as 0.0076 (GPU) -
Maximum support for 8-hour audio files
-
-
Real-time Chinese and English transcription service
-
End-to-end latency < 800ms -
Supports Ngram Language Model -
Adaptive speech segmentation
-
Deployment Example (Docker Solution):
docker run -p 10095:10095 \ | |
registry.cn-hangzhou.aliyuncs.com//-runtime-sdk-cpu:4.6 |
Community Ecosystem and Technical Support
A complete technical ecosystem has been formed:
-
Academic Support: Tsinghua University, China Telecom, and other institutions are deeply involved -
Industrial Applications: RapidAI, AIHealthX, and other companies provide practical cases -
Developer Community: Over 3000 active developers in DingTalk groups -
Continuously Updated: Annually releases over 20 new models
Open Source License and Academic Citation
The project adopts the MIT open-source protocol, and commercial applications must comply with additional terms. Key technical paper citations:
@inproceedings{gao2023, | |
title={: A Fundamental End–to–End Speech Recognition Toolkit}, | |
author={Gao, Zhifu and Li, Zerui and Wang, Jiaming and Luo, Haoneng and Shi, Xian and Chen, Mengzhe and Li, Yabin and Zuo, Lingyun and Du, Zhihao and Xiao, Zhangyu and Zhang, Shiliang}, | |
booktitle={INTERSPEECH}, | |
year={2023} | |
} |
Evolutionary Direction and Future Prospects
According to the updated roadmap for 2024, the focus will be on the development of:
-
Multimodal Fusion: Deepening the joint understanding ability of voice-text-visual -
Edge Computing Optimization: Release <100MB> of On-Device Inference Model -
Semi-supervised Learning: Pre-training General Speech Representation with Ten Million Hours of Data -
Medical Field Adaptation: A dedicated version developed in compliance with HIPAA standards
Through continuous technological iteration, it is steadily advancing towards the goal of building a general speech intelligence infrastructure, providing a solid foundation for the democratization of speech technology applications.
– Efficient Code Farmer –