
Alibaba Cloud, a subsidiary of Alibaba Group, unveiled its groundbreaking end-to-end multimodal artificial intelligence (AI) model, Qwen2.5-Omni-7B, on Monday.
This cutting-edge model can process diverse input data, including text, images, voice, and video, while delivering real-time text and voice responses.
Based on 7 billion parameters, Qwen2.5-Omni-7B is a lightweight model that implements cost-effective AI agents suitable for developing intelligent voice applications. Its versatility allows for applications across various domains, from providing real-time audio descriptions for the visually impaired to offering cooking guidance and enhancing customer service systems.
An Alibaba Cloud spokesperson stated that the model delivers high performance at a reduced cost based on innovative architecture. Key technologies include the “Thinker-Talker architecture,” which minimizes interference by separating text generation and speech synthesis, and “TMRoPE” (Time-aligned Multimodal RoPE), a location embedding technique that strengthens video and audio synchronization.
Leveraging an extensive pre-trained dataset, Qwen2.5-Omni-7B excels in various tasks, such as image-to-text, video-to-text, video-to-speech, and speech-to-text conversions. The model has notably achieved top-tier performance on the OmniBench benchmark, which assesses the integrated processing of visual, auditory, and textual information.
Alibaba Cloud has made the model open-source through popular platforms like Hugging Face and GitHub. It can also be accessed via ModelScope, Alibaba Cloud’s dedicated open-source community.
Alibaba Cloud has released over 200 Generative AI models open-source in recent years.
Following the initial launch of Qwen2.5 in September 2024, Alibaba Cloud expanded its AI portfolio with Qwen2.5-Max in January 2025. The company has also introduced specialized models like Qwen2.5-VL and Qwen2.5-1M for enhanced visual understanding and processing extended inputs.