A unified approach to multilingual voice translation
Microsoft has introduced Phi-Omni-ST, a multimodal language model capable of performing direct speech-to-speech translation (S2ST). Built on the open-source Phi-4-MM, the model integrates speech understanding and generation into a single end-to-end system—eliminating the need for intermediate text conversion.
Phi-Omni-ST receives spoken input, generates both translated text and audio output, and synthesizes voice in real time using a streaming vocoder. This approach reduces latency and avoids error propagation common in cascaded systems that separate ASR, MT, and TTS components.
Key architecture: from audio tokens to streaming speech
The model extends Phi-4-MM by adding an audio transformer head. This component predicts audio tokens with a slight delay relative to text tokens, enabling better context modeling. The audio tokens are processed into mel-spectrograms and then converted into waveform speech using a pretrained HiFi-GAN vocoder.
Importantly, the system uses a joint decoding strategy, generating both text and audio outputs in a single pass. The audio decoder leverages the same hidden representations as the text decoder, maintaining alignment between written and spoken translations.
Benchmark results on open datasets
When trained on the CVSS-C dataset (940 hours), Phi-Omni-ST outperformed all baseline S2ST models using the same data. On French-to-English translation, for instance, the model improved ASR-BLEU scores from 28.45 (StreamSpeech baseline) to 35.93. Comparable gains were recorded for Spanish and German.
Further scaling with in-house data and a 7B model variant brought Phi-Omni-ST’s performance in line with SeamlessM4T v2, a leading commercial model. Microsoft reports that this result was achieved using roughly half the training data.
Designed for reproducibility and accessibility
Unlike many enterprise-grade systems, Phi-Omni-ST prioritizes open development. It builds on publicly released models, including the Phi4-MM and CosyVoice 2 toolkits, and applies LoRA-based fine-tuning to reduce training overhead. The speech tokenizer and streaming vocoder modules are also open source.
These choices make the system reproducible and accessible, even for institutions with limited compute resources. The researchers emphasize that the architecture and methodology can be adapted to other speech-to-speech tasks beyond translation.
Future directions and applications
While Phi-Omni-ST currently focuses on multilingual S2ST, the framework supports broader extensions. Microsoft’s team identifies potential in low-resource language support, pretraining of the audio decoder, and other conversational AI use cases.
The research highlights a path toward scalable, end-to-end speech models grounded in open tools and public data—encouraging wider experimentation and collaboration across the language technology community.

