Phi-Omni-ST: Microsoft Advances Speech-to-Speech Translation with Open-Source Model

June 12, 2025

A unified approach to multilingual voice translation

Microsoft has introduced Phi-Omni-ST, a multimodal language model capable of performing direct speech-to-speech translation (S2ST). Built on the open-source Phi-4-MM, the model integrates speech understanding and generation into a single end-to-end system—eliminating the need for intermediate text conversion.

Phi-Omni-ST receives spoken input, generates both translated text and audio output, and synthesizes voice in real time using a streaming vocoder. This approach reduces latency and avoids error propagation common in cascaded systems that separate ASR, MT, and TTS components.

Key architecture: from audio tokens to streaming speech

The model extends Phi-4-MM by adding an audio transformer head. This component predicts audio tokens with a slight delay relative to text tokens, enabling better context modeling. The audio tokens are processed into mel-spectrograms and then converted into waveform speech using a pretrained HiFi-GAN vocoder.

Importantly, the system uses a joint decoding strategy, generating both text and audio outputs in a single pass. The audio decoder leverages the same hidden representations as the text decoder, maintaining alignment between written and spoken translations.

Benchmark results on open datasets

When trained on the CVSS-C dataset (940 hours), Phi-Omni-ST outperformed all baseline S2ST models using the same data. On French-to-English translation, for instance, the model improved ASR-BLEU scores from 28.45 (StreamSpeech baseline) to 35.93. Comparable gains were recorded for Spanish and German.

Further scaling with in-house data and a 7B model variant brought Phi-Omni-ST’s performance in line with SeamlessM4T v2, a leading commercial model. Microsoft reports that this result was achieved using roughly half the training data.

Designed for reproducibility and accessibility

Unlike many enterprise-grade systems, Phi-Omni-ST prioritizes open development. It builds on publicly released models, including the Phi4-MM and CosyVoice 2 toolkits, and applies LoRA-based fine-tuning to reduce training overhead. The speech tokenizer and streaming vocoder modules are also open source.

These choices make the system reproducible and accessible, even for institutions with limited compute resources. The researchers emphasize that the architecture and methodology can be adapted to other speech-to-speech tasks beyond translation.

Future directions and applications

While Phi-Omni-ST currently focuses on multilingual S2ST, the framework supports broader extensions. Microsoft’s team identifies potential in low-resource language support, pretraining of the audio decoder, and other conversational AI use cases.

The research highlights a path toward scalable, end-to-end speech models grounded in open tools and public data—encouraging wider experimentation and collaboration across the language technology community.

Phi-Omni-ST: Microsoft Advances Speech-to-Speech Translation with Open-Source Model

A unified approach to multilingual voice translation

Key architecture: from audio tokens to streaming speech

Benchmark results on open datasets

Designed for reproducibility and accessibility

Future directions and applications

RELATED ARTICLES

Microsoft Teams Introduces Bidirectional Translation for Interpreters

Localazy launches a collaborative translation memory to fight the climate emergency through localization

Microsoft and Canadian government partner to create MT for indigenous languages

Open World E07 Ft. Kate Edwards – LocFact #TheSims

Unicode Technology Workshop 2025 Opens Registration and Call for Submissions

Weekly Newsletter, Subscribe to stay updated!

Login or Register