Google at Interspeech 2022

This week, the 23rd Annual Conference of the International Speech Communication Association (INTERSPEECH 2022) is being held in Incheon, South Korea, representing one of the world’s most extensive conferences on research and technology of spoken language understanding and processing. Over 2,000 experts in speech-related research fields gather to take part in oral presentations and poster sessions and to collaborate with streamed events across the globe.

We are excited to be a Diamond Sponsor of INTERSPEECH 2022, where we will be showcasing nearly 50 research publications and supporting a number of workshops, special sessions and tutorials. We welcome in-person attendees to drop by the Google booth to meet our researchers and participate in Q&As and demonstrations of some of our latest speech technologies, which help to improve accessibility and provide convenience in communication for billions of users. In addition, online attendees are encouraged to visit our virtual booth in GatherTown where you can get up-to-date information on research and opportunities at Google. You can also learn more about the Google research being presented at INTERSPEECH 2022 below (Google affiliations in bold).

 

Organizing Committee

Industry Liaisons include: Bhuvana Ramabahdran

Area Chairs include: John HersheyHeiga ZenShrikanth NarayananBastiaan Kleijn

 

ISCA Fellows

Include: Tara SainathHeiga Zen

 

Publications

Production Federated Keyword Spotting via Distillation, Filtering, and Joint Federated-Centralized Training
Andrew HardKurt PartridgeNeng ChenSean AugensteinAishanee ShahHyun Jin ParkAlex ParkSara NgJessica NguyenIgnacio Lopez MorenoRajiv MathewsFrançoise Beaufays

Leveraging Unsupervised and Weakly-Supervised Data to Improve Direct Speech-to-Speech Translation
Ye JiaYifan DingAnkur BapnaColin CherryYu ZhangAlexis ConneauNobu Morioka

Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition
W. Ronny HuangCal PeyserTara N. SainathRuoming PangTrevor StrohmanShankar Kumar

UserLibri: A Dataset for ASR Personalization Using Only Text
Theresa BreinerSwaroop RamaswamyEhsan VarianiShefali GargRajiv MathewsKhe Chai SimKilol GuptaMingqing ChenLara McConnaughey

SNRi Target Training for Joint Speech Enhancement and Recognition
Yuma KoizumiShigeki KaritaArun NarayananSankaran PanchapagesanMichiel Bacchiani

Turn-Taking Prediction for Natural Conversational Speech
Shuo-Yiin ChangBo LiTara SainathChao ZhangTrevor StrohmanQiao LiangYanzhang He

Streaming Intended Query Detection Using E2E Modeling for Continued Conversation
Shuo-Yiin ChangGuru PrakashZelin WuTara SainathBo LiQiao LiangAdam StamblerShyam UpadhyayManaal FaruquiTrevor Strohman

Improving Distortion Robustness of Self-Supervised Speech Processing Tasks with Domain Adaptation
Kuan Po Huang, Yu-Kuan Fu, Yu Zhang, Hung-yi Lee

XLS-R: Self-Supervised Cross-Lingual Speech Representation Learning at Scale
Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli

Extracting Targeted Training Data from ASR Models, and How to Mitigate It
Ehsan AmidOm ThakkarArun NarayananRajiv MathewsFrançoise Beaufays

Detecting Unintended Memorization in Language-Model-Fused ASR
W. Ronny HuangSteve ChienOm ThakkarRajiv Mathews

AVATAR: Unconstrained Audiovisual Speech Recognition
Valentin GabeurPaul Hongsuck SeoArsha NagraniChen Sun, Karteek Alahari, Cordelia Schmid

End-to-End Multi-talker Audio-Visual ASR Using an Active Speaker Attention Module
Richard RoseOlivier Siohan

Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition for Single and Multi-person Video
Dmitriy SerdyukOtavio BragaOlivier Siohan

Unsupervised Data Selection via Discrete Speech Representation for ASR
Zhiyun LuYongqiang WangYu ZhangWei HanZhehuai ChenParisa Haghani

Non-parallel Voice Conversion for ASR Augmentation
Gary WangAndrew RosenbergBhuvana RamabhadranFadi BiadsyJesse EmondYinghui HuangPedro J. Moreno

Ultra-Low-Bitrate Speech Coding with Pre-trained Transformers
Ali Siahkoohi, Michael ChinenTom DentonW. Bastiaan KleijnJan Skoglund

Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification
Chao ZhangBo LiTara SainathTrevor StrohmanSepand MavandadiShuo-Yiin ChangParisa Haghani

Improving Deliberation by Text-Only and Semi-supervised Training
Ke Hu, Tara N. SainathYanzhang HeRohit PrabhavalkarTrevor StrohmanSepand MavandadiWeiran Wang

E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR
W. Ronny HuangShuo-yiin ChangDavid RybachRohit PrabhavalkarTara N. SainathCyril AllauzenCal PeyserZhiyun Lu

CycleGAN-Based Unpaired Speech Dereverberation
Alexis ConneauAnkur BapnaYu ZhangMin Ma, Patrick von Platen, Anton Lozhkov, Colin CherryYe JiaClara RiveraMihir KaleDaan van EschVera AxelrodSimran KhanujaJonathan ClarkOrhan Firat, Michael Auli, Sebastian RuderJason Riesa, Melvin Johnson

TRILLsson: Distilled Universal Paralinguistic Speech Representations (see blog post)
Joel ShorSubhashini Venugopalan

Learning Neural Audio Features Without Supervision
Sarthak Yadav, Neil Zeghidour

SpeechPainter: Text-Conditioned Speech Inpainting
Zalan BorsosMatthew SharifiMarco Tagliasacchi

SpecGrad: Diffusion Probabilistic Model-Based Neural Vocoder with Adaptive Noise Spectral Shaping
Yuma KoizumiHeiga Zen, Kohei Yatabe, Nanxin ChenMichiel Bacchiani

Distance-Based Sound Separation
Katharine PattersonKevin WilsonScott WisdomJohn R. Hershey

Analysis of Self-Attention Head Diversity for Conformer-Based Automatic Speech Recognition
Kartik AudhkhasiYinghui HuangBhuvana RamabhadranPedro J. Moreno

Improving Rare Word Recognition with LM-Aware MWER Training
Wang WeiranTongzhou ChenTara SainathEhsan VarianiRohit PrabhavalkarW. Ronny HuangBhuvana RamabhadranNeeraj GaurSepand MavandadiCal PeyserTrevor StrohmanYanzhang HeDavid Rybach

MAESTRO: Matched Speech Text Representations Through Modality Matching
Zhehuai ChenYu ZhangAndrew RosenbergBhuvana RamabhadranPedro J. MorenoAnkur BapnaHeiga Zen

Pseudo Label is Better Than Human Label
Dongseong HwangKhe Chai SimZhouyuan HuoTrevor Strohman

On the Optimal Interpolation Weights for Hybrid Autoregressive Transducer Model
Ehsan VarianiMichael RileyDavid RybachCyril AllauzenTongzhou ChenBhuvana Ramabhadran

Streaming Align-Refine for Non-autoregressive Deliberation
Wang WeiranKe HuTara Sainath

Federated Pruning: Improving Neural Network Efficiency with Federated Learning
Rongmei Lin*Yonghui XiaoTien-Ju YangDing Zhao, Li Xiong, Giovanni MottaFrançoise Beaufays

A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes
Shaojin DingWeiran WangDing ZhaoTara N SainathYanzhang HeRobert DavidRami BotrosXin WangRina PanigrahyQiao LiangDongseong HwangIan McGrawRohit PrabhavalkarTrevor Strohman

4-Bit Conformer with Native Quantization Aware Training for Speech Recognition
Shaojin DingPhoenix MeadowlarkYanzhang HeLukasz LewShivani AgrawalOleg Rybakov

Visually-Aware Acoustic Event Detection Using Heterogeneous Graphs
Amir Shirian, Krishna Somandepalli, Victor Sanchez, Tanaya Guha

A Conformer-Based Waveform-Domain Neural Acoustic Echo Canceller Optimized for ASR Accuracy
Sankaran PanchapagesanArun NarayananTuraj Zakizadeh ShabestaryShuai ShaoNathan HowardAlex ParkJames WalkerAlexander Gruenstein

Reducing Domain Mismatch in Self-Supervised Speech Pre-training
Murali Karthick Baskar, Andrew RosenbergBhuvana RamabhadranYu ZhangNicolás Serrano

On-the-Fly ASR Corrections with Audio Exemplars
Golan PundakTsendsuren MunkhdalaiKhe Chai Sim

A Language Agnostic Multilingual Streaming On-Device ASR System
Bo LiTara Sainath, Ruoming Pang*, Shuo-Yiin ChangQiumin XuTrevor StrohmanVince ChenQiao LiangHeguang LiuYanzhang HeParisa HaghaniSameer Bidichandani

XTREME-S: Evaluating Cross-Lingual Speech Representations
Alexis ConneauAnkur BapnaYu ZhangMin Ma, Patrick von Platen, Anton Lozhkov, Colin CherryYe JiaClara RiveraMihir KaleDaan van EschVera AxelrodSimran KhanujaJonathan ClarkOrhan Firat, Michael Auli, Sebastian RuderJason RiesaMelvin Johnson

Towards Disentangled Speech Representations
Cal PeyserRonny HuangAndrew RosenbergTara Sainath, Michael Picheny, Kyunghyun Cho

Personal VAD 2.0: Optimizing Personal Voice Activity Detection for On-Device Speech Recognition
Shaojin DingRajeev RikhyeQiao LiangYanzhang HeQuan WangArun NarayananTom O’MalleyIan McGraw

A Universally-Deployable ASR Frontend for Joint Acoustic Echo Cancellation, Speech Enhancement, and Voice Separation
Tom O’MalleyArun NarayananQuan Wang

Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks
Lev FinkelsteinHeiga Zen, Norman Casagrande, Chun-an ChanYe JiaTom KenterAlex Petelin, Jonathan Shen*, Vincent WanYu ZhangYonghui WuRobert Clark

A Scalable Model Specialization Framework for Training and Inference Using Submodels and Its Application to Speech Model Personalization
Fadi BiadsyYouzheng ChenXia ZhangOleg RybakovAndrew RosenbergPedro Moreno

Text-Driven Separation of Arbitrary Sounds
Kevin KilgourBeat GfellerQingqing HuangAren JansenScott WisdomMarco Tagliasacchi

 

Workshops, Tutorials & Special Sessions

The VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22)
Organizers include: Arsha Nagrani

Self-Supervised Representation Learning for Speech Processing
Organizers include: Tara Sainath

Learning from Weak Labels
Organizers include: Ankit Shah

RNN Transducers for Named Entity Recognition with Constraints on Alignment for Understanding Medical Conversations
Authors: Hagen SoltauIzhak ShafranMingqiu WangLaurent El Shafey

Listening with Googlears: Low-Latency Neural Multiframe Beamforming and Equalization for Hearing Aids
Authors: Samuel YangScott WisdomChet GnegyRichard F. LyonSagar Savla

Using Rater and System Metadata to Explain Variance in the VoiceMOS Challenge 2022 Dataset
Authors: Michael ChinenJan SkoglundChandan K. A. Reddy, Alessandro Ragano, Andrew Hines

Incremental Layer-Wise Self-Supervised Learning for Efficient Unsupervised Speech Domain Adaptation On Device
Authors: Zhouyuan HuoDongseong HwangKhe Chai SimShefali GargAnanya MisraNikhil SiddharthaTrevor StrohmanFrançoise Beaufays

Trustworthy Speech Processing
Organizers include: Shrikanth Narayanan



*Work done while at Google.  

RELATED ARTICLES

MultiLingualStaff
MultiLingual creates go-to news and resources for language industry professionals.

Weekly Digest

Subscribe to stay updated