| 2025 | A Momentum-Based Framework with Contrastive Data Generation for Robust Sound Source Localization. Hyun-Soo Kim, Da-Hee Yang, Joon-Hyuk Chang |
| 2025 | A Neural Model for Contextual Biasing Score Learning and Filtering. Wanting Huang, Weiran Wang |
| 2025 | A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data. Cheng-Kang Chou, Chan-Jan Hsu, Ho-Lam Chung, Liang-Hsuan Tseng, Hsi-Chun Cheng, Yu-Kuan Fu, Kuan-Po Huang, Hung-yi Lee |
| 2025 | A Study of the Scale Invariant Signal to Distortion Ratio in Speech Separation with Noisy References Simon Dahl Jepsen, Mads Græsbøll Christensen, Jesper Rindom Jensen |
| 2025 | A correlation-permutation approach for speech-music encoders model merging. Fabian Ritter Gutierrez, Yi-Cheng Lin, Jeremy H. M. Wong, Hung-yi Lee, Eng Siong Chng, Nancy F. Chen |
| 2025 | ASR for Affective Speech: Investigating Impact of Emotion and Speech Generative Strategy. Ya-Tse Wu, Chi-Chun Lee |
| 2025 | ASTAR-NTU solution to AudioMOS Challenge 2025 Track1. Fabian Ritter Gutierrez, Yi-Cheng Lin, Jui-Chiang Wei, Jeremy H. M. Wong, Nancy F. Chen, Hung-yi Lee |
| 2025 | AURA: Agent for Understanding, Reasoning, and Automated Tool Use in Voice-Driven Tasks. Leander Melroy Maben, Gayathri Ganesh Lakshmy, Srijith Radhakrishnan, Siddhant Arora, Shinji Watanabe |
| 2025 | Acoustic Phonetic Temporal Speech Representation. Yunbin Deng |
| 2025 | Acoustic to Articulatory Speech Inversion for Children with Velopharyngeal Insufficiency. Saba Tabatabaee, Suzanne Boyce, Liran Oren, Mark Tiede, Carol Y. Espy-Wilson |
| 2025 | AdaBit-TasNet: Speech Separation with Inference Adaptable Precision. Mohamed Elminshawi, Srikanth Raj Chetupalli, Emanuël A. P. Habets |
| 2025 | Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs. Umberto Cappellazzo, Minsu Kim, Stavros Petridis |
| 2025 | Advancing Controllable Music Generation with Latent Rectified Flow Guided by Rhythm and Harmony. Haibin Yu, Jiayi Zhou, Wei Wang, Zhiming Wang, Huijia Zhu, Yanmin Qian |
| 2025 | All-in-One ASR: Unifying Encoder-Decoder Models of CTC, Attention, and Transducer in Dual-Mode ASR. Takafumi Moriya, Masato Mimura, Tomohiro Tanaka, Hiroshi Sato, Ryo Masumura, Atsunori Ogawa |
| 2025 | An Effective Strategy for Modeling Score Ordinality and Non-uniform Intervals in Automated Speaking Assessment. Tien-Hong Lo, Szu-Yu Chen, Yao-Ting Sung, Berlin Chen |
| 2025 | Analysing the Language of Neural Audio Codecs. Joonyong Park, Shinnosuke Takamichi, David M. Chan, Shunsuke Kando, Yuki Saito, Hiroshi Saruwatari |
| 2025 | Analysis of Domain Shift across ASR Architectures via TTS-Enabled Separation of Target Domain and Acoustic Conditions. Tina Raissi, Nick Rossenbach, Ralf Schlüter |
| 2025 | AsyncSwitch: Asynchronous Text-Speech Adaptation for Code-Switched ASR. Tuan Nguyen, Huy-Dat Tran |
| 2025 | AsyncVoice Agent: Real-Time Explanation for LLM Planning and Reasoning. Yueqian Lin, Zhengmian Hu, Jayakumar Subramanian, Qinsi Wang, Nikos Vlassis, Hai Li, Yiran Chen |
| 2025 | Audio Aesthetics Prediction System QAM16k Based on Pre-trained Audio Encoder. Linping Xu, Ziqian Wu, Dejun Zhang |
| 2025 | Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model. Ziyang Ma, Zhuo Chen, Yuping Wang, Eng Siong Chng, Xie Chen |
| 2025 | AudioLens: A Closer Look at Auditory Attribute Perception of Large Audio-Language Models. Chih-Kai Yang, Neo Ho, Yi-Jyun Lee, Hung-yi Lee |
| 2025 | Benchmarking Prosody Encoding in Discrete Speech Tokens. Kentaro Onda, Satoru Fukayama, Daisuke Saito, Nobuaki Minematsu |
| 2025 | Benchmarking Rotary Position Embeddings for Automatic Speech Recognition. Shucong Zhang, Titouan Parcollet, Rogier C. van Dalen, Sourav Bhattacharya |
| 2025 | Beyond Modality Limitations: A Unified MLLM Approach to Automated Speaking Assessment with Effective Curriculum Learning. Yu-Hsuan Fang, Tien-Hong Lo, Yao-Ting Sung, Berlin Chen |
| 2025 | Bottleneck Transformer-Based Approach for Improved Automatic STOI Score Prediction. Amartyaveer, Murali Kadambi, Chandra Mohan Sharma, Anupam Mandal, Prasanta Kumar Ghosh |
| 2025 | Bridging the Modality Gap: Softly Discretizing Audio Representation for LLM-based Automatic Speech Recognition. Mu Yang, Szu-Jui Chen, Jiamin Xie, John H. L. Hansen |
| 2025 | CAMÕES: A Comprehensive Automatic Speech Recognition Benchmark for European Portuguese. Carlos Carvalho, Francisco Teixeira, Catarina Botelho, Anna Pompili, Rubén Solera-Ureña, Sérgio Paulo, Mariana Julião, Thomas Rolland, John Mendonça, Diogo A. P. Nunes, Isabel Trancoso, Alberto Abad |
| 2025 | CASPER: A Large Scale Spontaneous Speech Dataset. Cihan Xiao, Ruixing Liang, Xiangyu Zhang, Mehmet Emre Tiryaki, Veronica Bae, Lavanya Shankar, Rong Yang, Ethan Poon, Emmanuel Dupoux, Sanjeev Khudanpur, Leibny Paola García-Perera |
| 2025 | CAVIARES: Corpus for Audio-Visual Expressive Voice Agent. Jinsheng Chen, Yuki Saito, Dong Yang, Naoko Tanji, Hironori Doi, Byeongseon Park, Yuma Shirahata, Kentaro Tachibana, Hiroshi Saruwatari |
| 2025 | CLAIRA: Leveraging Large Language Models to Judge Audio Captions. Tsung-Han Wu, Joseph E. Gonzalez, Trevor Darrell, David M. Chan |
| 2025 | CO-VADA: A Confidence-Oriented Voice Augmentation Debiasing Approach for Fair Speech Emotion Recognition. Yun-Shao Tsai, Yi-Cheng Lin, Huang-Cheng Chou, Hung-yi Lee |
| 2025 | Can We Really Repurpose Multi-Speaker ASR Corpus for Speaker Diarization? Shota Horiguchi, Naohiro Tawara, Takanori Ashihara, Atsushi Ando, Marc Delcroix |
| 2025 | Can self-supervised speech models predict the perceived acceptability of prosodic variation? Sarenne Wallbridge, Adaeze Adigwe, Peter Bell |
| 2025 | ChipChat: Low-Latency Cascaded Conversational Agent in MLX. Tatiana Likhomanenko, Luke Carlson, Richard He Bai, Zijin Gu, Han Tran, Zakaria Aldeneh, Yizhe Zhang, Ruixiang Zhang, Huangjie Zheng, Navdeep Jaitly |
| 2025 | CoLMbo: Speaker Language Model for Descriptive Profiling. Massa Baali, Shuo Han, Syed Abdul Hannan, Purusottam Samal, Karanveer Singh, Soham Deshmukh, Rita Singh, Bhiksha Raj |
| 2025 | Codec2Vec: Self-Supervised Speech Representation Learning Using Neural Speech Codecs. Wei-Cheng Tseng, David Harwath |
| 2025 | Competitive Audio-Language Models with Data-Efficient Single-Stage Training on Public Data. Gokul Karthik Kumar, Rishabh Saraf, Ludovick Lepauloux, Abdul Muneer, Billel Mokeddem, Hakim Hacid |
| 2025 | Conan: A Chunkwise Online Network for Zero-Shot Adaptive Voice Conversion. Yu Zhang, Baotong Tian, Zhiyao Duan |
| 2025 | Confidence-Based Self-Training for EMG-to-Speech: Leveraging Synthetic EMG for Robust Modeling. Xiaodan Chen, Xiaoxue Gao, Mathias Quoy, Alexandre Pitti, Nancy F. Chen |
| 2025 | Continual Pre-training for Codec-Based Speech LLMs: Balancing Understanding and Generation. Jiatong Shi, Chunlei Zhang, Jinchuan Tian, Junrui Ni, Hao Zhang, Shinji Watanabe, Dong Yu |
| 2025 | Controllable Singing Voice Synthesis using Phoneme-Level Energy Sequence. Yerin Ryu, Inseop Shin, Chanwoo Kim |
| 2025 | Customizing Speech Recognition Model with Large Language Model Feedback. Shaoshi Ling, Guoli Ye |
| 2025 | DarkStream: real-time speech anonymization with low latency. Waris Quamer, Ricardo Gutierrez-Osuna |
| 2025 | DeCRED: Decoder-Centric Regularization for Encoder-Decoder Based Speech Recognition. Alexander Polok, Santosh Kesiraju, Karel Benes, Bolaji Yusuf, Lukás Burget, Jan Cernocký |
| 2025 | Deep Audio Zooming: Creating a Sound Barrier With Microphone Array Processing. Meng Yu, Dong Yu |
| 2025 | DiffRhythm+: Controllable and Flexible Full-Length Song Generation with Preference Optimization. Huakang Chen, Yuepeng Jiang, Guobin Ma, Chunbo Hao, Shuai Wang, Jixun Yao, Ziqian Ning, Meng Meng, Jian Luan, Lei Xie |
| 2025 | Diversity and complementarity of speech encoders across diverse tasks in a multi-modal large language model. Jeremy H. M. Wong, Muhammad Huzaifah, Hardik B. Sailor, Shuo Sun, Kye Min Tan, Bin Wang, Qiongqiong Wang, Wenyu Zhang, Xunlong Zou, Nancy F. Chen, Ai Ti Aw |
| 2025 | Do Self-Supervised Speech Models Exhibit the Critical Period Effects in Language Acquisition? Yurie Koga, Shunsuke Kando, Yusuke Miyao |
| 2025 | DyMEvalNet: Dynamic Text-Audio-Personalization Fusion for Multimodal Music Quality Assessment. Xiaoxun Wu, Kailai Shen, Yuheng Huang, Naiyuan Li, Diqun Yan |
| 2025 | EMO-Debias: Benchmarking Gender Debiasing Techniques in Multi-Label Speech Emotion Recognition. Yi-Cheng Lin, Huang-Cheng Chou, Yu-Hsuan Li Liang, Hung-yi Lee |
| 2025 | EMO-Reasoning: Benchmarking Emotional Reasoning Capabilities in Spoken Dialogue Systems. Jingwen Liu, Kan Jen Cheng, Jiachen Lian, Akshay Anand, Rishi Jain, Faith Qiao, Robin Netzorg, Huang-Cheng Chou, Tingle Li, Guan-Ting Lin, Gopala Anumanchipalli |
| 2025 | Efficient ASR Domain Adaptation with Long Noun Phrases: Harnessing the Linguistic Characteristics of Japanese. Shusuke Komatsu, Kazuyo Onishi, Koki Tanaka, Dohyun Kim, Koichiro Yoshino |
| 2025 | Efficient Deployment of Large Speech Recognition Models on GPU. Yuekai Zhang, Shuang Yu, Junjie Lai |
| 2025 | Efficient Scaling for LLM-based ASR. Bingshen Mu, Yiwen Shao, Kun Wei, Dong Yu, Lei Xie |
| 2025 | Efficient Speech Watermarking for Speech Synthesis via Progressive Knowledge Distillation. Yang Cui, Peter Pan, Lei He, Sheng Zhao |
| 2025 | EmoBiMamba-TTS: Bidirectional State Space Model for Emotion-Intensity Controllable Text-to-Speech. Insung Ham, Bonwha Ku, Hanseok Ko |
| 2025 | EmoTale: An Enacted Speech-emotion Dataset in Danish. Maja J. Hjuler, Harald V. Skat-Rørdam, Line H. Clemmensen, Sneha Das |
| 2025 | Emotional Styles Hide in Deep Speaker Embeddings: Disentangle Deep Speaker Embeddings for Speaker Clustering. Chaohao Lin, Xu Zheng, Kaida Wu, Peihao Xiang, Ou Bai |
| 2025 | Emphasis Sensitivity in Speech Representations. Shaun Cassini, Thomas Hain, Anton Ragni |
| 2025 | Enhancing Code-switched Text-to-Speech Synthesis Capability in Large Language Models with only Monolingual Corpora. Jing Xu, Daxin Tan, Jiaqi Wang, Xiao Chen |
| 2025 | Enhancing Dialogue Annotation with Speaker Characteristics Leveraging a Frozen LLM. Thomas Thebaud, Yen-Ju Lu, Matthew Wiesner, Peter Viechnicki, Najim Dehak |
| 2025 | Enhancing Fully Formatted End-to-End Speech Recognition with Knowledge Distillation via Multi-Codebook Vector Quantization. Jian You, Xiangfeng Li, Erwan Zerhouni |
| 2025 | Enhancing In-the-Wild Speech Emotion Conversion with Resynthesis-based Duration Modeling. Navin Raj Prabhu, Danilo de Oliveira, Nale Lehmann-Willenbrock, Timo Gerkmann |
| 2025 | Evaluating Japanese Dialect Robustness Across Speech and Text-based Large Language Models. Tomoya Mizumoto, Yusuke Fujita, Hao Shi, Lianbo Liu, Atsushi Kojima, Yui Sudo |
| 2025 | Evaluating Self-Supervised Speech Models Via Text-Based LLMs. Takashi Maekaku, Keita Goto, Jinchuan Tian, Yusuke Shinohara, Shinji Watanabe |
| 2025 | Evaluation of LLMs in Speech is Often Flawed: Test Set Contamination in Large Language Models for Speech Recognition. Yuan Tseng, Titouan Parcollet, Rogier C. van Dalen, Shucong Zhang, Sourav Bhattacharya |
| 2025 | Expressive Speech Retrieval using Natural Language Descriptions of Speaking Style. Wonjune Kang, Deb Roy |
| 2025 | Fake-Mamba: Real-Time Speech Deepfake Detection Using Bidirectional Mamba as Self-Attention's Alternative. Xi Xuan, Zimo Zhu, Wenxin Zhang, Yi-Cheng Lin, Tomi Kinnunen |
| 2025 | Few-shot Personalization via In-Context Learning for Speech Emotion Recognition based on Speech-Language Model. Mana Ihori, Taiga Yamane, Naotaka Kawata, Naoki Makishima, Tomohiro Tanaka, Satoshi Suzuki, Shota Orihashi, Ryo Masumura |
| 2025 | Fewer Hallucinations, More Verification: A Three-Stage LLM-Based Framework for ASR Error Correction. Yangui Fang, Baixu Cheng, Jing Peng, Xu Li, Yu Xi, Chengwei Zhang, Guohui Zhong |
| 2025 | FlexCTC: GPU-powered CTC Beam Decoding With Advanced Contextual Abilities. Lilit Grigoryan, Vladimir Bataev, Nikolay Karpov, Andrei Andrusenko, Vitaly Lavrukhin, Boris Ginsburg |
| 2025 | Flow-SLM: Joint Learning of Linguistic and Acoustic Information for Spoken Language Modeling. Ju-Chieh Chou, Jiawei Zhou, Karen Livescu |
| 2025 | From Simulation to Strategy: Automating Personalized Interaction Planning for Conversational Agents. Wen-Yu Chang, Tzu-Hung Huang, Chih-Ho Chen, Yun-Nung Chen |
| 2025 | Full-Duplex-Bench: A Benchmark to Evaluate Full-Duplex Spoken Dialogue Models on Turn-taking Capabilities. Guan-Ting Lin, Jiachen Lian, Tingle Li, Qirui Wang, Gopala Anumanchipalli, Alexander H. Liu, Hung-yi Lee |
| 2025 | GenVC: Self-Supervised Zero-Shot Voice Conversion. Zexin Cai, Henry Li Xinyuan, Ashi Garg, Leibny Paola García-Perera, Kevin Duh, Sanjeev Khudanpur, Matthew Wiesner, Nicholas Andrews |
| 2025 | Geolocation-Aware Robust Spoken Language Identification. Qingzheng Wang, Hye-jin Shim, Jiancheng Sun, Shinji Watanabe |
| 2025 | Granite-speech: open-source speech-aware LLMs with strong English ASR capabilities. George Saon, Avihu Dekel, Alexander Brooks, Tohru Nagano, Abraham Daniels, Aharon Satt, Ashish R. Mittal, Brian Kingsbury, David Haws, Edmilson Da Silva Morais, Gakuto Kurata, Hagai Aronowitz, Ibrahim Ibrahim, Hong-Kwang Kuo, Kate Soule, Luis A. Lastras, Masayuki Suzuki, Ron Hoory, Samuel Thomas, Sashi Novitasari, Takashi Fukuda, Vishal Sunder, Xiaodong Cui, Zvi Kons |
| 2025 | Graph Connectionist Temporal Classification for Phoneme Recognition. Henry Grafé, Hugo Van hamme |
| 2025 | Group Relative Policy Optimization for Speech Recognition. Prashanth Gurunath Shivakumar, Yile Gu, Ankur Gandhe, Ivan Bulyko |
| 2025 | HighRateMOS: Sampling-Rate Aware Modeling for Speech Quality Assessment. Wenze Ren, Yi-Cheng Lin, Wen-Chin Huang, Ryandhimas E. Zezario, Szu-Wei Fu, Sung-Feng Huang, Erica Cooper, Haibin Wu, Hung-Yu Wei, Hsin-Min Wang, Hung-yi Lee, Yu Tsao |
| 2025 | Hybrid Decoding: Rapid Pass and Selective Detailed Correction for Sequence Models. Yunkyu Lim, Jihwan Park, Hyung Yong Kim, Hanbin Lee, Byeong-Yeol Kim |
| 2025 | IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2025, Honolulu, HI, USA, December 6-10, 2025 |
| 2025 | Identifying and Calibrating Overconfidence in Noisy Speech Recognition. Mingyue Huo, Yuheng Zhang, Yan Tang |
| 2025 | Improving Multimodal Speech-To-Slide Alignment for Academic Lectures with Vision LLMs. Thomas Ranzenberger, Dominik Wagner, Steffen Freisinger, Tobias Bocklet, Korbinian Riedhammer |
| 2025 | Improving Noise Robust Audio-Visual Speech Recognition via Router-Gated Cross-Modal Feature Fusion. DongHoon Lim, Youngchae Kim, Dong-Hyun Kim, Da-Hee Yang, Joon-Hyuk Chang |
| 2025 | Improving Perceptual Audio Aesthetic Assessment via Triplet Loss and Self-Supervised Embeddings. Dyah A. M. G. Wisnu, Ryandhimas E. Zezario, Stefano Rini, Hsin-Min Wang, Yu Tsao |
| 2025 | Improving Resource-Efficient Speech Enhancement via Neural Differentiable DSP Vocoder Refinement. Heitor R. Guimarães, Ke Tan, Juan Azcarreta, Jesus Alvarez, Prabhav Agrawal, Ashutosh Pandey, Buye Xu |
| 2025 | Improving Speech Enhancement with Multi-Metric Supervision from Learned Quality Assessment. Wei Wang, Wangyou Zhang, Chenda Li, Jaitong Shi, Shinji Watanabe, Yanmin Qian |
| 2025 | Improving Streaming ASR via Differentially Private Fusion of Data from Multiple Sources. Virat Shejwalkar, Om Thakkar, Steve Chien, Nicole Rafidi, Arun Narayanan |
| 2025 | Incorporating Contextual Paralinguistic Understanding in Large Speech-Language Models. Qiongqiong Wang, Hardik Bhupendra Sailor, Jeremy H. M. Wong, Tianchi Liu, Shuo Sun, Wenyu Zhang, Muhammad Huzaifah, Nancy F. Chen, Ai Ti Aw |
| 2025 | Intermediate-Selective Feature Enhancement for Speech Emotion Recognition. Yangbiao Li, Xiaofen Xing, Jialong Mai, Jingyuan Xing, Xiangmin Xu |
| 2025 | Interpreting the Role of Visemes in Audio-Visual Speech Recognition. Aristeidis Papadopoulos, Naomi Harte |
| 2025 | Is Smaller Always Faster? Tradeoffs in Compressing Self-Supervised Speech Transformers. Tzu-Quan Lin, Tsung-Huan Yang, Chun-Yao Chang, Kuang-Ming Chen, Tzu-hsun Feng, Hung-yi Lee, Hao Tang |
| 2025 | Iterative Feedback in the Online Active Learning Paradigm. Mark Lindsey, Francis Kubala, Richard M. Stern |
| 2025 | JOOCI: a Novel Method for Learning Comprehensive Speech Representations. Hemant Yadav, Sunayana Sitaram, Rajiv Ratn Shah |
| 2025 | Joint ASR and Speech Attribute Prediction for Conversational Dysarthric Speech Analysis with Multimodal Language Models. Dominik Wagner, Ilja Baumann, Natalie Engert, Elmar Nöth, Korbinian Riedhammer, Tobias Bocklet |
| 2025 | Joint Multimodal Contrastive Learning for Robust Spoken Term Detection and Keyword Spotting. Ramesh Gundluru, Shubham Gupta, K. Sri Rama Murty |
| 2025 | KAN-AST: Kolmogorov-Arnold Network based Audio Spectrogram Transformer for Audio Classification. Phuong Tuan Dat, Tran Huy Dat |
| 2025 | KyotoMOS2: MOS Prediction for Speech Across Multiple Sampling Rates. Wangjin Zhou, Yizhou Zhang, Keisuke Imoto, Tatsuya Kawahara |
| 2025 | L2 Vowel Acquisition Analysis at the Inventory Level. Shuju Shi |
| 2025 | LCS-CTC: Leveraging Soft Alignments to Enhance Phonetic Transcription Robustness. Zongli Ye, Jiachen Lian, Akshaj Gupta, Xuanru Zhou, Haodong Li, Krish Patel, Hwi Joo Park, Dingkun Zhou, Chenxu Guo, Shuhe Li, Sam Wang, Iris Zhou, Cheol Jun Cho, Zoe Ezzes, Jet Vonk, Brittany Morin, Rian Bogley, Lisa Wauters, Zachary A. Miller, Maria Luisa Gorno-Tempini, Gopala Anumanchipalli |
| 2025 | LLM-Based Dictation Detection from Doctor-Patient Conversations. Siyuan Chen, Mojtaba Kadkhodaie Elyaderani, Jing Su, Susanne Burger, Thomas Schaaf |
| 2025 | LauraTSE: Target Speaker Extraction using Auto-Regressive Decoder-Only Language Models. Beilong Tang, Bang Zeng, Ming Li |
| 2025 | Layer-wise Analysis for Quality of Multilingual Synthesized Speech. Erica Cooper, Takuma Okamoto, Yamato Ohtani, Tomoki Toda, Hisashi Kawai |
| 2025 | Learning Marmoset Vocal Patterns with a Masked Autoencoder for Robust Call Segmentation, Classification, and Caller Identification. Bin Wu, Shinnosuke Takamichi, Sakriani Sakti, Satoshi Nakamura |
| 2025 | Less is More: Data Curation Matters in Scaling Speech Enhancement. Chenda Li, Wangyou Zhang, Wei Wang, Robin Scheibler, Kohei Saijo, Samuele Cornell, Yihui Fu, Marvin Sach, Zhaoheng Ni, Anurag Kumar, Tim Fingscheidt, Shinji Watanabe, Yanmin Qian |
| 2025 | Lightweight Prompt Biasing for Contextualized End-to-End ASR Systems. Bo Ren, Yu Shi, Jinyu Li |
| 2025 | Lightweight Wasserstein Audio-Visual Model for Unified Speech Enhancement and Separation. Jisoo Park, Seonghak Lee, Guisik Kim, Taewoo Kim, Junseok Kwon |
| 2025 | Llasa+: Free Lunch for Accelerated and Streaming Llama-Based Speech Synthesis. Wenjie Tian, Xinfa Zhu, Hanke Xie, Zhen Ye, Wei Xue, Lei Xie |
| 2025 | Long-Form Fuzzy Speech-to-Text Alignment for 1000+ Languages. Ruizhe Huang, Xiaohui Zhang, Zhaoheng Ni, Moto Hira, Jeff Hwang, Vineel Pratap, Ju Lin, Ming Sun, Florian Metze |
| 2025 | Low-Resource Domain Adaptation for Speech LLMs via Text-Only Fine-Tuning. Yangui Fang, Jing Peng, Xu Li, Yu Xi, Chengwei Zhang, Guohui Zhong, Kai Yu |
| 2025 | MADASR 2.0: Multi-Lingual Multi-Dialect ASR Challenge in 8 Indian Languages. Saurabh Kumar, Sumit Sharma, Deekshitha G, Abhayjeet Singh, Amartyaveer, Sathvik Udupa, Sandhya Badiger, Sanjeev Khudanpur, Sunayana Sitaram, Srinivasan Umesh, Bhuvana Ramabhadran, Brian Kingsbury, Hema A. Murthy, Srikanth S. Narayanan, Howard Lakougna, Prasanta Kumar Ghosh |
| 2025 | MBENet: Bone-conduction and Air-conduction Fusion Network for Target Speaker Extraction. Chen Zhang, Linfeng Feng, Zhi Liu, Xiao-Lei Zhang, Xuelong Li |
| 2025 | MEAN-RIR: Multi-Modal Environment-Aware Network for Robust Room Impulse Response Estimation. Jiajian Chen, Jiakang Chen, Hang Chen, Qing Wang, Yu Gao, Jun Du |
| 2025 | MMMOS: Multi-domain Multi-axis Audio Quality Assessment. Yi-Cheng Lin, Jia-Hung Chen, Hung-yi Lee |
| 2025 | MMW: Side Talk Rejection Multi-Microphone Whisper On Smart Glasses. Yang Liu, Li Wan, Yiteng Huang, Yong Xu, Yangyang Shi, Saurabh Adya, Ming Sun, Florian Metze |
| 2025 | MNSC: Advancing Singlish Speech Understanding with Carefully Curated Corpora. Bin Wang, Xunlong Zou, Shuo Sun, Wenyu Zhang, Yingxu He, Zhuohan Liu, Chengwei Wei, Nancy F. Chen, AiTi Aw |
| 2025 | Maestro-EVC: Controllable Emotional Voice Conversion Guided by References and Explicit Prosody. Jinsung Yoon, Wooyeol Jeong, Jio Gim, Young-Joo Suh |
| 2025 | Masked Self-distilled Transducer-based Keyword Spotting with Semi-autoregressive Decoding. Yu Xi, Xiaoyu Gu, Haoyu Li, Jun Song, Bo Zheng, Kai Yu |
| 2025 | Mel-Refine: A Plug-and-Play Approach to Refine Mel-Spectrogram in Audio Generation. Hongming Guo, Ruibo Fu, Yizhong Geng, Shuchen Shi, Tao Wang, Chunyu Qiang, Ya Li, Zhengqi Wen, Yukun Liu, Xuefei Liu, Chenxing Li |
| 2025 | Meta Audiobox Aesthetics: Unified Automatic Assessment for Speech, Music and Sound. Andros Tjandra, Yi-Chiao Wu, Baishan Guo, John Hoffman, Brian Ellis, Apoorv Vyas, Bowen Shi, Sanyuan Chen, Matt Le, Nick Zacharov, Carleigh Wood, Ann Lee, Wei-Ning Hsu |
| 2025 | MoLEx: Mixture of LoRA Experts in Speech Self-Supervised Models for Audio Deepfake Detection. Zihan Pan, Hardik B. Sailor, Jinyang Wu |
| 2025 | More Similar than Dissimilar: Modeling Annotators for Cross-Corpus Speech Emotion Recognition. James Tavernor, Emily Mower Provost |
| 2025 | Multi-Distillation from Speech and Music Representation Models. Jui-Chiang Wei, Yi-Cheng Lin, Fabian Ritter-Gutierrez, Hung-yi Lee |
| 2025 | Multi-Sampling-Frequency Naturalness MOS Prediction Using Self-Supervised Learning Model with Sampling-Frequency-Independent Layer. Go Nishikawa, Wataru Nakata, Yuki Saito, Kanami Imamura, Hiroshi Saruwatari, Tomohiko Nakamura |
| 2025 | Multi-Target Backdoor Attacks Against Speaker Recognition. Alexandrine Fortier, Sonal Joshi, Thomas Thebaud, Jesús Antonio Villalba López, Najim Dehak, Patrick Cardinal |
| 2025 | Multilingual Dataset Integration Strategies for Robust Audio Deepfake Detection: A SAFE Challenge System. Hashim Ali, Surya Subramani, Nithin Sai Adupa, Lekha Bollinani, Sali El-Loh, Hafiz Malik |
| 2025 | Non-Autoregressive Multi-Speaker ASR with Decoupled Speaker Change Detection. Yingke Zhu, Lahiru Samarakoon |
| 2025 | OOQ: Outlier-Oriented Quantization for Efficient Large Language Models. Haoyu Wang, Bei Liu, Hang Shao, Bo Xiao, Ke Zeng, Guanglu Wan, Yanmin Qian |
| 2025 | Obtaining objective labels and analysing annotator subjectivity by using a Rasch model for ordinal speech processing. Jeremy H. M. Wong, Nancy F. Chen |
| 2025 | Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM? Andrew Rouditchenko, Saurabhchand Bhati, Edson Araujo, Samuel Thomas, Hilde Kuehne, Rogério Feris, James R. Glass |
| 2025 | Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition. Zijin Gu, Tatiana Likhomanenko, Navdeep Jaitly |
| 2025 | On the Difficulty of Token-Level Modeling of Dysfluency and Fluency Shaping Artifacts. Kashaf Gulzar, Dominik Wagner, Sebastian P. Bayerl, Florian Hönig, Tobias Bocklet, Korbinian Riedhammer |
| 2025 | On the Use of Self-Supervised Representation Learning for Speaker Diarization and Separation. Séverin Baroudi, Hervé Bredin, Joseph Razik, Ricard Marxer |
| 2025 | Open Full-duplex Voice Agent with Speech-to-Speech Language Model. Edresson Casanova, Chen Chen, Kevin Hu, Ankita Pasad, Elena Rastorgueva, Seelan Lakshmi Narasimhan, Slyne Deng, Ehsan Hosseini-Asl, Piotr Zelasko, Valentin Mendelev, Subhankar Ghosh, Yifan Peng, Zhehuai Chen, Jason Li, Jagadeesh Balam, Vitaly Lavrukhin, Boris Ginsburg |
| 2025 | PARCO: Phoneme-Augmented Robust Contextual ASR via Contrastive Entity Disambiguation. Jiajun He, Naoki Sawada, Koichi Miyazaki, Tomoki Toda |
| 2025 | PRIME: Novel Prompting Strategies for Effective Biasing Word Recognition in Contextualized ASR. Yu-Chun Liu, Li-Ting Pai, Yi-Cheng Wang, Bi-Cheng Yan, Hsin-Wei Wang, Chi-Han Lin, Juan-Wei Xu, Berlin Chen |
| 2025 | PURE Codec: Progressive Unfolding of Residual Entropy for Speech Codec Learning. Jiatong Shi, Haoran Wang, William Chen, Chenda Li, Wangyou Zhang, Jinchuan Tian, Shinji Watanabe |
| 2025 | Personalized Federated Learning with Fuzzy Clustering for Dysarthric Speech Recognition. Jie-Shiang Yang, Jing-Tong Tzeng, Chi-Chun Lee |
| 2025 | Phoneme Overlapping-Aware Pre-Training with External Text Resources for Multi-Talker ASR. Ryo Masumura, Tomohiro Tanaka, Naoki Makishima, Mana Ihori, Shota Orihashi, Naotaka Kawata, Taiga Yamane, Satoshi Suzuki, Takafumi Moriya |
| 2025 | PhysMVNet: Physics-Informed End-to-End MVDR Beamformer with Residual Spectral Mapping for Multichannel Speech Enhancement. Xingyu Shen, Wei-Ping Zhu, Benoît Champagne |
| 2025 | Pitch-Assistant Harmonic Recovery for Efficient Speech Enhancement. Biao Liu, Zengqiang Shang, Haoyuan Xie, Mou Wang, Xin Liu, Pengyuan Zhang |
| 2025 | Post-training for Deepfake Speech Detection. Wanying Ge, Xin Wang, Xuechen Liu, Junichi Yamagishi |
| 2025 | Predictive ASR and Turn-taking Prediction at Once: Towards More Responsive Spoken Dialog System. Ryo Fukuda, Takatomo Kano, Naohiro Tawara, Marc Delcroix, Atsunori Ogawa, Yuya Chiba, Atsushi Ando |
| 2025 | ProtoCLAP - Prototypical Contrastive Language-Audio Pretraining. Adria Mallol-Ragolta, Björn W. Schuller |
| 2025 | QAMRO: Quality-aware Adaptive Margin Ranking Optimization for Human-aligned Assessment of Audio Generation Systems. Chien-Chun Wang, Kuan-Tang Huang, Cheng-Yeh Yang, Hung-Shin Lee, Hsin-Min Wang, Berlin Chen |
| 2025 | Qieemo: Multimodal Emotion Recognition Based on the ASR Backbone. Jinming Chen, Jingyi Fang, Yuanzhong Zheng, Yaoxuan Wang, Haojun Fei |
| 2025 | RE-LLM: Refining Empathetic Speech-LLM Responses by Integrating Emotion Nuance. Jing-Han Chen, Bo-Hao Su, Ya-Tse Wu, Chi-Chun Lee |
| 2025 | REF-VC: Robust, Expressive and Fast Zero-Shot Voice Conversion with Diffusion Transformers. Yuepeng Jiang, Ziqian Ning, Shuai Wang, Chengjia Wang, Mengxiao Bi, Pengcheng Zhu, Zhonghua Fu, Lei Xie |
| 2025 | Rapidly Adapting to New Voice Spoofing: Few-Shot Detection of Synthesized Speech Under Distribution Shifts. Ashi Garg, Zexin Cai, Henry Li Xinyuan, Leibny Paola García-Perera, Sanjeev Khudanpur, Matthew Wiesner, Nicholas Andrews |
| 2025 | Recognizing Dementia from Neuropsychological Tests with State Space Models. Liming Wang, Saurabhchand Bhati, Cody Karjadi, Rhoda Au, James R. Glass |
| 2025 | Reducing Object Hallucination in Large Audio-Language Models via Audio-Aware Decoding. Tzu-Wen Hsu, Ke-Han Lu, Cheng-Han Chiang, Hung-yi Lee |
| 2025 | Reliability of Lexical Richness Measures for ASR-Based Children's Speech Assessment. Imen Talbi, Christopher Gebauer, Lars Rumberg, Edith Beaulac, Hanna Ehlert, Jörn Ostermann |
| 2025 | Revealing the Role of Audio Channels in ASR Performance Degradation. Kuan-Tang Huang, Li-Wei Chen, Hung-Shin Lee, Berlin Chen, Hsin-Min Wang |
| 2025 | Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM. Chiori Hori, Yoshiki Masuyama, Siddarth Jain, Radu Corcodel, Devesh K. Jha, Diego Romeres, Jonathan Le Roux |
| 2025 | Robust Speech Emotion Recognition via Classifier Retraining on Mixup-Augmented Representations. Shi-wook Lee |
| 2025 | Robust Training of Singing Voice Synthesis Using Prior and Posterior Uncertainty. Yiwen Zhao, Jiatong Shi, Yuxun Tang, William Chen, Shinji Watanabe |
| 2025 | SEF-MK: Speaker-Embedding-Free Voice Anonymization through Multi-k-means Quantization. Beilong Tang, Xiaoxiao Miao, Xin Wang, Ming Li |
| 2025 | SENSE models: an open source solution for multilingual and multimodal semantic-based tasks. Salima Mdhaffar, Haroun Elleuch, Chaimae Chellaf, Ha Nguyen, Yannick Estève |
| 2025 | SLM-S2ST: A multimodal language model for direct speech-to-speech translation. Yuxuan Hu, Haibin Wu, Ruchao Fan, Xiaofei Wang, Heng Lu, Yao Qian, Jinyu Li |
| 2025 | SMILE: Speech Meta In-Context Learning for Low-Resource Language Automatic Speech Recognition. Ming-Hao Hsu, Hung-yi Lee |
| 2025 | SSVD: Structured SVD for Parameter-Efficient Fine-Tuning and Benchmarking under Domain Shift in ASR. Pu Wang, Shinji Watanabe, Hugo Van hamme |
| 2025 | SUTA-LM: Bridging Test-Time Adaptation and Language Model Rescoring for Robust ASR. Wei-Ping Huang, Guan-Ting Lin, Hung-yi Lee |
| 2025 | SV-Mixer: Replacing the Transformer Encoder with Lightweight MLPs for Self-Supervised Model Compresison in Speaker Verification. Jungwoo Heo, Hyun-seo Shin, Chan-yeong Lim, Kyo-Won Koo, Seung-Bin Kim, Jisoo Son, Ha-Jin Yu |
| 2025 | Scalable Controllable Accented TTS. Henry Li Xinyuan, Zexin Cai, Ashi Garg, Kevin Duh, Leibny Paola García-Perera, Sanjeev Khudanpur, Nicholas Andrews, Matthew Wiesner |
| 2025 | Selection of Layers from Self-supervised Learning Models for Predicting Mean-Opinion-Score of Speech. Xinyu Liang, Fredrik Cumlin, Victor Ungureanu, Chandan K. A. Reddy, Christian Schüldt, Saikat Chatterjee |
| 2025 | Serialized Output Prompting for Large Language Model-based Multi-Talker Speech Recognition. Hao Shi, Yusuke Fujita, Tomoya Mizumoto, Lianbo Liu, Atsushi Kojima, Yui Sudo |
| 2025 | Sinba: Singing-To-Accompaniment Generation With Pitch Guidance Via Mamba-Based Language Model. Jianwei Cui, Shihao Chen, Yu Gu, Jie Zhang, Liping Chen, Na Li, Chengxing Li, Shan Yang, Li-Rong Dai |
| 2025 | SincQDR-VAD: A Noise-Robust Voice Activity Detection Framework Leveraging Learnable Filters and Ranking-Aware Optimization. Chien-Chun Wang, En-Lun Yu, Jeih-weih Hung, Shih-Chieh Huang, Berlin Chen |
| 2025 | Speaker Style-Aware Phoneme Anchoring For Improved Cross-Lingual Speech Emotion Recognition. Shreya G. Upadhyay, Carlos Busso, Chi-Chun Lee |
| 2025 | Speech Masking System Based on Spatially Separated Multiple TTS Maskers With A Compact Circular Loudspeaker Array. Takuma Okamoto |
| 2025 | Speech Synthesis From Continuous Features Using Per-Token Latent Diffusion. Arnon Turetzky, Avihu Dekel, Nimrod Shabtay, Slava Shechtman, David Haws, Hagai Aronowitz, Ron Hoory, Yossi Adi |
| 2025 | Speech in-context learning of paralinguistic tasks. Jeremy H. M. Wong, Muhammad Huzaifah, Nancy F. Chen, Ai Ti Aw |
| 2025 | Spiralformer: Low Latency Encoder for Streaming Speech Recognition with Circular Layer Skipping and Early Exiting. Emiru Tsunoo, Hayato Futami, Yosuke Kashiwagi, Siddhant Arora, Shinji Watanabe |
| 2025 | State-of-the-art Embeddings with Video-free Segmentation of the Source VoxCeleb Data. Sara Barahona, Ladislav Mosner, Themos Stafylakis, Oldrich Plchot, Junyi Peng, Lukás Burget, Jan Cernocký |
| 2025 | Streaming Endpointer for Spoken Dialogue using Neural Audio Codecs and Label-Delayed Training. Sathvik Udupa, Shinji Watanabe, Petr Schwarz, Jan Cernocký |
| 2025 | Text-Guided Speech Representations for Language Acquisition Assessment. Ilja Baumann, Dominik Wagner, Philipp Seeberger, Korbinian Riedhammer, Tobias Bocklet |
| 2025 | The AudioMOS Challenge 2025. Wen-Chin Huang, Hui Wang, Cheng Liu, Yi-Chiao Wu, Andros Tjandra, Wei-Ning Hsu, Erica Cooper, Yong Qin, Tomoki Toda |
| 2025 | The JHU-MIT System for NIST SRE24: Post-Evaluation Analysis. Jesús Villalba, Jonas Borgstrom, Prabhav Singh, Leibny Paola García, Pedro A. Torres-Carrasquillo, Najim Dehak |
| 2025 | The T12 System for AudioMOS Challenge 2025: Audio Aesthetics Score Prediction System Using KAN- and VERSA-based Models. Katsuhiko Yamamoto, Koichi Miyazaki, Shogo Seki |
| 2025 | Time-Frequency-Based Attention Cache Memory Model for Real-Time Speech Separation. Guo Chen, Kai Li, Runxuan Yang, Xiaolin Hu |
| 2025 | Token-based Attractors and Cross-attention in Spoof Diarization. Kyo-Won Koo, Chan-yeong Lim, Jee-weon Jung, Hye-jin Shim, Ha-Jin Yu |
| 2025 | TokenVerse++: Towards Flexible Multitask Learning with Dynamic Task Activation. Shashi Kumar, Srikanth R. Madikeri, Esaú Villatoro-Tello, Sergio Burdisso, Pradeep Rangappa, Roberto Carofilis, Petr Motlícek, Karthik Pandia, Shankar Venkatesan, Kadri Hacioglu, Andreas Stolcke |
| 2025 | Towards Efficient Speech-Text Jointly Decoding within One Speech Language Model. Haibin Wu, Yuxuan Hu, Ruchao Fan, Xiaofei Wang, Ken'ichi Kumatani, Bo Ren, Jianwei Yu, Heng Lu, Lijuan Wang, Yao Qian, Jinyu Li |
| 2025 | Towards General Discrete Speech Codec for Complex Acoustic Environments: A Study of Reconstruction and Downstream Task Consistency. Haoran Wang, Guanyu Chen, Bohan Li, Hankun Wang, Yiwei Guo, Zhihan Li, Xie Chen, Kai Yu |
| 2025 | Towards Generalized Source Tracing for Codec-Based Deepfake Speech. I-Ming Lin, Xuanjun Chen, Lin Zhang, Haibin Wu, Hung-yi Lee, Jyh-Shing Roger Jang |
| 2025 | Towards Scalable and Robust Multilingual ASR for Indian Languages with MixLoRA-Whisper. Yeseul Park, Bowon Lee |
| 2025 | Training and Inference Efficiency of Encoder-Decoder Speech Models. Piotr Zelasko, Kunal Dhawan, Daniel Galvez, Krishna C. Puvvada, Ankita Pasad, Travis M. Bartley, Nithin Rao Koluguri, Ke Hu, Vitaly Lavrukhin, Jagadeesh Balam, Boris Ginsburg |
| 2025 | Transcribe, Translate, or Transliterate: An Investigation of Intermediate Representations in Spoken Language Models. Tolúlopé Ògúnrèmí, Christopher D. Manning, Dan Jurafsky, Karen Livescu |
| 2025 | TurboBias: Universal ASR Context-Biasing powered by GPU-accelerated Phrase-Boosting Tree. Andrei Andrusenko, Vladimir Bataev, Lilit Grigoryan, Vitaly Lavrukhin, Boris Ginsburg |
| 2025 | ULTRAS - Unified Learning of Transformer Representations for Audio and Speech Signals. P. E. Ameenudeen, Charumathi Narayanan, Sriram Ganapathy |
| 2025 | URGENT-PK: Perceptually-Aligned Ranking Model Designed for Speech Enhancement Competition. Jiahe Wang, Chenda Li, Wei Wang, Wangyou Zhang, Samuele Cornell, Marvin Sach, Robin Scheibler, Kohei Saijo, Yihui Fu, Zhaoheng Ni, Anurag Kumar, Tim Fingscheidt, Shinji Watanabe, Yanmin Qian |
| 2025 | USAD: Universal Speech and Audio Representation via Distillation. Heng-Jui Chang, Saurabhchand Bhati, James R. Glass, Alexander H. Liu |
| 2025 | Unifying Diarization, Separation, and ASR with Multi-Speaker Encoder. Muhammad Shakeel, Yui Sudo, Yifan Peng, Chyi-Jiunn Lin, Shinji Watanabe |
| 2025 | Unifying Model and Layer Fusion for Speech Foundation Models. Yi-Jen Shih, David Harwath |
| 2025 | Utilizing Kolmogorov-Arnold Network in Self-Supervised Learning for Speaker Diarization. Minh Vu, Phuong Tuan Dat, Kah Kuan Teh, Van Tuan Nguyen, Tran Huy Dat |
| 2025 | VERSA-v2: A Modular and Scalable Toolkit for Speech and Audio Evaluation with Expanded Metrics, Visualization, and LLM Integration. Jiatong Shi, Bo-Hao Su, Shikhar Bharadwaj, Yiwen Zhao, Shih-Heng Wang, Jionghao Hang, Haoran Wang, Wei Wang, Wenhao Feng, Yuxun Tang, Nezih Topaloglu, Siddhant Arora, Jinchuan Tian, William Chen, Hye-jin Shim, Wangyou Zhang, Wen-Chin Huang, Shinji Watanabe |
| 2025 | Voice Factor Control Using FIR-Based Fast Neural Vocoder for Speech Generation Applications. Yamato Ohtani, Takuma Okamoto, Tomoki Toda, Hisashi Kawai |
| 2025 | WST: Weakly Supervised Transducer for Automatic Speech Recognition. Dongji Gao, Chenda Liao, Changliang Liu, Matthew Wiesner, Leibny Paola García-Perera, Daniel Povey, Sanjeev Khudanpur, Jian Wu |
| 2025 | WhisQ: Cross-Modal Representation Learning for Text-to-Music MOS Prediction. Jakaria Islam Emon, Md Abu Salek, Kazi Tamanna Alam |
| 2025 | Whisper Has an Internal Word Aligner. Sung-Lin Yeh, Yen Meng, Hao Tang |
| 2025 | WhisperNER: Unified Open Named Entity and Speech Recognition. Gil Ayache, Menachem Pirchi, Aviv Navon, Aviv Shamsian, Gill Hetz, Joseph Keshet |
| 2025 | Whispering Context: Distilling Syntax and Semantics for Long Speech Transcripts. Duygu Altinok |
| 2025 | XEmoRAG: Cross-Lingual Emotion Transfer with Controllable Intensity Using Retrieval-Augmented Generation. Tianlun Zuo, Jingbin Hu, Yuke Li, Xinfa Zhu, Hai Li, Ying Yan, Junhui Liu, Danming Xie, Lei Xie |
| 2025 | ZO-ASR: Zeroth-Order Fine-Tuning of Speech Foundation Models without Back-Propagation. Yuezhang Peng, Yuxin Liu, Yao Li, Sheng Wang, Fei Wen, Xie Chen |
| 2025 | ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching. Zhu Han, Wei Kang, Zengwei Yao, Liyong Guo, Fangjun Kuang, Zhaoqing Li, Weiji Zhuang, Long Lin, Daniel Povey |
| 2025 | mSTEB: Massively Multilingual Evaluation of LLMs on Speech and Text Tasks. Luel Hagos Beyene, Vivek Verma, Min Ma, Jesujoba O. Alabi, Fabian David Schmidt, Joyce Nakatumba-Nabende, David Ifeoluwa Adelani |