TY - JOUR
T1 - Multimodal Fusion for Vocal Biomarkers Using Vector Cross-Attention
AU - Despotovic, Vladimir
AU - Elbéji, Abir
AU - Nazarov, Petr V.
AU - Fagherazzi, Guy
N1 - Publisher Copyright:
© 2024 International Speech Communication Association. All rights reserved.
PY - 2024/9
Y1 - 2024/9
N2 - Vocal biomarkers are measurable characteristics of person's voice that provide valuable insights into various aspects of their physiological and psychological state, or health status. The use of standardized voice tasks, such as reading, counting, or sustained vowel phonation are common in vocal biomarker research, but semi-spontaneous tasks where the person is instructed to talk about a particular topic, or spontaneous speech are also increasingly used. However, limited efforts were made to combine multiple voice modalities. In this paper, we propose a simple, yet efficient approach of fusing multiple standardized voice tasks based on vector cross-attention, showing improved predictive capacity for derived vocal biomarkers in comparison to single modalities. The multimodal approach is tested on the assessment of respiratory quality of life from reading and sustained vowel phonation recordings, outperforming single modalities up to 4.2% in terms of accuracy (relative increase of 7%).
AB - Vocal biomarkers are measurable characteristics of person's voice that provide valuable insights into various aspects of their physiological and psychological state, or health status. The use of standardized voice tasks, such as reading, counting, or sustained vowel phonation are common in vocal biomarker research, but semi-spontaneous tasks where the person is instructed to talk about a particular topic, or spontaneous speech are also increasingly used. However, limited efforts were made to combine multiple voice modalities. In this paper, we propose a simple, yet efficient approach of fusing multiple standardized voice tasks based on vector cross-attention, showing improved predictive capacity for derived vocal biomarkers in comparison to single modalities. The multimodal approach is tested on the assessment of respiratory quality of life from reading and sustained vowel phonation recordings, outperforming single modalities up to 4.2% in terms of accuracy (relative increase of 7%).
KW - attention mechanism
KW - multimodal fusion
KW - vocal biomarker
UR - http://www.scopus.com/inward/record.url?scp=85214795133&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2024-156
DO - 10.21437/Interspeech.2024-156
M3 - Conference article
AN - SCOPUS:85214795133
SN - 2308-457X
SP - 1435
EP - 1439
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 25th Interspeech Conferece 2024
Y2 - 1 September 2024 through 5 September 2024
ER -