Multimodal data analysis in emotion recognition: a review
https://doi.org/10.18255/1818-1015-2025-3-252-281
Abstract
The use of multimodal data in emotion recognition systems has great potential for applications in various fields: healthcare, human-machine interfaces, operator monitoring, and marketing. Until recently, the development of emotion recognition systems based on multimodal data was constrained by insufficient computing power. However, with the advent of high-performance GPU-based systems and the development of efficient deep neural network architectures, there has been a surge of research aimed at using multiple modalities such as audio, video, and physiological signals to accurately detect human emotions. In addition, physiological data from wearable devices has become important due to the relative ease of its collection and the accuracy it enables. This paper discusses architectures and methods for applying deep neural networks to analyse multimodal data to improve the accuracy and reliability of emotion recognition systems, presenting current approaches to implementing such algorithms and existing open multimodal datasets.
About the Authors
Daniil A. BerdyshevRussian Federation
Aleksei G. Shishkin
Russian Federation
References
1. P. Tarnowski, M. Kołodziej, A. Majkowski, and R. J. Rak, “Emotion recognition using facial expressions,” Procedia Computer Science, vol. 108, pp. 1175–1184, 2017.
2. O.-W. Kwon, K. Chan, J. Hao, and T.-W. Lee, “Emotion recognition by speech signals,” in Proceedings of the 8th European Conference on Speech Communication and Technology, 2003, pp. 125–128.
3. S. M. S. A. Abdullah, S. Y. A. Ameen, M. A. M. Sadeeq, and S. Zeebaree, “Multimodal emotion recognition using deep learning,” Journal of Applied Science and Technology Trends, vol. 2, no. 1, pp. 73–79, 2021.
4. Y. Huang, C. Du, Z. Xue, X. Chen, H. Zhao, and L. Huang, “What makes multi-modal learning better than single (provably),” Advances in Neural Information Processing Systems, vol. 34, pp. 10944–10956, 2021.
5. L. Kessous, G. Castellano, and G. Caridakis, “Multimodal emotion recognition in speech-based interaction using facial expression, body gesture and acoustic analysis,” Journal on Multimodal User Interfaces, vol. 3, pp. 33–48, 2010.
6. H. Ranganathan, S. Chakraborty, and S. Panchanathan, “Multimodal emotion recognition using deep learning architectures,” in Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 2016, pp. 1–9.
7. H. Huang, Z. Hu, W. Wang, and M. Wu, “Multimodal emotion recognition based on ensemble convolutional neural network,” IEEE Access, vol. 8, pp. 3265–3271, 2019.
8. M. G. Huddar, S. S. Sannakki, and V. S. Rajpurohit, “Attention-based multi-modal sentiment analysis and emotion detection in conversation using RNN,” International Journal of Interactive Multimedia and Artificial Intelligence, vol. 6, no. 6, pp. 112–121, 2021.
9. H.-D. Le, G.-S. Lee, S.-H. Kim, S. Kim, and H.-J. Yang, “Multi-label multimodal emotion recognition with transformer-based fusion and emotion-level representation learning,” IEEE Access, vol. 11, pp. 14742–14751, 2023.
10. H. Al Osman and T. H. Falk, “Multimodal affect recognition: Current approaches and challenges,” in Emotion and Attention Recognition Based on Biological Signals and Images, 2017, pp. 59–86.
11. D. Priyasad, T. Fernando, S. Denman, S. Sridharan, and C. Fookes, “Attention driven fusion for multi-modal emotion recognition,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2020, pp. 3227–3231.
12. S. Poria, N. Majumder, R. Mihalcea, and E. Hovy, “Emotion recognition in conversation: Research challenges, datasets, and recent advances,” IEEE access, vol. 7, pp. 100943–100953, 2019.
13. R. Subramanian, J. Wache, M. K. Abadi, R. L. Vieriu, S. Winkler, and N. Sebe, “ASCERTAIN: Emotion and personality recognition using commercial sensors,” IEEE Transactions on Affective Computing, vol. 9, no. 2, pp. 147–160, 2016.
14. Z. Zhang, F. Ringeval, B. Dong, E. Coutinho, E. Marchi, and B. Schüller, “Enhanced semi-supervised learning for multimodal emotion recognition,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2016, pp. 5185–5189.
15. A. Althnian et al., “Impact of dataset size on classification performance: an empirical evaluation in the medical domain,” Applied sciences, vol. 11, no. 2, p. 796, 2021.
16. Z. Zhang et al., “Multimodal spontaneous emotion corpus for human behavior analysis,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3438–3446.
17. V. Ovsyannikova, “K voprosu o klassifikacii emocij: kategorial'nyj i mnogomernyj podhody,” Finansovaya Analitika: Problemy i Resheniya, no. 37, pp. 43–48, 2013.
18. A. Ortony and T. J. Turner, “What's basic about basic emotions?,” Psychological review, vol. 97, no. 3, pp. 315–331, 1990.
19. P. Ekman, “An argument for basic emotions,” Cognition & emotion, vol. 6, no. 3, pp. 169–200, 1992.
20. C. E. Izard, The psychology of emotions. Springer Science & Business Media, 1991.
21. W. Wundt, “Outlines of psychology,” in Wilhelm Wundt and the Making of a Scientific Psychology, 1980, pp. 179–195.
22. J. A. Russell, “Core affect and the psychological construction of emotion.,” Psychological review, vol. 110, no. 1, p. 145, 2003.
23. J. A. Miranda-Correa, M. K. Abadi, N. Sebe, and I. Patras, “Amigos: A dataset for affect, personality and mood research on individuals and groups,” IEEE Transactions on Affective Computing, vol. 12, no. 2, pp. 479–493, 2018.
24. W.-L. Zheng, W. Liu, Y. Lu, B.-L. Lu, and A. Cichocki, “Emotionmeter: A multimodal framework for recognizing human emotions,” IEEE Transactions on Cybernetics, vol. 49, no. 3, pp. 1110–1122, 2018.
25. V. Markova, T. Ganchev, and K. Kalinkov, “Clas: A database for cognitive load, affect and stress recognition,” in Proceedings of the International Conference on Biomedical Innovations and Applications, 2019, pp. 1–4.
26. K. Sharma, C. Castellini, E. L. Van Den Broek, A. Albu-Schaeffer, and F. Schwenker, “A dataset of continuous affect annotations and physiological signals for emotion analysis,” Scientific data, vol. 6, no. 1, p. 196, 2019.
27. S. Koelstra et al., “Deap: A database for emotion analysis; using physiological signals,” IEEE Transactions on Affective Computing, vol. 3, no. 1, pp. 18–31, 2011.
28. M. K. Abadi, R. Subramanian, S. M. Kia, P. Avesani, I. Patras, and N. Sebe, “DECAF: MEG-based multimodal database for decoding affective physiological responses,” IEEE Transactions on Affective Computing, vol. 6, no. 3, pp. 209–222, 2015.
29. C. Y. Park et al., “K-EmoCon, a multimodal sensor dataset for continuous emotion recognition in naturalistic conversations,” Scientific Data, vol. 7, no. 1, p. 293, 2020.
30. C. Busso et al., “IEMOCAP: Interactive emotional dyadic motion capture database,” Language Resources and Evaluation, vol. 42, pp. 335–359, 2008.
31. S. Katsigiannis and N. Ramzan, “DREAMER: A database for emotion recognition through EEG and ECG signals from wireless low-cost off-the-shelf devices,” IEEE Journal of Biomedical and Health Informatics, vol. 22, no. 1, pp. 98–107, 2017.
32. A. A. B. Zadeh, P. P. Liang, S. Poria, E. Cambria, and L.-P. Morency, “Multimodal language analysis in the wild: CMU-MOSEI Dataset and interpretable dynamic fusion graph,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 2236–2246.
33. D. Kollias et al., “The 6th affective behavior analysis in-the-wild (abaw) competition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 4587–4598.
34. T. Bänziger, D. Grandjean, and K. R. Scherer, “Emotion recognition from expressions in face, voice, and body: the Multimodal Emotion Recognition Test (MERT),” Emotion, vol. 9, no. 5, p. 691, 2009.
35. R. M. Sabour, Y. Benezeth, P. De Oliveira, J. Chappe, and F. Yang, “UBFC-Phys: A multimodal database for psychophysiological studies of social stress,” IEEE Transactions on Affective Computing, vol. 14, no. 1, pp. 622–636, 2021.
36. J. F. Cohn, Z. Ambadar, and P. Ekman, “Observer-based measurement of facial expression with the Facial Action Coding System,” in The Handbook of Emotion Elicitation and Assessment, 2007, pp. 203–221.
37. S. Gupta, “Facial emotion recognition in real-time and static images,” in Proceedings of the 2nd International Conference on Inventive Systems and Control, 2018, pp. 553–560.
38. A. R. Khan, “Facial emotion recognition using conventional machine learning and deep learning methods: current achievements, analysis and remaining challenges,” Information, vol. 13, no. 6, p. 268, 2022.
39. A. Savran et al., “Bosphorus database for 3D face analysis,” in Biometrics and Identity Management: First European Workshop, BIOID 2008, Roskilde, Denmark, May 7-9, 2008. Revised Selected Papers 1, 2008, pp. 47–56.
40. M. S. Likitha, S. R. R. Gupta, K. Hasitha, and A. U. Raju, “Speech based human emotion recognition using MFCC,” in Proceedings of the International Conference on Wireless Communications, Signal Processing and Networking, 2017, pp. 2257–2260.
41. L. Kerkeni, Y. Serrestou, M. Mbarki, K. Raoof, M. A. Mahjoub, and C. Cleder, “Automatic speech emotion recognition using machine learning,” in Social Media and Machine Learning [Working Title], 2019, pp. hal-02432557.
42. A. Tripathy, A. Agrawal, and S. K. Rath, “Classification of sentiment reviews using n-gram machine learning approach,” Expert Systems with Applications, vol. 57, pp. 117–126, 2016.
43. F. A. Acheampong, H. Nunoo-Mensah, and W. Chen, “Transformer models for text-based emotion detection: a review of BERT-based approaches,” Artificial Intelligence Review, vol. 54, no. 8, pp. 5789–5829, 2021.
44. S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea, “MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations.” 2019.
45. M. Egger, M. Ley, and S. Hanke, “Emotion recognition from physiological signal analysis: A review,” Electronic Notes in Theoretical Computer Science, vol. 343, pp. 35–55, 2019.
46. H. Ma, J. Wang, H. Lin, B. Zhang, Y. Zhang, and B. Xu, “A transformer-based model with self-distillation for multimodal emotion recognition in conversations,” IEEE Transactions on Multimedia, vol. 26, pp. 776–788, 2023.
47. S. Anand, N. K. Devulapally, S. D. Bhattacharjee, and J. Yuan, “Multi-label emotion analysis in conversation via multimodal knowledge distillation,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 6090–6100.
48. Y. Li et al., “A novel bi-hemispheric discrepancy model for EEG emotion recognition,” IEEE Transactions on Cognitive and Developmental Systems, vol. 13, no. 2, pp. 354–367, 2020.
49. B. Fu, C. Gu, M. Fu, Y. Xia, and Y. Liu, “A novel feature fusion network for multimodal emotion recognition from EEG and eye movement signals,” Frontiers in Neuroscience, vol. 17, p. 1234162, 2023.
50. C. Tan, M. vSarlija, and N. Kasabov, “NeuroSense: Short-term emotion recognition and understanding based on spiking neural network modelling of spatio-temporal EEG patterns,” Neurocomputing, vol. 434, pp. 137–148, 2021.
51. P. Bhattacharya, R. K. Gupta, and Y. Yang, “Exploring the contextual factors affecting multimodal emotion recognition in videos,” IEEE Transactions on Affective Computing, vol. 14, no. 2, pp. 1547–1557, 2021.
52. H. Zhang, “Expression-EEG based collaborative multimodal emotion recognition using deep autoencoder,” IEEE Access, vol. 8, pp. 164130–164143, 2020.
53. F. Al-Naima, S. Y. Ameen, and A. F. Al-Saad, “Destroying steganography content in image files,” in Proceedings of IEEE Fifth International Symposium on Communication Systems, Networks and Digital Signal Processing, 2006, pp. 1–4.
54. S. Bouktif, A. Fiaz, A. Ouni, and M. A. Serhani, “Multi-sequence LSTM-RNN deep learning and metaheuristics for electric load forecasting,” Energies, vol. 13, no. 2, p. 391, 2020.
55. T. D. Nguyen, “Multimodal emotion recognition using deep learning techniques,” phdthesis, Queensland University of Technology, 2020.
56. A. Joshi, A. Bhat, A. Jain, A. V. Singh, and A. Modi, “COGMEN: COntextualized GNN based multimodal emotion recognitioN.” 2022.
57. J.-H. Lee, H.-J. Kim, and Y.-G. Cheong, “A multi-modal approach for emotion recognition of TV drama characters using image and text,” in Proceedings of the IEEE International Conference on Big Data and Smart Computing, 2020, pp. 420–424.
58. W. Ai, F. Zhang, Y. Shou, T. Meng, H. Chen, and K. Li, “Revisiting Multimodal Emotion Recognition in Conversation from the Perspective of Graph Spectrum,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2025, vol. 39, pp. 11418–11426.
59. S. Siriwardhana, A. Reis, R. Weerasekera, and S. Nanayakkara, “Jointly Fine-Tuning "BERT-like’ Self Supervised Models to Improve Multimodal Speech Emotion Recognition.” 2020.
60. D. N. Krishna and A. Patil, “Multimodal Emotion Recognition Using Cross-Modal Attention and 1D Convolutional Neural Networks,” in Proceedings of the Interspeech, 2020, pp. 4243–4247.
61. B. Fu, C. Gu, M. Fu, Y. Xia, and Y. Liu, “A novel feature fusion network for multimodal emotion recognition from EEG and eye movement signals,” Frontiers in Neuroscience, vol. 17, p. 1234162, 2023.
62. Z. Pan, Z. Luo, J. Yang, and H. Li, “Multi-modal Attention for Speech Emotion Recognition.” 2020.
63. Y. Huang, J. Lin, C. Zhou, H. Yang, and L. Huang, “Modality competition: What makes joint training of multi-modal network fail in deep learning? (provably),” in International Conference on Machine Learning, 2022, pp. 9226–9259.
64. W. Wang, D. Tran, and M. Feiszli, “What makes training multi-modal classification networks hard?,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 12695–12705.
65. H. Li, X. Li, P. Hu, Y. Lei, C. Li, and Y. Zhou, “Boosting multi-modal model performance with adaptive gradient modulation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22214–22224.
66. K. Kontras, C. Chatzichristos, M. Blaschko, and M. De Vos, “Improving Multimodal Learning with Multi-Loss Gradient Modulation.” 2024.
67. A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12449–12460, 2020.
68. W. Liu, W.-L. Zheng, and B.-L. Lu, “Emotion recognition using multimodal deep learning,” in Proceedings of the Neural Information Processing: 23rd International Conference, Part II 23, 2016, pp. 521–529.
69. J. Ma, H. Tang, W.-L. Zheng, and B.-L. Lu, “Emotion recognition using multimodal residual LSTM network,” in Proceedings of the 27th ACM International Conference on Multimedia, 2019, pp. 176–183.
70. H. Tang, W. Liu, W.-L. Zheng, and B.-L. Lu, “Multimodal emotion recognition using deep neural networks,” in Proceedings of the Neural Information Processing: 24th International Conference, Part IV 24, 2017, pp. 811–819.
71. P. Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. W. Schuller, and S. Zafeiriou, “End-to-end multimodal emotion recognition using deep neural networks,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1301–1309, 2017.
72. W. Zhang, B. Ma, F. Qiu, and Y. Ding, “Multi-modal facial affective analysis based on masked autoencoder,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 5793–5802.
73. R. G. Praveen and J. Alam, “Recursive joint cross-modal attention for multimodal fusion in dimensional emotion recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 4803–4813.
Review
For citations:
Berdyshev D.A., Shishkin A.G. Multimodal data analysis in emotion recognition: a review. Modeling and Analysis of Information Systems. 2025;32(3):252-281. (In Russ.) https://doi.org/10.18255/1818-1015-2025-3-252-281