Emotions are very crucial for humans as they determine our ways of thinking, our actions, and even how we interrelate with other persons. Recognition of emotions plays a critical role in areas such as interaction between humans and computers, mental disorder detection, and social robotics. Nevertheless, the current emotion recognition systems have issues like noise interference, inadequate feature extraction, and integration of data for the multimodal context that embraces audio, video, and text. To address these issues, this research proposes an "Enhanced Trimodal Emotion Recognition Using Multibranch Fusion Attention with Epistemic Neural Networks and Fire Hawk Optimization." The proposed method begins with modality-specific preprocessing: Natural Language Processing (NLP) for text to address linguistic variations, Relaxed instance Frequency-wise Normalization (RFN) for the audio to minimize distortion of noise’s importance and iterative self-Guided Image Filter (isGIF) for the videos to enhance the image quality and minimize the artifacts. This preprocessing facilitates and optimizes data for feature extracting; an Inception Transformer for capturing the textual contexts; Differentiable Adaptive Short-Time Fourier transform (DA-STFT) to extract the audio's spectral and temporal features; and class attention mechanisms to emphasize important features in the videos. Following that, these features are combined through a Multi-Branch Fusion Attention Network to harmonize all the multifarious modalities into one. The last sanity check occurs through an Epistemic Neural Network (ENN), which tackles issues of uncertainty involved in the last classification, and the Fire Hawk algorithm is used to enhance the emotion recognition capabilities of the framework. Finally the proposed approach attains 99.5% accuracy with low computational time. Thus, the proposed method addresses important shortcomings of the systems developed previously and can be regarded as a contribution to the development of the multimodal emotion recognition field.
H. F. T. Al-Saadawi and R. Das, “TER-CA-WGNN: Trimodel Emotion Recognition Using Cumulative Attribute-Weighted Graph Neural Network,” Applied Sciences, vol. 14, no. 6, p. 2252, Mar. 2024, doi: 10.3390/app14062252.
A. Aslam, A. B. Sargano, and Z. Habib, “Attention-based multimodal sentiment analysis and emotion recognition using deep neural networks,” Applied Soft Computing, vol. 144, p. 110494, Sep. 2023, doi: 10.1016/j.asoc.2023.110494.
P. Bhattacharya, R. K. Gupta, and Y. Yang, “Exploring the Contextual Factors Affecting Multimodal Emotion Recognition in Videos,” IEEE Transactions on Affective Computing, vol. 14, no. 2, pp. 1547–1557, Apr. 2023, doi: 10.1109/taffc.2021.3071503.
G.-N. Dong, C.-M. Pun, and Z. Zhang, “Temporal Relation Inference Network for Multimodal Speech Emotion Recognition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 9, pp. 6472–6485, Sep. 2022, doi: 10.1109/tcsvt.2022.3163445.
X. Zhang, M. Li, S. Lin, H. Xu, and G. Xiao, “Transformer-Based Multimodal Emotional Perception for Dynamic Facial Expression Recognition in the Wild,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 5, pp. 3192–3203, May 2024, doi: 10.1109/tcsvt.2023.3312858.
G. Kaur and A. Sharma, “A deep learning-based model using hybrid feature extraction approach for consumer sentiment analysis,” Journal of Big Data, vol. 10, no. 1, Jan. 2023, doi: 10.1186/s40537-022-00680-6.
S. Lee, D. K. Han, and H. Ko, “Multimodal Emotion Recognition Fusion Analysis Adapting BERT With Heterogeneous Feature Unification,” IEEE Access, vol. 9, pp. 94557–94572, 2021, doi: 10.1109/access.2021.3092735.
X. Wang, J. He, Z. Jin, M. Yang, Y. Wang, and H. Qu, “M2Lens: Visualizing and Explaining Multimodal Models for Sentiment Analysis,” IEEE Transactions on Visualization and Computer Graphics, vol. 28, no. 1, pp. 802–812, Jan. 2022, doi: 10.1109/tvcg.2021.3114794.
S. S. Hosseini, M. R. Yamaghani, and S. Poorzaker Arabani, “Multimodal modelling of human emotion using sound, image and text fusion,” Signal, Image and Video Processing, vol. 18, no. 1, pp. 71–79, Aug. 2023, doi: 10.1007/s11760-023-02707-8.
A. Chaudhari, C. Bhatt, A. Krishna, and C. M. Travieso-González, “Facial Emotion Recognition with Inter-Modality-Attention-Transformer-Based Self-Supervised Learning,” Electronics, vol. 12, no. 2, p. 288, Jan. 2023, doi: 10.3390/electronics12020288.
A. Yousaf et al., “Emotion Recognition by Textual Tweets Classification Using Voting Classifier (LR-SGD),” IEEE Access, vol. 9, pp. 6286–6295, 2021, doi: 10.1109/access.2020.3047831.
L. Zhu, X. Zhu, J. Guo, and S. Dietze, “Exploring rich structure information for aspect-based sentiment classification,” Journal of Intelligent Information Systems, vol. 60, no. 1, pp. 97–117, Jul. 2022, doi: 10.1007/s10844-022-00729-1.
A. Onan, “Bidirectional convolutional recurrent neural network architecture with group-wise enhancement mechanism for text sentiment classification,” Journal of King Saud University - Computer and Information Sciences, vol. 34, no. 5, pp. 2098–2117, May 2022, doi: 10.1016/j.jksuci.2022.02.025.
N.-H. Ho, H.-J. Yang, S.-H. Kim, and G. Lee, “Multimodal Approach of Speech Emotion Recognition Using Multi-Level Multi-Head Fusion Attention-Based Recurrent Neural Network,” IEEE Access, vol. 8, pp. 61672–61686, 2020, doi: 10.1109/access.2020.2984368.
L. Schoneveld, A. Othmani, and H. Abdelkawy, “Leveraging recent advances in deep learning for audio-Visual emotion recognition,” Pattern Recognition Letters, vol. 146, pp. 1–7, Jun. 2021, doi: 10.1016/j.patrec.2021.03.007.
A. I. Middya, B. Nag, and S. Roy, “Deep learning based multimodal emotion recognition using model-level fusion of audio–visual modalities,” Knowledge-Based Systems, vol. 244, p. 108580, May 2022, doi: 10.1016/j.knosys.2022.108580.
M. Hao, W.-H. Cao, Z.-T. Liu, M. Wu, and P. Xiao, “Visual-audio emotion recognition based on multi-task and ensemble learning with multiple features,” Neurocomputing, vol. 391, pp. 42–51, May 2020, doi: 10.1016/j.neucom.2020.01.048.
K. Zhang, Y. Li, J. Wang, E. Cambria, and X. Li, “Real-Time Video Emotion Recognition Based on Reinforcement Learning and Domain Knowledge,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 3, pp. 1034–1047, Mar. 2022, doi: 10.1109/tcsvt.2021.3072412.
T. Zhang, A. El Ali, C. Wang, A. Hanjalic, and P. Cesar, “CorrNet: Fine-Grained Emotion Recognition for Video Watching Using Wearable Physiological Sensors,” Sensors, vol. 21, no. 1, p. 52, Dec. 2020, doi: 10.3390/s21010052.
C. Luna-Jiménez, D. Griol, Z. Callejas, R. Kleinlein, J. M. Montero, and F. Fernández-Martínez, “Multimodal Emotion Recognition on RAVDESS Dataset Using Transfer Learning,” Sensors, vol. 21, no. 22, p. 7665, Nov. 2021, doi: 10.3390/s21227665.
D. Pena, A. Aguilera, I. Dongo, J. Heredia, and Y. Cardinale, “A Framework to Evaluate Fusion Methods for Multimodal Emotion Recognition,” IEEE Access, vol. 11, pp. 10218–10237, 2023, doi: 10.1109/access.2023.3240420.
S. Chen, J. Tang, L. Zhu, and W. Kong, “A multi-stage dynamical fusion network for multimodal emotion recognition,” Cognitive Neurodynamics, vol. 17, no. 3, pp. 671–680, Jul. 2022, doi: 10.1007/s11571-022-09851-w.
X. Liu, Z. Xu, and K. Huang, “Multimodal Emotion Recognition Based on Cascaded Multichannel and Hierarchical Fusion,” Computational Intelligence and Neuroscience, vol. 2023, no. 1, Jan. 2023, doi: 10.1155/2023/9645611.
S. Mai, Y. Zeng, S. Zheng, and H. Hu, “Hybrid Contrastive Learning of Tri-Modal Representation for Multimodal Sentiment Analysis,” IEEE Transactions on Affective Computing, vol. 14, no. 3, pp. 2276–2289, Jul. 2023, doi: 10.1109/taffc.2022.3172360.
Z. Lian, B. Liu, and J. Tao, “SMIN: Semi-Supervised Multi-Modal Interaction Network for Conversational Emotion Recognition,” IEEE Transactions on Affective Computing, vol. 14, no. 3, pp. 2415–2429, Jul. 2023, doi: 10.1109/taffc.2022.3141237.
Y. Y. Obaid Al Belushi, P. Jasmin Dennis, S. Deepa, V. Arulkumar, D. Kanchana, and R. Y. P, “A Robust Development of an Efficient Industrial Monitoring and Fault Identification Model using Internet of Things,” 2024 IEEE International Conference on Big Data & Machine Learning (ICBDML), pp. 27–32, Feb. 2024, doi: 10.1109/icbdml60909.2024.10577363.
N. Wang, H. Cao, J. Zhao, R. Chen, D. Yan, and J. Zhang, “M2R2: Missing-Modality Robust Emotion Recognition Framework With Iterative Data Augmentation,” IEEE Transactions on Artificial Intelligence, vol. 4, no. 5, pp. 1305–1316, Oct. 2023, doi: 10.1109/tai.2022.3201809.
C. P. Chai, “Comparison of text preprocessing methods,” Natural Language Engineering, vol. 29, no. 3, pp. 509–553, Jun. 2022, doi: 10.1017/s1351324922000213.
B. Kim, S. Yang, J. Kim, H. Park, J. Lee, and S. Chang, “Domain Generalization with Relaxed Instance Frequency-wise Normalization for Multi-device Acoustic Scene Classification,” Interspeech 2022, pp. 2393–2397, Sep. 2022, doi: 10.21437/interspeech.2022-61.
L. He, Y. Xie, S. Xie, Z. Jiang, and Z. Chen, “Iterative Self-Guided Image Filtering,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 8, pp. 7537–7549, Aug. 2024, doi: 10.1109/tcsvt.2024.3374758.
C. Si, W. Yu, P. Zhou, Y. Zhou, X. Wang, and S. Yan, “Inception transformer”, Advances in Neural Information Processing Systems, vol. 35, pp.23495-23509. 2022.
M. Leiber, Y. Marnissi, A. Barrau, and M. E. Badaoui, “Differentiable Adaptive Short-Time Fourier Transform with Respect to the Window Length,” ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5, Jun. 2023, doi: 10.1109/icassp49357.2023.10095245.
H. Gu, G. Gu, Y. Liu, H. Lin, and Y. Xu, “Multi-Branch Attention Fusion Network for Cloud and Cloud Shadow Segmentation,” Remote Sensing, vol. 16, no. 13, p. 2308, Jun. 2024, doi: 10.3390/rs16132308.
I. Osband, Z. Wen, S.M. Asghari, V. Dwaracherla, M. Ibrahimi, X. Lu, and B. Van Roy, “Epistemic neural networks”, Advances in Neural Information Processing Systems, vol.36. 2024.
M. Azizi, S. Talatahari, and A. H. Gandomi, “Fire Hawk Optimizer: a novel metaheuristic algorithm,” Artificial Intelligence Review, vol. 56, no. 1, pp. 287–363, Jun. 2022, doi: 10.1007/s10462-022-10173-w.
CRediT Author Statement
The authors confirm contribution to the paper as follows:
Conceptualization: Bangar Raju Cherukuri;
Methodology: Bangar Raju Cherukuri;
Software: Bangar Raju Cherukuri;
Data Curation: Bangar Raju Cherukuri;
Writing- Original Draft Preparation: Bangar Raju Cherukuri;
Visualization: Bangar Raju Cherukuri;
Investigation: Bangar Raju Cherukuri;
Supervision: Bangar Raju Cherukuri;
Validation: Bangar Raju Cherukuri;
Writing- Reviewing and Editing: Bangar Raju Cherukuri;
All authors reviewed the results and approved the final version of the manuscript.
Acknowledgements
The authors would like to thank to the reviewers for nice comments on the manuscript.
Funding
No funding was received to assist with the preparation of this manuscript.
Ethics declarations
Conflict of interest
The authors have no conflicts of interest to declare that are relevant to the content of this article.
Availability of data and materials
Data sharing is not applicable to this article as no new data were created or analysed in this study.
Author information
Contributions
All authors have equal contribution in the paper and all authors have read and agreed to the published version of the manuscript.
Open Access This article is licensed under a Creative Commons Attribution NoDerivs is a more restrictive license. It allows you to redistribute the material commercially or non-commercially but the user cannot make any changes whatsoever to the original, i.e. no derivatives of the original work. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc-nd/4.0/
Cite this article
Bangar Raju Cherukuri, “Enhanced Trimodal Emotion Recognition Using Multibranch Fusion Attention with Epistemic Neural Networks and Fire Hawk Optimization”, Journal of Machine and Computing, vol.5, no.1, pp. 058-075, January 2025, doi: 10.53759/7669/jmc202505005.