Hybrid CNN-LSTM Framework for Robust Speech Emotion Recognition

Shaik, A.Reddy, G. P.Vidya, R.Varsha, J.Jayasree, G.Sriveni, L.2025-09-252025Shaik, A., Reddy, G. P., Vidya, R., Varsha, J., Jayasree, G., & Sriveni, L. (2025). Hybrid CNN-LSTM framework for robust speech emotion recognition. In Proceedings of the International Research Conference on Smart Computing and Systems Engineering (SCSE 2025). Department of Industrial Management, Faculty of Science, University of Kelaniya.http://repository.kln.ac.lk/handle/123456789/30040A key component of emotional computing is voice Emotion Recognition (SER), which is concerned with recognizing and categorizing human emotions from voice data. Human communication is heavily influenced by emotions, and giving machines the ability to recognize these emotional states improves their capacity for intelligent and sympathetic interaction. A reliable SER system that accurately classifies emotions using a hybrid Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) model is presented in this research. The suggested approach integrates sophisticated feature extraction methods that efficiently capture both temporal and spatial emotional patterns in speech, including Mel-Frequency Cepstral Coefficients (MFCC), pitch, and chroma data. Metrics including accuracy, precision, recall, and F1-score were used to assess the system on two common datasets, RAVDESS and EMO-DB. The trial findings show that the hybrid CNN-LSTM model outperformed traditional machine learning approaches, with an overall accuracy of 89.4%. The system also demonstrated resistance to external noise and emotional overlap, making it appropriate for real-world applications.Deep LearningHybrid ModelSpeech AnalysisEmotionData augmentationclassificationHybrid CNN-LSTM Framework for Robust Speech Emotion RecognitionArticle