5.1 KiB
Speech Emotion Recognition
The aim of this section is to explore speech emotion recognition techniques from an audio recording.
Data
The data set used for training is the Ryerson Audio-Visual Database of Emotional Speech and Song: https://zenodo.org/record/1188976#.XA48aC17Q1J
RAVDESS contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions, and song contains calm, happy, sad, angry, and fearful emotions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression.
| Data | Processed Data for training | Processed Data for training | Pre-trained TimeDistributed CNNs model |
|---|---|---|---|
| RAVDESS | X-train y-train | X-test y-test | Weights Model |
Requirements
Python : 3.6.5
Scipy : 1.1.0
Scikit-learn : 0.20.1
Tensorflow : 1.12.0
Keras : 2.2.4
Numpy : 1.15.4
Librosa : 0.6.3
Pyaudio : 0.2.11
Ffmpeg : 4.0.2
Files
The different files that can be found in this repo :
Model: Saved models (SVM and TimeDistributed CNNs)Notebook: All notebooks (preprocessing and model training)Python: Personal audio libraryImages: Set of pictures saved from the notebooks and final reportResources: Some resources on Speech Emotion Recognition
Notebooks provided on this repo:
01 - Preprocessing[SVM].ipynb: Signal preprocessing and feature extraction from time and frequency domain (global statistics) to train SVM classifier.02 - Train [SVM].ipynb: Implementation and training of SVM classifier for Speech Emotion Recognition01 - Preprocessing[CNN-LSTM].ipynb: Signal preprocessing and log-mel-spectrogram extraction to train TimeDistributed CNNs02 - Train [CNN-LSTM].ipynb: Implementation and training of TimeDistributed CNNs classifier for Speech Emotion Recognition
Models
SVM
Classical approach for Speech Emotion Recognition consists in applying a series of filters on the audio signal and partitioning it into several windows (fixed size and time-step). Then, features from time domain (Zero Crossing Rate, Energy and Entropy of Energy) and frequency domain (Spectral entropy, centroid, spread, flux, rolloff and MFCCs) are extracted for each frame. We compute then the first derivatives of each of those features to capture frame to frame changes in the signal. Finally, we calculate the following global statistics on these features: mean, median, standard deviation, kurtosis, skewness, 1% percentile, 99% percentile, min, max and range and train a simple SVM classifier with rbf kernel to predict the emotion detected in the voice.
SVM classification pipeline:
- Voice recording
- Audio signal discretization
- Apply pre-emphasis filter
- Framing using a rolling window
- Apply Hamming filter
- Feature extraction
- Compute global statistics
- Make a prediction using our pre-trained model
TimeDistributed CNNs
The main idea of a Time Distributed Convolutional Neural Network is to apply a rolling window (fixed size and time-step) all along the log-mel-spectrogram. Each of these windows will be the entry of a convolutional neural network, composed by four Local Feature Learning Blocks (LFLBs) and the output of each of these convolutional networks will be fed into a recurrent neural network composed by 2 cells LSTM (Long Short Term Memory) to learn the long-term contextual dependencies. Finally, a fully connected layer with softmax activation is used to predict the emotion detected in the voice.
TimeDistributed CNNs pipeline:
- Voice recording
- Audio signal discretization
- Log-mel-spectrogram extraction
- Split spectrogram with a rolling window
- Make a prediction using our pre-trained model
Performance
To limit overfitting during training phase, we split our data set into train (80%) and test set (20%). Following show results obtained on test set:
| Model | Accuracy |
|---|---|
| SVM on global statistic features | 68,3% |
| Time distributed CNNs | 76,6% |



