Arquivos
Multimodal-Emotion-Recognition/01-Audio/README.md
T
Anatoli-deBRADKE a652055fe7 modif on readme
2019-06-17 09:00:45 +02:00

5.1 KiB

Speech Emotion Recognition

image

The aim of this section is to explore speech emotion recognition techniques from an audio recording.

Data

The data set used for training is the Ryerson Audio-Visual Database of Emotional Speech and Song: https://zenodo.org/record/1188976#.XA48aC17Q1J

RAVDESS contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions, and song contains calm, happy, sad, angry, and fearful emotions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression.

image

Data Processed Data for training Processed Data for training Pre-trained TimeDistributed CNNs model
RAVDESS X-train y-train X-test y-test Weights Model

Requirements

Python : 3.6.5
Scipy : 1.1.0
Scikit-learn : 0.20.1
Tensorflow : 1.12.0
Keras : 2.2.4
Numpy : 1.15.4
Librosa : 0.6.3
Pyaudio : 0.2.11
Ffmpeg : 4.0.2

Files

The different files that can be found in this repo :

  • Model : Saved models (SVM and TimeDistributed CNNs)
  • Notebook : All notebooks (preprocessing and model training)
  • Python : Personal audio library
  • Images: Set of pictures saved from the notebooks and final report
  • Resources : Some resources on Speech Emotion Recognition

Notebooks provided on this repo:

  • 01 - Preprocessing[SVM].ipynb : Signal preprocessing and feature extraction from time and frequency domain (global statistics) to train SVM classifier.
  • 02 - Train [SVM].ipynb : Implementation and training of SVM classifier for Speech Emotion Recognition
  • 01 - Preprocessing[CNN-LSTM].ipynb : Signal preprocessing and log-mel-spectrogram extraction to train TimeDistributed CNNs
  • 02 - Train [CNN-LSTM].ipynb : Implementation and training of TimeDistributed CNNs classifier for Speech Emotion Recognition

Models

SVM

Classical approach for Speech Emotion Recognition consists in applying a series of filters on the audio signal and partitioning it into several windows (fixed size and time-step). Then, features from time domain (Zero Crossing Rate, Energy and Entropy of Energy) and frequency domain (Spectral entropy, centroid, spread, flux, rolloff and MFCCs) are extracted for each frame. We compute then the first derivatives of each of those features to capture frame to frame changes in the signal. Finally, we calculate the following global statistics on these features: mean, median, standard deviation, kurtosis, skewness, 1% percentile, 99% percentile, min, max and range and train a simple SVM classifier with rbf kernel to predict the emotion detected in the voice.

image

SVM classification pipeline:

  • Voice recording
  • Audio signal discretization
  • Apply pre-emphasis filter
  • Framing using a rolling window
  • Apply Hamming filter
  • Feature extraction
  • Compute global statistics
  • Make a prediction using our pre-trained model

TimeDistributed CNNs

The main idea of a Time Distributed Convolutional Neural Network is to apply a rolling window (fixed size and time-step) all along the log-mel-spectrogram. Each of these windows will be the entry of a convolutional neural network, composed by four Local Feature Learning Blocks (LFLBs) and the output of each of these convolutional networks will be fed into a recurrent neural network composed by 2 cells LSTM (Long Short Term Memory) to learn the long-term contextual dependencies. Finally, a fully connected layer with softmax activation is used to predict the emotion detected in the voice.

image

TimeDistributed CNNs pipeline:

  • Voice recording
  • Audio signal discretization
  • Log-mel-spectrogram extraction
  • Split spectrogram with a rolling window
  • Make a prediction using our pre-trained model

Performance

To limit overfitting during training phase, we split our data set into train (80%) and test set (20%). Following show results obtained on test set:

Model Accuracy
SVM on global statistic features 68,3%
Time distributed CNNs 76,6%