Adapt Name
|
Depois Largura: | Altura: | Tamanho: 82 KiB |
|
Depois Largura: | Altura: | Tamanho: 888 KiB |
|
Depois Largura: | Altura: | Tamanho: 839 KiB |
|
Depois Largura: | Altura: | Tamanho: 1.0 MiB |
|
Depois Largura: | Altura: | Tamanho: 655 KiB |
|
Depois Largura: | Altura: | Tamanho: 102 KiB |
|
Depois Largura: | Altura: | Tamanho: 99 KiB |
|
Depois Largura: | Altura: | Tamanho: 594 KiB |
|
Depois Largura: | Altura: | Tamanho: 13 MiB |
|
Depois Largura: | Altura: | Tamanho: 853 KiB |
|
Depois Largura: | Altura: | Tamanho: 307 KiB |
|
Depois Largura: | Altura: | Tamanho: 980 KiB |
|
Depois Largura: | Altura: | Tamanho: 680 KiB |
|
Depois Largura: | Altura: | Tamanho: 1.6 MiB |
|
Depois Largura: | Altura: | Tamanho: 361 KiB |
|
Depois Largura: | Altura: | Tamanho: 106 KiB |
|
Depois Largura: | Altura: | Tamanho: 719 KiB |
|
Depois Largura: | Altura: | Tamanho: 546 KiB |
|
Depois Largura: | Altura: | Tamanho: 638 KiB |
|
Depois Largura: | Altura: | Tamanho: 154 KiB |
|
Depois Largura: | Altura: | Tamanho: 367 KiB |
|
Depois Largura: | Altura: | Tamanho: 22 KiB |
|
Depois Largura: | Altura: | Tamanho: 176 KiB |
|
Depois Largura: | Altura: | Tamanho: 674 KiB |
|
Depois Largura: | Altura: | Tamanho: 110 KiB |
|
Depois Largura: | Altura: | Tamanho: 840 KiB |
|
Depois Largura: | Altura: | Tamanho: 164 KiB |
|
Depois Largura: | Altura: | Tamanho: 890 KiB |
|
Depois Largura: | Altura: | Tamanho: 318 KiB |
|
Depois Largura: | Altura: | Tamanho: 1.1 MiB |
|
Depois Largura: | Altura: | Tamanho: 126 KiB |
|
Depois Largura: | Altura: | Tamanho: 476 KiB |
|
Depois Largura: | Altura: | Tamanho: 674 KiB |
@@ -0,0 +1,320 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Speech Emotion Recognition - Signal Preprocessing\n",
|
||||
"\n",
|
||||
"A project for the French Employment Agency\n",
|
||||
"\n",
|
||||
"Telecom ParisTech 2018-2019"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## I. Context"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The aim of this notebook is to set up all speech emotion recognition preprocessing and audio features extraction.\n",
|
||||
"\n",
|
||||
"### Audio features:\n",
|
||||
"The complete list of the implemented short-term features is presented below:\n",
|
||||
"- **Zero Crossing Rate**: The rate of sign-changes of the signal during the duration of a particular frame.\n",
|
||||
"- **Energy**: The sum of squares of the signal values, normalized by the respective frame length.\n",
|
||||
"- **Entropy of Energy**: The entropy of sub-frames' normalized energies. It can be interpreted as a measure of abrupt changes.\n",
|
||||
"- **Spectral Centroid**: The center of gravity of the spectrum.\n",
|
||||
"- **Sprectral Spread**: The second central moment of the spectrum.\n",
|
||||
"- **Spectral Entropy**: Entropy of the normalized spectral energies for a set of sub-frames.\n",
|
||||
"- **Spectral Flux**: The squared difference between the normalized magnitudes of the spectra of the two successive frames.\n",
|
||||
"- **Spectral Rolloff**: The frequency below which 90% of the magnitude distribution of the spectrum is concentrated.\n",
|
||||
"- **MFCCS**: Mel Frequency Cepstral Coefficients form a cepstral representation where the frequency bands are not linear but distributed according to the mel-scale.\n",
|
||||
"\n",
|
||||
"Global Statistics are then computed on upper features:\n",
|
||||
"- **mean, std, med, kurt, skew, q1, q99, min, max and range**\n",
|
||||
"\n",
|
||||
"### Data:\n",
|
||||
"**RAVDESS**: The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7356 files (total size: 24.8 GB). The database contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech includes *calm*, *happy*, *sad*, *angry*, *fearful*, *surprise*, and *disgust* expressions, and song contains calm, happy, sad, angry, and fearful emotions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression. (https://zenodo.org/record/1188976#.XA48aC17Q1J)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## II. General import"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2019-04-15T13:13:31.470677Z",
|
||||
"start_time": "2019-04-15T13:13:30.911103Z"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"### General imports ###\n",
|
||||
"from glob import glob\n",
|
||||
"import os\n",
|
||||
"import pickle\n",
|
||||
"import itertools\n",
|
||||
"import numpy as np\n",
|
||||
"\n",
|
||||
"### Audio preprocessing imports ###\n",
|
||||
"from AudioLibrary.AudioSignal import *\n",
|
||||
"from AudioLibrary.AudioFeatures import *"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2018-12-04T16:38:44.580314Z",
|
||||
"start_time": "2018-12-04T16:38:44.560062Z"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## III. Set labels"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2019-04-15T13:13:31.477659Z",
|
||||
"start_time": "2019-04-15T13:13:31.473279Z"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# RAVDESS Database\n",
|
||||
"label_dict_ravdess = {'02': 'NEU', '03':'HAP', '04':'SAD', '05':'ANG', '06':'FEA', '07':'DIS', '08':'SUR'}\n",
|
||||
"\n",
|
||||
"# Set audio files labels\n",
|
||||
"def set_label_ravdess(audio_file, gender_differentiation):\n",
|
||||
" label = label_dict_ravdess.get(audio_file[6:-16])\n",
|
||||
" if gender_differentiation == True:\n",
|
||||
" if int(audio_file[18:-4])%2 == 0: # Female\n",
|
||||
" label = 'f_' + label\n",
|
||||
" if int(audio_file[18:-4])%2 == 1: # Male\n",
|
||||
" label = 'm_' + label\n",
|
||||
" return label"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## IV. Import audio files"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2019-04-15T13:13:36.852703Z",
|
||||
"start_time": "2019-04-15T13:13:31.479656Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Import Data: START\n",
|
||||
"Import Data: RUNNING ... 0 files\n",
|
||||
"Import Data: RUNNING ... 200 files\n",
|
||||
"Import Data: RUNNING ... 300 files\n",
|
||||
"Import Data: RUNNING ... 400 files\n",
|
||||
"Import Data: RUNNING ... 500 files\n",
|
||||
"Import Data: RUNNING ... 600 files\n",
|
||||
"Import Data: RUNNING ... 700 files\n",
|
||||
"Import Data: RUNNING ... 800 files\n",
|
||||
"Import Data: RUNNING ... 900 files\n",
|
||||
"Import Data: RUNNING ... 1000 files\n",
|
||||
"Import Data: RUNNING ... 1100 files\n",
|
||||
"Import Data: RUNNING ... 1200 files\n",
|
||||
"Import Data: RUNNING ... 1300 files\n",
|
||||
"Import Data: RUNNING ... 1400 files\n",
|
||||
"Import Data: END \n",
|
||||
"\n",
|
||||
"Number of audio files imported: 1344\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Start feature extraction\n",
|
||||
"print(\"Import Data: START\")\n",
|
||||
"\n",
|
||||
"# Audio file path and names\n",
|
||||
"file_path = '../Datas/RAVDESS/'\n",
|
||||
"file_names = os.listdir(file_path)\n",
|
||||
"\n",
|
||||
"# Initialize signal and labels list\n",
|
||||
"signal = []\n",
|
||||
"labels = []\n",
|
||||
"\n",
|
||||
"# Sample rate (44.1 kHz)\n",
|
||||
"sample_rate = 44100 \n",
|
||||
"\n",
|
||||
"# Compute global statistics features for all audio file\n",
|
||||
"for audio_index, audio_file in enumerate(file_names):\n",
|
||||
"\n",
|
||||
" # Select audio file\n",
|
||||
" if audio_file[6:-16] in label_dict_ravdess.keys():\n",
|
||||
" \n",
|
||||
" # Read audio file\n",
|
||||
" signal.append(AudioSignal(sample_rate, filename=file_path + audio_file))\n",
|
||||
" \n",
|
||||
" # Set label\n",
|
||||
" labels.append(set_label_ravdess(audio_file, True))\n",
|
||||
"\n",
|
||||
" # Print running...\n",
|
||||
" if (audio_index % 100 == 0):\n",
|
||||
" print(\"Import Data: RUNNING ... {} files\".format(audio_index))\n",
|
||||
" \n",
|
||||
"# Cast labels to array\n",
|
||||
"labels = np.asarray(labels).ravel()\n",
|
||||
"\n",
|
||||
"# Stop feature extraction\n",
|
||||
"print(\"Import Data: END \\n\")\n",
|
||||
"print(\"Number of audio files imported: {}\".format(labels.shape[0]))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## V. Audio features extraction"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2019-04-15T13:13:36.863481Z",
|
||||
"start_time": "2019-04-15T13:13:36.855871Z"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Audio features extraction function\n",
|
||||
"def global_feature_statistics(y, win_size=0.025, win_step=0.01, nb_mfcc=12, mel_filter=40,\n",
|
||||
" stats = ['mean', 'std', 'med', 'kurt', 'skew', 'q1', 'q99', 'min', 'max', 'range'],\n",
|
||||
" features_list = ['zcr', 'energy', 'energy_entropy', 'spectral_centroid', 'spectral_spread', 'spectral_entropy', 'spectral_flux', 'sprectral_rolloff', 'mfcc']):\n",
|
||||
" \n",
|
||||
" # Extract features\n",
|
||||
" audio_features = AudioFeatures(y, win_size, win_step)\n",
|
||||
" features, features_names = audio_features.global_feature_extraction(stats=stats, features_list=features_list)\n",
|
||||
" return features\n",
|
||||
" \n",
|
||||
"# Features extraction parameters\n",
|
||||
"sample_rate = 16000 # Sample rate (16.0 kHz)\n",
|
||||
"win_size = 0.025 # Short term window size (25 msec)\n",
|
||||
"win_step = 0.01 # Short term window step (10 msec)\n",
|
||||
"nb_mfcc = 12 # Number of MFCCs coefficients (12)\n",
|
||||
"nb_filter = 40 # Number of filter banks (40)\n",
|
||||
"stats = ['mean', 'std', 'med', 'kurt', 'skew', 'q1', 'q99', 'min', 'max', 'range'] # Global statistics\n",
|
||||
"features_list = ['zcr', 'energy', 'energy_entropy', 'spectral_centroid', 'spectral_spread', # Audio features\n",
|
||||
" 'spectral_entropy', 'spectral_flux', 'sprectral_rolloff', 'mfcc']"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2019-04-15T13:19:38.974213Z",
|
||||
"start_time": "2019-04-15T13:13:36.866069Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Feature extraction: START\n",
|
||||
"Feature extraction: END!\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Start feature extraction\n",
|
||||
"print(\"Feature extraction: START\")\n",
|
||||
"\n",
|
||||
"# Compute global feature statistics for all audio file\n",
|
||||
"features = np.asarray(list(map(global_feature_statistics, signal)))\n",
|
||||
"\n",
|
||||
"# Stop feature extraction\n",
|
||||
"print(\"Feature extraction: END!\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## VI. Save as"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2019-04-15T13:19:38.983530Z",
|
||||
"start_time": "2019-04-15T13:19:38.975722Z"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Save DataFrame to pickle\n",
|
||||
"pickle.dump([features, labels], open(\"../Datas/Pickle/[RAVDESS][HAP-SAD-NEU-ANG-FEA-DIS-SUR][GLOBAL_STATS].p\", 'wb'))"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.7"
|
||||
},
|
||||
"toc": {
|
||||
"base_numbering": 1,
|
||||
"nav_menu": {},
|
||||
"number_sections": true,
|
||||
"sideBar": true,
|
||||
"skip_h1_title": false,
|
||||
"title_cell": "Table of Contents",
|
||||
"title_sidebar": "Contents",
|
||||
"toc_cell": false,
|
||||
"toc_position": {},
|
||||
"toc_section_display": true,
|
||||
"toc_window_display": false
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -0,0 +1,104 @@
|
||||
import pickle
|
||||
from AudioLibrary.AudioSignal import *
|
||||
from AudioLibrary.AudioFeatures import *
|
||||
|
||||
|
||||
class AudioEmotionRecognition:
|
||||
|
||||
def __init__(self, model_path):
|
||||
|
||||
# Load classifier
|
||||
self._clf = pickle.load(open(os.path.join(model_path, 'MODEL_CLF.p'), 'rb'))
|
||||
|
||||
# Load features parameters
|
||||
self._features_param = pickle.load(open(os.path.join(model_path, 'MODEL_PARAM.p'), 'rb'))
|
||||
|
||||
# Load feature scaler parametrs (mean and std)
|
||||
self._features_mean, self._features_std = pickle.load(open(os.path.join(model_path, 'MODEL_SCALER.p'), 'rb'))
|
||||
|
||||
# Load PCA
|
||||
self._pca = pickle.load(open(os.path.join(model_path, 'MODEL_PCA.p'), 'rb'))
|
||||
|
||||
# Load label encoder
|
||||
self._encoder = pickle.load(open(os.path.join(model_path, 'MODEL_ENCODER.p'), 'rb'))
|
||||
|
||||
'''
|
||||
Function to scale audio features
|
||||
'''
|
||||
def scale_features(self, features):
|
||||
|
||||
# Scaled features
|
||||
scaled_features = (features - self._features_mean) / self._features_std
|
||||
|
||||
# Return scaled features
|
||||
return scaled_features
|
||||
|
||||
'''
|
||||
Function to predict speech emotion from an audio signals
|
||||
'''
|
||||
def predict_emotion(self, audio_signal, predict_proba=False, decode=True):
|
||||
|
||||
# Extract audio features
|
||||
audio_features = AudioFeatures(audio_signal, float(self._features_param.get("win_size")),
|
||||
float(self._features_param.get("win_step")))
|
||||
features, features_names = audio_features.global_feature_extraction(stats=self._features_param.get("stats"),
|
||||
features_list=self._features_param.get(
|
||||
"features_list"),
|
||||
nb_mfcc=self._features_param.get("nb_mfcc"),
|
||||
diff=self._features_param.get("diff"))
|
||||
# Scale features
|
||||
features = self.scale_features(features)
|
||||
|
||||
# Apply feature dimension reduction
|
||||
if self._features_param.get("PCA") is True:
|
||||
features = self._pca.transform(features)
|
||||
|
||||
# Make prediction
|
||||
if predict_proba is True:
|
||||
prediction = self._clf.predict_proba(features.reshape(1, -1))
|
||||
else:
|
||||
prediction = self._clf.predict(features.reshape(1, -1))
|
||||
|
||||
# Decode label emotion
|
||||
if decode is True:
|
||||
prediction = (self._encoder.inverse_transform((prediction.astype(int).flatten())))
|
||||
|
||||
# Remove gender recognition
|
||||
prediction = prediction[0][2:]
|
||||
|
||||
return prediction
|
||||
|
||||
'''
|
||||
Function to predict speech emotion over time from video
|
||||
'''
|
||||
def predict_emotion_from_file(self, filename, sample_rate, chunk_size=0, chunk_step=0, predict_proba=False,
|
||||
decode=True):
|
||||
|
||||
# Initialize Audio Basic object
|
||||
audio_signal = AudioSignal(sample_rate, filename=filename)
|
||||
|
||||
# Split audio signals into chunks
|
||||
if chunk_size > 0:
|
||||
chunks = audio_signal.framing(chunk_size, chunk_step)
|
||||
|
||||
# Initialize time stamp
|
||||
timestamp = []
|
||||
|
||||
# Emotion prediction for each chunks
|
||||
prediction = []
|
||||
for signal in chunks:
|
||||
if len(timestamp) == 0:
|
||||
timestamp.append(chunk_size)
|
||||
else:
|
||||
timestamp.append(timestamp[-1] + chunk_step)
|
||||
prediction.append(self.predict_emotion(signal, predict_proba=predict_proba, decode=decode))
|
||||
|
||||
# Return emotion prediction and related timestamp
|
||||
return prediction, timestamp
|
||||
else:
|
||||
|
||||
# Emotion prediction
|
||||
prediction = self.predict_emotion(audio_signal, predict_proba=predict_proba, decode=decode)
|
||||
|
||||
# Return emotion prediction
|
||||
return prediction
|
||||
@@ -0,0 +1,347 @@
|
||||
import numpy
|
||||
from scipy.fftpack.realtransforms import dct
|
||||
from scipy.stats import kurtosis, skew
|
||||
from AudioLibrary.AudioSignal import *
|
||||
|
||||
|
||||
class AudioFeatures:
|
||||
|
||||
def __init__(self, audio_signal, win_size, win_step):
|
||||
|
||||
# Audio Signal
|
||||
self._audio_signal = audio_signal
|
||||
|
||||
# Short time features window size
|
||||
self._win_size = win_size
|
||||
|
||||
# Short time features window step
|
||||
self._win_step = win_step
|
||||
|
||||
'''
|
||||
Global statistics features extraction from an audio signals
|
||||
'''
|
||||
def global_feature_extraction(self, stats=['mean', 'std'], features_list=[], nb_mfcc=12, nb_filter=40, diff=0, hamming=True):
|
||||
|
||||
# Extract short term audio features
|
||||
st_features, f_names = self.short_time_feature_extraction(features_list, nb_mfcc, nb_filter, hamming)
|
||||
|
||||
# Number of short term features
|
||||
nb_feats = st_features.shape[1]
|
||||
|
||||
# Number of statistics
|
||||
nb_stats = len(stats)
|
||||
|
||||
# Global statistics feature names
|
||||
feature_names = ["" for x in range(nb_feats * nb_stats)]
|
||||
for i in range(nb_feats):
|
||||
for j in range(nb_stats):
|
||||
feature_names[i + j * nb_feats] = f_names[i] + "_d" + str(diff) + "_" + stats[j]
|
||||
|
||||
# Calculate global statistics features
|
||||
features = numpy.zeros((nb_feats * nb_stats))
|
||||
for i in range(nb_feats):
|
||||
|
||||
# Get features series
|
||||
feat = st_features[:, i]
|
||||
|
||||
# Compute first or second order difference
|
||||
if diff > 0:
|
||||
feat = feat[diff:] - feat[:-diff]
|
||||
|
||||
# Global statistics
|
||||
for j in range(nb_stats):
|
||||
features[i + j * nb_feats] = self.compute_statistic(feat, stats[j])
|
||||
|
||||
return features, feature_names
|
||||
|
||||
'''
|
||||
Short-time features extraction from an audio signals
|
||||
'''
|
||||
def short_time_feature_extraction(self, features=[], nb_mfcc=12, nb_filter=40, hamming=True):
|
||||
|
||||
# Copy features list to compute
|
||||
features_list = list(features)
|
||||
|
||||
# MFFCs features names
|
||||
mfcc_feature_names = []
|
||||
if 'mfcc' in features_list:
|
||||
mfcc_feature_names = ["mfcc_{0:d}".format(i) for i in range(1, nb_mfcc + 1)]
|
||||
features_list.remove('mfcc')
|
||||
|
||||
# Filter banks features names
|
||||
fbank_features_names = []
|
||||
if 'filter_banks' in features_list:
|
||||
fbank_features_names = ["fbank_{0:d}".format(i) for i in range(1, nb_filter + 1)]
|
||||
features_list.remove('filter_banks')
|
||||
|
||||
# All Features names
|
||||
feature_names = features_list + mfcc_feature_names + fbank_features_names
|
||||
|
||||
# Number of features
|
||||
nb_features = len(feature_names)
|
||||
|
||||
# Framming signal
|
||||
frames = self._audio_signal.framing(self._win_size, self._win_step, hamming=hamming)
|
||||
|
||||
# Number of frame
|
||||
nb_frames = len(frames)
|
||||
|
||||
# Compute features on each frame
|
||||
features = numpy.zeros((nb_frames, nb_features))
|
||||
cur_pos = 0
|
||||
for el in frames:
|
||||
|
||||
# Get signal of the frame
|
||||
signal = el._signal
|
||||
|
||||
# Compute the normalize magnitude of the spectrum (Discrete Fourier Transform)
|
||||
dft = el.dft(norm=True)
|
||||
|
||||
# Return the first half of the spectrum
|
||||
dft = dft[:int((self._win_size * self._audio_signal._sample_rate) / 2)]
|
||||
if cur_pos == 0:
|
||||
dft_prev = dft
|
||||
|
||||
# Compute features on frame
|
||||
for idx, f in enumerate(features_list):
|
||||
features[cur_pos, idx] = self.compute_st_features(f, signal, dft, dft_prev,
|
||||
self._audio_signal._sample_rate)
|
||||
|
||||
# Compute MFCCs and Filter Banks
|
||||
if len(mfcc_feature_names) > 0:
|
||||
features[cur_pos, len(features_list):len(features_list) + len(mfcc_feature_names) + len(fbank_features_names)] = self.mfcc(signal, self._audio_signal._sample_rate,
|
||||
nb_coeff=nb_mfcc, nb_filt=nb_filter, return_fbank=len(fbank_features_names) > 0)
|
||||
# Compute Filter Banks
|
||||
elif len(fbank_features_names) > 0:
|
||||
features[cur_pos, len(features_list) + len(mfcc_feature_names):] = self.filter_banks_coeff(signal, self._audio_signal._sample_rate, nb_filt=nb_filter)
|
||||
|
||||
# Keep previous Discrete Fourier Transform coefficients
|
||||
dft_prev = dft
|
||||
cur_pos = cur_pos + 1
|
||||
|
||||
return features, feature_names
|
||||
|
||||
'''
|
||||
Computes zero crossing rate of a signal
|
||||
'''
|
||||
@staticmethod
|
||||
def zcr(signal):
|
||||
zcr = numpy.sum(numpy.abs(numpy.diff(numpy.sign(signal))))
|
||||
zcr = zcr / (2 * numpy.float64(len(signal) - 1.0))
|
||||
return zcr
|
||||
|
||||
'''
|
||||
Computes signal energy of frame
|
||||
'''
|
||||
@staticmethod
|
||||
def energy(signal):
|
||||
energy = numpy.sum(signal ** 2) / numpy.float64(len(signal))
|
||||
return energy
|
||||
|
||||
'''
|
||||
Computes entropy of energy
|
||||
'''
|
||||
@staticmethod
|
||||
def energy_entropy(signal, n_short_blocks=10, eps=10e-8):
|
||||
|
||||
# Total frame energy
|
||||
energy = numpy.sum(signal ** 2)
|
||||
sub_win_len = int(numpy.floor(len(signal) / n_short_blocks))
|
||||
|
||||
# Length of sub-frame
|
||||
if len(signal) != sub_win_len * n_short_blocks:
|
||||
signal = signal[0:sub_win_len * n_short_blocks]
|
||||
|
||||
# Get sub windows
|
||||
sub_wins = signal.reshape(sub_win_len, n_short_blocks, order='F').copy()
|
||||
|
||||
# Compute normalized sub-frame energies:
|
||||
sub_energies = numpy.sum(sub_wins ** 2, axis=0) / (energy + eps)
|
||||
|
||||
# Compute entropy of the normalized sub-frame energies:
|
||||
entropy = -numpy.sum(sub_energies * numpy.log2(sub_energies + eps))
|
||||
|
||||
return entropy
|
||||
|
||||
'''
|
||||
Computes spectral centroid of frame
|
||||
'''
|
||||
@staticmethod
|
||||
def spectral_centroid_spread(fft, fs, eps=10e-8):
|
||||
|
||||
# Sample range
|
||||
sr = (numpy.arange(1, len(fft) + 1)) * (fs / (2.0 * len(fft)))
|
||||
|
||||
# Normalize fft coefficients by the max value
|
||||
norm_fft = fft / (fft.max() + eps)
|
||||
|
||||
# Centroid:
|
||||
C = numpy.sum(sr * norm_fft) / (numpy.sum(norm_fft) + eps)
|
||||
|
||||
# Spread:
|
||||
S = numpy.sqrt(numpy.sum(((sr - C) ** 2) * norm_fft) / (numpy.sum(norm_fft) + eps))
|
||||
|
||||
# Normalize:
|
||||
C = C / (fs / 2.0)
|
||||
S = S / (fs / 2.0)
|
||||
|
||||
return C, S
|
||||
|
||||
'''
|
||||
Computes the spectral flux feature
|
||||
'''
|
||||
@staticmethod
|
||||
def spectral_flux(fft, fft_prev, eps=10e-8):
|
||||
|
||||
# Sum of fft coefficients
|
||||
sum_fft = numpy.sum(fft + eps)
|
||||
|
||||
# Sum of previous fft coefficients
|
||||
sum_fft_prev = numpy.sum(fft_prev + eps)
|
||||
|
||||
# Compute the spectral flux as the sum of square distances
|
||||
flux = numpy.sum((fft / sum_fft - fft_prev / sum_fft_prev) ** 2)
|
||||
|
||||
return flux
|
||||
|
||||
'''
|
||||
Computes the spectral roll off
|
||||
'''
|
||||
@staticmethod
|
||||
def spectral_rolloff(fft, c=0.90, eps=10e-8):
|
||||
|
||||
# Total energy
|
||||
energy = numpy.sum(fft ** 2)
|
||||
|
||||
# Roll off threshold
|
||||
threshold = c * energy
|
||||
|
||||
# Compute cumulative energy
|
||||
cum_energy = numpy.cumsum(fft ** 2) + eps
|
||||
|
||||
# Find the spectral roll off as the frequency position
|
||||
[roll_off, ] = numpy.nonzero(cum_energy > threshold)
|
||||
|
||||
# Normalize
|
||||
if len(roll_off) > 0:
|
||||
roll_off = numpy.float64(roll_off[0]) / (float(len(fft)))
|
||||
else:
|
||||
roll_off = 0.0
|
||||
|
||||
return roll_off
|
||||
|
||||
'''
|
||||
Computes the Filter Bank coefficients
|
||||
'''
|
||||
@staticmethod
|
||||
def filter_banks_coeff(signal, sample_rate, nb_filt=40, nb_fft=512):
|
||||
|
||||
# Magnitude of the FFT
|
||||
mag_frames = numpy.absolute(numpy.fft.rfft(signal, nb_fft))
|
||||
|
||||
# Power Spectrum
|
||||
pow_frames = ((1.0 / nb_fft) * (mag_frames ** 2))
|
||||
low_freq_mel = 0
|
||||
|
||||
# Convert Hz to Mel
|
||||
high_freq_mel = (2595 * numpy.log10(1 + (sample_rate / 2) / 700))
|
||||
|
||||
# Equally spaced in Mel scale
|
||||
mel_points = numpy.linspace(low_freq_mel, high_freq_mel, nb_filt + 2)
|
||||
|
||||
# Convert Mel to Hz
|
||||
hz_points = (700 * (10 ** (mel_points / 2595) - 1))
|
||||
bin = numpy.floor((nb_fft + 1) * hz_points / sample_rate)
|
||||
|
||||
# Calculate filter banks
|
||||
fbank = numpy.zeros((nb_filt, int(numpy.floor(nb_fft / 2 + 1))))
|
||||
for m in range(1, nb_filt + 1):
|
||||
|
||||
# left
|
||||
f_m_minus = int(bin[m - 1])
|
||||
|
||||
# center
|
||||
f_m = int(bin[m])
|
||||
|
||||
# right
|
||||
f_m_plus = int(bin[m + 1])
|
||||
|
||||
for k in range(f_m_minus, f_m):
|
||||
fbank[m - 1, k] = (k - bin[m - 1]) / (bin[m] - bin[m - 1])
|
||||
for k in range(f_m, f_m_plus):
|
||||
fbank[m - 1, k] = (bin[m + 1] - k) / (bin[m + 1] - bin[m])
|
||||
filter_banks = numpy.dot(pow_frames, fbank.T)
|
||||
|
||||
# Numerical Stability
|
||||
filter_banks = numpy.where(filter_banks == 0, numpy.finfo(float).eps, filter_banks)
|
||||
|
||||
# dB
|
||||
filter_banks = 20 * numpy.log10(filter_banks)
|
||||
|
||||
return filter_banks
|
||||
|
||||
'''
|
||||
Computes the MFCCs
|
||||
'''
|
||||
def mfcc(self, signal, sample_rate, nb_coeff=12, nb_filt=40, nb_fft=512, return_fbank=False):
|
||||
|
||||
# Apply filter bank on spectogram
|
||||
filter_banks = self.filter_banks_coeff(signal, sample_rate, nb_filt=nb_filt, nb_fft=nb_fft)
|
||||
|
||||
# Compute MFCC coefficients
|
||||
mfcc = dct(filter_banks, type=2, axis=-1, norm='ortho')[1: (nb_coeff + 1)]
|
||||
|
||||
# Return MFFCs and Filter banks coefficients
|
||||
if return_fbank is True:
|
||||
return numpy.concatenate((mfcc, filter_banks))
|
||||
else:
|
||||
return mfcc
|
||||
|
||||
'''
|
||||
Compute statistics on short time features
|
||||
'''
|
||||
@staticmethod
|
||||
def compute_statistic(seq, statistic):
|
||||
if statistic == 'mean':
|
||||
S = numpy.mean(seq)
|
||||
elif statistic == 'med':
|
||||
S = numpy.median(seq)
|
||||
elif statistic == 'std':
|
||||
S = numpy.std(seq)
|
||||
elif statistic == 'kurt':
|
||||
S = kurtosis(seq)
|
||||
elif statistic == 'skew':
|
||||
S = skew(seq)
|
||||
elif statistic == 'min':
|
||||
S = numpy.min(seq)
|
||||
elif statistic == 'max':
|
||||
S = numpy.max(seq)
|
||||
elif statistic == 'q1':
|
||||
S = numpy.percentile(seq, 1)
|
||||
elif statistic == 'q99':
|
||||
S = numpy.percentile(seq, 99)
|
||||
elif statistic == 'range':
|
||||
S = numpy.abs(numpy.percentile(seq, 99) - numpy.percentile(seq, 1))
|
||||
return S
|
||||
|
||||
'''
|
||||
Compute short time features on signal
|
||||
'''
|
||||
def compute_st_features(self, feature, signal, dft, dft_prev, sample_rate):
|
||||
if feature == 'zcr':
|
||||
F = self.zcr(signal)
|
||||
elif feature == 'energy':
|
||||
F = self.energy(signal)
|
||||
elif feature == 'energy_entropy':
|
||||
F = self.energy_entropy(signal)
|
||||
elif feature == 'spectral_centroid':
|
||||
[F, FF] = self.spectral_centroid_spread(dft, sample_rate)
|
||||
elif feature == 'spectral_spread':
|
||||
[FF, F] = self.spectral_centroid_spread(dft, sample_rate)
|
||||
elif feature == 'spectral_entropy':
|
||||
F = self.energy_entropy(dft)
|
||||
elif feature == 'spectral_flux':
|
||||
F = self.spectral_flux(dft, dft_prev)
|
||||
elif feature == 'sprectral_rolloff':
|
||||
F = self.spectral_rolloff(dft)
|
||||
return F
|
||||
@@ -0,0 +1,161 @@
|
||||
import os
|
||||
import numpy
|
||||
from pydub import AudioSegment
|
||||
from scipy.fftpack import fft
|
||||
|
||||
|
||||
class AudioSignal(object):
|
||||
|
||||
def __init__(self, sample_rate, signal=None, filename=None):
|
||||
|
||||
# Set sample rate
|
||||
self._sample_rate = sample_rate
|
||||
|
||||
if signal is None:
|
||||
|
||||
# Get file name and file extension
|
||||
file, file_extension = os.path.splitext(filename)
|
||||
|
||||
# Check if file extension if audio format
|
||||
if file_extension in ['.mp3', '.wav']:
|
||||
|
||||
# Read audio file
|
||||
self._signal = self.read_audio_file(filename)
|
||||
|
||||
# Check if file extension if video format
|
||||
elif file_extension in ['.mp4', '.mkv', 'avi']:
|
||||
|
||||
# Extract audio from video
|
||||
new_filename = self.extract_audio_from_video(filename)
|
||||
|
||||
# read audio file from extracted audio file
|
||||
self._signal = self.read_audio_file(new_filename)
|
||||
|
||||
# Case file extension is not supported
|
||||
else:
|
||||
print("Error: file not found or file extension not supported.")
|
||||
|
||||
elif filename is None:
|
||||
|
||||
# Cast signal to array
|
||||
self._signal = signal
|
||||
|
||||
else:
|
||||
|
||||
print("Error : argument missing in AudioSignal() constructor.")
|
||||
|
||||
'''
|
||||
Function to extract audio from a video
|
||||
'''
|
||||
def extract_audio_from_video(self, filename):
|
||||
|
||||
# Get video file name and extension
|
||||
file, file_extension = os.path.splitext(filename)
|
||||
|
||||
# Extract audio (.wav) from video
|
||||
os.system('ffmpeg -i ' + file + file_extension + ' ' + '-ar ' + str(self._sample_rate) + ' ' + file + '.wav')
|
||||
print("Sucessfully converted {} into audio!".format(filename))
|
||||
|
||||
# Return audio file name created
|
||||
return file + '.wav'
|
||||
|
||||
'''
|
||||
Function to read audio file and to return audio samples of a specified WAV file
|
||||
'''
|
||||
def read_audio_file(self, filename):
|
||||
|
||||
# Get audio signal
|
||||
audio_file = AudioSegment.from_file(filename)
|
||||
|
||||
# Resample audio signal
|
||||
audio_file = audio_file.set_frame_rate(self._sample_rate)
|
||||
|
||||
# Cast to integer
|
||||
if audio_file.sample_width == 2:
|
||||
data = numpy.fromstring(audio_file._data, numpy.int16)
|
||||
elif audio_file.sample_width == 4:
|
||||
data = numpy.fromstring(audio_file._data, numpy.int32)
|
||||
|
||||
# Merge audio channels
|
||||
audio_signal = []
|
||||
for chn in list(range(audio_file.channels)):
|
||||
audio_signal.append(data[chn::audio_file.channels])
|
||||
audio_signal = numpy.array(audio_signal).T
|
||||
|
||||
# Flat signals
|
||||
if audio_signal.ndim == 2:
|
||||
if audio_signal.shape[1] == 1:
|
||||
audio_signal = audio_signal.flatten()
|
||||
|
||||
# Convert stereo to mono
|
||||
audio_signal = self.stereo_to_mono(audio_signal)
|
||||
|
||||
# Return sample rate and audio signal
|
||||
return audio_signal
|
||||
|
||||
'''
|
||||
Function to convert an input signal from stereo to mono
|
||||
'''
|
||||
@staticmethod
|
||||
def stereo_to_mono(audio_signal):
|
||||
|
||||
# Check if signal is stereo and convert to mono
|
||||
if isinstance(audio_signal, int):
|
||||
return -1
|
||||
if audio_signal.ndim == 1:
|
||||
return audio_signal
|
||||
elif audio_signal.ndim == 2:
|
||||
if audio_signal.shape[1] == 1:
|
||||
return audio_signal.flatten()
|
||||
else:
|
||||
if audio_signal.shape[1] == 2:
|
||||
return (audio_signal[:, 1] / 2) + (audio_signal[:, 0] / 2)
|
||||
else:
|
||||
return -1
|
||||
|
||||
'''
|
||||
Function to split the input signal into windows of same size
|
||||
'''
|
||||
def framing(self, size, step, hamming=False):
|
||||
|
||||
# Rescale windows step and size
|
||||
win_size = int(size * self._sample_rate)
|
||||
win_step = int(step * self._sample_rate)
|
||||
|
||||
# Number of frames
|
||||
nb_frames = 1 + int((len(self._signal) - win_size) / win_step)
|
||||
|
||||
# Build Hamming function
|
||||
if hamming is True:
|
||||
ham = numpy.hamming(win_size)
|
||||
else:
|
||||
ham = numpy.ones(win_size)
|
||||
|
||||
# Split signals (and multiply each windows signals by Hamming functions)
|
||||
frames = []
|
||||
for t in range(nb_frames):
|
||||
sub_signal = AudioSignal(self._sample_rate, signal=self._signal[(t * win_step): (t * win_step + win_size)] * ham)
|
||||
frames.append(sub_signal)
|
||||
return frames
|
||||
|
||||
'''
|
||||
Function to compute the magnitude of the Discrete Fourier Transform coefficient
|
||||
'''
|
||||
def dft(self, norm=False):
|
||||
|
||||
# Commpute the magnitude of the spectrum (and normalize by the number of sample)
|
||||
if norm is True:
|
||||
dft = abs(fft(self._signal)) / len(self._signal)
|
||||
else:
|
||||
dft = abs(fft(self._signal))
|
||||
return dft
|
||||
|
||||
'''
|
||||
Function to apply pre-emphasis filter on signal
|
||||
'''
|
||||
def pre_emphasis(self, alpha =0.97):
|
||||
|
||||
# Emphasized signal
|
||||
emphasized_signal = numpy.append(self._signal[0], self._signal[1:] - alpha * self._signal[:-1])
|
||||
|
||||
return emphasized_signal
|
||||
@@ -0,0 +1,233 @@
|
||||
## Basics ##
|
||||
import time
|
||||
import os
|
||||
import numpy as np
|
||||
|
||||
## Audio Preprocessing ##
|
||||
import pyaudio
|
||||
import wave
|
||||
import librosa
|
||||
from scipy.stats import zscore
|
||||
|
||||
## Time Distributed CNN ##
|
||||
import tensorflow as tf
|
||||
from tensorflow.keras import backend as K
|
||||
from tensorflow.keras.models import Model
|
||||
from tensorflow.keras.layers import Input, Dense, Dropout, Activation, TimeDistributed
|
||||
from tensorflow.keras.layers import Conv2D, MaxPooling2D, BatchNormalization, Flatten
|
||||
from tensorflow.keras.layers import LSTM
|
||||
|
||||
|
||||
'''
|
||||
Speech Emotion Recognition
|
||||
'''
|
||||
class speechEmotionRecognition:
|
||||
|
||||
'''
|
||||
Voice recording function
|
||||
'''
|
||||
def __init__(self, subdir_model=None):
|
||||
|
||||
# Load prediction model
|
||||
if subdir_model is not None:
|
||||
self._model = self.build_model()
|
||||
self._model.load_weights(subdir_model)
|
||||
|
||||
# Emotion encoding
|
||||
self._emotion = {0:'Angry', 1:'Disgust', 2:'Fear', 3:'Happy', 4:'Neutral', 5:'Sad', 6:'Surprise'}
|
||||
|
||||
|
||||
'''
|
||||
Voice recording function
|
||||
'''
|
||||
def voice_recording(self, filename, duration=5, sample_rate=16000, chunk=1024, channels=1):
|
||||
|
||||
# Start the audio recording stream
|
||||
p = pyaudio.PyAudio()
|
||||
stream = p.open(format=pyaudio.paInt16,
|
||||
channels=channels,
|
||||
rate=sample_rate,
|
||||
input=True,
|
||||
frames_per_buffer=chunk)
|
||||
|
||||
# Create an empty list to store audio recording
|
||||
frames = []
|
||||
|
||||
# Determine the timestamp of the start of the response interval
|
||||
print('* Start Recording *')
|
||||
stream.start_stream()
|
||||
start_time = time.time()
|
||||
current_time = time.time()
|
||||
|
||||
# Record audio until timeout
|
||||
while (current_time - start_time) < duration:
|
||||
|
||||
# Record data audio data
|
||||
data = stream.read(chunk)
|
||||
|
||||
# Add the data to a buffer (a list of chunks)
|
||||
frames.append(data)
|
||||
|
||||
# Get new timestamp
|
||||
current_time = time.time()
|
||||
|
||||
# Close the audio recording stream
|
||||
stream.stop_stream()
|
||||
stream.close()
|
||||
p.terminate()
|
||||
print('* End Recording * ')
|
||||
|
||||
# Export audio recording to wav format
|
||||
wf = wave.open(filename, 'w')
|
||||
wf.setnchannels(channels)
|
||||
wf.setsampwidth(p.get_sample_size(pyaudio.paInt16))
|
||||
wf.setframerate(sample_rate)
|
||||
wf.writeframes(b''.join(frames))
|
||||
wf.close()
|
||||
|
||||
|
||||
'''
|
||||
Mel-spectogram computation
|
||||
'''
|
||||
def mel_spectrogram(self, y, sr=16000, n_fft=512, win_length=256, hop_length=128, window='hamming', n_mels=128, fmax=4000):
|
||||
|
||||
# Compute spectogram
|
||||
mel_spect = np.abs(librosa.stft(y, n_fft=n_fft, window=window, win_length=win_length, hop_length=hop_length)) ** 2
|
||||
|
||||
# Compute mel spectrogram
|
||||
mel_spect = librosa.feature.melspectrogram(S=mel_spect, sr=sr, n_mels=n_mels, fmax=fmax)
|
||||
|
||||
# Compute log-mel spectrogram
|
||||
mel_spect = librosa.power_to_db(mel_spect, ref=np.max)
|
||||
|
||||
return np.asarray(mel_spect)
|
||||
|
||||
|
||||
'''
|
||||
Audio framing
|
||||
'''
|
||||
def frame(self, y, win_step=64, win_size=128):
|
||||
|
||||
# Number of frames
|
||||
nb_frames = 1 + int((y.shape[2] - win_size) / win_step)
|
||||
|
||||
# Framming
|
||||
frames = np.zeros((y.shape[0], nb_frames, y.shape[1], win_size)).astype(np.float16)
|
||||
for t in range(nb_frames):
|
||||
frames[:,t,:,:] = np.copy(y[:,:,(t * win_step):(t * win_step + win_size)]).astype(np.float16)
|
||||
|
||||
return frames
|
||||
|
||||
|
||||
'''
|
||||
Time distributed Convolutional Neural Network model
|
||||
'''
|
||||
def build_model(self):
|
||||
|
||||
# Clear Keras session
|
||||
K.clear_session()
|
||||
|
||||
# Define input
|
||||
input_y = Input(shape=(5, 128, 128, 1), name='Input_MELSPECT')
|
||||
|
||||
# First LFLB (local feature learning block)
|
||||
y = TimeDistributed(Conv2D(64, kernel_size=(3, 3), strides=(1, 1), padding='same'), name='Conv_1_MELSPECT')(input_y)
|
||||
y = TimeDistributed(BatchNormalization(), name='BatchNorm_1_MELSPECT')(y)
|
||||
y = TimeDistributed(Activation('elu'), name='Activ_1_MELSPECT')(y)
|
||||
y = TimeDistributed(MaxPooling2D(pool_size=(2, 2), strides=(2, 2), padding='same'), name='MaxPool_1_MELSPECT')(y)
|
||||
y = TimeDistributed(Dropout(0.2), name='Drop_1_MELSPECT')(y)
|
||||
|
||||
# Second LFLB (local feature learning block)
|
||||
y = TimeDistributed(Conv2D(64, kernel_size=(3, 3), strides=(1, 1), padding='same'), name='Conv_2_MELSPECT')(y)
|
||||
y = TimeDistributed(BatchNormalization(), name='BatchNorm_2_MELSPECT')(y)
|
||||
y = TimeDistributed(Activation('elu'), name='Activ_2_MELSPECT')(y)
|
||||
y = TimeDistributed(MaxPooling2D(pool_size=(4, 4), strides=(4, 4), padding='same'), name='MaxPool_2_MELSPECT')(y)
|
||||
y = TimeDistributed(Dropout(0.2), name='Drop_2_MELSPECT')(y)
|
||||
|
||||
# Third LFLB (local feature learning block)
|
||||
y = TimeDistributed(Conv2D(128, kernel_size=(3, 3), strides=(1, 1), padding='same'), name='Conv_3_MELSPECT')(y)
|
||||
y = TimeDistributed(BatchNormalization(), name='BatchNorm_3_MELSPECT')(y)
|
||||
y = TimeDistributed(Activation('elu'), name='Activ_3_MELSPECT')(y)
|
||||
y = TimeDistributed(MaxPooling2D(pool_size=(4, 4), strides=(4, 4), padding='same'), name='MaxPool_3_MELSPECT')(y)
|
||||
y = TimeDistributed(Dropout(0.2), name='Drop_3_MELSPECT')(y)
|
||||
|
||||
# Fourth LFLB (local feature learning block)
|
||||
y = TimeDistributed(Conv2D(128, kernel_size=(3, 3), strides=(1, 1), padding='same'), name='Conv_4_MELSPECT')(y)
|
||||
y = TimeDistributed(BatchNormalization(), name='BatchNorm_4_MELSPECT')(y)
|
||||
y = TimeDistributed(Activation('elu'), name='Activ_4_MELSPECT')(y)
|
||||
y = TimeDistributed(MaxPooling2D(pool_size=(4, 4), strides=(4, 4), padding='same'), name='MaxPool_4_MELSPECT')(y)
|
||||
y = TimeDistributed(Dropout(0.2), name='Drop_4_MELSPECT')(y)
|
||||
|
||||
# Flat
|
||||
y = TimeDistributed(Flatten(), name='Flat_MELSPECT')(y)
|
||||
|
||||
# LSTM layer
|
||||
y = LSTM(256, return_sequences=False, dropout=0.2, name='LSTM_1')(y)
|
||||
|
||||
# Fully connected
|
||||
y = Dense(7, activation='softmax', name='FC')(y)
|
||||
|
||||
# Build final model
|
||||
model = Model(inputs=input_y, outputs=y)
|
||||
|
||||
return model
|
||||
|
||||
|
||||
'''
|
||||
Predict speech emotion over time from an audio file
|
||||
'''
|
||||
def predict_emotion_from_file(self, filename, chunk_step=16000, chunk_size=49100, predict_proba=False, sample_rate=16000):
|
||||
|
||||
# Read audio file
|
||||
y, sr = librosa.core.load(filename, sr=sample_rate, offset=0.5)
|
||||
|
||||
# Split audio signals into chunks
|
||||
chunks = self.frame(y.reshape(1, 1, -1), chunk_step, chunk_size)
|
||||
|
||||
# Reshape chunks
|
||||
chunks = chunks.reshape(chunks.shape[1],chunks.shape[-1])
|
||||
|
||||
# Z-normalization
|
||||
y = np.asarray(list(map(zscore, chunks)))
|
||||
|
||||
# Compute mel spectrogram
|
||||
mel_spect = np.asarray(list(map(self.mel_spectrogram, y)))
|
||||
|
||||
# Time distributed Framing
|
||||
mel_spect_ts = self.frame(mel_spect)
|
||||
|
||||
# Build X for time distributed CNN
|
||||
X = mel_spect_ts.reshape(mel_spect_ts.shape[0],
|
||||
mel_spect_ts.shape[1],
|
||||
mel_spect_ts.shape[2],
|
||||
mel_spect_ts.shape[3],
|
||||
1)
|
||||
|
||||
# Predict emotion
|
||||
if predict_proba is True:
|
||||
predict = self._model.predict(X)
|
||||
else:
|
||||
predict = np.argmax(self._model.predict(X), axis=1)
|
||||
predict = [self._emotion.get(emotion) for emotion in predict]
|
||||
|
||||
# Clear Keras session
|
||||
K.clear_session()
|
||||
|
||||
# Predict timestamp
|
||||
timestamp = np.concatenate([[chunk_size], np.ones((len(predict) - 1)) * chunk_step]).cumsum()
|
||||
timestamp = np.round(timestamp / sample_rate)
|
||||
|
||||
return [predict, timestamp]
|
||||
|
||||
'''
|
||||
Export emotions predicted to csv format
|
||||
'''
|
||||
def prediction_to_csv(self, predictions, filename, mode='w'):
|
||||
|
||||
# Write emotion in filename
|
||||
with open(filename, mode) as f:
|
||||
if mode == 'w':
|
||||
f.write("EMOTIONS"+'\n')
|
||||
for emotion in predictions:
|
||||
f.write(str(emotion)+'\n')
|
||||
f.close()
|
||||
@@ -0,0 +1,104 @@
|
||||
import pickle
|
||||
from AudioLibrary.AudioSignal import *
|
||||
from AudioLibrary.AudioFeatures import *
|
||||
|
||||
|
||||
class AudioEmotionRecognition:
|
||||
|
||||
def __init__(self, model_path):
|
||||
|
||||
# Load classifier
|
||||
self._clf = pickle.load(open(os.path.join(model_path, 'MODEL_CLF.p'), 'rb'))
|
||||
|
||||
# Load features parameters
|
||||
self._features_param = pickle.load(open(os.path.join(model_path, 'MODEL_PARAM.p'), 'rb'))
|
||||
|
||||
# Load feature scaler parametrs (mean and std)
|
||||
self._features_mean, self._features_std = pickle.load(open(os.path.join(model_path, 'MODEL_SCALER.p'), 'rb'))
|
||||
|
||||
# Load PCA
|
||||
self._pca = pickle.load(open(os.path.join(model_path, 'MODEL_PCA.p'), 'rb'))
|
||||
|
||||
# Load label encoder
|
||||
self._encoder = pickle.load(open(os.path.join(model_path, 'MODEL_ENCODER.p'), 'rb'))
|
||||
|
||||
'''
|
||||
Function to scale audio features
|
||||
'''
|
||||
def scale_features(self, features):
|
||||
|
||||
# Scaled features
|
||||
scaled_features = (features - self._features_mean) / self._features_std
|
||||
|
||||
# Return scaled features
|
||||
return scaled_features
|
||||
|
||||
'''
|
||||
Function to predict speech emotion from an audio signals
|
||||
'''
|
||||
def predict_emotion(self, audio_signal, predict_proba=False, decode=True):
|
||||
|
||||
# Extract audio features
|
||||
audio_features = AudioFeatures(audio_signal, float(self._features_param.get("win_size")),
|
||||
float(self._features_param.get("win_step")))
|
||||
features, features_names = audio_features.global_feature_extraction(stats=self._features_param.get("stats"),
|
||||
features_list=self._features_param.get(
|
||||
"features_list"),
|
||||
nb_mfcc=self._features_param.get("nb_mfcc"),
|
||||
diff=self._features_param.get("diff"))
|
||||
# Scale features
|
||||
features = self.scale_features(features)
|
||||
|
||||
# Apply feature dimension reduction
|
||||
if self._features_param.get("PCA") is True:
|
||||
features = self._pca.transform(features)
|
||||
|
||||
# Make prediction
|
||||
if predict_proba is True:
|
||||
prediction = self._clf.predict_proba(features.reshape(1, -1))
|
||||
else:
|
||||
prediction = self._clf.predict(features.reshape(1, -1))
|
||||
|
||||
# Decode label emotion
|
||||
if decode is True:
|
||||
prediction = (self._encoder.inverse_transform((prediction.astype(int).flatten())))
|
||||
|
||||
# Remove gender recognition
|
||||
prediction = prediction[0][2:]
|
||||
|
||||
return prediction
|
||||
|
||||
'''
|
||||
Function to predict speech emotion over time from video
|
||||
'''
|
||||
def predict_emotion_from_file(self, filename, sample_rate, chunk_size=0, chunk_step=0, predict_proba=False,
|
||||
decode=True):
|
||||
|
||||
# Initialize Audio Basic object
|
||||
audio_signal = AudioSignal(sample_rate, filename=filename)
|
||||
|
||||
# Split audio signals into chunks
|
||||
if chunk_size > 0:
|
||||
chunks = audio_signal.framing(chunk_size, chunk_step)
|
||||
|
||||
# Initialize time stamp
|
||||
timestamp = []
|
||||
|
||||
# Emotion prediction for each chunks
|
||||
prediction = []
|
||||
for signal in chunks:
|
||||
if len(timestamp) == 0:
|
||||
timestamp.append(chunk_size)
|
||||
else:
|
||||
timestamp.append(timestamp[-1] + chunk_step)
|
||||
prediction.append(self.predict_emotion(signal, predict_proba=predict_proba, decode=decode))
|
||||
|
||||
# Return emotion prediction and related timestamp
|
||||
return prediction, timestamp
|
||||
else:
|
||||
|
||||
# Emotion prediction
|
||||
prediction = self.predict_emotion(audio_signal, predict_proba=predict_proba, decode=decode)
|
||||
|
||||
# Return emotion prediction
|
||||
return prediction
|
||||
@@ -0,0 +1,347 @@
|
||||
import numpy
|
||||
from scipy.fftpack.realtransforms import dct
|
||||
from scipy.stats import kurtosis, skew
|
||||
from AudioLibrary.AudioSignal import *
|
||||
|
||||
|
||||
class AudioFeatures:
|
||||
|
||||
def __init__(self, audio_signal, win_size, win_step):
|
||||
|
||||
# Audio Signal
|
||||
self._audio_signal = audio_signal
|
||||
|
||||
# Short time features window size
|
||||
self._win_size = win_size
|
||||
|
||||
# Short time features window step
|
||||
self._win_step = win_step
|
||||
|
||||
'''
|
||||
Global statistics features extraction from an audio signals
|
||||
'''
|
||||
def global_feature_extraction(self, stats=['mean', 'std'], features_list=[], nb_mfcc=12, nb_filter=40, diff=0, hamming=True):
|
||||
|
||||
# Extract short term audio features
|
||||
st_features, f_names = self.short_time_feature_extraction(features_list, nb_mfcc, nb_filter, hamming)
|
||||
|
||||
# Number of short term features
|
||||
nb_feats = st_features.shape[1]
|
||||
|
||||
# Number of statistics
|
||||
nb_stats = len(stats)
|
||||
|
||||
# Global statistics feature names
|
||||
feature_names = ["" for x in range(nb_feats * nb_stats)]
|
||||
for i in range(nb_feats):
|
||||
for j in range(nb_stats):
|
||||
feature_names[i + j * nb_feats] = f_names[i] + "_d" + str(diff) + "_" + stats[j]
|
||||
|
||||
# Calculate global statistics features
|
||||
features = numpy.zeros((nb_feats * nb_stats))
|
||||
for i in range(nb_feats):
|
||||
|
||||
# Get features series
|
||||
feat = st_features[:, i]
|
||||
|
||||
# Compute first or second order difference
|
||||
if diff > 0:
|
||||
feat = feat[diff:] - feat[:-diff]
|
||||
|
||||
# Global statistics
|
||||
for j in range(nb_stats):
|
||||
features[i + j * nb_feats] = self.compute_statistic(feat, stats[j])
|
||||
|
||||
return features, feature_names
|
||||
|
||||
'''
|
||||
Short-time features extraction from an audio signals
|
||||
'''
|
||||
def short_time_feature_extraction(self, features=[], nb_mfcc=12, nb_filter=40, hamming=True):
|
||||
|
||||
# Copy features list to compute
|
||||
features_list = list(features)
|
||||
|
||||
# MFFCs features names
|
||||
mfcc_feature_names = []
|
||||
if 'mfcc' in features_list:
|
||||
mfcc_feature_names = ["mfcc_{0:d}".format(i) for i in range(1, nb_mfcc + 1)]
|
||||
features_list.remove('mfcc')
|
||||
|
||||
# Filter banks features names
|
||||
fbank_features_names = []
|
||||
if 'filter_banks' in features_list:
|
||||
fbank_features_names = ["fbank_{0:d}".format(i) for i in range(1, nb_filter + 1)]
|
||||
features_list.remove('filter_banks')
|
||||
|
||||
# All Features names
|
||||
feature_names = features_list + mfcc_feature_names + fbank_features_names
|
||||
|
||||
# Number of features
|
||||
nb_features = len(feature_names)
|
||||
|
||||
# Framming signal
|
||||
frames = self._audio_signal.framing(self._win_size, self._win_step, hamming=hamming)
|
||||
|
||||
# Number of frame
|
||||
nb_frames = len(frames)
|
||||
|
||||
# Compute features on each frame
|
||||
features = numpy.zeros((nb_frames, nb_features))
|
||||
cur_pos = 0
|
||||
for el in frames:
|
||||
|
||||
# Get signal of the frame
|
||||
signal = el._signal
|
||||
|
||||
# Compute the normalize magnitude of the spectrum (Discrete Fourier Transform)
|
||||
dft = el.dft(norm=True)
|
||||
|
||||
# Return the first half of the spectrum
|
||||
dft = dft[:int((self._win_size * self._audio_signal._sample_rate) / 2)]
|
||||
if cur_pos == 0:
|
||||
dft_prev = dft
|
||||
|
||||
# Compute features on frame
|
||||
for idx, f in enumerate(features_list):
|
||||
features[cur_pos, idx] = self.compute_st_features(f, signal, dft, dft_prev,
|
||||
self._audio_signal._sample_rate)
|
||||
|
||||
# Compute MFCCs and Filter Banks
|
||||
if len(mfcc_feature_names) > 0:
|
||||
features[cur_pos, len(features_list):len(features_list) + len(mfcc_feature_names) + len(fbank_features_names)] = self.mfcc(signal, self._audio_signal._sample_rate,
|
||||
nb_coeff=nb_mfcc, nb_filt=nb_filter, return_fbank=len(fbank_features_names) > 0)
|
||||
# Compute Filter Banks
|
||||
elif len(fbank_features_names) > 0:
|
||||
features[cur_pos, len(features_list) + len(mfcc_feature_names):] = self.filter_banks_coeff(signal, self._audio_signal._sample_rate, nb_filt=nb_filter)
|
||||
|
||||
# Keep previous Discrete Fourier Transform coefficients
|
||||
dft_prev = dft
|
||||
cur_pos = cur_pos + 1
|
||||
|
||||
return features, feature_names
|
||||
|
||||
'''
|
||||
Computes zero crossing rate of a signal
|
||||
'''
|
||||
@staticmethod
|
||||
def zcr(signal):
|
||||
zcr = numpy.sum(numpy.abs(numpy.diff(numpy.sign(signal))))
|
||||
zcr = zcr / (2 * numpy.float64(len(signal) - 1.0))
|
||||
return zcr
|
||||
|
||||
'''
|
||||
Computes signal energy of frame
|
||||
'''
|
||||
@staticmethod
|
||||
def energy(signal):
|
||||
energy = numpy.sum(signal ** 2) / numpy.float64(len(signal))
|
||||
return energy
|
||||
|
||||
'''
|
||||
Computes entropy of energy
|
||||
'''
|
||||
@staticmethod
|
||||
def energy_entropy(signal, n_short_blocks=10, eps=10e-8):
|
||||
|
||||
# Total frame energy
|
||||
energy = numpy.sum(signal ** 2)
|
||||
sub_win_len = int(numpy.floor(len(signal) / n_short_blocks))
|
||||
|
||||
# Length of sub-frame
|
||||
if len(signal) != sub_win_len * n_short_blocks:
|
||||
signal = signal[0:sub_win_len * n_short_blocks]
|
||||
|
||||
# Get sub windows
|
||||
sub_wins = signal.reshape(sub_win_len, n_short_blocks, order='F').copy()
|
||||
|
||||
# Compute normalized sub-frame energies:
|
||||
sub_energies = numpy.sum(sub_wins ** 2, axis=0) / (energy + eps)
|
||||
|
||||
# Compute entropy of the normalized sub-frame energies:
|
||||
entropy = -numpy.sum(sub_energies * numpy.log2(sub_energies + eps))
|
||||
|
||||
return entropy
|
||||
|
||||
'''
|
||||
Computes spectral centroid of frame
|
||||
'''
|
||||
@staticmethod
|
||||
def spectral_centroid_spread(fft, fs, eps=10e-8):
|
||||
|
||||
# Sample range
|
||||
sr = (numpy.arange(1, len(fft) + 1)) * (fs / (2.0 * len(fft)))
|
||||
|
||||
# Normalize fft coefficients by the max value
|
||||
norm_fft = fft / (fft.max() + eps)
|
||||
|
||||
# Centroid:
|
||||
C = numpy.sum(sr * norm_fft) / (numpy.sum(norm_fft) + eps)
|
||||
|
||||
# Spread:
|
||||
S = numpy.sqrt(numpy.sum(((sr - C) ** 2) * norm_fft) / (numpy.sum(norm_fft) + eps))
|
||||
|
||||
# Normalize:
|
||||
C = C / (fs / 2.0)
|
||||
S = S / (fs / 2.0)
|
||||
|
||||
return C, S
|
||||
|
||||
'''
|
||||
Computes the spectral flux feature
|
||||
'''
|
||||
@staticmethod
|
||||
def spectral_flux(fft, fft_prev, eps=10e-8):
|
||||
|
||||
# Sum of fft coefficients
|
||||
sum_fft = numpy.sum(fft + eps)
|
||||
|
||||
# Sum of previous fft coefficients
|
||||
sum_fft_prev = numpy.sum(fft_prev + eps)
|
||||
|
||||
# Compute the spectral flux as the sum of square distances
|
||||
flux = numpy.sum((fft / sum_fft - fft_prev / sum_fft_prev) ** 2)
|
||||
|
||||
return flux
|
||||
|
||||
'''
|
||||
Computes the spectral roll off
|
||||
'''
|
||||
@staticmethod
|
||||
def spectral_rolloff(fft, c=0.90, eps=10e-8):
|
||||
|
||||
# Total energy
|
||||
energy = numpy.sum(fft ** 2)
|
||||
|
||||
# Roll off threshold
|
||||
threshold = c * energy
|
||||
|
||||
# Compute cumulative energy
|
||||
cum_energy = numpy.cumsum(fft ** 2) + eps
|
||||
|
||||
# Find the spectral roll off as the frequency position
|
||||
[roll_off, ] = numpy.nonzero(cum_energy > threshold)
|
||||
|
||||
# Normalize
|
||||
if len(roll_off) > 0:
|
||||
roll_off = numpy.float64(roll_off[0]) / (float(len(fft)))
|
||||
else:
|
||||
roll_off = 0.0
|
||||
|
||||
return roll_off
|
||||
|
||||
'''
|
||||
Computes the Filter Bank coefficients
|
||||
'''
|
||||
@staticmethod
|
||||
def filter_banks_coeff(signal, sample_rate, nb_filt=40, nb_fft=512):
|
||||
|
||||
# Magnitude of the FFT
|
||||
mag_frames = numpy.absolute(numpy.fft.rfft(signal, nb_fft))
|
||||
|
||||
# Power Spectrum
|
||||
pow_frames = ((1.0 / nb_fft) * (mag_frames ** 2))
|
||||
low_freq_mel = 0
|
||||
|
||||
# Convert Hz to Mel
|
||||
high_freq_mel = (2595 * numpy.log10(1 + (sample_rate / 2) / 700))
|
||||
|
||||
# Equally spaced in Mel scale
|
||||
mel_points = numpy.linspace(low_freq_mel, high_freq_mel, nb_filt + 2)
|
||||
|
||||
# Convert Mel to Hz
|
||||
hz_points = (700 * (10 ** (mel_points / 2595) - 1))
|
||||
bin = numpy.floor((nb_fft + 1) * hz_points / sample_rate)
|
||||
|
||||
# Calculate filter banks
|
||||
fbank = numpy.zeros((nb_filt, int(numpy.floor(nb_fft / 2 + 1))))
|
||||
for m in range(1, nb_filt + 1):
|
||||
|
||||
# left
|
||||
f_m_minus = int(bin[m - 1])
|
||||
|
||||
# center
|
||||
f_m = int(bin[m])
|
||||
|
||||
# right
|
||||
f_m_plus = int(bin[m + 1])
|
||||
|
||||
for k in range(f_m_minus, f_m):
|
||||
fbank[m - 1, k] = (k - bin[m - 1]) / (bin[m] - bin[m - 1])
|
||||
for k in range(f_m, f_m_plus):
|
||||
fbank[m - 1, k] = (bin[m + 1] - k) / (bin[m + 1] - bin[m])
|
||||
filter_banks = numpy.dot(pow_frames, fbank.T)
|
||||
|
||||
# Numerical Stability
|
||||
filter_banks = numpy.where(filter_banks == 0, numpy.finfo(float).eps, filter_banks)
|
||||
|
||||
# dB
|
||||
filter_banks = 20 * numpy.log10(filter_banks)
|
||||
|
||||
return filter_banks
|
||||
|
||||
'''
|
||||
Computes the MFCCs
|
||||
'''
|
||||
def mfcc(self, signal, sample_rate, nb_coeff=12, nb_filt=40, nb_fft=512, return_fbank=False):
|
||||
|
||||
# Apply filter bank on spectogram
|
||||
filter_banks = self.filter_banks_coeff(signal, sample_rate, nb_filt=nb_filt, nb_fft=nb_fft)
|
||||
|
||||
# Compute MFCC coefficients
|
||||
mfcc = dct(filter_banks, type=2, axis=-1, norm='ortho')[1: (nb_coeff + 1)]
|
||||
|
||||
# Return MFFCs and Filter banks coefficients
|
||||
if return_fbank is True:
|
||||
return numpy.concatenate((mfcc, filter_banks))
|
||||
else:
|
||||
return mfcc
|
||||
|
||||
'''
|
||||
Compute statistics on short time features
|
||||
'''
|
||||
@staticmethod
|
||||
def compute_statistic(seq, statistic):
|
||||
if statistic == 'mean':
|
||||
S = numpy.mean(seq)
|
||||
elif statistic == 'med':
|
||||
S = numpy.median(seq)
|
||||
elif statistic == 'std':
|
||||
S = numpy.std(seq)
|
||||
elif statistic == 'kurt':
|
||||
S = kurtosis(seq)
|
||||
elif statistic == 'skew':
|
||||
S = skew(seq)
|
||||
elif statistic == 'min':
|
||||
S = numpy.min(seq)
|
||||
elif statistic == 'max':
|
||||
S = numpy.max(seq)
|
||||
elif statistic == 'q1':
|
||||
S = numpy.percentile(seq, 1)
|
||||
elif statistic == 'q99':
|
||||
S = numpy.percentile(seq, 99)
|
||||
elif statistic == 'range':
|
||||
S = numpy.abs(numpy.percentile(seq, 99) - numpy.percentile(seq, 1))
|
||||
return S
|
||||
|
||||
'''
|
||||
Compute short time features on signal
|
||||
'''
|
||||
def compute_st_features(self, feature, signal, dft, dft_prev, sample_rate):
|
||||
if feature == 'zcr':
|
||||
F = self.zcr(signal)
|
||||
elif feature == 'energy':
|
||||
F = self.energy(signal)
|
||||
elif feature == 'energy_entropy':
|
||||
F = self.energy_entropy(signal)
|
||||
elif feature == 'spectral_centroid':
|
||||
[F, FF] = self.spectral_centroid_spread(dft, sample_rate)
|
||||
elif feature == 'spectral_spread':
|
||||
[FF, F] = self.spectral_centroid_spread(dft, sample_rate)
|
||||
elif feature == 'spectral_entropy':
|
||||
F = self.energy_entropy(dft)
|
||||
elif feature == 'spectral_flux':
|
||||
F = self.spectral_flux(dft, dft_prev)
|
||||
elif feature == 'sprectral_rolloff':
|
||||
F = self.spectral_rolloff(dft)
|
||||
return F
|
||||
@@ -0,0 +1,161 @@
|
||||
import os
|
||||
import numpy
|
||||
from pydub import AudioSegment
|
||||
from scipy.fftpack import fft
|
||||
|
||||
|
||||
class AudioSignal(object):
|
||||
|
||||
def __init__(self, sample_rate, signal=None, filename=None):
|
||||
|
||||
# Set sample rate
|
||||
self._sample_rate = sample_rate
|
||||
|
||||
if signal is None:
|
||||
|
||||
# Get file name and file extension
|
||||
file, file_extension = os.path.splitext(filename)
|
||||
|
||||
# Check if file extension if audio format
|
||||
if file_extension in ['.mp3', '.wav']:
|
||||
|
||||
# Read audio file
|
||||
self._signal = self.read_audio_file(filename)
|
||||
|
||||
# Check if file extension if video format
|
||||
elif file_extension in ['.mp4', '.mkv', 'avi']:
|
||||
|
||||
# Extract audio from video
|
||||
new_filename = self.extract_audio_from_video(filename)
|
||||
|
||||
# read audio file from extracted audio file
|
||||
self._signal = self.read_audio_file(new_filename)
|
||||
|
||||
# Case file extension is not supported
|
||||
else:
|
||||
print("Error: file not found or file extension not supported.")
|
||||
|
||||
elif filename is None:
|
||||
|
||||
# Cast signal to array
|
||||
self._signal = signal
|
||||
|
||||
else:
|
||||
|
||||
print("Error : argument missing in AudioSignal() constructor.")
|
||||
|
||||
'''
|
||||
Function to extract audio from a video
|
||||
'''
|
||||
def extract_audio_from_video(self, filename):
|
||||
|
||||
# Get video file name and extension
|
||||
file, file_extension = os.path.splitext(filename)
|
||||
|
||||
# Extract audio (.wav) from video
|
||||
os.system('ffmpeg -i ' + file + file_extension + ' ' + '-ar ' + str(self._sample_rate) + ' ' + file + '.wav')
|
||||
print("Sucessfully converted {} into audio!".format(filename))
|
||||
|
||||
# Return audio file name created
|
||||
return file + '.wav'
|
||||
|
||||
'''
|
||||
Function to read audio file and to return audio samples of a specified WAV file
|
||||
'''
|
||||
def read_audio_file(self, filename):
|
||||
|
||||
# Get audio signal
|
||||
audio_file = AudioSegment.from_file(filename)
|
||||
|
||||
# Resample audio signal
|
||||
audio_file = audio_file.set_frame_rate(self._sample_rate)
|
||||
|
||||
# Cast to integer
|
||||
if audio_file.sample_width == 2:
|
||||
data = numpy.fromstring(audio_file._data, numpy.int16)
|
||||
elif audio_file.sample_width == 4:
|
||||
data = numpy.fromstring(audio_file._data, numpy.int32)
|
||||
|
||||
# Merge audio channels
|
||||
audio_signal = []
|
||||
for chn in list(range(audio_file.channels)):
|
||||
audio_signal.append(data[chn::audio_file.channels])
|
||||
audio_signal = numpy.array(audio_signal).T
|
||||
|
||||
# Flat signals
|
||||
if audio_signal.ndim == 2:
|
||||
if audio_signal.shape[1] == 1:
|
||||
audio_signal = audio_signal.flatten()
|
||||
|
||||
# Convert stereo to mono
|
||||
audio_signal = self.stereo_to_mono(audio_signal)
|
||||
|
||||
# Return sample rate and audio signal
|
||||
return audio_signal
|
||||
|
||||
'''
|
||||
Function to convert an input signal from stereo to mono
|
||||
'''
|
||||
@staticmethod
|
||||
def stereo_to_mono(audio_signal):
|
||||
|
||||
# Check if signal is stereo and convert to mono
|
||||
if isinstance(audio_signal, int):
|
||||
return -1
|
||||
if audio_signal.ndim == 1:
|
||||
return audio_signal
|
||||
elif audio_signal.ndim == 2:
|
||||
if audio_signal.shape[1] == 1:
|
||||
return audio_signal.flatten()
|
||||
else:
|
||||
if audio_signal.shape[1] == 2:
|
||||
return (audio_signal[:, 1] / 2) + (audio_signal[:, 0] / 2)
|
||||
else:
|
||||
return -1
|
||||
|
||||
'''
|
||||
Function to split the input signal into windows of same size
|
||||
'''
|
||||
def framing(self, size, step, hamming=False):
|
||||
|
||||
# Rescale windows step and size
|
||||
win_size = int(size * self._sample_rate)
|
||||
win_step = int(step * self._sample_rate)
|
||||
|
||||
# Number of frames
|
||||
nb_frames = 1 + int((len(self._signal) - win_size) / win_step)
|
||||
|
||||
# Build Hamming function
|
||||
if hamming is True:
|
||||
ham = numpy.hamming(win_size)
|
||||
else:
|
||||
ham = numpy.ones(win_size)
|
||||
|
||||
# Split signals (and multiply each windows signals by Hamming functions)
|
||||
frames = []
|
||||
for t in range(nb_frames):
|
||||
sub_signal = AudioSignal(self._sample_rate, signal=self._signal[(t * win_step): (t * win_step + win_size)] * ham)
|
||||
frames.append(sub_signal)
|
||||
return frames
|
||||
|
||||
'''
|
||||
Function to compute the magnitude of the Discrete Fourier Transform coefficient
|
||||
'''
|
||||
def dft(self, norm=False):
|
||||
|
||||
# Commpute the magnitude of the spectrum (and normalize by the number of sample)
|
||||
if norm is True:
|
||||
dft = abs(fft(self._signal)) / len(self._signal)
|
||||
else:
|
||||
dft = abs(fft(self._signal))
|
||||
return dft
|
||||
|
||||
'''
|
||||
Function to apply pre-emphasis filter on signal
|
||||
'''
|
||||
def pre_emphasis(self, alpha =0.97):
|
||||
|
||||
# Emphasized signal
|
||||
emphasized_signal = numpy.append(self._signal[0], self._signal[1:] - alpha * self._signal[:-1])
|
||||
|
||||
return emphasized_signal
|
||||
@@ -0,0 +1,92 @@
|
||||
# Speech Emotion Recognition
|
||||
|
||||
|
||||

|
||||
|
||||
The aim of this section is to explore speech emotion recognition techniques from an audio recording.
|
||||
|
||||
## Data
|
||||
|
||||
The data set used for training is the **Ryerson Audio-Visual Database of Emotional Speech and Song**: https://zenodo.org/record/1188976#.XA48aC17Q1J
|
||||
|
||||
**RAVDESS** contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions, and song contains calm, happy, sad, angry, and fearful emotions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression.
|
||||
|
||||

|
||||
|
||||
| Data | Processed Data for training | Processed Data for training | Pre-trained TimeDistributed CNNs model|
|
||||
|:----:|:---------------------------:|:---------------------------:|:-------------------------------------:|
|
||||
| [RAVDESS](https://drive.google.com/file/d/1OL2Kx9dPdeZWoue6ofHcUNs5jwpfh4Fc/view?usp=sharing) | [X-train](https://drive.google.com/file/d/1qv-y0FhaRy5Np8DF3a8Xty8xLvvv4QH4/view?usp=sharing) [y-train](https://drive.google.com/file/d/1y5j43I09Xe6RHK8BsHP8_ZNkUuTehhgY/view?usp=sharing) | [X-test](https://drive.google.com/file/d/1MN1Fxc_sDR1ZDQmPdFMwlnhP4qn9d8bT/view?usp=sharing) [y-test](https://drive.google.com/file/d/1ovvCXumkEP1oLxErgMgyIg1Z1Eih430W/view?usp=sharing)| [Weights](https://drive.google.com/file/d/1pQ5QahXJ3dPDXhyPkQ7rS1fOHWKHcIdX/view?usp=sharing) [Model](https://drive.google.com/file/d/1TuKN2PbFvoClaobL3aOW1KmA0e2eEc-O/view?usp=sharing) | [Colab Notebook](https://colab.research.google.com/drive/1EY8m7uj3BzU-OsjAPGBqoapw1OSUHhum)|
|
||||
|
||||
|
||||
## Requirements
|
||||
|
||||
```
|
||||
Python : 3.6.5
|
||||
Scipy : 1.1.0
|
||||
Scikit-learn : 0.20.1
|
||||
Tensorflow : 1.12.0
|
||||
Keras : 2.2.4
|
||||
Numpy : 1.15.4
|
||||
Librosa : 0.6.3
|
||||
Pyaudio : 0.2.11
|
||||
Ffmpeg : 4.0.2
|
||||
```
|
||||
|
||||
|
||||
## Files
|
||||
|
||||
The different files that can be found in this repo :
|
||||
- `Model` : Saved models (SVM and TimeDistributed CNNs)
|
||||
- `Notebook` : All notebooks (preprocessing and model training)
|
||||
- `Python` : Personal audio library
|
||||
- `Images`: Set of pictures saved from the notebooks and final report
|
||||
- `Resources` : Some resources on Speech Emotion Recognition
|
||||
|
||||
Notebooks provided on this repo:
|
||||
- `01 - Preprocessing[SVM].ipynb` : Signal preprocessing and feature extraction from time and frequency domain (global statistics) to train SVM classifier.
|
||||
- `02 - Train [SVM].ipynb` : Implementation and training of SVM classifier for Speech Emotion Recognition
|
||||
- `01 - Preprocessing[CNN-LSTM].ipynb` : Signal preprocessing and log-mel-spectrogram extraction to train TimeDistributed CNNs
|
||||
- `02 - Train [CNN-LSTM].ipynb` : Implementation and training of TimeDistributed CNNs classifier for Speech Emotion Recognition
|
||||
|
||||
|
||||
## Models
|
||||
|
||||
### SVM
|
||||
|
||||
Classical approach for Speech Emotion Recognition consists in applying a series of filters on the audio signal and partitioning it into several windows (fixed size and time-step). Then, features from time domain (**Zero Crossing Rate, Energy** and **Entropy of Energy**) and frequency domain (**Spectral entropy, centroid, spread, flux, rolloff** and **MFCCs**) are extracted for each frame. We compute then the first derivatives of each of those features to capture frame to frame changes in the signal. Finally, we calculate the following global statistics on these features: *mean, median, standard deviation, kurtosis, skewness, 1% percentile, 99% percentile, min, max* and *range* and train a simple SVM classifier with rbf kernel to predict the emotion detected in the voice.
|
||||
|
||||

|
||||
|
||||
SVM classification pipeline:
|
||||
- Voice recording
|
||||
- Audio signal discretization
|
||||
- Apply pre-emphasis filter
|
||||
- Framing using a rolling window
|
||||
- Apply Hamming filter
|
||||
- Feature extraction
|
||||
- Compute global statistics
|
||||
- Make a prediction using our pre-trained model
|
||||
|
||||
|
||||
### TimeDistributed CNNs
|
||||
|
||||
The main idea of a **Time Distributed Convolutional Neural Network** is to apply a rolling window (fixed size and time-step) all along the log-mel-spectrogram. Each of these windows will be the entry of a convolutional neural network, composed by four Local Feature Learning Blocks (LFLBs) and the output of each of these convolutional networks will be fed into a recurrent neural network composed by 2 cells LSTM (Long Short Term Memory) to learn the long-term contextual dependencies. Finally, a fully connected layer with *softmax* activation is used to predict the emotion detected in the voice.
|
||||
|
||||

|
||||
|
||||
TimeDistributed CNNs pipeline:
|
||||
- Voice recording
|
||||
- Audio signal discretization
|
||||
- Log-mel-spectrogram extraction
|
||||
- Split spectrogram with a rolling window
|
||||
- Make a prediction using our pre-trained model
|
||||
|
||||
|
||||
## Performance
|
||||
|
||||
To limit overfitting during training phase, we split our data set into train (80%) and test set (20%). Following show results obtained on test set:
|
||||
|
||||
| Model | Accuracy |
|
||||
|-----------------------------------------|---------------|
|
||||
| SVM on global statistic features | 68,3% |
|
||||
| Time distributed CNNs | 76,6% |
|
||||
|
Depois Largura: | Altura: | Tamanho: 853 KiB |