Esse commit está contido em:
Maël
2019-06-11 22:10:57 +02:00
commit 09d04f5bae
296 arquivos alterados com 557024 adições e 0 exclusões
Arquivo binário não exibido.
Arquivo binário não exibido.

Depois

Largura:  |  Altura:  |  Tamanho: 82 KiB

Arquivo binário não exibido.

Depois

Largura:  |  Altura:  |  Tamanho: 888 KiB

Arquivo binário não exibido.

Depois

Largura:  |  Altura:  |  Tamanho: 839 KiB

Arquivo binário não exibido.

Depois

Largura:  |  Altura:  |  Tamanho: 1.0 MiB

Arquivo binário não exibido.

Depois

Largura:  |  Altura:  |  Tamanho: 655 KiB

Arquivo binário não exibido.

Depois

Largura:  |  Altura:  |  Tamanho: 102 KiB

Arquivo binário não exibido.

Depois

Largura:  |  Altura:  |  Tamanho: 99 KiB

Arquivo binário não exibido.

Depois

Largura:  |  Altura:  |  Tamanho: 594 KiB

Arquivo binário não exibido.

Depois

Largura:  |  Altura:  |  Tamanho: 13 MiB

BIN
Ver Arquivo
Arquivo binário não exibido.
Arquivo binário não exibido.

Depois

Largura:  |  Altura:  |  Tamanho: 853 KiB

Arquivo binário não exibido.

Depois

Largura:  |  Altura:  |  Tamanho: 307 KiB

Arquivo binário não exibido.

Depois

Largura:  |  Altura:  |  Tamanho: 980 KiB

Arquivo binário não exibido.

Depois

Largura:  |  Altura:  |  Tamanho: 680 KiB

Arquivo binário não exibido.

Depois

Largura:  |  Altura:  |  Tamanho: 1.6 MiB

Arquivo binário não exibido.

Depois

Largura:  |  Altura:  |  Tamanho: 361 KiB

Arquivo binário não exibido.

Depois

Largura:  |  Altura:  |  Tamanho: 106 KiB

Arquivo binário não exibido.

Depois

Largura:  |  Altura:  |  Tamanho: 719 KiB

Arquivo binário não exibido.

Depois

Largura:  |  Altura:  |  Tamanho: 546 KiB

Arquivo binário não exibido.

Depois

Largura:  |  Altura:  |  Tamanho: 638 KiB

Arquivo binário não exibido.

Depois

Largura:  |  Altura:  |  Tamanho: 154 KiB

Arquivo binário não exibido.

Depois

Largura:  |  Altura:  |  Tamanho: 367 KiB

Arquivo binário não exibido.

Depois

Largura:  |  Altura:  |  Tamanho: 22 KiB

Arquivo binário não exibido.

Depois

Largura:  |  Altura:  |  Tamanho: 176 KiB

Arquivo binário não exibido.

Depois

Largura:  |  Altura:  |  Tamanho: 674 KiB

Arquivo binário não exibido.

Depois

Largura:  |  Altura:  |  Tamanho: 110 KiB

Arquivo binário não exibido.

Depois

Largura:  |  Altura:  |  Tamanho: 840 KiB

Arquivo binário não exibido.

Depois

Largura:  |  Altura:  |  Tamanho: 164 KiB

Arquivo binário não exibido.

Depois

Largura:  |  Altura:  |  Tamanho: 890 KiB

Arquivo binário não exibido.

Depois

Largura:  |  Altura:  |  Tamanho: 318 KiB

Arquivo binário não exibido.

Depois

Largura:  |  Altura:  |  Tamanho: 1.1 MiB

BIN
Ver Arquivo
Arquivo binário não exibido.
Arquivo binário não exibido.
Arquivo binário não exibido.
Arquivo binário não exibido.
Arquivo binário não exibido.
Arquivo binário não exibido.
Arquivo binário não exibido.
Arquivo binário não exibido.
Arquivo binário não exibido.

Depois

Largura:  |  Altura:  |  Tamanho: 126 KiB

Arquivo binário não exibido.

Depois

Largura:  |  Altura:  |  Tamanho: 476 KiB

Arquivo binário não exibido.

Depois

Largura:  |  Altura:  |  Tamanho: 674 KiB

Arquivo binário não exibido.
Arquivo binário não exibido.
Arquivo binário não exibido.
Arquivo binário não exibido.
Arquivo binário não exibido.
Arquivo binário não exibido.
Arquivo binário não exibido.
Diff do arquivo suprimido porque uma ou mais linhas são muito longas
Diff do arquivo suprimido porque uma ou mais linhas são muito longas
@@ -0,0 +1,320 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Speech Emotion Recognition - Signal Preprocessing\n",
"\n",
"A project for the French Employment Agency\n",
"\n",
"Telecom ParisTech 2018-2019"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## I. Context"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The aim of this notebook is to set up all speech emotion recognition preprocessing and audio features extraction.\n",
"\n",
"### Audio features:\n",
"The complete list of the implemented short-term features is presented below:\n",
"- **Zero Crossing Rate**: The rate of sign-changes of the signal during the duration of a particular frame.\n",
"- **Energy**: The sum of squares of the signal values, normalized by the respective frame length.\n",
"- **Entropy of Energy**: The entropy of sub-frames' normalized energies. It can be interpreted as a measure of abrupt changes.\n",
"- **Spectral Centroid**: The center of gravity of the spectrum.\n",
"- **Sprectral Spread**: The second central moment of the spectrum.\n",
"- **Spectral Entropy**: Entropy of the normalized spectral energies for a set of sub-frames.\n",
"- **Spectral Flux**: The squared difference between the normalized magnitudes of the spectra of the two successive frames.\n",
"- **Spectral Rolloff**: The frequency below which 90% of the magnitude distribution of the spectrum is concentrated.\n",
"- **MFCCS**: Mel Frequency Cepstral Coefficients form a cepstral representation where the frequency bands are not linear but distributed according to the mel-scale.\n",
"\n",
"Global Statistics are then computed on upper features:\n",
"- **mean, std, med, kurt, skew, q1, q99, min, max and range**\n",
"\n",
"### Data:\n",
"**RAVDESS**: The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7356 files (total size: 24.8 GB). The database contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech includes *calm*, *happy*, *sad*, *angry*, *fearful*, *surprise*, and *disgust* expressions, and song contains calm, happy, sad, angry, and fearful emotions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression. (https://zenodo.org/record/1188976#.XA48aC17Q1J)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## II. General import"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"ExecuteTime": {
"end_time": "2019-04-15T13:13:31.470677Z",
"start_time": "2019-04-15T13:13:30.911103Z"
}
},
"outputs": [],
"source": [
"### General imports ###\n",
"from glob import glob\n",
"import os\n",
"import pickle\n",
"import itertools\n",
"import numpy as np\n",
"\n",
"### Audio preprocessing imports ###\n",
"from AudioLibrary.AudioSignal import *\n",
"from AudioLibrary.AudioFeatures import *"
]
},
{
"cell_type": "markdown",
"metadata": {
"ExecuteTime": {
"end_time": "2018-12-04T16:38:44.580314Z",
"start_time": "2018-12-04T16:38:44.560062Z"
}
},
"source": [
"## III. Set labels"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"ExecuteTime": {
"end_time": "2019-04-15T13:13:31.477659Z",
"start_time": "2019-04-15T13:13:31.473279Z"
}
},
"outputs": [],
"source": [
"# RAVDESS Database\n",
"label_dict_ravdess = {'02': 'NEU', '03':'HAP', '04':'SAD', '05':'ANG', '06':'FEA', '07':'DIS', '08':'SUR'}\n",
"\n",
"# Set audio files labels\n",
"def set_label_ravdess(audio_file, gender_differentiation):\n",
" label = label_dict_ravdess.get(audio_file[6:-16])\n",
" if gender_differentiation == True:\n",
" if int(audio_file[18:-4])%2 == 0: # Female\n",
" label = 'f_' + label\n",
" if int(audio_file[18:-4])%2 == 1: # Male\n",
" label = 'm_' + label\n",
" return label"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## IV. Import audio files"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"ExecuteTime": {
"end_time": "2019-04-15T13:13:36.852703Z",
"start_time": "2019-04-15T13:13:31.479656Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Import Data: START\n",
"Import Data: RUNNING ... 0 files\n",
"Import Data: RUNNING ... 200 files\n",
"Import Data: RUNNING ... 300 files\n",
"Import Data: RUNNING ... 400 files\n",
"Import Data: RUNNING ... 500 files\n",
"Import Data: RUNNING ... 600 files\n",
"Import Data: RUNNING ... 700 files\n",
"Import Data: RUNNING ... 800 files\n",
"Import Data: RUNNING ... 900 files\n",
"Import Data: RUNNING ... 1000 files\n",
"Import Data: RUNNING ... 1100 files\n",
"Import Data: RUNNING ... 1200 files\n",
"Import Data: RUNNING ... 1300 files\n",
"Import Data: RUNNING ... 1400 files\n",
"Import Data: END \n",
"\n",
"Number of audio files imported: 1344\n"
]
}
],
"source": [
"# Start feature extraction\n",
"print(\"Import Data: START\")\n",
"\n",
"# Audio file path and names\n",
"file_path = '../Datas/RAVDESS/'\n",
"file_names = os.listdir(file_path)\n",
"\n",
"# Initialize signal and labels list\n",
"signal = []\n",
"labels = []\n",
"\n",
"# Sample rate (44.1 kHz)\n",
"sample_rate = 44100 \n",
"\n",
"# Compute global statistics features for all audio file\n",
"for audio_index, audio_file in enumerate(file_names):\n",
"\n",
" # Select audio file\n",
" if audio_file[6:-16] in label_dict_ravdess.keys():\n",
" \n",
" # Read audio file\n",
" signal.append(AudioSignal(sample_rate, filename=file_path + audio_file))\n",
" \n",
" # Set label\n",
" labels.append(set_label_ravdess(audio_file, True))\n",
"\n",
" # Print running...\n",
" if (audio_index % 100 == 0):\n",
" print(\"Import Data: RUNNING ... {} files\".format(audio_index))\n",
" \n",
"# Cast labels to array\n",
"labels = np.asarray(labels).ravel()\n",
"\n",
"# Stop feature extraction\n",
"print(\"Import Data: END \\n\")\n",
"print(\"Number of audio files imported: {}\".format(labels.shape[0]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## V. Audio features extraction"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"ExecuteTime": {
"end_time": "2019-04-15T13:13:36.863481Z",
"start_time": "2019-04-15T13:13:36.855871Z"
}
},
"outputs": [],
"source": [
"# Audio features extraction function\n",
"def global_feature_statistics(y, win_size=0.025, win_step=0.01, nb_mfcc=12, mel_filter=40,\n",
" stats = ['mean', 'std', 'med', 'kurt', 'skew', 'q1', 'q99', 'min', 'max', 'range'],\n",
" features_list = ['zcr', 'energy', 'energy_entropy', 'spectral_centroid', 'spectral_spread', 'spectral_entropy', 'spectral_flux', 'sprectral_rolloff', 'mfcc']):\n",
" \n",
" # Extract features\n",
" audio_features = AudioFeatures(y, win_size, win_step)\n",
" features, features_names = audio_features.global_feature_extraction(stats=stats, features_list=features_list)\n",
" return features\n",
" \n",
"# Features extraction parameters\n",
"sample_rate = 16000 # Sample rate (16.0 kHz)\n",
"win_size = 0.025 # Short term window size (25 msec)\n",
"win_step = 0.01 # Short term window step (10 msec)\n",
"nb_mfcc = 12 # Number of MFCCs coefficients (12)\n",
"nb_filter = 40 # Number of filter banks (40)\n",
"stats = ['mean', 'std', 'med', 'kurt', 'skew', 'q1', 'q99', 'min', 'max', 'range'] # Global statistics\n",
"features_list = ['zcr', 'energy', 'energy_entropy', 'spectral_centroid', 'spectral_spread', # Audio features\n",
" 'spectral_entropy', 'spectral_flux', 'sprectral_rolloff', 'mfcc']"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"ExecuteTime": {
"end_time": "2019-04-15T13:19:38.974213Z",
"start_time": "2019-04-15T13:13:36.866069Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Feature extraction: START\n",
"Feature extraction: END!\n"
]
}
],
"source": [
"# Start feature extraction\n",
"print(\"Feature extraction: START\")\n",
"\n",
"# Compute global feature statistics for all audio file\n",
"features = np.asarray(list(map(global_feature_statistics, signal)))\n",
"\n",
"# Stop feature extraction\n",
"print(\"Feature extraction: END!\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## VI. Save as"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"ExecuteTime": {
"end_time": "2019-04-15T13:19:38.983530Z",
"start_time": "2019-04-15T13:19:38.975722Z"
}
},
"outputs": [],
"source": [
"# Save DataFrame to pickle\n",
"pickle.dump([features, labels], open(\"../Datas/Pickle/[RAVDESS][HAP-SAD-NEU-ANG-FEA-DIS-SUR][GLOBAL_STATS].p\", 'wb'))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Diff do arquivo suprimido porque uma ou mais linhas são muito longas
@@ -0,0 +1,104 @@
import pickle
from AudioLibrary.AudioSignal import *
from AudioLibrary.AudioFeatures import *
class AudioEmotionRecognition:
def __init__(self, model_path):
# Load classifier
self._clf = pickle.load(open(os.path.join(model_path, 'MODEL_CLF.p'), 'rb'))
# Load features parameters
self._features_param = pickle.load(open(os.path.join(model_path, 'MODEL_PARAM.p'), 'rb'))
# Load feature scaler parametrs (mean and std)
self._features_mean, self._features_std = pickle.load(open(os.path.join(model_path, 'MODEL_SCALER.p'), 'rb'))
# Load PCA
self._pca = pickle.load(open(os.path.join(model_path, 'MODEL_PCA.p'), 'rb'))
# Load label encoder
self._encoder = pickle.load(open(os.path.join(model_path, 'MODEL_ENCODER.p'), 'rb'))
'''
Function to scale audio features
'''
def scale_features(self, features):
# Scaled features
scaled_features = (features - self._features_mean) / self._features_std
# Return scaled features
return scaled_features
'''
Function to predict speech emotion from an audio signals
'''
def predict_emotion(self, audio_signal, predict_proba=False, decode=True):
# Extract audio features
audio_features = AudioFeatures(audio_signal, float(self._features_param.get("win_size")),
float(self._features_param.get("win_step")))
features, features_names = audio_features.global_feature_extraction(stats=self._features_param.get("stats"),
features_list=self._features_param.get(
"features_list"),
nb_mfcc=self._features_param.get("nb_mfcc"),
diff=self._features_param.get("diff"))
# Scale features
features = self.scale_features(features)
# Apply feature dimension reduction
if self._features_param.get("PCA") is True:
features = self._pca.transform(features)
# Make prediction
if predict_proba is True:
prediction = self._clf.predict_proba(features.reshape(1, -1))
else:
prediction = self._clf.predict(features.reshape(1, -1))
# Decode label emotion
if decode is True:
prediction = (self._encoder.inverse_transform((prediction.astype(int).flatten())))
# Remove gender recognition
prediction = prediction[0][2:]
return prediction
'''
Function to predict speech emotion over time from video
'''
def predict_emotion_from_file(self, filename, sample_rate, chunk_size=0, chunk_step=0, predict_proba=False,
decode=True):
# Initialize Audio Basic object
audio_signal = AudioSignal(sample_rate, filename=filename)
# Split audio signals into chunks
if chunk_size > 0:
chunks = audio_signal.framing(chunk_size, chunk_step)
# Initialize time stamp
timestamp = []
# Emotion prediction for each chunks
prediction = []
for signal in chunks:
if len(timestamp) == 0:
timestamp.append(chunk_size)
else:
timestamp.append(timestamp[-1] + chunk_step)
prediction.append(self.predict_emotion(signal, predict_proba=predict_proba, decode=decode))
# Return emotion prediction and related timestamp
return prediction, timestamp
else:
# Emotion prediction
prediction = self.predict_emotion(audio_signal, predict_proba=predict_proba, decode=decode)
# Return emotion prediction
return prediction
@@ -0,0 +1,347 @@
import numpy
from scipy.fftpack.realtransforms import dct
from scipy.stats import kurtosis, skew
from AudioLibrary.AudioSignal import *
class AudioFeatures:
def __init__(self, audio_signal, win_size, win_step):
# Audio Signal
self._audio_signal = audio_signal
# Short time features window size
self._win_size = win_size
# Short time features window step
self._win_step = win_step
'''
Global statistics features extraction from an audio signals
'''
def global_feature_extraction(self, stats=['mean', 'std'], features_list=[], nb_mfcc=12, nb_filter=40, diff=0, hamming=True):
# Extract short term audio features
st_features, f_names = self.short_time_feature_extraction(features_list, nb_mfcc, nb_filter, hamming)
# Number of short term features
nb_feats = st_features.shape[1]
# Number of statistics
nb_stats = len(stats)
# Global statistics feature names
feature_names = ["" for x in range(nb_feats * nb_stats)]
for i in range(nb_feats):
for j in range(nb_stats):
feature_names[i + j * nb_feats] = f_names[i] + "_d" + str(diff) + "_" + stats[j]
# Calculate global statistics features
features = numpy.zeros((nb_feats * nb_stats))
for i in range(nb_feats):
# Get features series
feat = st_features[:, i]
# Compute first or second order difference
if diff > 0:
feat = feat[diff:] - feat[:-diff]
# Global statistics
for j in range(nb_stats):
features[i + j * nb_feats] = self.compute_statistic(feat, stats[j])
return features, feature_names
'''
Short-time features extraction from an audio signals
'''
def short_time_feature_extraction(self, features=[], nb_mfcc=12, nb_filter=40, hamming=True):
# Copy features list to compute
features_list = list(features)
# MFFCs features names
mfcc_feature_names = []
if 'mfcc' in features_list:
mfcc_feature_names = ["mfcc_{0:d}".format(i) for i in range(1, nb_mfcc + 1)]
features_list.remove('mfcc')
# Filter banks features names
fbank_features_names = []
if 'filter_banks' in features_list:
fbank_features_names = ["fbank_{0:d}".format(i) for i in range(1, nb_filter + 1)]
features_list.remove('filter_banks')
# All Features names
feature_names = features_list + mfcc_feature_names + fbank_features_names
# Number of features
nb_features = len(feature_names)
# Framming signal
frames = self._audio_signal.framing(self._win_size, self._win_step, hamming=hamming)
# Number of frame
nb_frames = len(frames)
# Compute features on each frame
features = numpy.zeros((nb_frames, nb_features))
cur_pos = 0
for el in frames:
# Get signal of the frame
signal = el._signal
# Compute the normalize magnitude of the spectrum (Discrete Fourier Transform)
dft = el.dft(norm=True)
# Return the first half of the spectrum
dft = dft[:int((self._win_size * self._audio_signal._sample_rate) / 2)]
if cur_pos == 0:
dft_prev = dft
# Compute features on frame
for idx, f in enumerate(features_list):
features[cur_pos, idx] = self.compute_st_features(f, signal, dft, dft_prev,
self._audio_signal._sample_rate)
# Compute MFCCs and Filter Banks
if len(mfcc_feature_names) > 0:
features[cur_pos, len(features_list):len(features_list) + len(mfcc_feature_names) + len(fbank_features_names)] = self.mfcc(signal, self._audio_signal._sample_rate,
nb_coeff=nb_mfcc, nb_filt=nb_filter, return_fbank=len(fbank_features_names) > 0)
# Compute Filter Banks
elif len(fbank_features_names) > 0:
features[cur_pos, len(features_list) + len(mfcc_feature_names):] = self.filter_banks_coeff(signal, self._audio_signal._sample_rate, nb_filt=nb_filter)
# Keep previous Discrete Fourier Transform coefficients
dft_prev = dft
cur_pos = cur_pos + 1
return features, feature_names
'''
Computes zero crossing rate of a signal
'''
@staticmethod
def zcr(signal):
zcr = numpy.sum(numpy.abs(numpy.diff(numpy.sign(signal))))
zcr = zcr / (2 * numpy.float64(len(signal) - 1.0))
return zcr
'''
Computes signal energy of frame
'''
@staticmethod
def energy(signal):
energy = numpy.sum(signal ** 2) / numpy.float64(len(signal))
return energy
'''
Computes entropy of energy
'''
@staticmethod
def energy_entropy(signal, n_short_blocks=10, eps=10e-8):
# Total frame energy
energy = numpy.sum(signal ** 2)
sub_win_len = int(numpy.floor(len(signal) / n_short_blocks))
# Length of sub-frame
if len(signal) != sub_win_len * n_short_blocks:
signal = signal[0:sub_win_len * n_short_blocks]
# Get sub windows
sub_wins = signal.reshape(sub_win_len, n_short_blocks, order='F').copy()
# Compute normalized sub-frame energies:
sub_energies = numpy.sum(sub_wins ** 2, axis=0) / (energy + eps)
# Compute entropy of the normalized sub-frame energies:
entropy = -numpy.sum(sub_energies * numpy.log2(sub_energies + eps))
return entropy
'''
Computes spectral centroid of frame
'''
@staticmethod
def spectral_centroid_spread(fft, fs, eps=10e-8):
# Sample range
sr = (numpy.arange(1, len(fft) + 1)) * (fs / (2.0 * len(fft)))
# Normalize fft coefficients by the max value
norm_fft = fft / (fft.max() + eps)
# Centroid:
C = numpy.sum(sr * norm_fft) / (numpy.sum(norm_fft) + eps)
# Spread:
S = numpy.sqrt(numpy.sum(((sr - C) ** 2) * norm_fft) / (numpy.sum(norm_fft) + eps))
# Normalize:
C = C / (fs / 2.0)
S = S / (fs / 2.0)
return C, S
'''
Computes the spectral flux feature
'''
@staticmethod
def spectral_flux(fft, fft_prev, eps=10e-8):
# Sum of fft coefficients
sum_fft = numpy.sum(fft + eps)
# Sum of previous fft coefficients
sum_fft_prev = numpy.sum(fft_prev + eps)
# Compute the spectral flux as the sum of square distances
flux = numpy.sum((fft / sum_fft - fft_prev / sum_fft_prev) ** 2)
return flux
'''
Computes the spectral roll off
'''
@staticmethod
def spectral_rolloff(fft, c=0.90, eps=10e-8):
# Total energy
energy = numpy.sum(fft ** 2)
# Roll off threshold
threshold = c * energy
# Compute cumulative energy
cum_energy = numpy.cumsum(fft ** 2) + eps
# Find the spectral roll off as the frequency position
[roll_off, ] = numpy.nonzero(cum_energy > threshold)
# Normalize
if len(roll_off) > 0:
roll_off = numpy.float64(roll_off[0]) / (float(len(fft)))
else:
roll_off = 0.0
return roll_off
'''
Computes the Filter Bank coefficients
'''
@staticmethod
def filter_banks_coeff(signal, sample_rate, nb_filt=40, nb_fft=512):
# Magnitude of the FFT
mag_frames = numpy.absolute(numpy.fft.rfft(signal, nb_fft))
# Power Spectrum
pow_frames = ((1.0 / nb_fft) * (mag_frames ** 2))
low_freq_mel = 0
# Convert Hz to Mel
high_freq_mel = (2595 * numpy.log10(1 + (sample_rate / 2) / 700))
# Equally spaced in Mel scale
mel_points = numpy.linspace(low_freq_mel, high_freq_mel, nb_filt + 2)
# Convert Mel to Hz
hz_points = (700 * (10 ** (mel_points / 2595) - 1))
bin = numpy.floor((nb_fft + 1) * hz_points / sample_rate)
# Calculate filter banks
fbank = numpy.zeros((nb_filt, int(numpy.floor(nb_fft / 2 + 1))))
for m in range(1, nb_filt + 1):
# left
f_m_minus = int(bin[m - 1])
# center
f_m = int(bin[m])
# right
f_m_plus = int(bin[m + 1])
for k in range(f_m_minus, f_m):
fbank[m - 1, k] = (k - bin[m - 1]) / (bin[m] - bin[m - 1])
for k in range(f_m, f_m_plus):
fbank[m - 1, k] = (bin[m + 1] - k) / (bin[m + 1] - bin[m])
filter_banks = numpy.dot(pow_frames, fbank.T)
# Numerical Stability
filter_banks = numpy.where(filter_banks == 0, numpy.finfo(float).eps, filter_banks)
# dB
filter_banks = 20 * numpy.log10(filter_banks)
return filter_banks
'''
Computes the MFCCs
'''
def mfcc(self, signal, sample_rate, nb_coeff=12, nb_filt=40, nb_fft=512, return_fbank=False):
# Apply filter bank on spectogram
filter_banks = self.filter_banks_coeff(signal, sample_rate, nb_filt=nb_filt, nb_fft=nb_fft)
# Compute MFCC coefficients
mfcc = dct(filter_banks, type=2, axis=-1, norm='ortho')[1: (nb_coeff + 1)]
# Return MFFCs and Filter banks coefficients
if return_fbank is True:
return numpy.concatenate((mfcc, filter_banks))
else:
return mfcc
'''
Compute statistics on short time features
'''
@staticmethod
def compute_statistic(seq, statistic):
if statistic == 'mean':
S = numpy.mean(seq)
elif statistic == 'med':
S = numpy.median(seq)
elif statistic == 'std':
S = numpy.std(seq)
elif statistic == 'kurt':
S = kurtosis(seq)
elif statistic == 'skew':
S = skew(seq)
elif statistic == 'min':
S = numpy.min(seq)
elif statistic == 'max':
S = numpy.max(seq)
elif statistic == 'q1':
S = numpy.percentile(seq, 1)
elif statistic == 'q99':
S = numpy.percentile(seq, 99)
elif statistic == 'range':
S = numpy.abs(numpy.percentile(seq, 99) - numpy.percentile(seq, 1))
return S
'''
Compute short time features on signal
'''
def compute_st_features(self, feature, signal, dft, dft_prev, sample_rate):
if feature == 'zcr':
F = self.zcr(signal)
elif feature == 'energy':
F = self.energy(signal)
elif feature == 'energy_entropy':
F = self.energy_entropy(signal)
elif feature == 'spectral_centroid':
[F, FF] = self.spectral_centroid_spread(dft, sample_rate)
elif feature == 'spectral_spread':
[FF, F] = self.spectral_centroid_spread(dft, sample_rate)
elif feature == 'spectral_entropy':
F = self.energy_entropy(dft)
elif feature == 'spectral_flux':
F = self.spectral_flux(dft, dft_prev)
elif feature == 'sprectral_rolloff':
F = self.spectral_rolloff(dft)
return F
@@ -0,0 +1,161 @@
import os
import numpy
from pydub import AudioSegment
from scipy.fftpack import fft
class AudioSignal(object):
def __init__(self, sample_rate, signal=None, filename=None):
# Set sample rate
self._sample_rate = sample_rate
if signal is None:
# Get file name and file extension
file, file_extension = os.path.splitext(filename)
# Check if file extension if audio format
if file_extension in ['.mp3', '.wav']:
# Read audio file
self._signal = self.read_audio_file(filename)
# Check if file extension if video format
elif file_extension in ['.mp4', '.mkv', 'avi']:
# Extract audio from video
new_filename = self.extract_audio_from_video(filename)
# read audio file from extracted audio file
self._signal = self.read_audio_file(new_filename)
# Case file extension is not supported
else:
print("Error: file not found or file extension not supported.")
elif filename is None:
# Cast signal to array
self._signal = signal
else:
print("Error : argument missing in AudioSignal() constructor.")
'''
Function to extract audio from a video
'''
def extract_audio_from_video(self, filename):
# Get video file name and extension
file, file_extension = os.path.splitext(filename)
# Extract audio (.wav) from video
os.system('ffmpeg -i ' + file + file_extension + ' ' + '-ar ' + str(self._sample_rate) + ' ' + file + '.wav')
print("Sucessfully converted {} into audio!".format(filename))
# Return audio file name created
return file + '.wav'
'''
Function to read audio file and to return audio samples of a specified WAV file
'''
def read_audio_file(self, filename):
# Get audio signal
audio_file = AudioSegment.from_file(filename)
# Resample audio signal
audio_file = audio_file.set_frame_rate(self._sample_rate)
# Cast to integer
if audio_file.sample_width == 2:
data = numpy.fromstring(audio_file._data, numpy.int16)
elif audio_file.sample_width == 4:
data = numpy.fromstring(audio_file._data, numpy.int32)
# Merge audio channels
audio_signal = []
for chn in list(range(audio_file.channels)):
audio_signal.append(data[chn::audio_file.channels])
audio_signal = numpy.array(audio_signal).T
# Flat signals
if audio_signal.ndim == 2:
if audio_signal.shape[1] == 1:
audio_signal = audio_signal.flatten()
# Convert stereo to mono
audio_signal = self.stereo_to_mono(audio_signal)
# Return sample rate and audio signal
return audio_signal
'''
Function to convert an input signal from stereo to mono
'''
@staticmethod
def stereo_to_mono(audio_signal):
# Check if signal is stereo and convert to mono
if isinstance(audio_signal, int):
return -1
if audio_signal.ndim == 1:
return audio_signal
elif audio_signal.ndim == 2:
if audio_signal.shape[1] == 1:
return audio_signal.flatten()
else:
if audio_signal.shape[1] == 2:
return (audio_signal[:, 1] / 2) + (audio_signal[:, 0] / 2)
else:
return -1
'''
Function to split the input signal into windows of same size
'''
def framing(self, size, step, hamming=False):
# Rescale windows step and size
win_size = int(size * self._sample_rate)
win_step = int(step * self._sample_rate)
# Number of frames
nb_frames = 1 + int((len(self._signal) - win_size) / win_step)
# Build Hamming function
if hamming is True:
ham = numpy.hamming(win_size)
else:
ham = numpy.ones(win_size)
# Split signals (and multiply each windows signals by Hamming functions)
frames = []
for t in range(nb_frames):
sub_signal = AudioSignal(self._sample_rate, signal=self._signal[(t * win_step): (t * win_step + win_size)] * ham)
frames.append(sub_signal)
return frames
'''
Function to compute the magnitude of the Discrete Fourier Transform coefficient
'''
def dft(self, norm=False):
# Commpute the magnitude of the spectrum (and normalize by the number of sample)
if norm is True:
dft = abs(fft(self._signal)) / len(self._signal)
else:
dft = abs(fft(self._signal))
return dft
'''
Function to apply pre-emphasis filter on signal
'''
def pre_emphasis(self, alpha =0.97):
# Emphasized signal
emphasized_signal = numpy.append(self._signal[0], self._signal[1:] - alpha * self._signal[:-1])
return emphasized_signal
@@ -0,0 +1,233 @@
## Basics ##
import time
import os
import numpy as np
## Audio Preprocessing ##
import pyaudio
import wave
import librosa
from scipy.stats import zscore
## Time Distributed CNN ##
import tensorflow as tf
from tensorflow.keras import backend as K
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense, Dropout, Activation, TimeDistributed
from tensorflow.keras.layers import Conv2D, MaxPooling2D, BatchNormalization, Flatten
from tensorflow.keras.layers import LSTM
'''
Speech Emotion Recognition
'''
class speechEmotionRecognition:
'''
Voice recording function
'''
def __init__(self, subdir_model=None):
# Load prediction model
if subdir_model is not None:
self._model = self.build_model()
self._model.load_weights(subdir_model)
# Emotion encoding
self._emotion = {0:'Angry', 1:'Disgust', 2:'Fear', 3:'Happy', 4:'Neutral', 5:'Sad', 6:'Surprise'}
'''
Voice recording function
'''
def voice_recording(self, filename, duration=5, sample_rate=16000, chunk=1024, channels=1):
# Start the audio recording stream
p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16,
channels=channels,
rate=sample_rate,
input=True,
frames_per_buffer=chunk)
# Create an empty list to store audio recording
frames = []
# Determine the timestamp of the start of the response interval
print('* Start Recording *')
stream.start_stream()
start_time = time.time()
current_time = time.time()
# Record audio until timeout
while (current_time - start_time) < duration:
# Record data audio data
data = stream.read(chunk)
# Add the data to a buffer (a list of chunks)
frames.append(data)
# Get new timestamp
current_time = time.time()
# Close the audio recording stream
stream.stop_stream()
stream.close()
p.terminate()
print('* End Recording * ')
# Export audio recording to wav format
wf = wave.open(filename, 'w')
wf.setnchannels(channels)
wf.setsampwidth(p.get_sample_size(pyaudio.paInt16))
wf.setframerate(sample_rate)
wf.writeframes(b''.join(frames))
wf.close()
'''
Mel-spectogram computation
'''
def mel_spectrogram(self, y, sr=16000, n_fft=512, win_length=256, hop_length=128, window='hamming', n_mels=128, fmax=4000):
# Compute spectogram
mel_spect = np.abs(librosa.stft(y, n_fft=n_fft, window=window, win_length=win_length, hop_length=hop_length)) ** 2
# Compute mel spectrogram
mel_spect = librosa.feature.melspectrogram(S=mel_spect, sr=sr, n_mels=n_mels, fmax=fmax)
# Compute log-mel spectrogram
mel_spect = librosa.power_to_db(mel_spect, ref=np.max)
return np.asarray(mel_spect)
'''
Audio framing
'''
def frame(self, y, win_step=64, win_size=128):
# Number of frames
nb_frames = 1 + int((y.shape[2] - win_size) / win_step)
# Framming
frames = np.zeros((y.shape[0], nb_frames, y.shape[1], win_size)).astype(np.float16)
for t in range(nb_frames):
frames[:,t,:,:] = np.copy(y[:,:,(t * win_step):(t * win_step + win_size)]).astype(np.float16)
return frames
'''
Time distributed Convolutional Neural Network model
'''
def build_model(self):
# Clear Keras session
K.clear_session()
# Define input
input_y = Input(shape=(5, 128, 128, 1), name='Input_MELSPECT')
# First LFLB (local feature learning block)
y = TimeDistributed(Conv2D(64, kernel_size=(3, 3), strides=(1, 1), padding='same'), name='Conv_1_MELSPECT')(input_y)
y = TimeDistributed(BatchNormalization(), name='BatchNorm_1_MELSPECT')(y)
y = TimeDistributed(Activation('elu'), name='Activ_1_MELSPECT')(y)
y = TimeDistributed(MaxPooling2D(pool_size=(2, 2), strides=(2, 2), padding='same'), name='MaxPool_1_MELSPECT')(y)
y = TimeDistributed(Dropout(0.2), name='Drop_1_MELSPECT')(y)
# Second LFLB (local feature learning block)
y = TimeDistributed(Conv2D(64, kernel_size=(3, 3), strides=(1, 1), padding='same'), name='Conv_2_MELSPECT')(y)
y = TimeDistributed(BatchNormalization(), name='BatchNorm_2_MELSPECT')(y)
y = TimeDistributed(Activation('elu'), name='Activ_2_MELSPECT')(y)
y = TimeDistributed(MaxPooling2D(pool_size=(4, 4), strides=(4, 4), padding='same'), name='MaxPool_2_MELSPECT')(y)
y = TimeDistributed(Dropout(0.2), name='Drop_2_MELSPECT')(y)
# Third LFLB (local feature learning block)
y = TimeDistributed(Conv2D(128, kernel_size=(3, 3), strides=(1, 1), padding='same'), name='Conv_3_MELSPECT')(y)
y = TimeDistributed(BatchNormalization(), name='BatchNorm_3_MELSPECT')(y)
y = TimeDistributed(Activation('elu'), name='Activ_3_MELSPECT')(y)
y = TimeDistributed(MaxPooling2D(pool_size=(4, 4), strides=(4, 4), padding='same'), name='MaxPool_3_MELSPECT')(y)
y = TimeDistributed(Dropout(0.2), name='Drop_3_MELSPECT')(y)
# Fourth LFLB (local feature learning block)
y = TimeDistributed(Conv2D(128, kernel_size=(3, 3), strides=(1, 1), padding='same'), name='Conv_4_MELSPECT')(y)
y = TimeDistributed(BatchNormalization(), name='BatchNorm_4_MELSPECT')(y)
y = TimeDistributed(Activation('elu'), name='Activ_4_MELSPECT')(y)
y = TimeDistributed(MaxPooling2D(pool_size=(4, 4), strides=(4, 4), padding='same'), name='MaxPool_4_MELSPECT')(y)
y = TimeDistributed(Dropout(0.2), name='Drop_4_MELSPECT')(y)
# Flat
y = TimeDistributed(Flatten(), name='Flat_MELSPECT')(y)
# LSTM layer
y = LSTM(256, return_sequences=False, dropout=0.2, name='LSTM_1')(y)
# Fully connected
y = Dense(7, activation='softmax', name='FC')(y)
# Build final model
model = Model(inputs=input_y, outputs=y)
return model
'''
Predict speech emotion over time from an audio file
'''
def predict_emotion_from_file(self, filename, chunk_step=16000, chunk_size=49100, predict_proba=False, sample_rate=16000):
# Read audio file
y, sr = librosa.core.load(filename, sr=sample_rate, offset=0.5)
# Split audio signals into chunks
chunks = self.frame(y.reshape(1, 1, -1), chunk_step, chunk_size)
# Reshape chunks
chunks = chunks.reshape(chunks.shape[1],chunks.shape[-1])
# Z-normalization
y = np.asarray(list(map(zscore, chunks)))
# Compute mel spectrogram
mel_spect = np.asarray(list(map(self.mel_spectrogram, y)))
# Time distributed Framing
mel_spect_ts = self.frame(mel_spect)
# Build X for time distributed CNN
X = mel_spect_ts.reshape(mel_spect_ts.shape[0],
mel_spect_ts.shape[1],
mel_spect_ts.shape[2],
mel_spect_ts.shape[3],
1)
# Predict emotion
if predict_proba is True:
predict = self._model.predict(X)
else:
predict = np.argmax(self._model.predict(X), axis=1)
predict = [self._emotion.get(emotion) for emotion in predict]
# Clear Keras session
K.clear_session()
# Predict timestamp
timestamp = np.concatenate([[chunk_size], np.ones((len(predict) - 1)) * chunk_step]).cumsum()
timestamp = np.round(timestamp / sample_rate)
return [predict, timestamp]
'''
Export emotions predicted to csv format
'''
def prediction_to_csv(self, predictions, filename, mode='w'):
# Write emotion in filename
with open(filename, mode) as f:
if mode == 'w':
f.write("EMOTIONS"+'\n')
for emotion in predictions:
f.write(str(emotion)+'\n')
f.close()
@@ -0,0 +1,104 @@
import pickle
from AudioLibrary.AudioSignal import *
from AudioLibrary.AudioFeatures import *
class AudioEmotionRecognition:
def __init__(self, model_path):
# Load classifier
self._clf = pickle.load(open(os.path.join(model_path, 'MODEL_CLF.p'), 'rb'))
# Load features parameters
self._features_param = pickle.load(open(os.path.join(model_path, 'MODEL_PARAM.p'), 'rb'))
# Load feature scaler parametrs (mean and std)
self._features_mean, self._features_std = pickle.load(open(os.path.join(model_path, 'MODEL_SCALER.p'), 'rb'))
# Load PCA
self._pca = pickle.load(open(os.path.join(model_path, 'MODEL_PCA.p'), 'rb'))
# Load label encoder
self._encoder = pickle.load(open(os.path.join(model_path, 'MODEL_ENCODER.p'), 'rb'))
'''
Function to scale audio features
'''
def scale_features(self, features):
# Scaled features
scaled_features = (features - self._features_mean) / self._features_std
# Return scaled features
return scaled_features
'''
Function to predict speech emotion from an audio signals
'''
def predict_emotion(self, audio_signal, predict_proba=False, decode=True):
# Extract audio features
audio_features = AudioFeatures(audio_signal, float(self._features_param.get("win_size")),
float(self._features_param.get("win_step")))
features, features_names = audio_features.global_feature_extraction(stats=self._features_param.get("stats"),
features_list=self._features_param.get(
"features_list"),
nb_mfcc=self._features_param.get("nb_mfcc"),
diff=self._features_param.get("diff"))
# Scale features
features = self.scale_features(features)
# Apply feature dimension reduction
if self._features_param.get("PCA") is True:
features = self._pca.transform(features)
# Make prediction
if predict_proba is True:
prediction = self._clf.predict_proba(features.reshape(1, -1))
else:
prediction = self._clf.predict(features.reshape(1, -1))
# Decode label emotion
if decode is True:
prediction = (self._encoder.inverse_transform((prediction.astype(int).flatten())))
# Remove gender recognition
prediction = prediction[0][2:]
return prediction
'''
Function to predict speech emotion over time from video
'''
def predict_emotion_from_file(self, filename, sample_rate, chunk_size=0, chunk_step=0, predict_proba=False,
decode=True):
# Initialize Audio Basic object
audio_signal = AudioSignal(sample_rate, filename=filename)
# Split audio signals into chunks
if chunk_size > 0:
chunks = audio_signal.framing(chunk_size, chunk_step)
# Initialize time stamp
timestamp = []
# Emotion prediction for each chunks
prediction = []
for signal in chunks:
if len(timestamp) == 0:
timestamp.append(chunk_size)
else:
timestamp.append(timestamp[-1] + chunk_step)
prediction.append(self.predict_emotion(signal, predict_proba=predict_proba, decode=decode))
# Return emotion prediction and related timestamp
return prediction, timestamp
else:
# Emotion prediction
prediction = self.predict_emotion(audio_signal, predict_proba=predict_proba, decode=decode)
# Return emotion prediction
return prediction
+347
Ver Arquivo
@@ -0,0 +1,347 @@
import numpy
from scipy.fftpack.realtransforms import dct
from scipy.stats import kurtosis, skew
from AudioLibrary.AudioSignal import *
class AudioFeatures:
def __init__(self, audio_signal, win_size, win_step):
# Audio Signal
self._audio_signal = audio_signal
# Short time features window size
self._win_size = win_size
# Short time features window step
self._win_step = win_step
'''
Global statistics features extraction from an audio signals
'''
def global_feature_extraction(self, stats=['mean', 'std'], features_list=[], nb_mfcc=12, nb_filter=40, diff=0, hamming=True):
# Extract short term audio features
st_features, f_names = self.short_time_feature_extraction(features_list, nb_mfcc, nb_filter, hamming)
# Number of short term features
nb_feats = st_features.shape[1]
# Number of statistics
nb_stats = len(stats)
# Global statistics feature names
feature_names = ["" for x in range(nb_feats * nb_stats)]
for i in range(nb_feats):
for j in range(nb_stats):
feature_names[i + j * nb_feats] = f_names[i] + "_d" + str(diff) + "_" + stats[j]
# Calculate global statistics features
features = numpy.zeros((nb_feats * nb_stats))
for i in range(nb_feats):
# Get features series
feat = st_features[:, i]
# Compute first or second order difference
if diff > 0:
feat = feat[diff:] - feat[:-diff]
# Global statistics
for j in range(nb_stats):
features[i + j * nb_feats] = self.compute_statistic(feat, stats[j])
return features, feature_names
'''
Short-time features extraction from an audio signals
'''
def short_time_feature_extraction(self, features=[], nb_mfcc=12, nb_filter=40, hamming=True):
# Copy features list to compute
features_list = list(features)
# MFFCs features names
mfcc_feature_names = []
if 'mfcc' in features_list:
mfcc_feature_names = ["mfcc_{0:d}".format(i) for i in range(1, nb_mfcc + 1)]
features_list.remove('mfcc')
# Filter banks features names
fbank_features_names = []
if 'filter_banks' in features_list:
fbank_features_names = ["fbank_{0:d}".format(i) for i in range(1, nb_filter + 1)]
features_list.remove('filter_banks')
# All Features names
feature_names = features_list + mfcc_feature_names + fbank_features_names
# Number of features
nb_features = len(feature_names)
# Framming signal
frames = self._audio_signal.framing(self._win_size, self._win_step, hamming=hamming)
# Number of frame
nb_frames = len(frames)
# Compute features on each frame
features = numpy.zeros((nb_frames, nb_features))
cur_pos = 0
for el in frames:
# Get signal of the frame
signal = el._signal
# Compute the normalize magnitude of the spectrum (Discrete Fourier Transform)
dft = el.dft(norm=True)
# Return the first half of the spectrum
dft = dft[:int((self._win_size * self._audio_signal._sample_rate) / 2)]
if cur_pos == 0:
dft_prev = dft
# Compute features on frame
for idx, f in enumerate(features_list):
features[cur_pos, idx] = self.compute_st_features(f, signal, dft, dft_prev,
self._audio_signal._sample_rate)
# Compute MFCCs and Filter Banks
if len(mfcc_feature_names) > 0:
features[cur_pos, len(features_list):len(features_list) + len(mfcc_feature_names) + len(fbank_features_names)] = self.mfcc(signal, self._audio_signal._sample_rate,
nb_coeff=nb_mfcc, nb_filt=nb_filter, return_fbank=len(fbank_features_names) > 0)
# Compute Filter Banks
elif len(fbank_features_names) > 0:
features[cur_pos, len(features_list) + len(mfcc_feature_names):] = self.filter_banks_coeff(signal, self._audio_signal._sample_rate, nb_filt=nb_filter)
# Keep previous Discrete Fourier Transform coefficients
dft_prev = dft
cur_pos = cur_pos + 1
return features, feature_names
'''
Computes zero crossing rate of a signal
'''
@staticmethod
def zcr(signal):
zcr = numpy.sum(numpy.abs(numpy.diff(numpy.sign(signal))))
zcr = zcr / (2 * numpy.float64(len(signal) - 1.0))
return zcr
'''
Computes signal energy of frame
'''
@staticmethod
def energy(signal):
energy = numpy.sum(signal ** 2) / numpy.float64(len(signal))
return energy
'''
Computes entropy of energy
'''
@staticmethod
def energy_entropy(signal, n_short_blocks=10, eps=10e-8):
# Total frame energy
energy = numpy.sum(signal ** 2)
sub_win_len = int(numpy.floor(len(signal) / n_short_blocks))
# Length of sub-frame
if len(signal) != sub_win_len * n_short_blocks:
signal = signal[0:sub_win_len * n_short_blocks]
# Get sub windows
sub_wins = signal.reshape(sub_win_len, n_short_blocks, order='F').copy()
# Compute normalized sub-frame energies:
sub_energies = numpy.sum(sub_wins ** 2, axis=0) / (energy + eps)
# Compute entropy of the normalized sub-frame energies:
entropy = -numpy.sum(sub_energies * numpy.log2(sub_energies + eps))
return entropy
'''
Computes spectral centroid of frame
'''
@staticmethod
def spectral_centroid_spread(fft, fs, eps=10e-8):
# Sample range
sr = (numpy.arange(1, len(fft) + 1)) * (fs / (2.0 * len(fft)))
# Normalize fft coefficients by the max value
norm_fft = fft / (fft.max() + eps)
# Centroid:
C = numpy.sum(sr * norm_fft) / (numpy.sum(norm_fft) + eps)
# Spread:
S = numpy.sqrt(numpy.sum(((sr - C) ** 2) * norm_fft) / (numpy.sum(norm_fft) + eps))
# Normalize:
C = C / (fs / 2.0)
S = S / (fs / 2.0)
return C, S
'''
Computes the spectral flux feature
'''
@staticmethod
def spectral_flux(fft, fft_prev, eps=10e-8):
# Sum of fft coefficients
sum_fft = numpy.sum(fft + eps)
# Sum of previous fft coefficients
sum_fft_prev = numpy.sum(fft_prev + eps)
# Compute the spectral flux as the sum of square distances
flux = numpy.sum((fft / sum_fft - fft_prev / sum_fft_prev) ** 2)
return flux
'''
Computes the spectral roll off
'''
@staticmethod
def spectral_rolloff(fft, c=0.90, eps=10e-8):
# Total energy
energy = numpy.sum(fft ** 2)
# Roll off threshold
threshold = c * energy
# Compute cumulative energy
cum_energy = numpy.cumsum(fft ** 2) + eps
# Find the spectral roll off as the frequency position
[roll_off, ] = numpy.nonzero(cum_energy > threshold)
# Normalize
if len(roll_off) > 0:
roll_off = numpy.float64(roll_off[0]) / (float(len(fft)))
else:
roll_off = 0.0
return roll_off
'''
Computes the Filter Bank coefficients
'''
@staticmethod
def filter_banks_coeff(signal, sample_rate, nb_filt=40, nb_fft=512):
# Magnitude of the FFT
mag_frames = numpy.absolute(numpy.fft.rfft(signal, nb_fft))
# Power Spectrum
pow_frames = ((1.0 / nb_fft) * (mag_frames ** 2))
low_freq_mel = 0
# Convert Hz to Mel
high_freq_mel = (2595 * numpy.log10(1 + (sample_rate / 2) / 700))
# Equally spaced in Mel scale
mel_points = numpy.linspace(low_freq_mel, high_freq_mel, nb_filt + 2)
# Convert Mel to Hz
hz_points = (700 * (10 ** (mel_points / 2595) - 1))
bin = numpy.floor((nb_fft + 1) * hz_points / sample_rate)
# Calculate filter banks
fbank = numpy.zeros((nb_filt, int(numpy.floor(nb_fft / 2 + 1))))
for m in range(1, nb_filt + 1):
# left
f_m_minus = int(bin[m - 1])
# center
f_m = int(bin[m])
# right
f_m_plus = int(bin[m + 1])
for k in range(f_m_minus, f_m):
fbank[m - 1, k] = (k - bin[m - 1]) / (bin[m] - bin[m - 1])
for k in range(f_m, f_m_plus):
fbank[m - 1, k] = (bin[m + 1] - k) / (bin[m + 1] - bin[m])
filter_banks = numpy.dot(pow_frames, fbank.T)
# Numerical Stability
filter_banks = numpy.where(filter_banks == 0, numpy.finfo(float).eps, filter_banks)
# dB
filter_banks = 20 * numpy.log10(filter_banks)
return filter_banks
'''
Computes the MFCCs
'''
def mfcc(self, signal, sample_rate, nb_coeff=12, nb_filt=40, nb_fft=512, return_fbank=False):
# Apply filter bank on spectogram
filter_banks = self.filter_banks_coeff(signal, sample_rate, nb_filt=nb_filt, nb_fft=nb_fft)
# Compute MFCC coefficients
mfcc = dct(filter_banks, type=2, axis=-1, norm='ortho')[1: (nb_coeff + 1)]
# Return MFFCs and Filter banks coefficients
if return_fbank is True:
return numpy.concatenate((mfcc, filter_banks))
else:
return mfcc
'''
Compute statistics on short time features
'''
@staticmethod
def compute_statistic(seq, statistic):
if statistic == 'mean':
S = numpy.mean(seq)
elif statistic == 'med':
S = numpy.median(seq)
elif statistic == 'std':
S = numpy.std(seq)
elif statistic == 'kurt':
S = kurtosis(seq)
elif statistic == 'skew':
S = skew(seq)
elif statistic == 'min':
S = numpy.min(seq)
elif statistic == 'max':
S = numpy.max(seq)
elif statistic == 'q1':
S = numpy.percentile(seq, 1)
elif statistic == 'q99':
S = numpy.percentile(seq, 99)
elif statistic == 'range':
S = numpy.abs(numpy.percentile(seq, 99) - numpy.percentile(seq, 1))
return S
'''
Compute short time features on signal
'''
def compute_st_features(self, feature, signal, dft, dft_prev, sample_rate):
if feature == 'zcr':
F = self.zcr(signal)
elif feature == 'energy':
F = self.energy(signal)
elif feature == 'energy_entropy':
F = self.energy_entropy(signal)
elif feature == 'spectral_centroid':
[F, FF] = self.spectral_centroid_spread(dft, sample_rate)
elif feature == 'spectral_spread':
[FF, F] = self.spectral_centroid_spread(dft, sample_rate)
elif feature == 'spectral_entropy':
F = self.energy_entropy(dft)
elif feature == 'spectral_flux':
F = self.spectral_flux(dft, dft_prev)
elif feature == 'sprectral_rolloff':
F = self.spectral_rolloff(dft)
return F
+161
Ver Arquivo
@@ -0,0 +1,161 @@
import os
import numpy
from pydub import AudioSegment
from scipy.fftpack import fft
class AudioSignal(object):
def __init__(self, sample_rate, signal=None, filename=None):
# Set sample rate
self._sample_rate = sample_rate
if signal is None:
# Get file name and file extension
file, file_extension = os.path.splitext(filename)
# Check if file extension if audio format
if file_extension in ['.mp3', '.wav']:
# Read audio file
self._signal = self.read_audio_file(filename)
# Check if file extension if video format
elif file_extension in ['.mp4', '.mkv', 'avi']:
# Extract audio from video
new_filename = self.extract_audio_from_video(filename)
# read audio file from extracted audio file
self._signal = self.read_audio_file(new_filename)
# Case file extension is not supported
else:
print("Error: file not found or file extension not supported.")
elif filename is None:
# Cast signal to array
self._signal = signal
else:
print("Error : argument missing in AudioSignal() constructor.")
'''
Function to extract audio from a video
'''
def extract_audio_from_video(self, filename):
# Get video file name and extension
file, file_extension = os.path.splitext(filename)
# Extract audio (.wav) from video
os.system('ffmpeg -i ' + file + file_extension + ' ' + '-ar ' + str(self._sample_rate) + ' ' + file + '.wav')
print("Sucessfully converted {} into audio!".format(filename))
# Return audio file name created
return file + '.wav'
'''
Function to read audio file and to return audio samples of a specified WAV file
'''
def read_audio_file(self, filename):
# Get audio signal
audio_file = AudioSegment.from_file(filename)
# Resample audio signal
audio_file = audio_file.set_frame_rate(self._sample_rate)
# Cast to integer
if audio_file.sample_width == 2:
data = numpy.fromstring(audio_file._data, numpy.int16)
elif audio_file.sample_width == 4:
data = numpy.fromstring(audio_file._data, numpy.int32)
# Merge audio channels
audio_signal = []
for chn in list(range(audio_file.channels)):
audio_signal.append(data[chn::audio_file.channels])
audio_signal = numpy.array(audio_signal).T
# Flat signals
if audio_signal.ndim == 2:
if audio_signal.shape[1] == 1:
audio_signal = audio_signal.flatten()
# Convert stereo to mono
audio_signal = self.stereo_to_mono(audio_signal)
# Return sample rate and audio signal
return audio_signal
'''
Function to convert an input signal from stereo to mono
'''
@staticmethod
def stereo_to_mono(audio_signal):
# Check if signal is stereo and convert to mono
if isinstance(audio_signal, int):
return -1
if audio_signal.ndim == 1:
return audio_signal
elif audio_signal.ndim == 2:
if audio_signal.shape[1] == 1:
return audio_signal.flatten()
else:
if audio_signal.shape[1] == 2:
return (audio_signal[:, 1] / 2) + (audio_signal[:, 0] / 2)
else:
return -1
'''
Function to split the input signal into windows of same size
'''
def framing(self, size, step, hamming=False):
# Rescale windows step and size
win_size = int(size * self._sample_rate)
win_step = int(step * self._sample_rate)
# Number of frames
nb_frames = 1 + int((len(self._signal) - win_size) / win_step)
# Build Hamming function
if hamming is True:
ham = numpy.hamming(win_size)
else:
ham = numpy.ones(win_size)
# Split signals (and multiply each windows signals by Hamming functions)
frames = []
for t in range(nb_frames):
sub_signal = AudioSignal(self._sample_rate, signal=self._signal[(t * win_step): (t * win_step + win_size)] * ham)
frames.append(sub_signal)
return frames
'''
Function to compute the magnitude of the Discrete Fourier Transform coefficient
'''
def dft(self, norm=False):
# Commpute the magnitude of the spectrum (and normalize by the number of sample)
if norm is True:
dft = abs(fft(self._signal)) / len(self._signal)
else:
dft = abs(fft(self._signal))
return dft
'''
Function to apply pre-emphasis filter on signal
'''
def pre_emphasis(self, alpha =0.97):
# Emphasized signal
emphasized_signal = numpy.append(self._signal[0], self._signal[1:] - alpha * self._signal[:-1])
return emphasized_signal
+92
Ver Arquivo
@@ -0,0 +1,92 @@
# Speech Emotion Recognition
![image](audio_app.png)
The aim of this section is to explore speech emotion recognition techniques from an audio recording.
## Data
The data set used for training is the **Ryerson Audio-Visual Database of Emotional Speech and Song**: https://zenodo.org/record/1188976#.XA48aC17Q1J
**RAVDESS** contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions, and song contains calm, happy, sad, angry, and fearful emotions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression.
![image](Images/RAVDESS.png)
| Data | Processed Data for training | Processed Data for training | Pre-trained TimeDistributed CNNs model|
|:----:|:---------------------------:|:---------------------------:|:-------------------------------------:|
| [RAVDESS](https://drive.google.com/file/d/1OL2Kx9dPdeZWoue6ofHcUNs5jwpfh4Fc/view?usp=sharing) | [X-train](https://drive.google.com/file/d/1qv-y0FhaRy5Np8DF3a8Xty8xLvvv4QH4/view?usp=sharing) [y-train](https://drive.google.com/file/d/1y5j43I09Xe6RHK8BsHP8_ZNkUuTehhgY/view?usp=sharing) | [X-test](https://drive.google.com/file/d/1MN1Fxc_sDR1ZDQmPdFMwlnhP4qn9d8bT/view?usp=sharing) [y-test](https://drive.google.com/file/d/1ovvCXumkEP1oLxErgMgyIg1Z1Eih430W/view?usp=sharing)| [Weights](https://drive.google.com/file/d/1pQ5QahXJ3dPDXhyPkQ7rS1fOHWKHcIdX/view?usp=sharing) [Model](https://drive.google.com/file/d/1TuKN2PbFvoClaobL3aOW1KmA0e2eEc-O/view?usp=sharing) | [Colab Notebook](https://colab.research.google.com/drive/1EY8m7uj3BzU-OsjAPGBqoapw1OSUHhum)|
## Requirements
```
Python : 3.6.5
Scipy : 1.1.0
Scikit-learn : 0.20.1
Tensorflow : 1.12.0
Keras : 2.2.4
Numpy : 1.15.4
Librosa : 0.6.3
Pyaudio : 0.2.11
Ffmpeg : 4.0.2
```
## Files
The different files that can be found in this repo :
- `Model` : Saved models (SVM and TimeDistributed CNNs)
- `Notebook` : All notebooks (preprocessing and model training)
- `Python` : Personal audio library
- `Images`: Set of pictures saved from the notebooks and final report
- `Resources` : Some resources on Speech Emotion Recognition
Notebooks provided on this repo:
- `01 - Preprocessing[SVM].ipynb` : Signal preprocessing and feature extraction from time and frequency domain (global statistics) to train SVM classifier.
- `02 - Train [SVM].ipynb` : Implementation and training of SVM classifier for Speech Emotion Recognition
- `01 - Preprocessing[CNN-LSTM].ipynb` : Signal preprocessing and log-mel-spectrogram extraction to train TimeDistributed CNNs
- `02 - Train [CNN-LSTM].ipynb` : Implementation and training of TimeDistributed CNNs classifier for Speech Emotion Recognition
## Models
### SVM
Classical approach for Speech Emotion Recognition consists in applying a series of filters on the audio signal and partitioning it into several windows (fixed size and time-step). Then, features from time domain (**Zero Crossing Rate, Energy** and **Entropy of Energy**) and frequency domain (**Spectral entropy, centroid, spread, flux, rolloff** and **MFCCs**) are extracted for each frame. We compute then the first derivatives of each of those features to capture frame to frame changes in the signal. Finally, we calculate the following global statistics on these features: *mean, median, standard deviation, kurtosis, skewness, 1% percentile, 99% percentile, min, max* and *range* and train a simple SVM classifier with rbf kernel to predict the emotion detected in the voice.
![image](Images/features_stats.png)
SVM classification pipeline:
- Voice recording
- Audio signal discretization
- Apply pre-emphasis filter
- Framing using a rolling window
- Apply Hamming filter
- Feature extraction
- Compute global statistics
- Make a prediction using our pre-trained model
### TimeDistributed CNNs
The main idea of a **Time Distributed Convolutional Neural Network** is to apply a rolling window (fixed size and time-step) all along the log-mel-spectrogram. Each of these windows will be the entry of a convolutional neural network, composed by four Local Feature Learning Blocks (LFLBs) and the output of each of these convolutional networks will be fed into a recurrent neural network composed by 2 cells LSTM (Long Short Term Memory) to learn the long-term contextual dependencies. Finally, a fully connected layer with *softmax* activation is used to predict the emotion detected in the voice.
![image](Images/sound_pipeline.png)
TimeDistributed CNNs pipeline:
- Voice recording
- Audio signal discretization
- Log-mel-spectrogram extraction
- Split spectrogram with a rolling window
- Make a prediction using our pre-trained model
## Performance
To limit overfitting during training phase, we split our data set into train (80%) and test set (20%). Following show results obtained on test set:
| Model | Accuracy |
|-----------------------------------------|---------------|
| SVM on global statistic features | 68,3% |
| Time distributed CNNs | 76,6% |
Arquivo binário não exibido.
Arquivo binário não exibido.
Arquivo binário não exibido.
Diff do arquivo suprimido porque uma ou mais linhas são muito longas
Arquivo binário não exibido.
Arquivo binário não exibido.
Arquivo binário não exibido.
Arquivo binário não exibido.
Diff do arquivo suprimido porque uma ou mais linhas são muito longas
Arquivo binário não exibido.
Arquivo binário não exibido.
Diferenças do arquivo suprimidas por serem muito extensas Carregar Diff
Arquivo binário não exibido.
Arquivo binário não exibido.

Depois

Largura:  |  Altura:  |  Tamanho: 853 KiB

Arquivo binário não exibido.
Arquivo binário não exibido.
Diff do arquivo suprimido porque uma ou mais linhas são muito longas
Arquivo binário não exibido.
Arquivo binário não exibido.
Arquivo binário não exibido.
Arquivo binário não exibido.
Diff do arquivo suprimido porque uma ou mais linhas são muito longas
Arquivo binário não exibido.
Diff do arquivo suprimido porque uma ou mais linhas são muito longas
Ver Arquivo

Alguns arquivos não foram exibidos porque demasiados arquivos foram alterados neste diff Mostrar Mais