321 linhas
10 KiB
Plaintext
321 linhas
10 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Speech Emotion Recognition - Signal Preprocessing\n",
|
|
"\n",
|
|
"A project for the French Employment Agency\n",
|
|
"\n",
|
|
"Telecom ParisTech 2018-2019"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## I. Context"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The aim of this notebook is to set up all speech emotion recognition preprocessing and audio features extraction.\n",
|
|
"\n",
|
|
"### Audio features:\n",
|
|
"The complete list of the implemented short-term features is presented below:\n",
|
|
"- **Zero Crossing Rate**: The rate of sign-changes of the signal during the duration of a particular frame.\n",
|
|
"- **Energy**: The sum of squares of the signal values, normalized by the respective frame length.\n",
|
|
"- **Entropy of Energy**: The entropy of sub-frames' normalized energies. It can be interpreted as a measure of abrupt changes.\n",
|
|
"- **Spectral Centroid**: The center of gravity of the spectrum.\n",
|
|
"- **Sprectral Spread**: The second central moment of the spectrum.\n",
|
|
"- **Spectral Entropy**: Entropy of the normalized spectral energies for a set of sub-frames.\n",
|
|
"- **Spectral Flux**: The squared difference between the normalized magnitudes of the spectra of the two successive frames.\n",
|
|
"- **Spectral Rolloff**: The frequency below which 90% of the magnitude distribution of the spectrum is concentrated.\n",
|
|
"- **MFCCS**: Mel Frequency Cepstral Coefficients form a cepstral representation where the frequency bands are not linear but distributed according to the mel-scale.\n",
|
|
"\n",
|
|
"Global Statistics are then computed on upper features:\n",
|
|
"- **mean, std, med, kurt, skew, q1, q99, min, max and range**\n",
|
|
"\n",
|
|
"### Data:\n",
|
|
"**RAVDESS**: The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7356 files (total size: 24.8 GB). The database contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech includes *calm*, *happy*, *sad*, *angry*, *fearful*, *surprise*, and *disgust* expressions, and song contains calm, happy, sad, angry, and fearful emotions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression. (https://zenodo.org/record/1188976#.XA48aC17Q1J)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## II. General import"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {
|
|
"ExecuteTime": {
|
|
"end_time": "2019-04-15T13:13:31.470677Z",
|
|
"start_time": "2019-04-15T13:13:30.911103Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"### General imports ###\n",
|
|
"from glob import glob\n",
|
|
"import os\n",
|
|
"import pickle\n",
|
|
"import itertools\n",
|
|
"import numpy as np\n",
|
|
"\n",
|
|
"### Audio preprocessing imports ###\n",
|
|
"from AudioLibrary.AudioSignal import *\n",
|
|
"from AudioLibrary.AudioFeatures import *"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"ExecuteTime": {
|
|
"end_time": "2018-12-04T16:38:44.580314Z",
|
|
"start_time": "2018-12-04T16:38:44.560062Z"
|
|
}
|
|
},
|
|
"source": [
|
|
"## III. Set labels"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"metadata": {
|
|
"ExecuteTime": {
|
|
"end_time": "2019-04-15T13:13:31.477659Z",
|
|
"start_time": "2019-04-15T13:13:31.473279Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# RAVDESS Database\n",
|
|
"label_dict_ravdess = {'02': 'NEU', '03':'HAP', '04':'SAD', '05':'ANG', '06':'FEA', '07':'DIS', '08':'SUR'}\n",
|
|
"\n",
|
|
"# Set audio files labels\n",
|
|
"def set_label_ravdess(audio_file, gender_differentiation):\n",
|
|
" label = label_dict_ravdess.get(audio_file[6:-16])\n",
|
|
" if gender_differentiation == True:\n",
|
|
" if int(audio_file[18:-4])%2 == 0: # Female\n",
|
|
" label = 'f_' + label\n",
|
|
" if int(audio_file[18:-4])%2 == 1: # Male\n",
|
|
" label = 'm_' + label\n",
|
|
" return label"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## IV. Import audio files"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"metadata": {
|
|
"ExecuteTime": {
|
|
"end_time": "2019-04-15T13:13:36.852703Z",
|
|
"start_time": "2019-04-15T13:13:31.479656Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Import Data: START\n",
|
|
"Import Data: RUNNING ... 0 files\n",
|
|
"Import Data: RUNNING ... 200 files\n",
|
|
"Import Data: RUNNING ... 300 files\n",
|
|
"Import Data: RUNNING ... 400 files\n",
|
|
"Import Data: RUNNING ... 500 files\n",
|
|
"Import Data: RUNNING ... 600 files\n",
|
|
"Import Data: RUNNING ... 700 files\n",
|
|
"Import Data: RUNNING ... 800 files\n",
|
|
"Import Data: RUNNING ... 900 files\n",
|
|
"Import Data: RUNNING ... 1000 files\n",
|
|
"Import Data: RUNNING ... 1100 files\n",
|
|
"Import Data: RUNNING ... 1200 files\n",
|
|
"Import Data: RUNNING ... 1300 files\n",
|
|
"Import Data: RUNNING ... 1400 files\n",
|
|
"Import Data: END \n",
|
|
"\n",
|
|
"Number of audio files imported: 1344\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Start feature extraction\n",
|
|
"print(\"Import Data: START\")\n",
|
|
"\n",
|
|
"# Audio file path and names\n",
|
|
"file_path = '../Datas/RAVDESS/'\n",
|
|
"file_names = os.listdir(file_path)\n",
|
|
"\n",
|
|
"# Initialize signal and labels list\n",
|
|
"signal = []\n",
|
|
"labels = []\n",
|
|
"\n",
|
|
"# Sample rate (44.1 kHz)\n",
|
|
"sample_rate = 44100 \n",
|
|
"\n",
|
|
"# Compute global statistics features for all audio file\n",
|
|
"for audio_index, audio_file in enumerate(file_names):\n",
|
|
"\n",
|
|
" # Select audio file\n",
|
|
" if audio_file[6:-16] in label_dict_ravdess.keys():\n",
|
|
" \n",
|
|
" # Read audio file\n",
|
|
" signal.append(AudioSignal(sample_rate, filename=file_path + audio_file))\n",
|
|
" \n",
|
|
" # Set label\n",
|
|
" labels.append(set_label_ravdess(audio_file, True))\n",
|
|
"\n",
|
|
" # Print running...\n",
|
|
" if (audio_index % 100 == 0):\n",
|
|
" print(\"Import Data: RUNNING ... {} files\".format(audio_index))\n",
|
|
" \n",
|
|
"# Cast labels to array\n",
|
|
"labels = np.asarray(labels).ravel()\n",
|
|
"\n",
|
|
"# Stop feature extraction\n",
|
|
"print(\"Import Data: END \\n\")\n",
|
|
"print(\"Number of audio files imported: {}\".format(labels.shape[0]))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## V. Audio features extraction"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"metadata": {
|
|
"ExecuteTime": {
|
|
"end_time": "2019-04-15T13:13:36.863481Z",
|
|
"start_time": "2019-04-15T13:13:36.855871Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Audio features extraction function\n",
|
|
"def global_feature_statistics(y, win_size=0.025, win_step=0.01, nb_mfcc=12, mel_filter=40,\n",
|
|
" stats = ['mean', 'std', 'med', 'kurt', 'skew', 'q1', 'q99', 'min', 'max', 'range'],\n",
|
|
" features_list = ['zcr', 'energy', 'energy_entropy', 'spectral_centroid', 'spectral_spread', 'spectral_entropy', 'spectral_flux', 'sprectral_rolloff', 'mfcc']):\n",
|
|
" \n",
|
|
" # Extract features\n",
|
|
" audio_features = AudioFeatures(y, win_size, win_step)\n",
|
|
" features, features_names = audio_features.global_feature_extraction(stats=stats, features_list=features_list)\n",
|
|
" return features\n",
|
|
" \n",
|
|
"# Features extraction parameters\n",
|
|
"sample_rate = 16000 # Sample rate (16.0 kHz)\n",
|
|
"win_size = 0.025 # Short term window size (25 msec)\n",
|
|
"win_step = 0.01 # Short term window step (10 msec)\n",
|
|
"nb_mfcc = 12 # Number of MFCCs coefficients (12)\n",
|
|
"nb_filter = 40 # Number of filter banks (40)\n",
|
|
"stats = ['mean', 'std', 'med', 'kurt', 'skew', 'q1', 'q99', 'min', 'max', 'range'] # Global statistics\n",
|
|
"features_list = ['zcr', 'energy', 'energy_entropy', 'spectral_centroid', 'spectral_spread', # Audio features\n",
|
|
" 'spectral_entropy', 'spectral_flux', 'sprectral_rolloff', 'mfcc']"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"metadata": {
|
|
"ExecuteTime": {
|
|
"end_time": "2019-04-15T13:19:38.974213Z",
|
|
"start_time": "2019-04-15T13:13:36.866069Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Feature extraction: START\n",
|
|
"Feature extraction: END!\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Start feature extraction\n",
|
|
"print(\"Feature extraction: START\")\n",
|
|
"\n",
|
|
"# Compute global feature statistics for all audio file\n",
|
|
"features = np.asarray(list(map(global_feature_statistics, signal)))\n",
|
|
"\n",
|
|
"# Stop feature extraction\n",
|
|
"print(\"Feature extraction: END!\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## VI. Save as"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"metadata": {
|
|
"ExecuteTime": {
|
|
"end_time": "2019-04-15T13:19:38.983530Z",
|
|
"start_time": "2019-04-15T13:19:38.975722Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Save DataFrame to pickle\n",
|
|
"pickle.dump([features, labels], open(\"../Datas/Pickle/[RAVDESS][HAP-SAD-NEU-ANG-FEA-DIS-SUR][GLOBAL_STATS].p\", 'wb'))"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.7"
|
|
},
|
|
"toc": {
|
|
"base_numbering": 1,
|
|
"nav_menu": {},
|
|
"number_sections": true,
|
|
"sideBar": true,
|
|
"skip_h1_title": false,
|
|
"title_cell": "Table of Contents",
|
|
"title_sidebar": "Contents",
|
|
"toc_cell": false,
|
|
"toc_position": {},
|
|
"toc_section_display": true,
|
|
"toc_window_display": false
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|