Humans show a myriad of explicit and implicit emotional signals in our behaviors. Our facial expression, posture, and even the music we listen to are types of expressions that tell the overarching story of how we feel. While most of these signals are implicitly communicated during human-to-human interaction, we do not have a method for quantifying feeling and mood through individual behavioral signals expressed on the digital platform. For this reason, signals that could help us understand ourselves and our emotions go unnoticed.
So I joined Maslo, whose mission is to increase self-awareness and empathy in the world using technology to capture and analyze emotional signals. My goal was to gain a better understanding of how behavioral signals relate to mood. Predicting mood from behavioral signals can enable artificially intelligent beings to make better empathy-led decisions and help Maslo develop state-of-the-art empathetic technology.
Verbal communication is an extraordinarily rich source of emotional data.
When people talk, their voice, pitch, tone, rate of speech, pauses, gaps, highs, lows, and jumps are evocative emotional expressions. When you add these audio signals to the textual signals of what is being said — including a sentence’s topic, content, context, syntax, semantics, and complexity — you get a daunting enigma. You have to wrangle all these implicit emotion-conveying signals to predict one final output: how is this person feeling right now?
Given the breadth of information required, estimating mood from audio recordings presents an interesting challenge. In the context of Maslo’s products, presumption of mood through audio recording data can not only provide interesting mood-related insights but also supplement and validate current mood prediction pipelines at Maslo.
Engineering The Data
To predict emotion from speech recordings, I used open datasets containing audio data annotated with mood labels. These datasets are RAVDESS, CREMA-D, TESS, and SAVEE. I combined them, creating a dataset of 12,162 audio recordings of actors (57 females and 64 males) speaking words and sentences with specific emotions (angry, happy, sad, surprise, calm, neutral, fearful, and disgust). I predicted 4 emotion labels (‘angry’, ‘happy’, ‘sad’, and ‘neutral’) due to the limited number of samples in the dataset. This resulted in a training dataset of 7472 audio recordings.
For each audio recording, I used the feature extraction library librosa to estimate audio features (n = 44) including, Mel-frequency cepstral coefficients (MFCCs), root mean square (loudness), and polynomial coefficients (polynomial fitted on spectrogram columns). Speech recordings are time-series data, so these features were computed across several overlapping time windows capturing the temporal changes in their values. This resulted in a 3-D features matrix of 7472 recordings x 44 features x w time windows; where w varies with audio length. Summary statistics (i.e. minimum, maximum, variance, and median) of each feature were computed per audio recording to estimate variation across time. This created a summary features matrix of 7472 recordings x 176 summary features, which was used for training emotion label prediction models.
An Exploratory Data Analysis showed improved performance was dependent on gender and emotion. To investigate this gender-specific effect further, I trained three emotion label prediction models using Random Forest Classification: (i) gender-nonspecific prediction model using all recordings (female/male), (ii) female-specific prediction model using female recordings only, and (iii) male-specific prediction model using male recordings only. To prevent data-leakage issues, actors in the training dataset did not reappear in the test datasets. Using cross-validation, the performance of each prediction model was estimated using Area under the Receiver Operating Characteristic curve or AUC as the comparison criteria.
An improvement of 7%, 6%, 8%, and 2% in median AUC values for predicting ‘angry’, ‘happy’, ‘neutral’, and ‘sad’ emotion labels respectively was observed through the female-specific prediction model over gender-nonspecific prediction model. There was a 4% and 5% improvement in median AUC for ‘angry’ and ‘sad’ emotion labels in the male-specific prediction model, where no improvement was observed for the other emotion labels. Improvements in AUC values across models were also accompanied by increased precision — up to 20% for prediction of ‘happy’ in females — in predictions through gender-specific models over the gender-nonspecific prediction model.
Across prediction models, summary statistics for features (i) root mean square, (ii) polynomial coefficients, and (iii) MFCCs had the highest feature importance. Here, root mean square is a measure of loudness of the recording, polynomial coefficients are the coefficients of a polynomial fitted on the columns of a spectrogram, and MFCCs are the coefficients derived from an alternate representation of the audio recording (i.e. the Mel-frequency cepstrum). When comparing the feature importance across prediction models, summary statistics for lower MFCCs (e.g. 1st and 2nd MFCCs) had higher feature importance in male-specific prediction models than female-specific prediction models where summary statistics for higher MFCCs were more important (e.g. 17th to 20th MFCCs).
These results provide preliminary support for the hypothesis that there might be gender-specificities in the expression of certain emotions (happiness, neutral, and anger) over others (sadness). Female-specific prediction models showed increased performance improvements across different emotion-labels (‘angry’, ‘happy’, ‘neutral’) as compared to male-specific prediction models where improvements were more limited (‘angry’ and ‘sad’) or were not observed (‘happy’ and ‘neutral’). It is important to note that gender-nonspecific models have the advantage of training on a larger sample of audio recordings as compared to the training datasets for the gender-specific prediction models. However, improved performance of gender-specific models for prediction of emotion labels like ‘angry’ even with a reduction in training sample size shows that the gender-specific prediction models benefit from the removal of samples belonging to the other gender. Females and males use different ranges of frequencies, so gender-specific prediction models allow the model to learn the respective weights for the features specific to the range of frequencies. Moreover, gender-specific prediction models may also prevent internal gender-based stratification that might be happening within random forest classification, leading the model to prioritize learning emotion label-specific intricacies rather than gender-based differences.
Assumptions and Limitations
Though these findings provide encouragement for further investigation, it is important to be mindful of the assumptions made with these models.
- Can I only feel one emotion at a time? The prediction models assume that each audio recording can only be classified as a single emotion label, where the simultaneous classification of multiple emotions to a particular recording is not considered. Though this assumption holds for the training dataset where actors acted out specific emotions, this might not be the case for ‘real’ audio recordings where a person may have mixed feelings: sad/happy, sad/angry. A potential solution is to leverage the probabilities associated with multiple emotion labels returned by the classification model. Those can be used in conjunction with any content-based insights to create more intelligent emotion/mood insights.
- Is this how people actually talk? The prediction models assume that the training data audio recordings are good approximations of the Maslo user recordings. However, recordings from Maslo’s products are typically less than or equal to 60 seconds whereas the training data recordings were as short as a word or sentence. Differences in audio length may reduce prediction model performance. For example, summary statistics like median or variance may denote completely different measurements when computed for a short audio recording versus a long audio recording. A possible solution might be to subsample the audio recording at random and create predictions for each sub-sample where the mode of emotion labels predicted may be determined as the final prediction by each model.
- A more valid validation? Another limitation of the current experiment is that the prediction models were validated on datasets that contained some of the same sentences and words in the training dataset. As validation datasets were created to prevent the same actor from being present in both training and test data, filtering out validation datasets based on statements would have limited the number of training samples drastically. This issue can be addressed by adding other large-scale, emotion-label annotated audio recording datasets such that filtering based on statements and actors would leave sufficient training samples for the prediction models.
- Gender is not binary: The gender-specific prediction models also assume that gender is binary (i.e. either female or male) where the reality of a gender spectrum may impede with model performance. To partially address this, we have also created a gender-prediction model using Random Forest Classification that is trained on the reported gender and summary features extracted from the audio recordings datasets. As this model predicts the ‘gender’ (Cross-validation median AUC: > 0.9) based on features including those that quantify which part of the frequency spectrum is used by the speaker’s voice. This model may, in fact, be better classified as predicting people who speak at higher frequencies versus lower frequencies rather than actual gender. This gender-prediction model can enable stratification of future datasets based on frequencies which may be a better stratification than that using reported gender. Further study is needed to identify the impact of stratifying datasets using additional categories of gender on model performance.
Emotion prediction presents a difficult problem where intricacies in defining the ‘true’ emotion label (i.e. self-reported mood vs. third-party mood perception) adds further complexity. The presented framework is an initial attempt to predict emotion labels where additional improvements may evolve into a more robust emotion prediction pipeline. Given results from the current experiment, the following emotion label prediction pipeline is proposed to make future predictions for Maslo products:
- Use reported gender or predict ‘gender’ of audio-recording using the gender-prediction model
- Based on reported gender/predicted ‘gender’, predict emotion labels using the respective gender-specific prediction model
- Predict emotion labels using the gender-nonspecific prediction model
- Aggregate predictions from both models: Prioritize female-specific mood predictions (‘angry’, ‘happy’, ‘neutral’) and male-specific mood predictions (‘angry’, ‘happy’) for females and males respectively. Otherwise, rely on gender-nonspecific predictions
This pipeline provides a preliminary strategy to produce mood/emotion label predictions and may also be used to supplement mood insights from the existing pipelines at Maslo. Predictions made through this pipeline may also allow analyses into edge cases, where the content-based insights and audio-based insights were incongruent. Besides identifying the robust gender-specificities of happiness and anger, it may also be interesting to isolate other emotions that exhibit a gender-specific effect. The problem of emotion prediction from audio recordings presents a complex, multi-faceted challenge. This project aims to understand the problem and open up the possibilities for further enhancements, including but not limited to those discussed in “Assumptions and Limitations”. Future work may explore simultaneous content and audio-based feature extraction to produce more robust emotion insights and improve predictions using relevant methodologies like Recurrent Neural Networks.
Explore the code and ideas more here: https://github.com/HeyMaslo/AllTheFeels