FEEL Framework Logo

Framework for Emotion Evaluation

Quantifying Heterogeneity in Physiological Signals for Generalizable Emotion Recognition

Pragya Singh1 , Ankush Gupta1 , Somay Jalan1 , Mohan Kumar2 , Pushpendra Singh1
1IIIT-D, New Delhi, India     2RIT, Rochester, New York, USA
📄 Paper 💻 Code

About FEEL

FEEL (Framework for Emotion Evaluation) is the first large-scale benchmarking study for emotion recognition using EDA and PPG signals across 19 publicly available datasets, enabling systematic analysis of model generalizability.

Datasets

19

Publicly Available

Architectures

16

Models Benchmarked

Paradigms

4

Modeling Approaches

Signals

2

EDA & PPG

📝 Research Timeline

CSCSW 2024

Translating Emotions to Annotations: A Participant's Perspective of Physiological Emotion Data Collection

🧠 Paper →

NeurIPS 2024

EEVR: A Dataset of Paired Physiological Signals and Textual Descriptions for Joint Emotion Representation Learning

⌚ Paper →

IMWUT 2025

AnnoSense: A Framework for Physiological Emotion Data Collection in Everyday Settings for AI

📶 Paper →

NeurIPS 2025 🎯

FEEL: Quantifying Heterogeneity in Physiological Signals for Generalizable Emotion Recognition

📐 Paper →

🎯 Key Contributions

  • 19 Datasets: Diverse experimental conditions (lab, real-world, constraint), devices, and labeling strategies
  • Unified Pipeline: Standardized preprocessing, harmonization, and evaluation protocols
  • 16 Architectures: Evaluated across 4 paradigms - traditional ML, deep learning, and pretrained models
  • Cross-Dataset Analysis: Systematic transferability evaluation across settings, devices, labeling methods, age and gender.
  • CLSP Fine-Tuning: Novel conditional context optimization for dataset adaptation
  • Benchmarked on 3 Classification Tasks: Arousal, Valence, and Four-Class (HAPV, HANV, LAPV, LANV) emotion recognition

Dataset Collection

Overview of 19 Emotion Datasets

Dataset No of Participants Device Setting Labeling Task
WESAD 15 E4 Lab Stimulus Reading, videos, TSST, meditation
NURSE 15 E4 Real Self-report Hospital work stress
EMOGNITION 43 E4 Lab Stimulus Film clips (9 emotions)
UBFCPHYS 56 E4 Lab Stimulus Speech, arithmetic
VERBIO 49 E4 Lab Self-report Public speaking
PhyMER 30 E4 Lab Self-report Video stimuli
EmoWear 48 E4 Lab Self-report Video stimuli
MAUS 22 Procomp Lab Stimulus N-Back task
CLAS 62 Shimmer3 Lab Stimulus Videos, math, Stroop
CASE 30 ThoughtTech Lab Self-report Video clips
Unobtrusive 24 E4 LabReal Stimulus Cognitive tasks, WFH
CEAP-360VR 32 E4 Lab Self-report VR 360 videos
ScientISST MOVE 15 E4 Constraint Stimulus Physical activities
LAUREATE 44 E4 Real Self-report 13-week university
ForDigitStress 38 IOMbio Constraint Expert Job interview
Dapper 88 Custom Real Self-report Daily life (5 days)
ADARP 11 E4 Real Self-report Alcohol disorder
MOCAS 21 E4 Lab Self-report CCTV monitoring
Exercise 97 E4 Constraint Stimulus Stroop, debate, exercise

🏆 Leaderboards

Arousal Classification - All 19 Datasets

Rank Dataset EDA PPG Combined
F1 Best Model F1 Best Model F1 Best Model
1 ForDigitStress 0.94 CLSP MLP 25% 0.99 RF 0.99 RF
2 WESAD 0.83 Signal ResNet 0.80 HC-MLP 0.91 HC-MLP
3 Unobtrusive 0.88 RF 0.87 RF 0.86 CLSP MLP 50%
4 ScientISST MOVE 0.77 HC Attention MLP 0.81 RF 0.88 HC-MLP
5 ADARP 0.83 CLSP MLP 25% 0.80 CLSP CNN 50% 0.62 CLSP MLP 25%
6 MAUS 0.83 Signal ResNet 0.82 RF 0.82 RF
7 VERBIO 0.83 CLSP CNN 50% 0.77 CLSP CNN 50% 0.72 CLSP CNN 50%
8 LAUREATE 0.69 RF 0.77 CLSP MLP 50% 0.82 CLSP Zero-Shot
9 Dapper 0.77 CLSP CNN 50% 0.70 CLSP CNN 5% 0.81 CLSP MLP 5%
10 CLAS 0.69 RF 0.66 RF 0.70 RF
11 EMOGNITION 0.68 CLSP MLP 5% 0.62 CLSP MLP 50% 0.57 CLSP CNN 5%
12 MOCAS 0.65 CLSP CNN 5% 0.62 CLSP MLP 5% 0.63 CLSP MLP 25%
13 EmoWear 0.64 RF 0.64 RF 0.67 CLSP MLP 50%
14 Exercise 0.63 CLSP Zero-Shot 0.57 CLSP CNN 25% 0.54 CLSP MLP 5%
15 NURSE 0.62 CLSP CNN 5% 0.52 CLSP MLP 50% 0.62 CLSP CNN 5%
16 CEAP-360VR 0.56 CLSP CNN 5% 0.43 CLSP CNN 5% 0.45 CLSP MLP 5%
17 PhyMER 0.51 CLSP Zero-Shot 0.42 LDA 0.42 LDA
18 CASE 0.47 Signal CNN+Transformer 0.30 HC-MLP 0.40 CLSP MLP 5%
19 UBFCPHYS 0.45 CLSP MLP 5% 0.34 CLSP CNN 25% 0.41 CLSP MLP 5%

Valence Classification - All 19 Datasets

Rank Dataset EDA PPG Combined
F1 Best Model F1 Best Model F1 Best Model
1 WESAD 0.83 CLSP CNN 50% 0.83 CLSP CNN 50% 0.98 HC-MLP
2 Dapper 0.87 CLSP CNN 50% 0.85 CLSP CNN 50% 0.94 CLSP CNN 50%
3 ForDigitStress 0.87 CLSP CNN 5% 0.92 RF 0.92 RF
4 MOCAS 0.89 CLSP Zero-Shot 0.87 CLSP CNN 50% 0.82 CLSP CNN 25%
5 ScientISST MOVE 0.82 CLSP MLP 50% 0.80 CLSP CNN 50% 0.82 CLSP CNN 50%
6 EmoWear 0.78 CLSP CNN 50% 0.77 CLSP CNN 50% 0.77 RF
7 UBFCPHYS 0.76 RF 0.68 LDA 0.72 RF
8 Exercise 0.75 CLSP Zero-Shot 0.72 CLSP CNN 50% 0.71 CLSP MLP 50%
9 PhyMER 0.72 CLSP Zero-Shot 0.69 CLSP CNN 50% 0.70 CLSP MLP 50%
10 Unobtrusive 0.71 CLSP Zero-Shot 0.71 RF 0.70 CLSP CNN 25%
11 CLAS 0.64 CLSP Zero-Shot 0.61 CLSP CNN 25% 0.63 HC Attention MLP
12 NURSE 0.62 CLSP Zero-Shot 0.39 CLSP CNN 5% 0.38 CLSP Zero-Shot
13 CEAP-360VR 0.62 CLSP CNN 5% 0.61 CLSP CNN 5% 0.50 LDA
14 MAUS 0.58 HC-MLP 0.56 LDA 0.59 LDA
15 CASE 0.54 CLSP MLP 5% 0.48 LDA 0.49 LDA
16 EMOGNITION 0.53 CLSP Zero-Shot 0.50 CLSP MLP 5% 0.39 CLSP CNN 5%
17 LAUREATE 0.36 HC-MLP 0.41 HC-MLP 0.40 CLSP MLP 50%
18 VERBIO 0.40 HC-MLP 0.38 HC-MLP 0.34 CLSP MLP 5%
19 ADARP 0.30 CLSP Zero-Shot 0.40 CLSP Zero-Shot 0.47 HC-MLP

Four-Class Classification (HAPV, HANV, LAPV, LANV) - All 19 Datasets

Rank Dataset EDA PPG Combined
F1 Best Model F1 Best Model F1 Best Model
1 WESAD 0.987 RF 0.794 RF 0.987 LDA
2 ForDigitStress 0.682 LDA 0.821 RF 0.826 RF
3 ScientISST MOVE 0.701 CLSP MLP 25% 0.740 CLSP CNN 50% 0.800 CLSP CNN 50%
4 MAUS 0.700 HC-MLP 0.705 RF 0.728 RF
5 PhyMER 0.723 CLSP CNN 50% 0.300 RF 0.342 RF
6 UBFCPHYS 0.705 CLSP Zero-Shot 0.551 LDA 0.622 LDA
7 MOCAS 0.701 CLSP MLP 25% 0.357 RF 0.366 RF
8 EMOGNITION 0.572 RF 0.601 CLSP CNN 50% 0.513 RF
9 Dapper 0.434 RF 0.426 RF 0.555 RF
10 Exercise 0.552 CLSP CNN 25% 0.438 HC-MLP 0.480 RF
11 LAUREATE 0.527 CLSP MLP 5% 0.460 RF 0.461 RF
12 CASE 0.476 RF 0.397 RF 0.498 RF
13 VERBIO 0.480 CLSP Zero-Shot 0.582 CLSP Zero-Shot 0.436 CLSP Zero-Shot
14 NURSE 0.433 CLSP Zero-Shot 0.667 CLSP Zero-Shot 0.520 CLSP Zero-Shot
15 CLAS 0.430 RF 0.408 HC-MLP 0.459 RF
16 Unobtrusive 0.402 RF 0.409 CLSP Zero-Shot 0.393 HC-MLP
17 ADARP 0.269 CLSP Zero-Shot 0.433 CLSP Zero-Shot 0.354 CLSP Zero-Shot
18 CEAP-360VR 0.285 CLSP MLP 25% 0.307 RF 0.314 RF
19 EmoWear 0.293 CLSP CNN 50% 0.270 HC-MLP 0.282 HC-MLP

Arousal Classification

Testing Cohort Training Cohort EDA PPG Combined
F1 Best Model F1 Best Model F1 Best Model
Lab Real 0.72 CLSP CNN 5% 0.57 CLSP MLP 5% 0.71 CLSP MLP 5%
Lab Constraint 0.56 CLSP MLP 50% 0.61 RF 0.60 RF
Lab Lab 0.50 RF 0.50 RF 0.52 RF
Constraint Real 0.68 RF 0.51 RF 0.64 CLSP MLP 5%
Constraint Lab 0.44 HCMLP 0.67 LDA 0.64 LDA
Constraint Constraint 0.48 HCMLP 0.48 RF 0.48 RF
Real Constraint 0.65 CLSP MLP 5% 0.59 RF 0.73 CLSP MLP 5%
Real Lab 0.59 HCMLP 0.69 LDA 0.72 CLSP MLP 25%
Real Real 0.49 HCMLP 0.48 RF 0.46 RF

Valence Classification

Testing Cohort Training Cohort EDA PPG Combined
F1 Best Model F1 Best Model F1 Best Model
Lab Real 0.79 CLSP MLP 5% 0.69 RF 0.79 CLSP MLP 25%
Lab Constraint 0.66 CLSP MLP 25% 0.67 CLSP CNN 5% 0.68 CLSP MLP 25%
Lab Lab 0.54 RF 0.50 HCMLP 0.51 HCMLP
Constraint Real 0.76 RF 0.78 RF 0.77 RF
Constraint Lab 0.76 RF 0.72 RF 0.74 RF
Constraint Constraint 0.63 RF 0.64 RF 0.65 RF
Real Constraint 0.76 RF 0.70 RF 0.88 RF
Real Lab 0.72 RF 0.64 CLSP MLP 25% 0.76 RF
Real Real 0.41 HCMLP 0.41 HCMLP 0.42 HCMLP

Arousal Classification

Testing Device Training Device EDA PPG Combined
F1 Best Model F1 Best Model F1 Best Model
Custom Wearable E4 Wearable 0.65 CLSP MLP 50% 0.82 RF 0.73 CLSP MLP 50%
Custom Wearable Lab-Based 0.62 RF 0.77 RF 0.81 CLSP CNN 50%
Custom Wearable Custom Wearable 0.34 RF 0.26 RF 0.30 RF
E4 Wearable Lab-Based 0.67 CLSP CNN 50% 0.73 CLSP CNN 50% 0.73 RF
E4 Wearable Custom Wearable 0.64 CLSP CNN 50% 0.57 RF 0.66 CLSP CNN 50%
E4 Wearable E4 Wearable 0.62 RF 0.60 RF 0.61 RF
Lab-Based Lab-Based 0.60 RF 0.62 RF 0.62 RF
Lab-Based E4 Wearable 0.45 RF 0.52 HCMLP 0.57 HCMLP
Lab-Based Custom Wearable 0.51 RF 0.53 RF 0.54 RF

Valence Classification

Testing Device Training Device EDA PPG Combined
F1 Best Model F1 Best Model F1 Best Model
Custom Wearable E4 Wearable 0.70 CLSP MLP 50% 0.82 RF 0.82 CLSP MLP 50%
Custom Wearable Lab-Based 0.71 LDA 0.81 LDA 0.81 LDA
Custom Wearable Custom Wearable 0.34 RF 0.26 RF 0.28 RF
E4 Wearable Lab-Based 0.67 CLSP MLP 25% 0.64 RF 0.73 CLSP CNN 50%
E4 Wearable Custom Wearable 0.60 CLSP CNN 50% 0.55 CLSP MLP 50% 0.62 CLSP MLP 50%
E4 Wearable E4 Wearable 0.59 RF 0.56 HCMLP 0.61 HCMLP
Lab-Based Custom Wearable 0.62 CLSP CNN 5% 0.61 RF 0.62 RF
Lab-Based E4 Wearable 0.52 HCMLP 0.57 HCMLP 0.54 HCMLP
Lab-Based Lab-Based 0.52 RF 0.45 HCMLP 0.47 HCMLP

Arousal Classification

Testing Label Training Label EDA PPG Combined
F1 Best Model F1 Best Model F1 Best Model
Stimulus-Label Expert-Annotated 0.64 CLSP MLP 5% 0.72 RF 0.65 CLSP MLP 50%
Stimulus-Label Self-report 0.62 RF 0.44 CLSP CNN 5% 0.57 CLSP CNN 5%
Stimulus-Label Stimulus-Label 0.54 RF 0.51 RF 0.55 HCMLP
Self-report Expert-Annotated 0.65 CLSP MLP 5% 0.64 CLSP CNN 50% 0.69 CLSP MLP 5%
Self-report Stimulus-Label 0.57 HCMLP 0.51 CLSP CNN 50% 0.63 RF
Self-report Self-report 0.53 HCMLP 0.52 HCMLP 0.52 HCMLP
Expert-Annotated Self-report 0.87 RF 0.69 LDA 0.84 RF
Expert-Annotated Stimulus-Label 0.79 CLSP CNN 50% 0.70 CLSP MLP 50% 0.82 RF
Expert-Annotated Expert-Annotated 0.52 RF 0.28 RF 0.48 HCMLP

Valence Classification

Testing Label Training Label EDA PPG Combined
F1 Best Model F1 Best Model F1 Best Model
Stimulus-Label Expert-Annotated 0.65 CLSP MLP 5% 0.65 CLSP CNN 50% 0.65 CLSP CNN 25%
Stimulus-Label Self-report 0.63 CLSP CNN 25% 0.61 CLSP CNN 5% 0.61 CLSP CNN 5%
Stimulus-Label Stimulus-Label 0.61 RF 0.53 RF 0.52 RF
Self-report Expert-Annotated 0.69 CLSP MLP 50% 0.72 RF 0.76 CLSP CNN 50%
Self-report Stimulus-Label 0.57 LDA 0.59 CLSP MLP 5% 0.56 LDA
Self-report Self-report 0.53 RF 0.48 HCMLP 0.52 HCMLP
Expert-Annotated Stimulus-Label 0.87 LDA 0.85 RF 0.87 CLSP CNN 5%
Expert-Annotated Self-report 0.83 CLSP CNN 25% 0.85 CLSP CNN 50% 0.74 CLSP CNN 50%
Expert-Annotated Expert-Annotated 0.56 HCMLP 0.42 RF 0.49 HCMLP

Gender-Based Transfer - Arousal Classification

Testing Group Training Group EDA PPG Combined
F1 Best Model F1 Best Model F1 Best Model
Male Female 0.56 HCMLP 0.51 LDA 0.54 LDA
Male Male 0.56 RF 0.51 HCMLP 0.56 RF
Female Female 0.52 RF 0.55 HCMLP 0.56 HCMLP
Female Male 0.50 LDA 0.51 LDA 0.53 LDA

Gender-Based Transfer - Valence Classification

Testing Group Training Group EDA PPG Combined
F1 Best Model F1 Best Model F1 Best Model
Male Female 0.69 CLSP MLP 25% 0.71 RF 0.70 CLSP CNN 50%
Male Male 0.53 HCMLP 0.52 HCMLP 0.47 RF
Female Male 0.71 CLSP MLP 50% 0.70 CLSP CNN 50% 0.70 CLSP MLP 25%
Female Female 0.55 HCMLP 0.49 RF 0.54 HCMLP

Age-Based Transfer - Arousal Classification

Testing Group Training Group EDA PPG Combined
F1 Best Model F1 Best Model F1 Best Model
Old (>25 years) Young (18-25 years) 0.51 LDA 0.56 HCMLP 0.56 HCMLP
Old (>25 years) Old (>25 years) 0.55 RF 0.55 RF 0.53 RF
Young (18-25 years) Young (18-25 years) 0.55 HCMLP 0.53 RF 0.58 HCMLP
Young (18-25 years) Old (>25 years) 0.50 LDA 0.43 LDA 0.47 CLSP MLP 50%

Age-Based Transfer - Valence Classification

Testing Group Training Group EDA PPG Combined
F1 Best Model F1 Best Model F1 Best Model
Old (>25 years) Young (18-25 years) 0.73 CLSP MLP 50% 0.72 CLSP CNN 50% 0.73 RF
Old (>25 years) Old (>25 years) 0.53 RF 0.57 RF 0.53 RF
Young (18-25 years) Old (>25 years) 0.72 CLSP MLP 5% 0.67 RF 0.69 RF
Young (18-25 years) Young (18-25 years) 0.54 RF 0.51 RF 0.48 RF

Model Architectures

CLSP Fine-Tuning Architecture

CLSP fine-tuning with conditional context optimization (CoCoOp)

Four Modeling Paradigms (16 Architectures)

1. Traditional ML

Models: RF, LDA

Input: Handcrafted features

  • Random Forest
  • LDA

Top: 59/171

2. DL + Handcrafted

Models: 4 variants

Input: Handcrafted features

  • MLP
  • ResNet
  • LSTM+MLP
  • Attention+MLP

Top: 21/171

3. DL on Raw Signals

Models: 3 variants

Input: Raw time-series

  • Signal ResNet
  • Signal LSTM+MLP
  • CNN+Transformer

Top: 3/171

4. Pretrained CLSP

Models: 7 variants

Input: Pretrained embeddings

  • Zero-Shot
  • MLP (5/25/50%)
  • CNN (5/25/50%)

Top: 88/171

Key Model Insights

  • CLSP is the overall winner with 88/171 (51.5%) - dominates binary tasks classification
  • Few-Shot Power: 23 top instances with only 5% training data
  • Classical ML Competitive: RF and LDA remain strong for small datasets
  • Handcrafted Features Win: 166/171 (97%) top models use domain knowledge

🎯 Contribute to FEEL

Help expand the FEEL benchmark by submitting your model results or proposing new datasets. We welcome contributions that evaluate novel architectures, introduce new preprocessing techniques, or extend analysis to additional heterogeneity dimensions.

📤 Submit Results

Citation

Accepted at NeurIPS 2025. 
Citation link comning soon

License

Supported By

📬 Stay Connected

© 2025 FEEL Benchmark | NeurIPS 2025 | IIIT-Delhi & RIT