AI detects stress from voice alone in new SNUBH study

2025-04-08     Kim Ji-hye

An AI model developed by Seoul National University Bundang Hospital (SNUBH) can now detect stress with up to 77.5 percent accuracy—based entirely on a person’s voice.

Trained on samples from more than 100 Korean full-time workers, the deep learning system flags stress by analyzing subtle non-verbal cues like tone, pitch, and breath rhythm. The results, published in Psychiatry Investigation, represent one of the first biosignal-validated voice-based stress models built specifically for a Korean population.

The research team, led by Professor Kim Jeong-hyun of SNUBH’s Department of Public Health Medical Services and supported by SNU’s Institute of New Media and Communications, used ECAPA-TDNN—an AI architecture originally designed for speaker recognition. Participants recorded their voices before and after undergoing a standardized stress-inducing protocol: the Socially Evaluated Cold Pressor Test, which involves hand immersion in ice water while being observed.

Professor Kim Jeong-hyun of Seoul National University Bundang Hospital's (SNUBH) Department of Public Health Medical Services (Courtesy of SNUBH)

To confirm whether stress had been successfully induced, the study combined AI prediction scores with biological and self-reported markers—salivary cortisol and distress thermometer readings. Only data from participants who showed measurable stress responses were used to train and validate the model.

Compared to traditional models like convolutional neural networks and conformers, ECAPA-TDNN consistently delivered higher performance, especially when analyzing free-form speech. The model was trained on 95 subjects and tested on a separate group of 20, identifying stress in 70 percent of them.

Instead of focusing on what people said, the model zeroed in on how they said it—capturing stress-related shifts in vocal tension, rhythm, and tempo. Because it relies only on non-linguistic features, researchers noted the system avoids common sources of bias tied to language fluency, education level, or cultural background. The researchers added that all processing took place locally on device, keeping privacy risks low.

The study was supported by SK Telecom and conducted at both SNUBH and Boramae Medical Center. 115 participants read a neutral essay and responded to casual prompts about their daily lives. Audio recordings were segmented into overlapping four-second chunks and converted into Mel spectrograms, a common feature representation in voice-based AI.

While not yet commercialized, the team said it believes the technology could eventually power real-time stress monitoring in consumer devices. Future iterations may integrate additional biometric inputs—such as heart rate variability or skin conductance—to further boost accuracy, Professor Kim said in a statement.

Related articles