Abstract | Estimacija emocionalnih stanja iz govora može imati važnu ulogu u mnogim područjima. U okviru ove doktorske disertacije realiziran je sustav za estimaciju emocionalnih stanja, temeljen na akustičkim značajkama govornog signala, koji svoju primjenu može naći u psihoterapiji te u postupcima selekcije i obuke kandidata za stresne i odgovorne operacije. Zbog takvog potencijala je poseban naglasak stavljen na estimaciju govora pod stresom, kao i na pobuđivanje ispitanika prepadnim, odnosno startle pobudama. Istražena je neurobiološka podloga nastanka emocija kao i utjecaj emocija na biološke mehanizme za produkciju govora, a posljedično i na pojedine akustičke parametre i značajke iz glasa. Predložene su mjere perturbacije glasa, odnosno značajke utjecaja limbičkih struktura na poremećaje koordinacije antagonističkog procesa titranja glasnica, koje su rezultirale značajnom razlučivosti na razinu stresa u glasu. Pritom je ustavovljena i njihova robusnost na voljne komponente govora, konkretno dinamike fundamentalne frekvencije tijekom izgovora, gdje se konvencionalne perturbacijske mjere (jitter) nisu pokazale toliko uspješne. Analiziran je utjecaj intenzivnih zvučnih pobuda impulsnog oblika, odnosno startle pobuda, na promjene fundamentalne frekvencije glasa. Takozvane fear-potentiated startle reakcije nalaze veliku primjenu u dijagnostici posttraumatskog stresnog poremećaja, odnosno u paradigmama kondicioniranja i ekstinkcije straha. Kao konvencionalna mjera za predikciju startle reakcija danas se koristi elektromiografija orbicularis oculi mišića, to jest analiza treptaja oka. U okviru ove disertacije izvršena je usporedna analiza odziva na fundamentalnoj frekvenciji i odziva na orbicularis oculi mišiću te su ustanovljene konzistentnosti i slična svojstva odziva. Nadalje, predloženo je unaprjeđenje konvencionalne arhitekture sustava za estimaciju dimenzijskih emocija, ugode i pobuđenosti, s a priori znanjem o povezanosti tih emocija. Analizama je potvrđeno unaprjeđenje točnosti estimacije korištenjem takve arhitekture. |
Abstract (english) | This doctoral thesis is the result of research on the project “Adaptive Control of Scenarios in VR Therapy of PTSD”, which aims to develop collaborative and intelligent agent that, as a decision-making support, could be applicable in a number of areas such as prediction, selection, diagnosis and the treatment of mental disorders, especially those caused by stress. The thesis explores the estimation problem of emotional states, stress and acoustic startle responses based on acoustic speech features. Emphasis is placed on evaluating the features using statistical analysis methods in the context of the aforementioned problems. New voice perturbation features are proposed and evaluated in this thesis that describe the impact of limbic structures on neural regions responsible for coordinating the antagonistic process of the vocal folds vibrations. A comparative analysis of changes in speech fundamental frequency (F0) with electromyographic (EMG) response of the orbicularis oculi muscle was performed. This thesis proposes improvement of the conventional system architecture for estimating emotional dimensions, valence and arousal, with a priori knowledge about the relation between these two emotional dimensions. The introductory chapter defines the domains, motivation and objectives of the research, citing the inherent interdisciplinarity of the research field. The scientific contributions and the structure of the dissertation are also defined in this chapter. In the second chapter, neurobiological processes are described through which emotions impact on speech production mechanisms. The influence of emotions on respiration, phonation and articulation mechanisms of speech is explored. Special attention is given to the internal muscles of the larynx, i.e. phonation mechanisms, which due to their sensitive structures are most vulnerable to the impact of emotions. The acoustic speech features that are commonly used for estimation of emotional states and stress are described in the third chapter. Furthermore, decomposition of speech fundamental frequency is proposed, where components selectively include specific neurobiological processes of emotions. Speech perturbation features are proposed that describe the time and amplitude aspect of the disturbance in the vocal folds oscillation, which is a consequence of the limbic system influence on the cerebellum and brainstem. The proposed features are validated using the example of artificially generated speech perturbations and in terms of speech under stress. In most cases, the proposed features showed statistically significant difference to the level of speech perturbations and the level of stress. Furthermore, their satisfactory robustness was shown to the impact of the voluntary component in pronunciation, in particular the dynamics of the fundamental frequency, which is their main benefit over conventional speech perturbation measures (e.g. jitter measures). In the fourth chapter, F0 features are validated in the context of the acoustic startle response. Features like peak value, peak time, duration etc. are validated depending on the parameter changes of the startle stimulus, i.e. intensity, duration, rise time and spectral characteristics of the stimulus, as well as depending on the existence and intensity of the startle response. A comparative analysis is performed between F0 response features and EMG features of the orbicularis oculi muscle response (eyeblink), which is considered the reference measure for startle reaction analysis. Analyses have shown similar behavior of F0 and EMG responses when changing the intensity of the startle stimulus. In both cases the highest statistically significant difference is achieved for the response peak value. A significant increasing trend was observed in peak values of F0 and EMG responses with an increase in the stimulus intensity at higher levels of stimulus intensity. In the fifth chapter, the methodology of emotional state estimation based on acoustic speech features is described, which is conventionally performed through four sequential processes: speech signal processing with the extraction of acoustic measures; feature calculation from acoustic measures; reduction of a feature space; and estimation of emotional states using machine learning methods. An upgrade of conventional architecture for estimating emotion dimensions, valence and arousal, which is based on a priori relationships between the two dimensions, is proposed in this thesis. A priori model is applied on the conventional estimation process in order to shift estimation results in valence-arousal space toward more probable values, according to the level of the estimation uncertainty. Different approaches to a priori knowledge modeling have been undertaken: (a) single integral model over valence-arousal space, and (b) integration of multiple models that represent different discrete emotions in the valence-arousal space, specifically happiness, sadness, fear, anger and neutral state. Building and validation of the emotional state estimation system are performed using utterances from the Croatian emotional speech corpus, which was collected and annotated in collaboration with the University of Zagreb, Faculty of Humanities and Social Sciences. In the sixth chapter, validation of machine learning methods, specifically support vector machines and random forest, is performed in the cases of emotional states, stress and startle response estimation. In this context, the improvements proposed in the thesis were compared with conventional approaches from the literature. The results showed the justification for introducing new perturbation speech features for classification of speech under stress, applying F0 features for startle response analysis and proposing the enhanced method for estimation of emotional states. The last chapter concludes the doctoral thesis and provides suggestions for future related research. Specific applications of the proposed methods are also discussed. |