admin管理员组

文章数量:1579389

✔️ 自动抑郁检测:基于情感的音频-文本语料库及GRU/BiLSTM的模型

原文链接
code

文章目录

  • ABSTRACT
  • ✔️ 1. INTRODUCTION引言
  • ✔️ 2. RELATED WORK AND OUR CONTRIBUTIONS相关工作和我们的贡献
    • ✔️ Automatic depression detection. 自动抑郁检测
    • ✔️ Our motivations and contributions.动机和贡献
  • ✔️ 3. EATD-CORPUS 数据集
    • ✔️ Data collection. 数据收集
    • ✔️ Data preprocessing. 数据预处理
  • ✔️ 4. A MULTI-MODAL DEPRESSION DETECTION METHOD多模态抑郁症检测方法
    • ✔️4.1. Features特征
    • ✔️ 4.2. BiLSTM with Attention Layer 双向LSTM&注意力层
    • 4.3. Gate Recurrent Unit Neural Network
    • ✔️4.4. Multi-modal Fusion
  • ✔️ 5. EXPERIMENTS AND RESULTS
    • ✔️ 5.1. DAIC-WoZ Datset
    • ✔️ 5.2. Data Imbalance
    • ✔️ 5.3. Performance Evaluation on DAIC-WoZ Dataset
    • ✔️ 5.4. Performance Evaluation on EATD-Corpus Dataset
  • 📒ADD
      • ELMo嵌入的工作原理
      • ELMo嵌入的特点
      • VLAD算法

ABSTRACT

Index Terms— Depression detection, Multi-modal fusion,EATD-Corpus
⭐️遇到翻译的问题,或者不太理解的地方,换个翻译工具

✔️ 1. INTRODUCTION引言

Depression is a common mental disorder, the three main symptoms of which are persistent low mood, loss of interest and lack of energy [1, 2]. In the worst case, depression can lead to suicide. According to World Health Organization reports, about 264 million people are suffering from depression worldwide [3]. However, the treatment rate of depressed people remains very low in the whole world [4]. There are mainly two factors accounting for the low treatment rate. Firstly, traditional treatments for depression are time-consuming, costly and sometimes ineffective [5]. The cost of diagnosis and treatment can be a heavy burden for individuals with financial difficulties, and thus makes them reluctant to seek help from physicians. Secondly, during the clinical interviews of depression diagnosis, patients may hide their real mental states in fear of prejudice or discriminatory behaviors towards the depressed people [6, 7].In such cases, the clinician is unable to make a correct diagnosis. The aforementioned factors have necessitated the automatic depression detection system, which can help individuals assess their depressive states privately as well as increase their willingness to consult the psychologists. Furthermore, such a system would be of great help to psychologists in depression diagnosis when patients hide their real mental states.

抑郁症是一种常见的心理疾病,其主要症状包括持续的情绪低落、失去兴趣和缺乏精力[1, 2]。在最严重的情况下,抑郁症可能导致自杀。根据世界卫生组织的报告,全球约有2.64亿人正在遭受抑郁症[3]。然而,全球范围内抑郁症患者的治疗率仍然非常低[4]。造成低治疗率的原因主要有两个。首先,传统的抑郁症治疗方法耗时长、费用高,而且有时效果不佳[5]。对于经济困难的个人来说,诊断和治疗的费用可能是沉重的负担,因此使他们不愿意寻求医生的帮助。其次,在抑郁症诊断的临床面谈中,患者可能会因为担心对抑郁症患者的偏见或歧视行为而隐瞒其真实的心理状态[6, 7]。在这种情况下,临床医生可能无法做出正确的诊断。这些因素突显了自动抑郁症检测系统的必要性,该系统可以帮助个人私密地评估其抑郁状态,同时增加他们咨询心理学家的意愿。此外,这样的系统在患者隐瞒真实心理状态时也能大大帮助心理学家进行抑郁症诊断。

✔️ 2. RELATED WORK AND OUR CONTRIBUTIONS相关工作和我们的贡献

✔️ Automatic depression detection. 自动抑郁检测

Early studies of automatic depression detection were dedicated to extracting effective features from questions that were highly correlated with depression. Sun et al. [8] conducted content analysis to the text transcripts of clinical interviews and manually selected questions related to certain topics (e.g. Sleeping quality or recent feelings). Based on the text features extracted from the selected questions, they used Random Forest to detect depression tendency. Similarly, Yang et al. [9] also manually selected depression related questions after analyzing interview transcripts. They constructed a decision tree with the selected questions to predict the participants’ depression states. Gong and Poellabauer [10] performed topic modeling to split the interviews into topic-related segments, from which audio, video, and semantic features are extracted. They employed a feature selection algorithm to maintain the most discriminating features. Williamson et al. [11] constructed semantic context indicators related to factors such as depression diagnosis, medical/psychological therapy or negative feelings. Utilizing Gaussian Staircase Model, they achieved a good performance in depression detection.

早期的自动抑郁症检测研究致力于从与抑郁症高度相关的问题中提取有效特征。Sun 等人[8] 对临床访谈的文本转录进行了内容分析,并手动选择了与特定话题相关的问题(如睡眠质量或近期情绪)。基于从这些选定问题中提取的文本特征,他们使用随机森林算法来检测抑郁倾向。类似地,Yang 等人[9] 也在分析了访谈转录后手动选择了与抑郁相关的问题。他们构建了一个决策树,通过这些选定的问题来预测参与者的抑郁状态。Gong 和 Poellabauer[10] 进行了主题建模,将访谈分割成与主题相关的段落,并从中提取音频、视频和语义特征。他们采用了特征选择算法以保留最具区分性的特征。Williamson 等人[11] 构建了与抑郁诊断、医疗/心理治疗或负面情绪等因素相关的语义上下文指标。他们利用高斯阶梯模型,在抑郁症检测中取得了良好的性能。

summary : 总结了早期研究中不同的方法和技术,用于自动检测抑郁症的研究,涉及特征提取、决策树、主题建模及语义上下文指标等方面。

Inspired by the emerging deep learning techniques, integrating multi-modal features through deep learning models is particularly promising for depression detection. Yang et al. [12] presented a depression detection model based on deep Convolution Neural Network (CNN). They additionally designed a set of audio and video descriptors to train their model. Tuka et al. [13] proposed a Long Short-Term Memory (LSTM) network to assess depression tendency. They calculated Pearson Coefficients to select audio features and text features that were strongly related to depression severity. With the combination of CNN and LSTM, Ma et al. [14] encoded the depressive audio characteristics to predict the presence of depression. Haque et al. [7] proposed a causal CNN model which summarized acoustic, visual and linguistic features into embeddings which were then used to predict depressive states.

受到新兴深度学习技术的启发,通过深度学习模型整合多模态特征在抑郁症检测中尤为有前景。Yang 等人[12] 提出了一个基于深度卷积神经网络(CNN)的抑郁症检测模型。他们额外设计了一组音频和视频描述符来训练他们的模型。Tuka 等人[13] 提出了一个长短期记忆(LSTM)网络来评估抑郁倾向。他们计算了皮尔逊相关系数,以选择与抑郁严重程度强相关的音频特征和文本特征。Ma 等人[14] 结合 CNN 和 LSTM 对抑郁的音频特征进行了编码,以预测抑郁的存在。Haque 等人[7] 提出了一个因果卷积神经网络模型,该模型将声学、视觉和语言特征总结为嵌入向量,然后用于预测抑郁状态。

summary:描述了基于深度学习的不同方法如何整合多模态特征,以提高抑郁症检测的效果,包括CNN、LSTM和因果卷积神经网络等技术。

✔️ Our motivations and contributions.动机和贡献

In the field of automatic depression detection, several limitations exist in current research. First of all, some methods rely heavily on manually selected questions which requires psychologists’ expertise involved. Besides, all these preset questions have to be answered during the interview, otherwise the analysis may fail. How to improve detection performance without preset questions remains a challenging task. In addition, publicly available depression datasets are scarce due to ethic issues. In this work, we make efforts to overcome the aforementioned drawbacks:

在自动抑郁症检测领域,当前研究存在若干局限性。首先,一些方法过于依赖手动选择的问题,这需要心理学家的专业知识。此外,这些预设的问题在访谈过程中必须得到回答,否则分析可能会失败。如何在没有预设问题的情况下提高检测性能仍然是一个具有挑战性的任务。此外,由于伦理问题,公开可用的抑郁症数据集稀缺。在本研究中,我们努力克服上述缺点:

summary:总结了当前自动抑郁症检测研究中的主要问题,并指出了研究工作试图解决的挑战。

(1) To facilitate study of depression detection, we first establish EATD-Corpus, a publicly available Chinese depression dataset, which comprises audios and text transcripts extracted from the interviews of 162 volunteers.

(1)为了促进抑郁症检测的研究,我们首先建立了 EATD-Corpus,这是一个公开可用的中文抑郁症数据集,包含了从162名志愿者的访谈中提取的音频和文本转录。

(2) We then propose a novel method for automatic depression detection. In this method, a Gate Recurrent Unit (GRU) model and a Bidirectional Long Short-Term Memory (BiLSTM) model with an attention layer are utilized to summarize representations from audio and text features. In addition, a multi-modal fusion network integrates the summarized features to detection depression.

(2)我们随后提出了一种新颖的自动抑郁症检测方法。在该方法中,利用了一个门控循环单元(GRU)模型和一个带有注意力层的双向长短期记忆(BiLSTM)模型来总结音频和文本特征的表示。此外,一个多模态融合网络整合这些总结后的特征以进行抑郁症检测。

✔️ 3. EATD-CORPUS 数据集

The depression datasets are quite scarce [15–19]. To the best of our knowledge, there are only two publicly available datasets referring to depression detection. The first one is DAIC-WoZ which contains recordings and transcripts of 142 American participants who were clinically interviewed by a computer agent [16]. The second one is AViD-Corpus [20] which also contains audios and videos of German participants answering a set of queries or reciting fables. However, the transcripts are not provided by the authors.

抑郁症数据集相当稀缺[15–19]。据我们所知,目前只有两个公开可用的数据集涉及抑郁症检测。第一个是 DAIC-WoZ 数据集,包含了142名美国参与者的录音和转录,这些参与者由计算机代理进行临床访谈[16]。第二个是 AViD-Corpus 数据集[20],也包含了德国参与者回答一系列问题或朗读寓言的音频和视频。然而,作者没有提供转录文本。

In this work, we release a new Chinese depression dataset, namely EATD-Corpus, to facilitate the research in depression detection. EATD-Corpus consists of audios and text transcripts extracted from the interviews of 162 student volunteers recruited from Tongji University. All the volunteers have signed informed consents and guarantee the authenticity of all the information provided. Each volunteer is required to answer three randomly selected questions and complete an SDS questionnaire. The SDS questionnaire consists of 20 items which rate the four common characteristics of depression: the pervasive effect, the physiological equivalents, other disturbances, and psychomotor activities [21]. SDS is a commonly used questionnaire for psychologists to screen depressed individuals in practise. A raw SDS score can be summarized from the questionnaire. For Chinese people, an index SDS score (i.e. raw SDS score×1.25) greater than or equal to 53 implies that he/she is in depression [22]. According to the criterion, there are 30 depressed volunteers and 132 non-depressed volunteers in EATD-Corpus. The overall duration of response audios in the dataset is about 2.26 hours.

在本研究中,我们发布了一个新的中文抑郁症数据集,即 EATD-Corpus,以促进抑郁症检测的研究。EATD-Corpus 包含了从招募自同济大学的162名学生志愿者访谈中提取的音频和文本转录。所有志愿者均已签署知情同意书,并保证所提供信息的真实性。每位志愿者需要回答三个随机选择的问题,并完成一个 SDS 问卷。SDS 问卷包含20个项目,用于评估抑郁症的四个常见特征:普遍影响、生理等效、其他干扰和精神运动活动[21]。SDS 是心理学家在实践中常用的抑郁筛查问卷。可以从问卷中总结出原始 SDS 得分。对于中国人来说,指数 SDS 得分(即原始 SDS 得分×1.25)大于或等于53 表明其处于抑郁状态[22]。根据这一标准,EATD-Corpus 中有30名抑郁志愿者和132名非抑郁志愿者。数据集中响应音频的总体时长约为2.26小时。

The process of constructing EATD-Corpus consists of two steps: data collection and data preprocessing.

✔️ Data collection. 数据收集

An APP, through which a virtual interviewer will ask the interviewee three questions, is developed to conduct the interview and to collect audio responses. The interviewees can record their responses and upload the response audios online. Besides, each volunteer is required to complete an SDS questionnaire, the score of which indicates the depression severity. Currently, 162 volunteers have successfully finished online interviews. Based on their SDS scores, 30 volunteers are regarded in depression and the other 132 volunteers are non-depressive.

开发了一款应用程序(APP),通过该应用程序,虚拟面试官将向受访者提出三个问题,以进行访谈并收集音频回应。受访者可以录制他们的回应并在线上传这些音频。此外,每位志愿者还需要完成一个 SDS 问卷,其得分指示抑郁的严重程度。目前,162名志愿者已经成功完成了在线访谈。根据他们的 SDS 得分,30名志愿者被认为处于抑郁状态,另外132名志愿者为非抑郁状态。

✔️ Data preprocessing. 数据预处理

Several preprocessing operations have been performed on the collected audios. First, mute audios, audios less than 1 second, and the silent segments at the beginning and the end of each recording are removed. Then the background noises are eliminated using RNNoise [23] with default parameters. After that, Kaldi [24] is used to extract transcripts from the audios. In the end, all the transcripts were manually checked and corrected.

对收集到的音频进行了若干预处理操作。首先,移除了静音音频、少于1秒的音频,以及每个录音开头和结尾的静音段。然后,使用 RNNoise [23](默认参数)消除背景噪声。之后,使用 Kaldi [24] 从音频中提取转录文本。最后,对所有转录文本进行了人工检查和修正。

✔️ 4. A MULTI-MODAL DEPRESSION DETECTION METHOD多模态抑郁症检测方法

In addition, we propose an efficient method for automatic depression detection. As shown in Fig.1, the proposed approach consists of a GRU model and a BiLSTM model with an attention layer. The two models summarize audio and text representations, which are then concatenated and passed to a one-layer fully connected (FC) network. Modal attention is a trained weight vector predicting the importance of two modalities. The FC network outputs a binary label indicating the presence of depression.

此外,我们提出了一种高效的自动抑郁检测方法。如图1所示,该方法包括一个GRU模型和一个带有注意力层的BiLSTM模型。这两个模型分别总结了音频和文本的表示,然后将这两种表示拼接起来,并传递给一个单层全连接(FC)网络。模态注意力是一个经过训练的权重向量,用于预测两种模态的重要性。FC网络输出一个二进制标签,用于指示是否存在抑郁症。

✔️4.1. Features特征

In our method, text and audio features are used to prediction depression state. Text features are extracted by projecting transcript sentences into high-dimensional sentence embeddings using ELMo [25]. For audio features, Mel spectrograms are extracted from the audios. However, the sizes of the extracted Mel spectrograms vary greatly because the lengths of the audios range from 2 seconds to 1 minute. Therefore, NetVLAD [26] is further adopted to generate audio embeddings of the same length from Mel spectrograms.

在我们的方法中,文本和音频特征用于预测抑郁状态。文本特征通过使用ELMo [25] 将转录句子投影到高维句子嵌入中来提取。对于音频特征,从音频中提取Mel谱图。然而,提取的Mel谱图的尺寸差异很大,因为音频的长度范围从2秒到1分钟。因此,我们进一步采用NetVLAD [26] 从Mel谱图中生成相同长度的音频嵌入。

✔️ 4.2. BiLSTM with Attention Layer 双向LSTM&注意力层

To extract text features, BiLSTM with an attention layer is adopted to emphasize which sentence contributes most in depression detection.

为了提取文本特征,我们采用了带有注意力层的BiLSTM模型,以强调哪些句子对抑郁检测的贡献最大。

Attention is defined in Eq.1, where X is the input text features. O \boldsymbol{O} O consists of O f \mathbb{O}_f Of and O b \mathbb{O}_b Ob representing the forward and backward output of BiLSTM respectively. ω \omega ω is the learned weight vector from O \boldsymbol{O} O and c \boldsymbol{c} c is the weighted context and y \boldsymbol{y} y is the final output with attention.

注意力机制在公式1中定义,其中X是输入的文本特征。 O \boldsymbol{O} O 包含 O f \mathbb{O}_f Of O b \mathbb{O}_b Ob,分别表示BiLSTM的正向和反向输出。 ω \omega ω 是从 O \mathbb{O} O 中学习得到的权重向量, c \boldsymbol{c} c 是加权上下文, y \boldsymbol{y} y 是最终的带有注意力的输出。

The detailed configuration of the proposed BiLSTM has been listed in Table 1. The model consists of two BiLSTM layers, the output of which is fed into the attention layer for weight calculation. The following two-layer FC network predicts whether the participant is in depression.

所提议的BiLSTM模型的详细配置列在表1中。该模型由两个BiLSTM层组成,其输出被传递到注意力层进行权重计算。随后,通过一个两层的全连接(FC)网络来预测参与者是否处于抑郁状态。

4.3. Gate Recurrent Unit Neural Network

✔️4.4. Multi-modal Fusion

To integrate audio and text information, representations generated by the last layer of GRU model and BiLSTM model are concatenated horizontally. Modal attention is a weight vector trained to represent the importance of different modalities. The dot product of attention vector and the concatenated representations produce the weighted representation, which is then passed to a one-layer FC network. Then, a loss function is derived as defined in Eq. 2, where m is the adopted modality, l \mathcal{l} l is the cross entropy loss function defined in Eq. 3, x m x_m xm is the representation vectors of m, ω m ω_m ωm is the weight of the FC network with respect to m and y is the ground-truth.

为了整合音频和文本信息,我们将GRU模型和BiLSTM模型最后一层生成的表示进行水平拼接。模态注意力是一个权重向量,用于表示不同模态的重要性。模态注意力向量与拼接后的表示进行点积运算,生成加权表示,然后将其传递给一个一层的全连接(FC)网络。

接下来,损失函数如公式2所定义。在这个公式中, m m m代表所采用的模态, l \mathcal{l} l 是公式3中定义的交叉熵损失函数, x m x_m xm 是模态 m m m 的表示向量, ω m \omega_m ωm 是全连接网络对模态 m m m 的权重, y y y 是真实标签。

✔️ 5. EXPERIMENTS AND RESULTS

✔️ 5.1. DAIC-WoZ Datset

DAIC-WoZ dataset is a public English depression dataset that contains recordings and transcripts of 142 participants, each of which is labeled with a PHQ-8 score [16]. PHQ-8 questionnaire is another popular questionnaire for depression screening but with less questions compared with SDS. The participant whose PHQ-8 score greater than or equal to 10 is regarded in depression. DAICWoZ dataset consists of a training set (30 depressed and 77 nondepressed), a development set (12 depressed and 23 non-depressed) and a test set which is not publicly available [16]. The experiments are finally performed on DAIC-WoZ and EATD-Corpus dataset, considering that AViD-Corpus doesn’t provide text information.

DAIC-WoZ数据集是一个公开的英语抑郁数据集,包含了142名参与者的录音和文本转录,每名参与者都根据PHQ-8量表进行了评分[16]。PHQ-8问卷是另一种流行的抑郁筛查问卷,但与SDS(如自评抑郁量表)相比,它的问题更少。在DAIC-WoZ数据集中,PHQ-8评分大于或等于10的参与者被视为抑郁状态。

DAIC-WoZ数据集由训练集(包含30名抑郁参与者和77名非抑郁参与者)、开发集(包含12名抑郁参与者和23名非抑郁参与者)以及一个不公开的测试集组成[16]。由于AViD-Corpus数据集不提供文本信息,因此实验最终是在DAIC-WoZ和EATD-Corpus数据集上进行的。

在处理DAIC-WoZ数据集时,为了平衡两个类别的样本数量,采用了组重采样的方法,如之前所述。这种方法确保了训练过程中模型能够充分学习到两个类别的特征,从而提高了模型的泛化能力。

✔️ 5.2. Data Imbalance

Data imbalance heavily exists in depression datasets. Unbalanced datasets will introduce non-depressed preference to the trained classification models. Therefore, the sizes of the depressed and nondepressed classes need to be balanced before training. In this work, resampling is utilized to address the data imbalance issue.

抑郁症数据集中存在严重的数据不平衡问题。数据不平衡会导致训练出的分类模型对非抑郁样本的偏好。因此,在训练之前,需要平衡抑郁和非抑郁类别的样本数量。在本研究中,采用了重采样方法来解决数据不平衡问题。

For DAIC-WoZ dataset, samples in the two classes are equalized by group resampling. Every 10 responses of one participant are grouped, along with the corresponding audios and text transcripts. Samples are randomly selected from different groups of depressed participants without redundancy until the number of samples in the two classes are equivalent. For example, a balanced training set consisting of 77 depressed samples and 77 non-depressed samples can be constructed from DAIC-WoZ dataset. It should be noted that resampling is only performed on the training sets. In the testing phase, only one segment of audios and transcripts are randomly selected from each individual’s responses in the development/test set and used for evaluation.

针对DAIC-WoZ数据集,通过组重采样的方式使两个类别的样本数量相等。具体来说,每个参与者的10个响应被归为一个组,同时包括相应的音频和文本转录。为了在两个类别中达到样本数量的平衡,从抑郁参与者的不同组中随机选择样本,确保没有重复,直到两个类别的样本数量相等。例如,可以从DAIC-WoZ数据集中构建一个包含77个抑郁样本和77个非抑郁样本的平衡训练集。

需要注意的是,重采样操作仅针对训练集进行。在测试阶段,从开发集/测试集中每个个体的响应中随机选择一个音频片段和对应的文本转录,用于评估。这样做是为了确保测试集的独立性和公正性,避免因为重采样而引入的偏差。

此外,这种组重采样的方法有助于在训练过程中模拟真实世界中的数据分布情况,特别是当数据集中存在类别不平衡问题时。通过确保训练集中每个类别的样本数量相等,可以提高模型对少数类(如抑郁类别)的识别能力,从而改善模型的泛化性能。

For EATD-Corpus, the method of rearranging volunteers’ responses is adopted to increase the size of the depressed class. The orders of three responses are rearranged and these rearranged responses are resampled to create new training samples.Because there are 6 ways of response rearrangement for each individual, the size of the depressed class can be enlarged 6 times.

对于EATD-Corpus数据集,为了增加抑郁类别的样本数量,采用了重新排列志愿者响应的方法。具体来说,将每个志愿者的三个响应的顺序进行重新排列,并对这些重新排列后的响应进行重采样,以创建新的训练样本。由于每个志愿者的响应有6种不同的排列方式,因此抑郁类别的样本数量可以扩大6倍。

这种方法是一种数据增强的技术,它通过改变现有数据的结构来生成新的训练样本,从而在不增加额外数据收集成本的情况下增加模型的训练数据量。在EATD-Corpus数据集中,由于抑郁类别的样本可能相对较少,这种重新排列和重采样的方法有助于缓解类别不平衡问题,提高模型对抑郁类别的识别能力。

然而,需要注意的是,虽然这种方法可以增加训练样本的数量,但它也可能引入一些噪声或偏差,因为重新排列的响应可能并不符合实际交流中的自然顺序。因此,在使用这种方法时,需要谨慎评估其对模型性能的影响,并可能结合其他数据增强或预处理技术来进一步提高模型的泛化能力。

另外,与DAIC-WoZ数据集类似,重采样和重新排列操作通常仅应用于训练集,以确保测试集的独立性和公正性。在测试阶段,直接使用未经过处理的原始数据来评估模型的性能。

✔️ 5.3. Performance Evaluation on DAIC-WoZ Dataset

In the text transcripts of DAIC-WoZ, responses to the same question are concatenated and encoded as the average of all three layer embeddings from ELMo [25]. A matrix of N × 1024 is obtained for each participant, where N is the number of questions. To address the data imbalance issue, the matrix is divided into m smaller matrices of size 10 × 1024, where m is the integer of N divided by 10. Resampling is performed on the divided matrices of depressed participants.

在DAIC-WoZ的文本转录中,针对同一问题的回答被连接并编码为ELMo [25]中所有三层嵌入的平均值。每个参与者都会得到一个N×1024的矩阵,其中N是问题的数量。为了解决数据不平衡的问题,该矩阵被分成m个较小的10×1024矩阵,其中m是N除以10的整数部分。对抑郁参与者的分割矩阵进行了重采样。

Corresponding audio is segmented based on the timestamps in the text transcript. NetVLAD is applied to generate 256-dimensional audio embeddings from extracted Mel spectrograms. Similar to the text features, the matrix obtained for each participant is divided and resampling is performed.

相应的音频根据文本转录中的时间戳进行分段。应用NetVLAD从提取的Mel频谱图中生成256维的音频嵌入。与文本特征类似,为每个参与者获得的矩阵被分割,并进行重采样。

After extracting audio and text embeddings, a GRU model and a BiLSTM model with an attention layer are trained. Then, the 128dimensional text and 256-dimensional audio representations are concatenated horizontally to train modal attention. The dot products of the concatenated representations and modal attention are fed into the multi-modal network which produces binary labels.

在提取了音频和文本嵌入之后,分别训练了一个GRU模型和一个带有注意力层的双向LSTM(BiLSTM)模型。然后,将128维的文本表示和256维的音频表示进行水平拼接,以训练模态注意力。拼接后的表示和模态注意力的点积被输入到多模态网络中,该网络生成二元标签。

For performance comparison, F1 Score, Recall and Precision values are reported. The performances of our approach together with some existing methods for depression detection are summarized in Table 3. From Table 3, it can be seen that compared with the methods only adopting audio features, the proposed GRU model yields the highest performance with the F1 score equal to 0.77. Compared with methods adopting only text features, the proposed BiLSTM model achieves the second-best performance with the F1 score equal to 0.83, which is merely 0.01 worse than the best method. The proposed multi-modal fusion method produces the best result with its F1 score equal to 0.85. Compared with the other method accepting both audio and text features, our method achieves a much better performance. In addition, the Recall values of our proposed single modality models and fusion model are close to 1. It indicates that our method can find out most of the depressed participants in practice.

为了性能比较,报告了F1分数、召回率和精确度值。表3总结了我们的方法与一些现有的抑郁检测方法的性能。从表3中可以看出,与仅采用音频特征的方法相比,提出的GRU模型表现最佳,F1分数为0.77。与仅采用文本特征的方法相比,提出的BiLSTM模型实现了第二好的性能,F1分数为0.83,仅比最佳方法差0.01。提出的多模态融合方法产生了最佳结果,F1分数为0.85。与其他接受音频和文本特征的方法相比,我们的方法表现明显更好。此外,我们提出的单模态模型和融合模型的召回率值接近1。这表明我们的方法在实际应用中能够发现大多数抑郁参与者。

✔️ 5.4. Performance Evaluation on EATD-Corpus Dataset

The performances of the proposed method are further evaluated on EATD-Corpus with 3-fold cross validation. The volunteers in the dataset are divided into three groups, two of which are used for training and the other one for testing.As described in Section 5.2, audios and transcripts of each depressed volunteer in the training set are rearranged and resampled.Then, audio and text embeddings of size 3 × 256 and 3 × 1024 are extracted from the training set and test set.The proposed GRU model and BiLSTM model are trained separately to generate representations, which are concatenated and passed to the multi-modal fusion network to output binary labels.

对所提方法的性能进行了进一步评估,使用 EATD-Corpus 进行 3 折交叉验证。数据集中的志愿者被分成三组,其中两组用于训练,另一组用于测试。如第5.2节所述,训练集中每个抑郁志愿者的音频和文本转录被重新排列和重采样。然后,从训练集和测试集中提取大小为 3 × 256 和 3 × 1024 的音频和文本嵌入。所提 GRU 模型和 BiLSTM 模型分别进行训练以生成表示,这些表示被连接并传递到多模态融合网络以输出二元标签。

We implement the method introduced in [13] and evaluate its performance on EATD-Corpus for comparison. The performances of three traditional classifiers, i.e. SVM, Random Forest, and Decision Tree, are also evaluated. ll these methods are evaluated using 3-fold cross validation. The experimental results have been shown in Table 4. It can be seen that, when only using single modalities, the proposed GRU/BiL STM model achieves the best performance compared with its counterparts. When only audio features are considered, the F1 score of our method is 0.66, compared with the second best F1 score 0.50. For text features, the F1 score of our method is 0.65, compared with the second best Fl score 0.64. The results demonstrate the advantage of our method in dealing with depression detection problem. Compared with the models using single modalities, our fusion model exhibits a much higher performance, with the F1 score increased to 0.71. It is much higher than that of the method proposed in [13]. Similarly, the Recall values of the fusion model have also been significantly increased to 0.84, which indicates that our method can detect most depressive cases. As a consequence, the fusion performance is only compared between two deep learning based methods. These results demonstrate the effectiveness of the proposed fusion method.

我们实现了[13]中介绍的方法,并在EATD-Corpus上评估了其性能以进行比较。同时,三种传统分类器,即SVM、随机森林和决策树的性能也进行了评估。所有这些方法都使用3折交叉验证进行评估。实验结果如表4所示。从中可以看出,当仅使用单一模态时,所提的GRU/BiLSTM模型相比其他方法实现了最佳性能。在仅考虑音频特征时,我们方法的F1得分为0.66,而第二高的F1得分为0.50。对于文本特征,我们方法的F1得分为0.65,而第二高的F1得分为0.64。结果表明我们的方法在处理抑郁症检测问题上具有优势。相比于使用单一模态的模型,我们的融合模型表现出更高的性能,F1得分提高到0.71,远高于[13]中提出的方法。同样,融合模型的召回率也显著提高到0.84,这表明我们的方法能够检测到大多数抑郁病例。因此,融合性能仅在两个基于深度学习的方法之间进行了比较。这些结果证明了所提融合方法的有效性。

The results from DAIC-WoZ and EATD-Corpus imply that our method has a powerful generalization ability and can be applied to different depression datasets.

从 DAIC-WoZ 和 EATD-Corpus 的结果表明,我们的方法具有强大的泛化能力,可以应用于不同的抑郁症数据集。

#✔️ 6. CONCLUSION

In this paper, we release the first public Chinese depression dataset EATD-Corpus. It contains audio responses of 162 volunteers to three emotion related questions. Text transcripts of audios are also ex-tracted and manually corrected and supplied in EATD-Corpus. Considering the rareness of the public multimedia depression datasets,EATD-Corpus provides valuable data for researchers in psychology and computer science who are engaged in depression study.

在本文中,我们发布了首个公开的中文抑郁症数据集 EATD-Corpus。该数据集包含162名志愿者对三个情绪相关问题的音频回答。音频的文本转录也已提取并手动校正,随 EATD-Corpus 一同提供。考虑到公开的多媒体抑郁症数据集的稀缺性,EATD-Corpus 为从事抑郁症研究的心理学和计算机科学研究人员提供了宝贵的数据。

Beside, we propose a novel depression detection method which can detect depression state by analyzing audio signals and linguistic contents of the participants. Our method simply encodes audio/text features into embeddings and doesn’t rely on the contents of questions asked during the interview. We evaluate the performance of the proposed method on two depression datasets, namely DAIC-WoZ and EATD-Corpus. Experimental results demonstrate that the proposed method is of great effectiveness. In the future, we intend to build an APP that allows users to self-detect their depressive states based on the proposed method.

此外,我们提出了一种新颖的抑郁症检测方法,可以通过分析参与者的音频信号和语言内容来检测抑郁状态。我们的方法只是将音频/文本特征编码为嵌入向量,并且不依赖于访谈中所提问题的内容。我们在两个抑郁症数据集(即DAIC-WoZ和EATD-Corpus)上评估了所提方法的性能。实验结果表明,该方法具有很高的有效性。未来,我们打算开发一个应用程序,使用户可以根据所提方法自我检测抑郁状态。

📒ADD

ELMo(Embeddings from Language Models)是由AllenNLP团队开发的一种语言表示模型,它通过预训练的双向语言模型(biLM)生成词的上下文嵌入。与传统的词嵌入方法(如Word2Vec和GloVe)不同,ELMo的嵌入是上下文敏感的,这意味着相同的词在不同的上下文中会有不同的嵌入表示。以下是ELMo嵌入的工作原理:

ELMo嵌入的工作原理

  1. 双向语言模型(biLM)

    • ELMo使用一个双向语言模型,即包含两个独立的LSTM网络:一个从左到右(前向LSTM),另一个从右到左(后向LSTM)。
    • 前向LSTM通过给定的句子来预测下一个词的概率,而后向LSTM通过给定的句子来预测前一个词的概率。
  2. 多层LSTM

    • 该语言模型由多层LSTM组成(通常为两层)。每一层LSTM都会生成一个词的上下文表示。
  3. 上下文表示的生成

    • 对于给定的输入句子,ELMo首先通过字符级卷积网络将词表示成向量,然后将这些向量输入到双向LSTM中。
    • 每个LSTM层都会生成一组表示,这些表示包括了该词在给定上下文中的信息。
  4. ELMo嵌入的计算

    • 对于每个词,ELMo嵌入是其在所有LSTM层中的表示的加权和。具体来说,ELMo使用以下公式计算每个词的嵌入:
      ELMo k = γ ∑ j = 0 L s j h k , j \text{ELMo}_k = \gamma \sum_{j=0}^{L} s_j h_{k,j} ELMok=γj=0Lsjhk,j
      其中, h k , j h_{k,j} hk,j是第 j j j层的第 k k k 个词的表示, s j s_j sj 是第 j j j 层的权重, γ \gamma γ 是一个可学习的标量参数。
  5. 训练与微调

    • ELMo模型是在大规模语料库上预训练的,并且可以在下游任务(如问答系统、情感分析等)中进行微调,以适应特定任务的需求。

ELMo嵌入的特点

  • 上下文敏感:ELMo嵌入能捕捉到词的语义变化,因为它们考虑了词在句子中的上下文。
  • 多层表示:通过结合多层LSTM的表示,ELMo能够捕捉到不同层次的语义信息。
  • 迁移学习:预训练的ELMo模型可以很容易地应用到各种NLP任务中,提高模型的表现。

ELMo的这种上下文敏感嵌入方法在许多NLP任务中表现出色,尤其是在需要理解句子结构和语义的任务中。

NetVLAD(Neural Network-based Vector of Locally Aggregated Descriptors)是一种用于视觉图像描述的算法,主要用于图像检索和图像分类任务。它将传统的局部特征描述符(如SIFT、SURF等)与深度学习方法结合起来,通过学习特征聚合的方式来提高图像检索的准确性和效率。

VLAD算法

  1. 局部特征描述符

    • 在传统的图像检索方法中,图像通常被分解为一系列局部特征点,每个特征点用局部描述符(如SIFT或SURF)表示。这些描述符捕捉了图像中关键点的局部信息。
  2. 特征点的聚合

    • NetVLAD的核心思想是将这些局部描述符聚合成一个全局的图像特征向量。NetVLAD的主要目标是通过聚合来有效地表示图像的整体信息。
  3. VLAD(Vector of Locally Aggregated Descriptors)

    • VLAD是一种特征聚合方法,它首先使用K-means算法将所有局部描述符聚类成K个簇(每个簇有一个簇中心),然后将每个描述符与其对应的簇中心之间的距离进行编码。最后,将这些编码后的距离进行汇总,得到一个固定长度的向量表示整个图像。
    • NetVLAD基于VLAD,但使用了神经网络来学习特征聚合的方式,从而提高了性能。
  4. NetVLAD的工作流程

    • 局部特征提取:使用卷积神经网络(CNN)从图像中提取局部特征描述符。
    • 特征聚合:通过NetVLAD层对局部特征进行聚合。NetVLAD层将描述符映射到一个高维空间,并通过一个神经网络学习每个描述符对最终图像表示的贡献。
    • 全局特征生成:经过NetVLAD层处理后,得到一个固定长度的全局特征向量,该向量可以用于图像检索或分类。

本文标签: EmotionalAudioDetectionAutomaticDEPRESSION