Day 7. Towards Preemptive Detection of Depression and Anxiety in Twitter

编程入门行业动态更新时间:2024-10-23 06:31:44

Day 7. Towards Preemptive Detection of Depression and Anxiety in Twitter

Title:
Towards Preemptive Detection of Depression and Anxiety in Twitter
在Twitter中对抑郁和焦虑的先发制人检测

Abstract:
Depression and anxiety are psychiatric disorders that are observed in many areas of everyday life. For example, these disorders manifest themselves somewhat frequently in texts written by nondiagnosed users in social media. However, detecting users with these conditions is not a straightforward task as they may not explicitly talk about their mental state, and if they do, contextual cues such as immediacy must be taken into account. When available, linguistic ﬂags pointing to probable anxiety or depression could be used by medical experts to write better guidelines and treatments. In this paper, we develop a dataset designed to foster research in depression and anxiety detection in Twitter, framing the detection task as a binary tweet classiﬁcation problem. We then apply state-of-the-art classiﬁcation models to this dataset, providing a competitive set of baselines alongside qualitative error analysis. Our results show that language models perform
reasonably well, and better than more traditional baselines. Nonetheless, there is clear room for improvement, particularly with unbalanced training sets and in cases where seemingly obvious linguistic cues (keywords) are used counter-intuitively.
抑郁症和焦虑症是一种精神疾病，在日常生活的许多方面都可以观察到。例如，这些疾病在社交媒体上非诊断用户撰写的文本中有一定程度的频繁表现。然而，检测有这些情况的用户并不是一项简单的任务，因为他们可能不会明确地谈论他们的精神状态，如果他们这样做，就必须考虑到上下文提示，例如即时性。如果有言语上出现焦躁或抑郁，医学专家可以用来提供更好的指导和治疗。在本文中，我们开发了一个数据集，旨在促进对Twitter中抑郁和焦虑检测的研究，将检测任务定义为一个二元推文分类问题。然后，我们将最先进的分类模型应用于该数据集，在定性误差分析的同时提供一组具有竞争力的基线。我们的结果表明，语言模型的性能相当好，而且比传统的基线更好。尽管如此，仍然有明显的改进空间，尤其是在训练集不平衡的情况下，以及在那些看似明显的语言线索（关键字）被反直觉使用的情况下。

Highlight：
In this paper, we build a classiﬁcation dataset 2 to assist in the detection of depression and anxiety in Twitter, and compare several text classiﬁcation baselines. The results show that state-of-the-art language models (LMs henceforth) like BERT (Devlin et al., 2019) unsurprisingly outperform competing baselines. However, when the dataset shows an unbalanced distribution, linear models perform on par. Finally, alongside quantitative results, we also provide a qualitative analysis through which we aim to better understand the strengths and limitations of the models under study. Further, we identify the linguistic patterns alluding to the presence of depression and anxiety that elude all of the classiﬁers, and consider how we might improve performance against such patterns in the future.
本文建立了一个分类数据集2来帮助检测Twitter中的抑郁和焦虑，并比较了几种文本分类基线。结果表明，像BERT（Devlin et al.，2019）这样的最先进的语言模型（LMs从今以后）的表现并不出人意料地优于比较基线。然而，当数据集显示出不平衡分布时，线性模型的表现是不相上下的。最后，除了定量结果外，我们还提供了一个定性分析，通过该分析我们可以更好地理解所研究模型的优势和局限性。此外，我们找出了所有量词都无法避免的、暗示抑郁和焦虑存在的语言模式，并考虑在未来如何针对这些模式改进表现。
2（）

3 Dataset Construction
3.1 Tweet collection
First, we used Twitter’s Stream API to compile a large corpus of tweets. All tweets were of English language and published between May 2018 and August 2019. 3 We only considered tweets containing at least three tokens and without URLs so as to avoid bot tweets and spam advertising. All personal information, including usernames (denoting the author or other users) and location were removed from
the corpus-only textual information was retained. We did however retain emojis and emoticons. We surmised that they may, in part at least, be indicative of depression or anxiety.
首先，我们使用Twitter的API来编译大量的推文。所有推文均为英语，发布时间为2018年5月至2019年8月。我们只考虑包含至少三个token且没有url的推文，以避免机器推文和垃圾邮件广告。所有个人信息，包括用户名（表示作者或其他用户）和位置都从语料库中删除，只保留文本信息。但我们保留了表情和表情符号。我们推测，它们至少在一定程度上可能是抑郁或焦虑的象征。

The corpus was then ﬁltered. We aimed to identify tweets whose authors may be suffering from depression or anxiety but may not yet have been diagnosed by a clinician. To achieve this, we sought tweets containing occurrences of depress, anxie, or anxio, but not diagnos4 - an approach similar to that used by Bathina et al. (2020). This produced an initial set of 89,192 tweets. From these tweets we
proceeded to annotate a random subset of 1,050 tweets to arrive at our dataset.
然后过滤语料。我们的目的是找出那些作者可能患有抑郁症或焦虑症，但尚未被临床医生诊断的推特。为了实现这一点，我们寻找了包含抑郁、焦虑或焦虑的推文，但不包括诊断4——一种类似于Bathina等人使用的方法（2020年）。这就产生了最初的89192条推文。从这些推文中，我们对1050条推文的随机子集进行了注释，以获得我们的数据集。
3（Twitter’s automatic language labelling was used to identify English tweets.）
4（e.g. ”My anxiety is terrible today”）

3.2 Annotation注释
Three human annotators were appointed. The prerequisites for these annotators were to be ﬂuent in English and to have familiarity with Twitter. The 1,050 tweet dataset was divided into three distinct subsets of 300 and one distinct shared subset of 150. Each annotator received one of the former subsets in addition to the latter subset. They were tasked with labelling the 450 tweets that they had received.
三个工作人员被任命注释工作。这些注释者的先决条件是精通英语和熟悉Twitter。1050条推文数据集被分成三个不同的子集（300个）和一个不同的共享子集150。除了后一个子集之外，每个注释者分别收到一个子集。他们的任务是为收到的450条推文贴上标签。

表1：成对百分比协议和注释者间的可靠性

One of two numerals was selected by the annotator with respect to each tweet:
1: The tweeter appears to be suffering from depression or anxiety.
0: The tweeter does not appear to be suffering from depression or anxiety.
注释者根据每条推文选择两个数字中的一个：
1：推特者似乎患有抑郁症或焦虑症。
0：发帖人似乎没有抑郁或焦虑。

Guidelines were compiled to aid the annotation exercise. Their purpose was to ensure a consistent approach amongst annotators and to resolve ambiguous cases. These guidelines are deﬁned below along with examples and their suggested labels:

The tweeter states that they have depression or anxiety
Example: “I feel sick to my stomach, I hate having such bad anxiety” - 1
The tweeter states that they have had depression or anxiety in the past
Example: “Counselling ﬁxed my depression” - 0
The tweeter is referring to a fellow tweeter who may have depression or anxiety
Example: “@user I wish you all the best in beating your anxiety” - 0
The tweeter is temporarily depressed or anxious due to a short or superﬂuous event
Example: “Nothing gives me anxiety more than the tills at Aldi” - 0
The tweet is ambiguous or does not provide deﬁnitive information
Example: “Depression is not taken serious enough” - 1
编制了指导方针以帮助注释工作。他们的目的是确保注释者采用一致的方法，并解决模棱两可的情况。这些指南连同示例及其建议标签定义如下：
1.博主说他们有抑郁或焦虑
例如：“我觉得胃部不舒服，我讨厌有如此严重的焦虑” - 1

2.发帖者说他们过去有过抑郁症或焦虑症
例如：“心理咨询治好了我的抑郁症” -0

3.博主指的是可能患有抑郁症或焦虑症的博主
例如：“助你战胜焦虑” -0

4.由于一个短暂的或表面的事件，博主暂时感到沮丧或焦虑
例如：“没有什么比在Aldi耕种给我更多的焦虑了” -0

5.这条推文模棱两可，或者没有提供确切的信息
例如：“抑郁症还不够严重”-1

Guideline 5 recommends positive labelling in ambiguous cases. This is to help achieve high recall in terms of tweeters who appear to be suffering from depression or anxiety. Whilst this approach will inevitably retrieve negative instances, an eventual real-world application would require all retrieved instances to be veriﬁed manually by medical experts, and therefore high recall at the expense of lower precision is an acceptable tradeoff. In fact, it is recommended that results from automatic classiﬁers used in healthcare settings should be veriﬁed via an “expert-in-the-loop approach” (Holzinger, 2016).
准则5建议在模棱两可的情况下使用阳性标签。这是为了帮助那些看起来患有抑郁症或焦虑症的推特者获得高召回率。虽然这种方法不可避免地会检索到否定的实例，但最终的实际应用程序将要求所有检索到的实例都由医学专家手动验证，因此，以较低精度为代价的高召回率是可以接受的折衷方案。事实上，建议通过专家在回路方法（Holzinger，2016）验证医疗环境中使用的自动分类器的结果。

The guidelines evolved following the annotators’ ﬁrst attempts at the exercise. A conﬂict resolution meeting revealed that while agreement was due to be acceptable, there were instances where annotators felt unable to assign either label. This gave rise to the addition of guideline 5, which allowed the annotators to complete the exercise with conﬁdence.
这些指导方针是在注释者首次尝试这一练习之后发展起来的。一次确认决议会议显示，虽然协议应该是可以接受的，但有些情况下，注释者觉得无法指定任何一个标签。这导致增加了准则5，使注释者能够自信地完成这项工作。

3.3 Inter-Annotator Agreement注释者间协定
Once the annotation was completed, we calculated the Average Pairwise Percentage Agreement of the three annotators with respect to the 150 common tweets that they had received (Table 1).
一旦注释完成，我们计算了三个注释者对他们收到的150条普通推文的平均成对一致性百分比（表1）

An average pairwise agreement of 80% was recorded. To validate the quality of the exercise two further measures of inter-annotator reliability were selected: Fleiss’ Kappa and Krippendorf’s Alpha. They are apt for inter-annotator exercises involving more than two annotators (Zapf et al., 2016). Both measures returned scores indicating “substantial agreement” amongst the annotators (Xie et al., 2017).
平均两人同意率为80%。为了验证练习的质量，选择了两种进一步的注释者间信度测量方法：FleisKappa和Kripendorf s Alpha。它们适合于涉及两个以上注释者的注释者间练习（Zapf等人，2016）。这两个指标都返回了分数，表明注释者之间的基本一致性（Xie等人，2017年）。

Conﬁdence in the annotation guidelines was therefore established versus the 150 tweets common to each annotator. Disagreements in the labels were decided by majority voting among the three annotators. For example, the tweet “My seasonal depression automatically begun tonight at 12am” was labelled 1 by two of the annotators and labelled 0 by the third annotator. However, majority voting meant that it was
ﬁnally labelled 1. The annotators then proceeded to label their distinct 300 tweet subsets independently.
因此，相对于每个注释者共有150条推文，我们建立了对注释指南的信心。标签上的分歧是由三位注释者的多数票决定的。例如，“我的季节性抑郁症在今晚12点自动开始”被两个注释者标记为1，第三个注释者标记为0。然而，多数票意味着它最终被标为1。然后，注释者们独立地标记了他们各自独立的300个推文子集。

4.1 Experimental setting
4.1.1 Data
We prepared our annotated dataset described in Section 3 for input to a series of supervised classiﬁers. The three distinct subsets of 300 annotated tweets were combined to form a training set of 900 tweets. The 150 tweets labelled by all annotators formed the test set. We named this dataset DATD (Depression and Anxiety in Twitter Dataset). The test set’s ratio of positive instances to negative instances was
exactly 1:1 following the annotation exercise. This contrasts with related published datasets upon which no annotation had been performed and all instances were deemed mental illness-related.5
我们准备了第3节中描述的带注释的数据集，以便输入到一系列有监督的分类器中。将300条带注释的推文的三个不同子集组合起来，形成一个900条推文的训练集。由所有注释者标记的150条推文组成了测试集。我们将这个数据集命名为DATD（Twitter数据集中的抑郁和焦虑）。在注释练习之后，测试集的正实例与负实例的比率正好是1:1。这与相关发表的数据集形成对比，这些数据集没有进行任何注释，所有案例都被认为与精神疾病有关。5
5（.txt）

Following a similar approach to Bathina et al. (2020), we also compiled a non-annotated set of 3,600 random tweets which did not contain any occurrence of depress, anxie, anxio, or diagnos. These were merged with the 900 tweet training set to form a larger training set of 4,500 tweets. The purpose of this large training set (DATD+Rand henceforth) was to recreate a more realistic (and noisy) setting where
most training instances are negative. This meant that only 10.5% of the instances in this training set contained any of the keywords used to compile the positive examples. The 150 tweets labelled by all annotators formed the test set once again. The main characteristics of the two datasets are summarised in Table 2.
遵循与Bathina等人类似的方法（2020年），我们还编辑了一组无注释的3600条随机推文，其中不包含任何抑郁、焦虑、焦虑或诊断的出现。这些都与900条推文训练集合并，形成了一个更大的4500条推文训练集。这个大型训练集的目的是重建一个更真实（和嘈杂）的环境，大多数训练实例都是负面的。这意味着在这个训练集中只有10.5%的实例包含用于编译正面示例的任何关键字。由所有注释者标记的150条推文再次构成了测试集。表2总结了这两个数据集的主要特征。

表2：评估中使用的数据集的特征

4.1.2 Comparison systems
We evaluated several binary classiﬁers on both the DATD and DATD+Rand datasets, guided by existing research concerning problems similar to the one at hand. To this end, we deemed a Support Vector Machine (SVM) and an LM to be suitable classiﬁers. SVMs have demonstrated effectiveness when used with Twitter datasets in healthcare contexts (Prieto et al., 2014; Han et al., 2020). For our experiments we used both a standard SVM classiﬁer with TF-IDF features and a classiﬁer based on the average of word embeddings within the tweet.
我们评估了在DATD和DATD-Rand数据集上的几种二元分类器，在现有研究的指导下，与现有的问题相似。为此，我们认为支持向量机（SVM）和机器学习是合适的分类器。SVM在医疗保健环境中与Twitter数据集一起使用时已经证明了其有效性（Prieto等人，2014年；Han等人，2020年）。在我们的实验中，我们使用了一个标准的支持向量机分类器和一个基于推文中单词嵌入平均值的分类器。

With regards to pre-trained LMs, we used BERT (Bidirectional Encoder Representations from Transformers) and ALBERT (A Lite BERT) (Lan et al., 2019). These LMs have been deployed effectively in NLP tasks, leading to state-of-the-art results in most standard benchmarks (Wang et al., 2019) including Twitter (Basile et al., 2019; Roitero et al., 2020). In particular, ALBERT has been shown to provide competitive results despite being relatively light-weight compared to other LMs. Finally, for completeness we added a naive baseline that predicts positive instances in all cases.
对于预训练的LMs，我们使用了BERT（来自变压器的双向编码器表示）和ALBERT（A Lite BERT）（Lan等人，2019年）。这些LMs在NLP任务中得到了有效部署，在大多数标准基准（Wang等人，2019年）中，包括Twitter（Basile等人，2019年；Roitero等人，2020年）中取得了最新成果。特别是，尽管与其他LMs相比，ALBERT的重量相对较轻，但它仍能提供具有竞争力的结果。

4.1.3 Training details
We used the scikit-learn SVM model (Pedregosa et al., 2011) as well as its TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer implementations. 6 The word embeddings generated for each tweet were drawn from vectors trained on Twitter data (Pennington et al., 2014, GloVe). These vectors had a dimensionality of 200, and so did the averaged embedding generated.
我们使用了scikit-learn SVM模型（Pedregosa等人，2011）及其TF-IDF（Term FrequencyInverse Document Frequency）矢量器实现。为每条推文生成的单词嵌入都是从Twitter数据训练的向量中提取的（Pennington等人，2014，glow）。这些向量的维数为200，生成的平均嵌入量也是如此。

We performed tweet text preprocessing prior to their input to the SVM. In one series of SVM experiments all tweets underwent tokenization and lowercasing only, but in a second series all tweets also underwent tweet speciﬁc preprocessing7 (SVM+preproc henceforth). The preprocessing entailed the removal of hashtags, user mentions, reserved words (such as “RT” and “FAV”), emojis, and smileys. This
enabled us to see how the presence of these common tweet features affected classiﬁcation performance. In both cases, the SVM used a linear kernel and default hyperparameters.
我们在tweet文本输入到SVM之前对其进行预处理。在一系列SVM实验中，所有推文都仅进行了标记化和小写，但在第二个系列中，所有推文也都进行了特定的推文预处理（此后为SVM + preproc）。预处理需要除去主题标签，用户提及，保留字（例如“ RT”和“ FAV”），表情符号和表情符。这使我们能够看到这些常见的tweet特性是如何影响分类性能的。在这两种情况下，支持向量机都使用线性核和默认的超参数。

To deploy the LM classiﬁers we used the Simple Transformers8 software library. It provides a convenient Application Programming Interface (API) to the Transformers Library, which itself provides access to BERT and ALBERT models, amongst others (Wolf et al., 2019). The BERT and ALBERT classiﬁers used were “bert-base-uncased” and “albert-base-v1”, 9 respectively.
为了部署LM分类器，我们使用了simple transformers8软件库。它为Transformers库提供了一个方便的应用程序编程接口（API），该库本身提供了对BERT和ALBERT模型等的访问（Wolf等人，2019年）。使用的BERT和ALBERT分类器分别是bert-base-uncased和albert-base-v1。9

Unlike the SVM experiments it was not necessary to tokenize tweet texts prior to their input to the BERT or ALBERT classiﬁers; they perform their own tokenization. Tweet texts did undergo prior lowercasing however. The classiﬁers were instantiated with Simple Transformers’ default hyperparameters.
与支持向量机实验不同的是，在tweet文本输入到BERT或ALBERT分类器之前，不需要对tweet文本进行标记化；它们执行自己的标记化。然而，Tweet文本之前确实经历了小写。分类器用简单的Transformers默认超参数实例化。

更多推荐

Day 7. Towards Preemptive Detection of Depression and Anxiety in Twitter

本文发布于:2024-03-12 11:25:10，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1731394.html