admin管理员组

文章数量:1616808

来源: https://www.zhihu/tardis/sogou/art/522017847

1.主要介绍

自然语言处理(NLP)主要自然语言理解(NLU)和自然语言生成(NLG)。为了让NLU任务发挥最大的作用,来自纽约大学、华盛顿大学等机构创建了一个多任务的自然语言理解基准和分析平台,也就是GLUE(General Language Understanding Evaluation)

GLUE一共包含9项NLU(自然语言理解)任务,均为英语。涉及自然语言推断、文本蕴含、情感分析、语义相似等多个任务。像Bert、XLNet、RoBERTa、ERINE、T5等知名模型都会在此基准上进行测试。
GLUE官方网站

2.速读

GLUE共有九个任务,分别是CoLA、SST-2、MRPC、STS-B、QQP、MNLI、QNLI、RTE、WNLI。如下表1所示,可以分为三类,分别是单句任务(Single-Sentence Tasks),相似性(Similarity and paraphrase Tasks)和推断任务(Inference Tasks):

2.1 任务一:CoLA

CoLA(The Corpus of Linguistic Acceptability,语言可接受性语料库),单句子分类任务,语料主要来自语言理论和书籍以及期刊,每个句子被标注为是否合乎语法的单词序列。本任务是一个二分类任务,标签共两个,分别是0和1,其中0表示不合乎语法,1表示合乎语法。

  • 样本个数:训练集8, 551个,开发集1, 043个,测试集1, 063个。

  • 任务: 可接受程度,合乎语法与不合乎语法二分类

标签为1(合乎语法)的样例:

She is proud.
she is the mother.
John thinks Mary left.
Yes, she did.
Will John not go to school?
Mary noticed John’s excessive appreciation of himself.

标签为0(不合语法)的样例:

Mary sent.
Yes, she used.
Mary wonders for Bill to come.
They are intense of Bill.
Mary thinks whether Bill will come.
Mary noticed John’s excessive appreciation of herself.

注意到,这里面的句子看起来不是很长,有些错误是性别不符,有些是缺词、少词,有些是加s不加s的情况,各种语法错误。但我也注意到,有一些看起来错误并没有那么严重,甚至在某些情况还是可以说的通的。

2.2 任务二:SST-2

SST-2(The Stanford Sentiment Treebank,斯坦福情感树库),单句子分类任务,包含电影评论中的句子和它们情感的人类注释。这项任务是给定句子的情感,类别分为两类正面情感(positive,样本标签对应为1)和负面情感(negative,样本标签对应为0),并且只用句子级别的标签。也就是,本任务也是一个二分类任务,针对句子级别,分为正面和负面情感。

  • 样本个数:训练集67, 350个,开发集873个,测试集1, 821个。
  • 任务:情感分类,正面情感和负面情感二分类。
  • 评价准则:accuracy。

标签为1(正面情感,positive)的样例:

two central performances against shimmering cinematography that lends
the setting the ethereal beauty of an asian landscape painting the
situation in a well-balanced fashion a better movie at achieving the
modest , crowd-pleasing goals it sets for itself a patient viewer

标签为0(负面情感,negative)的样例:

a transparently hypocritical work that feels as though it 's trying to
set the women 's liberation movement back 20 years so pat it makes
your teeth hurt blood work is laughable in the solemnity with which it
tries to pump life into overworked elements from eastwood 's dirty
harry period . faced with the possibility that her life is meaningless
, vapid and devoid of substance , in a movie that is definitely
meaningless , vapid and devoid of substance monotone this new jangle
of noise , mayhem and stupidity must be a serious contender for the
title .

注意到,由于句子来源于电影评论,又有它们情感的人类注释,不同于CoLA的整体偏短,有些句子很长,有些句子很短,长短并不整齐。

2.3MRPC

MRPC(The Microsoft Research Paraphrase Corpus,微软研究院释义语料库),相似性和释义任务,是从在线新闻源中自动抽取句子对语料库,并人工注释句子对中的句子是否在语义上等效。类别并不平衡,其中68%的正样本,所以遵循常规的做法,报告准确率(accuracy)和F1值。

  • 样本个数:训练集3, 668个,开发集408个,测试集1, 725个。

  • 任务:是否释义二分类->是释义,不是释义两类。

  • 评价准则:准确率(accuracy)和F1值。

标签为1(正样本,互为释义)的样例(每个样例是两句话,中间用tab隔开):

The largest gains were seen in prices , new orders , inventories and exports . Sub-indexes measuring prices , new orders , inventories and exports increased .
Trading in Loral was halted yesterday ; the shares closed on Monday at $ 3.01 . The New York Stock Exchange suspended trading yesterday in Loral , which closed at $ 3.01 Friday .
He plans to have dinner with troops at Kosovo 's U.S. military headquarters , Camp Bondsteel . After that , he plans to have dinner at Camp Bondsteel with U.S. troops stationed there .
Retailers J.C. Penney Co . Inc . ( JCP ) and Walgreen Co . ( WAG ) kick things off on Monday . Retailers J.C. Penney Co . Inc . JCP.N and Walgreen Co . WAG.N kick things off on Monday .
Prosecutors filed a motion informing Lee they intend to seek the death penalty . He added that prosecutors will seek the death penalty .
Last year the court upheld Cleveland 's school voucher program , ruling 5-4 that vouchers are constitutional if they provide parents a choice of religious and secular schools . Last year , the court ruled 5-4 in an Ohio case that government vouchers are constitutional if they provide parents with choices among a range of religious and secular schools .

标签为0(负样本,不互为释义)的样例:

Earnings per share from recurring operations will be 13 cents to 14
cents . That beat the company 's April earnings forecast of 8 to 9
cents a share . He beat testicular cancer that had spread to his lungs
and brain . Armstrong , 31 , battled testicular cancer that spread to
his brain . Graves reported from Albuquerque , Villafranca from Austin
and Ratcliffe from Laredo . Pete Slover reported from Laredo and
Gromer Jeffers from Albuquerque . The commission must work out the
plan 's details , but the average residential customer paying $ 840 a
year would get a savings of about $ 30 annually . An average
residential customer paying $ 840 a year for electricity could see a
savings of $ 30 annually . A former teammate , Carlton Dotson , has
been charged with the murder . His body was found July 25 , and former
teammate Carlton Dotson has been charged in his shooting death . The
battles marked day four of a U.S. sweep to hunt down supporters of
Saddam Hussein 's fallen regime . Twenty-seven Iraqis were killed ,
pushing the number of opposition deaths to about 100 in a U.S.
operation to hunt down supporters of Saddam Hussein 's fallen regime .

本任务的数据集,包含两句话,每个样本的句子长度都非常长,且数据不均衡,正样本占比68%,负样本仅占32%。
思路: 可以用于Accuracy,模型正确预测为正样本的次数 / 总的预测为正样本的次数

2.4 STSB

STSB(The Semantic Textual Similarity Benchmark,语义文本相似性基准测试),相似性和释义任务,是从新闻标题、视频标题、图像标题以及自然语言推断数据中提取的句子对的集合,每对都是由人类注释的,其相似性评分为0-5(大于等于0且小于等于5的浮点数)。任务就是预测这些相似性得分,本质上是一个回归问题,但是依然可以用分类的方法,可以归类为句子对的文本五分类任务。

  • 样本个数:训练集5, 749个,开发集1, 379个,测试集1, 377个。

  • 任务:回归任务,预测为1-5之间的相似性得分的浮点数。但是依然可以使用分类的方法,作为五分类。

  • 评价准则:Pearson and Spearman correlation coefficients。

一些训练集中的样例句子对及其得分:

A plane is taking off. An air plane is taking off. 5.000 A man is
playing a large flute. A man is playing a flute. 3.800 A dog rides a
skateboard. A dog is riding a skateboard. 5.000 A woman is playing the
flute. A man is playing the guitar. 1.000 A man is playing the guitar.
A man is playing the drums. 1.556 A cat is playing a piano. A man is
playing a guitar. 0.600 A group of people dance on a hill. A group of
people are dancing. 3.200 A woman is sitting at a desk. A woman is
riding a donkey. 0.400 Someone is slicing tortila’s. Someone is riding
a horse. 0.000 A man is playing the guitar. A man plays an acoustic
guitar. 3.750

整体句子长度适中偏短,且均衡。

2.5 MNLI

MNLI(The Multi-Genre Natural Language Inference Corpus, 多类型自然语言推理数据库),自然语言推断任务,是通过众包方式对句子对进行文本蕴含标注的集合。给定前提(premise)语句和假设(hypothesis)语句,任务是预测前提语句是否包含假设(蕴含, entailment),与假设矛盾(矛盾,contradiction)或者两者都不(中立,neutral)。前提语句是从数十种不同来源收集的,包括转录的语音,小说和政府报告。

任务: 句子对,一个前提,一个是假设。前提和假设的关系有三种情况:蕴含(entailment),矛盾(contradiction),中立(neutral)。句子对三分类问题。

标签为蕴含(entailment)的句子对示例:

you know during the season and i guess at at your level uh you lose
them to the next level if if they decide to recall the the parent team
the Braves decide to call to recall a guy from triple A then a double
A guy goes up to replace him and a single A guy goes up to replace him
You lose the things to the following level if the people recall. ow do
you know? All this is their information again. This information
belongs to them. well you see that on television also You can see that
on television, as well. According to the Office of the Actuary at the
Health Care Financing Administration, the estimated net present value
of future additional resources needed to fund HI benefits alone over
the 75 years is $4. The net present value of future additional
resources for funding HI benefits was $4.

标签为矛盾(contradiction)的句子对示例:

They’re made from a secret recipe handed down to the present-day
villagers by their Mallorcan ancestors, who came here in the early
17th century as part of an official repopulation scheme. The recipe
passed down from Mallorcan ancestors is known to everyone. Felicia’s
Journey takes place behind the eyes of its central a young Irish girl,
Felicia, who crosses the sea to England in a hopeful quest to find the
father of her unborn child; and the fat, middle-aged catering manager,
Hiditch, who takes a paternal interest in the lass when it becomes
clear that her young man has caddishly given her the slip. The woman
did not care where the man was as long as it was far. Poirot, I
exclaimed, with relief, and seizing him by both hands, I dragged him
into the room. Poirot was now back and I was sorry that he would take
over what I now considered my own investigation. but that takes too
much planning It doesn’t take much planning.

标签为中立(neutral)的句子对示例:

Conceptually cream skimming has two basic dimensions - product and
geography. Product and geography are what make cream skimming work.
hebes held onto power until the 12th Dynasty, when its first king,
Amenemhet Iwho reigned between 1980 1951 b.c. established a capital
near Memphis. The capital near Memphis lasted only half a century
before its inhabitants abandoned it for the next capital. When the
trust fund begins running cash deficits in 2016, the government as a
whole must come up with the cash to finance Social Security’s cash
deficit by reducing any projected non-Social Security surpluses,
borrowing from the public, raising other taxes, or reducing other
government spending. The public would generally prefer to see the
government reduce its spending in other areas to finance Social
Security. She smiled back. She was so happy she couldn’t stop smiling.

总体训练集很充足,GLUE论文作者使用并推荐SNLI数据集[2]作为辅助训练数据

2.6 QNLI

QNLI(Qusetion-answering NL Inference,问答自然语言推断),自然语言推断任务。QNLI是从另一个数据集The Stanford Question Answering Dataset(斯坦福问答数据集, SQuAD 1.0)[3]转换而来的。SQuAD 1.0是有一个问题-段落对组成的问答数据集,其中段落来自维基百科,段落中的一个句子包含问题的答案。这里可以看到有个要素,来自维基百科的段落,问题,段落中的一个句子包含问题的答案。通过将问题和上下文(即维基百科段落)中的每一句话进行组合,并过滤掉词汇重叠比较低的句子对就得到了QNLI中的句子对。相比原始SQuAD任务,消除了模型选择准确答案的要求;也消除了简化的假设,即答案适中在输入中并且词汇重叠是可靠的提示。

  • 样本个数:训练集104, 743个,开发集5, 463个,测试集5, 461个。
  • 任务:判断问题(question)和句子(sentence,维基百科段落中的一句)是否蕴含,蕴含和不蕴含,二分类。
  • 评价准则:准确率(accuracy)。
  • 标签为蕴含(entailment,正样本)的样例(每个样例是两句话,中间用tab隔开,第一句是问题,第二句是上下文中的一句):

Why do people say KInsey’s work is not correct?为什么人们说金赛的工作是不正确的?
Kinsey’s methods have been criticized as flawed, particularly with regard to the randomness of his sample population, which included prison inmates, male prostitutes and those who willingly participated in discussion of previously taboo sexual topics.金赛的方法因其缺陷而受到批评,特别是关于他的样本人群的随机性,包括囚犯、男妓和那些愿意参与讨论以前禁忌性话题的人。

2.7 RTE

RTE(The Recognizing Textual Entailment datasets,识别文本蕴含数据集),自然语言推断任务,它是将一系列的年度文本蕴含挑战赛的数据集进行整合合并而来的,包含RTE1[4],RTE2,RTE3[5],RTE5等,这些数据样本都从新闻和维基百科构建而来。将这些所有数据转换为二分类,对于三分类的数据,为了保持一致性,将中立(neutral)和矛盾(contradiction)转换为不蕴含(not entailment)。

  • 样本个数:训练集2, 491个,开发集277个,测试集3, 000个。

  • 任务:判断句子对是否蕴含,句子1和句子2是否互为蕴含,二分类任务。

  • 评价准则:准确率(accuracy)。

  • 标签为蕴含(entailment,正样本)的样例(每个样例是两句话,中间用tab隔开):

每个样例包含两句话,任务是判断第二句是否可以从第一句中推导出来。

蕴含样例1:

A place of sorrow, after Pope John Paul II died, became a place of celebration, as Roman Catholic faithful gathered in downtown Chicago to mark the installation of new Pope Benedict XVI. Pope Benedict XVI is the new leader of the Roman Catholic Church.
判断:蕴含(entailment)。第一句提到了新教皇本笃十六世的就职,第二句直接说明了他是罗马天主教会的新领袖。

本文标签: 基准数据LLMGLUE