admin管理员组

文章数量:1572459

评测标准

1.能力基础评测

为了检验大语言模型(LLM)的有效性和优越性,已有研究采用了大量的任务和基准数据集来进行实证评估和分析。根据任务定义,现有语言生成的任务主要可以分为语言建模、条件文本生成和代码合成任务。需要注意的是,代码合成不是典型的自然语言处理任务,但可以直接地用(经过代码数据训练的)LLM以类似自然语言文本生成的方法解决,因此也纳入讨论范围。

语言建模:语言建模是LLM的基本能力,旨在基于前一个token预测下一个token[15],主要关注基本的语言理解和生成能力;条件文本生成:作为语言生成中的一个重要话题,条件文本生成旨在基于给定的条件生成满足特定任务需求的文本,通常包括机器翻译、文本摘要和问答系统等;代码合成:除了生成高质量的自然语言外,现有的LLM还表现出强大的生成形式语言的能力,尤其是满足特定条件的计算机程序(即代码),这种能力被称为代码合成;闭卷问答任务测试LLM从预训练语料库中习得的事实知识。LLM只能基于给定的上下文回答问题,而不能使用外部资源;与闭卷问答不同,在开卷问答任务中,LLM可以从外部知识库或文档集合中提取有用的证据,然后基于提取的证据回答问题;在知识补全任务中,LLM(在某种程度上)可以被视为一个知识库,补全或预测知识单元(例如知识三元组)的缺失部分;知识推理任务依赖于逻辑关系和事实知识的证据来回答给定的问题。现有的工作主要使用特定的数据集来评估相应类型的知识推理能力;符号推理任务主要关注于在形式化规则设定中操作符号以实现某些特定目标,且这些操作和规则可能在LLM预训练期间从未被看到过;数学推理任务需要综合利用数学知识、逻辑和计算来解决问题或生成证明过程。以下为基础能力评测数据集表。

表 基础能力评测数据集表

基本任务子任务数据集
语言生成语言建模PennTreebank,WikiText-103,thePile,LAMBADA
语言生成条件文本生成WMT’14,16,19,20,21,22,Flores-101,DiaBLa,CNN/DailyMail,XSum,WikiLingua,OpenDialKG,SuperGLUE,MMLU,BIG-benchHard,CLUE
语言生成代码合成APPS,HumanEval,MBPP,CodeContest,MTPB,DS-1000,ODEX
知识运用闭卷问答NaturalQuestions,ARC,TruthfulQA,WebQuestionsTriviaQA,PIQA,LC-quad2.0,GrailQA,KQApro,CWQ,MKQA,ScienceQA
知识运用开卷问答NaturalQuestions,OpenBookQA,ARC,WebQuestions,TriviaQA,MSMARCO,QASC,SQuAD,WikiMovies
知识运用知识补全WikiFact,FB15k-237,Freebase,WN18RR,WordNet,LAMA,YAGO3-10YAGO
复杂推理知识推理CSQA,StrategyQA,ARC,Bool,PIQA],SIQA,HellaSwag,WinoGrande,OpenBookQA,COPA,ScienceQA,proScript,ProPara,ExplaGraph,ProofWriter,EntailmentBank,ProOntoQA
复杂推理符号推理CoinFlip,ReverseList,LastLeTter,BooleanAssignment,Parity,ColoredObject,PenguinsinaTable,RepeatCopy,ObjectCounting
复杂推理数学推理MATH,GSM8k,SVAMP,MultiArith,ASDiv,MathQA,AQUA-RAT,MAWPS,DROP,NaturalProofs,PISA,miniF2F,ProofNet

2.高级能力评估

高级能力评估有以下三种:

  1. 与人类对齐(人工评估)。
  2. 与外部环境的互动(生成例如清洁和烹饪任务清单,并根据清单及现实世界执行的成功率来评估能力)。
  3. 工具操作(例如数学问题求解,对逻辑推理及知识问答进行组合评估)。

与人类对齐(humanalignment)指的是让LLM能够很好地符合人类的价值和需求,这是在现实世界应用中广泛使用LLM的关键能力。为了评估这种能力,现有的研究考虑了多个人类对齐的标准,例如有益性、诚实性和安全性。对于有益性和诚实性,可以利用对抗性问答任务(例如TruthfulQA)来检查LLM在检测文本中可能的虚假性方面的能力。此外,有害性也可以通过若干现有的基准测试来评估,例如CrowS-Pairs和Winogender;除了标准评估任务外,LLM还具有从外部环境接收反馈并根据行为指令执行操作的能力,例如生成自然语言行动计划以操作智能体。LLM中具备这种能力,可以生成详细且高度逼真的行动计划,而较小的模型(例如GPT-2)倾向于生成较短或无意义的计划。为了测试这种能力,研究者提出了一些具身体感知的人工智能基准进行评估。VirtualHome构建了一个3D模拟器,用于家务任务(例如清洁和烹饪),代理人可以执行LLM生成的自然语言行动为了检验工具操作的能力,现有的工作大多采用复杂的推理任务进行评估,例如数学问题求解(例如GSM8k和SVAMP)或知识问答(例如TruthfulQA),其中成功操作工具对于增强LLM缺乏的所需技能非常重要(例如数值计算)。

3.评测基准

用于评估LLM的综合性评测基准如下所示:

  1. 基于MMLU的通用评测基准,用户大规模评测LLM的多任务知识理解能力。
  2. 基于Big-bench的通用评测基准,旨在从语言学、儿童发展、数学、常识推理等多方面探究现有LLM的能力。
  3. 基于HELM的综合性评测基准,目前包括16个核心场景和7类指标。
  4. 基于TyDiQA的大模型多语言知识利用评测基准。
  5. 基于MGSM的多语言数学推理评测基准。

4. 评测标准相关内容列表




上表为不同任务对应的提示模版信息。

上表为,不同的调优策略的特点。“Additional”表示在LM参数之外是否有其他参数,而“Tunned”表示参数是否被更新。

评测报告

1.模型微调相关评测信息

近年来大型语言模型(指规模大于10B的模型)的统计数据,包括Evaluation、Pre-trainDataScale(以token数量或存储大小表示)和Hardware。在本表中,我们仅列举有公开论文介绍技术细节的大语言模型。这里,“ReleaseTime”表示相应论文正式发布的日期。“PubliclyAvailable”表示模型检查点可以公开获取,而“ClosedSource”则相反。“Adaptation”指模型是否经过了后续微调:IT表示指令微调,RLHF表示人类反馈的强化学习。“Evaluation”表示模型是否在原始论文中评估了相应的能力:ICL表示上下文学习,CoT表示思维链。“*”表示最大的公开可用版本。

表 文献场景理解能力

下述是最近几年现有的大型语言模型(大小超过 10B)的时间轴。时间轴主要是根据模型技术论文的发布日期(例如提交到 arXiv 的日期)建立的。由于图表空间的限制,仅展示公开报告了评估结果的 LLM。

2.模型能力评测信息

任务类型:CR:常识推理。QA:问题回答。SUM:总结。机器翻译。语言能力探测。GCG:一般的条件生成。CKM:常识性知识挖掘。FP:事实探测。TC:文本分类。MR:数学推理。SR:符号推理。AR:类比推理。Theory:理论分析。IE:信息提取。D2T:数据到文本。TAG:序列标记。SEMP:语义解析。EVALG:文本生成的评估。VQA:视觉问题回答。VFP:视觉事实探测。MG:多模态接地。CodeGen:代码生成。PLM列列出了在相应的论文中用于下游任务的所有预先训练过的lm。GPT-like是一种自回归语言模型,它对原始的GPT-2架构进行了小的修改。SeTting列出了基于提示的学习的设置,可以是零镜头学习(零)、少镜头学习(很少)、完全监督学习(完整)。在“提示工程”下,形状表示模板的形状(Clo表示卷块式搜索,Pre表示前缀),Man表示是否需要人工努力,Auto表示数据驱动的搜索方法(光盘用于离散搜索,Cont用于表示连续搜索)。在“回答工程”下,形状表示答案的形状(Tok表示标记级别,Sp表示跨度级别,Sen表示句子级或文档级),人和自动与上面相同。“调优”列列出了调优策略(7)。TFP:无调优的提示。固定提示LM调优。固定LM提示调优。LMPT:LM+提示调优。Mull-Pr列列出了多提示的学习方法。PA:提示增强。PE:提示集成。PC:提示组成。PD:及时分解。

表 针对各任务的大模型的理解能评测表

表 文献场景理解能力

模型/场景概述生成敏感话题杜绝摘要结构化统一风格回答问题多轮问答专利生成复杂输出任务
Chat-GLM-6B×××
Chat-GLM2-6B-32k×√√√√
BloomZ-7B√√
T0PP-11B××××××
CLLaMA2-7B√√√√×
MT0-13B××××
PanguAlpha-6B×××××
CLLaMA2-13B×
Baichuan2-13B-Chat√√√√√√
ChatGPT4√√√√√√√√
ChatGPT3.5
文心一言√√√√
通义千问
星火
Vicuna-13B
商量

×表示不具备理解能力;√表示基本具备;√√表示效果最好

表 开源可调式模型训练推理显存使用结果

模型训练显存占用训练时间推理显存占用推理时间
GLM-6B4-p100-15G3day1-p100-15G30S
GLM2-6B--1-p100-16G20S
BloomZ-7B-8bit4-P100-14G5day1-p100-11G13S
T0PP-11B××--
CLLaMA2-7B5-p100-16G4day2-p100-15G60S
MT0-13B××4-p100-16G34S
PanguAlpha-6B--1-p100-11G40S
CLLaMA2-13B-×在线平台49S
Baichuan2-13B-Chat--4-p100-16G36S

-:表示未记录;x:算力不够测不出

3.具体评测信息

按照自己的要求,约束表达
中英文指示模版:

请用一段话总结三井号里面的句子。###目的:IL-6在肿瘤生物学中的临床应用价值。方法:查阅整理近十年来国内外关于IL-6的研究文献,对IL-6的生物特性,功能和应用等方面进行文献综述。结果:IL-6在生物学上的结构,功能和应用价值有了初步的成果。结论:为进一步阐明IL-6在肿瘤方面的作用及临床应用价值提供了依据。###

GLM-6B的测试效果


GLM2-6B测试效果


BloomZ-7B的测试效果


T0pp-11B(T5-plus拓展版)测试效果

C-LLaMA2-chat-7B(chat测试-时间有点长,基本60S)

网址:Llama27BChat-aHuggingFaceSpacebyLinkSoul


Mto-xxl-13B(T0最大版模型-基于T5拓展指令集的跨语言模型)

将文献输出指定格式,例如Json
中英文指示模版:

将三井号里面的句子按照目的、方法、结果、结论生成json格式。###探讨尿干化学法联合尿沉渣法用于尿液检验的临床效果.方法 选择余姚市中医医院门诊行尿常规检验的疑似尿路感染患者200例(2018年1-10月)进行前瞻性研究,采集200例患者的新鲜晨尿标本,分别采用尿干化学分析仪、尿沉渣分析仪对尿液标本进行检测,比较尿干化学法、尿沉渣法的尿常规检测指标.以尿液细菌培养结果为参照,计算和比较尿干化学法、尿沉渣法、尿干化学法+尿沉渣法对尿路感染的诊断灵敏度、特异度、准确率,再采用一致性检验,分析尿干化学法、尿沉渣法、尿干化学法+尿沉渣法与尿液细菌培养结果的一致性.(1)尿干化学法、尿沉渣法对尿红细胞、白细胞、透明管型等指标的检测结果比较,差异均无统计学意义(t=0.562、0.343、0.901,均P>0.05).(2)200例疑似尿路感染患者中,有157例患者经尿液细菌培养证实为尿路感染,其余43例无尿路感染.尿干化学法、尿沉渣法对尿路感染的诊断灵敏度、特异度、准确率比较,差异均无统计学意义(x2=0.120、0.081、0.022,均P>0.05),但尿干化学法+尿沉渣法对尿路感染的诊断灵敏度、特异度、准确率分别为96.18%、97.67%、96.50%,均高于尿干化学法、尿沉渣法(x2=6.497、6.081、11.923、8.219、4.962、12.858,均P<0.05).经一致性分析,尿干化学法+尿沉渣法对尿路感染的诊断结果与尿液细菌培养结果之间的一致性良好,Kappa值为0.854,而尿干化学法、尿沉渣法与尿液细菌培养结果之间的一致性均为中等,Kappa值分别为0.642、0.637.尿干化学法与尿沉渣法联合用于尿常规检验,可提高尿常规检验的准确性,有利于提高尿常规检验在尿路感染诊断中的价值。###

The sentences in Mitsui are generated in json format according to the purpose, method, result and conclusion. ###BACKGROUND: Suicidal behaviour aggregates in families, and the hypothalamic-pituitary-adrenal (HPA) axis and noradrenergic dysregulation may play a role in suicide risk. It is unclear whether stress dysregulation is a heritable trait of suicide or how it might increase risk. We investigated stress reactivity of the autonomic nervous system and the HPA axis in suicide predisposition and characterized the effect of this dysregulation on neuropsychologic function. METHODS: In this family-based study of first-degree relatives (n = 14) of suicide completers and matched controls with no family or personal history of suicidal behaviour (n = 14), participants underwent the Trier Social Stress Test (TSST). We used salivary alpha-amylase and cortisol levels to characterize stress reactivity and diurnal variation. We administered a series of neuropsychologic and executive function tests before and after the TSST. RESULTS: Despite normal diurnal variation, relatives of suicide completers exhibited blunted cortisol and alpha-amylase TSST reactivity. Although there were no baseline differences in conceptual reasoning, sustained attention or executive function, the relatives of suicide completers did not improve on measures of inhibition upon repeated testing after TSST. Secondary analyses suggested that these effects were related to suicide vulnerability independent of major depression. LIMITATIONS: The sample size was small, and the design prevents us from disentangling our findings from the possible traumatic consequences of losing a relative by suicide. CONCLUSIONS: Blunted stress response may be a trait of suicide risk, and impairment of stress-induced executive function may contribute to suicide vulnerability.###

GLM-6B的测试效果


上图出现了LLM不能理解的问题,我换了一种更明确的说法验证是不是GLM缺少这个能力。

仍然不能

GLM2-6B测试效果


BloomZ的测试效果

这个生成结果有点多,需要针对这个LLM版本进行调整

T0pp-11B(T5-plus拓展版)测试效果

C-LLaMA2-chat-7B(chat测试)

效果不行

Mto-xxl-13B(T0最大版模型-基于T5拓展指令集的跨语言模型)


政治性敏感话题杜绝【示例侧面证明能力】
中英文指示模版:

提供三井号里的文本。如果文本包含结果,请按照下述格式重写:【结果】结果;【目的】目的,如果没有则输出‘没有找到结果内容’。###目的:IL-6在肿瘤生物学中的临床应用价值。方法:查阅整理近十年来国内外关于IL-6的研究文献,对IL-6的生物特性,功能和应用等方面进行文献综述。结果:IL-6在生物学上的结构,功能和应用价值有了初步的成果。结论:为进一步阐明 IL-6在肿瘤方面的作用及临床应用价值提供了依据。###

提供三井号里的文本。如果文本包含结果,请按照下述格式重写:【结果】结果;【目的】目的,如果没有则输出‘没有找到结果内容’。###目的:IL-6在肿瘤生物学中的临床应用价值。方法:查阅整理近十年来国内外关于IL-6的研究文献,对IL-6的生物特性,功能和应用等方面进行文献综述。结论:为进一步阐明 IL-6在肿瘤方面的作用及临床应用价值提供了依据。###

Provide the text in Mitsui No. If the text contains results, rewrite in the following format: [Results] results; [Purpose] purpose, if not, output ‘result results found’. ###BACKGROUND: Suicidal behaviour aggregates in families, and the hypothalamic-pituitary-adrenal (HPA) axis and noradrenergic dysregulation may play a role in suicide risk. It is unclear whether stress dysregulation is a heritable trait of suicide or how it might increase risk. We investigated stress reactivity of the autonomic nervous system and the HPA axis in suicide predisposition and characterized the effect of this dysregulation on neuropsychologic function. METHODS: In this family-based study of first-degree relatives (n = 14) of suicide completers and matched controls with no family or personal history of suicidal behaviour (n = 14), participants underwent the Trier Social Stress Test (TSST). We used salivary alpha-amylase and cortisol levels to characterize stress reactivity and diurnal variation. We administered a series of neuropsychologic and executive function tests before and after the TSST. RESULTS: Despite normal diurnal variation, relatives of suicide completers exhibited blunted cortisol and alpha-amylase TSST reactivity. Although there were no baseline differences in conceptual reasoning, sustained attention or executive function, the relatives of suicide completers did not improve on measures of inhibition upon repeated testing after TSST. Secondary analyses suggested that these effects were related to suicide vulnerability independent of major depression. LIMITATIONS: The sample size was small, and the design prevents us from disentangling our findings from the possible traumatic consequences of losing a relative by suicide. CONCLUSIONS: Blunted stress response may be a trait of suicide risk, and impairment of stress-induced executive function may contribute to suicide vulnerability.###

GLM的测试效果


这个效果不行,在没有结果内容是没有按照我们的指令修改。

GLM2-6B测试效果



BloomZ的测试效果



T0pp-11B(T5-plus拓展版)测试效果-听不懂指令。
C-LLaMA2-chat-7B(chat测试)



格式是对的,但内容就瞎说了

Mto-xxl-13B(T0最大版模型-基于T5拓展指令集的跨语言模型)


按指定风格回答问题
中英文指示模版:

你的任务是按照这种风格回答用户的问题:<用户>:请告诉我中国的首都。<客服>:中国的首都是:北京。<用户>: 重写三井号里的问题。###目的:IL-6在肿瘤生物学中的临床应用价值。方法:查阅整理近十年来国内外关于IL-6的研究文献,对IL-6的生物特性,功能和应用等方面进行文献综述。###

Your task is to answer user’s question in this style: : Please tell me the capital of China.: The capital of China is: Beijing.: overwrite the problem in Mitsui. ###BACKGROUND: Suicidal behaviour aggregates in families, and the hypothalamic-pituitary-adrenal (HPA) axis and noradrenergic dysregulation may play a role in suicide risk. It is unclear whether stress dysregulation is a heritable trait of suicide or how it might increase risk. We investigated stress reactivity of the autonomic nervous system and the HPA axis in suicide predisposition and characterized the effect of this dysregulation on neuropsychologic function. METHODS: In this family-based study of first-degree relatives (n = 14) of suicide completers and matched controls with no family or personal history of suicidal behaviour (n = 14), participants underwent the Trier Social Stress Test (TSST). We used salivary alpha-amylase and cortisol levels to characterize stress reactivity and diurnal variation. We administered a series of neuropsychologic and executive function tests before and after the TSST. RESULTS: Despite normal diurnal variation, relatives of suicide completers exhibited blunted cortisol and alpha-amylase TSST reactivity. Although there were no baseline differences in conceptual reasoning, sustained attention or executive function, the relatives of suicide completers did not improve on measures of inhibition upon repeated testing after TSST. Secondary analyses suggested that these effects were related to suicide vulnerability independent of major depression. LIMITATIONS: The sample size was small, and the design prevents us from disentangling our findings from the possible traumatic consequences of losing a relative by suicide. CONCLUSIONS: Blunted stress response may be a trait of suicide risk, and impairment of stress-induced executive function may contribute to suicide vulnerability.###

GLM的测试效果

GLM2-6B测试效果
BloomZ的测试效果


T0pp-11B(T5-plus拓展版)测试效果-听不懂指令。
C-LLaMA2-chat-7B(chat测试)

Mto-xxl-13B(T0最大版模型-基于T5拓展指令集的跨语言模型)
分解复杂需求能力(推理速度会变慢)
中英文指示模版:

任务是执行以下操作:\n总结书名号里的文本。\n2.翻译书名号里的总结为英语句子。\n3.输出总结及英语句子到json列表里。\n使用以下格式:\n总结:<书名号里的文本总结> \n翻译:<书名号里的总结翻译成英语>\nJSON:<包含总结翻译和总结的json列表>\n文本:<目的:IL-6在肿瘤生物学中的临床应用价值。方法:查阅整理近十年来国内外关于IL-6的研究文献,对IL-6的生物特性,功能和应用等方面进行文献综述。>

The task is to perform the following operations: \n1. summarize the text in the book title.\n2. Summarize the book title as English sentences.\n3. Output the summary and English sentences to the json list.\n Use the following format: \n summary: \n translation: \n JSON: .\n<BACKGROUND: Suicidal behaviour aggregates in families, and the hypothalamic-pituitary-adrenal (HPA) axis and noradrenergic dysregulation may play a role in suicide risk. It is unclear whether stress dysregulation is a heritable trait of suicide or how it might increase risk. We investigated stress reactivity of the autonomic nervous system and the HPA axis in suicide predisposition and characterized the effect of this dysregulation on neuropsychologic function. METHODS: In this family-based study of first-degree relatives (n = 14) of suicide completers and matched controls with no family or personal history of suicidal behaviour (n = 14), participants underwent the Trier Social Stress Test (TSST). We used salivary alpha-amylase and cortisol levels to characterize stress reactivity and diurnal variation. We administered a series of neuropsychologic and executive function tests before and after the TSST. RESULTS: Despite normal diurnal variation, relatives of suicide completers exhibited blunted cortisol and alpha-amylase TSST reactivity. Although there were no baseline differences in conceptual reasoning, sustained attention or executive function, the relatives of suicide completers did not improve on measures of inhibition upon repeated testing after TSST. Secondary analyses suggested that these effects were related to suicide vulnerability independent of major depression. LIMITATIONS: The sample size was small, and the design prevents us from disentangling our findings from the possible traumatic consequences of losing a relative by suicide. CONCLUSIONS: Blunted stress response may be a trait of suicide risk, and impairment of stress-induced executive function may contribute to suicide vulnerability.>

GLM的测试效果


推理效果不好

推理效果不好,不具备翻译的能力。

GLM2-6B测试效果


BloomZ的测试效果


推理效果不好

T0pp-11B(T5-plus拓展版)测试效果-听不懂指令。
C-LLaMA2-chat-7B(chat测试)

对应结果:

好的,以下是按照要求执行的步骤:

读取三井号的文本。

提取三井号的总结和翻译成英语的总结,并将其存储到一个列表中。

使用JSON格式将总结和总结翻译存储到一个JSON对象中。

以下是实现这些步骤的Python代码:




import json

**# 读取三井号的文本**
with open('three_i_text.txt', 'r', encoding='utf-8') as f:
   summary = f.read().strip()

**# 提取总结**
summary_with_punctuation = summary.translate(str.maketrans('', '', string.punctuation))
summary_stripped = summary_with_punctuation.strip()
summary_split = summary_stripped.split(" ")
summary_part = ''
**# 提取总结翻译成英语**
summary_with_punctuation = summary.translate(str.maketrans('', '', string.punctuation))
summary_with_punctuation_stripped = summary_with_punctuation.strip()
summary_with_punctuation_split = summary_with_punctuation_stripped.split(" ")
**# 处理总结翻译成英语**
for i in range(len(summary_with_punctuation_split)):
   if summary_with_punctuation_split[i] == ':':
       summary_part += summary_with_punctuation_split[i] + ':'
   else:
       summary_part += summary_with_punctuation_split[i] + ','
**# 创建JSON对象**
json_data = {
   "summary": summary_part,
   "translation": summary_with_punctuation_split,
}

**# 写入JSON**
with open('summary_data.json', 'w', encoding='utf-8') as f:
   json.dump(json_data, f)
在上面的代码中,我们首先使用open()函数读取三井号的文本,并将其存储到变量summary中。接下来,我们提取了三井号的总结,将其转换为纯文本,并将其存储到变量summary_part中。然后,我们提取了三井号的总结翻译,将其分割为一个单词列表,并将其存储到变量summary_with_punctuation_split中。最后,我们将这些单词组合成一个JSON对象,并将其存储到文件summary_data.json中。

如果我们运行上面的代码,我们将得到一个JSON对象summary_data.json,它包含了三井号的总结和总结翻译成英语。

理解错了。

Here are the summary and English sentences as requested:

Title: Blunted Stress Response in First-Degree Relatives of Suicide Completers
Translation: Despite normal diurnal variation, relatives of suicide completers exhibited blunted cortisol and alpha-amylase TSST reactivity. Blunted stress response may be a trait of suicide risk, and impairment of stress-induced executive function may contribute to suicide vulnerability.
JSON:
[
{
"title": "Blunted Stress Response in First-Degree Relatives of Suicide Completers",
"translation": "Despite normal diurnal variation, relatives of suicide completers exhibited blunted cortisol and alpha-amylase TSST reactivity. Blunted stress response may be a trait of suicide risk, and impairment of stress-induced executive function may contribute to suicide vulnerability.",
"json": [
{
"title": "Summary",
"summary": "Suicidal behavior aggregates in families, and stress dysregulation may contribute to suicide risk.",
"translation": "Suicidal behavior aggregates in families, and stress dysregulation may contribute to suicide risk."
},
{
"title": "Translation",
"translation": "Suicidal behavior aggregates in families, and stress dysregulation may contribute to suicide risk."
}
]
Mto-xxl-13B(T0最大版模型-基于T5拓展指令集的跨语言模型)
总结

针对于复杂任务,每个模型都要多次测试其对指示模版的适应程度。

多轮问答能力
中英文指示模版

用户:总结文本的目的。\n文本:<目的:IL-6在肿瘤生物学中的临床应用价值。方法:查阅整理近十年来国内外关于IL-6的研究文献,对IL-6的生物特性,功能和应用等方面进行文献综述。>\n助手:目的是临床应用价值。\n根据以上内容回答###里的问题。\n用户:###那方法呢?###。

User: Summarize the purpose of the text.\n Text:###BACKGROUND: Suicidal behaviour aggregates in families, and the hypothalamic-pituitary-adrenal (HPA) axis and noradrenergic dysregulation may play a role in suicide risk. It is unclear whether stress dysregulation is a heritable trait of suicide or how it might increase risk. We investigated stress reactivity of the autonomic nervous system and the HPA axis in suicide predisposition and characterized the effect of this dysregulation on neuropsychologic function. METHODS: In this family-based study of first-degree relatives (n = 14) of suicide completers and matched controls with no family or personal history of suicidal behaviour (n = 14), participants underwent the Trier Social Stress Test (TSST). We used salivary alpha-amylase and cortisol levels to characterize stress reactivity and diurnal variation. We administered a series of neuropsychologic and executive function tests before and after the TSST. RESULTS: Despite normal diurnal variation, relatives of suicide completers exhibited blunted cortisol and alpha-amylase TSST reactivity. Although there were no baseline differences in conceptual reasoning, sustained attention or executive function, the relatives of suicide completers did not improve on measures of inhibition upon repeated testing after TSST. Secondary analyses suggested that these effects were related to suicide vulnerability independent of major depression. LIMITATIONS: The sample size was small, and the design prevents us from disentangling our findings from the possible traumatic consequences of losing a relative by suicide. CONCLUSIONS: Blunted stress response may be a trait of suicide risk, and impairment of stress-induced executive function may contribute to suicide vulnerability.###\n Assistant: The purpose is how it might increase risk. \n Answer the question in Mitsui based on the above content. \ n user: ### How about that method?###。

GLM的测试效果

GLM貌似不理解指令,因此测试直接问GLM问题


GLM2-6B测试效果


BloomZ的测试效果


T0pp-11B(T5-plus拓展版)测试效果-听不懂指令。
C-LLaMA2-7B(chat测试)


Mto-xxl-13B(T0最大版模型-基于T5拓展指令集的跨语言模型)

文献生成专利大纲及建议
中英文指示模版

根据文本生成专利及各个部分撰写建议,至少6000个字。\n使用以下格式:\n技术领域:<相关技术>;建议:<技术撰写建议>\n背景:<技术已知问题和局限性>;建议:<背景撰写建议>\n发明概述:<发明概述>;建议:<发明撰写建议>\n发明内容:<发明步骤>;建议:<发明步骤撰写建议>\n实施方式:<专利实施方式>;建议:<专利实施撰写建议>\n专利要求:<专利要求>;建议:<专利要求撰写建议>\n描述要求:<专利进一步描述>;建议:<专利描述撰写建议>\n文本:<目的 对糖尿病人综合治疗进行临床观察分析.方法 选取100例2型DM的患者分为综合治疗组和对照治疗组,综合治疗组主要通过护理干预综合饮食、运动、心理因素等进行综合治疗,而对照治疗组主要是已传统的药物治疗配合一般护理进行治疗,并且分别对入选时和2个月后两组空腹血糖和饭后2h血糖进行记录,同时对两组入选时和2个月后的抑郁,焦虑的自评量分值表进行填写.结果 经过spss.13分析可知,对照组的P>0.05,说明入选时同2个月后的空腹血糖及饭后2h血糖之间无明显差异,焦虑和抑郁自评量分值差异不具有统计学意义.治疗组的P<0.01,说明入选时同2个月后空腹血糖及饭后2h血糖比较明显偏高,与对照组的比较差异具有统计学意义,同时焦虑和抑郁的自评量分值也出现明显下降的趋势,P<0.01差异具有统计学意义,绝大部分抑郁和焦虑的患者病情也出现好转.结论 糖尿病综合治疗的临床效果要明显优于对照组所采用的传统的单一的药物配合一般护理进行治疗的效果,因此应该积极推广糖尿病综合治疗的基本有效模式,使综合治疗成为糖尿病有效治疗的保障.>

Based on the text generation patent and the partial writing recommendations, at least 6,000 words.\n Use the following format: \n technology field: ; Suggestions: \n Background: ; Suggestions: \n Overview of inventions: ; Suggestion: \n Invention content: ; Suggestions: \n implementation method: ; Suggestion: \n Patent requirements: ; Suggestion: \n Description requirement: ; Suggestion: \n Text: ###BACKGROUND: Suicidal behaviour aggregates in families, and the hypothalamic-pituitary-adrenal (HPA) axis and noradrenergic dysregulation may play a role in suicide risk. It is unclear whether stress dysregulation is a heritable trait of suicide or how it might increase risk. We investigated stress reactivity of the autonomic nervous system and the HPA axis in suicide predisposition and characterized the effect of this dysregulation on neuropsychologic function. METHODS: In this family-based study of first-degree relatives (n = 14) of suicide completers and matched controls with no family or personal history of suicidal behaviour (n = 14), participants underwent the Trier Social Stress Test (TSST). We used salivary alpha-amylase and cortisol levels to characterize stress reactivity and diurnal variation. We administered a series of neuropsychologic and executive function tests before and after the TSST. RESULTS: Despite normal diurnal variation, relatives of suicide completers exhibited blunted cortisol and alpha-amylase TSST reactivity. Although there were no baseline differences in conceptual reasoning, sustained attention or executive function, the relatives of suicide completers did not improve on measures of inhibition upon repeated testing after TSST. Secondary analyses suggested that these effects were related to suicide vulnerability independent of major depression. LIMITATIONS: The sample size was small, and the design prevents us from disentangling our findings from the possible traumatic consequences of losing a relative by suicide. CONCLUSIONS: Blunted stress response may be a trait of suicide risk, and impairment of stress-induced executive function may contribute to suicide vulnerability.###

GLM的测试效果


GLM2-6B测试效果
技术领域:
本研究旨在探究糖尿病综合治疗对患者治疗效果的影响。糖尿病综合治疗是指通过多种手段对患者进行治疗,包括饮食、运动、心理干预等,以改善患者血糖水平、减轻症状、延缓糖尿病进展等。
建议:
对于糖尿病综合治疗,建议采取以下措施:
1. 在患者确诊后,及时制定个性化的治疗方案,包括饮食、运动、心理干预等方面。
2. 患者应遵守医生的治疗方案,按时服药,并注意饮食、运动等生活习惯。
3. 定期进行复查,监测血糖水平、血压、血脂等指标,并根据医生建议进行调整。
4. 加强心理干预,患者应保持良好的心态,避免过度焦虑、抑郁等情绪。
背景:
糖尿病是一种常见的慢性代谢性疾病,严重威胁人类健康。在全球范围内,糖尿病患者的数量不断增加,其中90%以上是2型糖尿病患者。传统的单一药物治疗已经无法满足患者的需求,糖尿病综合治疗成为当前研究的热点之一。
然而,关于糖尿病综合治疗的效果仍存在一些争议。一些研究表明,糖尿病综合治疗可以显著改善患者的血糖水平、减轻症状、延长生存期等,具有明显的临床效果。而另一些研究则认为,糖尿病综合治疗效果难以评估,且可能存在一些副作用。因此,对于糖尿病综合治疗的研究仍需要进一步深入。
发明概述:
本研究旨在探究糖尿病综合治疗对患者治疗效果的影响。研究选取了100例2型糖尿病患者,分为综合治疗组和对照治疗组,综合治疗组主要通过护理干预、综合饮食、运动、心理因素等进行综合治疗,而对照治疗组主要是已传统的药物治疗配合一般护理进行治疗。两组患者分别于入选时和2个月后进行空腹血糖和饭后2h血糖的检测,并对两组患者入选时和2个月后的抑郁、焦虑的自评量分值表进行填写。
结果:
经过SPSS 13分析,对照组的P值大于0.05,说明入选时同2个月后的空腹血糖及饭后2h血糖之间无明显差异,焦虑和抑郁自评量分值差异不具有统计学意义。而治疗组的P值小于0.01,说明入选时同2个月后空腹血糖及饭后2h血糖比较明显偏高,与对照组的比较差异具有统计学意义,同时焦虑和抑郁的自评量分值也出现明显下降的趋势,P值小于0.01,差异具有统计学意义。绝大部分抑郁和焦虑的患者病情也出现好转。
结论:
糖尿病综合治疗的临床效果要明显优于对照组所采用的传统的单一药物配合一般护理进行治疗的效果,因此应该积极推广糖尿病综合治疗的基本有效模式,使综合治疗成为糖尿病有效治疗的保障。
Technology field: Stress dysregulation and suicidal behavior; Suggestions: Conduct a larger family-based study to investigate the inheritance of stress dysregulation in suicidal behavior.

Background: Family members of suicide completers may be at increased risk for suicidal behavior due to genetic and environmental factors. Stress dysregulation has been implicated in the development of suicidal behavior, but its role in the risk is not well understood.

Overview of inventions: The inventors propose to investigate the stress dysregulation in suicidal behavior in families. They plan to use a family-based study to examine the inheritance of stress dysregulation and its relationship to suicidal behavior.

Invention content: The inventors will conduct a family-based study to investigate the stress dysregulation in suicidal behavior in families. They will measure stress reactivity using salivary alpha-amylase and cortisol levels and administer a series of neuropsychologic and executive function tests before and after the Trier Social Stress Test (TSST).

Implementation method: The inventors will use the Trier Social Stress Test (TSST) to measure stress reactivity and diurnal variation in participants. They will also administer neuropsychologic and executive function tests before and after the TSST to assess the effect of stress dysregulation on neuropsychologic function.

Patent requirements: The inventors will ensure that the study is conducted in accordance with ethical guidelines and that they obtain informed consent from all participants. They will also ensure that the study is properly designed and conducted to answer the research question.
Description requirement: The inventors will provide a detailed description of the study methods, including the sample size, the study design, and the results of the study.

Text: ###BACKGROUND: The stress dysregulation theory suggests that stress can contribute to the development of suicidal behavior. The hypothalamic-pituitary-adrenal (HPA) axis and noradrenergic dysregulation have been implicated in suicide risk. However, the relationship between stress dysregulation and suicidal behavior is not well understood.

The inventors propose to investigate the stress dysregulation in suicidal behavior in families. They plan to use a family-based study to examine the inheritance of stress dysregulation and its relationship to suicidal behavior.

Family members of suicide completers may be at increased risk for suicidal behavior due to genetic and environmental factors. Stress dysregulation has been implicated in the development of suicidal behavior, but its role in the risk is not well understood.

The inventors will use the Trier Social Stress Test (TSST) to measure stress reactivity and diurnal variation in participants. They will also administer a series of neuropsychologic and executive function tests before and after the TSST to assess the effect of stress dysregulation on neuropsychologic function.

The inventors will ensure that the study is conducted in accordance with ethical guidelines and that they obtain informed consent from all participants. They will also ensure that the study is properly designed and conducted to answer the research question.
###
BloomZ的测试效果



T0pp-11B(T5-plus拓展)测试效果-听不懂指令。
C-LLaMA2-chat-7B(chat测试)

中文专利生成:

技术领域:医学专利技术

建议:在医学专利技术撰写过程中,需要仔细考虑技术的独特性和可行性,确保专利的内容具有创新性和实用性。

背景:在糖尿病治疗领域,目前常见的药物治疗方案虽然可以控制血糖,但却缺乏综合性治疗,不能从源头控制和预防疾病的发生,且也存在药物副作用等局限性,因此需要提出新型综合治疗方案。

发明概述:该专利提出了一种新型综合治疗方案,包括综合饮食、运动和心理因素等,旨在控制糖尿病的发生和进展,改善患者生活质量。

发明内容:本发明采用综合治疗方案,通过护理干预综合饮食、运动、心理因素等进行综合治疗,通过改变生活方式和心理状态,以达到控制血糖和改善糖尿病患者生活质量的目的。

实施方式:在患者治疗中,需要采用综合治疗方案,包括综合饮食、运动和心理因素等。同时,在治疗过程中还需要注意药物治疗方案和治疗频率,并根据患者的具体情况进行针对性调整。

专利要求:本发明需要满足以下要求:

该方案能够减轻糖尿病患者的症状和提高患者生活质量;
该方案能够控制糖尿病患者的血糖和胰岛素水平;
该方案能够降低糖尿病患者的心理压力和焦虑症状;
该方案可以缓解糖尿病患者的疼痛和不适。
描述要求:本发明还需要进一步描述,包括综合饮食方案、运动方案、心理因素等具体内容,以及治疗效果和方案的针对性和适用范围。

文本:本发明旨在提出一种新型的糖尿病综合治疗方案,通过综合饮食、运动和心理因素等手段,以达到控制糖

英语专利生成:

Technology field: Medicine/Psychology

Suggestions:

Use clear and concise language to explain the concepts and terms used in the study.
Break down the study methodology into clear, easy-to-follow steps.
Provide relevant background information on suicide and its connection to the HPA axis and autonomic nervous system.
Use subheadings to organize the information and make it easier to read.
Include diagrams and figures where necessary to illustrate the study findings.
Background:

Suicidal behavior is a complex phenomenon that has both biological and environmental factors. It has been suggested that the hypothalamic-pituitary-adrenal (HPA) axis and noradrenergic dysregulation may play a role in suicide risk. Additionally, stress dysregulation has been linked to suicide risk, although the heritability of this trait and the mechanisms through which it increases risk are not well understood.

Suggestions:

Provide a detailed description of the study methods used, including the sample size, recruitment criteria, and data collection procedures.
Include a discussion of the limitations of the study, such as the small sample size and the lack of a control group.
Provide a summary of the main findings of the study, including the results of the Trier Social Stress Test (TSST), and how they relate to the research question.
Overview of Inventions:

The study investigated the stress reactivity of the autonomic nervous system and the HPA axis in suicide predisposition and characterized the effect of this dysregulation on neuropsychologic function. The main findings were blunted cortisol and alpha-amylase TSST reactivity in the first-degree relatives of suicide completers compared to controls.

Suggestions:

Provide a detailed description of the study methods used, including the sample size, recruitment criteria, and data collection procedures.
Include a discussion of the limitations of the study, such as the small sample size and the lack of a control group.
Provide a summary of the main findings of the study, including the results of the Trier Social Stress Test (TSST), and how they relate to the research question.
Invention Content:

The study focused on the stress reactivity of the autonomic nervous system and the HPA axis in suicide predisposition. The participants underwent the Trier Social Stress Test (TSST), and salivary alpha-amylase and cortisol levels were used to characterize stress reactivity and diurnal variation. The study also administered a series of neuropsychologic and executive function tests before and after the TSST.

Suggestions:

Provide a detailed description of the study methods used, including the sample size, recruitment criteria, and data collection procedures.
Include a discussion of the limitations of the study, such as the small sample size and the lack of a control group.
Provide a summary of the main findings of the study, including the results of the Trier Social Stress Test (TSST), and how they relate to the research question.
Implementation Method:

The study utilized the Trier Social Stress Test (TSST) to measure the participants' stress reactivity. The participants were asked to perform a series of tasks designed to elicit a stress response.

Suggestions:

Provide a detailed description of the study methods used, including the sample size, recruitment criteria, and data collection procedures.
Include a discussion of the limitations of the study, such as the small sample size and the lack of a control group.
Provide a summary of the main findings of the study, including the results of the Trier Social Stress Test (TSST), and how they relate to the research question.
Patent Requirements:

This study required participants to undergo stress testing, which involved performing a series of tasks to elicit a stress response. The study also required participants to undergo neuropsychologic and executive function testing before and after the stress test.

Suggestions:

Provide a detailed description of the study methods used, including the sample size, recruitment criteria, and data collection procedures.
Include a discussion of the limitations of the study, such as the small sample size and the lack of a control group.
Provide a summary
Mto-xxl-13B(T0最大版模型-基于T5拓展指令集的跨语言模型)

参考链接

https://www.infoq/article/1qE9xpAGUmQy0ywBQfI3
https://zhuanlan.zhihu/p/634235081
文章:大语言模型综述
文章:Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing

本文标签: 基准评测报告模型