【自然语言处理】【大模型】DeepSpeed+Transformers:简单快捷上手百亿参数模型微调

编程入门 行业动态 更新时间:2024-10-28 17:28:44

【自然语言处理】【大<a href=https://www.elefans.com/category/jswz/34/1771358.html style=模型】DeepSpeed+Transformers:简单快捷上手百亿参数模型微调"/>

【自然语言处理】【大模型】DeepSpeed+Transformers:简单快捷上手百亿参数模型微调

相关博客
【深度学习】混合精度训练与显存分析
【自然语言处理】【大模型】ChatGLM-6B模型结构代码解析(单机版)
【自然语言处理】【大模型】极低资源微调大模型方法LoRA以及BLOOM-LORA实现代码
【自然语言处理】【ChatGPT系列】FLAN:微调语言模型是Zero-Shot学习器
【深度学习】【分布式训练】Collective通信操作及Pytorch示例
【自然语言处理】【大模型】大语言模型BLOOM推理工具测试
【自然语言处理】【大模型】GLM-130B:一个开源双语预训练语言模型
【自然语言处理】【大模型】用于大型Transformer的8-bit矩阵乘法介绍
【自然语言处理】【大模型】BLOOM:一个176B参数且可开放获取的多语言模型

DeepSpeed+Transformers:简单快捷上手百亿参数模型微调

​ 本文以百亿大模型FLAN-T5 XXL为例,展示如何使用DeepSpeed+Transformers来快速上手百亿参数模型的微调。

​ FLAN-T5是经过instruction tuning的T5模型,关于instruction tuning可以看文章【自然语言处理】【ChatGPT系列】FLAN:微调语言模型是Zero-Shot学习器。本文选用FLAN-T5 XXL模型作为示例,该模型参数量约为11B。

​ 此外,本文涉及到混合精度训练和ZeRO的配置,相关知识可以看文章:

  • 【深度学习】混合精度训练与显存分析
  • 【深度学习】【分布式训练】DeepSpeed:AllReduce与ZeRO-DP
  • 【自然语言处理】【大模型】用于大型Transformer的8-bit矩阵乘法介绍

一、模型训练相关代码

为了简洁明了,模型训练相关代码均假设在文件train.py中

1. 导入依赖包

import nltk
import torch
import evaluate
import datasets
import numpy as npfrom nltk.tokenize import sent_tokenize
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArgumentsnltk.download("punkt")

2. 设置参数

dataset_name = "samsum" # 数据集名称
model_name="google/flan-t5-xxl" # 模型名称
max_input_length = 512
max_gen_length = 128
output_dir = "checkpoints"
num_train_epochs = 5
learning_rate = 5e-5
deepspeed_config = "./ds_config.json" # deepspeed配置文件
per_device_train_batch_size=1 # batch size设置为1,因为太大导致OOM
per_device_eval_batch_size=1
gradient_accumulation_steps=2 # 由于单卡的batch size为1,为了扩展batch size,使用梯度累加tokenizer = AutoTokenizer.from_pretrained(model_name)

3. 加载数据

dataset = datasets.load_dataset(dataset_name)
print(dataset["train"][0])

​ 数据集使用"samsum",其是一个对话摘要数据集。数据格式如下:

{‘id’: ‘13818513’,
‘dialogue’: “Amanda: I baked cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I’ll bring you tomorrow 😃”,
‘summary’: ‘Amanda baked cookies and will bring Jerry some tomorrow.’}

4. tokenize

def preprocess(examples):dialogues = ["summarize:" + dia for dia in examples["dialogue"]]# summaries = [summ for summ in examples["summary"]]model_inputs = tokenizer(dialogues, max_length=max_input_length, truncation=True)labels = tokenizer(text_target=examples["summary"], max_length=max_gen_length, truncation=True)model_inputs["labels"] = labels["input_ids"]return model_inputstokenized_dataset = dataset.map(preprocess, batched=True, remove_columns=["dialogue", "summary", "id"])
# print(tokenized_dataset["train"]["input_ids"][0]) # 打印结果

​ 该步骤主要负责对输入的样本进行tokenizer,将padding的步骤放在后面进行。

5. collate_fn

def collate_fn(features):batch_input_ids = [torch.LongTensor(feature["input_ids"]) for feature in features]batch_attention_mask = [torch.LongTensor(feature["attention_mask"]) for feature in features]batch_labels = [torch.LongTensor(feature["labels"]) for feature in features]batch_input_ids = pad_sequence(batch_input_ids, batch_first=True, padding_value=tokenizer.pad_token_id)batch_attention_mask = pad_sequence(batch_attention_mask, batch_first=True, padding_value=0)batch_labels = pad_sequence(batch_labels, batch_first=True, padding_value=-100)return {"input_ids": batch_input_ids,"attention_mask": batch_attention_mask,"labels": batch_labels}
# 用于测试的代码
# dataloader = DataLoader(tokenized_dataset["test"], shuffle=False, batch_size=4, collate_fn=collate_fn)
# batch = next(iter(dataloader))
# print(batch)

​ 该步骤负责对batch数据进行padding,这样每个batch都会有动态长度。

6. 加载模型

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
# 用于测试的代码
# dataloader = DataLoader(tokenized_dataset["test"], shuffle=False, batch_size=4, collate_fn=collate_fn)
# batch = next(iter(dataloader))
# output = model(**batch)
# print(output)

7. 定义评估函数

metric = evaluate.load("rouge")def compute_metrics(eval_preds):preds, labels = eval_predsif isinstance(preds, tuple):preds = preds[0]decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)labels = np.where(labels != -100, labels, tokenizer.pad_token_id)decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)decoded_preds = ["\n".join(sent_tokenize(pred.strip())) for pred in decoded_preds]decoded_labels = ["\n".join(sent_tokenize(label.strip())) for label in decoded_labels]result = metricpute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)result = {k: round(v * 100, 4) for k, v in result.items()}prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]result["gen_len"] = np.mean(prediction_lens)return result

8. 设置训练参数

training_args = Seq2SeqTrainingArguments(output_dir=output_dir,per_device_train_batch_size=per_device_train_batch_size,per_device_eval_batch_size=per_device_eval_batch_size,gradient_accumulation_steps=gradient_accumulation_steps,eval_accumulation_steps=1, # 防止评估时导致OOMpredict_with_generate=True,fp16=False,learning_rate=learning_rate,num_train_epochs=num_train_epochs,# logging & evaluation strategieslogging_dir="logs",logging_strategy="steps",logging_steps=50, # 每50个step打印一次logevaluation_strategy="steps",eval_steps=500, # 每500个step进行一次评估save_steps=500,save_total_limit=2,load_best_model_at_end=True,deepspeed=deepspeed_config, # deepspeed配置文件的位置report_to="all"
)

9. 模型训练

trainer = Seq2SeqTrainer(model=model,args=training_args,train_dataset=tokenized_dataset["train"],eval_dataset=tokenized_dataset["validation"],data_collator=collate_fn,compute_metrics=compute_metrics,
)trainer.train()
# 打印验证集上的结果
print(trainer.evaluate(tokenized_dataset["validation"]))
# 打印测试集上的结果
print(trainer.evaluate(tokenized_dataset["test"]))
# 保存最优模型
trainer.save_model("best")

二、DeepSpeed配置文件

ds_config.json

{"bf16": {"enabled": "auto"},"optimizer": {"type": "AdamW","params": {"lr": "auto","betas": "auto","eps": "auto","weight_decay": "auto"}},"scheduler": {"type": "WarmupLR","params": {"warmup_min_lr": "auto","warmup_max_lr": "auto","warmup_num_steps": "auto"}},"zero_optimization": {"stage": 3,"offload_optimizer": {"device": "cpu","pin_memory": true},"offload_param": {"device": "cpu","pin_memory": true},"overlap_comm": true,"contiguous_gradients": true,"sub_group_size": 1e9,"reduce_bucket_size": "auto","stage3_prefetch_bucket_size": "auto","stage3_param_persistence_threshold": "auto","stage3_max_live_parameters": 1e9,"stage3_max_reuse_distance": 1e9,"stage3_gather_16bit_weights_on_model_save": false},"gradient_accumulation_steps": "auto","gradient_clipping": "auto","steps_per_print": 2000,"train_batch_size": "auto","train_micro_batch_size_per_gpu": "auto","wall_clock_breakdown": false
}

三、启动

​ 这里使用8张显卡的后4张。此外,使用ZeRO3需要有足够的内存。

deepspeed --include=localhost:4,5,6,7 train.py

四、显存

参考资料

=1001.2014.3001.5501

更多推荐

【自然语言处理】【大模型】DeepSpeed+Transformers:简单快捷上手百亿参数模型微调

本文发布于:2024-03-09 11:38:30,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1724861.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:模型   自然语言   上手   快捷   参数

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!