基于 huggingface 模型分布式训练场景的 MUSA 应用移植|电子爱好者

admin管理员组
文章数量:1666604

更多内容请访问摩尔线程博客中心

1.引言

Huggingface 起初是一家总部位于纽约的聊天机器人初创服务商，他们本来打算创业做聊天机器人，然后在github上开源了一个Transformers库，虽然聊天机器人业务没搞起来，但是他们的这个库在机器学习社区迅速大火起来。目前已经共享了超60万个预训练模型，13万个数据集，变成了机器学习界的github。从huggingface官网里，可以获取到以下资源：

Datasets：数据集，以及数据集的下载地址
Models：各个预训练模型
course：免费的nlp课程，可惜都是英文的
docs：文档

随着大模型的兴起，许多知名的开源模型（例如gpt，chatglm，llama，mistral等）都将预训练好的model放到了huggingface上，然后通过几行很简单的代码就能调用并进行训练或推理任务。为了方便用户使用，在不同AI芯片上快速适配基于huggingface模型的分布式训练是至关重要的，因此本文主要介绍如何在摩尔线程AI训练芯片S4000上快速适配hugginface模型和执行分布式训练任务。

2. MUSA软件栈

这部分简单介绍一下在摩尔线程GPU里分布式训练里用到的相关软件栈。MUSA(Moore Threads Unified System Architecture)是摩尔线程公司的统一系统架构, MUSA软件栈是在摩尔线程GPU基础上开发的一系列软件, 可以让摩尔线程GPU发挥强大的计算及图形性能。

图1 MUSA软件栈

摩尔线程集合通信库 (MCCL) 可实现针对摩尔线程GPU 和网络进行性能优化的多 GPU 和多节点通信基元。MCCL 提供了 all-gather、all-reduce、broadcast、reduce、reduce-scatter、point-to-point send 和 receive 等原语，这些原语均经过优化，可通过节点内的 PCIe 和 MTLink 高速互联以及节点间的InfiniBand网络实现高带宽和低延迟。 MCCL支持节点内和跨节点通信。可以实现拓扑的自动检测，计算最佳的路径，最终实现GPUs之间的高效传输

图2 MCCL框架图

torch_musa 是一个基于 PyTorch 的扩展 Python 包。通过插件的方式开发torch_musa，可以让torch_musa与PyTorch解耦，方便代码维护。与PyTorch结合，用户可以通过torch_musa充分利用摩尔线程显卡的强大威力。此外，torch_musa还有两个显着的优点：

torch_musa可以实现CUDA兼容，大大减少了适配新算子的工作量
torch_musa API 格式与 PyTorch 一致，可以让习惯 PyTorch 的用户平滑迁移到 torch_musa

3. MUSA移植

3.1 分布式训练demo

基于huggingface transformers我们写了一个CUDA版本的分布式训练demo，使用的NVIDIA的NCCL通信库，如下所示：

main_hf.py

import os
import math
import time
import json
import argparse
import warnings
import torch
import torch.distributed as dist
from itertools import chain
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler
from datasets import load_dataset
from transformers import (
    CONFIG_MAPPING,
    AutoConfig,
    AutoModelForCausalLM,
    AutoTokenizer,
    default_data_collator,
    get_scheduler
)
 
def print_rank_0(message):
    if dist.get_rank() == 0:
        print(message)
 
str_to_dtype = {
    "float16": torch.float16,
    "bfloat16": torch.bfloat16,
    "float32": torch.float32,
    "float64": torch.float64,
    "int8": torch.int8,
    "int16": torch.int16,
    "int32": torch.int32,
    "int64": torch.int64,
    "uint8": torch.uint8,
    "bool": torch.bool
}
 
def build_model(args):
    """
    Load pretrained model and tokenizer In distributed training,
    the .from_pretrained methods guarantee that only one local
    process can concurrently download model & vocab.
    """
    config = AutoConfig.from_pretrained(args.model_name_or_path, trust_remote_code=True)
 
    tokenizer = AutoTokenizer.from_pretrained(
        args.model_name_or_path,
        use_fast=True,
        trust_remote_code=True
    )
 
    def _modify_config(config):
        with open("override_config.json", "r", encoding='utf8') as fp:
            new_config = json.load(fp)
 
        for key, value in new_config.items():
            if not hasattr(config, key):
                print_rank_0(f"WARNING: Invalid config key: {key} and skip override it")
                continue
            old_value = getattr(config, key)
            if old_value is None:
                print_rank_0(f"WARNING: config {key} is set None and skip override it")
                continue
            if key == "torch_dtype":
                value = str_to_dtype[value]
            if type(old_value) is not type(value):
                raise TypeError(f"Type mismatch of {key}: {old_value} vs {value}")
            print_rank_0(f"modify {key}: {old_value} -> {value}")
            setattr(config, key, value)
 
    if os.path.exists("override_config.json"):
        _modify_config(config)
 
    model = AutoModelForCausalLM.from_pretrained(
        args.model_name_or_path,
        from_tf=bool(".ckpt" in args.model_name_or_path),
        config=config,
        low_cpu_mem_usage=False,
        trust_remote_code=True
    )
 
    embedding_size = model.get_input_embeddings().weight.shape[0]
    if len(tokenizer) > embedding_size:
        model.resize_token_embeddings(len(tokenizer))
    model.tie_weights()
 
    return tokenizer, model
 
 
def get_seq_len(config):
    if hasattr(config, "n_positions"):
        seq_len = config.n_positions
    elif hasattr(config, "max_position_embeddings"):
        seq_len = config.max_position_embeddings
    elif hasattr(config, "seq_length"):
        seq_len = config.seq_length
    else:
        raise RuntimeError(
            "Set the correct attribute of config to get seq_len."
        )
    print_rank_0(f"seq_len = {seq_len}")
    return seq_len
 
 
def build_datasets(args, tokenizer, phase):
    assert phase in ["train", "validation", "test"]
    # Downloading and loading a dataset from the hub.
    raw_datasets = load_dataset(
        args.dataset_name_or_path, args.dataset_config_name
    )
    if "validation" not in raw_datasets.keys():
        raw_datasets["validation"] = load_dataset(
            args.dataset_name,
            args.dataset_config_name,
            split=f"train[:{args.validation_split_percentage}%]",
        )
        raw_datasets["train"] = load_dataset(
            args.dataset_name,
            args.dataset_config_name,
            split=f"train[{args.validation_split_percentage}%:]",
        )
    # Preprocessing the datasets.
    # First we tokenize all the texts.
    column_names = raw_datasets["train"].column_names
    text_column_name = "text" if "text" in column_names else column_names[0]
 
    def tokenize_function(examples):
        return tokenizer(examples[text_column_name])
 
    preprocessing_num_workers = 16
    tokenized_datasets = raw_datasets.map(
        tokenize_function,
        batched=True,
        num_proc=preprocessing_num_workers,
        remove_columns=column_names,
        load_from_cache_file=True,
        desc="Running tokenizer on dataset",
    )
    # Main data processing function that will concatenate all texts from our dataset and generate
    # chunks of block_size.
    block_size = tokenizer.model_max_length
    if block_size > 1024:
        print_rank_0(
            "WARNING: The chosen tokenizer supports a `model_max_length` that is longer than the default"
            " `block_size` value of 1024. If you would like to use a longer `block_size` up to"
            " `tokenizer.model_max_length` you can override this with `--block_size xxx`."
        )
        block_size = 1024
 
    def group_texts(examples):
        # Concatenate all texts.
        concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
        total_length = len(concatenated_examples[list(examples.keys())[0]])
        # We drop the small remainder, we could add padding if the model supported it
        # instead of this drop, you can
        # customize this part to your needs.
        if total_length >= block_size:
            total_length = (total_length // block_size) * block_size
        # Split by chunks of max_len.
        result = {
            k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
            for k, t in concatenated_examples.items()
        }
        result["labels"] = result["input_ids"].copy()
        return result
 
    # Note that with `batched=True`, this map processes 1,000 texts together,
    # so group_texts throws away a remainde for each of those groups of 1,000 texts.
    # You can adjust that batch_siz here but a higher value might be slowe to preprocess.
    #
    # To speed up this part, we use multiprocessing.
    # See the documentation of the map method for more information:
    # https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map
    lm_datasets = tokenized_datasets.map(
        group_texts,
        batched=True,
        num_proc=preprocessing_num_workers,
        load_from_cache_file=True,
        desc=f"Grouping texts in chunks of {block_size}",
    )
    return lm_datasets[phase]
 
 
def record_timestamp():
    torch.cuda.synchronize()
    return time.time()
 
 
if __name__ == "__main__":
    warnings.filterwarnings("ignore", message="promote has been superseded by promote_options='default'.", category=FutureWarning)
    parser = argparse.ArgumentParser(
        description="Pytorch Example of huggingface",
        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
    )
    parser.add_argument(
        "--model_name_or_path",
        type=str,
        default="gpt2",
        help="Path to pretrained model or model identifier from huggingface.co/models.",
        required=False,
    )
    parser.add_argument(
        "--dataset_name_or_path",
        type=str,
        default="wikitext",
        help="The name of the dataset to use (via the datasets library).",
    )
    parser.add_argument(
        "--dataset_config_name",
        type=str,
        default="wikitext-103-raw-v1",
        help="The configuration name of the dataset to use (via the datasets library).",
    )
    parser.add_argument(
        "--batch_size",
        type=int,
        default=2,
        help="Batch size (per device) for the training dataloader.",
    )
    parser.add_argument(
        "--epochs",
        type=int,
        default=100,
        help="the number of training epoch.",
    )
    args = parser.parse_args()
 
    dist.init_process_group("nccl")
    local_rank = int(os.environ["LOCAL_RANK"])
    world_size = dist.get_world_size()
    device = torch.device("cuda", local_rank)
    torch.cuda.set_device(local_rank)
 
    torch.manual_seed(123)
    torch.cuda.manual_seed(123)
 
    tokenizer, model = build_model(args)
    print_rank_0(model)
    model = model.to(device)
    model = DDP(model, device_ids=[device], output_device=device)
 
    train_dataset = build_datasets(args, tokenizer, "train")
    eval_dataset = build_datasets(args, tokenizer, "test")
    train_dataloader = DataLoader(
        train_dataset,
        collate_fn=default_data_collator,
        shuffle=False,
        batch_size=args.batch_size,
        sampler=DistributedSampler(train_dataset),
    )
    eval_dataloader = DataLoader(
        eval_dataset,
        collate_fn=default_data_collator,
        shuffle=False,
        batch_size=args.batch_size,
        sampler=DistributedSampler(eval_dataset),
    )
 
    lr = 1e-4 * world_size
    optimizer = torch.optim.AdamW(params=model.parameters(), lr=lr, weight_decay=0.1)
    scheduler = get_scheduler(
        name="cosine",
        optimizer=optimizer,
        num_warmup_steps=100,
        num_training_steps=len(train_dataloader)
    )
    seq_len = get_seq_len(model.module.config)
    for epoch in range(args.epochs):
        train_loss = 0.
        model.train()
        start_time = record_timestamp()
        for batch_idx, batch_data in enumerate(train_dataloader):
            batch_data["input_ids"] = batch_data["input_ids"].to(device)
            batch_data["attention_mask"] = batch_data["attention_mask"].to(device)
            batch_data["labels"] = batch_data["labels"].to(device)
 
            outputs = model(**batch_data)
            loss = outputs.loss
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            loss.backward()
            optimizer.step()
            scheduler.step()
            end_time = record_timestamp()
 
            train_loss += loss
            duration = end_time - start_time
            train_tokens_throughput = args.batch_size * seq_len * world_size / duration
 
            if batch_idx % 10 == 0:
                print_rank_0(
                    f"epoch[{epoch}/{args.epochs-1}], "
                    f"step[{batch_idx}/{len(train_dataloader)-1}]: "
                    f"train_loss: {train_loss.item()/(batch_idx+1)}, "
                    f"train_tokens_throughput: {train_tokens_throughput} tokens/s"
                )
                start_time = record_timestamp()
 
        eval_loss = 0.
        model.eval()
        with torch.no_grad():
            for batch_data in eval_dataloader:
                batch_data["input_ids"] = batch_data["input_ids"].to(device)
                batch_data["attention_mask"] = batch_data["attention_mask"].to(device)
                batch_data["labels"] = batch_data["labels"].to(device)
                outputs = model(**batch_data)
                eval_loss += output.loss
 
        print_rank_0(
            f"validation in train epoch {epoch}: "
            f"eval_loss: {eval_loss.item()/len(eval_dataloader)}, "
            f"val_perplexity: {math.exp(eval_loss.item())}"
        )

其中主要由下面这三行代码调用hugginface模型和对应的tokenizer：

config = AutoConfig.from_pretrained(args.model_name_or_path, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(
    args.model_name_or_path,
    use_fast=True,
    trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    args.model_name_or_path,
    from_tf=bool(".ckpt" in args.model_name_or_path),
    config=config,
    low_cpu_mem_usage=False,
    trust_remote_code=True
)

执行如下命令就可以在A100上开始训练：

torchrun --nproc_per_node=2 main_hf.py \
        --model_name_or_path gpt2 \
        --batch_size 1

其中--nproc_per_node表示训练的gpu卡数，--model_name_or_path表示想要训练哪种模型，–-batch_size指的单卡batch_size。

如果训练的模型过大，例如mixtral-8x7B，那么可以先离线下载好hugginface checkpoint，然后将–model_name_or_path设置本地路径。除此之外，如果将batch_size设置成1后，gpu的显存不够训练这么大模型，此时可以将config里num_layers调小，以便快速验证是否在MUSA移植成功该模型。此过程可以在override_config.json里设置，然后会覆盖原生config.json对应的参数值。

override_config.json

{
    "num_layers": 2
}

def _modify_config(config):
    with open("override_config.json", "r", encoding='utf8') as fp:
        new_config = json.load(fp)
    for key, value in new_config.items():
        if not hasattr(config, key):
            print_rank_0(f"WARNING: Invalid config key: {key} and skip override it")
            continue
        old_value = getattr(config, key)
        if old_value is None:
            print_rank_0(f"WARNING: config {key} is set None and skip override it")
            continue
        if key == "torch_dtype":
            value = str_to_dtype[value]
        if type(old_value) is not type(value):
            raise TypeError(f"Type mismatch of {key}: {old_value} vs {value}")
        print_rank_0(f"modify {key}: {old_value} -> {value}")
        setattr(config, key, value)
 
if os.path.exists("override_config.json"):
    _modify_config(config)

3.2 musify

musify是指将CUDA应用修改成MUSA应用并在摩尔线程GPU跑起来，对于上述hugginface训练demo，musify的过程非常简单：

musify

# 1.在main_hf.py里添加torch_musa
# 2.修改设备：将main_hf.py所有cuda修改成musa
sed -i "s/cuda/musa/g" `grep -rl "cuda" main_hf.py`
# 3.修改通信库：将main_hf.py里nccl改成mccl
sed -i "s/nccl/mccl/g" `grep -rl "nccl" main_hf.py`

修改完之后的diff如下所示：

图3 huggingface training demo musify的结果

4. 训练结果展示

我们选了三种典型模型来展示训练结果，分别是：

Encoder-Decoder模型：bert-base-uncased
Only Decoder模型：gpt2
MOE模型：mixtral-8x7B-v0.1

需要注意的是，由于MUSA里随机数的生成机制与CUDA不同，模型里dropout层的输出有较大差异。所以为了验证与CUDA的loss是否一致，需要关闭模型里的dropout。然后列了前五十个step的训练结果：

模型	num_gpus	override_config.json	batch_size	device	占用显存/总显存(GB)
bert-base-uncased	2	{ "attention_probs_dropout_prob": 0.0, "hidden_dropout_prob": 0.0 }	56	A100	46/81
bert-base-uncased	2		56	S4000	46/49
gpt2	2	{ "attn_pdrop": 0.0, "embd_pdrop": 0.0, "resid_pdrop": 0.0, "summary_first_dropout":0.0 }	16	A100	48/81
gpt2	2		16	S4000	46/49
mixtral-8x7B-v0.1	2	{ "num_hidden_layers": 1 }	1	A100	41/49
mixtral-8x7B-v0.1	2	{ "num_hidden_layers": 1 }	1	S4000	41/49

可以看出：

占用显存：MUSA和CUDA的占用显存基本一致，甚至在gpt2里MUSA比CUDA的显存还少了2GB
精度：无论是哪种模型，MUSA和CUDA的loss误差均在万分位；
吞吐：S4000在bert-base-uncased，gpt2和mixtral-8x7B-v0.1分别约为A100的0.2倍，0.24倍和0.22倍，后面还需不断优化在S4000上的性能；

本文标签：分布式模型场景 Huggingface MUSA

版权声明：本文标题：基于 huggingface 模型分布式训练场景的 MUSA 应用移植内容由热心网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：https://www.elefans.com/dongtai/1730074955a1221653.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

电子爱好者 - 最新技术资讯及电子产品介绍！

基于 huggingface 模型分布式训练场景的 MUSA 应用移植

1.引言

2. MUSA软件栈

3. MUSA移植

3.1 分布式训练demo

3.2 musify

4. 训练结果展示

更多相关文章

模型部署之——解决pytorch转onnx模型出现identify

模型训练出现UnidentifiedImageError: cannot identify image file ＜io.BytesIO object at 0x7faaa82bdb90＞

深度学习PyTorch，TensorFlow中GPU利用率较低，CPU利用率很低，且模型训练速度很慢的问题总结与分析

3D模型渲染导致电脑太卡怎么办？

做大模型 千万别买苹果笔记本电脑

将FBX模型转换为glb格式

STEP和IGES模型转换为适用Web的glb格式

centos7.5离线安装部署TiDB-6.5.0分布式系统

huggingface笔记： accelerate estimate-memory 命令

结果方程模型（SEM）的理论和基本实现过程

Mplus—随机截距交叉滞后模型（RI-CLPM）语法

学习笔记-ARIMA模型

一元线性模型用R语言进行拟合

分布式搜索引擎es原理

简单搜索引擎模型

【AI大模型-文心-思维树解读-仓颉精通之路-13】

李沐论文精读系列五：DALL·E2（生成模型串讲，从GANs、VEVAEVQ-VAEDALL·E到扩散模型DDPMADM）

AIGC｜一文梳理「AI视频生成」技术核心基础知识和模型应用

【AI大模型】Prompt Engineering 基础知识与挑战

AI扫盲指南！什么是大模型？什么prompt？什么是大模型参数？

发表评论

推荐文章

ios 隐藏app的插件_教你如何在iPad和iPhone中隐藏APP应用 划重点了

Visual Studio 2013各个版本密钥(亲测可用)

应用安全系列之二十七：加密算法

制作uos启动盘

第十三章 会议与WebRTC视频会议

热门文章

办公人员必须会的15种求和技巧

win10若何从零碎产物ID晓得产物密匙？（windows 10 hostid）

OA办公系统如何实现合同管理

Laravel个人博客后端开源。第三方登录，邮箱，常用命令等总结

搜索引擎技术 —— 网络爬虫

【无标题】applied intelligence投稿记录

国际会议full paper、short paper、poster；transaction、Conference、Workshop等

chrome deug remote_Chrome Remote Desktop：让你在任意设备上远程连接Windows桌面

Win10系统下如何设置共享打印机

Linux终端下载速度太慢如何解决

最新文章

如何将 ChatGPT 集成到你的应用中

国产开源版「ChatGPT插件系统」来了！豆瓣、搜索一应俱全，清华、面壁智能等联合发布

ChatGPT官方APP上线：速度极快且免费、增加语音识别！

java调用chatgpt,产生的证书问题javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorExcepti

PureChat: 集成ChatGPT的新一代智能聊天应用

可使用ChatGPT，可使用一键视频去除水印，有这免费的app就够用了

重磅来袭，ChatGPT官方的掌上神器目前仅支持IOS

基于uni-app+vue3跨端「h5+小程序+App」仿制chatGPT模板实例

ChatGPT常见错误解决和封号情形说明

chatgpt赋能python：Python抓取App数据的优势与实现

Build ChatGPT powered AI app in Angular - ChatGPT Crash Course

Android引入chatGpt

刚刚，ChatGPT推出Windows客户端！

WhatsApp-ChatGPT 入门学习指南 - 将 AI 助手集成到 WhatsApp

ChatGpt Mac app

小米手机肿么还原时钟

15000流明是多少瓦

一般普通投影机功率多大?

苹果绿联转换器有些投影机不能用

坚果V9投影机具体参数?

有关九年级作文850字精选

80后90后_高一作文

中级卫生专业资格中医全科学主治医师中级模拟题2021年(9)案与解析

(精品)师范大学招考硕士研究生课程八六0试卷

ZXMVC8900(V3

【模拟人生4（The Sims 4）性感露背黑色亮片礼服MOD V20190313】模拟人生4（The Sims 4）性感露背黑色亮片礼服MOD V20190313 官方免费下载

【生化危机2：重制版（Resident Evil 2 Remake）克莱尔红头发深色服装MOD】生化危机2：重制版（Resident Evil 2 Remake）克莱尔红头发深色服装MOD 官方免费下载

【模拟人生4（The Sims 4）性感露背深V领吊带裙MOD V20190311】模拟人生4（The Sims 4）性感露背深V领吊带裙MOD V20190311 官方免费下载

【模拟人生4（The Sims 4）科幻风宇宙飞船家庭住宅MOD V20190311】模拟人生4（The Sims 4）科幻风宇宙飞船家庭住宅MOD V20190311 官方免费下载

【鬼泣5（Devil May Cry V）v1.0十四项修改】鬼泣5（Devil May Cry V）v1.0十四项修改 官方免费下载

如何实现高效的treenode搜索算法

treenode与链表有何本质区别

做大模型千万别买苹果笔记本电脑

ios 隐藏app的插件_教你如何在iPad和iPhone中隐藏APP应用划重点了

第十三章会议与WebRTC视频会议

【鬼泣5（Devil May Cry V）v1.0十四项修改】鬼泣5（Devil May Cry V）v1.0十四项修改官方免费下载