


  • 中文理解代码
  • 4 Implementing a GPT model from Scratch To Generate Text
    • This chapter covers
    • 4.1 Coding an LLM architecture
    • Listing 4.1 A placeholder GPT model architecture class
    • 4.2 Normalizing activations with layer normalization
    • 4.3 Implementing a feed forward network with GELU activations
    • 4.4 Adding shortcut connections
    • 4.5 Connecting attention and linear layers in a transformer block
    • 4.6 Coding the GPT model
    • 4.7 Generating text
    • 4.8 Summary



4 Implementing a GPT model from Scratch To Generate Text

4 从零开始实现GPT模型以生成文本

This chapter covers

  • Coding a GPT-like large language model (LLM) that can be trained to generate human-like text
  • 编码一个类似GPT的大型语言模型(LLM),可以训练生成类似人类的文本
  • Normalizing layer activations to stabilize neural network training
  • 归一化层激活以稳定神经网络训练
  • Adding shortcut connections in deep neural networks to train models more effectively
  • 在深度神经网络中添加快捷连接以更有效地训练模型
  • Implementing transformer blocks to create GPT models of various sizes
  • 实现transformer块以创建各种规模的GPT模型
  • Computing the number of parameters and storage requirements of GPT models
  • 计算GPT模型的参数数量和存储需求

In the previous chapter, you learned and coded the multi-head attention mechanism, one of the core components of LLMs. In this chapter, we will now code the other building blocks of an LLM and assemble them into a GPT-like model that we will train in the next chapter to generate human-like text, as illustrated in Figure 4.1.


Figure 4.1 A mental model of the three main stages of coding an LLM, pretraining the LLM on a general text dataset, and finetuning it on a labeled dataset. This chapter focuses on implementing the LLM architecture, which we will train in the next chapter.

图4.1 一个关于编写LLM的三个主要阶段的心智模型,包括在通用文本数据集上预训练LLM,并在标注数据集上进行微调。本章重点在于实现LLM架构,我们将在下一章中训练它。

The LLM architecture, referenced in Figure 4.1, consists of several building blocks that we will implement throughout this chapter. We will begin with a top-down view of the model architecture in the next section before covering the individual components in more detail.


4.1 Coding an LLM architecture

4.1 编写LLM架构

LLMs, such as GPT (which stands forGenerative Pretrained Transformer), are large deep neural network architectures designed to generate new text one word (or token) at a time. However, despite their size, the model architecture is less complicated than you might think, since many of its components are repeated, as we will see later. Figure 4.2 provides a top-down view of a GPT-like LLM, with its main components highlighted.


Figure 4.2 A mental model of a GPT model. Next to the embedding layers, it consists of one or more transformer blocks containing the masked multi-head attention module we implemented in the previous chapter.

图4.2 GPT模型的心智模型。紧挨着嵌入层,它包含一个或多个transformer块,这些transformer块包含我们在上一章中实现的遮蔽多头注意力模块。

As you can see in Figure 4.2, we have already covered several aspects, such as input tokenization and embedding, as well as the masked multi-head attention module. The focus of this chapter will be on implementing the core structure of the GPT model, including its transformer blocks, which we will then train in the next chapter to generate human-like text.


In the previous chapters, we used smaller embedding dimensions for simplicity, ensuring that the concepts and examples could comfortably fit on a single page. Now, in this chapter, we are scaling up to the size of a small GPT-2 model, specifically the smallest version with 124 million parameters, as described in Radford et al.’s paper, “Language Models are Unsupervised Multitask Learners.” Note that while the original report mentions 117 million parameters, this was later corrected.


Chapter 6 will focus on loading pretrained weights into our implementation and adapting it for larger GPT-2 models with 345, 762, and 1,542 million parameters. In the context of deep learning and LLMs like GPT, the term “parameters” refers to the trainable weights of the model. These weights are essentially the internal variables of the model that are adjusted and optimized during the training process to minimize a specific loss function. This optimization allows the model to learn from the training data.


For example, in a neural network layer that is represented by a 2,048x2,048-dimensional matrix (or tensor) of weights, each element of this matrix is a parameter. Since there are 2,048 rows and 2,048 columns, the total number of parameters in this layer is 2,048 multiplied by 2,048, which equals 4,194,304 parameters.



Note that we are focusing on GPT-2 because OpenAI has made the weights of the pretrained model publicly available, which we will load into our implementation in chapter 6. GPT-3 is fundamentally the same in terms of model architecture, except that it is scaled up from 1.5 billion parameters in GPT-2 to 175 billion parameters in GPT-3, and it is trained on more data. As of this writing, the weights for GPT-3 are not publicly available. GPT-2 is also a better choice for learning how to implement LLMs, as it can be run on a single laptop computer, whereas GPT-3 requires a GPU cluster for training and inference. According to Lambda Labs, it would take 355 years to train GPT-3 on a single V100 datacenter GPU, and 665 years on a consumer RTX 8000 GPU.

请注意,我们关注的是GPT-2,因为OpenAI已公开发布了预训练模型的权重,我们将在第6章中将其加载到我们的实现中。GPT-3在模型架构方面基本相同,只是它从GPT-2的15亿参数扩展到了GPT-3的1750亿参数,并且在更多数据上进行了训练。截至撰写本文时,GPT-3的权重尚未公开。GPT-2也是学习如何实现LLM的更好选择,因为它可以在单台笔记本电脑上运行,而GPT-3则需要GPU集群进行训练和推理。根据Lambda Labs的数据,在单个V100数据中心GPU上训练GPT-3需要355年,而在消费级RTX 8000 GPU上则需要665年。

We specify the configuration of the small GPT-2 model via the following Python dictionary, which we will use in the code examples later:


    "vocab_size": 50257,

  # 词汇量大小
    "context_length": 1024,  # 上下文长度
    "emb_dim": 768,  # 嵌入维度
    "n_heads": 12,  # 注意力头数
    "n_layers": 12,  # 层数
    "drop_rate": 0.1,  # 丢弃率
    "qkv_bias": False  # 查询-键-值偏差

In the GPT_CONFIG_124M dictionary, we use concise variable names for clarity and to prevent long lines of code:


  • “vocab_size” refers to a vocabulary of 50,257 words, as used by the BPE tokenizer from chapter 2.

  • "vocab_size"指的是包含50,257个词汇的词汇表,由第2章的BPE分词器使用。

  • “context_length” denotes the maximum number of input tokens the model can handle, via the positional embeddings discussed in chapter 2.

  • “context_length” 表示模型可以处理的最大输入词元数,通过第2章讨论的位置嵌入实现。

  • “emb_dim” represents the embedding size, transforming each token into a 768-dimensional vector.

  • “emb_dim” 表示嵌入大小,将每个词元转换为768维向量。

  • “n_heads” indicates the count of attention heads in the multi-head attention mechanism, as implemented in chapter 3.

  • “n_heads” 表示多头注意力机制中的注意力头数,如第3章实现。

  • “n_layers” specifies the number of transformer blocks in the model, which will be elaborated on in upcoming sections.

  • “n_layers” 指定模型中的transformer块的数量,这将在后续章节中详细说明。

  • “drop_rate” indicates the intensity of the dropout mechanism (0.1 implies a 10% drop of hidden units) to prevent overfitting, as covered in chapter 3.

  • “drop_rate” 表示丢弃机制的强度(0.1表示丢弃10%的隐藏单元)以防止过拟合,如第3章所述。

  • “qkv_bias” determines whether to include a bias vector in the linear layers of the multi-head attention for query, key, and value computations.We will initially disable this, following the norms of modern LLMs, but will revisit it in chapter 6 when we load pretrained GPT-2 weights from OpenAI into our model.

  • “qkv_bias” 决定是否在多头注意力的查询、键和值计算的线性层中包含偏置向量。我们最初将禁用此功能,遵循现代LLM的规范,但将在第6章中加载OpenAI预训练的GPT-2权重到我们的模型中时重新审视它。

Using the configuration above, we will start this chapter by implementing a GPT placeholder architecture (DummyGPTModel) in this section, as shown in Figure 4.3. This will provide us with a big-picture view of how everything fits together and what other components we need to code in the upcoming sections to assemble the full GPT model architecture.


Figure 4.3 A mental model outlining the order in which we code the GPT architecture. In this chapter, we will start with the GPT backbone, a placeholder architecture, before we get to the individual core pieces and eventually assemble them in a transformer block for the final GPT architecture.

图4.3 描述了我们编写GPT架构的顺序的心智模型。在本章中,我们将从GPT骨干开始,这是一个占位架构,然后再逐步实现各个核心部分,并最终将它们组装成transformer块以构建最终的GPT架构。

The numbered boxes shown in Figure 4.3 illustrate the order in which we tackle the individual concepts required to code the final GPT architecture. We will start with step 1, a placeholder GPT backbone we call DummyGPTModel:


Listing 4.1 A placeholder GPT model architecture class

import torch  # 导入torch库
import torch.nn as nn  # 导入torch.nn库并重命名为nn

class DummyGPTModel(nn.Module):  # 定义DummyGPTModel类,继承nn.Module
    def __init__(self, cfg):  # 定义初始化方法
        super().__init__()  # 调用父类的初始化方法
        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])  # 定义词元嵌入层
        self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])  # 定义位置嵌入层
        self.drop_emb = nn.Dropout(cfg["drop_rate"])  # 定义丢弃层
        self.trf_blocks = nn.Sequential(  # 定义顺序容器,用于存放transformer块
            *[DummyTransformerBlock(cfg) for _ in range(cfg["n_layers"])])  # 使用占位transformer块
        self.final_norm = DummyLayerNorm(cfg["emb_dim"])  # 使用占位层归一化
        self.out_head = nn.Linear(  # 定义线性层
            cfg["emb_dim"], cfg["vocab_size"], bias=False)

    def forward(self, in_idx):  # 定义前向传播方法
        batch_size, seq_len = in_idx.shape  # 获取批量大小和序列长度
        tok_embeds = self.tok_emb(in_idx)  # 获取词元嵌入
        pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))  # 获取位置嵌入
        x = tok_embeds + pos_embeds  # 合并嵌入
        x = self.drop_emb(x)  # 应用丢弃层
        x = self.trf_blocks(x)  # 应用transformer块
        x = self.final_norm(x)  # 应用归一化层
        logits = self.out_head(x)  # 计算输出
        return logits  # 返回输出

class DummyTransformerBlock(nn.Module):  # 定义占位transformer块类,继承nn.Module
    def __init__(self, cfg):  # 定义初始化方法
        super().__init__()  # 调用父类的初始化方法

    def forward(self, x):  # 定义前向传播方法
        return x  # 返回输入

class DummyLayerNorm(nn.Module):  # 定义占位层归一化类,继承nn.Module
    def __init__(self, normalized_shape, eps=1e-5):  # 定义初始化方法
        super().__init__()  # 调用父类的初始化方法

    def forward(self, x):  # 定义前向传播方法
        return x  # 返回输入

The DummyGPTModel class in this code defines a simplified version of a GPT-like model using PyTorch’s neural network module (nn.Module). The model architecture in the DummyGPTModel class consists of token and positional embeddings, dropout, a series of transformer blocks (DummyTransformerBlock), a final layer normalization (DummyLayerNorm), and a linear output layer (out_head). The configuration is passed in via a Python dictionary, for instance, the GPT_CONFIG_124M dictionary we created earlier.


The forward method describes the data flow through the model: it computes token and positional embeddings for the input indices, applies dropout, processes the data through the transformer blocks, applies normalization, and finally produces logits with the linear output layer.


The code above is already functional, as we will see later in this section after we prepare the input data. However, for now, note in the code above that we have used placeholders (DummyLayerNorm and DummyTransformerBlock) for the transformer block and layer normalization, which we will develop in later sections.


Next, we will prepare the input data and initialize a new GPT model to illustrate its usage. Building on the figures we have seen in chapter 2, where we coded the tokenizer, Figure 4.4 provides a high-level overview of how data flows in and out of a GPT model.


Figure 4.4 A big-picture overview showing how the input data is tokenized, embedded, and fed to the GPT model. Note that in our DummyGPTClass coded earlier, the token embedding is handled inside the GPT model. In LLMs, the embedded input token dimension typically matches the output dimension. The output embeddings here represent the context vectors we discussed in chapter 3.

图4.4 显示了输入数据如何被分词、嵌入并馈送到GPT模型的整体视图。请注意,在我们之前编写的DummyGPTClass中,词元嵌入在GPT模型内部处理。在LLM中,嵌入的输入词元维度通常与输出维度相匹配。这里的输出嵌入表示我们在第3章中讨论的上下文向量。

To implement the steps shown in Figure 4.4, we tokenize a batch consisting of two text inputs for the GPT model using the tokenizer introduced in chapter 2:


import tiktoken  # 导入tiktoken库

tokenizer = tiktoken.get_encoding("gpt2")  # 获取GPT-2编码器
batch = []  # 初始化批处理列表
txt1 = "Every effort moves you"  # 文本1
txt2 = "Every day holds a"  # 文本2

batch.append(torch.tensor(tokenizer.encode(txt1)))  # 将文本1编码并添加到批处理列表
batch.append(torch.tensor(tokenizer.encode(txt2)))  # 将文本2编码并添加到批处理列表
batch = torch.stack(batch, dim=0)  # 将批处理列表堆叠成张量
print(batch)  # 输出批处理张量

The resulting token IDs for the two texts are as follows:


tensor([[ 6109,  3626,  6100,   345],  #A
        [ 6109,  1110,  6622,   257]])  #A
#A The first row corresponds to the first text, and the second row corresponds to the second text
#A 第一行对应第一个文本,第二行对应第二个文本

Next, we initialize a new 124 million parameter DummyGPTModel instance and feed it the tokenized batch:


torch.manual_seed(123)  # 设置随机种子
model = DummyGPTModel(GPT_CONFIG_124M)  # 创建DummyGPTModel实例
logits = model(batch)  # 获取模型输出logits
print("Output shape:", logits.shape)  # 输出logits的形状
print(logits)  # 输出logits

The model outputs, which are commonly referred to as logits, are as follows:


Output shape: torch.Size([2, 4, 50257])
tensor([[[-1.2034,  0.3201, -0.7130, ..., -1.5548, -0.2390, -0.4667],
         [-0.1192,  0.4539, -0.4432, ...,  0.2392,  1.3469,  1.2430],
         [ 0.5307,  1.6720, -0.4695, ...,  1.1966,  0.0111,  0.5835],
         [ 0.0139,  1.6755, -0.3388, ...,  1.1586, -0.0435, -1.0400]],

        [[-1.9088,  0.1798, -0.9484, ..., -1.6047,  0.2439, -0.4530],
         [-0.7860,  0.5581, -0.0610, ...,  0.4835, -0.0077,  1.6621],
         [ 0.3567,  1.2698, -0.6398, ..., -0.0162, -0.1296,  0.3771],
         [-0.2407, -0.7349, -0.5102, ...,  2.0057, -0.3694,  0.1814]]],

The output tensor has two rows corresponding to the two text samples. Each text sample consists of 4 tokens; each token is a 50,257-dimensional vector, which matches the size of the tokenizer’s vocabulary.


The embedding has 50,257 dimensions because each of these dimensions refers to a unique token in the vocabulary. At the end of this chapter, when we implement the postprocessing code, we will convert these 50,257-dimensional vectors back into token IDs, which we can then decode into words.


Now that we have taken a top-down look at the GPT architecture and its in- and outputs, we will code the individual placeholders in the upcoming sections, starting with the real layer normalization class that will replace the DummyLayerNorm in the previous code.


4.2 Normalizing activations with layer normalization

Training deep neural networks with many layers can sometimes prove challenging due to issues like vanishing or exploding gradients. These issues lead to unstable training dynamics and make it difficult for the network to effectively adjust its weights, which means the learning process struggles to find a set of parameters (weights) for the neural network that minimizes the loss function. In other words, the network has difficulty learning the underlying patterns in the data to a degree that would allow it to make accurate predictions or decisions. (If you are new to neural network training and the concepts of gradients, a brief introduction to these concepts can be found in Section A.4, Automatic Differentiation Made Easy in Appendix A: Introduction to PyTorch. However, a deep mathematical understanding of gradients is not required to follow the contents of this book.)


In this section, we will implement layer normalization to improve the stability and efficiency of neural network training.


The main idea behind layer normalization is to adjust the activations (outputs) of a neural network layer to have a mean of 0 and a variance of 1, also known as unit variance. This adjustment speeds up the convergence to effective weights and ensures consistent, reliable training. As we have seen in the previous section, based on the DummyLayerNorm placeholder, in GPT-2 and modern transformer architectures, layer normalization is typically applied before and after the multi-head attention module and before the final output layer.


Before we implement layer normalization in code, Figure 4.5 provides a visual overview of how layer normalization functions.


Figure 4.5 An illustration of layer normalization where the 5 layer outputs, also called activations, are normalized such that they have a zero mean and variance of 1.

图4.5 层归一化的示意图,其中5个层输出(也称为激活)被归一化,使它们的均值为0,方差为1。

We can recreate the example shown in Figure 4.5 via the following code, where we implement a neural network layer with 5 inputs and 6 outputs that we apply to two input examples:


torch.manual_seed(123)  # 设置随机种子
batch_example = torch.randn(2, 5)  #A
layer = nn.Sequential(nn.Linear(5, 6), nn.ReLU())  # 创建顺序容器,包括线性层和ReLU激活函数
out = layer(batch_example)  # 计算输出
print(out)  # 输出结果

This prints the following tensor, where the first row lists the layer outputs for the first input and the second row lists the layer outputs for the second row:


tensor([[0.2260, 0.3470, 0.0000, 0.2216, 0.0000, 0.0000],
        [0.2133, 0.2394, 0.0000, 0.5198, 0.3297, 0.0000]],

The neural network layer we have coded consists of a Linear layer followed by a non-linear activation function, ReLU (short for Rectified Linear Unit), which is a standard activation function in neural networks. If you are unfamiliar with ReLU, it simply thresholds negative inputs to 0, ensuring that a layer outputs only positive values, which explains why the resulting layer output does not contain any negative values. (Note that we will use another, more sophisticated activation function in GPT, which we will introduce in the next section).


Before we apply layer normalization to these outputs, let’s examine the mean and variance:


mean = out.mean(dim=-1, keepdim=True)  # 计算均值
var = out.var(dim=-1, keepdim=True)  # 计算方差
print("Mean:\n", mean)  # 打印均值
print("Variance:\n", var)  # 打印方差

The output is as follows:


        [0.2170]], grad_fn=<MeanBackward1>)

        [0.0398]], grad_fn=<VarBackward0>)

The first row in the mean tensor above contains the mean value for the first input row, and the second output row contains the mean for the second input row.


Using keepdim=True in operations like mean or variance calculation ensures that the output tensor retains the same number of dimensions as the input tensor, even though the operation reduces the tensor along the dimension specified via dim. For instance, without keepdim=True, the returned mean tensor would be a 2-dimensional vector [0.1324, 0.2170] instead of a 2x1-dimensional matrix [[0.1324], [0.2170]].

在计算均值或方差等操作中使用keepdim=True可以确保输出张量保持与输入张量相同的维度数,即使操作在指定的维度上减少了张量。例如,如果不使用keepdim=True,返回的均值张量将是一个二维向量[0.1324, 0.2170],而不是一个2x1的矩阵[[0.1324], [0.2170]]。

The dim parameter specifies the dimension along which the calculation of the statistic (here, mean or variance) should be performed in a tensor, as shown in Figure 4.6.


Figure 4.6 An illustration of the dim parameter when calculating the mean of a tensor. For instance, if we have a 2D tensor (matrix) with dimensions [rows, columns], using dim=0 will perform the operation across rows (vertically, as shown at the bottom), resulting in an output that aggregates the data for each column. Using dim=1 or dim=-1 will perform the operation across columns (horizontally, as shown at the top), resulting in an output aggregating the data for each row.

图4.6 计算张量均值时dim参数的示意图。例如,如果我们有一个二维张量(矩阵),其维度为[行,列],使用dim=0将在行之间执行操作(垂直,如底部所示),产生一个汇总每列数据的输出。使用dim=1或dim=-1将在列之间执行操作(水平,如顶部所示),产生一个汇总每行数据的输出。

As Figure 4.6 explains, for a 2D tensor (like a matrix), using dim=-1 for operations such as mean or variance calculation is the same as using dim=1. This is because -1 refers to the tensor’s last dimension, which corresponds to the columns in a 2D tensor. Later, when adding layer normalization to the GPT model, which produces 3D tensors with shape [batch_size, num_tokens, embedding_size], we can still use dim=-1 for normalization across the last dimension, avoiding a change from dim=1 to dim=2.

如图4.6所述,对于二维张量(如矩阵),使用dim=-1进行均值或方差计算等操作与使用dim=1相同。这是因为-1指的是张量的最后一个维度,对应于二维张量中的列。稍后,当将层归一化添加到GPT模型中时,该模型生成形状为[batch_size, num_tokens, embedding_size]的三维张量,我们仍然可以使用dim=-1进行最后一个维度的归一化,避免从dim=1变为dim=2。

Next, let us apply layer normalization to the layer outputs we obtained earlier. The operation consists of subtracting the mean and dividing by the square root of the variance (also known as standard deviation):


out_norm = (out - mean) / torch.sqrt(var)  # 标准化输出
mean = out_norm.mean(dim=-1, keepdim=True)  # 计算标准化输出的均值
var = out_norm.var(dim=-1, keepdim=True)  # 计算标准化输出的方差
print("Normalized layer outputs:\n", out_norm)  # 打印标准化层输出
print("Mean:\n", mean)  # 打印均值
print("Variance:\n", var)  # 打印方差

As we can see based on the results, the normalized layer outputs, which now also contain negative values, have zero mean and a variance of 1:


Normalized layer outputs:
tensor([[ 0.6159,  1.4126, -0.8719,  0.5872, -0.8719, -0.8719],
        [-0.0189,  0.1121, -1.0876,  1.5173,  0.5647, -1.0876]],
        [3.9736e-08]], grad_fn=<MeanBackward1>)
        [1.]], grad_fn=<VarBackward0>)

Note that the value 2.9802e-08 in the output tensor is the scientific notation for 2.9802 × 10^-8, which is 0.0000000298 in decimal form. This value is very close to 0, but it is not exactly 0 due to small numerical errors that can accumulate because of the finite precision with which computers represent numbers.

请注意,输出张量中的值2.9802e-08是2.9802 × 10^-8的科学记数法,即小数形式的0.0000000298。这个值非常接近0,但由于计算机表示数字的有限精度,会累积一些小的数值误差,因此不完全是0。

To improve readability, we can also turn off the scientific notation when printing tensor values by setting sci_mode to False:


torch.set_printoptions(sci_mode=False)  # 设置打印选项,不使用科学记数法
print("Mean:\n", mean)  # 打印均值
print("Variance:\n", var)  # 打印方差

The output is as follows:


        [0.0000]], grad_fn=<MeanBackward1>)
        [1.]], grad_fn=<VarBackward0>)

So far, in this section, we have coded and applied layer normalization in a step-by-step process. Let’s now encapsulate this process in a PyTorch module that we can use in the GPT model later:


Listing 4.2 A layer normalization class
清单 4.2 层标准化类

class LayerNorm(nn.Module):  # 定义LayerNorm类,继承nn.Module
    def __init__(self, emb_dim):  # 初始化方法
        super().__init__()  # 调用父类的初始化方法
        self.eps = 1e-5  # 设置epsilon值
        self.scale = nn.Parameter(torch.ones(emb_dim))  # 定义可训练的scale参数
        self.shift = nn.Parameter(torch.zeros(emb_dim))  # 定义可训练的shift参数

    def forward(self, x):  # 定义前向传播方法
        mean = x.mean(dim=-1, keepdim=True)  # 计算均值
        var = x.var(dim=-1, keepdim=True, unbiased=False)  # 计算方差
        norm_x = (x - mean) / torch.sqrt(var + self.eps)  # 进行归一化
        return self.scale * norm_x + self.shift  # 返回归一化后的输出

This specific implementation of layer normalization operates on the last dimension of the input tensor x, which represents the embedding dimension (emb_dim). The variable eps is a small constant (epsilon) added to the variance to prevent division by zero during normalization. The scale and shift are two trainable parameters (of the same dimension as the input) that the LLM automatically adjusts during training if it is determined that doing so would improve the model’s performance on its training task. This allows the model to learn appropriate scaling and shifting that best suit the data it is processing.




In our variance calculation method, we have opted for an implementation detail by setting unbiased=False. For those curious about what this means, in the variance calculation, we divide by the number of inputs n in the variance formula. This approach does not apply Bessel’s correction, which typically uses n-1 instead of n in the denominator to adjust for bias in sample variance estimation. This decision results in a so-called biased estimate of the variance. For large-scale language models (LLMs), where the embedding dimension n is significantly large, the difference between using n and n-1 is practically negligible. We chose this approach to ensure compatibility with the GPT-2 model’s normalization layers and because it reflects TensorFlow’s default behavior, which was used to implement the original GPT-2 model. Using a similar setting ensures our method is compatible with the pretrained weights we will load in chapter 6.


Let’s now try the LayerNorm module in practice and apply it to the batch input:


ln = LayerNorm(emb_dim=5)  # 创建LayerNorm实例
out_ln = ln(batch_example)  # 对批处理示例应用LayerNorm
mean = out_ln.mean(dim=-1, keepdim=True)  # 计算标准化输出的均值
var = out_ln.var(dim=-1, unbiased=False, keepdim=True)  # 计算标准化输出的方差
print("Mean:\n", mean)  # 打印均值
print("Variance:\n", var)  # 打印方差

As we can see based on the results, the layer normalization code works as expected and normalizes the values of each of the two inputs such that they have a mean of 0 and a variance of 1:


        [ 0.0000]], grad_fn=<MeanBackward1>)
        [1.0000]], grad_fn=<VarBackward0>)

In this section, we covered one of the building blocks we will need to implement the GPT architecture, as shown in the mental model in Figure 4.7.


Figure 4.7 A mental model listing the different building blocks we implement in this chapter to assemble the GPT architecture.

图4.7 列出了我们在本章中实现的不同构建模块,以组装GPT架构的心智模型。

In the next section, we will look at the GELU activation function, which is one of the activation functions used in LLMs, instead of the traditional ReLU function we used in this section.


** 层归一化与批量归一化**

If you are familiar with batch normalization, a common and traditional normalization method for neural networks, you may wonder how it compares to layer normalization. Unlike batch normalization, which normalizes across the batch dimension, layer normalization normalizes across the feature dimension. LLMs often require significant computational resources, and the available hardware or the specific use case can dictate the batch size during training or inference. Since layer normalization normalizes each input independently of the batch size, it offers more flexibility and stability in these scenarios. This is particularly beneficial for distributed training or when deploying models in environments where resources are constrained.


4.3 Implementing a feed forward network with GELU activations

In this section, we implement a small neural network submodule that is used as part of the transformer block in LLMs. We begin with implementing the GELU activation function, which plays a crucial role in this neural network submodule. (For additional information on implementing neural networks in PyTorch, please see section A.5 Implementing multilayer neural networks in Appendix A.)


Historically, the ReLU activation function has been commonly used in deep learning due to its simplicity and effectiveness across various neural network architectures. However, in LLMs, several other activation functions are employed beyond the traditional ReLU. Two notable examples are GELU (Gaussian Error Linear Unit) and SwiGLU (Swish-Gated Linear Unit).


GELU and SwiGLU are more complex and smooth activation functions incorporating Gaussian and sigmoid-gated linear units, respectively. They offer improved performance for deep learning models, unlike the simpler ReLU.


The GELU activation function can be implemented in several ways; the exact version is defined as GELU(x)=x·Φ(x), where Φ(x) is the cumulative distribution function of the standard Gaussian distribution. In practice, however, it’s common to implement a computationally cheaper approximation (the original GPT-2 model was also trained with this approximation):


GELU ( x ) ≈ 0.5 ⋅ x ⋅ ( 1 + tanh ⁡ ( 2 / π ⋅ ( x + 0.044715 ⋅ x 3 ) ) ) \text{GELU}(x) \approx 0.5 \cdot x \cdot (1 + \tanh(\sqrt{2 / \pi} \cdot (x + 0.044715 \cdot x^3))) GELU(x)0.5x(1+tanh(2/π (x+0.044715x3)))

In code, we can implement this function as PyTorch module as follows:


Listing 4.3 An implementation of the GELU activation function
清单 4.3 GELU 激活函数的实现

class GELU(nn.Module):  # 定义GELU类,继承nn.Module
    def __init__(self):  # 初始化方法
        super().__init__()  # 调用父类的初始化方法

    def forward(self, x):  # 定义前向传播方法
        return 0.5 * x * (1 + torch.tanh(
            torch.sqrt(torch.tensor(2.0 / torch.pi)) *
            (x + 0.044715 * torch.pow(x, 3))

Next, to get an idea of what this GELU function looks like and how it compares to the ReLU function, let’s plot these functions side by side:


import matplotlib.pyplot as plt  # 导入matplotlib.pyplot模块并重命名为plt

gelu, relu = GELU(), nn.ReLU()  # 创建GELU和ReLU实例

x = torch.linspace(-3, 3, 100)  # 在-3到3的范围内创建100个样本数据点
y_gelu, y_relu = gelu(x), relu(x)  # 分别计算GELU和ReLU的输出
plt.figure(figsize=(8, 3))  # 设置图形大小
for i, (y, label) in enumerate(zip([y_gelu, y_relu], ["GELU", "ReLU"]), 1):  # 枚举GELU和ReLU输出及其标签
    plt.subplot(1, 2, i)  # 创建子图
    plt.plot(x, y)  # 绘制函数图形
    plt.title(f"{label} activation function")  # 设置图形标题
    plt.xlabel("x")  # 设置x轴标签
    plt.ylabel(f"{label}(x)")  # 设置y轴标签
    plt.grid(True)  # 显示网格线
plt.tight_layout()  # 紧凑布局  # 显示图形

As we can see in the resulting plot in Figure 4.8, ReLU is a piecewise linear function that outputs the input directly if it is positive; otherwise, it outputs zero. GELU is a smooth, non-linear function that approximates ReLU but with a non-zero gradient for negative values.


Figure 4.8 The output of the GELU and ReLU plots using matplotlib. The x-axis shows the function inputs and the y-axis shows the function outputs.

图4.8 使用matplotlib绘制的GELU和ReLU图的输出。x轴显示函数输入,y轴显示函数输出。

The smoothness of GELU, as shown in Figure 4.8, can lead to better optimization properties during training, as it allows for more nuanced adjustments to the model’s parameters. In contrast, ReLU has a sharp corner at zero, which can sometimes make optimization harder, especially in networks that are very deep or have complex architectures. Moreover, unlike ReLU, which outputs zero for any negative input, GELU allows for a small, non-zero output for negative values. This characteristic means that during the training process, neurons that receive negative input can still contribute to the learning process, albeit to a lesser extent than positive inputs.


Next, let’s use the GELU function to implement the small neural network module, FeedForward, that we will be using in the LLM’s transformer block later:


Listing 4.4 A feed forward neural network module
清单 4.4 前馈神经网络模块

class FeedForward(nn.Module):  # 定义FeedForward类,继承nn.Module
    def __init__(self, cfg):  # 初始化方法
        super().__init__()  # 调用父类的初始化方法
        self.layers = nn.Sequential(  # 使用顺序容器定义网络层
            nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),  # 线性层1
            GELU(),  # GELU激活函数
            nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),  # 线性层2

    def forward(self, x):  # 定义前向传播方法
        return self.layers(x)  # 返回网络层的输出

As we can see in the preceding code, the FeedForward module is a small neural network consisting of two Linear layers and a GELU activation function. In the 124 million parameter GPT model, it receives the input batches with tokens that have an embedding size of 768 each via the GPT_CONFIG_124M dictionary where GPT_CONFIG_124M[“emb_dim”] = 768.

如前面的代码所示,FeedForward模块是一个由两个线性层和一个GELU激活函数组成的小型神经网络。在拥有1.24亿参数的GPT模型中,它通过GPT_CONFIG_124M字典接收具有嵌入大小为768的词元输入批次,其中GPT_CONFIG_124M[“emb_dim”] = 768。

Figure 4.9 shows how the embedding size is manipulated inside this small feed forward neural network when we pass it some inputs.

图4.9 显示了当我们传递一些输入时,嵌入大小在这个小型前馈神经网络中的操作方式。

Figure 4.9 provides a visual overview of the connections between the layers of the feed forward neural network. It is important to note that this neural network can accommodate variable batch sizes and numbers of tokens in the input. However, the embedding size for each token is determined and fixed when initializing the weights.

图4.9 提供了前馈神经网络层之间连接的视觉概述。重要的是要注意,这个神经网络可以适应输入中的可变批大小和词元数量。然而,每个词元的嵌入大小在初始化权重时是确定和固定的。

Following the example in Figure 4.9, let’s initialize a new FeedForward module with a token embedding size of 768 and feed it a batch input with 2 samples and 3 tokens each:


ffn = FeedForward(GPT_CONFIG_124M)  # 初始化FeedForward模块
x = torch.rand(2, 3, 768)  #创建批大小为2的样本输入
out = ffn(x)  # 获取FeedForward模块的输出
print(out.shape)  # 打印输出张量的形状

As we can see, the shape of the output tensor is the same as that of the input tensor:


torch.Size([2, 3, 768])

The FeedForward module we implemented in this section plays a crucial role in enhancing the model’s ability to learn from and generalize the data. Although the input and output dimensions of this module are the same, it internally expands the embedding dimension into a higher-dimensional space through the first linear layer as illustrated in Figure 4.10. This expansion is followed by a non-linear GELU activation, and then a contraction back to the original dimension with the second linear transformation. Such a design allows for the exploration of a richer representation space.


Figure 4.10 An illustration of the expansion and contraction of the layer outputs in the feed forward neural network. First, the inputs expand by a factor of 4 from 768 to 3072 values. Then, the second layer compresses the 3072 values back into a 768-dimensional representation.

图4.10 图示了前馈神经网络中层输出的扩展和收缩。首先,输入从768值按4倍扩展到3072值。然后,第二层将3072值压缩回768维表示。

Moreover, the uniformity in input and output dimensions simplifies the architecture by enabling the stacking of multiple layers, as we will do later, without the need to adjust dimensions between them, thus making the model more scalable.


As illustrated in Figure 4.11, we have now implemented most of the LLM’s building blocks.


Figure 4.11 A mental model showing the topics we cover in this chapter, with the black checkmarks indicating those that we have already covered.

图4.11 一个心智模型,显示了我们在本章中涵盖的主题,黑色对勾标记表示我们已经涵盖的部分。

In the next section, we will go over the concept of shortcut connections that we insert between different layers of a neural network, which are important for improving the training performance in deep neural network architectures.

在下一节中,我们将讨论在神经网络的不同层之间插入的shortcut connections的概念,这对于提高深度神经网络架构中的训练性能非常重要。

4.4 Adding shortcut connections

Next, let’s discuss the concept behind shortcut connections, also known as skip or residual connections. Originally, shortcut connections were proposed for deep networks in computer vision (specifically, in residual networks) to mitigate the challenge of vanishing gradients. The vanishing gradient problem refers to the issue where gradients (which guide weight updates during training) become progressively smaller as they propagate backward through the layers, making it difficult to effectively train earlier layers, as illustrated in Figure 4.12.

接下来,让我们讨论shortcut connections背后的概念,也称为跳跃连接或残差连接。最初,shortcut connections是为计算机视觉中的深度网络(特别是在残差网络中)提出的,以减轻梯度消失的挑战。梯度消失问题指的是梯度(在训练期间指导权重更新)在向后传播通过层时逐渐变小,使得很难有效地训练早期层,如图4.12所示。

Figure 4.12 A comparison between a deep neural network consisting of 5 layers without (on the left) and with shortcut connections (on the right). Shortcut connections involve adding the inputs of a layer to its outputs, effectively creating an alternate path that bypasses certain layers. The gradient illustrated in Figure 1.1 denotes the mean absolute gradient at each layer, which we will compute in the code example that follows.

图4.12 对比了一个由5层组成的深度神经网络(左侧没有shortcut connections,右侧有shortcut connections)。shortcut connections通过将一层的输入添加到其输出,实际上创建了一条绕过某些层的替代路径。图1.1中(这里应该是笔误)说明的梯度表示每层的平均绝对梯度,我们将在接下来的代码示例中计算。

As illustrated in Figure 4.12, a shortcut connection creates an alternative, shorter path for the gradient to flow through the network by skipping one or more layers, which is achieved by adding the output of one layer to the output of a later layer. This is why these connections are also known as skip connections. They play a crucial role in preserving the flow of gradients during the backward pass in training.

如图4.12所示,shortcut connections通过跳过一个或多个层,为梯度在网络中的流动创建了一条替代的、更短的路径,这是通过将一层的输出添加到后面一层的输出来实现的。这就是为什么这些连接也被称为跳跃连接。它们在训练期间的反向传播过程中保持梯度流动方面起着至关重要的作用。

In the code example below, we implement the neural network shown in Figure 4.12 to see how we can add shortcut connections in the forward method:

在下面的代码示例中,我们实现了图4.12中显示的神经网络,以查看如何在前向方法中添加shortcut connections:

Listing 4.5 A neural network to illustrate shortcut connections
清单 4.5 说明 shortcut connections的神经网络

class ExampleDeepNeuralNetwork(nn.Module):  # 定义ExampleDeepNeuralNetwork类,继承nn.Module
    def __init__(self, layer_sizes, use_shortcut):  # 初始化方法
        super().__init__()  # 调用父类的初始化方法
        self.use_shortcut = use_shortcut  # 设置是否使用shortcut
        self.layers = nn.ModuleList([  # 使用模块列表定义网络层
            nn.Sequential(nn.Linear(layer_sizes[0], layer_sizes[1]), GELU()),  # 实现5层
            nn.Sequential(nn.Linear(layer_sizes[1], layer_sizes[2]), GELU()),
            nn.Sequential(nn.Linear(layer_sizes[2], layer_sizes[3]), GELU()),
            nn.Sequential(nn.Linear(layer_sizes[3], layer_sizes[4]), GELU()),
            nn.Sequential(nn.Linear(layer_sizes[4], layer_sizes[5]), GELU())

    def forward(self, x):  # 定义前向传播方法
        for layer in self.layers:  # 遍历每一层
            layer_output = layer(x)  # 计算当前层的输出
            if self.use_shortcut and x.shape == layer_output.shape:  # 检查是否可以应用shortcut
                x = x + layer_output  # 应用shortcut
                x = layer_output  # 否则直接输出当前层结果
        return x  # 返回最终输出

The code implements a deep neural network with 5 layers, each consisting of a Linear layer and a GELU activation function. In the forward pass, we iteratively pass the input through the layers and optionally add the shortcut connections depicted in Figure 4.12 if the self.use_shortcut attribute is set to True.

该代码实现了一个具有5层的深度神经网络,每层由一个线性层和一个GELU激活函数组成。在前向传播中,我们迭代地通过层传递输入,如果self.use_shortcut属性设置为True,则可以选择性地添加图4.12中描绘的shortcut connections。

Let’s use this code to first initialize a neural network without shortcut connections. Here, each layer will be initialized such that it accepts an example with 3 input values and returns 3 output values. The last layer returns a single output value:

让我们使用这段代码首先初始化一个没有shortcut connections的神经网络。在这里,每层将被初始化为接受一个具有3个输入值的示例并返回3个输出值。最后一层返回一个单一的输出值:

layer_sizes = [3, 3, 3, 3, 3, 1]  # 定义每层的大小
sample_input = torch.tensor([[1., 0., -1.]])  # 样本输入
torch.manual_seed(123)  # 指定初始权重的随机种子以确保结果可复现
model_without_shortcut = ExampleDeepNeuralNetwork(  # 初始化没有shortcut的神经网络
    layer_sizes, use_shortcut=False

Next, we implement a function that computes the gradients in the model’s backward pass:


def print_gradients(model, x):  # 定义打印梯度的函数
    # Forward pass
    output = model(x)  # 前向传播计算输出
    target = torch.tensor([[0.]])  # 目标张量

    # Calculate loss based on how close the target and output are
    loss = nn.MSELoss()  # 使用均方误差损失
    loss = loss(output, target)  # 计算损失

    # Backward pass to calculate the gradients
    loss.backward()  # 反向传播计算梯度

    for name, param in model.named_parameters():  # 遍历模型的每个参数
        if 'weight' in name:  # 如果参数名中包含'weight'
            # Print the mean absolute gradient of the weights
            print(f"{name} has gradient mean of {param.grad.abs().mean().item()}")  # 打印权重的平均绝对梯度

In the preceding code, we specify a loss function that computes how close the model output and a user-specified target (here, for simplicity, the value 0) are. Then, when calling loss.backward(), PyTorch computes the loss gradient for each layer in the model. We can iterate through the weight parameters via model.named_parameters(). Suppose we have a 3×3 weight parameter matrix for a given layer. In that case, this layer will have 3×3 gradient values, and we print the mean absolute gradient of these 3×3 gradient values to obtain a single gradient value per layer to compare the gradients between layers more easily.


In short, the .backward() method is a convenient method in PyTorch that computes loss gradients, which are required during model training, without implementing the math for the gradient calculation ourselves, thereby making working with deep neural networks much more accessible. If you are unfamiliar with the concept of gradients and neural network training, I recommend reading sections A.4, Automatic differentiation made easy and A.7 A typical training loop in appendix A.


Let’s now use the print_gradients function and apply it to the model without skip connections:


print_gradients(model_without_shortcut, sample_input)  # 打印没有shortcut的模型的梯度

The output is as follows:


layers.0.0.weight has gradient mean of 0.0002017358786325169
layers.1.0.weight has gradient mean of 0.0001201116101583466
layers.2.0.weight has gradient mean of 0.0007512046153711182
layers.3.0.weight has gradient mean of 0.001398783664673078
layers.4.0.weight has gradient mean of 0.00504946366387606

As we can see based on the output of the print_gradients function, the gradients become smaller as we progress from the last layer (layers.4) to the first layer (layers.0), which is a phenomenon called the vanishing gradient problem.


Let’s now instantiate a model with skip connections and see how it compares:


torch.manual_seed(123)  # 指定随机种子
model_with_shortcut = ExampleDeepNeuralNetwork(  # 初始化带有shortcut的神经网络
    layer_sizes, use_shortcut=True
print_gradients(model_with_shortcut, sample_input)  # 打印带有shortcut的模型的梯度

The output is as follows:


layers.0.0.weight has gradient mean of 0.22169792652130127
layers.1.0.weight has gradient mean of 0.20694105327129364
layers.2.0.weight has gradient mean of 0.32896995544433594
layers.3.0.weight has gradient mean of 0.2665732502937317
layers.4.0.weight has gradient mean of 1.325841822433472

As we can see, based on the output, the last layer (layers.4) still has a larger gradient than the other layers. However, the gradient value stabilizes as we progress towards the first layer (layers.0) and doesn’t shrink to a vanishingly small value.


In conclusion, shortcut connections are important for overcoming the limitations posed by the vanishing gradient problem in deep neural networks. Shortcut connections are a core building block of very large models such as LLMs, and they will help facilitate more effective training by ensuring consistent gradient flow across layers when we train the GPT model in the next chapter.

总之,shortcut connections对于克服深度神经网络中梯度消失问题的局限性非常重要。shortcut connections是像LLM这样的大型模型的核心构建块,它们将通过确保层间一致的梯度流动来帮助更有效的训练,在下一章中我们训练GPT模型时将会使用。

After introducing shortcut connections, we will now connect all of the previously covered concepts (layer normalization, GELU activations, feed forward module, and shortcut connections) in a transformer block in the next section, which is the final building block we need to code the GPT architecture.

在介绍了shortcut connections之后,我们现在将在下一节中将前面介绍的所有概念(层归一化、GELU激活、前馈模块和shortcut connections)连接到一个transformer块中,这是我们编码GPT架构所需的最后一个构建块。

4.5 Connecting attention and linear layers in a transformer block

In this section, we are implementing the transformer block, a fundamental building block of GPT and other LLM architectures. This block, which is repeated a dozen times in the 124 million parameter GPT-2 architecture, combines several concepts we have previously covered: multi-head attention, layer normalization, dropout, feed forward layers, and GELU activations, as illustrated in Figure 4.13. In the next section, we will then connect this transformer block to the remaining parts of the GPT architecture.


Figure 4.13 An illustration of a transformer block. The bottom of the diagram shows input tokens that have been embedded into 768-dimensional vectors. Each row corresponds to one token’s vector representation. The outputs of the transformer block are vectors of the same dimension as the input, which can then be fed into subsequent layers in an LLM.

图4.13 transformer块的图示。图的底部显示了已嵌入768维向量的输入词元。每一行对应一个词元的向量表示。transformer块的输出是与输入相同维度的向量,可以输入到LLM的后续层中。

As shown in Figure 4.13, the transformer block combines several components, including the masked multi-head attention module from chapter 3 and the FeedForward module we implemented in Section 4.3.

如图4.13所示,transformer块结合了几个组件,包括第3章的masked multi-head attention模块和第4.3节实现的FeedForward模块。

When a transformer block processes an input sequence, each element in the sequence (for example, a word or subword token) is represented by a fixed-size vector (in the case of Figure 4.13, 768 dimensions). The operations within the transformer block, including multi-head attention and feed forward layers, are designed to transform these vectors in a way that preserves their dimensionality.


The idea is that the self-attention mechanism in the multi-head attention block identifies and analyzes relationships between elements in the input sequence. In contrast, the feed forward network modifies the data individually at each position. This combination not only enables a more nuanced understanding and processing of the input but also enhances the model’s overall capacity for handling complex data patterns.


In code, we can create the TransformerBlock as follows:


Listing 4.6 The transformer block component of GPT
清单 4.6 GPT 的transformer 块组件

from previous_chapters import MultiHeadAttention  # 从之前章节导入MultiHeadAttention

class TransformerBlock(nn.Module):  # 定义TransformerBlock类,继承nn.Module
    def __init__(self, cfg):  # 初始化方法
        super().__init__()  # 调用父类的初始化方法
        self.att = MultiHeadAttention(  # 实例化多头注意力
        self.ff = FeedForward(cfg)  # 实例化前馈层
        self.norm1 = LayerNorm(cfg["emb_dim"])  # 实例化层归一化1
        self.norm2 = LayerNorm(cfg["emb_dim"])  # 实例化层归一化2
        self.drop_shortcut = nn.Dropout(cfg["drop_rate"])  # 实例化dropout

    def forward(self, x):  # 定义前向传播方法
        shortcut = x  # 保存输入以便添加shortcut  
        x = self.norm1(x)  # 层归一化1
        x = self.att(x)  # 多头注意力
        x = self.drop_shortcut(x)  # dropout
        x = x + shortcut  # 添加原始输入回去

        shortcut = x  # 保存输入以便添加shortcut  
        x = self.norm2(x)  # 层归一化2
        x = self.ff(x)  # 前馈层
        x = self.drop_shortcut(x)  # dropout
        x = x + shortcut  # 添加原始输入回去

        return x  # 返回最终输出

The given code defines a TransformerBlock class in PyTorch that includes a multi-head attention mechanism (MultiHeadAttention) and a feed forward network (FeedForward), both configured based on a provided configuration dictionary (cfg), such as GPT_CONFIG_124M.


Layer normalization (LayerNorm) is applied before each of these two components, and dropout is applied after them to regularize the model and prevent overfitting. This is also known as Pre-LayerNorm. Older architectures, such as the original transformer model, applied layer normalization after the self-attention and feed-forward networks instead, known as Post-LayerNorm, which often leads to worse training dynamics.


The class also implements the forward pass, where each component is followed by a shortcut connection that adds the input of the block to its output. This critical feature helps gradients flow through the network during training and improves the learning of deep models as explained in section 4.4.


Using the GPT_CONFIG_124M dictionary we defined earlier, let’s instantiate a transformer block and feed it some sample data:


torch.manual_seed(123)  # 设置随机数种子为123,以确保生成的随机数具有可重复性
x = torch.rand(2, 4, 768)  # 创建一个形状为[2, 4, 768]的随机张量,表示输入数据,包含2个批次,每个批次4个token,每个token用768维向量表示
block = TransformerBlock(GPT_CONFIG_124M)  # 实例化一个TransformerBlock类,使用配置字典GPT_CONFIG_124M
output = block(x)  # 将输入张量x传递给TransformerBlock实例,获取输出

print("Input shape:", x.shape)  # 打印输入张量的形状
print("Output shape:", output.shape)  # 打印输出张量的形状

The output is as follows:


Input shape: torch.Size([2, 4, 768])
Output shape: torch.Size([2, 4, 768])

As we can see from the code output, the transformer block maintains the input dimensions in its output, indicating that the transformer architecture processes sequences of data without altering their shape throughout the network.


The preservation of shape throughout the transformer block architecture is not incidental but a crucial aspect of its design. This design enables its effective application across a wide range of sequence-to-sequence tasks, where each output vector directly corresponds to an input vector, maintaining a one-to-one relationship. However, the output is a context vector that encapsulates information from the entire input sequence, as we learned in chapter 3.

This means that while the physical dimensions of the sequence (length and feature size) remain unchanged as it passes through the transformer block, the content of each output vector is re-encoded to integrate contextual information from across the entire input sequence.


With the transformer block implemented in this section, we now have all the building blocks, as shown in Figure 4.14, needed to implement the GPT architecture in the next section.


Figure 4.14 A mental model of the different concepts we have implemented in this chapter so far.


4.6 Coding the GPT model

We started this chapter with a big-picture overview of a GPT architecture that we called DummyGPTModel. In this DummyGPTModel code implementation, we showed the input and outputs to the GPT model, but its building blocks remained a black box using a DummyTransformerBlock and DummyLayerNorm class as placeholders.


In this section, we are now replacing the DummyTransformerBlock and DummyLayerNorm placeholders with the real TransformerBlock and LayerNorm classes we coded later in this chapter to assemble a fully working version of the original 124 million parameter version of GPT-2. In chapter 5, we will pretrain a GPT-2 model, and in chapter 6, we will load in the pretrained weights from OpenAI.


Before we assemble the GPT-2 model in code, let’s look at its overall structure in Figure 4.15, which combines all the concepts we covered so far in this chapter.


Figure 4.15 An overview of the GPT model architecture. This figure illustrates the flow of data through the GPT model. Starting from the bottom, tokenized text is first converted into token embeddings, which are then augmented with positional embeddings. This combined information forms a tensor that is passed through a series of transformer blocks shown in the center (each containing multi-head attention and feed forward neural network layers with dropout and layer normalization), which are stacked on top of each other and repeated 12 times.

图4.15 GPT模型架构概述。本图说明了数据通过GPT模型的流程。从底部开始,分词文本首先转换为分词嵌入,然后与位置嵌入进行增强。这些组合信息形成一个张量,传递通过中间显示的一系列Transformer块(每个块包含多头注意力和前馈神经网络层,具有dropout和层归一化),这些块彼此堆叠并重复12次。

As shown in Figure 4.15, the transformer block we coded in Section 4.5 is repeated many times throughout a GPT model architecture. In the case of the 124 million parameter GPT-2 model, it’s repeated 12 times, which we specify via the “n_layers” entry in the GPT_CONFIG_124M dictionary. In the case of the largest GPT-2 model with 1,542 million parameters, this transformer block is repeated 36 times.


As shown in Figure 4.15, the output from the final transformer block then goes through a final layer normalization step before reaching the linear output layer. This layer maps the transformer’s output to a high-dimensional space (in this case, 50,257 dimensions, corresponding to the model’s vocabulary size) to predict the next token in the sequence.


Let’s now implement the architecture we see in Figure 4.15 in code:


class GPTModel(nn.Module):  # 定义GPTModel类,继承nn.Module
    def __init__(self, cfg):  # 初始化方法,接受配置参数cfg
        super().__init__()  # 调用父类的初始化方法
        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])  # 创建分词嵌入层
        self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])  # 创建位置嵌入层
        self.drop_emb = nn.Dropout(cfg["drop_rate"])  # 创建dropout层
        self.trf_blocks = nn.Sequential(  # 创建一系列Transformer块
            *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])]
        self.final_norm = LayerNorm(cfg["emb_dim"])  # 创建最终层归一化层
        self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False)  # 创建输出层

    def forward(self, in_idx):  # 前向传播方法
        batch_size, seq_len = in_idx.shape  # 获取输入的批量大小和序列长度
        tok_embeds = self.tok_emb(in_idx)  # 获取分词嵌入
        pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))  # 获取位置嵌入
        x = tok_embeds + pos_embeds  # 将分词嵌入和位置嵌入相加
        x = self.drop_emb(x)  # 应用dropout
        x = self.trf_blocks(x)  # 通过Transformer块
        x = self.final_norm(x)  # 应用最终层归一化
        logits = self.out_head(x)  # 通过输出层得到logits
        return logits  # 返回logits

Thanks to the TransformerBlock class we implemented in Section 4.5, the GPTModel class is relatively small and compact.


The init constructor of this GPTModel class initializes the token and positional embedding layers using the configurations passed in via a Python dictionary, cfg. These embedding layers are responsible for converting input token indices into dense vectors and adding positional information, as discussed in chapter 2.


Next, the init method creates a sequential stack of TransformerBlock modules equal to the number of layers specified in cfg. Following the transformer blocks, a LayerNorm layer is applied, standardizing the outputs from the transformer blocks to stabilize the learning process. Finally, a linear output head without bias is defined, which projects the transformer’s output into the vocabulary space of the tokenizer to generate logits for each token in the vocabulary.


The forward method takes a batch of input token indices, computes their embeddings, applies the positional embeddings, passes the sequence through the transformer blocks, normalizes the final output, and then computes the logits, representing the next token’s unnormalized probabilities. We will convert these logits into tokens and text outputs in the next section.


Let’s now initialize the 124 million parameter GPT model using the GPT_CONFIG_124M dictionary we pass into the cfg parameter and feed it with the batch text input we created at the beginning of this chapter:


torch.manual_seed(123)  # 设置随机种子
model = GPTModel(GPT_CONFIG_124M)  # 初始化GPT模型
out = model(batch)  # 获取模型输出

print("Input batch:\n", batch)  # 打印输入批次
print("\nOutput shape:", out.shape)  # 打印输出形状
print(out)  # 打印输出

The preceding code prints the contents of the input batch followed by the output tensor:


Input batch:


tensor([[ 6109,  3626,  6100,   345],  # 文本1的词元ID
        [ 6109,  1110,  6622,   257]])  # 文本2的词元ID

Output shape: torch.Size([2, 4, 50257])

输出形状:torch.Size([2, 4, 50257])

tensor([[[ 0.3613,  0.4222, -0.0711,  ...,  0.3483,  0.4661, -0.2838],  # 文本1的输出
         [-0.1792, -0.5666, -0.9485,  ...,  0.0477,  0.5181, -0.3168],
         [ 0.7120,  0.0332,  0.1085,  ..., -0.1018, -0.4327, -0.2553],
         [-1.0076,  0.3418, -0.1190,  ...,  0.7195,  0.4023, -0.0532]],

        [[-0.2564,  0.0900,  0.0335,  ...,  0.2659,  0.4454, -0.6806],  # 文本2的输出
         [ 0.1230,  0.3653, -0.2074,  ...,  0.7705,  0.2710,  0.2246],
         [ 1.0558,  1.0318, -0.2800,  ...,  0.6936,  0.3295, -0.3178],
         [-0.1565,  0.3926,  0.3288,  ...,  1.2630, -0.1858,  0.0388]]],

As we can see, the output tensor has the shape [2, 4, 50257], since we passed in 2 input texts with 4 tokens each. The last dimension, 50,257, corresponds to the vocabulary size of the tokenizer. In the next section, we will see how to convert each of these 50,257-dimensional output vectors back into tokens.

正如我们所见,输出张量的形状为[2, 4, 50257],因为我们传入了2个输入文本,每个文本包含4个词元。最后一个维度50,257对应于词汇表的大小。在下一节中,我们将看到如何将这些50,257维的输出向量转换回词元。

Before we move on to the next section and code the function that converts the model outputs into text, let’s spend a bit more time with the model architecture itself and analyze its size.


Using the numel() method, short for “number of elements,” we can collect the total number of parameters in the model’s parameter tensors:


total_params = sum(p.numel() for p in model.parameters())  # 计算参数总数
print(f"Total number of parameters: {total_params:,}")  # 打印参数总数

The result is as follows:


Total number of parameters: 163,009,536

Now, a curious reader might notice a discrepancy. Earlier, we spoke of initializing a 124 million parameter GPT model, so why is the actual number of parameters 163 million, as shown in the preceding code output?


The reason is a concept called weight tying that is used in the original GPT-2 architecture, which means that the original GPT-2 architecture is reusing the weights from the token embedding layer in its output layer. To understand what this means, let’s take a look at the shapes of the token embedding layer and linear output layer that we initialized on the model via the GPTModel earlier:


print("Token embedding layer shape:", model.tok_emb.weight.shape)  # 打印词元嵌入层的形状
print("Output layer shape:", model.out_head.weight.shape)  # 打印输出层的形状

As we can see based on the print outputs, the weight tensors for both these layers have the same shape:


Token embedding layer shape: torch.Size([50257, 768])  # 词元嵌入层形状
Output layer shape: torch.Size([50257, 768])  # 输出层形状

The token embedding and output layers are very large due to the number of rows for the 50,257 in the tokenizer’s vocabulary. Let’s remove the output layer parameter count from the total GPT-2 model count according to the weight tying:


total_params_gpt2 = total_params - sum(p.numel() for p in model.out_head.parameters())  # 计算移除输出层后的总参数
print(f"Number of trainable parameters considering weight tying: {total_params_gpt2:,}")  # 打印考虑权重共享后的可训练参数数量

The output is as follows:


Number of trainable parameters considering weight tying: 124,412,160

As we can see, the model is now only 124 million parameters large, matching the original size of the GPT-2 model.


Weight tying reduces the overall memory footprint and computational complexity of the model. However, in my experience, using separate token embedding and output layers results in better training and model performance; hence, we are using separate layers in our GPTModel implementation. The same is true for modern LLMs. However, we will revisit and implement the weight tying concept later in chapter 6 when we load the pretrained weights from OpenAI.



练习4.1 前馈和注意力模块中的参数数量

Calculate and compare the number of parameters that are contained in the feed forward module and those that are contained in the multi-head attention module.


Lastly, let us compute the memory requirements of the 163 million parameters in our GPTModel object:


total_size_bytes = total_params * 4  # 计算总大小(假设float32,每个参数4字节)
total_size_mb = total_size_bytes / (1024 * 1024)  # 转换为兆字节
print(f"Total size of the model: {total_size_mb:.2f} MB")  # 打印模型总大小

The result is as follows:


Total size of the model: 621.83 MB

In conclusion, by calculating the memory requirements for the 163 million parameters in our GPTModel object and assuming each parameter is a 32-bit float taking up 4 bytes, we find that the total size of the model amounts to 621.83 MB, illustrating the relatively large storage capacity required to accommodate even relatively small LLMs.

总之,通过计算GPTModel对象中1.63亿参数的内存需求,并假设每个参数是32位浮点数,占用4字节,我们发现模型的总大小为621.83 MB,说明了即使是相对较小的LLM也需要相对较大的存储容量。

In this section, we implemented the GPTModel architecture and saw that it outputs numeric tensors of shape [batch_size, num_tokens, vocab_size]. In the next section, we will write the code to convert these output tensors into text.

在本节中,我们实现了GPTModel架构,并看到它输出形状为[batch_size, num_tokens, vocab_size]的数值张量。在下一节中,我们将编写代码将这些输出张量转换为文本。

练习 4.2 初始化更大的 GPT 模型

In this chapter, we initialized a 124 million parameter GPT model, which is known as “GPT-2 small.” Without making any code modifications besides updating the configuration file, use the GPTModel class to implement GPT-2 medium (using 1024-dimensional embeddings, 24 transformer blocks, 16 multi-head attention heads), GPT-2 large (1280-dimensional embeddings, 36 transformer blocks, 20 multi-head attention heads), and GPT-2 XL (1600-dimensional embeddings, 48 transformer blocks, 25 multi-head attention heads). As a bonus, calculate the total number of parameters in each GPT model.

在本章中,我们初始化了一个具有 1.24 亿参数的 GPT 模型,这被称为“GPT-2 small”。在不修改任何代码的情况下,除了更新配置文件,使用 GPTModel 类实现 GPT-2 medium(使用 1024 维嵌入、24 个 transformer 块、16 个多头注意力头)、GPT-2 large(1280 维嵌入、36 个 transformer 块、20 个多头注意力头)和 GPT-2 XL(1600 维嵌入、48 个 transformer 块、25 个多头注意力头)。作为附加任务,计算每个 GPT 模型中的参数总数。

4.7 Generating text

4.7 生成文本

In this final section of this chapter, we will implement the code that converts the tensor outputs of the GPT model back into text. Before we get started, let’s briefly review how a generative model like an LLM generates text one word (or token) at a time, as shown in Figure 4.16.

在本章的最后一节中,我们将实现将 GPT 模型的张量输出转换回文本的代码。在开始之前,让我们简要回顾一下生成模型(如 LLM)如何一次生成一个单词(或词元)的文本,如图 4.16 所示。

Figure 4.16 This diagram illustrates the step-by-step process by which an LLM generates text, one token at a time. Starting with an initial input context (“Hello, I am”), the model predicts a subsequent token during each iteration, appending it to the input context for the next round of prediction. As shown, the first iteration adds “a”, the second “model”, and the third “ready”, progressively building the sentence.

图 4.16 该图展示了 LLM 一次生成一个词元的逐步过程。从初始输入上下文(“Hello, I am”)开始,模型在每次迭代期间预测下一个词元,并将其附加到下一轮预测的输入上下文中。如图所示,第一次迭代添加了“a”,第二次迭代添加了“model”,第三次迭代添加了“ready”,逐步构建句子。

Figure 4.16 illustrates the step-by-step process by which a GPT model generates text given an input context, such as “Hello, I am,” on a big-picture level. With each iteration, the input context grows, allowing the model to generate coherent and contextually appropriate text. By the 6th iteration, the model has constructed a complete sentence: “Hello, I am a model ready to help.”

图 4.16 说明了在大局上,GPT 模型在给定输入上下文(例如“Hello, I am”)的情况下生成文本的逐步过程。随着每次迭代,输入上下文会增加,使模型能够生成连贯且上下文合适的文本。在第六次迭代时,模型已经构建了一个完整的句子:“Hello, I am a model ready to help”。

In the previous section, we saw that our current GPTModel implementation outputs tensors with shape [batch_size, num_token, vocab_size]. Now, the question is, how does a GPT model go from these output tensors to the generated text shown in Figure 4.16?

在上一节中,我们看到当前的GPTModel实现输出形状为 [batch_size, num_token, vocab_size] 的张量。现在的问题是,GPT 模型如何从这些输出张量生成图 4.16 所示的文本?

The process by which a GPT model goes from output tensors to generated text involves several steps, as illustrated in Figure 4.17. These steps include decoding the output tensors, selecting tokens based on a probability distribution, and converting these tokens into human-readable text.

GPT 模型从输出张量到生成文本的过程涉及几个步骤,如图 4.17 所示。这些步骤包括解码输出张量、根据概率分布选择词元以及将这些词元转换成人类可读的文本。

Figure 4.17 details the mechanics of text generation in a GPT model by showing a single iteration in the token generation process. The process begins by encoding the input text into token IDs, which are then fed into the GPT model. The outputs of the model are then converted back into text and appended to the original input text.

图 4.17 通过展示词元生成过程中的单次迭代,详细说明了 GPT 模型中文本生成的机制。该过程从将输入文本编码为词元 ID 开始,这些 ID 然后被输入到 GPT 模型中。模型的输出随后被转换回文本,并附加到原始输入文本中。

The next-token generation process detailed in Figure 4.17 illustrates a single step where the GPT model generates the next token given its input.

图 4.17 详细说明的下一个词元生成过程展示了 GPT 模型在给定输入的情况下生成下一个词元的单个步骤。

In each step, the model outputs a matrix with vectors representing potential next tokens. The vector corresponding to the next token is extracted and converted into a probability distribution via the softmax function. Within the vector containing the resulting probability scores, the index of the highest value is located, which translates to the token ID. This token ID is then decoded back into text, producing the next token in the sequence. Finally, this token is appended to the previous inputs, forming a new input sequence for the subsequent iteration. This step-by-step process enables the model to generate text sequentially, building coherent phrases and sentences from the initial input context.

在每一步中,模型输出一个包含表示潜在下一个词元的向量的矩阵。对应于下一个词元的向量通过 softmax 函数被提取并转换为概率分布。在包含结果概率分数的向量中,找到最高值的索引,这对应于词元 ID。然后将此词元 ID 解码回文本,生成序列中的下一个词元。最后,这个词元被附加到之前的输入中,形成下一个迭代的新输入序列。这个逐步过程使模型能够顺序生成文本,从初始输入上下文中构建连贯的短语和句子。

In practice, we repeat this process over many iterations, such as shown in Figure 4.16 earlier, until we reach a user-specified number of generated tokens.

在实践中,我们会在多次迭代中重复这个过程,如前面的图 4.16 所示,直到我们达到用户指定的生成词元数量。

In code, we can implement the token-generation process as follows:


def generate_text_simple(model, idx, max_new_tokens, context_size):  # 定义生成简单文本的函数
    for _ in range(max_new_tokens):  # 循环生成新词元
        idx_cond = idx[:, -context_size:]  # 截取上下文
        with torch.no_grad():  # 禁用梯度计算
            logits = model(idx_cond)  # 通过模型获得logits
        logits = logits[:, -1, :]  # 只关注最后一个时间步的logits
        probas = torch.softmax(logits, dim=-1)  # 应用softmax函数获取概率分布
        idx_next = torch.argmax(probas, dim=-1, keepdim=True)  # 选择具有最高概率的词元ID
        idx =[idx, idx_next], dim=1)  # 将生成的词元ID附加到当前上下文
    return idx  # 返回生成的词元序列

In the preceding code, the generate_text_simple function, we use a softmax function to convert the logits into a probability distribution from which we identify the position with the highest value via torch.argmax. The softmax function is monotonic, meaning it preserves the order of its inputs when transformed into outputs. So, in practice, the softmax step is redundant since the position with the highest score in the softmax output tensor is the same position in the logit tensor. In other words, we could apply the torch.argmax function to the logits tensor directly and get identical results. However, we coded the conversion to illustrate the full process of transforming logits to probabilities, which can add additional intuition, such that the model generates the most likely next token, which is known as greedy decoding.


In the next chapter, when we will implement the GPT training code, we will also introduce additional sampling techniques where we modify the softmax outputs such that the model doesn’t always select the most likely token, which introduces variability and creativity in the generated text.


This process of generating one token ID at a time and appending it to the context using the generate_text_simple function is further illustrated in Figure 4.18. (The token ID generation process for each iteration is detailed in Figure 4.17.)


Figure 4.18 An illustration showing six iterations of a token prediction cycle, where the model takes a sequence of initial token IDs as input, predicts the next token, and appends this token to the input sequence for the next iteration. (The token IDs are also translated into their corresponding text for better understanding.)


As shown in Figure 4.18, we generate the token IDs in an iterative fashion. For instance, in iteration 1, the model is provided with the tokens corresponding to “Hello, I am”, predicts the next token (with ID 257, which is “a”), and appends it to the input. This process is repeated until the model produces the complete sentence “Hello, I am a model ready to help.” after six iterations.

如图4.18所示,我们以迭代方式生成词元ID。例如,在第1次迭代中,模型被提供了与“Hello, I am”对应的词元,预测下一个词元(ID为257,即“a”),并将其附加到输入中。这个过程重复进行,直到模型在六次迭代后生成完整的句子“Hello, I am a model ready to help”。

Let’s now try out the generate_text_simple function with the “Hello, I am” context as model input, as shown in Figure 4.18, in practice.

现在让我们尝试使用generate_text_simple函数,并以“Hello, I am”作为模型输入上下文,如图4.18所示。

First, we encode the input context into token IDs:


start_context = "Hello, I am"  # 定义初始上下文
encoded = tokenizer.encode(start_context)  # 使用分词器编码初始上下文
print("encoded:", encoded)  # 打印编码后的词元ID
encoded_tensor = torch.tensor(encoded).unsqueeze(0)  # 将编码后的词元ID转换为张量并添加批次维度
print("encoded_tensor.shape:", encoded_tensor.shape)  # 打印张量的形状

The encoded IDs are as follows:


encoded: [15496, 11, 314, 716]  # 编码后的词元ID
encoded_tensor.shape: torch.Size([1, 4])  # 张量的形状

Next, we put the model into .eval() mode, which disables random components like dropout, which are only used during training, and use the generate_text_simple function on the encoded input tensor:


model.eval()  # 将模型置于评估模式
out = generate_text_simple(  # 使用generate_text_simple函数生成文本
print("Output:", out)  # 打印输出张量
print("Output length:", len(out[0]))  # 打印输出张量的长度

The resulting output token IDs are as follows:


Output: tensor([[15496, 11, 314, 716, 27018, 24806, 47843, 30961, 42348, 7267]])  # 输出词元ID
Output length: 10  # 输出长度

Using the .decode method of the tokenizer, we can convert the IDs back into text:


decoded_text = tokenizer.decode(out.squeeze(0).tolist())  # 解码输出张量为文本
print(decoded_text)  # 打印解码后的文本

The model output in text format is as follows:


Hello, I am Featureiman Byeswickatattribute argue  # 解码后的文本

As we can see, based on the preceding output, the model generated gibberish, which is not at all like the coherent text shown in Figure 4.18. What happened? The reason why the model is unable to produce coherent text is that we haven’t trained it yet. So far, we just implemented the GPT architecture and initialized a GPT model instance with initial random weights.


Model training is a large topic in itself, and we will tackle it in the next chapter.


练习 4.3 使用单独的dropout参数

At the beginning of this chapter, we defined a global “drop_rate” setting in the GPT_CONFIG_124M dictionary to set the dropout rate in various places throughout the GPTModel architecture. Change the code to specify a separate dropout value for the various dropout layers throughout the model architecture. (Hint: there are three distinct places where we used dropout layers: the embedding layer, shortcut layer, and multi-head attention module.)


4.8 Summary

4.8 总结

Layer normalization stabilizes training by ensuring that each layer’s outputs have a consistent mean and variance.


Shortcut connections are connections that skip one or more layers by feeding the output of one layer directly to a deeper layer, which helps mitigate the vanishing gradient problem when training deep neural networks.


Transformer blocks are a core structural component of GPT models, combining masked multi-head attention modules with fully connected feed-forward networks that use the GELU activation function.


GPT models are LLMs with many repeated transformer blocks that have millions to billions of parameters.


GPT models come in various sizes, for example, 124, 345, 762, and 1542 million parameters, which we can implement with the same GPTModel Python class.

GPT模型有多种大小,例如,1.24亿、3.45亿、7.62亿和15.42亿参数,我们可以使用相同的GPTModel Python类来实现。

The text generation capability of a GPT-like LLM involves decoding output tensors into human-readable text by sequentially predicting one token at a time based on a given input context.


Without training, a GPT model generates incoherent text, which underscores the importance of model training for coherent text generation, which is the topic of subsequent chapters.


本文标签: 注释中文代码languageLarge