Generative Adversarial Networks（生成式对抗网络）论文翻译

编程入门行业动态更新时间:2024-10-19 22:17:27

Generative Adversarial Networks（生成式对抗网络）<a href=https://www.elefans.com/category/jswz/34/1770125.html style= 论文翻译"/>

Generative Adversarial Networks（生成式对抗网络）论文翻译

Abstract(摘要)

We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G G G that captures the data distribution, and a discriminative model D D D that estimates the probability that a sample came from the training data rather than G G G.
我们提出了一个通过对抗过程来评估生成模型的新框架，在这个新框架中我们同时训练两个模型：捕获数据分布的生成模型 G G G和估计样本来自训练数据而不是生成模型 G G G的概率的判别模型 D D D。

The training procedure for G G G is to maximize the probability of D D D making a mistake.
对 G G G训练过程就是使 D D D犯错误的概率最大化。

This framework corresponds to a minimax two-player game.
这个框架相当于一个极小化极大（minmax）的双人博弈。

In the space of arbitrary functions G G G and D D D, a unique solution exists, with G G G recovering the training data
distribution and D D D equal to 1 2 \frac{1}{2} 21 everywhere.
在任意函数 G G G和 D D D空间，存在唯一的解，其中 G G G恢复训练数据分布，同时 D D D处处等于 1 2 \frac{1}{2} 21。

In the case where G G G and D D D are defined by multilayer perceptrons, the entire system can be trained with backpropagation.
在 G G G和 D D D定义称多层感知机的情况下，整个系统可以用反向传播进行训练。

There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples.
在训练或者生成样本的过程中，不需要任何的马尔科夫链或者展开的近似推理网络。

Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.
实验通过对生成的样本进行定性和定量评估来证明该框架的潜力。

一、Introduction(简介)

The promise of deep learning is to discover rich, hierarchical models [2] that represent probability distributions over the kinds of data encountered in artificial intelligence applications, such as natural images, audio waveforms containing speech, and symbols in natural language corpora. So far, the most striking successes in deep learning have involved discriminative models, usually those that map a high-dimensional, rich sensory input to a class label [14, 22]. These striking successes have primarily been based on the backpropagation and dropout algorithms, using piecewise linear units [19, 9, 10] which have a particularly well-behaved gradient . Deep generative models have had less of an impact, due to the difficulty of approximating many intractable probabilistic computations that arise in maximum likelihood estimation and related strategies, and due to difficulty of leveraging the benefits of piecewise linear units in the generative context.
We propose a new generative model estimation procedure that sidesteps these difficulties. ¹
深度学习的任务是寻找丰富的层次模型[2]，这些模型表示人工智能应用中遇到的各种数据的概率分布，比如说自然图片，包含语音的音频波形以及自然语言语料库中的符号。到目前为止，深度学习领域最引人注目的成功包括判别模型（discriminative models），通常是那些将高维度，丰富的感官输入映射到类标签的模型[14,22]。这些惊人的成果主要基于反向传播和dropout算法，使用分段线性单元[19, 9, 10]，它们具有特别良好的梯度。深度生成模型的影响较小，由于在最大似然估计和相关策略中出现的许多难以解决的概率计算的困难，以及很难利用在生成上下文中时使用分段线性单元的好处，深度生成模型的影响很小。我们提出一个新的生成模型估计程序，来分步处理这些难题。

In the proposed adversarial nets framework, the generative model is pitted against an adversary: a discriminative model that learns to determine whether a sample is from the model distribution or the data distribution. The generative model can be thought of as analogous to a team of counterfeiters, trying to produce fake currency and use it without detection, while the discriminative model is analogous to the police, trying to detect the counterfeit currency. Competition in this game drives
both teams to improve their methods until the counterfeits are indistiguishable from the genuine articles.
在提出的对抗性网络框架中，生成模型与对手进行对比，一种判别模型，用于学习确定样本是来自模型分布还是数据分布。生成模型可以被认为类似于造假者团队，试图生产虚假货币并在没有检测的情况下使用它。而判别模型类似于警察，试图检查伪造货币。在这个游戏中的竞争促使两个团队改进他们的方法，直到假冒伪劣品与真品无法区分。

This framework can yield specific training algorithms for many kinds of model and optimization algorithm. In this article, we explore the special case when the generative model generates samples by passing random noise through a multilayer perceptron, and the discriminative model is also a multilayer perceptron. We refer to this special case as adversarial nets. In this case, we can train both models using only the highly successful backpropagation and dropout algorithms [17] and
sample from the generative model using only forward propagation. No approximate inference or Markov chains are necessary.
该框架可以为多种模型和优化算法提供特定的训练算法。在本文中，我们探讨了生成模型通过多层感知器传递随机噪声生成样本的特殊情况，并且判别模型也是多层感知器。我们将这种特殊情况称为对抗性网络。在这种情况下，我们可以仅使用非常成功的反向传播和丢失算法[17]来训练两个模型，并且仅使用前向传播来生成来自生成模型的样本。不需要近似推理或马尔可夫链。

二、Related work(相关工作)

An alternative to directed graphical models with latent variables are undirected graphical models with latent variables, such as restricted Boltzmann machines (RBMs) [27, 16], deep Boltzmann machines (DBMs) [26] and their numerous variants. The interactions within such models are represented as the product of unnormalized potential functions, normalized by a global summation/integration over all states of the random variables. This quantity (the partition function) and its gradient are intractable for all but the most trivial instances, although they can be estimated by Markov chain Monte Carlo (MCMC) methods. Mixing poses a significant problem for learning algorithms that rely on MCMC [3, 5].
含隐变量的有向图模型可以由含隐变量的无向图模型替代，例如受限制波兹曼机（RBM）[27, 16]，深度波兹曼机（DBM）[26]和它们很多的变种。这些模型之间的相互影响可以被表达为非标准化的势函数的乘积，再通过随机变量的所有状态的全局整合来标准化。这个数量（配分函数）和它的梯度的估算是很棘手的，尽管他们能够依靠马尔可夫链和蒙特卡罗（MCMC）算法来估计，同时依靠MCMC算法的混合也会引发一个严重的问题。

Deep belief networks (DBNs) [16] are hybrid models containing a single undirected layer and several directed layers. While a fast approximate layer-wise training criterion exists, DBNs incur the computational difficulties associated with both undirected and directed models.
深度信念网络（DBN）是一个包含了一个无向层和几个有向层的混合模型。当使用一个快速逐层训练法则时，DBNS 会引发无向模型和有向模型相关的计算难题。

Alternative criteria that do not approximate or bound the log-likelihood have also been proposed, such as score matching [18] and noise-contrastive estimation (NCE) [13]. Both of these require the learned probability density to be analytically specified up to a normalization constant. Note that in many interesting generative models with several layers of latent variables (such as DBNs and DBMs), it is not even possible to derive a tractable unnormalized probability density. Some models such as denoising auto-encoders [30] and contractive autoencoders have learning rules very similar to score matching applied to RBMs. In NCE, as in this work, a discriminative training criterion is employed to fit a generative model. However, rather than fitting a separate discriminative model, the generative model itself is used to discriminate generated data from samples a fixed noise distribution. Because NCE uses a fixed noise distribution, learning slows dramatically after the model has learned even an approximately correct distribution over a small subset of the observed variables.
还提出了不近似或约束对数似然的替代准则，例如得分匹配[18]和噪声对比估计（NCE）[13]。他们都需要知道先验概率密度知识用来分析指定一个规范化的常量。请注意,许多有趣的带有一些隐层变量的生成模型（如DBN和DBM），甚至不可能导出易处理的非标准化概率密度。一些模型如自动编码降噪机和压缩编码的学习准则与分数匹配在RBM上的应用非常相似。在NCE中，使用一个判别训练准则来拟合一个生成模型。然而,生成模型常常被用来判别从一个固定噪音分布中抽样生成的数据，而不是拟合一个独立的判别模型。由于NCE使用一个固定的噪音分布，仅仅是从观测变量的一个小子集中学习到一个大致正确的分布后，模型的学习便急剧减慢。

Finally, some techniques do not involve defining a probability distribution explicitly, but rather train a generative machine to draw samples from the desired distribution. This approach has the advantage that such machines can be designed to be trained by back-propagation. Prominent recent work in this area includes the generative stochastic network (GSN) framework [5], which extends generalized denoising auto-encoders [4]: both can be seen as defining a parameterized Markov chain, i.e., one learns the parameters of a machine that performs one step of a generative Markov chain. Compared to GSNs, the adversarial nets framework does not require a Markov chain for sampling. Because adversarial nets do not require feedback loops during generation, they are better able to leverage piecewise linear units [19, 9, 10], which improve the performance of backpropagation but have problems with unbounded activation when used ina feedback loop. More recent examples of training a generative machine by back-propagating into it include recent work on auto-encoding variational Bayes [20] and stochastic backpropagation [24].
最后，一些技术不涉及明确地定义概率分布，而是训练生成机器从所需分布中抽取样本。这种方法的优点是可以将这种机器设计成通过反向传播进行训练。最近在该领域的突出工作包括生成随机网络（GSN）框架[5]，该框架扩展了广义去噪自动编码器[4]：两者都可以被视为定义参数化马尔可夫链，即，学习执行生成马尔可夫链的一个步骤的机器的参数。与GSN相比，对抗性网络框架不需要马尔可夫链进行采样。因为对抗网络在生成期间不需要反馈回路，所以它们能够更好地利用分段线性单元[19,9,10]，这提高了反向传播的性能，但是在反馈回路中使用时存在无界激活地问题。最近通过反向传播训练生成机器的例子包括最近关于变分贝叶斯[20]和随机反向传播[24]的自动编码工作。

三、Adversarial nets(对抗网络)

The adversarial modeling framework is most straightforward to apply when the models are both multilayer perceptrons. To learn the generator’s distribution p g p_g pg over data x x x, we define a prior on input noise variables p z ( z ) p_z(z) pz(z), then represent a mapping to data space as G ( z ; θ g ) G(z;\theta_g) G(z;θg), where G G G is a differentiable function represented by a multilayer perceptron with parameters θ g \theta_g θg. We also define a second multilayer perceptron D ( x ; θ d ) D(x; \theta_d) D(x;θd) that outputs a single scalar. D ( x ) D(x) D(x) represents the probability that x x x came from the data rather than p g p_g pg. We train D D D to maximize the probability of assigning the correct label to both training examples and samples from G G G. We simultaneously train G G G to minimize l o g ( 1 − D ( G ( z ) ) ) log(1 - D(G(z))) log(1−D(G(z))):
In other words, D D D and G G G play the following two-player minimax game with value function V ( G ; D ) V (G;D) V(G;D):
min ⁡ G max ⁡ D V ( D , G ) = E x ∼ P d a t a ( x ) [ l o g D ( x ) ] + E z ∼ P z ( z ) [ l o g ( 1 − D ( z ) ) ] . ( 1 ) \min \limits_{G} \max \limits_{D} V(D, G) = \Epsilon_{x \sim P_{data}(x)}[logD(x)] + \Epsilon_{z \sim P_{z}(z)}[log(1 - D(z))].\ \ \ \ \ \ \ \ (1) GminDmaxV(D,G)=Ex∼Pdata(x)[logD(x)]+Ez∼Pz(z)[log(1−D(z))]. (1)

当模型是多层感知器时，对抗模型框架是最直接应用的。为了学习生成器关于数据 x x x上的分布 p g pg pg, 我们定义输入噪声的先验变量 p z ( z ) p_z(z) pz(z),然后使用 G ( z ; θ g ) G(z;θg) G(z;θg)来代表数据空间的映射。这里 G G G是一个由含有参数 θ g θg θg的多层感知机表示的可微函数。我们再定义了一个多层感知机 D ( x ; θ d ) D(x;θd) D(x;θd)用来输出一个单独的标量。 D ( x ) D(x) D(x)代表 x x x来自于真实数据分布而不是 p g p_g pg的概率，我们训练 D D D来最大化分配正确标签给不管是来自于训练样例还是 G G G生成的样例的概率。我们同时训练 G G G来最小化 l o g ( 1 − D ( G ( z ) ) ) log(1−D(G(z))) log(1−D(G(z)))。换句话说， D D D和 G G G的训练是关于值函数 V ( G , D ) V(G,D) V(G,D)的极小化极大的二人博弈问题：
min ⁡ G max ⁡ D V ( D , G ) = E x ∼ P d a t a ( x ) [ l o g D ( x ) ] + E z ∼ P z ( z ) [ l o g ( 1 − D ( z ) ) ] . ( 1 ) \min \limits_{G} \max \limits_{D} V(D, G) = \Epsilon_{x \sim P_{data}(x)}[logD(x)] + \Epsilon_{z \sim P_{z}(z)}[log(1 - D(z))].\ \ \ \ \ \ \ \ (1) GminDmaxV(D,G)=Ex∼Pdata(x)[logD(x)]+Ez∼Pz(z)[log(1−D(z))]. (1)

In the next section, we present a theoretical analysis of adversarial nets, essentially showing that the training criterion allows one to recover the data generating distribution as G G G and D D D are given enough capacity, i.e., in the non-parametric limit. See Figure 1 for a less formal, more pedagogical explanation of the approach. In practice, we must implement the game using an iterative, numerical approach. Optimizing D D D to completion in the inner loop of training is computationally prohibitive, and on finite datasets would result in overfitting. Instead, we alternate between k k k steps of optimizing D D D and one step of optimizing G G G. This results in D D D being maintained near its optimal solution, so long as G G G changes slowly enough. This strategy is analogous to the way that SML/PCD [31, 29] training maintains samples from a Markov chain from one learning step to the next in order to avoid burning in a Markov chain as part of the inner loop of learning. The procedure is formally presented in Algorithm 1.

在下一节中，我们提出了对抗网络的理论分析，基本上表明基于训练准则可以恢复数据生成分布，因为 G G G和 D D D被给予足够的容量，即在非参数极限。如图1展示了该方法的一个非正式却更加直观的解释。实际上，我们必须使用迭代数值方法来实现这个过程。在训练的内部循环中优化 D D D到完成的计算是禁止的，并且有限的数据集将导致过拟合。相反，我们在优化 D D D的 k k k个步骤和优化 G G G的一个步骤之间交替。只要 G G G变化足够慢，可以保证 D D D保持在其最佳解附近。该过程如算法1所示。

In practice, equation 1 may not provide sufficient gradient for G G G to learn well. Early in learning, when G G G is poor, D D D can reject samples with high confidence because they are clearly different from the training data. In this case, l o g ( 1 − D ( G ( z ) ) ) log(1- D(G(z))) log(1−D(G(z))) saturates. Rather than training G G G to minimize l o g ( 1 − D ( G ( z ) ) ) log(1 - D(G(z))) log(1−D(G(z))) we can train G G G to maximize l o g D ( G ( z ) ) logD(G(z)) logD(G(z)). This objective function results in the same fixed point of the dynamics of G G G and D D D but provides much stronger gradients early in learning.

实际上，方程1可能无法为 G G G提供足够的梯度来学习。训练初期，当 G G G的生成效果很差时， D D D会以高置信度来拒绝生成样本，因为它们与训练数据明显不同。因此， l o g ( 1 − D ( G ( z ) ) ) log(1−D(G(z))) log(1−D(G(z)))饱和。因此我们选择最大化 l o g D ( G ( z ) ) logD(G(z)) logD(G(z))而不是最小化 l o g ( 1 − D ( G ( z ) ) ) log(1−D(G(z))) log(1−D(G(z)))来训练 G G G，该目标函数使 G G G和 D D D的动力学稳定点相同，并且在训练初期，该目标函数可以提供更强大的梯度。

图1：训练对抗神经网络时，同时更新判别分布（ D D D，蓝色虚线）使 D D D能区分数据生成分布 p x p_x px（黑色虚线）中的样本和生成分布 p g p_g pg ( G G G，绿色实线) 中的样本。下面的水平线为均匀采样 z z z的区域，上面的水平线为 x x x的部分区域。朝上的箭头显示映射 x = G ( z ) x=G(z) x=G(z)如何将非均匀分布 p g pg pg作用在转换后的样本上。 G G G在 p g p_g pg高密度区域收缩，且在 p g p_g pg地的低密度区域扩散。
(a)考虑一个接近收敛的对抗的模型对： p g p_g pg与 p d a t a p_data pdata相似，且D是个部分准确的分类器。
(b)在算法的内循环中，训练D来判断数据中的样本，收敛到 D ∗ ( x ) = p d a t a ( x ) p d a t a ( x ) + p g ( x ) D^*(x)=\frac{p_{data}(x)}{p_{data}(x)+p_g(x)} D∗(x)=pdata(x)+pg(x)pdata(x)。
©在 G G G的一次更新后， D D D的梯度引导 G ( z ) G(z) G(z)流向更可能分类为数据的区域。
(d)训练若干步后，如果 G G G和 D D D有足够的容量，他们将会接近某个点，由于 p g = p d a t a p_g=p_{data} pg=pdata两者都无法提高性能。判别器将不能区别出训练数据分布和生成数据分布，即 D ( x ) = 1 2 D(x) = \frac{1}{2} D(x)=21。

四、Theoretical Results(理论结果)

The generator G G G implicitly defines a probability distribution p g p_g pg as the distribution of the samples G ( z ) G(z) G(z) obtained when z ∼ p z z\sim p_z z∼pz. Therefore, we would like Algorithm 1 to converge to a good estimator of p d a t a p_{data} pdata, if given enough capacity and training time. The results of this section are done in a nonparametric setting, e.g. we represent a model with infinite capacity by studying convergence in the space of probability density functions.

生成器 G G G隐含地将概率分布 p g p_g pg定义为当 z ∼ p z z\sim p_z z∼pz时获得的样本 G ( z ) G(z) G(z)的分布。因此，如果给定足够的容量和训练时间，我们希望算法1收敛到 p d a t a p_{data} pdata的良好估计器。本部分的结果是在非参数设置中完成的，例如，我们通过研究概率密度函数空间中的收敛来表示具有无限容量的模型。

We will show in section 4.1 that this minimax game has a global optimum for p g = p d a t a p_g = p_{data} pg=pdata. We will then show in section 4.2 that Algorithm 1 optimizes Eq 1, thus obtaining the desired result.

我们将在4.1节中展示这个minimax博弈具有 p g = p d a t a p_g = p_ {data} pg=pdata的全局最优值。然后我们将在4.2节中展示算法1优化等式1，从而获得所需的结果。

4.1 全局优化 p g = p d a t a p_g=p_{data} pg=pdata

我们首先考虑对于任何给定生产者 G G G的最优鉴别者 D D D。
命题1：对于固定的 G G G，最优鉴别器 D D D是：
D G ∗ ( x ) = p d a t a ( x ) p d a t a ( x ) + p g ( x ) ( 2 ) D_{G}^*(x) = \frac{p_{data}(x)}{p_{data}(x)+p_g(x)}\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2) DG∗(x)=pdata(x)+pg(x)pdata(x) (2)
证明：对于鉴别者 D D D的训练准则，给定任意生产者 G G G，最大化 V ( G , D ) V(G,D) V(G,D)的量：
V ( G , D ) = ∫ x p d a t a ( x ) l o g ( D ( x ) ) d x + ∫ z p z ( z ) l o g ( 1 − D ( g ( z ) ) ) d z = ∫ x p d a t a ( x ) l o g ( D ( x ) ) + p g ( x ) l o g ( 1 − D ( x ) ) d x ( 3 ) V(G,D)=\int_{x}p_{data}(x)log(D(x))dx+\int_{z}p_{z}(z)log(1-D(g(z)))dz \\ = \int_{x}p_{data}(x)log(D(x))+p_g(x)log(1-D(x))dx \ \ \ \ \ \ \ (3) V(G,D)=∫xpdata(x)log(D(x))dx+∫zpz(z)log(1−D(g(z)))dz=∫xpdata(x)log(D(x))+pg(x)log(1−D(x))dx (3)
对于任意的 ( a , b ) ∈ R 2 − { 0 , 0 } (a,b)\in R^2 -\{0,0\} (a,b)∈R2−{0,0}，函数 y → a l o g ( y ) + b l o g ( 1 − y ) y \to alog(y)+blog(1-y) y→alog(y)+blog(1−y)在[0, 1]区间取得最大值的点是 a a + b \frac{a}{a+b} a+ba。鉴别者不需要定在 S u p p e r ( p d a t a ) ∪ S u p p ( p g ) Supper(p_{data}) \cup Supp(p_{g}) Supper(pdata)∪Supp(pg)之外。证明结束。

注意，对 D D D的训练目标可以理解成最大化评估条件概率的 P ( Y = y ∣ x ) P(Y=y|x) P(Y=y∣x)对数似然函数，其中 Y Y Y表明， x x x来自于 P d a t a ( y = 1 ) P_{data}(y=1) Pdata(y=1)还是来自于 P g ( y = 0 ) P_g(y=0) Pg(y=0)。在公式(1)的极大极小博弈游戏中，可以再次阐述为：
C ( G ) = max ⁡ D V ( G , D ) = E x ∼ P d a t a [ l o g D G ∗ ( x ) ] + E z ∼ P z [ l o g ( 1 − D G ∗ ( G ( z ) ) ) ] = E x ∼ P d a t a [ l o g D G ∗ ( x ) ] + E x ∼ P g [ l o g ( 1 − D G ∗ ( x ) ) ] = E x ∼ P d a t a [ l o g P d a t a ( x ) P d a t a ( x ) + P g ( x ) ] + E x ∼ P g [ l o g P g ( x ) P d a t a ( x ) + P g ( x ) ] ( 4 ) C(G)=\max_DV(G,D)\\ =E_{x \sim P_{data}}[logD_G^*(x)]+E_{z \sim P_z}[log(1-D_G^*(G(z)))]\\ = E_{x \sim P_{data}}[logD_G^*(x)]+E_{x \sim P_g}[log(1-D_G^*(x))]\\ = E_{x \sim P_{data}}[log\frac{P_{data(x)}}{P_{data}(x)+P_g(x)}]+E_{x \sim P_{g}}[log\frac{P_{g(x)}}{P_{data}(x)+P_g(x)}] \ \ \ \ \ \ \ \ (4) C(G)=DmaxV(G,D)=Ex∼Pdata[logDG∗(x)]+Ez∼Pz[log(1−DG∗(G(z)))]=Ex∼Pdata[logDG∗(x)]+Ex∼Pg[log(1−DG∗(x))]=Ex∼Pdata[logPdata(x)+Pg(x)Pdata(x)]+Ex∼Pg[logPdata(x)+Pg(x)Pg(x)] (4)

定理1： C ( G ) C(G) C(G)达到全局最小值，当且仅当 P g = P d a t a P_g=P_{data} Pg=Pdata。在这一点时， C ( G ) C(G) C(G)实现值为 − l o g 4 -log4 −log4。

证明：对于 P g = P d a t a P_g=P_{data} Pg=Pdata, D G ∗ ( x ) = 1 2 D_{G}^*(x) = \frac{1}{2} DG∗(x)=21，（考虑公式2）。因此，通过检查公式（4），在 D G ∗ ( x ) = 1 2 D_G^*(x)=\frac{1}{2} DG∗(x)=21时，我们发现， C ( G ) = l o g ( 1 2 ) + l o g ( 1 2 ) = − l o g 4 C(G)=log(\frac{1}{2})+log(\frac{1}{2})=-log4 C(G)=log(21)+log(21)=−log4，为了知道当 p g = p d a t a p_g=p_{data} pg=pdata时，这个是否是 C ( G ) C(G) C(G)可能达到的最好的值，观察：
E x ∼ P d a t a [ − l o g 2 ] + E x ∼ P g [ − l o g 2 ] = − l o g 4 E_{x\sim P_{data}}[-log2]+E_{x\sim P_{g}}[-log2]=-log4 Ex∼Pdata[−log2]+Ex∼Pg[−log2]=−log4
并且通过减掉 C ( G ) = V ( D G ∗ , G ) C(G)=V(D_{G}^*,G) C(G)=V(DG∗,G)的表达式，我们得到：
C ( G ) = − l o g ( 4 ) + K L ( P d a t a ∥ p d a t a + p g 2 ) + K L ( P g ∥ p d a t a + p g 2 ) C(G) = -log(4) + KL(P_{data}\rVert \frac{p_{data}+p_{g}}{2})+KL(P_{g}\lVert \frac{p_{data}+p_{g}}{2}) \\ C(G)=−log(4)+KL(Pdata∥2pdata+pg)+KL(Pg∥2pdata+pg)
其中，KL是相对熵。
这里插播解释一下相对熵在概率论或信息论中，KL散度(Kullback–Leibler divergence)，又称相对熵（relative entropy)，是描述两个概率分布P和Q差异的一种方法。它是非对称的，这意味着 D ( P ∣ ∣ Q ) ≠ D ( Q ∣ ∣ P ) D(P||Q) ≠ D(Q||P) D(P∣∣Q)̸=D(Q∣∣P)。特别的，在信息论中， D ( P ∣ ∣ Q ) D(P||Q) D(P∣∣Q)表示当用概率分布Q来拟合真实分布P时，产生的信息损耗，其中P表示真实分布，Q表示P的拟合分布。

有人将KL散度称为KL距离，但事实上，KL散度并不满足距离的概念，因为：1）KL散度不是对称的；2）KL散度不满足三角不等式。

我们考虑到在先前的表达中，模型分布和数据生成过程之间的JSD距离：
C ( G ) = − l o g ( 4 ) + 2 × J S D ( p d a t a ∣ ∣ p g ) ( 6 ) C(G)=-log(4)+2 \times JSD(p_{data}||p_{g})\ \ \ \ \ (6) C(G)=−log(4)+2×JSD(pdata∣∣pg) (6)

由于两个分布之间的JSD距离总是非负的，当他们相等的时候是0，我们展示了，C(G)的的全局最小值是 C ∗ = − l o g ( 4 ) C^*=-log(4) C∗=−log(4)，并且唯一解是 p g = p d a t a p_g=p_{data} pg=pdata，也就是说，生成模型完美的复制了数据分布。

4.2 算法1的收敛

命题2. 如果G和D有足够的能力，在算法1的每一步，对于给定G，并且 p g p_g pg是更新的，通过提高标准，鉴别者允许达到其最佳条件， E x ∼ p d a t a [ l o g D G ∗ ( x ) ] + E x ∼ p g [ l o g ( 1 − D G ∗ ( x ) ) ] E_{x\sim p_{data}}[logD_G^*(x)]+E_{x\sim p_{g}}[log(1-D_G^*(x))] Ex∼pdata[logDG∗(x)]+Ex∼pg[log(1−DG∗(x))]，然后 p g p_g pg收敛于 p d a t a p_{data} pdata

**证明.**考虑在上面的标准中， V ( G , D ) = U ( p g , D ) V(G,D)=U(p_g,D) V(G,D)=U(pg,D)作为 p g p_g pg的函数已经定下来了。注意，U(p_g,D)在 p g p_g pg中是凸的。凸函数的上确界的的次导数包括这个函数的导数，并且在这一点最大值可以取到。换句话说，如果 f ( x ) = s u p α ∈ A f α ( x ) f(x)=sup_{\alpha \in A}f_{\alpha}(x) f(x)=supα∈Afα(x)，并且 f α ( x ) f_{\alpha}(x) fα(x)对于每一个 α \alpha α，在x上都是凸的，则 ∂ f β ( x ) ∈ ∂ f \partial f_{\beta}(x)\in\partial f ∂fβ(x)∈∂f if β = a r g s u p α ∈ A f α ( x ) \beta=arg sup_{\alpha\in A}f_{\alpha}(x) β=argsupα∈Afα(x)。这就相当于，给定与之相关联的G，计算梯度下降更新 p g p_g pg来最优化D。在定理1中已经证明， sup ⁡ D U ( p g , D ) \sup_{D}U(p_{g}, D) supDU(pg,D)在 p g p_g pg中是凸的，并具有全局最优状态，因此，只要有对 p g p_g pg足够的小更新， p g p_g pg就能收敛到 p x p_x px，证明结束。

在实践中，通过函数 G ( z ; θ g ) G(z;\theta_g) G(z;θg)，对抗网络代表了 p g p_g pg分布的有限家族，并且我们优化 θ g \theta_g θg而不是 p g p_g pg本身，因此证明并不适用。然而，在实际中，多层感知器的的优越性能说明了尽管他们缺乏理论的保证，但他们是合适的模型。

表1：基于对数似然函数评估的帕尔森窗。在MNIST上的报告编号是测试集上的样本的平均对数似然函数，由交叉例子计算的标准平均误差。在TFD上，我们计算折叠数据集的标准误差，通过每一个折叠的验证集来选择不同的 σ \sigma σ。在TFD中，在每一个折叠上，σ被交叉验证，并且对数似然函数也被计算。对于MNIST，我们与数据集是实值（而不是二进制）版本的模型相比较。

五、Experiments（实验）

我们在一定范围的数据集内训练对抗网络，包括MNIST，Toronto人脸数据集（TFD）和CIFAR-10。生成器网络使用整流器线性激活函数和sigmoid激活函数的混合，而辨别器网络使用maxout激活函数。Dropout应用于训练辨别器网络中。尽管我们的理论框架允许dropout和生产器中间层的其他噪音的使用，我们使用噪音仅仅是作为生产者网络最底层的输入。

我们通过将高斯帕尔森窗口应用于G产生的样本中，并且在这个分布下报告对数似然函数，来评估 p g p_g pg之下的测试数据集的可能性。高斯函数中的 σ σ σ参数由交叉数据集上的交叉验证获取。这个程序已经由Breuleux等人介绍，并且使用于各种不同的、难以处理似然函数的生成网络。结果如表一中所示，这种评估似然函数的方法方差有点大，并且在高维空间中表现不是特别好，但是据我们所知，这是可用的最好的方法了。可以取样但是不能评估似然的生成网络的提高直接进一步地激励了如果评估这种模型的研究。
在图2和图3中，我们展示了在训练之后从生成器获得的样本，尽管我们没有要求这些样本比已有的方法产生的样本要好，但我们相信，这些样本至少与文献中那些比较好的生成模型的样本相比是有可竞争性的，并且强调对抗框架是有潜力的。

图2：来自模型的样本验证。最右边一列展示了临近样本最近的训练例子，为了表明模型并没有记住训练集。样本是公平随机的绘制，而不是择优挑选。不想大多数其他深度生成网络，这些图像展示了实际的来自模型分布的样本，而不是隐藏层条件平均给定的样本。此外，这些样本是不相关的，因为抽样过程不依赖于马尔科夫链的混合。a）MNIST b）TFD c）CIFAR-10（全连接模型） d）CIFAR-10（卷积辨别器和非卷积生成器）

图3：在全模型的z空间中，由坐标之间的线性插值获取数字。

六、Advantages and disadvantages（利与弊）

This new framework comes with advantages and disadvantages relative to previous modeling frameworks.The disadvantages are primarily that there is no explicit representation of p g ( x ) p_g(x) pg(x), and that D D D must be synchronized well with G G G during training (in particular, G G G must not be trained too much without updating D D D, in order to avoid “the Helvetica scenario” in which G G G collapses too many values of z to the same value of x to have enough diversity to model p_{data}), much as the negative chains of a Boltzmann machine must be kept up to date between learning steps. The advantages are that Markov chains are never needed, only backprop is used to obtain gradients, no inference is needed during learning, and a wide variety of functions can be incorporated into the model. Table 2 summarizes the comparison of generative adversarial nets with other generative modeling approaches. The aforementioned advantages are primarily computational. Adversarial models may also gain some statistical advantage from the generator network not being updated directly with data examples, but only with gradients flowing through the discriminator. This means that components of the input are not copied directly into the generator’s parameters. Another advantage of adversarial networks is that they can represent very sharp, even degenerate distributions, while methods based on Markov chains require that the distribution be somewhat blurry in order for the chains to be able to mix between modes.

这个新的框架有缺点也有优点，涉及到先前的模型框架。缺点主要是没有 p g ( x ) p_g(x) pg(x)的显性表示，并且在训练过程中， D D D必须是与 G G G同步的（特别的，如果没有更新 D D D， G G G不应训练过多，为了避免“the Helvetica scenario”情况， G G G折叠了太多 z z z的值到同一个 x x x的值，以至于模型没有足够的多样性），尽管Boltzmann机器的负链必须保持学习步数之间的约定。优点就是马尔科夫链不再需要，只是用后向传播来获取梯度，在学习期间不再需要推理，并且更多样性的函数可以成为模型的一部分。表二中总结了生成对抗网络与其他生成模型方法的对比。

表2：生成模型上的挑战：涉及到模型每个主要操作的深度生成模型不同方法所遇到的困难的总结。

上面提到的优点主要是计算方面的，对抗网络不仅从生成器网络中，利用数据例子直接更新，获得了很多统计上的优点，也从辨别器中获得了梯度。这意味着输入的成分不会直接复制到生产器的参数中。对抗网络的另一个优点是，即使是退化分布，他们表现地也非常敏捷，尽管基于马尔科夫链的方法要求，对于链，分布必须是有些模糊的，以至于有能力混合模型。

七、Conclusions and future work（结论与未来工作）

这个框架允许很多直接的扩展：

1：一个条件神经网络 p ( x ∣ c ) p(x|c) p(x∣c)可以通过向 G G G和 D D D同时添加c作为输入获得。
2：学习近似推理可以通过给定 x x x，训练一个辅助网络来预测 z z z来实现。
3：可以通过训练共享参数的模型族来近似的对所有条件 p ( x S ∣ x ̸ S ) p(x_{S}∣x_{\not{S}}) p(xS∣x̸S)来建模。其实S是x的索引的一个子集。本质上，可以使用对抗网络来实现确定性MP-DBM的随机扩展。
4：半监督学习：当只有有限的标签数据可用时，来自鉴别器或者推理网络的特征可以提高分类器的性能。
5：效率改善：通过设计更好的方法来协调 G G G和 D D D，或者决定在训练时对于样本 z z z更好的分布可以显著提高训练性能大大加快训练。

更多推荐

Generative Adversarial Networks（生成式对抗网络）论文翻译

本文发布于:2024-03-08 19:00:27，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1721937.html