【学习笔记(0)】Variational Autoencoder 变分自编码器

编程入门行业动态更新时间:2024-10-24 01:52:00

【学习笔记(0)】Variational Autoencoder 变分自<a href=https://www.elefans.com/category/jswz/34/1762206.html style= 编码器"/>

【学习笔记(0)】Variational Autoencoder 变分自编码器

本文是VAE的学习笔记，是阅读多个网站的intro时记录的阅读笔记。

Variational Autoencoders Explained -.html
讲的很细，但看完之后不太有整体思路
Generative Modeling: What is a Variational Autoencoder (VAE)? - /
帮助梳理整体框架
★ ★ ★ Understanding Variational Autoencoders (VAEs) -
强推!!! 写的超好超详细，数学部分推理很详细，使得对整体框架的构造很清晰，且对于从降维、贝叶斯、AE是如何发展到VAE的演化过程也讲的很清楚

Ⅰ Variational Autoencoders Explained
- - Generative models
  - The challenge of modeling images
  - Latent space
  - The wonders of latent space
  - Deep learning to the rescue
  - So how do we train this beast?
  - Let’s take a shortcut!
  - Variational Inference
  - The reparameterization trick
  - Connecting the dots（理清头绪）
Ⅱ Generative Modeling: What is a Variational Autoencoder (VAE)?
Ⅲ Understanding Variational Autoencoders (VAEs)

Ⅰ Variational Autoencoders Explained

Variational Autoencoders Explained网址链接：.html

Generative models

VAE是一种生成模型，评估训练数据的Probability Density Function (PDF)。
★ VAE可以从学到的PDF中进行采样，从而可以生成与原数据集类似的新样本。
接下来用MNIST数据集进行解释。

The challenge of modeling images

pixels之间是有关联的，并不是独立的，导致我们不能对每个像素点独立学习一个PDF和采样，增加了学习的困难。

Latent space

VAE尝试对图片生成的过程进行建模：

给定一个图片 x x x，试图找到至少一个latent vector能够描述它
找到一个vector包含生成 x x x的指令

由1和全概率定律，我们可以得到公式： P ( x ) = ∫ P ( x ∣ z ) P ( z ) d z P(x)=\int P(x|z)P(z)dz P(x)=∫P(x∣z)P(z)dz 公式中存在一些intuition：
- 积分符号意味着要在整个latent space上进行latent vector的寻找
- 对于每个可能的 z z z，都要考虑 x x x可以从 z z z中生成吗？ P ( x ∣ z ) P(x|z) P(x∣z)足够大吗？
- z z z是可能的吗？ P ( z ) P(z) P(z)足够大吗？
VAE的训练目标在于 maximize P ( x ) P(x) P(x)
我们用多变量高斯函数 N ( f ( z ) , σ 2 ⋅ I ) N(f(z), \sigma^2 \cdot I) N(f(z),σ2⋅I)对 P ( x ∣ z ) P(x|z) P(x∣z)进行建模
- f ( z ) f(z) f(z)会用神经网络进行建模，训练后将用 f f f来生成新的图片
- σ \sigma σ是一个超参数，对单位矩阵 I I I进行放缩

The wonders of latent space

latent space法具有两大问题：

每个维度代表什么信息？某数据集上学到的维度信息不适用于新的数据集。
The latent space might be entangled, i.e. the dimensions might be correlated.

Deep learning to the rescue

每个分布都可以通过在标准多元高斯上应用足够复杂的函数来生成。
在此，选 P ( z ) P(z) P(z)作为一个标准多元高斯函数， f f f由神经网络建模， f f f将会分成两个部分：
- 第一部分的层们会把高斯映射到latent space中的真实分布上
- 第二部分的层们会从latent space映射到 P ( x ∣ z ) P(x|z) P(x∣z)

So how do we train this beast?

P ( x ) P(x) P(x)的公式很棘手，我们会用蒙特卡洛方法对它进行近似：

从先验 P ( z ) P(z) P(z)中取样 { z i } i = 1 n \{z_i\}^n_{i=1} {zi}i=1n
用 P ( x ) ≈ 1 n ∑ i = 1 n P ( x ∣ z i ) P(x) \approx \frac{1}{n} \sum_{i=1}^n P(x|z_i) P(x)≈n1∑i=1nP(x∣zi)进行近似

然而，由于 x x x的维度很高，需要很多样本才能得到合理的近似值。同时我们需要给 P ( x ∣ z ) P(x|z) P(x∣z)赋正数值，使得梯度可以传播。

Let’s take a shortcut!

如果随机采样，那么采到的 z z z很多不会对 P ( x ) P(x) P(x)有任何帮助。

因此我们引入了 Q ( z ∣ x ) Q(z|x) Q(z∣x)。 Q Q Q会给可能生成 x x x的 z z z高概率值。因此用蒙特卡洛近似时我们可以从 Q Q Q中采少得多的样本。

但是，新的问题又出现了：相较于 maximizing P ( x ) = ∫ P ( x ∣ z ) P ( z ) d z = E z ∼ P ( z ) P ( x ∣ z ) P(x)=\int P(x|z)P(z)dz=E_{z \sim P(z)}P(x|z) P(x)=∫P(x∣z)P(z)dz=Ez∼P(z)P(x∣z)，我们需要 maximizing E z ∼ Q ( z ∣ x ) P ( x ∣ z ) E_{z \sim Q(z|x)}P(x|z) Ez∼Q(z∣x)P(x∣z)，两者之间有什么关系呢？

Variational Inference

这两个目标通过以下公式相关联：
l o g P ( X ) − K L [ Q ( z ∣ x ) ∣ ∣ P ( z ∣ x ) ] = E z ∼ Q ( z ∣ x ) [ l o g P ( x ∣ z ) ] − K L [ Q ( z ∣ x ) ∣ ∣ P ( z ) ] logP(X)-KL[Q(z|x)||P(z|x)]=E_{z \sim Q(z|x)}[logP(x|z)]-KL[Q(z|x)||P(z)] logP(X)−KL[Q(z∣x)∣∣P(z∣x)]=Ez∼Q(z∣x)[logP(x∣z)]−KL[Q(z∣x)∣∣P(z)]
其中KL为KL散度，衡量两个分布的相似程度。

如果我们最大化公式的右边，左边也会被最大化：
- P ( x ) P(x) P(x)会被最大化
- Q ( z ∣ x ) Q(z|x) Q(z∣x)和 P ( z ∣ x ) P(z|x) P(z∣x)之间的距离（未知的true posterior）会被最小化

等式右侧背后的直觉是：
- 1.一方面我们想要 maximize x x x从 z ∼ Q z \sim Q z∼Q中解码得到的精确程度
- 2.另一方面，我们想要最大化 Q ( z ∣ x ) Q(z|x) Q(z∣x) (the encoder) 和先验 P ( z ) P(z) P(z) 的相似程度。这一项可被看作一个正则项。

用神经网络对 Q Q Q进行建模，神经网络的输出是多元高斯函数的参数：

均值 μ Q \mu_Q μQ
diagonal covariance matrix ∑ Q \sum_Q ∑Q

但在 decoder 部分有点麻烦，因为从 Q Q Q中采样 z z z是不会让梯度在 Q Q Q中传播的（采样不是可微分操作），这使得输出 μ Q \mu_Q μQ和 ∑ Q \sum_Q ∑Q的神经网络的层参数不会被更新。

The reparameterization trick

我们可以将 Q Q Q换成无参数随机变量的确定性参数化变换（a deterministic parameterized transformation of a parameterless random variable）：

从标准（无参数）高斯函数中采样
将样本乘以 ∑ Q \sum_Q ∑Q的平方根
将 μ Q \mu_Q μQ与2的结果相加

The result will have a distribution equal to Q Q Q.

如此一来，参数可以在 Q Q Q中传递了，因为现在有确定性路径了。从而，模型可以对参数进行调整了，使得输出的 Q Q Q的参数集中于能产生 x x x的好的 z z z附近。

Connecting the dots（理清头绪）

总结掌握VAE需要哪些步骤：

图的左边：

input图片传入encoder网络
encoder输出 Q ( z ∣ x ) Q(z|x) Q(z∣x)的参数 μ Q \mu_Q μQ和 ∑ Q \sum_Q ∑Q
latent vector z z z从 Q ( z ∣ x ) Q(z|x) Q(z∣x)被采样。（如果encoder自己学习的很好，那么很可能采到的 z z z就会包含描述 x x x的信息）
decoder将 z z z解码成图片

图的右边：

Reconstruction error: 输出要和输入相似
Q ( z ∣ x ) Q(z|x) Q(z∣x)要和多元标准高斯的先验 P ( z ) P(z) P(z)相似

如果要生成新的数据集里没有的图片的话，直接采样一个latent vector并解码就行了。

在Variational Autoencoders Explained in Detail中包含VAE的代码。
建议阅读论文Tutorial on Variational Autoencoders (.05908)，介绍的更详细，高被引（1606）。

Ⅱ Generative Modeling: What is a Variational Autoencoder (VAE)?

Generative Modeling: What is a Variational Autoencoder (VAE)?网址链接：/

Unsupervised learning means we’re not trying to map inputs to targets, but rather we’re trying to learn the structure of our inputs.
We can think of p ( z ) p(z) p(z) as the prior probability that any x x x belongs to a certain category. p ( z ) p(z) p(z) is a categorical or discrete distribution, and it tells us which category an x x x is likely to belong to. without looking at any x x x.
本文给的VAE定义：A variational autoencoder (VAE) is a type of neural network that learns to reproduce its input, and also map data to latent space.
Now what does “variational” mean?
Variational refers to variational inference or variational Bayes. These techniques fall into the category of Bayesian machine learning.
To summarize, variational autoencoders combine autoencoders with variational inference.
与传统的autoencoders不同，在VAE的encoder的结尾处我们得到的不是一个value而实一个distribution，更准确的说是parameters of a distribution。
Recall that Bayesian machine learning is all about learning distributions instead of learning point estimates. So instead of finding z z z, we are finding Q ( z ) Q(z) Q(z) which tells us the PDF of z z z.
To summarize, the output of the decoder represents a probability distribution. And from this distribution we can generate samples.
在 prior predictive sampling 中，我们如果从标准正态分布 N ( 0 , 1 ) N(0,1) N(0,1)中采样一个 z z z，然后将其通过decoder，就可以得到一个新的很类似训练数据的数据，这个数据叫做 prior predictive sample。
???
The key for this method is that we build our optimization algorithm in a way that encourages the encoder to map the training data around the standard normal distribution.
The objective function we want to optimize is called the “ELBO”, or the the evidence lower bound:
In statistics, the evidence lower bound (ELBO, also variational lower bound) is the difference between the distribution of a latent variable and the distribution of the respective observed variable.
The cost function of a VAE is the combination of two terms: the expected log likelihood and the KL-divergence. Expected log-likelihood is responsible for the reconstruction penalty, and KL divergence is responsible for the regularization penalty.

Ⅲ Understanding Variational Autoencoders (VAEs)

Intuitively, the overall autoencoder architecture (encoder+decoder) creates a bottleneck for data that ensures only the main structured part of the information can go through and be reconstructed.
类似于SVD，PCA这些保留部分特征值的方法
If the latent space is regular enough (well “organized” by the encoder during the training process), we could take a point randomly from that latent space and decode it to get a new content. （生成新样本）The decoder would then act more or less like the generator of a Generative Adversarial Network.
Variational autoencoder can be defined as being an autoencoder whose training is regularised to avoid overfitting and ensure that the latent space has good properties that enable generative process.
The reason why an input is encoded as a distribution with some variance instead of a single point is that it makes possible to express very naturally the latent space regularisation: the distributions returned by the encoder are enforced to be close to a standard normal distribution.
怎么是正态分布了呢不是近似的吗 ——在后面的math detail部分可以看到解释
P ( z ) P(z) P(z)是 N ( 0 , 1 ) N(0,1) N(0,1) ; P ( z ∣ x ) P(z|x) P(z∣x)才是第一篇文章中的 P ( z ) P(z) P(z)
VAEs are autoencoders that encode inputs as distributions instead of points and whose latent space “organisation” is regularised by constraining distributions returned by the encoder to be close to a standard Gaussian.
因为原求解公式中用到了积分，这在实际过程中是intractable，因此我们采用 variational inference来进行近似，这样可以使得模型更加 general & robust。
用高斯分布 Q Q Q对 P ( z ∣ x ) P(z|x) P(z∣x)进行近似，希望两者的差异（KL散度）越小越好。
什么是 variational inference？
In statistics, variational inference (VI) is a technique to approximate complex distributions. The idea is to set a parametrised family of distribution (for example the family of Gaussians, whose parameters are the mean and the covariance) and to look for the best approximation of our target distribution among this family. The best element in the family is one that minimise a given approximation error measurement (most of the time the Kullback-Leibler divergence between approximation and target) and is found by gradient descent over the parameters that describe the family.
OVERALL ARCHITECTURE OF VAE