MIT 6.041 概率系统分析与应用概率总结

编程入门行业动态更新时间:2024-10-07 08:24:45

MIT 6.041 <a href=https://www.elefans.com/category/jswz/34/1767487.html style= 概率系统分析与应用概率总结"/>

MIT 6.041 概率系统分析与应用概率总结

引言

课程主页链接：Probabilistic Systems Analysis and Applied Probability

德克萨斯大学达拉斯分校总结出了几个重要的概率分布，每个分布给出了例子，期望，和方差，参考一下。

Probability Models and Axioms

样本空间（Sample Space），记作 Ω $\Omega$ ，是所有可能出现结果的集合。集合必须满足以下2点要求：

Mutually exclusive. 即实验结束过后只有一种结果出现。
Collectively exhaustive.

下图是一个离散有穷的样本空间例子，连续扔两个四面形的骰子：

接下来我给出事件（event）的定义：

Event is a subset of the sample space, probability is assigned to events.

下面是3个简单的但是最重要的公理，通过它们3个可以推导出很多其它的定理。

Nonnegativity： P(A)≥0 $P(A) \ge 0$
Normalization： P(Ω)=1 $P(\Omega) = 1$
Additivity：如果 A∩B=∅ $A \cap B = \emptyset$ ，那么 P(A∪B)=P(A)+P(B) $P(A \cup B) = P(A) + P(B)$

通过上面的3个公理，我们可以得到一些很重要的推论，如下：

P(A)≤1 $P(A) \le 1$ .
P(∅)=0 $P(\emptyset) = 0$ .
P(A)+P(AC)=1 $P(A)+P(A^C) = 1$
A∪Ac=Ω $A \cup A^c = \Omega$
A∩Ac=∅ $A \cap A^c = \emptyset$
If A⊂B $A\subset B$ , then P(A)≤P(B) $P(A) \le P(B)$
P(A∪B)=P(A)+P(B)−P(A∩B) $P(A \cup B)=P(A)+P(B)-P(A\cap B)$
P(A∪B)≤P(A)+P(B) $P(A \cup B)\le P(A)+P(B)$
P(A∪B∪C)=P(A)+P(AC∩B)+P(AC∩BC∩C) $P(A \cup B \cup C)=P(A)+P(A^C\cap B)+P(A^C\cap B^C \cap C)$

关于上面定理的推导很简单，没什么好说的。有兴趣的同学可以参考：课件的8，9，10，和11页。教授在课堂上总结了计算概率的几大步骤：

Specify the sample space
Specify a probability law
Identify an event of interest
calculate

A probability law in principle specifies the probability of every event, and there’s nothing else to do. But quite often the probability law will be given in some implicit manner, for example, by specifying the probabilities of only some of the events. In that case, you may have to do some additional work to find the probability of the particular event that you care about.

Conditioning and Bayes’ Rule

在这个 lecture 中，教授介绍了三个重要的工具，在理解它们之前，我们必须要知道条件概率是什么，下图是关于它的定义，图片的左面给我们一直直观的感受，为什么条件概率是这样定义的。

条件概率也遵循文章前面介绍的公理。比如第3条公理可以写成：如果 A∩B=∅ $A \cap B = \emptyset$ ，那么 P(A∪C|B)=P(A|B)+P(C|B) $P(A \cup C | B) = P(A|B) + P(C|B)$ ，下图给了你一个直观的感受为什么是这样的！

有了条件概率的定义之后，接下来我总结一下教授课堂上介绍的3个重要的概率工具，它们分别是：Multiplication rule，Total probability theorem，and Bayes’ rule.

Multiplication rule

multiplication rule 中的第6页有个例子，看完它以后可以更好地帮你理解乘法公式（multiplication rule）。实际上，乘法公式就是条件概率的变形，下图中的右半部分是 sequential description，有了它可以让我们更加直观的理解乘法公式。

上图中只描述了3个事件，但是我们可以推广到更多的事件，得到下面的通式：

P(A1∩A2∩...An)=P(A1)∏i=2nP(Ai|A1∩...Ai−1) $P(A_1 \cap A_2 \cap ... A_n) = P(A_1) \prod_{i=2}^n P(A_i | A_1 \cap ... A_{i-1})$

Total probability

全概率公式用了 divide and conquer 的策略。比如下图中的例子：把样本空间拆分成3个子空间: A1,A2,A3 $A_1, A_2, A_3$ ，我们知道每个子空间的概率： P(A1),P(A2),P(A3) $P(A_1), P(A_2), P(A_3)$ ，同时我们也知道在每个子空间下，事件B发生的概率，即条件概率： P(B|A1),P(B|A2),P(B|A3) $P(B|A_1), P(B|A_2), P(B|A_3)$ ，通过全概率公式我们可以得到事件B发生的概率： P(B)=P(A1)P(B|A1)+P(A2)P(B|A2)+P(A3)P(B|A3) $P(B)=P(A_1)P(B|A_1)+P(A_2) P(B|A_2)+P(A_3)P(B|A_3)$

Note that the sum of the probabilities of the different scenarios is of course equal to 1. 对于上面的例子来说就是： P(A1)+P(A2)+P(A3)=1 $P(A_1)+P(A_2)+ P(A_3)=1$ . As a matter of fact, the probability above is a weighted average of the conditional probabilities of event B. In words, the probability that an event occurs is a weighted average of the probability that it has under each possible scenario, where the weights are the probabilities of the different scenarios.

上面的例子只是把样本空间拆分成3个子空间，然而，我们可以划分出更多的子空间，得到下面的通式， ∑iP(Ai)=1 $\sum_i P(A_i)=1$

P(B)=∑iP(Ai)P(B|Ai) $P(B)=\sum_i P(A_i)P(B|A_i)$

Bayes’ rule

在全概率的公式中，我们已知 each possible scenario 下的概率，然后又知道每个 scenario 下事件B发生的概率，从而得到事件B发生的概率。然而，贝叶斯公式正是与它“相反”。我们已知事件B发生的概率，又知道某个 scenario 与B一同发生的概率，然后想要求出在事件B发生的情况下，在这个 scenario 的概率有多大。把我这段描述可以写成如下的公式：

P(Ai|B)=P(Ai∩B)P(B) $P(A_i|B)=\frac{P(A_i\cap B)}{P(B)}$

通过条件概率的定义替换分子，通过全概率公式替换分母，我们可以得到下面的公式，即贝叶斯公式：

P(Ai|B)=P(Ai)P(B|Ai)∑jP(Aj)P(B|Aj) $P(A_i | B) = \frac{P(A_i) P(B | A_i)}{\sum_j P(A_j) P(B | A_j)}$

Independence

Two events are independent if the occurrence of one event does not change our beliefs about the other. It does not affect the probability that the other event also occurs. 因此，一个比较直观的定义就是： P(B|A)=P(B) $P(B|A)=P(B)$ . 这个定义有2个缺点：1）不对称，即它没有说明 P(A|B)=P(A) $P(A|B)=P(A)$ . 2）条件概率需要分母的概率不能为0. 下面是独立性正式的定义：

P(A∩B)=P(A)⋅P(B) $P(A\cap B)=P(A) \cdot P(B)$

如果 A 与 B 相互独立的，那么 A 与 BC $B^C$ 也是相互独立的。似的，只要A和B是相互独立的，那么A和 BC $B^C$ 、 BC $B^C$ 和 AC $A^C$ 也都是相互独立的。接下来，我来介绍一下 conditional independence, 定义如下：

Conditional independence, given C, is defined as independence under the probability law P(⋅|C) $P(\cdot | C)$ . 举个例子： P(A∩B|C)=P(A|C)⋅P(B|C) $P(A\cap B|C)=P(A|C)\cdot P(B|C)$

加上条件以后有可能会影响到独立性。比如下图中的事件 A 与 B 是相互独立的，但是在事件 C 作为条件的情况下，影响了它们的独立性。即，在事件 C 发生的前提下，如果有人告诉你事件 A 发生了，那么你一定就会知道事件 B 没有发生，这改变了事件A与事件B之间的独立性。从这个例子我们也可以看出，being independent is something completely different from being disjoint.

接下来我给出一系列事件之间相互独立的定义：

P(Ai∩Aj∩⋯∩Am)=P(Ai)P(Aj)⋯P(Am) $P(A_i \cap A_j \cap \cdots \cap A_m) = P(A_i)P(A_j) \cdots P(A_m)$

通过上面的定义，我们可以知道 pairwise independence 也成立。举个例子：如果事件 A，B and C 独立，即 P(A∩B∩C)=P(A)P(B)P(C) $P(A\cap B\cap C)=P(A)P(B)P(C)$ ，那么它们两两之间也是相互独立的，如下图所示。但是，反之是不成立的，i.e. pairwise independence does not imply independence.

题目

1、A chess tournament problem:

总共有3个人（假设叫A，B，C），其中C是冠军。第一轮比赛要求A和B比2局，他们中的某个人必须2局全胜才能与冠军进行第二轮比赛，否则不会进行第2轮比赛，冠军依然是C。如果有人能晋级，那么他会与C也进行2局比赛，全胜才可以羸得冠军，否则冠军依然是C. 关于具体问什么概率看：A Chess Tournament Problem 中的第2个tab，并且有助教讲解的视频，你会看到画树状图大大简化了问题。

2、The Monty Hall problem

这个问题来自于美国一档综艺节目，很有名。You are told that a prize is equally likely to be found behind any one of three closed doors in front of you. You point to one of the doors. A friend opens for you one of the remaining two doors, after making sure that the prize is not behind it. At this point, you can stick to your initial choice, or switch to the other unopened door. You win the prize if it lies behind your final choice of a door.

那么下面3种策略哪个得奖的概率最大呢？

Stick to your initial choice
Switch to the other unopened door
You first point to door 1. If door 2 is opened, you do not switch. If door 3 is opened, you switch

前2种策略实际上很好理解。对于第1种策略来说，实际上就是你选的那个 door 里面有奖的概率（1/3）; 对于第2种策略来说，实际上就是你选的那个 door 里面没有奖的概率（2/3）; 因此，第2种策略要比第一种好。第3种策略的概率取决于你朋友打开 door 2 的概率，设它为p，最终结果就是 (1/3)p+1/3，你发现无论p是多少，它的概率介于1/3和2/3之间，因此第2种策略是最好的。实际上任何其它类似于3的策略结果都与它相似，不会得到更好的概率。

Discrete random variables

随机变量的定义如下：

A random variable, usually written X $X$ , is a variable whose possible values are numerical outcomes of a random phenomenon. Mathematical definition: A function from the sample space Ω $\Omega$ to the real numbers.

关于随机变量更直观的解释是：每个概率实验最终都会产生一个 outcome，随机变量就是与这个 outcome 对应的 numerical value. 举个例子：比如我们的实验是在教室中随机选择一个学生，那么体重（X），身高（Y），年龄（Z）都可以是这个样本空间的随机变量。从这个例子可以看出，同一个样本空间可以有多个随机变量。进一步地，1个或几个随机变量的函数可以构成一个新的随机变量，比如在这个例子中，我们可以把随机变量相加，X+Y+Z ，这样我们就构成了一个新的随机变量。

下面我给出 probability mass function (PMF) 的定义：

A probability mass function (pmf) is a function that gives the probability that a discrete random variable is exactly equal to some value.

下面的等式是 PMF 的 notation，它所表达的是随机变量 X=x $X=x$ 的概率。

pX(x)=P(X=x)=P({ω∈Ω s.t.X(ω)=x}) $p_X(x) = P(X = x) = P(\{\omega \in \Omega \ s.t. X(\omega) = x\})$

比如下图中的例子： PX(5)=P(X=5)=1/2 $P_X(5)=P(X=5)=1/2$

PMF 有以下2个一目了然的属性：

pX(x)≥0 $p_X(x) \ge 0$
∑xpX(x)=1 $\sum_xp_X(x) = 1$

接下来总结一下几个常见的随机变量：Bernoulli and indicator random variables，Uniform random variables，Binomial random variables，and Geometric random variables.

Bernoulli and indicator random variables

下面是 Bernoulli distribution 的定义，它通常 models a trial that results in success/failure, Heads/Tails, etc.

It is the probability distribution of a random variable which takes the value 1 with probability p $p$ and the value 0 with probability q=1−p $q=1-p$ — i.e., the probability distribution of any single experiment that asks a yes–no question; the question results in a boolean-valued outcome, a single bit of information whose value is success/yes/true/one with probability p $p$ and failure/no/false/zero with probability q $q$

它的 PMF 如下图所示：

一个事件 A 的 Indicator random variables 如下图所示：

Uniform random variables

下面是 Discrete uniform distribution 的定义，它通常 models a case where we have a range of possible values, and we have complete ignorance, no reason to believe that one value is more likely than the other.

The discrete uniform distribution is a symmetric probability distribution whereby a finite number of values are equally likely to be observed; every one of n $n$ values has equal probability 1/n $1/n$

它的 PMF 如下图所示：

Binomial random variables

这种类型的分布是上面介绍的 Bernoulli distribution 的“扩大版本”，即，单个 success/failure 实验叫做 Bernoulli trial or Bernoulli experiment，而一系列的 Bernoulli 实验得到的结果就是 Binomial distribution. 这种类型的分布通常 models number of successes in a given number of independent trials.

它的 PMF 如下图所示：

详细的定义参考 Binomial distribution.

Poisson distribution

下面的定义来源于维基百科：

The Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant rate and independently of the time since the last event. The Poisson distribution can also be used for the number of events in other specified intervals such as distance, area or volume.

这里我举个例子说明一下上面定义中的黑体 constant rate 和 independently 分别代表什么意思！比如：一本书上，每100页就有平均2处错误。在这个例子中：

constant rate 表示这本书的其它的100页中，平均错误的出现也应该是2，不应该是其它的值
independently 表示100页中的每处错误应该是相互不影响的

这个例子同时也表明了 interval 不一定非要指时间，其它的一些解释也是可以的，在这个例子中每100页就是一个 interval.

泊松分布的 PMF 如下：

P(keventsinainterval)=λke−λk! $P(k \; events\; in\; a\; interval)=\frac{\lambda^k e^{-\lambda}}{k!}$

上面公式中的 λ $\lambda$ is the average number of events per interval. 知道了泊松分布的 PMF，下面让我们来算2个例子。

例子1：某本书上，每100页就有平均2处错误！假设我们从这本书中随机选择100页，没有出现错误的概率是多少呢？

答案： P(X=0)=20e−20! $P(X=0)=\frac{2^0 e^{-2}}{0!}$

假设我们从这本书中随机选择400页，没有出现错误的概率是多少呢？

答案：400页会有4个 interval，每100页平均有2处错误，由于泊松分布为 constant rate，即每个 interval 的平均出错次数应该相同，因此每400页平均就有 4×2=8 处错误，所以 λ=8 $\lambda=8$ 。因此概率为： P(X=0)=80e−80! $P(X=0)=\frac{8^0 e^{-8}}{0!}$

例子2：在某条河流上，每100年平均出现一次洪水泛滥。那么在100年的 interval 上，出现 k 次洪水泛滥的概率是多少，假设这个模型服从泊松分布。

答案：很明显 λ=1 $\lambda=1$ ，所以 P(X=k)=1ke−1k! $P(X=k)=\frac{1^k e^{-1}}{k!}$

接下来我们介绍一下用泊松分布逼近二项分布，即泊松定理。

Consider a time interval and divide it into n $n$ equally-sized subintervals. Suppose n $n$ is very large so that either one or zero event can occur in a subinterval. Suppose further that the probability for an event to occur in a subinterval is λn $\frac{\lambda}{n}$ , independent of what occurs in other subintervals. λn $\frac{\lambda}{n}$ 你可以理解成：在一段区间内，事件平均出现的次数是 λ $\lambda$ ，如果你把这个区间分割成n个子区间，那么事件出现的平均次数自然变成了 λn $\frac{\lambda}{n}$ ，如果n足够大，子区间就变得非常小，最终退化成 either one or zero event can occur in the subinterval. 那么 λn $\frac{\lambda}{n}$ 自然就变成了事件发生的概率。

在上面的假设下，某一区间事件出现的次数，X，服从二项分布： X B(n,λn) $X~B(n,\frac{\lambda}{n})$ ，所有有：

P(X=x)=Cxn(λn)x(1−λn)n−x $P(X=x)=C_n^x (\frac{\lambda}{n})^x (1-\frac{\lambda}{n})^{n-x}$

当 n 趋进于无穷时，上面的二项分布就变成了泊松分布。泊松定理：

limn→∞Cxn(λn)x(1−λn)n−x=λke−λk! $lim_{n\rightarrow \infty}C_n^x (\frac{\lambda}{n})^x (1-\frac{\lambda}{n})^{n-x}=\frac{\lambda^k e^{-\lambda}}{k!}$

Geometric random variables

下面是 Geometric distribution 的定义，它通常 models situations where we’re waiting for something to happen. 举个例子：直到得到第一正面所需要投出的硬币次数。

The Geometric distribution is a probability distribution of the number X $X$ of Bernoulli trials needed to get one success

The geometric distribution gives the probability that the first occurrence of success requires k $k$ independent trials, each with success probability p $p$ . If the probability of success on each trial is p $p$ , then the probability that the kth trial (out of k trials) is the first success is: PX(k)=P(X=k)=(1−p)k−1p $P_X(k)=P(X=k)=(1-p)^{k-1}p$ ，这个公式就是几何分布的 PMF.

Expectation/mean of a random variable

The expectation of a discrete random variable X $X$ is a weighted average of the possible values that the random variable can take. Unlike the sample mean of a group of observations, which gives each observation equal weight, the mean of a random variable weights each outcome xi $x_i$ according to its probability, pi $p_i$ . The common symbol for the expectation (also known as the mean value of X $X$ ) is E[X] $E[X]$ , formally defined by

E[X]=∑xxPX(x) $E[X] = \sum\limits_{x} xP_X(x)$

举个例子：比如随机变量 X $X$ 只有3个可能的结果分别是： 1, 2, and 4. 每个结果对应的概率为：2/10, 5/10, and 3/10. 因此，这个随机变量的平均值是：1∗2/10+2∗5/10+4∗3/10 $1*2/10+2*5/10+4*3/10$ . 课件中的12,13, and 14 页也给出了几个例子。

下面是期望的3个基本属性：

If X≥0 $X\ge0$ , then E[X]≥0 $E[X]\ge 0$
If a≤X≤b $a\le X\le b$ , then a≤E[X]≤b $a\le E[X]\le b$
If c $c$ is a constant, E[c]=c $E[c]=c$

假设知道随机变量 X $X$ 的期望 E[x] $E[x]$ ，那么如何求出随机变量 Y=g(X) $Y=g(X)$ 的期望呢？公式如下，证明很简单，请参考课件中的 16页。

E[Y]=E[g(X)]=∑xg(x)pX(x) $E[Y] = E[g(X)] = \sum\limits_xg(x)p_X(x)$

当上述公式中的函数 g 为线性时，即 Y=aX+b $Y=aX+b$ ，那么期望 E[Y]=E[aX+b]=aE[x]+b $E[Y]=E[aX+b]=aE[x]+b$ . 因此在线性函数的情况下，通用形式为： E[g(x)]=g(E[X]) $E[g(x)]=g(E[X])$ ，用上面的公式就可以证明它。

Variance of a random variable

The standard deviation σ $\sigma$ is the square root of the variance. The variance of a discrete random variable X $X$ measures the spread, or variability, of the distribution, and is defined by

var(X)=E[(x−μ)2] $var(X)=E[(x-\mu)^2]$

在上一小节中我介绍了 the expected value rule，通过它我们可以求出上述公式的方差。令 g(x)=(x−μ)2 $g(x)=(x-\mu)^2$ ，所以 var(X)=E[g(x)]=∑xg(x)pX(x) $var(X)=E[g(x)]= \sum\limits_xg(x)p_X(x)$

关于方差有以下2年重要的属性：

var(aX+b)=a2var(X) $var(aX+b)=a^2var(X)$
var(X)=E(X2)−[E(X)]2 $var(X)=E(X^2)-[E(X)]^2$

关于第1条属性，直观的解释就是：如果把随机变量加上某个常量，就相当于右移所有的值，并不会改变方差。关于这2个属性的证明，请参考课件的第3页; 关于 Bernoulli 和 Uniform 的方差参考课件中的4和5页。

Conditioning a random variable

给定的条件可以是1个事件或者1个随机变量，下面我来分别介绍一下这2种类型的 conditional PMF 和相应的一些定理。

Conditional PMF and expectation, given an event

关于一个事件作为条件的 PMF，与普通的 PMF 没有什么差别，看下图的对比马上就能明白了。

有了加条件的随机变量，类似于上文中介绍的全概率，我们也可以写出类似的全概率： PX(x)=∑iP(Ai)PX|Ai(x) $P_X(x)=\sum_i P(A_i)P_{X|A_i}(x)$ ，我们在这个公式的2面同时应用求和符号 ∑x $\sum_x$ ，得到 total expectation theorem:

E[X]=∑iP(Ai)E[X|Ai] $E[X]=\sum_i P(A_i)E[X|A_i]$

关于加条件的几何随机变量和它的期望请参考课件中的13和14页。

Conditional PMF and expectation, given a random variable

关于这种类型的 Conditional PMF 与上个小节中的没有什么不同，我就直接给出定义了：

PX|Y(x|y)=P(X=x|Y=y)=PX,Y(x,y)PY(y) $P_{X|Y}(x|y)=P(X=x | Y=y)=\frac{P_{X,Y}(x,y)}{P_Y(y)}$

关于它的期望与 expected value rule 与上个小节中的类型也基本相同，没什么好说的了，直接上公式：

E[X|Y=y]=∑xxpX|Y(x|y)E[g(X)|Y=y]=∑xg(x)pX|Y(x|y) $E[X|Y=y]=\sum_x xp_{X|Y}(x|y) \\ E[g(X)|Y=y]=\sum_x g(x)p_{X|Y}(x|y)$

关于全概率与 total expectation theorem 也与上个小节一样，直接上公式：

PX(x)=∑yPY(y)PX|y(x|y)E[X]=∑yPY(y)E[X|Y=y] $P_X(x)=\sum_y P_Y(y)P_{X|y}(x|y) \\ E[X]=\sum_y P_Y(y)E[X|Y=y]$

关于超过2个随机变量的 Conditional PMF 与上面的基本一样，请参考课件的第3页。

Joint PMF of two random variables

Joint PMF 定义如下：

PX,Y(x,y)=P(X=xandY=y) $P_{X,Y}(x,y)=P(X=x \quad and \quad Y=y)$

关于 Joint PMF 还有下图3个公式，2个红色的公式是 Marginal PMF.

关于上面介绍过的 the expected value rule，有了 Joint PMF 以后，我们可以稍微修改一下成如下公式，其中 g(X,Y) 是随机变量 X 和 Y 的函数， PX,Y(x,y) $P_{X,Y}(x,y)$ 是它们的 Joint PMF.

E[g(X,Y)]=∑x∑yg(x,y)PX,Y(x,y) $E[g(X,Y)]= \sum\limits_x\sum\limits_yg(x,y)P_{X,Y}(x,y)$

关于2个随机变量和的期望有： E[X+Y]=E[X]+E[Y] $E[X+Y]=E[X]+E[Y]$ ，证明请看课件的18页。这个公式不仅仅限于2个随机变量： E[X1+⋯+Xn]=E[X1]+⋯+E[Xn] $E[X_1+\cdots + X_n]=E[X_1]+\cdots + E[X_n]$ . 有了这个公式，我们可以很容易地求出 binomial 随机变量的期望了，请参考课件的20页。关于3个随机变量的 Joint PMF 与上面的类似，就不多说了！

随机变量的独立性

与上文中介绍的事件的独立性一样，所具有的属性也都一样，我就直接给出它的定义了：

PX,Y(x,y)=PX(x)PY(y) $P_{X,Y}(x,y)=P_X(x)P_Y(y)$

有了独立性这么特殊的条件，那么先前介绍过的期望与方差就有了一些先前不具有的属性。关于期望有： E[XY]=E[X][Y] $E[XY]=E[X][Y]$ ，关于方差有： var[X+Y]=var[X]+var[Y] $var[X+Y]=var[X]+var[Y]$ . 关于它们的证明请参考课件中的第8页和第9页，关于 binomial 的方差请参考课件的第10页。

Continuous Random Variable

A random variable X $X$ is called continuous if there is a non-negative function fX $f_X$ , called the probability density function of X $X$ such that:

P(X∈B)=∫BfX(x)dx $P(X\in B)=\int_B f_X(x)dx$

for every subset B $B$ of the real line. 如下图所示，PDF 就是坐标系中的曲线，子集 B $B$ 的范围在 [a,b] $[a,b]$ 之间。从定义也可以看出，任何单个点的概率都是0. 虽然 PDF 可以用来计算概率，但是 fX(x) $f_X(x)$ 不是任何一个具体事件的概率，因此它不用小于或等于1.

下图中画红线的地方把 PDF 解释的很好了，学过微积分的同学应该会看出它有点 Riemann sum 的味道。

下图总结了关于 PDF 的几个重要属性：

连续随机变量的期望和方差

The expectation of a continuous random variable X $X$ is defined by

E[X]=∫∞−∞xfX(x)dx $E[X]=\int_{-\infty}^{\infty}xf_X(x)dx$

不难发现，它与离散随机变量的定义基本一样，只不过把 PMF 换成了 PDF，把求和变成了积分。下图中的方差的定义和相应的属性也都和离散的相对应，没有什么好解释的了！

Cumulative distribution function

CDF 既可以描述离散的随机变量，又可以描述连续的随机变量，下图是关于它的定义

The CDF of a random variable X $X$ is denoted by FX $F_X$ and provides the probability P(X≤x) $P(X \le x)$ . In particular, for every x we have

通过 CDF 的定义，我们不难看出 CDF 有如下图所示的几条属性：

If X $X$ is continuous, the PDF and the CDF can be obtained from each other by integration or differentiation:

FX(x)=∫x−∞fX(t)dtdFX(x)dx=fX(x) $F_X(x)=\int_{-\infty}^xf_X(t)dt \\ \frac{dF_X(x)}{dx}=f_X(x)$

Normal random variable

关于这个分布有个标准化的问题，即把一个任意的正态分布转换成一个标准的正态分布。用下面的公式可以做到这一点：

Y=X−μσ $Y=\frac{X-\mu}{\sigma}$

如果理解上面的公式呢？把随机变量 X 中的每个值减去期望，我们会得到期望为0的随机变量; 由于现在期望为0, 现在方差的定义变成， E[(X−μ)2]=E[X2] $E[(X-\mu)^2]=E[X^2]$ ，如果再把随机变量 X 的值除以标准差，那么公式就变成了 E[(Xσ)2]=E[1σ2]E[X2]=1σ2E[X2] $E[(\frac{X}{\sigma})^2]=E[\frac{1}{\sigma^2}]E[X^2]=\frac{1}{\sigma^2}E[X^2]$ ，由此可以看出，新得到的随机变量Y方差变成了1. 关于更多关于正态分布的知识，参考引言部分给出的链接的23页。

Joint PDFs of multiple random variables

在上面的单个随机变量中，如果知道了 PDF，想要求出某个区间的概率，我们只需要沿着一维的直线积分就可以了，积分的边界就是区间的范围。而现在关于多个随机变量的 joint PDFs 道理也是一样的，只不过我们现在的积分区间不是一维的直线了，而是二维的平面了，积分自然也就变成二重积分了。对于2个随机变量的 joint PDFs 的定义如下， fX,Y $f_{X,Y}$ 是非负的。

P((X,Y)∈B)=∬(x,y)∈BfX,Y(x,y)dxdy $P((X,Y)\in B)=\iint_{(x,y)\in B}f_{X,Y}(x,y)dxdy$

这个子集 B 与前面的不同，前面的子集B是一维的直线，而现在的它是一个二维平面。如果B是整个2维的平面，概率需要等于1，因此得到： ∫∞−∞∫∞−∞fX,Y(x,y)dxdy=1 $\int_{-\infty}^{\infty} \int_{-\infty}^{\infty}f_{X,Y}(x,y)dxdy=1$ . 。举个具体的例子，子集 B={(x,y)|a≤x≤b,c≤y≤d} $B=\{(x,y)|a\le x\le b, c\le y \le d\}$ 构成了一个长方形，那么它的 joint PDFs 写成

P((a≤X≤b,c≤Y≤d))=∫dc∫bafX,Y(x,y)dxdy $P((a\le X\le b, c\le Y \le d))=\int_c^d \int_a^bf_{X,Y}(x,y)dxdy$

在上文中我已经告诉大家如何解释 PDF, fX(x) $f_X(x)$ ，即 probability mass per unit length around x. 那么如何解释 joint PDFs , fX,Y(a,c) $f_{X,Y}(a,c)$ ,呢？道理基本一样，即 probability per unit area in the vicinity of (a,c). 单个随机变量积分过后得到的是面积，而现在得到的是体积。

那么知道了 joint PDFs，如何把它转换成 marginal 呢？它类似于离散的 joint PMFs 转换成 marginal，如下图的对比，红色的部分是 joint PDFs to marginal.

Joint CDFs 的定义如下，由于这是的 x，y 是某个具体的值，我用了2个 dummy variables, s and t.

FX,Y(x,y)=P(X≤x,Y≤y)=∫x−∞∫y−∞fX,Y(s,t)dtds $F_{X,Y}(x,y)=P(X\le x, Y\le y)=\int_{-\infty}^x \int_{-\infty}^yf_{X,Y}(s,t)dtds$

同样的，expected value rule 在这里依然适用，如下图所示：

Conditioning

Similar to the case of discrete random variables. we can condition a random variable on an event or on another random variable, and define the concepts of conditional PDF and conditional expectation.

Conditioning a Random Variable on an Event

一个连续随机变量的 X 的 conditional PDF，given an event A，记作 fX|A $f_{X|A}$ ，因此加条件的概率满足如下公式。子集B是一维直线的某个区间，如果让B为整个一维的直线，那么conditional PDF 一定会满足这个属性： ∫∞−∞fX|A(x)dx=1 $\int_{-\infty}^{\infty} f_{X|A}(x)dx=1$

P(X∈B|A)=∫BfX|A(x)dx $P(X\in B|A)=\int_B f_{X|A}(x)dx$

从上面的公式可以看出，想要求加条件的连续随机变量的概率，只需要找出 conditional PDF 就可以了。但是，如果这个条件A是任意的，我们没法知道 conditional PDF 的具体形式。因此，我们需要让这个事件A是个特例，即 X∈A $X\in A$ ，现在我们就可以知道这个 conditional PDF 的具体形式了，如下图所示。The conditional PDF is zero outside the conditioning set. Within the conditioning set , the conditional PDF has exactly the same shape as the unconditional one, except that it is scaled by the constant factor 1P(X∈A) $\frac{1}{P(X\in A)}$ . 关于为什么是这个 factor，参考课件的第3页。

因此，这个特例下的 conditional PDF 公式如下图所示，知道了 conditional PDF，通过上面给出的积分公式就可以求出相应的条件概率了！

对于 joint PDFs 来说，它也有与上图中相似的 conditional PDF，如下图所示：

Conditioning one Random Variable on Another

在上个小节中，我们介绍了如何求 conditional PDF 当把一个事件作为条件时。在这个小节中，我会介绍当把另一个随机变量作为条件时，如何定义 conditional PDF. 它的定义如下：

fX|Y(x|y)=fX,Y(x,y)fY(y) $f_{X|Y}(x|y)=\frac{f_{X,Y}(x,y)}{f_Y(y)}$

当2个随机变量独立时， fX,Y(x,y)=fX(x)fY(y) $f_{X,Y}(x,y)=f_X(x)f_Y(y)$ ，那么上面的公式就变成了： fX|Y(x|y)=fX(x) $f_{X|Y}(x|y)=f_X(x)$ . 得到 conditional PDF 以后，我们可以通过下面的公式来求出条件概率：

P(X∈A|Y=y)=∫AfX|Y(x|y)dx $P(X\in A|Y=y)=\int_Af_{X|Y}(x|y)dx$

不难发现，要想求出连续随机变量的概率，我们必须先找到相对应的 PDFs，然后在给定的区域内应用积分，从而求出这个给定区域的概率。

Conditional Expectations

如果你理解了上面的内容，Conditional Expectations 就没什么好说的了，基本上都是类似的内容。教授的书上给出了总结，我把它截成下面2幅图给大家参考。

有了独立性以后，期望和方差又多了以下几个属性：

If X and Y are independent, then E[XY]=E[X]E[Y] $E[XY]=E[X]E[Y]$
Furthermore, for any functions g and h, the random variables g(X) and h(Y) are independent, and we have E[g(X)h(Y)]=E[g(X)]E[h(Y)] $E[g(X)h(Y)]=E[g(X)]E[h(Y)]$
If X and Y are independent, then var[X+Y]=var[X]+var[Y] $var[X+Y]=var[X]+var[Y]$

一图胜千言

下图中一共包含3个图形，我们先把它标一下号！最上面的图称为图1; 下左的图称为图2; 下右的图称为图3. 这个图的解释非常重要，如果你理解了它，就理解了整个章节。它完美地展示了 joint PDFs -> marginal PDFs -> conditional PDFs 之间的关系。

图1是 joint PDFs 的分布，即 fX,Y(x,y) $f_{X,Y}(x,y)$ ，如果你把它沿 X $X$ 变量 slice 以后，就得到了 marginal PDFs，即fX(x)=∫∞−∞fX,Y(x,y)dy $f_X(x)=\int_{-\infty}^{\infty} f_{X,Y}(x,y)dy$ ，每个积分的结果就是每个 slice 的面积，如图2所示; 要想得到 conditional PDFs，我们需要把每个 slice 重新 normalized 一下，使得每个 slice 的面积为1，因此满足了normalization 属性，如图3所示, 我们得到了 conditional PDFs，即 fY|X(y|x) $f_{Y|X}(y|x)$

Derived Distributions

This lecture develops a method for finding the distribution (PMF or PDF) of a function of one or more random variables with known distribution. 即给定随机变量 X 的分布，如何找到 Y = g(X) 的分布？

离散随机变量

如果给定的分布 X 是离散的，则很容易找出 Y = g(X) 的分布，其中函数g是线性的，公式如下(Y=aX+b)：

pY(y)=pX(y−ba) $p_Y(y)=p_X(\frac{y-b}{a})$

关于 intuitive 的解释参考课件中的第2和第3页！

连续随机变量

这里有一个通用的方法找出 Y=g(X) 的分布，其中连续随机变量 X 的分布已知，函数 g 可以是任意的，不需要一定是线性的！方法分为以下2步：

Find the CDF of Y: FY(y)=P(Y≤y)=P(g(X)≤y) $F_Y(y)=P(Y\le y)=P(g(X)\le y)$
Differentiate CDF: fY(y)=ddyFY(y) $f_Y(y)=\frac{d}{dy}F_Y(y)$

当函数 g 为线性的时候，即Y=aX+b，通过上面的方法可以推导出下面的公式，详细推导过程参考课件中的第5页！

fY(y)=1|a|fX(y−ba) $f_Y(y)=\frac{1}{|a|}f_X(\frac{y-b}{a})$

关于函数 g 为非线性的时候，教授给出了2个例子，请参考课件中的第8，9页！

当函数 g 是单调的，通过上面的方法可以推导出下面的公式，详细推导过程参考课件中的第10页和第11页的例子，下面公式中的函数 h 是 g 的反函数。

fY(y)=fX(h(y))∣∣∣ddyh(y)∣∣∣ $f_Y(y)=f_X(h(y))\left| \frac{d}{dy}h(y) \right|$

Covariance And Correlation

2个随机变量的 Covariance 定义如下：

cov(X,Y)=E[(X−E[X])⋅(Y−E[Y])] $cov(X, Y)=E\left [ (X-E[X]) \cdot (Y-E[Y]) \right ]$

If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values, i.e., the variables tend to show similar behavior, the covariance is positive; In the opposite case, when the greater values of one variable mainly correspond to the lesser values of the other, i.e., the variables tend to show opposite behavior, the covariance is negative. The sign of the covariance therefore shows the tendency in the linear relationship between the variables.

通过上面的定义不难得出：当2个随机变量相互独立时，它们的协方差为0. 但是反过来并不成立，即当2个随机变量的协方差为0时，它们之间不一定是相互独立的。反例请参考课件第6页。下图把上面给出的定义进行展开，当2个随机变量相互独立时，E[XY]=E[X]E[Y]，所以协方差为0.

关于协方差有以下几条属性：

上图中的展开得到公式： cov(X,Y)=E[XY]−E[X]E[Y] $cov(X,Y)=E[XY]-E[X]E[Y]$
根据第1条属性得： cov(X,X)=E[X2]−E[X]E[X]=var[X] $cov(X,X)=E[X^2]-E[X]E[X]=var[X]$
cov(aX+b,Y)=acov(X,Y) $cov(aX+b,Y)=a \; cov(X,Y)$
cov(X,Y+Z)=cov(X,Y)cov(X,Z) $cov(X,Y+Z)=cov(X,Y) \; cov(X,Z)$

课件第7页用假设期望为0的方式证明了上面的第3和第4条属性，当期望为其它值的时候，得到的结论是一样的。当然了，你也可以不做期望为0的假设，直接代入到第1条属性中的公式中，然后把相应的期望展开，最终的结果与其一致。

先前俺介绍过，当2个随机变量相互独立时，var[X+Y]=var[X]+var[Y]. 现在学过协方差之后，没有独立的前提下，我们可以得到如下公式，当这2个随机变量独立时，cov[X,Y]=0，与我们先前学过的保持一致。关于下面公式的推导，以及扩展到N多个这样的随机变量的公式，请参考课件第8,9页。

var[X+Y]=var[X]+var[Y]+2cov[X,Y] $var[X+Y]=var[X]+var[Y]+2cov[X,Y]$

The magnitude of the covariance is not easy to interpret because it is not normalized and hence depends on the magnitudes of the variables. The normalized version of the covariance, the correlation coefficient, however, shows by its magnitude the strength of the linear relation. 关于 correlation coefficient 的定义如下：

p(X,Y)=E[(X−E[X])σX⋅(Y−E[Y])σY]=cov(X,Y)σXσY $p(X, Y)=E\left [ \frac{(X-E[X])}{\sigma_X} \cdot \frac{(Y-E[Y])}{\sigma_Y} \right ]=\frac{cov(X,Y)}{\sigma_X\sigma_Y}$

−1≤p≤1 $-1\le p \le1$
当2个随机变量独立时，p=0，即 “uncorrelated”
当 |p|=1 时，linearly related

随机变量之间的这种关联关系并不意味着因果关系。比如，数学好的人通常有更好的音乐能力，它们之间有关联关系，但并没有因果关系（数学好 -> 弹琴也好）。它们之间有联系很有可能是与某种隐藏的因素有关，这种因素同时影响2个变量，比如在这个例子中，数学好的人有可能大脑中的某个物质更活跃，而音乐能力好的人这个物质也活跃，因此这2个随机变量产生了联系。

课件第13页中的例子让我深受启发。在投资的时候，投资领域的多样性固然重要，但是一定要明白各个领域之间是否有关联关系，然后在确定投资方向。

更多推荐

MIT 6.041 概率系统分析与应用概率总结

本文发布于:2024-02-14 13:40:36，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1763515.html