我有一个包含 ~900 行的数据框;我正在尝试为某些列绘制 KDEplots.在某些列中,大多数值是相同的最小值.当我包含太多最小值时,KDEPlot 会突然停止显示最小值.例如,以下包括 600 个值,其中 450 个是最小值,并且绘图看起来不错:
y = df.sort_values(by='col1', Ascending=False)['col1'].values[:600]sb.kdeplot(y)但是包含 451 个最小值会产生非常不同的输出:
y = df.sort_values(by='col1', Ascending=False)['col1'].values[:601]sb.kdeplot(y)最终我想绘制不同列的双变量 KDEPlots,但我想先了解这一点.
解决方案问题是为带宽"选择的默认算法
PS:正如@mwascom 在评论中提到的,在这种情况下 scipy.statsmodels.nonparametric.kde 被使用(不是 scipy.stats.gaussian_kde).那里的默认值是 "scott";- 1.059 * A * nobs ** (-1/5.),其中 A 是 min(std(X),IQR/1.34).min() 阐明了行为的突然变化.IQR 是 四分位距",75% 和 25% 之间的差异.
自 Seaborn 0.11 起,statsmodel 后端已被删除,因此 kde 仅通过 scipy.stats.gaussian_kde 计算.
I have a data frame containing ~900 rows; I'm trying to plot KDEplots for some of the columns. In some columns, a majority of the values are the same, minimum value. When I include too many of the minimum values, the KDEPlot abruptly stops showing the minimums. For example, the following includes 600 values, of which 450 are the minimum, and the plot looks fine:
y = df.sort_values(by='col1', ascending=False)['col1'].values[:600] sb.kdeplot(y)But including 451 of the minimum values gives a very different output:
y = df.sort_values(by='col1', ascending=False)['col1'].values[:601] sb.kdeplot(y)Eventually I would like to plot bivariate KDEPlots of different columns against each other, but I'd like to understand this first.
解决方案The problem is the default algorithm that is chosen for the "bandwidth" of the kde. The default method is 'scott', which isn't very helpful when there are many equal values.
The bandwidth is the width of the gaussians that are positioned at every sample point and summed up. Lower bandwidths are closer to the data, higher bandwidths smooth everything out. The sweet spot is somewhere in the middle. In this case bw=0.3 could be a good option. In order to compare different kde's it is recommended to each time choose exactly the same bandwidth.
Here is some sample code to show the difference between bw='scott' and bw=0.3. The example data are 150 values from a standard normal distribution together with either 400, 450 or 500 fixed values.
import matplotlib.pyplot as plt import numpy as np import seaborn as sns; sns.set() fig, axs = plt.subplots(nrows=2, ncols=3, figsize=(10,5), gridspec_kw={'hspace':0.3}) for i, bw in enumerate(['scott', 0.3]): for j, num_same in enumerate([400, 450, 500]): y = np.concatenate([np.random.normal(0, 1, 150), np.repeat(-3, num_same)]) sns.kdeplot(y, bw=bw, ax=axs[i, j]) axs[i, j].set_title(f'bw:{bw}; fixed values:{num_same}') plt.show()The third plot gives a warning that the kde can not be drawn using Scott's suggested bandwidth.
PS: As mentioned by @mwascom in the comments, in this case scipy.statsmodels.nonparametric.kde is used (not scipy.stats.gaussian_kde). There the default is "scott" - 1.059 * A * nobs ** (-1/5.), where A is min(std(X),IQR/1.34). The min() clarifies the abrupt change in behavior. IQR is the "interquartile range", the difference between the 75th and 25th percentiles.
Edit: Since Seaborn 0.11, the statsmodel backend has been dropped, so kde's are only calculated via scipy.stats.gaussian_kde.
更多推荐
Seaborn KDEPlot
发布评论