如何标准化seaborn distplot？

Question

出于可重复性的原因，数据集和可重复性的原因，我在 [此处] [1] 共享它。

这是我在做什么 - 从第 2 列开始，我正在读取当前行并将其与前一行的值进行比较。 如果它更大，我会继续比较。 如果当前值小于前一行的值，我想将当前值（较小）除以前一个值（较大）。 因此，以下代码：

这给出了以下图。

sns.distplot(quotient, hist=False, label=protname)

从图中我们可以看出

数据-V具有0.8的商数当quotient_times小于3和商保持0.5如果quotient_times大于3。

我想对这些值进行标准化，使第二个绘图值的y-axis介于 0 和 1 之间。我们如何在 Python 中做到这一点？

Answer 1

前言

据我了解，默认情况下，seaborn distplot 会进行 kde 估计。 如果你想要一个标准化的 distplot 图，那可能是因为你假设图的 Ys 应该在 [0;1] 之间。 如果是这样，堆栈溢出问题会引发kde 估计器显示值高于 1 的问题。

引用一个答案：

连续 pdf （pdf=概率密度函数）从不说值小于 1，对于连续随机变量的 pdf，函数p(x) 不是概率。 您可以参考连续随机变量及其分布

引用importantofbeingernest 的第一条评论：

pdf 上的积分是 1 。 这里没有矛盾。

据我所知，它是CDF（累积密度函数），其值应该在 [0; 1]。

注意：所有可能的连续拟合函数都在 SciPy 站点上，并且在包 scipy.stats 中可用

也许也看看概率质量函数？

如果您真的希望对相同的图形进行标准化，那么您应该收集绘制函数（选项 1）或函数定义（选项 2）的实际数据点，然后自己对它们进行标准化并再次绘制它们。

选项1

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import sys

print('System versions          : {}'.format(sys.version))
print('System versions          : {}'.format(sys.version_info))
print('Numpy versqion           : {}'.format(np.__version__))
print('matplotlib.pyplot version: {}'.format(matplotlib.__version__))
print('seaborn version          : {}'.format(sns.__version__))

protocols = {}

types = {"data_v": "data_v.csv"}

for protname, fname in types.items():
    col_time,col_window = np.loadtxt(fname,delimiter=',').T
    trailing_window = col_window[:-1] # "past" values at a given index
    leading_window  = col_window[1:]  # "current values at a given index
    decreasing_inds = np.where(leading_window < trailing_window)[0]
    quotient = leading_window[decreasing_inds]/trailing_window[decreasing_inds]
    quotient_times = col_time[decreasing_inds]

    protocols[protname] = {
        "col_time": col_time,
        "col_window": col_window,
        "quotient_times": quotient_times,
        "quotient": quotient,
    }

    fig, (ax1, ax2) = plt.subplots(1,2, sharey=False, sharex=False)
    g = sns.distplot(quotient, hist=True, label=protname, ax=ax1, rug=True)
    ax1.set_title('basic distplot (kde=True)')
    # get distplot line points
    line = g.get_lines()[0]
    xd = line.get_xdata()
    yd = line.get_ydata()
    # https://stackoverflow.com/questions/29661574/normalize-numpy-array-columns-in-python
    def normalize(x):
        return (x - x.min(0)) / x.ptp(0)
    #normalize points
    yd2 = normalize(yd)
    # plot them in another graph
    ax2.plot(xd, yd2)
    ax2.set_title('basic distplot (kde=True)\nwith normalized y plot values')

    plt.show()

选项 2

下面，我尝试执行 kde 并将获得的估计归一化。 我不是统计专家，所以 kde 的用法可能在某种程度上是错误的（它与截图中看到的 seaborn 不同，这是因为 seaborn 的工作方式比我好得多。它只是试图模仿与 scipy 匹配的 kde。我猜结果还不错）

截屏：

代码：

import numpy as np
from scipy import stats
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import sys

print('System versions          : {}'.format(sys.version))
print('System versions          : {}'.format(sys.version_info))
print('Numpy versqion           : {}'.format(np.__version__))
print('matplotlib.pyplot version: {}'.format(matplotlib.__version__))
print('seaborn version          : {}'.format(sns.__version__))

protocols = {}

types = {"data_v": "data_v.csv"}

for protname, fname in types.items():
    col_time,col_window = np.loadtxt(fname,delimiter=',').T
    trailing_window = col_window[:-1] # "past" values at a given index
    leading_window  = col_window[1:]  # "current values at a given index
    decreasing_inds = np.where(leading_window < trailing_window)[0]
    quotient = leading_window[decreasing_inds]/trailing_window[decreasing_inds]
    quotient_times = col_time[decreasing_inds]

    protocols[protname] = {
        "col_time": col_time,
        "col_window": col_window,
        "quotient_times": quotient_times,
        "quotient": quotient,
    }

    fig, (ax1, ax2, ax3, ax4) = plt.subplots(1,4, sharey=False, sharex=False)
    diff=quotient_times
    ax1.plot(diff, quotient, ".", label=protname, color="blue")
    ax1.set_ylim(0, 1.0001)
    ax1.set_title(protname)
    ax1.set_xlabel("quotient_times")
    ax1.set_ylabel("quotient")
    ax1.legend()

    sns.distplot(quotient, hist=True, label=protname, ax=ax2, rug=True)
    ax2.set_title('basic distplot (kde=True)')

    # taken from seaborn's source code (utils.py and distributions.py)
    def seaborn_kde_support(data, bw, gridsize, cut, clip):
        if clip is None:
            clip = (-np.inf, np.inf)
        support_min = max(data.min() - bw * cut, clip[0])
        support_max = min(data.max() + bw * cut, clip[1])
        return np.linspace(support_min, support_max, gridsize)

    kde_estim = stats.gaussian_kde(quotient, bw_method='scott')

    # manual linearization of data
    #linearized = np.linspace(quotient.min(), quotient.max(), num=500)

    # or better: mimic seaborn's internal stuff
    bw = kde_estim.scotts_factor() * np.std(quotient)
    linearized = seaborn_kde_support(quotient, bw, 100, 3, None)

    # computes values of the estimated function on the estimated linearized inputs
    Z = kde_estim.evaluate(linearized)

    # https://stackoverflow.com/questions/29661574/normalize-numpy-array-columns-in-python
    def normalize(x):
        return (x - x.min(0)) / x.ptp(0)

    # normalize so it is between 0;1
    Z2 = normalize(Z)
    for name, func in {'min': np.min, 'max': np.max}.items():
        print('{}: source={}, normalized={}'.format(name, func(Z), func(Z2)))

    # plot is different from seaborns because not exact same method applied
    ax3.plot(linearized, Z, ".", label=protname, color="orange")
    ax3.set_title('Non linearized gaussian kde values')

    # manual kde result with Y axis avalues normalized (between 0;1)
    ax4.plot(linearized, Z2, ".", label=protname, color="green")
    ax4.set_title('Normalized gaussian kde values')

    plt.show()

输出：

System versions          : 3.7.2 (default, Feb 21 2019, 17:35:59) [MSC v.1915 64 bit (AMD64)]
System versions          : sys.version_info(major=3, minor=7, micro=2, releaselevel='final', serial=0)
Numpy versqion           : 1.16.2
matplotlib.pyplot version: 3.0.2
seaborn version          : 0.9.0
min: source=0.0021601491646143518, normalized=0.0
max: source=9.67319154426489, normalized=1.0

与评论相反，绘图：

[(x-min(quotient))/(max(quotient)-min(quotient)) for x in quotient]

不改变行为！ 它只更改内核密度估计的源数据。 曲线形状将保持不变。

引用 seaborn 的 distplot 文档：

此函数将 matplotlib hist 函数（自动计算良好的默认 bin 大小）与 seaborn kdeplot() 和 rugplot() 函数相结合。 它还可以拟合 scipy.stats 分布并在数据上绘制估计的 PDF。

默认情况下：

kde : bool，可选设置为 True 是否绘制高斯核密度估计。

它默认使用 kde。 引用 seaborn 的 kde 文档：

拟合并绘制单变量或双变量核密度估计值。

引用SCiPy 高斯 kde 方法文档：

使用高斯核表示核密度估计。

核密度估计是一种以非参数方式估计随机变量的概率密度函数（PDF）的方法。 gaussian_kde 适用于单变量和多变量数据。 它包括自动带宽确定。 该估计最适用于单峰分布； 双峰或多峰分布往往过于平滑。

请注意，正如您自己提到的那样，我确实相信您的数据是双峰的。 它们看起来也很离散。 据我所知，离散分布函数可能不像连续分布函数那样分析，而且拟合可能很棘手。

以下是各种法律的示例：

import numpy as np
from scipy.stats import uniform, powerlaw, logistic
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import sys

print('System versions          : {}'.format(sys.version))
print('System versions          : {}'.format(sys.version_info))
print('Numpy versqion           : {}'.format(np.__version__))
print('matplotlib.pyplot version: {}'.format(matplotlib.__version__))
print('seaborn version          : {}'.format(sns.__version__))

protocols = {}

types = {"data_v": "data_v.csv"}

for protname, fname in types.items():
    col_time,col_window = np.loadtxt(fname,delimiter=',').T
    trailing_window = col_window[:-1] # "past" values at a given index
    leading_window  = col_window[1:]  # "current values at a given index
    decreasing_inds = np.where(leading_window < trailing_window)[0]
    quotient = leading_window[decreasing_inds]/trailing_window[decreasing_inds]
    quotient_times = col_time[decreasing_inds]

    protocols[protname] = {
        "col_time": col_time,
        "col_window": col_window,
        "quotient_times": quotient_times,
        "quotient": quotient,
    }
    fig, [(ax1, ax2, ax3), (ax4, ax5, ax6)] = plt.subplots(2,3, sharey=False, sharex=False)
    diff=quotient_times
    ax1.plot(diff, quotient, ".", label=protname, color="blue")
    ax1.set_ylim(0, 1.0001)
    ax1.set_title(protname)
    ax1.set_xlabel("quotient_times")
    ax1.set_ylabel("quotient")
    ax1.legend()
    quotient2 = [(x-min(quotient))/(max(quotient)-min(quotient)) for x in quotient]
    print(quotient2)
    sns.distplot(quotient, hist=True, label=protname, ax=ax2, rug=True)
    ax2.set_title('basic distplot (kde=True)')
    sns.distplot(quotient2, hist=True, label=protname, ax=ax3, rug=True)
    ax3.set_title('logistic distplot')

    sns.distplot(quotient, hist=True, label=protname, ax=ax4, rug=True, kde=False, fit=uniform)
    ax4.set_title('uniform distplot')
    sns.distplot(quotient, hist=True, label=protname, ax=ax5, rug=True, kde=False, fit=powerlaw)
    ax5.set_title('powerlaw distplot')
    sns.distplot(quotient, hist=True, label=protname, ax=ax6, rug=True, kde=False, fit=logistic)
    ax6.set_title('logistic distplot')
    plt.show()

输出：

System versions          : 3.7.2 (default, Feb 21 2019, 17:35:59) [MSC v.1915 64 bit (AMD64)]
System versions          : sys.version_info(major=3, minor=7, micro=2, releaselevel='final', serial=0)
Numpy versqion           : 1.16.2
matplotlib.pyplot version: 3.0.2
seaborn version          : 0.9.0
[1.0, 0.05230125523012544, 0.0433775382360589, 0.024590765616971128, 0.05230125523012544, 0.05230125523012544, 0.05230125523012544, 0.02836946874603772, 0.05230125523012544, 0.05230125523012544, 0.05230125523012544, 0.05230125523012544, 0.03393500048652319, 0.05230125523012544, 0.05230125523012544, 0.05230125523012544, 0.0037013196009011043, 0.0, 0.05230125523012544]

截屏：

Answer 2

在最新的更新中， sns.distplot已被弃用，而必须使用sns.displot 。 因此，要获得归一化的直方图/密度，必须使用以下语法：

sns.displot(x, kind='hist', stat='density');

或者

sns.plot(x, stat='density');

代替

sns.distplot(x, kde=False, norm_hist=True);

PS：要获得密度而不是直方图，种类值必须更改为“kde”。

参考：

如何标准化seaborn distplot？

问题描述

2 个解决方案

解决方案1
10 已采纳 2019-03-12 21:58:34

前言

选项1

选项 2

解决方案2
2 2021-05-20 12:52:54

如何标准化seaborn distplot？

问题描述

2 个解决方案

解决方案1 10 已采纳 2019-03-12 21:58:34

前言

选项1

选项 2

解决方案2 2 2021-05-20 12:52:54

解决方案1
10 已采纳 2019-03-12 21:58:34

解决方案2
2 2021-05-20 12:52:54