如何标准化seaborn distplot？

Question

For reproducibility reasons, the dataset and for reproducibility reasons, I am sharing it [here][1].出于可重复性的原因，数据集和可重复性的原因，我在 [此处] [1] 共享它。

Here is what I am doing - from column 2, I am reading the current row and compare it with the value of the previous row.这是我在做什么 - 从第 2 列开始，我正在读取当前行并将其与前一行的值进行比较。 If it is greater, I keep comparing.如果它更大，我会继续比较。 If the current value is smaller than the previous row's value, I want to divide the current value (smaller) by the previous value (larger).如果当前值小于前一行的值，我想将当前值（较小）除以前一个值（较大）。 Accordingly, the following code:因此，以下代码：

This gives the following plots.这给出了以下图。

sns.distplot(quotient, hist=False, label=protname)

As we can see from the plots从图中我们可以看出

Data-V has a quotient of 0.8 when the quotient_times is less than 3 and the quotient remains 0.5 if the quotient_times is greater than 3.数据-V具有0.8的商数当quotient_times小于3和商保持0.5如果quotient_times大于3。

I want to normalize the values so that we have y-axis of the second plot values between 0 and 1. How do we do that in Python?我想对这些值进行标准化，使第二个绘图值的y-axis介于 0 和 1 之间。我们如何在 Python 中做到这一点？

Answer 1

Foreword前言

From what I understand, the seaborn distplot by default does a kde estimation.据我了解，默认情况下，seaborn distplot 会进行 kde 估计。 If you want a normalized distplot graph, it could be because you assume that the graph's Ys should be bounded between in [0;1].如果你想要一个标准化的 distplot 图，那可能是因为你假设图的 Ys 应该在 [0;1] 之间。 If so, a stack overflow question has raised the question of kde estimators showing values above 1 .如果是这样，堆栈溢出问题会引发kde 估计器显示值高于 1 的问题。

Quoting one answer :引用一个答案：

a continous pdf (pdf=probability density function) never says the value to be less than 1, with the pdf for continous random variable, f unction p(x) is not the probability .连续 pdf （pdf=概率密度函数）从不说值小于 1，对于连续随机变量的 pdf，函数p(x) 不是概率。 you can refer for continuous random variables and their distrubutions您可以参考连续随机变量及其分布

Quoting first comment ofimportanceofbeingernest :引用importantofbeingernest 的第一条评论：

The integral over a pdf is 1 . pdf 上的积分是 1 。 There is no contradiction to be seen here.这里没有矛盾。

From my knowledge it is the CDF (Cumulative Density Function) whose values are supposed to be in [0;据我所知，它是CDF（累积密度函数），其值应该在 [0; 1]. 1]。

Notice: All possible continuous fittable functions are on SciPy site and available in the package scipy.stats注意：所有可能的连续拟合函数都在 SciPy 站点上，并且在包 scipy.stats 中可用

Maybe have also a look at probability mass functions ?也许也看看概率质量函数？

If you really want to have the same graph normalized, then you should gather the actual data points of the plotted function (Option1), or the function definition (Option 2), and normalize them yourself and plot them again.如果您真的希望对相同的图形进行标准化，那么您应该收集绘制函数（选项 1）或函数定义（选项 2）的实际数据点，然后自己对它们进行标准化并再次绘制它们。

Option 1选项1

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import sys

print('System versions          : {}'.format(sys.version))
print('System versions          : {}'.format(sys.version_info))
print('Numpy versqion           : {}'.format(np.__version__))
print('matplotlib.pyplot version: {}'.format(matplotlib.__version__))
print('seaborn version          : {}'.format(sns.__version__))

protocols = {}

types = {"data_v": "data_v.csv"}

for protname, fname in types.items():
    col_time,col_window = np.loadtxt(fname,delimiter=',').T
    trailing_window = col_window[:-1] # "past" values at a given index
    leading_window  = col_window[1:]  # "current values at a given index
    decreasing_inds = np.where(leading_window < trailing_window)[0]
    quotient = leading_window[decreasing_inds]/trailing_window[decreasing_inds]
    quotient_times = col_time[decreasing_inds]

    protocols[protname] = {
        "col_time": col_time,
        "col_window": col_window,
        "quotient_times": quotient_times,
        "quotient": quotient,
    }

    fig, (ax1, ax2) = plt.subplots(1,2, sharey=False, sharex=False)
    g = sns.distplot(quotient, hist=True, label=protname, ax=ax1, rug=True)
    ax1.set_title('basic distplot (kde=True)')
    # get distplot line points
    line = g.get_lines()[0]
    xd = line.get_xdata()
    yd = line.get_ydata()
    # https://stackoverflow.com/questions/29661574/normalize-numpy-array-columns-in-python
    def normalize(x):
        return (x - x.min(0)) / x.ptp(0)
    #normalize points
    yd2 = normalize(yd)
    # plot them in another graph
    ax2.plot(xd, yd2)
    ax2.set_title('basic distplot (kde=True)\nwith normalized y plot values')

    plt.show()

Option 2选项 2

Below, I tried to perform a kde and normalize the obtained estimation.下面，我尝试执行 kde 并将获得的估计归一化。 I'm not a stats expert, so the kde usage might be wrong in some way (It is different from seaborn's as one can see on the screenshot, this is because seaborn does the job way much better than me. It only tried to mimic the kde fitting with scipy. The result is not so bad i guess )我不是统计专家，所以 kde 的用法可能在某种程度上是错误的（它与截图中看到的 seaborn 不同，这是因为 seaborn 的工作方式比我好得多。它只是试图模仿与 scipy 匹配的 kde。我猜结果还不错）

Screenshot:截屏：

Code:代码：

import numpy as np
from scipy import stats
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import sys

print('System versions          : {}'.format(sys.version))
print('System versions          : {}'.format(sys.version_info))
print('Numpy versqion           : {}'.format(np.__version__))
print('matplotlib.pyplot version: {}'.format(matplotlib.__version__))
print('seaborn version          : {}'.format(sns.__version__))

protocols = {}

types = {"data_v": "data_v.csv"}

for protname, fname in types.items():
    col_time,col_window = np.loadtxt(fname,delimiter=',').T
    trailing_window = col_window[:-1] # "past" values at a given index
    leading_window  = col_window[1:]  # "current values at a given index
    decreasing_inds = np.where(leading_window < trailing_window)[0]
    quotient = leading_window[decreasing_inds]/trailing_window[decreasing_inds]
    quotient_times = col_time[decreasing_inds]

    protocols[protname] = {
        "col_time": col_time,
        "col_window": col_window,
        "quotient_times": quotient_times,
        "quotient": quotient,
    }

    fig, (ax1, ax2, ax3, ax4) = plt.subplots(1,4, sharey=False, sharex=False)
    diff=quotient_times
    ax1.plot(diff, quotient, ".", label=protname, color="blue")
    ax1.set_ylim(0, 1.0001)
    ax1.set_title(protname)
    ax1.set_xlabel("quotient_times")
    ax1.set_ylabel("quotient")
    ax1.legend()

    sns.distplot(quotient, hist=True, label=protname, ax=ax2, rug=True)
    ax2.set_title('basic distplot (kde=True)')

    # taken from seaborn's source code (utils.py and distributions.py)
    def seaborn_kde_support(data, bw, gridsize, cut, clip):
        if clip is None:
            clip = (-np.inf, np.inf)
        support_min = max(data.min() - bw * cut, clip[0])
        support_max = min(data.max() + bw * cut, clip[1])
        return np.linspace(support_min, support_max, gridsize)

    kde_estim = stats.gaussian_kde(quotient, bw_method='scott')

    # manual linearization of data
    #linearized = np.linspace(quotient.min(), quotient.max(), num=500)

    # or better: mimic seaborn's internal stuff
    bw = kde_estim.scotts_factor() * np.std(quotient)
    linearized = seaborn_kde_support(quotient, bw, 100, 3, None)

    # computes values of the estimated function on the estimated linearized inputs
    Z = kde_estim.evaluate(linearized)

    # https://stackoverflow.com/questions/29661574/normalize-numpy-array-columns-in-python
    def normalize(x):
        return (x - x.min(0)) / x.ptp(0)

    # normalize so it is between 0;1
    Z2 = normalize(Z)
    for name, func in {'min': np.min, 'max': np.max}.items():
        print('{}: source={}, normalized={}'.format(name, func(Z), func(Z2)))

    # plot is different from seaborns because not exact same method applied
    ax3.plot(linearized, Z, ".", label=protname, color="orange")
    ax3.set_title('Non linearized gaussian kde values')

    # manual kde result with Y axis avalues normalized (between 0;1)
    ax4.plot(linearized, Z2, ".", label=protname, color="green")
    ax4.set_title('Normalized gaussian kde values')

    plt.show()

Output:输出：

System versions          : 3.7.2 (default, Feb 21 2019, 17:35:59) [MSC v.1915 64 bit (AMD64)]
System versions          : sys.version_info(major=3, minor=7, micro=2, releaselevel='final', serial=0)
Numpy versqion           : 1.16.2
matplotlib.pyplot version: 3.0.2
seaborn version          : 0.9.0
min: source=0.0021601491646143518, normalized=0.0
max: source=9.67319154426489, normalized=1.0

Contrary to a comment, plotting:与评论相反，绘图：

[(x-min(quotient))/(max(quotient)-min(quotient)) for x in quotient]

Does not change the behavior !不改变行为！ It only changes the source data for kernel density estimation.它只更改内核密度估计的源数据。 The curve shape would remain the same.曲线形状将保持不变。

Quoting seaborn's distplot doc : 引用 seaborn 的 distplot 文档：

This function combines the matplotlib hist function (with automatic calculation of a good default bin size) with the seaborn kdeplot() and rugplot() functions.此函数将 matplotlib hist 函数（自动计算良好的默认 bin 大小）与 seaborn kdeplot() 和 rugplot() 函数相结合。 It can also fit scipy.stats distributions and plot the estimated PDF over the data.它还可以拟合 scipy.stats 分布并在数据上绘制估计的 PDF。

By default:默认情况下：

kde : bool, optional set to True Whether to plot a gaussian kernel density estimate. kde : bool，可选设置为 True 是否绘制高斯核密度估计。

It uses kde by default.它默认使用 kde。 Quoting seaborn's kde doc:引用 seaborn 的 kde 文档：

Fit and plot a univariate or bivariate kernel density estimate.拟合并绘制单变量或双变量核密度估计值。

Quoting SCiPy gaussian kde method doc :引用SCiPy 高斯 kde 方法文档：

Representation of a kernel-density estimate using Gaussian kernels.使用高斯核表示核密度估计。

Kernel density estimation is a way to estimate the probability density function (PDF) of a random variable in a non-parametric way.核密度估计是一种以非参数方式估计随机变量的概率密度函数（PDF）的方法。 gaussian_kde works for both uni-variate and multi-variate data. gaussian_kde 适用于单变量和多变量数据。 It includes automatic bandwidth determination.它包括自动带宽确定。 The estimation works best for a unimodal distribution;该估计最适用于单峰分布； bimodal or multi-modal distributions tend to be oversmoothed.双峰或多峰分布往往过于平滑。

Note that I do believe that your data are bimodal, as you mentioned it yourself.请注意，正如您自己提到的那样，我确实相信您的数据是双峰的。 They also look discrete.它们看起来也很离散。 As far as I know, discrete distribution function may not be analyzed in the same way continuous are, and fitting may proove tricky.据我所知，离散分布函数可能不像连续分布函数那样分析，而且拟合可能很棘手。

Here is an example with various laws:以下是各种法律的示例：

import numpy as np
from scipy.stats import uniform, powerlaw, logistic
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import sys

print('System versions          : {}'.format(sys.version))
print('System versions          : {}'.format(sys.version_info))
print('Numpy versqion           : {}'.format(np.__version__))
print('matplotlib.pyplot version: {}'.format(matplotlib.__version__))
print('seaborn version          : {}'.format(sns.__version__))

protocols = {}

types = {"data_v": "data_v.csv"}

for protname, fname in types.items():
    col_time,col_window = np.loadtxt(fname,delimiter=',').T
    trailing_window = col_window[:-1] # "past" values at a given index
    leading_window  = col_window[1:]  # "current values at a given index
    decreasing_inds = np.where(leading_window < trailing_window)[0]
    quotient = leading_window[decreasing_inds]/trailing_window[decreasing_inds]
    quotient_times = col_time[decreasing_inds]

    protocols[protname] = {
        "col_time": col_time,
        "col_window": col_window,
        "quotient_times": quotient_times,
        "quotient": quotient,
    }
    fig, [(ax1, ax2, ax3), (ax4, ax5, ax6)] = plt.subplots(2,3, sharey=False, sharex=False)
    diff=quotient_times
    ax1.plot(diff, quotient, ".", label=protname, color="blue")
    ax1.set_ylim(0, 1.0001)
    ax1.set_title(protname)
    ax1.set_xlabel("quotient_times")
    ax1.set_ylabel("quotient")
    ax1.legend()
    quotient2 = [(x-min(quotient))/(max(quotient)-min(quotient)) for x in quotient]
    print(quotient2)
    sns.distplot(quotient, hist=True, label=protname, ax=ax2, rug=True)
    ax2.set_title('basic distplot (kde=True)')
    sns.distplot(quotient2, hist=True, label=protname, ax=ax3, rug=True)
    ax3.set_title('logistic distplot')

    sns.distplot(quotient, hist=True, label=protname, ax=ax4, rug=True, kde=False, fit=uniform)
    ax4.set_title('uniform distplot')
    sns.distplot(quotient, hist=True, label=protname, ax=ax5, rug=True, kde=False, fit=powerlaw)
    ax5.set_title('powerlaw distplot')
    sns.distplot(quotient, hist=True, label=protname, ax=ax6, rug=True, kde=False, fit=logistic)
    ax6.set_title('logistic distplot')
    plt.show()

Output:输出：

System versions          : 3.7.2 (default, Feb 21 2019, 17:35:59) [MSC v.1915 64 bit (AMD64)]
System versions          : sys.version_info(major=3, minor=7, micro=2, releaselevel='final', serial=0)
Numpy versqion           : 1.16.2
matplotlib.pyplot version: 3.0.2
seaborn version          : 0.9.0
[1.0, 0.05230125523012544, 0.0433775382360589, 0.024590765616971128, 0.05230125523012544, 0.05230125523012544, 0.05230125523012544, 0.02836946874603772, 0.05230125523012544, 0.05230125523012544, 0.05230125523012544, 0.05230125523012544, 0.03393500048652319, 0.05230125523012544, 0.05230125523012544, 0.05230125523012544, 0.0037013196009011043, 0.0, 0.05230125523012544]

Screenshot:截屏：

Answer 2

In the latest update, the sns.distplot has been deprecated and sns.displot has to be used instead.在最新的更新中， sns.distplot已被弃用，而必须使用sns.displot 。 Hence, to obtain a normalized histogram/density one has to use the following syntax:因此，要获得归一化的直方图/密度，必须使用以下语法：

sns.displot(x, kind='hist', stat='density');

or或者

sns.plot(x, stat='density');

instead of代替

sns.distplot(x, kde=False, norm_hist=True);

PS: to get density instead of histogram the kind value has to be changed to 'kde'. PS：要获得密度而不是直方图，种类值必须更改为“kde”。

Reference:参考：

如何标准化seaborn distplot？

问题描述

2 个解决方案

解决方案1
10 已采纳 2019-03-12 21:58:34

Foreword前言

Option 1选项1

Option 2选项 2

解决方案2
2 2021-05-20 12:52:54

如何标准化seaborn distplot？

问题描述

2 个解决方案

解决方案1 10 已采纳 2019-03-12 21:58:34

Foreword前言

Option 1选项1

Option 2选项 2

解决方案2 2 2021-05-20 12:52:54

解决方案1
10 已采纳 2019-03-12 21:58:34

解决方案2
2 2021-05-20 12:52:54