简体   繁体   English

使用seaborn.distplot绘制密度图时,如何防止seaborn平滑直方图?

[英]how to prevent seaborn from smoothing the histogram when plotting a density plot using seaborn.distplot?

Here is the problem I'm facing for a while now. 这是我现在面临的问题。 I'm trying to plot a density plot (ie a smoothed approximation of a histogram plot) using seaborn.distplot() and I obtain the following figure: 我正在尝试使用seaborn.distplot()绘制密度图(即直方图的平滑近似seaborn.distplot()并获得下图:

密度图

The problem with the above plot is that the contour on the leftmost side extends well beyond -1.0 and I do not want that since the similarity score cannot be less than -1.0 (ie it should only lie in the closed interval [-1.0, 1.0] ). 上图的问题在于,最左侧的轮廓远超过-1.0,我不希望这样,因为相似性得分不能小于-1.0(即,它应该位于闭合区间[-1.0, 1.0] )。 I have checked my input array (given below) for values less than -1.0 and there's no such value which is less than -1.0. 我检查了输入数组(如下所示)的值是否小于-1.0,没有这样的值小于-1.0。 So, it seems that seaborn.distplot() smoothens the distribution which makes it to extend well beyond -1.0. 因此,似乎seaborn.distplot()使分布平滑,使其扩展到-1.0以上。 How can I stop this from happening? 我该如何阻止这种情况的发生? I have tried setting xlim on the x-axis but that doesn't leave any room on the left hand side of the plot (like we have a single column on the rightmost side after +1.0). 我尝试在x轴上设置xlim ,但是在图的左侧没有留下任何空间(就像我们在+1.0之后在最右侧有一个单独的列)。

For an illustration, here is a sample input array and code that I'm using for plotting: 为了举例说明,这是我用于绘图的示例输入数组和代码:

arr = np.array([-0.35416853, -0.28675528, -0.54088942,  0.18797232,  0.01707244,
       -0.48090636, -0.44454523, -0.03228283, -0.70861904,  0.02323842,
       -0.54905541, -0.5421915 ,  0.27547336, -0.92913273, -0.55379011,
       -0.23521681, -0.1079175 , -0.24065031, -0.33773661, -0.06147251,
       -0.74171701, -0.74315048,  0.06634989, -0.49222919,  0.48899574,
        0.13499221,  0.53120786, -0.1688146 ,  0.47125832,  0.36517109,
        0.33110315,  0.34495851,  0.18393   ,  0.67211736,  0.11608325,
       -0.92913273, -0.71209124,  0.01828323,  0.30894561, -0.06463642,
        0.45423401, -0.7993457 ,  0.50007295,  0.17983021, -0.66105515,
       -0.92783269, -0.49277017, -0.19487059,  0.07502782,  0.00700057,
        0.29958942, -0.04223299,  0.04105657, -0.12604522,  0.30506049,
       -0.15600141, -0.17434894,  0.01152945, -0.11583157,  0.07010729,
       -0.92913273, -0.02566766,  0.48114331, -0.13252103, -0.42600686,
        0.54836633,  0.37945642, -0.34006735, -0.29560479,  0.4930249 ,
        0.02693856,  0.57255816,  0.31185216,  0.19780182,  0.11909931,
       -0.02853919, -0.25082142, -0.08635957, -0.28266912, -0.80937364,
       -0.92913273, -0.0172393 , -0.18993503, -0.69080226, -0.66901143,
        0.0470842 , -0.45307088,  0.05043218, -0.20894534, -0.22218531,
        0.5189177 , -0.92913273,  0.31509469, -0.15935917, -0.92913273,
       -0.41652189,  0.20265061,  0.016976  ,  0.0680205 ,  0.33159134,
       -0.3138477 ,  0.10086817,  0.37074665, -0.06916329, -0.19177307,
        0.22842641, -0.15087903,  0.34376167,  0.24173604, -0.38040409,
       -0.20031291,  0.17990511,  0.40231535, -0.27195479, -0.15867829,
        0.2389052 ,  0.08337308, -0.07327617, -0.77566734, -0.12074809,
        0.19539527,  0.03727124, -0.13330546,  0.13602168,  0.36673224,
       -0.3434154 ,  0.19251896,  0.27692974,  0.4757158 ,  0.24333386,
        0.29905657,  0.57319178,  0.46753947, -0.04079389,  0.5571865 ,
        0.3453707 ,  0.55110949,  0.19614831,  0.61707333,  0.3680048 ,
        0.48193126,  0.67330892,  0.53603774,  0.54464057,  0.35016492,
        0.36970268,  0.150395  ,  0.4697073 ,  0.3383952 ,  0.4037419 ,
       -0.01055328,  0.26734498,  0.2647191 ,  0.30056532,  0.46706568,
        0.41460328,  0.42295413,  0.44188908,  0.29304088, -0.18437651,
       -0.33404869,  0.31744862,  0.16578238, -0.2903621 , -0.36128032,
       -0.65571561,  0.39868119, -0.31359498,  0.45377302,  0.23929229,
        0.19958669,  0.51978988, -0.01249307, -0.16404641,  0.27193916,
       -0.11159726, -0.10719093,  0.05472177, -0.64784851,  0.25594644,
       -0.26109644, -0.28908332,  0.06264426,  0.05689891,  0.26437733,
       -0.29424862,  0.26441642,  0.34868516,  0.00497344, -0.46811445,
       -0.35795662, -0.04599685,  0.08701907, -0.32572399,  0.17639076,
        0.35640737, -0.08174591, -0.13910904,  0.35387245,  0.00857055,
       -0.24789401,  0.24033791, -0.08525459,  0.19189512,  0.27148848,
       -0.38631975, -0.08820518,  0.12658585,  0.23404602,  0.06062359,
        0.13340842, -0.11942433, -0.15974527, -0.0236961 ,  0.01533685,
       -0.92641117,  0.01533685, -0.00582898,  0.08251113, -0.18537655,
       -0.92641117, -0.63036561, -0.02408175, -0.10033362, -0.08820518,
        0.01533685, -0.1475904 , -0.06573955, -0.10033362, -0.08820518,
       -0.08820518,  0.04798457,  0.29057868,  0.08310757,  0.25168328,
        0.03989156,  0.1895359 , -0.44324531, -0.16724842,  0.06172038,
        0.05685105,  0.3381661 , -0.46472578, -0.13137012,  0.10249921,
        0.26703853,  0.14798872,  0.09729466, -0.09559039,  0.38893042,
        0.6081168 , -0.32574556, -0.11493626,  0.30370567, -0.13203101,
        0.12251789,  0.29993512, -0.80796771, -0.14717629,  0.37894796,
        0.30086822,  0.26228619, -0.01403568, -0.46596314, -0.11860131,
       -0.52649509,  0.41834337,  0.25892792,  0.40497516, -0.0287142 ,
       -0.14994142,  0.41714702,  0.40928704,  0.0595943 ,  0.5190621 ,
        0.53760238,  0.25452441, -0.08397463,  0.22131469, -0.46173602,
        0.48456617,  0.44220971,  0.16059022,  0.43723123,  0.04680989,
       -0.00131657, -0.09681387, -0.48600167, -0.44205123,  0.13787778,
       -0.02900436,  0.07049823,  0.02565475, -0.20544388,  0.0297263 ,
        0.09162641, -0.17354248, -0.41518963,  0.12393266, -0.41754063,
       -0.19018751,  0.02251257, -0.27799953,  0.21135703,  0.09597453,
        0.56175636,  0.34126265,  0.17056669,  0.13149045, -0.30472518,
       -0.07366951,  0.42843431, -0.22890901,  0.05518269, -0.01007775,
       -0.48123104, -0.44906545,  0.09229373, -0.85684002,  0.23411821,
        0.02637603,  0.02477345,  0.21678001, -0.14454807,  0.32430986,
       -0.12988135,  0.07014938,  0.17991853, -0.02405694, -0.83110188,
       -0.11192697,  0.02312546, -0.10770876,  0.13470276,  0.10568144,
       -0.20336714, -0.15739212,  0.21271663,  0.05357167,  0.3281988 ,
        0.17442453,  0.11561338, -0.68398479, -0.03704769,  0.28698584,
        0.17608064,  0.30424182,  0.51034264, -0.09452418,  0.38242868,
       -0.60014916,  0.21856565, -0.04819684,  0.2653766 ,  0.02992649,
        0.18941891, -0.04752845,  0.02295903, -0.29201727,  0.07913569,
       -0.12563984,  0.21124929, -0.18801383, -0.24118712, -0.29686842,
        0.27609838, -0.23855832,  0.31970457,  0.41328374,  0.19630546,
        0.34077982, -0.3704136 ,  0.17032295,  0.20643397,  0.34154881,
        0.1504677 ,  0.37392242,  0.25842101, -0.50553798,  0.35387764,
        0.41873554,  0.27067669,  0.31011181, -0.51092977, -0.10282291,
       -0.4126883 , -0.52383119, -0.82821877, -0.4585979 ,  0.2531493 ,
        0.34361492,  0.38418371, -0.22988404,  0.285816  , -0.40203361,
        0.38114577,  0.15781548,  0.27335741,  0.36371593,  0.36515941])

In [57]: ax = sns.distplot(arr, hist=False, kde_kws={"shade": True}, norm_hist=True, label="density plot")

In [58]: plt.plot(np.array([-0.208, -0.208]), np.array([0, 2]), color='grey', linestyle='--')
In [59]: plt.plot(np.array([0.317, 0.317]), np.array([0, 2]), color='grey', linestyle='--')    
In [60]: ax.set_xlabel(r"similarity")
In [61]: ax.set_ylabel(r"density")
In [62]: plt.show()

So, I'd like to not have this smoothing on the left side of the plot and leave one column spacing as in the rightmost side of the plot. 因此,我不想在图的左侧进行这种平滑处理,而在图的最右侧保留一列的间距。 How can I achieve this? 我该如何实现? Thanks! 谢谢!

One of the ways to get desired result is to use custom window and kernel. 获得理想结果的一种方法是使用自定义窗口和内核。 Both kernel and window should depends on the position of the window's center relative to interval's edge points a and b . 内核和窗口都应取决于窗口中心相对于间隔的边缘点ab

Searborn uses stats.gaussian_kde or kde estimatior from statsmodels, if the latter is installed. Searborn使用stats.gaussian_kde或kde estimatior(如果已安装)。 As far as I know about gaussian_kde , it doesn't allow such tweaking. 据我所知gaussian_kde ,它不允许这种调整。 So, we need to implement custom kde estimator. 因此,我们需要实现自定义kde估计器。

Look at the following code snippet, it works and can be considered as a starting point for further improvement. 请看下面的代码片段,它可以正常工作,可以视为进一步改进的起点。

import numpy as np
from scipy.integrate import quad

class kde:

    def __init__(self, a, b, kernel=None):
        self.a = a
        self.b = b

    def h(self, x):
        """ h(x) window size depends on position of the center of the window relative to (a, b).

        _ r    ___________   <-- rthumb = r = height of the graph
        |     /           \ 
        |    /             \
        |---|--|----------|-|----- 
            a  a+r      b-r b

        """

        if x > (self.a + self.rthumb) and x < (self.b - self.rthumb):
            return self.rthumb
        elif x >= self.a and (x < self.a + self.rthumb):
            return x - self.a + np.finfo(float).eps
        elif (x <= self.b) and (x > self.b - self.rthumb):
            return self.b - x + np.finfo(float).eps
        else:
            return + np.finfo(float).eps

    def kernel(self, x):
        return np.exp(-0.5 * x ** 2) / np.sqrt(2 * np.pi);

    def window(self, x):
        """ x - scalar value """
        def w(y):
            res = self.kernel((x - y) / self.h(x))  # gaussian kernel
            res[(y > self.b) | (y < self.a)] = 0.0  # window is zero outside [a, b]~
            return res
        return w

    def fit(self, data):
        # Rule of thumb
        self.rthumb = 1.06 * np.std(data) * np.power(len(data), -1/5) 
        def _pdf(x):
            ww = self.window(x)
            return ww(data).sum() / len(data)
        val = quad(_pdf, self.a, self.b)[0]
        def pdf_norm(f):
            def pn(x):
                return f(x) / val
            return pn
        self.pdf = np.vectorize(pdf_norm(_pdf))
        return self

If we apply it to your data: 如果我们将其应用于您的数据:

k = kde(-1, 1)
from pylab import plt
x = np.linspace(-1,  1, 100)
plt.plot(x, k.fit(arr).pdf(x))

we get: 我们得到:

在此处输入图片说明

If the sliding window is close to edges of the interval, it is truncated: 如果滑动窗口靠近间隔的边缘,则会被截断:

plt.plot(x, k.window(0.9)(x), 'r.', x, k.window(0)(x), x, k.window(-.9)(x),'r.')
plt.show()

在此处输入图片说明

Note, this custom class produces normalized pdf estimations, eg AUC(kde.pdf) = 1. 请注意,此自定义类产生标准化的pdf估计值,例如AUC(kde.pdf)= 1。

EDITED: 编辑:

I added small value (float's epsilon 1) to h(x) value, and now everything works without warnings. 我在h(x)值上添加了小值(浮点型epsilon 1),现在一切正常,没有警告。

In general, trying to obtain kernel density estimation is made under assumption that the theoretical pdf is a smooth function. 通常,在理论pdf是平滑函数的假设下,尝试获得核密度估计。 In your case, you can truncate the pdf obtained using gaussian_kde from scipy, and finally add some constant to truncated estimation to meet AUC = 1. Some distributions have discontinuities of the first kind, eg pdf of the uniform distribution. 在您的情况下,您可以截断使用scipy中使用gaussian_kde获得的pdf,最后在截断估计中添加一些常数以满足AUC =1。某些分布具有第一种不连续性,例如均匀分布的pdf。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM