简体   繁体   English

如何绘制两个密度分布之间的差异

[英]How to plot the difference between two density distributions

I've trained a model to predict a certain variable. 我已经训练了一个模型来预测某个变量。 When I now use this model to predict said value and compare this predictions to the actual values, I get the two following distributions. 现在,当我使用该模型预测所述值并将该预测值与实际值进行比较时,我得到了以下两个分布。

在此处输入图片说明

The corresponding R Data Frame looks as follows: 相应的R数据帧如下所示:

x_var | kind
3.532 | actual
4.676 | actual
...
3.12 | predicted
6.78 | predicted

These two distributions obviously have slightly different means, quantiles, etc. What I would now like to do is combine these two distributions into one (especially as they are fairly similar), but not like in the following thread . 这两个分布显然具有略微不同的均值,分位数等。我现在想做的是将这两个分布合并为一个(特别是因为它们相当相似),但是在下一个线程中 却不是这样。

Instead, I would like to plot one density function that shows the difference between the actual and predicted values and enables me to say eg 50% of the predictions are within -X% and +Y% of the actual values . 相反,我想绘制一个密度函数,该函数显示实际值与预测值之间的差异,并使我能够说例如50%的预测在实际值的-X%和+ Y%之内

I've tried just plotting the difference between predicted-actual and also the difference compared to the mean in the respective group. 我试过仅绘制predicted-actual之间的差异以及与各个组中的平均值相比的差异。 However, neither approach has produced my desired result. 但是,两种方法都没有产生我想要的结果。 With the plotted distribution, it is especially important to be able to make above statement, ie 50% of the predictions are within -X% and +Y% of the actual values. 对于所绘制的分布,进行以上声明尤为重要,即50%的预测值在实际值的-X%和+ Y%之内。 How can this be achieved? 如何做到这一点?

Let's consider the two distributions as df_actual, df_predicted, then calculate 让我们将两个分布视为df_actual,df_predicted,然后计算

# dataframe with difference between two distributions
df_diff <- data.frame(x = df_predicted$x - df_actual$x, y = df_predicted$y - df_actual$y)

Then find the relative % difference by : 然后通过以下公式找到相对百分比差异:

x_diff = mean(( df_diff$x - df_actual$x) / df_actual $x) * 100
y_diff = mean(( df_diff$y - df_actual$y) / df_actual $y) * 100

This will give you % prediction whether +/- in x as well as y . 这将给您%预测x和y中是否为+/- This is my opinion and also follow this thread for displaying and measuring area between two distribution curves. 这是我的观点,也遵循此主题来显示和测量两条分布曲线之间的面积。

I hope this helps. 我希望这有帮助。

ParthChaudhary is right - rather than subtracting the distributions, you want to analyze the distribution of differences. ParthChaudhary是正确的-您要分析差异的分布,而不是减去分布。 But take care to subtract the values within corresponding pairs, or otherwise the actual - predicted differences will be overshadowed by the variance of actual (and predicted ) alone. 但是请注意减去相应对中的值,否则actual - predicted差异将被单独的actual (和predicted )方差所掩盖。 Ie, if you have something like: 即,如果您有类似以下内容:

x y type
0 10.9 actual
1 15.7 actual
2 25.3 actual
...
0 10 predicted
1 17 predicted
2 23 predicted
...

you would merge(df[df$type=="actual",], df[df$type=="predicted",], by="x") , then calculate and plot yx-yy . 您可以merge(df[df$type=="actual",], df[df$type=="predicted",], by="x") ,然后计算并绘制yx-yy

为了更好地量化预测分布和实际分布之间的差异是否显着,可以考虑使用R中的Kolmogorov-Smirnov检验,该检验可通过功能ks.test

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM