如何绘制两个密度分布之间的差异

Question

I've trained a model to predict a certain variable. 我已经训练了一个模型来预测某个变量。 When I now use this model to predict said value and compare this predictions to the actual values, I get the two following distributions. 现在，当我使用该模型预测所述值并将该预测值与实际值进行比较时，我得到了以下两个分布。

The corresponding R Data Frame looks as follows: 相应的R数据帧如下所示：

x_var | kind
3.532 | actual
4.676 | actual
...
3.12 | predicted
6.78 | predicted

These two distributions obviously have slightly different means, quantiles, etc. What I would now like to do is combine these two distributions into one (especially as they are fairly similar), but not like in the following thread . 这两个分布显然具有略微不同的均值，分位数等。我现在想做的是将这两个分布合并为一个（特别是因为它们相当相似），但是在下一个线程中 却不是这样。

Instead, I would like to plot one density function that shows the difference between the actual and predicted values and enables me to say eg 50% of the predictions are within -X% and +Y% of the actual values . 相反，我想绘制一个密度函数，该函数显示实际值与预测值之间的差异，并使我能够说例如50％的预测在实际值的-X％和+ Y％之内 。

I've tried just plotting the difference between predicted-actual and also the difference compared to the mean in the respective group. 我试过仅绘制predicted-actual之间的差异以及与各个组中的平均值相比的差异。 However, neither approach has produced my desired result. 但是，两种方法都没有产生我想要的结果。 With the plotted distribution, it is especially important to be able to make above statement, ie 50% of the predictions are within -X% and +Y% of the actual values. 对于所绘制的分布，进行以上声明尤为重要，即50％的预测值在实际值的-X％和+ Y％之内。 How can this be achieved? 如何做到这一点？

Answer 1

Let's consider the two distributions as df_actual, df_predicted, then calculate 让我们将两个分布视为df_actual，df_predicted，然后计算

# dataframe with difference between two distributions
df_diff <- data.frame(x = df_predicted$x - df_actual$x, y = df_predicted$y - df_actual$y)

Then find the relative % difference by : 然后通过以下公式找到相对百分比差异：

x_diff = mean(( df_diff$x - df_actual$x) / df_actual $x) * 100
y_diff = mean(( df_diff$y - df_actual$y) / df_actual $y) * 100

This will give you % prediction whether +/- in x as well as y . 这将给您％预测x和y中是否为+/- 。 This is my opinion and also follow this thread for displaying and measuring area between two distribution curves. 这是我的观点，也遵循此主题来显示和测量两条分布曲线之间的面积。

I hope this helps. 我希望这有帮助。

Answer 2

ParthChaudhary is right - rather than subtracting the distributions, you want to analyze the distribution of differences. ParthChaudhary是正确的-您要分析差异的分布，而不是减去分布。 But take care to subtract the values within corresponding pairs, or otherwise the actual - predicted differences will be overshadowed by the variance of actual (and predicted ) alone. 但是请注意减去相应对中的值，否则actual - predicted差异将被单独的actual （和predicted ）方差所掩盖。 Ie, if you have something like: 即，如果您有类似以下内容：

x y type
0 10.9 actual
1 15.7 actual
2 25.3 actual
...
0 10 predicted
1 17 predicted
2 23 predicted
...

you would merge(df[df$type=="actual",], df[df$type=="predicted",], by="x") , then calculate and plot yx-yy . 您可以merge(df[df$type=="actual",], df[df$type=="predicted",], by="x") ，然后计算并绘制yx-yy 。

Answer 3

为了更好地量化预测分布和实际分布之间的差异是否显着，可以考虑使用R中的Kolmogorov-Smirnov检验，该检验可通过功能ks.test

如何绘制两个密度分布之间的差异

问题描述

3 个解决方案

解决方案1
0 2017-05-19 10:55:44

解决方案2
0 2017-05-20 14:27:34

解决方案3
0 2018-08-22 17:34:39

如何绘制两个密度分布之间的差异

问题描述

3 个解决方案

解决方案1 0 2017-05-19 10:55:44

解决方案2 0 2017-05-20 14:27:34

解决方案3 0 2018-08-22 17:34:39

解决方案1
0 2017-05-19 10:55:44

解决方案2
0 2017-05-20 14:27:34

解决方案3
0 2018-08-22 17:34:39