简体   繁体   中英

How to plot the difference between two density distributions

I've trained a model to predict a certain variable. When I now use this model to predict said value and compare this predictions to the actual values, I get the two following distributions.

在此处输入图片说明

The corresponding R Data Frame looks as follows:

x_var | kind
3.532 | actual
4.676 | actual
...
3.12 | predicted
6.78 | predicted

These two distributions obviously have slightly different means, quantiles, etc. What I would now like to do is combine these two distributions into one (especially as they are fairly similar), but not like in the following thread .

Instead, I would like to plot one density function that shows the difference between the actual and predicted values and enables me to say eg 50% of the predictions are within -X% and +Y% of the actual values .

I've tried just plotting the difference between predicted-actual and also the difference compared to the mean in the respective group. However, neither approach has produced my desired result. With the plotted distribution, it is especially important to be able to make above statement, ie 50% of the predictions are within -X% and +Y% of the actual values. How can this be achieved?

Let's consider the two distributions as df_actual, df_predicted, then calculate

# dataframe with difference between two distributions
df_diff <- data.frame(x = df_predicted$x - df_actual$x, y = df_predicted$y - df_actual$y)

Then find the relative % difference by :

x_diff = mean(( df_diff$x - df_actual$x) / df_actual $x) * 100
y_diff = mean(( df_diff$y - df_actual$y) / df_actual $y) * 100

This will give you % prediction whether +/- in x as well as y . This is my opinion and also follow this thread for displaying and measuring area between two distribution curves.

I hope this helps.

ParthChaudhary is right - rather than subtracting the distributions, you want to analyze the distribution of differences. But take care to subtract the values within corresponding pairs, or otherwise the actual - predicted differences will be overshadowed by the variance of actual (and predicted ) alone. Ie, if you have something like:

x y type
0 10.9 actual
1 15.7 actual
2 25.3 actual
...
0 10 predicted
1 17 predicted
2 23 predicted
...

you would merge(df[df$type=="actual",], df[df$type=="predicted",], by="x") , then calculate and plot yx-yy .

为了更好地量化预测分布和实际分布之间的差异是否显着,可以考虑使用R中的Kolmogorov-Smirnov检验,该检验可通过功能ks.test

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM