Compute model efficiency in a cross validation leave one subject out mode in R

Question

I have a dataframe df

structure(list(x = c(49, 50, 51, 52, 53, 54, 55, 56, 1, 2, 3, 
    4, 5, 14, 15, 16, 17, 163, 164, 165, 153, 154, 72, 38, 39, 40, 
    23, 13, 14, 15, 5, 6, 74, 75, 76, 77, 78, 79, 80, 81, 82, 127, 
    128, 129, 130, 131, 132, 71, 72, 73, 74, 75, 76, 2, 3, 4, 5, 
    6, 99, 100, 101, 10, 11, 3, 30, 50, 51, 52, 53, 54, 56, 64, 66, 
    67, 68, 69, 34, 35, 37, 39, 2, 46, 47, 17, 18, 99, 100, 102, 
    103, 84, 85, 86, 87, 88, 67, 70, 72), y = c(2268.14043972082, 
    2147.62290922552, 2269.1387550775, 2247.31983098201, 1903.39138268307, 
    2174.78291538358, 2359.51909126411, 2488.39004804939, 212.851575751527, 
    461.398994384333, 567.150629704352, 781.775113821961, 918.303706148872, 
    1107.37695799186, 1160.80594193377, 1412.61328924168, 1689.48879626486, 
    685.154353165934, 574.088067465695, 650.30821636616, 494.185166497016, 
    436.312162090908, 641.231373456365, 494.374217984441, 201.745910386788, 
    486.030122926459, 483.045965372879, 265.693628564011, 285.116345260642, 
    291.023782284086, 229.606221692753, 230.952761338012, 1089.06303295676, 
    1255.88808925333, 1087.75402177912, 1068.248897182, 1212.17254891642, 
    884.222588171535, 938.887718005513, 863.582247020735, 1065.91969416523, 
    790.570635510725, 834.500908313203, 710.755041345197, 814.002362551197, 
    726.814950022846, 828.559687148314, 611.564698476112, 603.238720579422, 
    524.322001078981, 565.296378873638, 532.431853589369, 597.174114277044, 
    260.737164468854, 306.72700499362, 283.410379620422, 366.813913489692, 
    387.570173754128, 606.075737104722, 686.408686154056, 705.914347674276, 
    388.602676983443, 477.858510450125, 128.198042456082, 535.519377609133, 
    1893.38468471169, 1819.83262739703, 1827.31409981102, 1640.5816780664, 
    1689.0365549922, 2112.67917439342, 1028.8780498564, 1098.54431357711, 
    1265.26965941035, 1129.58344809909, 820.922447928053, 749.343583476846, 
    779.678206156474, 646.575242339517, 733.953282899613, 461.156280127354, 
    1184.81825619942, 1281.2920902365, 906.813018662913, 798.186995701282, 
    831.365377249207, 764.519073183124, 672.076289062505, 669.879217186302, 
    1265.48484068702, 1193.29000986667, 1156.81486114406, 1199.7373066445, 
    1116.24029749935, 1341.47673353751, 1401.44881976186, 1640.27575962036
    ), ID = 1:97), .Names = c("x", "y", "ID"), row.names = c(NA, 
    -97L), class = "data.frame")

I now want to compute the model efficiency of one non-linear model based on a cross validation leave one ID out. I implemented this line of code.

library(stats)
library (hydroGOF)

id <- unique(df$ID)
for (i in id){
  fit1 <- try(nls(y~A*x^3+B*x^2+C*x+D, data = df[df$ID != i,], start = list(A=0.02, B=-0.6, C= 50, D=200)), silent=TRUE)
  Out <- if (inherits(fit1, "nls")) NSE(sim = predict(fit1, newdatadata=df[df$ID==i,]), obs = df$y, na.rm=T)
  }

However, I have this error message:

    Error in valindex.default(sim, obs) : 
  Invalid argument: 'length(sim) != length(obs)' !! (96!=97) !!

Can someone help me out with that?

Answer 1

You have a few small mistakes and a big logic mistake in the above code which I address below:

First of all the code should be like this:

library(stats)
library (hydroGOF)

id <- unique(df$ID)
for (i in id){
  fit1 <- try(nls(y~A*x^3+B*x^2+C*x+D, data = df[df$ID != i,], start = list(A=0.02, B=-0.6, C= 50, D=200)), silent=TRUE)
  Out <- if (inherits(fit1, "nls")) NSE(sim = predict(fit1, newdata=df[df$ID==i,]), obs = df$y[df$ID==i], na.rm=T)
}

In your question, in the NSE function you set the argument as newdatadata=df[df$ID==i,]) instead of newdata=df[df$ID==i,]) ie there is an additional data in there which causes trouble when running the function (you misspelled the argument :) ) . Also, the obs argument does not have the correct length as the whole y column will be used but you only need df$y[df$ID==i] which is of length one (in order to match the prediction which now is of length one too).

Now after the corrections the above code will run.

However, once you run it you will see that it produces a warning that says that you cannot use NSE when 'sum((obs - mean(obs))^2)=0' => it is not possible to compute 'NSE' . In your case, since you only have one obs the calculation 'sum((obs - mean(obs))^2)=0' will always be zero.

So, you cannot use this technique with NSE because it will fail by definition (since you try to calculate NSE on a single observation). You should probably collect all the leave-one-out predictions, store them in a variable and then use NSE on that variable against df$y . That will work.

What I mean is the following (done with leave-one-out cv):

Out <- c()
id <- unique(df$ID)
for (i in id){
  fit1 <- try(nls(y~A*x^3+B*x^2+C*x+D, data = df[df$ID != i,], start = list(A=0.02, B=-0.6, C= 50, D=200)), silent=TRUE)
  Out[i] <- if (inherits(fit1, "nls")) sim = predict(fit1, newdata=df[df$ID==i,])
}

Which will now work:

> NSE(Out, df$y)
[1] 0.3440862

Compute model efficiency in a cross validation leave one subject out mode in R

Question

1 answers

solution1
1 ACCPTED 2015-10-06 13:25:53

Compute model efficiency in a cross validation leave one subject out mode in R

Question

1 answers

solution1 1 ACCPTED 2015-10-06 13:25:53

solution1
1 ACCPTED 2015-10-06 13:25:53