简体   繁体   中英

Issues with predict function in R

I'm having issues with using the predict() function in R and I hope that I can get some help. Consider a dataset with two columns - 1) Y, 2) X

My goal is to fit a natural spline fit and get a 95% CI and to mark points outside of the 95% CI as outlier. Here is what I do:

1) Initially no point in the dataset is marked as outlier. 2) I fit my ns fit and using its 95% CI, I mark the points outside of the CI as outlier 3) I, then, exclude the initially marked outliers, and fit another ns and using it's 95% CI, I mark outliers.

* Issue: * Suppose my initial dataset has 1000 obs. I mark some outliers in the first round and I get 23 outliers. Then I fit another ns (call it fit.ns) using the remaining 977 non-outliers. I then use ALL X's (all 1000) to get predicted values based on this new fit but I get warning AND error that newdata in my predict function has 1000 obs but fit has 977. The predicted values returned has also 977 values and NOT 1000.

* My predict() code *

# Fitting a Natural Spline Fit (df = 3 by default)
fit.ns <- lm(data.ns$IBI ~ ns(data.ns$Time, knots = data.ns$Time[knots]))

# Getting Fitted Values and 95% CI:
fit.ns.values <- predict(fit.ns, newdata = data.frame(Time = data.temp$Time), 
interval="prediction", level = 1 - 0.05) # ??? PROBLEM

I really appreciate your help.

Seems that I cannot upload the dataset, but my code is:

library(splines)
ns.knot <- 10
for (i in 1:2){
  # I exclude outliers so that my ns.fit does not get affected my outliers
  data.ns <- data.temp[data.temp$OutlierInd == 0,] 
  data.ns$BeatNum <- 1:nrow(data.ns) # BeatNum is like a row number for me and is an auxilary variable

  # Place Holder for Natural Spline results:
  data.temp$IBI.NSfit <- rep(NA, nrow(data.temp))
  data.temp$IBI.NSfit.L95 <- rep(NA, nrow(data.temp))
  data.temp$IBI.NSfit.U95 <- rep(NA, nrow(data.temp))

  # defining the knots in n.s.:
  knots <- (data.ns$BeatNum)[seq(ns.knot, (length(data.ns$BeatNum) - ns.knot), by = ns.knot)]

  # Fitting a Natural Spline Fit (df = 3 by default)
  fit.ns <- lm(data.ns$IBI ~ ns(data.ns$Time, knots = data.ns$Time[knots]))

  # Getting Fitted Values and 95% CI:
  fit.ns.values <- predict(fit.ns, newdata = data.frame(Time = data.temp$Time), interval="prediction", level = 1 - 0.05) # ??? PROBLEM
  data.temp$IBI.NSfit <- fit.ns.values[,1]
  data.temp$IBI.NSfit.L95 <- fit.ns.values[,2]
  data.temp$IBI.NSfit.U95 <- fit.ns.values[,3]

  # Updating OutlierInd based on Natural Spline 95% CI:
  data.temp$OutlierInd <- ifelse(data.temp$IBI < data.temp$IBI.NSfit.U95 & data.temp$IBI > data.temp$IBI.NSfit.L95, 0, 1)
}

Finally, I found the solution:

When I fit the model, I should use the "data =" option. In other words, instead of the command below,

# Fitting a Natural Spline Fit (df = 3 by default)
fit.ns <- lm(data.ns$IBI ~ ns(data.ns$Time, knots = data.ns$Time[knots]))

I should use the command below instead:

# Fitting a Natural Spline Fit (df = 3 by default)
fit.ns <- lm(IBI ~ ns(Time, knots = Time[knots]), data = data.ns)

Then the predict function will work.

I wanted to add a comment but my rep level doesnt allow that.

Anyways, I think this is a well documented point that predict uses the exact variables names used in the fit function. So, naming your variables is the best way to get around this error in my experience.

So, in the case above, please redefine a data frame just for your fit purposes like this

library(splines)
#Fit part
fit.data <- data.frame(y=rnorm(30),x=rnorm(30))
fit.ns <- lm(y ~ ns(x,3),data=fit.data)

#Predict
pred.data <- data.frame(y=rnorm(10),x=rnorm(10))
pred.fit <- predict(fit.ns,interval="confidence",limit=0.95,data.frame(x=pred.data$x))

IMHO, this should get rid of your error

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM