What's the basic algorithm to partition a set of data into two groups st the sum of the two separate residual sum of squares is minimized? For example, consider the code below. Basically, how do you compute the value stored in best.cutpoint$RSS
without iteratively testing each possible value?
set.seed(1)
ind.var <- runif(1000, 1, 50000)
dep.var <- ind.var * runif(1000, 2, 3) + rnorm(1000, 100, 500)
dat <- data.frame(ind.var, dep.var)
best.cutpoint <- list(RSS = Inf, cutpoint = NA)
for(cutpoint in sort(unique(ind.var))){
# partition data
dat1 <- dat[dat$ind.var > cutpoint,]
dat2 <- dat[!(dat$ind.var > cutpoint),]
if(nrow(dat1) < 2 | nrow(dat2) < 2){
next
}
# estimate
mod1 <- lm(dep.var ~ ind.var, dat = dat1)
mod2 <- lm(dep.var ~ ind.var, dat = dat2)
# calculate RSS
part1.RSS <- sum((dat1$dep.var - (mod1$coefficients['(Intercept)'] + dat1$ind.var * mod1$coefficients['ind.var'])) ^ 2)
part2.RSS <- sum((dat2$dep.var - (mod2$coefficients['(Intercept)'] + dat2$ind.var * mod2$coefficients['ind.var'])) ^ 2)
total <- part1.RSS + part2.RSS
if(total < best.cutpoint$RSS){
best.cutpoint <- list(RSS = total, cutpoint = cutpoint)
}
}
Which generates the following results from the following range of possible values.
> print(best.cutpoint)
$RSS
[1] 75241532557
$cutpoint
[1] 34351.46
> range(dat$ind.var)
[1] 66.73151 49996.52975
It sounds to me like you're asking how to determine a breakpoint for a segmented or piecewise linear regression . Let me know if that's not the case.
The package is useful for this purpose Segmented
First let's genrate some data:
x<-seq(1:20)
y<-c(seq(1:10),seq(10,100,by=10))
plot(x,y)
This data looks like,
It's pretty obvious where the "breakpoint" is.
Next, let's fit a model with the segmented package:
library(segmented)
lin.mod <- lm(y~x)
segmented.mod <- segmented(lin.mod, seg.Z = ~x, psi=14)
Did it find the breakpoint?
plot(segmented.mod)
points(x,y)
It looks like it did.
> segmented.mod
Call: segmented.lm(obj = lin.mod, seg.Z = ~x, psi = 14)
Meaningful coefficients of the linear terms:
(Intercept) x U1.x
0.1818 0.9545 9.0455
Estimated Break-Point(s) psi1.x : 11.08
Where seg.z and psi are defined as:
seg.Z a formula with no response variable, such as seg.Z=~x1+x2, indicating the
(continuous) explanatory variables having segmented relationships with the response.
Currently, formulas involving functions, such as seg.Z=~log(x1) or
seg.Z=~sqrt(x1), or selection operators, such as seg.Z=~d[,"x1"] or seg.Z=~d$x1,
are not allowed.
psi named list of vectors. The names have to match the variables of the seg.Z
argument. Each vector includes starting values for the break-point(s) for the
corresponding variable in seg.Z. If seg.Z includes only a variable, psi may be
a numeric vector. A NA value means that ‘K’ quantiles (or equally spaced values)
are used as starting values; K is fixed via the seg.control auxiliary function.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.