简体   繁体   中英

Partition data into two separate groups s.t. residual sum of squares with one continuous predictor is minimized

What's the basic algorithm to partition a set of data into two groups st the sum of the two separate residual sum of squares is minimized? For example, consider the code below. Basically, how do you compute the value stored in best.cutpoint$RSS without iteratively testing each possible value?

set.seed(1)
ind.var <- runif(1000, 1, 50000)
dep.var <- ind.var * runif(1000, 2, 3) + rnorm(1000, 100, 500)

dat <- data.frame(ind.var, dep.var)

best.cutpoint <- list(RSS = Inf, cutpoint = NA)
for(cutpoint in sort(unique(ind.var))){
    # partition data
    dat1 <- dat[dat$ind.var > cutpoint,]
    dat2 <- dat[!(dat$ind.var > cutpoint),]

    if(nrow(dat1) < 2 | nrow(dat2) < 2){
        next
    }
    # estimate
    mod1 <- lm(dep.var ~ ind.var, dat = dat1)
    mod2 <- lm(dep.var ~ ind.var, dat = dat2)

    # calculate RSS
    part1.RSS <- sum((dat1$dep.var - (mod1$coefficients['(Intercept)'] + dat1$ind.var * mod1$coefficients['ind.var'])) ^ 2)
    part2.RSS <- sum((dat2$dep.var - (mod2$coefficients['(Intercept)'] + dat2$ind.var * mod2$coefficients['ind.var'])) ^ 2)

    total <- part1.RSS + part2.RSS

    if(total < best.cutpoint$RSS){
        best.cutpoint <- list(RSS = total, cutpoint = cutpoint)
    }
}

Which generates the following results from the following range of possible values.

> print(best.cutpoint)
$RSS
[1] 75241532557

$cutpoint
[1] 34351.46

> range(dat$ind.var)
[1]    66.73151 49996.52975

It sounds to me like you're asking how to determine a breakpoint for a segmented or piecewise linear regression . Let me know if that's not the case.

The package is useful for this purpose Segmented

First let's genrate some data:

x<-seq(1:20)
y<-c(seq(1:10),seq(10,100,by=10))
plot(x,y)

This data looks like,

在此处输入图片说明

It's pretty obvious where the "breakpoint" is.

Next, let's fit a model with the segmented package:

library(segmented)
lin.mod <- lm(y~x)
segmented.mod <- segmented(lin.mod, seg.Z = ~x, psi=14)

Did it find the breakpoint?

plot(segmented.mod)
points(x,y)

在此处输入图片说明

It looks like it did.

> segmented.mod
Call: segmented.lm(obj = lin.mod, seg.Z = ~x, psi = 14)

Meaningful coefficients of the linear terms:
(Intercept)            x         U1.x  
     0.1818       0.9545       9.0455  

Estimated Break-Point(s) psi1.x : 11.08 

Where seg.z and psi are defined as:

seg.Z a formula with no response variable, such as seg.Z=~x1+x2, indicating the
(continuous) explanatory variables having segmented relationships with the response.
Currently, formulas involving functions, such as seg.Z=~log(x1) or
seg.Z=~sqrt(x1), or selection operators, such as seg.Z=~d[,"x1"] or seg.Z=~d$x1,
are not allowed.
psi named list of vectors. The names have to match the variables of the seg.Z
argument. Each vector includes starting values for the break-point(s) for the
corresponding variable in seg.Z. If seg.Z includes only a variable, psi may be
a numeric vector. A NA value means that ‘K’ quantiles (or equally spaced values)
are used as starting values; K is fixed via the seg.control auxiliary function.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM