Partition data into two separate groups s.t. residual sum of squares with one continuous predictor is minimized

Question

What's the basic algorithm to partition a set of data into two groups st the sum of the two separate residual sum of squares is minimized? For example, consider the code below. Basically, how do you compute the value stored in best.cutpoint$RSS without iteratively testing each possible value?

set.seed(1)
ind.var <- runif(1000, 1, 50000)
dep.var <- ind.var * runif(1000, 2, 3) + rnorm(1000, 100, 500)

dat <- data.frame(ind.var, dep.var)

best.cutpoint <- list(RSS = Inf, cutpoint = NA)
for(cutpoint in sort(unique(ind.var))){
    # partition data
    dat1 <- dat[dat$ind.var > cutpoint,]
    dat2 <- dat[!(dat$ind.var > cutpoint),]

    if(nrow(dat1) < 2 | nrow(dat2) < 2){
        next
    }
    # estimate
    mod1 <- lm(dep.var ~ ind.var, dat = dat1)
    mod2 <- lm(dep.var ~ ind.var, dat = dat2)

    # calculate RSS
    part1.RSS <- sum((dat1$dep.var - (mod1$coefficients['(Intercept)'] + dat1$ind.var * mod1$coefficients['ind.var'])) ^ 2)
    part2.RSS <- sum((dat2$dep.var - (mod2$coefficients['(Intercept)'] + dat2$ind.var * mod2$coefficients['ind.var'])) ^ 2)

    total <- part1.RSS + part2.RSS

    if(total < best.cutpoint$RSS){
        best.cutpoint <- list(RSS = total, cutpoint = cutpoint)
    }
}

Which generates the following results from the following range of possible values.

> print(best.cutpoint)
$RSS
[1] 75241532557

$cutpoint
[1] 34351.46

> range(dat$ind.var)
[1]    66.73151 49996.52975

Answer 1

It sounds to me like you're asking how to determine a breakpoint for a segmented or piecewise linear regression . Let me know if that's not the case.

The package is useful for this purpose Segmented

First let's genrate some data:

x<-seq(1:20)
y<-c(seq(1:10),seq(10,100,by=10))
plot(x,y)

This data looks like,

在此处输入图片说明

It's pretty obvious where the "breakpoint" is.

Next, let's fit a model with the segmented package:

library(segmented)
lin.mod <- lm(y~x)
segmented.mod <- segmented(lin.mod, seg.Z = ~x, psi=14)

Did it find the breakpoint?

plot(segmented.mod)
points(x,y)

在此处输入图片说明

It looks like it did.

> segmented.mod
Call: segmented.lm(obj = lin.mod, seg.Z = ~x, psi = 14)

Meaningful coefficients of the linear terms:
(Intercept)            x         U1.x  
     0.1818       0.9545       9.0455  

Estimated Break-Point(s) psi1.x : 11.08

Where seg.z and psi are defined as:

seg.Z a formula with no response variable, such as seg.Z=~x1+x2, indicating the
(continuous) explanatory variables having segmented relationships with the response.
Currently, formulas involving functions, such as seg.Z=~log(x1) or
seg.Z=~sqrt(x1), or selection operators, such as seg.Z=~d[,"x1"] or seg.Z=~d$x1,
are not allowed.
psi named list of vectors. The names have to match the variables of the seg.Z
argument. Each vector includes starting values for the break-point(s) for the
corresponding variable in seg.Z. If seg.Z includes only a variable, psi may be
a numeric vector. A NA value means that ‘K’ quantiles (or equally spaced values)
are used as starting values; K is fixed via the seg.control auxiliary function.

Partition data into two separate groups s.t. residual sum of squares with one continuous predictor is minimized

Question

1 answers

solution1
2 ACCPTED 2015-02-09 04:19:29

Partition data into two separate groups s.t. residual sum of squares with one continuous predictor is minimized

Question

1 answers

solution1 2 ACCPTED 2015-02-09 04:19:29

solution1
2 ACCPTED 2015-02-09 04:19:29