简体   繁体   中英

Extract Group Regression Coefficients in R w/ PLYR

I'm trying to run a regression for every zipcode in my dataset and save the coefficients to a data frame but I'm having trouble.

Whenever I run the code below, I get a data frame called "coefficients" containing every zip code but with the intercept and coefficient for every zipcode being equal to the results of the simple regression lm(Sealed$hhincome ~ Sealed$square_footage) .

When I run the code as indicated in Ranmath's example at the link below, everything works as expected. I'm new to R after many years with STATA, so any help would be greatly appreciated :)

R extract regression coefficients from multiply regression via lapply command

library(plyr)
Sealed <- read.csv("~/Desktop/SEALED.csv")

x <- function(df) {
      lm(Sealed$hhincome ~ Sealed$square_footage)
}

regressions <- dlply(Sealed, .(Sealed$zipcode), x)
coefficients <- ldply(regressions, coef)

Because dlply takes a ... argument that allows additional arguments to be passed to the function, you can make things even simpler:

dlply(Sealed,.(zipcode),lm,formula=hhincome~square_footage)

The first two arguments to lm are formula and data . Since formula is specified here, lm will pick up the next argument it is given (the relevant zipcode-specific chunk of Sealed ) as the data argument ...

You are applying the function:

x <- function(df) {
      lm(Sealed$hhincome ~ Sealed$square_footage)
}

to each subset of your data, so we shouldn't be surprised that the output each time is exactly

lm(Sealed$hhincome ~ Sealed$square_footage)

right? Try replacing Sealed with df inside your function. That way you're referring to the variables in each individual piece passed to the function, not the whole variable in the data frame Sealed .

The issue is not with plyr but rather in the definition of the function. You are calling a function, but not doing anything with the variable.

As an analogy,

 myFun <- function(x) {
    3 * 7
  }

    >  myFun(2)
    [1] 21
    >  myFun(578)
    [1] 21

If you run this function on different values of x, it will still give you 21, no matter what x is. That is, there is no reference to x within the function. In my silly example, the correction is obvious; in your function above, the confusion is understandable. The $hhincome and $square_footage should conceivably serve as variables.

But you want your x to vary over what comes before the $ . As @Joran correctly pointed out, swap sealed$hhincome with df$hhincome (and same for $squ.. ) and that will help.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM