简体   繁体   English

按组进行快速线性回归

[英]Fast linear regression by group

I have 500K users and I need to compute a linear regression (with intercept) for each of them.我有50 万用户,我需要为每个用户计算线性回归(带截距)

Each user has around 30 records.每个用户大约有 30 条记录。

I tried with dplyr and lm and this is way too slow.我试过dplyrlm ,这太慢了。 Around 2 sec by user.用户约 2 秒。

  df%>%                       
      group_by(user_id, add =  FALSE) %>%
      do(lm = lm(Y ~ x, data = .)) %>%
      mutate(lm_b0 = summary(lm)$coeff[1],
             lm_b1 = summary(lm)$coeff[2]) %>%
      select(user_id, lm_b0, lm_b1) %>%
      ungroup()
    )

I tried to use lm.fit which is known to be faster but it doesn't seem to be compatible with dplyr .我尝试使用已知速度更快的lm.fit ,但它似乎与dplyr不兼容。

Is there a fast way to do a linear regression by group?有没有一种快速的方法可以按组进行线性回归?

You can just use the basic formulas for calculating slope and regression.您可以只使用基本公式来计算斜率和回归。 lm does a lot of unnecessary things if all you care about are those two numbers.如果你只关心这两个数字, lm会做很多不必要的事情。 Here I use data.table for the aggregation, but you could do it in base R as well (or dplyr ):在这里,我使用data.table进行聚合,但您也可以在基础 R(或dplyr )中进行:

system.time(
  res <- DT[, 
    {
      ux <- mean(x)
      uy <- mean(y)
      slope <- sum((x - ux) * (y - uy)) / sum((x - ux) ^ 2)
      list(slope=slope, intercept=uy - slope * ux)
    }, by=user.id
  ]
)

Produces for 500K users ~30 obs each (in seconds):为 500K 用户生成 ~30 个 obs(以秒为单位):

 user  system elapsed 
 7.35    0.00    7.36 

Or about 15 microseconds per user .或者每个用户大约15 微秒

Update : I ended up writing a bunch of blog posts that touch on this as well.更新:我最终写了一堆博客文章也涉及到这个问题。

And to confirm this is working as expected:并确认这是按预期工作的:

> summary(DT[user.id==89663, lm(y ~ x)])$coefficients
             Estimate Std. Error   t value  Pr(>|t|)
(Intercept) 0.1965844  0.2927617 0.6714826 0.5065868
x           0.2021210  0.5429594 0.3722580 0.7120808
> res[user.id == 89663]
   user.id    slope intercept
1:   89663 0.202121 0.1965844

Data:数据:

set.seed(1)
users <- 5e5
records <- 30
x <- runif(users * records)
DT <- data.table(
  x=x, y=x + runif(users * records) * 4 - 2, 
  user.id=sample(users, users * records, replace=T)
)

If all you want is coefficients, I'd just use user_id as a factor in the regression.如果你想要的只是系数,我只会使用user_id作为回归中的一个因素。 Using @miles2know's simulated data code (though renaming since an object other than exp() sharing that name looks weird to me)使用@miles2know 的模拟数据代码(虽然重命名,因为共享该名称的exp()以外的对象对我来说看起来很奇怪)

dat <- data.frame(id = rep(c("a","b","c"), each = 20),
                  x = rnorm(60,5,1.5),
                  y = rnorm(60,2,.2))

mod = lm(y ~ x:id + id + 0, data = dat)

We fit no global intercept ( + 0 ) so that the intercept for each id is the id coefficient, and no x by itself, so that the x:id interactions are the slopes for each id :我们不拟合全局截距 ( + 0 ),因此每个 id 的截距是id系数,而没有x本身,因此x:id交互作用是每个id的斜率:

coef(mod)
#      ida      idb      idc    x:ida    x:idb    x:idc 
# 1.779686 1.893582 1.946069 0.039625 0.033318 0.000353 

So, for the a level of id , the ida coefficient, 1.78, is the intercept and the x:ida coefficient, 0.0396, is the slope.因此,对于id a水平, ida系数 1.78 是截距,而x:ida系数 0.0396 是斜率。

I'll leave the gathering of these coefficients into appropriate columns of a data frame to you...我将把这些系数收集到数据框的适当列中留给你......

This solution should be very fast because you're not having to deal with subsets of data frames.此解决方案应该非常快,因为您不必处理数据帧的子集。 It could probably be sped up even more with fastLm or such.使用fastLm等可能会加快速度。

Note on scalability:关于可扩展性的注意事项:

I did just try this on @nrussell's simulated full-size data and ran into memory allocation issues.我只是在@nrussell 的模拟全尺寸数据上尝试过这个,但遇到了内存分配问题。 Depending on how much memory you have it may not work in one go, but you could probably do it in batches of user ids.取决于您拥有多少内存,它可能无法一次性工作,但您可能可以分批使用用户 ID。 Some combination of his answer and my answer might be the fastest overall---or nrussell's might just be faster---expanding the user id factor into thousands of dummy variables might not be computationally efficient, as I've been waiting more than a couple minutes now for a run on just 5000 user ids.他的答案和我的答案的某种组合可能是整体最快的——或者 nrussell 的可能更快——将用户 ID 因子扩展为数千个虚拟变量可能在计算上效率不高,因为我已经等了不止一个现在只需几分钟即可运行 5000 个用户 ID。

Update: As pointed out by Dirk, my original approach can be greatly improved upon by specifying x and Y directly rather than using the formula-based interface of fastLm , which incurs (a fairly significant) processing overhead.更新:正如 Dirk 所指出的,可以通过直接指定xY而不是使用fastLm的基于公式的接口来大大改进我的原始方法,这会产生(相当大的)处理开销。 For comparison, using the original full size data set,为了比较,使用原始全尺寸数据集,

R> system.time({
  dt[,c("lm_b0", "lm_b1") := as.list(
    unname(fastLm(x, Y)$coefficients))
    ,by = "user_id"]
})
#  user  system elapsed 
#55.364   0.014  55.401 
##
R> system.time({
  dt[,c("lm_b0","lm_b1") := as.list(
    unname(fastLm(Y ~ x, data=.SD)$coefficients))
    ,by = "user_id"]
})
#   user  system elapsed 
#356.604   0.047 356.820

this simple change yields roughly a 6.5x speedup .这个简单的改变产生了大约6.5 倍的加速


[Original approach] [原始方法]

There is probably some room for improvement, but the following took about 25 minutes on a Linux VM (2.6 GHz processor), running 64-bit R:可能还有一些改进的空间,但以下在运行 64 位 R 的 Linux VM(2.6 GHz 处理器)上花费了大约 25 分钟:

library(data.table)
library(RcppArmadillo)
##
dt[
  ,c("lm_b0","lm_b1") := as.list(
    unname(fastLm(Y ~ x, data=.SD)$coefficients)),
  by=user_id]
##
R> dt[c(1:2, 31:32, 61:62),]
   user_id   x         Y     lm_b0    lm_b1
1:       1 1.0 1674.8316 -202.0066 744.6252
2:       1 1.5  369.8608 -202.0066 744.6252
3:       2 1.0  463.7460 -144.2961 374.1995
4:       2 1.5  412.7422 -144.2961 374.1995
5:       3 1.0  513.0996  217.6442 261.0022
6:       3 1.5 1140.2766  217.6442 261.0022

Data:数据:

dt <- data.table(
  user_id = rep(1:500000,each=30))
##
dt[, x := seq(1, by=.5, length.out=30), by = user_id]
dt[, Y := 1000*runif(1)*x, by = user_id]
dt[, Y := Y + rnorm(
  30, 
  mean = sample(c(-.05,0,0.5)*mean(Y),1), 
  sd = mean(Y)*.25), 
  by = user_id]

You might give this a try using data.table like this.您可以尝试使用 data.table 像这样。 I've just created some toy data but I'd imagine data.table would give some improvement.我刚刚创建了一些玩具数据,但我想 data.table 会有所改进。 It's quite speedy.这是相当快的。 But that is quite a large data-set so perhaps benchmark this approach on a smaller sample to see if the speed is a lot better.但这是一个相当大的数据集,因此也许可以在较小的样本上对这种方法进行基准测试,以查看速度是否要好得多。 good luck.祝你好运。


    library(data.table)

    exp <- data.table(id = rep(c("a","b","c"), each = 20), x = rnorm(60,5,1.5), y = rnorm(60,2,.2))
    # edit: it might also help to set a key on id with such a large data-set
    # with the toy example it would make no diff of course
    exp <- setkey(exp,id)
    # the nuts and bolts of the data.table part of the answer
    result <- exp[, as.list(coef(lm(y ~ x))), by=id]
    result
       id (Intercept)            x
    1:  a    2.013548 -0.008175644
    2:  b    2.084167 -0.010023549
    3:  c    1.907410  0.015823088

An example using Rfast.使用 Rfast 的示例。

Assuming a single response and 500K predictor variables.假设一个响应和 50 万个预测变量。

y <- rnorm(30)
x <- matrnorm(500*1000,30)
system.time( Rfast::univglms(y, x,"normal") )  ## 0.70 seconds

Assuming 500K response variables and a singl predictor variable.假设有 500K 响应变量和单个预测变量。

system.time( Rfast::mvbetas(x,y) )  ## 0.60 seconds

Note: The above times will decrease in the nearby future.注:以上次数将在近期减少。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM