简体   繁体   English

R:如何使用随时间变化的断点有效地进行分箱?

[英]R: How to bin efficiently using time-varying breakpoints?

I'm working with a large data frame of 14 million rows, containing columns month , firmID , and firmSize .我正在处理一个包含 1400 万行的大型数据框,其中包含monthfirmIDfirmSize列。 In a separate data frame I have monthly breakpoints (quintiles essentially) for firm size.在一个单独的数据框中,我有公司规模的每月断点(基本上是五分位数)。

My goal is to add a fourth column quintile to the first data frame.我的目标是在第一个数据框中添加第四列quintile In this column I would have a number from 1 to 5 corresponding to the size quintile the firmSize belongs to in that specific month.在此列中,我将有一个从 1 到 5 的数字,对应于该特定月份中公司大小所属的大小五分firmSize

I have the following loop that does the job but has a runtime of several hundreds of hours.我有以下循环可以完成这项工作,但运行时间为数百小时。

for (i in 1:length(df$month)) {
  for (j in 1:4) {
    if (df$size[i] <= breakpoints[which(df$month[i] == breakpoints$month),(j+1)]) {
      df$quintile[i] <- j
      break()
    }
    else {
      df$quintile[i] <- 5
    }
  }
}

I have quite limited knowledge of eg the applications of dplyr and I was wondering if anyone has an idea about how to approach this problem so that I don't have to keep my laptop running for weeks.我对例如 dplyr 的应用程序的了解非常有限,我想知道是否有人知道如何解决这个问题,这样我就不必让我的笔记本电脑运行数周。

Thank you in advance!先感谢您!

Edit: Example data for the data frames: (thank you Ricardo for your suggestion!)编辑:数据框的示例数据:(感谢里卡多的建议!)

df df

month  firmID   firmSize
201001 46603210 9738635
201001 72913210 1166077
201001 00621210 3884422
201512 75991610 2932127
201512 45383610 1241272
201512 05766520 1931038

breakpoints断点

month  Q1     Q2      Q3      Q4      Q5
201001 322770 1038300 2112300 4597580 28919700
201512 379340 1239800 2840630 7785700 46209140

I wonder if using findInterval and data.table might be worth pursuing and faster.我想知道使用findIntervaldata.table是否值得追求和更快。 This was adapted from this answer which I thought was very helpful.这是改编自我认为非常有帮助的这个答案

findInterval finds the index of one vector in another (assuming the other is non-decreasing). findInterval在另一个向量中找到一个向量的索引(假设另一个向量是非递减的)。 In this case, breakpoints columns from Q1 to Q5 forms the second vector, and the function will return the index based on the firmSize value in the first data frame.在这种情况下, breakpoints列从Q1Q5 forms 第二个向量,function 将根据第一个数据帧中的firmSize值返回索引。

library(data.table)

setDT(df)
setkey(df, month)

setDT(breakpoints)
setkey(breakpoints, month)

df[, quintile := findInterval(firmSize, breakpoints[.BY][, Q1:Q5]) + 1, by = month][]

Output Output

    month   firmID firmSize quintile
1: 201001 46603210  9738635        5
2: 201001 72913210  1166077        3
3: 201001   621210  3884422        4
4: 201512 75991610  2932127        4
5: 201512 45383610  1241272        3
6: 201512  5766520  1931038        3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在 R 中加入基线和时变数据 - Joining baseline and time-varying data in R R 使用 survSplit 重塑/扩展数据集以获得 cox 回归的时变变量 - R reshaping / expanding dataset using survSplit to obtain time-varying variable for cox regression 如何呈现包含时变协变量的生存数据并使模型适合R. - How to present survival data that includes time-varying covariates and fit the model in R R : 在 RandomForestSRC 中拟合具有时变协变量的生存树 - R :Fitting survival trees with time-varying covariates in RandomForestSRC 使用R中的时变协变量创建计数过程数据集 - Creating Count Process Data Set With Time-Varying Covariates in R 使用R中随时间变化的协变量拟合完全参数比例风险模型 - Fitting a fully parametric proportional hazard model with time-varying covariates in R 多次运行deSolve更改时变参数 - run deSolve multiple times varying a time-varying parameter 根据ID拆分时变变量的值序列 - Splitting the sequence of values of a time-varying variable, conditionally on id 复杂的长到宽数据转换(具有随时间变化的变量) - Complex long to wide data transformation (with time-varying variable) R-在不同时区有效地将毫秒转换为as.POSIXct - R- efficiently convert time in milliseconds to as.POSIXct with varying time zones
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM