简体   繁体   English

使用R中的data.table进行顺序过滤

[英]Sequential filtering using data.table in R

I have data as follows: 我的数据如下:

PERMNO date DLSTCD
    10 1983     NA 
    10 1985    250 
    10 1986     NA
    10 1986     NA 
    10 1987    240 
    10 1987     NA  
    11 1984     NA  
    11 1984     NA  
    11 1985     NA  
    11 1987     NA 
    12 1984    240 

I need to filter rows based on following criteria: 我需要根据以下标准过滤行:

  1. For each PERMNO , sort data by date 对于每个PERMNO ,按date对数据进行排序
  2. Parse through the sorted data and delete rows after a company gets delisted (ie. DLSTCD != NA) 在公司被除名后解析排序数据并删除行(即DLSTCD!= NA)
  3. If the first row corresponds to company getting delisted, do not include any rows for that company 如果第一行对应于公司已退市,则不包括该公司的任何行

Based on these criteria, following is my expected output: 根据这些标准,以下是我的预期输出:

PERMNO date DLSTCD
    10 1983     NA 
    10 1985    250 
    11 1984     NA  
    11 1984     NA  
    11 1985     NA  
    11 1987     NA 

I am using data.table in R to work with this data. 我在R中使用data.table来处理这些数据。 The example above is an oversimplified version of my actual data, which contains around 3M rows corresponding to 30k PERMNOs. 上面的示例是我的实际数据的过度简化版本,其中包含对应于30k PERMNO的大约3M行。

I implemented three different methods for doing this, as can be seen here: 我实现了三种不同的方法,如下所示:
r-fiddle: http://www.r-fiddle.org/#/fiddle?id=4GapqSbX&version=3 r-fiddle: http ://www.r-fiddle.org/#/fiddle? id = 4GapqSbX&version = 3

Below I compare my implementations using a small dataset of 50k rows. 下面我使用50k行的小数据集来比较我的实现。 Here are my results: 这是我的结果:

Time Comparison 时间比较

system.time(dt <- filterbydelistingcode(dt))   # 39.962 seconds
system.time(dt <- filterbydelistcoderowindices(dt))   # 39.014 seconds
system.time(dt <- filterbydelistcodeinline(dt))   # 114.3 seconds

As you can see all my implementations are extremely inefficient. 正如您所看到的,我的所有实现都非常低效。 Can someone please help me implement a much faster version for this? 有人可以帮我实现更快的版本吗? Thank you. 谢谢。

Edit : Here is a link to a sample dataset of 50k rows that I used for time comparison: https://ufile.io/q9d8u 编辑 :以下是我用于时间比较的50k行样本数据集的链接: https//ufile.io/q9d8u

Also, here is a customized read function for this data: 此外,这是这个数据的自定义读取功能:

readdata = function(filename){
    data = read.csv(filename,header=TRUE, colClasses = c(date = "Date"))
    PRCABS = abs(data$PRC)
    mcap = PRCABS * data$SHROUT
    hpr = data$RET
    HPR = as.numeric(levels(hpr))[hpr]
    HPR[HPR==""] = NA
    data = cbind(data,PRCABS,mcap, HPR)
    return(data)
}

data <- readdata('fewdata.csv')
dt <- as.data.table(data)

Here's an attempt in data.table : 这是data.table的尝试:

dat[
  dat[order(date),
  {
    pos <- match(TRUE, !is.na(DLSTCD));
    (.I <= .I[pos] & pos != 1) | (is.na(pos)) 
  },
  by=PERMNO]
$V1]

#   PERMNO date DLSTCD
#1:     10 1983     NA
#2:     10 1985    250
#3:     11 1984     NA
#4:     11 1984     NA
#5:     11 1985     NA
#6:     11 1987     NA

Testing it on 2.5million rows, 400000 with a delisting date: 测试它在250万行,400000与退市日期:

set.seed(1)
dat <- data.frame(PERMNO=sample(1:22000,2.5e6,replace=TRUE), date=1:2.5e6)
dat$DLSTCD <- NA
dat$DLSTCD[sample(1:2.5e6, 400000)] <- 1
setDT(dat)

system.time({
dat[
  dat[order(date),
  {
    pos <- match(TRUE, !is.na(DLSTCD));
    (.I <= .I[pos] & pos != 1) | (is.na(pos)) 
  },
  by=PERMNO]
$V1]
})
#   user  system elapsed 
#   0.74    0.00    0.76 

Less than a second - not bad. 不到一秒钟 - 不错。

Building on @thelatemail's answer, here are two more variations on the same theme. 在@ thelatemail的答案的基础上,以下是同一主题的两个变体。

In both cases, setkey() first makes things easier to reason with : 在这两种情况下, setkey()首先使事情更容易理解:

setkey(dat,PERMNO,date)  # sort by PERMNO, then by date within PERMNO

Option 1 : stack the data you want (if any) from each group 选项1:从每个组中堆叠所需的数据(如果有)

system.time(
  ans1 <- dat[, {
    w = first(which(!is.na(DLSTCD)))
    if (!length(w)) .SD
    else if (w>1) .SD[seq_len(w)]
  }, keyby=PERMNO]
)
   user  system elapsed 
  2.604   0.000   2.605 

That's quite slow because allocating and populating all the little bits of memory for the result for each group, only then to be stacked into one single result in the end again, takes time and memory. 这是非常慢的,因为为每个组分配和填充所有小内存的结果,然后再次将其堆叠成一个结果,需要时间和内存。

Option 2 : (closer to the way you phrased the question) find the row numbers to delete, then delete them. 选项2 :(更接近你所说的问题的方式)找到要删除的行号,然后删除它们。

system.time({
  todelete <- dat[, {
    w = first(which(!is.na(DLSTCD)))
    if (length(w)) .I[seq.int(from=if (w==1) 1 else w+1, to=.N)]
  }, keyby=PERMNO]

  ans2 <- dat[ -todelete$V1 ]
})
   user  system elapsed 
  0.160   0.000   0.159

That's faster because it's only stacking row numbers to delete followed by a single operation to delete the required rows in one bulk operation. 这样更快,因为它只是堆叠要删除的行号,然后是一个操作来删除一个批量操作中所需的行。 Since it's grouping by the first column of the key, it uses the key to make the grouping faster (groups are contiguous in RAM). 由于它是按密钥的第一列进行分组,因此它使用密钥使分组更快(组在RAM中是连续的)。

More info can be found about ?.SD and ?.I on this manual page . 有关?.SD?.I更多信息,请参见本手册页

You can inspect and debug what is happening inside each group just by adding a call to browser() and having a look as follows. 您可以通过添加对browser()的调用并查看如下内容来检查和调试每个组内发生的事情。

> ans1 <- dat[, {
     browser()
     w = first(which(!is.na(DLSTCD)))
     if (!length(w)) .SD
     else if (w>1) .SD[seq_len(w)]
   }, keyby=PERMNO]
Browse[1]> .SD      # type .SD to look at it
        date DLSTCD
  1:   21679     NA
  2:   46408      1
  3:   68378     NA
  4:   75362     NA
  5:   77690     NA
 ---               
111: 2396559      1
112: 2451629     NA
113: 2461958     NA
114: 2484403     NA
115: 2485217     NA
Browse[1]> w   # doesn't exist yet because browser() before that line
Error: object 'w' not found
Browse[1]> w = first(which(!is.na(DLSTCD)))  # copy and paste line
Browse[1]> w
[1] 2
Browse[1]> if (!length(w)) .SD else if (w>1) .SD[seq_len(w)]
    date DLSTCD
1: 21679     NA
2: 46408      1
Browse[1]> # that is what is returned for this group
Browse[1]> n   # or type n to step to next line
debug at #3: w = first(which(!is.na(DLSTCD)))
Browse[2]> help  # for browser commands

Let's say you find a problem or bug with one particular PERMNO. 假设您发现某个特定PERMNO存在问题或错误。 You can make the call to browser conditional as follows. 您可以按如下方式调用浏览器条件。

> ans1 <- dat[, {
     if (PERMNO==42) browser()
     w = first(which(!is.na(DLSTCD)))
     if (!length(w)) .SD
     else if (w>1) .SD[seq_len(w)]
  }, keyby=PERMNO]
Browse[1]> .SD
        date DLSTCD
  1:   31018     NA
  2:   35803      1
  3:   37494     NA
  4:   50012     NA
  5:   52459     NA
 ---               
128: 2405818     NA
129: 2429995     NA
130: 2455519     NA
131: 2478605      1
132: 2497925     NA
Browse[1]> 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM