简体   繁体   English

在对data.table进行分组后如何对有条件的行进行计数

[英]How to count rows with conditional after grouping in data.table

I have the following data frame: 我有以下数据框:

dat <- read_csv(
  "s1,s2,v1,v2
   a,b,10,20
   a,b,22,NA
   a,b,13,33
   c,d,3,NA
   c,d,4.5,NA
   c,d,10,20"
)

dat
#> # A tibble: 6 x 4
#>      s1    s2    v1    v2
#>   <chr> <chr> <dbl> <int>
#> 1     a     b  10.0    20
#> 2     a     b  22.0    NA
#> 3     a     b  13.0    33
#> 4     c     d   3.0    NA
#> 5     c     d   4.5    NA
#> 6     c     d  10.0    20

What I want to do is 我想做的是

  1. Filter row based on v1 values 根据v1值过滤行
  2. Group by s1 and s2 s1s2分组
  3. Count total lines in every group 计算每组中的总行数
  4. Count lines in every group where v2 is not NA . 计算每个组中v2不是NA

For example with v1_filter >= 0 we get this: 例如,对于v1_filter >= 0我们得到:

s1 s2 total_line non_na_line
a  b     3          2
c  d     3          1

And with v1_filter >= 10 we get this: 使用v1_filter >= 10可以得到:

s1 s2 total_line non_na_line
a  b     2          1
c  d     1          1

How can I achieve that with data.table or dplyr? 如何使用data.table或dplyr实现呢? In reality we have around ~31M rows in dat . 实际上, dat大约有3100万行。 So we need a fast method. 因此,我们需要一种快速的方法。

I'm stuck with this 我坚持这个

 library(data.table)
 dat <- data.table(dat)

 v1_filter = 0
 dat[, v1 >= v1_filter, 
     by=list(s1,s2)]

Using sum should help. 使用sum应该会有所帮助。 Operating on a logical vector, it treats each TRUE as 1 and FALSE as 0 , so you can easily do this: 在逻辑向量上进行操作,它将每个TRUE视为1 ,将FALSE视为0 ,因此您可以轻松地做到这一点:

dat %>%
    group_by(s1, s2) %>%
    summarise(total_lines = n(),
              non_na_line = sum(!is.na(v2)))

# A tibble: 2 x 4
# Groups:   s1 [?]
     s1    s2 total_lines non_na_line
  <chr> <chr>       <int>       <int>
1     a     b           3           2
2     c     d           3           1

You'll easily be able to add in a filter between group_by and summarise , to get what you want. 您将能够轻松地在过滤器之间添加group_bysummarise ,才能得到你想要的东西。 Keep in mind that summarise will only retain columns that you group by. 请记住, summarise只会保留您分组的列。

Benchmark 基准

For what it's worth, I ran a quick benchmark, with some test data of similar size as yours. 对于它的价值,我运行了一个快速基准测试,其中包含一些与您的大小相似的测试数据。

s1charMix <- rep(letters[seq(from = 1, to = 10)], length.out = 30000000)
s2charMix <- rep(letters[seq(from = 11, to = 20)], length.out = 30000000)
s1chars <- sample(s1charMix, 30000000)
s2chars <- sample(s2charMix, 30000000)
v1Nums <- runif(30000000, min = 0, max = 20)
nomissing <- sample(1:200000,1)
int.mix <- rbinom(30000000 - nomissing, 30, 0.3)
nalist <- rep(NA, nomissing)
v2NumsNA <- sample(x = c(int.mix, nalist), 30000000)
df <- data_frame(s1 = s1chars, s2 = s2chars, v1 = v1Nums, v2 = v2NumsNA)

This should roughly replicate the size and type of the data you suggest: 这应该大致复制您建议的数据的大小和类型:

df

# A tibble: 30,000,000 x 4
      s1    s2         v1    v2
   <chr> <chr>      <dbl> <int>
 1     d     s  9.2123603     7
 2     b     q 16.6638639    11
 3     g     o 18.3682028    11
 4     g     s  0.8779067     9
 5     a     s  0.0719127    10
 6     b     q 16.8809193    12
 7     h     q 15.4382455     6
 8     e     k  2.3565489    11
 9     h     p 16.4508811     9
10     d     n  2.7283823    11
# ... with 29,999,990 more rows

df %>%
    filter(is.na(v2))

# A tibble: 116,924 x 4
      s1    s2         v1    v2
   <chr> <chr>      <dbl> <int>
 1     d     r 13.1448988    NA
 2     b     o  0.2703848    NA
 3     b     t 18.8319385    NA
 4     a     s 11.6448437    NA
 5     j     m  0.5388760    NA
 6     i     k  8.7098427    NA
 7     d     s  6.1149735    NA
 8     h     p  2.5552694    NA
 9     g     r  0.9057442    NA
10     b     s 19.8886830    NA
# ... with 116,914 more rows

Now, let's benchmark dplyr operations vs data.table : 现在,让我们对dplyr操作与data.table基准测试:

### dplyr
df %>%
    filter(v1 > 10) %>%
    group_by(s1, s2) %>%
    summarise(total_lines = n(),
              non_na_line = sum(!is.na(v2)))

# A tibble: 100 x 4
# Groups:   s1 [?]
      s1    s2 total_lines non_na_line
   <chr> <chr>       <int>       <int>
 1     a     k      150327      149734
 2     a     l      149655      149062
 3     a     m      149794      149200
 4     a     n      149771      149197
 5     a     o      149495      148942
...
> system.time(df %>% filter(v1 > 10) %>% group_by(s1, s2) %>% summarise(total_lines = n(), non_na_line = sum(!is.na(v2))))
   user  system elapsed 
  1.848   0.420   2.290
> system.time(for (i in 1:100) df %>% filter(v1 > 10) %>% group_by(s1, s2) %>% summarise(total_lines = n(), non_na_line = sum(!is.na(v2))))
   user  system elapsed 
187.657  55.878 245.528 

### Data.table
library(data.table)
dat <- data.table(df)
> dat[v1 > 10, .N, by = .(s1, s2)][dat[v1 > 10 & !is.na(v2), .N, by = .(s1, s2)] , on = c("s1", "s2") , nomatch = 0]
 s1 s2      N    i.N
  1:  b  q 149968 149348
  2:  g  o 150411 149831
  3:  h  q 150132 149563
  4:  h  p 150786 150224
  5:  e  o 149951 149353
 ...
> system.time(dat[v1 > 10, .N, by = .(s1, s2)][dat[v1 > 10 & !is.na(v2), .N, by = .(s1, s2)] , on = c("s1", "s2") , nomatch = 0])
   user  system elapsed 
  2.027   0.228   2.271
> system.time(for (i in 1:100) dat[v1 > 10, .N, by = .(s1, s2)][dat[v1 > 10 & !is.na(v2), .N, by = .(s1, s2)] , on = c("s1", "s2") , nomatch = 0])
   user  system elapsed 
213.281  43.949 261.664

TL;DR dplyr and data.table are similarly fast, if anything dplyr is slightly faster TL; DR dplyrdata.table同样快,如果任何dplyr稍快

> library(readr)
> dat <- read_csv(
+   "s1,s2,v1,v2
+    a,b,10,20
+    a,b,22,NA
+    a,b,13,33
+    c,d,3,NA
+    c,d,4.5,NA
+    c,d,10,20"
+ )
> 
> dat
# A tibble: 6 x 4
     s1    s2    v1    v2
  <chr> <chr> <dbl> <int>
1     a     b  10.0    20
2     a     b  22.0    NA
3     a     b  13.0    33
4     c     d   3.0    NA
5     c     d   4.5    NA
6     c     d  10.0    20

Using data.table since you have a big data 使用data.table,因为您有大数据

> library(data.table)
data.table 1.10.4
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> dat=data.table(dat)

Without removing NA and keeping V1 filter as 0.1 不删除NA并将V1滤波器保持为0.1

> dat1=dat[v1>0.1,.N,.(s1,s2)]
> dat1
   s1 s2 N
1:  a  b 3
2:  c  d 3

Removing v2 NA and keeping V1 filter as 0.1 删除v2 NA,并将V1滤镜保持为0.1

> dat2=dat[v1>0.1&is.na(v2)==F,.N,.(s1,s2)]
> dat2
   s1 s2 N
1:  a  b 2
2:  c  d 1

Merging the two and keeping V1 filter as 0 合并两者并将V1过滤器保持为0

 > dat[v1 > 0, .N, by = .(s1, s2)][ dat[v1 > 0 & !is.na(v2), .N, by = .(s1, s2)] , on = c("s1", "s2") , nomatch = 0 ]
       s1 s2 N i.N
    1:  a  b 3   2
    2:  c  d 3   1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM