在數據表中獲取不同行的最有效方法是什么？

Question

我有一個包含很多行的數據表。

我正在考慮采用不同的選項來獲取一組獨特的行，包括

dt <- dt %>% unique(.)
dt <- dt %>% distinct()

什么是最有效的方法來做到這一點？ 我擔心效率，因為它是一個 20GB 的文件。

Answer 1

unique 可能是最有效的，因為有一個 data.table 實現。

示例數據（250m 行，2 列）。

library("data.table")

# Setting the number of threads to something reasonable for the benchmark.
# You don't need to normally set this. 
setDTthreads(6)

DT <- data.table(
  obj=sample(LETTERS[1:10], 2.5e8, replace=TRUE),
  val=sample(seq_len(10), 2.5e8, replace=TRUE)
)

> print(object.size(DT), units="Gb")
2.8 Gb

基准。

bench::mark(distinct=distinct(DT), unique=unique(DT), iterations=5)

# A tibble: 2 x 13
  expression   min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr> <bch> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 distinct   5.28s   5.4s     0.185    2.93GB    0.123     3     2     16.24s
2 unique     1.91s  1.97s     0.504  953.69MB    0         5     0      9.93s
# … with 4 more variables: result <list>, memory <list>, time <list>, gc <list>

在數據表中獲取不同行的最有效方法是什么？

問題描述

1 個解決方案

解決方案1
2 已采納 2020-09-11 00:43:05

在數據表中獲取不同行的最有效方法是什么？

問題描述

1 個解決方案

解決方案1 2 已采納 2020-09-11 00:43:05

解決方案1
2 已采納 2020-09-11 00:43:05