简体   繁体   English

如何从 data.table 中采样(使用替换和权重)超过 .Machine$integer.max 行?

[英]How to sample (with replacement and weights) more than .Machine$integer.max rows from a data.table?

I want to take a sample from a data.table larger than the integer limit of R (available via .Machine$integer.max ).我想从大于 R 整数限制的 data.table 中提取样本(可通过.Machine$integer.max获得)。 Here is what I have tried:这是我尝试过的:

library(bit64)
library(data.table)
irisdt <- as.data.table(iris)
test <- slice_sample(irisdt, n = .Machine$integer.max + 100, weight_by = Sepal.Length, replace = T)
Fehler in sample.int(n, size, prob = wt, replace = TRUE) : 
  ungültiges 'size' Argument (= invalid argument 'size')
Zusätzlich: Warnmeldung:
In sample.int(n, size, prob = wt, replace = TRUE) :
  NAs introduced by coercion to integer range

If I convert the n argument to slice_sample to integer64, I get an empty sample.如果我将n参数转换为slice_sample为 integer64,我会得到一个空样本。

> test <- slice_sample(irisdt, n = as.integer64(.Machine$integer.max + 100),
                       weight_by = Sepal.Length, replace = T)
> nrow(test)
[1] 0

I cannot take several smaller samples which would be an obvious solution to the problem.我不能采取几个较小的样本,这将是解决问题的明显方法。

Do you have any other ideas?你还有其他建议吗? Thank you!谢谢!

I think here we have 2 problems:我认为这里有两个问题:

  • first is as @Waldi commented about data.table row number limitations.首先是@Waldi 评论 data.table 行数限制。
  • second from sample function where the size argument must not exceeds .Machine$integer.max see from documentation :第二个来自sample函数,其中 size 参数不得超过.Machine$integer.max参见文档:

Non-integer positive numerical values of n or x will be truncated to the next smallest integer, which has to be no larger than .Machine$integer.max. n 或 x 的非整数正数值将被截断为下一个最小整数,该整数必须不大于 .Machine$integer.max。

you can try any size less than or equal .Machine$integer.max您可以尝试任何小于或等于.Machine$integer.max的尺寸

irisdt[sample(.N , .Machine$integer.max - 2e9 , replace = T) ,]

that works for me (subtracting 2e9 for memory limits)对我有用(减去 2e9 的内存限制)

As data.table doesn't allow more than .Machine$integer.max rows, you could as a workaround use arrow with dplyr and furrr :由于data.table不允许超过.Machine$integer.max行,因此您可以将arrowdplyrfurrr使用:

library(bit64)
library(data.table)

library(arrow)
library(dplyr)
library(furrr)

irisdt <- as.data.table(iris)

# Split job
target = .Machine$integer.max+1000
split = 100

# Distribute calculations
numcalc <- rep(round(target/split),split)
numcalc[split] <- numcalc[split] + target - sum(numcalc)

plan(multisession, workers = nbrOfWorkers()-1)

# Generate files in parallel
numcalc %>% furrr::future_iwalk(~{
  test <- irisdt %>% slice_sample( n = .x , weight_by = Sepal.Length, replace = T) 
  write_dataset(test,paste0('D:/test/test',.y,'.parquet'),format = 'parquet')
},.options = furrr_options(seed = TRUE))

# Open dataset
ds <- open_dataset('D:/test',format='parquet')
ds
#FileSystemDataset with 100 Parquet files
#Sepal.Length: double
#Sepal.Width: double
#Petal.Length: double
#Petal.Width: double
#Species: dictionary<values=string, indices=int32>

result <- ds %>% group_by(Species) %>% summarize(n=n()) %>% collect() 

result
# A tibble: 3 x 2
#  Species            n
#  <fct>          <int>
#1 virginica  807049123
#2 versicolor 727198323
#3 setosa     613237201


sum(result$n)-.Machine$integer.max
#[1] 1000

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM