繁体   English   中英

与sparklyr一起使用时sample_n真的是随机样本吗?

[英]Is sample_n really a random sample when used with sparklyr?

我在spark数据框中有5亿行。 我对使用dplyr sample_n感兴趣,因为它可以让我明确指定所需的样本大小。 如果要使用sparklyr::sdf_sample() ,则必须首先计算sdf_nrow() ,然后创建指定比例的数据sample_size / nrow ,然后将该比例传递给sdf_sample 这没什么大不了的,但是sdf_nrow()可能需要一段时间才能完成。

因此,直接使用dplyr::sample_n()是理想的。 但是,经过一些测试之后, sample_n()看起来并不是随机的。 实际上,结果与head()相同! 如果该函数只返回前n行,而不是随机采样行,那将是一个主要问题。

有人可以确认吗? sdf_sample()我最好的选择吗?

# install.packages("gapminder")

library(gapminder)
library(sparklyr)
library(purrr)

sc <- spark_connect(master = "yarn-client")

spark_data <- sdf_import(gapminder, sc, "gapminder")


> # Appears to be random
> spark_data %>% sdf_sample(fraction = 0.20, replace = FALSE) %>% summarise(sample_mean = mean(lifeExp))
# Source:   lazy query [?? x 1]
# Database: spark_connection
  sample_mean
        <dbl>
1    58.83397


> spark_data %>% sdf_sample(fraction = 0.20, replace = FALSE) %>% summarise(sample_mean = mean(lifeExp))
# Source:   lazy query [?? x 1]
# Database: spark_connection
  sample_mean
        <dbl>
1    60.31693


> spark_data %>% sdf_sample(fraction = 0.20, replace = FALSE) %>% summarise(sample_mean = mean(lifeExp))
# Source:   lazy query [?? x 1]
# Database: spark_connection
  sample_mean
        <dbl>
1    59.38692
> 
> 
> # Appears to be random
> spark_data %>% sample_frac(0.20) %>% summarise(sample_mean = mean(lifeExp))
# Source:   lazy query [?? x 1]
# Database: spark_connection
  sample_mean
        <dbl>
1    60.48903


> spark_data %>% sample_frac(0.20) %>% summarise(sample_mean = mean(lifeExp))
# Source:   lazy query [?? x 1]
# Database: spark_connection
  sample_mean
        <dbl>
1    59.44187


> spark_data %>% sample_frac(0.20) %>% summarise(sample_mean = mean(lifeExp))
# Source:   lazy query [?? x 1]
# Database: spark_connection
  sample_mean
        <dbl>
1    59.27986
> 
> 
> # Does not appear to be random
> spark_data %>% sample_n(300) %>% summarise(sample_mean = mean(lifeExp))
# Source:   lazy query [?? x 1]
# Database: spark_connection
  sample_mean
        <dbl>
1    57.78434


> spark_data %>% sample_n(300) %>% summarise(sample_mean = mean(lifeExp))
# Source:   lazy query [?? x 1]
# Database: spark_connection
  sample_mean
        <dbl>
1    57.78434


> spark_data %>% sample_n(300) %>% summarise(sample_mean = mean(lifeExp))
# Source:   lazy query [?? x 1]
# Database: spark_connection
  sample_mean
        <dbl>
1    57.78434
> 
> 
> 
> # === Test sample_n() ===
> sample_mean <- list()
> 
> for(i in 1:20){
+   
+   sample_mean[i] <- spark_data %>% sample_n(300) %>% summarise(sample_mean = mean(lifeExp)) %>% collect() %>% pull()
+   
+ }
> 
> 
> sample_mean %>% flatten_dbl() %>% mean()
[1] 57.78434
> sample_mean %>% flatten_dbl() %>% sd()
[1] 0
> 
> 
> # === Test head() ===
> spark_data %>% 
+   head(300) %>% 
+   pull(lifeExp) %>% 
+   mean()
[1] 57.78434

它不是。 如果您检查执行计划( optimizedPlan功能定义在这里 ),你会看到它是一个限制:

spark_data %>% sample_n(300) %>% optimizedPlan()
<jobj[168]>
  org.apache.spark.sql.catalyst.plans.logical.GlobalLimit
  GlobalLimit 300
+- LocalLimit 300
   +- InMemoryRelation [country#151, continent#152, year#153, lifeExp#154, pop#155, gdpPercap#156], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), `gapminder`
         +- Scan ExistingRDD[country#151,continent#152,year#153,lifeExp#154,pop#155,gdpPercap#156] 

这由show_query进一步确认:

spark_data %>% sample_n(300) %>% show_query()
<SQL>
SELECT *
FROM (SELECT *
FROM `gapminder` TABLESAMPLE (300 rows) ) `hntcybtgns`

和可视化的执行计划:

TABLESAMPLE(n ROWS)计划

最后,如果您查看Spark源代码,您会看到这种情况是通过简单的LIMIT实现的:

case ctx: SampleByRowsContext =>
  Limit(expression(ctx.expression), query)

我相信这种语义是从Hive继承的,Hive的等效查询从每个输入拆分中获取n个第一行

在实践中,获取精确大小的样本非常昂贵,除非绝对必要,否则应避免使用(与LIMITS一样大)。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM