[英]Is sample_n really a random sample when used with sparklyr?
I have 500 million rows in a spark dataframe. 我在spark数据框中有5亿行。 I'm interested in using sample_n
from dplyr
because it will allow me to explicitly specify the sample size I want. 我对使用dplyr
sample_n
感兴趣,因为它可以让我明确指定所需的样本大小。 If I were to use sparklyr::sdf_sample()
, I would first have to calculate the sdf_nrow()
, then create the specified fraction of data sample_size / nrow
, then pass this fraction to sdf_sample
. 如果要使用sparklyr::sdf_sample()
,则必须首先计算sdf_nrow()
,然后创建指定比例的数据sample_size / nrow
,然后将该比例传递给sdf_sample
。 This isn't a big deal, but the sdf_nrow()
can take a while to complete. 这没什么大不了的,但是sdf_nrow()
可能需要一段时间才能完成。
So, it would be ideal to use dplyr::sample_n()
directly. 因此,直接使用dplyr::sample_n()
是理想的。 However, after some testing, it doesn't look like sample_n()
is random. 但是,经过一些测试之后, sample_n()
看起来并不是随机的。 In fact, the results are identical to head()
! 实际上,结果与head()
相同! It would be a major issue if instead of sampling rows at random, the function were just returning the first n
rows. 如果该函数只返回前n
行,而不是随机采样行,那将是一个主要问题。
Can anyone else confirm this? 有人可以确认吗? Is sdf_sample()
my best option? sdf_sample()
我最好的选择吗?
# install.packages("gapminder")
library(gapminder)
library(sparklyr)
library(purrr)
sc <- spark_connect(master = "yarn-client")
spark_data <- sdf_import(gapminder, sc, "gapminder")
> # Appears to be random
> spark_data %>% sdf_sample(fraction = 0.20, replace = FALSE) %>% summarise(sample_mean = mean(lifeExp))
# Source: lazy query [?? x 1]
# Database: spark_connection
sample_mean
<dbl>
1 58.83397
> spark_data %>% sdf_sample(fraction = 0.20, replace = FALSE) %>% summarise(sample_mean = mean(lifeExp))
# Source: lazy query [?? x 1]
# Database: spark_connection
sample_mean
<dbl>
1 60.31693
> spark_data %>% sdf_sample(fraction = 0.20, replace = FALSE) %>% summarise(sample_mean = mean(lifeExp))
# Source: lazy query [?? x 1]
# Database: spark_connection
sample_mean
<dbl>
1 59.38692
>
>
> # Appears to be random
> spark_data %>% sample_frac(0.20) %>% summarise(sample_mean = mean(lifeExp))
# Source: lazy query [?? x 1]
# Database: spark_connection
sample_mean
<dbl>
1 60.48903
> spark_data %>% sample_frac(0.20) %>% summarise(sample_mean = mean(lifeExp))
# Source: lazy query [?? x 1]
# Database: spark_connection
sample_mean
<dbl>
1 59.44187
> spark_data %>% sample_frac(0.20) %>% summarise(sample_mean = mean(lifeExp))
# Source: lazy query [?? x 1]
# Database: spark_connection
sample_mean
<dbl>
1 59.27986
>
>
> # Does not appear to be random
> spark_data %>% sample_n(300) %>% summarise(sample_mean = mean(lifeExp))
# Source: lazy query [?? x 1]
# Database: spark_connection
sample_mean
<dbl>
1 57.78434
> spark_data %>% sample_n(300) %>% summarise(sample_mean = mean(lifeExp))
# Source: lazy query [?? x 1]
# Database: spark_connection
sample_mean
<dbl>
1 57.78434
> spark_data %>% sample_n(300) %>% summarise(sample_mean = mean(lifeExp))
# Source: lazy query [?? x 1]
# Database: spark_connection
sample_mean
<dbl>
1 57.78434
>
>
>
> # === Test sample_n() ===
> sample_mean <- list()
>
> for(i in 1:20){
+
+ sample_mean[i] <- spark_data %>% sample_n(300) %>% summarise(sample_mean = mean(lifeExp)) %>% collect() %>% pull()
+
+ }
>
>
> sample_mean %>% flatten_dbl() %>% mean()
[1] 57.78434
> sample_mean %>% flatten_dbl() %>% sd()
[1] 0
>
>
> # === Test head() ===
> spark_data %>%
+ head(300) %>%
+ pull(lifeExp) %>%
+ mean()
[1] 57.78434
It is not. 它不是。 If you check the execution plan ( optimizedPlan
function as defined here ) you'll see it is just a limit: 如果您检查执行计划( optimizedPlan
功能定义在这里 ),你会看到它是一个限制:
spark_data %>% sample_n(300) %>% optimizedPlan()
<jobj[168]>
org.apache.spark.sql.catalyst.plans.logical.GlobalLimit
GlobalLimit 300
+- LocalLimit 300
+- InMemoryRelation [country#151, continent#152, year#153, lifeExp#154, pop#155, gdpPercap#156], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), `gapminder`
+- Scan ExistingRDD[country#151,continent#152,year#153,lifeExp#154,pop#155,gdpPercap#156]
This further confirmed by the show_query
: 这由show_query
进一步确认:
spark_data %>% sample_n(300) %>% show_query()
<SQL>
SELECT *
FROM (SELECT *
FROM `gapminder` TABLESAMPLE (300 rows) ) `hntcybtgns`
and visualized execution plan: 和可视化的执行计划:
Finally if you check Spark source you'll see that this case is implemented with simple LIMIT
: 最后,如果您查看Spark源代码,您会看到这种情况是通过简单的LIMIT
实现的:
case ctx: SampleByRowsContext =>
Limit(expression(ctx.expression), query)
I believe that this semantics has been inherited from Hive where equivalent query takes n first rows from each input split . 我相信这种语义是从Hive继承的,Hive的等效查询从每个输入拆分中获取n个第一行 。
In practice getting a sample of an exact size is just very expensive, and you should avoid unless strictly necessary (same as large LIMITS
). 在实践中,获取精确大小的样本非常昂贵,除非绝对必要,否则应避免使用(与LIMITS
一样大)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.