Select 随机行来自 PySpark dataframe

Question

I want to select n random rows (without replacement) from a PySpark dataframe (preferably in the form of a new PySpark dataframe). I want to select n random rows (without replacement) from a PySpark dataframe (preferably in the form of a new PySpark dataframe). What is the best way to do this?做这个的最好方式是什么？

Following is an example of a dataframe with ten rows.以下是具有十行的 dataframe 的示例。

+-----+-------------------+-----+
| name|          timestamp|value|
+-----+-------------------+-----+
|name1|2019-01-17 00:00:00|11.23|
|name2|2019-01-17 00:00:00|14.57|
|name3|2019-01-10 00:00:00| 2.21|
|name4|2019-01-10 00:00:00| 8.76|
|name5|2019-01-17 00:00:00|18.71|
|name5|2019-01-10 00:00:00|17.78|
|name4|2019-01-10 00:00:00| 5.52|
|name3|2019-01-10 00:00:00| 9.91|
|name1|2019-01-17 00:00:00| 1.16|
|name2|2019-01-17 00:00:00| 12.0|
+-----+-------------------+-----+

The above given dataframe generated by using the following code:上面给出的 dataframe 使用以下代码生成：

from pyspark.sql import *

df_Stats = Row("name", "timestamp", "value")

df_stat1 = df_Stats('name1', "2019-01-17 00:00:00", 11.23)
df_stat2 = df_Stats('name2', "2019-01-17 00:00:00", 14.57)
df_stat3 = df_Stats('name3', "2019-01-10 00:00:00", 2.21)
df_stat4 = df_Stats('name4', "2019-01-10 00:00:00", 8.76)
df_stat5 = df_Stats('name5', "2019-01-17 00:00:00", 18.71)
df_stat6 = df_Stats('name5', "2019-01-10 00:00:00", 17.78)
df_stat7 = df_Stats('name4', "2019-01-10 00:00:00", 5.52)
df_stat8 = df_Stats('name3', "2019-01-10 00:00:00", 9.91)
df_stat9 = df_Stats('name1', "2019-01-17 00:00:00", 1.16)
df_stat10 = df_Stats('name2', "2019-01-17 00:00:00", 12.0)

df_stat_lst = [df_stat1 , df_stat2, df_stat3, df_stat4, df_stat5,
               df_stat6, df_stat7, df_stat8, df_stat9, df_stat10]
df = spark.createDataFrame(df_stat_lst)

Answer 1

There is a sample method on a pyspark.sql.DataFrame . pyspark.sql.DataFrame上有一个sample方法。 The docs here should be helpful.这里的文档应该会有所帮助。

Usage:用法：

df.sample(withReplacement=False, fraction=desired_fraction)

Select 随机行来自 PySpark dataframe

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-10-23 01:31:46

Select 随机行来自 PySpark dataframe

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-10-23 01:31:46

解决方案1
1 已采纳 2019-10-23 01:31:46