[英]How take a random row from a PySpark DataFrame?
How can I get a random row from a PySpark DataFrame?如何从 PySpark DataFrame 中获取随机行? I only see the method sample()
which takes a fraction as parameter.我只看到以分数作为参数的方法sample()
。 Setting this fraction to 1/numberOfRows
leads to random results, where sometimes I won't get any row.将此分数设置为1/numberOfRows
会导致随机结果,有时我不会得到任何行。
On RDD
there is a method takeSample()
that takes as a parameter the number of elements you want the sample to contain.在RDD
有一个方法takeSample()
将您希望样本包含的元素数量作为参数。 I understand that this might be slow, as you have to count each partition, but is there a way to get something like this on a DataFrame?我知道这可能很慢,因为您必须计算每个分区,但是有没有办法在 DataFrame 上获得这样的东西?
You can simply call takeSample
on a RDD
:您可以简单地在RDD
上调用takeSample
:
df = sqlContext.createDataFrame(
[(1, "a"), (2, "b"), (3, "c"), (4, "d")], ("k", "v"))
df.rdd.takeSample(False, 1, seed=0)
## [Row(k=3, v='c')]
If you don't want to collect you can simply take a higher fraction and limit:如果您不想收集,您可以简单地采用更高的分数和限制:
df.sample(False, 0.1, seed=0).limit(1)
Different Types of Sample不同类型的样品
Randomly sample % of the data with and without replacement随机抽样 % 有替换和无替换的数据
import pyspark.sql.functions as F
#Randomly sample 50% of the data without replacement
sample1 = df.sample(False, 0.5, seed=0)
#Randomly sample 50% of the data with replacement
sample1 = df.sample(True, 0.5, seed=0)
#Take another sample exlcuding records from previous sample using Anti Join
sample2 = df.join(sample1, on='ID', how='left_anti').sample(False, 0.5, seed=0)
#Take another sample exlcuding records from previous sample using Where
sample1_ids = [row['ID'] for row in sample1.ID]
sample2 = df.where(~F.col('ID').isin(sample1_ids)).sample(False, 0.5, seed=0)
#Generate a startfied sample of the data across column(s)
#Sampling is probabilistic and thus cannot guarantee an exact number of rows
fractions = {
'NJ': 0.5, #Take about 50% of records where state = NJ
'NY': 0.25, #Take about 25% of records where state = NY
'VA': 0.1, #Take about 10% of records where state = VA
}
stratified_sample = df.sampleBy(F.col('state'), fractions, seed=0)
Here's an alternative using Pandas DataFrame.Sample method.这是使用 Pandas DataFrame.Sample方法的替代方法。 This uses the spark applyInPandas
method to distribute the groups, available from Spark 3.0.0.这使用 spark applyInPandas
方法来分发组,可从 Spark 3.0.0 获得。 This allows you to select an exact number of rows per group.这允许您为每组选择确切数量的行。
I've added args
and kwargs
to the function so you can access the other arguments of DataFrame.Sample
.我已将args
和kwargs
添加到函数中,以便您可以访问DataFrame.Sample
的其他参数。
def sample_n_per_group(n, *args, **kwargs):
def sample_per_group(pdf):
return pdf.sample(n, *args, **kwargs)
return sample_per_group
df = spark.createDataFrame(
[
(1, 1.0),
(1, 2.0),
(2, 3.0),
(2, 5.0),
(2, 10.0)
],
("id", "v")
)
(df.groupBy("id")
.applyInPandas(
sample_n_per_group(1, random_state=2),
schema=df.schema
)
)
To be aware of the limitations for very large groups, from the documentation :要了解非常大的团体的限制,请参阅文档:
This function requires a full shuffle.此功能需要完全洗牌。 All the data of a group will be loaded into memory, so the user should be aware of the potential OOM risk if data is skewed and certain groups are too large to fit in memory.一个组的所有数据都会被加载到内存中,因此用户应该意识到如果数据倾斜并且某些组太大而无法放入内存,则用户应该意识到潜在的OOM风险。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.