简体   繁体   English

如何从 PySpark DataFrame 中随机抽取一行?

[英]How take a random row from a PySpark DataFrame?

How can I get a random row from a PySpark DataFrame?如何从 PySpark DataFrame 中获取随机行? I only see the method sample() which takes a fraction as parameter.我只看到以分数作为参数的方法sample() Setting this fraction to 1/numberOfRows leads to random results, where sometimes I won't get any row.将此分数设置为1/numberOfRows会导致随机结果,有时我不会得到任何行。

On RDD there is a method takeSample() that takes as a parameter the number of elements you want the sample to contain.RDD有一个方法takeSample()将您希望样本包含的元素数量作为参数。 I understand that this might be slow, as you have to count each partition, but is there a way to get something like this on a DataFrame?我知道这可能很慢,因为您必须计算每个分区,但是有没有办法在 DataFrame 上获得这样的东西?

You can simply call takeSample on a RDD :您可以简单地在RDD上调用takeSample

df = sqlContext.createDataFrame(
    [(1, "a"), (2, "b"), (3, "c"), (4, "d")], ("k", "v"))
df.rdd.takeSample(False, 1, seed=0)
## [Row(k=3, v='c')]

If you don't want to collect you can simply take a higher fraction and limit:如果您不想收集,您可以简单地采用更高的分数和限制:

df.sample(False, 0.1, seed=0).limit(1)

Different Types of Sample不同类型的样品

Randomly sample % of the data with and without replacement随机抽样 % 有替换和无替换的数据

import pyspark.sql.functions as F
#Randomly sample 50% of the data without replacement
sample1 = df.sample(False, 0.5, seed=0)

#Randomly sample 50% of the data with replacement
sample1 = df.sample(True, 0.5, seed=0)

#Take another sample exlcuding records from previous sample using Anti Join
sample2 = df.join(sample1, on='ID', how='left_anti').sample(False, 0.5, seed=0)

#Take another sample exlcuding records from previous sample using Where
sample1_ids = [row['ID'] for row in sample1.ID]
sample2 = df.where(~F.col('ID').isin(sample1_ids)).sample(False, 0.5, seed=0)

#Generate a startfied sample of the data across column(s)
#Sampling is probabilistic and thus cannot guarantee an exact number of rows
fractions = {
        'NJ': 0.5, #Take about 50% of records where state = NJ
    'NY': 0.25, #Take about 25% of records where state = NY
    'VA': 0.1, #Take about 10% of records where state = VA
}
stratified_sample = df.sampleBy(F.col('state'), fractions, seed=0)

Here's an alternative using Pandas DataFrame.Sample method.这是使用 Pandas DataFrame.Sample方法的替代方法。 This uses the spark applyInPandas method to distribute the groups, available from Spark 3.0.0.这使用 spark applyInPandas方法来分发组,可从 Spark 3.0.0 获得。 This allows you to select an exact number of rows per group.这允许您为每组选择确切数量的行。

I've added args and kwargs to the function so you can access the other arguments of DataFrame.Sample .我已将argskwargs添加到函数中,以便您可以访问DataFrame.Sample的其他参数。

def sample_n_per_group(n, *args, **kwargs):
    def sample_per_group(pdf):
        return pdf.sample(n, *args, **kwargs)
    return sample_per_group

df = spark.createDataFrame(
    [
        (1, 1.0), 
        (1, 2.0), 
        (2, 3.0), 
        (2, 5.0), 
        (2, 10.0)
    ],
    ("id", "v")
)

(df.groupBy("id")
   .applyInPandas(
        sample_n_per_group(1, random_state=2), 
        schema=df.schema
   )
)

To be aware of the limitations for very large groups, from the documentation :要了解非常大的团体的限制,请参阅文档

This function requires a full shuffle.此功能需要完全洗牌。 All the data of a group will be loaded into memory, so the user should be aware of the potential OOM risk if data is skewed and certain groups are too large to fit in memory.一个组的所有数据都会被加载到内存中,因此用户应该意识到如果数据倾斜并且某些组太大而无法放入内存,则用户应该意识到潜在的OOM风险。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何将行显示为 pyspark 数据框中的字典? - How to display row as dictionary from pyspark dataframe? 在 PySpark 中 - 如果列表中的值位于不同的 DataFrame 的行中,如何在 PySpark 中创建新的 DataFrame? - In PySpark - How to create a new DataFrame in PySpark if values from list are in row of a different DataFrame? PySpark:从随机均匀分布创建数据帧 - PySpark: create dataframe from random uniform disribution Select 随机行来自 PySpark dataframe - Select random rows from PySpark dataframe 如何从数据框中消除行名和列名的值导致pyspark? - How to eliminate row and column name values from the dataframe result in pyspark? Pyspark。 如何从单行数据框中提取日期时间值? - Pyspark. How to extract a datetime value from single row dataframe? 如何获取包含字符串列表的 dataframe 并从 Pyspark 中的这些列表中创建另一个 dataframe? - How can I take a dataframe containing lists of strings and create another dataframe from these lists in Pyspark? 如何将行从 pyspark 中的 dataframe 转换为列但保留列名? - pyspark 或 python - How can I convert a row from a dataframe in pyspark to a column but keep the column names? - pyspark or python PySpark - 从 Dataframe 中删除第一行 - PySpark - Remove first row from Dataframe 如何将 Pyspark 数据帧标题设置为另一行? - How to Set Pyspark Dataframe Headers to another Row?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM