使用 pyspark/spark 对大型分布式数据集进行采样

Question

I have a file in hdfs which is distributed across the nodes in the cluster.我在 hdfs 中有一个文件，它分布在集群中的节点上。

I'm trying to get a random sample of 10 lines from this file.我正在尝试从此文件中获取 10 行的随机样本。

in the pyspark shell, I read the file into an RDD using:在 pyspark shell 中，我使用以下命令将文件读入 RDD：

>>> textFile = sc.textFile("/user/data/myfiles/*")

and then I want to simply take a sample... the cool thing about Spark is that there are commands like takeSample , unfortunately I think I'm doing something wrong because the following takes a really long time:然后我想简单地做一个样本......关于 Spark 很酷的事情是有像takeSample这样的命令，不幸的是我认为我做错了什么，因为以下需要很长时间：

>>> textFile.takeSample(False, 10, 12345)

so I tried creating a partition on each node, and then instructing each node to sample that partition using the following command:所以我尝试在每个节点上创建一个分区，然后使用以下命令指示每个节点对该分区进行采样：

>>> textFile.partitionBy(4).mapPartitions(lambda blockOfLines: blockOfLines.takeSample(False, 10, 1234)).first()

but this gives an error ValueError: too many values to unpack :但这会产生错误ValueError: too many values to unpack ：

org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/lib/spark/python/pyspark/worker.py", line 77, in main
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/lib/spark/python/pyspark/serializers.py", line 117, in dump_stream
    for obj in iterator:
  File "/opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/lib/spark/python/pyspark/rdd.py", line 821, in add_shuffle_key
    for (k, v) in iterator:
ValueError: too many values to unpack

How can I sample 10 lines from a large distributed data set using spark or pyspark?如何使用 spark 或 pyspark 从大型分布式数据集中采样 10 行？

Answer 1

Try using textFile.sample(false,fraction,seed) instead.尝试使用textFile.sample(false,fraction,seed)代替。 takeSample will generally be very slow because it calls count() on the RDD . takeSample通常会很慢，因为它在 RDD 上调用count() 。 It needs to do this because otherwise it wouldn't take evenly from each partition, basically it uses the count along with the sample size you asked for to compute the fraction and calls sample internally.它需要这样做，因为否则它不会从每个分区中均匀获取，基本上它使用计数以及您要求的样本大小来计算分数并在内部调用sample 。 sample is fast because it just uses a random boolean generator that returns true fraction percent of the time and thus doesn't need to call count . sample很快，因为它只使用一个随机布尔生成器，该生成器在时间的百分比中返回真实的fraction ，因此不需要调用count 。

In addition, I don't think this is happening to you but if the sample size returned is not big enough it calls sample again which can obviously slow it down.此外，我认为这不会发生在你身上，但如果返回的样本量不够大，它会再次调用sample ，这显然会减慢它的速度。 Since you should have some idea of the size of your data I would recommend calling sample and then cutting the sample down to size yourself, since you know more about your data than spark does.由于您应该对数据的大小有所了解，因此我建议您调用 sample，然后自己将样本缩小到大小，因为您对数据的了解比 spark 多。

Answer 2

Using sample instead of takeSample appears to make things reasonably fast:使用 sample 而不是 takeSample 似乎可以让事情变得相当快：

textFile.sample(False, .0001, 12345)

the problem with this is that it's hard to know the right fraction to choose unless you have a rough idea of the number of rows in your data set.这样做的问题是，除非您对数据集中的行数有一个大致的了解，否则很难知道要选择的正确分数。

Answer 3

Different Types of Sample in PySpark PySpark 中不同类型的样本

Randomly sample % of the data with and without replacement随机抽样 % 有替换和无替换的数据

import pyspark.sql.functions as F
#Randomly sample 50% of the data without replacement
sample1 = df.sample(False, 0.5, seed=0)

#Randomly sample 50% of the data with replacement
sample1 = df.sample(True, 0.5, seed=0)

#Take another sample exlcuding records from previous sample using Anti Join
sample2 = df.join(sample1, on='ID', how='left_anti').sample(False, 0.5, seed=0)

#Take another sample exlcuding records from previous sample using Where
sample1_ids = [row['ID'] for row in sample1.ID]
sample2 = df.where(~F.col('ID').isin(sample1_ids)).sample(False, 0.5, seed=0)

#Generate a startfied sample of the data across column(s)
#Sampling is probabilistic and thus cannot guarantee an exact number of rows
fractions = {
        'NJ': 0.5, #Take about 50% of records where state = NJ
    'NY': 0.25, #Take about 25% of records where state = NY
    'VA': 0.1, #Take about 10% of records where state = VA
}
stratified_sample = df.sampleBy(F.col('state'), fractions, seed=0)

使用 pyspark/spark 对大型分布式数据集进行采样

问题描述

3 个解决方案

解决方案1
30 2014-07-17 19:22:01

解决方案2
21 已采纳 2014-07-17 17:08:58

解决方案3
0 2021-02-26 22:23:10

使用 pyspark/spark 对大型分布式数据集进行采样

问题描述

3 个解决方案

解决方案1 30 2014-07-17 19:22:01

解决方案2 21 已采纳 2014-07-17 17:08:58

解决方案3 0 2021-02-26 22:23:10

解决方案1
30 2014-07-17 19:22:01

解决方案2
21 已采纳 2014-07-17 17:08:58

解决方案3
0 2021-02-26 22:23:10