Pyspark：从随机项目函数创建一个集合

Question

New to pyspark and would like any pointers on generating a set of items based on a random selection from a given list. pyspark 新手，并希望根据给定列表中的随机选择生成一组项目的任何指针。 These random choices need to append to a list but must be unique so in the python implementation I used a set to initiate, in the context of a while statement这些随机选择需要附加到一个列表中，但必须是唯一的，所以在 python 实现中我使用了一个集合来启动，在一个 while 语句的上下文中

import string
import random
def id_generator(size=6, chars=string.ascii_uppercase + string.digits):
   return ''.join(random.choice(chars) for _ in range(size))
my_set=set()
while len(my_set)<n+1:  #n being the number of items desired
  my_set.add(id_generator())

(Credit to https://stackoverflow.com/a/2257449/8840174 for the id_generator syntax) （id_generator 语法归功于https://stackoverflow.com/a/2257449/8840174 ）

What I'd like to do is take advantage of spark's distributed compute and complete the above much quicker.我想要做的是利用 spark 的分布式计算并更快地完成上述操作。

Process-wise I'm thinking something like this needs to happen: hold the set on the driver node, and distribute the function out to the workers available to perform id_generator() until there are n unique items in my set.在流程方面，我认为需要发生这样的事情：将集合保存在驱动程序节点上，并将函数分发给可用于执行 id_generator() 的工作人员，直到我的集合中有 n 个唯一项目。 It doesn't seem like there is an equivalent function in pyspark for random.choices, so maybe I need to use the UDF decorator to register the function in pyspark? pyspark 中似乎没有用于 random.choices 的等效函数，所以也许我需要使用 UDF 装饰器在 pyspark 中注册该函数？

This is for a distribution between 0,1 not a random choice selected from some item list.这是用于 0,1 之间的分布，而不是从某个项目列表中选择的随机选择。 https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.rand.html https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.rand.html

@udf
def id_generator():
  import string
  import random
  def id_generator(size=6, chars=string.ascii_uppercase + string.digits):
    return ''.join(random.choice(chars) for _ in range(size))
  return id_generator()

Something like the above?像上面这样的吗？ Although I still am not clear how/if sets work on spark or not.虽然我仍然不清楚 set 如何/是否在 spark 上工作。

https://stackoverflow.com/a/61777594/8840174 https://stackoverflow.com/a/61777594/8840174

The above is sorta the right idea, though I don't know that collecting the value from a single item spark dataframe is a good idea for millions of iterations.以上是一个正确的想法，尽管我不知道从单个项目 spark 数据帧中收集值对于数百万次迭代来说是一个好主意。

The code works fine for straight python, but I'd like to speed it up from several hours if possible.该代码适用于纯 python，但如果可能的话，我想将它从几个小时开始加速。 (I need to generate several randomly generated columns based on various rules/list of values to create a dataset from scratch). （我需要根据各种规则/值列表生成几个随机生成的列，以从头开始创建数据集）。

*I know that id_generator() has a size of 6, with some 2,176,782,336 combinations http://mathcentral.uregina.ca/QQ/database/QQ.09.00/churilla1.html so the chance for duplicates is not huge, but even without the set() requirement, I'm still struggling with the best implementation of appending random choices from a list to another list in pyspark. *我知道 id_generator() 的大小为 6，大约有 2,176,782,336 种组合http://mathcentral.uregina.ca/QQ/database/QQ.09.00/churilla1.html所以重复的机会不是很大，但即使没有set() 要求，我仍在努力寻找将随机选择从列表附加到 pyspark 中的另一个列表的最佳实现。

Thanks for any input!感谢您提供任何意见！

Edit This looks promising Random numbers generation in PySpark编辑这看起来很有希望在 PySpark 中生成随机数

Answer 1

It really depends on your usecase if Spark is the best way to go, however you could do so using a udf of your function on a generated dataframe and dropping duplicates.如果 Spark 是最佳方式，这实际上取决于您的用例，但是您可以在生成的数据帧上使用函数的 udf 并删除重复项。 The drawback of this approach is that due to dropping duplicates it is harder to reach an exact number of datapoints you might desire.这种方法的缺点是，由于删除重复项，很难达到您可能需要的确切数量的数据点。

Note 1: I've slightly adjusted your function to use random.choices.注 1：我稍微调整了您的函数以使用 random.choices。

Note 2: If running on multiple nodes, you might need to make sure each node uses a different seed for random.注意 2：如果在多个节点上运行，您可能需要确保每个节点使用不同的随机种子。

import string
import random
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf

SIZE = 10 ** 6

spark = SparkSession.builder.getOrCreate()

@udf(StringType())
def id_generator(size=6, chars=string.ascii_uppercase + string.digits):
    return ''.join(random.choices(chars, k=size))

df = spark.range(SIZE)

df = df.withColumn('sample', id_generator()).drop('id')

print(f'Count: {df.count()}')
print(f'Unique count: {df.dropDuplicates().count()}')

df.show(5)

Which gives:这使：

Count: 1000000                                                                  
Unique count: 999783                                                            
+------+
|sample|
+------+
|QTOVIM|
|NEH0SY|
|DJW5Q3|
|WMEKRF|
|OQ09N9|
+------+
only showing top 5 rows

Pyspark：从随机项目函数创建一个集合

问题描述

1 个解决方案

解决方案1
0 2021-11-14 12:21:17

Pyspark：从随机项目函数创建一个集合

问题描述

1 个解决方案

解决方案1 0 2021-11-14 12:21:17

解决方案1
0 2021-11-14 12:21:17