pyspark 中的大数据减去速度较慢。有没有更快的方法？

Question

The data looks like this数据看起来像这样

schema = StructType([StructField("ID",StringType(),True), \
            StructField("Priority", IntegerType(),True)])

data = spark.createDataFrame([("A",1),("B",1),("B",2),("C",2),("C",3),("D",3)],schema)

Problem statement: For each priority, the ID has to be compared with all the ID's of previous priorities.问题陈述：对于每个优先级，必须将 ID 与所有先前优先级的 ID 进行比较。

For example, priority 2 to be compared with IDs of priority 1, priority 3 to be compared with IDs of priority 1 and 2, and so on.例如，优先级 2 与优先级 1 的 ID 进行比较，优先级 3 与优先级 1 和 2 的 ID 进行比较，依此类推。

Step 1: Creating new_data by filtering priority 1第 1 步：通过过滤优先级 1 创建 new_data

new_data = data.filter(col('Priority') == 1)

Step 2: Subtracting the next priority with the new data and then appending the result with the new data.第 2 步：用新数据减去下一个优先级，然后将结果附加到新数据。

for i in range(2,4):
    x = data.filter(col('Priority') == i).select('ID')
    x = x.subtract(new_data.select('ID'))
    x = x.withColumn('Priority',F.lit(i))
    new_data = new_data.union(x)

The final new_data is the desired outcome.最终的 new_data 是期望的结果。 But the problem is that with big data, this approach is much slower as the size of new data increases each iteration.但问题是，对于大数据，随着每次迭代新数据的大小增加，这种方法要慢得多。

Is there a faster approach to this method?这种方法有更快的方法吗？ Kindly help.请帮忙。

Answer 1

IIUC you want the highest priority (which is the lowest value) per ID. IIUC 你想要每个 ID 的最高优先级（这是最低值）。 This can be done simply by grouping by id and selecting min(priority).这可以简单地通过按 id 分组并选择 min(priority) 来完成。

data.groupby('ID').min('Priority').show()


+---+-------------+
| ID|min(Priority)|
+---+-------------+
|  A|            1|
|  B|            1|
|  C|            2|
|  D|            3|
+---+-------------+

Answer 2

The faster approach is to make a window partitioned by 'ID' (so each of them checks individually) and ordered by 'Priority'.更快的方法是制作一个按“ID”分区的窗口（因此每个窗口都单独检查）并按“优先级”排序。 Then for each row check what is the min priority seen for that 'ID'.然后对于每一行，检查该“ID”的最低优先级是多少。 If the min priority is equal to priority of that row it means there is no lower priority for that 'ID' so it would be in your final output table如果最小优先级等于该行的优先级，则意味着该“ID”没有更低的优先级，因此它将在您的最终输出表中

window = W.partitionBy('ID').orderBy('Priority')
(
    data
    .withColumn('minPriority', F.min('Priority').over(window))
    .filter(F.col('Priority') == F.col('minPriority'))
    .drop('minPriority')
).show()

pyspark 中的大数据减去速度较慢。有没有更快的方法？

问题描述

2 个解决方案

解决方案1
2 2022-12-19 08:45:09

解决方案2
1 2022-12-19 08:12:27

pyspark 中的大数据减去速度较慢。 有没有更快的方法？

问题描述

2 个解决方案

解决方案1 2 2022-12-19 08:45:09

解决方案2 1 2022-12-19 08:12:27

pyspark 中的大数据减去速度较慢。有没有更快的方法？

解决方案1
2 2022-12-19 08:45:09

解决方案2
1 2022-12-19 08:12:27