PySpark：如何获取列的所有组合

Question

I have a DF with combinations of batches, inputs and outputs and I would like to be able to add their "unique combinations" back to the DataFrame.我有一个包含批次、输入和输出组合的 DF，我希望能够将它们的“独特组合”添加回 DataFrame。 A simple representation of the data looks like this:数据的简单表示如下所示：

Batch批	Output输出	Input输入
1 1	A一个	X X
1 1	A一个	Y是
1 1	A一个	Z Z
2 2	A一个	X X
2 2	A一个	Y是
2 2	A一个	Z Z
3 3	A一个	V五
3 3	A一个	Y是
3 3	A一个	Z Z
4 4	A一个	W W
4 4	A一个	Y是
4 4	A一个	Z Z

So as you can see there are 4 batches and 3 different combinations of input to make the same output type, what I would like to end up with is:所以你可以看到有 4 个批次和 3 种不同的输入组合来产生相同的输出类型，我想最终得到的是：

Batch批	Output输出	Input输入	Combination组合
1 1	A一个	X X	1 1
1 1	A一个	Y是	1 1
1 1	A一个	Z Z	1 1
2 2	A一个	X X	1 1
2 2	A一个	Y是	1 1
2 2	A一个	Z Z	1 1
3 3	A一个	V五	2 2
3 3	A一个	Y是	2 2
3 3	A一个	Z Z	2 2
4 4	A一个	W W	3 3
4 4	A一个	Y是	3 3
4 4	A一个	Z Z	3 3

I am looking to implement this in PySpark for further data manipulation, any guidance would be appreciated :)我希望在 PySpark 中实现这一点以进行进一步的数据操作，任何指导将不胜感激:)

EDIT: still inelegant but it works in PySpark!编辑：仍然不优雅，但它在 PySpark 中有效！ I am sure there must be a way easier method to do this using either sets or dictionaries, my brain just refuses to let me see it...我确信必须有一种更简单的方法来使用集合或字典来做到这一点，我的大脑只是拒绝让我看到它......

df = spark.createDataFrame(
    [
        (1,'A','X'),
        (1,'A','Y'),
        (1,'A','Z'),
        (2,'A','X'),
        (2,'A','Y'),
        (2,'A','Z'),
        (3,'A','V'),
        (3,'A','Y'),
        (3,'A','Z'),
        (4,'A','W'),
        (4,'A','Y'),
        (4,'A','Z'),
        (5,'B','X'),
        (5,'B','Y'),
        (5,'B','Z')
    ],
    ["Batch", "Output", "Input"]
)

grouped = df.orderBy("Input").groupBy(["Batch", "Output"]).agg(f.concat_ws('_', f.sort_array(f.collect_list("Input"))).alias("Comb"))
grouped = grouped.withColumn("TotalComb", f.concat_ws('_',grouped.Output, grouped.Comb))
w = Window.partitionBy().orderBy(f.col('TotalComb').asc())
groupunique = grouped[["totalComb"]].distinct().withColumn("UniqueComb", f.row_number().over(w))
connected = df.join(grouped, on = ["Batch", "Output"], how = "left").join(groupunique, on = ["totalComb"], how = "left")

Answer 1

Create a list of inputs, classify by that list, find consecutive differences and use them create values to cummulatively sum over the entire df创建一个输入列表，按该列表分类，找到连续的差异并使用它们创建值以对整个 df 进行累积求和

w=Window.partitionBy("Batch","Output").orderBy("Batch")

df1=(df.withColumn('Combination',collect_set("Input").over(w))
     .withColumn('Combination',sum(when(lag('Output').over(Window.partitionBy("Combination",'Output').orderBy("Batch")).isNull(),1)
     .otherwise(0)).over(Window.partitionBy().orderBy('Batch')))).show()

+-----+------+-----+-----------+
|Batch|Output|Input|Combination|
+-----+------+-----+-----------+
|    1|     A|    X|          1|
|    1|     A|    Y|          1|
|    1|     A|    Z|          1|
|    2|     A|    X|          1|
|    2|     A|    Y|          1|
|    2|     A|    Z|          1|
|    3|     A|    V|          2|
|    3|     A|    Y|          2|
|    3|     A|    Z|          2|
|    4|     A|    W|          3|
|    4|     A|    Y|          3|
|    4|     A|    Z|          3|
|    5|     B|    X|          4|
|    5|     B|    Y|          4|
|    5|     B|    Z|          4|
+-----+------+-----+-----------+

PySpark：如何获取列的所有组合

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-05-18 22:05:55

PySpark：如何获取列的所有组合

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-05-18 22:05:55

解决方案1
1 已采纳 2022-05-18 22:05:55