使用 PySpark dataframe 根據索引從一個數組中定位值並復制到另一個數組

Question

在這個 dataframe 中，我有以下兩個 arrays：discount_applicaitons 和 line_items。 line_items 數組有一個名為 discount_allocaitons 的內部數組，其中有一個名為 discount_application_index 的字段。 要求是使用 discount_application_index 值並在 discount_applications 數組索引中找到相應的“type”值並將其復制到相應的 applications_type 字段中。

這是 dataframe：

records = '[{"_c":{"discount_applications":[{"type":"manual0"},{"type":"manual1"},{"type":"manual2"},{"type":"manual3"}],"line_items":[{"discount_allocations":[{"application_type":"","discount_application_index":0}]},{"discount_allocations":[{"application_type":"","discount_application_index":1}]},{"discount_allocations":[{"application_type":"","discount_application_index":2}]},{"discount_allocations":[{"application_type":"","discount_application_index":3}]}]}},{"_c":{"discount_applications":[{"type":"manual0"},{"type":"manual1"},{"type":"manual2"}],"line_items":[{"discount_allocations":[{"application_type":"","discount_application_index":0}]},{"discount_allocations":[{"application_type":"","discount_application_index":1}]},{"discount_allocations":[{"application_type":"","discount_application_index":2}]}]}},{"_c":{"discount_applications":[{"type":"manual0"},{"type":"manual1"},{"type":"manual2"}],"line_items":[{"discount_allocations":[{"application_type":"","discount_application_index":0}]},{"discount_allocations":[{"application_type":"","discount_application_index":1}]},{"discount_allocations":[{"application_type":"","discount_application_index":2}]}]}}]'
df = spark.read.json(sc.parallelize([records]))
df.show(truncate=False)
df.printSchema()

root
 |-- _c: struct (nullable = true)
 |    |-- discount_applications: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- type: string (nullable = true)
 |    |-- line_items: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- discount_allocations: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- application_type: string (nullable = true)
 |    |    |    |    |    |-- discount_application_index: long (nullable = true)

+--------------------------------------------------------------------------------------------+
|_c                                                                                          |
+--------------------------------------------------------------------------------------------+
|[[[manual0], [manual1], [manual2], [manual3]], [[[[, 0]]], [[[, 1]]], [[[, 2]]], [[[, 3]]]]]|
|[[[manual0], [manual1], [manual2]], [[[[, 0]]], [[[, 1]]], [[[, 2]]]]]                      |
|[[[manual0], [manual1], [manual2]], [[[[, 0]]], [[[, 1]]], [[[, 2]]]]]                      |
+--------------------------------------------------------------------------------------------+

轉換后，要求 dataframe 看起來像這樣：

+------------------------------------------------------------------------------------------------------------------------+
|_c                                                                                                                      |
+------------------------------------------------------------------------------------------------------------------------+
|[[[manual0], [manual1], [manual2], [manual3]], [[[[manual0, 0]]], [[[manual1, 1]]], [[[manual2, 2]]], [[[manual3, 3]]]]]|
|[[[manual0], [manual1], [manual2]], [[[[manual0, 0]]], [[[manual1, 1]]], [[[manual2, 2]]]]]                             |
|[[[manual0], [manual1], [manual2]], [[[[manual0, 0]]], [[[manual1, 1]]], [[[manual2, 2]]]]]                             |
+------------------------------------------------------------------------------------------------------------------------+

Answer 1

只是讓你的頭腦清醒並進行transform :)

import pyspark.sql.functions as F

df2 = df.withColumn(
    '_c', 
    F.expr("""
        struct(
            _c.discount_applications,
            transform(
                _c.line_items,
                x -> struct(
                    transform(
                        x.discount_allocations,
                        y -> struct(
                            _c.discount_applications[int(y.discount_application_index)].type as application_type,
                            y.discount_application_index as discount_application_index
                        )
                    ) as discount_allocations
                )
            ) as line_items
        )
    """)
)

df2.show(truncate=False)
+------------------------------------------------------------------------------------------------------------------------+
|_c                                                                                                                      |
+------------------------------------------------------------------------------------------------------------------------+
|[[[manual0], [manual1], [manual2], [manual3]], [[[[manual0, 0]]], [[[manual1, 1]]], [[[manual2, 2]]], [[[manual3, 3]]]]]|
|[[[manual0], [manual1], [manual2]], [[[[manual0, 0]]], [[[manual1, 1]]], [[[manual2, 2]]]]]                             |
|[[[manual0], [manual1], [manual2]], [[[[manual0, 0]]], [[[manual1, 1]]], [[[manual2, 2]]]]]                             |
+------------------------------------------------------------------------------------------------------------------------+

使用 PySpark dataframe 根據索引從一個數組中定位值並復制到另一個數組

問題描述

1 個解決方案

解決方案1
1 已采納 2021-02-06 08:01:40

使用 PySpark dataframe 根據索引從一個數組中定位值並復制到另一個數組

問題描述

1 個解決方案

解決方案1 1 已采納 2021-02-06 08:01:40

解決方案1
1 已采納 2021-02-06 08:01:40