在 Spark 中修改 UDF 以創建附加鍵列

Question

我有一個包含數據行的 dataframe 和一列需要解析的 XML。 我可以使用此堆棧溢出解決方案中的以下代碼解析 XML ：

import xml.etree.ElementTree as ET
import pyspark.sql.functions as F

@F.udf('array<struct<id:string, age:string, sex:string>>')
def parse_xml(s):
    root = ET.fromstring(s)
    return list(map(lambda x: x.attrib, root.findall('visitor')))
    
df2 = df.select(
    F.explode(parse_xml('visitors')).alias('visitors')
).select('visitors.*')

df2.show()

這個 function 為解析的 XML 數據創建一個新的 dataframe。

相反，如何修改此 function 以包含原始 dataframe 中的列，以便以后加入？

例如，如果原始 dataframe 看起來像：

+----+---+----------------------+
|id  |a  |xml                   |
+----+---+----------------------+
|1234|.  |<row1, row2>          |
|2345|.  |<row3, row4>, <row5>  |
|3456|.  |<row6>                |
+----+---+----------------------+

如何在新創建的 dataframe 的每一行中包含 ID？

Answer 1

構建df2時，您還需要select id列。 我認為您可以執行以下操作：

df2 = df.select('id',
    F.explode(parse_xml('visitors')).alias('visitors')
).select('id','visitors.*')

這是一個演示該想法的獨立的小示例：

import pyspark.sql.functions as F
df = spark.createDataFrame([(1,["xml1", "xml2", "xml3"]), (2,["xml4", "xml5", "xml6"]),(3,["xml7", "xml8", "xml9"])], ["id", "xml"])
df.show()
df_exploded_with_id = df.select("id", F.explode(F.col("xml")))
df_exploded_with_id.show()

Output：

+---+------------------+
| id|               xml|
+---+------------------+
|  1|[xml1, xml2, xml3]|
|  2|[xml4, xml5, xml6]|
|  3|[xml7, xml8, xml9]|
+---+------------------+

+---+----+
| id| col|
+---+----+
|  1|xml1|
|  1|xml2|
|  1|xml3|
|  2|xml4|
|  2|xml5|
|  2|xml6|
|  3|xml7|
|  3|xml8|
|  3|xml9|
+---+----+

在 Spark 中修改 UDF 以創建附加鍵列

問題描述

1 個解決方案

解決方案1
0 2021-12-22 09:26:55

在 Spark 中修改 UDF 以創建附加鍵列

問題描述

1 個解決方案

解決方案1 0 2021-12-22 09:26:55

解決方案1
0 2021-12-22 09:26:55