Spark：使用 UDF 在 Dataframe 中创建数组列

Question

I have a simple function that takes some XML in a field, parses the values, and returns a list:我有一个简单的 function，它在一个字段中使用一些 XML，解析值，并返回一个列表：

<data>
   <datas a="1" b="2" c="3">
   <datas a="2" b="3" c="2">
</data>

becomes a nested list [[1,2,3],[2,3,2]]成为嵌套列表 [[1,2,3],[2,3,2]]

I've made this a udf, and I'm making this call on my dataframe:我已将其设为 udf，我正在拨打我的 dataframe：

myudf=udf(myparser)
df2=df1.withColumn("newDataColumn",myudf(df1["xmldatafield"]))

this works.这有效。 Except that newDataColumn is type STRING instead of Array.除了 newDataColumn 的类型是 STRING 而不是 Array。 So I can't use any of the sql Array functions on it to access or work with individual elements.所以我不能在其上使用任何 sql 数组函数来访问或处理单个元素。

I've confirmed in python that the function is returning a List type.我已经在 python 中确认 function 正在返回一个列表类型。

Any idea what I'm doing wrong or how I could get this to be an array column type?知道我做错了什么或者我怎么能把它变成数组列类型？

Answer 1

A friend of mine just told me, the solution is passing the datatype to the UDF function. Duh我的一个朋友刚刚告诉我，解决方案是将数据类型传递给 UDF function。Duh

Spark：使用 UDF 在 Dataframe 中创建数组列

问题描述

1 个解决方案

解决方案1
1 2022-05-09 17:57:57

Spark：使用 UDF 在 Dataframe 中创建数组列

问题描述

1 个解决方案

解决方案1 1 2022-05-09 17:57:57

解决方案1
1 2022-05-09 17:57:57