[英]How to convert an array to string efficiently in PySpark / Python
I have a df
with the following schema:我有一个具有以下架构的
df
:
root
|-- col1: string (nullable = true)
|-- col2: array (nullable = true)
| |-- element: string (containsNull = true)
in which one of the columns, col2
is an array [1#b, 2#b, 3#c]
.其中一列
col2
是一个数组[1#b, 2#b, 3#c]
。 I want to convert this to the string format 1#b,2#b,3#c
.我想将其转换为字符串格式
1#b,2#b,3#c
。
I am currently doing this through the following snippet我目前正在通过以下代码段执行此操作
df2 = (df1.select("*", explode(col2)).drop('col2'))
df2.groupBy("col1").agg(concat_ws(",", collect_list('col')).alias("col2"))
While this gets the job done, it is taking time and also seems inefficient.虽然这可以完成工作,但它需要时间并且似乎效率低下。
Is there a better alternative?有更好的选择吗?
您可以直接在列上调用concat_ws
,如下所示:
df1.withColumn('col2', concat_ws(',', 'col2'))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.