如何在 PySpark/Python 中有效地将数组转换为字符串

Question

I have a df with the following schema:我有一个具有以下架构的df ：

root
 |-- col1: string (nullable = true)
 |-- col2: array (nullable = true)
 |    |-- element: string (containsNull = true)

in which one of the columns, col2 is an array [1#b, 2#b, 3#c] .其中一列col2是一个数组[1#b, 2#b, 3#c] 。 I want to convert this to the string format 1#b,2#b,3#c .我想将其转换为字符串格式1#b,2#b,3#c 。

I am currently doing this through the following snippet我目前正在通过以下代码段执行此操作

df2 = (df1.select("*", explode(col2)).drop('col2'))
df2.groupBy("col1").agg(concat_ws(",", collect_list('col')).alias("col2"))

While this gets the job done, it is taking time and also seems inefficient.虽然这可以完成工作，但它需要时间并且似乎效率低下。

Is there a better alternative?有更好的选择吗？

Answer 1

您可以直接在列上调用concat_ws ，如下所示：

df1.withColumn('col2', concat_ws(',', 'col2'))

如何在 PySpark/Python 中有效地将数组转换为字符串

问题描述

1 个解决方案

解决方案1
14 2017-11-04 07:31:13

如何在 PySpark/Python 中有效地将数组转换为字符串

问题描述

1 个解决方案

解决方案1 14 2017-11-04 07:31:13

解决方案1
14 2017-11-04 07:31:13