如何在 Pyspark 数据框中连接 2 列轴 = 1 上的 ArrayType？

Question

I have a the following dataframe:我有以下数据框：

I would like to concatenate the lat and lon into a list.我想将lat和lon连接成一个列表。 Where mmsi is similar to an ID (This is unique)其中mmsi类似于 ID（这是唯一的）

+---------+--------------------+--------------------+
|     mmsi|                 lat|                 lon|
+---------+--------------------+--------------------+
|255801480|[47.1018366666666...|[-5.3017783333333...|
|304182000|[44.6343033333333...|[-63.564803333333...|
|304682000|[41.1936, 41.1715...|[-8.7716, -8.7514...|
|305930000|[49.5221333333333...|[-3.6310166666666...|
|306216000|[42.8185133333333...|[-29.853155, -29....|
|477514400|[47.17205, 47.165...|[-58.6317, -58.60...|

Therefore, I would like to concatenate the lat and lon array but on axis = 1, that is, I would like to have at the end a list of lists, in a separate column, like:因此，我想将 lat 和 lon 数组连接起来，但在轴 = 1 上，也就是说，我想在最后有一个列表列表，在一个单独的列中，例如：

[[47.1018366666666, -5.3017783333333], ... ]

How is that could be possible in pyspark dataframe?在 pyspark 数据框中这怎么可能？ I have tried concat, but that will return:我试过 concat，但它会返回：

[47.1018366666666, 44.6343033333333, ..., -5.3017783333333, -63.564803333333, ...]

Any help is much appreciated!任何帮助深表感谢！

Answer 1

Starting Spark version 2.4, you can use the inbuilt function arrays_zip .从 Spark 2.4 版开始，您可以使用内置函数arrays_zip 。

from pyspark.sql.functions import arrays_zip
df.withColumn('zipped_lat_lon',arrays_zip(df.lat,df.lon)).show()

如何在 Pyspark 数据框中连接 2 列轴 = 1 上的 ArrayType？

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-12-09 19:01:50

如何在 Pyspark 数据框中连接 2 列轴 = 1 上的 ArrayType？

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-12-09 19:01:50

解决方案1
1 已采纳 2019-12-09 19:01:50