Pyspark Dataframe - 如何根据作为输入的列数组连接列

Question

I have dataframe of 10 columns and want to do function - concatenation based on Array of columns which come as input:我有 10 列的数据框，并且想要执行函数 - 基于作为输入的列数组的连接：

arr = ["col1", "col2", "col3"]

This is current so far:这是目前的情况：

newDF = rawDF.select(concat(col("col1"), col("col2"), col("col3") )).exceptAll(updateDF.select( concat(col("col1"), col("col2"), col("col3") ) ) )

Also:还：

df3 = df2.join(df1, concat( df2.col1, df2.col2, df2.col3, df2.col3 ) == df1.col5 )

But I want to make a loop or function to do this based on input array (not hard-coding it as is now).但是我想根据输入数组创建一个循环或函数来执行此操作（而不是像现在这样对其进行硬编码）。

What is the best way?什么是最好的方法？

Answer 1

You can unpack the cols using (*).您可以使用 (*) 解压缩 cols。 In the pyspark.sql docs, if any functions have (*cols), it means that you can unpack the cols.在 pyspark.sql 文档中，如果任何函数有 (*cols)，则表示您可以解压缩 cols。 For concat:对于连接：

pyspark.sql.functions.concat(*cols) pyspark.sql.functions.concat(*cols)

from pyspark.sql import functions as F
arr = ["col1", "col2", "col3"]
newDF = rawDF.select(F.concat(*(F.col(col) for col in arr))).exceptAll(updateDF.select(F.concat(*(F.col(col) for col in arr))))

For joins:对于联接：

arr=['col1','col2','col3']
df3 = df2.join(df1, F.concat(*(F.col(col) for col in arr)) == df1.col5 )

Pyspark Dataframe - 如何根据作为输入的列数组连接列

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-02-21 18:52:59

Pyspark Dataframe - 如何根据作为输入的列数组连接列

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-02-21 18:52:59

解决方案1
1 已采纳 2020-02-21 18:52:59