pyspark数据框动态地在多个列上运行

Question

In pyspark , suppose I have dataframe with columns named as 'a1','a2','a3'...'a99' , how do I apply operation on each of them to create new columns with new names dynamically? 在pyspark中 ，假设我有一个名为'a1','a2','a3'...'a99'列的数据框，我该如何对其应用操作以动态创建具有新名称的新列？

For example, to getnew columns such as sum('a1') as 'total_a1' , ... sum('a99') as 'total_a99' . 例如，要获取诸如sum('a1') as 'total_a1' , ... sum('a99') as 'total_a99' 。

Answer 1

You can use a list comprehension with alias . 您可以使用带有alias的列表alias 。

To return only the new columns: 要仅返回新列：

import pyspark.sql.functions as f
df1 = df.select(*[f.sum(c).alias("total_"+c) for c in df.columns])

And if you wanted to keep the existing columns as well: 如果您还想保留现有列：

df2 = df.select("*", *[f.sum(c).alias("total_"+c) for c in df.columns])

pyspark数据框动态地在多个列上运行

问题描述

1 个解决方案

解决方案1
0 2019-02-28 15:37:41

pyspark数据框动态地在多个列上运行

问题描述

1 个解决方案

解决方案1 0 2019-02-28 15:37:41

解决方案1
0 2019-02-28 15:37:41