[英]pyspark dataframe operate on multiple columns dynamically
In pyspark , suppose I have dataframe with columns named as 'a1','a2','a3'...'a99'
, how do I apply operation on each of them to create new columns with new names dynamically? 在pyspark中 ,假设我有一个名为'a1','a2','a3'...'a99'
列的数据框 ,我该如何对其应用操作以动态创建具有新名称的新列?
For example, to getnew columns such as sum('a1') as 'total_a1' , ... sum('a99') as 'total_a99'
. 例如,要获取诸如sum('a1') as 'total_a1' , ... sum('a99') as 'total_a99'
。
You can use a list comprehension with alias
. 您可以使用带有alias
的列表alias
。
To return only the new columns: 要仅返回新列:
import pyspark.sql.functions as f
df1 = df.select(*[f.sum(c).alias("total_"+c) for c in df.columns])
And if you wanted to keep the existing columns as well: 如果您还想保留现有列:
df2 = df.select("*", *[f.sum(c).alias("total_"+c) for c in df.columns])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.