[英]Pyspark: Concat function generated columns into new dataframe
I have a pyspark dataframe (df) with n cols, I would like to generate another df of n cols, where each column records the percentage difference b/w consecutive rows in the corresponding, original df column. 我有一个带有n个列的pyspark数据帧(df),我想生成另一个n个列的df,其中每列在相应的原始df列中记录了连续黑白行的百分比差异。 And the column headers in the new df should be == corresponding column header in old dataframe + "_diff".
并且新df中的列标题应为==旧数据帧中的相应列标题+“ _diff”。 With the following code I can generate the new columns of percentage changes for each column in the original df but am not able to stick them in a new df with suitable column headers:
使用以下代码,我可以为原始df中的每一列生成百分比变化的新列,但无法将它们粘贴在具有合适列标题的新df中:
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
import pyspark.sql.functions as func
spark = (SparkSession
.builder
.appName('pct_change')
.enableHiveSupport()
.getOrCreate())
df = spark.createDataFrame([(1, 10, 11, 12), (2, 20, 22, 24), (3, 30, 33, 36)],
["index", "col1", "col2", "col3"])
w = Window.orderBy("index")
for i in range(1, len(df.columns)):
col_pctChange = func.log(df[df.columns[i]]) - func.log(func.lag(df[df.columns[i]]).over(w))
Thanks 谢谢
In this case, you can do a list comprehension inside of a call to select
. 在这种情况下,您可以在
select
的调用中进行列表理解。
To make the code a little more compact, we can first get the columns we want to diff in a list: 为了使代码更紧凑,我们首先可以获取要在列表中进行比较的列:
diff_columns = [c for c in df.columns if c != 'index']
Next select the index and iterate over diff_columns
to compute the new column. 接下来,选择索引并遍历
diff_columns
以计算新列。 Use .alias()
to rename the resulting column: 使用
.alias()
重命名结果列:
df_diff = df.select(
'index',
*[(func.log(func.col(c)) - func.log(func.lag(func.col(c)).over(w))).alias(c + "_diff")
for c in diff_columns]
)
df_diff.show()
#+-----+------------------+-------------------+-------------------+
#|index| col1_diff| col2_diff| col3_diff|
#+-----+------------------+-------------------+-------------------+
#| 1| null| null| null|
#| 2| 0.693147180559945| 0.6931471805599454| 0.6931471805599454|
#| 3|0.4054651081081646|0.40546510810816416|0.40546510810816416|
#+-----+------------------+-------------------+-------------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.