[英]How to create columns in a dataframe out of columns of another dataframe in PySpark
Assuming that I have the following Spark DataFrame df
:假设我有以下 Spark DataFrame df
:
+-----+-------+-------+-------+
| id | col1 | col2 | col3 |
+-----+-------+-------+-------+
| "a" | 10 | 5 | 75 |
| "b" | 20 | 3 | 3 |
| "c" | 30 | 2 | 65 |
+-----+-------+-------+-------+
I want to create a new dataframe new_df
that contains:我想创建一个新的数据new_df
,其中包含:
1) the id
of each row 1)每一行的id
2) the value of the division between col1 / col2
and 2) col1 / col2
和
3) the value of the division between col3 / col1
3) col3 / col1
之间的除法值
The desired output for new_df
is: new_df
所需的输出是:
+-----+-------+-------+
| id | col1_2| col3_1|
+-----+-------+-------+
| "a" | 2 | 7.5 |
| "b" | 6.67 | 0.15 |
| "c" | 15 | 2.17 |
+-----+-------+-------+
I have already tried我已经试过了
new_df = df.select("id").withColumn("col1_2", df["col1"] / df["col2"))
without any luck没有任何运气
Either use select
: 可以使用select
:
df.select('id',
(df.col1 / df.col2).alias('col1_2'),
(df.col3 / df.col1).alias('col3_1')
).show()
+---+-----------------+------------------+
| id| col1_2| col3_1|
+---+-----------------+------------------+
| a| 2.0| 7.5|
| b|6.666666666666667| 0.15|
| c| 15.0|2.1666666666666665|
+---+-----------------+------------------+
Or selectExpr
: 或selectExpr
:
df.selectExpr('id', 'col1 / col2 as col1_2', 'col3 / col1 as col3_1').show()
+---+-----------------+------------------+
| id| col1_2| col3_1|
+---+-----------------+------------------+
| a| 2.0| 7.5|
| b|6.666666666666667| 0.15|
| c| 15.0|2.1666666666666665|
+---+-----------------+------------------+
from pyspark.sql.functions import udf, col
def get_remainder(col_1, col_2):
return col1/col2
get_remainder_udf = udf(get_remainder)
df = df.withColumn('col1_2', get_remainder_udf(col('col1'), col('col2')))
df = df.withColumn('col3_1', get_remainder_udf(col('col3'), col('col1')))
df = df.drop('col1').drop('col2').drop('col3')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.