简体   繁体   English

如何从 PySpark 中另一个数据帧的列中创建数据帧中的列

[英]How to create columns in a dataframe out of columns of another dataframe in PySpark

Assuming that I have the following Spark DataFrame df :假设我有以下 Spark DataFrame df

+-----+-------+-------+-------+
| id  | col1  |  col2 |  col3 |
+-----+-------+-------+-------+
| "a" |   10  |  5    |   75  |
| "b" |   20  |  3    |   3   | 
| "c" |   30  |  2    |   65  |
+-----+-------+-------+-------+

I want to create a new dataframe new_df that contains:我想创建一个新的数据new_df ,其中包含:

1) the id of each row 1)每一行的id

2) the value of the division between col1 / col2 and 2) col1 / col2

3) the value of the division between col3 / col1 3) col3 / col1之间的除法值

The desired output for new_df is: new_df所需的输出是:

+-----+-------+-------+
| id  | col1_2| col3_1|
+-----+-------+-------+
| "a" |  2    |  7.5  |
| "b" |  6.67 |  0.15 | 
| "c" |   15  |  2.17 |
+-----+-------+-------+

I have already tried我已经试过了

new_df = df.select("id").withColumn("col1_2", df["col1"] / df["col2"))

without any luck没有任何运气

Either use select : 可以使用select

df.select('id', 
  (df.col1 / df.col2).alias('col1_2'), 
  (df.col3 / df.col1).alias('col3_1')
).show()
+---+-----------------+------------------+
| id|           col1_2|            col3_1|
+---+-----------------+------------------+
|  a|              2.0|               7.5|
|  b|6.666666666666667|              0.15|
|  c|             15.0|2.1666666666666665|
+---+-----------------+------------------+

Or selectExpr : selectExpr

df.selectExpr('id', 'col1 / col2 as col1_2', 'col3 / col1 as col3_1').show()
+---+-----------------+------------------+
| id|           col1_2|            col3_1|
+---+-----------------+------------------+
|  a|              2.0|               7.5|
|  b|6.666666666666667|              0.15|
|  c|             15.0|2.1666666666666665|
+---+-----------------+------------------+
from pyspark.sql.functions import udf, col
def get_remainder(col_1, col_2):
    return col1/col2
get_remainder_udf = udf(get_remainder)
df = df.withColumn('col1_2', get_remainder_udf(col('col1'), col('col2')))
df = df.withColumn('col3_1', get_remainder_udf(col('col3'), col('col1')))
df = df.drop('col1').drop('col2').drop('col3')

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 How to create a function that checks if values in 2 columns of a PySpark dataframe matches values in the same 2 columns of another dataframe? - How to create a function that checks if values in 2 columns of a PySpark dataframe matches values in the same 2 columns of another dataframe? 如何在 pyspark 数据框中创建日期时间列? - How to create datetime columns in a pyspark dataframe? 如何拆分 pyspark dataframe 并创建新列 - How to split pyspark dataframe and create new columns 如何在 pyspark 中创建具有两个 dataframe 列的字典? - How to create a dictionary with two dataframe columns in pyspark? 如何加入位于另一个数据框的两列之间的 Pyspark 数据框? - How to Join Pyspark Dataframe that is In Between 2 Columns of another Dataframe? 需要创建一个 Dataframe,其中的列是通过循环另一个 Dataframe 列的值来创建的。 如何在 PySpark 中执行此操作? - Need to create a Dataframe where the columns are created by looping through the values of another Dataframe columns. How can I do this in PySpark? 如何处理pyspark数据框列 - How to process pyspark dataframe columns 我如何比较 PySpark 中另一个 dataframe 的列 - How i can compare columns from another dataframe in PySpark 如何创建Pyspark UDF以将新列添加到数据框 - How to create a Pyspark UDF for adding new columns to a dataframe 在 Dataframe 内部循环中使用过滤器创建列 Pyspark - Create Columns in Dataframe Inside Loop With Filters Pyspark
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM