如何基於在PySpark中其他列中進行的計算來創建新列

Question

我有以下DataFrame：

+-----------+----------+----------+
|   some_id | one_col  | other_col|
+-----------+----------+----------+
|       xx1 |        11|       177|         
|       xx2 |      1613|      2000|    
|       xx4 |         0|     12473|      
+-----------+----------+----------+

我需要添加一個新列，該列基於在第一列和第二列上進行的一些計算，例如，對於col1_value = 1和col2_value = 10將需要產生col1包含在col2中的百分比，因此col3_value =（1/10）* 100 = 10％：

+-----------+----------+----------+--------------+
|   some_id | one_col  | other_col|  percentage  |
+-----------+----------+----------+--------------+
|       xx1 |        11|       177|     6.2      |  
|       xx3 |         1|       10 |      10      |     
|       xx2 |      1613|      2000|     80.6     |
|       xx4 |         0|     12473|      0       |
+-----------+----------+----------+--------------+

我知道我需要為此使用udf，但是如何基於結果直接添加新的列值？

一些偽代碼：

import pyspark
from pyspark.sql.functions import udf

df = load_my_df

def my_udf(val1, val2):
    return (val1/val2)*100

udf_percentage = udf(my_udf, FloatType())

df = df.withColumn('percentage', udf_percentage(# how?))

謝謝！

Answer 1

df.withColumn('percentage', udf_percentage("one_col", "other_col"))

要么

df.withColumn('percentage', udf_percentage(df["one_col"], df["other_col"]))

要么

df.withColumn('percentage', udf_percentage(df.one_col, df.other_col))

要么

from pyspark.sql.functions import col

df.withColumn('percentage', udf_percentage(col("one_col"), col("other_col")))

但是為什么不只是：

df.withColumn('percentage', col("one_col") / col("other_col") * 100)

如何基於在PySpark中其他列中進行的計算來創建新列

問題描述

1 個解決方案

解決方案1
2 已采納

如何基於在PySpark中其他列中進行的計算來創建新列

問題描述

1 個解決方案

解決方案1 2 已采納

解決方案1
2 已采納