一個 dataframe 列與另一個 dataframe 列的倍數基於條件

Question

dataframe有兩個，一個是info表，一個是reference表。 我需要根據條件對兩列進行多列處理，詳細信息如下：

Dataframe（信息）

+-----+-----+
|  key|value|
+-----+-----+
|    a|   10|
|    b|   20|
|    c|   50|
|    d|   40|
+-----+-----+

Dataframe（參考）

+-----+----------+
|  key|percentage|
+-----+----------+
|    a|       0.1|
|    b|       0.5|
+-----+----------+

Dataframe（這是我要的output）

+-----+------+
|  key|result|
+-----+------+
|    a|     1|   (10 * 0.1 = 1)
|    b|    10|   (20 * 0.5 = 10)
|    c|    50|   (because there are no key matching in reference table, then remain the same)
|    d|    40|   (because there are no key matching in reference table, then remain the same)
+-----+------+

我嘗試了以下代碼但失敗了。

df_cal = (
    info
    .withColumn('result', f.when(f.col('key')==reference.withColumn(f.col('key')), \
                          f.col('value)*reference.withColumn(f.col('percentage')) ))
    .select('key', 'result')
)

df_cal.show()

Answer 1

加入並繁殖。 下面的代碼和邏輯

new_info = (info.join(broadcast(Reference), on='key', how='left')#Join the two dataframes
 .na.fill(1.0)#Fill null with 1
 .withColumn('result', col('value')*col('percentage'))#multiply the columns and store in results
 .drop('value','percentage')#drop unwanted columns
)

new_info.show()

Answer 2

與 wwnde 的解決方案略有不同，整體邏輯保持不變，將使用coalesce而不是fillna 。 fillna ，如果在沒有子集的情況下使用，也可以填充不需要的列 - 在任何情況下，它都會在 spark 計划中生成一個新的投影。

使用coalesce的示例

data1_sdf. \
    join(data2_sdf, ['key'], 'left'). \
    withColumn('result', 
               func.coalesce(func.col('value') * func.col('percentage'), func.col('value'))
               ). \
    show()

# +---+-----+----------+------+
# |key|value|percentage|result|
# +---+-----+----------+------+
# |  d|   40|      null|  40.0|
# |  c|   50|      null|  50.0|
# |  b|   20|       0.5|  10.0|
# |  a|   10|       0.1|   1.0|
# +---+-----+----------+------+

Answer 3

如果你願意使用 Spark SQL 而不是 DataFrame API，你可以這樣做：

創建數據框。 （可選，因為您已經擁有數據）

from pyspark.sql.types import StructType,StructField, IntegerType, FloatType, StringType

# create info dataframe
info_data = [
  ("a",10),
  ("b",20),
  ("c",50),
  ("d",40),
]
info_schema = StructType([
  StructField("key",StringType()),
  StructField("value",IntegerType()),
])
info_df = spark.createDataFrame(data=info_data,schema=info_schema)

# create reference dataframe
reference_data = [
  ("a",.1),
  ("b",.5)
]
reference_schema = StructType([
  StructField("key",StringType()),
  StructField("percentage",FloatType()),
])
reference_df = spark.createDataFrame(data=reference_data,schema=reference_schema)
reference_df.show()

接下來我們需要創建 2 個數據幀的視圖來運行 sql 查詢。 下面我們從info_df創建一個名為info的視圖和一個來自reference_df的名為reference的視圖

# create views: info and reference
info_df.createOrReplaceTempView("info")
reference_df.createOrReplaceTempView("reference")

最后我們編寫一個查詢來執行乘法。 該查詢在 info 和 reference 之間執行左連接，然后將value乘以percentage 。 關鍵部分是我們將percentage與 1 coalesce 。因此，如果percentage為 null，則value乘以 1。

from pyspark.sql.functions import coalesce

my_query = """
select
  i.key,
  -- coalese the percentage with 1. If percentage is null then it gets replaced by 1
  i.value * coalesce(r.percentage,1) as result
from info i
left join reference r
  on i.key = r.key
"""

final_df = spark.sql(my_query)
final_df.show()

Output：

+---+------+
|key|result|
+---+------+
|  a|   1.0|
|  b|  10.0|
|  c|  50.0|
|  d|  40.0|
+---+------+

一個 dataframe 列與另一個 dataframe 列的倍數基於條件

問題描述

3 個解決方案

解決方案1
0 2022-12-28 03:28:42

解決方案2
0 2022-12-28 04:18:58

解決方案3
0 2022-12-28 04:52:52

一個 dataframe 列與另一個 dataframe 列的倍數基於條件

問題描述

3 個解決方案

解決方案1 0 2022-12-28 03:28:42

解決方案2 0 2022-12-28 04:18:58

解決方案3 0 2022-12-28 04:52:52

解決方案1
0 2022-12-28 03:28:42

解決方案2
0 2022-12-28 04:18:58

解決方案3
0 2022-12-28 04:52:52