简体   繁体   English

一个 dataframe 列与另一个 dataframe 列的倍数基于条件

[英]One dataframe column multiple with another dataframe column based on condition

There are two dataframe, one is info table, and another one is reference table. dataframe有两个,一个是info表,一个是reference表。 I need to multiple two columns based on the conditions, here is the details:我需要根据条件对两列进行多列处理,详细信息如下:

Dataframe (Info) Dataframe(信息)

+-----+-----+
|  key|value|
+-----+-----+
|    a|   10|
|    b|   20|
|    c|   50|
|    d|   40|
+-----+-----+

Dataframe (Reference) Dataframe(参考)

+-----+----------+
|  key|percentage|
+-----+----------+
|    a|       0.1|
|    b|       0.5|
+-----+----------+

Dataframe (this is the output I want) Dataframe(这是我要的output)

+-----+------+
|  key|result|
+-----+------+
|    a|     1|   (10 * 0.1 = 1)
|    b|    10|   (20 * 0.5 = 10)
|    c|    50|   (because there are no key matching in reference table, then remain the same)
|    d|    40|   (because there are no key matching in reference table, then remain the same)
+-----+------+

I have try the below code but failed.我尝试了以下代码但失败了。

df_cal = (
    info
    .withColumn('result', f.when(f.col('key')==reference.withColumn(f.col('key')), \
                          f.col('value)*reference.withColumn(f.col('percentage')) ))
    .select('key', 'result')
)

df_cal.show()

Join and multiply.加入并繁殖。 Code and logic below下面的代码和逻辑

new_info = (info.join(broadcast(Reference), on='key', how='left')#Join the two dataframes
 .na.fill(1.0)#Fill null with 1
 .withColumn('result', col('value')*col('percentage'))#multiply the columns and store in results
 .drop('value','percentage')#drop unwanted columns
)

new_info.show()

a slight difference from wwnde's solution, with the overall logic remaining same, would be to use coalesce instead of the fillna .与 wwnde 的解决方案略有不同,整体逻辑保持不变,将使用coalesce而不是fillna fillna , if used without subset, can fill unwanted columns as well - and in any case, it generates a new projection in the spark plan. fillna ,如果在没有子集的情况下使用,也可以填充不需要的列 - 在任何情况下,它都会在 spark 计划中生成一个新的投影。

example using coalesce使用coalesce的示例

data1_sdf. \
    join(data2_sdf, ['key'], 'left'). \
    withColumn('result', 
               func.coalesce(func.col('value') * func.col('percentage'), func.col('value'))
               ). \
    show()

# +---+-----+----------+------+
# |key|value|percentage|result|
# +---+-----+----------+------+
# |  d|   40|      null|  40.0|
# |  c|   50|      null|  50.0|
# |  b|   20|       0.5|  10.0|
# |  a|   10|       0.1|   1.0|
# +---+-----+----------+------+

If you are willing to use Spark SQL instead of the DataFrame API, you can do this approach:如果你愿意使用 Spark SQL 而不是 DataFrame API,你可以这样做:

Create dataframes.创建数据框。 (Optional since you already have the data) (可选,因为您已经拥有数据)

from pyspark.sql.types import StructType,StructField, IntegerType, FloatType, StringType

# create info dataframe
info_data = [
  ("a",10),
  ("b",20),
  ("c",50),
  ("d",40),
]
info_schema = StructType([
  StructField("key",StringType()),
  StructField("value",IntegerType()),
])
info_df = spark.createDataFrame(data=info_data,schema=info_schema)

# create reference dataframe
reference_data = [
  ("a",.1),
  ("b",.5)
]
reference_schema = StructType([
  StructField("key",StringType()),
  StructField("percentage",FloatType()),
])
reference_df = spark.createDataFrame(data=reference_data,schema=reference_schema)
reference_df.show()

Next we need to create views of the 2 dataframes to run sql queries.接下来我们需要创建 2 个数据帧的视图来运行 sql 查询。 Below we create a view called info from info_df and a view called reference from reference_df下面我们从info_df创建一个名为info的视图和一个来自reference_df的名为reference的视图

# create views: info and reference
info_df.createOrReplaceTempView("info")
reference_df.createOrReplaceTempView("reference")

Finally we write a query to perform the multiplication.最后我们编写一个查询来执行乘法。 The query performs a left join between info and reference and then multiplies value by percentage .该查询在 info 和 reference 之间执行左连接,然后将value乘以percentage The key part is that we coalesce percentage with 1. Thus if percentage is null, then value is multiplied by 1.关键部分是我们将percentage与 1 coalesce 。因此,如果percentage为 null,则value乘以 1。

from pyspark.sql.functions import coalesce

my_query = """
select
  i.key,
  -- coalese the percentage with 1. If percentage is null then it gets replaced by 1
  i.value * coalesce(r.percentage,1) as result
from info i
left join reference r
  on i.key = r.key
"""

final_df = spark.sql(my_query)
final_df.show()

Output: Output:

+---+------+
|key|result|
+---+------+
|  a|   1.0|
|  b|  10.0|
|  c|  50.0|
|  d|  40.0|
+---+------+

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 根据条件将一个 dataframe 列的值分配给另一个 dataframe 列 - assign values of one dataframe column to another dataframe column based on condition 根据条件使用另一个数据帧列值更新一个数据帧值 - Update one dataframe value with another dataframe column value based on the condition 根据条件将一个 dataframe 中的列值设置为另一个 dataframe 列 - Setting value of columns in one dataframe to another dataframe column based on condition 根据多列索引将值从一个 dataframe 复制到另一个 - Copy value from one dataframe to another based on multiple column index 根据另一个 dataframe 的列值打印一个 dataframe 的列值 - print column values of one dataframe based on the column values of another dataframe 根据条件从另一个数据框设置数据框列的值 - Setting value for dataframe column from another dataframe based on condition 根据条件从另一个数据帧的值向数据帧添加新列 - Adding a new column to a dataframe from the values of another dataframe based on a condition 根据另一个数据帧的列值的条件将数据添加到数据帧中的列 - Adding data to columns in a dataframe based on condition on column values of another dataframe 基于多个条件检查,将值放在pandas dataframe中的列中,来自另一个数据帧 - Putting values in a column in pandas dataframe from another dataframe based on multiple condition check 从多个列的另一个数据帧列中减去一个数据帧列 - Subtracting one dataframe column from another dataframe column for multiple columns
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM