[英]One dataframe column multiple with another dataframe column based on condition
There are two dataframe, one is info table, and another one is reference table. dataframe有两个,一个是info表,一个是reference表。 I need to multiple two columns based on the conditions, here is the details:
我需要根据条件对两列进行多列处理,详细信息如下:
Dataframe (Info) Dataframe(信息)
+-----+-----+
| key|value|
+-----+-----+
| a| 10|
| b| 20|
| c| 50|
| d| 40|
+-----+-----+
Dataframe (Reference) Dataframe(参考)
+-----+----------+
| key|percentage|
+-----+----------+
| a| 0.1|
| b| 0.5|
+-----+----------+
Dataframe (this is the output I want) Dataframe(这是我要的output)
+-----+------+
| key|result|
+-----+------+
| a| 1| (10 * 0.1 = 1)
| b| 10| (20 * 0.5 = 10)
| c| 50| (because there are no key matching in reference table, then remain the same)
| d| 40| (because there are no key matching in reference table, then remain the same)
+-----+------+
I have try the below code but failed.我尝试了以下代码但失败了。
df_cal = (
info
.withColumn('result', f.when(f.col('key')==reference.withColumn(f.col('key')), \
f.col('value)*reference.withColumn(f.col('percentage')) ))
.select('key', 'result')
)
df_cal.show()
Join and multiply.加入并繁殖。 Code and logic below
下面的代码和逻辑
new_info = (info.join(broadcast(Reference), on='key', how='left')#Join the two dataframes
.na.fill(1.0)#Fill null with 1
.withColumn('result', col('value')*col('percentage'))#multiply the columns and store in results
.drop('value','percentage')#drop unwanted columns
)
new_info.show()
a slight difference from wwnde's solution, with the overall logic remaining same, would be to use coalesce
instead of the fillna
.与 wwnde 的解决方案略有不同,整体逻辑保持不变,将使用
coalesce
而不是fillna
。 fillna
, if used without subset, can fill unwanted columns as well - and in any case, it generates a new projection in the spark plan. fillna
,如果在没有子集的情况下使用,也可以填充不需要的列 - 在任何情况下,它都会在 spark 计划中生成一个新的投影。
example using coalesce
使用
coalesce
的示例
data1_sdf. \
join(data2_sdf, ['key'], 'left'). \
withColumn('result',
func.coalesce(func.col('value') * func.col('percentage'), func.col('value'))
). \
show()
# +---+-----+----------+------+
# |key|value|percentage|result|
# +---+-----+----------+------+
# | d| 40| null| 40.0|
# | c| 50| null| 50.0|
# | b| 20| 0.5| 10.0|
# | a| 10| 0.1| 1.0|
# +---+-----+----------+------+
If you are willing to use Spark SQL instead of the DataFrame API, you can do this approach:如果你愿意使用 Spark SQL 而不是 DataFrame API,你可以这样做:
Create dataframes.创建数据框。 (Optional since you already have the data)
(可选,因为您已经拥有数据)
from pyspark.sql.types import StructType,StructField, IntegerType, FloatType, StringType
# create info dataframe
info_data = [
("a",10),
("b",20),
("c",50),
("d",40),
]
info_schema = StructType([
StructField("key",StringType()),
StructField("value",IntegerType()),
])
info_df = spark.createDataFrame(data=info_data,schema=info_schema)
# create reference dataframe
reference_data = [
("a",.1),
("b",.5)
]
reference_schema = StructType([
StructField("key",StringType()),
StructField("percentage",FloatType()),
])
reference_df = spark.createDataFrame(data=reference_data,schema=reference_schema)
reference_df.show()
Next we need to create views of the 2 dataframes to run sql queries.接下来我们需要创建 2 个数据帧的视图来运行 sql 查询。 Below we create a view called
info
from info_df
and a view called reference
from reference_df
下面我们从
info_df
创建一个名为info
的视图和一个来自reference_df
的名为reference
的视图
# create views: info and reference
info_df.createOrReplaceTempView("info")
reference_df.createOrReplaceTempView("reference")
Finally we write a query to perform the multiplication.最后我们编写一个查询来执行乘法。 The query performs a left join between info and reference and then multiplies
value
by percentage
.该查询在 info 和 reference 之间执行左连接,然后将
value
乘以percentage
。 The key part is that we coalesce
percentage
with 1. Thus if percentage
is null, then value
is multiplied by 1.关键部分是我们将
percentage
与 1 coalesce
。因此,如果percentage
为 null,则value
乘以 1。
from pyspark.sql.functions import coalesce
my_query = """
select
i.key,
-- coalese the percentage with 1. If percentage is null then it gets replaced by 1
i.value * coalesce(r.percentage,1) as result
from info i
left join reference r
on i.key = r.key
"""
final_df = spark.sql(my_query)
final_df.show()
Output: Output:
+---+------+
|key|result|
+---+------+
| a| 1.0|
| b| 10.0|
| c| 50.0|
| d| 40.0|
+---+------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.