简体   繁体   English

如何将左连接操作的 output 中的 null 值替换为 pyspark 中的 0?

[英]How to replace null values in the output of a left join operation with 0 in pyspark dataframe?

I have a simple PySpark dataframe, df1-我有一个简单的 PySpark dataframe, df1-

df1 = spark.createDataFrame([
    ("u1", 1),
    ("u1", 2),
    ("u2", 3),
    ("u3", 4),

    ],
    ['user_id', 'var1'])

print(df1.printSchema())
df1.show(truncate=False)

Output-输出-

root
 |-- user_id: string (nullable = true)
 |-- var1: long (nullable = true)

None
+-------+----+
|user_id|var1|
+-------+----+
|u1     |1   |
|u1     |2   |
|u2     |3   |
|u3     |4   |
+-------+----+

I have another PySpark dataframe df2-我还有另一个 PySpark dataframe df2-

df2 = spark.createDataFrame([
    (1, 'f1'),
    (2, 'f2'),

    ],
    ['var1', 'var2'])

print(df2.printSchema())
df2.show(truncate=False)

Output-输出-

root
 |-- var1: long (nullable = true)
 |-- var2: string (nullable = true)

None
+----+----+
|var1|var2|
+----+----+
|1   |f1  |
|2   |f2  |
+----+----+

I have to join the two dataframes mentioned above, by using a left-join operation on them-我必须通过对它们使用左连接操作来连接上面提到的两个数据框-

df1.join(df2, df1.var1==df2.var1, 'left').show()

Output-输出-

+-------+----+----+----+
|user_id|var1|var1|var2|
+-------+----+----+----+
|     u1|   1|   1|  f1|
|     u1|   2|   2|  f2|
|     u2|   3|null|null|
|     u3|   4|null|null|
+-------+----+----+----+

But as you can see, I am getting null values in the rows for which there two tables don't have a match.但正如您所看到的,我在两个表不匹配的行中得到 null 值。 How can I replace all the null values with 0?如何将所有 null 值替换为 0?

You can use fillna .您可以使用fillna Two fillnas are needed to account for integer and string columns.需要两个 fillnas 来说明 integer 和字符串列。

df1.join(df2, df1.var1==df2.var1, 'left').fillna(0).fillna("0")

You can rename columns after join (otherwise you get columns with the same name) and use a dictionary to specify how you want to fill missing values:您可以在join后重命名列(否则您将获得具有相同名称的列)并使用字典来指定您希望如何填充缺失值:

f1.join(df2, df1.var1 == df2.var1, 'left').select(
    *[df1['user_id'], df1['var1'], df2['var1'].alias('df2_var1'), df2['var2'].alias('df2_var2')]
).fillna({'df2_var1': 0, 'df2_var2': '0'}).show()

Output: Output:

+-------+----+--------+--------+
|user_id|var1|df2_var1|df2_var2|
+-------+----+--------+--------+
|     u1|   1|       1|      f1|
|     u2|   3|       0|       0|
|     u1|   2|       2|      f2|
|     u3|   4|       0|       0|
+-------+----+--------+--------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM