[英]scala- Outer join on 2 dataframe columns doesnt show rows where there are null values
Im joining 2 dataframes like so: val joinCols = Array("first_name", "last_name") val df_subset_joined = df1_subset.as("a").join(df2_subset.as("b"), joinCols, "full_outer") df_subset_joined.show() 我像这样加入2个数据帧:val joinCols = Array(“ first_name”,“ last_name”)val df_subset_joined = df1_subset.as(“ a”)。join(df2_subset.as(“ b”),joinCols,“ full_outer”)df_subset_joined 。节目()
This is the result of the above code: 这是上面的代码的结果:
Dataframe of differences between 2 dataframes
+----------+---------+-------------+-------------+
|first_name|last_name|loyalty_score|loyalty_score|
+----------+---------+-------------+-------------+
| will | smith| 67| 67|
| george | clooney| 67| 67|
| george | clooney| 67| 88|
| blake | lively| 66| null|
| celena| gomez| null| 2|
| eva| green| 44| 56|
| null| null| | null|
| jason| momoa| 34| 34|
| ed| sheeran| 88| null|
| lionel| messi| 88| 88|
| kyle| jenner| null| 56|
| tom | cruise| 66| 34|
| tom | cruise| 66| 99|
| brad| pitt| 99| 78|
| ryan| reynolds| 45| null|
+----------+---------+-------------+-------------+
As you can see there are columns with null values. 如您所见,存在具有空值的列。
I run the following code next: 接下来我运行以下代码:
val filter_str = s"a.$col"+" != "+s"b.$col"
val df_subset_filtered = df_subset_joined.filter(filter_str)
df_subset_filtered.show()
I get the foll dataframe: 我得到以下数据框:
Below is the dataframe of differences between DF1 and DF1 based on the comparison between:
a.loyalty_score != b.loyalty_score
+----------+---------+-------------+-------------+
|first_name|last_name|loyalty_score|loyalty_score|
+----------+---------+-------------+-------------+
| tom | cruise| 66| 99|
| tom | cruise| 66| 34|
| eva| green| 44| 56|
| brad| pitt| 99| 78|
| george | clooney| 67| 88|
+----------+---------+-------------+-------------+
Why dont I see the rows where there are null values in 1 column and a actual value in another. 为什么我看不到在第一列中有空值而在另一列中有实际值的行。 Shouldnt this satisfy value != null
这不应该满足值!= null
How can I make my filter statement make the null values appear in the final dataframe 我如何使我的过滤器语句使空值出现在最终数据框中
The reason you don't get any rows where there is null
in one column and non-null
in the other is that the comparison returns FALSE
. 之所以没有得到其中一列为
null
而另一列为non-null
任何行,是因为比较返回FALSE
。
To avoid this, use the null-safe comparison operator <=>
, in conjunction with not
. 为了避免这种情况,请结合使用null安全比较运算符
<=>
和not
。
val filter_str = "not(" + s"a.$col"+" <=> "+s"b.$col)"
val df_subset_filtered = df_subset_joined.filter(filter_str)
df_subset_filtered.show()
From the documentation, 从文档中
expr1 <=> expr2 - Returns same result as the EQUAL(=) operator for non-null operands, but returns true if both are null, false if one of the them is null.
expr1 <=> expr2-对于非空操作数,返回与EQUAL(=)运算符相同的结果,但如果两者均为null,则返回true,如果其中之一为null,则返回false。
Arguments:
参数:
expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a type that can be used in equality comparison.
expr1,expr2-这两个表达式必须是相同类型或可以强制转换为通用类型,并且必须是可用于相等比较的类型。 Map type is not supported.
不支持地图类型。 For complex types such array/struct, the data types of fields must be orderable.
对于复杂的类型(例如数组/结构),字段的数据类型必须可排序。 Examples:
例子:
SELECT 2 <=> 2;
选择2 <=> 2; true
真正
SELECT 1 <=> '1';
SELECT 1 <=>'1'; true
真正
SELECT true <=> NULL;
SELECT true <=> NULL; false
假
SELECT NULL <=> NULL;
SELECT NULL <=> NULL; true
真正
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.