简体   繁体   中英

Filtering on NULL values in a spark dataframe does not work on all columns?

While writing this question, I managed to find an explaination. But as it seems a tricky point, I will post it and answer it anyway. Feel free to complement.

I have what appears to me an inconsistent behaviour of pyspark, but as I am quite new to it I may miss something... All my steps are run in an Azure Databricks notebook, and the data is from a parquet file hosted in Azure Datalake Gen. 2.

I want to simply filter the NULL records from a spark dataframe, created by reading a parquet file, with the following steps:

读取镶木地板文件

Filtering on the phone column just works fine:

在电话栏上过滤

We can see that at least some contact_tech_id values are also missing. But when filtering on this specific column, an empty dataframe is retrieved...

过滤contact_tech_id

Is there any explaination on why this could happen, or what I should look for?

The reason why filtering on contact_tech_id Null values was unsuccessful is because what appears as null in this column in the notebook output is in fact a NaN value ("Not a Number", see here for more information).

It appeared when converting into a pandas dataframe, for which the output is more readable:

熊猫数据框视图

In order to compare the NULL values for equality, Spark provides a null-safe equal operator (<=>) , which returns False when one of the operand is NULL and returns True when both the operands are NULL. Instead of using is null always recommend (<=>) operator.

Apache Spark supports the standard comparison operators such as >, >=, =, < and <=. The result of these operators is unknown or NULL when one of the operands or both the operands are unknown or NULL. In order to compare the NULL values for equality, Spark provides a null-safe equal operator (<=>), which returns False when one of the operand is NULL and returns True when both the operands are NULL. You can ref link

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM