Pyspark Removing null values from a column in dataframe

Question

My Dataframe looks like below

ID,FirstName,LastName

1,Navee,Srikanth

2,,Srikanth 

3,Naveen,

Now My Problem statement is I have to remove the row number 2 since First Name is null.

I am using below pyspark script

join_Df1= Name.filter(Name.col(FirstName).isnotnull()).show()

I am getting error as

  File "D:\0\NameValidation.py", line 13, in <module>
join_Df1= filter(Name.FirstName.isnotnull()).show()

TypeError: 'Column' object is not callable

Can anyone please help me on this to resolve

Answer 1

It looks like your DataFrame FirstName have empty value instead Null . Below are some options to try out:-

df = sqlContext.createDataFrame([[1,'Navee','Srikanth'], [2,'','Srikanth'] , [3,'Naveen','']], ['ID','FirstName','LastName'])
df.show()
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
|  1|    Navee|Srikanth|
|  2|         |Srikanth|
|  3|   Naveen|        |
+---+---------+--------+

df.where(df.FirstName.isNotNull()).show() #This doen't remove null because df have empty value
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
|  1|    Navee|Srikanth|
|  2|         |Srikanth|
|  3|   Naveen|        |
+---+---------+--------+

df.where(df.FirstName != '').show()
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
|  1|    Navee|Srikanth|
|  3|   Naveen|        |
+---+---------+--------+

df.filter(df.FirstName != '').show()
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
|  1|    Navee|Srikanth|
|  3|   Naveen|        |
+---+---------+--------+

df.where("FirstName != ''").show()
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
|  1|    Navee|Srikanth|
|  3|   Naveen|        |
+---+---------+--------+

Answer 2

You should be doing as below

join_Df1.filter(join_Df1.FirstName.isNotNull()).show

Hope this helps!

Answer 3

I think what you might need is this notnull() .

So this is your input in csv file my_test.csv :

ID,FirstName,LastName
1,Navee,Srikanth

2,,Srikanth

3,Naveen

The code:

import pandas as pd
df = pd.read_csv("my_test.csv")

print(df[df['FirstName'].notnull()])

output:

  ID FirstName  LastName
0   1     Navee  Srikanth
2   3    Naveen       NaN

This is what you would like! df[df['FirstName'].notnull()]

output of df['FirstName'].notnull() :

0     True
1    False
2     True

This creates a dataframe df where df['FirstName'].notnull() returns True

How this is checked? df['FirstName'].notnull() If the value for FirstName column is notnull return True else if NaN is present return False .

Pyspark Removing null values from a column in dataframe

Question

3 answers

solution1
6 2017-06-23 07:25:03

solution2
4 2017-06-23 07:03:36

solution3
-1 2017-06-23 07:00:16

Pyspark Removing null values from a column in dataframe

Question

3 answers

solution1 6 2017-06-23 07:25:03

solution2 4 2017-06-23 07:03:36

solution3 -1 2017-06-23 07:00:16

solution1
6 2017-06-23 07:25:03

solution2
4 2017-06-23 07:03:36

solution3
-1 2017-06-23 07:00:16