简体   繁体   中英

Pyspark Removing null values from a column in dataframe

My Dataframe looks like below

ID,FirstName,LastName

1,Navee,Srikanth

2,,Srikanth 

3,Naveen,

Now My Problem statement is I have to remove the row number 2 since First Name is null.

I am using below pyspark script

join_Df1= Name.filter(Name.col(FirstName).isnotnull()).show()

I am getting error as

  File "D:\0\NameValidation.py", line 13, in <module>
join_Df1= filter(Name.FirstName.isnotnull()).show()

TypeError: 'Column' object is not callable

Can anyone please help me on this to resolve

It looks like your DataFrame FirstName have empty value instead Null . Below are some options to try out:-

df = sqlContext.createDataFrame([[1,'Navee','Srikanth'], [2,'','Srikanth'] , [3,'Naveen','']], ['ID','FirstName','LastName'])
df.show()
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
|  1|    Navee|Srikanth|
|  2|         |Srikanth|
|  3|   Naveen|        |
+---+---------+--------+

df.where(df.FirstName.isNotNull()).show() #This doen't remove null because df have empty value
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
|  1|    Navee|Srikanth|
|  2|         |Srikanth|
|  3|   Naveen|        |
+---+---------+--------+

df.where(df.FirstName != '').show()
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
|  1|    Navee|Srikanth|
|  3|   Naveen|        |
+---+---------+--------+

df.filter(df.FirstName != '').show()
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
|  1|    Navee|Srikanth|
|  3|   Naveen|        |
+---+---------+--------+

df.where("FirstName != ''").show()
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
|  1|    Navee|Srikanth|
|  3|   Naveen|        |
+---+---------+--------+

You should be doing as below

join_Df1.filter(join_Df1.FirstName.isNotNull()).show

Hope this helps!

I think what you might need is this notnull() .

So this is your input in csv file my_test.csv :

ID,FirstName,LastName
1,Navee,Srikanth

2,,Srikanth

3,Naveen

The code:

import pandas as pd
df = pd.read_csv("my_test.csv")

print(df[df['FirstName'].notnull()])

output:

  ID FirstName  LastName
0   1     Navee  Srikanth
2   3    Naveen       NaN

This is what you would like! df[df['FirstName'].notnull()]

output of df['FirstName'].notnull() :

0     True
1    False
2     True

This creates a dataframe df where df['FirstName'].notnull() returns True

How this is checked? df['FirstName'].notnull() If the value for FirstName column is notnull return True else if NaN is present return False .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM