My Dataframe looks like below
ID,FirstName,LastName
1,Navee,Srikanth
2,,Srikanth
3,Naveen,
Now My Problem statement is I have to remove the row number 2 since First Name is null.
I am using below pyspark script
join_Df1= Name.filter(Name.col(FirstName).isnotnull()).show()
I am getting error as
File "D:\0\NameValidation.py", line 13, in <module>
join_Df1= filter(Name.FirstName.isnotnull()).show()
TypeError: 'Column' object is not callable
Can anyone please help me on this to resolve
It looks like your DataFrame FirstName have empty value instead Null
. Below are some options to try out:-
df = sqlContext.createDataFrame([[1,'Navee','Srikanth'], [2,'','Srikanth'] , [3,'Naveen','']], ['ID','FirstName','LastName'])
df.show()
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
| 1| Navee|Srikanth|
| 2| |Srikanth|
| 3| Naveen| |
+---+---------+--------+
df.where(df.FirstName.isNotNull()).show() #This doen't remove null because df have empty value
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
| 1| Navee|Srikanth|
| 2| |Srikanth|
| 3| Naveen| |
+---+---------+--------+
df.where(df.FirstName != '').show()
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
| 1| Navee|Srikanth|
| 3| Naveen| |
+---+---------+--------+
df.filter(df.FirstName != '').show()
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
| 1| Navee|Srikanth|
| 3| Naveen| |
+---+---------+--------+
df.where("FirstName != ''").show()
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
| 1| Navee|Srikanth|
| 3| Naveen| |
+---+---------+--------+
You should be doing as below
join_Df1.filter(join_Df1.FirstName.isNotNull()).show
Hope this helps!
I think what you might need is this notnull()
.
So this is your input in csv file my_test.csv
:
ID,FirstName,LastName
1,Navee,Srikanth
2,,Srikanth
3,Naveen
The code:
import pandas as pd
df = pd.read_csv("my_test.csv")
print(df[df['FirstName'].notnull()])
output:
ID FirstName LastName
0 1 Navee Srikanth
2 3 Naveen NaN
This is what you would like! df[df['FirstName'].notnull()]
output of df['FirstName'].notnull()
:
0 True
1 False
2 True
This creates a dataframe df
where df['FirstName'].notnull()
returns True
How this is checked? df['FirstName'].notnull()
If the value for FirstName
column is notnull return True
else if NaN
is present return False
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.