簡體   English   中英

Pyspark從數據框中的列中刪除空值

[英]Pyspark Removing null values from a column in dataframe

我的數據框如下所示

ID,FirstName,LastName

1,Navee,Srikanth

2,,Srikanth 

3,Naveen,

現在我的問題陳述是,由於名字為空,我必須刪除第2行。

我正在使用以下pyspark腳本

join_Df1= Name.filter(Name.col(FirstName).isnotnull()).show()

我收到錯誤消息

  File "D:\0\NameValidation.py", line 13, in <module>
join_Df1= filter(Name.FirstName.isnotnull()).show()

TypeError:“列”對象不可調用

誰能幫我解決這個問題

看起來您的DataFrame FirstName具有空值,而不是Null 以下是一些可以嘗試的選項:-

df = sqlContext.createDataFrame([[1,'Navee','Srikanth'], [2,'','Srikanth'] , [3,'Naveen','']], ['ID','FirstName','LastName'])
df.show()
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
|  1|    Navee|Srikanth|
|  2|         |Srikanth|
|  3|   Naveen|        |
+---+---------+--------+

df.where(df.FirstName.isNotNull()).show() #This doen't remove null because df have empty value
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
|  1|    Navee|Srikanth|
|  2|         |Srikanth|
|  3|   Naveen|        |
+---+---------+--------+

df.where(df.FirstName != '').show()
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
|  1|    Navee|Srikanth|
|  3|   Naveen|        |
+---+---------+--------+

df.filter(df.FirstName != '').show()
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
|  1|    Navee|Srikanth|
|  3|   Naveen|        |
+---+---------+--------+

df.where("FirstName != ''").show()
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
|  1|    Navee|Srikanth|
|  3|   Naveen|        |
+---+---------+--------+

你應該做如下

join_Df1.filter(join_Df1.FirstName.isNotNull()).show

希望這可以幫助!

我認為您可能需要的是this notnull()

這是您在csv文件my_test.csv輸入:

ID,FirstName,LastName
1,Navee,Srikanth

2,,Srikanth

3,Naveen

編碼:

import pandas as pd
df = pd.read_csv("my_test.csv")

print(df[df['FirstName'].notnull()])

輸出:

  ID FirstName  LastName
0   1     Navee  Srikanth
2   3    Naveen       NaN

這就是你想要的! df[df['FirstName'].notnull()]

df['FirstName'].notnull()

0     True
1    False
2     True

這將創建一個數據幀df ,其中df['FirstName'].notnull()返回True

如何檢查? df['FirstName'].notnull()如果FirstName列的值不為null,則返回True否則,如果存在NaN則返回False

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM