PySpark數據框：過濾具有四個或更多非空列的記錄

Question

我有許多PySpark數據框，其中兩列中的數據是必需的，其他列是可選的。 必填列包含日期和記錄ID。 最有價值的數據位於可選列中。 我正在嘗試捕獲可選列中元素之間的連接。

數據框，預過濾器：

id     col1    col2    col3    date
123            xyz             20160401
234    abc     pqr             20160401
345    def     hij     klm     20160401
456                            20160401

后過濾器，數據框如下所示：

id     col1    col2    col3    date
234    abc     pqr             20160401
345    def     hij     klm     20160401

具有多個非空列值的記錄很有趣，因為它們描述了關系。

我注意到PySpark有一個.filter方法。 文檔中的示例通常顯示過濾列，例如df_filtered = df.filter(df.some_col > some_value) 。 我正在嘗試編寫一個過濾器來捕獲具有任意數據幀的四個或更多非空列的所有那些記錄，即，不得顯式聲明列名。

在PySpark中有一種簡單的方法嗎？

更新資料

盡管.dropna(thresh=4)似乎正是我在尋找的東西，但由於某種原因它沒有用。 例如

df.collect()

[Row(id=123, col1=None,         col2=None, col3=3754907743, date='20160403'),
 Row(id=124, col1=7911019393,   col2=None, col3=1456473867, date='20160403'),
 Row(id=125, col1=None,         col2=None, col3=2049622472, date='20160403'),
 Row(id=126, col1=4345043212,   col2=None, col3=3168577324, date='20160403'),
 Row(id=127, col1=None,         col2=None, col3=3185277065, date='20160403'),
 Row(id=128, col1=1336048242,   col2=None, col3=1322345860, date='20160403')]

無論thresh值多少，它總是返回原始數據幀中的所有記錄：

df_filtered = df.dropna(thresh=[any number])

df_filtered.collect()

[Row(id=123, col1=None,         col2=None, col3=3754907743, date='20160403'),
 Row(id=124, col1=7911019393,   col2=None, col3=1456473867, date='20160403'),
 Row(id=125, col1=None,         col2=None, col3=2049622472, date='20160403'),
 Row(id=126, col1=4345043212,   col2=None, col3=3168577324, date='20160403'),
 Row(id=127, col1=None,         col2=None, col3=3185277065, date='20160403'),
 Row(id=128, col1=1336048242,   col2=None, col3=1322345860, date='20160403')]

我正在運行Spark版本1.5.0-cdh5.5.2。

Answer 1

從文檔中，您正在尋找dropna ：

dropna（how ='any'，thresh = None，子集= None）

返回一個新的DataFrame，省略具有空值的行。 DataFrame.dropna（）和DataFrameNaFunctions.drop（）是彼此的別名。-

參數：
 how – 'any' or 'all'. If 'any', drop a row if it contains any nulls. If 'all', drop a row only if all its values are null. thresh – int, default None If specified, drop rows that have less than thresh non-null values. This overwrites the how parameter. subset – optional list of column names to consider. 

因此，要回答您的問題，您可以嘗試df.dropna(thresh=4) 。

PySpark數據框：過濾具有四個或更多非空列的記錄

問題描述

更新資料

1 個解決方案

解決方案1
1 2016-04-03 23:18:10

PySpark數據框：過濾具有四個或更多非空列的記錄

問題描述

更新資料

1 個解決方案

解決方案1 1 2016-04-03 23:18:10

解決方案1
1 2016-04-03 23:18:10