用於在 Spark(Python) 中過濾 RDD 的 Lambda 函數 - 檢查元素是否不是空字符串

Question

我有以下 RDD

2019-09-24,Debt collection,transworld systems inc. is trying to collect a debt that is not mine not owed and is inaccurate.

2019-09-19,Credit reporting credit repair services or other personal consumer reports,

3個元素中的每一個都相應地表示

我需要應用過濾器轉換，以便僅保留以“201”（日期）開頭的記錄並包含注釋（它們具有值並且在第三個元素中不是空字符串）。

我使用以下代碼來計算每次過濾轉換減少了多少記錄：

countA = rdd.count()

countB = rdd.filter(lambda x: x.startswith('201'))

countC = rdd.filter(lambda x: x.startswith('201') & (x.split(",")[2] != None) & (len(x.split(",")[2]) > 0))

我的代碼在計算countC崩潰了，雖然看起來過濾在我的進一步計算中countC ，但我也遇到了更多錯誤......

Answer 1

你收到錯誤：

IndexError：列表索引超出范圍

因為您正在嘗試訪問列表的索引2 （拆分的結果），如果數據集中的某些行只有日期或日期和標簽，或者為空或可能有格式問題，則該列表可能不存在。

在lambda函數可能采取的-短路短在python到第一檢查的優勢，如果至少有3個元素（即，的索引2是可以使用len(x.split(",")) >=3代替(x.split(",")[2] != None) ) 之前嘗試訪問此索引。

這可以寫成：

countC = rdd.filter(lambda x: x.startswith('201') and (len(x.split(",")) >=3) and (len(x.split(",")[2]) > 0))

讓我知道這是否適合您。