在 pyspark 中的列表上應用邏輯運算符

Question

我必須where pyspark 中的 function 中的條件列表中應用邏輯運算符or 。與 pyspark 中一樣，運算符or is | ，它無法使用來自 Python 的any() function。有人建議如何解決這個問題嗎？

下面是一個簡單的例子：

# List of conditions
spark_conditions = [cond1, cond2, ..., cond100]

# Apply somehow the '|' operator on `spark_conditions`
# spark_conditions would look like -> [cond1 | cond2 | .... | cond100]

df.select(columns).where(spark_conditions)

感謝您的幫助，謝謝！

Answer 1

我認為這實際上是一個 pandas 問題，因為spark.sql.DataFrame似乎至少表現得像 pandas DataFrame。但我不知道 spark。 無論如何，您的“火花條件”實際上（我認為）是 boolean 系列。 我確定有一些方法可以正確地將 pandas 中的 boolean 系列求和，但您也可以像這樣減少它：

import pandas as pd
from funtools import reduce

df = pd.DataFrame([0,1,2,2,1,4], columns=["num"])
filter1 = df["num"] > 3
filter2 = df["num"] == 2
filter3 = df["num"] == 1
filters = (filter1, filter2, filter3)
filter = reduce(lambda x, y: x | y, filters)
df.filter(filter) # note .where is an alias for .filter

它是這樣工作的： reduce()獲取過濾器中的前兩件事並運行lambda x, y: x | y lambda x, y: x | y在他們身上。 然后它將 output 作為x傳遞給 lambda x , lambda x, y: x | y lambda x, y: x | y ，將filters中的第三個條目作為y傳遞。 它會繼續前進，直到沒有任何東西可以拿走。

因此，.net 效果是沿着可迭代對象累積應用 function。 在這種情況下，function 只返回| 它的輸入，所以它做的正是你手動做的，但是像這樣：

(filter1 | filter2) | filter3

我懷疑有一種更有趣或更有趣的方式來做到這一點，但 reduce有時值得擁有。 圭多雖然不喜歡它。

Answer 2

2e0byo的回答非常正確。 我正在添加另一種方法，如何在 pyspark 中完成此操作。

如果我們的條件是 SQL 條件表達式的字符串（如 col_1 == 'ABC101'），我們可以組合所有這些字符串並將組合的字符串作為條件提供給where() （或filter() ）。

df = spark.createDataFrame([(1, "a"),
                            (2, "b"),
                            (3, "c"),
                            (4, "d"),
                            (5, "e"),
                            (6, "f"),
                            (7, "g")], schema="id int, name string")
condition1 = "id == 1"
condition2 = "id == 4"
condition3 = "id == 6"
conditions = [condition1, condition2, condition3]
combined_or_condition = " or ".join(conditions)     # Combine the conditions: condition1 or condition2 or condition3
df.where(combined_or_condition).show()

" or ".join(conditions)通過使用or作為定界符/連接符/組合符連接conditions中存在的所有字符串來創建一個字符串。 在這里， combined_or_condition變為id == 1 or id == 4 or id == 6 。

在 pyspark 中的列表上應用邏輯運算符

問題描述

2 個解決方案

解決方案1
3 已采納 2021-09-28 12:21:58

解決方案2
0 2021-09-28 17:36:49

在 pyspark 中的列表上應用邏輯運算符

問題描述

2 個解決方案

解決方案1 3 已采納 2021-09-28 12:21:58

解決方案2 0 2021-09-28 17:36:49

解決方案1
3 已采納 2021-09-28 12:21:58

解決方案2
0 2021-09-28 17:36:49