根據給定的輸入數組過濾數據框中的數組列 --Pyspark

Question

我有一個這樣的數據框

Studentname  Speciality
Alex         ["Physics","Math","biology"]
Sam          ["Economics","History","Math","Physics"]
Claire       ["Political science,Physics"]

我想找到所有 [Physics,Math] 專業的學生，所以輸出應該有 2 行 Alex,Sam

這是我嘗試過的

from pyspark.sql.functions import array_contains
from pyspark.sql import functions as F

def student_info():
     student_df = spark.read.parquet("s3a://studentdata")
     a1=["Physics","Math"]
     df=student_df
     for a in a1:
       df= student_df.filter(array_contains(student_df.Speciality, a))
       print(df.count())

student_info()

output:
3
2

想知道如何根據給定的數組子集過濾數組列

Answer 1

使用高階函數filter應該是最可擴展和最有效的方法（ Spark2.4 ）

from pyspark.sql import functions as F
df.withColumn("new", F.size(F.expr("""filter(Speciality, x-> x=='Math' or x== 'Physics')""")))\
  .filter("new=2").drop("new").show(truncate=False)
+-----------+-----------------------------------+
|Studentname|Speciality                         |
+-----------+-----------------------------------+
|Alex       |[Physics, Math, biology]           |
|Sam        |[Economics, History, Math, Physics]|
+-----------+-----------------------------------+

如果您想使用a1類的array來動態執行此操作，您可以使用F.array_except和F.array然后filter size （ spark 2.4 ）：

a1=['Math','Physics']
df.withColumn("array", F.array_except("Speciality",F.array(*(F.lit(x) for x in a1))))\
  .filter("size(array)= size(Speciality)-2").drop("array").show(truncate=False)

+-----------+-----------------------------------+
|Studentname|Speciality                         |
+-----------+-----------------------------------+
|Alex       |[Physics, Math, biology]           |
|Sam        |[Economics, History, Math, Physics]|
+-----------+-----------------------------------+

要獲得 count ，您可以使用.count()而不是.show()

Answer 2

這是利用array_sort和 Spark 相等運算符的另一種方法，該方法將數組作為任何其他類型處理，前提是它們已排序：

from pyspark.sql.functions import lit, array, array_sort, array_intersect

target_ar = ["Physics", "Math"]
search_ar = array_sort(array(*[lit(e) for e in target_ar]))

df.where(array_sort(array_intersect(df["Speciality"], search_ar)) == search_ar) \
  .show(10, False)

# +-----------+-----------------------------------+
# |Studentname|Speciality                         |
# +-----------+-----------------------------------+
# |Alex       |[Physics, Math, biology]           |
# |Sam        |[Economics, History, Math, Physics]|
# +-----------+-----------------------------------+

首先我們使用array_intersect(df["Speciality"], search_ar)找到公共元素，然后我們使用==來比較排序后的數組。

Answer 3

假設你有，學生的Speciality沒有重復（例如

StudentName   Speciality
SomeStudent   ['Physics', 'Math', 'Biology', 'Physics']

您可以在熊貓中使用與groupby一起explode

所以，對於你的問題

# df is above dataframe
# Lookup subjects
a1 = ['Physics', 'Math']

gdata = df.explode('Speciality').groupby(['Speciality']).size().to_frame('Count')

gdata.loc[a1, 'Count']

#             Count
# Speciality
# Physics         3
# Math            2

根據給定的輸入數組過濾數據框中的數組列 --Pyspark

問題描述

3 個解決方案

解決方案1
2 2020-03-24 23:34:36

解決方案2
2 已采納 2020-03-25 12:17:19

解決方案3
0 2020-03-24 23:16:45

根據給定的輸入數組過濾數據框中的數組列 --Pyspark

問題描述

3 個解決方案

解決方案1 2 2020-03-24 23:34:36

解決方案2 2 已采納 2020-03-25 12:17:19

解決方案3 0 2020-03-24 23:16:45

解決方案1
2 2020-03-24 23:34:36

解決方案2
2 已采納 2020-03-25 12:17:19

解決方案3
0 2020-03-24 23:16:45