![](/img/trans.png)
[英]How to apply filter on a column (with datatype array (of strings)) on a PySpark dataframe?
[英]Filter array column in a dataframe based on a given input array --Pyspark
我有一個這樣的數據框
Studentname Speciality
Alex ["Physics","Math","biology"]
Sam ["Economics","History","Math","Physics"]
Claire ["Political science,Physics"]
我想找到所有 [Physics,Math] 專業的學生,所以輸出應該有 2 行 Alex,Sam
這是我嘗試過的
from pyspark.sql.functions import array_contains
from pyspark.sql import functions as F
def student_info():
student_df = spark.read.parquet("s3a://studentdata")
a1=["Physics","Math"]
df=student_df
for a in a1:
df= student_df.filter(array_contains(student_df.Speciality, a))
print(df.count())
student_info()
output:
3
2
想知道如何根據給定的數組子集過濾數組列
使用高階函數filter
應該是最可擴展和最有效的方法( Spark2.4 )
from pyspark.sql import functions as F
df.withColumn("new", F.size(F.expr("""filter(Speciality, x-> x=='Math' or x== 'Physics')""")))\
.filter("new=2").drop("new").show(truncate=False)
+-----------+-----------------------------------+
|Studentname|Speciality |
+-----------+-----------------------------------+
|Alex |[Physics, Math, biology] |
|Sam |[Economics, History, Math, Physics]|
+-----------+-----------------------------------+
如果您想使用a1
類的array
來動態執行此操作,您可以使用F.array_except
和F.array
然后filter
size
( spark 2.4 ):
a1=['Math','Physics']
df.withColumn("array", F.array_except("Speciality",F.array(*(F.lit(x) for x in a1))))\
.filter("size(array)= size(Speciality)-2").drop("array").show(truncate=False)
+-----------+-----------------------------------+
|Studentname|Speciality |
+-----------+-----------------------------------+
|Alex |[Physics, Math, biology] |
|Sam |[Economics, History, Math, Physics]|
+-----------+-----------------------------------+
要獲得 count ,您可以使用.count()
而不是.show()
這是利用array_sort
和 Spark 相等運算符的另一種方法,該方法將數組作為任何其他類型處理,前提是它們已排序:
from pyspark.sql.functions import lit, array, array_sort, array_intersect
target_ar = ["Physics", "Math"]
search_ar = array_sort(array(*[lit(e) for e in target_ar]))
df.where(array_sort(array_intersect(df["Speciality"], search_ar)) == search_ar) \
.show(10, False)
# +-----------+-----------------------------------+
# |Studentname|Speciality |
# +-----------+-----------------------------------+
# |Alex |[Physics, Math, biology] |
# |Sam |[Economics, History, Math, Physics]|
# +-----------+-----------------------------------+
首先我們使用array_intersect(df["Speciality"], search_ar)
找到公共元素,然后我們使用==
來比較排序后的數組。
假設你有,學生的Speciality
沒有重復(例如
StudentName Speciality
SomeStudent ['Physics', 'Math', 'Biology', 'Physics']
您可以在熊貓中使用與groupby
一起explode
所以,對於你的問題
# df is above dataframe
# Lookup subjects
a1 = ['Physics', 'Math']
gdata = df.explode('Speciality').groupby(['Speciality']).size().to_frame('Count')
gdata.loc[a1, 'Count']
# Count
# Speciality
# Physics 3
# Math 2
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.