[英]Filter PySpark dataframe into a list of dataframes
我有一个 PySpark dataframe ,我想根据某些列中的唯一值进行过滤。
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.enableHiveSupport().getOrCreate()
columns = ["language","users_count","apple"]
data = [("Java", 1, 0.0), ("Scala", 4, -4.0), ("Java", 1, 0.0)]
pyspark_df = spark_session.createDataFrame(data).toDF(*columns)
pandas_df = pd.DataFrame(data, columns=columns)
# Operation I want to replicate in PySpark:
column_list = ['language','users_count'] #these names and number of columns can be changed at runtime.
unique_dfs = [df for id, df in pandas_df.groupby(column_list
, as_index=False)]
可以完成的另一种方法是在 PySpark df 中创建一个列,并放置唯一值(字符串(语言 + users_count),然后过滤这些唯一值以获取 dfs。
如果你确切地知道你需要什么数据,你应该做filter
,因为它在 Spark 中很有效。
from pyspark.sql import functions as F
df = pyspark_df.filter(
(F.col('language') == 'Java') &
(F.col('users_count') == 1)
)
如果您真的需要这些列的所有可能组合作为单独的数据帧,您将不得不运行distinct
(即要避免的随机播放)和低效的collect
from pyspark.sql import functions as F
column_list = ['language', 'users_count']
df_dist = pyspark_df.select(column_list).distinct()
unique_dfs = []
for row in df_dist.collect():
cond = F.lit(True)
for c in column_list:
cond &= (F.col(c) == row[c])
unique_dfs.append(pyspark_df.filter(cond))
结果:
unique_dfs[0].show()
# +--------+-----------+-----+
# |language|users_count|apple|
# +--------+-----------+-----+
# | Java| 1| 0.0|
# | Java| 1| 0.0|
# +--------+-----------+-----+
unique_dfs[1].show()
# +--------+-----------+-----+
# |language|users_count|apple|
# +--------+-----------+-----+
# | Scala| 4| -4.0|
# +--------+-----------+-----+
unique_dfs[0].explain()
# == Physical Plan ==
# *(1) Project [_1#158 AS language#164, _2#159L AS users_count#165L, _3#160 AS apple#166]
# +- *(1) Filter ((isnotnull(_1#158) AND isnotnull(_2#159L)) AND ((_1#158 = Java) AND (_2#159L = 1)))
# +- *(1) Scan ExistingRDD[_1#158,_2#159L,_3#160]
注意:这里你看到 Java 被索引为 0,Scala 被索引为 1,但实际上它可能相反,你在那里没有确定性,因为你不知道哪个 executor 会先将他的数据发送给 driver 在 driver 之后使用collect
时要求提供数据。 所以,你问的,可能不是你真正需要的。
使用对列进行分区的窗口函数创建排名(您希望根据值进行分组)。 然后从 1 迭代到 df.count() 并根据排名过滤数据帧并将数据帧存储到列表中。 我希望这有帮助!
from pyspark.sql import functions as F, Window as W
column_list = ['language', 'users_count']
unique_dfs = []
w = W.orderBy(*column_list)
df = pyspark_df.withColumn('_rank', F.dense_rank().over(w))
for i in range(1, df.agg(F.max('_rank')).head()[0] + 1):
unique_dfs.append(df.filter(F.col('_rank') == i))
结果:
unique_dfs[0].show()
# +--------+-----------+-----+-----+
# |language|users_count|apple|_rank|
# +--------+-----------+-----+-----+
# | Java| 1| 0.0| 1|
# | Java| 1| 0.0| 1|
# +--------+-----------+-----+-----+
unique_dfs[1].show()
# +--------+-----------+-----+-----+
# |language|users_count|apple|_rank|
# +--------+-----------+-----+-----+
# | Scala| 4| -4.0| 2|
# +--------+-----------+-----+-----+
unique_dfs[0].explain()
# == Physical Plan ==
# AdaptiveSparkPlan isFinalPlan=false
# +- Filter (_rank#579 = 1)
# +- Window [dense_rank(language#571, users_count#572L) windowspecdefinition(language#571 ASC NULLS FIRST, users_count#572L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS _rank#579], [language#571 ASC NULLS FIRST, users_count#572L ASC NULLS FIRST]
# +- Sort [language#571 ASC NULLS FIRST, users_count#572L ASC NULLS FIRST], false, 0
# +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#975]
# +- Project [_1#565 AS language#571, _2#566L AS users_count#572L, _3#567 AS apple#573]
# +- Scan ExistingRDD[_1#565,_2#566L,_3#567]
我已经解决了这个问题
groups = list(pyspark_df.select(['language','users_count']).distinct().collect())
unique_campaigns_dfs = [
pyspark_df.where((functions.col('language') == x[0]) & (functions.col('users_count') == x[1])) for x in
groups]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.