繁体   English   中英

将 PySpark dataframe 过滤到数据帧列表中

[英]Filter PySpark dataframe into a list of dataframes

我有一个 PySpark dataframe ,我想根据某些列中的唯一值进行过滤。

from pyspark.sql import SparkSession
spark_session = SparkSession.builder.enableHiveSupport().getOrCreate()

columns = ["language","users_count","apple"]
data = [("Java", 1, 0.0), ("Scala", 4, -4.0), ("Java", 1, 0.0)]

pyspark_df = spark_session.createDataFrame(data).toDF(*columns)

pandas_df = pd.DataFrame(data, columns=columns)

# Operation I want to replicate in PySpark:
column_list = ['language','users_count'] #these names and number of columns can be changed at runtime.
unique_dfs = [df for id, df in pandas_df.groupby(column_list
, as_index=False)]

可以完成的另一种方法是在 PySpark df 中创建一个列,并放置唯一值(字符串(语言 + users_count),然后过滤这些唯一值以获取 dfs。

如果你确切地知道你需要什么数据,你应该做filter ,因为它在 Spark 中很有效。

from pyspark.sql import functions as F

df = pyspark_df.filter(
    (F.col('language') == 'Java') &
    (F.col('users_count') == 1)
)

如果您真的需要这些列的所有可能组合作为单独的数据帧,您将不得不运行distinct (即要避免的随机播放)和低效的collect

from pyspark.sql import functions as F

column_list = ['language', 'users_count']
df_dist = pyspark_df.select(column_list).distinct()
unique_dfs = []
for row in df_dist.collect():
    cond = F.lit(True)
    for c in column_list:
        cond &= (F.col(c) == row[c])
    unique_dfs.append(pyspark_df.filter(cond))

结果:

unique_dfs[0].show()
# +--------+-----------+-----+
# |language|users_count|apple|
# +--------+-----------+-----+
# |    Java|          1|  0.0|
# |    Java|          1|  0.0|
# +--------+-----------+-----+

unique_dfs[1].show()
# +--------+-----------+-----+
# |language|users_count|apple|
# +--------+-----------+-----+
# |   Scala|          4| -4.0|
# +--------+-----------+-----+

unique_dfs[0].explain()
# == Physical Plan ==
# *(1) Project [_1#158 AS language#164, _2#159L AS users_count#165L, _3#160 AS apple#166]
# +- *(1) Filter ((isnotnull(_1#158) AND isnotnull(_2#159L)) AND ((_1#158 = Java) AND (_2#159L = 1)))
#    +- *(1) Scan ExistingRDD[_1#158,_2#159L,_3#160]

注意:这里你看到 Java 被索引为 0,Scala 被索引为 1,但实际上它可能相反,你在那里没有确定性,因为你不知道哪个 executor 会先将他的数据发送给 driver 在 driver 之后使用collect时要求提供数据。 所以,你问的,可能不是你真正需要的。

使用对列进行分区的窗口函数创建排名(您希望根据值进行分组)。 然后从 1 迭代到 df.count() 并根据排名过滤数据帧并将数据帧存储到列表中。 我希望这有帮助!

from pyspark.sql import functions as F, Window as W

column_list = ['language', 'users_count']
unique_dfs = []
w = W.orderBy(*column_list)
df = pyspark_df.withColumn('_rank', F.dense_rank().over(w))
for i in range(1, df.agg(F.max('_rank')).head()[0] + 1):
    unique_dfs.append(df.filter(F.col('_rank') == i))

结果:

unique_dfs[0].show()
# +--------+-----------+-----+-----+
# |language|users_count|apple|_rank|
# +--------+-----------+-----+-----+
# |    Java|          1|  0.0|    1|
# |    Java|          1|  0.0|    1|
# +--------+-----------+-----+-----+

unique_dfs[1].show()
# +--------+-----------+-----+-----+
# |language|users_count|apple|_rank|
# +--------+-----------+-----+-----+
# |   Scala|          4| -4.0|    2|
# +--------+-----------+-----+-----+

unique_dfs[0].explain()
# == Physical Plan ==
# AdaptiveSparkPlan isFinalPlan=false
# +- Filter (_rank#579 = 1)
#    +- Window [dense_rank(language#571, users_count#572L) windowspecdefinition(language#571 ASC NULLS FIRST, users_count#572L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS _rank#579], [language#571 ASC NULLS FIRST, users_count#572L ASC NULLS FIRST]
#       +- Sort [language#571 ASC NULLS FIRST, users_count#572L ASC NULLS FIRST], false, 0
#          +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#975]
#             +- Project [_1#565 AS language#571, _2#566L AS users_count#572L, _3#567 AS apple#573]
#                +- Scan ExistingRDD[_1#565,_2#566L,_3#567]

我已经解决了这个问题

groups = list(pyspark_df.select(['language','users_count']).distinct().collect())

unique_campaigns_dfs = [
    pyspark_df.where((functions.col('language') == x[0]) & (functions.col('users_count') == x[1])) for x in
    groups]

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM