简体   繁体   English

如何对PySpark中的dataframe进行分组过滤

[英]How to group and filter a dataframe in PySpark

Attached is my dataframe and I'm trying to find a way in pySpark to filter Link Name for which none of the Supports are with Status 'In'.附件是我的 dataframe,我正试图在 pySpark 中找到一种方法来过滤没有支持状态为“在”的链接名称。 For eg.例如。 The expected output should be only Link3 as none of the supports associated to it are with 'In'预期的 output 应该只是 Link3,因为与之关联的支撑都没有“In”

Link Name链接名称 Support支持 Status地位
Link1链接1 Support1支持1 In
Link1链接1 Support2支持2 In
Link1链接1 Support3支持3 Out出去
Link2链接2 Support4支持4 In
Link2链接2 Support5支持5 In
Link3链接3 Support6支持6 Out出去
Link3链接3 Support7支持7 Out出去

Can someone please help me here?有人可以帮我吗?

The expected output should be only Link3 as none of the supports associated to it are without 'In'预期的 output 应该只是 Link3,因为与其关联的所有支撑都没有“In”

You can try something like this with window function你可以尝试这样的事情 window function

import pyspark.sql.functions as F
from pyspark.sql import Window

inputData = [
    ("Link1", "Support1", "In"),
    ("Link1", "Support2", "In"),
    ("Link1", "Support3", "Out"),
    ("Link2", "Support4", "In"),
    ("Link2", "Support5", "In"),
    ("Link3", "Support6", "Out"),
    ("Link3", "Support7", "Out"),
]
inputDf = spark.createDataFrame(inputData, schema=["Link Name", "Support", "Status"])

window = Window.partitionBy("Link Name").orderBy(F.col("Status").asc())

dfWithRank = inputDf.withColumn("dense_rank", F.dense_rank().over(window))
dfWithRank.filter(
    (F.col("dense_rank") == F.lit(1)) & (F.col("Status") == F.lit("Out"))
).select("Link Name").distinct().show()

I am grouping by link name and sorting by status within group.我按链接名称分组并按组内的状态排序。 If first status within group sorted ascending is "Out" it means that "In" status does not exists for such partition and thats what filter is checking如果按升序排序的组中的第一个状态是“Out”,则表示此类分区不存在“In”状态,这就是过滤器正在检查的内容

At the end i am selecting only Link Name and calling distinct to get just single record with Link Name最后,我只选择链接名称并调用 distinct 以获取带有链接名称的单个记录

Output is Output 是

+---------+
|Link Name|
+---------+
|    Link3|
+---------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM