简体   繁体   中英

How to group and filter a dataframe in PySpark

Attached is my dataframe and I'm trying to find a way in pySpark to filter Link Name for which none of the Supports are with Status 'In'. For eg. The expected output should be only Link3 as none of the supports associated to it are with 'In'

Link Name Support Status
Link1 Support1 In
Link1 Support2 In
Link1 Support3 Out
Link2 Support4 In
Link2 Support5 In
Link3 Support6 Out
Link3 Support7 Out

Can someone please help me here?

The expected output should be only Link3 as none of the supports associated to it are without 'In'

You can try something like this with window function

import pyspark.sql.functions as F
from pyspark.sql import Window

inputData = [
    ("Link1", "Support1", "In"),
    ("Link1", "Support2", "In"),
    ("Link1", "Support3", "Out"),
    ("Link2", "Support4", "In"),
    ("Link2", "Support5", "In"),
    ("Link3", "Support6", "Out"),
    ("Link3", "Support7", "Out"),
]
inputDf = spark.createDataFrame(inputData, schema=["Link Name", "Support", "Status"])

window = Window.partitionBy("Link Name").orderBy(F.col("Status").asc())

dfWithRank = inputDf.withColumn("dense_rank", F.dense_rank().over(window))
dfWithRank.filter(
    (F.col("dense_rank") == F.lit(1)) & (F.col("Status") == F.lit("Out"))
).select("Link Name").distinct().show()

I am grouping by link name and sorting by status within group. If first status within group sorted ascending is "Out" it means that "In" status does not exists for such partition and thats what filter is checking

At the end i am selecting only Link Name and calling distinct to get just single record with Link Name

Output is

+---------+
|Link Name|
+---------+
|    Link3|
+---------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM