[英]How to group and filter a dataframe in PySpark
Attached is my dataframe and I'm trying to find a way in pySpark to filter Link Name for which none of the Supports are with Status 'In'.附件是我的 dataframe,我正试图在 pySpark 中找到一种方法来过滤没有支持状态为“在”的链接名称。 For eg.例如。 The expected output should be only Link3 as none of the supports associated to it are with 'In'预期的 output 应该只是 Link3,因为与之关联的支撑都没有“In”
Link Name链接名称 | Support支持 | Status地位 |
---|---|---|
Link1链接1 | Support1支持1 | In在 |
Link1链接1 | Support2支持2 | In在 |
Link1链接1 | Support3支持3 | Out出去 |
Link2链接2 | Support4支持4 | In在 |
Link2链接2 | Support5支持5 | In在 |
Link3链接3 | Support6支持6 | Out出去 |
Link3链接3 | Support7支持7 | Out出去 |
Can someone please help me here?有人可以帮我吗?
The expected output should be only Link3 as none of the supports associated to it are without 'In'预期的 output 应该只是 Link3,因为与其关联的所有支撑都没有“In”
You can try something like this with window function你可以尝试这样的事情 window function
import pyspark.sql.functions as F
from pyspark.sql import Window
inputData = [
("Link1", "Support1", "In"),
("Link1", "Support2", "In"),
("Link1", "Support3", "Out"),
("Link2", "Support4", "In"),
("Link2", "Support5", "In"),
("Link3", "Support6", "Out"),
("Link3", "Support7", "Out"),
]
inputDf = spark.createDataFrame(inputData, schema=["Link Name", "Support", "Status"])
window = Window.partitionBy("Link Name").orderBy(F.col("Status").asc())
dfWithRank = inputDf.withColumn("dense_rank", F.dense_rank().over(window))
dfWithRank.filter(
(F.col("dense_rank") == F.lit(1)) & (F.col("Status") == F.lit("Out"))
).select("Link Name").distinct().show()
I am grouping by link name and sorting by status within group.我按链接名称分组并按组内的状态排序。 If first status within group sorted ascending is "Out" it means that "In" status does not exists for such partition and thats what filter is checking如果按升序排序的组中的第一个状态是“Out”,则表示此类分区不存在“In”状态,这就是过滤器正在检查的内容
At the end i am selecting only Link Name and calling distinct to get just single record with Link Name最后,我只选择链接名称并调用 distinct 以获取带有链接名称的单个记录
Output is Output 是
+---------+
|Link Name|
+---------+
| Link3|
+---------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.