简体   繁体   English

使用Spark过滤具有条件的数据框

[英]Using Spark filter a data frame with conditions

I have a data frame which looks like 我有一个看起来像的数据框

scala> val df = sc.parallelize(Seq(("User 1","X"), ("User 2", "Y"), ("User 3", "X"), ("User 2", "E"), ("User 3", "E"))).toDF("user", "event")

scala> df.show
+------+-----+
|  user|event|
+------+-----+
|User 1|    X|
|User 2|    Y|
|User 3|    X|
|User 2|    E|
|User 3|    E|
+------+-----+

I want to find all the users who has event "X" but don't have event "E" 我想找到所有拥有事件“X”但没有事件“E”的用户

In this case only 'User 1' qualifies as it does not have an event "E" entry. 在这种情况下,只有'用户1'符合条件,因为它没有事件“E”条目。 How can I do it using Spark API? 我怎么能用Spark API做到这一点?

Left join can be used: 可以使用左连接:

val xDF = df.filter(col("event") === "X")
val eDF = df.filter(col("event") === "E")
val result = xDF.as("x").join(eDF.as("e"), List("user"), "left_outer").where(col("e.event").isNull).select(col("x.user"))

Result is: 结果是:

+------+
|user  |
+------+
|User 1|
+------+

You can group users with collection of events and then filter out events for appropriate user based on specific condition. 您可以使用事件集合对用户进行分组,然后根据特定条件过滤掉适当用户的事件。

val result = df.groupBy("user")
    .agg(collect_list("event")
    .as("events"))
    .filter( p => p.getList(1).contains("X") && !p.getList(1).contains("E"))
val tmp = df.groupBy("user").pivot("event").count
tmp.show
+------+----+----+----+
|  user|   E|   X|   Y|
+------+----+----+----+
|User 2|   1|null|   1|
|User 3|   1|   1|null|
|User 1|null|   1|null|
+------+----+----+----+
tmp.filter(  ($"X" isNotNull) and ($"E" isNull) ).show
+------+----+---+----+
|  user|   E|  X|   Y|
+------+----+---+----+
|User 1|null|  1|null|
+------+----+---+----+
tmp.filter(  ($"X" isNotNull) and ($"E" isNull) ).select("user","X").show 
+------+---+
|  user|  X|
+------+---+
|User 1|  1|
+------+---+

And hope this will help 希望这会有所帮助

You can count rows of each users and count each rows of users and events and the filter those rows whose both counts are equal and event column has X value. 您可以计算每个用户的行数并计算用户和事件的每一行,并筛选两个计数相等且事件列具有X值的行。

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
df.withColumn("count", count($"user").over(Window.partitionBy("user")))
    .withColumn("distinctCount", count($"user").over(Window.partitionBy("user", "event")))
    .filter($"count" === $"distinctCount" && $"event" === "X")
    .drop("count", "distinctCount")

You should get the result you want 你应该得到你想要的结果

I hope the answer is helpful 我希望答案是有帮助的

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM