![](/img/trans.png)
[英]Scala Spark functions like group by, describe() returning incorrect result
[英]Spark Scala filter on group of result
我正在嘗試根據一組結果過濾數據框。
樣品 Dataframe 代碼 -
scala> val df = sc.parallelize(Seq(
(1, 1, "m10", "t22"),
(1, 2, "m10", "t22"),
(1, 3, "m11", "t22"),
(1, 4, "m11", "t22"),
(1, 5, "m10", "t22"),
(1, 6, "m10", "t22"),
(1, 7, "m10", "t22"),
(1, 8, "m11", "t22"),
(1, 9, "m10", "t22"),
(1, 10, "m10", "t22"),
(2, 1, "m10", "t22"),
(2, 2, "m11", "t22"),
(2, 3, "m10", "t22"),
(2, 4, "m10", "t22"),
(2, 5, "m10", "t22"),
(2, 9, "m10", "t22"),
(2, 10, "m11", "t22"),
(3, 4, "m10", "t22"),
(3, 5, "m11", "t22"),
(3, 6, "m10", "t22"),
(3, 7, "m10", "t22"),
(3, 8, "m10", "t22"),
(3, 9, "m11", "t22"),
(3, 10, "m10", "t22")
)
).toDF("org_id", "rule_id", "period_id", "base_id")
數據如下所示 -
scala> df.show(50, false)
+------+-------+---------+-------+
|org_id|rule_id|period_id|base_id|
+------+-------+---------+-------+
|1 |1 |m10 |t21 |
|1 |2 |m10 |t22 |
|1 |3 |m11 |t22 |
|1 |4 |m11 |t22 |
|1 |5 |m10 |t23 |
|1 |6 |m10 |t22 |
|1 |7 |m10 |t22 |
|1 |8 |m11 |t22 |
|1 |9 |m10 |t22 |
|1 |10 |m10 |t22 |
|2 |1 |m10 |t22 |
|2 |2 |m11 |t22 |
|2 |3 |m10 |t23 |
|2 |4 |m10 |t22 |
|2 |5 |m10 |t22 |
|2 |9 |m10 |t22 |
|2 |10 |m11 |t22 |
|3 |4 |m10 |t22 |
|3 |5 |m11 |t22 |
|3 |6 |m10 |t22 |
|3 |7 |m10 |t22 |
|3 |8 |m10 |t22 |
|3 |9 |m11 |t22 |
|3 |10 |m10 |t23 |
+------+-------+---------+-------+
基於屬性文件,我需要過濾 org_id 組的結果。 屬性文件看起來像 -
4=1,2,3
7=1,4,5
9=8,10
.....................
.....................
在屬性文件中,所有值都是 rule_id。
僅當任何 org_id 組包含 1、2 和 3 個 rule_id 時,我才會考慮這些行包含 rule_id 4。 否則我需要刪除包含 rule_id 4 的行。對於屬性文件中可用的其他 rule_id 值也是如此。
預期結果 -
+------+-------+---------+-------+
|org_id|rule_id|period_id|base_id|
+------+-------+---------+-------+
|1 |1 |m10 |t21 |
|1 |2 |m10 |t22 |
|1 |3 |m11 |t22 |
|1 |4 |m11 |t22 |
|1 |5 |m10 |t23 |
|1 |6 |m10 |t22 |
|1 |7 |m10 |t22 |
|1 |8 |m11 |t22 |
|1 |9 |m10 |t22 |
|1 |10 |m10 |t22 |
|2 |1 |m10 |t22 |
|2 |2 |m11 |t22 |
|2 |3 |m10 |t23 |
|2 |4 |m10 |t22 |
|2 |5 |m10 |t22 |
|2 |10 |m11 |t22 |
|3 |5 |m11 |t22 |
|3 |6 |m10 |t22 |
|3 |8 |m10 |t22 |
|3 |9 |m11 |t22 |
|3 |10 |m10 |t23 |
+------+-------+---------+-------+
我堅持這一點,不知道如何繼續。 任何建議將不勝感激。
這種方法有多個連接和聚合,所以希望數據不會太大。
基本上,創建帶有規則集的記錄。 然后,連接將原始記錄與該組織/規則組合必須存在的子規則以及該組織中實際展示的規則相關聯,從而創建orgsContainingRulesDF
。 使用此 DF,您可以過濾掉未展示所有“子規則”的規則。
// Assume rule/sub-rule info can be read as either a Map or List of Tuple
val rules = Map(4->Set(1,2,3), 7->Set(1,4,5), 9->Set(8,10))
val rulesDF = rules.toList.toDF("rule", "sub_rules")
// For each org_id, get a set of rules which appear under it
val ruleSetsDF = df.groupBy(col("org_id")).agg(collect_set(col("rule_id")) as "rules")
// For each rule with sub-rules, match with orgs containing that rule
// Also get the full list of rules pertaining to that org
val orgsContainingRulesDF = rulesDF.join(df, $"rule" === $"rule_id", "left").join(ruleSetsDF, Seq("org_id"), "left")
// Create a UDF for determining if all items in first seq are in second seq
val subsetOf = udf((array1: Seq[String], array2: Seq[String]) => {
Set(array1:_*).subsetOf(Set(array2:_*))
})
// Create DF with items to delete
// i.e. org-and-rule-id-pairs where not all sub-rules appear in exhibited rules
val toDeleteDF = orgsContainingRulesDF.filter(!subsetOf($"sub_rules", $"rules"))
// Use a left anti-join (inverse of left join) to only preserve records
// with no corresponding toDeleteDF record
val resultDF = df.join(toDeleteDF, Seq("org_id", "rule_id"), "left_anti").orderBy($"org_id", $"rule_id")
結果如預期:
resultDF.show(25,false)
+------+-------+---------+-------+
|org_id|rule_id|period_id|base_id|
+------+-------+---------+-------+
|1 |1 |m10 |t22 |
|1 |2 |m10 |t22 |
|1 |3 |m11 |t22 |
|1 |4 |m11 |t22 |
|1 |5 |m10 |t22 |
|1 |6 |m10 |t22 |
|1 |7 |m10 |t22 |
|1 |8 |m11 |t22 |
|1 |9 |m10 |t22 |
|1 |10 |m10 |t22 |
|2 |1 |m10 |t22 |
|2 |2 |m11 |t22 |
|2 |3 |m10 |t22 |
|2 |4 |m10 |t22 |
|2 |5 |m10 |t22 |
|2 |10 |m11 |t22 |
|3 |5 |m11 |t22 |
|3 |6 |m10 |t22 |
|3 |8 |m10 |t22 |
|3 |9 |m11 |t22 |
|3 |10 |m10 |t22 |
+------+-------+---------+-------+
這個問題可以使用 SQL window function 來解決。
讓我們將您的原始數據和屬性文件分別注冊為臨時視圖data
和rule_filters
:
Seq(
(1, 1, "m10", "t22"),
(1, 2, "m10", "t22"),
(1, 3, "m11", "t22"),
(1, 4, "m11", "t22"),
(1, 5, "m10", "t22"),
(1, 6, "m10", "t22"),
(1, 7, "m10", "t22"),
(1, 8, "m11", "t22"),
(1, 9, "m10", "t22"),
(1, 10, "m10", "t22"),
(2, 1, "m10", "t22"),
(2, 2, "m11", "t22"),
(2, 3, "m10", "t22"),
(2, 4, "m10", "t22"),
(2, 5, "m10", "t22"),
(2, 9, "m10", "t22"),
(2, 10, "m11", "t22"),
(3, 4, "m10", "t22"),
(3, 5, "m11", "t22"),
(3, 6, "m10", "t22"),
(3, 7, "m10", "t22"),
(3, 8, "m10", "t22"),
(3, 9, "m11", "t22"),
(3, 10, "m10", "t22")
).toDF(
"org_id",
"rule_id",
"period_id",
"base_id"
).createOrReplaceTempView("data")
Seq(
"4=1,2,3",
"7=1,4,5",
"9=8,10"
).map { line =>
val Array(key, values) = line.split("=")
(key, values.split(",").map(_.toInt).sorted)
}.toDF(
"key",
"rules"
).createOrReplaceTempView("rule_filters")
那么下面的 SQL 查詢就解決了這個問題:
SELECT
org_id,
rule_id,
period_id,
base_id
FROM
(
SELECT
*,
array_sort(
collect_set(rule_id) OVER (
PARTITION BY org_id ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING
)
) AS rules_in_org
FROM
data
LEFT JOIN rule_filters ON rule_id = key
)
WHERE
rules IS NULL
OR array_intersect(rules_in_org, rules) = rules
ORDER BY
org_id,
rule_id
如果您願意,也可以使用 DataFrame API 來實現它:
table("data")
.join(table("rule_filters"), $"data.rule_id" === $"rule_filters.key", "left")
.select(
$"*",
array_sort(
collect_set($"rule_id").over(
Window
.partitionBy($"org_id")
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
)
) as "rules_within_org"
)
.filter($"rules".isNull || array_intersect($"rules_within_org", $"rules") === $"rules")
.drop("key", "rules", "rules_within_org")
.orderBy($"org_id", $"rule_id")
.show(Int.MaxValue)
+------+-------+---------+-------+
|org_id|rule_id|period_id|base_id|
+------+-------+---------+-------+
| 1| 1| m10| t22|
| 1| 2| m10| t22|
| 1| 3| m11| t22|
| 1| 4| m11| t22|
| 1| 5| m10| t22|
| 1| 6| m10| t22|
| 1| 7| m10| t22|
| 1| 8| m11| t22|
| 1| 9| m10| t22|
| 1| 10| m10| t22|
| 2| 1| m10| t22|
| 2| 2| m11| t22|
| 2| 3| m10| t22|
| 2| 4| m10| t22|
| 2| 5| m10| t22|
| 2| 10| m11| t22|
| 3| 5| m11| t22|
| 3| 6| m10| t22|
| 3| 8| m10| t22|
| 3| 9| m11| t22|
| 3| 10| m10| t22|
+------+-------+---------+-------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.