簡體   English   中英

Spark Scala 過濾結果組

[英]Spark Scala filter on group of result

我正在嘗試根據一組結果過濾數據框。

樣品 Dataframe 代碼 -

scala> val df = sc.parallelize(Seq(
      (1, 1, "m10", "t22"),
      (1, 2, "m10", "t22"),
      (1, 3, "m11", "t22"),
      (1, 4, "m11", "t22"),
      (1, 5, "m10", "t22"),
      (1, 6, "m10", "t22"),
      (1, 7, "m10", "t22"),
      (1, 8, "m11", "t22"),
      (1, 9, "m10", "t22"),
      (1, 10, "m10", "t22"),
      (2, 1, "m10", "t22"),
      (2, 2, "m11", "t22"),
      (2, 3, "m10", "t22"),
      (2, 4, "m10", "t22"),
      (2, 5, "m10", "t22"),
      (2, 9, "m10", "t22"),
      (2, 10, "m11", "t22"),
      (3, 4, "m10", "t22"),
      (3, 5, "m11", "t22"),
      (3, 6, "m10", "t22"),
      (3, 7, "m10", "t22"),
      (3, 8, "m10", "t22"),
      (3, 9, "m11", "t22"),
      (3, 10, "m10", "t22")
       )
       ).toDF("org_id", "rule_id", "period_id", "base_id")

數據如下所示 -

scala> df.show(50, false)
+------+-------+---------+-------+
|org_id|rule_id|period_id|base_id|
+------+-------+---------+-------+
|1     |1      |m10      |t21    |
|1     |2      |m10      |t22    |
|1     |3      |m11      |t22    |
|1     |4      |m11      |t22    |
|1     |5      |m10      |t23    |
|1     |6      |m10      |t22    |
|1     |7      |m10      |t22    |
|1     |8      |m11      |t22    |
|1     |9      |m10      |t22    |
|1     |10     |m10      |t22    |
|2     |1      |m10      |t22    |
|2     |2      |m11      |t22    |
|2     |3      |m10      |t23    |
|2     |4      |m10      |t22    |
|2     |5      |m10      |t22    |
|2     |9      |m10      |t22    |
|2     |10     |m11      |t22    |
|3     |4      |m10      |t22    |
|3     |5      |m11      |t22    |
|3     |6      |m10      |t22    |
|3     |7      |m10      |t22    |
|3     |8      |m10      |t22    |
|3     |9      |m11      |t22    |
|3     |10     |m10      |t23    |
+------+-------+---------+-------+

基於屬性文件,我需要過濾 org_id 組的結果。 屬性文件看起來像 -

    4=1,2,3
    7=1,4,5
    9=8,10
.....................
.....................

在屬性文件中,所有值都是 rule_id。

僅當任何 org_id 組包含 1、2 和 3 個 rule_id 時,我才會考慮這些行包含 rule_id 4。 否則我需要刪除包含 rule_id 4 的行。對於屬性文件中可用的其他 rule_id 值也是如此。

預期結果 -

    +------+-------+---------+-------+
    |org_id|rule_id|period_id|base_id|
    +------+-------+---------+-------+
    |1     |1      |m10      |t21    |
    |1     |2      |m10      |t22    |
    |1     |3      |m11      |t22    |
    |1     |4      |m11      |t22    |
    |1     |5      |m10      |t23    |
    |1     |6      |m10      |t22    |
    |1     |7      |m10      |t22    |
    |1     |8      |m11      |t22    |
    |1     |9      |m10      |t22    |
    |1     |10     |m10      |t22    |
    |2     |1      |m10      |t22    |
    |2     |2      |m11      |t22    |
    |2     |3      |m10      |t23    |
    |2     |4      |m10      |t22    |
    |2     |5      |m10      |t22    |
    |2     |10     |m11      |t22    |
    |3     |5      |m11      |t22    |
    |3     |6      |m10      |t22    |
    |3     |8      |m10      |t22    |
    |3     |9      |m11      |t22    |
    |3     |10     |m10      |t23    |
    +------+-------+---------+-------+

我堅持這一點,不知道如何繼續。 任何建議將不勝感激。

這種方法有多個連接和聚合,所以希望數據不會太大。

基本上,創建帶有規則集的記錄。 然后,連接將原始記錄與該組織/規則組合必須存在的子規則以及該組織中實際展示的規則相關聯,從而創建orgsContainingRulesDF 使用此 DF,您可以過濾掉未展示所有“子規則”的規則。

// Assume rule/sub-rule info can be read as either a Map or List of Tuple
val rules = Map(4->Set(1,2,3), 7->Set(1,4,5), 9->Set(8,10))
val rulesDF = rules.toList.toDF("rule", "sub_rules")

// For each org_id, get a set of rules which appear under it
val ruleSetsDF = df.groupBy(col("org_id")).agg(collect_set(col("rule_id")) as "rules")

// For each rule with sub-rules, match with orgs containing that rule
// Also get the full list of rules pertaining to that org
val orgsContainingRulesDF = rulesDF.join(df, $"rule" === $"rule_id", "left").join(ruleSetsDF, Seq("org_id"), "left")

// Create a UDF for determining if all items in first seq are in second seq
val subsetOf = udf((array1: Seq[String], array2: Seq[String]) => {
  Set(array1:_*).subsetOf(Set(array2:_*))
})

// Create DF with items to delete
// i.e. org-and-rule-id-pairs where not all sub-rules appear in exhibited rules
val toDeleteDF = orgsContainingRulesDF.filter(!subsetOf($"sub_rules", $"rules"))

// Use a left anti-join (inverse of left join) to only preserve records
// with no corresponding toDeleteDF record
val resultDF = df.join(toDeleteDF, Seq("org_id", "rule_id"), "left_anti").orderBy($"org_id", $"rule_id")

結果如預期:

resultDF.show(25,false)
+------+-------+---------+-------+
|org_id|rule_id|period_id|base_id|
+------+-------+---------+-------+
|1     |1      |m10      |t22    |
|1     |2      |m10      |t22    |
|1     |3      |m11      |t22    |
|1     |4      |m11      |t22    |
|1     |5      |m10      |t22    |
|1     |6      |m10      |t22    |
|1     |7      |m10      |t22    |
|1     |8      |m11      |t22    |
|1     |9      |m10      |t22    |
|1     |10     |m10      |t22    |
|2     |1      |m10      |t22    |
|2     |2      |m11      |t22    |
|2     |3      |m10      |t22    |
|2     |4      |m10      |t22    |
|2     |5      |m10      |t22    |
|2     |10     |m11      |t22    |
|3     |5      |m11      |t22    |
|3     |6      |m10      |t22    |
|3     |8      |m10      |t22    |
|3     |9      |m11      |t22    |
|3     |10     |m10      |t22    |
+------+-------+---------+-------+

這個問題可以使用 SQL window function 來解決。

讓我們將您的原始數據和屬性文件分別注冊為臨時視圖datarule_filters

Seq(
  (1, 1, "m10", "t22"),
  (1, 2, "m10", "t22"),
  (1, 3, "m11", "t22"),
  (1, 4, "m11", "t22"),
  (1, 5, "m10", "t22"),
  (1, 6, "m10", "t22"),
  (1, 7, "m10", "t22"),
  (1, 8, "m11", "t22"),
  (1, 9, "m10", "t22"),
  (1, 10, "m10", "t22"),
  (2, 1, "m10", "t22"),
  (2, 2, "m11", "t22"),
  (2, 3, "m10", "t22"),
  (2, 4, "m10", "t22"),
  (2, 5, "m10", "t22"),
  (2, 9, "m10", "t22"),
  (2, 10, "m11", "t22"),
  (3, 4, "m10", "t22"),
  (3, 5, "m11", "t22"),
  (3, 6, "m10", "t22"),
  (3, 7, "m10", "t22"),
  (3, 8, "m10", "t22"),
  (3, 9, "m11", "t22"),
  (3, 10, "m10", "t22")
).toDF(
  "org_id",
  "rule_id",
  "period_id",
  "base_id"
).createOrReplaceTempView("data")

Seq(
  "4=1,2,3",
  "7=1,4,5",
  "9=8,10"
).map { line =>
  val Array(key, values) = line.split("=")
  (key, values.split(",").map(_.toInt).sorted)
}.toDF(
  "key",
  "rules"
).createOrReplaceTempView("rule_filters")

那么下面的 SQL 查詢就解決了這個問題:

SELECT
  org_id,
  rule_id,
  period_id,
  base_id
FROM
  (
    SELECT
      *,
      array_sort(
        collect_set(rule_id) OVER (
          PARTITION BY org_id ROWS BETWEEN UNBOUNDED PRECEDING
          AND UNBOUNDED FOLLOWING
        )
      ) AS rules_in_org
    FROM
      data
      LEFT JOIN rule_filters ON rule_id = key
  )
WHERE
  rules IS NULL
  OR array_intersect(rules_in_org, rules) = rules
ORDER BY
  org_id,
  rule_id

如果您願意,也可以使用 DataFrame API 來實現它:

table("data")
  .join(table("rule_filters"), $"data.rule_id" === $"rule_filters.key", "left")
  .select(
    $"*",
    array_sort(
      collect_set($"rule_id").over(
        Window
          .partitionBy($"org_id")
          .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
      )
    ) as "rules_within_org"
  )
  .filter($"rules".isNull || array_intersect($"rules_within_org", $"rules") === $"rules")
  .drop("key", "rules", "rules_within_org")
  .orderBy($"org_id", $"rule_id")
  .show(Int.MaxValue)
+------+-------+---------+-------+
|org_id|rule_id|period_id|base_id|
+------+-------+---------+-------+
|     1|      1|      m10|    t22|
|     1|      2|      m10|    t22|
|     1|      3|      m11|    t22|
|     1|      4|      m11|    t22|
|     1|      5|      m10|    t22|
|     1|      6|      m10|    t22|
|     1|      7|      m10|    t22|
|     1|      8|      m11|    t22|
|     1|      9|      m10|    t22|
|     1|     10|      m10|    t22|
|     2|      1|      m10|    t22|
|     2|      2|      m11|    t22|
|     2|      3|      m10|    t22|
|     2|      4|      m10|    t22|
|     2|      5|      m10|    t22|
|     2|     10|      m11|    t22|
|     3|      5|      m11|    t22|
|     3|      6|      m10|    t22|
|     3|      8|      m10|    t22|
|     3|      9|      m11|    t22|
|     3|     10|      m10|    t22|
+------+-------+---------+-------+

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM