从火花数据帧或 sql 中选择具有首选项层次结构的多个记录

Question

我有一个具有不同类别的相同产品的产品数据框。 我只想根据层次结构选择一条记录

Product ID.  Category.  Status
1.           Cat1.      status1
1.           Cat2.      status1
1.           Cat3.      status1
2.           Cat1.      status1
2.           Cat2.      status1
3.           Cat2.      status1

如果存在 Cat1 的记录，则选择它，否则选择 Cat2。 如果 Cat2 不存在，请选择 Cat3。 但只能从多个中选择一个。

Answer 1

使用row_number() ：

select t.*
from (select t.*, row_number () over (partition by productid order by category) as seq
      from table t
     ) t
where seq = 1;

如果类别名称不同，则用case表达式

order by (case when category = 'category_x' then 1 
               when category = 'category_gg' then 2 
               else 3 
         end)

Answer 2

这是使用数据帧函数与@Yogesh Sharma 相同的答案。

import org.apache.spark.sql.expressions.Window

val w = Window.partitionBy("Product ID").orderBy("Category")
df.withColumn("row", row_number.over(w))
  .filter($"row" === 1)
  .orderBy("Product ID")
  .drop("row")
  .show

或使用groupBy和 self-join 如

df.join(df.groupBy("Product ID").agg(first("Category").as("Category")), Seq("Product ID", "Category")).show

那些会给你结果的地方：

+----------+--------+-------+
|Product ID|Category| Status|
+----------+--------+-------+
|         1|    Cat1|status1|
|         2|    Cat1|status1|
|         3|    Cat2|status1|
+----------+--------+-------+

Answer 3

考虑到您的类别是cat1., cat2., ... cat10., ...cat100.,....

您必须从类别中获取数字，然后相应地对它们进行排序。

SELECT * FROM
    (
        SELECT
            T.*,
            ROW_NUMBER() OVER(
                PARTITION BY PRODUCTID
                ORDER BY TO_NUMBER(REGEXP_SUBSTR(CATEGORY, '[0-9]+'))
            ) AS RN
        FROM YOUR_TABLE T
    )
WHERE RN = 1;

干杯！！

Answer 4

我在 Yogesh 和 Lamansa 的回答的帮助下开发了以下解决方案

 val df1 = df.withColum("row_num", when($"category"==="Cat1", "A"),
    .when($"category" ==== "Cat2", "B"),
    .when($"category" === "Cat3", "C"))

    df1.join(df1.groupBy("product_id).agg(first("category").as("category")), 
    Seq("product_id","category")).show

何时用作按类别排序无法确保您的偏好按所需顺序排列。 例如，Cat2 可以是首选。

Output :
+----------+--------+-------+
|Product ID|Category| Status|
+----------+--------+-------+
|         1|    Cat1|status1|
|         2|    Cat1|status1|
|         3|    Cat2|status1|
+----------+--------+-------+

输出：

从火花数据帧或 sql 中选择具有首选项层次结构的多个记录

问题描述

4 个解决方案

解决方案1
2 2020-03-12 15:07:18

解决方案2
1 2020-03-12 15:11:38

解决方案3
1 2020-03-12 15:14:00

解决方案4
0 已采纳 2020-03-13 17:03:20

从火花数据帧或 sql 中选择具有首选项层次结构的多个记录

问题描述

4 个解决方案

解决方案1 2 2020-03-12 15:07:18

解决方案2 1 2020-03-12 15:11:38

解决方案3 1 2020-03-12 15:14:00

解决方案4 0 已采纳 2020-03-13 17:03:20

解决方案1
2 2020-03-12 15:07:18

解决方案2
1 2020-03-12 15:11:38

解决方案3
1 2020-03-12 15:14:00

解决方案4
0 已采纳 2020-03-13 17:03:20