从火花数据帧或 sql 中选择具有首选项层次结构的多个记录

Question

I have a dataframe of products with same product having different categories.我有一个具有不同类别的相同产品的产品数据框。 I want select only one record based on hierarchy like我只想根据层次结构选择一条记录

Product ID.  Category.  Status
1.           Cat1.      status1
1.           Cat2.      status1
1.           Cat3.      status1
2.           Cat1.      status1
2.           Cat2.      status1
3.           Cat2.      status1

If record with Cat1 exists, pick it else pick Cat2.如果存在 Cat1 的记录，则选择它，否则选择 Cat2。 If Cat2 doesn't exist pick Cat3.如果 Cat2 不存在，请选择 Cat3。 But pick only one out of multiple.但只能从多个中选择一个。

Answer 1

Use row_number() :使用row_number() ：

select t.*
from (select t.*, row_number () over (partition by productid order by category) as seq
      from table t
     ) t
where seq = 1;

If categories are different named then use case expression如果类别名称不同，则用case表达式

order by (case when category = 'category_x' then 1 
               when category = 'category_gg' then 2 
               else 3 
         end)

Answer 2

Here is the same answer with @Yogesh Sharma by using the dataframe functions.这是使用数据帧函数与@Yogesh Sharma 相同的答案。

import org.apache.spark.sql.expressions.Window

val w = Window.partitionBy("Product ID").orderBy("Category")
df.withColumn("row", row_number.over(w))
  .filter($"row" === 1)
  .orderBy("Product ID")
  .drop("row")
  .show

or use groupBy and self-join such as或使用groupBy和 self-join 如

df.join(df.groupBy("Product ID").agg(first("Category").as("Category")), Seq("Product ID", "Category")).show

where those will give you the result:那些会给你结果的地方：

+----------+--------+-------+
|Product ID|Category| Status|
+----------+--------+-------+
|         1|    Cat1|status1|
|         2|    Cat1|status1|
|         3|    Cat2|status1|
+----------+--------+-------+

Answer 3

Considering that your categories are the cat1., cat2., ... cat10., ...cat100.,....考虑到您的类别是cat1., cat2., ... cat10., ...cat100.,....

You must fetch the number from categories then order them accordingly.您必须从类别中获取数字，然后相应地对它们进行排序。

SELECT * FROM
    (
        SELECT
            T.*,
            ROW_NUMBER() OVER(
                PARTITION BY PRODUCTID
                ORDER BY TO_NUMBER(REGEXP_SUBSTR(CATEGORY, '[0-9]+'))
            ) AS RN
        FROM YOUR_TABLE T
    )
WHERE RN = 1;

Cheers!!干杯！！

Answer 4

I developed below solution with the help of answers by Yogesh and Lamansa我在 Yogesh 和 Lamansa 的回答的帮助下开发了以下解决方案

 val df1 = df.withColum("row_num", when($"category"==="Cat1", "A"),
    .when($"category" ==== "Cat2", "B"),
    .when($"category" === "Cat3", "C"))

    df1.join(df1.groupBy("product_id).agg(first("category").as("category")), 
    Seq("product_id","category")).show

When is used as order by category cannot make sure your preferences will be in desired orrder.何时用作按类别排序无法确保您的偏好按所需顺序排列。 Eg Cat2 can be a first preference.例如，Cat2 可以是首选。

Output :
+----------+--------+-------+
|Product ID|Category| Status|
+----------+--------+-------+
|         1|    Cat1|status1|
|         2|    Cat1|status1|
|         3|    Cat2|status1|
+----------+--------+-------+

Output :输出：

从火花数据帧或 sql 中选择具有首选项层次结构的多个记录

问题描述

4 个解决方案

解决方案1
2 2020-03-12 15:07:18

解决方案2
1 2020-03-12 15:11:38

解决方案3
1 2020-03-12 15:14:00

解决方案4
0 已采纳 2020-03-13 17:03:20

从火花数据帧或 sql 中选择具有首选项层次结构的多个记录

问题描述

4 个解决方案

解决方案1 2 2020-03-12 15:07:18

解决方案2 1 2020-03-12 15:11:38

解决方案3 1 2020-03-12 15:14:00

解决方案4 0 已采纳 2020-03-13 17:03:20

解决方案1
2 2020-03-12 15:07:18

解决方案2
1 2020-03-12 15:11:38

解决方案3
1 2020-03-12 15:14:00

解决方案4
0 已采纳 2020-03-13 17:03:20