简体   繁体   中英

Select a record from multiple with preference hierarchy from spark dataframe or sql

I have a dataframe of products with same product having different categories. I want select only one record based on hierarchy like

Product ID.  Category.  Status
1.           Cat1.      status1
1.           Cat2.      status1
1.           Cat3.      status1
2.           Cat1.      status1
2.           Cat2.      status1
3.           Cat2.      status1

If record with Cat1 exists, pick it else pick Cat2. If Cat2 doesn't exist pick Cat3. But pick only one out of multiple.

Use row_number() :

select t.*
from (select t.*, row_number () over (partition by productid order by category) as seq
      from table t
     ) t
where seq = 1;

If categories are different named then use case expression

order by (case when category = 'category_x' then 1 
               when category = 'category_gg' then 2 
               else 3 
         end)

Here is the same answer with @Yogesh Sharma by using the dataframe functions.

import org.apache.spark.sql.expressions.Window

val w = Window.partitionBy("Product ID").orderBy("Category")
df.withColumn("row", row_number.over(w))
  .filter($"row" === 1)
  .orderBy("Product ID")
  .drop("row")
  .show

or use groupBy and self-join such as

df.join(df.groupBy("Product ID").agg(first("Category").as("Category")), Seq("Product ID", "Category")).show

where those will give you the result:

+----------+--------+-------+
|Product ID|Category| Status|
+----------+--------+-------+
|         1|    Cat1|status1|
|         2|    Cat1|status1|
|         3|    Cat2|status1|
+----------+--------+-------+

Considering that your categories are the cat1., cat2., ... cat10., ...cat100.,....

You must fetch the number from categories then order them accordingly.

SELECT * FROM
    (
        SELECT
            T.*,
            ROW_NUMBER() OVER(
                PARTITION BY PRODUCTID
                ORDER BY TO_NUMBER(REGEXP_SUBSTR(CATEGORY, '[0-9]+'))
            ) AS RN
        FROM YOUR_TABLE T
    )
WHERE RN = 1;

Cheers!!

I developed below solution with the help of answers by Yogesh and Lamansa

 val df1 = df.withColum("row_num", when($"category"==="Cat1", "A"),
    .when($"category" ==== "Cat2", "B"),
    .when($"category" === "Cat3", "C"))

    df1.join(df1.groupBy("product_id).agg(first("category").as("category")), 
    Seq("product_id","category")).show

When is used as order by category cannot make sure your preferences will be in desired orrder. Eg Cat2 can be a first preference.

Output :
+----------+--------+-------+
|Product ID|Category| Status|
+----------+--------+-------+
|         1|    Cat1|status1|
|         2|    Cat1|status1|
|         3|    Cat2|status1|
+----------+--------+-------+

Output :

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM