简体   繁体   中英

spark scala: add a new column that has a list as a value; the list should verify a specific condition

I have the query bellow:

finvResultOutput1
      .withColumn("offersAtSite",when(col("activeIOsAtSite")==="Y",
        finvResultOutput1.select(col("OfferSpecification_displayLabel"),col("Site_siteId").alias("siteid"))
          .where(col("Site_siteId")===col("siteid"))
          .drop("siteid")
          .collect()
      ))

I want to add a new column to my data frame called offersAtSite , this column should be added when another column called activeIOsAtSite equals "Y" .

The value of this new column should be a list without duplicates of the elements of column OfferSpecification_displayLabel where the Site_siteId equals the current Site_siteId of the row we are iterating over.

Could anyone give me a better approach to do that, because the query that I made is taking too much time and I don't know if it is working

Bellow is an example of what I want to achieve:

在此处输入图像描述

First, you can groupBy then collect_set (set does not contain duplicates) on your main table:

val grouped = df.groupBy("Site_siteId").agg(collect_set("OfferSpecifications_displayLabel").as("offerAtSite"))

We get:

+-----------+------------------------+
|Site_siteId|offerAtSite             |
+-----------+------------------------+
|site_id_3  |[site3_DL_1, site3_DL_2]|
|site_id_4  |[site4_DL_1]            |
|site_id_2  |[site2_DL_1]            |
|site_id_1  |[site1_DL_1, site1_DL_2]|
+-----------+------------------------+

Then, we join our df table with grouped and overwrite offerAtSite to only have values for Y value:

df.join(grouped, Seq("Site_siteId"), "left")
  .withColumn("offerAtSite", when(col("activeIOsAtSite").equalTo("Y"), col("offerAtSite")))

Final result:

+-----------+---------------+--------------------------------+------------------------+
|Site_siteId|activeIOsAtSite|OfferSpecifications_displayLabel|offerAtSite             |
+-----------+---------------+--------------------------------+------------------------+
|site_id_3  |Y              |site3_DL_1                      |[site3_DL_1, site3_DL_2]|
|site_id_3  |Y              |site3_DL_2                      |[site3_DL_1, site3_DL_2]|
|site_id_4  |N              |site4_DL_1                      |null                    |
|site_id_2  |N              |site2_DL_1                      |null                    |
|site_id_1  |Y              |site1_DL_1                      |[site1_DL_1, site1_DL_2]|
|site_id_1  |Y              |site1_DL_2                      |[site1_DL_1, site1_DL_2]|
+-----------+---------------+--------------------------------+------------------------+

This should work better, good luck!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM