简体   繁体   中英

Spark, return multiple rows on group?

So, I have a Kafka topic containing the following data, and I'm working on a proof-of-concept whether we can achieve what we're trying to do. I was previous trying to solve it within Kafka, but it seems that Kafka wasn't the right tool, so looking at Spark now:)

The data in its basic form looks like this:

+--+------------+-------+---------+
|id|serialNumber|source |company  |
+--+------------+-------+---------+
|1 |123ABC      |system1|Acme     |
|2 |3285624     |system1|Ajax     |
|3 |CDE567      |system1|Emca     |
|4 |XX          |system2|Ajax     |
|5 |3285624     |system2|Ajax&Sons|
|6 |0147852     |system2|Ajax     |
|7 |123ABC      |system2|Acme     |
|8 |CDE567      |system2|Xaja     |
+--+------------+-------+---------+

The main grouping column is serialNumber and the result should be that id 1 and 7 should match as it's a full match on the company. Id 2 and 5 should match because the company in id 2 is a full partial match of the company in id 5. Id 3 and 8 should not match as the companies doesn't match.

I expect the end result to be something like this. Note that sources are not fixed to just one or two and in the future it will contain more sources.

+------+-----+------------+-----------------+---------------+
|uuid  |id   |serialNumber|source           |company        |
+------+-----+------------+-----------------+---------------+
|<uuid>|[1,7]|123ABC      |[system1,system2]|[Acme]         |
|<uuid>|[2,5]|3285624     |[system1,system2]|[Ajax,Ajax&Sons|
|<uuid>|[3]  |CDE567      |[system1]        |[Emca]         |
|<uuid>|[4]  |XX          |[system2]        |[Ajax]         |
|<uuid>|[6]  |0147852     |[system2]        |[Ajax]         |
|<uuid>|[8]  |CDE567      |[system2]        |[Xaja]         |
+------+-----+------------+-----------------+---------------+

I was looking at groupByKey().mapGroups() but having problems finding examples. Can mapGroups() return more than one row?

You can simply groupBy based on serialNumber column and collect_list of all other columns.

code:

import org.apache.spark.sql.{Dataset, SparkSession}
import org.apache.spark.sql.functions._

      val ds = Seq((1,"123ABC", "system1", "Acme"),
        (7,"123ABC", "system2", "Acme"))
        .toDF("id", "serialNumber", "source", "company")
      
      ds.groupBy("serialNumber")
        .agg(
          collect_list("id").alias("id"),
          collect_list("source").alias("source"),
          collect_list("company").alias("company")
        )
        .show(false)

Output:

+------------+------+------------------+------------+
|serialNumber|id    |source            |company     |
+------------+------+------------------+------------+
|123ABC      |[1, 7]|[system1, system2]|[Acme, Acme]|
+------------+------+------------------+------------+

If you dont want duplicate values, use collect_set

  ds.groupBy("serialNumber")
    .agg(
      collect_list("id").alias("id"),
      collect_list("source").alias("source"),
      collect_set("company").alias("company")
    )
    .show(false)

Output with collect_set on company column:

+------------+------+------------------+-------+
|serialNumber|id    |source            |company|
+------------+------+------------------+-------+
|123ABC      |[1, 7]|[system1, system2]|[Acme] |
+------------+------+------------------+-------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM