So, I have a Kafka topic containing the following data, and I'm working on a proof-of-concept whether we can achieve what we're trying to do. I was previous trying to solve it within Kafka, but it seems that Kafka wasn't the right tool, so looking at Spark now:)
The data in its basic form looks like this:
+--+------------+-------+---------+
|id|serialNumber|source |company |
+--+------------+-------+---------+
|1 |123ABC |system1|Acme |
|2 |3285624 |system1|Ajax |
|3 |CDE567 |system1|Emca |
|4 |XX |system2|Ajax |
|5 |3285624 |system2|Ajax&Sons|
|6 |0147852 |system2|Ajax |
|7 |123ABC |system2|Acme |
|8 |CDE567 |system2|Xaja |
+--+------------+-------+---------+
The main grouping column is serialNumber and the result should be that id 1 and 7 should match as it's a full match on the company. Id 2 and 5 should match because the company in id 2 is a full partial match of the company in id 5. Id 3 and 8 should not match as the companies doesn't match.
I expect the end result to be something like this. Note that sources are not fixed to just one or two and in the future it will contain more sources.
+------+-----+------------+-----------------+---------------+
|uuid |id |serialNumber|source |company |
+------+-----+------------+-----------------+---------------+
|<uuid>|[1,7]|123ABC |[system1,system2]|[Acme] |
|<uuid>|[2,5]|3285624 |[system1,system2]|[Ajax,Ajax&Sons|
|<uuid>|[3] |CDE567 |[system1] |[Emca] |
|<uuid>|[4] |XX |[system2] |[Ajax] |
|<uuid>|[6] |0147852 |[system2] |[Ajax] |
|<uuid>|[8] |CDE567 |[system2] |[Xaja] |
+------+-----+------------+-----------------+---------------+
I was looking at groupByKey().mapGroups() but having problems finding examples. Can mapGroups() return more than one row?
You can simply groupBy based on serialNumber column and collect_list of all other columns.
code:
import org.apache.spark.sql.{Dataset, SparkSession}
import org.apache.spark.sql.functions._
val ds = Seq((1,"123ABC", "system1", "Acme"),
(7,"123ABC", "system2", "Acme"))
.toDF("id", "serialNumber", "source", "company")
ds.groupBy("serialNumber")
.agg(
collect_list("id").alias("id"),
collect_list("source").alias("source"),
collect_list("company").alias("company")
)
.show(false)
Output:
+------------+------+------------------+------------+
|serialNumber|id |source |company |
+------------+------+------------------+------------+
|123ABC |[1, 7]|[system1, system2]|[Acme, Acme]|
+------------+------+------------------+------------+
If you dont want duplicate values, use collect_set
ds.groupBy("serialNumber")
.agg(
collect_list("id").alias("id"),
collect_list("source").alias("source"),
collect_set("company").alias("company")
)
.show(false)
Output with collect_set on company column:
+------------+------+------------------+-------+
|serialNumber|id |source |company|
+------------+------+------------------+-------+
|123ABC |[1, 7]|[system1, system2]|[Acme] |
+------------+------+------------------+-------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.