简体   繁体   中英

Spark: How to get a single object from an ArrayType of objects in Scala

I have a DF with multiple columns out of which one of them in of ArrayType called "requests" which contains 2 field "id" and "responses":

(requests,ArrayType(StructType(StructField(id,IntegerType,true),StructField(responses,ArrayType(IntegerType,false),true))))

From of an array of "requests", I want to get a single request whose "id" matches a specific value and add this to a new column.

So far, I added a boolean value to represent if the value is present in the list or not:

dF
  .withColumn("idPresent", array_contains(col("requests.id"), 55))
  .show()

But I'm not able to figure out how I can get one object of type of requests when I give "id" as a parameter? I expect just one such object to be present in the array but if there's more than one, the first one will suffice. I'd like to add the new matching object to a new column.

Consider using higher-order function transform to nullify requests that don't match the "id", followed by removing those null elements with array_except :

case class Request(id: Int, responses: Seq[Int])

val df = Seq(
  Seq(Request(1, Seq(11)), Request(2, Seq(21, 22)), Request(3, Seq(31))),
  Seq(Request(4, Seq(41, 42, 43)), Request(5, Seq(51, 52)))
).toDF("requests")

df.
  withColumn("request_id2", array_except(
      expr("transform(requests, r -> case when r.id = 2 then r end)"),
      array(lit(null))
    )
  ).show(false)
// +-------------------------------------+---------------+
// |requests                             |request_id2    |
// +-------------------------------------+---------------+
// |[[1, [11]], [2, [21, 22]], [3, [31]]]|[[2, [21, 22]]]|
// |[[4, [41, 42, 43]], [5, [51, 52]]]   |[]             |
// +-------------------------------------+---------------+

Extending on @Leo C's answer , you can also use filter to filter the array elements based on their id:

df.withColumn("request_id2", expr("filter(requests, r -> r.id = 2)")).show(false)
+-------------------------------------+---------------+
|requests                             |request_id2    |
+-------------------------------------+---------------+
|[[1, [11]], [2, [21, 22]], [3, [31]]]|[[2, [21, 22]]]|
|[[4, [41, 42, 43]], [5, [51, 52]]]   |[]             |
+-------------------------------------+---------------+

If you just want the first struct object, you can add [0] to the expr :

df.withColumn("request_id2", expr("filter(requests, r -> r.id = 2)[0]")).show(false)
+-------------------------------------+-------------+
|requests                             |request_id2  |
+-------------------------------------+-------------+
|[[1, [11]], [2, [21, 22]], [3, [31]]]|[2, [21, 22]]|
|[[4, [41, 42, 43]], [5, [51, 52]]]   |null         |
+-------------------------------------+-------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM