简体   繁体   English

Spark:如何从 Scala 中的 ArrayType 对象中获取单个 object

[英]Spark: How to get a single object from an ArrayType of objects in Scala

I have a DF with multiple columns out of which one of them in of ArrayType called "requests" which contains 2 field "id" and "responses":我有一个包含多个列的 DF,其中一个列在 ArrayType 中,称为“requests”,其中包含 2 个字段“id”和“responses”:

(requests,ArrayType(StructType(StructField(id,IntegerType,true),StructField(responses,ArrayType(IntegerType,false),true))))

From of an array of "requests", I want to get a single request whose "id" matches a specific value and add this to a new column.从“请求”数组中,我想获取一个“id”与特定值匹配的请求,并将其添加到新列中。

So far, I added a boolean value to represent if the value is present in the list or not:到目前为止,我添加了一个 boolean 值来表示该值是否存在于列表中:

dF
  .withColumn("idPresent", array_contains(col("requests.id"), 55))
  .show()

But I'm not able to figure out how I can get one object of type of requests when I give "id" as a parameter?但是当我将“id”作为参数时,我无法弄清楚如何获得一个 object 类型的请求? I expect just one such object to be present in the array but if there's more than one, the first one will suffice.我希望只有一个这样的 object 出现在阵列中,但如果有多个,第一个就足够了。 I'd like to add the new matching object to a new column.我想将新匹配的 object 添加到新列中。

Consider using higher-order function transform to nullify requests that don't match the "id", followed by removing those null elements with array_except :考虑使用高阶 function transform来取消与“id”不匹配的请求,然后使用array_except删除那些 null 元素:

case class Request(id: Int, responses: Seq[Int])

val df = Seq(
  Seq(Request(1, Seq(11)), Request(2, Seq(21, 22)), Request(3, Seq(31))),
  Seq(Request(4, Seq(41, 42, 43)), Request(5, Seq(51, 52)))
).toDF("requests")

df.
  withColumn("request_id2", array_except(
      expr("transform(requests, r -> case when r.id = 2 then r end)"),
      array(lit(null))
    )
  ).show(false)
// +-------------------------------------+---------------+
// |requests                             |request_id2    |
// +-------------------------------------+---------------+
// |[[1, [11]], [2, [21, 22]], [3, [31]]]|[[2, [21, 22]]]|
// |[[4, [41, 42, 43]], [5, [51, 52]]]   |[]             |
// +-------------------------------------+---------------+

Extending on @Leo C's answer , you can also use filter to filter the array elements based on their id:扩展@Leo C's answer ,您还可以使用filter根据它们的 id 过滤数组元素:

df.withColumn("request_id2", expr("filter(requests, r -> r.id = 2)")).show(false)
+-------------------------------------+---------------+
|requests                             |request_id2    |
+-------------------------------------+---------------+
|[[1, [11]], [2, [21, 22]], [3, [31]]]|[[2, [21, 22]]]|
|[[4, [41, 42, 43]], [5, [51, 52]]]   |[]             |
+-------------------------------------+---------------+

If you just want the first struct object, you can add [0] to the expr :如果您只想要第一个结构 object,您可以将[0]添加到expr

df.withColumn("request_id2", expr("filter(requests, r -> r.id = 2)[0]")).show(false)
+-------------------------------------+-------------+
|requests                             |request_id2  |
+-------------------------------------+-------------+
|[[1, [11]], [2, [21, 22]], [3, [31]]]|[2, [21, 22]]|
|[[4, [41, 42, 43]], [5, [51, 52]]]   |null         |
+-------------------------------------+-------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM