从另一个数据帧的选定信息创建一个新的数据帧（具有不同的架构）

Question

我有一个数据框，其中标签列包含不同的key->values. 我尝试过滤掉key=name所在的values信息。 过滤掉的信息应该放在一个新的数据框中。

初始df具有以下架构：

root
 |-- id: long (nullable = true)
 |-- type: string (nullable = true)
 |-- tags: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- nds: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- ref: long (nullable = true)
 |-- members: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- ref: long (nullable = true)
 |    |    |-- role: string (nullable = true)
 |-- visible: boolean (nullable = true)

我想要一个newdf的架构：

root
 |-- place: string (nullable = true)
 |-- num_evacuees string (nullable = true)

我应该怎么做过滤器？ 我尝试了很多方法，至少我尝试过使用普通过滤器。 但是每次过滤器的结果都是一个空的数据帧。 例如：

val newdf = df.filter($"tags"("key") contains "name")
val newdf = df.where(places("tags")("key") === "name")

我尝试了很多方法，但都没有奏效我应该如何进行适当的过滤

Answer 1

您可以通过以下方式实现您想要的结果：

         val df = Seq(
                 (1L, Map("sf" -> "100")),
                 (2L, Map("ny" -> "200"))
               ).toDF("id", "tags")
               
               val resultDf = df
                 .select(explode(map_filter(col("tags"), (k, _) => k === "ny")))
                 .withColumnRenamed("key", "place")
                 .withColumnRenamed("value", "num_evacuees")
               
               resultDf.printSchema
               resultDf.show

这将显示：

root
 |-- place: string (nullable = false)
 |-- num_evacuees: string (nullable = true)

+-----+------------+
|place|num_evacuees|
+-----+------------+
|   ny|         200|
+-----+------------+

关键思想是使用map_filter从您想要的地图中选择字段，然后explode将地图变成两列（ key和value ），然后您可以重命名以使DataFrame符合您的规范。

上面的例子假设你想得到一个值来演示这个想法。 map_filter使用的 lambda 函数可以根据需要变得复杂。 它的签名map_filter(expr: Column, f: (Column, Column) => Column): Column表示只要你返回一个Column就会开心。

如果您想过滤大量条目，您可以执行以下操作：

val resultDf = df
  .withColumn("filterList", array("sf", "place_n"))
  .select(explode(map_filter(col("tags"), (k, _) => array_contains(col("filterList"), k))))

Answer 2

这个想法是提取映射列的键（标签），然后使用 array_contains 来检查名为“name”的键。

import org.apache.spark.sql.functions._
val newdf = df.filter(array_contains(map_keys($"tags), "name"))

从另一个数据帧的选定信息创建一个新的数据帧（具有不同的架构）

问题描述

2 个解决方案

解决方案1
1 已采纳 2021-10-18 07:45:31

解决方案2
0 2021-10-18 05:53:32

从另一个数据帧的选定信息创建一个新的数据帧（具有不同的架构）

问题描述

2 个解决方案

解决方案1 1 已采纳 2021-10-18 07:45:31

解决方案2 0 2021-10-18 05:53:32

解决方案1
1 已采纳 2021-10-18 07:45:31

解决方案2
0 2021-10-18 05:53:32