简体   繁体   English

Spark scala 选择数据框中的某些列作为地图

[英]Spark scala select certain columns in a dataframe as a map

I have a dataframe df and a list of column names to select from this dataframe as a map我有一个数据框df和一个列名列表,可以从此数据框中选择作为map

I have tried the following approach to build the map .我尝试了以下方法来构建map

var df = Seq((66, "xyz1","a"),(67, "abc1","a"),(68, "fgr1","b"),(69, "yte1","d"),(70, "erx1","q"),(71, "ter1","q")).toDF("from_value", "to_value","label")

val cols = List("from_value","to_value")

df.select(
  map(
    lit(cols(0)),col(cols(0))
    ,lit(cols(1)),col(cols(1))
  )
  .as("mapped")
  ).show(false)

Output:输出:

+------------------------------------+
|mapped                              |
+------------------------------------+
|[from_value -> 66, to_value -> xyz1]|
|[from_value -> 67, to_value -> abc1]|
|[from_value -> 68, to_value -> fgr1]|
|[from_value -> 69, to_value -> yte1]|
|[from_value -> 70, to_value -> erx1]|
|[from_value -> 71, to_value -> ter1]|
+------------------------------------+

However, I do see few issues with this approach such as但是,我确实认为这种方法很少有问题,例如

  • The list of column names may contain 0 or upto 3 column names.列名列表可能包含 0 个或最多 3 个列名。 The code above would throw an IndexOutOfBound exception.上面的代码会抛出一个 IndexOutOfBound 异常。
  • The order in which the column names appear in the map is important and I need the keys in the map to preserve the order列名在地图中出现的顺序很重要,我需要地图中的键来保留顺序
  • The column value can be null and that would need to be coalesce to an empty string列值可以为空,并且需要合并为空字符串
  • The column specified in the list may not exist in the df列表中指定的列可能不存在于df

Is there an elegant way to handle the above scenarios without being too verbose?有没有一种优雅的方法来处理上述场景而不会太冗长?

You can select certain columns in a dataframe as a map using the following function mappingExpr :您可以使用以下函数mappingExpr选择数据框中的某些列作为映射:

import org.apache.spark.sql.functions.{col, lit, map, when}
import org.apache.spark.sql.{Column, DataFrame}

def mappingExpr(columns: Seq[String], dataframe: DataFrame): Column = {
  def getValue(columnName: String): Column = when(col(columnName).isNull, lit("")).otherwise(col(columnName))

  map(
    columns
      .filter(dataframe.columns.contains)
      .flatMap(columnName => Seq(lit(columnName), getValue(columnName))): _*
  ).as("mapped")
}  

So given your example's data:因此,鉴于您的示例数据:

> val df = Seq((66, "xyz1","a"),(67, "abc1","a"),(68, "fgr1","b"),(69, "yte1","d"),(70, "erx1","q"),(71, "ter1","q")).toDF("from_value", "to_value","label")
> val cols = List("from_value","to_value")
> 
> df.select(mappingExpr(cols, df)).show(false)

+------------------------------------+
|mapped                              |
+------------------------------------+
|[from_value -> 66, to_value -> xyz1]|
|[from_value -> 67, to_value -> abc1]|
|[from_value -> 68, to_value -> fgr1]|
|[from_value -> 69, to_value -> yte1]|
|[from_value -> 70, to_value -> erx1]|
|[from_value -> 71, to_value -> ter1]|
+------------------------------------+

Detailed explanation详细说明

The main idea of my function is to transform the list of columns to a list of tuples, where the first element of the tuple contains the column name as column, and the second element of the tuple contains the column value as column.我的函数的主要思想是将列列表转换为元组列表,其中元组的第一个元素包含列名作为列,元组的第二个元素包含列值作为列。 Then I flatten this list of tuples and pass the result to the map spark SQL function然后我将这个元组列表展平并将结果传递给map spark SQL 函数

Let's now take your different constraints现在让我们考虑您的不同限制

The list of column names may contain 0 up to 3 column names列名列表可能包含 0 到 3 个列名

As I built the elements inserted in the map by iterating over the list of columns, the number of column's names does not change anything.当我通过迭代列列表来构建插入到地图中的元素时,列名称的数量不会改变任何东西。 If we pass an empty list of column's names, there is no error:如果我们传递列名称的空列表,则没有错误:

> val df = Seq((66, "xyz1","a"),(67, "abc1","a"),(68, "fgr1","b"),(69, "yte1","d"),(70, "erx1","q"),(71, "ter1","q")).toDF("from_value", "to_value","label")
> val cols = List()
>
> df.select(mappingExpr(List(), df)).show(false)
+------+
|mapped|
+------+
|[]    |
|[]    |
|[]    |
|[]    |
|[]    |
|[]    |
+------+

I need the keys in the map to preserve the order我需要地图中的键来保留顺序

This is the most tricky one.这是最棘手的一个。 Usually when you create a map, the order is not preserved due to how a map is implemented .通常,当您创建地图时,由于地图的实现方式,不会保留顺序。 However in Spark it seems that the order is preserved so it only depends of list of column's names order.但是,在 Spark 中,似乎保留了顺序,因此它仅取决于列的名称顺序列表。 So in your example if we change column's names order:因此,在您的示例中,如果我们更改列的名称顺序:

> val df = Seq((66, "xyz1","a"),(67, "abc1","a"),(68, "fgr1","b"),(69, "yte1","d"),(70, "erx1","q"),(71, "ter1","q")).toDF("from_value", "to_value","label")
> val cols = List("to_value","from_value")
> 
> df.select(mappingExpr(cols, df)).show(false)

+------------------------------------+
|mapped                              |
+------------------------------------+
|[to_value -> xyz1, from_value -> 66]|
|[to_value -> abc1, from_value -> 67]|
|[to_value -> fgr1, from_value -> 68]|
|[to_value -> yte1, from_value -> 69]|
|[to_value -> erx1, from_value -> 70]|
|[to_value -> ter1, from_value -> 71]|
+------------------------------------+

The column value can be null and that would need to be coalesce to an empty string列值可以为空,并且需要合并为空字符串

I do that in the inner function getValue , by using the when Spark's SQL function.我在内部函数getValue使用when Spark 的 SQL 函数执行此操作。 So when the column value is null , return empty string otherwise return column value: when(col(columnName).isNull, lit("")).otherwise(col(columnName)) .因此,当列值为null ,返回空字符串,否则返回列值: when(col(columnName).isNull, lit("")).otherwise(col(columnName)) So when you have null values in your dataframe, it is replaced by empty string:所以,当你有null在数据框中的值,它是替换空字符串:

> val df = Seq((66, null,"a"),(67, "abc1","a"),(68, "fgr1","b"),(69, "yte1","d"),(70, "erx1","q"),(71, "ter1","q")).toDF("from_value", "to_value","label")
> val cols = List("from_value","to_value")
> 
> df.select(mappingExpr(cols, df)).show(false)

+------------------------------------+
|mapped                              |
+------------------------------------+
|[from_value -> 66, to_value -> ]    |
|[from_value -> 67, to_value -> abc1]|
|[from_value -> 68, to_value -> fgr1]|
|[from_value -> 69, to_value -> yte1]|
|[from_value -> 70, to_value -> erx1]|
|[from_value -> 71, to_value -> ter1]|
+------------------------------------+

The column specified in the list may not exist in the dataframe列表中指定的列可能不存在于数据框中

You can retrieve the list of columns of a dataframe by using the method columns .您可以使用方法columns检索数据框的列列表。 So I use this method to filter out all column's names that are not in dataframe with the line .filter(dataframe.columns.contain) .所以我使用这种方法来过滤掉所有不在数据.filter(dataframe.columns.contain)列名称.filter(dataframe.columns.contain) So when the list of column's names contains a column name that is not in dataframe, it is ignored:因此,当列名称列表包含不在数据框中的列名称时,它会被忽略:

> val df = Seq((66, "xyz1","a"),(67, "abc1","a"),(68, "fgr1","b"),(69, "yte1","d"),(70, "erx1","q"),(71, "ter1","q")).toDF("from_value", "to_value","label")
> val cols = List("a_column_that_does_not_exist", "from_value","to_value")
> 
> df.select(mappingExpr(cols, df)).show(false)

+------------------------------------+
|mapped                              |
+------------------------------------+
|[from_value -> 66, to_value -> xyz1]|
|[from_value -> 67, to_value -> abc1]|
|[from_value -> 68, to_value -> fgr1]|
|[from_value -> 69, to_value -> yte1]|
|[from_value -> 70, to_value -> erx1]|
|[from_value -> 71, to_value -> ter1]|
+------------------------------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM