[英]Spark scala select certain columns in a dataframe as a map
I have a dataframe df
and a list of column names to select from this dataframe as a map
我有一个数据框df
和一个列名列表,可以从此数据框中选择作为map
I have tried the following approach to build the map
.我尝试了以下方法来构建map
。
var df = Seq((66, "xyz1","a"),(67, "abc1","a"),(68, "fgr1","b"),(69, "yte1","d"),(70, "erx1","q"),(71, "ter1","q")).toDF("from_value", "to_value","label")
val cols = List("from_value","to_value")
df.select(
map(
lit(cols(0)),col(cols(0))
,lit(cols(1)),col(cols(1))
)
.as("mapped")
).show(false)
Output:输出:
+------------------------------------+
|mapped |
+------------------------------------+
|[from_value -> 66, to_value -> xyz1]|
|[from_value -> 67, to_value -> abc1]|
|[from_value -> 68, to_value -> fgr1]|
|[from_value -> 69, to_value -> yte1]|
|[from_value -> 70, to_value -> erx1]|
|[from_value -> 71, to_value -> ter1]|
+------------------------------------+
However, I do see few issues with this approach such as但是,我确实认为这种方法很少有问题,例如
df
列表中指定的列可能不存在于df
Is there an elegant way to handle the above scenarios without being too verbose?有没有一种优雅的方法来处理上述场景而不会太冗长?
You can select certain columns in a dataframe as a map using the following function mappingExpr
:您可以使用以下函数mappingExpr
选择数据框中的某些列作为映射:
import org.apache.spark.sql.functions.{col, lit, map, when}
import org.apache.spark.sql.{Column, DataFrame}
def mappingExpr(columns: Seq[String], dataframe: DataFrame): Column = {
def getValue(columnName: String): Column = when(col(columnName).isNull, lit("")).otherwise(col(columnName))
map(
columns
.filter(dataframe.columns.contains)
.flatMap(columnName => Seq(lit(columnName), getValue(columnName))): _*
).as("mapped")
}
So given your example's data:因此,鉴于您的示例数据:
> val df = Seq((66, "xyz1","a"),(67, "abc1","a"),(68, "fgr1","b"),(69, "yte1","d"),(70, "erx1","q"),(71, "ter1","q")).toDF("from_value", "to_value","label")
> val cols = List("from_value","to_value")
>
> df.select(mappingExpr(cols, df)).show(false)
+------------------------------------+
|mapped |
+------------------------------------+
|[from_value -> 66, to_value -> xyz1]|
|[from_value -> 67, to_value -> abc1]|
|[from_value -> 68, to_value -> fgr1]|
|[from_value -> 69, to_value -> yte1]|
|[from_value -> 70, to_value -> erx1]|
|[from_value -> 71, to_value -> ter1]|
+------------------------------------+
The main idea of my function is to transform the list of columns to a list of tuples, where the first element of the tuple contains the column name as column, and the second element of the tuple contains the column value as column.我的函数的主要思想是将列列表转换为元组列表,其中元组的第一个元素包含列名作为列,元组的第二个元素包含列值作为列。 Then I flatten this list of tuples and pass the result to the map spark SQL function然后我将这个元组列表展平并将结果传递给map spark SQL 函数
Let's now take your different constraints现在让我们考虑您的不同限制
As I built the elements inserted in the map by iterating over the list of columns, the number of column's names does not change anything.当我通过迭代列列表来构建插入到地图中的元素时,列名称的数量不会改变任何东西。 If we pass an empty list of column's names, there is no error:如果我们传递列名称的空列表,则没有错误:
> val df = Seq((66, "xyz1","a"),(67, "abc1","a"),(68, "fgr1","b"),(69, "yte1","d"),(70, "erx1","q"),(71, "ter1","q")).toDF("from_value", "to_value","label")
> val cols = List()
>
> df.select(mappingExpr(List(), df)).show(false)
+------+
|mapped|
+------+
|[] |
|[] |
|[] |
|[] |
|[] |
|[] |
+------+
This is the most tricky one.这是最棘手的一个。 Usually when you create a map, the order is not preserved due to how a map is implemented .通常,当您创建地图时,由于地图的实现方式,不会保留顺序。 However in Spark it seems that the order is preserved so it only depends of list of column's names order.但是,在 Spark 中,似乎保留了顺序,因此它仅取决于列的名称顺序列表。 So in your example if we change column's names order:因此,在您的示例中,如果我们更改列的名称顺序:
> val df = Seq((66, "xyz1","a"),(67, "abc1","a"),(68, "fgr1","b"),(69, "yte1","d"),(70, "erx1","q"),(71, "ter1","q")).toDF("from_value", "to_value","label")
> val cols = List("to_value","from_value")
>
> df.select(mappingExpr(cols, df)).show(false)
+------------------------------------+
|mapped |
+------------------------------------+
|[to_value -> xyz1, from_value -> 66]|
|[to_value -> abc1, from_value -> 67]|
|[to_value -> fgr1, from_value -> 68]|
|[to_value -> yte1, from_value -> 69]|
|[to_value -> erx1, from_value -> 70]|
|[to_value -> ter1, from_value -> 71]|
+------------------------------------+
I do that in the inner function getValue
, by using the when Spark's SQL function.我在内部函数getValue
使用when Spark 的 SQL 函数执行此操作。 So when the column value is null
, return empty string otherwise return column value: when(col(columnName).isNull, lit("")).otherwise(col(columnName))
.因此,当列值为null
,返回空字符串,否则返回列值: when(col(columnName).isNull, lit("")).otherwise(col(columnName))
。 So when you have null
values in your dataframe, it is replaced by empty string:所以,当你有null
在数据框中的值,它是替换空字符串:
> val df = Seq((66, null,"a"),(67, "abc1","a"),(68, "fgr1","b"),(69, "yte1","d"),(70, "erx1","q"),(71, "ter1","q")).toDF("from_value", "to_value","label")
> val cols = List("from_value","to_value")
>
> df.select(mappingExpr(cols, df)).show(false)
+------------------------------------+
|mapped |
+------------------------------------+
|[from_value -> 66, to_value -> ] |
|[from_value -> 67, to_value -> abc1]|
|[from_value -> 68, to_value -> fgr1]|
|[from_value -> 69, to_value -> yte1]|
|[from_value -> 70, to_value -> erx1]|
|[from_value -> 71, to_value -> ter1]|
+------------------------------------+
You can retrieve the list of columns of a dataframe by using the method columns .您可以使用方法columns检索数据框的列列表。 So I use this method to filter out all column's names that are not in dataframe with the line .filter(dataframe.columns.contain)
.所以我使用这种方法来过滤掉所有不在数据.filter(dataframe.columns.contain)
列名称.filter(dataframe.columns.contain)
。 So when the list of column's names contains a column name that is not in dataframe, it is ignored:因此,当列名称列表包含不在数据框中的列名称时,它会被忽略:
> val df = Seq((66, "xyz1","a"),(67, "abc1","a"),(68, "fgr1","b"),(69, "yte1","d"),(70, "erx1","q"),(71, "ter1","q")).toDF("from_value", "to_value","label")
> val cols = List("a_column_that_does_not_exist", "from_value","to_value")
>
> df.select(mappingExpr(cols, df)).show(false)
+------------------------------------+
|mapped |
+------------------------------------+
|[from_value -> 66, to_value -> xyz1]|
|[from_value -> 67, to_value -> abc1]|
|[from_value -> 68, to_value -> fgr1]|
|[from_value -> 69, to_value -> yte1]|
|[from_value -> 70, to_value -> erx1]|
|[from_value -> 71, to_value -> ter1]|
+------------------------------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.