[英]In spark and scala, how to convert or map a dataframe to specific columns info?
[英]Spark scala select certain columns in a dataframe as a map
我有一個數據框df
和一個列名列表,可以從此數據框中選擇作為map
我嘗試了以下方法來構建map
。
var df = Seq((66, "xyz1","a"),(67, "abc1","a"),(68, "fgr1","b"),(69, "yte1","d"),(70, "erx1","q"),(71, "ter1","q")).toDF("from_value", "to_value","label")
val cols = List("from_value","to_value")
df.select(
map(
lit(cols(0)),col(cols(0))
,lit(cols(1)),col(cols(1))
)
.as("mapped")
).show(false)
輸出:
+------------------------------------+
|mapped |
+------------------------------------+
|[from_value -> 66, to_value -> xyz1]|
|[from_value -> 67, to_value -> abc1]|
|[from_value -> 68, to_value -> fgr1]|
|[from_value -> 69, to_value -> yte1]|
|[from_value -> 70, to_value -> erx1]|
|[from_value -> 71, to_value -> ter1]|
+------------------------------------+
但是,我確實認為這種方法很少有問題,例如
df
有沒有一種優雅的方法來處理上述場景而不會太冗長?
您可以使用以下函數mappingExpr
選擇數據框中的某些列作為映射:
import org.apache.spark.sql.functions.{col, lit, map, when}
import org.apache.spark.sql.{Column, DataFrame}
def mappingExpr(columns: Seq[String], dataframe: DataFrame): Column = {
def getValue(columnName: String): Column = when(col(columnName).isNull, lit("")).otherwise(col(columnName))
map(
columns
.filter(dataframe.columns.contains)
.flatMap(columnName => Seq(lit(columnName), getValue(columnName))): _*
).as("mapped")
}
因此,鑒於您的示例數據:
> val df = Seq((66, "xyz1","a"),(67, "abc1","a"),(68, "fgr1","b"),(69, "yte1","d"),(70, "erx1","q"),(71, "ter1","q")).toDF("from_value", "to_value","label")
> val cols = List("from_value","to_value")
>
> df.select(mappingExpr(cols, df)).show(false)
+------------------------------------+
|mapped |
+------------------------------------+
|[from_value -> 66, to_value -> xyz1]|
|[from_value -> 67, to_value -> abc1]|
|[from_value -> 68, to_value -> fgr1]|
|[from_value -> 69, to_value -> yte1]|
|[from_value -> 70, to_value -> erx1]|
|[from_value -> 71, to_value -> ter1]|
+------------------------------------+
我的函數的主要思想是將列列表轉換為元組列表,其中元組的第一個元素包含列名作為列,元組的第二個元素包含列值作為列。 然后我將這個元組列表展平並將結果傳遞給map spark SQL 函數
現在讓我們考慮您的不同限制
當我通過迭代列列表來構建插入到地圖中的元素時,列名稱的數量不會改變任何東西。 如果我們傳遞列名稱的空列表,則沒有錯誤:
> val df = Seq((66, "xyz1","a"),(67, "abc1","a"),(68, "fgr1","b"),(69, "yte1","d"),(70, "erx1","q"),(71, "ter1","q")).toDF("from_value", "to_value","label")
> val cols = List()
>
> df.select(mappingExpr(List(), df)).show(false)
+------+
|mapped|
+------+
|[] |
|[] |
|[] |
|[] |
|[] |
|[] |
+------+
這是最棘手的一個。 通常,當您創建地圖時,由於地圖的實現方式,不會保留順序。 但是,在 Spark 中,似乎保留了順序,因此它僅取決於列的名稱順序列表。 因此,在您的示例中,如果我們更改列的名稱順序:
> val df = Seq((66, "xyz1","a"),(67, "abc1","a"),(68, "fgr1","b"),(69, "yte1","d"),(70, "erx1","q"),(71, "ter1","q")).toDF("from_value", "to_value","label")
> val cols = List("to_value","from_value")
>
> df.select(mappingExpr(cols, df)).show(false)
+------------------------------------+
|mapped |
+------------------------------------+
|[to_value -> xyz1, from_value -> 66]|
|[to_value -> abc1, from_value -> 67]|
|[to_value -> fgr1, from_value -> 68]|
|[to_value -> yte1, from_value -> 69]|
|[to_value -> erx1, from_value -> 70]|
|[to_value -> ter1, from_value -> 71]|
+------------------------------------+
我在內部函數getValue
使用when Spark 的 SQL 函數執行此操作。 因此,當列值為null
,返回空字符串,否則返回列值: when(col(columnName).isNull, lit("")).otherwise(col(columnName))
。 所以,當你有null
在數據框中的值,它是替換空字符串:
> val df = Seq((66, null,"a"),(67, "abc1","a"),(68, "fgr1","b"),(69, "yte1","d"),(70, "erx1","q"),(71, "ter1","q")).toDF("from_value", "to_value","label")
> val cols = List("from_value","to_value")
>
> df.select(mappingExpr(cols, df)).show(false)
+------------------------------------+
|mapped |
+------------------------------------+
|[from_value -> 66, to_value -> ] |
|[from_value -> 67, to_value -> abc1]|
|[from_value -> 68, to_value -> fgr1]|
|[from_value -> 69, to_value -> yte1]|
|[from_value -> 70, to_value -> erx1]|
|[from_value -> 71, to_value -> ter1]|
+------------------------------------+
您可以使用方法columns檢索數據框的列列表。 所以我使用這種方法來過濾掉所有不在數據.filter(dataframe.columns.contain)
列名稱.filter(dataframe.columns.contain)
。 因此,當列名稱列表包含不在數據框中的列名稱時,它會被忽略:
> val df = Seq((66, "xyz1","a"),(67, "abc1","a"),(68, "fgr1","b"),(69, "yte1","d"),(70, "erx1","q"),(71, "ter1","q")).toDF("from_value", "to_value","label")
> val cols = List("a_column_that_does_not_exist", "from_value","to_value")
>
> df.select(mappingExpr(cols, df)).show(false)
+------------------------------------+
|mapped |
+------------------------------------+
|[from_value -> 66, to_value -> xyz1]|
|[from_value -> 67, to_value -> abc1]|
|[from_value -> 68, to_value -> fgr1]|
|[from_value -> 69, to_value -> yte1]|
|[from_value -> 70, to_value -> erx1]|
|[from_value -> 71, to_value -> ter1]|
+------------------------------------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.