簡體   English   中英

Spark scala 選擇數據框中的某些列作為地圖

[英]Spark scala select certain columns in a dataframe as a map

我有一個數據框df和一個列名列表,可以從此數據框中選擇作為map

我嘗試了以下方法來構建map

var df = Seq((66, "xyz1","a"),(67, "abc1","a"),(68, "fgr1","b"),(69, "yte1","d"),(70, "erx1","q"),(71, "ter1","q")).toDF("from_value", "to_value","label")

val cols = List("from_value","to_value")

df.select(
  map(
    lit(cols(0)),col(cols(0))
    ,lit(cols(1)),col(cols(1))
  )
  .as("mapped")
  ).show(false)

輸出:

+------------------------------------+
|mapped                              |
+------------------------------------+
|[from_value -> 66, to_value -> xyz1]|
|[from_value -> 67, to_value -> abc1]|
|[from_value -> 68, to_value -> fgr1]|
|[from_value -> 69, to_value -> yte1]|
|[from_value -> 70, to_value -> erx1]|
|[from_value -> 71, to_value -> ter1]|
+------------------------------------+

但是,我確實認為這種方法很少有問題,例如

  • 列名列表可能包含 0 個或最多 3 個列名。 上面的代碼會拋出一個 IndexOutOfBound 異常。
  • 列名在地圖中出現的順序很重要,我需要地圖中的鍵來保留順序
  • 列值可以為空,並且需要合並為空字符串
  • 列表中指定的列可能不存在於df

有沒有一種優雅的方法來處理上述場景而不會太冗長?

您可以使用以下函數mappingExpr選擇數據框中的某些列作為映射:

import org.apache.spark.sql.functions.{col, lit, map, when}
import org.apache.spark.sql.{Column, DataFrame}

def mappingExpr(columns: Seq[String], dataframe: DataFrame): Column = {
  def getValue(columnName: String): Column = when(col(columnName).isNull, lit("")).otherwise(col(columnName))

  map(
    columns
      .filter(dataframe.columns.contains)
      .flatMap(columnName => Seq(lit(columnName), getValue(columnName))): _*
  ).as("mapped")
}  

因此,鑒於您的示例數據:

> val df = Seq((66, "xyz1","a"),(67, "abc1","a"),(68, "fgr1","b"),(69, "yte1","d"),(70, "erx1","q"),(71, "ter1","q")).toDF("from_value", "to_value","label")
> val cols = List("from_value","to_value")
> 
> df.select(mappingExpr(cols, df)).show(false)

+------------------------------------+
|mapped                              |
+------------------------------------+
|[from_value -> 66, to_value -> xyz1]|
|[from_value -> 67, to_value -> abc1]|
|[from_value -> 68, to_value -> fgr1]|
|[from_value -> 69, to_value -> yte1]|
|[from_value -> 70, to_value -> erx1]|
|[from_value -> 71, to_value -> ter1]|
+------------------------------------+

詳細說明

我的函數的主要思想是將列列表轉換為元組列表,其中元組的第一個元素包含列名作為列,元組的第二個元素包含列值作為列。 然后我將這個元組列表展平並將結果傳遞給map spark SQL 函數

現在讓我們考慮您的不同限制

列名列表可能包含 0 到 3 個列名

當我通過迭代列列表來構建插入到地圖中的元素時,列名稱的數量不會改變任何東西。 如果我們傳遞列名稱的空列表,則沒有錯誤:

> val df = Seq((66, "xyz1","a"),(67, "abc1","a"),(68, "fgr1","b"),(69, "yte1","d"),(70, "erx1","q"),(71, "ter1","q")).toDF("from_value", "to_value","label")
> val cols = List()
>
> df.select(mappingExpr(List(), df)).show(false)
+------+
|mapped|
+------+
|[]    |
|[]    |
|[]    |
|[]    |
|[]    |
|[]    |
+------+

我需要地圖中的鍵來保留順序

這是最棘手的一個。 通常,當您創建地圖時,由於地圖的實現方式,不會保留順序。 但是,在 Spark 中,似乎保留了順序,因此它僅取決於列的名稱順序列表。 因此,在您的示例中,如果我們更改列的名稱順序:

> val df = Seq((66, "xyz1","a"),(67, "abc1","a"),(68, "fgr1","b"),(69, "yte1","d"),(70, "erx1","q"),(71, "ter1","q")).toDF("from_value", "to_value","label")
> val cols = List("to_value","from_value")
> 
> df.select(mappingExpr(cols, df)).show(false)

+------------------------------------+
|mapped                              |
+------------------------------------+
|[to_value -> xyz1, from_value -> 66]|
|[to_value -> abc1, from_value -> 67]|
|[to_value -> fgr1, from_value -> 68]|
|[to_value -> yte1, from_value -> 69]|
|[to_value -> erx1, from_value -> 70]|
|[to_value -> ter1, from_value -> 71]|
+------------------------------------+

列值可以為空,並且需要合並為空字符串

我在內部函數getValue使用when Spark 的 SQL 函數執行此操作。 因此,當列值為null ,返回空字符串,否則返回列值: when(col(columnName).isNull, lit("")).otherwise(col(columnName)) 所以,當你有null在數據框中的值,它是替換空字符串:

> val df = Seq((66, null,"a"),(67, "abc1","a"),(68, "fgr1","b"),(69, "yte1","d"),(70, "erx1","q"),(71, "ter1","q")).toDF("from_value", "to_value","label")
> val cols = List("from_value","to_value")
> 
> df.select(mappingExpr(cols, df)).show(false)

+------------------------------------+
|mapped                              |
+------------------------------------+
|[from_value -> 66, to_value -> ]    |
|[from_value -> 67, to_value -> abc1]|
|[from_value -> 68, to_value -> fgr1]|
|[from_value -> 69, to_value -> yte1]|
|[from_value -> 70, to_value -> erx1]|
|[from_value -> 71, to_value -> ter1]|
+------------------------------------+

列表中指定的列可能不存在於數據框中

您可以使用方法columns檢索數據框的列列表。 所以我使用這種方法來過濾掉所有不在數據.filter(dataframe.columns.contain)列名稱.filter(dataframe.columns.contain) 因此,當列名稱列表包含不在數據框中的列名稱時,它會被忽略:

> val df = Seq((66, "xyz1","a"),(67, "abc1","a"),(68, "fgr1","b"),(69, "yte1","d"),(70, "erx1","q"),(71, "ter1","q")).toDF("from_value", "to_value","label")
> val cols = List("a_column_that_does_not_exist", "from_value","to_value")
> 
> df.select(mappingExpr(cols, df)).show(false)

+------------------------------------+
|mapped                              |
+------------------------------------+
|[from_value -> 66, to_value -> xyz1]|
|[from_value -> 67, to_value -> abc1]|
|[from_value -> 68, to_value -> fgr1]|
|[from_value -> 69, to_value -> yte1]|
|[from_value -> 70, to_value -> erx1]|
|[from_value -> 71, to_value -> ter1]|
+------------------------------------+

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM