在 Spark DataFrame 中使用多列更改行的值

Question

我得到了這種格式的數據幀（df）。

df.show()
********************
X1 | x2  | X3 | ..... | Xn   | id_1 | id_2 | .... id_23
1  |  ok |good|         john | null | null |     |null
2  |rick |good|       | ryan | null | null |     |null
....

我有一個數據框，其中有很多列，並且該數據框名為 df。 我需要編輯此數據框（df）中的列。 我有 2 個映射，m1（整數->整數）和 m2（整數->字符串）映射。

我需要查看每一行並取 X1 列的值，然后查看 m1 中 X1 的映射值，該值將在 [1,23] 范圍內，設為 5 並在 m2 中找到 X1 的映射值，這將像X8這樣的東西。 我需要將 X8 列的值添加到 id_5。 我有以下代碼，但我無法讓它工作。

val dfEdited = df.map( (row) => {
  val mapValue = row.getAs("X1")
  row.getAs("id_"+m1.get(mapValue)) = row.getAs(m2.get(mapValue)
})

Answer 1

你在row.getAs("id_"+m1.get(mapValue)) = row.getAs(m2.get(mapValue)所做的沒有意義。

首先，您要為操作getAs("id_"+m1.get(mapValue))的結果分配一個值，它為您提供一個不可變的值。 其次，您沒有正確使用getAs方法，因為您需要指定此類方法返回的數據類型。

我不確定我是否正確理解你想要做什么，我猜你錯過了一些細節。 無論如何，這就是我得到的，它運行良好。

當然，我已經對每一行代碼進行了注釋，以便您可以輕松理解。

// First of all we need to create a case class to wrap the content of each row.
case class Schema(X1: Int, X2: String, X3: String, X4: String, id_1: Option[String], id_2: Option[String], id_3: Option[String])


val dfEdited = ds.map( row => {
  // We use the getInt method to get the value of a field which is expected to be Int
  val mapValue = row.getInt(row.fieldIndex("X1"))

  // fieldIndex gives you the position inside the row fo the field you are looking for. 
  // Regarding m1(mapValue), NullPointer might be thrown if mapValue is not in that Map. 
  // You need to implement mechanisms to deal with it (for example, an if...else clause, or using the method getOrElse)
  val indexToModify = row.fieldIndex("id_" + m1(mapValue)) 

  // We convert the row to a sequence, and pair each element with its index.
  // Then, with the map method we generate a new sequence.
  // We replace the element situated in the position indexToModify.
  // In addition, if there are null values, we have to convert it to an object of type Option.
  // It is necessary for the next step.
  val seq = row.toSeq.zipWithIndex.map(x => if (x._2 == indexToModify) Some(m2(mapValue)) else if(x._1 == null) None else x._1)


  // Finally, you have to create the Schema object by using pattern matching
  seq match {
    case Seq(x1: Int, x2: String, x3: String, x4: String, id_1: Option[String], id_2: Option[String], id_3: Option[String]) => Schema(x1, x2,x3,x4, id_1, id_2, id_3)
  }
})

一些評論：

ds對象是一個數據集。 數據集必須有一個結構。 您不能修改 map 方法中的行並返回它們，因為 Spark 不會知道數據集的結構是否已更改。 出於這個原因，我將返回一個 case 類對象，因為它為 Dataset 對象提供了一個結構。
請記住，您可能會遇到空值問題。 如果您沒有建立機制來處理 X1 的值不在 m1 中的情況，則此代碼可能會向您拋出空指針。

希望它有效。

在 Spark DataFrame 中使用多列更改行的值

問題描述

1 個解決方案

解決方案1
2 2019-02-05 14:00:17

在 Spark DataFrame 中使用多列更改行的值

問題描述

1 個解決方案

解決方案1 2 2019-02-05 14:00:17

解決方案1
2 2019-02-05 14:00:17