將Spark數據框中的列拆分為新行[Scala]

Question

我從如下所示的spark數據幀中輸出：

Amt | id | num |開始日期|標識符
43.45 | 19840 | A345 | [2014-12-26，2013-12-12] | [232323,45466] |
43.45 | 19840 | A345 | [2010-03-16，2013-16-12] | [34343,45454] |

我的要求是從上述輸出生成以下格式的輸出

Amt | id | num |開始日期|標識符
43.45 | 19840 | A345 | 2014-12-26 | 232323
43.45 | 19840 | A345 | 2013-12-12 | 45466
43.45 | 19840 | A345 | 2010-03-16 | 34343
43.45 | 19840 | A345 | 2013-16-12 | 45454

有人可以幫助我實現這一目標。

Answer 1

這是您要找的東西嗎？

import org.apache.spark.sql._
import org.apache.spark.sql.functions._

val sparkSession = ...
import sparkSession.implicits._

val input = sc.parallelize(Seq(
  (43.45, 19840, "A345", Seq("2014-12-26", "2013-12-12"), Seq(232323,45466)),
  (43.45, 19840, "A345", Seq("2010-03-16", "2013-16-12"), Seq(34343,45454))
)).toDF("amt", "id", "num", "start_date", "identifier")

val zipArrays = udf { (dates: Seq[String], identifiers: Seq[Int]) =>
  dates.zip(identifiers)
}

val output = input.select($"amt", $"id", $"num", explode(zipArrays($"start_date", $"identifier")))
  .select($"amt", $"id", $"num", $"col._1".as("start_date"), $"col._2".as("identifier"))

output.show()

哪個返回：

+-----+-----+----+----------+----------+
|  amt|   id| num|start_date|identifier|
+-----+-----+----+----------+----------+
|43.45|19840|A345|2014-12-26|    232323|
|43.45|19840|A345|2013-12-12|     45466|
|43.45|19840|A345|2010-03-16|     34343|
|43.45|19840|A345|2013-16-12|     45454|
+-----+-----+----+----------+----------+

編輯：

由於您希望將多個列壓縮，因此您應該嘗試執行以下操作：

val input = sc.parallelize(Seq(
  (43.45, 19840, "A345", Seq("2014-12-26", "2013-12-12"), Seq("232323","45466"), Seq("123", "234")),
  (43.45, 19840, "A345", Seq("2010-03-16", "2013-16-12"), Seq("34343","45454"), Seq("345", "456"))
)).toDF("amt", "id", "num", "start_date", "identifier", "another_column")

val zipArrays = udf { seqs: Seq[Seq[String]] =>
  for(i <- seqs.head.indices) yield seqs.fold(Seq.empty)((accu, seq) => accu :+ seq(i))
}

val columnsToSelect = Seq($"amt", $"id", $"num")
val columnsToZip = Seq($"start_date", $"identifier", $"another_column")
val outputColumns = columnsToSelect ++ columnsToZip.zipWithIndex.map { case (column, index) =>
  $"col".getItem(index).as(column.toString())
}

val output = input.select($"amt", $"id", $"num", explode(zipArrays(array(columnsToZip: _*)))).select(outputColumns: _*)

output.show()

/*
+-----+-----+----+----------+----------+--------------+
|  amt|   id| num|start_date|identifier|another_column|
+-----+-----+----+----------+----------+--------------+
|43.45|19840|A345|2014-12-26|    232323|           123|
|43.45|19840|A345|2013-12-12|     45466|           234|
|43.45|19840|A345|2010-03-16|     34343|           345|
|43.45|19840|A345|2013-16-12|     45454|           456|
+-----+-----+----+----------+----------+--------------+
*/

Answer 2

如果我理解正確，則需要第3列和第4列的第一個元素。這有意義嗎？

val newDataFrame = for {
    row <- oldDataFrame
} yield {
  val zro = row(0) // 43.45
  val one = row(1) // 19840
  val two = row(2) // A345
  val dates = row(3) // [2014-12-26, 2013-12-12]
  val numbers = row(4) // [232323,45466]
  Row(zro, one, two, dates(0), numbers(0))
}

Answer 3

您可以使用SparkSQL。

首先，使用我們需要處理的信息創建一個視圖：
df.createOrReplaceTempView("tableTest")

然后，您可以選擇帶有擴展的數據：

 sparkSession.sqlContext.sql( "SELECT Amt, id, num, expanded_start_date, expanded_id " + "FROM tableTest " + "LATERAL VIEW explode(Start_date) Start_date AS expanded_start_date " + "LATERAL VIEW explode(Identifier) AS expanded_id") .show()

將Spark數據框中的列拆分為新行[Scala]

問題描述

3 個解決方案

解決方案1
1 已采納 2016-11-11 14:34:05

解決方案2
0 2016-11-11 14:12:15

解決方案3
0 2016-11-11 14:57:39

將Spark數據框中的列拆分為新行[Scala]

問題描述

3 個解決方案

解決方案1 1 已采納 2016-11-11 14:34:05

解決方案2 0 2016-11-11 14:12:15

解決方案3 0 2016-11-11 14:57:39

解決方案1
1 已采納 2016-11-11 14:34:05

解決方案2
0 2016-11-11 14:12:15

解決方案3
0 2016-11-11 14:57:39