简体   繁体   English

将Spark数据框中的列拆分为新行[Scala]

[英]Spliting columns in a Spark dataframe in to new rows [Scala]

I have output from a spark data frame like below: 我从如下所示的spark数据帧中输出:

Amt |id |num |Start_date |Identifier Amt | id | num |开始日期|标识符
43.45|19840|A345|[2014-12-26, 2013-12-12]|[232323,45466]| 43.45 | 19840 | A345 | [2014-12-26,2013-12-12] | [232323,45466] |
43.45|19840|A345|[2010-03-16, 2013-16-12]|[34343,45454]| 43.45 | 19840 | A345 | [2010-03-16,2013-16-12] | [34343,45454] |

My requirement is to generate output in below format from the above output 我的要求是从上述输出生成以下格式的输出

Amt |id |num |Start_date |Identifier Amt | id | num |开始日期|标识符
43.45|19840|A345|2014-12-26|232323 43.45 | 19840 | A345 | 2014-12-26 | 232323
43.45|19840|A345|2013-12-12|45466 43.45 | 19840 | A345 | 2013-12-12 | 45466
43.45|19840|A345|2010-03-16|34343 43.45 | 19840 | A345 | 2010-03-16 | 34343
43.45|19840|A345|2013-16-12|45454 43.45 | 19840 | A345 | 2013-16-12 | 45454

Can somebody help me to achieve this. 有人可以帮助我实现这一目标。

Is this the thing you're looking for? 这是您要找的东西吗?

import org.apache.spark.sql._
import org.apache.spark.sql.functions._

val sparkSession = ...
import sparkSession.implicits._

val input = sc.parallelize(Seq(
  (43.45, 19840, "A345", Seq("2014-12-26", "2013-12-12"), Seq(232323,45466)),
  (43.45, 19840, "A345", Seq("2010-03-16", "2013-16-12"), Seq(34343,45454))
)).toDF("amt", "id", "num", "start_date", "identifier")

val zipArrays = udf { (dates: Seq[String], identifiers: Seq[Int]) =>
  dates.zip(identifiers)
}

val output = input.select($"amt", $"id", $"num", explode(zipArrays($"start_date", $"identifier")))
  .select($"amt", $"id", $"num", $"col._1".as("start_date"), $"col._2".as("identifier"))

output.show()

Which returns: 哪个返回:

+-----+-----+----+----------+----------+
|  amt|   id| num|start_date|identifier|
+-----+-----+----+----------+----------+
|43.45|19840|A345|2014-12-26|    232323|
|43.45|19840|A345|2013-12-12|     45466|
|43.45|19840|A345|2010-03-16|     34343|
|43.45|19840|A345|2013-16-12|     45454|
+-----+-----+----+----------+----------+

EDIT: 编辑:

Since you would like to have multiple columns that should be zipped, you should try something like this: 由于您希望将多个列压缩,因此您应该尝试执行以下操作:

val input = sc.parallelize(Seq(
  (43.45, 19840, "A345", Seq("2014-12-26", "2013-12-12"), Seq("232323","45466"), Seq("123", "234")),
  (43.45, 19840, "A345", Seq("2010-03-16", "2013-16-12"), Seq("34343","45454"), Seq("345", "456"))
)).toDF("amt", "id", "num", "start_date", "identifier", "another_column")

val zipArrays = udf { seqs: Seq[Seq[String]] =>
  for(i <- seqs.head.indices) yield seqs.fold(Seq.empty)((accu, seq) => accu :+ seq(i))
}

val columnsToSelect = Seq($"amt", $"id", $"num")
val columnsToZip = Seq($"start_date", $"identifier", $"another_column")
val outputColumns = columnsToSelect ++ columnsToZip.zipWithIndex.map { case (column, index) =>
  $"col".getItem(index).as(column.toString())
}

val output = input.select($"amt", $"id", $"num", explode(zipArrays(array(columnsToZip: _*)))).select(outputColumns: _*)

output.show()

/*
+-----+-----+----+----------+----------+--------------+
|  amt|   id| num|start_date|identifier|another_column|
+-----+-----+----+----------+----------+--------------+
|43.45|19840|A345|2014-12-26|    232323|           123|
|43.45|19840|A345|2013-12-12|     45466|           234|
|43.45|19840|A345|2010-03-16|     34343|           345|
|43.45|19840|A345|2013-16-12|     45454|           456|
+-----+-----+----+----------+----------+--------------+
*/

If I understand correctly, you want the first elements of col 3 and 4. Does this make sense? 如果我理解正确,则需要第3列和第4列的第一个元素。这有意义吗?

val newDataFrame = for {
    row <- oldDataFrame
} yield {
  val zro = row(0) // 43.45
  val one = row(1) // 19840
  val two = row(2) // A345
  val dates = row(3) // [2014-12-26, 2013-12-12]
  val numbers = row(4) // [232323,45466]
  Row(zro, one, two, dates(0), numbers(0))
}

You could use SparkSQL. 您可以使用SparkSQL。

  • First you create a view with the information we need to process: 首先,使用我们需要处理的信息创建一个视图:

    df.createOrReplaceTempView("tableTest")

  • Then you can select the data with the expansions: 然后,您可以选择带有扩展的数据:

     sparkSession.sqlContext.sql( "SELECT Amt, id, num, expanded_start_date, expanded_id " + "FROM tableTest " + "LATERAL VIEW explode(Start_date) Start_date AS expanded_start_date " + "LATERAL VIEW explode(Identifier) AS expanded_id") .show() 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在Spark DataFrame,Scala中将行转换为列 - how to convert rows into columns in spark dataframe, scala Spark Scala Dataframe:如何将列转换为行? - Spark Scala Dataframe: How to convert columns into rows? 使用 scala 在 Spark DataFrame 中添加新行 - Add new rows in the Spark DataFrame using scala Spark Scala - 如何在数据框中迭代行,并将计算值添加为数据框的新列 - Spark Scala - How do I iterate rows in dataframe, and add calculated values as new columns of the data frame Scala Spark DataFrame 问题:如何通过将当前行中的值与前一行中的某处匹配来添加新列 - Scala Spark DataFrame Question:How to add new columns by matching the value in current row to somewhere from previous rows 根据列数创建具有新行的新DataFrame-Spark Scala - Create new DataFrame with new rows depending in number of a column - Spark Scala 根据 Spark Scala 中的条件转置 Dataframe 中的特定列和行 - Transpose Specific Columns and Rows in Dataframe based on Condition in Spark Scala 如何在spark scala中截断数据框中多行和多列的值 - how to truncate values for multiple rows and columns in dataframe in spark scala 如何使用 scala 在 spark dataframe 中将行转换为列 - how to convert rows into columns in spark dataframe using scala Spark Dataframe Scala:按某些条件添加新列 - Spark Dataframe Scala: add new columns by some conditions
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM