简体   繁体   English

如何遍历数组类型的数据框并使用Scala将值附加到最终数据框

[英]How to loop through the Dataframe which is of type of Array and append the value to a final Dataframe using Scala

Please could you help me with the solution for the below Questions: Question 01: Is there a way i can loop only Array types as looping string type within an array will throw an error. 请为以下问题的解决方案提供帮助吗?问题01:有没有一种方法可以仅循环数组类型,因为在数组中循环字符串类型会引发错误。 I cannot drop String Type(VIN) as i need this data on the final df. 我无法删除String Type(VIN),因为我需要在最终df上使用此数据。

df.printSchema

returns: 返回:

root
  |-- APP: array (nullable = true)
  |    |-- element: struct (containsNull = true)
  |    |    |-- E: long (nullable = true)
  |    |    |-- V: double (nullable = true)
  |-- B1X: array (nullable = true)
  |    |-- element: struct (containsNull = true)
  |    |    |-- E: long (nullable = true)
  |    |    |-- V: long (nullable = true)
  |-- B2X: array (nullable = true)
  |    |-- element: struct (containsNull = true)
  |    |    |-- E: long (nullable = true)
  |    |    |-- V: long (nullable = true)
  |-- B3X: array (nullable = true)
  |    |-- element: struct (containsNull = true)
  |    |    |-- E: long (nullable = true)
  |    |    |-- V: long (nullable = true)
  |-- VIN: string (nullable = true)

After running the below forloop: 运行以下forloop之后:

Question 02: Dataframe jsonDF2 is holding only the last E, V value as stime, can_value of the last signal B3X. 问题02:数据帧jsonDF2仅保留最后一个信号B3X的最后一个E,V值作为stime,can_value。 Is there a way to append all the values( i mean all the signal values{APP, B1X, B2X, B3X, VIN}) to a Dataframe jsonDF2 after it comes out of foreach loop. 有没有一种方法可以将所有值(我是指所有信号值{APP,B1X,B2X,B3X,VIN})附加到Dataframe jsonDF2中,使其脱离foreach循环。

val columns:Array[String] = df.columns

for(col_name <- columns){
|       df = df.withColumn("element", explode(col(col_name)))
|         .withColumn("stime", col("element.E"))
|         .withColumn("can_value", col("element.V"))
|         .withColumn("SIGNAL", lit(col_name))
|         .drop(col("element"))
|         .drop(col(col_name))
|     }

You can use the schema member and then filter them out before hand with a filter and a map. 您可以使用架构成员,然后使用过滤器和地图将其过滤掉。 Then do your for loop stuff. 然后执行for循环操作。

import org.apache.spark.sql.types._
val schema = df.schema.filter{ case StructField(_, datatype, _, _) => datatype == ArrayType }
val columns = schema.map{ case StructField(columnName, _ , _, _) => columnName }

Here's one approach illustrated using the following example: 这是使用以下示例说明的一种方法:

import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions._
import spark.implicits._

case class Elem(e: Long, v: Double)

val df = Seq(
  (Seq(Elem(1, 1.0)), Seq(Elem(2, 2.0), Elem(3, 3.0)), Seq(Elem(4, 4.0)), Seq(Elem(5, 5.0)), "a"),
  (Seq(Elem(6, 6.0)), Seq(Elem(7, 7.0), Elem(8, 8.0)), Seq(Elem(9, 9.0)), Seq(Elem(10, 10.0)), "b")
).toDF("APP", "B1X", "B2X", "B3X", "VIN")

Question #1: Is there a way i can loop only Array types? 问题1:有没有办法我只能循环数组类型?

You can simply collect all the top-level field names of ArrayType as follows: 您可以简单地collect ArrayType所有顶级字段名称,如下所示:

val arrCols = df.schema.fields.collect{
  case StructField(name, dtype: ArrayType, _, _) => name
}
// arrCols: Array[String] = Array(APP, B1X, B2X, B3X)

Question #2: Is there a way to append all the signal values {APP, B1X, B2X, B3X, VIN}? 问题2:是否可以添加所有信号值{APP,B1X,B2X,B3X,VIN}?

Not sure I completely understand your requirement without sample output. 不知道我是否完全了解您的要求,而没有提供示例输出。 Based on your code snippet, I'm assuming your goal is to flatten all array columns of struct-typed elements into separate top-level columns. 根据您的代码片段,我假设您的目标是将结构类型元素的所有数组列展平为单独的顶层列。 Below are the steps: 步骤如下:

Step 1 : Group all the array columns into a single array column of struct(colName, colValue) ; 步骤1 :将所有数组列分组为struct(colName, colValue)的单个数组列; then transform for each row using foldLeft to generate a combined array of struct(colName, Elem-E, Elem-V) : 然后使用foldLeft转换每一行,以生成struct(colName, Elem-E, Elem-V)的组合数组:

case class ColElem(c: String, e: Long, v: Double)

val df2 = df.
  select(array(arrCols.map(c => struct(lit(c).as("_1"), col(c).as("_2"))): _*)).
  map{ case Row(rs: Seq[Row] @unchecked) => rs.foldLeft(Seq[ColElem]()){  
    (acc, r) => r match { case Row(c: String, s: Seq[Row] @unchecked) =>
      acc ++ s.map(el => ColElem(c, el.getAs[Long](0), el.getAs[Double](1)))
    }
  }}.toDF("combined_array")

df2.show(false)
// +-----------------------------------------------------------------------------+
// |combined_array                                                               |
// +-----------------------------------------------------------------------------+
// |[[APP, 1, 1.0], [B1X, 2, 2.0], [B1X, 3, 3.0], [B2X, 4, 4.0], [B3X, 5, 5.0]]  |
// |[[APP, 6, 6.0], [B1X, 7, 7.0], [B1X, 8, 8.0], [B2X, 9, 9.0], [B3X, 10, 10.0]]|
// +-----------------------------------------------------------------------------+

Step 2 : Flatten the combined array of struct-typed elements into top-level columns: 步骤2 :将结构类型元素的组合数组展平到顶级列中:

df2.
  select(explode($"combined_array").as("flattened")).
  select($"flattened.c".as("signal"), $"flattened.e".as("stime"), $"flattened.v".as("can_value")).
  orderBy("signal", "stime").
  show
// +------+-----+---------+
// |signal|stime|can_value|
// +------+-----+---------+
// |   APP|    1|      1.0|
// |   APP|    6|      6.0|
// |   B1X|    2|      2.0|
// |   B1X|    3|      3.0|
// |   B1X|    7|      7.0|
// |   B1X|    8|      8.0|
// |   B2X|    4|      4.0|
// |   B2X|    9|      9.0|
// |   B3X|    5|      5.0|
// |   B3X|   10|     10.0|
// +------+-----+---------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM