簡體   English   中英

根據 Spark Scala 中的條件轉置 Dataframe 中的特定列和行

[英]Transpose Specific Columns and Rows in Dataframe based on Condition in Spark Scala

我有一個如下場景,源 dataframe 需要使用 spark scala 從列轉換為行

來源 DataFrame:

|||||||||||||||||||||||||||||||||||||||||||||||
|ID|LOAN|COUNT|A1 |A2 |A3 |A4 |B1 |B2 |B3 |B4 |
|||||||||||||||||||||||||||||||||||||||||||||||
| 1| 100|    1| 35|   |   |   |444|   |   |   |
| 2| 200|    3| 30| 15| 18|   |111|222|333|   |
| 3| 300|    2| 18| 20|   |   |555|666|   |   |
| 4| 400|    4| 28| 60| 80| 90|777|888|123|456|
| 5| 500|    1| 45|   |   |   |245|   |   |   |
|||||||||||||||||||||||||||||||||||||||||||||||

期望下面的結果需要根據 COUNT 字段上的值/條件轉換為行

預期 DataFrame:

|||||||||||||||||
|ID|LOAN|  A|  B|
|||||||||||||||||
| 1| 100| 35|444|
| 2| 200| 30|111|
| 2| 200| 15|222|
| 2| 200| 18|333|
| 3| 300| 18|555|
| 3| 300| 20|666|
| 4| 400| 28|777|
| 4| 400| 60|888|
| 4| 400| 80|123|
| 4| 400| 90|456|
| 5| 500| 45|245|
|||||||||||||||||

我認為,您的用例是取消透視表。

我嘗試使用以下方法解決此問題-

  1. Read the input
    val spark = sqlContext.sparkSession
    val implicits = spark.implicits
    import implicits._
    val schema = StructType(
      "ID|LOAN|COUNT|A1 |A2 |A3 |A4 |B1 |B2 |B3 |B4"
        .split("\\|")
        .map(f => StructField(f.trim, DataTypes.IntegerType))
    )
    val data =
      """
        | 1| 100|    1| 35|   |   |   |444|   |   |
        | 2| 200|    3| 30| 15| 18|   |111|222|333|
        | 3| 300|    2| 18| 20|   |   |555|666|   |
        | 4| 400|    4| 28| 60| 80| 90|777|888|123|456
        | 5| 500|    1| 45|   |   |   |245|   |   |
      """.stripMargin
    val df = spark.read
      .schema(schema)
      .option("sep", "|")
      .csv(data.split(System.lineSeparator()).map(_.replaceAll("\\s*", "")).toSeq.toDS())
    df.show(false)
    df.printSchema()

結果-

+---+----+-----+---+----+----+----+---+----+----+----+
|ID |LOAN|COUNT|A1 |A2  |A3  |A4  |B1 |B2  |B3  |B4  |
+---+----+-----+---+----+----+----+---+----+----+----+
|1  |100 |1    |35 |null|null|null|444|null|null|null|
|2  |200 |3    |30 |15  |18  |null|111|222 |333 |null|
|3  |300 |2    |18 |20  |null|null|555|666 |null|null|
|4  |400 |4    |28 |60  |80  |90  |777|888 |123 |456 |
|5  |500 |1    |45 |null|null|null|245|null|null|null|
+---+----+-----+---+----+----+----+---+----+----+----+

root
 |-- ID: integer (nullable = true)
 |-- LOAN: integer (nullable = true)
 |-- COUNT: integer (nullable = true)
 |-- A1: integer (nullable = true)
 |-- A2: integer (nullable = true)
 |-- A3: integer (nullable = true)
 |-- A4: integer (nullable = true)
 |-- B1: integer (nullable = true)
 |-- B2: integer (nullable = true)
 |-- B3: integer (nullable = true)
 |-- B4: integer (nullable = true)

  1. unpivot the table and remove null entry
df.selectExpr(
      "ID",
      "LOAN",
      "stack(4, A1, B1, A2, B2, A3, B3, A4, B4) as (A, B)"
    ).where("A is not null and B is not null").show(false)

結果-

+---+----+---+---+
|ID |LOAN|A  |B  |
+---+----+---+---+
|1  |100 |35 |444|
|2  |200 |30 |111|
|2  |200 |15 |222|
|2  |200 |18 |333|
|3  |300 |18 |555|
|3  |300 |20 |666|
|4  |400 |28 |777|
|4  |400 |60 |888|
|4  |400 |80 |123|
|4  |400 |90 |456|
|5  |500 |45 |245|
+---+----+---+---+

如果您正在以字符串類型讀取數據,則可以使用空字符串而不是 null 過濾結果

arrays_zip組合a1,a2,a3,a4 & b1,b2,b3,b4列。

array_except刪除empty values

分解以arrays_zip explode的組合值

檢查下面的代碼。

scala> adf.show(false)
+---+----+-----+---+---+---+---+---+---+---+---+
|id |loan|count|a1 |a2 |a3 |a4 |b1 |b2 |b3 |b4 |
+---+----+-----+---+---+---+---+---+---+---+---+
|1  |100 |1    |35 |   |   |   |444|   |   |   |
|2  |200 |3    |30 |15 |18 |   |111|222|333|   |
|3  |300 |2    |18 |20 |   |   |555|666|   |   |
|4  |400 |4    |28 |60 |80 |90 |777|888|123|456|
|5  |500 |1    |45 |   |   |   |245|   |   |   |
+---+----+-----+---+---+---+---+---+---+---+---+


scala> :paste
// Entering paste mode (ctrl-D to finish)

adf
.withColumn("ab",explode(
    arrays_zip(
        array_except(array($"a1",$"a2",$"a3",$"a4"),array(lit(""))),
        array_except(array($"b1",$"b2",$"b3",$"b4"),array(lit("")))
        )
    )
)
.select($"id",$"loan",$"ab".cast("struct<a:string,b:string>"))
.select($"id",$"loan",$"ab.a".as("a"),$"ab.b".as("b"))
.show(false)

// Exiting paste mode, now interpreting.

+---+----+---+---+
|id |loan|a  |b  |
+---+----+---+---+
|1  |100 |35 |444|
|2  |200 |30 |111|
|2  |200 |15 |222|
|2  |200 |18 |333|
|3  |300 |18 |555|
|3  |300 |20 |666|
|4  |400 |28 |777|
|4  |400 |60 |888|
|4  |400 |80 |123|
|4  |400 |90 |456|
|5  |500 |45 |245|
+---+----+---+---+

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM