![](/img/trans.png)
[英]Filtering rows based on a complex condition in a dataframe with spark scala
[英]Transpose Specific Columns and Rows in Dataframe based on Condition in Spark Scala
我有一個如下場景,源 dataframe 需要使用 spark scala 從列轉換為行
來源 DataFrame:
|||||||||||||||||||||||||||||||||||||||||||||||
|ID|LOAN|COUNT|A1 |A2 |A3 |A4 |B1 |B2 |B3 |B4 |
|||||||||||||||||||||||||||||||||||||||||||||||
| 1| 100| 1| 35| | | |444| | | |
| 2| 200| 3| 30| 15| 18| |111|222|333| |
| 3| 300| 2| 18| 20| | |555|666| | |
| 4| 400| 4| 28| 60| 80| 90|777|888|123|456|
| 5| 500| 1| 45| | | |245| | | |
|||||||||||||||||||||||||||||||||||||||||||||||
期望下面的結果需要根據 COUNT 字段上的值/條件轉換為行
預期 DataFrame:
|||||||||||||||||
|ID|LOAN| A| B|
|||||||||||||||||
| 1| 100| 35|444|
| 2| 200| 30|111|
| 2| 200| 15|222|
| 2| 200| 18|333|
| 3| 300| 18|555|
| 3| 300| 20|666|
| 4| 400| 28|777|
| 4| 400| 60|888|
| 4| 400| 80|123|
| 4| 400| 90|456|
| 5| 500| 45|245|
|||||||||||||||||
我認為,您的用例是取消透視表。
我嘗試使用以下方法解決此問題-
Read the input
val spark = sqlContext.sparkSession
val implicits = spark.implicits
import implicits._
val schema = StructType(
"ID|LOAN|COUNT|A1 |A2 |A3 |A4 |B1 |B2 |B3 |B4"
.split("\\|")
.map(f => StructField(f.trim, DataTypes.IntegerType))
)
val data =
"""
| 1| 100| 1| 35| | | |444| | |
| 2| 200| 3| 30| 15| 18| |111|222|333|
| 3| 300| 2| 18| 20| | |555|666| |
| 4| 400| 4| 28| 60| 80| 90|777|888|123|456
| 5| 500| 1| 45| | | |245| | |
""".stripMargin
val df = spark.read
.schema(schema)
.option("sep", "|")
.csv(data.split(System.lineSeparator()).map(_.replaceAll("\\s*", "")).toSeq.toDS())
df.show(false)
df.printSchema()
結果-
+---+----+-----+---+----+----+----+---+----+----+----+
|ID |LOAN|COUNT|A1 |A2 |A3 |A4 |B1 |B2 |B3 |B4 |
+---+----+-----+---+----+----+----+---+----+----+----+
|1 |100 |1 |35 |null|null|null|444|null|null|null|
|2 |200 |3 |30 |15 |18 |null|111|222 |333 |null|
|3 |300 |2 |18 |20 |null|null|555|666 |null|null|
|4 |400 |4 |28 |60 |80 |90 |777|888 |123 |456 |
|5 |500 |1 |45 |null|null|null|245|null|null|null|
+---+----+-----+---+----+----+----+---+----+----+----+
root
|-- ID: integer (nullable = true)
|-- LOAN: integer (nullable = true)
|-- COUNT: integer (nullable = true)
|-- A1: integer (nullable = true)
|-- A2: integer (nullable = true)
|-- A3: integer (nullable = true)
|-- A4: integer (nullable = true)
|-- B1: integer (nullable = true)
|-- B2: integer (nullable = true)
|-- B3: integer (nullable = true)
|-- B4: integer (nullable = true)
unpivot the table and remove null entry
df.selectExpr(
"ID",
"LOAN",
"stack(4, A1, B1, A2, B2, A3, B3, A4, B4) as (A, B)"
).where("A is not null and B is not null").show(false)
結果-
+---+----+---+---+
|ID |LOAN|A |B |
+---+----+---+---+
|1 |100 |35 |444|
|2 |200 |30 |111|
|2 |200 |15 |222|
|2 |200 |18 |333|
|3 |300 |18 |555|
|3 |300 |20 |666|
|4 |400 |28 |777|
|4 |400 |60 |888|
|4 |400 |80 |123|
|4 |400 |90 |456|
|5 |500 |45 |245|
+---+----+---+---+
如果您正在以字符串類型讀取數據,則可以使用空字符串而不是 null 過濾結果
arrays_zip
組合a1,a2,a3,a4
& b1,b2,b3,b4
列。
array_except
刪除empty values
分解以arrays_zip
explode
的組合值
檢查下面的代碼。
scala> adf.show(false)
+---+----+-----+---+---+---+---+---+---+---+---+
|id |loan|count|a1 |a2 |a3 |a4 |b1 |b2 |b3 |b4 |
+---+----+-----+---+---+---+---+---+---+---+---+
|1 |100 |1 |35 | | | |444| | | |
|2 |200 |3 |30 |15 |18 | |111|222|333| |
|3 |300 |2 |18 |20 | | |555|666| | |
|4 |400 |4 |28 |60 |80 |90 |777|888|123|456|
|5 |500 |1 |45 | | | |245| | | |
+---+----+-----+---+---+---+---+---+---+---+---+
scala> :paste
// Entering paste mode (ctrl-D to finish)
adf
.withColumn("ab",explode(
arrays_zip(
array_except(array($"a1",$"a2",$"a3",$"a4"),array(lit(""))),
array_except(array($"b1",$"b2",$"b3",$"b4"),array(lit("")))
)
)
)
.select($"id",$"loan",$"ab".cast("struct<a:string,b:string>"))
.select($"id",$"loan",$"ab.a".as("a"),$"ab.b".as("b"))
.show(false)
// Exiting paste mode, now interpreting.
+---+----+---+---+
|id |loan|a |b |
+---+----+---+---+
|1 |100 |35 |444|
|2 |200 |30 |111|
|2 |200 |15 |222|
|2 |200 |18 |333|
|3 |300 |18 |555|
|3 |300 |20 |666|
|4 |400 |28 |777|
|4 |400 |60 |888|
|4 |400 |80 |123|
|4 |400 |90 |456|
|5 |500 |45 |245|
+---+----+---+---+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.