简体   繁体   中英

Add values to a dataframe against some particular ID in Spark Scala

My Code:

var data = spark.read.option("header", "true")
      .option("inferSchema", "true")
      .csv("src/main/resources/student.csv")
        data.show()

Data looks like:

ID  Name  City
1   Ali   swl
2   Sana  lhr
3   Ahad  khi
4   ABC   fsd

Now I have a list of values like (1,2,1).

val nums: List[Int] = List(1, 2, 1)

And I want to add these values against ID= 3 . So that DataFrame may look like.

ID  Name  City  newCol  newCol2  newCol3
1   Ali   swl    null     null    null
2   Sana  lhr    null     null    null
3   Ahad  khi     1        2        1
4   ABC   fsd    null     null    null

I wonder if it is possible? Any help will be appreciated. Thanks

Firstly you can convert it to DataFrame with single array column and then "decompose" the array column into columns as follows:

import org.apache.spark.sql.functions.{col, lit}
import spark.implicits._

val numsDf =
  Seq(nums)
    .toDF("nums")
    .select(nums.indices.map(i => col("nums")(i).alias(s"newCol$i")): _*)

After that you can use outer join for joining data to numsDf with ID == 3 condition as follows:

val resultDf = data.join(numsDf, data.col("ID") === lit(3), "outer") 

resultDf.show() will print:

+---+----+----+-------+-------+-------+
| ID|Name|City|newCol0|newCol1|newCol2|
+---+----+----+-------+-------+-------+
|  1| Ali| swl|   null|   null|   null|
|  2|Sana| lhr|   null|   null|   null|
|  3|Ahad| khi|      1|      2|      3|
|  4| ABC| fsd|   null|   null|   null|
+---+----+----+-------+-------+-------+

Make sure you have added spark.sql.crossJoin.crossJoin.enabled = true option to the spark session:

val spark = SparkSession.builder()
  ...
  .config("spark.sql.crossJoin.enabled", value = true)
  .getOrCreate()

Yes, Its possible.

Use when for populating matched values & otherwise for not matched values.

I have used zipWithIndex for making column names unique.

Please check below code.

scala> import org.apache.spark.sql.functions._

scala> val df = Seq((1,"Ali","swl"),(2,"Sana","lhr"),(3,"Ahad","khi"),(4,"ABC","fsd")).toDF("id","name","city") // Creating DataFrame with given sample data.
df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]

scala> val nums = List(1,2,1) // List values.
nums: List[Int] = List(1, 2, 1)

scala> val filterData = List(3,4)

scala> spark.time{ nums.zipWithIndex.foldLeft(df)((df,c) => df.withColumn(s"newCol${c._2}",when($"id".isin(filterData:_*),c._1).otherwise(null))).show(false) } // Used zipWithIndex to make column names unique.
+---+----+----+-------+-------+-------+
|id |name|city|newCol0|newCol1|newCol2|
+---+----+----+-------+-------+-------+
|1  |Ali |swl |null   |null   |null   |
|2  |Sana|lhr |null   |null   |null   |
|3  |Ahad|khi |1      |2      |1      |
|4  |ABC |fsd |1      |2      |1      |
+---+----+----+-------+-------+-------+

Time taken: 43 ms

scala>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM