简体   繁体   English

针对 Spark Scala 中的某些特定 ID 将值添加到 dataframe

[英]Add values to a dataframe against some particular ID in Spark Scala

My Code:我的代码:

var data = spark.read.option("header", "true")
      .option("inferSchema", "true")
      .csv("src/main/resources/student.csv")
        data.show()

Data looks like:数据如下:

ID  Name  City
1   Ali   swl
2   Sana  lhr
3   Ahad  khi
4   ABC   fsd

Now I have a list of values like (1,2,1).现在我有一个像 (1,2,1) 这样的值列表。

val nums: List[Int] = List(1, 2, 1)

And I want to add these values against ID= 3 .我想将这些值添加到 ID= 3 So that DataFrame may look like.所以 DataFrame 可能看起来像。

ID  Name  City  newCol  newCol2  newCol3
1   Ali   swl    null     null    null
2   Sana  lhr    null     null    null
3   Ahad  khi     1        2        1
4   ABC   fsd    null     null    null

I wonder if it is possible?我想知道这是否可能? Any help will be appreciated.任何帮助将不胜感激。 Thanks谢谢

Firstly you can convert it to DataFrame with single array column and then "decompose" the array column into columns as follows:首先,您可以将其转换为具有单个数组列的DataFrame ,然后将数组列“分解”为如下列:

import org.apache.spark.sql.functions.{col, lit}
import spark.implicits._

val numsDf =
  Seq(nums)
    .toDF("nums")
    .select(nums.indices.map(i => col("nums")(i).alias(s"newCol$i")): _*)

After that you can use outer join for joining data to numsDf with ID == 3 condition as follows:之后,您可以使用外连接将data连接到具有ID == 3条件的numsDf ,如下所示:

val resultDf = data.join(numsDf, data.col("ID") === lit(3), "outer") 

resultDf.show() will print: resultDf.show()将打印:

+---+----+----+-------+-------+-------+
| ID|Name|City|newCol0|newCol1|newCol2|
+---+----+----+-------+-------+-------+
|  1| Ali| swl|   null|   null|   null|
|  2|Sana| lhr|   null|   null|   null|
|  3|Ahad| khi|      1|      2|      3|
|  4| ABC| fsd|   null|   null|   null|
+---+----+----+-------+-------+-------+

Make sure you have added spark.sql.crossJoin.crossJoin.enabled = true option to the spark session:确保您已将spark.sql.crossJoin.crossJoin.enabled = true选项添加到 spark session:

val spark = SparkSession.builder()
  ...
  .config("spark.sql.crossJoin.enabled", value = true)
  .getOrCreate()

Yes, Its possible.是的,有可能。

Use when for populating matched values & otherwise for not matched values. when用于填充匹配的值, otherwise用于不匹配的值。

I have used zipWithIndex for making column names unique.我使用 zipWithIndex 使列名唯一。

Please check below code.请检查以下代码。

scala> import org.apache.spark.sql.functions._

scala> val df = Seq((1,"Ali","swl"),(2,"Sana","lhr"),(3,"Ahad","khi"),(4,"ABC","fsd")).toDF("id","name","city") // Creating DataFrame with given sample data.
df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]

scala> val nums = List(1,2,1) // List values.
nums: List[Int] = List(1, 2, 1)

scala> val filterData = List(3,4)

scala> spark.time{ nums.zipWithIndex.foldLeft(df)((df,c) => df.withColumn(s"newCol${c._2}",when($"id".isin(filterData:_*),c._1).otherwise(null))).show(false) } // Used zipWithIndex to make column names unique.
+---+----+----+-------+-------+-------+
|id |name|city|newCol0|newCol1|newCol2|
+---+----+----+-------+-------+-------+
|1  |Ali |swl |null   |null   |null   |
|2  |Sana|lhr |null   |null   |null   |
|3  |Ahad|khi |1      |2      |1      |
|4  |ABC |fsd |1      |2      |1      |
+---+----+----+-------+-------+-------+

Time taken: 43 ms

scala>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM