针对 Spark Scala 中的某些特定 ID 将值添加到 dataframe

Question

My Code:我的代码：

var data = spark.read.option("header", "true")
      .option("inferSchema", "true")
      .csv("src/main/resources/student.csv")
        data.show()

Data looks like:数据如下：

ID  Name  City
1   Ali   swl
2   Sana  lhr
3   Ahad  khi
4   ABC   fsd

Now I have a list of values like (1,2,1).现在我有一个像 (1,2,1) 这样的值列表。

val nums: List[Int] = List(1, 2, 1)

And I want to add these values against ID= 3 .我想将这些值添加到 ID= 3 。 So that DataFrame may look like.所以 DataFrame 可能看起来像。

ID  Name  City  newCol  newCol2  newCol3
1   Ali   swl    null     null    null
2   Sana  lhr    null     null    null
3   Ahad  khi     1        2        1
4   ABC   fsd    null     null    null

I wonder if it is possible?我想知道这是否可能？ Any help will be appreciated.任何帮助将不胜感激。 Thanks谢谢

Answer 1

Firstly you can convert it to DataFrame with single array column and then "decompose" the array column into columns as follows:首先，您可以将其转换为具有单个数组列的DataFrame ，然后将数组列“分解”为如下列：

import org.apache.spark.sql.functions.{col, lit}
import spark.implicits._

val numsDf =
  Seq(nums)
    .toDF("nums")
    .select(nums.indices.map(i => col("nums")(i).alias(s"newCol$i")): _*)

After that you can use outer join for joining data to numsDf with ID == 3 condition as follows:之后，您可以使用外连接将data连接到具有ID == 3条件的numsDf ，如下所示：

val resultDf = data.join(numsDf, data.col("ID") === lit(3), "outer")

resultDf.show() will print: resultDf.show()将打印：

+---+----+----+-------+-------+-------+
| ID|Name|City|newCol0|newCol1|newCol2|
+---+----+----+-------+-------+-------+
|  1| Ali| swl|   null|   null|   null|
|  2|Sana| lhr|   null|   null|   null|
|  3|Ahad| khi|      1|      2|      3|
|  4| ABC| fsd|   null|   null|   null|
+---+----+----+-------+-------+-------+

Make sure you have added spark.sql.crossJoin.crossJoin.enabled = true option to the spark session:确保您已将spark.sql.crossJoin.crossJoin.enabled = true选项添加到 spark session：

val spark = SparkSession.builder()
  ...
  .config("spark.sql.crossJoin.enabled", value = true)
  .getOrCreate()

Answer 2

Yes, Its possible.是的，有可能。

Use when for populating matched values & otherwise for not matched values. when用于填充匹配的值， otherwise用于不匹配的值。

I have used zipWithIndex for making column names unique.我使用 zipWithIndex 使列名唯一。

Please check below code.请检查以下代码。

scala> import org.apache.spark.sql.functions._

scala> val df = Seq((1,"Ali","swl"),(2,"Sana","lhr"),(3,"Ahad","khi"),(4,"ABC","fsd")).toDF("id","name","city") // Creating DataFrame with given sample data.
df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]

scala> val nums = List(1,2,1) // List values.
nums: List[Int] = List(1, 2, 1)

scala> val filterData = List(3,4)

scala> spark.time{ nums.zipWithIndex.foldLeft(df)((df,c) => df.withColumn(s"newCol${c._2}",when($"id".isin(filterData:_*),c._1).otherwise(null))).show(false) } // Used zipWithIndex to make column names unique.
+---+----+----+-------+-------+-------+
|id |name|city|newCol0|newCol1|newCol2|
+---+----+----+-------+-------+-------+
|1  |Ali |swl |null   |null   |null   |
|2  |Sana|lhr |null   |null   |null   |
|3  |Ahad|khi |1      |2      |1      |
|4  |ABC |fsd |1      |2      |1      |
+---+----+----+-------+-------+-------+

Time taken: 43 ms

scala>

针对 Spark Scala 中的某些特定 ID 将值添加到 dataframe

问题描述

2 个解决方案

解决方案1
1 已采纳 2020-05-03 11:45:26

解决方案2
1 2020-05-03 12:04:55

针对 Spark Scala 中的某些特定 ID 将值添加到 dataframe

问题描述

2 个解决方案

解决方案1 1 已采纳 2020-05-03 11:45:26

解决方案2 1 2020-05-03 12:04:55

解决方案1
1 已采纳 2020-05-03 11:45:26

解决方案2
1 2020-05-03 12:04:55