[英]Add values to a dataframe against some particular ID in Spark Scala
My Code:我的代码:
var data = spark.read.option("header", "true")
.option("inferSchema", "true")
.csv("src/main/resources/student.csv")
data.show()
Data looks like:数据如下:
ID Name City
1 Ali swl
2 Sana lhr
3 Ahad khi
4 ABC fsd
Now I have a list of values like (1,2,1).现在我有一个像 (1,2,1) 这样的值列表。
val nums: List[Int] = List(1, 2, 1)
And I want to add these values against ID= 3
.我想将这些值添加到 ID=
3
。 So that DataFrame may look like.所以 DataFrame 可能看起来像。
ID Name City newCol newCol2 newCol3
1 Ali swl null null null
2 Sana lhr null null null
3 Ahad khi 1 2 1
4 ABC fsd null null null
I wonder if it is possible?我想知道这是否可能? Any help will be appreciated.
任何帮助将不胜感激。 Thanks
谢谢
Firstly you can convert it to DataFrame
with single array column and then "decompose" the array column into columns as follows:首先,您可以将其转换为具有单个数组列的
DataFrame
,然后将数组列“分解”为如下列:
import org.apache.spark.sql.functions.{col, lit}
import spark.implicits._
val numsDf =
Seq(nums)
.toDF("nums")
.select(nums.indices.map(i => col("nums")(i).alias(s"newCol$i")): _*)
After that you can use outer join for joining data
to numsDf
with ID == 3
condition as follows:之后,您可以使用外连接将
data
连接到具有ID == 3
条件的numsDf
,如下所示:
val resultDf = data.join(numsDf, data.col("ID") === lit(3), "outer")
resultDf.show()
will print: resultDf.show()
将打印:
+---+----+----+-------+-------+-------+
| ID|Name|City|newCol0|newCol1|newCol2|
+---+----+----+-------+-------+-------+
| 1| Ali| swl| null| null| null|
| 2|Sana| lhr| null| null| null|
| 3|Ahad| khi| 1| 2| 3|
| 4| ABC| fsd| null| null| null|
+---+----+----+-------+-------+-------+
Make sure you have added spark.sql.crossJoin.crossJoin.enabled = true
option to the spark session:确保您已将
spark.sql.crossJoin.crossJoin.enabled = true
选项添加到 spark session:
val spark = SparkSession.builder()
...
.config("spark.sql.crossJoin.enabled", value = true)
.getOrCreate()
Yes, Its possible.是的,有可能。
Use when
for populating matched values & otherwise
for not matched values. when
用于填充匹配的值, otherwise
用于不匹配的值。
I have used zipWithIndex for making column names unique.我使用 zipWithIndex 使列名唯一。
Please check below code.请检查以下代码。
scala> import org.apache.spark.sql.functions._
scala> val df = Seq((1,"Ali","swl"),(2,"Sana","lhr"),(3,"Ahad","khi"),(4,"ABC","fsd")).toDF("id","name","city") // Creating DataFrame with given sample data.
df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]
scala> val nums = List(1,2,1) // List values.
nums: List[Int] = List(1, 2, 1)
scala> val filterData = List(3,4)
scala> spark.time{ nums.zipWithIndex.foldLeft(df)((df,c) => df.withColumn(s"newCol${c._2}",when($"id".isin(filterData:_*),c._1).otherwise(null))).show(false) } // Used zipWithIndex to make column names unique.
+---+----+----+-------+-------+-------+
|id |name|city|newCol0|newCol1|newCol2|
+---+----+----+-------+-------+-------+
|1 |Ali |swl |null |null |null |
|2 |Sana|lhr |null |null |null |
|3 |Ahad|khi |1 |2 |1 |
|4 |ABC |fsd |1 |2 |1 |
+---+----+----+-------+-------+-------+
Time taken: 43 ms
scala>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.