[英]Add new rows in the Spark DataFrame using scala
I have a dataframe like:我有一个 dataframe 像:
Name_Index City_Index
2.0 1.0
0.0 2.0
1.0 0.0
I have a new list of values.我有一个新的值列表。
list(1.0,1.0)
I want to add these values to a new row in dataframe in the case that all previous rows are dropped.我想将这些值添加到 dataframe 中的新行,以防所有先前的行都被删除。
My code:我的代码:
val spark = SparkSession.builder
.master("local[*]")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.getOrCreate()
var data = spark.read.option("header", "true")
.option("inferSchema", "true")
.csv("src/main/resources/student.csv")
val someDF = Seq(
(1.0,1.0)
).toDF("Name_Index","City_Index")
data=data.union(someDF).show()
It show output like:它显示 output 如下:
Name_Index City_Index
2.0 1.0
0.0 2.0
1.0 0.0
1.1 1.1
But output should be like this.但是output应该是这样的。 So that all the previous rows are dropped and new values are added.
这样所有以前的行都被删除并添加了新值。
Name_Index City_Index
1.0 1.0
you can achieve this using limit & union functions.您可以使用限制和联合功能来实现这一点。 check below.
检查下面。
scala> val df = Seq((2.0,1.0),(0.0,2.0),(1.0,0.0)).toDF("name_index","city_index")
df: org.apache.spark.sql.DataFrame = [name_index: double, city_index: double]
scala> df.show(false)
+----------+----------+
|name_index|city_index|
+----------+----------+
|2.0 |1.0 |
|0.0 |2.0 |
|1.0 |0.0 |
+----------+----------+
scala> val ndf = Seq((1.0,1.0)).toDF("name_index","city_index")
ndf: org.apache.spark.sql.DataFrame = [name_index: double, city_index: double]
scala> ndf.show
+----------+----------+
|name_index|city_index|
+----------+----------+
| 1.0| 1.0|
+----------+----------+
scala> df.limit(0).union(ndf).show(false) // this is not good approach., you can directly call ndf.show
+----------+----------+
|name_index|city_index|
+----------+----------+
|1.0 |1.0 |
+----------+----------+
change the last line to将最后一行更改为
data=data.except(data).union(someDF).show()
you could try this approach你可以试试这个方法
data = data.filter(_ => false).union(someDF)
output output
+----------+----------+
|Name_Index|City_Index|
+----------+----------+
|1.0 |1.0 |
+----------+----------+
I hope it gives you some insights.我希望它能给你一些见解。
Regards.问候。
As far as I can see, you only need the list of columns from source Dataframe.据我所知,您只需要源 Dataframe 中的列列表。
If your sequence has the same order of the columns as the source Dataframe does, you can re-use schema without actually querying the source Dataframe. Performance wise, it will be faster.如果您的序列具有与源 Dataframe 相同的列顺序,您可以重新使用架构而无需实际查询源 Dataframe。性能方面,它会更快。
val srcDf = Seq((2.0,1.0),(0.0,2.0),(1.0,0.0)).toDF("name_index","city_index")
val dstDf = Seq((1.0, 1.0)).toDF(srcDf.columns:_*)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.