使用 scala 在 Spark DataFrame 中添加新行

Question

I have a dataframe like:我有一个 dataframe 像：

Name_Index  City_Index
  2.0         1.0
  0.0         2.0
  1.0         0.0

I have a new list of values.我有一个新的值列表。

list(1.0,1.0)

I want to add these values to a new row in dataframe in the case that all previous rows are dropped.我想将这些值添加到 dataframe 中的新行，以防所有先前的行都被删除。

My code:我的代码：

 val spark = SparkSession.builder
      .master("local[*]")
      .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
      .getOrCreate()


    var data = spark.read.option("header", "true")
      .option("inferSchema", "true")
      .csv("src/main/resources/student.csv")

   val someDF = Seq(
         (1.0,1.0)
        ).toDF("Name_Index","City_Index")

   data=data.union(someDF).show()

It show output like:它显示 output 如下：

Name_Index  City_Index
  2.0          1.0
  0.0          2.0
  1.0          0.0
  1.1          1.1

But output should be like this.但是output应该是这样的。 So that all the previous rows are dropped and new values are added.这样所有以前的行都被删除并添加了新值。

Name_Index   City_Index
  1.0          1.0

Answer 1

you can achieve this using limit & union functions.您可以使用限制和联合功能来实现这一点。 check below.检查下面。

scala> val df = Seq((2.0,1.0),(0.0,2.0),(1.0,0.0)).toDF("name_index","city_index")
df: org.apache.spark.sql.DataFrame = [name_index: double, city_index: double]

scala> df.show(false)
+----------+----------+
|name_index|city_index|
+----------+----------+
|2.0       |1.0       |
|0.0       |2.0       |
|1.0       |0.0       |
+----------+----------+


scala> val ndf = Seq((1.0,1.0)).toDF("name_index","city_index")
ndf: org.apache.spark.sql.DataFrame = [name_index: double, city_index: double]

scala> ndf.show
+----------+----------+
|name_index|city_index|
+----------+----------+
|       1.0|       1.0|
+----------+----------+


scala> df.limit(0).union(ndf).show(false) // this is not good approach., you can directly call ndf.show
+----------+----------+
|name_index|city_index|
+----------+----------+
|1.0       |1.0       |
+----------+----------+

Answer 2

change the last line to将最后一行更改为

data=data.except(data).union(someDF).show()

Answer 3

you could try this approach你可以试试这个方法

data = data.filter(_ => false).union(someDF)

output output

+----------+----------+
|Name_Index|City_Index|
+----------+----------+
|1.0       |1.0       |
+----------+----------+

I hope it gives you some insights.我希望它能给你一些见解。

Regards.问候。

Answer 4

As far as I can see, you only need the list of columns from source Dataframe.据我所知，您只需要源 Dataframe 中的列列表。

If your sequence has the same order of the columns as the source Dataframe does, you can re-use schema without actually querying the source Dataframe. Performance wise, it will be faster.如果您的序列具有与源 Dataframe 相同的列顺序，您可以重新使用架构而无需实际查询源 Dataframe。性能方面，它会更快。

    val srcDf = Seq((2.0,1.0),(0.0,2.0),(1.0,0.0)).toDF("name_index","city_index")

    val dstDf = Seq((1.0, 1.0)).toDF(srcDf.columns:_*)

使用 scala 在 Spark DataFrame 中添加新行

问题描述

4 个解决方案

解决方案1
1 已采纳 2020-04-23 10:53:02

解决方案2
0 2020-04-23 10:51:04

解决方案3
0 2020-04-23 11:26:28

解决方案4
0 2020-04-24 09:36:48

使用 scala 在 Spark DataFrame 中添加新行

问题描述

4 个解决方案

解决方案1 1 已采纳 2020-04-23 10:53:02

解决方案2 0 2020-04-23 10:51:04

解决方案3 0 2020-04-23 11:26:28

解决方案4 0 2020-04-24 09:36:48

解决方案1
1 已采纳 2020-04-23 10:53:02

解决方案2
0 2020-04-23 10:51:04

解决方案3
0 2020-04-23 11:26:28

解决方案4
0 2020-04-24 09:36:48