使用时间戳和 ID 将行添加到 Spark Dataframe

Question

I have a dataframe named timeDF which has the schema below:我有一个名为timeDF的数据timeDF ，其架构如下：

root
 |-- Id: long (nullable = true)
 |-- Model: timestamp (nullable = true)
 |-- Prevision: timestamp (nullable = true)

I want to add a new row at the end of timeDF by transforming two Calendar objects c1 & c2 to Timestamp .我想通过将两个Calendar对象c1 & c2为Timestamp在timeDF的末尾添加一个新行。 I know I can do it by first converting them to Timestamp like so :我知道我可以先将它们转换为Timestamp如下所示：

val t1 = new Timestamp(c1.getTimeInMillis)
val t2 = new Timestamp(c2.getTimeInMillis)

However, I can't figure out how I then write those variables to timeDF as a new row, and how to let spark increase the Id column value ?但是，我不知道如何将这些变量作为新行写入timeDF ，以及如何让 spark 增加Id列值？

Should I create a List object with t1 and t2 and make a temporary dataframe from this list to then union the two dataframes ?我应该用t1和t2创建一个List对象，然后从这个列表中创建一个临时数据框，然后合并两个数据框吗？ If so how do I manage the Id column ?如果是这样，我如何管理Id列？ Isn't it too much a mess for such a simple operation ?这么简单的操作是不是太乱了？

Can someone explain me please ?有人可以解释一下吗？

Thanks.谢谢。

Answer 1

If your first dataframe can be sorted by ID and you need to add rows one by one you can find maximum ID in your list:如果您的第一个数据框可以按 ID 排序并且您需要逐行添加行，您可以在列表中找到最大 ID：

long max = timeDF.agg(functions.max("Id")).head().getLong(0);

and then increment and add it to your dataframe by Union.然后通过联合递增并将其添加到您的数据帧中。 To do this, follow the following example which age can act like id.为此，请遵循以下示例，其中年龄可以充当 id。 people.json is a file in spark examples. people.json是 spark 示例中的文件。

Dataset<Row> df = spark.read().json("H:\\work\\HadoopWinUtils\\people.json");
df.show();

long max = df.agg(functions.max("age")).head().getLong(0);
List<Row> rows = Arrays.asList(RowFactory.create(max+1,  "test"));

StructType schema = DataTypes.createStructType(Arrays.asList(
                DataTypes.createStructField("age", DataTypes.LongType, false, Metadata.empty()),
                DataTypes.createStructField("name", DataTypes.StringType, false, Metadata.empty())));
Dataset<Row> df2 = spark.createDataFrame(rows, schema);
df2.show();
Dataset<Row> df3 = df.union(df2);
df3.show();

Answer 2

I tried this but I don't know why, when printing the table saved, it only keep the last 2 rows, all others being deleted.我试过这个，但我不知道为什么，在打印保存的表时，它只保留最后 2 行，所有其他行都被删除。

This is how I init the delta table :这就是我初始化增量表的方式：

val schema = StructType(
               StructField("Id", LongType, false) ::
               StructField("Model", TimestampType, false) ::
               StructField("Prevision", TimestampType, false) :: Nil
             )

var timestampDF = spark.createDataFrame(sc.emptyRDD[Row], schema)

val write_format = "delta"
val partition_by = "Model"
val save_path = "/mnt/path/to/folder"
val table_name = "myTable"

spark.sql("DROP TABLE IF EXISTS " + table_name)
dbutils.fs.rm(save_path, true)

timestampDF.write.partitionBy(partition_by)
                 .format(write_format)
                 .save(save_path)

spark.sql("CREATE TABLE " + table_name + " USING DELTA LOCATION '" + save_path + "'")

And this how I add a new item to it这就是我向其中添加新项目的方式

def addTimeToData(model: Calendar, target: Calendar): Unit = {
  var timeDF = spark.read
                    .format("delta")
                    .load("/mnt/path/to/folder")
  
  val modelTS = new Timestamp(model.getTimeInMillis)
  val targetTS = new Timestamp(target.getTimeInMillis)
  var id: Long = 0
  
  if (!timeDF.head(1).isEmpty) {
    id = timeDF.agg(max("Id")).head().getLong(0) + 1
  }
  
  val newTime = Arrays.asList(RowFactory.create(id, modelTS, targetTS))
  val schema = StructType(
                 StructField("Id", LongType, false) ::
                 StructField("Model", TimestampType, false) ::
                 StructField("Prevision", TimestampType, false) :: Nil
               )  
  var newTimeDF = spark.createDataFrame(newTime, schema)
  val unionTimeDF = timeDF.union(newTimeDF)
  timeDF = unionTimeDF
  unionTimeDF.show
  val save_path = "/mnt/datalake/Exploration/Provisionning/MeteoFrance/Timestamps/"
  val table_name = "myTable"

  spark.sql("DROP TABLE IF EXISTS " + table_name)
  dbutils.fs.rm(save_path, true)
  timeDF.write.partitionBy("Model")
              .format("delta")
              .save(save_path)

  spark.sql("CREATE TABLE " + table_name + " USING DELTA LOCATION '" + save_path + "'")
}

I'm not very familiar with delta tables so I don't know if I can just use SQL on it to add values like so :我对增量表不是很熟悉，所以我不知道我是否可以在它上面使用 SQL 来添加这样的值：

spark.sql("INSERT INTO 'myTable' VALUES (" + id + ", " + modelTS + ", " + previsionTS + ")");

And I don't if just putting the timestamps variable like so will work.而且我不会，如果只是像这样放置时间戳变量就行了。

Answer 3

Here is a solution you can try, in a nutshell:简而言之，这是您可以尝试的解决方案：

Ingest your file.摄取您的文件。
Create a new dataframe with your data and unionByName() .使用您的数据和unionByName()创建一个新数据unionByName() 。
Correct the id.更正标识。
Clean up.清理。

Create the extra record创建额外记录

First you create the extra record from scratch.首先，您从头开始创建额外的记录。 As you mix several types, I used a POJO, here is the code:当您混合多种类型时，我使用了 POJO，这是代码：

List<ModelPrevisionRecord> data = new ArrayList<>();
ModelPrevisionRecord b = new ModelPrevisionRecord(
    -1L,
    new Timestamp(System.currentTimeMillis()),
    new Timestamp(System.currentTimeMillis()));
data.add(b);
Dataset<ModelPrevisionRecord> ds = spark.createDataset(data,
    Encoders.bean(ModelPrevisionRecord.class));
timeDf = timeDf.unionByName(ds.toDF());

ModelPrevisionRecord is a very basic POJO: ModelPrevisionRecord 是一个非常基本的 POJO：

package net.jgp.labs.spark.l999_scrapbook.l000;

import java.sql.Timestamp;

public class ModelPrevisionRecord {

  public long getId() {
    return id;
  }

  public void setId(long id) {
    this.id = id;
  }

  public Timestamp getModel() {
    return model;
  }

  public void setModel(Timestamp model) {
    this.model = model;
  }

  public Timestamp getPrevision() {
    return prevision;
  }

  public void setPrevision(Timestamp prevision) {
    this.prevision = prevision;
  }

  private long id;
  private Timestamp model;
  private Timestamp prevision;

  public ModelPrevisionRecord(long id, Timestamp model, Timestamp prevision) {
    this.id = id;
    this.model = model;
    this.prevision = prevision;
  }
}

Correct the Id更正 ID

The id is -1, so the id is to create a new column, id2 , with the right id: id 为 -1，因此 id 是创建一个新列id2 ，并使用正确的 id：

timeDf = timeDf.withColumn("id2",
    when(
        col("id").$eq$eq$eq(-1), timeDf.agg(max("id")).head().getLong(0)+1)
            .otherwise(col("id")));

Cleanup the dataframe清理数据框

Finally, clean up your dataframe:最后，清理你的数据框：

timeDf = timeDf.drop("id").withColumnRenamed("id2", "id");

Important notes重要笔记

This solution will only work if you add one record at a time, otherwise, you will end up having the same id.此解决方案仅在您一次添加一条记录时才有效，否则，您最终将拥有相同的 ID。
You can see the whole example here: https://github.com/jgperrin/net.jgp.labs.spark/tree/master/src/main/java/net/jgp/labs/spark/l999_scrapbook/l000 , it might be easier to clone...你可以在这里看到整个例子： https : //github.com/jgperrin/net.jgp.labs.spark/tree/master/src/main/java/net/jgp/labs/spark/l999_scrapbook/l000 ，它可能更容易克隆...

使用时间戳和 ID 将行添加到 Spark Dataframe

问题描述

3 个解决方案

解决方案1
0 2021-10-27 13:00:23

解决方案2
0 2021-10-28 09:53:37

解决方案3
0 2021-10-28 21:24:34

Create the extra record创建额外记录

Correct the Id更正 ID

Cleanup the dataframe清理数据框

Important notes重要笔记

使用时间戳和 ID 将行添加到 Spark Dataframe

问题描述

3 个解决方案

解决方案1 0 2021-10-27 13:00:23

解决方案2 0 2021-10-28 09:53:37

解决方案3 0 2021-10-28 21:24:34

Create the extra record创建额外记录

Correct the Id更正 ID

Cleanup the dataframe清理数据框

Important notes重要笔记

解决方案1
0 2021-10-27 13:00:23

解决方案2
0 2021-10-28 09:53:37

解决方案3
0 2021-10-28 21:24:34