简体   繁体   English

使用时间戳和 ID 将行添加到 Spark Dataframe

[英]Add row to Spark Dataframe with timestamps and id

I have a dataframe named timeDF which has the schema below:我有一个名为timeDF的数据timeDF ,其架构如下:

root
 |-- Id: long (nullable = true)
 |-- Model: timestamp (nullable = true)
 |-- Prevision: timestamp (nullable = true)

I want to add a new row at the end of timeDF by transforming two Calendar objects c1 & c2 to Timestamp .我想通过将两个Calendar对象c1 & c2TimestamptimeDF的末尾添加一个新行。 I know I can do it by first converting them to Timestamp like so :我知道我可以先将它们转换为Timestamp如下所示:

val t1 = new Timestamp(c1.getTimeInMillis)
val t2 = new Timestamp(c2.getTimeInMillis)

However, I can't figure out how I then write those variables to timeDF as a new row, and how to let spark increase the Id column value ?但是,我不知道如何将这些变量作为新行写入timeDF ,以及如何让 spark 增加Id列值?

Should I create a List object with t1 and t2 and make a temporary dataframe from this list to then union the two dataframes ?我应该用t1t2创建一个List对象,然后从这个列表中创建一个临时数据框,然后合并两个数据框吗? If so how do I manage the Id column ?如果是这样,我如何管理Id列? Isn't it too much a mess for such a simple operation ?这么简单的操作是不是太乱了?

Can someone explain me please ?有人可以解释一下吗?

Thanks.谢谢。

If your first dataframe can be sorted by ID and you need to add rows one by one you can find maximum ID in your list:如果您的第一个数据框可以按 ID 排序并且您需要逐行添加行,您可以在列表中找到最大 ID:

long max = timeDF.agg(functions.max("Id")).head().getLong(0);

and then increment and add it to your dataframe by Union.然后通过联合递增并将其添加到您的数据帧中。 To do this, follow the following example which age can act like id.为此,请遵循以下示例,其中年龄可以充当 id。 people.json is a file in spark examples. people.json是 spark 示例中的文件。

Dataset<Row> df = spark.read().json("H:\\work\\HadoopWinUtils\\people.json");
df.show();

long max = df.agg(functions.max("age")).head().getLong(0);
List<Row> rows = Arrays.asList(RowFactory.create(max+1,  "test"));

StructType schema = DataTypes.createStructType(Arrays.asList(
                DataTypes.createStructField("age", DataTypes.LongType, false, Metadata.empty()),
                DataTypes.createStructField("name", DataTypes.StringType, false, Metadata.empty())));
Dataset<Row> df2 = spark.createDataFrame(rows, schema);
df2.show();
Dataset<Row> df3 = df.union(df2);
df3.show();

I tried this but I don't know why, when printing the table saved, it only keep the last 2 rows, all others being deleted.我试过这个,但我不知道为什么,在打印保存的表时,它只保留最后 2 行,所有其他行都被删除。

This is how I init the delta table :这就是我初始化增量表的方式:

val schema = StructType(
               StructField("Id", LongType, false) ::
               StructField("Model", TimestampType, false) ::
               StructField("Prevision", TimestampType, false) :: Nil
             )

var timestampDF = spark.createDataFrame(sc.emptyRDD[Row], schema)

val write_format = "delta"
val partition_by = "Model"
val save_path = "/mnt/path/to/folder"
val table_name = "myTable"

spark.sql("DROP TABLE IF EXISTS " + table_name)
dbutils.fs.rm(save_path, true)

timestampDF.write.partitionBy(partition_by)
                 .format(write_format)
                 .save(save_path)

spark.sql("CREATE TABLE " + table_name + " USING DELTA LOCATION '" + save_path + "'")

And this how I add a new item to it这就是我向其中添加新项目的方式

def addTimeToData(model: Calendar, target: Calendar): Unit = {
  var timeDF = spark.read
                    .format("delta")
                    .load("/mnt/path/to/folder")
  
  val modelTS = new Timestamp(model.getTimeInMillis)
  val targetTS = new Timestamp(target.getTimeInMillis)
  var id: Long = 0
  
  if (!timeDF.head(1).isEmpty) {
    id = timeDF.agg(max("Id")).head().getLong(0) + 1
  }
  
  val newTime = Arrays.asList(RowFactory.create(id, modelTS, targetTS))
  val schema = StructType(
                 StructField("Id", LongType, false) ::
                 StructField("Model", TimestampType, false) ::
                 StructField("Prevision", TimestampType, false) :: Nil
               )  
  var newTimeDF = spark.createDataFrame(newTime, schema)
  val unionTimeDF = timeDF.union(newTimeDF)
  timeDF = unionTimeDF
  unionTimeDF.show
  val save_path = "/mnt/datalake/Exploration/Provisionning/MeteoFrance/Timestamps/"
  val table_name = "myTable"

  spark.sql("DROP TABLE IF EXISTS " + table_name)
  dbutils.fs.rm(save_path, true)
  timeDF.write.partitionBy("Model")
              .format("delta")
              .save(save_path)

  spark.sql("CREATE TABLE " + table_name + " USING DELTA LOCATION '" + save_path + "'")
}

I'm not very familiar with delta tables so I don't know if I can just use SQL on it to add values like so :我对增量表不是很熟悉,所以我不知道我是否可以在它上面使用 SQL 来添加这样的值:

spark.sql("INSERT INTO 'myTable' VALUES (" + id + ", " + modelTS + ", " + previsionTS + ")");

And I don't if just putting the timestamps variable like so will work.而且我不会,如果只是像这样放置时间戳变量就行了。

Here is a solution you can try, in a nutshell:简而言之,这是您可以尝试的解决方案:

  1. Ingest your file.摄取您的文件。
  2. Create a new dataframe with your data and unionByName() .使用您的数据和unionByName()创建一个新数据unionByName()
  3. Correct the id.更正标识。
  4. Clean up.清理。

Create the extra record创建额外记录

First you create the extra record from scratch.首先,您从头开始创建额外的记录。 As you mix several types, I used a POJO, here is the code:当您混合多种类型时,我使用了 POJO,这是代码:

List<ModelPrevisionRecord> data = new ArrayList<>();
ModelPrevisionRecord b = new ModelPrevisionRecord(
    -1L,
    new Timestamp(System.currentTimeMillis()),
    new Timestamp(System.currentTimeMillis()));
data.add(b);
Dataset<ModelPrevisionRecord> ds = spark.createDataset(data,
    Encoders.bean(ModelPrevisionRecord.class));
timeDf = timeDf.unionByName(ds.toDF());

ModelPrevisionRecord is a very basic POJO: ModelPrevisionRecord 是一个非常基本的 POJO:

package net.jgp.labs.spark.l999_scrapbook.l000;

import java.sql.Timestamp;

public class ModelPrevisionRecord {

  public long getId() {
    return id;
  }

  public void setId(long id) {
    this.id = id;
  }

  public Timestamp getModel() {
    return model;
  }

  public void setModel(Timestamp model) {
    this.model = model;
  }

  public Timestamp getPrevision() {
    return prevision;
  }

  public void setPrevision(Timestamp prevision) {
    this.prevision = prevision;
  }

  private long id;
  private Timestamp model;
  private Timestamp prevision;

  public ModelPrevisionRecord(long id, Timestamp model, Timestamp prevision) {
    this.id = id;
    this.model = model;
    this.prevision = prevision;
  }
}

Correct the Id更正 ID

The id is -1, so the id is to create a new column, id2 , with the right id: id 为 -1,因此 id 是创建一个新列id2 ,并使用正确的 id:

timeDf = timeDf.withColumn("id2",
    when(
        col("id").$eq$eq$eq(-1), timeDf.agg(max("id")).head().getLong(0)+1)
            .otherwise(col("id")));

Cleanup the dataframe清理数据框

Finally, clean up your dataframe:最后,清理你的数据框:

timeDf = timeDf.drop("id").withColumnRenamed("id2", "id");

Important notes重要笔记

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Spark Java - 如何迭代数据帧数据集中的行<Row> , 并将一列的值添加到 Arraylist - Spark Java - How do I iterate rows in dataframe Dataset<Row>, and add values of one column to an Arraylist Cassandra&Spark:我可以在行中添加项目以从行列表创建数据框吗? - Cassandra & Spark: can I add an item to a row to create a dataframe from a list of Rows 使用JavaRdd映射Spark DataFrame Colunm值<Row > - Map Spark DataFrame Colunm value with JavaRdd<Row > 由 java 时间戳构建并写入 parquet 的 spark 数据帧的内容是否会因系统的本地时区而异? - Will contents of spark dataframe constructed out of java timestamps and written to parquet differ depending on the system's local timezone? 将索引列添加到现有Spark的DataFrame - Add index column to existing Spark's DataFrame Spark-Java:如何在 spark Dataframe 中添加数组列 - Spark-Java : How to add an array column in spark Dataframe Java&Spark:向数据集添加唯一的增量ID - Java & Spark : add unique incremental id to dataset 在JTable中添加行,并自动增加最后一行ID - Add row in a JTable with autoincrement of the last row id 将Spark中的现有行添加到另一个DataSet(Spark Java 2.3.1) - Add Existing Row in spark to another DataSet (Spark Java 2.3.1) Spark类型不匹配:无法从DataFrame转换为Dataset <Row> - Spark Type mismatch: cannot convert from DataFrame to Dataset<Row>
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM