[英]Add row to Spark Dataframe with timestamps and id
I have a dataframe named timeDF
which has the schema below:我有一个名为timeDF
的数据timeDF
,其架构如下:
root
|-- Id: long (nullable = true)
|-- Model: timestamp (nullable = true)
|-- Prevision: timestamp (nullable = true)
I want to add a new row at the end of timeDF
by transforming two Calendar
objects c1
& c2
to Timestamp
.我想通过将两个Calendar
对象c1
& c2
为Timestamp
在timeDF
的末尾添加一个新行。 I know I can do it by first converting them to Timestamp
like so :我知道我可以先将它们转换为Timestamp
如下所示:
val t1 = new Timestamp(c1.getTimeInMillis)
val t2 = new Timestamp(c2.getTimeInMillis)
However, I can't figure out how I then write those variables to timeDF
as a new row, and how to let spark increase the Id
column value ?但是,我不知道如何将这些变量作为新行写入timeDF
,以及如何让 spark 增加Id
列值?
Should I create a List
object with t1
and t2
and make a temporary dataframe from this list to then union the two dataframes ?我应该用t1
和t2
创建一个List
对象,然后从这个列表中创建一个临时数据框,然后合并两个数据框吗? If so how do I manage the Id
column ?如果是这样,我如何管理Id
列? Isn't it too much a mess for such a simple operation ?这么简单的操作是不是太乱了?
Can someone explain me please ?有人可以解释一下吗?
Thanks.谢谢。
If your first dataframe can be sorted by ID and you need to add rows one by one you can find maximum ID in your list:如果您的第一个数据框可以按 ID 排序并且您需要逐行添加行,您可以在列表中找到最大 ID:
long max = timeDF.agg(functions.max("Id")).head().getLong(0);
and then increment and add it to your dataframe by Union.然后通过联合递增并将其添加到您的数据帧中。 To do this, follow the following example which age can act like id.为此,请遵循以下示例,其中年龄可以充当 id。 people.json
is a file in spark examples. people.json
是 spark 示例中的文件。
Dataset<Row> df = spark.read().json("H:\\work\\HadoopWinUtils\\people.json");
df.show();
long max = df.agg(functions.max("age")).head().getLong(0);
List<Row> rows = Arrays.asList(RowFactory.create(max+1, "test"));
StructType schema = DataTypes.createStructType(Arrays.asList(
DataTypes.createStructField("age", DataTypes.LongType, false, Metadata.empty()),
DataTypes.createStructField("name", DataTypes.StringType, false, Metadata.empty())));
Dataset<Row> df2 = spark.createDataFrame(rows, schema);
df2.show();
Dataset<Row> df3 = df.union(df2);
df3.show();
I tried this but I don't know why, when printing the table saved, it only keep the last 2 rows, all others being deleted.我试过这个,但我不知道为什么,在打印保存的表时,它只保留最后 2 行,所有其他行都被删除。
This is how I init the delta table :这就是我初始化增量表的方式:
val schema = StructType(
StructField("Id", LongType, false) ::
StructField("Model", TimestampType, false) ::
StructField("Prevision", TimestampType, false) :: Nil
)
var timestampDF = spark.createDataFrame(sc.emptyRDD[Row], schema)
val write_format = "delta"
val partition_by = "Model"
val save_path = "/mnt/path/to/folder"
val table_name = "myTable"
spark.sql("DROP TABLE IF EXISTS " + table_name)
dbutils.fs.rm(save_path, true)
timestampDF.write.partitionBy(partition_by)
.format(write_format)
.save(save_path)
spark.sql("CREATE TABLE " + table_name + " USING DELTA LOCATION '" + save_path + "'")
And this how I add a new item to it这就是我向其中添加新项目的方式
def addTimeToData(model: Calendar, target: Calendar): Unit = {
var timeDF = spark.read
.format("delta")
.load("/mnt/path/to/folder")
val modelTS = new Timestamp(model.getTimeInMillis)
val targetTS = new Timestamp(target.getTimeInMillis)
var id: Long = 0
if (!timeDF.head(1).isEmpty) {
id = timeDF.agg(max("Id")).head().getLong(0) + 1
}
val newTime = Arrays.asList(RowFactory.create(id, modelTS, targetTS))
val schema = StructType(
StructField("Id", LongType, false) ::
StructField("Model", TimestampType, false) ::
StructField("Prevision", TimestampType, false) :: Nil
)
var newTimeDF = spark.createDataFrame(newTime, schema)
val unionTimeDF = timeDF.union(newTimeDF)
timeDF = unionTimeDF
unionTimeDF.show
val save_path = "/mnt/datalake/Exploration/Provisionning/MeteoFrance/Timestamps/"
val table_name = "myTable"
spark.sql("DROP TABLE IF EXISTS " + table_name)
dbutils.fs.rm(save_path, true)
timeDF.write.partitionBy("Model")
.format("delta")
.save(save_path)
spark.sql("CREATE TABLE " + table_name + " USING DELTA LOCATION '" + save_path + "'")
}
I'm not very familiar with delta tables so I don't know if I can just use SQL on it to add values like so :我对增量表不是很熟悉,所以我不知道我是否可以在它上面使用 SQL 来添加这样的值:
spark.sql("INSERT INTO 'myTable' VALUES (" + id + ", " + modelTS + ", " + previsionTS + ")");
And I don't if just putting the timestamps variable like so will work.而且我不会,如果只是像这样放置时间戳变量就行了。
Here is a solution you can try, in a nutshell:简而言之,这是您可以尝试的解决方案:
unionByName()
.使用您的数据和unionByName()
创建一个新数据unionByName()
。First you create the extra record from scratch.首先,您从头开始创建额外的记录。 As you mix several types, I used a POJO, here is the code:当您混合多种类型时,我使用了 POJO,这是代码:
List<ModelPrevisionRecord> data = new ArrayList<>();
ModelPrevisionRecord b = new ModelPrevisionRecord(
-1L,
new Timestamp(System.currentTimeMillis()),
new Timestamp(System.currentTimeMillis()));
data.add(b);
Dataset<ModelPrevisionRecord> ds = spark.createDataset(data,
Encoders.bean(ModelPrevisionRecord.class));
timeDf = timeDf.unionByName(ds.toDF());
ModelPrevisionRecord is a very basic POJO: ModelPrevisionRecord 是一个非常基本的 POJO:
package net.jgp.labs.spark.l999_scrapbook.l000;
import java.sql.Timestamp;
public class ModelPrevisionRecord {
public long getId() {
return id;
}
public void setId(long id) {
this.id = id;
}
public Timestamp getModel() {
return model;
}
public void setModel(Timestamp model) {
this.model = model;
}
public Timestamp getPrevision() {
return prevision;
}
public void setPrevision(Timestamp prevision) {
this.prevision = prevision;
}
private long id;
private Timestamp model;
private Timestamp prevision;
public ModelPrevisionRecord(long id, Timestamp model, Timestamp prevision) {
this.id = id;
this.model = model;
this.prevision = prevision;
}
}
The id is -1, so the id is to create a new column, id2
, with the right id: id 为 -1,因此 id 是创建一个新列id2
,并使用正确的 id:
timeDf = timeDf.withColumn("id2",
when(
col("id").$eq$eq$eq(-1), timeDf.agg(max("id")).head().getLong(0)+1)
.otherwise(col("id")));
Finally, clean up your dataframe:最后,清理你的数据框:
timeDf = timeDf.drop("id").withColumnRenamed("id2", "id");
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.