简体   繁体   English

按时间戳写入spark分区数据

[英]spark partition data writing by timestamp

I have some data which has timestamp column field which is long and its epoch standard , I need to save that data in split-ted format like yyyy/mm/dd/hh using spark scala 我有一些数据有时间戳列字段,它是长的及其纪元标准,我需要以分裂格式保存该数据,如yyyy / mm / dd / hh使用spark scala

data.write.partitionBy("timestamp").format("orc").save("mypath") 

this is just splitting the data by timestamp like below 这只是按时间戳分割数据,如下所示

timestamp=1458444061098
timestamp=1458444061198

but I want it to be as 但我希望它像

└── YYYY
    └── MM
        └── DD
            └── HH

You can leverage various spark sql date/time functions for this. 您可以利用各种spark sql日期/时间函数。 First, you add a new date type column created from the unix timestamp column. 首先,添加从unix timestamp列创建的新日期类型列。

val withDateCol = data
.withColumn("date_col", from_unixtime(col("timestamp", "YYYYMMddHH"))

After this, you can add year, month, day and hour columns to the DF and then partition by these new columns for the write. 在此之后,您可以将年,月,日和小时列添加到DF,然后按这些新列进行分区以进行写入。

withDateCol
.withColumn("year", year(col("date_col")))
.withColumn("month", month(col("date_col")))
.withColumn("day", dayofmonth(col("date_col")))
.withColumn("hour", hour(col("date_col")))
.drop("date_col")
.partitionBy("year", "month", "day", "hour")
.format("orc")
.save("mypath") 

The columns included in the partitionBy clause wont be part of the file schema. partitionBy子句中包含的列不会是文件架构的一部分。

First, I would caution you with over-partitioning. 首先,我会提醒您过度分区。 That is, make sure you have sufficient data to make it worth partitioning by hour otherwise you could end up with lots of partition folders with small files. 也就是说,确保你有足够的数据使其值得按小时进行分区,否则你可能会得到大量带有小文件的分区文件夹。 The second caution I would make is from using a partition hierarchy (year/month/day/hour) since it will require a recursive partition discovery. 我要做的第二个注意事项是使用分区层次结构(年/月/日/小时),因为它需要递归分区发现。

Having said that, if you definitely want to partition by hour segments I would suggest truncating your timestamp to the hour into a new column and partitioning by that. 话虽如此,如果你肯定想按小时段进行分区,我建议将你的时间戳缩短为一小时,然后将其分区。 Then, Spark will be smart enough to recognize the format as a timestamp when you read it back in and you can actually perform full filtering as needed. 然后,当您重新读取格式时,Spark将足够聪明地将格式识别为时间戳,您可以根据需要实际执行完全过滤。

input
  .withColumn("ts_trunc", date_trunc("HOUR", 'timestamp)) // date_trunc added in Spark 2.3.0
  .write
  .partitionBy("ts_trunc")
  .save("/mnt/warehouse/part-test")

spark.read.load("/mnt/warehouse/part-test").where("hour(ts_trunc) = 10")

The other option would to partition by date and hour of day as so: 另一个选项是按日期和小时进行分区,如下所示:

input
  .withColumn("date", to_date('timestamp))
  .withColumn("hour", hour('timestamp))
  .write
  .partitionBy("date", "hour")
  .save("/mnt/warehouse/part-test")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM