简体   繁体   中英

spark structured streaming: parquet partition name uniqueness

While streaming from Kafka using Spark Structured stream 2.1, with partitionBy a string column(containing a string of a date in this format yyyy-mm-dd), I expected one

interval27e/_spark_metadata
interval27e/interval_read_date=2010-10-27 08%3A02%3A48
interval27e/interval_read_date=2010-10-30 04%3A27%3A34
interval27e/interval_read_date=2010-11-03 02%3A22%3A13
interval27e/interval_read_date=2010-11-03 07%3A27%3A08
interval27e/interval_read_date=2010-11-14 08%3A37%3A52
interval27e/interval_read_date=2010-11-19 01%3A46%3A50

Spark is appending strings ("08%3A02%3A48") with multiple directories per date.

This is the writeStream command:

interval3=interval2     \
  .writeStream 
  .format("parquet") 
  .option("path","/user/usera/interval27e") 
  .partitionBy("interval_read_date") 
  .trigger(processingTime='15 seconds') 
  .option("checkpointLocation", "/user/usera/checkpoint27e") 
  .start()

I dont observe this happening on other stackOverflow questions concerning writing parquet with partitioning.

How can I write the partition the parquet directories without that string being appended to the directory name?

Looks like interval_read_date is not data / date-like string at all but timestamp. %3A is percent encoded : therefore names you have are:

interval_read_date=2010-10-27 08:02:48
interval_read_date=2010-10-30 04:27:34
...

Please double check you're using the right data and truncate or cast if necessary.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM