While streaming from Kafka using Spark Structured stream 2.1, with partitionBy a string column(containing a string of a date in this format yyyy-mm-dd), I expected one
interval27e/_spark_metadata
interval27e/interval_read_date=2010-10-27 08%3A02%3A48
interval27e/interval_read_date=2010-10-30 04%3A27%3A34
interval27e/interval_read_date=2010-11-03 02%3A22%3A13
interval27e/interval_read_date=2010-11-03 07%3A27%3A08
interval27e/interval_read_date=2010-11-14 08%3A37%3A52
interval27e/interval_read_date=2010-11-19 01%3A46%3A50
Spark is appending strings ("08%3A02%3A48") with multiple directories per date.
This is the writeStream command:
interval3=interval2 \
.writeStream
.format("parquet")
.option("path","/user/usera/interval27e")
.partitionBy("interval_read_date")
.trigger(processingTime='15 seconds')
.option("checkpointLocation", "/user/usera/checkpoint27e")
.start()
I dont observe this happening on other stackOverflow questions concerning writing parquet with partitioning.
How can I write the partition the parquet directories without that string being appended to the directory name?
Looks like interval_read_date
is not data / date-like string at all but timestamp. %3A
is percent encoded :
therefore names you have are:
interval_read_date=2010-10-27 08:02:48
interval_read_date=2010-10-30 04:27:34
...
Please double check you're using the right data and truncate or cast if necessary.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.