I save a partitioned file in a s3 bucket from a data frame in scala
data_frame.write.mode("append").partitionBy("date").parquet("s3n://...")
When I read this partitioned file I'm experimenting very slow performance, I'm just doing a simple group by
val load_df = sqlContext.read.parquet(s"s3n://...").cache()
I also try load_df.registerTempTable("dataframe")
Any advice, I'm doing something wrong?
It depends on what you mean by "very slow performance".
If you have too many files in you date
partition it will take some time to read those.
Try to reduce granularity of the partition.
您应该使用S3A驱动程序(可能就像将URL协议更改为s3a://一样简单,或者您可能需要一些额外的类路径来使用hadoop-aws和aws-sdk jar)以获得更好的性能。
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.