简体   繁体   中英

Slow performance reading parquet files in S3 with scala in Spark

I save a partitioned file in a s3 bucket from a data frame in scala

data_frame.write.mode("append").partitionBy("date").parquet("s3n://...")

When I read this partitioned file I'm experimenting very slow performance, I'm just doing a simple group by

val load_df = sqlContext.read.parquet(s"s3n://...").cache()

I also try load_df.registerTempTable("dataframe")

Any advice, I'm doing something wrong?

It depends on what you mean by "very slow performance".

If you have too many files in you date partition it will take some time to read those.

Try to reduce granularity of the partition.

您应该使用S3A驱动程序(可能就像将URL协议更改为s3a://一样简单,或者您可能需要一些额外的类路径来使用hadoop-aws和aws-sdk jar)以获得更好的性能。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM