简体   繁体   English

Spark s3 写入(s3 与 s3a 连接器)

[英]Spark s3 write (s3 vs s3a connectors)

I am working on a job that runs on EMR and it saves thousands of partitions on s3.我正在做一项在 EMR 上运行的工作,它在 s3 上保存了数千个分区。 Partitions are year/month/day.分区是年/月/日。

I have data from the last 50 years.我有过去 50 年的数据。 Now when spark writes 10000 partitions, it takes around 1-hour using the s3a connection.现在,当 spark 写入 10000 个分区时,使用s3a连接需要大约 1 小时。 It is extremely slow.它非常慢。

df.repartition($"year", $"month", $"day").write.mode("append").partitionBy("year", "month", "day").parquet("s3a://mybucket/data")

Then I tried with only s3 prefix and it took only a few minutes to save all the partitions on S3.然后我尝试只使用 s3 前缀,只花了几分钟就将所有分区保存在 S3 上。

df.repartition($"year", $"month", $"day").write.mode("append").partitionBy("year", "month", "day").parquet("s3://mybucket/data")

When I overwritten 1000 partitions, s3 was very fast in compare to s3a当我覆盖 1000 个分区时,s3 与s3a相比非常快

 df
 .repartition($"year", $"month", $"day")
 .write
 .option("partitionOverwriteMode", "dynamic")
 .mode("overwrite").partitionBy("year", "month", "day")
 .parquet("s3://mybucket/data")

As per my understanding, s3a is more mature and currently in use.据我了解,s3a比较成熟,目前正在使用。 s3/s3n are old connectors and they are deprecated. s3/s3n 是旧的连接器,它们已被弃用。 So I am wondering what to use?所以我想知道该用什么? Should I use 's3`?我应该使用's3`吗? What is the best s3 connect or s3 URI to use with EMR jobs that save data into s3?与将数据保存到 s3 的 EMR 作业一起使用的最佳 s3 连接或 s3 URI 是什么?

As Stevel pointed out, the s3:// connector used in Amazon EMR is built by amazon for EMR to interact with S3, and is the recommended way to do so according to Amazon EMR Work with storage and file systems :正如 Stevel 所指出的,Amazon EMR 中使用的 s3:// 连接器是由 amazon 构建的,用于 EMR 与 S3 交互,根据Amazon EMR Work with storage and file systems推荐的方法:

Previously, Amazon EMR used the s3n and s3a file systems.以前,Amazon EMR 使用 s3n 和 s3a 文件系统。 While both still work, we recommend that you use the s3 URI scheme for the best performance, security, and reliability.虽然两者仍然有效,但我们建议您使用 s3 URI 方案以获得最佳性能、安全性和可靠性。

Some more interesting stuff: The Apache Hadoop community also developed its own S3 connector and S3a:// is the actively maintained one.一些更有趣的东西:Apache Hadoop 社区也开发了自己的 S3 连接器,S3a:// 是积极维护的连接器。 The Hadoop community had also used a connector that was named S3:// that probably added to confusion. Hadoop 社区还使用了一个名为 S3:// 的连接器,这可能会增加混乱。 From hadoop docs :来自hadoop 文档

There are other Hadoop connectors to S3. S3 还有其他 Hadoop 连接器。 Only S3A is actively maintained by the Hadoop project itself.只有 S3A 由 Hadoop 项目本身积极维护。

  1. Apache's Hadoop's original s3:// client. Apache 的 Hadoop 的原始 s3:// 客户端。 This is no longer included in Hadoop.这不再包含在 Hadoop 中。
  2. Amazon EMR's s3:// client. Amazon EMR 的 s3:// 客户端。 This is from the Amazon EMR team, who actively maintain it.这是来自积极维护它的 Amazon EMR 团队。
  3. Apache's Hadoop's s3n: filesystem client. Apache 的 Hadoop 的 s3n:文件系统客户端。 This connector is no longer available: users must migrate to the newer s3a: client.此连接器不再可用:用户必须迁移到较新的 s3a: 客户端。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM