你能在 Spark/Hadoop 中将（或别名）s3:// 翻译成 s3a:// 吗？

Question

We have some code that we run on Amazon's servers that loads parquet using the s3:// scheme as advised by Amazon.我们有一些在亚马逊服务器上运行的代码，这些代码使用亚马逊建议的 s3:// 方案加载镶木地板。 However, some developers want to run code locally using a spark installation on Windows, but stubbornly spark insists on using the s3a:// scheme.但是，一些开发人员希望在 Windows 上使用 spark 安装在本地运行代码，但 spark 固执地坚持使用 s3a:// 方案。

We can read files just fine using s3a, but we get an java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException.我们可以使用 s3a 很好地读取文件，但是我们得到一个 java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException。

SparkSession available as 'spark'.
>>> spark.read.parquet('s3a://bucket/key')
DataFrame[********************************************]
>>> spark.read.parquet('s3://bucket/key')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\spark\spark-2.4.4-bin-hadoop2.7\python\pyspark\sql\readwriter.py", line 316, in parquet
    return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
  File "C:\spark\spark-2.4.4-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1257, in __call__
  File "C:\spark\spark-2.4.4-bin-hadoop2.7\python\pyspark\sql\utils.py", line 63, in deco
    return f(*a, **kw)
  File "C:\spark\spark-2.4.4-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o37.parquet.
: java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException
        at org.apache.hadoop.fs.s3.S3FileSystem.createDefaultStore(S3FileSystem.java:99)
        at org.apache.hadoop.fs.s3.S3FileSystem.initialize(S3FileSystem.java:89)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
        at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:332)
        at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
        at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:644)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.lang.reflect.Method.invoke(Unknown Source)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.ClassNotFoundException: org.jets3t.service.S3ServiceException
        at java.net.URLClassLoader.findClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
        at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
        ... 24 more

Is there a way to get hadoop or spark or pyspark to "translate" the URI scheme from s3 to s3a via some sort of magic configuration?有没有办法让hadoop或spark或pyspark通过某种神奇的配置将URI方案从s3“翻译”到s3a？ Changing the code is not an option we entertain as it would involve quite a lot of testing.更改代码不是我们接受的选项，因为它会涉及大量测试。

The local environment is windows 10, pyspark2.4.4 with hadoop2.7 (prebuilt), python3.7.5, and the right aws libs installed.本地环境是windows 10，pyspark2.4.4，hadoop2.7（预建），python3.7.5，安装了正确的aws库。

EDIT: One hack I used - since we're not supposed to use s3:// paths is to just convert them to s3a:// in pyspark.编辑：我使用的一个技巧 - 因为我们不应该使用 s3:// 路径，只是在 pyspark 中将它们转换为 s3a:// 。

I've added the following function in readwriter.py and just invoked it wherever there was a call out to the jvm with paths.我在 readwriter.py 中添加了以下函数，并在有路径调用 jvm 的任何地方调用它。 Works fine, but would be nice if this was a config option.工作正常，但如果这是一个配置选项会很好。

def massage_paths(paths):
    if isinstance(paths, basestring):
        return 's3a' + x[2:] if x.startswith('s3:') else x
    if isinstance(paths, list):
        t = list
    else:
        t = tuple
    return t(['s3a' + x[2:] if x.startswith('s3:') else x for x in paths])

Answer 1

Ideally, you could refactor the code to detect the runtime environment, or externalize the paths to a config file that could be used in the respective areas.理想情况下，您可以重构代码以检测运行时环境，或将路径具体化为可在相应区域中使用的配置文件。

Otherwise, you would need to edit the hdfs-site.xml to configure the fs.s3a.impl key to rename s3a to s3, and you might be able to keep the value the same.否则，您需要编辑 hdfs-site.xml 以配置fs.s3a.impl键以将 s3a 重命名为 s3，并且您可能能够保持该值相同。 That change would need done for all Spark workers所有 Spark 工作人员都需要进行这种更改

Answer 2

cricket007 is correct. cricket007 是正确的。

spark.hadoop.fs.s3.impl org.apache.fs.s3a.S3AFileSystem

There's some code in org.apache.hadoop.FileSystem which looks up from a schema "s3" to an implementation class, loads it and instantiates it with the full URL. org.apache.hadoop.FileSystem有一些代码从模式“s3”查找实现类，加载它并使用完整的 URL 实例化它。

Warning There's no specific code in the core S3A FS which looks for an FS schema being s3a, but you will encounter problems if you use the DynamoDB consistency layer "S3Guard" -that's probably a bit of overkill someone could fix警告核心 S3A FS 中没有特定代码来查找 FS 模式为 s3a，但如果您使用 DynamoDB 一致性层“S3Guard”，您会遇到问题 - 这可能有点矫枉过正，有人可以修复

Answer 3

You probably won't be able to configure Spark to help you "translate".您可能无法配置 Spark 来帮助您“翻译”。

Instead, this is more like a design issue.相反，这更像是一个设计问题。 The code should be made configurable to choose different protocol for different environment(that was what I did for a similar situation).代码应该是可配置的，以便为不同的环境选择不同的协议（这是我在类似情况下所做的）。 If you insist on working locally, some code refactoring may not be avoidable...如果你坚持在本地工作，一些代码重构可能无法避免......

你能在 Spark/Hadoop 中将（或别名）s3:// 翻译成 s3a:// 吗？

问题描述

3 个解决方案

解决方案1
1 2019-12-13 03:44:18

解决方案2
1 2019-12-17 17:22:11

解决方案3
0 2019-12-12 00:25:26

你能在 Spark/Hadoop 中将（或别名）s3:// 翻译成 s3a:// 吗？

问题描述

3 个解决方案

解决方案1 1 2019-12-13 03:44:18

解决方案2 1 2019-12-17 17:22:11

解决方案3 0 2019-12-12 00:25:26

解决方案1
1 2019-12-13 03:44:18

解决方案2
1 2019-12-17 17:22:11

解决方案3
0 2019-12-12 00:25:26