简体   繁体   English

Spark 2.3.3将实木复合地板输出到S3

[英]Spark 2.3.3 outputing parquet to S3

A while back I had the problem that outputting directly parquets to S3 isn't really feasible and I needed a caching layer before I finally copy the parquets to S3 see this post 前一段时间,我遇到了一个问题,就是直接将实木复合地板输出到S3确实不可行,在最终将实木复合地板复制到S3之前,我需要一个缓存层,请参阅这篇文章。

I know that HADOOP-13786 should fix this problem and it seems to be implemented in HDFS >3.1.0 我知道HADOOP-13786应该解决此问题,并且它似乎在HDFS> 3.1.0中实现

Now the question is how do I use it in spark 2.3.3 as far as I understand it spark 2.3.3 comes with hdfs 2.8.5. 现在的问题是,据我了解,我如何在spark 2.3.3中使用它,而spark 2.3.3随hdfs 2.8.5一起提供。 I usually use flintrock to orchestrate my cluster on AWS. 我通常使用flintrock在AWS上协调集群。 Is it just a matter of setting HDFS to 3.1.1 in the flintrock config and then I get all the goodies? 仅仅是在flintrock配置中将HDFS设置为3.1.1的问题,然后我得到了所有的好东西? Or do I still for example have to set something in code like I did before. 还是我还是必须像以前一样在代码中设置一些内容。 For example like this: 例如这样:

conf = SparkConf().setAppName(appname)\
.setMaster(master)\
.set('spark.executor.memory','13g')\
.set('spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version','2')\
.set('fs.s3a.fast.upload','true')\
.set('fs.s3a.fast.upload.buffer','disk')\
.set('fs.s3a.buffer.dir','/tmp/s3a')

(I know this is the old code and probably no longer relevant) (我知道这是旧代码,可能不再相关了)

You'll need Hadoop 3.1, and a build of Spark 2.4 which has this PR applied: https://github.com/apache/spark/pull/24970 您将需要Hadoop 3.1 适用于此PR的Spark 2.4构建: https : //github.com/apache/spark/pull/24970

Some downstream products with their own Spark builds do this (HDP-3.1), but it's not (yet) in the apache builds. 一些具有自己的Spark版本的下游产品可以执行此操作(HDP-3.1),但apache版本中尚未(尚)。

With that you then need to configure parquet to use the new bridging committer (Parquet only allows subclasses of the Parquet committer), and select the specific S3A committer of three (long story) to use. 然后,您需要配置Parquet以使用新的桥接提交程序(Parquet仅允许Parquet提交程序的子类),并选择要使用的三个特定的S3A提交程序(详细信息)。 The Staging committer is the one I'd recommend as its (a) based on the one Netflix use and (b) the one I've tested the most. 我建议使用Staging提交者,因为(a)基于Netflix的一种用法,(b)我测试最多的一种。

There's no fundamental reason why the same PR can't be applied to Spark 2.3, just that nobody has tried. 没有根本原因不能将相同的PR应用于Spark 2.3,只是没有人尝试过。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM