简体   繁体   English

从 Spark 将许多文件写入镶木地板 - 缺少一些镶木地板文件

[英]Writing many files to parquet from Spark - Missing some parquet files

We developed a job that process and writes a huge amount of files in parquet in Amazon S3 (s3a) using Spark 2.3.我们开发了一个作业,该作业使用 Spark 2.3 在 Amazon S3 (s3a) 中处理和写入大量 Parquet 文件。 Every source file should create a different partition in S3.每个源文件都应该在 S3 中创建一个不同的分区。 The code was tested (with less files) and working as expected.代码经过测试(使用较少的文件)并按预期工作。

However after the execution using the real data we noticed that some files (a small amount of the total) were not written to parquet.然而,在使用真实数据执行后,我们注意到一些文件(总数的一小部分)没有写入 parquet。 No error or anything weird in the logs.日志中没有错误或任何奇怪的东西。 We tested again the code for the files that were missing and it worked ¿?.我们再次测试了丢失的文件的代码,它运行起来了 ¿?。 We want to use the code in a production enviroment but we need to detect what's the problem here.我们想在生产环境中使用代码,但我们需要检测这里的问题。 We are writing to parquet like this:我们正在写这样的镶木地板:

dataframe_with_data_to_write.repartition($"field1", $"field2").write.option("compression", "snappy").option("basePath", path_out).partitionBy("field1", "field2", "year", "month", "day").mode(SaveMode.Append).parquet(path_out)

We used the recommended parameters:我们使用了推荐的参数:

spark.sparkContext.hadoopConfiguration.set("mapreduce.output.fileoutputformat.compress", "true")
spark.sparkContext.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version", "2")  
spark.sparkContext.hadoopConfiguration.set("mapreduce.fileoutputcommitter.cleanup-failures.ignored", "true")
spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

Is there any known issue of bug using this parameters?使用此参数是否存在任何已知的错误问题? Maybe something with S3 eventual consistency?也许与 S3 最终一致性有关? Any suggestions?有什么建议?

Any help will be appreciated.任何帮助将不胜感激。

yes, it is a known issue.是的,这是一个已知问题。 Work is committed by listing the output in the attempt working directory and renaming into the destination directory.通过在尝试工作目录中列出输出并重命名到目标目录中来提交工作。 If that listing underreports files: output is missing.如果该列表少报文件:缺少输出。 If that listing lists files which aren't there, the commit fails.如果该列表列出了不存在的文件,则提交失败。

Fixes on the ASF Hadoop releases.修复了 ASF Hadoop 版本。

  1. hadoop-2.7-2.8 connectors. hadoop-2.7-2.8 连接器。 Write to HDFS, copy files写入HDFS,复制文件
  2. Hadoop 2.9-3.0 turn on S3Guard for a consistent S3 listing (uses DynamoDB for this) Hadoop 2.9-3.0 打开 S3Guard 以获得一致的 S3 列表(为此使用 DynamoDB)
  3. Hadoop 3.1, switch to the S3A committers which are designed with the consistency and performance issues in mind. Hadoop 3.1,切换到S3A 提交者,其设计考虑了一致性和性能问题。 The "staging" one from netflix is the simplest to use here.来自 netflix 的“登台”是这里最简单的使用。

Further reading:A zero-rename committer .进一步阅读:零重命名提交者

Update 11-01-2019, Amazon has its own closed source implementation of the ASF zero rename committer . 2019 年 1 月 1 日更新,亚马逊拥有自己的 ASF 零重命名提交者的闭源实现。 Ask the EMR team for their own proofs of correctness, as the rest of us cannot verify this.向 EMR 团队询问他们自己的正确性证明,因为我们其他人无法验证这一点。

Update 11-dec-2020: Amazon S3 is now fully consistent, so listing will be up to date and correct; 2020 年 12 月 11 日更新:Amazon S3 现在完全一致,因此列表将是最新且正确的; update inconsistency and 404 caching no more.更新不一致和 404 缓存不再。

  • The v1 commit algorithm is still unsafe as directory rename is non-atomic v1 提交算法仍然不安全,因为目录重命名是非原子的
  • The v2 commit algorithm is always broken as it renames files one-by-one v2 提交算法总是被破坏,因为它会一一重命名文件
  • Renames are slow O(data) copy operations on S3, so the window of failure during task commit is bigger.重命名是 S3 上缓慢的 O(data) 复制操作,因此任务提交期间的失败窗口更大。

You aren't at risk of data loss any more, but as well as performance being awful, failure during task commits aren't handled properly您不再有数据丢失的风险,但是除了性能很差之外,任务提交期间的失败也没有得到正确处理

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM