简体繁体 English

Spark在S3上不输出.crc文件

[英]Spark doesn't output .crc files on S3

原文 2018-10-15 18:25:16 3 1 amazon-web-services/ apache-spark/ amazon-s3/ amazon-emr

When I use spark locally, writing data on my local filesystem, it creates some usefull .crc file. 当我在本地使用spark时，在本地文件系统上写入数据时，它将创建一些有用的.crc文件。

Using the same job on Aws EMR and writing on S3, the .crc files are not written. 在AWS EMR上使用相同的作业并在S3上进行写入，不会写入.crc文件。

Is this normal? 这正常吗？ Is there a way to force the writing of .crc files on S3? 有没有办法强制在S3上写入.crc文件？

1 个解决方案

those .crc files are just created by the the low level bits of the Hadoop FS binding so that it can identify when a block is corrupt, and, on HDFS, switch to another datanode's copy of the data for the read and kick off a re-replication of one of the good copies. 这些.crc文件只是由Hadoop FS绑定的低级位创建的，因此它可以识别何时块损坏，并在HDFS上切换到另一个datanode的数据副本以进行读取，然后重新启动-复制好副本之一。

On S3, stopping corruption is left to AWS. 在S3上，停止损坏留给了AWS。

What you can get off S3 is the etag of a file, which is the md5sum on a small upload; 您可以从S3下车的是文件的etag，即小文件上传时的md5sum； on a multipart upload it is some other string, which again, changes when you upload it. 在分段上传中，它是其他一些字符串，在您上载时，它也会再次更改。

you can get at this value with the Hadoop 3.1+ version of the S3A connector, though it's off by default as distcp gets very confused when uploading from HDFS. 您可以使用Hadoop 3.1+版本的S3A连接器来获得此值，尽管默认情况下它已关闭，因为从HDFS上载时distcp变得非常混乱。 For earlier versions, you can't get at it, nor does the aws s3 command show it. 对于早期版本，您无法aws s3它， aws s3命令也不会显示它。 You'd have to try some other S3 libraries (it's just a HEAD request, after all) 您必须尝试其他一些S3库（毕竟，这只是一个HEAD请求）

设置Spark输出文件的S3输出文件被许可人 - Setting S3 output file grantees for spark output files

AWS 无服务器 S3 上传不上传文件 - AWS serverless S3 upload doesn't upload files

spark-submit 不从 s3 读取文件，只是卡在上面 - spark-submit doesn't read file from s3, just stucks on it

重命名 s3 中的 Pyspark output 文件 - Rename Pyspark output files in s3

S3 不返回子文件夹键。它只是返回子文件夹中的所有文件 - S3 doesn't return the subfolder keys. It's just returning all the files inside the subfolder

AWS s3 rb 在 CLI 中不起作用，如何删除 s3 存储桶中的文件？ - AWS s3 rb doesn't work in the CLI, how do I delete files in my s3 bucket?

如何在S3上保存Spark的临时文件？ - How to make spark save it's temp files on S3?

Uppy Companion 不适用于具有多部分 S3 上传的 > 5GB 文件 - Uppy Companion doesn't work for > 5GB files with Multipart S3 uploads

boto3 textract start_document_text_detection 不接受用于 s3 上的输入文件的文件夹 - boto3 textract start_document_text_detection doesn't accept folders for input files on s3

在 AWS S3 中使用预签名 url 下载对象（文件）不起作用 - Download objects (files) using presigned url in AWS S3 doesn't work

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 设置Spark输出文件的S3输出文件被许可人 - Setting S3 output file grantees for spark output files AWS 无服务器 S3 上传不上传文件 - AWS serverless S3 upload doesn't upload files spark-submit 不从 s3 读取文件，只是卡在上面 - spark-submit doesn't read file from s3, just stucks on it 重命名 s3 中的 Pyspark output 文件 - Rename Pyspark output files in s3 S3 不返回子文件夹键。它只是返回子文件夹中的所有文件 - S3 doesn't return the subfolder keys. It's just returning all the files inside the subfolder AWS s3 rb 在 CLI 中不起作用，如何删除 s3 存储桶中的文件？ - AWS s3 rb doesn't work in the CLI, how do I delete files in my s3 bucket? 如何在S3上保存Spark的临时文件？ - How to make spark save it's temp files on S3? Uppy Companion 不适用于具有多部分 S3 上传的 > 5GB 文件 - Uppy Companion doesn't work for > 5GB files with Multipart S3 uploads boto3 textract start_document_text_detection 不接受用于 s3 上的输入文件的文件夹 - boto3 textract start_document_text_detection doesn't accept folders for input files on s3 在 AWS S3 中使用预签名 url 下载对象（文件）不起作用 - Download objects (files) using presigned url in AWS S3 doesn't work

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM