简体   繁体   English

Spark在S3上不输出.crc文件

[英]Spark doesn't output .crc files on S3

When I use spark locally, writing data on my local filesystem, it creates some usefull .crc file. 当我在本地使用spark时,在本地文件系统上写入数据时,它将创建一些有用的.crc文件。

Using the same job on Aws EMR and writing on S3, the .crc files are not written. 在AWS EMR上使用相同的作业并在S3上进行写入,不会写入.crc文件。

Is this normal? 这正常吗? Is there a way to force the writing of .crc files on S3? 有没有办法强制在S3上写入.crc文件?

those .crc files are just created by the the low level bits of the Hadoop FS binding so that it can identify when a block is corrupt, and, on HDFS, switch to another datanode's copy of the data for the read and kick off a re-replication of one of the good copies. 这些.crc文件只是由Hadoop FS绑定的低级位创建的,因此它可以识别何时块损坏,并在HDFS上切换到另一个datanode的数据副本以进行读取,然后重新启动-复制好副本之一。

On S3, stopping corruption is left to AWS. 在S3上,停止损坏留给了AWS。

What you can get off S3 is the etag of a file, which is the md5sum on a small upload; 您可以从S3下车的是文件的etag,即小文件上传时的md5sum; on a multipart upload it is some other string, which again, changes when you upload it. 在分段上传中,它是其他一些字符串,在您上载时,它也会再次更改。

you can get at this value with the Hadoop 3.1+ version of the S3A connector, though it's off by default as distcp gets very confused when uploading from HDFS. 您可以使用Hadoop 3.1+版本的S3A连接器来获得此值,尽管默认情况下它已关闭,因为从HDFS上载时distcp变得非常混乱。 For earlier versions, you can't get at it, nor does the aws s3 command show it. 对于早期版本,您无法aws s3它, aws s3命令也不会显示它。 You'd have to try some other S3 libraries (it's just a HEAD request, after all) 您必须尝试其他一些S3库(毕竟,这只是一个HEAD请求)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM