简体   繁体   English

如何在 Amazon S3 中查找零字节文件

[英]How to find zero byte files in Amazon S3

Is there a way to programmatically find zero bytes file in Amazon S3?有没有办法以编程方式在 Amazon S3 中查找零字节文件?

The total size of the bucket is more than 100G,桶的总大小超过100G,
unlikely for me to sync back to server, then do a我不太可能同步回服务器,然后做一个

find . -size 0 -type f

Combining s3cmd with awk should do the trick easily. 将s3cmd与awk结合起来应该可以轻松完成。

Note: s3cmd outputs 4 columns, date, time, size and name. 注意:s3cmd输出4列,日期,时间,大小和名称。 You want to match the size (column 3) against 0 and output the object name (column 4). 您希望将大小(第3列)与0匹配,并输出对象名称(第4列)。 This should do the trick... 这应该是诀窍......

$ s3cmd ls -r s3://bucketname | awk '{if ($3 == 0) print $4}'
s3://bucketname/root/
s3://bucketname/root/e

If you want to see all information, just drop the $4 so that it only says print. 如果你想查看所有信息,只需删除$ 4,这样它就只能打印。

$ s3cmd ls -r s3://bucketname | awk '{if ($3 == 0) print}' 
2013-03-04 06:28         0   s3://bucketname/root/
2013-03-04 06:28         0   s3://bucketname/root/e

Memory-wise, this should be fine as it's a simple bucket listing. 在记忆方面,这应该没问题,因为它是一个简单的桶列表。

There is no direct process to search files of zero bytes in size at amazon s3. 在亚马逊s3上没有直接的过程来搜索大小为零字节的文件。 You can do it by listing all objects and then sort that items on the basis of size, then you can get all zero file size together. 您可以通过列出所有对象然后根据大小对这些项进行排序来实现,然后您可以将所有零文件大小放在一起。

if you want get list of all file having size zero then you can use Bucket Explorer and list the objects of the selected bucket then click on size header (sort by size) it will keep together files size of zero byte together. 如果你想获得大小为零的所有文件的列表,那么你可以使用Bucket Explorer并列出所选存储桶的对象,然后单击大小标题(按大小排序)它将文件大小保持为零字节。

Disclosure : I am a developer of Bucket Explorer. 披露 :我是Bucket Explorer的开发人员。

Just use Boto : 只需使用Boto

from boto import S3Connection
aws_access_key = ''
aws_secret_key = ''
bucket_name = ''
s3_conn = S3Connection(aws_access_key, aws_secret_key)
s3_conn.get_bucket(bucket_name)
for key in bucket.list():
    if key.size == 0:
        print(key.key)

In regards to the number files, Boto requests the file metadata (not the actual file content) at 1000 per time (the aws limit), and it uses a generator so the memory usage is minor. 关于数字文件,Boto每次请求1000个文件元数据(不是实际文件内容)(aws限制),并且它使用生成器,因此内存使用量很小。

使用基本模式匹配查找零长度文件:

hdfs dfs -ls -R s3a://bucket_path/ | grep '^-' | awk -F " " '{if ($4 == 0) print $4, $7}'

JMSE查询:

aws s3api list-objects --bucket $BUCKET --prefix $PREFIX --output json --query 'Contents[?Size==`0`]'
const getBucketFileSize = async function () {
  try {
    const response = await s3
      .listObjectsV2({
        Bucket: //Bucket-name,
        Prefix: //Provide Bucket Prefix if available,
      })
      .promise();

    response.Contents.map(item=>{
         if(item.Size===0){
            console.log(item)
         }
    })
  } catch (e) {
    console.log("err", e);
  }
};

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM