简体   繁体   English

如何使用 Python 和 boto3 将多个文件附加到 Amazon 的 s3 中?

[英]How to append multiple files into one in Amazon's s3 using Python and boto3?

I have a bucket in Amazon's S3 called test-bucket .我在亚马逊的 S3 中有一个名为test-bucket Within this bucket, json files look like this:在这个存储桶中,json 文件如下所示:

test-bucket
    | continent
        | country
            | <filename>.json

Essentially, filenames are continent/country/name/ .本质上,文件名是continent/country/name/ Within each country, there are about 100k files, each containing a single dictionary, like this:在每个国家/地区,大约有 100k 个文件,每个文件包含一个字典,如下所示:

{"data":"more data", "even more data":"more data", "other data":"other other data"}

Different files have different lengths.不同的文件有不同的长度。 What I need to do is compile all these files together into a single file, then re-upload that file into s3.我需要做的是将所有这些文件编译成一个文件,然后将该文件重新上传到 s3。 The easy solution would be to download all the files with boto3, read them into Python, then append them using this script:简单的解决方案是使用 boto3 下载所有文件,将它们读入 Python,然后使用以下脚本附加它们:

import json


def append_to_file(data, filename):
    with open(filename, "a") as f:
        json.dump(record, f)
        f.write("\n")

However, I do not know all the filenames (the names are a timestamp).但是,我不知道所有文件名(名称是时间戳)。 How can I read all the files in a folder, eg Asia/China/* , then append them to a file, with the filename being the country?如何读取文件夹中的所有文件,例如Asia/China/* ,然后将它们附加到文件中,文件名是国家?

Optimally, I don't want to have to download all the files into local storage.最理想的是,我不想将所有文件下载到本地存储中。 If I could load these files into memory that would be great.如果我可以将这些文件加载​​到内存中,那就太好了。

EDIT: to make things more clear.编辑:让事情更清楚。 Files on s3 aren't stored in folders, the file path is just set up to look like a folder. s3 上的文件不存储在文件夹中,文件路径只是设置为看起来像一个文件夹。 All files are stored under test-bucket .所有文件都存储在test-bucket

The answer to this is fairly simple.答案很简单。 You can list all files in the bucket using a filter to filter it down to a "subdirectory" in the prefix.您可以使用过滤器列出存储桶中的所有文件,将其过滤到前缀中的“子目录”。 If you have a list of the continents and countries in advance, then you can reduce the list returned.如果你事先有一个大洲和国家的列表,那么你可以减少返回的列表。 The returned list will have the prefix, so you can filter the list of object names to the ones you want.返回的列表将具有前缀,因此您可以将对象名称列表过滤为您想要的名称。

    s3 = boto3.resource('s3')
    bucket_obj = s3.Bucket(bucketname)

    all_s3keys = list(obj.key for obj in bucket_obj.objects.filter(Prefix=job_prefix))

    if file_pat:
        filtered_s3keys = [key for key in all_s3keys if bool(re.search(file_pat, key))]
    else:
        filtered_s3keys = all_s3keys

The code above will return all the files, with their complete prefix in the bucket, exclusive to the prefix provided.上面的代码将返回所有文件,以及它们在存储桶中的完整前缀,不包括提供的前缀。 So if you provide prefix='Asia/China/', then it will provide a list of the files only with that prefix.因此,如果您提供 prefix='Asia/China/',那么它将仅提供具有该前缀的文件列表。 In some cases, I take a second step and filter the file names in that 'subdirectory' before I use the full prefix to access the files.在某些情况下,在使用完整前缀访问文件之前,我会采取第二步并过滤该“子目录”中的文件名。

The second step is to download all the files:第二步是下载所有文件:

    with concurrent.futures.ThreadPoolExecutor(max_workers=MAX_THREADS) as executor:
        executor.map(lambda s3key:  bucket_obj.download_file(s3key, local_filepath, Config=CUSTOM_CONFIG),                         
                    filtered_s3keys)

for simplicity, I skipped showing the fact that the code generates a local_filepath for each file downloaded so it is the one you actually want and where you want it.为简单起见,我跳过了这一事实,即代码为每个下载的文件生成一个 local_filepath,因此它是您真正想要的文件以及您想要它的位置。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM