使用 boto3 將本地文件夾同步到 s3 存儲桶

Question

我注意到 boto3 中沒有可通過命令行執行的“同步”操作的 API。

所以，

如何使用 boto3 將本地文件夾同步到給定的存儲桶？

Answer 1

我剛剛為這個問題實現了一個簡單的類。 我把它貼在這里希望它可以幫助任何有同樣問題的人。

您可以修改 S3Sync.sync 以將文件大小考慮在內。

class S3Sync:
    """
    Class that holds the operations needed for synchronize local dirs to a given bucket.
    """

    def __init__(self):
        self._s3 = boto3.client('s3')

    def sync(self, source: str, dest: str) -> [str]:
        """
        Sync source to dest, this means that all elements existing in
        source that not exists in dest will be copied to dest.

        No element will be deleted.

        :param source: Source folder.
        :param dest: Destination folder.

        :return: None
        """

        paths = self.list_source_objects(source_folder=source)
        objects = self.list_bucket_objects(dest)

        # Getting the keys and ordering to perform binary search
        # each time we want to check if any paths is already there.
        object_keys = [obj['Key'] for obj in objects]
        object_keys.sort()
        object_keys_length = len(object_keys)
        
        for path in paths:
            # Binary search.
            index = bisect_left(object_keys, path)
            if index == object_keys_length:
                # If path not found in object_keys, it has to be sync-ed.
                self._s3.upload_file(str(Path(source).joinpath(path)),  Bucket=dest, Key=path)

    def list_bucket_objects(self, bucket: str) -> [dict]:
        """
        List all objects for the given bucket.

        :param bucket: Bucket name.
        :return: A [dict] containing the elements in the bucket.

        Example of a single object.

        {
            'Key': 'example/example.txt',
            'LastModified': datetime.datetime(2019, 7, 4, 13, 50, 34, 893000, tzinfo=tzutc()),
            'ETag': '"b11564415be7f58435013b414a59ae5c"',
            'Size': 115280,
            'StorageClass': 'STANDARD',
            'Owner': {
                'DisplayName': 'webfile',
                'ID': '75aa57f09aa0c8caeab4f8c24e99d10f8e7faeebf76c078efc7c6caea54ba06a'
            }
        }

        """
        try:
            contents = self._s3.list_objects(Bucket=bucket)['Contents']
        except KeyError:
            # No Contents Key, empty bucket.
            return []
        else:
            return contents

    @staticmethod
    def list_source_objects(source_folder: str) -> [str]:
        """
        :param source_folder:  Root folder for resources you want to list.
        :return: A [str] containing relative names of the files.

        Example:

            /tmp
                - example
                    - file_1.txt
                    - some_folder
                        - file_2.txt

            >>> sync.list_source_objects("/tmp/example")
            ['file_1.txt', 'some_folder/file_2.txt']

        """

        path = Path(source_folder)

        paths = []

        for file_path in path.rglob("*"):
            if file_path.is_dir():
                continue
            str_file_path = str(file_path)
            str_file_path = str_file_path.replace(f'{str(path)}/', "")
            paths.append(str_file_path)

        return paths


if __name__ == '__main__':
    sync = S3Sync()
    sync.sync("/temp/some_folder", "some_bucket_name")

更新：

@Z.Wei 評論：

深入研究一下以處理奇怪的 bisect 函數。 我們可以使用 if path not in object_keys:?

我認為這是一個有趣的問題，值得更新答案，不要迷失在評論中。

回答：

不， if path not in object_keys中將執行線性搜索O(n) 。 bisect_* 執行二進制搜索（列表必須排序），其為 O(log(n))。

大多數情況下，您將處理足夠多的對象來進行排序和二進制搜索，這通常比僅使用 in 關鍵字更快。

考慮到您必須使用in O(m * n)來檢查源中的每條路徑與目標中的每條路徑，其中 m 是源中的對象數，而 n 是目標中的對象數。 使用 bisect 整個事情是O( n * log(n) )

但 ...

如果我考慮一下，您可以使用集合使算法更快（並且簡單，因此更像 Python）：

def sync(self, source: str, dest: str) -> [str]:

    # Local paths
    paths = set(self.list_source_objects(source_folder=source))

    # Getting the keys (remote s3 paths).
    objects = self.list_bucket_objects(dest)
    object_keys = set([obj['Key'] for obj in objects])

    # Compute the set difference: What we have in paths that does
    # not exists in object_keys.
    to_sync = paths - object_keys

    sournce_path = Path(source)
    for path in to_sync:
        self._s3.upload_file(str(sournce_path / path),
                                Bucket=dest, Key=path)

在sets搜索是 O(1) 所以，使用集合整個事情會比之前的O( m * log(n) )快O(n)方式。

進一步改進

可以進一步改進代碼，使方法list_bucket_objects和list_source_objects返回集合而不是列表。

使用 boto3 將本地文件夾同步到 s3 存儲桶

問題描述

1 個解決方案

解決方案1
5 2019-07-04 17:50:33

更新：

但 ...

進一步改進

使用 boto3 將本地文件夾同步到 s3 存儲桶

問題描述

1 個解決方案

解決方案1 5 2019-07-04 17:50:33

更新：

但 ...

進一步改進

解決方案1
5 2019-07-04 17:50:33