简体   繁体   English

如何通过python只获取在S3位置创建/修改的最新文件/文件

[英]How can I get only the latest file/files created/modified on S3 location through python

using boto i tried the below code : 使用boto我尝试了以下代码:

from boto.s3.connection import S3Connection
conn = S3Connection('XXX', 'YYYY')

bucket = conn.get_bucket('myBucket')

file_list = bucket.list('just/a/prefix/')

but am unable to get the length of the list or the last element of the file_list as it is a BucketListResultSet type ,please suggest a solution for this scenario 但我无法获取列表的长度或file_list的最后一个元素,因为它是BucketListResultSet类型,请为此方案建议一个解决方案

You are trying to use boto library, which is rather obsolete and not maintained. 您正在尝试使用boto库,该库已过时且未得到维护。 The number of issues with this library is growing. 该库的问题数量正在增加。

Better use currently developed boto3 . 更好地使用目前开发的boto3

First, let us define parameters of our search: 首先,让我们定义搜索参数:

>>> bucket_name = "bucket_of_m"
>>> prefix = "region/cz/"

Do import boto3 and create s3 representing S3 resource: 导入boto3并创建代表S3资源的s3:

>>> import boto3
>>> s3 = boto3.resource("s3")

Get the bucket: 获取桶:

>>> bucket = s3.Bucket(name=bucket_name)
>>> bucket
s3.Bucket(name='bucket_of_m')

Define filter for objects with given prefix: 为具有给定前缀的对象定义过滤器:

>>> res = bucket.objects.filter(Prefix=prefix)
>>> res
s3.Bucket.objectsCollection(s3.Bucket(name='bucket_of_m'), s3.ObjectSummary)

and iterate over it: 并迭代它:

>>> for obj in res:
...     print obj.key
...     print obj.size
...     print obj.last_modified
...

Each obj is ObjectSummary (not Object itself), but it holds enought to learn something about it 每个obj都是ObjectSummary(而不是Object本身),但它仍然需要了解它

>>> obj
s3.ObjectSummary(bucket_name='bucket_of_m', key=u'region/cz/Ostrava/Nadrazni.txt')
>>> type(obj)
boto3.resources.factory.s3.ObjectSummary

You can get Object from it and use it as you need: 您可以从中获取Object并根据需要使用它:

>>> o = obj.Object()
>>> o
s3.Object(bucket_name='bucket_of_m', key=u'region/cz/rodos/fusion/AdvancedDataFusion.xml')

There are not so many options for filtering, but prefix is available. 过滤的选项并不多,但前缀可用。

As an addendum to Jan's answer : 作为Jan答案的附录:


Seems that the boto3 library has changed in the meantime and currently (version 1.6.19 at the time of writing) offers more parameters for the filter method : 似乎boto3库在此期间发生了变化,目前(编写本文时为1.6.19版) filter方法提供了更多参数

 object_summary_iterator = bucket.objects.filter( Delimiter='string', EncodingType='url', Marker='string', MaxKeys=123, Prefix='string', RequestPayer='requester' ) 

Three useful parameters to limit the number of entries for your scenario are Marker , MaxKeys and Prefix : 限制场景条目数的三个有用参数是MarkerMaxKeysPrefix

Marker ( string ) -- Specifies the key to start with when listing objects in a bucket. 标记字符串 ) - 指定在存储桶中列出对象时要开始的键。
MaxKeys ( integer ) -- Sets the maximum number of keys returned in the response. MaxKeys整数 ) - 设置响应中返回的最大键数。 The response might contain fewer keys but will never contain more. 响应可能包含较少的键,但永远不会包含更多键。
Prefix ( string ) -- Limits the response to keys that begin with the specified prefix. Prefixstring ) - 限制对以指定前缀开头的键的响应。

Two notes: 两个笔记:

  • The key you specify for Marker will not be included in the result, ie the listing starts from the key following the one you specify as Marker. 您指定标记的密钥将被包括在结果,即从上市下列指定为标记的一个关键开始。

  • The boto3 library is performing automatic pagination on the results. boto3库正在对结果执行自动分页 The size of each page is determined by the MaxKeys parameter of the filter function (defaulting to 1000). 每个页面的大小由过滤器函数的MaxKeys参数确定(默认为1000)。

    If you iterate over the s3.Bucket.objectsCollection object for more than that, it will automatically download the next page. 如果迭代s3.Bucket.objectsCollection对象超过它,它将自动下载下一页。 While this is generally useful, it might be surprising when you specify eg MaxKeys=10 and want to iterate only over the 10 keys, yet the iterator will go over all matched keys, just with a new request to server each 10 keys. 虽然这通常很有用,但是当你指定例如MaxKeys=10并且只想迭代10个键时,它可能会令人惊讶,但是迭代器将遍历所有匹配的键,只需要为每个10个键提供服务器的新请求。
    So, if you just want eg the first three results, break off the loop manually, don't rely on the iterator . 所以, 如果你只想要例如前三个结果,手动中断循环,不要依赖迭代器

    (Unfortunately this is not clear in the docs (it's actually quite wrong), as the library parameter description is copied from the API parameter description , where it actually makes sense: " The response might contain fewer keys but will never contain more. ") (不幸的是,这在文档中并不清楚(实际上是非常错误的),因为库参数描述是从API参数描述复制的,实际上它有意义:“ 响应可能包含更少的键,但永远不会包含更多。 ”)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用Boto Python从S3获取最新文件的最后修改日期? - How to get last modified date of latest file from S3 with Boto Python? 如何使用 python aioboto3 或 boto3 仅从 S3 获取文件? - How can I get ONLY files from S3 with python aioboto3 or boto3? 如何查找未在Python中最新修改的最新创建文件 - How to find latest created file not latest modified in Python Boto + Python + AWS S3:如何获取特定文件的last_modified属性? - Boto + Python + AWS S3: How to get last_modified attribute of specific file? databricks-已安装S3-如何获取文件元数据,如上次修改日期(Python) - databricks - mounted S3 - how to get file metadata like last modified date (Python) 如何使用 Python 列出 S3 中最后修改的文件 - How to list last modified file in S3 using Python 如何使用 python boto 获取 amazon S3 中仅文件夹的列表? - How can I get the list of only folders in amazon S3 using python boto? 如何使用 url 访问 Python 中的 s3 文件? - How can I access s3 files in Python using urls? 如何遍历包含 csv 文件的文件夹,提取每个文件的修改日期并使用 Python 创建一个新列作为“修改日期” - How to iterate through a folder that contains csv files, extract each file's modified date and create a new column as " modified date" using Python 如何使用 python 从 S3 获取最新的文件夹路径 - How to get latest folder path from S3 using python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM