简体   繁体   English

从R中的AWS S3存储桶获取最后修改的文件

[英]Get last modfied file from AWS S3 bucket in R

I've got a S3 bucket being updated in realtime with API data. 我已经用API数据实时更新了一个S3存储桶。 The files are saved with a .XXX format, where xxx is 1...n. 文件以.XXX格式保存,其中xxx为1 ... n。

My R script needs to be able to grab the latest files and add them to the analysis dataframe. 我的R脚本需要能够获取最新文件并将其添加到分析数据框中。 I've been using the aws.s3 package so far. 到目前为止,我一直在使用aws.s3包。 After setting secret/access keys to environment: 将密钥/访问密钥设置为环境后:

mybucket <- get_bucket("mybucket1")

Returns an s3 object of 1000 elements (presumably more), and it looks like each object has Contents:list if 7, one of which is $LastModified. 返回一个s3对象,该对象包含1000个元素(可能更多),并且看起来每个对象都有Contents:list(如果为7),其中一个是$ LastModified。 How do I get the name of the last modified file? 如何获得最后修改的文件的名称?

Mybucket     Large s3_bucket (1000 elements, 2.1Mb)
contents:List of 7
..$ Key : chr "folder1"
..$ LastModified: chr "2018-01-16T09:58:47.000Z"
..$ ETag : chr "\" nnnnnnnnnnn\""
etc (.. $Owner, $Storage class, $bucket, $-attr)
contents: List of 7
..$ Key : chr "folder1/file.1
..$ LastModified: chr "2018....etc"
..$ ETag : chr "...etc..."
etc....
contents: List of 7
etc.....

It's really the number after 'file.' 确实是“文件”后面的数字。 that I need (in this case it would be 1). 我需要的(在这种情况下为1)。

After experimentation, I think and CLI command through RCurl would be a better option. 经过试验,我认为通过RCurl和CLI命令将是一个更好的选择。

aws s3 ls s3://mybucket --recursive | grep APIdata@symbol=XXX&interval=5.1*

This gets me really close, but the command is leaving out the '&interval=5.1*' so it's returning ALL objects with 'APIdata@symbol=XXX*' 这确实使我非常接近,但是该命令省略了'&interval = 5.1 *',因此它将返回所有带有'APIdata @ symbol = XXX *'的对象。

I think your question is independent of AWS S3 and I would classify it as how can I create a data.frame from a list of lists and there are existing answers for that, eg: 我认为您的问题独立于AWS S3,我将其归类为如何从列表列表中创建data.frame并且对此已有答案,例如:

R list of lists to data.frame R列表到data.frame的列表

My solutions uses the handy rbindlist from the package data.table . 我的解决方案使用了rbindlist包中的data.table

I had to guess about the data types of Mybucket but a solution could look like this: 我不得不猜测Mybucket的数据类型,但是一个解决方案可能看起来像这样:

# https://cran.r-project.org/web/packages/aws.s3/aws.s3.pdf
# get_bucket: returns a list of objects in the bucket (with class “s3_bucket”)
library(data.table)
library(lubridate)

# my personal assumption of the output of "get_bucket" is list of list (I have no S3 at hand to verify this)
Mybucket <- list(  list(Key = "folder1/file.1", LastModified = "2018-01-16T09:58:47.000Z", ETag = "\" nnnnnnnnnnn\"")
                 , list(Key = "folder2/file.2", LastModified = "2018-01-16T08:58:47.000Z", ETag = "xyz"))

dt <- rbindlist(Mybucket)  # convert into a data.table (enhanced data.frame)

dt[, LastModAsDate := ymd_hms(LastModified)]  # add a data column

dt.most.recent <- dt[order(-dt$LastModAsDate),][1]  # order by date descending, then pick the top-most row

which results in 导致

> dt.most.recent
              Key             LastModified           ETag       LastModAsDate
1: folder1/file.1 2018-01-16T09:58:47.000Z " nnnnnnnnnnn" 2018-01-16 09:58:47

Please note that the date conversion may loose precision (milliseconds) but the overall solution is sketched anyhow... 请注意,日期转换可能会失去精度(毫秒),但是总的来说,解决方案还是很简单的……

To extract the number contained in the file extension use: 要提取文件扩展名中包含的数字,请使用:

tools::file_ext(dt.most.recent$Key)
# [1] "1"

Easiest way ended up being with a system command: 最简单的方法最终是使用系统命令:

currentfile <- system("aws s3 ls s3://bucket/folder --recursive | grep 'file.16' | sort | tail -n 1 | awk '{print $4}'", intern=TRUE)

grep grabs files with 'file.16' present, which significantly narrows the search as current file listings are in the 1600's. grep抓取存在'file.16'的文件,这大大缩小了搜索范围,因为当前文件列表位于1600年代。 Intern=TRUE saves the response, in this case saves it in 'currentfile' as a character string. Intern = TRUE将保存响应,在这种情况下,将其作为字符串保存在“ currentfile”中。 The sort, tail and print $4 orders files by modified date, takes last modified 4th column (name). 按修改日期排序,尾部打印$ 4订单文件,取最后修改的第4列(名称)。

for reference: Downloading the latest file in an S3 bucket using AWS CLI? 供参考: 使用AWS CLI在S3存储桶中下载最新文件?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM