[英]How to grep into files stored in S3
Do anybody know how to perform grep on S3 files with aws S3 directly into the bucket?有人知道如何使用 aws S3 直接在存储桶中对 S3 文件执行 grep 吗? For example I have FILE1.csv, FILE2.csv with many rows and want to look for the rows that contain string JZZ
例如,我有很多行的 FILE1.csv、FILE2.csv,并且想要查找包含字符串 JZZ 的行
aws s3 ls --recursive s3://mybucket/loaded/*.csv.gz | grep ‘JZZ’
The aws s3 cp
command can send output to stdout
: aws s3 cp
命令可以将输出发送到stdout
:
aws s3 cp s3://mybucket/foo.csv - | grep 'JZZ'
The dash ( -
) signals the command to send output to stdout. 破折号(
-
)表示命令将输出发送到stdout。
See: How to use AWS S3 CLI to dump files to stdout in BASH? 请参阅: 如何使用AWS S3 CLI将文件转储到BASH中的stdout?
You can also use the GLUE/Athena combo which allows you to execute directly within AWS. 您还可以使用GLUE / Athena组合,该组合允许您直接在AWS中执行。 Depending on data volumes, queries' cost can be significant and take time.
根据数据量的不同,查询的成本可能很高且需要时间。
Basically 基本上
Use Athena to query, eg 使用雅典娜查询,例如
select "$path",line from where line like '%some%fancy%string%' 选择“ $ path”,从“%some%fancy%string%”等行开始
and get something like 并得到像
$path line $路径行
s3://mybucket/mydir/my.csv "some I did find some,yes, "fancy, yes, string" s3://mybucket/mydir/my.csv“我确实找到了一些,是的,“花哨的,是的,字符串”
Saves you from having to run any external infrastructure. 使您不必运行任何外部基础结构。
You can do it locally with the following command: 您可以使用以下命令在本地执行此操作:
aws s3 ls --recursive s3://<bucket_name>/<path>/ | awk '{print $4}' | xargs -I FNAME sh -c "echo FNAME; aws s3 cp s3://<bucket_name>/FNAME - | grep --color=always '<regex_pattern>'"
Explanation: The ls command generates a list of files then we select the file name from the output and for each file (xargs command) download the file from S3 and grep the output. 说明:ls命令生成一个文件列表,然后我们从输出中选择文件名,并为每个文件(xargs命令)从S3下载文件并grep输出。
I don't recommend this approach if you have to download a lot of data from S3 (due to transfer costs). 如果您必须从S3下载大量数据(由于传输成本),我不建议您使用这种方法。 You can avoid the costs for internet transfer though if you run the command on some EC2 instance that is located in a VPC with an S3 VPC endpoint attached to it.
尽管如果您在连接了S3 VPC端点的VPC中的某些EC2实例上运行命令,则可以避免Internet传输的成本。
There is a way to do it thru the aws command line but will require some tools and fancy pipes.有一种方法可以通过 aws 命令行来完成,但需要一些工具和花哨的管道。 Here are some examples
这里有些例子
S3: S3:
aws s3api list-objects --bucket my-logging-bucket --prefix "s3/my-events-2022-01-01" | aws s3api list-objects --bucket my-logging-bucket --prefix "s3/my-events-2022-01-01" | jq -r '.Contents[]|
jq -r '.内容[]| .Key' |
.键' | sort -r |
排序-r | xargs -I{} aws s3 cp s3://my-logging-bucket/{} -
xargs -I{} aws s3 cp s3://my-logging-bucket/{} -
Cloudfront:云端:
aws s3api list-objects --bucket my-logging-bucket --prefix "cloudfront/blog.example.com/EEQEEEEEEEEE.2022-01-01" |jq -r '.Contents[]| aws s3api list-objects --bucket my-logging-bucket --prefix "cloudfront/blog.example.com/EEQEEEEEEEEE.2022-01-01" |jq -r '.Contents[]| .Key' |
.键' | sort -r |
排序-r | xargs -I{} aws s3 cp s3://my-logging-bucket/{} - |
xargs -I{} aws s3 cp s3://my-logging-bucket/{} - | zgrep GET
zgrep 获取
The "sort -r" just reverses the order so it shows the newest objects first. “sort -r”只是颠倒顺序,因此它首先显示最新的对象。 You can omit that if you want to look at them in chronological order.
如果您想按时间顺序查看它们,可以省略它。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.