如何 grep 到存储在 S3 中的文件

Question

Do anybody know how to perform grep on S3 files with aws S3 directly into the bucket?有人知道如何使用 aws S3 直接在存储桶中对 S3 文件执行 grep 吗？ For example I have FILE1.csv, FILE2.csv with many rows and want to look for the rows that contain string JZZ例如，我有很多行的 FILE1.csv、FILE2.csv，并且想要查找包含字符串 JZZ 的行

aws s3 ls --recursive s3://mybucket/loaded/*.csv.gz | grep ‘JZZ’

Answer 1

The aws s3 cp command can send output to stdout : aws s3 cp命令可以将输出发送到stdout ：

aws s3 cp s3://mybucket/foo.csv - | grep 'JZZ'

The dash ( - ) signals the command to send output to stdout. 破折号（ - ）表示命令将输出发送到stdout。

See: How to use AWS S3 CLI to dump files to stdout in BASH? 请参阅：如何使用AWS S3 CLI将文件转储到BASH中的stdout？

Answer 2

You can also use the GLUE/Athena combo which allows you to execute directly within AWS. 您还可以使用GLUE / Athena组合，该组合允许您直接在AWS中执行。 Depending on data volumes, queries' cost can be significant and take time. 根据数据量的不同，查询的成本可能很高且需要时间。

Basically 基本上

Create a GLUE classifier that reads byline 创建一个按行读取的GLUE分类器
Create a crawler to your S3 data directory against a database (csvdumpdb) - it will create a table with all the lines across all the csvs found 针对数据库（csvdumpdb）为您的S3数据目录创建搜寻器-它将创建一个表，其中包含找到的所有csv的所有行
Use Athena to query, eg 使用雅典娜查询，例如
select "$path",line from where line like '%some%fancy%string%' 选择“ $ path”，从“％some％fancy％string％”等行开始
and get something like 并得到像
$path line $路径行
s3://mybucket/mydir/my.csv "some I did find some,yes, "fancy, yes, string" s3：//mybucket/mydir/my.csv“我确实找到了一些，是的，“花哨的，是的，字符串”

Saves you from having to run any external infrastructure. 使您不必运行任何外部基础结构。

Answer 3

You can do it locally with the following command: 您可以使用以下命令在本地执行此操作：

aws s3 ls --recursive s3://<bucket_name>/<path>/ | awk '{print $4}' | xargs -I FNAME sh -c "echo FNAME; aws s3 cp s3://<bucket_name>/FNAME - | grep --color=always '<regex_pattern>'"

Explanation: The ls command generates a list of files then we select the file name from the output and for each file (xargs command) download the file from S3 and grep the output. 说明：ls命令生成一个文件列表，然后我们从输出中选择文件名，并为每个文件（xargs命令）从S3下载文件并grep输出。

I don't recommend this approach if you have to download a lot of data from S3 (due to transfer costs). 如果您必须从S3下载大量数据（由于传输成本），我不建议您使用这种方法。 You can avoid the costs for internet transfer though if you run the command on some EC2 instance that is located in a VPC with an S3 VPC endpoint attached to it. 尽管如果您在连接了S3 VPC端点的VPC中的某些EC2实例上运行命令，则可以避免Internet传输的成本。

Answer 4

There is a way to do it thru the aws command line but will require some tools and fancy pipes.有一种方法可以通过 aws 命令行来完成，但需要一些工具和花哨的管道。 Here are some examples这里有些例子

S3: S3：

aws s3api list-objects --bucket my-logging-bucket --prefix "s3/my-events-2022-01-01" | aws s3api list-objects --bucket my-logging-bucket --prefix "s3/my-events-2022-01-01" | jq -r '.Contents[]| jq -r '.内容[]| .Key' | .键' | sort -r |排序-r | xargs -I{} aws s3 cp s3://my-logging-bucket/{} - xargs -I{} aws s3 cp s3://my-logging-bucket/{} -

Cloudfront:云端：

aws s3api list-objects --bucket my-logging-bucket --prefix "cloudfront/blog.example.com/EEQEEEEEEEEE.2022-01-01" |jq -r '.Contents[]| aws s3api list-objects --bucket my-logging-bucket --prefix "cloudfront/blog.example.com/EEQEEEEEEEEE.2022-01-01" |jq -r '.Contents[]| .Key' | .键' | sort -r |排序-r | xargs -I{} aws s3 cp s3://my-logging-bucket/{} - | xargs -I{} aws s3 cp s3://my-logging-bucket/{} - | zgrep GET zgrep 获取

The "sort -r" just reverses the order so it shows the newest objects first. “sort -r”只是颠倒顺序，因此它首先显示最新的对象。 You can omit that if you want to look at them in chronological order.如果您想按时间顺序查看它们，可以省略它。

如何 grep 到存储在 S3 中的文件

问题描述

4 个解决方案

解决方案1
11 2016-12-20 05:09:30

解决方案2
6 2017-10-17 17:47:25

解决方案3
3 2018-09-13 12:02:57

解决方案4
0 2022-01-25 00:57:09

如何 grep 到存储在 S3 中的文件

问题描述

4 个解决方案

解决方案1 11 2016-12-20 05:09:30

解决方案2 6 2017-10-17 17:47:25

解决方案3 3 2018-09-13 12:02:57

解决方案4 0 2022-01-25 00:57:09

解决方案1
11 2016-12-20 05:09:30

解决方案2
6 2017-10-17 17:47:25

解决方案3
3 2018-09-13 12:02:57

解决方案4
0 2022-01-25 00:57:09