简体   繁体   English

如何 grep 到存储在 S3 中的文件

[英]How to grep into files stored in S3

Do anybody know how to perform grep on S3 files with aws S3 directly into the bucket?有人知道如何使用 aws S3 直接在存储桶中对 S3 文件执行 grep 吗? For example I have FILE1.csv, FILE2.csv with many rows and want to look for the rows that contain string JZZ例如,我有很多行的 FILE1.csv、FILE2.csv,并且想要查找包含字符串 JZZ 的行

aws s3 ls --recursive s3://mybucket/loaded/*.csv.gz | grep ‘JZZ’

The aws s3 cp command can send output to stdout : aws s3 cp命令可以将输出发送到stdout

aws s3 cp s3://mybucket/foo.csv - | grep 'JZZ'

The dash ( - ) signals the command to send output to stdout. 破折号( - )表示命令将输出发送到stdout。

See: How to use AWS S3 CLI to dump files to stdout in BASH? 请参阅: 如何使用AWS S3 CLI将文件转储到BASH中的stdout?

You can also use the GLUE/Athena combo which allows you to execute directly within AWS. 您还可以使用GLUE / Athena组合,该组合允许您直接在AWS中执行。 Depending on data volumes, queries' cost can be significant and take time. 根据数据量的不同,查询的成本可能很高且需要时间。

Basically 基本上

  • Create a GLUE classifier that reads byline 创建一个按行读取的GLUE分类器 在此处输入图片说明
  • Create a crawler to your S3 data directory against a database (csvdumpdb) - it will create a table with all the lines across all the csvs found 针对数据库(csvdumpdb)为您的S3数据目录创建搜寻器-它将创建一个表,其中包含找到的所有csv的所有行
  • Use Athena to query, eg 使用雅典娜查询,例如

    select "$path",line from where line like '%some%fancy%string%' 选择“ $ path”,从“%some%fancy%string%”等行开始

  • and get something like 并得到像

    $path line $路径行

    s3://mybucket/mydir/my.csv "some I did find some,yes, "fancy, yes, string" s3://mybucket/mydir/my.csv“我确实找到了一些,是的,“花哨的,是的,字符串”

Saves you from having to run any external infrastructure. 使您不必运行任何外部基础结构。

You can do it locally with the following command: 您可以使用以下命令在本地执行此操作:

aws s3 ls --recursive s3://<bucket_name>/<path>/ | awk '{print $4}' | xargs -I FNAME sh -c "echo FNAME; aws s3 cp s3://<bucket_name>/FNAME - | grep --color=always '<regex_pattern>'"

Explanation: The ls command generates a list of files then we select the file name from the output and for each file (xargs command) download the file from S3 and grep the output. 说明:ls命令生成一个文件列表,然后我们从输出中选择文件名,并为每个文件(xargs命令)从S3下载文件并grep输出。

I don't recommend this approach if you have to download a lot of data from S3 (due to transfer costs). 如果您必须从S3下载大量数据(由于传输成本),我不建议您使用这种方法。 You can avoid the costs for internet transfer though if you run the command on some EC2 instance that is located in a VPC with an S3 VPC endpoint attached to it. 尽管如果您在连接了S3 VPC端点的VPC中的某些EC2实例上运行命令,则可以避免Internet传输的成本。

There is a way to do it thru the aws command line but will require some tools and fancy pipes.有一种方法可以通过 aws 命令行来完成,但需要一些工具和花哨的管道。 Here are some examples这里有些例子

S3: S3:

aws s3api list-objects --bucket my-logging-bucket --prefix "s3/my-events-2022-01-01" | aws s3api list-objects --bucket my-logging-bucket --prefix "s3/my-events-2022-01-01" | jq -r '.Contents[]| jq -r '.内容[]| .Key' | .键' | sort -r |排序-r | xargs -I{} aws s3 cp s3://my-logging-bucket/{} - xargs -I{} aws s3 cp s3://my-logging-bucket/{} -

Cloudfront:云端:

aws s3api list-objects --bucket my-logging-bucket --prefix "cloudfront/blog.example.com/EEQEEEEEEEEE.2022-01-01" |jq -r '.Contents[]| aws s3api list-objects --bucket my-logging-bucket --prefix "cloudfront/blog.example.com/EEQEEEEEEEEE.2022-01-01" |jq -r '.Contents[]| .Key' | .键' | sort -r |排序-r | xargs -I{} aws s3 cp s3://my-logging-bucket/{} - | xargs -I{} aws s3 cp s3://my-logging-bucket/{} - | zgrep GET zgrep 获取

The "sort -r" just reverses the order so it shows the newest objects first. “sort -r”只是颠倒顺序,因此它首先显示最新的对象。 You can omit that if you want to look at them in chronological order.如果您想按时间顺序查看它们,可以省略它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM