简体   繁体   中英

How to grep a term from S3 and output object name

I need to grep a term over thousands of files in S3, and list those file names in some output file. I'm quite new using cli, so I've been testing both on my local, and in a small subset in s3.

So far I've got this:

aws s3 cp s3://mybucket/path/to/file.csv - | grep -iln searchterm > output.txt

The problem with this is with the hyphen. Since I'm copying over to standard output, the -l switch in grep returns (standard input) instead of file.csv

My desired output is

file.csv

Eventually, I'll need to iterate this over the whole bucket, and then all buckets, to get

file1.csv
file2.csv
file3.csv

But I need to get over this hurdle first. Thanks!

Because you print the file in STDOUT and pipe that to grep STDIN, grep has no idea that the original file was file.csv . If you have a long list of files, I would do:

while read -r file; do aws s3 cp s3://mybucket/path/to/${file} - | grep -q searchterm && { echo ${file} >> output.txt; }; done < files_list.txt

I cannot try it, because I do not have access to an AWS S3 instance, but the trick is to use grep quietly ( -q ), it will return true if it finds at least a match, false otherwise; Then you can print the name of the file.

EDIT: Explanation

  1. The while loop will iterate over each line of files_list.txt
  2. The aws command will print this file in stdout
  3. We redirect stdout to grep in quiet mode ( -q ) which acts as a pattern matcher, returning true if a match was found, false ohter wise.
  4. If grep returns true, we append the name of the file ( ${file} ) to our output file.

EDIT2: Other solution

while read -r file; do aws s3 cp s3://mybucket/path/to/${file} - | sed -n /searchpattern/{F;q} >> output.txt; done < files_list.txt

Explanation

Steps 1 and 2 are the same, then:

  1. stdout is redirected to sed, which will look in the file line by line until it finds the first stream pattern , and then quit ( q ), printing the file name ( F ) in the output file.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM