简体   繁体   中英

how to search for file contents in amazon S3 bucket without downloading the file

i have n number of files uploaded to amazon S3 i need* search * those files based on occurrence of an string in its contents , i tried one method of downloading the files from S3 bucket converting input stream to string and then search for the word in content , but if their are more than five to six files it takes lot of time to do the above process,

is their any other way to do this , please help thanks in advance.

If your files contain CSV, TSV, JSON, Parquet or ORC, you can take a look at AWS's Athena: https://aws.amazon.com/athena/

From their intro:

Amazon Athena is a fast, cost-effective, interactive query service that makes it easy to analyze petabytes of data in S3 with no data warehouses or clusters to manage.

Unlikely to help you though as it sounds like you have plain text to search through.

Thought I'd mention it as it might help others looking to solve a similar problem.

Nope!

If you can't infer where the matches are from object metadata (like, the file name), then you're stuck with downloading & searching manually. If you have spare bandwidth, I suggest downloading a few files at a time to speed things up.

In single word NO!!

I think you can do to imprrve the performance will be to cache the files locally so that you don't have to download the file again and again

Probably you can use Last-Modified header to check whether the local file is dirty, then download it again

My suggestion, since you seem to own the files, is to index them manually, based on content. If there is a lot of "keywords", or metadata associated with each file, you can help yourself by using a lightweight database, where you will perform your queries and get the exact file(s) users are looking for. This will preserve bandwidth and also be much faster, at the cost of maintaining kind of an "indexing" system.

Another option (if each file does not contain much metadata) would be to reorganize the files in your buckets, adding prefixes which would "auto-index" them, like follows:

/foo/bar/randomFileContainingFooBar.dat /foo/zar/anotherRandomFileContainingFooZar.dat.

This way you might end up scanning the whole bucket in order to find the set of files you need (this is why I suggested this option only if you have little metadata), but you will only download the matching ones, which is still much better than your original approach.

Yes, now it is possible with AWS S3 Select. If your objects stored in CSV, JSON, or Apache Parquet format.

AWS details: https://aws.amazon.com/blogs/developer/introducing-support-for-amazon-s3-select-in-the-aws-sdk-for-javascript/

Aws S3 Select getting started examples: https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-select.html

Just if anyone will be looking the same.

ie With SDK:

if you have a csv like this:

 user_name,age jsrocks,13 node4life,22 esfuture,29 ...

And for example we would like to retrieve something like:

SELECT user_name FROM S3Object WHERE cast(age as int) > 20

Then in AWS SDK on JavaScript we do the following:

 const S3 = require('aws-sdk/clients/s3'); const client = new S3({ region: 'us-west-2' }); const params = { Bucket: 'my-bucket, Key: 'target-file.csv', ExpressionType: 'SQL, Expression: 'SELECT user_name FROM S3Object WHERE cast(age as int) > 20', InputSerialization: { CSV: { FileHeaderInfo: 'USE', RecordDelimiter: '\n', FieldDelimiter: ',' } }, OutputSerialization: { CSV: {} } };

I am not familiar with Amazon S3, but the general way to deal with searching remote files is to use indexing , with the index itself being stored on the remote server. That way each search will use the index to deduce a relatively small number of potential matching files and only those will be scanned directly to verify if they are indeed a match or not. Depending on your search terms and the complexity of the pattern, it might even be possible to avoid the direct file scan altogether.

That said, I do not know whether Amazon S3 has an indexing engine that you can use or whether there are supplemental libraries that do that for you, but the concept is simple enough that you should be able to get something working by yourself without too much work.

EDIT:

Generally the tokens that exist in each file are what is indexed. For example if you want to search for "foo bar" the index will tell you which files contain "foo" and which contain "bar" . The cross-section of these results will be the files that contain both "foo" and "bar" . You will have to scan those files directly to select those (if any) where "foo" and "bar" are right next to each other in the right order.

In any case, the amount of data that is downloaded to the client would be far less than downloading and scanning everything, although that would also depend on how your files are structured and what your search patterns look like.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM