简体   繁体   English

如何通过键索引存储在S3中的JSON文件?

[英]How to index JSON files stored in S3 by keys?

Suppose I want to store hundreds of JSON files in S3.假设我想在 S3 中存储数百个 JSON 文件。 All these JSON files have the same schema.所有这些 JSON 文件都具有相同的架构。 I would like to search these JSON files by keys and values: eg find all JSON files with a key a value = "abc*" and a key x value = "xyz".我想按键和值搜索这些 JSON 文件:例如,查找所有 JSON 文件,其中键a值 =“abc*”,键x值 =“xyz”。 I expect the search to return the file names and the keys that match the query.我希望搜索返回与查询匹配的文件名和键。

What is the best way to index JSON files stored in S3 by keys?通过键索引存储在 S3 中的 JSON 文件的最佳方法是什么?

This is a follow-up of my previous question这是我上一个问题的后续

You might want to consider using S3 Select .您可能要考虑使用S3 Select

With Amazon S3 Select, you can use simple structured query language (SQL) statements to filter the contents of Amazon S3 objects and retrieve just the subset of data that you need.借助 Amazon S3 Select,您可以使用简单的结构化查询语言 (SQL) 语句来过滤 Amazon S3 对象的内容并仅检索您需要的数据子集。 By using Amazon S3 Select to filter this data, you can reduce the amount of data that Amazon S3 transfers, which reduces the cost and latency to retrieve this data.通过使用 Amazon S3 Select 过滤此数据,您可以减少 Amazon S3 传输的数据量,从而降低检索此数据的成本和延迟。

Amazon S3 Select works on objects stored in CSV, JSON, or Apache Parquet format. Amazon S3 Select 适用于以 CSV、JSON 或 ZE9713AE04A02A8410D6F33DD9 格式存储的对象。

Full docs on S3 Select . S3 Select 上的完整文档

Here's a nice blog post on how to use S3 Select .这是一篇关于如何使用S3 Select的不错的博客文章。

https://aws.amazon.com/blogs/storage/querying-data-without-servers-or-databases-using-amazon-s3-select/ https://aws.amazon.com/blogs/storage/querying-data-without-servers-or-databases-using-amazon-s3-select/

And a sample code would look like this:示例代码如下所示:

import boto3

# S3 bucket to query (Change this to your bucket)
S3_BUCKET = 'greg-college-data'

s3 = boto3.client('s3')

r = s3.select_object_content(
        Bucket=S3_BUCKET,
        Key='COLLEGE_DATA_2015.csv',
        ExpressionType='SQL',
        Expression="select \"INSTNM\" from s3object s where s.\"STABBR\" in ['OR', 'IA']",
        InputSerialization={'CSV': {"FileHeaderInfo": "Use"}},
        OutputSerialization={'CSV': {}},
)

for event in r['Payload']:
    if 'Records' in event:
        records = event['Records']['Payload'].decode('utf-8')
        print(records)

Code soruce . 代码来源

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM