简体   繁体   English

将具有多种布局的json对象从S3复制到Redshift

[英]Copying json objects with multiple layouts from S3 into Redshift

I have an S3 bucket with many files containing "\\n" delimited json objects. 我有一个S3存储桶,其中包含许多包含“ \\ n”定界的json对象的文件。 These json objects can have a few different layouts. 这些json对象可以具有一些不同的布局。 There is a standard set of keys that are common across all the layouts. 在所有布局中都有一组通用的标准键。 Most differences just have a few extra keys, but some have nested json objects. 大多数差异仅具有一些额外的键,但有些具有嵌套的json对象。 One file can have any/all of these layouts. 一个文件可以具有任何/所有这些布局。

I have managed to define a single, basic table in Redshift and copy the data into that table, but any keys not in my table are lost. 我设法在Redshift中定义了一个基本表,并将数据复制到该表中,但是我表中没有的所有键都丢失了。

I would like to create 1 table for each layout I have and have the json object copied into the appropriate table. 我想为每个布局创建1个表,并将json对象复制到相应的表中。 The layouts with nested json objects could probably stay in a single string column as json since Redshift is able to parse json in a query. 由于Redshift能够解析查询中的json,因此带有嵌套json对象的布局可能会像json一样保留在单个字符串列中。

I am new to AWS, so any help would be appreciated. 我是AWS的新手,所以将不胜感激。 Also, feel free to suggest non-Redshift services that might work as well. 另外,请随时建议可能也起作用的非Redshift服务。

Thanks! 谢谢!

You'll need to run a separate COPY for each table that you want to load. 您需要为要加载的每个表运行单独的COPY。 However you may have trouble with nested objects (as of right now). 但是,您可能无法使用嵌套对象(截至目前)。

We gave up on direct JSON loads because it cannot load an arbitrary number of nested objects. 我们放弃了直接JSON加载,因为它无法加载任意数量的嵌套对象。 Each nested object has to be referred to by it's index (eg 'nest[0]' ) in order to load it. 每个嵌套对象都必须通过其索引(例如'nest [0]')进行引用才能加载它。 Which is not ideal when there could be many thousands of objects. 当可能有成千上万个对象时,这是不理想的。

You cannot skip lines via the Copy command. 您不能通过“复制”命令跳过行。 One option you can think of is assuming that the files are being loaded to S3. 您可以想到的一种选择是假设文件正在加载到S3。 You can split the file and land it into different folders. 您可以拆分文件并将其放到不同的文件夹中。 So that it you can run different copy commands to load the data to different tables. 这样您就可以运行不同的复制命令来将数据加载到不同的表中。 Other option: You can load just the first level json objects to a staging table and can use the Redshift JSON functions post that. 其他选项:您可以仅将第一级json对象加载到登台表中,并可以使用Redshift JSON函数发布该表。 Example: JSON 1: 示例: JSON 1:

{
    "a": "value",
    "b": "value",
    "c": "value",
    "d": "value",
    "f": {
        "fa": "value",
        "fb": "value",
        "fc": "value",
        "fd": "value"
    },
    "g": "value"
}

JSON 2: JSON 2:

{
    "a": "value",
    "b": "value",
    "c": "value",
    "d": "value",
    "e": {
        "ea": "value",
        "eb": "value",
        "ec": {
            "eca": "",
            "ecb": "value",
            "ecc": "value",
            "ecd": "value",
        }
    },
    "f": {
        "fa": "value"
    },
    "g": "value"
}

In JSON 2 , the 'e' tag is extra. 在JSON 2中,“ e”标记是多余的。 You can load these two different types of JSONs to the same table and then use Redshift JSON functions to process further. 您可以将这两种不同类型的JSON加载到同一张表中,然后使用Redshift JSON函数进行进一步处理。

Your target table should have the columns: 您的目标表应具有以下列:

a, b, c, d, e, f, g

Your jsonpath should look like: 您的jsonpath应该看起来像:

{
    "jsonpaths": ["$.a", "$.b", "$.c", "$.d", "$.e", "$.f", "$.g"]
}

When JSON 1 is loaded, the column e is loaded as null. 加载JSON 1时,列e被加载为null。

I hope this is what you are looking for. 我希望这是您要寻找的。 Let me know if you found a solution. 让我知道您是否找到了解决方案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM