如何在大数据平台中比较大文件？

Question

Here are some big files coming in a day, not very frequent, 2-3 every single day, and they are converted into JSON format. 这是一天中出现的一些大文件，不是很频繁，每天只有2-3个，它们被转换为JSON格式。

The file's content looks like: 该文件的内容如下所示：

[
    {
        "spa_ref_data": {
            "approval_action": "New",
            "spa_ref_no": "6500781413",
            "begin_date": null,
            "end_date": "20191009",
            "doc_file_name": "LEN_SPA_6500781413.json",
            "LEN_V": "v1",
            "version_no": null,
            "spa_ref_id": null,
            "spa_ref_notes": "MC00020544",
            "vend_code": "LEN"
        },
        "cust_data": [
            {
                "cust_name": null,
                "cust_no": null,
                "cust_type": "E",
                "state": null,
                "country": null
            },
            {
                "cust_name": null,
                "cust_no": null,
                "cust_type": "C",
                "state": null,
                "country": null
            }
        ],
        "product_data": [
            {
                "mfg_partno": "40AH0135US",
                "std_price": null,
                "rebate_amt": "180",
                "max_spa_qty": null,
                "rebate_type": null,
                "min_spa_qty": null,
                "min_cust_qty": null,
                "max_cust_qty": null,
                "begin_date": "20180608",
                "end_date": null
            },
            {
                "mfg_partno": "40AJ0135US",
                "std_price": null,
                "rebate_amt": "210",
                "max_spa_qty": null,
                "rebate_type": null,
                "min_spa_qty": null,
                "min_cust_qty": null,
                "max_cust_qty": null,
                "begin_date": "20180608",
                "end_date": null
            }
        ]
    },
    {
        "spa_ref_data": {
            "approval_action": "New",
            "spa_ref_no": "5309745006",
            "begin_date": null,
            "end_date": "20190426",
            "doc_file_name": "LEN_SPA_5309745006.json",
            "LEN_V": "v1",
            "version_no": null,
            "spa_ref_id": null,
            "spa_ref_notes": "MC00020101",
            "vend_code": "LEN"
        },
        "cust_data": [
            {
                "cust_name": null,
                "cust_no": null,
                "cust_type": "E",
                "state": null,
                "country": null
            },
            {
                "cust_name": null,
                "cust_no": null,
                "cust_type": "C",
                "state": null,
                "country": null
            }
        ],
        "product_data": [
            {
                "mfg_partno": "10M8S0HU00",
                "std_price": null,
                "rebate_amt": "698",
                "max_spa_qty": null,
                "rebate_type": null,
                "min_spa_qty": null,
                "min_cust_qty": null,
                "max_cust_qty": null,
                "begin_date": "20180405",
                "end_date": null
            },
            {
                "mfg_partno": "20K5S0CM00",
                "std_price": null,
                "rebate_amt": "1083",
                "max_spa_qty": null,
                "rebate_type": null,
                "min_spa_qty": null,
                "min_cust_qty": null,
                "max_cust_qty": null,
                "begin_date": "20180405",
                "end_date": null
            }
        ]
    }
]

This is a mock data file.In fact, it is a array with length 30000+. 这是一个模拟数据文件，实际上是一个长度为30000+的数组。

My target is to compare the coming one with the latest one. 我的目标是将即将到来的产品与最新的产品进行比较。 And get the changed data. 并获取更改的数据。

The leader says I must use the big data techs. 这位负责人说，我必须使用大数据技术。 And the performance must be good. 而且性能一定要好。

We use Apache NIFI and hadoop big data tools to do it. 我们使用Apache NIFI并使用了大数据工具来做到这一点。

Is there some advice ? 有什么建议吗？

Answer 1

For example, you can use ExecuteScript processor with js scrpit to compare jsons. 例如，您可以将ExecuteScript处理器与js scrpit一起使用以比较json。 It works fast. 它运作迅速。 Also you can split your big array json with SplitRecord processor and compare each one by executeScript proccessor. 您也可以使用SplitRecord处理器拆分大数组json，并通过executeScript处理程序对每个数组进行比较。 It also works good. 它也很好用。

如何在大数据平台中比较大文件？

问题描述

1 个解决方案

解决方案1
0 2018-12-04 07:26:11

如何在大数据平台中比较大文件？

问题描述

1 个解决方案

解决方案1 0 2018-12-04 07:26:11

解决方案1
0 2018-12-04 07:26:11