简体   繁体   English

杰克逊jsonparser重新启动破碎的JSON解析

[英]jackson jsonparser restart parsing in broken JSON

I am using Jackson to process JSON that comes in chunks in Hadoop. 我正在使用Jackson来处理Hadoop中成块出现的JSON。 That means, they are big files that are cut up in blocks (in my problem it's 128M but it doesn't really matter). 这意味着,它们是按块分割的大文件(在我的问题上是128M,但这并不重要)。 For efficiency reasons, I need it to be streaming (not possible to build the whole tree in memory). 出于效率原因,我需要它进行流式传输(无法在内存中构建整个树)。

I am using a mixture of JsonParser and ObjectMapper to read from my input. 我正在混合使用JsonParser和ObjectMapper从我的输入中读取。 At the moment, I am using a custom InputFormat that is not splittable, so I can read my whole JSON. 目前,我正在使用不可拆分的自定义InputFormat,因此可以读取我的整个JSON。

The structure of the (valid) JSON is something like: (有效)JSON的结构类似于:

[    {    "Rep":
        {
        "date":"2013-07-26 00:00:00",
        "TBook":
        [
            {
            "TBookC":"ABCD",            
            "Records":
            [
                {"TSSName":"AAA", 
                    ... 
                },
                {"TSSName":"AAB", 
                    ... 
                },
                {"TSSName":"ZZZ", 
                ... 
                }
            ] } ] } } ]

The records I want to read in my RecordReader are the elements inside the "Records" element. 我想在RecordReader中读取的记录是“ Records”元素中的元素。 The "..." means that there is more info there, which conforms my record. “ ...”表示那里有更多信息,符合我的记录。 If I have an only split, there is no problem at all. 如果我只有一个拆分,那就没有问题了。 I use a JsonParser for fine grain (headers and move to "Records" token) and then I use ObjectMapper and JsonParser to read records as Objects. 我使用JsonParser获得细粒度(标题,并移至“ Records”令牌),然后使用ObjectMapper和JsonParser将记录读取为对象。 For details: 有关详细信息:

configure(JsonParser.Feature.AUTO_CLOSE_SOURCE, false);
MappingJsonFactory factory = new MappingJsonFactory();
mapper = new ObjectMapper(factory); 
mapper.configure(Feature.FAIL_ON_UNKNOWN_PROPERTIES,false);
mapper.configure(SerializationConfig.Feature.FAIL_ON_EMPTY_BEANS,false);
parser = factory.createJsonParser(iStream);
mapper.readValue(parser, JsonNode.class);

Now, let's imagine I have a file with two inputsplits (ie there are a lot of elements in "Records"). 现在,让我们假设我有一个带有两个inputplits的文件(即“记录”中有很多元素)。 The valid JSON starts on the first split, and I read and keep the headers (which I need for each record, in this case the "date" field). 有效的JSON从第一次拆分开始,我读取并保留了标头(每条记录都需要它,在这种情况下为“日期”字段)。

The split would cut anywhere in the Records array. 拆分将在Records数组中的任何位置剪切。 So let's assume I get a second split like this: 因此,假设我得到了第二个拆分:

                ... 
                },
                {"TSSName":"ZZZ", 
                ... 
                },
                {"TSSName":"ZZZ2", 
                ... 
                }
            ] } ] } } ]

I can check before I start parsing, to move the InputStream (FSDataInputStream) to the beginning ("{" ) of the record with the next "TSSNAME" in it (and this is done OK). 我可以在开始解析之前进行检查,以将InputStream(FSDataInputStream)移至其中包含下一个“ TSSNAME”的记录的开头(“ {”)(此操作可以完成)。 It's fine to discard the trailing "garbage" at the beginning. 最好一开始就丢弃尾随的“垃圾”。 So we got this: 所以我们得到了这个:

                {"TSSName":"ZZZ", 
                ... 
                },
                {"TSSName":"ZZZ2", 
                ... 
                },
                ...
            ] } ] } } ]

Then I handle it to the JsonParser/ObjectMapper pair seen above. 然后,将其处理为上面看到的JsonParser / ObjectMapper对。 The first object "ZZZ" is read OK. 第一个对象“ ZZZ”被读取为OK。 But for the next "ZZZ2", it breaks: the JSONParser complaints about malformed JSON. 但是对于下一个“ ZZZ2”,它会中断:JSONParser抱怨格式错误的JSON。 It is encountering a "," not being in an array. 它遇到“,”不在数组中。 So it fails. 所以失败了。 And then I cannot keep on reading my records. 然后我无法继续阅读我的记录。

How could this problem be solved, so I can still be reading my records from the second (and nth) splits? 如何解决此问题,所以我仍然可以从第二个(和第n个)拆分中读取记录? How could I make the parser ignore these errors on the commas, or either let the parser know in advance it's reading contents of an array? 如何使解析器忽略逗号上的这些错误,或者让解析器提前知道它正在读取数组的内容?

It seems it's OK just catching the exception: the parser goes on and it's able to keep on reading objects via the ObjectMapper. 似乎捕捉到异常就可以了:解析器继续运行,并且能够继续通过ObjectMapper读取对象。

I don't really like it - I would like an option where the parser could not throw Exceptions on nonstandard or even bad JSON. 我真的不喜欢它-我希望解析器不能在非标准甚至不良JSON上抛出异常。 So I don't know if this fully answers the question, but I hope it helps. 因此,我不知道这是否可以完全回答问题,但我希望它能有所帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM