简体   繁体   English

如何将巨大的 JSON 文件解析为 Json.NET 中的 stream?

[英]How to parse huge JSON file as stream in Json.NET?

I have a very, very large JSON file (1000+ MB) of identical JSON objects.我有一个非常非常大的 JSON 文件(1000+ MB)相同的 JSON 对象。 For example:例如:

[
    {
        "id": 1,
        "value": "hello",
        "another_value": "world",
        "value_obj": {
            "name": "obj1"
        },
        "value_list": [
            1,
            2,
            3
        ]
    },
    {
        "id": 2,
        "value": "foo",
        "another_value": "bar",
        "value_obj": {
            "name": "obj2"
        },
        "value_list": [
            4,
            5,
            6
        ]
    },
    {
        "id": 3,
        "value": "a",
        "another_value": "b",
        "value_obj": {
            "name": "obj3"
        },
        "value_list": [
            7,
            8,
            9
        ]

    },
    ...
]

Every single item in the root JSON list follows the same structure and thus would be individually deserializable.根 JSON 列表中的每一项都遵循相同的结构,因此可以单独反序列化。 I already have the C# classes written to receive this data, and deserializing a JSON file containing a single object without the list works as expected.我已经编写了 C# 类来接收此数据,并反序列化包含单个 object 的 JSON 文件而没有列表按预期工作。

At first, I tried to just directly deserialize my objects in a loop:起初,我尝试直接在循环中反序列化我的对象:

JsonSerializer serializer = new JsonSerializer();
MyObject o;
using (FileStream s = File.Open("bigfile.json", FileMode.Open))
using (StreamReader sr = new StreamReader(s))
using (JsonReader reader = new JsonTextReader(sr))
{
    while (!sr.EndOfStream)
    {
        o = serializer.Deserialize<MyObject>(reader);
    }
}

This didn't work, threw an exception clearly stating that an object is expected, not a list.这不起作用,抛出一个异常,明确指出 object 是预期的,而不是列表。 My understanding is that this command would just read a single object contained at the root level of the JSON file, but since we have a list of objects, this is an invalid request.我的理解是,这个命令只会读取包含在 JSON 文件的根级别中的单个 object,但由于我们有一个对象列表,因此这是一个无效请求。

My next idea was to deserialize as a C# List of objects:我的下一个想法是反序列化为 C# 对象列表:

JsonSerializer serializer = new JsonSerializer();
List<MyObject> o;
using (FileStream s = File.Open("bigfile.json", FileMode.Open))
using (StreamReader sr = new StreamReader(s))
using (JsonReader reader = new JsonTextReader(sr))
{
    while (!sr.EndOfStream)
    {
        o = serializer.Deserialize<List<MyObject>>(reader);
    }
}

This does succeed.这确实成功了。 However, it only somewhat reduces the issue of high RAM usage.但是,它只是在一定程度上减少了高 RAM 使用率的问题。 In this case it does look like the application is deserializing items one at a time, and so is not reading the entire JSON file into RAM, but we still end up with a lot of RAM usage because the C# List object now contains all of the data from the JSON file in RAM.在这种情况下,看起来应用程序一次反序列化一个项目,因此并未将整个 JSON 文件读取到 RAM 中,但我们仍然会使用大量 RAM,因为 C# 列表 ZA8CFDE6331BD59EB216Z6B966 现在包含所有来自 RAM 中 JSON 文件的数据。 This has only displaced the problem.这只是取代了问题。

I then decided to simply try taking a single character off the beginning of the stream (to eliminate the [ ) by doing sr.Read() before going into the loop.然后我决定简单地尝试在进入循环之前通过执行sr.Read()从 stream 的开头删除一个字符(以消除[ )。 The first object then does read successfully, but subsequent ones do not, with an exception of "unexpected token".然后第一个 object 确实读取成功,但后续的则没有,除了“意外令牌”。 My guess is this is the comma and space between the objects throwing the reader off.我的猜测是,这是让读者失望的对象之间的逗号和空格。

Simply removing square brackets won't work since the objects do contain a primitive list of their own, as you can see in the sample.简单地删除方括号是行不通的,因为对象确实包含它们自己的原始列表,正如您在示例中看到的那样。 Even trying to use }, as a separator won't work since, as you can see, there are sub-objects within the objects.即使尝试使用},作为分隔符也不起作用,因为如您所见,对象中有子对象。

What my goal is, is to be able to read the objects from the stream one at a time.我的目标是能够一次读取 stream 中的对象。 Read an object, do something with it, then discard it from RAM, and read the next object, and so on.读取 object,对其进行处理,然后将其从 RAM 中丢弃,然后读取下一个 object,依此类推。 This would eliminate the need to load either the entire JSON string or the entire contents of the data into RAM as C# objects.这将消除将整个 JSON 字符串或数据的全部内容作为 C# 对象加载到 RAM 中的需要。

What am I missing?我错过了什么?

This should resolve your problem.这应该可以解决您的问题。 Basically it works just like your initial code except it's only deserializing object when the reader hits the { character in the stream and otherwise it's just skipping to the next one until it finds another start object token.基本上它就像你的初始代码一样工作,除了它只是在读取器点击流中的{字符时反序列化对象,否则它只是跳到下一个直到找到另一个起始对象标记。

JsonSerializer serializer = new JsonSerializer();
MyObject o;
using (FileStream s = File.Open("bigfile.json", FileMode.Open))
using (StreamReader sr = new StreamReader(s))
using (JsonReader reader = new JsonTextReader(sr))
{
    while (reader.Read())
    {
        // deserialize only when there's "{" character in the stream
        if (reader.TokenType == JsonToken.StartObject)
        {
            o = serializer.Deserialize<MyObject>(reader);
        }
    }
}

I think we can do better than the accepted answer, using more features of JsonReader to make a more generalized solution.我认为我们可以比公认的答案做得更好,使用JsonReader更多功能来制定更通用的解决方案。

As a JsonReader consumes tokens from a JSON, the path is recorded in the JsonReader.Path property.由于JsonReader来自 JSON 的令牌,因此路径记录在JsonReader.Path属性中。

We can use this to precisely select deeply nested data from a JSON file, using regex to ensure that we're on the right path.我们可以使用它从 JSON 文件中精确选择深度嵌套的数据,使用正则表达式来确保我们在正确的路径上。

So, using the following extension method:因此,使用以下扩展方法:

public static class JsonReaderExtensions
{
    public static IEnumerable<T> SelectTokensWithRegex<T>(
        this JsonReader jsonReader, Regex regex)
    {
        JsonSerializer serializer = new JsonSerializer();
        while (jsonReader.Read())
        {
            if (regex.IsMatch(jsonReader.Path) 
                && jsonReader.TokenType != JsonToken.PropertyName)
            {
                yield return serializer.Deserialize<T>(jsonReader);
            }
        }
    }
}

The data you are concerned with lies on paths:您关心的数据位于路径上:

[0]
[1]
[2]
... etc

We can construct the following regex to precisely match this path:我们可以构造以下正则表达式来精确匹配这条路径:

var regex = new Regex(@"^\[\d+\]$");

it now becomes possible to stream objects out of your data (without fully loading or parsing the entire JSON) as follows现在可以从数据中流式传输对象(无需完全加载或解析整个 JSON),如下所示

IEnumerable<MyObject> objects = jsonReader.SelectTokensWithRegex<MyObject>(regex);

Or if we want to dig even deeper into the structure, we can be even more precise with our regex或者,如果我们想更深入地挖掘结构,我们可以使用正则表达式更加精确

var regex = new Regex(@"^\[\d+\]\.value$");
IEnumerable<string> objects = jsonReader.SelectTokensWithRegex<string>(regex);

to only extract value properties from the items in the array.仅从数组中的项目中提取value属性。

I've found this technique extremely useful for extracting specific data from huge (100 GiB) JSON dumps, directly from HTTP using a network stream (with low memory requirements and no intermediate storage required).我发现这种技术对于从巨大的 (100 GiB) JSON 转储中提取特定数据非常有用,直接从 HTTP 使用网络流(内存要求低且不需要中间存储)。

Here is another easy way to parse large JSON file using Cinchoo ETL , an open source library (Uses JSON.NET under the hood to parse the json in stream manner)这是使用开源库Cinchoo ETL解析大型 JSON 文件的另一种简单方法(在后台使用 JSON.NET 以流方式解析 json)

using (var r = ChoJSONReader<MyObject>.LoadText(json)
       )
{
    foreach (var rec in r)
        Console.WriteLine(rec.Dump());
}

Sample fiddle: https://dotnetfiddle.net/i5qJ5R示例小提琴: https : //dotnetfiddle.net/i5qJ5R

.NET 6 .NET 6

This is easily done with the System.Text.Json.JsonSerializer in .NET 6:这可以通过 .NET 6 中的System.Text.Json.JsonSerializer轻松完成:

using (FileStream? fileStream = new FileStream("hugefile.json", FileMode.Open))
{
    IAsyncEnumerable<Person?> people = JsonSerializer.DeserializeAsyncEnumerable<Person?>(fileStream);
    await foreach (Person? p in people)
    {
        Console.WriteLine($"Hello! My name is {p.Name}!";
    }
}

Is this what you're looking for?这是你要找的吗? Found on a previous question 在上一个问题中找到

The current version of Json.net does not allow you to use the accepted answer code.当前版本的 Json.net 不允许您使用接受的答案代码。 A current alternative is:目前的替代方案是:

public static object DeserializeFromStream(Stream stream)
{
    var serializer = new JsonSerializer();

    using (var sr = new StreamReader(stream))
    using (var jsonTextReader = new JsonTextReader(sr))
    {
        return serializer.Deserialize(jsonTextReader);
    }
}

Documentation: Deserialize JSON from a file stream文档: 从文件流中反序列化 JSON

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM