简体   繁体   English

将大型 JSON 文件反序列化为对象并将其添加到数据库时出现性能问题

[英]Performance issue with deserializing large JSON file into objects and add them to database

I'm working on asp.net core 7 MVC project, and want to seed list of countries with cities to Database from locally stored json file (more than 600000 lines)我正在研究 asp.net 核心 7 MVC 项目,并希望从本地存储的 json 文件(超过 600000 行)将国家/地区列表播种到数据库

Here is a sample of JSON file这是 JSON 文件的示例

[{
    "id": 1,
    "name": "Afghanistan",
    "iso3": "AFG",
    "iso2": "AF",
    "cities": [
      {
        "id": 141,
        "name": "‘Alāqahdārī Dīshū"
      },
      {
        "id": 53,
        "name": "Aībak"
      },
      {
        "id": 50,
        "name": "Andkhoy"
      },
      {
        "id": 136,
        "name": "Āqchah"
      },
      {
        "id": 137,
        "name": "Ārt Khwājah"
      },
      {
        "id": 51,
        "name": "Asadabad"
      },
      {
        "id": 52,
        "name": "Ashkāsham"
      },
      {
        "id": 138,
        "name": "Āsmār"
      },
      {
        "id": 54,
        "name": "Baghlān"
      },
      {
        "id": 55,
        "name": "Balkh"
      },
    ]
  } 
]

This is what I tried to do.这就是我试图做的。

Country Model:国家 Model:

public class Country
{
    [Key]
    [DatabaseGenerated(DatabaseGeneratedOption.None)]
    public int Id { get; set; }
    public string Name { get; set; }
    public string ISO2 { get; set; }
    public string ISO3 { get; set; }

    public ICollection<City> Cities { get; set; }
}

City Model市 Model

public class City
{
    [Key]
    [DatabaseGenerated(DatabaseGeneratedOption.None)]
    public int Id { get; set; }
    public string Name { get; set; }
    public int CountryId { get; set; }
    public Country Country { get; set; }
}

Code for deserializing反序列化代码

public static async Task SeedCountries(AppDbContext context, IWebHostEnvironment web)
{
    if (!(context.Countries.Count() > 0) || !(context.Cities.Count() > 0))
    {
        string json = await System.IO.File.ReadAllTextAsync(Path.Combine(web.ContentRootPath, "countries.json"));
        var jsonObject = JArray.Parse(json);

        IList<Country> countries = new List<Country>();
        foreach (var item in jsonObject)
        {
            Country country = item.ToObject<Country>();
            var citites = item["cities"] as JArray;
            var citis = new City();
            countries.Add(country);
            foreach (var city in citites)
            {
                City cities = city.ToObject<City>();
                cities.CountryId = country.Id;
                context.Cities.Add(cities);
                context.Entry(cities).State = Microsoft.EntityFrameworkCore.EntityState.Detached;
            }
        }
        await context.Countries.AddRangeAsync(countries);
        await context.SaveChangesAsync();
    }
}

The problem with that is performance as its shown in this image:问题在于性能,如下图所示:

图片

You have two problems here:你在这里有两个问题:

  1. You are loading your entire 600000 lines JSON file into a single json string.您正在将整个 600000 行 JSON 文件加载到单个json字符串中。 This string will be much larger than 85,000 bytes so will be added to the large object heap, causing problems described in Why Large Object Heap and why do we care?此字符串将远大于 85,000 字节,因此将被添加到大 object 堆中,从而导致Why Large Object Heap and why do we care?中所述的问题。 . .

  2. You are then parsing that huge string into a JArray which will take up even more memory.然后,您将该巨大的字符串解析为JArray ,它将占用更多 memory。

I reckon you are creating the intermediate JSON string because you need to use asynchronous file reading, but Json.NET's serializer does not support async deserialization.我估计你正在创建中间 JSON 字符串,因为你需要使用异步文件读取,但 Json.NET 的序列化程序不支持async反序列化。 However, JsonTextReader does support asynchronous reading via JsonTextReader.ReadAsync() and JToken does support asynchronous loading via Token.LoadAsync() .但是, JsonTextReader确实支持通过JsonTextReader.ReadAsync()进行异步读取,而JToken确实支持通过Token.LoadAsync()进行异步加载。 Putting those together, it is possible to iterate through a huge JSON array asynchronously, load each item into a JToken asynchronously, then deserialize the token to your final array item (here Country ) with bounded memory use.将它们放在一起,可以异步迭代一个巨大的 JSON 数组,将每个项目异步加载到JToken中,然后将令牌反序列化为您的最终数组项目(此处为Country ),并使用有限的 memory 。

And in fact, this answer to Deserializing to AsyncEnumerable using Newtonsoft.Json has an extension method that does exactly that:事实上, Deserializing to AsyncEnumerable using Newtonsoft.Json这个答案有一个扩展方法可以做到这一点:

public static partial class JsonExtensions
{
    /// <summary>
    /// Asynchronously load and synchronously deserialize values from a stream containing a JSON array.  The root object of the JSON stream must in fact be an array, or an exception is thrown
    /// </summary>
    public static async IAsyncEnumerable<T?> DeserializeAsyncEnumerable<T>(Stream stream, JsonSerializerSettings? settings = default, [EnumeratorCancellation] CancellationToken cancellationToken = default)
    {
        // See https://stackoverflow.com/a/72502371/3744182 for the body of this method

So grab all the code from the JsonExtensions extension class from that answer , and now you will be able to create your countries list as follows:因此,从该答案JsonExtensions扩展 class 中的所有代码,现在您将能够创建您的countries地区列表,如下所示:

var fileName = Path.Combine(web.ContentRootPath, "countries.json");

var countries = new List<Country>();

await using var stream = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read, bufferSize: 4096, useAsync: true);
await foreach (var country in JsonExtensions.DeserializeAsyncEnumerable<Country>(stream))
{
    if (country == null)
        continue;
    if (country.Cities != null)
        foreach (var city in country.Cities)
        {
            // Do you need another instance of City?  If so you may clone the current one as follows:
            // city = JToken.FromObject(city).ToObject<City>();
            city.CountryId = country.Id;
            context.Cities.Add(city);
            context.Entry(city).State = Microsoft.EntityFrameworkCore.EntityState.Detached;
        }
    countries.Add(country);
}

Notes:笔记:

  • In your sample code, you seem to be deserializing Cities twice.在您的示例代码中,您似乎对Cities进行了两次反序列化。 item.ToObject<Country>(); will deserialize the Country.Cities list, but then you deserialize them again manually in your foreach (var city in citites) loop.将反序列化Country.Cities列表,但随后您在foreach (var city in citites)循环中再次手动反序列化它们。 It's unclear why you are doing this since the list of cities is already present under the country, so I removed the duplication from my sample code目前还不清楚你为什么要这样做,因为城市列表已经存在于国家之下,所以我从我的示例代码中删除了重复项

    If for some reason you do you need to create two instances of each City , you may need to clone them (eg by round-tripping them with the JSON serializer) inside the foreach (var city in country.Cities) loop in my sample code.如果出于某种原因您确实需要为每个City创建两个实例,则可能需要在我的示例代码中的foreach (var city in country.Cities)循环内克隆它们(例如,通过使用 JSON 序列化程序来回传输它们) .

Demo fiddle #1 here .演示小提琴 #1在这里

As an alternative , as suggested by CodeCaster in comments , you could switch to System.Text.Json, which has built-in support for asynchronous deserialization of huge JSON arrays via JsonSerializer.DeserializeAsyncEnumerable() :作为替代方案,正如CodeCaster评论中所建议的,您可以切换到 System.Text.Json,它通过JsonSerializer.DeserializeAsyncEnumerable()内置了对巨大 JSON arrays 异步反序列化的支持:

using System.Text.Json;
using System.Text.Json.Serialization;

var countries = new List<Country>();

await using var stream = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read, bufferSize: 4096, useAsync: true);
var options = new JsonSerializerOptions
{
    PropertyNamingPolicy = JsonNamingPolicy.CamelCase,
    //PropertyNameCaseInsensitive is required to deserialize "iso2" and "iso3" correctly
    PropertyNameCaseInsensitive = true,
};
await foreach (var country in JsonSerializer.DeserializeAsyncEnumerable<Country>(stream, options))
{
    if (country == null)
        continue;
    if (country.Cities != null)
        foreach (var city in country.Cities)
        {
            // Do you need another instance of City?  If so you may clone the current one as follows:
            // city = JsonSerializer.Deserialize<City>(JsonSerializer.SerializeToUtf8Bytes(city));
            city.CountryId = country.Id;
            context.Cities.Add(city);
            context.Entry(city).State = Microsoft.EntityFrameworkCore.EntityState.Detached;
        }
    countries.Add(country);
}

Demo fiddle #2 here .演示小提琴 #2在这里

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM